64• VLSI Systems
64• VLSI Systems Asynchronous Logic Design Abstract | Full Text: PDF (206K) Clock Distribution In Synchronous Systems Abstract | Full Text: PDF (311K) Dram Chips Abstract | Full Text: PDF (185K) Integrated Circuit Signal Delay Abstract | Full Text: PDF (236K) Microprocessors Abstract | Full Text: PDF (133K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...20ELECTRONICS%20ENGINEERING/64.VLSI%20Systems.htm15.06.2008 1:13:13
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...LECTRONICS%20ENGINEERING/64.%20VLSI%20Systems/W7604.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Asynchronous Logic Design Standard Article Ruchir Puri1 1IBM Thomas J. Watson Research Center, Yorktown Heights, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W7604 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (206K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Fundamentals and Motivation Asynchronous Logic Asynchronous Design Methodologies Asynchronous Processors and other Design Examples Conclusions About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...NICS%20ENGINEERING/64.%20VLSI%20Systems/W7604.htm15.06.2008 1:13:47
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
ASYNCHRONOUS LOGIC DESIGN
ASYNCHRONOUS LOGIC DESIGN The performance, cost effectiveness, and complexity of digital computers have experienced explosive growth in the past two decades. These advances in the performance and cost effectiveness of integrated circuits (IC) are a direct result of innovations in the technology for fabrication, architecture, and advanced circuit techniques. Recent advances in semiconductor technology have pushed the number of transistors on a single microprocessor chip beyond the ten million mark and clock frequencies beyond 600 MHz (1). Distributing such a highfrequency clock signal over the entire chip is an extremely challenging task, which is guaranteed to get more difficult with future generations of microprocessors synchronized by a single clock. Thus, it is worth exploring other types of digital design that might offer a solution to this clock distribution problem. Asynchronous logic circuits hold the promise of alleviating these clock distribution problems because they do not require a global clock signal for their operation (2,3). FUNDAMENTALS AND MOTIVATION Modern computers are designed with digital logic circuits. The design of digital logic circuits is broadly classified into two basic types: combinational logic design and sequential logic design (4). In combinational logic circuits, outputs depend only on their present inputs. On the other hand, sequential logic circuits are those circuits whose outputs depend on their past as well as present inputs. This implies that sequential circuits must incorporate some form of memory to hold information about past inputs. This information about past inputs contained in the memory elements (also called latches or flip-flops) is called state information. Thus, the output of a sequential circuit, as shown in Fig. 1, is a function of its current input and state. Frequently, sequential circuits are also referred as finite state machines (FSM) or sequential machines. Sequential circuits can be classified into synchronous or clocked circuits and asynchronous circuits. In synchronous or clocked sequential circuits, time is quantized, and all actions take place at discrete intervals of time determined by a periodic source of pulses called a clock. The clock signal controls the memory elements whose values reflect the circuit state. Inputs to synchronous circuits can change only during the period when the clock pulses essentially disable the memory ele-
Inputs
State signals
Outputs
Combinational logic
Memory elements Figure 1. Sequential logic circuit.
726
ments and prevent the circuit from changing state (i.e., events in synchronous circuits are sequenced by a globally distributed periodic clock signal). The role of a clock in synchronous circuits can be described with an excellent analogy to a conductor in an orchestra regulating every beat of the music (5). In an orchestra, individual performers know exactly when to play each note and know implicitly that their output will blend appropriately with the output of other functional units within the orchestra, so long as they observe the conductor’s beat. Unlike an orchestra, synchronous operation is not fundamental to an electronic circuit. The operation of an electronic circuit can be compared to a production line where partly assembled products are passed from one stage to the next when they are ready. A smooth flow must be maintained for efficient production. Asynchronous circuits follow this production line mode of operation. In contrast to synchronous sequential circuits, asynchronous circuits are a broader class of circuits wherein general events may take place at any time (i.e., they are designed to operate without a global synchronization clock signal). Asynchronous circuits use local synchronization rather than a global synchronous signal such as clock (i.e., various circuit components communicate with each other using handshaking or request-and-acknowledge signals). Since its advent, digital logic design has been dominated by synchronous logic because of its relative ease of design compared to asynchronous logic circuits. The ease of designing synchronous logic is a direct result of timing restrictions placed on circuit signals and the global clock. These timing constraints ensure that all signals must stabilize before the onset of the clock signal that stores the stabilized signal values into the memory elements. If this constraint is not satisfied, the clock frequency must be reduced to avoid circuit malfunction resulting from the storage of incorrect signal values in the memory elements. In addition, distributing gigahertz frequency clock signals over a complex chip is an extremely challenging task that is guaranteed to get more difficult with future generations of microprocessors synchronized by a single clock. This has recently revived research efforts that eliminate the global clock by employing asynchronous circuits or locally clocked circuits. In addition, asynchronous circuits can also offer the following desirable properties, when designed carefully: • Alleviating Global Clock Distribution Problems. In a synchronous circuit, the clock signal connects all the memory elements and clocked logic gates through wires that are routed over the entire chip. Because long wires have significant resistance and capacitance, electrical signals carried on these wires incur significant delays. Due to this large distributed resistive–capacitive clock network, a given clock pulse arrives at different times at different parts of the circuit. This effect is called clock skew, which can severely affect the performance of the circuit. Distributing clock signal over the entire chip with relatively low clock skew is one of the most difficult tasks being faced by the designers of advanced high-performance microprocessors. One of the goals of the clock signal is to synchronize the updating of the state information in the memory elements. This synchronization in the presence of clock skew can result in a circuit malfunction. Because asynchronous circuits do not employ a global clock for
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
ASYNCHRONOUS LOGIC DESIGN
synchronization, the problem of global clock skew and global clock distribution is alleviated. Although asynchronous circuits eliminate the global clock, they replace this problem with synchronization or handshaking constraints at the local level. Thus, a large global problem is replaced by several smaller problems at the local level. • Low-Power Potential. In high-performance microprocessors, designers have developed several methods to address clock distribution and clock-skew issues. Invariably, they involve building a clock distribution tree or mesh with amplifying buffers. Unfortunately, high-performance microprocessor designs with clock frequencies in the 600 MHz range require very large clock buffers to obtain an acceptable clock-distribution network, which results in excessive power consumption and occupy large area. For example, Digital Equipment Corporation’s Alpha microprocessor with 433 MHz clock frequency requires a clock-distribution network that occupies 10% of total chip area (16.8 mm ⫻ 13.9 mm) and consumes 40% of total power consumption of 30 W (6). With additional increases in clock frequency, it will be increasingly expensive to distribute a single global clock due to power dissipation constraints. In addition, in a synchronous circuit, the clock signal is always active, even if the circuit is not processing any data. Thus, low-power synchronous circuits usually switch the clock signal off to a subcircuit that is not active. This requires an additional circuit that monitors the input signals for any activity, which itself consumes power. Also, the switch that shuts off the clock signal to an inactive subcircuit presents an additional capacitive load to the clock distribution network. Because asynchronous circuits do not require a continuously running global clock for their operation, they are free of these problems and can provide potential savings in power consumption for low-power applications. Because the market for wireless, portable, and hand-held consumer products that require long battery life and thus low-power circuits is growing very fast, potential lowpower benefits are a significant motivating factor in the recent resurgence of interest in asynchronous circuits (7). • Improved Performance Potential. In synchronous sequential circuits, the pulse period of the clock signal is chosen according to the worst-case timing path through the combinational logic. This clock period is quite inflexible in taking advantage of the best-care or even average-case behavior. Because components in asynchronous circuits synchronize events by generating local completion signals when the computation is complete, their performance is not determined by worst-case delays. Thus, asynchronous circuits hold the potential of achieving increased performance. • Modularity and Upgradability. An important aspect of any design methodology is its support for ‘‘modularity,’’ since a divide and conquer approach is central to any complex design task. A clock signal is, in effect, a global system variable, and therefore an impediment to modularity. Because asynchronous circuits do not impose a global clock constraint, a system designed with asynchronous components will function correctly by simply connecting these components together (provided their interfaces match and they observe the same signaling
727
protocol). Early research work by Clark and Molnar (7a) demonstrated the composition benefits of asynchronous circuit modules and formed the basis for several recent asynchronous synthesis approaches (54,55). The formation of Virtual Socket Interface (VSI) alliance by leading semiconductor corporations provides significant thrust to the design methodologies that naturally support more modular systems with easy-to-replace components. Asynchronous design methodologies are bound to benefit from this recent push toward modularity and reusability. In synchronous sequential circuits, various components are synchronized using the same global clock signal. Thus, even in the case of minor change in circuit functionality, one must follow the fixed clock period constraint to avoid a major redesign. In contrast, a logic change in asynchronous circuits can be locally accommodated, since various components do not have to adhere to a strict global timing constraint. • Environmental Robustness. Synchronous circuits are designed for correct operation within a range of temperature, power-supply voltage variations, and fabrication process variations. While designing synchronous circuits, the clock time period is chosen according to the delay of the components with worst-case parameters (e.g., maximum allowable temperature and minimum allowable power-supply voltage). Thus, if a synchronous circuit was designed to operate at a maximum temperature of 95 ⬚C, it may not function correctly above this temperature (because of an increase in component delays with increasing temperatures). In contrast, asynchronous circuits are robust with respect to variations in their environment, such as temperature, power supply, and fabrication process because their operation is independent of component delays and thus their variations. In addition, asynchronous circuits offer the advantage of reducing electromagnetic interference (emi) emissions because they distribute the switching energy over time and frequency, in contrast to synchronous circuits where the switching energy is concentrated around clock frequency (5). In spite of all these potential benefits, asynchronous circuits have not been widely used in commercial applications except where unavoidable, such as interface circuits between two independently clocked circuits or real-time circuits. Even in the case of asynchronous interface circuits, their design is considered to be very informal and is thought to be one of the most difficult tasks by integrated circuit designers. The major reason for the overwhelming popularity of synchronous circuits is their ease of design, which is due to a clear separation between functionality and timing in synchronous designs. This separation is taken away in asynchronous circuits [e.g., in isochronic fork assumption (51) or fundamental mode assumption (15)], which makes the design of these circuits more complex. Although asynchronous circuits eliminate the global clock, they replace this problem with synchronization or handshaking constraints at the local level. Unfortunately, in some cases the overhead due to handshaking can actually result in a performance degradation. To reduce this performance penalty, architectural changes to hide this handshaking overhead must be incorporated (63). In addition, asynchronous circuits are prone to generating unwanted sig-
728
ASYNCHRONOUS LOGIC DESIGN
nal changes known as hazards that may cause a circuit to malfunction. The hazards are naturally filtered out in a synchronous design by choosing a long-enough clock period, which ensures that the circuit is in a stable circuit state before the next input changes take place. Thus, high-performance asynchronous logic circuits that are free of hazards are more difficult to design than their synchronous counterparts. In general, the presence of hazards in asynchronous circuits is a major hindrance in their widespread use. Although an asynchronous implementation of a DCC error corrector chip at Philips Research Labs (72) demonstrated a five times reduction in power over the best synchronous design, a clear advantage of asynchronous designs for large-scale high-performance and low-power circuits still remains to be seen. This chip was designed with the help of an internal asynchronous synthesis tool called TANGRAM. Unavailability of such industrial strength synthesis tools through commercial CAD vendors is another hurdle in the proliferation of commercial asynchronous designs. Despite few current commercial applications of asynchronous circuits, it is worth exploring other areas of digital design that may offer a solution to the clock distribution and power dissipation problems in high-performance circuits. Asynchronous logic circuits hold the promise of solving these problems, which is a major motivating factor in recent resurgence of interest in asynchronous and locally clocked circuits. Asynchronous design should not be viewed as a single alternative to synchronous logic design. More accurately, synchronous design is a special case representing a single design point in a multidimensional asynchronous design space. Thus, one can implement a logic circuit using completely asynchronous design, or a completely synchronous design, or a suitable combination of the two design techniques. ASYNCHRONOUS LOGIC There are many flavors of asynchronous logic. However, there are a few key features that characterize the underlying implementation of various asynchronous designs. These implementation features can be seen as a choice between different synchronization protocols (i.e., two-phase versus four-phase signaling protocol), data encoding schemes (i.e., dual-rail data encoding versus bundled data encoding), and delay models (8). Asynchronous Control Protocols In general, asynchronous circuit components communicate with each other using handshaking or request-and-acknowledge signaling protocols. The correct functionality of an asynchronous circuit depends upon each component following a sequence of request-and-acknowledge events in its signaling protocol and is independent of signal timings. One component, the sender, sends a request to start an event to another component, the receiver. After the receiver completes the requested action, it sends an acknowledge signal back to the sender to complete the protocol. In order for this protocol to function correctly, the sender must not produce a new request signal until the previous request signal has been acknowledged. In addition, the receiver must not reply with an acknowledge signal unless it has received the request signal. This standard request acknowledge signaling protocol can be
REQ
ACK (a) Request Sender
Receiver Acknowledge
REQ
ACK (b) Figure 2. (a) Two-phase signaling; (b) four-phase signaling.
realized using a transition sensitive circuit or a level sensitive circuit, described as follows. Two-Phase Signaling. In two-phase signaling, also known as transition or nonreturn to zero signaling, the request-and-acknowledge events are actually transitions (i.e., either rising or falling). A request or acknowledge event is said to have occurred if the corresponding signal wire has made a transition from its present state to another (i.e., from high level to low level or vice versa). The direction of the transition is unimportant. As shown in Fig. 2(a), in the two-phase signaling protocol, the sender can issue a request to start an event by generating a transition on the request wire, and the receiver can respond after completing the requested event by generating a transition on the acknowledge wire. There are two states in this signaling protocol: a state where the sender is active and the receiver has not yet responded and a state where the receiver has responded and is waiting for the sender to become active again. In general, both request and acknowledge wires are initialized to level zero. Thus, the inactive state in the protocol is defined as the state when both request and acknowledge wires are at the same level. Four-Phase Signaling. In the four-phase signaling protocol, also known as return to zero signaling, the request-and-acknowledge events are levels instead of transitions [i.e., either low (zero) or high (one)]. Similar to two-phase signaling, the inactive state in this signaling protocol is also initialized to level zero for both request and acknowledge wires. As shown in Fig. 2(b), a four-phase signaling protocol sequence starts by both request and acknowledge wires at level zero. Then sender initiates a request to start an event by changing the request wire to level one. After completing the requested event, the receiver responds with the acknowledge signal by changing the acknowledge wire to level one. When the sender receives the acknowledge event, it changes the level of request wire back to zero, and subsequently the receiver also changes the level on acknowledge wire back to zero. Thus, after completing the four-phase signaling, both request and acknowledge wires return back to level zero. Now, the sender is ready to send another request signal.
ASYNCHRONOUS LOGIC DESIGN
Data
729
Data Request Sender
REQ
Data
Receiver
REQ
Acknowledge ACK
ACK
Two-phase signaling with bundled data encoding
Four-phase signaling with bundled data encoding Figure 3. Bundled data encoding.
The choice of either the two-phase signaling or four-phase signaling depends on designers and a particular application. In theory, two-phase signaling may appear to be faster and less power-consuming than four-phase protocol since a complete sequence of two-phase protocol requires half the number of transitions of a four-phase protocol. However, two-phase signaling requires special logic to detect transitions rather than levels that result in more complex circuits. This in turn penalizes the interface in terms of both performance and power. Asynchronous Data Encoding Schemes In addition to the choice of two- or four-phase control signaling protocol discussed earlier, the data signals can also be encoded using either bundled data encoding or dual-rail encoding, which is described as follows. Bundled Data Encoding. Bundled data encoding employs separate wires for data and control signals. As shown in Fig. 3, for n bits of data to be communicated from the sender to the receiver, bundled data encoding requires n ⫹ 2 wires: n wires for the data, one wire for the request signal, and one wire for the acknowledge signal. Bundled data encoding can be employed with either the two-phase signaling or the fourphase signaling protocol, as illustrated in Fig. 3. Separation of data and control timing in bundled data encoding imposes certain constraints described as follows. Suppose that the request signal were faster than at least one data signal. In such a case, the receiver would receive the request event from the sender even before the data are received and may initiate the required computation with wrong data values. Thus, the use of bundled data encoding implies an implicit assumption that the request signal is slower are asserted). This constraint is known as bundled data constraint and is widely used in the design of asynchronous logic circuits. Thus, request signal in bundled data encoding is similar to a clock signal. Dual-Rail Encoding. In the case that the bundled data constraint cannot be satisfied, dual-rail encoding (shown in Fig. 4) is used where data and request signals are encoded to-
Dual-rail data Acknowledge Sender
Receiver
Figure 4. Dual-rail data encoding.
gether onto two wires (dual rails). In addition, an acknowledge wire is also required for every data bit. Thus, communicating n data bits with this encoding requires 2n wires for the dual-rail encoding of request and data signals and n wires for acknowledge signals. In the case that the receiver requires the availability of all data bits before activating acknowledge signal, only one acknowledge wire is sufficient for all dual-rail data signals, which reduces the total number of wires from 3n to 2n ⫹ 1. In general, the values on the dual rails are interpreted as follows: 10 corresponds to data value 0, 01 implies a data value 1, 00 implies that the data value is not yet available, and 11 is not allowed. Dual-rail encoding is insensitive to the delays on any wire and is often used when bundled data constraint cannot be satisfied. Unfortunately, it results in increased complexity of both the number of wires and the logic. To compensate for the high complexity of dual-rail encoding, other coding schemes have been developed (8a,8b) in which the logic overhead is only in the completion detection and the complexity of data processing is avoided. Delay Models and Hazards in Asynchronous Circuits A formal design method for a digital circuit requires a delay model of the logic gate operation. The accuracy of this model determines the practicality of the design method. Logic gates and interconnections (i.e., wires are fundamental building blocks of digital circuits). Physically, both logic gates and wires exhibit finite delays and are modeled with delay elements while analyzing their behavior. A delay element is said to have pure delay if it delays the transition of its input waveform to its output but does not otherwise alter the waveform itself. On the other hand, a delay element is said to exhibit inertial delay with threshold di if it does not react to input pulses shorter than di and the pulses longer than di are transmitted to its output with a delay di. A delay element is said to have bounded delay if its delay can take any value within a given interval and is said to have unbounded delay if its delay can take any finite value. Using this definition, logic circuits can be characterized with a bounded delay model, unbounded gate delay model, or unbounded wire delay model (9). The bounded delay model associates a bounded delay with both logic gates and interconnecting wires. In contrast, the unbounded gate delay model associates an unbounded delay with every gate, and the interconnecting wires are assumed to have zero delay, whereas the unbounded wire delay model associates an unbounded delay with both logic gates and the interconnecting wires. Because real gates and wires have finite delays, output signals may glitch before settling down to their final value. In a synchronous design, clock signal controls all the state changes and communication between vari-
730
ASYNCHRONOUS LOGIC DESIGN
Combinational logic
x1 Inputs x2
y1
y2 z
y1
Output
y2 Feedback wires
Figure 5. Hazards in asynchronous circuits.
ous components. Between active clock edges, combinational logic generates the next state function, possibly producing many spurious output values (glitches). Synchronous logic operates correctly by ensuring that all the outputs are correct and stable before the next active clock edge. Because asynchronous circuits do not employ a global clock, glitches cannot be filtered out by simply controlling a global synchronization signal. In asynchronous circuits, these glitches are known as hazards, which can cause the circuit to malfunction. For example, consider the circuit shown in Fig. 5. For this circuit, if input x1x2 changes from 00 to 10, then output z goes high, and the next state variable Y1 changes to 1 while the next state variable Y2 remains at 1. In this process, the output of the AND gate x1y1 changes to 0 while the output of the AND gate x1y2 changes to 1. If the AND gates have nonzero finite delays and one AND gate has more delay than the other, then the output of the OR gate implementing Y2 can go to 0 for a short duration (equal to the difference in the delays of the AND gates). If this difference in AND gate delays is large enough, the circuit can stabilize in some other state because the transient can be propagated back through the feedback wires. This transient on the output of the OR gate is known as a hazard. Depending on the specifications, changing the logic implementation can eliminate some type of hazards. For example, adding an AND gate y1y2 (shown as dotted in Fig. 5) to the circuit implementing Y2 keeps the output Y2 at 1, and no hazard can occur assuming that the feedback wires have much larger delays than the delays in combinational logic. If this assumption is not met and y1 falls too fast, then the output of gate y1y2 may fall to give a transient Y2 output again. We can now define a hazard as a deviation of an output signal from the specified behavior. A hazard can be classified as a static hazard, a dynamic hazard, a combinational hazard, or a sequential hazard. These are formally defined as follows. A signal is said to have a static hazard if it should remain constant but it changes twice or more (in opposite directions). A static hazard is a 0-hazard if the signal should remain 0, and it is a 1-hazard if the signal should remain 1. A signal is said to have dynamic hazard if it should change only once but
it changes multiple times. A combinational hazard occurs as a result of a distribution of finite gate delays in the combinational logic (as described in the preceding example). A combinational hazard is classified as logic hazard if the hazard depends on the particular logic implementation, as described in the preceding example. A logic hazard can be eliminated by changing the logic implementation of the function. A combinational hazard is classified as a function hazard if it cannot be eliminated by changing the logic, irrespective of the gate delays. A sequential hazard occurs as a result of the feedback wire delays. A sequential hazard is called an essential hazard if it is inherent in the finite state machine specifications and occurs irrespective of the logic implementation. For example, the hazards due to critical races that cannot be eliminated by proper state assignment are essential hazards. There is a wide body of literature on hazard analysis and elimination that establishes several properties of static, dynamic, combinational, and sequential hazards. Further details on hazards in asynchronous circuits can be obtained from Unger (15,16). Given various delay models, asynchronous circuits are generally classified into speed-independent circuits and delayinsensitive circuits. Speed-independent circuits (10) operate correctly (without hazards) irrespective of gate delays, and wires are assumed to have zero delays. Thus, their operation is defined using the unbounded gate delay model. In contrast, delay-insensitive circuits (11) operate correctly (without hazards) irrespective of the gate as well as wire delays. Thus, their operation is defined using the unbounded wire delay model. In general, digital logic circuits are designed with basic logic gates such as a NAND gate and NOR gate. It is known that to ensure delay-insensitive behavior of the circuit, any logic gate must wait for a transition on all of its inputs before generating a transition on its output (12). Because all single-output standard logic gates such as AND, NAND, OR, NOR do not satisfy this constraint, they cannot be used to build delay-insensitive circuits. Thus, only logic elements that satisfy this constraint, such as C-element (which implements an AND of transitions), inverters, buffers, and wires, can be used to implement delay-insensitive circuits. In order to design delay-insensitive asynchronous circuits in practice, more complex delay-insensitive logic elements with a range of functionality are used (13). Because it is impractical to design pure delay-insensitive circuits with simple logic elements, researchers have relaxed the pure delay-insensitivity constraint to develop quasi-delay-insensitive circuits. Quasi-delay-insensitive circuits are similar to delay-insensitive circuits except that they assume isochronic forks (72). An isochronic fork is a forked wire where all branches have exactly the same delay. Some researchers have relaxed this constraint to allow a bounded skew between different branches of the fork. In contrast, the delays on different branches of a fork in delay-insensitive circuits are completely independent of each other. In additon to speed-independent and delay-insensitive circuits, timed asynchronous circuits (14,45) have also been developed which utilize the bounded nature of delays in practice to optimize the performance of asynchronous implementation. There are several approaches of designing asynchronous circuits. Asynchronous design methodologies can be characterized in many ways (17,18) (i.e., the delay model used for their implementation, the type of design specifications used to specify their behavior). In the following section, several major design methods for implementing asynchronous designs are
ASYNCHRONOUS LOGIC DESIGN
Inputs
. . .
. . Outputs . Combinational logic
State .. signals .
. . .
Delay elements
Figure 6. Asynchronous Huffman state machine.
briefly described. These methods are categorized according to the type of specifications they use. ASYNCHRONOUS DESIGN METHODOLOGIES Asynchronous Design Using Finite State Machines Huffman State Machines. Asynchronous behavior is traditionally synthesized using finite state machine specifications, which are implemented using a Huffman machine (15) shown in Fig. 6. A Huffman machine follows a bounded delay model and consists of combinational logic with primary inputs, primary outputs, and feedback state variables. Instead of latches or flip-flops in the state signal feedback loop, delay elements are used to store state information. In each state, a state machine can receive inputs, generate outputs, and move to the next state. If the state machine receives an input and remains in the same state, then that state is called a stable state. Because asynchronous circuits do not have a clock, state transitions are caused by input changes. If an input change causes a transition from a stable state to an unstable state, then the state continues to change until a stable state is reached. Asynchronous state machines can also be characterized according to the number of inputs that may change between any state transition. In a single-input change machine, only one input may change at any given time, and any subsequent input change takes place only after the circuit has stabilized in response to this input. In contrast, in a multiple-input change-asynchronous state machine, any number of inputs may change simultaneously (i.e., within a very narrow interval di) and any subsequent input changes take place only after the circuit has stabilized. The asynchronous FSM specifications are described using a flow table (15). A flow table expresses a relationship between present states, inputs, next states, and outputs. A flow table represents the output and next-state circuit behavior as a function of its inputs and present state. In Table 1, asynchronous flow table specifications are illustrated with a simple asynchronous modulo 4 counter having two binary inputs x1 and x2 and two binary outputs z1 and z2. The input x1 ⫽ 0, x2 ⫽ 0 is a reset input that clears the counter to an initial state producing the output 00. In this example, only a singleinput variable is allowed to change in a transition (singleinput change mode). The count is incremented by one when the input x1 ⫽ 0, x2 ⫽ 1 is received and incremented by two when the input x1 ⫽ 1, x2 ⫽ 0 is received. The count remains
731
the same for input x1 ⫽ 1, x2 ⫽ 1. The outputs z1 and z2 represent the counter outputs (i.e., the count). Table 1 shows the flow table specifications for the asynchronous counter. For each state Si of the flow table, next, state entries are specified for different input signals, and stable states are denoted by a box. The rest state a is stable in input column 00. This input may be followed by input 01, which results in a next state b and outputs z1 ⫽ 0, z2 ⫽ 1 (i.e., a count of 1). Similarly, input 10 after the input 00 results in a next state c and outputs z1 ⫽ 1, z2 ⫽ 0 (i.e., a count of 2). The next-state entry and the outputs in column input 10 for present state a is unspecified. The stable b in input column 01 may be followed by input 00 or 11, taking the state machine to reset state a or state d, respectively. In this case, the counter outputs z1 and z2 remain unchanged (i.e., the count remains the same). Similarly, the rest of the flow table entries can also be specified, as shown in Table 1. Outputs are specified only for stable states, assuming that outputs may change at any time during the transition. The initial flow table describing the desired asynchronous behavior usually contains some redundancy. The removal of redundant states [i.e., state minimization] is important to reduce the circuit complexity of the final implementation. To implement the reduced state table as a logic circuit, every state must be assigned a unique Boolean encoding. This step is known as state assignment (15). In the case of a transition between states that involve multiple state signal changes, it must be ensured that the circuit operation is independent of the order in which these state signals change. A situation in which more than one state variable must change in the course of a transition is called a race condition. A race is said to be a critical race if the stable state reached by the circuit depends on the order in which the signals change (i.e., the outcome of the race). There are many techniques to achieve critical race free state assignment such as one-hot assignments (15) or those described by Tracey (19). Further details on various state minimization and state assignment techniques can be obtained from Unger (15). The minimized and critical race free encoded flow table describes a set of logic equations that produce the next state and output functions. These logic functions can be implemented by a combinational circuit. The logic function of a next state (i.e., feedback wires) or an output signal can be obtained by deriving its Karnaugh map (i.e., finding its implied value in each binary encoded flow table state). Connecting the next-state output and present-state input of this circuit yields the logic circuit implementing the asynchronous state machine. This implementation may exhibit a hazardous behavior. Hazard removal is much easier in single-input change machines as compared to multiple-input change machines because of their more constrained operation as discussed previously. The issue of hazard removal can be simplified if it is assumed that the inputs to the logic circuit do not change until the circuit outputs have stabilized. This condition is known as fundamental mode constraint which generally results in severe performance penalty. Fundamental mode constraint is similar to a hold time constraint for a simple latch or a flipflop. Several formal design procedures exist for synthesizing hazard-free fundamental-mode single-input change asynchronous state machines and are described in (15). Although multiple-input change state machines are more flexible, they are significantly more difficult to design.
732
ASYNCHRONOUS LOGIC DESIGN
Table 1. An Example Flow Table Specifications Present State
x1x2 ⫽ 00
x1x2 ⫽ 01
x1x2 ⫽ 11
x1x2 ⫽ 10
a b
a /00 a/⫺
b/⫺ b /01
⫺/⫺ d/⫺
c/⫺ ⫺/⫺
c
a/⫺
⫺/⫺
e/⫺
c /10
d e
⫺/⫺ ⫺/⫺
f/⫺ h/⫺
d /01 c /10
g/⫺ i/⫺ ⫺/⫺
f
a/⫺
f /10
e/⫺
g
a/⫺
⫺/⫺
j/⫺
g /11
h
a/⫺
h /11
j/⫺
⫺/⫺
i j
a/⫺ ⫺/⫺
⫺/⫺ m/⫺
k/⫺ j /11
i /00 l/⫺
k
⫺/⫺
b/⫺
k /00
c/⫺
l m
a/⫺ a/⫺
⫺/⫺ m /00
d/⫺ k/⫺
l /01 ⫺/⫺
Some methods (15) for synthesizing multiple-input change machines rely on inertial delays to filter out glitches. Unfortunately, inertial delays are difficult to build, and they penalize the logic implementation in terms of performance. In the absence of fundamental mode constraint, it is extremely difficult to obtain hazard-free behavior in Huffman state machine implementations. In addition, to obtain a hazard-free implementation, the Huffman state machine specifications must be free from essential hazards (i.e., hazards inherent to the specification). Burst mode specifications developed by Nowick et al. (22) provided solutions to these hazard-free implementation problems. In addition to providing the flexibility of multipleinput changes, burst mode specifications avoid essential hazards by construction. Self-synchronized State Machines. The difficulties and overhead of hazard elimination associated with Huffman state machines gave rise to an asynchronous design style called self-synchronized state machines (20,21). As shown in Fig. 7, these state machines generate their own synchronization signal, which acts like a clock on internal flip-flops. The selfsynchronized machine consists of combinational logic, storage elements with clock control, inputs and outputs, and state variables that are fed back to machine inputs. The local clock is generated from the external inputs and the current state.
Inputs Local clock circuit
Storage elements
Combinational logic
Outputs
State signals
Figure 7. Self-synchronized state machine.
The local clock is also used to eliminate a number of possible hazards. In general, a self-synchronized machine is idle until an input change occurs. Subsequently, the combinational logic generates corresponding outputs and state signals. The storage elements update the machine state when the local clock generation circuitry generates a clock pulse. The machine is ready to accept new inputs after the storage elements have been updated. Self-synchronized machines try to combine the benefits of synchronous and asynchronous state machines. Both single-input change and multiple-input change self-synchronized machines can be designed. They offer the advantages of synchronous machines (i.e., they are simple and do not require critical-race free state assignments). In addition, they do not require hazard-free logic for outputs and nextstate entries. However, they do transfer the problem of hazard elimination to the local clock generation logic, which may require an inertial delay at its output to eliminate hazards. In addition, they may have poor performance as a result of their design based on the worst-case delay of output and state variables. Burst Mode State Machines. Nowick et al. (22) made significant contributions to the design of asynchronous circuits using state machines by developing burst mode state machine specifications. One of the biggest advantage of burst mode machines is that the logic implementation is guaranteed to be hazard-free, while maintaining high performance. In addition, burst mode machines allow multiple-input changes, thereby yielding more concurrent systems. In burst mode state machines, a set of input transitions (i.e., input burst) is followed by a set of output transitions (i.e., output burst). Finally the state change takes place. The inputs in the input burst may occur in any order. The output burst can occur only after the entire input burst has occurred. A new input burst cannot occur until the machine has reacted to the previous input burst. Thus, these specifications also require the fundamental mode assumption but only between transitions in different input bursts. In order to distinguish different input bursts in a given state, no input burst can be a subset of another input burst. This constraint is known as maximal set constraint. In addition, each state must follow the unique en-
ASYNCHRONOUS LOGIC DESIGN
A c+/z– a+b+/y+z– C a+c–/z+
B c–/– –
D
b–c+/z+ c+/y+ E a–/y– F Figure 8. Burst mode specifications of a simple controller.
733
burst mode specifications. To implement large designs, usually the behavior is described using distributed specifications. Kudva et al. developed an approach for synthesizing distributed burst mode specifications (24). Nowick et al. have developed a comprehensive suite of burst mode synthesis tools which include hazard-free logic minimization capability (24a) as well. Burst mode specifications can also be implemented using the Huffman-style state machines illustrated in Fig. 6. Yun and Dill (25) proposed automatic synthesis techniques for burst mode specifications using Huffman state machines, also known as 3D asynchronous state machines. Burst mode specifications have enjoyed significant success in industry as well. Research by Davis et al. (26) at Hewlett-Packard Labs resulted in a complete CAD methodology for burst mode synthesis, which was used to develop several industrial designs. Asynchronous Design Using Petri Nets-Based Specifications
try constraint, which implies that every state has a unique entry point. Thus, a given state is always entered with the same set of input values. Burst mode specifications are specified using state diagrams. Figure 8 shows the burst mode specifications of a simple controller with three inputs a, b, c and two outputs y, z. Each transition between two states is labeled with an input burst followed by an output burst. For example, transition between state A and state B takes place with an input burst of 兵a⫹ b⫹其 and generates an output burst of 兵y⫹ z⫺其, where a⫹ denotes a rising transition on input a and z⫺ denotes a falling transition on output z. The key difference between burst mode state machines and multiple-input change Huffman state machines is that, unlike multipleinput change machines, inputs within a burst could be uncorrelated, arriving in any order and at any time. Burst mode asynchronous specifications are implemented using self-synchronized or locally clocked state machine shown in Fig. 7. Implementation of burst mode specification in locally clocked machines differs from self-synchronized machines in several aspects. In burst mode implementation, the clock is generated selectively (i.e., some transitions do not require a clock pulse). In addition, unlike many self-synchronized methods (20,21), the clock unit does not require inertial delays to eliminate hazards. As mentioned previously, burst mode specifications impose simple constraints on input transitions, such as maximal set and unique entry constraint to guarantee hazard-free logic. Initially, the burst mode locally clocked machine is stable in some state. Inputs in a specified burst may then change value in any order and at any time. Throughout this input burst, the machine outputs and state remain unchanged. When the input burst is complete, the outputs change as specified. A state change may also occur concurrently with the output change. Then the machine will be driven to a new stable state. It is also possible for the input burst and output burst to occur without a state change. In either case, no further inputs may arrive until the machine is stable. After the machine is stable, the transition is complete and the machine is ready to receive a new input burst. Throughout the entire machine cycle, outputs and state variables must be free of glitches. Although this design style allows multiple input changes that can arrive at arbitrary times, it is restricted in terms of modeling concurrency between input and output signals. Yun et al. (23) removed some of these restrictions with extended
Petri nets (27) are a modeling tool for the study of systems. A petri net is a bipartite directed graph 具P,T,F,M0典, consisting of a finite set of transitions T, a finite set of places P, and a flow relation F 債 P ⫻ T 傼 T ⫻ P specifying a binary relation between transitions (represented as bars) and places (represented as circles). For example, in the petri net of Fig. 9, six places p0, p1, p2, p3, p4, and p5 correspond to six conditions (i.e., a job to be processed, a job is waiting, the processor is idle, a job is being processed, a job is waiting to be output, and a job has been processed). This example petri net has four transitions t1, t2, t3, and t4 that correspond to four events (i.e., a job is put in input queue, a job is started, a job is completed, and a job is output). The net structure represents the static nature of the modeled system. Its dynamic behavior is captured by its markings and the firing of transitions, which transform one marking into another. A marking M is a collection of places corresponding to the local conditions that hold at a particular moment. It is graphically represented by solid circles called tokens, residing in these places, i.e., for a given place p, a marking defines a nonnegative integer representing number of tokens in p. The initial marking is denoted as M0. A transi-
p0 {a job to be processed} t1 {a job is put in input queue} p2 {the processor is idle}
p1 {a job is waiting} t2 {a job is started} p3 {a job is being processed} t3 {a job is completed}
p4 {a job is waiting to be output} t4 {a job is output} p5 {processed job} Figure 9. Modeling of a simple computer system with petri nets.
734
ASYNCHRONOUS LOGIC DESIGN
x y z
x+
x–
x+
x+
y+
y–
z+
y–
x– z–
z+ (a)
y+
z– (b)
Figure 10. (a) An example timing diagram and (b) the corresponding signal transition graph.
tion is said to be enabled when all its input places are marked with at least one token. The firing of an enabled transition removes one token from each input place and deposits one token in each output place. The transformation of a marking M into another marking M⬘, by firing a transition t, is denoted t by M 씮 M⬘. For the petri net example of Fig. 9, the transition t1 (a job is put in the input queue) is enabled because place p0 has one token (i.e., condition a job to be processed is valid) and p0 is the only input place for transition t1. After the transition t1 occurs (fires), a token will be removed from place p0 and will be deposited in transition t1’s output place p1. This implies that the condition a job is waiting will become valid. The initial marking M0 of the example petri net (Fig. 9) is 兵 p0, p2其. Occurrence (firing) of transition t1 transforms the petri net marking to 兵 p1, p2其. One of the main features of petri nets is the inherent concurrency. In the petri net model, two events, which are both enabled and do not interact, may occur independently. Another major characteristic of petri nets is their asynchronous nature. Petri nets define a partial order of the occurrence of events. The petri net structure contains all the necessary information to define all possible sequences of events. Thus, for the petri net of Fig. 9, the event a job is completed (transition t3) must follow the event a job is started (transition t2). Petri net execution is viewed as a sequence of discrete events. The order in which the events occur is one of possibly many allowed by the petri net structure. In addition to representing concurrency, petri nets can also represent nondeterminism, i.e., a choice between several events. If, at any time, more than one interacting transitions (i.e., transitions that have a common place as input) are enabled simultaneously, then any of the enabled transitions may fire next. The choice as to which transition fires is made in a nondeterministic manner, i.e., randomly. Signal Transition Graphs. Signal transition graphs (STG) are event-based graphical models that specify the asynchronous circuit behavior using temporal relationships between signal transitions rather than states. They were introduced independently by Rosenblum and Yakovlev (28) and Chu (29). Signal transition graph specifications are based on petri nets as an underlying formalism, where transitions T of the net are interpreted as rising (positive) and falling (negative) transitions on input I and output O wires of the asynchronous circuit (i.e., T 僆 I ⫻ 兵⫹, ⫺其 傼 O ⫻ 兵⫹, ⫺其). In an STG, transitions are represented by their names instead of a bar and a label. Every place with a single input and output transition is represented by an arc between these transitions, which represents their temporal relationships [as illustrated in Fig. 10(b)].
STG specifications can explicitly describe asynchronous circuit behavior (e.g., concurrency, causality and conflict) and have captured wide attention. Figure 10(a) shows a timing diagram that specifies three signals x, y, and z. A positive transition of signal y follows the positive transition on signal x. Similarly, a positive transition of signal z follows the positive transition on signal x. Because there is no ordering constraint between the positive transitions of y and z, they are said to be concurrent. This timing diagram can be directly transformed into STG specifications by representing signal transitions as nodes and ordering constraints as directed arcs. The STG specifications corresponding to the timing diagram of Fig. 10(a) are shown in Fig. 10(b). A model similar to STG specifications, called change diagrams, was proposed by Varshavsky et al. (30). Change diagrams can specify concurrent behavior but are unable to specify conflict behavior (i.e., either one of the events can occur but not both). STG specifications have several advantages for specifying control-intensive asynchronous behavior. STG specifications can explicitly describe the major aspects of asynchronous control circuit behavior (e.g., concurrency, causality, and conflict). Most control-intensive circuits are specified using timing diagrams that can be directly transformed into STG specifications. Thus, they are appealing to designers. Because STG specifications are based on petri nets as an underlying formalism, they can directly use the wide body of petri net analysis techniques. The graphical nature of a signal transition graph makes it easier to analyze circuit behavior at a higher level of abstraction. A logic circuit can be derived by transforming the STG specifications into a state graph (29) [as shown in Fig. 11(b)]. The state graph represents all the states of STG specifications. It captures all the possible transition sequences in the STG. A state graph can be derived by exhaustively generating all possible markings (i.e., states) of the STG. A state graph can be mapped into a circuit by assigning a unique binary code to each state in the state graph. This binary encoding can be derived from the values of STG signals in each state described as follows. Logic functions can be derived from a state graph by assigning binary codes to each state. The value of input and output signals in a given state gives the binary coding for that state and is derived using a consistent state assignment defined as follows. Consistent state assignment: For STG signals 兵s1, s2, . . ., sn其, a state M in the state graph is assigned a binary code ⬍M(s1), M(s2), . . ., M(sn)⬎. If a transition t is t enabled in state M (i.e., M 씮 M⬘), then t ⫽ ‘‘si⫹’’ implies M(i) ⫽ 1; t ⫽ ‘‘si⫺’’ implies M(i) ⫽ 1 and M⬘(i) ⫽ 0. For example, in the initial marking of STG shown in Fig. 11(a), transition x⫹ is enabled. Thus, signal x must have a value 0 in the initial state because it must go to a value 1 after firing a positive transition of x. Similarly, in the initial state, STG signals y and z can also be evaluated to have values of 0 and 0, respectively. This assigns a binary code 000 to the initial state for state graph shown in Fig. 11(b). The values in binary code 000 correspond to values of STG signals x, y, and z, respectively. Binary encoding for the rest of the states can be easily derived by simply firing the enabled signal transitions and changing the signal values accordingly. Figure 11(b) shows the state graph corresponding to the signal transition graph of Fig. 11(a). Every state the state graph corresponding to the signal transition graph of Fig. 11(a). Ev-
ASYNCHRONOUS LOGIC DESIGN
735
000 x+ y 100 z+
x+ 101
z+ y–
xz y
y+ 110
y+
x– y+
00 01 11 10
0 0
1
1
1
1 0
1
1
1
z+ 001
x–
111 y+ x–
z–
y–
011 x z–
z
010 (a)
(b)
y=x+z Ckt.
(c)
ery state in the state graph of Fig. 11(b) has been encoded according to the values of the input and output signals in the STG. The following constraint ensures that the circuit will be able to distinguish between two states by using only the input and output signal values. A state graph satisfies the complete state coding (CSC) constraint (29) if and only if the transitions of non-input signals, enabled in two states having the same binary code assignment, are the same. Thus, only input transitions enabled in two states having the same binary code are different, and it is assumed that the environment can distinguish between them. A state graph satisfying the CSC constraints has a well-defined logic function, and there is no conflict of implied values even if the binary code assignments of two states are the same. Because only input and output signal values are used for coding the states, the CSC constraints are necessary for logic implementation of the STG specifications. A CSC violation must be corrected by inserting extra signals in the STG, so as to distinguish between the states violating CSC (31). For example, Fig. 12 shows two states S1 and S2 having the same state encoding 011 in a state graph. State S1 enables an input signal ai transition and an output signal ao transition. On the other hand, state S2 enables only an output signal bo transition. Because both states S1 and S2 have the same binary encoding 011, and they enable different output (i.e., noninput) signals, they violate the CSC constraint according to its definition given previously. This violation can be corrected by inserting another signal n (called a state signal) in the state
001 {n}
001 {n}
S1
S2
Input signal ai+ Output signal ao– Output signal bo– Figure 12. Complete State Coding violation in STGs.
Figure 11. (a) An example STG; (b) the corresponding state graph; and (c) the logic derivation.
graph. This signal must have a value of 1 in one state and 0 in the other state. The state signal insertion to satisfy complete state coding violation is also illustrated in Fig. 12 where n is assigned to the binary code of state S1 and n is assigned to the binary code of state S2. Thus, the states S1 and S2 now have different binary codes (i.e., 001n and 001n, respectively). A logic circuit can be derived from a state graph with no CSC violations by finding the implied values of output signals in every state. The implied value of an output in a state graph state is obtained as follows. The implied value of an output o in some state graph state s is defined as • the complement of the present value of o in binary code of state s, if o is enabled in state s, or • the present value of o in the binary code of state s, if o is not enabled in state s. For example, in initial state 000 of state graph in Fig. 11(b), output y is not enabled. Thus, the implied value of y in state 000 will be same as the present value of y (i.e., 0). On the other hand, in state 100, output transition y⫹ is enabled. Thus, the implied value of y in state 100 will be the complement of the present value of y (i.e., 1). Similarly, the implied value of output y in the rest of the state graph states can also be obtained. The logic function of an output can be obtained by constructing a Karnaugh map that contains the entries of output-implied values. For example, in the Karnaugh map of output y in Fig. 11(c), the entry corresponding to xyz ⫽ 000 is 0 (i.e., the implied value of y in state graph state 000) and the entry corresponding to xyz ⫽ 100 is 1 (i.e., the implied value of y in state graph state 100). A Karnaugh map gives the logic function for the corresponding output [e.g., the Karnaugh map in Fig. 11(c) generates the logic function y ⫽ x ⫹ z]. It has been proved that CSC is the necessary and sufficient requirement for implementing hazard-free STG specifications into logic circuits, if the logic can be implemented with a single complex gate (32). A number of state encoding techniques have been developed to satisfy CSC constraint. Lin and Lin (37) and Vanbekbergen et al. (38) proposed state encoding techniques to satisfy a more strict constraint than the general CSC constraint. This constraint requires that each state in
736
ASYNCHRONOUS LOGIC DESIGN
the state graph must have a unique state code. In addition, these techniques were restricted to STG specifications describing only concurrent asynchronous behavior (also known as marked graphs). These techniques were also limited by an additional restriction that for every signal, the STG can specify only one rising and one falling transition. Lavagno et al. (31) proposed a new state encoding framework for STG specifications with a limited interplay of concurrency and choice. They solved the CSC constraint satisfaction problem at the state graph level by transforming the STG into an FSM state table. This state table is then reduced with state minimization and encoded using critical race-free state assignment techniques. This approach guarantees sufficient conditions for the CSC satisfaction. The algorithm inserts state signals into the original STG to satisfy CSC constraints, but only handles live free-choice petri nets. The CSC solutions obtained in this framework correspond to a special class of STG transformations. Vanbekbergen et al. (40) proposed a general framework to solve the CSC satisfaction problem for general STG specifications. It is not limited to marked graph or free-choice petri nets. They formulated the CSC problem as a Boolean satisfiability (SAT) problem. They gave the necessary and sufficient conditions for the insertion of state signals. This ensures that CSC property while conserving the original STG behavior. It is well known that many combinatorial optimization problems can be directly transformed into the SAT problem. Unfortunately, the instances of SAT formulas derived from practical STGs are too large to be solved efficiently. Puri and Gu (41) proposed an efficient modular approach for solving complete state coding problems by first partitioning the signal transition graph into a number of simpler and manageable smaller modular graphs. This approach is applicable to general signal transition graphs and achieves significant performance improvement (42,43). A group of researchers that include Cortadella, Kishinevsky, Kondratyev, Lavagno, Pastor, Semenov, and Yakovlev et al. (43a,43b,43c) have made significant progress in reducing the complexity of solving complete state coding. They solved the CSC problem by directly analyzing the STG specifications rather than state graphs derived from this STG. Because the number of states in a state graph can be very large for a highly concurrent STG, methods that work directly at the STG level can yield significant performance improvement in some cases. In general, logic circuits derived from practical STG specifications are too large to be implemented in a single complex gate. This implies that the gate level logic implementation obtained from state graphs satisfying CSC constraint may not be free from all hazards (32). Lavagno et al. (34) and Yu and Subrahmanyan (39) developed heuristic techniques that add delays for gate-level hazard removal for bounded delay models. Moon proved in (33) that an STG that satisfies the CSC requirement is free from all functional hazards, all critical races, and all static O-hazards, under the unbounded gate delay (speed-independent) model. Thus, the speed-independent logic implementation of an STG-satisfying CSC constraint may have only static 1-hazards and dynamic hazards. Moon et al. further proposed algorithms to remove these remaining hazards (32,33). Significant progress was made by Kondratyev et al. (43d), who developed sufficient conditions called monotonous cover condition and unique entry condition to de-
rive a hazard-free speed-independent (i.e., under unbounded gate delay) circuit. Pastor and Cortadella (43e) developed efficient algorithms for hazard-free synthesis of speed-independent circuits directly from STGs that satisfy CSC constraint. Beerel and Meng (35) and Kishinevsky et al. (36) also developed efficient algorithms to derive hazard-free speed-independent circuits from state graphs. Myers and Meng (45) have extended the STG specifications to incorporate timing constraints and developed efficient synthesis algorithms to implement them. Over several years, a group of researchers, which include Kishinevsky, Kondratyev, Taubin, and Varshavsky, have made significant contributions to developing a publicly available design tool called FORCAGE for synthesizing practical STG specification into speed-independent circuits (36). Another publicly available design tool for synthesizing hazardfree circuits from STG specifications was developed by Lavagno et al., and this tool is integrated with the publicly available Berkeley SIS tool (45a). Vanbekbergen, Ykman, and Lin et al. at IMEC Belgium developed a tool called ASSASSIN for synthesis and analysis of asynchronous control circuits from general STGs including timed signal transition graphs (45b). Cortadella et al. have developed a state encoding and synthesis tool called PATRIFY for designing speed-independent circuits from STG specifications (43b,43c). Asynchronous Design Using Communicating Process CSP-Based Specifications. Hoare introduced a specification language, called communicating sequential processes (CSP) (46), for a set of concurrent processes that communicate on fixed links called channels. Martin used a subset of this specification language consisting of sequential, communication, and probe constructs and developed a rule-based asynchronous synthesis procedure that transforms CSP (communicating sequential processes) specifications describing asynchronous behavior into CMOS (complimentary metal oxide silicon) circuits (47,48). In Martin’s method, asynchronous behavior specified using the CSP notation is transformed into a semantically equivalent set of VLSI (Very Large Scale Integrated circuit) operators using transformations such as process decomposition, handshaking expansion, and production-rule expansion. The first step of the transformation, called process decomposition replaces one process with several processes by application of a decomposition rule. Process decomposition makes it possible to reduce a process with an arbitrary control structure to a set of subprocesses of only two different types: either a (finite or infinite) sequence of communication actions or a repetition of process selections. The next step of the transformation, called handshaking expansion, replaces each channel with a pair of wire-operators and each communication action in a program with its implementation in terms of elementary actions of four-phase handshaking protocol. Production-rule expansion is the transformation from a handshaking expansion to a set of production rules. It is the most important step of the compilation and consists of state assignment, guard strengthening, and symmetrization. State assignment transforms the handshaking expansion to ensure that each state of the expansion is unique. After state variables have been introduced so as to distinguish any two states of the hand-
ASYNCHRONOUS LOGIC DESIGN
shaking expansion, it is possible to strengthen the Boolean guards of the production rules to enforce program-order execution. Subsequently, symmetrization may be performed on production rules to minimize the number of state-holding operators. Finally, the production rules can be transformed into a circuit implementation. Burns developed an automated version of this procedure and further improved the performance of implemented circuits (49,50). This design style assumes a four-phase handshaking protocol [Fig. 2(b)]. The designed circuits are quasidelay-insensitive (i.e., the operation of the circuit is independent of the delays of the component and interconnecting wire delays) and the delays of wire forks are comparable [i.e., the design conforms to the isochronic fork assumption (51)]. This method using the CSP specifications has many practical examples such as the distributed mutual exclusion element (52) and an asynchronous microprocessor (53). A similar compilation-based method that generalizes Martin’s communication style to include shared variables has been proposed by van Berkel et al. (54). These techniques have also been implemented by van Berkel et al. into a robust asynchronous design tool called TANGRAM at Philips Research Labs. TANGRAM first compiles the CSP-based specifications into an intermediate representation called handshake circuit. A handshake circuit consists of a network of handshake processes which communicate asynchronously on channels using asynchronous protocols. The circuit is then optimized using peephole optimization, and finally the components are mapped to VLSI implementation. TANGRAM has been successfully used to implement several DSP (Digital Signal Processing) designs at Philips. Brunvand and Sproull used a programming language called occam, which is based on CSP (55). Occam describes asynchronous computations as a set of concurrent processes that interact by communication over channels. In occam, control over concurrent and sequential aspects of communication is explicit. Brunvand and Sproull developed a design methodology to translate programs written in a subset of occam automatically into delay-insensitive circuits using syntax-driven techniques and two-phase handshaking protocol [Fig. 2(a)]. The resulting circuits are then improved using semantics-preserving circuit-to-circuit transformations.
737
pletely specifies the behavior of the interface. A trace containing no execution symbols is represented by symbol ⑀. Traces may be extended by appending a new symbol corresponding to a possible transition onto the end of the trace. This indicates that the transition is allowed to occur immediately following the events already recorded in the trace. Trace sets, being sets of simple lists of symbols, are often expressed using familiar regular-expression notation. This notation makes the description of an entire trace set more compact than simply listing all the possible traces. Regular expressions are composed of the symbols in the alphabet 兺 傼 兵⑀其 of the trace structure, and special symbols 兩, ⴱ, (, ) interpreted as follows. Let r1 and r2 be two regular expressions. • r1r2 is a regular expression representing the concatenation of r1 and r2. • r1兩r2 is a regular expression representing a choice between expressions r1 and r2. • r*1 is a regular expression that represents zero or more repetitions of expression r1 into a single expression. • (r1) represents a regular expression that groups all the symbols of expression r1 into a single expression. For example, if alphabet 兺 ⫽ 兵r,a其 is used to represent request-and-acknowledge wires in a two-phase signaling protocol, the trace set of this protocol can be expressed with the regular expression 兵[(ra)*]兩[(ra)*r其. In designing delay-insensitive asynchronous circuits through trace theory, components are described using commands that describe sequences of possible events (i.e., traces). Based on trace theory, Ebergen developed the concepts of formal decomposition of a component (59). A decomposition of a component represents a realization of that component by means of a network of other basic components such that the correctness of the network is insensitive to delays in the basic components. The basic components in trace theory are implemented with C-elements, a XOR gate, Toggle elements, and Merge elements (59). All communication in this method is through a two-phase handshaking protocol. Asynchronous Design Using Micropipelines
Trace Theory-based Specifications. Trace theory was inspired by Hoare’s CSP and developed by van de Snepscheut (56) and Rem et al. (57). Ebergen used trace theory to described asynchronous behavior (58). A trace of a circuit represents a history of execution by listing all the transitions of signals at its interface. The set of all possible traces of a circuit, known as trace set, completely specifies the behavior of the circuit at its interface. More formally, a trace structure is defined using the communication alphabet 兺 of a circuit. This alphabet consists of a finite number of symbols used to represent the wires over which a circuit communicates. A trace structure is defined as a triple: T ⫽ 具I,O,X典, where I 僆 兺 is a finite set of input symbols, O 僆 兺 is a finite set of output symbols, and X 僆 兺* is a set of all possible traces of the circuit. The set 兺* is the set of all finite-length sequences of symbols in 兺. Although an individual trace represents a single execution history of the circuit, the trace set X captures all possible execution histories of an interface and thus com-
Pipelines provide an efficient framework for performing highspeed computations because their separate stages can operate in parallel. Pipelines both store and process data, and the storage elements (registers) and processing logic blocks alternate along the length of the pipeline. Thus, without any processing logic blocks, a pipeline will act like a shift register. Synchronous circuits use clocked pipelines [as shown in Fig. 13(a)], where data advance through the pipeline at fixed clock rate. Because processing logic blocks in the pipelines may have different delays, the clock rate is chosen according to the worst-case delay of any processing block. Because of this fixed clock rate, a clocked pipeline operates at a much slower data rate than its optimal performance. This drawback of clocked pipelines can be eliminated by employing asynchronous pipelines where different stages operate at different rates and they communicate with each other using handshaking protocols. Micropipelines were introduced by Sutherland (8) as an asynchronous alternative to synchronous pipelines. As shown
738
ASYNCHRONOUS LOGIC DESIGN
Ri
d2
A1
Ro
CLK
Cd P
CLK
Logic L2
Reg2
Di
Cd P
Logic L1
Do
Reg1
C Pd Logic
Register
Logic
Di
Register
C1
Do
C Pd
Clock C2 Ai (a)
d1
R1
Ao
(b)
Figure 13. (a) Clocked pipeline and (b) micropipeline with processing.
in Fig. 13(b), a micropipeline consists of alternating logic blocks separated by registers and control circuitry. Computation on data in a micropipeline is accomplished by the logic blocks. In addition to its elastic nature, a major benefit of micropipeline structure is that hazards can be filtered out by the registers that moderate the flow of data through the logic blocks. In micropipelines, event-controlled registers are employed as opposed to clocked registers in clocked pipelines. An eventcontrolled register responds to events on its two control wires, called capture [denoted by C in Fig. 13(b)] and pass [denoted by P in Fig. 13(b)]. When the control signals C and P are in same state, the event-controlled register is transparent (i.e., its data input is directly connected to its output). When signals C and P differ in their state, the event-controlled register captures and stores data, and its data output is disconnected from its data output. The behavior of this event-controlled register can also be described in terms of events. Initially, the register is assumed to be transparent. In addition, it is assumed that the capture control signal and the pass control signal always alternate. An event on the capture control wire causes the register to capture and hold the data value passing through it. This event isolates the output value of the elements from changes at the register’s input but does not change the output value. A subsequent event on the pass control wire returns the register to its transparent state, permitting the next data value to appear at its output and possibly changing its output value. Thus, after each event on the pass control wire, a new output value may appear. The event-controlled register also includes two control outputs called capture done (denoted as Cd) and pass done (denoted as Pd), which are amplified and thus delayed versions of the corresponding control input signals capture and pass. The control output signals Cd and Pd deliver output events after the register has done its action (i.e., they must be delayed until after the register has performed the corresponding capture and pass actions). The datapath of a micropipeline as shown in Fig. 13(b) consists of event-controlled registers and logic blocks for data processing, which alternate along the length of the micropipeline. The control for the micropipeline consists of a string of Muller C-elements with inverters interposed as illustrated in Fig. 13(b). Initially, the outputs of all Muller C-elements are
initialized to the same state. This is achieved through a master clear signal. A request event received on the Ri wire of the micropipeline initiates the processing. This allows C-element C1 to generate a capture signal for register Reg1. Subsequently after some delay, register Reg1 generates a capture done (Cd) signal, which sends an acknowledge event back on Ai wire. The event on Cd wire generated by register Reg1 also reaches C-element C2 after being delayed by a duration more than the worst-case delay d1 of the logic block L1, which generates a capture signal for register R2. This indicates that logic block L1 has completed processing and register R2 should capture this data. After register Reg2 captures data, it generates a capture done signal that in turn signals register Reg1 through pass (P) signal that it can pass new data to logic block L1 now. At this point, logic block L2 can process the captured data in register Reg2 concurrently with processing in logic block L1. After the event on its pass wire, register Reg1 generates an event on its pass done (Pd) signal to indicate that it has switched to the transparent mode. With this event on Pd signal of register Reg1, C-element C1 will again generate a capture signal for register Reg1 after a request event on Ri is received signaling the availability of data and the same process continues. Micropipelines have been extensively used in one of the most comprehensive research efforts to design an asynchronous microprocessor called AMULET (60), an asynchronous version of the popular synchronous ARM microprocessor. Although micropipelines avoid the hazard problem and are elastic in nature (i.e., each pipeline stage can have different delay), they still deliver worst-case performance in each single state of the micropipeline.
ASYNCHRONOUS PROCESSORS AND OTHER DESIGN EXAMPLES Early digital computers embodied a variety of design styles. Although most designs were based on synchronous techniques, there were several examples that used an asynchronous approach. For example, ORDVAC, built at the University of Illinois in 1951 and IAS, built by John Neumann’s group at Princeton University in 1951, were asynchronous designs and operated without any central clock. Later in 1974,
ASYNCHRONOUS LOGIC DESIGN
design team at Manchester University built a processor called MU5 that used asynchronous control. In 1978, Davis designed the first dataflow computer, called DDM-1 (61), which used locally synchronous modules that communicated using request-and-acknowledge protocol. While modern digital computers are dominated by synchronous approach, some asynchronous processor designs are beginning to make the transition from research to products. Alain Martin at Caltech developed the first quasi-delayinsensitive 16-bit asynchronous microprocessor (53) that was fabricated in 1.6 애m CMOS version and consumed only 200 mW at 5 V and 7.6 mW at 2 V. This design was later implemented using GaAs technology as well (62). Recently Martin et al. also designed asynchronous version of MIPS R3000 processor (63). Researchers at Manchester University have designed several asynchronous versions of the ARM microprocessor, called AMULET (60). The most recent version of this asynchronous microprocessor, known as AMULET2e (64), delivers 40 MIPS at only 150 mW and targets portable embedded applications. Advanced RISC (Reduced Instruction Set Computers) Machine Ltd., the inventors of ARM microprocessor, have already initiated the efforts to transfer this asynchronous design into a commercial offering (65). Sharp Corporation recently announced an asynchronous media processor chip called NMP (66), that employs a data-driven architecture. It uses eight clock-free CPUs (Central Processing Unit), each of which delivers 600 MOPS using only 40,000 gates and consuming less than 60 mW. In a significant development in the area of commercial asynchronous processors, Cogency Technology, Inc., revealed a design system (67) that can produce made-to-order, self-timed CPUs, and DSPs. In its first commercial use, the suite has produced a completely asynchronous digital signal-processing chip ‘stDSP’, which was designed for LG Semicon. This chip is roughly the same size as a functionally identical synchronous version but uses 47% less power. Researchers at Tokyo Institute of Technology recently designed a 32-bit delay-insensitive asynchronous version of a MIPS R2000 processor, called TITAC-2 (68). Similar to any asynchronous processor, TITAC-2 works correctly even with large temperature and power supply variations. In addition, TITAC-2 achieves a performance of 52 MIPS with a power consumption of 2 W at 3.3 V. Researchers at Sun Microsystems have recently designed an asynchronous counterflow pipeline processor (69). A significant design effort by a team of designers at Hewlett-Packard Labs resulted in a completely asynchronous full-custom CMOS chip called Post Office (70), which had 300,000 transistors and was designed to support internode communication for the Mayfly parallel processing system. Mark Dean at Stanford University built a processor called STRiP (self-timed RISC processor) (71), which includes both synchronous and asynchronous design techniques. In addition to the large asynchronous chips mentioned, several smaller asynchronous chips such as a DCC error corrector chip (72) and a high-speed packet switching component chip (73) by Philips Research Labs and a communication chip at Hewlett-Packard Labs (7) are also slowly beginning to make their way into the commercial marketplace. In spite of all the asynchronous design applications mentioned earlier, almost all commercial digital designs employ synchronous circuits. The major reason for the overwhelming popularity of synchronous circuits is their ease of design. In ad-
739
dition, asynchronous circuits are prone to hazards that may cause a circuit to malfunction. The hazards are naturally filtered out in a synchronous design by choosing a long enough clock period, which ensures that the circuit is in a stable circuit state before the next input changes take place. Thus, high-performance asynchronous logic circuits that are free of hazards are more difficult to design than their synchronous counterparts. In general, the presence of hazards in asynchronous circuits is a major hindrance in their widespread use. In addition, a clear advantage of asynchronous designs over synchronous designs for large-scale high-performance and low-power circuits still remains to be demonstrated in general.
CONCLUSIONS Asynchronous logic circuits hold the promise of solving the clock distribution and power dissipation problems in highperformance circuits. This is a major motivating factor in the recent resurgence of interest in applying asynchronous design techniques to processor design as well as low-power applications. Asynchronous logic design should not be viewed as a single alternative to synchronous logic design. More accurately, synchronous design is a special case representing a single design point in a multidimensional asynchronous design space, which varies from totally distributed control to global control with clock. Because of the problems of clocks with frequencies in excess of 1 GHz, it is likely that the next generation of processors will use some of the advantages offered by asynchronous logic by implementing circuits with a suitable combination of synchronous and asynchronous design techniques. Although modern digital computers are dominated by the synchronous approach, asynchronous designs have already started to emerge in commercial products. Recent years have seen a surge in research activities related to asynchronous logic design. This article focused on the logic design aspect of asynchronous circuits. We only discuss a portion of the vast body of literature available on asynchronous design (a complete bibliography in the field is being maintained at Eindhoven University of Technology, Netherlands and can be obtained from
[email protected]). Specification and design of logic circuits is just one of the steps among several crucial steps in a practical design methodology. Testing and verification of implemented circuits is an integral part of any design flow and consumes a dominant portion of the design cycle of a complex system. Verifying that what you implemented is what you specified is critical for avoiding costly errors late in the design cycle. Formal techniques for verifying circuit implementations are especially critical for asynchronous designs because of the subtlety. Research into these new frontiers of practical formal verification techniques for asynchronous designs is the focus of some recent efforts and will be an interesting area for further study. Although much progress has been made toward developing robust asynchronous design and verification techniques, pushing asynchronous designs over competing synchronous implementations in industry still remains a challenge. Now more than ever, it is crucial to evaluate the advantages of already researched design techniques through practical design implementations and develop methods that can be used in design methodologies being practiced by the industry.
740
ASYNCHRONOUS LOGIC DESIGN
BIBLIOGRAPHY 1. B. Gieseke et al., A 600MHz superscalar RISC microprocessor with out-of-order execution, Int. Solid State Circuits Symp., 1997, pp. 176–177. 2. S. B. Furber, Breaking step: The return of asynchronous logic. IEE Rev., 39 (4): 159–162, 1993. 3. P. Song, Asynchronous design shows promise, Microprocessor Report, 11 (13): 1997. 4. S. H. Unger, The Essence of Logic Circuits, 2nd ed., New York: IEEE Press, 1997. 5. S. B. Furber, Asynchronous design, in W. Nebel and J. Mermet (eds.), Proc. Submicron. Electr., Il Ciocco, Italy, 1996, pp. 461–492. 6. P. E. Gronowski et al., A 433MHz 65b quad-issue RISC microprocessor, in Int. Solid State Circuits Symp., 1996, pp. 222–223. 7. A. Marshall, B. Coates, and P. Siegel, Designing an asynchronous communications chip, IEEE Design & Test Comput., 11 (2): 8– 21, 1994. 7a. W. A. Clark and C. E. Molnar, Macromodular computer systems, in R. W. Stacy and B. D. Waxman (eds.), Computers in Biomedical Research, vol. IV, New York: Academic Press, 1974, ch. 3, pp. 45–85. 8. I. E. Sutherland, Micropipelines, Commun. ACM, 32 (6): 720– 738, 1989. 8a. M. Dean, T. Williams, and D. Dill, Efficient self-timing with level-encoded 2-phase dual-rail (LEDR), in C. H. Sequin (ed.), Advanced Research in VLSI, Cambridge, MA: MIT Press, 1991, pp. 55–70. 8b. T. Verhoeff, Delay-insensitive codes—an overview. Distributed Computing, 3 (1): 1–8, 1988. 9. J. A. Brzozowski and C.-J. H. Seger, Asynchronous Circuits, New York: Springer-Verlag, 1995. 10. D. E. Muller and W. S. Bartky, A theory of asynchronous circuits, in Proc. Int. Symp. Theory of Switching, Harvard University Press, April 1959, pp. 204–243. 11. J. T. Udding, Classification and Composition of Delay-Insensitive Circuits, PhD thesis, Dept. of Math. and Comp. Sci., Eindhoven Univ. of Technology, 1984. 12. J. B. Dennis and S. S. Patil, Speed-independent asynchronous circuits, in Proc. Hawaii Int. Conf. System Sci., 1971, pp. 55–58. 13. J. C. Egergen, A formal approach to designing delay-insensitive circuits, Distributed Comput., 5 (3): 107–119, 1991. 14. D. A. Huffman, The synthesis of sequential switching circuits, in E. F. Moore (ed.), Sequential Machines: Selected Papers, New York: Addison-Wesley, 1964. 15. S. H. Unger, Asynchronous Sequential Switching Circuits, New York: Wiley-Interscience, 1969. 16. S. H. Unger, Hazards, critical races, and metastability, IEEE Trans. Comput., 44: 754–768, 1995. 17. S. Hauck, Asynchronous design methodologies: An overview, Proc. IEEE, 83: 69–93, 1995. 18. A. L. Davis and S. M. Nowick, An introduction to asynchronous circuit design, in A. Kent and J. G. Williams (eds.), The Encyclopedia of Computer Science and Technology, vol. 38, New York: Marcel Dekker, 1997. 19. J. H. Tracey, Internal state assignments for asynchronous sequential machines, IEEE Trans. Electron. Comput., EC-15: 551– 560, 1966. 20. A. B. Hayes, Stored state asynchronous sequential circuits, IEEE Trans. Comput., C-30 (8): 596–600, 1981. 21. C. A. Rey and J. Vaucher, Self-synchronized asynchronous sequential machines, IEEE Trans. Comput., 23: 1306–1311, 1974.
22. S. M. Nowick and D. L. Dill, Synthesis of asynchronous state machines using a local clock, Proc. Int. Conf. Comput. Design ICCD, Los Alamitos, CA: IEEE Computer Society Press, October 1991, pp. 192–197. 23. K. Y. Yun, D. L. Dill, and S. M. Nowick, Practical generalizations of asynchronous state machines, Proc. Eur. Conf. Design Autom. EDAC, Los Alamitos, CA: IEEE Computer Society Press, February 1993, pp. 525–530. 24. P. Kudva, G. Gopalakrishnan, and H. Jacobson, A technique for synthesizing distributed burst-mode circuits, Proc. ACM/IEEE Design Autom. Conf., 1996. 24a. S. M. Nowick and D. L. Dill, Exact two-level minimization of hazard-free logic with multiple-input changes, IEEE Trans. Comput.-Aided Des., 14 (8): 986–997, 1995. 25. K. Y. Yun and D. L. Dill, Automatic synthesis of 3D asynchronous state machines, Proc. Int. Conf. Comput.-Aided Design ICCAD, Los Alamitos, CA: IEEE Computer Society Press, November 1992, pp. 576–580. 26. A. Davis, B. Coates, and K. Stevens, Automatic synthesis of fast compact asynchronous control circuits, in S. Furber and M. Edwards (eds.), Asynchronous Design Methodologies, vol. A-28 IFIP Transactions, Elsevier, 1993, pp. 193–207. 27. T. Murata, Petri nets: Properties, analysis and applications, Proc. IEEE, 77: 541–580, 1989. 28. L. Y. Rosenblum and A. V. Yakovlev, Signal graphs: From selftimed to timed ones, in Proc. Int. Workshop Timed Petri Nets, Torino, Italy, Los Alamitos, CA: IEEE Computer Society Press, July 1985, pp. 199–207. 29. T.-A. Chu, Synthesis of Self-Timed VLSI Circuits from GraphTheoretic Specifications, PhD thesis, MIT Laboratory for Computer Science, MIT, June 1987. 30. V. I. Varshavsky (ed.), Self-Timed Control of Concurrent Processes: The Design of Aperiodic Logical Circuits in Computers and Discrete Systems, Dordrecht: Kluwer, 1990. 31. L. Lavagno et al., Solving the state assignment problem for signal transition graphs, Proc. ACM/IEEE Design Autom. Conf., IEEE Computer Society Press, June 1992, pp. 568–572. 32. C. W. Moon, P. R. Stephan, and R. K. Brayton, Synthesis of hazard-free asynchronous circuits from graphical specifications, Proc. Int. Conf. Comput.-Aided Design ICCAD, IEEE Computer Society Press, November 1991, pp. 322–325. 33. C. W. Moon and R. K. Brayton, Elimination of dynamic hazards in asynchronous circuits by factoring, Proc. ACM/IEEE Design Autom. Conf., IEEE Computer Society Press, June 1993, pp. 7–13. 34. L. Lavagno, K. Keutzer, and A. Sangiovanni-Vincentelli, Algorithms for synthesis of hazard-free asynchronous circuits, Proc. ACM/IEEE Design Autom. Conf., Los Alamitos, CA: IEEE Computer Society Press, pp. 302–308, 1991. 35. P. Beerel and T. H.-Y. Meng, Automatic gate-level synthesis of speed-independent circuits, Proc. Int. Conf. Comput.-Aided Design ICCAD, Los Alamitos, CA: IEEE Computer Society Press, November 1992, pp. 581–587. 36. M. Kishinevsky et al., Concurrent Hardware: The Theory and Practice of Self-Timed Design, New York: Wiley, 1994. 37. K.-J. Lin and C.-S. Lin, Automatic synthesis of asynchronous circuits, Proc. ACM/IEEE Design Autom. Conf., Los Alamitos, CA: IEEE Computer Society Press, 1991, pp. 296–301. 38. P. Vanbekbergen et al., Optimized synthesis of asynchronous control circuits form graph-theoretic specifications, Proc. Int. Conf. Comput.-Aided Design ICCAD, Los Alamitos, CA: IEEE Computer Society Press, 1990, pp. 184–187. 39. M.-L. Yu and P. A. Subrahmanyan, A path oriented approach for reducing hazards in asynchronous design, Proc. ACM/IEEE Design Automation Conf., 1992, pp. 239–244.
ASYNCHRONOUS LOGIC DESIGN 40. P. Vanbekbergen et al., A generalized state assignment theory for transformations on signal transition graphs, Proc. Int. Conf. Comput.-Aided Design ICCAD, Los Alamitos, CA: IEEE Computer Society Press, November 1992, pp. 112–117. 41. R. Puri and J. Gu, A modular partitioning approach for asynchronous circuit synthesis, Proc. ACM/IEEE Design Autom. Conf., June 1994, pp. 63–69. 42. R. Puri and J. Gu, Area efficient synthesis of asynchronous interface circuits, Proc. Int. Conf. Comput. Design ICCD, Los Alamitos, CA: IEEE Computer Society Press, October 1994. 43. R. Puri and J. Gu, Asynchronous circuit synthesis with boolean satisfiability, IEEE Trans. Comput.-Aided Design, 14: 961–973, 1995. 43a. A. Semenov et al., Synthesis of speed-independent circuits from STG-unfolding segment, in Proc. ACM/IEEE Des. Automation Conf., pp. 16–21, 1997. 43b. J. Cortadella et al., Decomposition and technology mapping of speed-independent circuits using Boolean relations, in Proc. Int. Conf. Comput.-Aided Des. ICCAD, 1997. 43c. J. Cortadella et al., A region-based theory for state assignment in speed-independent circuits, IEEE Trans. Comput.-Aided Des., 16: 793–812, 1997. 43d. A. Kondratyev et al., Basic gate implementation of speed-independent circuits, in Proc. ACM/IEEE Des. Automation Conf., pp. 56–62, 1994. 43e. E. Pastor and J. Cortadella, Polynomial algorithms for the synthesis of hazard-free circuits from signal transition graphs, in Proc. Int. Conf. Comput.-Aided Des. ICCAD, pp. 250–254, 1993. 44. J. Cortadella et al., Methodology and tools for state encoding in asynchronous circuit synthesis, Proc. ACM/IEEE Design Autom. Conf., 1996. 45. C. Myers and T. H.-Y. Meng, Synthesis of timed asynchronous circuits, Proc. Int. Conf. Comput. Design ICCD, Los Alamitos, CA: IEEE Computer Society Press, October 1992, pp. 279–282. 45a. L. Lavagno and A. Sangiovanni-Vincentelli, Algorithms for Synthesis and Testing of Asynchronous Circuits, Dordrecht: Kluwer, 1993. 45b. C. Ykman-Couvreur, B. Lin, and H. de Man, ASSASSIN: A synthesis system for asynchronous control circuits, Tech. rep., IMEC, Belgium, September 1994, user and tutorial manual. 46. C. A. R. Hoare, Communicating sequential processes, Commun. ACM, 21 (8): 666–677, 1978. 47. A. J. Martin, Compiling communicating processes into delay-insensitive VLSI circuits, Distributed Comput., 1 (4): 226–234, 1986. 48. A. J. Martin, A synthesis method for self-timed VLSI circuits, Proc. Int. Conf. Comput. Design ICCD, Los Alamitos, CA: IEEE Computer Society Press, 1987, pp. 224–229. 49. S. M. Burns, Automated Compilation of Concurrent Programs into Self-Timed Circuits, Master’s thesis, Pasadena, CA: California Institute of Technology, 1988. 50. S. M. Burns and A. J. Martin, Synthesis of self-timed circuits by program transformation, in G. J. Milne (ed.), The Fusion of Hardware Design and Verification, New York: Elsevier, 1988, pp. 99–116. 51. C. H. van Berkel, Beware the isochronic fork, Nat. Lab. Unclassified Report UR 003/91, Philips Research Lab., Eindhoven, The Netherlands, 1991. 52. A. J. Martin, The design of a self-timed circuit for distributed mutual exclusion, in Henry Fuchs (ed.), Proc. 1985 Chapel Hill Conf. VLSI, Computer Science Press, 1985, pp. 245–260. 53. A. J. Martin et al., The design of an asynchronous microproces-
741
sor, in Charles L. Seitz (ed.), Advanced Research in VLSI: Proc. Decennial Caltech Conf. VLSI, Cambridge, MA: MIT Press, 1989, pp. 351–373. 54. C. H. (K.) van Berkel et al., VLSI programming and silicon compilation, Proc Int. Conf. Comput. Design ICCD, Los Alamitos, CA: IEEE Computer Society Press, 1988, pp. 150–166. 55. E. Brunvand and R. F. Sproull, Translating concurrent programs into delay-insensitive circuits, in Proc. Int. Conf. Comput.-Aided Design ICCAD, Los Alamitos, CA: IEEE Computer Society Press, November 1989, pp. 262–265. 56. M. Rem, J. L. A. van de Snepscheut, and J. T. Udding, Trace theory and the definition of hierarchical components, in Randal Bryant (ed.), Proc. 3rd Caltech Conf. VLSI, Rockville, MA: Computer Science, 1983, pp. 225–239. 57. J. L. A. van de Snepscheut, Trace Theory and VLSI Design, vol. 200, Lecture Notes in Computer Science, Berlin: Springer-Verlag, 1985. 58. J. C. Ebergen, Translating Programs into Delay-Insensitive Circuits, vol. 56 of CWI Tract, Centre for Mathematics and Computer Science, 1989. 59. J. C. Ebergen, Arbitrers: An exercise in specifying and decomposing asynchronously communicating components. Research Report CS-90-29, Computer Science Dept., Univ. of Waterloo, Canada, July 1990. 60. S. B. Furber et al., AMULET1: A micropipelined ARM, Proc. IEEE Comput. Conf. COMPCON, March 1994, pp. 476–485. 61. A. L. Davis, The architecture and system method of DDM-1: A recursively-structured data driven machine, Proc. 5th Annu. Symp. Comput. Architecture, 1978. 62. J. A. Tierno et al., A 100-MIPS GaAs asynchronous microprocessor, IEEE Design & Test Comput., 11 (2): 43–49, 1994. 63. A. J. Martin et al., The design of an asynchronous MIPS R3000 microprocessor, Proc. 17th Conf. Advanced Res. VLSI, September 1997, pp. 164–181. 64. S. B. Furber et al., AMULET2e: An asynchronous embedded controller, Proc. Int. Symp. Advanced Res. Asynchronous Circuits Syst., Los Alamitos, CA: IEEE Computer Society Press, April 1997. 65. R. Weiss, ARM researchers asynchronous CPU design, Comput. Design, 1995. 66. J. Yoshida, Sharp’s processor beats the clock, Electron. Eng. EE Times, 1996. 67. P. Clarke, Startup pushes asynchronous chips towards mainstream, Electron. Eng. EE Times, October 1977. 68. T. Nanya et al., TITAC: Design of a quasi-delay-insensitive microprocessor, IEEE Design & Test of Comput., 11 (2): 50–63, 1994. 69. R. F. Sproull, I. E. Sutherland, and C. E. Molnar, The counterflow pipeline processor architecture, IEEE Design & Test of Comput., 11 (3): 48–59, 1994. 70. A. Davis, B. Coates, and K. Stevens, The Post Office experience: Designing a large asynchronous chip, Proc. Hawaii Int. Conf. Syst. Sci., Los Alamitos, CA: IEEE Computer Society Press, January 1993, vol. 1I, pp. 409–418. 71. M. E. Dean, STRiP: A Self-Timed RISC Processor Architecture, PhD thesis, Stanford University, 1992. 72. K. van Berkel et al., A fully-asynchronous low-power error corrector for the DCC player, Int. Solid State Circuits Conf., February 1994, pp. 88–89. 73. W. O. Budde et al., An asynchronous, high-speed packet switching component, IEEE Design & Test Comput., 11 (2): 33–42, 1994.
RUCHIR PURI IBM Thomas J. Watson Research Center
742
ASYNCHRONOUS SEQUENTIAL LOGIC
ASYNCHRONOUS MULTIPLEXING. See STATISTICAL MULTIPLEXING.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...LECTRONICS%20ENGINEERING/64.%20VLSI%20Systems/W7605.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Clock Distribution In Synchronous Systems Standard Article Eby G. Friedman1 1University of Rochester, Rochester, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W7605 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (311K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Abstract The sections in this article are Synchronous Systems Theoretical Background of Clock Skew Timing Constraints Caused By Clock Skew Clock Distribution Design of Structured Custom Vlsi Circuits Automated Synthesis and Layout Of Clock Distribution Networks Analysis and Modeling of The Timing Characteristics of Clock Distribution Networks Specification of The Optimal Timing Characteristics of Clock Distribution Networks Example Implementations of Clock Distribution Networks Directions For Future Research In The Design Of Clock Distribution Networks Summary and Conclusions About Wiley InterScience | About Wiley | Privacy | Terms & Conditions
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...S%20ENGINEERING/64.%20VLSI%20Systems/W7605.htm (1 of 2)15.06.2008 1:14:01
Search this title Enter words or phrases ❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...LECTRONICS%20ENGINEERING/64.%20VLSI%20Systems/W7605.htm
Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...S%20ENGINEERING/64.%20VLSI%20Systems/W7605.htm (2 of 2)15.06.2008 1:14:01
474
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS In a synchronous digital system, the clock signal is used to define a time reference for the movement of data within that system. Because this function is vital to the operation of a synchronous system, much attention has been given to the characteristics of these clock signals and the networks used in their distribution. Clock signals are often regarded as simple control signals; however, these signals have some very special characteristics and attributes. Clock signals are typically loaded with the greatest fanout, travel over the greatest distances, and operate at the highest speeds of any signal, either control or data, within the entire system. Because the data signals are provided with a temporal reference by the clock signals, the clock waveforms must be particularly clean and sharp. Furthermore, these clock signals are particularly affected by technology scaling, in that long global interconnect lines become much more highly resistive as line dimensions are decreased. This increased line resistance is one of the primary reasons for the increasing significance of clock distribution on synchronous performance. Finally, the control of any differences in the delay of the clock signals can severely limit the maximum performance of the entire system and create catastrophic race conditions in which an incorrect data signal may latch within a register. Most synchronous digital systems consist of cascaded banks of sequential registers with combinatorial logic between each set of registers. The functional requirements of the digital system are satisfied by the logic stages. The global performance and local timing requirements are satisfied by the careful insertion of pipeline registers into equally spaced time windows to satisfy critical worst case timing constraints. The proper design of the clock distribution network further ensures that these critical timing requirements are satisfied and that no race conditions exist (1–27). With the careful design of the clock distribution network, system-level synchronous performance can actually increase, surpassing the performance advantages of asynchronous systems by permitting synchronous performance to be based on average path delays rather than worst case path delays, without incurring the handshaking protocol delay penalties required in most asynchronous systems. In a synchronous system, each data signal is typically stored in a latched state within a bistable register (28) awaiting the incoming clock signal, which determines when the data signal leaves the register. When the enabling clock signal reaches the register, the data signal leaves the bistable register, propagates through the combinatorial network, and, J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
for a properly working system, enters the next register and is fully latched into that register before the next clock signal appears. Thus, the delay components that make up a general synchronous system are composed of the following three individual subsystems (29–31):
Delay Components of Data Path The minimum allowable clock period TCP(min) between any two registers in a sequential data path is given by
1. the memory storage elements, 2. the logic elements, and 3. the clocking circuitry and distribution network. Interrelationships among these three subsystems of a synchronous digital system are critical to achieving maximum levels of performance and reliability. The important area of clock generation, as compared to clock distribution, which is the primary topic of this article, bears separate focus. The article is organized as follows. In the first section, an overview of the operation of a synchronous system is provided. In the next section, fundamental definitions and the timing characteristics of clock skew are discussed. The timing relationships between a local data path and the clock skew of that path are then described. The interplay among the aforementioned three subsystems making up a synchronous digital system is described next, particularly how the timing characteristics of the memory and logic elements constrain the design and synthesis of clock distribution networks. Different forms of clock distribution networks, such as buffered trees and H-trees, are discussed, as are the automated layout and synthesis of clock distribution networks. Then techniques for making clock distribution networks less sensitive to process parameter variations are discussed. Localized scheduling of the clock delays is useful in optimizing the performance of high-speed synchronous circuits. Determining the optimal timing characteristics of a clock distribution network is reviewed next. The application of clock distribution networks to high-speed circuits has existed for many years. The design of the clock distribution network of certain important Very Large Scale Integration (VLSI)-based systems has been described in the literature, and some examples of these circuits are described. In an effort to provide some insight into future and evolving areas of research relevant to high-performance clock distribution networks, some potentially important topics for future research are discussed in the next section. Finally, a summary of this article with some concluding remarks is provided.
SYNCHRONOUS SYSTEMS A digital synchronous circuit is a network of functional logic elements and globally clocked registers. For an arbitrary ordered pair of registers (R1, R2), one of the following two situations can be observed: either (1) the input of R2 cannot be reached from the output of R1 by propagating through a sequence of logical elements only; or (2) there exists at least one sequence of logic blocks that connects the output of R1 to the input of R2. In the former case, switching events at the output of the register R1 do not affect the input of the register R2 during the same clock period. In the latter case—denoted by R1 ⇒ R2 —signal switching at the output of R1 will propagate to the input of R2. In this case, (R1, R2) is called a sequentially adjacent pair of registers which make up a local data path.
475
1 = TCP (min) ≥ TPD(max) + TSkew f clkMAX
(1)
TPD(max) = TC-Q + TLogic + TInt + TSet-up = D(i, f )
(2)
where
and the total path delay of a data path TPD(max) is the sum of the maximum time required for the data to leave the initial register after the clock signal Ci arrives TC-Q, the time necessary to propagate through the logic and interconnect TLogic ⫹ TInt, and the time required to successfully propagate to and latch within the final register of the data path TSet-up. Observe that the latest arrival time is given by TLogic(max) and the earliest arrival time is given by TLogic(min) because data are latched into each register within the same clock period. The sum of the delay components in Eq. (2) must satisfy the timing constraint of Eq. (1) in order to attain the clock period TCP(min), which is the inverse of the maximum possible clock frequency f clkMAX. The clock skew TSkewij can be positive or negative depending on whether Cj leads or lags Ci, respectively. The clock period is chosen such that the latest data generated by the initial register is latched by the final register with the next clock edge after the clock edge that activated the initial register. Furthermore, in order to avoid race conditions, the local path delay must be chosen such that for any two sequentially adjacent registers in a multistage data path, the latest data signal must arrive and be latched within the final register before the earliest data signal generated with the next clock pulse in the output of the initial register arrives. The waveforms in Fig. 1 show the timing requirement of Eq. (1) being barely satisfied (i.e., the data signal arrives at Rf just before the clock signal arrives). An example of a local data path Ri ⇒ Rf is shown in Fig. 2. The clock signals Ci and Cf synchronize the sequentially adjacent pair of registers Ri and Rf, respectively. Signal switching at the output of Ri is triggered by the arrival of the clock signal Ci. After propagating through the logic block Lif, this signal will appear at the input of Rf. Therefore, a nonzero amount of time elapses between the triggering event and the signal switching at the input of Rf. The minimum and maximum values of this delay are called the short and long delays, respectively, and are denoted by d(i, f) and D(i, f), respectively. Note that both d(i, f) and D(i, f) are caused by the accumulative effects of three sources of delay (4). These sources are the clock-to-output delay of the register Ri, a delay introduced by the signal propagating through Lif, and an in-
Datain
Ri
Ci
Lif
Rf
Cf
Figure 1. Timing diagram of clocked local data path.
Dataout
476
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS Data register
Data register
Combinatorial network
Datain
Dataout
Ri
Rf
Clock input Ci
Clock input Cf
Clock
Figure 2. A local data path. Composed of an initial and final register with a logic block between the two registers.
TC-Q
Data
terconnect delay caused by the presence of wires on the signal path, Ri ⇒ Rf. A synchronous digital system can be modeled (32,33) as a directed graph G with vertex set V ⫽ 兵v1, . . ., VNr其 and edge set E ⫽ 兵e1, . . ., eNp其 CVXV. An example of a circuit graph G is illustrated in Fig. 3. The number of registers in a circuit is 兩V兩 ⫽ NR and vertex vk corresponds to the register Rk. The number of local data paths in a circuit is 兩E兩 ⫽ NP. There is an edge directed from vi to vj iff Ri ⇒ Rj. In the case where multiple paths exist between a sequentially adjacent pair of registers Ri ⇒ Rj, only one edge connects vi to vj. The underlying graph Gu of the graph G is a nondirected graph that has the same vertex set V, where the directions have been removed from the edges. In Fig. 3, an input or an output of the circuit is indicated by an edge that is incident to only one vertex.
A schematic of a generalized synchronized data path is presented in Figs. 1 and 2, where Ci and Cf represent the clock signals driving a sequentially adjacent pair of registers, specifically the initial register Ri and the final register Rf of a data path, respectively, where both clock signals originate from the same clock signal source and a pair of registers are sequentially adjacent if only combinatorial logic (no sequential elements) exist between the two registers. The propagation delay from the clock source to the jth clocked register is the clock delay, TCj. The clock delays of the initial clock signal TCi and the final clock signal TCf define the time reference when the data signals begin to leave their respective registers. These clock signals originate from a clock distribution network that is designed to generate a specific clock signal waveform used to synchronize each register. This standard clock distribution network structure is based on equipotential
v2
v3
v4
v5
(a)
v1
TLogic
TInt
TSet-up
clocking, where the entire network is considered a surface that must be brought to a specific voltage (clock signal polarity) at each half of the clock cycle. Ideally, clocking events occur at all registers simultaneously. Given this global clocking strategy, clock signal arrival times (at each register) are defined with respect to a universal time reference. Definition of Clock Skew The difference in clock signal arrival time between two sequentially adjacent registers, as shown in Eq. (3), is the clock skew TSkew. If the clock signals Ci and Cf are in complete synchronism (i.e., the clock signals arrive at their respective registers at exactly the same time), the clock skew is zero. A definition of clock skew follows. Definition 1. Given two sequentially adjacent registers, Ri and Rj, and an equipotential clock distribution network, the clock skew between these two registers is defined as
THEORETICAL BACKGROUND OF CLOCK SKEW
v1
TInt
v2
v3
v4
v5
(b)
Figure 3. Graph G of a circuit with NR ⫽ 5 registers. (a) The directed graph G. (b) The underlying graph Gu corresponding to the graph G in (a).
TSkewi j = TCi − TC j
(3)
where TCi and TCj are the clock delays from the clock source to the registers Ri and Rj, respectively. It is important to observe that the temporal skew between the arrival time of different clock signals is relevant only to sequentially adjacent registers making up a single data path, as shown in Fig. 2. Thus, systemwide (or chipwide) clock skew between nonsequentially connected registers, from an analysis viewpoint, has no effect on the performance and reliability of the synchronous system and is essentially meaningless. However, from a design perspective, systemwide global clock skew places constraints on the permissible local clock skew. It should be noted that in Refs. 11 and 34, Hatamian designates the lead/lag clock skew polarity (positive/negative clock skew) notation as the opposite of that used here. Different clock signal paths can have different delays for a variety of reasons. Wann and Franklin (3) present the following causes of clock skew: 1. differences in line lengths from the clock source to the clocked register, 2. differences in delays of any active buffers (e.g., distributed buffers) within the clock distribution network (caused by 3 and 4),
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
Ri
Rj
Rk
Rn
Rn+1
Ci
Cj
Ck
Cn
Cn+1
477
Figure 4. Global data path composed of multiple local data paths.
3. differences in passive interconnect parameters, such as line resistivity, dielectric constant and thickness, via/ contact resistance, line and fringing capacitance, and line dimensions, and 4. differences in active device parameters, such as MOS threshold voltages and channel mobilities, which affect the delay of the active buffers.
a relationship between the clock skew in the forward path and the clock skew in the feedback path because the initial and final registers in the feedback path are also registers in the forward path. As shown in Fig. 5, the initial and final registers in the feedback path Rl ⫺ Rj are the final and initial registers of the forward path Rj ⫺ Rk ⫺ Rl. This relationship is formalized in Theorem 2.
It should be noted that for a well-designed and well-balanced clock distribution network, the distributed clock buffers are the principal source of clock skew. To determine the clock delay from the clock source to each register, it is important to investigate the relationship between the clock skews of the sequentially adjacent registers occurring within a global data path. Furthermore, it is necessary to consider the effects of feedback within global data paths on the clock skew. The path between two sequentially adjacent registers is described in this article as a local data path, as compared to a global data path, where a global data path can consist of one or more local data paths (Fig. 4). The relationship between the clock skew of sequentially adjacent registers in a global data path is called conservation of clock skew and is formalized here.
Theorem 2. For any given global data path containing feedback paths, the clock skew in a feedback path between any two registers, say Rl and Rj, is related to the clock skew of the forward path by the following relationship
Theorem 1. For any given global data path, clock skew is conserved. Alternatively, the clock skew between any two registers in a global data path that are not necessarily sequentially adjacent is the sum of the clock skews between each pair of registers along the global data path between those same two registers.
The magnitude and polarity of the clock skew have a twosided effect on system performance and reliability. Depending upon whether Ci leads or lags Cf and upon the magnitude of TSkew with respect to TPD, system performance and reliability can either be degraded or enhanced. These cases are discussed next.
Although clock skew is defined between two sequentially adjacent registers, Theorem 1 shows that clock skew can exist between any two registers in a global data path. Therefore, it extends the definition of clock skew introduced by Definition 1 to any two nonsequentially adjacent registers belonging to the same global data path. It also illustrates that the clock skew between any two nonsequentially adjacent registers that do not belong to the same global data path has no physical meaning because no functional data transfer between these registers occurs. A typical sequential circuit may contain sequential feedback paths, as illustrated in Fig. 5. It is possible to establish
Maximum Data Path/Clock Skew Constraint Relationship
Rf
Ri
Rj
Rk
Rl
Rm
Ci
Cj
Ck
Cl
Cm
Figure 5. Global path with feedback path.
TSkewfeedback, lj = −TSkewforward, jl
(4)
Both Theorems 1 and 2 are useful for determining the optimal clock skew schedule within a synchronous digital system, specifically, the set of local clock skew values that maximizes system performance and reliability. The process for determining these clock skew values is discussed in the section entitled ‘‘Specification of the Optimal Timing Characteristics of Clock Distribution Networks.’’ TIMING CONSTRAINTS CAUSED BY CLOCK SKEW
For a design to meet its specified timing requirements, the greatest propagation delay of any data path between a pair of data registers, Ri and Rf, being synchronized by a clock distribution network must be less than the minimum clock period (the inverse of the maximum clock frequency) of the circuit as shown in Eq. (1) (7,8,11,16,31,34–37). If the time of arrival of the clock signal at the final register of a data path TCf leads that of the time of arrival of the clock signal at the initial register of the same sequential data path TCi [see Fig. 6(a)], the clock skew is referred to as positive clock skew, and, under this condition, the maximum attainable operating frequency is decreased. Positive clock skew is the additional amount of time that must be added to the minimum clock period to reliably apply a new clock signal at the final register, where reliable operation implies that the system will function correctly at low as well as at high frequencies (assuming fully static logic). It should be noted that positive clock skew affects only the maximum frequency of a system and cannot create race conditions. In the positive clock skew case, the clock signal arrives at Rf before it reaches Ri. From Eqs. (1) and (2), the maximum
478
Ci
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
where TPD(min) is the minimum path delay, denoted by d(i, f), between two sequentially adjacent registers and THold is the amount of time the input data signal must be stable after the clock signal changes state. An important example in which this minimum constraint can occur is in those designs that use cascaded registers, such as a serial shift register or a k-bit counter, as shown in Fig. 7 (note that a distributed RC impedance is between Ci and Cf). In cascaded register circuits, TLOGIC(min) is zero and TInt approaches zero (because cascaded registers are typically designed, at the geometric level, to abut). If TCf ⬎ TCi (i.e., negative clock skew), then the minimum constraint becomes
TSkew
Cf (a)
Ci
TSkew
|TSkew | ≤ TC-Q + THold
Cf (b) Figure 6. Clock timing diagrams: (a) positive clock skew, (b) negative clock skew.
permissible positive clock skew can be expressed as (7,8,11,16,31,34–37)
TSkew ≤ TCP −TPD(max) = TCP − (TC-Q +TLogic(max) +TInt +TSet-up ) for TCi > TC f
(5)
where TPD(max) is the maximum path delay between two sequentially adjacent registers. This situation is the typical critical path timing analysis requirement commonly seen in most high-performance synchronous digital systems. If Eq. (5) is not satisfied, the system will not operate correctly at that specific clock period (or clock frequency). Therefore, TCP must be increased for the circuit to operate correctly, thereby decreasing the system performance. In circuits where the tolerance for positive clock skew is small [TSkew in Eq. (5) is small], the clock and data signals should be run in the same direction, thereby forcing Cf to lag Ci and making the clock skew negative. Minimum Data Path/Clock Skew Constraint Relationship If the clock signal arrives at Ri before it reaches Rf [see Fig. 6(b)], the clock skew is defined as being negative. Negative clock skew can be used to improve the maximum performance of a synchronous system by decreasing the delay of a critical path; however, a potential minimum constraint can occur, creating a race condition (11,12,31,34,36,38–41). In this case, when Cf lags Ci, the clock skew must be less than the time required for the data to leave the initial register, propagate through the interconnect combinatorial logic, and set up in the final register (see Fig. 1). If this condition is not met, the data stored in register Rf is overwritten by the data that had been stored in register Ri and has propagated through the combinatorial logic. Furthermore, a circuit operating close to this condition might pass system diagnostics but malfunction at unpredictable times because of fluctuations in ambient temperature or power supply voltage (36). Correct operation requires that Rf latches data that correspond to the data Ri latched during the previous clock period. This constraint on clock skew is
for TC f > TCi
and all that is necessary for the system to malfunction is a poor relative placement of the flip flops or a highly resistive connection between Ci and Cf. In a circuit configuration such as a shift register or counter, where negative clock skew is a more serious problem than positive clock skew, provisions should be made to force Cf to lead Ci, as shown in Fig. 7. As higher levels of integration are achieved in high-complexity VLSI circuits, on-chip testability (42) becomes necessary. Data registers, configured in the form of serial set/scan chains when operating in the test mode, are a common example of a design for testability (DFT) technique. The placement of these circuits is typically optimized around the functional flow of the data. When the system is reconfigured to use the registers in the role of the set/scan function, different local path delays are possible. In particular, the clock skew of the reconfigured local data path can be negative and greater in magnitude than the local register delays. Therefore, with increased negative clock skew, Eq. (7) may no longer be satisfied, and incorrect data may latch into the final register of the reconfigured local data path. Therefore, it is imperative that attention be placed on the clock distribution of those paths that have nonstandard modes of operation. In ideal scaling of MOS devices, all linear dimensions and voltages are multiplied by the factor 1/S, where S ⬎ 1 (16,43– 45). Device-dependent delays, such as TC-Q, TSet-up, and TLogic, scale as 1/S, whereas interconnect dominated delays such as TSkew remain constant to first order and, if fringing capacitance and electromigration are considered, actually increase with decreasing dimensions. Therefore, when examining the effects of dimensional scaling on system reliability, Eqs. (6) and (7) should be considered carefully (46). One straightforward method to avoid the effect of technology scaling on those
k−1 F/F Data
D
k F/F Q
Ci
D
Q
Cf Clock
|TSkew| ≤ TPD(min) = (TC-Q + TLogic(min) + TInt + THold ) for TC f > TCi
(6)
(7)
Figure 7. k-bit shift register with positive clock skew.
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
Data signal
5
R1 C1 3
4
R2
7
R3
C2 112
4
479
R4
C3 3
Clock signal
data paths particularly susceptible to negative clock skew is to not scale the clock distribution lines. Svensson and Afghahi (47) show that by using courser than ordinary lines for global clock distribution, 20 mm wide chip sizes with CMOS circuits scaled to 0.3 애m polysilicon lines would have comparable logic and cross-chip interconnect delays (on the order of 0.5 ns), making possible synchronous clock frequencies of up to 1 GHz. Therefore, the scaling of device technologies can severely affect the design and operation of clock distribution networks, necessitating specialized strategies and compensation techniques. Enhancing Synchronous Performance by Applying Localized Clock Skew Localized clock skew can be used to improve synchronous performance by providing more time for the critical worst case data paths (29–31,36,41,48). By forcing Ci to lead Cf at each critical local data path, excess time is shifted from the neighboring, less-critical local data paths to the critical local data paths. This negative clock skew represents the additional amount of time that the data signal at Ri has to propagate through the logic stages and interconnect sections and into the final register. Negative clock skew subtracts from the logic path delay, thereby decreasing the minimum clock period. Thus, applying negative clock skew, in effect, increases the total time that a given critical data path has to accomplish its functional requirements by giving the data signal released from Ri more time to propagate through the logic and interconnect stages and latch into Rf. Thus, the differences in delay between each local data path are minimized, thereby compensating for any inefficient partitioning of the global data path into local data paths that may have occurred, a common situation in many practical systems. Different terms have been used in the literature to describe negative clock skew, such as double-clocking (36), deskewing data pulses (38), cycle stealing (48,49), useful clock skew (45), and prescribed skew (50). The maximum permissible negative clock skew of a data path, however, is dependent upon the clock period itself as well as the time delay of the previous data paths. This results from the structure of the serially cascaded local data paths making up the global data path. Because a particular clock signal synchronizes a register that functions in a dual role, as the initial register of the next local data path and as the final register of the previous data path, the earlier Ci is for a given data path, the earlier that same clock signal, now Cf, is for the previous data path. Thus, the use of negative clock skew in the ith path results in a positive clock skew for the preceding path, which may then establish the new upper limit for the system clock frequency. An Example of Applying Localized Negative Clock Skew to Synchronous Circuits. Consider the nonrecursive synchronous circuit shown in Fig. 8 where the horizontal circles represent
Figure 8. Example of applying localized negative clock skew to synchronous circuit.
logic elements with logic delays and the vertical circles represent clock delays. Because the local data path from R2 to R3 represents the worst case path (assuming the register delays are equal), by delaying C3 with respect to C2, negative clock skew is added to the R2 ⫺ R3 local data path. If C1 is synchronized with C3, then the R1 ⫺ R2 local data path receives some positive clock skew. Thus, assuming the register delays are both 2 ns, C2 should be designed to lead C3 by 1.5 ns, forcing both paths to have the same total local path delay, TPD ⫹ TSkew ⫽ 7.5 ns. The delay of the critical path of the synchronous circuit is temporally refined to the precision of the clock distribution network and the entire system (for this simple example) could operate at a clock frequency of 133.3 MHz rather than 111.1 MHz if no localized clock skew is applied. The performance characteristics of the system, both with and without the application of localized clock skew, are summarized in Table 1. Note that 兩TSkew兩 ⬍ TPD (兩⫺ 1.5 ns兩 ⬍ 9 ns) for the R2 ⫺ R3 local data path; therefore, the correct data signal is successfully latched into R3 and no minimum data path/clock skew constraint relationship exists. This design technique of applying localized clock skew is particularly effective in sequentially adjacent temporally irregular local data paths; however, it is applicable to any type of synchronous sequential system, and for certain architectures, a significant improvement in performance is both possible and likely. The limiting condition for applying localized negative clock skew is determined by the control of the clock skew variations and by the difference in path delay between neighboring local data paths. These clock skew variations are caused by power supply variations, process tolerances where process parameters may vary over a specified range, and environmental effects, such as temperature or radiation, which, for example, can shift both MOS threshold voltages and channel mobilities. CLOCK DISTRIBUTION DESIGN OF STRUCTURED CUSTOM VLSI CIRCUITS Many different approaches, from ad hoc to algorithmic, have been developed for designing clock distribution networks in VLSI circuits. The requirement of distributing a tightly controlled clock signal to each synchronous register on a large nonredundant hierarchically structured VLSI circuit (an example floorplan is shown in Fig. 9) within specific temporal bounds is difficult and problematic. Furthermore, the tradeoffs that exist among system speed, physical die area, and power dissipation are greatly affected by the clock distribution network. The design methodology and structural topology of the clock distribution network should be considered in the development of a system for distributing the clock signals. Therefore, various clock distribution strategies have been de-
480
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
Table 1. Performance Characteristics of Circuit of Fig. 8 without and with Localized Clock Skew Local Data Path
TPD (min) with Zero Skew
TCi
TCf
TSkew
TPD (min) with Nonzero Skew
4⫹2⫹0⫽6 7⫹2⫹0⫽9 111.1 MHz
3 1.5
1.5 3
1.5 ⫺1.5
4 ⫹ 2 ⫹ 1.5 ⫽ 7.5 7 ⫹ 2 ⫺ 1.5 ⫽ 7.5 133.3 MHz
R1 to R2 R2 to R3 fMax All time units are in nanoseconds.
veloped. The most common and general approach to equipotential clock distribution is the use of buffered trees, which are discussed in the subsection entitled ‘‘Buffered Clock Distribution Trees.’’ In contrast to these asymmetric structures, symmetric trees, such as H-trees, are used to distribute highspeed clock signals. This topic is described in the subsection entitled ‘‘Symmetric H-Tree Clock Distribution Networks.’’ In developing structured custom VLSI circuits, such as the floorplan pictured in Fig. 9, specific circuit design techniques are used to control the delays within the clock distribution network. One important compensation technique is described in the subsection entitled ‘‘Compensation Techniques for Controlling Clock Skew.’’ Low power design techniques are an area of significant currency and importance. Some recent efforts to reduce the power dissipated within the clock distribution network are reviewed in the subsection entitled ‘‘Design of Low-Power Clock Distribution Networks.’’ Buffered Clock Distribution Trees The most common strategy for distributing clock signals in VLSI-based systems is to insert buffers either at the clock source and/or along a clock path, forming a tree structure. Thus, the unique clock source is frequently described as the root of the tree, the initial portion of the tree as the trunk, individual paths driving each register as the branches, and
Clock source Figure 9. Floorplan of structured custom VLSI circuit requiring synchronous clock distribution.
the registers being driven as the leaves. This metaphor for describing a clock distribution network is commonly accepted and used throughout the literature; it is illustrated in Fig. 10. Occasionally, a mesh version of the clock tree structure is used. As such, shunt paths are placed farther down the clock distribution network to minimize the interconnect resistance within the clock tree. This mesh structure effectively places the branch resistances in parallel, minimizing both the clock delay and the clock skew. An example of this mesh structure is described and illustrated in the subsection entitled ‘‘The DEC 64 Bit Alpha Microprocessor.’’ The mesh version of the clock tree is considered in this article as an extended version of the standard, more commonly used clock tree depicted in Fig. 10. The clock distribution network is typically organized as a rooted tree structure (31,51), as illustrated in Figs. 8 and 9, and is often called a clock tree (31). If the interconnect resistance of the buffer at the clock source is small as compared to the buffer output resistance, a single buffer is often used to drive the entire clock distribution network. This strategy may be appropriate if the clock is distributed entirely on metal, making load balancing of the network less critical. The primary requirement of a singlebuffer system is that the buffer should provide enough current to drive the network capacitance (both interconnect and fanout) while maintaining high-quality waveform shapes (i.e., short transition times) and minimizing the effects of the interconnect resistance by ensuring that the output resistance of the buffer is much greater than the resistance of the interconnect section being driven. An alternative approach to using only a single buffer at the clock source is to distribute buffers throughout the clock distribution network, as shown in Fig. 10. This approach requires additional area but greatly improves the precision and control of the clock signal waveforms and is necessary if the resistance of the interconnect lines is nonnegligible. The distributed buffers serve the double function of amplifying the clock signals degraded by the distributed interconnect impedances and isolating the local clock nets from upstream load impedances (35). A three-level buffer clock distribution network using this strategy is shown in Fig. 11. In this approach, a single buffer drives multiple clock paths (and buffers). The number of buffer stages between the clock source and each clocked register depends on the total capacitive loading, in the form of registers and interconnect, and the permissible clock skew (52). It is worth noting that the buffers are a primary source of the total clock skew within a well-balanced clock distribution network because the active device characteristics vary much more greatly than the passive device characteristics. The maximum number of buffers driven by a single buffer is determined by the current drive of the source buffer and the capacitive load (assuming an MOS technology)
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
481
Branches Leaves
Root
Trunk
Figure 10. Attributes of tree structure of a clock distribution network.
of the destination buffers. The final buffer along each clock path provides the control signal of the driven register. Historically, the primary design goal in clock distribution has been to ensure that a clock signal arrives at every register within the entire synchronous system at precisely the same time. This concept of zero clock skew design has been extended, as is explained in the section on timing constraints caused by clock skew, to provide either a positive or a negative clock skew at a magnitude depending upon the temporal characteristics of each local data path in order to improve system performance and enhance system reliability. Symmetric H-Tree Clock Distribution Networks
Register inputs
Another approach for distributing clock signals, a subset of the distributed buffer approach depicted in Fig. 10, uses a hierarchy of planar symmetric H-tree or X-tree structures
Clock source
(see Fig. 12) (45,53,54) to ensure zero clock skew by maintaining the distributed interconnect and buffers to be identical from the clock signal source to the clocked register of each clock path. In this approach, the primary clock driver is connected to the center of the main H structure. The clock signal is transmitted to the four corners of the main H. These four close-to-identical clock signals provide the inputs to the next level of the H-tree hierarchy, represented by the four smaller H structures. The distribution process then continues through several layers of progressively smaller H structures. The final destination points of the H-tree are used to drive the local registers or are amplified by local buffers that drive the local registers. Thus, each clock path from the clock source to a clocked register has practically the same delay. The primary delay difference between the clock signal paths is caused by variations in process parameters that affect the interconnect impedance and, in particular, any active distributed buffer amplifiers. The amount of clock skew within an H-tree structured clock distribution network is strongly dependent upon the physical size, the control of the semiconductor process, and the degree to which active buffers and clocked latches are distributed within the H-tree structure. The conductor widths in H-tree structures are designed to progressively decrease as the signal propagates to lower levels of the hierarchy. This minimizes reflections of the highspeed clock signals at the branching points. Specifically, the impedance of the conductor leaving each branch point ZK⫹1 must be twice the impedance of the conductor providing the signal to the branch point ZK for an H-tree structure (45,53– 55) and four times the impedance for an X-tree structure.
. . . Additional parallel buffers . . . Figure 11. Three-level buffer tree-structured clock distribution network.
H-tree
X-tree
Figure 12. Symmetric H-tree and X-tree clock distribution networks.
482
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
integrated with structured custom design methodologies (57) should be used in the design of the clock distribution networks in order to maximize system clock frequency, minimize clock delay, and control any deleterious effects of local (particularly, negative) clock skew. Compensation Techniques for Controlling Clock Skew
Figure 13. Tapered H-tree clock distribution network.
This tapered H-tree structure is illustrated in Fig. 13: Zk =
Zk+1 2
for an H-tree structure
(8)
The planar H-tree structure places constraints on the physical layout of the clock distribution network as well as on the design methodology used in the development of the VLSI system. For example, in an H-tree network, clock lines must be routed in both the vertical and horizontal directions. For a standard two-level metal CMOS process, this creates added difficulty in routing the clock lines without using either resistive interconnect or multiple high resistance vias between the two metal lines. This is a primary reason for the development of three or more layers of metal in logic-based CMOS processes. Furthermore, the interconnect capacitance (and therefore the power dissipation) is much greater for the Htree as compared with the standard clock tree because the total wire length tends to be much greater (56). This increased capacitance of the H-tree structure exemplifies an important tradeoff between clock delay and clock skew in the design of high-speed clock distribution networks. Symmetric structures are used to minimize clock skew; however, an increase in clock signal delay is incurred. Therefore, the increased clock delay must be considered when choosing between buffered tree and H-tree clock distribution networks. Also, because clock skew affects only sequentially adjacent registers, the obvious advantages to using highly symmetric structures to distribute clock signals are significantly degraded. There may, however, be certain sequentially adjacent registers distributed across the integrated circuit. For this situation, a symmetric H-tree structure may be appropriate. Another consideration in choosing a clock distribution topology is that the H-tree and X-tree clock distribution networks are difficult to implement in those VLSI-based systems that are irregular in nature, such as those pictured in Figs. 9 and 14. In these types of systems, buffered tree topologies
Figure 14. Trunk-structured clock distribution network for structured custom VLSI circuit.
One structured custom approach, oriented for hierarchical VLSI-based circuits, uses compensation techniques to minimize the variation of interconnect impedances and capacitive loads between clock signal paths (35,38,58,59). A general schematic of a clock distribution network is shown in Fig. 15, in which the nodes i, j, and k represent different clock signal destinations (i.e., clocked registers). Different clock paths could conceivably have different levels of buffering, where each buffer drives a localized distributed RC impedance. The location of these buffers is often chosen so that the active buffer output impedance is comparable to or greater than the interconnect resistance seen at the buffer output. This ensures that the locally distributed RC interconnect section can be accurately modeled as being mostly capacitive. The use of distributed buffers in this manner is described as buffer repeaters (45,60). However, in general, the interconnect impedance should be modeled as a distributed resistive-capacitive section of interconnect. The difficulty with applying symmetric clock distribution strategies is that they do not easily support the ability to partition large VLSI systems into hierarchically structured functional blocks. Preferably, each large functional block would contain its own locally optimized clock distribution network to satisfy the local timing and loading of that particular functional block. For a globally synchronous system, however, local optimization within a functional element does not necessarily lead to global optimization of the overall on-chip clock distribution system. If the interconnect resistance of the global clock distribution network is relatively small, a chip-level centralized clock buffer circuit can be used to satisfy the synchronization requirements of a VLSI circuit. However, in most large VLSI circuits, the physical distances are such that line resistances coupled with any via/contact resistances and the significant line and coupling capacitances will create large interconnect impedances. Therefore, even with a centrally located clock generation and distribution circuit, additional techniques are required to compensate for variations in interconnect and register loading. In order to control the delay of each clock signal path and to minimize the skew between these paths, passive RC delay elements (38) or geometrically sized transistor widths (35) are used to compensate for the variation of the delay of each clock signal path caused by different on-chip locations (i.e., different path-dependent interconnect impedances) and capacitive loading of the clock destinations (i.e., the number and load of the clocked registers per clock signal path). Clock buffers are placed along the clock path such that the highly resistive interconnect lines (typically long lines) drive loads with low capacitance, whereas the low-resistance interconnect lines (typically short lines) drive loads with high capacitance. Thus, either a centralized module of clock buffer drivers can be used or those clock buffers driving large capacitive loads can be placed close to the registers, thereby decreasing the intercon-
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
483
i
j
k
nect resistance. This design strategy of using compensation techniques to control the local clock skew is graphically depicted in Fig. 16. The variation of clock delay between each of the functional elements is compensated for by parameterizing the current drive of each of the functional block clock buffers resident in the centrally located clock buffering circuit (see Fig. 16). If feedback circuitry is being used to further control the delays and skews within the clock distribution network, as in on-chip phase lock loops (PLLs), taps are placed close to the register and are fed back to maintain lock. In order to ensure that the clock distribution network is successfully designed, the following practices should be followed: (1) the number of stages of clock buffering within each of the functional blocks should be the same to maintain equal polarity, (2) the maximum clock signal rise and fall times within each functional block should be specified and controlled, and (3) the internal functional block clock skew
Figure 15. Tree-structured clock distribution network with cascaded buffers and distributed RC interconnect impedances.
should be specified and controlled using the same hierarchical clock distribution strategy as is used at the global VLSI system level (35). Advantages and Disadvantages of Compensation Technique. The primary advantage of using a compensation technique is controlling (and reducing) on-chip clock skew. Also, the clock delay from the clock source to the clocked registers is reduced. This is a result of improved partitioning of the RC loads. Because the inverters located within each functional block drive large capacitive loads, the interconnect impedance and, in particular, the interconnect resistance driven by any specific clock buffer are small in comparison to the buffer output impedance. The fairly long distances of the intrablock clock signal paths are fairly resistive. These paths, however, are isolated from the highly capacitive loads. Thus, the RC time constants are reduced, reducing the overall clock
. . .
Clock source
. . . Parameterized buffers Figure 16. Parameterized buffers integrated into a global clock distribution network to control local clock skew (35).
484
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
delay. Another important advantage of this design technique is the ease of partitioning the clock distribution problem among a team of VLSI circuit designers. The overall VLSI system design can be partitioned hierarchically into a manageable domain of information while still providing a strategy for implementing optimal clock distribution networks. The usefulness of this compensation technique is dependent upon the ability to characterize the device and interconnect impedances within the VLSI circuit. With an accurate estimate of these impedances, the parameterized buffers can be designed so as to satisfy a specific clock skew schedule. It is important to note an important disadvantage of this compensation technique. Unlike interconnect impedances, transistor conductances tend to be very sensitive to variations in supply voltage and to process and environmental conditions (e.g., temperature, radiation). A clock signal path whose delay is dominated by interconnect impedances may vary differently than a clock signal path whose delay is dominated by device impedances (35). Several specific examples of clock distribution networks are discussed in the literature (e.g., 1,11,35,52,59,61,62). Some examples of clock distribution networks applied to highspeed circuits are described in the section entitled ‘‘Directions for Future Research in the Design of Clock Distribution Networks.’’ In each of these clock distribution networks, significant effort has been placed on accurately estimating the magnitude of the resistive and capacitive interconnect impedances to determine the effect of these RC loads on the shape of the clock signal waveform. This information is typically back annotated into a SPICE-like circuit simulator to adjust the clock delays for minimum clock skew (63). Minimal work exists, however, in developing circuit procedures and algorithms for automating the circuit design of clock distribution networks in structured custom VLSI circuits (64–67). One primary requirement for developing these algorithms is a physical model for estimating the delay of the clock signal path. An important observation is that the accuracy required to calculate delay differences (as in clock skew) is much greater than that required when calculating absolute delay values (as in the delay of a clock path). Design of Low-Power Clock Distribution Networks (68–70) In a modern VLSI system, the clock distribution network may drive thousands of registers, creating a large capacitive load that must be efficiently sourced. Furthermore, each transition of the clock signal changes the state of each capacitive node within the clock distribution network, in contrast with the switching activity in combinational logic blocks, where the change of logic state is dependent on the logic function. The combination of large capacitive loads and a continuous demand for higher clock frequencies has led to an increasingly larger proportion of the total power of a system dissipated within the clock distribution network, in some applications much greater than 25% of the total power (69,71). The primary component of power dissipation in most CMOS-based digital circuits is dynamic power. It is possible to reduce CV2f dynamic power by lowering the clock frequency, the power supply, and/or the capacitive load of the clock distribution network. Lowering the clock frequency, however, conflicts with the primary goal of developing highspeed VLSI systems. Therefore, for a given circuit implemen-
tation low dynamic power dissipation is best achieved by employing certain design techniques that either minimize the power supply and/or the capacitive load. De Man (68) introduced a technique for designing clock buffers and pipeline registers such that the clock distribution network operates at half the power supply swing, reducing the power dissipated in the clock tree by 60% without compromising the clock frequency of the circuit. Kojima et al. (69) describe a similar strategy in which the clock signals operate also only over half of the power supply rail, reducing the power dissipated in the clock tree by ideally 75%. The degradation in system speed is very small because, unlike the clock signals, the data signals operate over the full power supply rail. Thus the voltage is reduced only in the clocking circuitry, resulting in significantly reduced power with a minimal degradation in system speed. Experimentally derived savings of 67% were demonstrated on a test circuit (a 16 stage shift register) fabricated in a 0.5 애m CMOS technology with only a 0.5 ns degradation in speed using this half-swing clocking scheme. Other approaches exist for reducing the power dissipated within a clock distribution network. These approaches reduce power by decreasing the total effective capacitance required to implement a clock tree. Reductions of 10% to 25% in power dissipated within the clock tree are reported with no degradation in clock frequency (70). The development of design strategies for minimizing the power dissipated both internal to the clock tree as well as the overall system being synchronized is a research topic of great relevance to a variety of important applications. AUTOMATED SYNTHESIS AND LAYOUT OF CLOCK DISTRIBUTION NETWORKS Different approaches have been taken in the automated synthesis and layout of clock distribution networks, ranging from procedural behavioral synthesis of pipelined registers (72–77) to the automated layout of clock distribution nets for application to gate array and standard cell-based integrated circuits (45,50,61,78–108). The area of automated layout and the integration of the effects of clock distribution into behavioral synthesis methodologies are described in this section. Automated Layout of Clock Distribution Networks A second research path in the area of automated layout has been the development of algorithms that carefully control the variations in delay between clock signal net length so as to minimize clock skew (e.g., 50,82,85–90,100). The strategy used is to construct binary tree-like structures with the clock pins at the leaf nodes. Minimal skew clock distribution networks are created using a recursive bottom-up approach. At each of the clock pins of the registers, which represent the leaves of the clock distribution tree, a clock net is defined. The point where two zero-skew clock nets connect is chosen so that the effective delay from that point to each clocked register is identical (within the accuracy of the delay model). This process continues up the clock distribution tree; the point of connection of each new branch being chosen to satisfy the zero-skew design goal. The layout process terminates when the root (or source) of the clock tree is reached. The schematic diagram of this geometric matching process is illustrated in
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
Figure 17. Geometric matching to create zero clock skew.
Fig. 17. Thus, the automated layout algorithm attempts to balance the delay of each clock branch in a recursive manner, moving from the leaves to the root of the tree. The appropriate branching points of the zero-skew subtree are chosen so as to maintain equal delay. If the zero-skew tapping location falls directly on an unroutable location, such as an existing macrocell, a nonzero clock skew would be realized (108–110). Some early clock-routing algorithms (82,88,89) define the delay as a measure of the total wire length along a path. These algorithms attempt to equalize the lengths of each net from the root of the clock tree to each of the leaf nodes. Thus, the clock skew is minimized during the routing phase of the layout process. No attempt is made to postprocess the layout database to further improve the skew characteristics. In Refs. 108–110, the automated layout of the clock tree is composed of a two-phase process. The clock net is initially routed in a binary-tree manner with the clock pins as leaf nodes and the clock buffer as the root. This layout phase is followed by a post-layout phase in which the clock nets are widened according to the zero-skew specifications, thereby giving the clock layout system additional flexibility in routing around possible blockages. The choice of which clock net to widen is determined by analyzing the sensitivity of the clock net impedance. These sensitivities provide a means of choosing those nets that will decrease the average delay of the RC trees as near as possible to a specified target delay. Those nets whose delay must be increased and are less sensitive to increasing capacitance are widened. However, if all the clock nets are relatively thin, statistical variations in the widths of those wires closest to the clock driver may affect the actual clock skew the most (108). This occurs because the section of interconnect closest to the driver sees the greatest portion of the distributed RC impedance of the interconnect line. Therefore, the greatest change in delay will occur as the width varies. These automated clock layout algorithms tend to use simplified delay models, such as linear delay, where the delay is linearly related to the path length, or the Elmore delay (111– 113), where the delay along a path is the summation of the products of the branch resistance and the downstream capacitance of every branch on the path from the root to the downstream node i (the clock pin of the register), and is TDi =
k
RkiCk
(9)
485
where Ck is the capacitance at node k and Rki is the resistance of the portion of the (unique) path between the input and the output node i, that is common with the (unique) path between the input and node k. The Elmore delay is a first-order step response approximation of the delay through a distributed resistive–capacitive interconnect section. For slow input waveforms, the Elmore delay approximation can become highly inaccurate because the shape and magnitude of the clock waveforms are not considered. The fundamental difficulty with these delay models, however, is the inability to accurately consider the effects of active devices, such as distributed buffers, when estimating delay. Other, more subtle considerations, such as biasdependent loading and varying waveform shapes, must also be considered. The primary focus of the existing research into the automatic layout of clock distribution networks has been placed on minimizing total wire length, metal-to-metal contacts, and crossovers, as well as attaining zero system-wide clock skew (assuming nominal conditions). Integration of Clock Distribution into Behavioral Synthesis Localized clock distribution has been considered only minimally in automated layout or physical synthesis. However, early work in applying local clock skew to behavioral synthesis is described in Refs. 64–67, 72–77, 114. These papers represent early efforts to develop strategies that consider the effects of clock distribution networks in the behavioral synthesis process rather than after the circuit has been partitioned into logic (or register transfer level) blocks. This capability will improve high-level exploratory design techniques as well as optimize the performance of circuits implemented with high-level synthesis tools. As described in Refs. 64–66 and 115–123, the automated synthesis of clock distribution networks can be broken up into four phases: (1) optimal clock scheduling, (2) topological design, (3) circuit design, and (4) physical layout. Optimal scheduling represents a primary research activity in clock distribution networks and is discussed in more detail later. The area of topological design, in which the structure of the clock distribution network is derived from the scheduling of the local clock skew schedule, is discussed in Refs. 64 and 116. Algorithms are developed for converting the clock skew information into path-specific clock delays. With this information and some information describing the hierarchy of the circuit function, a clock distribution tree is developed with delay values assigned to each branch of the tree. With the topological structure and delay information determined, circuit delay elements that satisfy the individual branch delays are synthesized. Circuit techniques to implement the clock tree delay elements are discussed further in Refs. 65, 66, 116, and 123. Finally, a variety of techniques exist to lay out the clock distribution trees. This work represents early research in the development of a systematic methodology for synthesizing tree-structured clock distribution networks that contain distributed cascaded buffers and, furthermore, exploit nonzero localized clock skew. In Refs. 73–75 and 124, a delay model characterizing the timing components of a local data path, similar to Eqs. (1) and (2), are used to incorporate the effects of local clock distribution delays into the retiming process. This is accomplished by assuming that physical regions of similar clock delay exist
486
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
throughout an integrated circuit. Retiming is an automated synthesis process for relocating pipeline registers such that the critical worst case path delay is minimized, creating a synchronous system with the highest possible clock frequency while maintaining the function and latency of the original system. Previous work in the area of retiming ignored clock skew in the calculation of the minimum clock period. In the algorithm presented in Refs. 74, 75, and 124, clock delays are attached to individual paths between logic gates. As a register is placed on a new path, it assumes the clock delay of that path. Because each sequentially adjacent pair of registers defines a local data path, as registers are moved from one region to another during the retiming process, the displaced registers assume the clock delay of the new physical region. Thus, the local clock skews of each data path are determined at each iteration of the retiming process, permitting both increased accuracy in estimating the maximum clock frequency and detection and elimination of any catastrophic race conditions. If a choice of register locations does not satisfy a particular clock period or a race condition is created, that specific register instantiation is disallowed. This algorithm, therefore, integrates the effects of clock skew (and variable register and interconnect delays) directly into the synchronous retiming process. It is interesting to note that adding clock delay to a clock path (applying localized clock skew) has an effect similar to retiming, where the register crosses logic boundaries. Thus, time can be shifted by moving the registers or changing the local clock delays, where retiming is discrete in time and localized clock skew is continuous in time. In general, the two methods complement each other (36). As Fishburn mentions in Ref. 36, because both methods are linear, ‘‘it is likely that efficient procedures could be given for optimizing systems by jointly considering both sets of variables.’’ ANALYSIS AND MODELING OF THE TIMING CHARACTERISTICS OF CLOCK DISTRIBUTION NETWORKS This research area is composed of a number of disparate topics, all of which have in common the attributes of modeling the general performance characteristics of clock distribution networks. An important and active area of research in clock distribution networks is the design of circuits which are less sensitive to variations in process parameters. This topic is discussed in the subsection entitled ‘‘Design of Process-Insensitive Clock Distribution Networks.’’ Techniques for calculating clock skew are summarized in the section entitled ‘‘Timing Constraints Caused by Clock Skew’’ as well as in the subsection entitled ‘‘Deterministic Models for Estimating Clock Skew.’’ A clock-tree topology that implements a given clock schedule must enforce a clock skew TSkew(i, f) for each local data path, Ri ⇒ Rf, of the circuit in order to ensure that both Eqs. (5) and (6) are satisfied. Previous research (116,121) has indicated that tight control over the clock skews rather than the
Figure 18. Permissible range of the clock skew of a local data path, Ri ⇒ Rf. A clock hazard exists if TSkew(i, f) 僆 [d(i, f), TCP ⫺ D(i, f)].
clock delays is necessary for the circuit to operate reliably. Equations (5) and (6) are used in Ref. 121 to determine a permissible range of the allowed clock skew for each local data path. The concept of a permissible range for the clock skew TSkew(i, f) of a local data path Ri ⇒ Rf, is illustrated in Fig. 18. When TSkew(i, f) 僆 [d(i, f), TCP ⫺ D(i, f)]—as shown in Fig. 18—Eqs. (5) and (6) are satisfied. TSkew(i, f) is not permitted to be in either the interval [⫺앝, d(i, f)] because a race condition will be created or the interval [TCP ⫺ D(i, f), ⫹앝] because the minimum clock period will be limited. It is this range of permissible clock skew rather than a specific target value of clock skew that provides an important opportunity to improve the tolerance of a synchronous digital system to process parameter variations. By selecting a value of clock skew toward the center of the permissible range, the tolerance of the local data path to process parameter and environmental delay variations is improved. Furthermore, this decreased sensitivity to process parameter variations by exploiting the localized permissible range of each local data path is completely compatible with the design techniques described in the following subsection. Design of Process-Insensitive Clock Distribution Networks A primary disadvantage of clock distribution networks is that the delay of each of the elements of a clock path, the distributed buffers and the interconnect impedances, are highly sensitive to geometric, material, and environmental variations that exist in an implementing technology. Thus as device and interconnect parameters vary from process lot to process lot, the specific performance characteristics of the clock distribution network may change. This phenomenon can have a disastrous effect on both the performance and the reliability of a synchronous system and can limit the precision and the design methodology of the clock distribution network. It is essential for a robust clock distribution network to exhibit a certain degree of tolerance to variations in process parameters and environmental conditions. In an effort to overcome this problem, various approaches that mitigate the effect of process tolerances on the design of clock distribution networks while maintaining an effective methodology for designing these networks have been developed. Threshold Tracking to Control Clock Skew. An important circuit design technique for making clock distribution networks less process sensitive is described by Shoji (125). The technique uses the MOS circuit characteristic that n-channel and p-channel parameters tend not to track each other as a process varies. Interestingly, the response times of these devices tend to move in opposite directions because the cause of a positive threshold voltage shift is one type of Metal Oxide Semiconductor (MOS) transistor (e.g., an n-channel device will typically cause the p-channel threshold voltage to shift in the opposite direction). Shoji quantitatively describes how the delays of the p-channel and n-channel transistors within the distributed buffers of a clock distribution network should be
Race conditions
Permissible range TSkewijmin
Clock period limitations TSkewijmax
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
P1
P2
P3 Clock'
N1
N2
N3
Clock
PA
PB Clock
NA
NB
Figure 19. Elimination of process-induced clock skew by scaling matched transistor types.
individually matched to ensure that as the process varies, the path delays between different clock paths will continue to track each other. The primary objective of this process-insensitive circuit design technique is to match the two clock edges (of either a pchannel or an n-channel transistor) as the process parameters vary. Shoji presents two rules to minimize the effects of process variations on clock skew. 1. Match the sum of the pull-up delays of the p-channel Metal Oxide Semiconductor Field-Effect Transistor (MOSFET) with the pull-up delays of any related clock signal paths. 2. Match the sum of the pull-down delays of the n-channel MOSFET with the pull-down delays of any related clock signal paths. Although process variations may change the total clock delay along a given path, the difference in delay between paths will track each other, keeping the skew small. A circuit using this technique is shown in Fig. 16. Delay times T1, T3, and TA are directly related to the conductances of the n-channel devices, N1, N3, and NA, respectively. Delay times T2 and TB are directly related to the conductances of the p-channel devices, P2 and PB, respectively. The conductance of each of these devices is proportional to the size of the MOSFET geometric width. In conventional Complementary Metal Oxide Semiconductor (CMOS) circuit design, the transistor widths are adjusted to satisfy TA + TB = T1 + T2 + T3
(10)
thereby ensuring that the skew at the output of each clock branch is close to zero. If, for example, either the n-channel or the p-channel threshold voltages varies significantly from its assumed value, Eq. (10) will no longer be satisfied, and a large skew will develop between the outputs, clock and clock⬘, shown in Fig. 19. Instead of designing the circuit in Fig. 19 to satisfy Eq. (10), the circuit is designed to satisfy Eqs. (11)
487
and (12). Thus, the primary objective of Eq. (10) (zero clock skew) is maintained while the added more stringent design constraint makes the entire clock distribution circuit more tolerant to process variations. This increased tolerance occurs because by satisfying both Eqs. (11) and (12), the n-channel and p-channel transistors of both branches individually track each other, making the system more tolerant to variations in the n-channel and p-channel transistor characteristics: TA = T1 + T3
(11)
TB = T2
(12)
This design technique can be used to make circuits less sensitive to process variations and environmental conditions even if the circuits are not inverters but are more general forms of logic gates. The technique also ensures similar behavior when interconnect impedances are included within the circuit. Simulated worst case clock skews of circuits using this technique exhibit skews that are 10% less than that of conventionally designed circuits (125). Interconnect Widening to Minimize Clock Skew Sensitivity. As described in the subsection entitled ‘‘Automated Layout of Clock Distribution Networks,’’ one approach for the automated layout of clock nets is to lengthen specific clock nets to equalize the length of every clock line, thereby keeping the clock skew close to zero. A disadvantage of this approach is that these minimum width lines are very susceptible to variations in the etch rate of the metal lines, as well as to mask misalignment or local spot defects. Therefore, the effective interconnect impedance (and the delay) of these long thin clock nets can vary greatly from wafer to wafer as these line widths vary. In order to design these clock nets to be less sensitive to process variations, Pullela et al. (108–110) have developed an automated layout algorithm that widens the clock nets rather than lengthens them while equalizing the line delays. These nets are therefore less sensitive to both under- and over-etching during the metal-patterning process. By widening the clock lines, the interconnect resistance is decreased; however, the interconnect capacitance increases. It is interesting to note that increasing the line width of those branches closer to the root of the RC tree has a greater effect on the clock path delay than increasing the widths closer to the leaf nodes (the clocked registers). Thus, decreasing the resistance at the source by increasing the line width affects the total path delay more significantly than decreasing the resistance at the leaf node because more capacitance is seen by the large source resistance than if the resistance is greater near the leaf. Therefore, the clock skew is particularly sensitive to changes in line width close to the clock source. One approach to making the clock lines more tolerant of process variations is to make the width of the clock interconnect lines widest near the clock source and thinner as the leaf nodes are approached. This strategy would provide a reasonable tradeoff between controlling the effect of process variations (particularly, metal etch rates) on the clock skew and minimizing line dimensions for process yield and circuit layout efficiency. The relative sensitivities of each net to changes in the capacitive and resistive interconnect impedances are analyzed and integrated into the Elmore delay model (111–113). One of the primary advantages of this approach is that the process of automatically laying out the clock nets is separated from the clock
488
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
skew reduction process. Thus, local layout techniques, such as widening the clock nets, can be used to make the overall circuit less sensitive to variations in process parameters. Deterministic Models for Estimating Clock Skew A clock signal path within a clock distribution network has a single input and, although paths branch off from the trunk of the tree, a single path (or branch) exists from the clock source to the clock input of a register. This branch is typically composed of distributed buffers and interconnect sections, as shown in Fig. 15. In order to simplify the calculation of the path delay and to provide simple closed form delay models of the path, it is typically assumed that the buffer on-resistance is much greater than the interconnect resistance that the buffer is driving. This permits the distributed RC interconnect section to be modeled as a simple lumped capacitor. Percent errors reflecting this assumption are provided in Ref. 126, where the errors are dependent upon the ratio of the load resistance to the output buffer on-resistance. However, if the line resistance is not significantly smaller than the buffer output resistance, repeaters (45,60) are often inserted at a point within the clock line to ensure that the output resistance of the repeater buffer is much larger than the local line resistance of the interconnect section between the repeaters (16). In order to calculate the clock path delay and skew, a simple model of a CMOS inverter driving another inverter with line resistance and capacitance between the two inverters is often used. A well-known empirical estimate of the rise (or fall) time of a single CMOS inverter driving an RC interconnect section with a capacitive load (representing the following CMOS inverter) is (44,45,126) TR/F = 1.02RIntCInt + 2.21(RTrCInt + RTrCTr + RIntCTr )
(13)
where RInt and CInt are the resistance and capacitance of the interconnect section, respectively, and RTr and CTr are the output on-resistance of the driving buffer and the input load capacitance (⫽COXWL) of the following buffer, respectively. Note that COX is the oxide capacitance per unit area, and W and L are the width and length, respectively, of the following buffer. An approximate estimate of the output resistance of the buffer may be obtained from (44,45,126) RO ≈
L/W µCox (VDD − VT )
(14)
where 애 is the channel mobility, VDD is the power supply voltage, and VT is the device threshold voltage. Equation (14) is derived from the large signal I–V equation of a MOSFET operating in the saturation region close to the linear region and is accurate for small channel geometries because velocity saturation decreases the quadratic behavior of the MOS device operating in the saturation region. The physical delay model represented by Eqs. (13) and (14) is a fairly simple approximation of the delay of a CMOS inverter driving a distributed RC impedance. More complex and accurate delay models exist. This area of inquiry represents an important topic of intensive research unto itself and is discussed in great detail throughout the literature. An important research area in VLSI circuits is timing analysis, where simplified RC models are used to estimate the
delay through a CMOS circuit. Clock characteristics are provided to a timing analyzer to define application-specific temporal constraints, such as the minimum clock period or hold time, on the functional timing of a specific synchronous system (127). Tsay and Lin (48,49) continue this approach by describing an innovative timing analyzer that exploits negative clock skew (i.e., time is ‘‘stolen’’ from adjacent data paths to increase system performance). Therefore, the descriptive term ‘‘cycle stealing’’ is used to describe this process. Dagenais and Rumin (128,129) present a timing analysis system that determines important clocking parameters from a circuit specification of the system, such as the minimum clock period and hold time. This approach is useful for top-down design when performing exploratory estimation of system performance.
SPECIFICATION OF THE OPTIMAL TIMING CHARACTERISTICS OF CLOCK DISTRIBUTION NETWORKS An important element in the design of clock distribution networks is choosing the minimum local clock skews that increase circuit performance by reducing the maximum clock period while ensuring that no race conditions exist. This design process is called optimal clock skew scheduling and has been extensively studied in Refs. 29, 30, 36, 130–138, and is described in this section. Starting with the timing characteristics of the circuit, such as the minimum and maximum delay of each combinational logic block and register, it is possible to obtain the localized clock skews and the minimum clock period. This information is determined by formulating the optimal clock scheduling problem as a linear programming problem and solving with linear programming techniques (36,134). The concept of scheduling the system-wide clock skews for improved performance while minimizing the likelihood of race conditions was first presented by Fishburn in 1990 (36), although the application of localized clock skew to increase the clock frequency and to eliminate race conditions was known previously (34). Fishburn presents a methodology in which a set of linear inequalities are solved using standard linear programming techniques in order to determine each clock signal path delay from the clock source to every clocked register. Two clocking hazards, identical to the constraint relationships described in the section entitled ‘‘Timing Constraints Caused by Clock Skew,’’ are eliminated by solving the set of linear inequalities derived from Eqs. (5) and (6) for each local data path. The deleterious effect of positive clock skew (the maximum data path/clock skew constraint relationship) is described as zero clocking, while the deleterious effect of negative clock skew (the minimum data path/clock skew constraint relationship) is described as double-clocking. This approach is demonstrated on a 4-bit ripple-carry adder with accumulation and input register in a 1.25 애m CMOS technology. The minimum clock period is decreased from 9.5 ns with zero clock skew to 7.5 ns with localized clock skew (36). Syzmanski improves this methodology for determining an optimal clock skew schedule by selectively generating the short path constraints, permitting the inequalities describing the timing characteristics of each local data path to be solved more efficiently (133).
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
Data input
Rin
Ri
Rj
Rk
Rl
Rout
∆1
Ci
Cj
Ck
Cl
∆2
489
δ
Clock input Figure 20. Multistage block diagram of a synchronous digital system.
In order to describe the process for determining an optimal clock skew schedule, a system-timing model is presented. A block diagram of a multistage synchronous digital system is depicted in Fig. 20. Between each stage of registers there is, typically, a block of combinational logic with possible feedback paths between the registers. Each register is either a multi- or single-bit register, and all the inputs are assumed to change at the same time point of the transition of the clock signal. Only one single-phase clock signal source in the circuit is assumed. The registers are composed of edge-triggered flipflops and considered to assume a single value for each clock cycle. The combinational logic block is described in terms of the maximum and minimum delay values, TLogic(max) and TLogic(min). These logic delay values are obtained by considering the delay of all possible input to output paths within each combinational logic block. For simplicity and without loss of generality, the block diagram in Fig. 20 considers only a single-input, single-output circuit. In this figure, ⌬1 and ⌬2 are clock delays outside the VLSI-based system, and 웃 is that portion of the on-chip clock delay that is shared by each of the clock paths (the initial trunk of the clock tree). The registers Rin and Rout make up one set of registers placed at the input and output, respectively, of the VLSI-based system. The circuit is composed of the registers, Ri to Rl, where the logic blocks are between the registers, and the signal clock delays are Ci to Cl. Off-Chip Clock Skew. The circuit model in Fig. 20 also considers the relationship between off-chip and on-chip clock skew. The registers Rin and Rout symbolize off-chip registers and are controlled by the off-chip clock source, which also provides the on-chip clock signals, because the circuit is assumed to be a fully synchronous system. This relationship is represented by TSkewin,out = TSkewin,i + TSkewi, j + · · · + TSkewl,out = 0
(15)
Therefore, to satisfy Eq. (15), in Fig. 20, ⌬1 ⫽ ⌬2 ⫽ ⌬. Although it is possible to have off-chip nonzero clock skew, it is desirable to ensure that the clock skew between VLSI input/output (I/O) approaches zero, in order to avoid complicating the design of a circuit board or the specification of the
interface of the circuit with other components also controlled by the same clock source. For example, a circuit with intentional nonzero clock skew requires that the clock distribution network of any other synchronous circuit sharing the same global clock be offset by the same amount of temporal skew, otherwise race conditions, such as described in the subsection entitled ‘‘Minimum Data Path/Clock Skew Constraint Relationship,’’ may occur at the board level. This strategy of minimizing clock skew as the level of design complexity shifts should be applied at each higher level of design, such as from the board level to the multiboard level. A fully synchronous circuit must generate data at a rate defined by the clock period of the clock source; otherwise, race conditions may occur at the interface of the circuit with other parts of the system. These race conditions occur when there is a negative clock skew, intentionally or unintentionally introduced in the circuit. Observe that every circuit has unintentional clock skew caused by several factors, one common cause being variations in process parameters. This type of clock skew must be considered during the design of the circuit and should be less than the intentional on-chip clock skew introduced to increase the performance of the circuit. Furthermore, the magnitude of the intentional clock skew at each register I/O may vary substantially, according to the optimization applied to the data path. Therefore, clock skew at the system level of a VLSI circuit should be constrained to approach zero, in order to allow the circuit to communicate correctly with other board-level circuits. For example, a symmetric zero clock skew distribution system should be used for the external registers (symmetric networks are discussed in the subsection entitled ‘‘Symmetric H-Tree Clock Distribution Networks’’). Observe that restricting the off-chip clock skew to zero does not preclude the circuit from being optimized with localized clock skew. The primary effect is that the performance improvement is less than that obtained without this constraint. Observe that Eq. (15) is valid only if the interface circuitry is controlled by the same clock source. The restriction does not apply to asynchronous circuits or synchronous circuits that communicate asynchronously (i.e., globally asynchronous, locally synchronous systems).
490
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
Global and Local Timing Constraints. As described earlier, in order to avoid either type of clock hazard (either a maximum or minimum data path/clock skew constraint relationship), a set of inequalities must be satisfied for each local data path in terms of the system clock period TCP and the individual delay components within the data path. To avoid limiting the maximum clock rate between two sequentially adjacent registers, Ri and Rj, Eq. (5) must be satisfied. To avoid race conditions between two sequentially adjacent registers, Ri and Rj, Eq. (6) must be satisfied. The system-wide clock period is minimized by finding a set of clock skew values that satisfy Eqs. (5) and (6) for each local data path and Eq. (15) for each global data path. These relationships are sufficient conditions to determine the optimal clock skew schedule such that the overall circuit performance is maximized while eliminating any race conditions. The timing characteristics of each local data path are assumed to be known. The minimum clock period is obtained when the problem is formalized as a linear programming problem. Consider the following. Minimize TCP subject to the local and global timing constraints:
TSkew ≤ TCP − TPD(max) = TCP − (TC-Q + TLogic(max) + TInt + TSet-up )
for TCi > TC f
|TSkew | ≤ TPD(min) = TC-Q + TLogic(min) + TInt + THold
Minimize TCP subject to
R1 − R2 : R2 − R2 : R3 − R4 : R3 − R2 : R1 − R4 :
C1 − C2 = TSkew12 ≥ −20 ns TSkew12 − TCP ≤ −22 ns C2 − C3 = TSkew23 ≥ −26 ns TSkew23 − TCP ≤ −27 ns C3 − C4 = TSkew34 ≥ −2 ns TSkew34 − TCP ≤ −3 ns C3 − C2 = TSkew32 ≥ −2 ns TSkew32 − TCP ≤ −3 ns TSkew12 + TSkew23 + TSkew34 = 0
where the optimal clock schedule is
TSkew12 = −3 ns TSkew23 = −12 ns TSkew34 = 15 ns TCP = 19 ns If zero clock skew between off-chip registers is not considered, the minimum clock period is TCP ⫽ 15 ns. Although the restriction of zero clock skew increases the clock period, there is still an improvement in performance by applying intentional localized nonzero clock skew to the circuit. With zero clock skew, the minimum period is TCP ⫽ 27 ns as a result of the worst case path delay of the local data path between registers R2 and R3.
for TC f > TCi TCP ≥ TPD + TSkewij
EXAMPLE IMPLEMENTATIONS OF CLOCK DISTRIBUTION NETWORKS
TPD(max) = TC-Qi + TLogic(max) + TInt + TSet-up j TSkewin,i + TSkewi, j + TSkew j,k + · · · + TSkewn,out = 0 An Example of Determining the Optimal Clock Schedule of a Pipelined System. An example of determining the minimum clock period of a multistage system with feedback paths is the circuit illustrated in Fig. 21. This example is similar to that used in Ref. 23, adapted to consider zero clock skew between off-chip registers. The numbers inside the logic blocks are the minimum and maximum delays of each block, respectively. Similar to the approach taken in Ref. 23, all the register timing parameters are assumed to be zero for simplicity. The linear program that gives the minimum clock period and the optimal clock skew schedule for the circuit shown in Fig. 21 follows.
R1
C1
20/22
R2
26/27
C2
R3
C3
2/3
R4
C4
Clock source
Figure 21. Example of synchronous circuit with feedback.
A number of interesting and innovative examples of high-performance fully synchronous clock distribution networks have been developed for highly specialized and high-performance commercial processors and have been described in the literature. These VLSI-based systems required an unusual combination of methodologies and practices commensurate with large design teams while maintaining the localized circuit optimization requirements important to high-speed VLSI circuit design. The design of the clock distribution network used to synchronize some well-known industrial circuit examples are discussed in this section. The Bell Telephone WE32100 32 Bit Microprocessor (139,140) In the early 1980s, a family of 32 bit microprocessors was developed at Bell Laboratories using a variety of advanced CMOS circuit design techniques and methodologies. Because performance was of fundamental importance, significant attention was placed on synchronization, particularly the design of the global clock distribution network. In 1982, Shoji (139) described the clock distribution of the BELLMAC-32A, a 146,000 transistor central processing unit (CPU) operating at 8 MHz and built using a 2.5 애m single-level metal silicide CMOS technology. The clock delay and maximum tolerance of the clock distribution network is specified at 15 ns ⫾ 3.5 ns defining the maximum permissible clock skew. A four-phase clocking strategy is used; each phase synchronizes in order, the slave latches, the slave latch logic, the master latches, and the master latch logic. Each time a clock signal crosses a
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
power bus, a silicide crossunder is used to route the clock signal. In order to equalize the series resistance of each clock path, each clock path is routed with an identical number of three power bus crossunders from either of the two primary clock liens around the chip periphery. Buffers are strategically placed after each crossunder to amplify the degraded clock signal. Using this clock distribution strategy, the circuit satisfied the clock frequency specification of 8 MHz at 70⬚C with the clock skew not exceeding ⫾3.5 ns. It is worth noting that because of the significantly increased complexity encountered when distributing four separate clock signals, a fourphase clocking strategy is not particularly appropriate for higher-density, higher-speed VLSI-based systems. This perspective is consistent with the processor described next. In 1986, Shoji (140) reported on the electrical design of the WE32100 CPU built using a 1.75 애m CMOS technology. The approach used in designing the WE32100 synchronizing clock system is described; local clock skew is optimized for a small number of the critical paths by applying negative clock skew. This strategy is consistent with the customized design methodology used in the design of the CPU. The clock distribution network uses a standard tree structure where the input clock signal is buffered by the clock driver and distributed over the entire circuit. Buffers are again placed after each crossunder. A strategy very similar to the approach presented in the subsection entitled ‘‘Compensation Techniques for Controlling Clock Skew’’ (35) and depicted in Fig. 14 is used to compensate for the variation in interconnect impedance and register load of each clock line. Clock edges at each register are further synchronized by adjusting the MOSFET transistor geometries of the distributed buffers within the clock distribution network. Another circuit technique used in the WE32100 to minimize the dependence of the clock skew on any process variations is discussed in the subsection entitled ‘‘Design of Process-Insensitive Clock Distribution Networks.’’ This technique minimizes process-induced clock skew caused by asymmetric variations of the device parameters of the N-channel and Pchannel MOSFETs. Clock distribution networks with skews an order of magnitude less than conventionally designed circuits have been simulated with this technique (125). The issue of chip-to-chip synchronization is important in the design of integrated circuits because these circuits make up the components of a larger computing system. Because individual integrated circuits are processed in different wafer lots, the device parameters may differ; therefore, any internal delays, such as the clock lines, will also differ. If the clock delay is TCD for the slowest integrated circuit, as much as TCD /2 clock skew may be developed when the fastest and slowest chips communicate synchronously. Therefore, a preferable strategy is to reduce the on-chip clock delay TCD to a minimum and to preselect chips for similar clock delays (140). The DEC 64 Bit Alpha Microprocessor (71,141) An important application area for high-speed clock distribution networks is the development of high speed microprocessors. The performance of these circuits is often limited by the clocking strategy used in their implementation. The DEC Alpha chip currently represents a significant milestone in microprocessor technology. The VLSI circuit operates above 200 MHz with a 3.3 V power supply implemented in 0.75 애m
491
CMOS three-level metal technology. A clock period of 5 ns must be satisfied for each local data path. Therefore, the clock skew should be of very small magnitude (e.g., less than 0.5 ns for a 10% positive clock skew requirement). This strategy assumes the clock skew to be a global effect rather than a local effect. Thus, careful attention to modeling the circuits and interconnects is required in order to design and analyze this type of high-speed system. The Alpha microprocessor contains 1.68 million transistors and supports a fully pipelined 64 bit data structure. The functional attributes of the microprocessor are described in greater detail in Ref. 71 because the focus herein is on the clocking strategy used within the circuit. In designing this high-speed microprocessor, significant attention has been placed on the circuit implementation. The single-phase clock signal is distributed globally on the top most level of the trilevel metal process, as is the power distribution, because the third layer of metal is thicker and wider (7.5 애m pitch with contacts as compared to 2.625 애m and 2.25 애m for the second and first layers of metal, respectively). Therefore, the resistivity per unit length of the third layer of metal and the metal to substrate capacitance is less. A number of inherent difficulties exists within the clock distribution requirements of the Alpha chip. For example, a substantial capacitive load must be driven at high speed by the clock distribution network, 3250 pF (3.25 nF). Also, for both latch design and power dissipation reasons, so as to minimize shortcircuit current, a fast clock edge rate (⬍0.5 ns) must be maintained throughout the clock distribution network. The huge capacitive load is caused by the 63,000 transistor gates being driven by the clock distribution system. The distribution of the loads is nonsymmetric, necessitating a specialized strategy for distributing the clock. The single 200 MHz clock signal is distributed through five levels of buffering, where the total network consists of 145 separate elements. Each of the elements contains four levels of buffering with a final output stage locally driving the clocked registers. These distributed buffers are configured as a tree as shown in Fig. 22 (142). Vertical straps are placed on the second level of metal (M2) to minimize any skew that may develop within the initial four-stage portion of the buffer tree. The primary signal-wide distribution is on the third level of metal (M3), designed to be particularly thick to minimize any line resistance as well as to improve process yield. The approach used in designing the clock distribution of the Alpha chip is to permit only positive clock skew, thereby assuring that no catastrophic race conditions induced by neg-
M3
Clock source
M2 Figure 22. Clock distribution network of DEC Alpha microprocessor. Note the mesh structure within the clock tree.
492
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
ative clock skew can occur. Thus, only a graceful degradation in maximum clock rate caused by localized positive clock skew is possible. This strategy is accomplished by centrally locating the clock generation circuitry within the integrated circuit. The clock signal is then radially distributed from the center of the chip die to its periphery. By carefully monitoring the design of this clock distribution methodology, the likelihood of developing a catastrophic amount of negative clock skew (i.e., 兩TSkew兩 ⬎ TPD(min)) is minimized.
G
Another application area that requires sophisticated clock distribution is heavily pipelined digital signal processors (DSP), such as Finite Impulse Response (FIR)/Infinite Impulse Response (IIR) digital filters, multiply-adders, multiplyaccumulators, and frequency synthesizers. These types of circuits repeatedly use similar circuit elements, such as multipliers, adders, and registers. Careful attention is placed on developing high-performance customized versions of these circuit elements, and these circuit components are repeatedly used within larger DSP systems. The primary difference between different versions of these DSP circuit components is typically the size of the bit slice (e.g., 8 bit versus 64 bit) and the degree of pipelining. Substantial pipelining is applied to these circuits to increase system clock frequency (29–31). The multiplier function is a good example of a complex circuit element capable of significant improvement in clock rate with high levels of pipelining. Because the data flow is nonrecursive, fewer pipeline registers are required as compared to those structures that contain substantial feedback. Furthermore, the multiplier tends to be the critical element (in terms of speed, area, and power) in most DSP circuits. Heavily pipelined multipliers requiring sophisticated clock distribution networks are the focus of considerable research. In this subsection, specific examples of clock distribution networks in highly pipelined DSP-based multipliers implemented in VLSI technologies are described. A common feature of these VLSI-based multipliers (and many VLSI-based systems) is the repetitive organization of the physical layout. Repetitive parallel arrays of abutted adder cells, pipelined at each bit (a register placed between each adder cell), provide worst case path delays of only a single adder and a register delay (TC-Q and TSet-up), permitting very high multiplication throughput. Specialized architectures, which are beyond the scope of this article, such as carry save addition, are used to improve the throughput of these VLSI-based multipliers (11,56,143–146). In these types of highly arrayed structures, clock skew can appear both horizontally (serial skew), in the direction of the data flow, and vertically (parallel skew), orthogonal to the data flow. As described by Hatamian and Cash (143,144) and pictured in Fig. 23, assuming that the clock source originates from point A, the clock skew between points D and E at the cell inputs is quite close to the skew between points B and C. As long as this horizontal clock skew is less than the local data path delay between cells, no negative clock skew condition will occur [see the subsection entitled ‘‘Minimum Data Path/Clock Skew Constraint Relationship’’ and Eq. (6)], and the multiplier array will operate properly. Furthermore, additional cells can be added to the array without creating any race conditions as long as the same constraint is maintained.
E
B
C
F A
8 Bit ⴛ 8 Bit Pipelined Multiplier (11,56,143–146)
D
Clock path Figure 23. Clock distribution network of pipelined multiplier array (143,144).
Unlike the horizontal skew, however, the vertical skew is cumulative. The clock skew increases as the signal propagates vertically from point A. Thus, the cumulative skew between points F and G dominates over the horizontal skew, again assuming that the clock source originates from point A. It is worth noting that the highly arrayed structure of the multiplier, which makes it particularly amenable to a VLSI implementation, also constrains and limits the topology and layout of the clock distribution network used to synchronize the circuit. In the 8 bit ⫻ 8 bit multiplier described in Refs. 143 and 144, which is implemented in a 2.5 애m CMOS technology and operates up to 70 MHz, the clock signals are distributed entirely on metal, except where short polysilicon crossunders are used to transverse the power lines. A two-level buffer clock distribution network is used, where the clock load is balanced at the output of the second buffer, and the path before the second buffer is kept symmetric. This clock distribution strategy is similar to that described in Ref. 145. Also an 8 bit ⫻ 8 bit multiplier, the multiplier described in Ref. 145 is implemented in a 1.0 애m nMOS technology and operates up to 330 MHz at room temperature and up to 600 MHz with liquid nitrogen cooling. Pipelining after each 1 bit addition, the multiplier architecture is very similar to that described in Ref. 143. The clock signals are distributed in metal outside the multiplier array and in polycide (strapped polysilicon with silicide to minimize the line resistance) inside the multiplier array. Two-phase clocking is used with a total master and slave register fanout of 8 pF. No special circuitry to prevent overlap of the two-phase clock is used because this degrades the active-high portion of the clock signal. Maximal pipelining of multiplying architectures is taken one step further by pipelining each half bit of an 8 bit ⫻ 8 bit multiplier (56). Operating at 230 MHz and implemented in a 1.6 애m double-level metal CMOS process, the multiplier architecture is similar to that used in Refs. 143–145. The circuit uses a single-phase clocking scheme (147). A standard threelevel buffer network is used to distribute the clock signal. Each row of the multiplier provides a capacitive load of 5.3 pF. A common clock line runs horizontally in metal one (M1) and is driven by a large sized buffer. The complete buffer tree is composed of 14 buffers, where 10 of these buffers drive the registers (i.e., the leaves of the tree). The maximum skew measured between two clock liens is 150 ps (56). These three multiplier examples are intended to provide some insight into distributing clock signals within highly ar-
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
rayed VLSI-based DSP systems. The primary attributes of the multiplier, repetitive circuit elements, abutted or closely spaced layouts, extremely high throughput (i.e., many hundreds of MHz) resulting from heavy pipelining, make the multiplier an important class of VLSI system that requires highly specialized clock distribution networks.
493
such as Heterostructure Bipolar Transistors (HBT) and the superconductive digital electronic technology, Single Flux Quantum (SFQ) logic (148–153), which operate well into the 10 GHz to 100 GHz frequency range, become more readily available. System Issues on the Design of Clock Distribution Networks
DIRECTIONS FOR FUTURE RESEARCH IN THE DESIGN OF CLOCK DISTRIBUTION NETWORKS Significant research opportunities remain in the design of clock distribution networks. Some examples of these research areas are briefly described in this section. Automated Synthesis of Clock Distribution Networks Much of the current research focuses on automating the synthesis of clock distribution networks to support higher performance requirements. The optimal placement of localized distributed buffers, improved delay models that account for nonlinear active transistor behavior, the use of localized clock skew to increase circuit speed, and the integration of RC interconnect and buffer physical delay models for more accurate delay analysis must be considered in the automated design and layout of clock distribution networks. The effects of clock skew, both positive and negative, must also be integrated into behavioral synthesis and RC timing analyzers to detect race conditions as well as to satisfy local and global performance constraints. Synchronous timing constraints must be integrated into high-level behavioral synthesis algorithms, thereby improving the accuracy and generality of these synthesis (and simulation) tools. Most clock distribution networks are tree structured; however, in many customized VLSI circuits, certain portions of the network are strapped in a mesh configuration to minimize interconnect resistance. These mesh networks decrease clock skew as well as improve circuit reliability. Therefore, one area of future research is the automated layout of clock meshes. Both timing models and physical layout must be integrated to handle this important extension of tree-structured clock distribution networks. Design of Process Insensitive Clock Distribution Networks A research area of primary importance to the practical design of high-speed clock distribution networks is improving the tolerance of these networks to both process and environmental variations. As functional clock periods approach and exceed one ns (i.e., one GHz clock frequency), variations in delay of tens of picoseconds could significantly degrade the performance and reliability of these high-speed synchronous systems. Variations of this magnitude are quite common in modern semiconductor technologies. This topic is of immediate importance to the existing high-speed processor community. Design of Microwave Frequency Clock Distribution Networks As system-wide clock frequencies increase beyond gigahertz levels, transmission line effects will begin to influence the performance characteristics of clock distribution networks. Models of interconnect sections will require the inclusion of accurate inductive elements. These microwave effects will become more apparent as ultra-high-speed digital technologies,
System issues in the design of clock distribution networks also necessitate focused research. Important and developing architectural structures, such as parallel processors, neural networks, supercomputers, hybrid circuits or monolithic multi-chip modules (154), and wafer scale integration all require specialized synchronization strategies. Related systems issues, such as ultra-low power circuits which require ultralow power clock distribution networks (68–70), are becoming increasingly important. Improving the power dissipation characteristics of clock distribution networks is particularly important because these networks dissipate a large portion of the total system-wide power budget (as much as 40% of the total power dissipated in large microprocessors (71)). Topics, such as distributing small differential voltage signals and task monitoring (or power management) strategies, are important areas of research in low-power and low-voltage clock distribution network design. Debug and Production Test of Clock Distribution Networks A necessary requirement in developing a product is evaluating the quality of that product. Both debug and production test of high-performance clock distribution networks are of fundamental importance. There is little research that describes how best to debug high-speed clock distribution networks as well as how to best test these networks in a production environment. SUMMARY AND CONCLUSIONS All electronic systems are fundamentally asynchronous in nature; by carefully inserting precise localized timing relationships and storage elements, an asynchronous system can be adapted to appear to behave synchronously. As long as the specific local timing and functional relationships are satisfied, synchronous systems can be applied, easing the timing constraints on data flow, albeit requiring a clock distribution network to provide the synchronizing reference signal. By synchronizing with distributed clock signals, clock frequency, a measure of how often new data appear at the output of a system, will remain as the primary performance metric in synchronous systems. Furthermore, systems that operate at performance levels that would be otherwise unattainable without requiring significant architectural or technological improvements can be developed. It is often noted that the design of the clock distribution network represents the fundamental circuit limitation to performance in high-speed synchronous digital systems. The local data path-dependent nature of clock skew, rather than its global characteristics, requires extreme care in the design, analysis, and evaluation of high-speed clock distribution networks. The design complexity and difficulty in scaling these networks to finer geometries are the primary reasons for the recent emphasis placed on asynchronous systems. Clearly,
494
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
however, synchronous systems will be commonplace for a long time to come, necessitating improved techniques for designing and implementing high-speed, highly reliable clock distribution networks. Furthermore, as tighter control of the clocking parameters improves, approaches such as localized clock skew will be more generally applied to the design of clock distribution networks to further enhance system performance. In this article on synchronous clock distribution, timing relationships are examined and are used to constrain the timing characteristics of the overall system. Various architectures and applications are considered, and circuit strategies for distributing the clock signals are offered. It is the intention of this article to integrate these various subtopics and to provide some sense of cohesiveness to the field of clocking and, specifically, clock distribution networks. BIBLIOGRAPHY 1. F. Anceau, A synchronous approach for clocking VLSI systems, IEEE J. Solid-State Circuits, SC-17: 51–56, 1982. 2. M. A. Franklin and D. F. Wann, Asynchronous and clocked control structures for VLSI based interconnection networks, Proc. 9th Annu. Symp. Comput. Archit., 1982, pp. 50–59. 3. D. Wann and M. Franklin, Asynchronous and clocked control structures for VLSI based interconnection networks, IEEE Trans. Comput., C-32: 284–293, 1983. 4. S. Dhar, M. Franklin, and D. Wann, Reduction of clock delays in VLSI structures, Proc. IEEE Int. Conf. Comput. Des., 1984, pp. 778–783. 5. S. Unger and C.-J. Tan, Optimal clocking schemes for high speed digital systems, Proc. IEEE Int. Conf. Comput. Des., 1983, pp. 366–369. 6. J. Beausang and A. Albicki, A method to obtain an optimal clocking scheme for a digital system, Proc. IEEE Int. Conf. Comput. Des., 1983, pp. 68–72. 7. S. H. Unger and C.-J. Tan, Clocking schemes for high-speed digital systems, IEEE Trans. Comput., C-35: 880–895, 1986. 8. D. Noice, R. Mathews, and J. Newkirk, A clocking discipline for two-phase digital systems, Proc. IEEE Int. Conf. Circuits Comput., 1982, pp. 108–111.
18. V. L. Chi, Salphasic distribution of clock signals for synchronous systems, IEEE Trans. Comput., C-43: 597–602, 1994. 19. M. C. Papaefthymiou and K. H. Randall, Edge-triggering vs. two-phase level-clocking, Proc. Symp. Integr. Syst., 1993, pp. 201–218. 20. W. K. C. Lam, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, Valid clocking in wavepipelined circuits, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 124–131. 21. D. A. Joy and M. J. Ciesielski, Clock period minimization with wave pipelining, IEEE Trans. Comput.-Aided Des., CAD-12: 461–472, 1993. 22. C. T. Gray, W. Liu, and R. K. Cavin, III, Timing constraints for wave-pipelined systems, IEEE Trans. Comput.-Aided Des., CAD13: 987–1004, 1994. 23. X. Zhang and R. Sridhar, Synchronization of wave-pipelined circuits, Proc. IEEE Int. Conf. Comput. Des., 1994, pp. 164–167. 24. E. G. Friedman, Clock distribution design in VLSI circuits—an overview, Proc. IEEE Int. Symp. Circuits Syst., 1993, pp. 1475– 1478. 25. K. A. Sakallah et al., Synchronization of pipelines, IEEE Trans. Comput.-Aided Des., CAD-12: 1132–1146, 1993. 26. K. Wagner and E. McCluskey, Tuning, Clock Distribution, and Communication in VLSI High-Speed Chips, Stanford, CA: Stanford Univ., 1984, CRC Tech. Rep. 85-5. 27. K. D. Wagner, A Survey of Clock Distribution Techniques in HighSpeed Computer Systems, Stanford, CA: Stanford Univ., 1986, CRC Tech. Rep. No. 86-20. 28. E. G. Friedman, Latching characteristics of a CMOS bistable register, IEEE Trans. Circuits and Systems I, Fundam. Theory Appl., CAS-40: 902–908, 1993. 29. E. G. Friedman and J. H. Mulligan, Jr., Clock frequency and latency in synchronous digital systems, IEEE Trans. Signal Process., SP-39: 930–934, 1991. 30. E. G. Friedman and J. H. Mulligan, Jr., Pipelining of high performance synchronous digital systems, Int. J. Electron., 70 (5): 917–935, 1991. 31. E. G. Friedman and J. H. Mulligan, Jr., Pipelining and clocking of high performance synchronous digital systems, in M. A. Bayoumi and E. E. Swartzlander, Jr., (eds.), VLSI Signal Processing Technology, Norwell, MA: Kluwer, 1994, pp. 97–133. 32. T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, Cambridge, MA: MIT Press, 1989.
9. C. Svensson, Signal resynchronization in VLSI system, Integr. VLSI J., 4 (1): 75–80, 1986.
33. D. B. West, Introduction to Graph Theory, Upper Saddle River, NJ: Prentice-Hall, 1996.
10. M. S. McGregor, P. B. Denyer, and A. F. Murray, A single-phase clocking scheme for CMOS VLSI, Proc. Stanford Conf. Adv. Res. Very Large Scale Integr., 1987, pp. 257–271.
34. M. Hatamian, Understanding clock skew in synchronous systems, in S. K. Tewksbury, B. W. Dickinson, and S. C. Schwartz (eds.), Concurrent Computations Algorithms, Architecture and Technology, New York: Plenum, 1988, pp. 87–96.
11. M. Hatamian and G L. Cash, Parallel bit-level pipelined VLSI designs for high-speed signal processing, Proc. IEEE, 75: 1192– 1202, 1987. 12. M. Hatamian et al., Fundamental interconnection issues, AT& T Tech. J., 66 (4): 13–30, 1987.
35. E. G. Friedman and S. Powell, Design and analysis of a hierarchical clock distribution system for synchronous standard cell/ macrocell VLSI, IEEE J. Solid-State Circuits, SC-21: 240–246, 1986.
13. K. D. Wagner, Clock system design, IEEE Des. Test Comput., 5 (5): 9–27, 1988.
36. J. P. Fishburn, Clock skew optimization, IEEE Trans. Comput., C-39: 945–951, 1990.
14. A. F. Champernowne et al., Latch-to-latch timing rules, IEEE Trans. Comput., C-39: 798–808, 1990.
37. J. Alves Marques and A. Cunha, Clocking of VLSI circuits, in Randell and Treleaven (eds.), VLSI Architecture, Englewood Cliffs, NJ: Prentice-Hall, 1983, pp. 165–178.
15. D. G. Messerschmitt, Synchronization in digital system design, IEEE J. Sel. Areas Commun., 8: 1404–1419, 1990. 16. M. Afghahi and C. Svensson, Performance of synchronous and asynchronous schemes for VLSI systems, IEEE Trans. Comput., C-41: 858–872, 1992. 17. V. Chi, Designing salphasic clock distribution systems, Proc. Symp. Integr. Syst., 1993, pp. 219–233.
38. Method of deskewing data pulses, IBM Tech. Disclosure Bull., 28 (6): 2658–2659, 1985. 39. S. C. Menon and K. A. Sakallah, Clock qualification algorithm for timing analysis of custom CMOS VLSI circuits with overlapping clocking disciplines and on-section clock derivation, Proc. 1st Int. Conf. Syst. Integr., 1990, pp. 550–558.
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
495
40. K. A. Sakallah, T. N. Mudge, and O. A. Olukoton, Analysis and design of latch-controlled synchronous digital circuits, Proc. ACM/IEEE Des. Autom. Conf., 1990, pp. 111–117.
63. G. Yacoub et al., A system for critical path analysis based on back annotation and distributed interconnect impedance models, Microelectron. J., 19 (3): 21–30, 1988.
41. E. G. Friedman, Performance limitations in synchronous digital systems, Ph.D. dissertation, Irvine: Univ. California, 1989.
64. J. L. Neves and E. G. Friedman, Topological design of clock distribution networks based on non-zero clock skew specifications, Proc. 36th Midwest Symp. Circuits Syst., 1993, pp. 468–471. 65. J. L. Neves and E. G. Friedman, Circuit synthesis of clock distribution networks based on non-zero clock skew, Proc. IEEE Int. Symp. Circuits Syst., 1994, pp. 4.175–4.178. 66. J. L. Neves and E. G. Friedman, Synthesizing distributed buffer clock trees for high performance ASICs, Proc. IEEE Int. Conf. ASICs, 1994, pp. 126–129. 67. J. Chung and C.-K. Cheng, Optimal buffered clock tree synthesis, Proc. IEEE Int. Confer. ASICs, 1994, pp. 130–133. 68. E. De Man and M. Schobinger, Power dissipation in the clock system of highly pipelined ULSI CMOS circuits, Proc. Int. Workshop Low Power Des., 1994, pp. 133–138. 69. H. Kojima, S. Tanaka, and K. Sasaki, High-swing clocking scheme for 75% power saving in clocking circuitry, Proc. IEEE Symp. Very Large Scale Integr. Circuits, 1994, pp. 23–24. 70. J. L. Neves and E. G. Friedman, Minimizing power dissipation in non-zero skew-based clock distribution networks, Proc. IEEE Int. Symp. Circuits Syst., 1995, pp. 1576–1579. 71. D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue CMOS microprocessor, IEEE J. Solid-State Circuits, SC-27: 1555–1565, 1992. 72. N. Park and A. Parker, Synthesis of optimal clocking schemes, Proc. 22nd ACM/IEEE Des. Autom. Conf., 1985, pp. 489–495. 73. E. G. Friedman, The application of localized clock distribution design to improving the performance of retimed sequential circuits, Proc. IEEE Asia-Pac. Conf. Circuits Syst., 1992, pp. 12–17. 74. T. Soyata, E. G. Friedman, and J. H. Mulligan, Jr., Integration of clock skew and register delays into a retiming algorithm, Proc. IEEE Int. Symp. Circuits Syst., 1993, pp. 1483–1486. 75. T. Soyata and E. G. Friedman, Retiming with non-zero clock skew, variable register, and interconnect delay, Proc. IEEE Int. Conf. Comput -Aided Des., 1994, pp. 234–241. 76. L.-F. Chao and E. H.-M. Sha, Retiming and clock skew for synchronous systems, Proc. IEEE Int. Symp. Circuits Syst., 1994, pp. 1.283–1.286. 77. B. Lockyear and C. Ebeling, The practical application of retiming to the design of high-performance systems, Proc. IEEE Int. Conf. Comput.-Aided Des., 1993, pp. 288–295. 78. Y. Ogawa et al., Efficient placement algorithms optimizing delay for high-speed ECL masterslice LSI’s, Proc. 23rd ACM/IEEE Des. Autom. Conf., 1986, pp. 404–410. 79. S. Boon et al., High performance clock distribution for CMOS ASICs, Proc. IEEE Custom Integr. Circuits Conf., 1989, pp. 15.4.1–15.4.5. 80. A. Chao, Clock tree synthesis for large gate arrays, High Perform. Syst., 10: 32, 1989. 81. D. Y. Montuno and R. C. S. Ma, A layout methodology for the synthesis of high speed global clock nets, Proc. IEEE Custom Integr. Circuits Conf., 1992, pp. 28.4.1–28.4.4. 82. P. Ramanathan and K. G. Shin, A clock distribution scheme for non-symmetric VLSI circuits, Proc. IEEE Int. Conf. Comput.Aided Des., 1989, pp. 398–401. 83. K. D. Boese and A. B. Kahng, Zero-skew clock routing trees with minimum wirelength, Proc. IEEE Int. Conf. ASICs, 1992, pp. 1.1.1–1.1.5. 84. J. Burkis, Clock tree synthesis for high performance ASICs, Proc. IEEE Int. Conf. ASICs, 1991, pp. 9.8.1–9.8.4. 85. T.-H. Chao, Y.-C. Hsu, and J.-M. Ho, Zero skew clock net routing, Proc. ACM/IEEE Des. Autom. Conf., 1992, pp. 518–523.
42. T. W. Williams and K. P. Parker, Design for testability—A survey, Proc. IEEE, 71: 98–112, 1983. 43. R. H. Dennard et al., Design of ion-implanted MOSFET’s with very small physical dimensions, IEEE J. Solid-State Circuits, SC-9: 256–268, 1974. 44. H. B. Bakoglu and J. D. Meindl, Optimal interconnection circuits for VLSI, IEEE Trans. Electron Devices, ED-32: 903–909, 1985. 45. H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Reading, MA: Addison-Wesley, 1990. 46. C. V. Gura, Analysis of Clock Skew in Distributed Resistive-Capacitive Interconnects, Urbana: Univ. Illinois, 1987, SRC Tech. Rep. No. T87053. 47. C. Svensson and M. Afghahi, On RC line delays and scaling in VLSI systems, Electron. Lett., 24 (9): 562–563, 1988. 48. I. Lin, J. A. Ludwig, and K. Eng, Analyzing cycle stealing on synchronous circuits with level-sensitive latches, Proc. ACM/ IEEE Des. Autom. Conf., 1992, pp. 393–398. 49. R.-S. Tsay and I. Lin, Robin Hood: A system timing verifier for multi-phase level-sensitive clock designs, Proc. IEEE Int. Conf. ASICs, 1992, pp. 516–519. 50. T.-H. Chao et al., Zero skew clock routing with minimum wirelength, IEEE Trans. Circuits Syst. II, Analog Digital Signal Process., CAS-39: 799–814, 1992. 51. S. Y. Kung, VLSI Array Processors, Englewood Cliffs, NJ: Prentice-Hall, 1988. 52. D. Mijuskovic, Clock distribution in application specific integrated circuits, Microelectron. J., 18 (4): 15–27, 1987. 53. H. B. Bakoglu, J. T. Walker, and J. D. Meindl, A symmetric clock-distribution tree and optimized high-speed interconnections for reduced clock skew in ULSI and WSI circuits, Proc. IEEE Int. Conf. Comput. Des., 1986, pp. 118–122. 54. M. Nekili et al., Logic-based H-trees for large VLSI processor arrays: A novel skew modeling and high-speed clocking method, Proc. 5th Int. Conf. Microelectron., 1993, pp. 1–4. 55. D C. Keezer and V. K. Jain, Clock distribution strategies for WSI: A critical survey, Proc. IEEE Int. Conf. Wafer Scale Integr., 1991, pp. 277–283. 56. D. Somasekhar and V. Visvanathan, A 230-MHz half-bit level pipelined multiplier using true single-phase clocking, IEEE Trans. Very Large Scale Integr. VLSI Syst., VLSI-1: 415–422, 1993. 57. D. D. Gajski, Silicon Compilation, Reading, MA: Addison-Wesley, 1988. 58. B. Wu and N. A. Sherwani, Effective buffer insertion of clock tree for high-speed VLSI circuits, Microelectron. J., 23 (4): 291– 300, 1992. 59. S. Padin, Scheme for distributing high-speed clock signals in a large digital system, Electron. Lett., 25 (2): 92–93, 1989. 60. S. Dhar and M. A. Franklin, Optimum buffer circuits for driving long uniform lines, IEEE J. Solid-State Circuits, SC-26 (1): 32– 40, 1991. 61. G. M. Blair, Skew-free clock distribution for standard-cell VLSI designs, IEE Proc. G, 139 (2): 265–267, 1992. 62. R. Maini, J. McDonald, and L. Spangler, A clock distribution circuit with a 100 PS skew window, Proc. IEEE Bipolar Circuits Technol. Meet., 1987, pp. 41–43.
496
CLOCK DISTRIBUTION IN SYNCHRONOUS SYSTEMS
86. J. Cong, A. B. Kahng, and G. Robins, Matching-based methods for high-performance clock routing, IEEE Trans. Comput.-Aided Des., CAD-12: 1157–1169, 1993.
109. N. Menezes et al., Skew reduction in clock trees using wire width optimization, Proc. Custom Integr. Circuits Conf., 1993, pp. 9.6.1–9.6.4.
87. M. Edahiro, A clock net reassignment algorithm using Voronoi diagrams, Proc. IEEE Int. Conf. Comput.-Aided Des., 1990, pp. 420–423. 88. M. A. B. Jackson, A. Srinivasan, and E. S. Kuh, Clock routing for high performance ICs, Proc. ACM/IEEE Des. Autom. Conf., 1990, pp. 573–579. 89. A. B. Kahng, J. Cong, and G. Robins, High-performance clock routing based on recursive geometric matching, Proc. ACM/ IEEE Des. Autom. Conf., 1991, pp. 322–327. 90. R. S. Tsay, Exact zero skew, Proc. IEEE Int. Conf. Comput.Aided Des., 1991, pp. 336–339. 91. T. Saigo et al., Clock skew reduction approach for standard cell, Proc. IEEE Custom Integr. Circuits Conf., 1990, pp. 16.4.1– 16.4.4. 92. P. D. Ta and K. Do, A low power clock distribution scheme for complex IC system, Proc. IEEE Int. Conf. ASICs, 1991, pp. 1-5.1–1-5.4. 93. P. R. Mukund and D. W. Bouldin, A graph theoretic approach to the clock distribution problem, Proc. IEEE Int. Conf. ASICs, 1991, pp. 7-5.1–7-5.4. 94. J. Cong, A. Kahng, and G. Robins, On clock routing for general cell layouts, Proc. IEEE Int. Conf. ASICs, 1991, pp. 14-5.1– 14-5.4. 95. N. A. Sherwani and B. Wu, Clock layout for high-performance ASIC based on weighted center algorithm, Proc. IEEE Int. Conf. ASICs, 1991, pp. 15-5.1–15-5.4. 96. D. A. Joy and M. J. Ciesielski, Placement for clock period minimization with multiple wave propagation, Proc. 28th ACM/ IEEE Des. Autom. Conf., 1991, pp. 640–643. 97. Y-M. Li and M. A. Jabri, A zero-skew clock routing scheme for VLSI circuits, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 458–463. 98. W. Khan, M. Hossain, and N. Sherwani, Zero skew clock routing in multiple-clock synchronous systems, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 464–467. 99. Q. Zhu and W. W-M. Dai, Perfect-balance planar clock routing with minimal path-length, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 473–476. 100. R-S. Tsay, An exact zero-skew clock routing algorithm, IEEE Trans. Comput.-Aided Des., CAD-12: 242–249, 1993. 101. W. Khan and N. Sherwani, Zero skew clock routing algorithm for high performance ASIC systems, Proc. IEEE Int. Conf. ASICs, 1993, pp. 79–82. 102. W. Khan, S. Madhwapathy, and N. Sherwani, An hierarchical approach to clock routing in high performance systems, Proc. IEEE Int. Symp. Circuits Syst., 1994, pp. 1.467–1.470. 103. S. Lin and C. K. Wong, Process-variation-tolerant zero skew clock routing, Proc. IEEE Int. Conf. ASICs, 1993, pp. 83–86. 104. M. Edahiro, A clustering-based optimization algorithm in zeroskew routings, Proc. ACM/IEEE Des. Autom. Conf., 1993, pp. 612–616. 105. N-C. Chou and C-K. Cheng, Wire length and delay minimization in general clock net routings, Proc. IEEE Int. Conf. Comput.Aided Des., 1993, pp. 552–555. 106. M. Edahiro, Delay minimization for zero-skew routing, Proc. IEEE Int. Conf. Comput.-Aided Des., 1993, pp. 563–566. 107. M. Edahiro, An efficient zero-skew routing algorithm, Proc. 31st ACM/IEEE Des. Autom. Conf., 1994. 108. S. Pullela, N. Menezes, and L. T. Pillage, Reliable non-zero clock trees using wire width optimization, Proc. ACM/IEEE Des. Autom. Conf., 1993, pp. 165–170.
110. S. Pullela et al., Skew and delay optimization for reliable buffered clock trees, Proc. IEEE Int. Conf. Comput.-Aided Des., 1993, pp. 556–562. 111. W. C. Elmore, The transient response of damped linear networks with particular regard to wideband amplifiers, J. Appl. Phys., 19: 55–63, 1948. 112. J. Rubinstein, P. Penfield, Jr., and M. A. Horowitz, Signal delay in RC tree networks, IEEE Trans. Comput.-Aided Des., CAD-2: 202–211, 1983. 113. J. Vlach et al., Group delay as an estimate of delay in logic, IEEE Trans. Comput.-Aided Des., CAD-10: 949–953, 1991. 114. J. Chung and C.-K. Cheng, Skew sensitivity minimization of buffered clock tree, Proc. IEEE Int. Conf. Comput.-Aided Des., 1994, pp. 280–283. 115. J. L. Neves and E. G. Friedman, Buffered clock tree synthesis with non-zero clock skew scheduling for increased tolerance to process parameter variations, J. VLSI Signal Process., 16 (2/3): 149–161, 1997. 116. J. L. Neves and E. G. Friedman, Design methodology for synthesizing clock distribution networks exploiting non-zero clock skew, IEEE Trans. Very Large Scale Integr. VLSI Syst., VLSI4: 286–291, 1996. 117. I. S. Kourtev and E. G. Friedman, Topological synthesis of clock trees with non-zero clock skew, Proc. ACM/IEEE Int. Workshop Timing Issues Specif. Synth. Digital Syst., 1997, pp. 158–163. 118. I. S. Kourtev and E. G. Friedman, Topological synthesis of clock trees for VLSI-based DSP systems, Proc. IEEE Workshop Signal Process. Syst., 1997, pp. 151–162. 119. I. S. Kourtev and E. G. Friedman, The automated synthesis of high performance clock distribution networks, Proc. IEEE Int. Workshop Clock Distrib. Netw. Des., Synth., Anal., 1997, pp. 11–12. 120. I. S. Kourtev and E. G. Friedman, Simultaneous clock scheduling and buffered clock tree synthesis, Proc. IEEE Int. Symp. Circuits Syst., 1997, pp. 1812–1815. 121. J. L. Neves and E. G. Friedman, Optimal clock skew scheduling tolerant to process variations, Proc. ACM/IEEE Des. Autom. Conf., 1996, pp. 623–628. 122. J. L. Neves and E. G. Friedman, Reduced sensitivity of clock skew scheduling to technology variations, Proc. 5th ACM/ SIGDA Phys. Des. Workshop, 1996, pp. 241–248. 123. J. L. Neves and E. G. Friedman, Buffered clock tree synthesis with optimal clock skew scheduling for reduced sensitivity to process parameter variations, Proc. TAU ACM/SIGDA Int. Workshop Timing Issues Specif. Synth. Digital Syst., 1995, pp. 131–141. 124. T. Soyata, E. G. Friedman, and J. H. Mulligan, Jr., Incorporating interconnect, register, and clock distribution delays into the retiming process, IEEE Trans. Comput.-Aided Des., CAD-16: 105–120, 1997. 125. M. Shoji, Elimination of process-dependent clock skew in CMOS VLSI, IEEE J. Solid-State Circuits, SC-21: 875–880, 1986. 126. T. Sakurai, Approximation of wiring delay in MOSFET LSI, IEEE J. Solid-State Circuits, SC-18: 418–426, 1983. 127. E. Vanden Meersch, L. Claesen, and H. De Man, Automated analysis of timing faults in synchronous MOS circuits, Proc. IEEE Int. Symp. Circuits Syst., 1988, pp. 487–490. 128. M. R. Dagenais and N. C. Rumin, Automatic determination of optimal clocking parameters in synchronous MOS VLSI circuits, Proc. 1988 Stanford Conf. Adv. Res. Very Large Scale Integr., 1988, pp. 19–33.
CLOCKS IN TELECOMMUNICATIONS 129. M. R. Dagenais and N. C. Rumin, On the calculation of optimal clocking parameters in synchronous circuits with level-sensitive latches, IEEE Trans. Comput.-Aided Des., CAD-8: 268–278, 1989.
497
151. K. Gaj, E. G. Friedman, and M. J. Feldman, Timing of Large RSFQ Digital Circuits, Proc. 6th Int. Supercond. Electron. Conf., 1997, pp. 299–301.
130. K. A. Sakallah et al., Optimal clocking of circular pipelines, Proc. IEEE Int. Conf. Comput. Des., 1991, pp. 642–646.
152. K. Gaj, E. G. Friedman, and M. J. Feldman, Two-phase clocking for medium to large RSFQ circuits, Proc. 6th Int. Supercond. Electron. Conf., 1997, pp. 302–304.
131. K. A. Sakallah, T. N. Mudge, and O. A. Olukoton, checkTc and minTc: Timing verification and optimal clocking of synchronous digital circuits, Proc. IEEE Int. Conf. Comput.-Aided Des., 1990, pp. 552–555.
153. K. Gaj, E. G. Friedman, and M. J. Feldman, Choice of the optimum timing scheme for RSFQ digital circuits, Proc. 5th Int. Workshop High-Temp. Supercond. Electron Devices, 1997, pp. 39–40.
132. K. A. Sakallah, T. N. Mudge, and O. A. Olukoton, Optimal clocking of synchronous systems, Proc. TAU90-ACM Int. Workshop Timing Issues Specif. Synth. Digital Syst., 1990.
154. R. Reinschmidt and D. Leuthold, Clocking considerations for a pentium based CPU module with a 512K byte secondary cache, Proc. IEEE Multi-Chip Module Conf., 1994, pp. 26–31.
133. T. G. Szymanski, Computing optimal clock schedules, Proc. ACM/IEEE Des. Autom. Conf., 1992, pp. 399–404. 134. T. G. Szymanski and N. Shenoy, Verifying clock schedules, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 124–131. 135. N. Shenoy and R. K. Brayton, Graph algorithms for clock schedule optimization, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 132–136. 136. R. B. Deokar and S. S. Sapatnekar, A graph-theoretic approach to clock skew optimization, Proc. IEEE Int. Symp. Circuits Syst., 1994, pp. 1.407–1.410. 137. W. Chuang, S. S. Sapatnekar, and I. N. Hajj, A unified algorithm for gate sizing and clock skew optimization to minimize sequential circuit area, Proc. IEEE Int. Conf. Comput.-Aided Des., 1993, pp. 220–223. 138. T. M. Burks, K. A. Sakallah, and T. N. Mudge, Identification of critical paths in circuits with level-sensitive latches, Proc. IEEE Int. Conf. Comput.-Aided Des., 1992, pp. 137–141. 139. M. Shoji, Electrical design of BELLMAC-32A microprocessor, Proc. IEEE Int. Conf. Circuits Comput., 1982, pp. 112–115. 140. M. Shoji, Reliable chip design method in high performance CMOS VLSI, Proc. IEEE Int. Confer. Comput. Des., 1986, pp. 389–392. 141. J. Montanaro et al., A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor, Digital Tech. J., 9 (1): 49–62, 1997. 142. M. Horowitz, Clocking strategies in high performance processors, Proc. IEEE Symp. Very Large Scale Integr. Circuits, 1992, pp. 50–53. 143. M. Hatamian and G. Cash, A 70-MHz 8-bit ⫻ 8-bit parallel pipelined multiplier in 2.5애m CMOS, IEEE J. Solid-State Circuits, SC-21: 505–513, 1986. 144. M. Hatamian and G. L. Cash, High speed signal processing, pipelining, and VLSI, Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1986, pp. 1173–1176. 145. T. G. Noll et al., A pipelined 330-MHz multiplier, IEEE J. SolidState Circuits, SC-21: 411–416, 1986. 146. M. R. Santoro, Design and clocking of VLSI multipliers, Ph.D. dissertation, Palo Alto, CA: Stanford Univ., 1989. 147. Y. Jiren, I. Karlsson, and G. Svensson, A true single-phase-clock dynamic CMOS circuit technique, IEEE J. Solid-State Circuits, SC-22: 899–901, 1987. 148. K. Gaj et al., A clock distribution scheme for large RSFQ circuits, IEEE Trans. Appl. Supercond., AS-5: 3320–3324, 1995. 149. K. K. Likharev and V. K. Semenov, RSFQ logic/memory family: A new Josephson-junction technology for sub-Terahertz clockfrequency digital systems, IEEE Trans. Appl. Supercond., 1: 3– 28, 1991. 150. K. Gaj, E. G. Friedman, and M. J. Feldman, Timing of multiGigahertz rapid single flux quantum digital circuits, J. VLSI Signal Process., 16 (2/3): 247–276, 1997.
EBY G. FRIEDMAN University of Rochester
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...LECTRONICS%20ENGINEERING/64.%20VLSI%20Systems/W7601.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Dram Chips Standard Article Yoichi Oshima1, Bing J. Sheu2, Steve H. Jen2 1Japanese Patent Office, Tokyo, Japan 2University of Southern California, Los Angeles, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W7601 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (185K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Dram Memory Cell Basic Dram System Low-Power Design Technology High-Throughput Dram Technology Acknowledgments About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...NICS%20ENGINEERING/64.%20VLSI%20Systems/W7601.htm15.06.2008 1:14:19
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
58
DRAM CHIPS
; ; ; ; ; ; ; ; ;;;; ;;; ;;;; ;;;; ;;; ;;;; ;; ; ;;;
Bit line
Word line
Access MOS transistor
Plate electrode for storage capacitor
Figure 2. DRAM memory cell cross-sectional view.
DRAM CHIPS DRAM MEMORY CELL
Dynamic random access memory (DRAM) is defined as random access memory with a refresh operation for maintaining the stored data. DRAM has always been the leading semiconductor product which requires advanced semiconductor device and fabrication technology (1–3). DRAM is the most popular memory device because of its high performance-to-cost ratio. In comparison with other kinds of memories (4,5), DRAM has a very simple memory cell. One transistor and one capacitor type is the most popular cell type in present DRAM. Figure 1 shows the schematic diagram of the basic memory cell for DRAM. The data depend on the amount of charge on the capacitor. For example, ample amount of charge stored on a capacitor is recognized as logic-1, while no charge stored on a capacitor is recognized as logic-0. This data storage capacitor is selected by the switched transistor. The storage capacitor is very important for DRAM performance. Since the capacitor leaks some charges, large capacitance is a great advantage in maintaining data integrity. Additionally large capacitance could improve speed performance. Figure 2 shows the cross section of a conventional planar DRAM cell. Since minimum area consumption is allowed in order to achieve a high-density DRAM, the structure of the cell capacitor is modified in the vertical dimension for the advanced memory cell, unlike for the conventional plane capacitor. Figures 3 and 4 show the diagrams of two present well-known fundamental threedimensional capacitor cells (6–9). To achieve the required minimum value of capacitance, both approaches have been successful in minimizing the area consumption and fabrica-
Bit line
Word line
Access MOS transistor
Buried Si
Plate electrode for storage capacitor
Figure 3. Cross-sectional view of trench capacitor cell.
Bit line
Bit line
SiO2
Storage node electrode Plate electrode for storage capacitor
Word line
Access MOS transistor
Storage capacitor
Access MOS transistor
Figure 1. DRAM memory cell.
Word line
Figure 4. Cross-sectional view of stacked capacitor cell.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
DRAM CHIPS
Column select line
59
trode technology have been investigated for the next-generation memory cells (10,11).
Second level word line
BASIC DRAM SYSTEM Plate electrode for storage capacitor Ta2O5
Shield bit line
Word line Storage node electrode
Figure 5. Cross-sectional view of advanced three-dimensional capacitor cell.
tion process complexity. Furthermore, a variety of shapes of capacitor electrodes have been investigated in order to obtain larger capacitance in a small area for higher density. Figure 5 shows an advanced three-dimensional capacitor cell. This type of capacitor will play a major role for 64 MB DRAM and higher-density versions. In addition to the above-mentioned approaches, larger capacitance with high dielectric insulator and enlarged surface area by microvillus patterning of elec-
Figure 6 shows the block diagram of a DRAM system. In principle, the function of DRAM is very simple. It requires only writing the data into a certain address, preserving it for a certain period of time, and reading the data out. We trace this simple function using the diagram shown in Fig. 6. In the read operation, the memory chip needs to set up the address data. In DRAM, in contrast with other memory technologies, a unique address input scheme is adopted. With the address multiplex scheme, the address data are divided into two portions, defined as row address and column address, respectively. According to the enable-address signals, each address is stored in an address input buffer by individual timing. Therefore, the number of address pins is equal to one-half of the whole address digits. On the other hand, two sequential address inputs are necessary. After obtaining the address information, the stored data, which are selected by the row address, are read by sense amplifiers to amplify the signals. Then, the column address is used to transmit the amplified signals to the output buffers. In the write operation, a write amplifier allows reversal of the data on the I/O bus, if the input data are different from the data on the I/O bus. Sense Amplifier A sense amplifier is as important as the memory cell, because its performance affects the whole chip performance significantly. The function of the sense amplifier is to read the data
Control signal
Control signal input buffer
Column decoder
Output data
Sense amplifiers
Input data
array
Data output buffer
Memory cell
Data input buffer
Row decoder
Address input buffer
Address signal
Control clock
Figure 6. DRAM system block.
60
DRAM CHIPS
1/2VDD
WL1
PR
VDD
WL2
M8 SDP BL M1 M3
M4
M6
M5
M7
MC2 MC1
M2 BL
M9 SDN Figure 7. Conventional sense amplifier block.
from the storage cell and to magnify the signal level up to the appropriate logic level to be treated by digital logic gates. Taking a DRAM sense amplifier as an example, Fig. 7 shows the schematic diagram of a conventional sense amplifier for DRAM. This circuit block operates as follows. At the initial condition, assume that ‘‘H’’ level is stored in MC1. A bit line pair is precharged and equalized by M1, M2 and M3, and other MOS transistors are turned off. First, M1, M2, and M3 turn off to set up the preparation stage for sensing. Then, the selected word line (WL1) is activated by the row address. As a result, one MOS access transistor and one storage capacitor is connected to the bit line. This action causes the imbalances of the bit line pair. The voltage of the bit line changes slightly because of storage charge in the capacitor of MC1, whereas the other bit line still remains at the same level. This difference can be detected by the sensing node. Sense amplifier driver transistors M8 and M9 turn on in order to activate the amplification function. Then, a small signal value between the bit line pair becomes a large signal level by the crosscoupled MOS transistors. The basic function is simple; however, optimizing the size and timing for each clock is an essential issue for the DRAM system designer. Figure 8 shows the timing diagram for the sensing operation. LOW-POWER DESIGN TECHNOLOGY Low-power DRAM technology is increasingly important for general memory use (12). Since the current memory market tends to explore outdoor usage, the battery-operable feature is a crucial requirement for memories. In general, total power
consumption of the DRAM chip (P), which is operated at VDD and cycle time tRC, can be expressed as follows:
P∼ =
Cj ·
V j
t
· VDD + IDC · VDD
V ∼ = QT · DD + IDC · VDD tRC (C · VD + CPT · VP ) · VDD ∼ + IDC · VDD = BT tRC V ∼ = (QBT + QPT ) · DD + IDC · VDD tRC
(1)
where Cj is the capacity of node j ⌬Vj is the voltage variation at node j ⌬QT is the total charge of the chip during one cycle ⌬QBT is the total charge of bit lines during one cycle ⌬QPT is the total charge of peripheral blocks during one cycle IDC is the dc current component CBT is the total capacitance for bit lines ⌬VD is the bit line charged level CPT is the total capacitance for peripheral blocks ⌬VP is the voltage variation for peripheral blocks There are several effective methods to reduce the power consumption. Special care on supply voltage VDD, bit line capacitance charge QBT, and dc current component IDC could minimize the power consumption effectively. HIGH-THROUGHPUT DRAM TECHNOLOGY
PR
WL1 SDN SDP
BL/BL Figure 8. Timing diagram of sensing operation.
Since the speed performance of the central processing unit (CPU) has been drastically increased, the DRAM is under pressure to achieve fast response. A variety of useful approaches are categorized according to three criteria: multibit data output, high-frequency control clock, and advanced interface technology. These approaches are selected with careful consideration for cost-per-bit performance, which is the most important factor for any type of memory chips. Figure 9 shows the recent market for memory, according to speed performance and capacity. Instead of common memory devices, several advanced approaches, such as extended data out (EDO) DRAM, synchronous DRAM (SDRAM), and Rambus DRAM (RDRAM), will play main roles in the future memory market. Although the
DRAM CHIPS
61
High-end PC 800
Bandwidth(MB/s)
Synchronous DRAM
400
Unified-memory PC Rambus DRAM
200 100 50
Middle-end PC Extended Data Out mode(EDO) DRAM
Low-end PC Fast Page mode DRAM 4
8
16 Memory capacity(Mbyte)
32
(a)
Bandwidth(MB/s)
2000
Rambus DRAM 1000 Synchronous DRAM 500 200
100 Number of color Picture size
Extended Data Out mode (EDO) DRAM 16 256 640 × 480
256 64K 16M 1024 × 768
64K 16M 1280 × 1024
(b)
purpose of new approaches is to achieve high data throughput, each approach has unique features. Multibit DRAM Increasing the number of data output structures is a simple and straightforward approach to achieve high data throughput performance. This approach has already been adopted and is successful for a variety of memory chips. In fact, ⫻4, ⫻8, and ⫻16 bit DRAM are very popular products in the commercial market. Recently, achievement of ⫻32 bit 16M DRAM for mass production was reported (13). In order to realize the large number of multibit structures, development of an advanced data output buffer is indispensable. Basically, there are three inherent problems for the multibit memory: increased switching noise, increased power consumption, and increased chip area. Therefore, a data output buffer is introduced to minimize the above-mentioned effects. Switching noise is caused by parasitic inductance, which is due to the wire line connecting the external pin and the internal pad on the chip. The larger the number of data output buffers that are activated, the more the output signal is dis-
Over Figure 9. (a) DRAM application for PC. (b) DRAM application for graphics.
torted by the noise. Power consumption is proportional to the number of activated data output buffers. In order to solve these problems, the following approaches are useful. Adoption of multiple power supply lines and ground lines can prevent noise interference. Minimization of the physical wire length can reduce the parasitic inductance. Lowering the power supply voltage can reduce power consumption. In practice, the combination of approaches will be able to produce stable performance of memories. Figure 10 shows a schematic diagram of a capacitor-boosted-type data output buffer (14). At the initial stage, Q1 and Q3, which are small drive-ability transistors, turn on to avoid the abrupt voltage transition that causes considerable noise. After reaching a certain output level, Q2 and Q4, which are large drive-ability transistors, turn on to increase processing speed. T1 and T2 are used to set the delay time. Since the gate voltage of Q2 is boosted by capacitor coupling, full VDD level output signal can be achieved. Fast Page Mode DRAM The currently dominant mode of DRAM is fast page mode, although there are similar performance modes, such as static
62
DRAM CHIPS
VDD
T1 RD Q1
Q2
φ
Dout T2
RD Q3
Q4
Figure 10. Schematic diagram of capacitor-boosted-type data output buffer.
column mode and nibble mode. There are three fundamental operation modes: read, write, and refresh. The user needs to set up control signals such as RAS, CAS, WE, OE and address signals in order to activate the proper mode. RAS is the signal to strobe the row address data, whereas CAS is the signal to
strobe the column address data. Since DRAM adopts multiplex addressing, the user sets both data on the same address pins at different time periods. WE is defined as the writeenable signal. Thus, WE is inactive in the read mode and is active in the write mode. OE is defined as the output-enable signal. This signal is used for ⫻4, ⫻8, and ⫻16 bit operation chip. Figure 11 shows the timing diagram of standard read and write operation. Access time is the most important specification item for DRAM. There are four kinds of access time: tRAC, tCAC, tAA and tOEA. Thus, the user has to pay attention to particular access time values. First, tRAC is defined as the time from RAS low to the valid data output. This is the longest access time among them. This access time is usually used to classify the version. Second, tCAC is defined as the time from CAS being low to the valid data output. For data which are stored at the same row address, DRAM can offer the minimum access time, tCAC. Third, tAA is defined as the time from the column address set up to the valid data ouput. The tAA is a useful specification for static column mode, which can produce data by column address transition. Fourth, tOEA is defined as the time from OE being low to the valid data output. Note that tOEA is usually equal to tCAC. The tOEA is a valuable specification for ⫻4, ⫻8, and ⫻16 bit DRAM. The refresh operation is a unique feature of DRAM. This unavoidable operation adds to the complexity of DRAM. The refresh operation is equivalent to row address
tRC tRP
tRAS RAS CAS
ADD
tASR
tRAH
tRCD tASC
tCAS
tRCH
tCAH Column
Row
tASC
WE
tAA tCAC
Dout
tRC RAS tRCD
CAS
Row
Column tWCS tDH
tWCH tWP tDH
Din Figure 11. Fundamental DRAM operation scheme. (a) Read mode operation. (b) Write mode operation.
Dout
tOFF Don’t care
(a)
WE
tRRH VALID DATA
tRAC
ADD
tCP
VALID DATA Open (b)
DRAM CHIPS
63
tRC tRAS
RAS
tRP
tRPC CAS tASR ADD
tRAH Row 1
Row 2 (a) tRC tRAS
RAS
tRP
CAS tWRP
tWRH
ADD
Figure 12. Refresh mode timing diagram. (a) RAS-only refresh mode. (b) CBR (CAS before RAS) refresh mode.
(b)
block activation. In other words, data are read out on the bit line and stored back to the same memory cell. For this operation, column address block is not activated. If the user sends the specific row address, the chip can be refreshed by RASonly refresh mode. If the user sets up CAS low before RAS low (CBR), the internal counter provides the specific row address instead of the external row address and starts the refresh operation. Figure 12 shows the refresh operation timing diagram. In addition to the above-mentioned operation modes, there are additional operation modes, such as readmodify-write mode, hidden-refresh mode, and so on. Data books provide detailed information on various operation modes (15). In order to achieve high throughput rate, the fast page mode is useful. In the fast page mode, access time is just the CAS access time (tCAC), although random access is limited to the selected row address. This is based on the characteristics of parallel structure of the DRAM memory cell array. In this mode, every data cell which is connected to the selected word line is amplified and sent to the bit line. The external
column address data is used to select the data on the bit line. Theoretically, it is possible to fetch all the data which are connected to the selected word line successively. This operation does not require repeating the word line boosted operation and sensing operation. This access time is the same as CAS access time (tCAC), which is defined as the column address access time. Figure 13 shows the fast page read cycle mode. Extended Data Out Mode (EDO) Recent progress has improved throughput rate efficiently. The technique is called extended data out (EDO) mode, or hyper page mode. With simple modification on data output buffers, it can increase the data throughput performance. Figure 14 shows the read cycle timing chart in the EDO mode. The difference between conventional fast page mode and EDO mode is the data output control. In fast page mode, output data are reset by rising CAS; however, output data can be maintained until next CAS falling edge in EDO mode. This
tRASP tCP
RAS
tPC tCAH
tRCD tASC
CAS ADD
tOFF Row
Col 1 tCAC tAA
Dout
tRAC
Col m tCAC
Col 2 tCAC 1
2
m
Figure 13. Fast page read cycle mode.
64
DRAM CHIPS
tRAH
tRP
tRASP tRNCD
RAS
tCSH tRCD tASR
CAS
tRAD
tRAH tASC
tHPC
tCAS tCAH
Row
ADD
tRHCP
Col 1 tCEA
OE tAA
tCAC
Dout
tCPA tCOH
tHCAC
1
Figure 14. Extended data out DRAM read cycle mode.
tCLZ
unique feature enables one to use DRAM efficiently in terms of data throughput. Figure 15 shows the comparison of conventional fast page mode DRAM with word interleave technique and EDO mode. In fast page mode, the minimum value of the pulse duration of CAS high cannot be used, because it requires some period to reset the data after the CAS rising edge. Although the word interleave technique is used, it cannot reach the same performance as the EDO mode. This modi-
Col m
Col 2
tCEZ
tREZ m
2
tOLZ
fication considerably improves the DRAM speed performance in terms of data throughput rate. Since the trend of the control DRAM is toward synchronous operation, this synchronization capability is a great advantage of the EDO. In terms of productivity, EDO has also an advantage because it can be manufactured based on the current fast page mode DRAM with small modifications rather than major changes. In addition, full compatibility with conventional fast page mode
30 ns Extended Data Out mode (Hyper Page mode) CLK at 33MHz RAS CAS Data out
Hi-Z
1
2
3
4
Conventional Fast Page mode (with Word Interleave) CLK at 33MHz
RAS CAS Data out
;;;; a1
b1
a2
0-15 ns
CASa RAS CASb
DRAM a
Data out
DRAM b
Word interleave system Figure 15. Comparison of EDO and fast page mode by word interleave operation.
b2
DRAM CHIPS
CAS latency = 2 Burst length = 4
Synchronous DRAM CLK(66 MHz)
Row
ADD
ACT
CMD
65
Col 1 RED
Dout
PRE D1
D2
D3
D4
60 ns Conventional EDO mode
105 ns
RAS CAS
Row
ADD
Col 1
Dout
Col 2
Col 3
D1
D2
60 ns
4th Data 180 ns
Figure 16. Comparison of operation methods of SDRAM and EDO mode DRAM.
makes EDO an acceptable product in the commercial market. Therefore, between the fast page mode and synchronous DRAM, the EDO scheme could be recognized as a suitable bridge in practice. Synchronous DRAM Although the EDO mode approach has improved speed performance, it cannot reach the level that memory application designers have expected. As one of the future memory technologies, synchronous DRAM (SDRAM) has been proposed to improve the throughput performance by introducing synchronous operation to DRAM (16). Figure 16 shows a comparison of data throughput performance between SDRAM and EDO DRAM.
Active command
Read/Write command
In SDRAM operation, read or write operation is set by command, which is combination of levels of input pins at the rising edge of the clock. The command is decoded by the command decoder, and it starts to operate according to the decoded signal. Figure 17 shows how to set up commands, such as active, read/write, and precharge, by the input pin level. Owing to the adoption of command input scheme, the designer only has to pay attention to the set-up time and hold time of each input signal related to the clock rising edge. It significantly contributes to high-performance system design. SDRAM requires initial conditions to set several important characteristics of the chip, such as CAS latency, the length of burst data, and address generation type. CAS latency is the number of required clocks after receiving the read command to produce the first data. The length of burst data is defined
Pre-charge command
CLK CKE CS
RAS CAS WE ADD
Read = H Write = L Figure 17. Command input SDRAM.
66
DRAM CHIPS
Synchronous DRAM Add 1
Column dec. Data amp.
Add 2
Add 3
Add 4
Add 5
Add 6
Data 1
Data 2
Data 3
Data 4
Data 5
Data 1
Data 2
Data 3
Data 4
Output buffer
Access time Data 1 Data 2 Data 3 Data 4 Conventional DRAM Column dec.
Add 1
Data amp.
Add 2 Data 1
Output buffer
Data 2 Data 1
Data 2
Access time Data 1 Data 2 Figure 18. Comparison of pipeline process of SDRAM and DRAM internal process.
Inactive period
Clock 1 Address buffer Address input
Clock 2 Column switch
Column decoder
Memory cell array
Clock 3 Read/write amp Output Output buffer
Case 1 CAS latency: 3 Case 2 CAS latency: 2 Figure 19. Internal pipeline structure of SDRAM.
Case 3 CAS latency: 1
Bank A Pre-charge
Bank select
Bank B Active
Data
Figure 20. Multibank structure of SDRAM.
as the consecutive number of data. Address generation can be classified into two types, one being sequential and the other interleave. SDRAM has great advantage in data throughput performance. Because of the synchronous operation, SDRAM has several crucial advantages to enhance its performance. First, all input signals are latched by the clock. System designers do not have to worry about complicated timing problems, such as signal skew. In addition, pipeline operation, which is defined as the parallel operation for sequentially separated blocks, can be more efficient than asynchronous operation. Figure 18 shows the timing chart of pipeline operation, and Fig. 19 shows the diagram for internal circuit structure of pipeline operation. According to the control clock, each seg-
DRAM CHIPS
Cache memory
DRAM
67
DRAM
Memory control MPU Graphics control
VRAM
RamDAC
Monitor
(a)
(ASIC)
RDRAM
RDRAM
Memory control
MPU
Graphics control DAC
Rambus channel (500 Mbyte/s)
Monitor
Figure 21. Comparison of (a) conventional memory system and (b) RDRAM system.
(b)
ment of column block is operated in parallel. Although access time for the first data remains the same, this approach can improve access time from the second piece of data to the last one. Furthermore, synchronous operation allows the implementation of multibank configuration on the chip, instead of using several discrete chips. In fact, the multibank structure can provide a nonprecharged period for the user, because some banks could be accessed while others are in the precharged period. Figure 20 shows the block diagram of a twobank scheme. If Bank B is active, data are produced from only bank B memory cell array. During this period, Bank A is prepared for the next active period through precharging. This hidden multibank technique allows the designer to use memory in a very efficient manner. The SDRAM could play a major role in high-performance memory, especially for highspeed applications. RAMBUS DRAM
archy and the Rambus system. The system-on-a-chip approach is the target goal for RDRAM. RDRAM adopts a 9-bit data bus. Since there is no specified address bus, a request packet which includes the command to set the bus for address bus should be sent to the chip first through the control bus when a chip is accessed. After the request packet, next come the acknowledge packet and the data packet. Because the initial condition has to be set, it will be late for the first data access. However, once a transfer condition is set, data access is achieved at considerably high speed, such as at 500 Mbyte/s. For the purpose of stable data input and output, a phase-locked loop (PLL) circuit is located on the chip to synchronize operation between the chip and the external clock. Sense amplifiers in the memory array are used for cache memory to realize fast response. Figure 22 shows the block diagram of the read cycle operation for RDRAM. ACKNOWLEDGMENTS
Besides the improvement of DRAM chip performance itself, development of a high-performance interface is the future alternative approach. Rambus DRAM (RDRAM) was proposed to provide the optimized interface solution for data transfer between CPU and memory (17). In order to achieve this goal, RDRAM adopts a new scheme for memory architecture. Figure 21 shows the comparison of a conventional memory hier-
Yoichi Oshima would like to thank Kunio Matsumoto for his continual encouragement, and appreciates the helpful support of Akira Shoyama, Jun Ihara, and Masaaki Awano, Semiconductor Memory Technical Group, Semiconductor Device Section, Japanese Patent Office. Dr. Bing Sheu and Mr. Steve Jen would like to thank Integrated Media Systems Cen-
Clock
Data bus (9 bits)
Request packet
Control bus Request packet (1 bit)
Data out(1 to 256 byte)
Ack
Figure 22. RDRAM operation scheme.
68
DUALITY, MATHEMATICS
ter, an NSF Engineering Research Center in the multimedia field.
DRY TYPE TRANSFORMER. See TRANSFORMERS, DRY TYPE.
DTL. See DIODE-TRANSISTOR LOGIC. DTV TRANSMITTERS. See TRANSMITTERS FOR DIGITAL
BIBLIOGRAPHY
TELEVISION.
1. T. Masuhara et al., VLSI memories: Present status and future prospect, J. Inst. Electron. Inf. Commun. Eng., E74 (1): 130– 141, 1991. 2. H. Komiya, Future technological and economical prospects for VLSI, IEEE Int. Solid-State Circuits Conf., San Francisco, pp. 16–19, 1993. 3. H. Sasaki, Multimedia complex on a chip, IEEE Int. Solid-State Circuits Conf., San Francisco, pp. 16–19, 1996. 4. K. Ishibashi et al., A 6-ns 4-Mb CMOS SRAM with offset-voltage insensitive current sense amplifiers, IEEE J. Solid-State Circuits, 30 (4): 480–485, 1995. 5. K. Kashihara, T. Okudaira, and H. Itoh, A novel metal-ferroelectric insulator-semiconductor (MFS) capacitor using PZT/SrTiO3 layered insulator, IEEE Symp. VLSI Tech. Dig., Kyoto, Japan, pp. 49–50, 1993. 6. T. V. Rajeevakumar and G. B. Bronner, A novel trench capacitor structure for ULSI DRAMs, IEEE Symp. VLSI Tech. Dig., Oiso, Kanagawa, Japan, pp. 7–8, 1991. 7. D. Kenney et al., A buried-plate trench cell for 64-Mb DRAM, IEEE Symp. VLSI Tech. Dig., Oiso, Kanagawa, Japan, pp. 14– 15, 1991. 8. T. Kaga et al., Crown-shaped stacked-capacitor cell for 1.5-V operation 64-Mb DRAM’s, IEEE Trans. Electron Devices, 38: 255– 261, 1991. 9. H. Watanabe et al., A novel stacked capacitor with porous-Si electrodes for high density DRAMs, IEEE Symp. VLSI Tech. Dig., Honolulu, HI, pp. 19–20, 1994. 10. K. W. Kwon et al., Ta2O5 /TiO2 composite films for high density DRAM capacitors, IEEE Symp. VLSI Tech. Dig., Kyoto, Japan, pp. 45–46, 1993. 11. J. H. Ahn et al., Microvillus patterning technology for 256 Mb DRAM stack cell, IEEE Symp. VLSI Tech. Dig., Seattle, WA, pp. 12–13, 1992. 12. K. Itoh, K. Sasaki, and Y. Nakagome, Trends in low-power RAM circuit technologies, IEEE Symp. Low Power Electron., San Diego, CA, pp. 84–87, 1994. 13. H. Koinumai and T. Kimura, ⫻32 bit 16 M DRAM structure, Nikkei Microdevices, June, pp. 84–85, 1995. 14. M. Aoki et al., New DRAM noise generation under half-Vcc precharge and its reduction using a transposed amplifier, IEEE J. Solid-State Circuits, 24 (4): 889–894, 1989. 15. Texas Instruments, MOS Memory Data Book, June 1995. 16. Y. Takai et al., 250 M Byte/s sychronous DRAM using a 3-stagepipelined architecture, IEEE J. Solid-State Circuits, 29 (4): 426– 431, 1994. 17. N. Kushiyama et al., 500 M Byte/sec data-rate 512 K bits ⫻9 DRAM using a novel I/O interface, IEEE Symp. VLSI Circuits Dig., Seattle, WA, pp. 66–67, 1992.
YOICHI OSHIMA Japanese Patent Office
BING J. SHEU STEVE H. JEN University of Southern California
DRIVES, DC. See MOTOR DRIVES, DC.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...LECTRONICS%20ENGINEERING/64.%20VLSI%20Systems/W7606.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Integrated Circuit Signal Delay Standard Article Ivan S. Kourtev1 and Eby G. Friedman1 1University of Rochester, Rochester, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W7606 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (236K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Background to Signal Delay Delay Metrics Devices and Interconnections Impact of Dsm on Design Methodologies Conclusions About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...NICS%20ENGINEERING/64.%20VLSI%20Systems/W7606.htm15.06.2008 1:14:37
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
378
INTEGRATED CIRCUIT SIGNAL DELAY
INTEGRATED CIRCUIT SIGNAL DELAY Technologies for designing and building microelectronicsbased computational equipment have been steadily advancing ever since the first commercial discrete circuits (ICs) were introduced in the late 1950s (1) (monolithic integrated circuits were introduced in the 1960s). As predicted by Moore’s law in the 1960s (2), integrated-circuit density has been doubling approximately every 18 months, and this doubling in size has been accompanied by a similar exponential increase in circuit speed (or more precisely, clock frequency). These trends of steadily increasing circuit size and clock frequency are illustrated in Figs. 1(a) and 1(b), respectively. As a result of this amazing revolution in semiconductor technology, it is not unusual for modern integrated circuits to contain over 10 million switching elements (i.e., transistors) packed into a chip area as large as 500 mm2 (3–5). This truly exceptional technologi-
DEC Alpha•
107
Transistor count
•Pentium 106
•i486 i860• •V80
105
•V60/V70 i43201• •i80286 • µ PD7809 i8087• µ PD7720• •i8086
104 •i4004 1975
1980
1985 1990 Year
1995
(a)
103
Clock frequency (MHz)
DEC Alpha• 102
•Pentium •V70
10
1
i8086•
i80286•
•i4004
1975
1980
1985 1990 Year
1995
(b) Figure 1. Moore’s law—exponential increase in circuit integration and clock frequency (2). (a) Evolution of the transistor count per integrated circuit. (b) Evolution of clock frequency.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
INTEGRATED CIRCUIT SIGNAL DELAY
cal capability is due to advances in both design methodologies and physical manufacturing technologies. Research and experience demonstrate that this trend of exponentially increasing integrated-circuit computational power will continue into the foreseeable future. Integrated-circuit performance is usually characterized (6) by the speed of operation, the available circuit functionality, and the power consumption, and there are multiple factors that directly affect these performance characteristics. While each of these factors is significant, on the technological side, increased circuit performance has been largely achieved by the following approaches: • Reduction in feature size (technology scaling), that is, the capability of manufacturing physically smaller and faster circuit structures • Increase in chip area, permitting a larger number of circuits and therefore greater on-chip functionality • Advances in packaging technology, permitting the increasing volume of data traffic between an integrated circuit and its environment as well as the efficient removal of heat created during circuit operation The most complex integrated circuits are referred to as VLSI circuits, where VLSI stands for very large scale integration. This term describes the complexity of modern integrated circuits consisting of hundreds of thousands to many millions of active transistor elements. Currently, the leading integratedcircuit manufacturers have a technological capability for the mass production of VLSI circuits with feature sizes as small as 0.12 애m (7). These sub–0.5 애m technologies are identified with the term deep submicrometer (DSM) since the minimum feature size is well below the 1 애m mark. As these dramatic advances in fabricating technologies take place, integrated-circuit performance is often limited by effects closely related to the very reasons behind these advances such as small geometry interconnect structures. Circuit performance has become strongly dependent and limited by electrical issues that are particularly significant in DSM integrated circuits. Signal delay and related waveform effects are among those phenomena that have a great impact on high-performance integrated-circuit design methodologies and the resulting system implementation. In the case of fully synchronous VLSI systems, these effects have the potential to create catastrophic failures due to the limited time available for signal propagation among gates. BACKGROUND TO SIGNAL DELAY Data processing in the most widely available types of digital integrated circuits [complementary metal-oxide semiconductor (CMOS), bipolar junction transistor, bipolar CMOS (BiCMOS), and GaAs] is based on the transport of electrical energy from one location to another location. Typically, the information that is being processed is encoded as a physical variable that can be easily stored and transmitted to other locations while functionally manipulated along the way. Such a physical variable—also called a signal—is, for example, the electrical voltage provided by a power supply (with respect to a ground potential) and developed in circuit elements in the presence of electromagnetic fields. The voltage signal or bit of
379
information (in a digital circuit) is temporarily stored in a circuit structure capable of accumulating electric charge. This accumulating or storage property is called capacitance— denoted by the symbol C—and, depending on the materials and the physical conditions, is created by a variety of different types of conductor–insulator–conductor structures commonly found in integrated circuits. Furthermore, modern digital circuits utilize binary logic, in which information is encoded by two values of a signal. These logic values are typically called false and true (or low and high or logic zero and logic one) and correspond to the minimum and maximum allowable values of the signal voltage for a specific integrated-circuit implementation. Since the voltage V is proportional to the stored electric charge q (q ⫽ CV, where C is the storage capacitance), the logic low value corresponds to a fully discharged capacitance (q ⫽ CV ⫽ 0) while the logic high value corresponds to a capacitance storing the maximum possible charge (fully charged to a voltage V). The largest and most complicated digital integrated circuits today contain many millions of circuit elements each processing these binary signals (2,6,8,9). Every element has a specific number of input terminals through which it receives data from other elements and a specific number of output terminals through which it makes the results of the processing available to other elements. For the circuit to implement a particular function, the inputs and outputs of each element must be properly connected among each other. These connections are accomplished with wires, which are collectively referred to as an interconnect network, while the set of binary state circuit elements is often simply called the logic gates. During normal circuit operation, the logic gates receive signals at their inputs, process the signals to generate new data, and transmit the resulting data signals to the corresponding logic elements through a network of interconnections. This process involves the transport of a voltage signal from one physical location to another physical location. In each case, this process takes a small yet finite amount of time to be completed and is often called the propagation delay of the signal. The rate of data processing in a digital integrated circuit is directly related to two factors: how fast the circuit can switch between the two logic values and how precisely a circuit element can interpret a specific signal value as the intended binary logic state. Switching the state of a circuit between two logic values requires either charging a fully discharged capacitance or discharging a fully charged capacitance, depending upon the type of state transition—low to high or a high to low. This charging/discharging process is controlled by the active switching elements in the logic gates and is strongly affected by the physical properties of both the gates and the interconnections. Specifically, the signal waveform shapes change, either enhancing or degrading the signals, affecting both the ability and the time required for the logic gates to properly recognize these signals. The concept of signal propagation delay between two different points A and B of a circuit is illustrated in Fig. 2. The signals at points A and B—denoted sA and sB, respectively— are plotted versus time for two different cases in Figs. 2(a) and 2(b), respectively. Without considering the specific electronic devices and circuits required to create these waveform shapes, it is assumed that signal sA makes a transition from high to low and triggers a computation that causes signal sB
380
INTEGRATED CIRCUIT SIGNAL DELAY
sA
sB 90% trB AB
= tPLHAB
sA,sB
tPD
50%
tfA 10% Time (a)
sA
sB 90% trB
sA,sB
tPD
AB
= tPLHAB 50%
tfA 10% Time (b) Figure 2. Signal propagation delay from point A to point B. (a) Linear ramp input and ramp output. (b) Linear ramp input and exponential output.
to make an opposite transition from low to high. Several important observations can be made from Fig. 2: • Although sA is the same in each case, sB may have different shapes. • A temporal relationship (or causality relationship) between sA and sB exists in the sense that sA ‘‘causes’’ sB, thereby preceding the switching event by an amount of time required for the physical switching process to propagate through the circuit structure. • Regardless of shape, sB has the same logical meaning, namely, that the state of the circuit at point B changes from low to high; this transition from low to high and the reverse state transition (signal sA from high to low) require a positive amount of time to complete. The temporal relationship between sA and sB as shown in Fig. 2 must be evaluated quantitatively. This information permits the speed of the signals at different points in the same circuit or in different circuits built in different semiconductor technologies to be temporally characterized. By quantifying the physical speed of the logical operations, circuit designers are provided with necessary information to design correctly functioning integrated circuits.
The design of a digital VLSI system may require a great deal of effort in order to consider a broad range of architectural and logic issues, that is, choosing the appropriate gates and interconnections among these gates to achieve the required circuit function. No design is complete, however, without considering the dynamic (or transient) characteristics of the signal propagation or, alternatively, the changing behavior of signals within time. Every computation performed by a switching circuit involves multiple signal transitions between logic states and requires a finite amount of time to complete. The voltage at every circuit node must reach a specific value for the computation to be completed. Therefore, state-of-theart integrated-circuit design is largely centered around the difficult task of predicting and properly interpreting signal waveform shapes at various points in a circuit. In a typical VLSI system, millions of signal transitions occur, such as those shown in Fig. 2, which determine the individual gate delays and the overall speed of the system. Some of these signal transitions can be executed concurrently while others must be executed in a strict sequential order (8). The sequential occurrence of the latter operations—or signal transition events—must be properly coordinated in time so that logically correct system operation is guaranteed and its results are reliable (in the sense that these results can be repeated). This coordination is known as synchronization and is critical to ensuring that any pair of logical operations in a circuit with a precedence relationship proceed in the proper order. In modern digital integrated circuits, synchronization is achieved at all stages of system design and system operation by a variety of techniques, known as a timing discipline or timing scheme (9–12). With few exceptions, these circuits are based on a fully synchronous timing scheme, specifically developed to cope with the finite speed required by the physical signals to propagate through the system. An example of a fully synchronous system is shown in Fig. 3(a). As illustrated in Fig. 3(a), there are three easily recognizable components in this system. The first component—the logic gates, collectively referred to as the combinational logic—provides the range of operations that a system executes. The second component—the clocked storage elements or simply the registers—are elements that store the results of the logical operations. Together, the combinational logic and registers constitute the computational portion of the synchronous system and are interconnected in a way that implements the required system function. The third component of the synchronous system—known as the clock distribution network—is a highly specialized circuit structure that does not perform a computational process but rather provides an important control capability. The clock generation and distribution network controls the overall synchronization of the circuit by generating a time reference and properly distributing this time reference to every register. The normal operation of a system, such as the example shown in Fig. 3(a), consists of the iterative execution of computations in the combinational logic followed by the storage of the processed results in the registers. The actual process of storing is temporally controlled by the clock signal and occurs once the signal transients in the logic gate outputs are completed and the outputs have settled to a valid state. At the beginning of each computational cycle, the inputs of the system together with the data stored in the registers initiate a new switching process. As time proceeds, the signals propagate through the logic, generating results at the logic output.
INTEGRATED CIRCUIT SIGNAL DELAY
By the end of the clock period, these results are stored in the registers and are operated upon during the following clock cycle. Therefore, the operation of a digital system can be thought of as the sequential execution of a large set of simple computations that occur concurrently in the combinational logic portion of the system. The concept of a local data path is a useful abstraction for each of these simple operations and is shown in Fig. 3(b). The magnitude of the delay of the combinational logic is bound by the requirement of storing data in the registers within a clock period. The initial register Ri is the storage element at the beginning of the local data path and provides some or all of the input signals for the combinational logic at the beginning of the computational cycle (defined by the beginning of the clock period). The combinational path ends with the data successfully latching within the final register Rf in which the results are stored at the end of the computational cycle. Each register acts as a source or sink for the data depending on the current phase of the system operation. The behavior of a fully synchronous system is well defined and controllable as long as the time window provided by the clock period is sufficiently long to allow every signal in the circuit to propagate through the required logic gates and interconnect wires and successfully latch within the final register. In designing the system and choosing the proper clock
381
period, however, two contradictory requirements must be satisfied. First, the smaller the clock period, the more computational cycles can be performed by the circuit in a given amount of time. At the same time, the time window defined by the clock period must be sufficiently long so that the slowest signals reach the destination registers before the current clock cycle is concluded and the following clock cycle is initiated. This way of organizing computation has certain clear advantages that have made a fully synchronous timing scheme the primary choice for digital VLSI systems: • It is easy to understand and its properties and variations are well understood. • It eliminates the nondeterministic behavior of the propagation delay in the combinational logic (due to environmental and process fluctuations and the unknown input signal pattern) so that the system as a whole has a completely deterministic behavior corresponding to the implemented algorithm. • The circuit design does not need to be concerned with glitches in the combinational logic outputs so the only relevant dynamic characteristic of the logic is the propagation delay. • The state of the system is completely defined within the storage elements—this fact greatly simplifies certain aspects of the design, debug, and test phases in developing a large system.
Computation Input Data
Combinational logic
However, the synchronous paradigm also has certain limitations that make the design of synchronous VLSI systems increasingly challenging:
Output Data
Clocked storage (registers) Clock signal Clock distribution network Synchronization (a)
Signal activity at the begining of the clock period Rf
Ri Data
Combinational logic
Clock
Data
Clock Signal activity at the end of the clock period (b)
Figure 3. A synchronous system. (a) Finite-state machine model of a synchronous system. (b) A local data path.
• This synchronous approach has a serious drawback in that it requires the overall circuit to operate as slow as the slowest register-to-register path. Thus, the global speed of a fully synchronous system depends upon those paths in the combinational logic with the largest delays—these paths are also known as the worst case or critical paths. In a typical VLSI system, the propagation delays in the combinational paths are distributed unevenly so there may be many paths with delays much smaller than the clock period. Although these paths could take advantage of a lower clock period—higher clock frequency—it is the paths with the largest delays that bound the clock period, thereby imposing a limit on the overall system speed. This imbalance in propagation delays is sometimes so dramatic that the system speed is dictated by only a handful of very slow paths. • The clock signal has to be distributed to tens of thousands of storage registers scattered throughout the system. Therefore, a significant portion of the system area and dissipated power is devoted to the clock distribution network—a circuit structure that does not perform any computational function. • The reliable operation of the system depends upon the assumptions concerning the values of the propagation delays, which, if not satisfied, can lead to catastrophic timing violations and render the system unusable. DELAY METRICS The delay for a signal to propagate from one point within a circuit to another point is caused by both active electronic de-
INTEGRATED CIRCUIT SIGNAL DELAY
Definition 1. If X and Y are two points in a circuit and sX and sY are the signals at X and Y, respectively, the signal propagation delay tPDXY from X to Y is defined as the time interval from the 50% point of the signal transition of sX to the 50% point of the signal transition of sY. (Although the delay can be defined from any point X to any other point Y, X and Y typically correspond to an input and an output of a logic gate, respectively. In such a case, the signal delay from X to Y is the propagation delay of the gate.) This formal definition of the propagation delay is related to the concept that ideally the switching point of a logic gate is at the 50% level of the output waveform. Thus, 50% of the maximum output signal level is assumed to be the boundary point where the state of the gate switches from one binary logic state to the other binary logic state. Practically, a more physically correct definition of propagation delay is the time from the switching point of the driving circuit to the switching point of the driven circuit. Currently, however, this switching-
x1 X xN
.. .. ..
Circuit
.. .. ..
tPD
XY = tPD XZ
+ tPDZY 50%
tPD tPD
XZ
=
tPLH
ZY
= tPHLZY
XZ
sX
sZ
10% sY
Time (a)
sX
sZ
90%
TPHLxz<0 (T zf < T xr ) 50% TPLHxz > 0 (T zr > T xf ) 10% T xf T rz
T fz T rx Time (b)
Figure 5. Switching characteristics of the circuit shown in Fig. 4(a) Signal waveforms for the circuit shown in Fig. 4(b). (b) Signal waveforms for the inverter in the circuit shown in Fig. 4(b).
y1 Y yM
(a)
Y X
90%
s X, s Y, s Z
vices (transistors) in the logic elements and the various passive interconnect structures connecting the logic gates. While the physical principles behind the operation of transistors and interconnect are well understood at the current–voltage level, it is often computationally difficult to apply this detailed information to the densely packed multimillion transistor DSM integrated circuits of today. A general form of a circuit with N input (x1, . . ., xN) and M output (y1, . . ., yM) terminals is shown in Fig. 4(a). The box labeled ‘‘Circuit’’ may represent a simple wire, a transistor, or a logic gate consisting of several transistors, or an arbitrarily complex combination of these elements. If the box shown in Fig. 4(a) corresponds to the portion of the logic circuit schematically outlined in Fig. 4(b), a logically possible signal activity at the circuit points X, Y, and Z is shown in Fig. 5(a). The dynamic characteristics of the signal transitions as well as their relationship in time are described and formalized in Definitions 1 to 3.
s X, s Z
382
Z
(b) Figure 4. A simple electronic circuit. (a) Abstract representation of a circuit. (b) Logic schematic of the circuit in panel (a).
point-based reference for signal delay is not widely used in practical computer-aided design applications because of the computational complexity of the algorithms and the increased amount of data required to estimate the delay of a path. Therefore, choosing the switching point at 50% has become a generally acceptable practice for referencing the propagation delay of a switching element. Also, note that the propagation delay tPD as defined in Definition 1 is mathematically additive, thereby permitting the delay between any two points X and Y to be determined by summing the delays through the consecutive structures between X and Y. From Figs. 4(b) and 5(a), for example, tPDXY ⫽ tPDXZ ⫹ tPDZY. However, this additivity property must be applied with caution since neither of the switching points of consecutively connected gates may occur at the 50% level. In addition, passive interconnect structures along signal paths do not exhibit switching properties although physical signals propagate through these structures with finite speed. Therefore, if the properties of a signal propagating through a series connection of logic gates and interconnections are under investigation, an analysis of the entire signal path composed of gates
INTEGRATED CIRCUIT SIGNAL DELAY
and wires—rather than adding 50%-to-50% delays—is necessary to avoid accumulating error. In high-performance CMOS VLSI circuits, logic gates often switch before the input signal completes its transition. (Also, a gate may have asymmetric signal paths, whereby a gate would switch faster in one direction than in the other direction.) This difference in switching speed may be sufficiently large so that an output signal of a gate will reach its 50% point before the input signal reaches the 50% point. If this is the case, tPD as defined by Definition 1 may have a negative value. Consider, for example, the inverter connected between nodes X (inverter input) and Z (inverter output) in Fig. 4(b). The specific input and output waveforms for this inverter are shown in detail in Fig. 5(b). When the input signal sX makes a transition from high to low, the output signal sZ makes a transition from low to high (and vice versa). In this specific example, the low-to-high transition of the signal sZ crosses the 50% signal level after the high-to-low transition of the signal sX. Therefore, the signal delay tPLH (the signal name index is omitted for clarity) is positive as shown by the direction of the arrow in Fig. 5(b), coinciding with the positive direction of the x axis. However, when the input signal sX makes a low-tohigh transition, the output signal sZ makes a faster high-tolow transition and crosses the 50% signal level before the input signal sX crosses the 50% signal level. The signal delay tPHL in this case is negative as shown by the direction of the arrow in Fig. 5(b), coinciding with the negative direction of the x axis. As illustrated in Fig. 5(b), the asymmetry of the switching characteristics of a logic gate requires the ability to discriminate between the values of the propagation delay in the two different switching situations (a low-to-high or a high-to-low transition). One single value of the propagation delay tPD, as defined in Definition 1, does not provide sufficient information about this possible asymmetry in the switching characteristics of a logic gate. Therefore, the concept of delay is extended further to include this missing information. Specifically, the direction of the output waveform (since the output of a gate is typically the evaluation node) is included in the definition of delay, thereby permitting the evaluation of the gate switching speed to account for the effects of the output signal transition: Definition 2. The signal propagation delays tPLHXY and tPHLXY, respectively, denote the signal delay from input X to output Y (as defined in Definition 1), where the output signal (at point Y) transitions from low to high and from high to low, respectively (the low-to-high and high-to-low transitions). It is important to consider both tPLH and tPHL during circuit analysis and design. However, if only a single value of tPD is specified, tPD usually refers to the arithmetic average, (tPLH ⫹ tPHL)/2. Furthermore, Definition 2 specifies the time between switching events, but does not convey any information about the transition time of the events themselves. This transition time is finite and is characterized by the two parameters described in the following definition: Definition 3. For a signal making a transition between two different logic states, the transition time is defined as the time interval between the 10% point and the 90% point of the
383
signal. For a low-to-high transition, the rise transition time tr ⫽ t兩90% ⫺ t兩10%. For a high-to-low transition, the fall transition time tf ⫽ t兩10% ⫺ t兩90%. The parameters defined in Definition 3 are illustrated in Fig. 2, where the fall time tfA and the rise time trB for the signals sA and sB, respectively, are indicated. As tr and tf are related to the slope of the signal transitions, the transition times also affect the values of tPLH and tPHL, respectively. In Fig. 5(a), for example, note that if the signal sY had been slower—a larger fall time tfY —sY would have crossed the 50% level at a later time, effectively increasing the propagation delay tPLHXY. However, as illustrated in Fig. 2, it is possible that the 50%-to-50% delay remains nearly the same, although the signal slope may change significantly [the rise time trB in Figs. 2(a) and 2(b)].
DEVICES AND INTERCONNECTIONS The technology of choice for most modern high-performance digital integrated circuits is based on the metal-oxide-semiconductor field-effect transistor (MOSFET) structure. The primary reasons for the wide application of MOSFETs are, among other things, high packing density and, in its complementary form, low power dissipation. In this section, the properties of both active devices and interconnections are discussed from the perspective of circuit performance. An n-channel enhancement mode MOSFET transistor (NMOS) is depicted in Fig. 6(a). Note that in most digital applications, the substrate is connected to the source, i.e., Vs ⫽ Vb and Vsb ⫽ 0. Therefore, the four-terminal transistor depicted in Fig. 6(a) can be considered as a three-terminal device with the voltages Vs, Vg, and Vd controlling the operation of the transistor. Assuming no substrate current, Idd ⫽ Iss — both currents Idd and Iss are usually referred to by Ids only. In the following discussion, the additional indices n and p are used to indicate which type of transistor is being considered, n-channel or p-channel, respectively. To first order, the drain current Idsn through the transistor is modeled by the classical Shichman–Hodges equations (13):
Idsn
2 ], βn [(Vgsn − Vtn )Vdsn − 12 Vdsn ≥ V and V V gsn tn gdn ≥ Vtn = βn 12 (Vgsn − Vtn )2 , Vgsn ≥ Vtn and Vgdn ≤ Vtn 0, Vgsn ≤ Vtn
(linear mode) (saturation mode) (cutoff mode) (1)
The derivation of PMOS I-V equations is straightforward by accounting for the changes in voltages and current directions. In Eq. (1), the parameter 웁n is a device parameter commonly called a gain factor or the current gain of the transistor—the dimension of 웁n is [A/V2]. The value of the current gain 웁n is βn = Kn
Wn Ln
(2)
where Kn is the process transconductance parameter and Wn and Ln are the width and length of the transistor channel, respectively. The process transconductance Kn is found as the
384
INTEGRATED CIRCUIT SIGNAL DELAY
the timing relationships connecting the transistor terminal voltages as these voltages are the signal representations of the data being processed. By performing a dynamic analysis, the signal delay from an input waveform to its corresponding output waveform can be evaluated with a certain level of accuracy. Complementary MOS logic or CMOS logic is the most popular circuit style for most modern high-performance digital integrated circuits. An analytical analysis of a simple CMOS logic gate is presented next for one of the simplest CMOS gates—the CMOS inverter shown in Fig. 6(b). Performing such a simple analysis illustrates the process for estimating circuit performance as well as provides insight into what factors and how these factors may affect the timing characteristics of a logic gate.
Vd –
+
Vgd
Idd
Drain Vg
+
Gate
+
Base (substrate)
Vb
Vds
Source
Iss Vgs – –
Analytical Delay Analysis Consider the CMOS inverter circuit consisting of the PMOS device Q1 and NMOS device Q2 shown in Fig. 6(b). For this analysis, assume that the capacitive load of the inverter— consisting of any device capacitances, interconnect capacitances, and the load capacitance of the following stage—can be lumped into a single capacitor CL. The output voltage Vo ⫽ VCL is the voltage across the capacitive load and the terminal voltages of the transistors are shown in Table 1. The regions of operation for the devices Q1 and Q2 are illustrated in Fig. 7 depending upon the values of Vi and Vo. Referring to Fig. 7 may be helpful in understanding the switching process in a CMOS inverter. Determining the values of the fall time tf and the propagation delay tPHL is described in the following section. Similarly, closed-form expressions for the rise time tr and the propagation delay tPLH are derived later.
Vs (a)
Vdd Q1 Idsp
Vi (t)
Load
Idsn
Vo(t)
Q2
(b) Figure 6. (a) An NMOS transistor and (b) the basic CMOS inverter gate.
product Kn = µnCox = µn
ox tox
(3)
where 애n is the carrier mobility and Cox is the gate capacitance per unit area (⑀ox is the relative dielectric constant of the gate oxide material—3.9 for SiO2 —and tox is the gate oxide thickness). By substituting the index p for the index n in Eqs. (1) to (3), analogous expressions for 웁p and Kp of a p-channel enhancement mode MOSFET transistor can be developed (2,6,9,14). Also note that the threshold voltage Vtn of an nchannel transistor is positive (Vtn ⬎ 0), while the threshold voltage Vtp of a p-channel transistor is negative (Vtp ⬍ 0). Equation (1) and its counterpart for a p-channel MOS device are fundamental to both static and dynamic circuit analysis. Static, or dc, analysis refers to circuit bias conditions in which the voltages Vg, Vd, and Vs remain constant. Dynamic analysis is attractive from the signal delay perspective since it deals with voltage and current waveforms that change over time. An important goal of dynamic analysis is to determine
The Value of tf and tPHL. The transition process used to derive tf and tPHL is illustrated in Fig. 8(a). Assume that the input signal Vi has been held at logic low (Vi ⫽ 0) for a sufficiently long time such that the capacitor CL is fully charged to the value of Vdd. The operating point of the inverter is point A on Fig. 7. At time t0 ⫽ 0 the input signal abruptly switches to a logic high. The capacitor CL cannot discharge instantaneously, thereby forcing the operating point of the circuit to point B—(Vi, Vo) ⫽ (Vdd, Vdd). At B, the device Q1 is cut off while Q2 is conducting, thereby permitting CL to begin discharging through Q2. As this discharge process develops, the operating point moves down the line BD approaching point D, where CL is fully discharged, that is, Vo(D) ⫽ 0. Observe that during the interval 0 ⱕ t ⬍ t2 the operating point is between B and C and the device Q2 operates in saturation. At time t2, the capacitor is discharged to Vdd ⫺ Vtn and Q2 begins to operate in the linear region. For t ⱖ t2, the device Q2 is in Table 1. Terminal Voltages for the p-channel and n-channel Transistor in a CMOS Inverter Circuit
Vg Vs Vd Vgs Vgd Vds
p-Channel
n-Channel
Vgp ⫽ Vin Vsp ⫽ VDD Vdp ⫽ Vout Vgsp ⫽ Vin ⫺ VDD Vgdp ⫽ Vin ⫺ Vout Vdsp ⫽ Vout ⫺ VDD
Vgn ⫽ Vin Vsn ⫽ 0 Vdn ⫽ Vout Vgsn ⫽ Vin Vgdn ⫽ Vin ⫺ Vout Vdsn ⫽ Vout
INTEGRATED CIRCUIT SIGNAL DELAY
B
A Vo Vdd
C′
B
A Q1 linear Q2 saturation
Q1 linear Q2 cutoff
385
I
II
III
IV
VI
V
C
Q1 linear Q2 saturation F VII
I
II
III
IV
C Vdd – Vtn
D
E
Non-ideal (non-step) input B
A Q1 saturation Q2 linear I F –Vtp
VII
VI
II
III
IV
VI
V
C
V F VII
E
Q1 saturation Q2 cutoff
0
Q1 saturation Q2 linear
Vtn
Q1 cutoff Q2 linear
Vdd + Vtp
D
E D Vdd
Vi
Ideal step input
(a)
(b)
Figure 7. Modes of operation for the devices in the CMOS inverter. (a) Operating modes depending on the input voltage, Vi, and the output voltage, Vo. (b) Operating point trajectory for different input waveforms (only rising input shown).
the linear region. If (as is typical) 0.1Vdd ⬍ Vtn ⬍ 0.5Vdd, then t1 ⬍ t2 ⬍ t3 as shown in Fig. 8(a). Therefore, the fall time is tf ⫽ t4 ⫺ t1 and the propagation time tPHL ⫽ t3 ⫺ 0 ⫽ t3. To determine the values of tf and tPHL, the output waveform Vo(t) must be evaluated for each of the intervals [t0, t2) and [t2, 앝). For t0 ⱕ t ⬍ t2, the current discharging the capacitor Idsn, shown in Fig. 6(b), is
Vi(t) =
(4)
βnVdd (1 − η) CL
(5)
Substituting η=
0, t < 0 Vdd, t > –0
t
and γn =
Vdd, t < 0 0, t > –0
tr
tPLH
Vo(t)
Vtn Vdd
Vi(t) =
tf
tPLH
1 dVo Idsn = βn [(Vdd − Vtn )2 ] = −CL 2 dt
Vo(t) 0.9Vdd
0.9Vdd Vdd – Vtn
Eq. (6)
0.5Vdd
Eq. (9)
0.5Vdd – Vtp
0.1Vdd 0 t1 t2
t3
t4 (a)
0.1Vdd t
0 t1 t2
t3
t4
t
(b)
Figure 8. Switching waveforms for a step input at the CMOS inverter in Fig. 6(b). (a) High-tolow output transition. (b) Low-to-high output transition.
386
INTEGRATED CIRCUIT SIGNAL DELAY
and solving Eq. (4) for Vo with the initial condition Vo(0) ⫽ Vdd yields βn Vo (t) = Vdd − (Vdd − Vtn )2 t 2CL (6) γn (1 − η)t for t0 ≤ t < t2 = Vdd 1 − 2
If (as is typical) 0.1Vdd ⬍ 兩Vtp兩 ⬍ 0.5Vdd, then t1 ⬍ t2 ⬍ t3 as shown in Fig. 8(b). Therefore, the rise time is tr ⫽ t4 ⫺ t1 and the propagation delay is tPLH ⫽ t3 ⫺ 0 ⫽ t3. To determine the values of tr and tPLH, the output waveform Vo(t) must be evaluated for each of the intervals [t0, t2) and [t2, 앝). An analysis similar to that described earlier can be performed to derive expressions for t1, t3, and t4 in Fig. 8(b). Substituting
From Eq. (6) it can be further shown that
Vo (t2 ) = Vdd − Vtn
for t2 =
2CL 2η V = βn (Vdd − Vtn )2 tn γn (1 − η) (7)
The interval t ⱖ t2 is considered next. The device Q2 is in linear mode and Idsn is given by 1 dVo (8) Idsn = βn (Vdd − Vtn )Vo − Vo2 = −CL 2 dt A closed-form expression for the output voltage Vo(t) for time t ⱖ t2 is obtained by solving Eq. (8) (a Bernoulli equation) with the initial condition Vo(t2) ⫽ Vdd ⫺ Vtn: 2(1 − η) Vo (t) = Vdd 1 + eγ n (t−t 2 )
for t ≥ t2
1 CL η − 0.1 + ln(19 − 20η) 2 βn Vdd (1 − η) 1−η
(11)
tPHL
β pVdd (1 − π ) CL
(13)
t1, t3, and t4 are
2π 1 0.2 1 , t3 = + ln(3 − 4π ) , t1 = γp 1 − π γp 1 − π 2π 1 t4 = + ln(19 − 20π ) γp 1 − π
(14)
Therefore, the value of the rise time tr is tr = t4 − t1 =
1 CL π − 0.1 + ln(19 − 20π ) 2 β p Vdd (1 − π ) 1−π
(15)
tPLH = t3 − 0 = t3 =
1 CL β p Vdd (1 − π )
2π + ln(3 − 4π ) 1−π
(16)
Several observations can be made by analyzing the expressions derived earlier for tr, tf , tPHL, and tPLH. First, the factors that affect the inverter delays are analyzed. Following this analysis, the related waveform effects are considered and short-channel effects of submicrometer devices are then described. Controlling the Delay
and the propagation delay tPHL is 1 C = t3 − 0 = t3 = L βn Vdd (1 − η)
γp =
and the value of the propagation delay tPLH is
The fall time tf is
Vt p , Vdd
(9)
The values of t1 from Eq. (6) and t3 and t4 from Eq. (9) are 2η 1 0.2 1 , t3 = + ln(3 − 4η) , t1 = γn 1 − η γn 1 − η (10) 1 2η + ln(19 − 20η) t4 = γn 1 − η
tf = t4 − t1 =
π =−
2η + ln(3 − 4η) 1−η
(12)
The Value of tr and tPLH. The rise time tr and the propagation delay tPLH are determined from the switching process illustrated in Fig. 8(b) (similarly to tf and tPHL earlier). Assume that the input signal Vi has been held at logic high (Vi ⫽ Vdd) for a sufficiently long time such that the capacitor CL is fully discharged to Vo ⫽ 0. The operating point of the inverter is point D shown in Fig. 7. At time t0 ⫽ 0, the input signal abruptly switches to a logic low. Since the voltage on CL cannot change instantaneously, the operating point is forced at point E. At E, the device Q2 is cut off while Q1 is conducting, thereby permitting CL to begin charging through Q1. As this charging process develops, the operating point moves up the line EA toward point A at which CL is fully charged, that is, Vo(A) ⫽ Vdd. Note that during the interval 0 ⱕ t ⬍ t2, the operating point is between E and F and the device Q1 operates in the saturation region. At time t2, the capacitor is charged to ⫺Vtp (recall that Vtp ⬍ 0) and Q1 starts operating in the linear region. For t ⱖ t2, the device Q1 is in the linear region.
Note that in Eqs. (11) and (15), the fall and rise times, respectively, are the product of a term of the form CL /웁 and another process-dependent term (a function solely of Vdd and Vt). These relationships imply that for a given manufacturing process, improvements in individual gate delays are possible by reducing the load impedance CL or by increasing the current gain of the transistors. Reducing the load impedance is possible by controlling physical aspects of the design (the specific gate layout). Alternatively, increasing 웁 of the devices (recall that 웁 앜 W/L) is typically accomplished by controlling the value of W—a process known as transistor or gate sizing. (Typically, the device channel length is chosen to be the minimum permitted by the technology and therefore cannot be decreased to further increase 웁.) Transistor sizing, however, has limits—area requirements may limit the maximum channel width W, and increasing W will also increase the input load capacitance of the gates. Waveform Effects The ideal step input waveform used to derive the delay expressions presented earlier is a physical abstraction. Such an ideal waveform does not exist naturally, although it is used
INTEGRATED CIRCUIT SIGNAL DELAY
to simplify the analysis presented before. Note that despite ideally fast input waveforms, the output signal of a CMOS logic gate has a finite slope, thereby contributing to a certain gate delay. In a practical VLSI integrated circuit, both the input and output signals have a nonzero rise and fall time due to the impedances along any signal path. Fast input waveforms can be effectively considered as step inputs, and the delay expressions derived in Eqs. (11) and (15) model the delays for such cases with reasonable accuracy. Slow input waveforms, however, contribute significantly to the delay of the charge–discharge path in a gate output (6,9,14,15), making the aforementioned delay expression inaccurate. Furthermore, it is considerably more difficult to derive closed-form delay expressions for nonstep input waveforms. Consider, for example, the derivation of the fall time of the inverter shown in Fig. 6(b) assuming a nonideal input, such as the linear ramp signal sA in Fig. 2(a). Referring to Fig. 7(b), the trajectory of the operating point relating Vi and Vo for a nonideal (nonstep) input is as shown in the upper diagram. This trajectory is a curve passing through regions I, II, III, and IV (through regions I, II, III, IV, and V for slower input signals), and down the line C⬘ 씮 C 씮 D rather than the two straight-line segments A 씮 B and B 씮 C 씮 D (as shown in the lower diagram). Therefore, calculating an exact expression for tf in this case would require separately evaluating the delay for all five portions of the output Vo —one for each region. Analysis of the CMOS inverter shown in Fig. 6(b) with other than an ideal step input, as well as the respective delay expressions, can be found in Ref. 15. Consider, for example, the linear ramp input described by
0, t Vdd , Vi (t) = tr i Vdd ,
t<0 0 ≤ t < tr i
(17)
t ≥ tr i
where tri is the rise time of the input voltage signal Vi(t). In the case depicted in the upper diagram shown in Fig. 7(b), the total propagation delay tPHLramp at the 50% level (15) is given by tPHLramp = 16 (1 + 2η)tr i + tPHLstep
(18)
where tPHLstep is the propagation delay time for a step input given by Eq. (12). Note that the ramp input described by Eq. (17) is also an idealization intended to simplify analysis. In a practical integrated circuit, the input waveform to the inverter is not a linear ramp, but rather the output waveform of another gate within the circuit. For such an input—also known as a characteristic input—it is practical to regard the propagation delay through the inverter gate shown in Fig. 6(b) as a function of the CL /웁 ratio of the preceding gate or, equivalently, as a function of the step response delay of the preceding stage (15). This kind of direct analytical solution—by breaking the output waveform in regions depending upon the trajectory of the operating point—becomes even more complicated for a gate with more than one input arriving at an arbitrary time and with arbitrary waveforms. Because of the growing complexity of such an analytical solution, it is imperative that alternative
387
methods for the delay calculation be developed and used in practice. Nonideal input waveforms also have implication on the power dissipation of individual logic gates and therefore of the entire circuit. Observe that in regions II, IV, and VI, shown in Fig. 7, both devices simultaneously conduct, creating a temporary direct path for the current from Vdd to ground. The short-circuit current in this direct current path is only slightly related to the output voltage of the gate and adds to the total power dissipation. This added component is known as short-circuit power. The short-circuit power can be a substantial fraction of the total power dissipation of a circuit and can become an obstacle to meeting a specific design goal. Faster waveforms throughout the circuit generally mean less time spent switching within regions II, IV, and VI and therefore decreased short-circuit current and short-circuit power. Short-Channel Effects The active device model used in the analyses described earlier, Eq. (1), is accurate for long-channel devices. As technology is scaled down into the submicrometer range, a variety of physical phenomena develop that requires improved device models in order to preserve accuracy. In this section, certain key effects, known as short-channel effects, are described, as related to the discussion of propagation delay. Channel-Length Modulation. A MOSFET device modeled by Eq. (1) has an infinite output resistance in saturation and acts as a voltage-controlled current source. Recall the linear portion of the falling or rising output waveforms from the analysis presented earlier. The device acts as a current source because of the complete independence in saturation of the drain current Idsn from the voltage Vdsn assumed in Eq. (1). This independence, however, is an idealization that does not take into account the effect of the voltage Vdsn on the shape of the channel. In practice, as Vdsn is increased beyond the value required for saturation (such that Vgdn ⬍ Vtn or Vgdp ⬎ Vtp for a PMOS device), the channel pinch-off point moves towards the source. Therefore, the effective channel length is reduced, an effect known as channel-length modulation. To account for channel-length modulation analytically, the expression for the current in saturation in Eq. (1) is modified as follows: Idsn = βn 12 (Vgsn − Vtn )2 (1 + λnVdsn )
(19)
The additional factor (1 ⫹ nVdsn) in Eq. (19) is the cause of a finite device output resistance ⭸Vdsn /⭸Idsn ⫽ 2(Vgsn ⫺ Vtn)⫺2 /(n웁n) in saturation. The output waveform is degraded due to the degradation of the transfer characteristic of the inverter. Velocity Saturation. In a long-channel transistor, the drift velocity of the carriers in the channel is proportional to both the carrier mobility and the lateral electric field in the channel (parallel to the source–drain path). In short-channel devices, however, the velocity of the carriers eventually saturates for some value of the voltage Vds within the operating range of the circuit. This velocity saturation phenomenon is due to the fact that the power supply voltage is not scaled
388
INTEGRATED CIRCUIT SIGNAL DELAY
down as quickly as the device dimensions due to system constraints. The saturation in carrier velocity for high electric field strengths—caused by the high voltage Vds applied over a short channel—causes a reduction in both the process transconductance [see Eq. (3)] and the current gain of a saturated device. This reduction in the current gain 웁 has a direct effect on the ability of the devices to drive a specific load, resulting in increased delay times. Recall that the propagation delays described in Eqs. (11), (12), (15), and (16) are inversely proportional to 웁. A more realistic device model for DSM devices—known as the 움-power model—has been proposed in Ref. 16 to include the carrier velocity saturation effect in submicrometer devices (short-channel devices in general):
Idsn =
ID0 n , I D0 n VD0 n 0,
Vgsn ≥ Vtn , Vdsn ≥ VD0 n (pentode or saturation region)
Vdsn ,
Vgsn ≥ Vtn , Vdsn < VD0 n
(20)
(triode or linear region) (cutoff mode)
Vgsn ≤ Vtn
where ID0 = ID0 n
V
− Vtn Vdd − Vtn gsn
α
,
VD0 = VD0 n
V
− Vtn Vdd − Vtn gsn
α/2 (21)
In Eqs. (20) and (21), 움 is the velocity saturation index, VD0 is the drain saturation voltage for Vgsn ⫽ Vdd, and ID0 is the drain saturation current for Vgsn ⫽ Vdsn ⫽ Vdd. A typical value for the velocity saturation index of short-channel devices is 1 ⱕ 움 ⱕ 2, where Eq. (20) becomes Eq. (1) for 움 ⫽ 2. Analytical solutions for the output voltage of a CMOS inverter with a purely capacitive load CL for a step, linear ramp, and exponential input waveforms can be found in Ref. 17. Closed-form expressions for the delay of the CMOS inverter shown in Fig. 6(b) under the 움-power model are given in Ref. 16 and are repeated here: tPHL = tPLH =
1 2
−
C V 1−η V t + L dd , η = tn 1+α T 2ID0 Vdn
(22)
The propagation delay described by Eq. (22) can be applied to nonideal input waveforms and consists of two terms. The first term reflects the effect on the gate delay of the input waveform shape and is proportional to the input waveform transition time tT. The second term reflects the dependency of the delay on the gate load, similarly to the CL /웁 term in Eqs. (12) and (16). The Importance of Interconnections The analysis of the CMOS gate delay as described earlier is based on the assumption that the load of the inverter shown in Fig. 6(b) is a purely capacitive load (C). This assumption is generally true for logic gates placed close to each other in the physical layout of an integrated circuit. In a multimillion transistor VLSI circuit, however, certain connected logic gates may be relatively distant from each other. In this situation, the impedance of the interconnect wires cannot be considered
as being purely capacitive, but rather as being resistive– capacitive (RC). An important type of global circuit structure where the gates can be very far apart is the clock distribution network (18). The interconnect has become a major concern due to the high resistance that can limit overall circuit performance. These interconnect impedances have become significant as the minimum line dimensions have been scaled down into the deep-submicrometer region while the overall chip dimensions have increased. Perhaps the most important consequence of these trends of scaling transistor and interconnect dimensions and increasing chip sizes is that the primary source of signal propagation delay has shifted from the active transistors to the passive interconnect lines. Therefore, the nature of the load impedance has shifted from a lumped capacitance to a distributed resistance–capacitance, thereby requiring new qualitative and quantitative interpretations of the signal switching processes. To illustrate the effects of scaling, consider ideal scaling (6) where devices are scaled down by a factor of S (S ⬎ 1) and chip sizes are scaled up by a factor of Sc (Sc ⬎ 1). The delay of the logic gates decreases by 1/S while the delay due to the interconnect increases by S2Sc2 (6,19). Therefore, the ratio of interconnect delay to gate delay after ideal scaling increases by a factor of S3Sc2. For example, if S ⫽ 4 (corresponding to scaling down from a 2 애m CMOS technology to a 0.5 애m CMOS technology) and Sc ⫽ 1.225 (corresponding to the chip area increasing by 50%), the ratio of interconnect delay to gate delay will increase by a factor of 43 ⫻ 1.225 ⫽ 78.4 times. Delay Estimation in RC Interconnect. Interconnect delay can be analyzed by considering the CMOS inverter shown in Fig. 6(b) with the capacitive load CL representing the accumulated capacitance of the fanout of the inverter. The interconnect connecting the drains of the devices Q1 and Q2 to the upper terminal of the load is replaced by a distributed RC line with a resistance and capacitance Rint and Cint, respectively (19). Closed-form expressions for the signal delay with an RC load are described by Wilnai (20). The delay values for both a distributed and lumped RC load are summarized in Table 2. These delay values are obtained assuming a step input signal. The results listed in Table 2 are illustrated graphically in Fig. 9 (20). Two waveforms of the signal output response making a low-to-high transition are shown in Fig. 9. These two waveforms are based on the RC load being distributed and lumped, respectively. Assuming an on-resistance Rtr of the driving transistor (19), the interconnect delay Tint can be characterized by the following expression:
Table 2. Closed-Form Expressions for the Signal Delay Response Driving a Distributed and Lumped RC Load—An Ideal Step Input Is Assumed Signal Delay Output Voltage Range 0% to 90% 10% to 90% (rise time tr) 0% to 63% 0% to 50% (delay tPLH) 0% to 10%
Distributed RC
Lumped RC
1.0RC 0.9RC 0.5RC 0.4RC 0.1RC
2.3RC 2.2RC 1.0RC 0.7RC 0.1RC
INTEGRATED CIRCUIT SIGNAL DELAY
Table 3. Circuit Network to Model Distributed RC Line with a Maximum Error of 3%a
90%
0.9
RT
Vout(t)/Vdd
Distributed Lumped 0.63
63%
0.5
50%
0.1
10% 0.5RC
389
1.0RC
1.5RC
2.0RC
Time
Figure 9. Illustration of the RC signal delay expressions in Table 2 (20). Waveforms are shown for both a distributed and lumped RC load.
CT
0
0.01
0.1
0.2
0.5
1
2
5
10
20
50
100
0 0.01 0.1 0.2 0.5 1 2 5 10 20 50 100
⌸3 ⌸3 T2 T2 T1 T1 T1 ⌸1 ⌸1 R R R
⌸3 ⌸3 T2 T2 T1 T1 T1 ⌸1 ⌸1 R R R
⌸2 ⌸2 ⌸2 ⌸2 T1 T1 T1 ⌸1 ⌸1 R R R
⌸2 ⌸2 ⌸2 ⌸2 T1 T1 T1 ⌸1 ⌸1 R R R
⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 R R R
⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 R R R
⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 L1 L1 R R
⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 L1 L1 L1 R R
⌸1 ⌸1 ⌸1 ⌸1 ⌸1 ⌸1 L1 L1 L1 L1 R R
C C C C C C L1 L1 L1 L1 R R
C C C C C C C C C C C N
C C C C C C C C C C N N
From Ref. 21. The notations ⌸, T, and L correspond to a ⌸, T, and L model, respectively. The notations R and C correspond to a single lumped resistance and capacitance, respectively. The notation N means that the interconnect impedance can be ignored. The number after certain models (e.g., ⌸3) correspond to a multiple model structure.
a
Tint = RintCint + 2.3(RtrCint + RtrCL + RintCL )
(23)
≈ (2.3Rtr + Rint )Cint
(24)
The on-resistance of the driving transistor Rtr in Eqs. (23) and (24) can be approximated (19) by 1 Rtr ≈ βVdd
function of RT and CT is shown in Table 3. By using the appropriate RC model (21), the computational time of the simulation can be more efficiently reduced while preserving the accuracy of the circuit simulation (22).
(25)
where the term 웁 in Eq. (25) is the current gain of the driving transistor [see Eq. (2)]. Approximating a distributed RC line by a combination of lumped resistances (R) and capacitances (C) is commonly used in circuit simulation programs. Three typical ladder circuits are illustrated in Fig. 10. The names of the ladder circuits shown in Fig. 10 are derived based on the similarities between the shape of the circuit and a known structure such as a letter. The RC interconnect is replaced in circuit simulation programs with circuit ladder structures such as those shown in Fig. 10. To increase the accuracy of simulation, more detailed ⌸n and Tn ladder models can be used (21). A lumped ⌸ and T ladder circuit model better approximates a distributed RC model than a lumped L ladder circuit (21) by up to 30%. As described in Ref. 21, the strategy to model a distributed RC line depends upon two circuit parameters:
Delay Mitigation A variety of different techniques have been developed to improve the signal delay characteristics depending upon the type of load and other circuit parameters. Among the most important techniques are as follows: • Gate sizing to increase the output current drive capability of the transistors along the logic chain (23–25). Gate sizing must be applied with caution, however, because of the resulting increase in area and power dissipation. • Tapered buffer circuit structures are often used to drive large capacitive loads (such as at the output pad of a chip) (8,26–31). A series of CMOS inverters such as the circuit shown in Fig. 6(b) are cascaded where the output drive of each buffer is increased by a constant tapering factor. • The use of repeater circuit structures to drive resistive– capacitive (RC) loads. Unlike tapered buffers, repeaters are typically CMOS inverters of uniform size (drive capability) that are inserted at uniform intervals along an interconnect line (6,32–37). • A different timing discipline such as asynchronous timing (2,8,38). Unlike fully synchronous circuits, the order of execution of logic operations in an asynchronous circuit
1. The ratio CT ⫽ CL /C of the load capacitance CL of the fanout to the capacitance C of the interconnect line 2. The ratio RT ⫽ Rtr /R of the output resistance of the driving MOSFET device Rtr to the resistance R of the interconnect line The appropriate ladder circuit (from Ref. 21) to model a distributed RC interconnect line properly within 3% error as a
R C L1
R 2
R
R 2
C 2
C 2 Π1
C T1
Figure 10. L, ⌸, and T ladder circuits to approximate an RC interconnect impedance.
390
INTEGRATED CIRCUIT SIGNAL DELAY
is not controlled by a global clock signal. Therefore, asynchronous circuits are essentially independent of the signal delays. The logical order of operations in an asynchronous circuit is enforced by requiring the generation of special handshaking signals that communicate the status of the computation. Among other useful techniques to improve the signal delay characteristics are the use of dynamic CMOS logic circuits including Domino logic (9) and differential circuit logic styles, such as cascade voltage switch logic or CVSL (9). IMPACT OF DSM ON DESIGN METHODOLOGIES The capability of applying and analyzing timing relationships and delays in deep-submicrometer integrated circuits requires a great amount of knowledge describing the physical phenomena of these circuits. As described earlier, the development of purely analytic equations is practically impossible to carry out even for very simple circuits. Furthermore, from a design perspective, it is important to be able to apply intuitive knowledge that incorporates circuit physics and operation when both analyzing existing circuits and synthesizing, or designing, new circuits based on their topological, functional, and timing characteristics. The use of powerful computers coupled with efficient algorithms is absolutely fundamental to the successful analysis and synthesis of multi-million transistors integrated circuits. In fact, the majority of these algorithms are specifically developed with circuit complexity in mind and the related issues of accuracy, run time, and memory requirements. Therefore, CAD software tools play a vital role in the circuit design and manufacturing process. As noted earlier, however, improvements in technology and the demand for greater functionality and performance are changing the physical models of devices and interconnects in DSM circuits. A serious consequence of these changes is that the traditional design flow (the sequence of steps involved in the design and analysis of circuits) is no longer able to handle the required circuit complexity in an efficient manner. In the traditional design flow, a great amount of effort and time is devoted to the architectural and logical aspects of the circuit. In this front-end portion of the design process, the circuit is partitioned into smaller subsystems and the individual logical networks. At the front end, the emphasis is on the behavioral, register transfer level (RTL), and logic levels of abstraction, concentrating on satisfying the functional design goals. Approximate timing information is used at the front end to estimate the delay of the logic gates and to determine the correct architectural (rather than physical) placement of the registers within a circuit. Actual circuit and physical design are at the back end of the design process and consist of determining the circuit description of the specific physical transistor and interconnect patterns corresponding to the previously developed networks of logic gates. During this phase, the locations of the logic gates on the chip area are determined and wires are routed among the terminals of these gates as required by the logic network specifications. Besides being a time-consuming process targeted to satisfy these many geometrical and connectivity constraints, the physical design process must also preserve the dynamic specifications of the circuit assumed during
the front-end design process. Alternatively, the gates must be placed and the wires routed among them to guarantee that the circuit will function correctly given the system input signals. The primary difficulty with this approach is that the frontend methodologies and CAD tools largely ignore the details and problems of the physical domain. Such an approach cannot be tolerated in the design of DSM circuits for multiple reasons, among which the following are related to the signal delay and waveform shapes in an important way: 1. With advances in technology, transistor devices and gates become smaller and faster, while the size of the integrated circuit increases. These trends lead to the appearance of many global interconnect wires the length of which increases proportionally with increasing die size. Not only are the devices smaller but these transistors often have to drive relatively larger loads due to the long global interconnections. 2. As the average length of a wire increases, the electrical model of an interconnect wire changes from a purely capacitive (C) model to a resistive–capacitive (RC) model and finally to an inductive (RLC) model. The wire geometry also changes in order to satisfy performance, density, and yield objectives. Therefore, fringing capacitances between lines and cross-wire signal coupling begin to play an increasingly important role in signal integrity and circuit speed. 3. Multiple wire planes are often used, thereby increasing the complexity of the routing tools and making it significantly more difficult to account for any coupling and noise effects during analysis and synthesis portions of the design process. 4. Fast turnaround times and increased market pressure often require the reuse of large circuit subsystems (known as ASIC cores or megacells) surrounded by customized glue logic. (Application-specific integrated circuits—or ASIC—are specialized circuits developed to satisfy a specific manufacturer’s need rather than be distributed as off-the-shelf parts.) The reusable portions and the glue logic are naturally separated from each other on the surface of the integrated circuit, requiring multiple long interconnect wires. CAD software tool developers and circuit designers have become increasingly concerned with new approaches to the integrated-circuit design flow in order to cope with these aforementioned effects. A paradigm shift towards merging the capabilities of front-end and back-end tools is currently emerging as an alternative to the traditional methodology of separating these design efforts. Thus, in order to relieve constraints on the back-end tools and increase the likelihood of a successful design, the front-end tools must account for the lower-level DSM-related physical effects at a much earlier stage of the system design process. An important approach in circuit extraction and simulation is applying advanced mathematical methods to extract parasitic wire impedances and to reduce the complexity of the extracted data. This reduction is needed so that the analysis and simulation of the critical wires in a circuit can be performed in a reasonable amount of time while not sacrificing precision.
INTEGRATED CIRCUIT SIGNAL DELAY
Wire impedance and signal coupling effects are quite important in DSM circuits and must not be overlooked during the design process. These signal-integrity-related effects are also extremely difficult to deal with during the post-layout verification phase. Circuit design methodologies are emerging that are targeted to identifying possible wire delay bottlenecks and dealing with these effects before the actual physical layout is completed. Among the most promising techniques are the automatic repeater insertion to reduce the RC delays of long wires and circuit and architectural techniques that effectively ‘‘balance’’ wire delay distribution to ease the physical design process. Clock skew scheduling, retiming, wave pipelining, and a combination of these methods are potentially feasible and useful techniques for balancing the distribution of the delays within a circuit.
CONCLUSIONS With the incessant advances of integrated-circuit design and manufacturing technologies, the performance of CMOS integrated circuits has become strongly dependent on low-level physical effects. Lower supply voltages and effects such as velocity saturation and channel-length modulation are contributing to the degradation of signal waveforms as signals propagate through the logic gates. Furthermore, the changing nature of the interconnect from a lumped capacitive to a distributed resistance-capacitance has increased the signal propagation delay through passive interconnect structures. Therefore, it is becoming increasingly difficult to design correctly functioning circuits while satisfying performance criteria such as higher clock frequencies (lower clock periods) and low power. The smaller device sizes and larger integrated-circuit chip dimensions create conditions for the existence of long interconnect structures. These structures account for a growing portion of the combined logic and interconnect delay and shift the primary cause of the delay from the logic elements to the interconnect. In addition, the variations of the values of the physical parameters introduced during circuit manufacturing can substantially change the overall behavior and timing characteristics of the physical structures. It is therefore imperative that the signal properties be well understood and properly applied in the design of high-performance VLSI integrated circuits.
BIBLIOGRAPHY 1. J. S. Kilby, Invention of the integrated circuit, IEEE Trans. Electron Devices, ED-23: 648–654, 1976. 2. J. M. Rabaey, Digital Integrated Circuits: A Design Perspective, Upper Saddle River, NJ: Prentice-Hall, 1996. 3. N. Gaddis and J. Lotz, A 64-b quad-issue CMOS RISC microprocessor, IEEE J. Solid-State Circuits, SC-31: 1697–1702, 1996. 4. P. E. Gronowski et al., A 433-MHz 64-bit quad-issue RISC microprocessor, IEEE J. Solid-State Circuits, SC-31: 1687–1696, 1996. 5. N. Vasseghi et al., 200-MHz superscalar RISC microprocessor, IEEE J. Solid-State Circuits, SC-31: 1675–1686, 1996. 6. H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Reading, MA: Addison-Wesley, 1990. 7. S. Bothra et al., Analysis of the effects of scaling on interconnect delay in ULSI circuits, IEEE Trans. Electron Devices, ED-40: 591–597, 1993.
391
8. C. Mead and L. Conway, Introduction to VLSI Systems, Reading, MA: Addison-Wesley, 1980. 9. N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, 2nd ed., Reading, MA: AddisonWesley, 1993. 10. F. Anceau, A synchronous approach for clocking VLSI systems, IEEE J. Solid-State Circuits, SC-17: 51–56, 1982. 11. M. Afghani and C. Svensson, A unified clocking scheme for VLSI systems, IEEE J. Solid-State Circuits, SC-25: 225–233, 1990. 12. S. H. Unger and C.-J. Tan, Clocking schemes for high-speed digital systems, IEEE Trans. Comput., C-35: 880–895, 1986. 13. H. Shichman and D. A. Hodges, Modeling and simulation of insulated-gate field-effect transistor switching circuits, IEEE J. SolidState Circuits, SC-3: 285–289, 1968. 14. S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and Design, New York: McGraw-Hill, 1996. 15. N. Hedenstierna and K. O. Jeppson, CMOS circuit speed and buffer optimization, IEEE Trans. Comput.-Aided Design, CAD-6: 270–281, 1987. 16. T. Sakurai and A. R. Newton, Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas, IEEE J. Solid-State Circuits, SC-25: 584–594, 1990. 17. A. I. Kayssi, K. A. Sakallah, and T. M. Burks, Analytical transient response of CMOS inverters, IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., CAS I-39: 42–45, 1992. 18. E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems. IEEE Press, 1995. 19. H. B. Bakoglu and J. D. Meindl, Optimal interconnection circuits for VLSI, IEEE Trans. Electron Devices, ED-32: 903–909, 1985. 20. A. Wilnai, Open-ended RC line model predicts MOSFET IC response, Electron. Design News, 53–54, December 1971. 21. T. Sakurai, Approximation of wiring delay in MOSFET LSI, IEEE J. Solid-State Circuits, SC-18: 418–426, 1983. 22. G. Y. Yacoub et al., A system for critical path analysis based on back annotation and distributed interconnect impedance models, Microelectron. J., 19: 21–30, 1988. 23. M. R. C. M. Berkelaar and J. A. G. Jess, Gate sizing in MOS digital circuits with linear programming, EDAC: Proc. Eur. Design Automat. Conf., 1990, pp. 217–221. 24. O. Coudert, Gate sizing for constrained delay/power/area optimization, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-5: 465–472, 1997. 25. U. Ko and P. T. Balsara, Short-circuit power driven gate sizing technique for reducing power dissipation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-3: 450–455, 1995. 26. S. R. Vemuru and A. R. Thorbjornsen, Variable-taper CMOS buffer, IEEE J. Solid-State Circuits, SC-26: 1265–1269, 1991. 27. C. Prunty and L. Gal, Optimum tapered buffer, IEEE J. SolidState Circuits, SC-27: 118–119, 1992. 28. N. Hedenstierna and K. O. Jeppson, Comments on the optimum CMOS tapered buffer problem, IEEE J. Solid-State Circuits, SC-29: 155–158, 1994. 29. B. S. Cherkauer and E. G. Friedman, Channel width tapering of serially connected MOSFET’s with emphasis on power dissipation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-2: 100–114, 1994. 30. B. S. Cherkauer and E. G. Friedman, Design of tapered buffers with local interconnect capacitance, IEEE J. Solid-State Circuits, SC-30: 151–155, 1995. 31. B. S. Cherkauer and E. G. Friedman, A unified design methodology for CMOS tapered buffers, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-3: 99–111, 1995. 32. V. Adler and E. G. Friedman, Repeater insertion to reduce delay and power in RC tree structures, Proc. Asilomar Conf. Sign., Syst., Comput., 1997.
392
INTEGRATED INJECTION LOGIC
33. V. S. Adler and E. G. Friedman, Delay and power expressions for a CMOS inverter driving a resistive-capacitive load, 1996 IEEE Int. Symp. Circuits Syst., 4: 1996, pp. 101–104. 34. V. Adler and E. G. Friedman, Repeater design to reduce delay and power in resistive interconnect, Proc. IEEE Int. Symp. Circuits Syst., 1997, pp. 2148–2151. 35. V. Adler and E. G. Friedman, Timing and power models for CMOS repeaters driving resistive interconnect, Proc. IEEE ASIC Conf., 1996, pp. 201–204. 36. C. J. Alpert, Wire segmenting for improved buffer insertion, Proc. IEEE/ACM Design Automat. Conf., 1997. 37. V. E. Adler and E. G. Friedman, Repeater design to reduce delay and power in resistive interconnect, IEEE Trans. Circuits Syst. II, Analog Digital Sign. Process., CAS II-45: 607–616, 1998. 38. I. E. Sutherland, Micropipelines, Commun. ACM, 32: 720–738, 1989.
Reading List R. J. Antinone and G. W. Brown, The modeling of resistive interconnects for integrated circuits, IEEE J. Solid-State Circuits, SC-18: 200–203, 1983. E. Barke, Line-to-ground capacitance calculation for VLSI: A comparison, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., CAD-7: 295–298, 1988. J. R. Black, Electromigration—A brief survey and some recent results, IEEE Trans. Electron Devices, ED-16: 338–347, 1969. J.-H. Chern et al., Multilevel metal capacitance models for CAD design synthesis systems, IEEE Electron Device Lett., 13: 32–34, 1992. Z. Chen et al., CMOS technology scaling for low voltage low power applications, 1994 IEEE Symp. Low Power Electron., Dig. Tech. Papers, 1994, pp. 56–57. S. Dutta, S. S. M. Shetti, and S. L. Lusky, A Comprehensive Delay Model for CMOS Inverters, IEEE J. Solid-State Circuits, SC-30: 864–871, 1995. M. Gilligan and S. Gupta, A methodology for estimating interconnect capacitance for signal propagation delay in VLSIs, Microelectron. J., 26: 327–336, 1995. F. S. Lai, A generalized algorithm for CMOS circuit delay, power, and area optimization, Solid-State Electron., 31: 1619–1627, 1988. J. Qian, S. Pullela, and L. Pillage, Modeling the ‘‘effective capacitance’’ for the RC interconnect of CMOS gates, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-13: 1526–1535, 1994. J. Rubinstein, P. Penfield, and M. A. Horowitz, Signal delay in RC tree networks, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-2: 202–211, 1983. T. Sakurai, Closed-form expressions for interconnection delay, coupling, and crosstalk in VLSIs, IEEE Trans. Electron Devices, ED40: 118–124, 1993. T. Sakurai and K. Tamaru, Simple formulas for two- and threedimensional capacitances, IEEE Trans. Electron Devices, ED-30: 183–185, 1983. S. R. Vemuru and N. Scheinberg, Short-circuit power dissipation estimation for CMOS logic gates, IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., CAS I-41: 762–765, 1994.
IVAN S. KOURTEV EBY G. FRIEDMAN University of Rochester
INTEGRATED CIRCUITS, MICROWAVE. See MICROWAVE INTEGRATED CIRCUITS.
INTEGRATED CIRCUITS, NEURAL. See NEURAL CHIPS.
INTEGRATED CIRCUITS, OPTOELECTRONIC. See OPTOELECTRONICS IN VLSI TECHNOLOGY.
INTEGRATED CIRCUITS, POWER. See POWER INTEGRATED CIRCUITS.
INTEGRATED CIRCUIT TESTING. See AUTOMATIC TESTING.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...LECTRONICS%20ENGINEERING/64.%20VLSI%20Systems/W7602.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Microprocessors Standard Article Robert P. Colwell1 1Intel Corporation, Hillsboro, OR Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W7602 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (133K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Microprocessors and Computers Moore’S Law Microprocessor Architectures The Evolution of Isas Coprocessors and Multiple Processors High-end Microprocessor Systems Future Prospects For Microprocessors Acknowledgments About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...NICS%20ENGINEERING/64.%20VLSI%20Systems/W7602.htm15.06.2008 1:14:52
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
676
MICROPROCESSORS
Figure 1. The world’s first microprocessor, the Intel 4004, ca. 1971. Originally designed to be a less expensive way to implement the digital logic of a calculator, the chip instead spawned a computing revolution that still shows no signs of abating.
From the 4004’s humble beginning, the microprocessor has assumed an importance in the world’s economy similar to that of the electric motor or the internal combustion engine. Microprocessors now supply more than 90% of the world’s computing needs, from small portable and personal desktop computers to large-scale supercomputers such as Intel’s Teraflop machine, which contains over 9000 microprocessors. A variant of the microprocessor, the microcontroller, has become the universal controller in machines from automobile engines to audio systems to wristwatches. MICROPROCESSORS AND COMPUTERS Microprocessors are the processing units or the ‘‘brains’’ of the computer system. Every action that microprocessors perform is specified by a computer program that has been encoded into ‘‘object code’’ by a software program known as a compiler. Directed by another software program known as the operating system (e.g., Microsoft’s Windows 95), the microprocessor locates the desired application code on the hard drive or compact disk and orders the drive to begin transferring the program to the memory subsystem so that the program can be run. Digital electronic computers have at least three major subsystems:
MICROPROCESSORS In 1972, Intel Corporation sparked an industrial revolution with the world’s first microprocessor, the 4004. The 4004 replaced the logic of a numeric calculator with a general-purpose computer, implemented in a single silicon chip. The 4004 is shown in Fig. 1. The 4004 integrated 2300 transistors and ran at a clock rate of 108 kHz (108,000 clock cycles per second). In 1997, the 4004’s most recent successor is the Pentium II processor, running at 300 MHz (300 million clock cycles per second) and incorporating nearly 8 million transistors. The Pentium II processor is shown in Fig. 2.
• A memory to hold the programs and data structures • An input/output (I/O) subsystem • A central processor (CPU) A microprocessor is the central processor subsystem, implemented on a single chip of silicon. In microprocessor-based computer systems, the I/O subsystem moves information into and out of the computer system. I/O subsystems usually include some form of nonvolatile storage, which is a means of remembering data and programs even when electrical power is not present. Disk drives, floppy drives, and certain types of memory chips fulfill this require-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
MICROPROCESSORS
ment in microprocessor-based systems. Keyboards, trackballs, and mice are common input devices. Networks, modems, and compact discs are also examples of I/O devices. The memory subsystem, a place to keep and quickly access programs or data, is usually random-access memory (RAM) chips. Microprocessors and microcontrollers are closely related devices. The differences are in how they are used. Essentially, microcontrollers are microprocessors for embedded control applications. They run programs that are permanently encoded into read-only memories and optimized for low cost so that they can be used in inexpensive appliances (printers, televisions, power tools, and so on). The versatility of a microcontroller is responsible for user-programmable VCRs and microwave ovens, the fuel-savings of an efficiently managed automobile engine, and the convenience of sequenced traffic lights on a highway and of automated bank teller machines. Microprocessor software is typically created by humans who write their codes in a high-level language such as C or Fortran. A compiler converts that source code into a machine language that is unique to each particular family of microprocessors. For instance, if the program needs to write a character to the screen, it will include an instruction to the microprocessor that specifies the character, when to write it, and where to put it. Exactly how these instructions are encoded into the 1s and 0s (bits) that a computer system can use determines which computers will be able to run the program successfully. In effect, there is a contract between the design of a microprocessor and the compiler that is generating object code for it. The compiler and microprocessor must agree on what every computer instruction does, under all circumstances of execution, if a program is to perform its intended function. This contract is known as the computer’s instruction set. The instruction set plus some additional details of implementation such as the number of registers (fast temporary storage) are known as the computer’s instruction set architecture (ISA). Programs written or compiled to one ISA will not run on a different ISA. During the 1960s and 1970’s, IBM’s System/360 and System/370 were the most important ISAs.
677
With the ascendancy of the microprocessor, Intel’s x86 ISA vied with Motorola’s MC68000 for control of the personal computer market. By 1997, the Intel architecture was found in approximately 85% of all computer systems sold. Early microprocessor instruction set architectures, such as the 4004, were designed to operate on 8-bit data values (operands). Later microprocessors migrated to 16-bit operands, including the microprocessor in the original IBM PC (the Intel 8088). Microprocessors settled on 32-bit operands in the 1980s, with the Motorola 68000 family and Intel’s 80386. In the late 1980s, the microprocessors being used in the fastest servers and high-end workstations began to run into the intrinsic addressability limit of 4GB (four gigabytes, or four billion bytes, which is 2 raised to the power 32). These microprocessors introduced 64-bit addressing and data widths. It is likely that 64-bit computing will eventually supplant 32-bit microprocessors. It also seems likely that this will be the last increase in addressability that the computing industry will ever need because 2 raised to the power 64 is an enormous number of addresses. Prior to the availability of microprocessors, computer systems were implemented in discrete logic, which required the assembly of large numbers of fairly simple digital electronic integrated circuits to realize the basic functions of the I/O, memory, and central processor subsystems. Because many (typically thousands) of such circuits were needed, the resulting systems were large, power-hungry, and costly. Manufacturing such systems was also expensive, requiring unique tooling, hand assembly, and a large amount of human debug effort to repair the inevitable flaws that accumulate during the construction of such complex machinery. In contrast, the fabrication process that underlies the microprocessor is much more economical. As with any silicon-integrated circuit, microprocessor fabrication is mainly a series of chemical processes performed by robots. So the risk of introducing human errors that would later require human debugging is eliminated. The overall process can produce many more microprocessors than discrete methods could.
Figure 2. The 1998 successors to the line of microprocessors started by the 4004, Intel’s Pentium II processor, mounted within its Single-Edge Cartridge Connector (SECC). This picture shows the cartridge with its black case removed. On the substrate within the cartridge, the large octagonal package in the center is the Pentium II CPU itself. The rectangular packages to the right and left of the CPU are the cache chips. The small components mounted on the substrate are resistors and capacitors needed for power filtering and bus termination.
678
MICROPROCESSORS
MOORE’S LAW In 1964, Gordon Moore made an important observation regarding the rate of improvement of the silicon-integrated circuit industry. He noted that the chip fabrication process permitted the number of transistors on a chip to double every 18 months. This resulted from the constantly improving silicon process that determines the sizes of the transistors and wiring on the integrated circuits. Although he made the initial observation on the basis of experience with memory chips, it has turned out to be remarkably accurate for microprocessors as well. Moore’s Law has held for well over 30 years. Figure 3 plots the number of transistors on each Intel microprocessor since the 4004. These improvements of the underlying process technology have fueled the personal computer industry in many different ways. Each new process generation makes the transistors smaller. Smaller transistors are electrically much faster, allowing higher clock rates. Smaller wires represent less electrical capacitance, which also increases overall clock rates and reduces power dissipation. The combination of both permits far more active circuitry to be included in new design. Constant learning in the silicon fabrication plants have also helped drive up the production efficiency, or yield, of each new process to be higher than its predecessor, which also helps support larger die sizes per silicon chip. The impact of this progression has been profound for the entire industry. The primary benefit of a new microprocessor is its additional speed over its predecessors, at ever better price points. The effect of Moore’s Law has been for each new microprocessor to become obsolete within only a few years after its introduction. The software industry that supplies the applications to run on these new microprocessors expects this performance improvement. The industry tries to design so that its new products will run acceptably on the bulk of the installed base but can also take advantage of the new performance for the initially small number of platforms that have the new processor. The new processor’s advantages in price/ performance will cause it to begin to supplant the previous generation’s volume champion. The fabrication experience gained on the new product allows its price to be driven ever
100,000,000
16M 4M
Transistors per die
10,000,000 1,000,000
256K
100,000 10,000 1,000
1M
64K 1K
4K 16K
4004
80286
486
64M
PPro Pentium
386
8086 8080
Memory Microprocessors
100 10 1 1965
1970
1975
1980
1985
1990
1995
Figure 3. Moore’s Law has accurately predicted the number of transistors that can be incorporated in microprocessors for over 25 years. Since this transistor count strongly influences system performance, this remarkable ‘‘law’’ has become one of the central tenets in the field of computers and integrated electronics. It guides the design of software, hardware, manufacturing production capacity, communications, and corporate planning in nearly every major area.
downward until the new design completely takes over. Then an even more advanced processor on an even better process technology is released, and the hardware/software spiral continues.
MICROPROCESSOR ARCHITECTURES Another factor in the performance improvement of microprocessors is its microarchitecture. Microarchitecture refers to how a microprocessor’s internal systems are organized. The microarchitecture is not to be confused with its instruction set architecture. The ISA determines what kind of software a given chip can execute. The earliest microprocessors (e.g., Intel 4004, 4040, 8008, 8080, 8086) were simple, direct implementations of the desired ISA. But as the process improvements implied by Moore’s Law unfolded, microprocessor designers were able to borrow many microarchitectural techniques from the mainframes that preceded them, such as caching (Intel’s 486, MC68010), pipelining (i486 and all subsequent chips), parallel superscalar execution (Pentium processor), superpipelining (Pentium Pro processor), and out-oforder and speculative execution (Pentium Pro processor, MIPS R10000, DEC Alpha 21264). Microprocessor designers choose their basic microarchitectures very carefully because a chip’s microarchitecture has a profound effect on virtually every other aspect of the design. If a microarchitecture is too complicated to fit a certain process technology (e.g., requires many more transistors than the process can economically provide), then the chip designers may encounter irreconcilable problems during the chip’s development. The chip development may need to wait for the next process technology to become available. Conversely, if a microarchitecture is not aggressive enough, then it could be very difficult for the final design to have a high enough performance to be competitive. Microarchitectures are chosen and developed to balance efficiency and clock rate. All popular microprocessors use a synchronous design style in which the microarchitecture’s functions are subdivided in a manner similar to the way a factory production line is subdivided into discrete tasks. And like the production line, the functions comprising a microprocessor’s microarchitecture are pipelined, such that one function’s output becomes the input to the next. The rate at which the functions comprising this pipeline can complete their work is known as the pipeline’s clock rate. If the functions do not all take the same amount of time to execute, then the overall clock rate is determined by the slowest function in the pipeline. One measure of efficiency for a microarchitecture is the average number of clock cycles required per instruction executed (CPI). For a given clock rate, fewer clocks per instruction implies a faster computer. The more efficient a microarchitecture is, the fewer the number of clock cycles it will need to execute the average instruction. Therefore, it will need fewer clock cycles to run an entire program. However, the desire for high microarchitectural efficiency is often in direct conflict with designing for highest clock rate. Generally, the clock rate is determined by the time it takes a signal to traverse the slowest path in the chip, and adding transistors to a microarchitecture to boost its efficiency usually makes those paths slower.
MICROPROCESSORS
679
Processor bus
Bus interface unit
64-bit Interunit transfer bus
Address 32
32
Data
32
Data
Instruction prefetch unit
Cache unit
Data
Data
20
32
32
Physical Address
32
Displacement 24
Paging unit 32
Code stream
Linear address Segmentation unit 32
Base/index
Integer (datapath) unit
Instruction decode unit 13
Instruction word
Control unit
Hard-wired microinstruction Floating-point unit
Figure 4 illustrates the functional block diagram of the Intel 486, a very popular microprocessor of the early 1990s. (The microarchitectures of microprocessors that followed the 486, such as the Pentium processor or the Pentium Pro processor, are too complex to be described here.) The prefetch unit of the 486 fetches the next instruction from the instruction cache at a location that is either the next instruction after the last instruction executed or some new fetch address that was calculated by a previous branch instruction. If the instruction requested is not present in the cache, then the bus interface unit generates an access to main memory across the processor bus, and the memory sends the missing instruction back to the cache. The requested instruction is sent to the instruction decode unit, which extracts the various fields of the instruction, such as the opcode (the operation to be performed), the register or registers to be used in the instruction, and any memory addresses needed by the operation. The control unit forwards the various pieces of the instruction to the places in the microarchitecture that need them (register designators to the register file, memory addresses to the memory interface unit, opcode to the appropriate execution unit). Certain very complex instructions are implemented in an on-chip read-only memory called the microcode. When the instruction decoder encounters one of these, it signals a micro-
Microcode entry point Figure 4. Block diagram of the most popular microprocessor of the early 1990s, the Intel i486. The various blocks shown work together to execute the Intel Architecture instruction set with approximately 1.1M transistors. Newer designs, such as the Pentium processor, or the P6 microarchitecture at the core of the latest Pentium II processor, are much more complicated.
code entry point for the microcode unit to use in supplying the sequence of machine operations that correspond to that complex macroinstruction. Although it is not obvious from the block diagram, the Intel 486 microarchitecture is pipelined, which allows the machine to work on multiple instructions at any given instant. While one instruction is being decoded, another instruction is accessing its registers, a third can be executing, and a fourth is writing the results of an earlier execution to the memory subsystem. See References 1–4 for sources of more details on designing microarchitectures.
THE EVOLUTION OF ISAS Although microprocessor ISAs are crucial in determining which software will run on a given computer system, they are not static and unchangeable. There is a constant urge to develop the ISA further, adding new instructions to the instruction set or (much more rarely) removing old obsolete ones. Almost all old ISAs have many instructions, typically hundreds, some of which are quite complicated and difficult for compilers to use. Such architectures are known as Complex
680
MICROPROCESSORS
Instruction Set Computers (CISC). In the early 1980s, substantial academic research was aimed at simplifying ISAs [Reduced Instruction Set Computers (RISC)], and designing them with the compiler in mind, in the hopes of yielding much higher system performance. Some important differences remain, such as the number of registers, but with time the differences in implementations between these two design philosophies have diminished. RISC ISAs have adopted some of the complexity of the CISC ISAs, and the CISC designers borrowed liberally from the RISC research. Examples of the CISC design style are the Intel x86, the Motorola MC68000, the IBM System/360 and /370, and the DEC VAX. RISC ISAs include MIPS, PowerPC, Sun’s Sparc, Digital Equipment Corp. Alpha, and Hewlett-Packard PA-RISC. COPROCESSORS AND MULTIPLE PROCESSORS Some microprocessor systems have included a separate chip known as a coprocessor. This coprocessor was intended to improve the system’s performance at some particular task that the main microprocessor was unsuited for. For example, in the Intel 386 systems, the microprocessor did not implement the floating-point instruction set; that was relegated to a separate numerics coprocessor. (In systems that lacked the coprocessor, the microprocessor would emulate the floatingpoint functions, albeit slowly, in software.) This saved die size and power on the microprocessor in those systems that did not need high floating-point performance, yet it made the high performance available in systems that did need it, via the coprocessor. However, in the next processor generation, the Intel 486, enough transistors were available on the microprocessor, and the perceived need for floating-point performance was large enough, that the floating-point functions were directly implemented on the microprocessor. Floating-point coprocessors have not reappeared, but lessintegrated hardware for providing audio (sound generation cards) and fast graphics are quite common in personal computers of the 1990s, which are similar to the coprocessors of the past. As the CPUs get faster, they can begin to implement some of this functionality in their software, thus potentially saving the cost of the previous hardware. But the audio and graphics hardware also improves, offering substantially faster functionality in these areas, so that buyers are tempted to pay a small amount extra for a new system. HIGH-END MICROPROCESSOR SYSTEMS Enough on-chip cache memory and external bus bandwidth is now available that having multiple microprocessors in a single system has become a viable proposition. These microprocessors share a common platform, memory, and I/O subsystem. The operating system attempts to balance the overall computing workload equitably among them. Dedicated circuits on the microprocessor’s internal caches monitor the traffic on the system buses, in a procedure known as ‘‘snooping’’ the bus, to keep each microprocessor’s internal cache consistent with every other microprocessor’s cache. The system buses are designed with enough additional performance so that the extra microprocessors are not starved. In the late 1990s, systems of 1, 2, and 4 microprocessors became more common. Future high-end systems will probably
continue that trend, introducing 8, 16, 32, or more microprocessors organized into clusters. As of the mid 1990s, the fastest computers in the world no longer relied on exotic specialized logic circuits but were composed of thousands of standard microprocessors. FUTURE PROSPECTS FOR MICROPROCESSORS From their inception in 1971, microprocessors have been riding an exponential growth curve in the number of transistors per chip, delivered performance, and growth in the installed base. But no physical process can continue exponential growth forever. It is of far more than academic interest to determine when microprocessor development will begin to slow and what form such a slowdown will take. For example, it is reasonable to surmise that the process technology will eventually hit fundamental limitations in the physics of silicon electronic devices. The insulators most commonly used in an integrated circuit are layers of oxide, and these layers are only a few atoms thick. To keep these insulators from breaking down in the presence of the electric fields on an integrated circuit, designers try to lower the voltage of the chip’s power supply. At some point, the voltage may get so low that the transistors no longer work. Power dissipation is becoming an increasingly important problem. The heat produced by fast microprocessors must be removed so that the silicon continues to work properly. As the devices get faster, they also generate more heat. Providing the well-regulated electrical current for the power supply, and then removing the heat, means higher expense in the system. With the 486 generation, aluminum blocks with large machined surface areas, known as heat sinks, became commonplace. These heat sinks help transfer the heat from the microprocessor to the ambient air inside the computer; a fan mounted on the chassis transfers this ambient air outside the chassis. With the Pentium processor generation, a passive aluminum block was no longer efficient enough, and a fan was mounted directly on the heat sink itself. Future microprocessors must find ways to use less power, transfer the heat more efficiently and inexpensively to the outside, and modulate their operations to their circumstances more adroitly. This may involve slowing down when high performance is temporarily unnecessary, changing their power supply voltages in real time, and managing the program workload based on each program’s thermal characteristics. Microprocessor manufacturers face another serious challenge: complexity, combined with the larger and less technically sophisticated user base. Microprocessors are extremely complicated, and this complexity will continue to rise commensurate with, among other things, • Higher performance • Higher transistor counts • The increasing size of the installed base (which makes achieving compatibility harder) • New features to handle new workloads • Larger design teams • More difficult manufacturing processes This product complexity also implies a higher risk that intrinsic design or manufacturing flaws may reach the end user
MICROPROCESSORS
undetected. In 1994, such a flaw was found in Intel’s Pentium processor, causing some floating-point divides to return slightly wrong answers. A public relations debacle ensued, and Intel took a $475 million charge against earnings, to cover the cost of replacing approximately 5 million microprocessors. In the future, if existing trends continue, microprocessor manufacturers may have tens or even hundreds of millions of units in the field. The cost of replacing that silicon would be prohibitive. Design teams are combating this problem in a number of ways, most notably by employing validation techniques such as random instruction testing, directed tests, protocol checkers, and formal verification. What really sets microprocessors apart from the other tools that humankind has invented is the chameleonlike ability of a computer to change its behavior completely under the control of software. A computer can be a flight simulator, a business tool for calculating spreadsheets, an Internet connection engine, a household tool to balance the checkbook, and a mechanic to diagnose problems in the car. The faster the microprocessor and its supporting chips within the computer, the wider the range of applicability across the problems and opportunities that people face. As microprocessors continue to improve in performance, there is ample reason to believe that the computing workloads of the future will evolve to take advantage of the new features and higher performance, and applications that are inconceivable today will become commonplace. Conversely, one challenge to the industry could arise from a saturated market that either no longer needs faster computers or can no longer afford to buy them. Or perhaps the ability of new software to take advantage of newer, faster machines will cease to keep pace with the development of the hardware itself. Either of these prospects could conceivably slow the demand for new computer products enough to threaten the hardware/software spiral. Then the vast amounts of money needed to fund new chip developments and chip manufacturing plants would be unavailable.
681
However, negative prognostications about computers or microprocessors have been notoriously wrong in the past. Predictions such as ‘‘I think there is a world market for maybe five computers’’ (Thomas Watson, chairman of IBM, 1943) or ‘‘photolithography is no longer useful beyond one micron line widths’’ have become legendary for their wrongheadedness. It is usually far easier to see impending problems than to conceive ways of dealing with them, but computer history is replete with examples of supposedly immovable walls that turned out to be tractable. In its short life, the microprocessor has already proven itself to be a potent agent of change. It seems a safe bet that the world will continue to demand faster computers and that this incentive will provide the motivation for new generations of designers to continue driving the capabilities and applications of microprocessors into areas as yet unimagined. ACKNOWLEDGMENTS Various trademarks are the property of their respective owners. BIBLIOGRAPHY 1. D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative Approach, 2nd edition, San Francisco: Morgan Kaufmann, 1996. 2. G. A. Blaauw and F. P. Brooks, Jr., Computer Architecture Concepts and Evolution, Reading, MA: Addison-Wesley, 1997. 3. D. P. Siewiorek, C. G. Bell, and A. Newell, Computer Structures: Principles and Examples, New York: McGraw-Hill, 1981. 4. M. S. Malone, The Microprocessor: A Biography, Santa Clara, CA: Springer-Verlag, 1995.
ROBERT P. COLWELL Intel Corporation