13• Computer-Aided Design of Integrated Circuits
13• Computer-Aided Design of Integrated Circuits Automatic Testing Abs...
38 downloads
1245 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
13• Computer-Aided Design of Integrated Circuits
13• Computer-Aided Design of Integrated Circuits Automatic Testing Abstract | Full Text: PDF (207K) Cad for Field Programmable Gate Arrays Abstract | Full Text: PDF (118K) Cad For Manufacturability of Integrated Circuits Abstract | Full Text: PDF (119K) Design Verification and Fault Diagnosis in Manufacturing Abstract | Full Text: PDF (204K) Electrical and Timing Simulation Abstract | Full Text: PDF (148K) High Level Synthesis Abstract | Full Text: PDF (414K) Logic Synthesis Abstract | Full Text: PDF (266K) Power Estimation and Optimization Abstract | Full Text: PDF (148K) Vlsi Circuit Layout Abstract | Full Text: PDF (326K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECT...uter-Aided%20Design%20of%20Integrated%20Circuits.htm18.06.2008 15:23:52
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1804.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Automatic Testing Standard Article Lee A. Belfore II1 1Old Dominion University, Norfolk, VA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1804 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (207K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1804.htm (1 of 2)18.06.2008 15:24:39
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1804.htm
Abstract The sections in this article are Testing Principles Design for Testability Test Pattern Generation CAD Tools Mixed-Signal Testing Automatic Test Examples Emerging Technologies Summary Appendix 1. Acronyms and Abbreviations | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1804.htm (2 of 2)18.06.2008 15:24:39
120
AUTOMATIC TESTING
AUTOMATIC TESTING This article describes the topic of automatic testing of electronic integrated circuits. Due to the variety of implementations and technologies, automatic testing takes several forms. Automatic testing first appeared in the form of automatic test equipment (ATE) to test newly manufactured integrated circuits. The automated operation of these devices facilitated the mass production of circuits. In ATE, the testing process consists of presenting a series of inputs to the circuits. Simultaneous with the presentation of inputs circuit outputs are compared against acceptable responses. In the event discrepancies appear, the devices were either scrapped or reconfigured into working circuits if the designs allowed. With the ever-increasing complexity of circuits, it was observed that some of the ATE capabilities could be integrated onto the same circuit to enable the circuit to either fully or partially test itself. In addition, built-in self-test circuitry is being integrated in increasingly complex chips conferring significant benefits in these applications. The science of automatic testing includes topics from several areas of knowledge. First, the defects must be classified according to models that accurately render the defect behavior of the circuit, giving test engineers a target for developing defect tests. Second, given a defect model, methodologies for determining test patterns must be examined. Indeed, circuits can be inherently easy to develop tests for, highly testable, while others may require special design practices to become easily testable. Built-in test circuitry facilitates the testing of a circuit. By adding special test structures, the circuit can be more easily testable. Third, in many cases, the number of tests to give acceptable test performance may be large, especially for circuits that are not designed to be testable. Fourth, simulation can assist the test engineer in assessing the efficacy of a particular test regimen. Fault simulations are used to determine what faults can be detected with the test regimen providing valuable feedback in the test design process. Two excellent references on topics related to automatic testing are (1,2). These works provide comprehensive treatment of testing methodologies. The ever-increasing complexity of integrated circuits and digital systems makes verifying that the circuit or system is fully functional more difficult. Circuit testers can control and observe system inputs and outputs, respectively, which may number in the thousands. The number of potential component failures in current systems can be in the range of billions. This mismatch between the number of inputs and outputs and the number of internal structures suggests that there are many challenges in testing these devices. If the direct effects
of the faults must be seen at the circuit outputs, it is likely that an enormous number of test patterns must be presented to the circuit, unless attention is paid to efficient generation of test patterns, design for test methodologies, and/or builtin self-test methodologies are employed. By allowing internal structures to automatically test themselves, the internal structures need only report that they are fault-free or faulty. Many vendors today supply modules and subsystems that can be included in larger designs. Because the design is considered intellectual property, many vendors do not want to release sufficient implementation details to enable construction of an effective test strategy. By incorporating Built-In Self-Test (BIST) capabilities in the subsystem, the vendor can supply a testable design without disclosing any details of the design. Importantly, the customer can achieve more acceptable testing results without needing to know the details of the design. Not surprisingly, economics plays a role in the manner in which testing approaches are applied. One study has shown (3) that in consumer electronics, BIST approaches may not be cost-effective. Testing affects the profitability of a design in several ways. BIST requires the addition of circuitry that is not necessary to maintain functionality. This additional circuitry increases the size of the chip. This increase in chip size reduces the number of chips that can be manufactured per wafer and also reduces the chip yields as a consequence of the increased die size. For high volume products, the costs can be enormous. For example, Intel calculated in 1995 for the Pentium processor, a 1% increase in chip area resulted in a $63 million increase in production costs. A 15% increase in chip area resulted in an almost $1 billion increase in production costs. A study done in Ref. 3 showed that high production chips that are part of consumer electronics typically have a useful design life of 2 years. For these chips, the addition of automatic test capabilities increased costs when all cost factors were taken into account. On the other hand, designs with longer useful design lives, say 5 years, benefit from an automatic test capability. This article is organized into sections including an introduction, a description of foundational testing paradigms, a description of commonly used automatic testing methodologies, CAD tool support, frontiers in automatic testing, and automatic testing case studies. In the section entitled ‘‘Testing Principles,’’ the general principles and methods for testing are introduced. This topic is covered in more detail in other encyclopedia articles. In ‘‘Economics of Test,’’ design for test methodologies are presented as they relate to automatic testing and in ‘‘Programmable Logic Arrays,’’ the subject of test pattern generation is discussed. In the section entitled ‘‘BuiltIn Self-Test,’’ these methodologies are discussed and in ‘‘CAD Tools,’’ the CAD tool support available in current tools is presented. In ‘‘Analog BIST,’’ the principles of automatic testing for analog systems are presented. In ‘‘Automatic Testing Case Studies,’’ several actual implementations that employ automatic testing are presented, and in the next section the frontiers of automatic testing are described.
TESTING PRINCIPLES In this section, several topics related to testing methodologies, techniques, and philosophy are discussed. Many auto-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
AUTOMATIC TESTING
Bridging fault A G1
B C D E F
G2 G3
G H I
SA-1 G4
J
D
Q
Delay
Memory K
Figure 1. An illustration of fault models. Digital circuits are susceptible to many types of faults.
mated testing approaches are derived from the less restrictive testing methodologies. For a more in-depth discussion of testing techniques, the interested reader should see Refs. 1 and 2. Fault Modeling Circuits can fail in many ways. The failures can be a result of manufacturing defects, infant mortality, random failures, age, or external disturbances (4). The defects can be localized, affecting the function of one circuit element or distributed, affecting many or all circuit elements. The failures can result in temporary or permanent failure of the circuit. The quality and detail of the fault models can have an impact on the success of the test strategy. In addition, the fault model may have an impact on the overall test strategy. In order to develop effective testing methodologies, accurate models for circuit failures must be agreed upon and targeted by the testing approach. The fault models selected depend on the technology used to implement the circuits. Manufacturing defects exist as a consequence of the manufacture of the circuit. The introduction and study of manufacturing defects is a heavily studied topic because of its impact on the profitability of the device. Dust or other aerosols in the air can affect the defect statistics of a particular manufacturing run. In addition, mask misalignment and defects in the mask can also increase the defect densities. Figure 1 gives some example faults which are discussed in more detail in the following sections. Stuck-at Fault Models. Stuck-at fault models are the simplest and most widely used fault models in testing. The stuckat fault model requires the adoption of several fundamental assumptions. First, a stuck-at fault manifests itself as a node being stuck at either of the allowable logic levels, either zero or one, regardless of the inputs that are applied to the gate that drives the node. Second, the stuck-at fault model assumes that the faults are permanent. Third, the stuck-at fault model assumes that gates maintain their ordinary function in the presence of the fault. Significantly, the stuck-at fault model also models a common failure mode in digital circuits. The circuit shown in Fig. 1 can be used to illustrate the fault model. The output of gate G1 can be stuck-at 1 as a result of a defect. When the fault is present, the corresponding input to G4 will always be one. In order for the fault to manifest itself, a discrepancy must occur in the circuit as a conse-
121
quence of the fault. To force the discrepancy, the circuit inputs are manipulated so that A ⫽ B ⫽ 1, with the discrepancy appearing at the output of the gate. This discrepancy may ultimately result in the system malfunction. A second example circuit is shown in Fig. 2, consisting of an OR gate (G1) driving one input in three AND gates (G2, G3, and G4). Consider the occurrence of a stuck-at 1 fault at the input to G1, the fault results in the output being 1 as a consequence of the fault. In this simple example, one can thus observe the indistinguishability between output and input stuck at 1 faults. For modeling purposes, these faults can be collapsed into a single fault. In the event gate input I2 has a stuck-at 0 or 1 fault, the situation is somewhat different. In this case, O1 ⫽ I3 ⫽ I4 and G3 and G4 will not be directly affected by the fault. Delay Fault Models. A delay fault is a fault where a part of the circuit operates more slowly, relative to other circuit structures when a fault is present. When such a fault is present, the circuit may operate correctly when operated at slower clock rates, but does not operate at speed under certain circumstances. Delay faults can be modeled at several levels (5). Gate delay fault models are modeled as excessive delay in a gate as a consequence of faults. The transition fault model is either a slow to transition from 0 to 1 or from 1 to 0. A path delay fault is present when the propagation delay through a series of gates is larger than some desired delay. Indeed, a current industry practice is to perform statistical timing analysis of parts. The manufacturer can determine that the parts can be run at a higher speed with a certain probability so that higher levels of performance can be delivered to customers. However, this relies on the statistical likelihood that delays will not be worst case (5). By running the device at a higher clock rate, devices and structures that satisfy worst-case timing along the critical path, may not meet the timing at the new higher clock rate. Hence, a delay fault can appear as a consequence of the manufacturing decisions. Assuming the
G2 I2
G1
O1
G3 I3
G4 I4
Figure 2. Illustration of input stuck-at faults. O1 stuck-at O forces the outputs of gates G2 , G3 , and G4 to be stuck-at 0.
122
AUTOMATIC TESTING
Clock
J J
∆
Delay fault
Q Q
∆
Figure 3. Illustration of a delay fault. The presence of a delay fault causes the wrong value to be clocked into the flip-flop.
indicated delay fault in Fig. 1, Fig. 3 gives a timing diagram showing manifestation of the fault. In this circuit, the delay fault causes the flip-flop input, J, to be delayed for a particular combination of inputs and input change(s), resulting in value being stored in the flip-flop being delayed by one clock period. Because of the nature of the delay fault, the circuit must be tested at speed in order to detect the delay fault because at slower clocks, the circuit will operate correctly since the circuit functions correctly at slower clocks. Furthermore, because the delay fault is dynamic, in order to detect the fault, the combinational circuit must receive an input change and the flip-flop must be clocked to make the delay fault observable. Bridging Faults. Bridging faults are the presence of an undesirable electrical connection between two nodes. This connection results in the circuit malfunctioning or behaving in a degraded fashion. Furthermore, bridging faults may manifest themselves in wired-and or wired-or fashions, changing the circuit function. In addition to degrading the signal, the bridging fault may be manifested as stuck faults when bridging faults occur between a node and the supply lines or as a sequential circuit when the bridging fault creates a feedback connection (5a). Bridging faults require physical proximity between the circuit structures afflicted by the bridging faults. Figure 1 gives an example of a bridging fault that changes the combinational circuit into a sequential circuit. CMOS Fault Models. CMOS technology has several fault modes that are unique to the technology (5a). Furthermore, as a consequence of the properties of the technology, CMOS offers alternative methods for identifying defective circuits. CMOS gates consist of complementary networks of PMOS and NMOS transistors configured such that significant currents may be drawn only when signal changes occur. When no signal changes occur, the normally working circuit draws very low leakage currents. Since the CMOS circuit should only draw currents when the circuit is switching, any significant divergence from a known current profile is indicative of faults. For example, the gates may logically function correctly, but as a consequence of the faults being present, the circuit may show abnormally large power supply currents. These changes in current draw characteristics can be used as a diagnostic for indicating the presence of faults and in possibly identifying the faults. Testing for faults based on this observation is called IDDQ testing. Bridging faults are common in CMOS circuits (6) and are ef-
fectively detected with IDDQ testing (7). IDDQ faults can have a significant impact on portable designs where the low current drawn by CMOS circuits is necessary. CMOS circuits also have an interesting failure mode where an ordinary gate can be turned into a sequential circuit. The fault is a consequence of a transistor failure, low quiescent currents, and capacitive gate inputs. In Fig. 4, if transistor Q1 is stuck open, the gate input to the inverter, G1 dynamically stores the prior value on node C when A ⫽ 0 and B ⫽ 1. In order to make this fault visible, C must be forced to 1 by setting A ⫽ B ⫽ 0 followed by setting B ⫽ 1 to store the value at the input of G2. Memory Faults. Semiconductor memories have structures that are very regular and very dense. As a result, they can exhibit faults that are not ordinarily seen in other circuits which can complicate the testing process. The faults can affect the memory behavior in unusual ways (8). First, a fault can link two memory cells in such a way that when a value is written into one cell, the value toggles in another cell. Second, the memory cell can only be written to 0 or 1 but cannot be written the opposite value. Third, the behavior of a memory cell may be sensitive to the contents of the neighboring cells. For example, a particular pattern values stored in surrounding cells may prevent writing into the affected cell. Fourth, the particular pattern of values stored in the cells can result in the value in the affected cell changing. The nature of these faults make their detection challenging. Crosspoint Faults. Crosspoint faults (1) are a type of defect that can occur in PLAs (Programmable Logic Arrays). PLAs consist of AND arrays and OR arrays with individual terms included in each through the programming of transistors that either include or exclude a term through the presence or absence of connections in the array. In field programmable devices, a transistor is programmed to be on or off, respectively, to represent the presence or absence of a connection. A crosspoint fault is the undesired presence or absence of a connection in the PLA. The crosspoint fault can result in a change in the logic function that cannot be modeled by the stuck fault model. A crosspoint fault with a missing connection in the AND array results in a product term of fewer variables while an extra connection results in more variables in the product term. For example, consider function f(A, B, C, D) ⫽ AB ⫹ CD implemented on a PLA. The existence of a crosspoint fault can change the function to f cpf (A, B, C, D) ⫽ ABC ⫹ CD. Fig-
A Capacitive input
B
C
D
Q1 Stuck-open
G1
G2
Figure 4. An illustration of a CMOS memory fault. CMOS circuits can suffer faults that impart sequential behaviors.
AUTOMATIC TESTING
A B C D
Crosspoint fault
Figure 5. An illustration of a crosspoint fault. The crosspoint fault results in the programmable logic array to evaluate the wrong function.
ure 5 diagrams the structure of the PLA and the functional effect of the crosspoint fault. Measures of Testing In order to gauge the success of a test methodology, some measure of the testing success and overheads are necessary. In this section, the measures of test coverage, test set size, hardware overhead, and performance impact are discussed. Test Coverage. Test coverage is the percentage of targeted faults that have been covered by the test regimen. Ideally, 100% test coverage is desired; however, this can be misleading if the fault model does not accurately reflect the types of faults that can be expected to occur (8a). For example, the stuck fault model is a popular and simple model for faults that works well in many situations. CMOS circuits, however, have several failure modes that cannot be modeled by the stuck fault model. A stuck fault test can be constructed that covers 100% of the stuck faults, yet may be only partially successful in identifying other faults. Fault coverage is determined through fault simulation of the respective circuit. In order to assess the performance of a test, a fault simulator must be able to accurately model the targeted fault to get a realistic measure of fault coverage. Size of Test Set. The size of the test set is an indirect measure of the complexity of the test set. The size of the test set impacts the test time which has a direct impact on circuit cost if expensive circuit testers are employed. In addition, the test set size is related to the effort, both in personnel and computationally, that was required to develop the test. The size of the test set depends on many factors including the ease with which the design can be tested as well as implementation of Design for Test (DFT) methodologies. Indeed, DFT methodologies may not necessarily result in shorter tests. Use of scan path approaches where flip-flops are interconnected as shift registers gives excellent fault coverages, yet the process of scanning into and out from the shift register can result in large test sets. Hardware Overhead. The addition of circuitry to improve testability or incorporate BIST capabilities can increase the
123
size of a chip or system and as shown previously can have a disproportionate impact on circuit costs. In the event improved testability is a requirement, the increase in hardware can be used as a criteria for evaluating different designs. The ratio of the circuit size with test circuitry to the circuit without test circuitry is a straightforward measure of the hardware overhead. The additional circuitry for test can increase the likelihood that defects are present in the test circuitry. In addition, failure rates of circuits in service are a function of the size of the circuit, where larger circuits have higher failure rates. Impact on Performance. Likewise, the addition of test circuitry can have an impact on system performance. The impact can be measured in terms of reduced clock rate and higher power requirements. For example, scan design methods add circuitry to flip-flops that multiplex between normal and test modes which typically have larger delays as compared to circuits not equipped. For devices with fixed die sizes and PLAs, the addition of test circuitry may be at the expense of circuitry that improves the performance of operation. Testability. Testability is an analysis and metric that describes how easily a design may be tested for defects. In testing a circuit for defects, the goal is to supply inputs to the circuit so that it behaves correctly when no defects are present, but which malfunctions if a single defect is present. In other words, the only way to detect the defect is to force the circuit to malfunction. In general, testability is measured in terms of the specific and collective observability and controllability of nodes within a design. For example, a circuit which gives the test engineer direct access (setting and reading) to flip-flop contents is more easily testable than one does not, which would give a corresponding better testability figure. In the test community, testability is often described in the context of controllability and observability. Controllability of a circuit node is the capability of being able to set the node to a particular value. Observability of a circuit node is the ability of being able to observe the value of the node (either complemented or uncomplemented) at the circuit outputs. Estimates of the difficulties of controlling and observing circuit nodes form the basis for testability measures. Figure 6 presents a simple illustration of the problem and process. The node S is susceptible to many types of faults. The general procedure for testing for the correct operation of node S is to control the node to a value contrary to the fault value. Next, the observed value of the signal is communicated to the outputs of the system for observation. Detecting faults with redundancy of any sort requires special consideration in order to be able to detect all possible faults. For example, fault tol-
Rest of circuit Primary outputs Primary inputs
S
Figure 6. Representative circuit with fault. In order to test for faults on S, the node must be controlled by the inputs and observed at the outputs.
124
AUTOMATIC TESTING
erant systems that employ triple modular redundancy (TMR) will not show any discrepancies at the circuit outputs when one fault is present (4). In order to make the modules testable, they must be somehow separated so that the redundancy does not mask the presence of faults. In addition, redundant gates necessary to remove hazards from combinational circuits result in a circuit where certain faults are untestable. Testability can be achieved by making certain internal nodes observable. Automatic Test Equipment Automatic test equipment tests integrated circuits by applying a set of inputs and comparing output responses with known good responses. The equipment can be used to determine whether a circuit works or not. If desired, the tests can be constructed to indicate what part of the circuit has failed. In the event the circuit outputs differ from the known good responses, the circuit is labeled as bad. Reference 8b gives a good overview of ATE. A block diagram of an automatic tester is shown in Fig. 7. Each of the major blocks will be discussed. Circuit Under Test (CUT). The circuit under test (CUT) is the device that is under test. Testing can be conducted on the device at several points in the manufacturing process. For example, the device can be tested immediately after manufacture, before the wafer has been broken up into individual dies. At the other end of the spectrum, the circuit can be tested after the die has been packaged. The circuit can also be tested after it has been placed into service. Bed of nails testers have some capacity to test devices that are an integral part of the circuit board. In order to achieve an accurate and reliable test, the tester must operate at the device speed and also check the timing of all parameters to specifications. In practice, the testers can test the device at speed; however, the testers can be extremely restricted in their capacity to check for exact timing. The problem lies in the manner in which outputs are checked. In most testers, the outputs are sampled at a regular rate. For example, the Teradyne model J971SP is capable of a peak data rate of 400 MHz, or 2.5 ns per pattern (http:// www.teradyne.com/prods/std/j971/j971.html). In addition, a tester capable of taking samples at this rate must be capable of accepting the outputs and incorporating them in a process that can determine whether the device is defective or not. Input Generator. The input generator can provide previously specified patterns or have some capacity to generate
Input generator
Inputs
Cut
Outputs
Output collector
Output analyzer
Good Defective
Feedback used for adaptive tests Figure 7. Block diagram of an automatic tester. The automatic tester is used to test integrated circuits.
patterns. The simplest and most straightforward implementation is in the form of a memory that holds patterns that have been specified by a test engineer. The memory architecture must be designed to be able to supply patterns at a rate consistent with the test rate. Alternatively, the patterns applied to the CUT can adapt depending on the output responses of the CUT. For example, in the event the CUT is defective, an adaptive test can be applied to identify the specific defect. Lastly, patterns can be generated in a pseudorandom fashion. Pseudorandom pattern generation has the benefit of requiring simple circuitry that has provable probabilistic defect coverage for combinational circuits which is a function of the number of pseudorandom patterns that have been applied. Output Collector. The output collector collects the output results for subsequent passage to the output analyzer. One form of output collector is a simple pass through operation. In this form, the outputs of the circuit are passed to the output analyzer without modification. In some testing applications, the output collector compacts the output responses into a signature that is indicative of whether or not defects are present in the CUT. In many digital circuits, Linear Feedback Shift Registers (LFSRs) are used to compute the signatures. Other applications may utilized different structures to compute the signatures. For example, in Digital Signal Processing (DSP) applications, a digital integrator is shown to compute a signature with provable fault detection capabilities (9). Output Analyzer. The purpose of the output analyzer is to determine whether the circuit is operating fundamentally defect free. In many automatic testers, the output analyzer consists of a memory containing all of the expected responses to the circuit. For each presentation of the input, the CUT outputs are compared to the previously determined good circuit outputs. A miscomparison results in the circuit being found to have one or more defects. The miscomparison usually does not indicate the specific location of the tests; however, additional tests may be required to identify the locations. In the event that the specific defect location is desired, additional likely adaptive tests would be necessary. Through a searching process, output results can be fed back to the input generator to emit tests consistent with the searching process. In the event output response compaction has been applied, relevant comparisons with the compacted signatures will be applied. Parametric Testing. Parametric testing tests the electrical characteristics of the signals. For example, circuit outputs are generally required to be able to sink and source particular current levels while maintaining acceptable logic levels. Such testing requires measurement of both voltage and current parameters. In addition, testers have the capability to perform some timing tests that are in addition to performance related to device clock frequency. Economics of Test Not surprisingly, economics plays a role in how testing approaches are applied (3). In the manufacturing process, defects will occur, so it is impossible to ignore testing entirely. The costs include design, manufacturing, testing, and maintenance costs. On the design side, the engineer must decide what, if any, test structures to include in the design. If these
AUTOMATIC TESTING
design structures are not in a library, they must be designed. In addition, the actual test regimen must be constructed. Complicating the matter, the costs associated with the test regimen can affect the costs associated with constructing the test set. Employing many design for test (DFT) approaches reduces the effort needed to develop test regimens. Since these costs are incurred in the design phase, they can be distributed over the manufacturing run, with a large manufacturing run minimizing the impact of the larger design costs. The manufacturing costs are direct costs and are incurred with the manufacture of each device. In general, adding DFT capabilities increases the die size of the device, reducing the number of manufactured dies. Assuming standard yield models and assuming that no redundant capacity is included to compensate for increased die defect rates, the yield of the individual dies is reduced. Using highly simplistic assumptions, some useful observations can be made. First, the wafer is assumed to be 15 cm in diameter, the die area is 1 cm2. This example assumes that the entire wafer area is available for dies makes no attempt to compensate for test dies, test regions, or dies on the perimeter. In this example, 706 dies can be manufactured on each wafer. Assuming DFT increases the die area by 10%, the number of dies on the wafer is reduced to 642. Assuming a fixed cost to manufacture each wafer, the apparent cost per die increases by 10%. More significantly, the defect yield models show that the fraction of good dies manufactured decreases nonlinearly with the increasing area. Assuming negative bilinear defect statistics and an average of 1 defect/cm2, the yield of the 1 cm2 die is 44.4% producing 313 working dies, while the yield of the 1.1 cm2 is 41.6% producing 267 working dies. The cost per working die manufactured is 17% higher for the circuit that employs test circuitry. Testing of the manufactured device is an essential aspect of the manufacture. The extensiveness of the testing performed and the design for test approaches used in the design can impact the testing cost. To determine whether or not the device is defect free, a series of test inputs are applied to the device and the responses are observed. In the event the observed behaviors differ, the device is marked as bad. DFT in some cases can simplify and reduce in number the set of test inputs and can reduce testing time proportionately. Furthermore, circuits with BIST capabilities can dramatically reduce or even eliminate the reliance on external test equipment. In addition, testing can be used to qualify a part according to a government or industry standard. For example, burn-in is the process of running a device just after manufacture in order to weed out the weaker parts. Failure of these parts is termed infant mortality due to the early failure and the increased failure rates of newly manufactured parts. Additional testing can be performed at the extremes of temperature, vibration, and humidity to receive qualification for devices used in military designs. Maintenance costs are related to the costs of the device once it is placed in service. While the device is in service, device failures can cause the system within which it is contained to fail. Indeed, without some sort of built-in self-test (BIST) capability, determining the cause and compensating for the failure can be difficult. For a device within the system, determining the difference between defect and design flaw may be extremely difficult. Incorporation of a built-in test capability can greatly enhance the maintainability because the
125
device, in effect, may be able to indicate its failure and simplify the task of the equipment maintainers. DESIGN FOR TESTABILITY Design for testability methods are used to simplify and enhance the testing process through analysis and application of methodologies to facilitate easier testing. The understanding of DFT requires background in several areas. These include understanding how circuits fail, the models for circuit failure, and general methods for identifying circuit failures. Automated testing can be performed in several fashions which can have a great impact on design (1,2). First, the test can be performed either on-line or off-line. In an off-line test, the system is taken out of service to perform testing. Once the system is out of service, the CUT is either placed in an ATE or BIST circuitry is enabled and the test is performed. BIST approaches will be discussed in detail in the next section. Once testing has been conducted, the system continues its normal function. An on-line test requires the system to remain in service. In one approach, the system is tested at idle times, where the system is operational but has no tasks to perform. This type of test is nonconcurrent, because the test and the circuit function are mutually exclusive. On-line concurrent testing requires testing to be performed concurrent with ordinary circuit function. A simple on-line testing approach is parity checks of transmitted data, however which may have unacceptable fault latencies. Special design techniques may be necessary to support on-line concurrent testing with high fault coverage that achieve low fault latencies. Test Point Selection A simple method for improving the testability of a design is through the addition of judiciously selected test points to serve as auxiliary points for controlling and observing internal nodes of the circuit. The identification of these points can follow from a testability analysis of the entire design by determining the difficulty with which internal points may be tested. Circuit notes that are among the most difficult to test are selected as the test points. As test points are identified, the testability analysis can be repeated to determine both how well the additional test points improve testability and to determine whether additional test points are necessary. The disadvantages to employing test points is that this approach is an ad hoc method for identifying test points and the test points will require additional inputs and outputs to the system. Scan Design Methods Scan design methods are DFT methods that build on the test point selection methodology by identifying specific circuit structures that would serve as suitable points for test points. Furthermore, these test points are interconnected as a large shift register enabling direct controllability and observability at all stages of the shift register. The scan design methods use either the flip-flops that are part of the design or other ‘‘geographic’’ clues such as system partitions to define the shift register paths. A key feature of scan design methods is that large sequential circuits are broken into smaller pieces that are more easily tested. Taken to its fullest extent, the
126
AUTOMATIC TESTING
circuit is decomposed entirely into combinational and sequential parts. Historically, these approaches follow from techniques incorporated into the IBM System/360 where shift registers were employed to improve testability of the system (10). A typical application of a scan design is given in Fig. 8. Note the switching of the multiplexer at the flip-flop inputs controls whether the circuit is in test or normal operaiton. Differences between different scan design methods occur in the flipflop characteristics or in clocking. Scan Path Design. In scan path design (11), the circuit is designed to operate in one of two modes: normal and test. When the circuit operates in normal mode, all test circuitry is disabled and the circuit operates per its functional requirements. When in test mode, all flip-flops are reconfigured into a shift register whose contents can be scanned in or out through special test inputs and outputs to the circuit. The purpose of the shift register is to give direct controllability and observability to all the flip-flops in the shift register. In essence, the test mode decomposes the circuit into its sequential and combinational parts. The combinational logic can be tested with any of the well-known approaches for developing test patterns for combinational circuits. Relevant aspects of the design include the application of a race-free D flip-flop improving testability of the flip-flops. Level Sensitive Scan Design. One long-standing and successful DFT approach is IBM’s level sensitive scan design (LSSD) approach (12). Similar to the scan path approach, the circuit
d1 d0 Y
Boundary Scan Techniques. In many design approaches, the option of applying design for testability to some components is impossible. For example, standard parts that might be used in Printed Circuit Boards (PCBs) which are not typically designed with scan path in mind. As another example, more and more ASIC design consists of design using cores, subsystems designed by third party suppliers, in the design. The core subsystems are typically processors, memories, and other devices that until recently were individual VLSI circuits themselves. To enable testing in these situations, boundary scan methods were developed. Boundary scan techniques employ shift registers to achieve controllability and observability at the input/ output parameters of circuit boards, chips, and cores. An important application of boundary scan approaches is to test the interconnect between chips and circuit boards that employ boundary scan techniques. In addition, the boundary scan techniques provide a minimal capability to perform defect testing of the components at the boundary. The interface to the boundary scan is a test access port (TAP) that enables setting and reading of the values at the boundary. In addition, the TAP may also allow internal testing of the components delimited by the boundary scan. Applications of boundary scan approaches include BIST applications (13), test of cores (14), and hierarchical circuits (15). The IEEE has created and approved the IEEE Std 1149.1 boundary scan standard (16). This standard encourages designers to employ boundary scan techniques by making possible testable designs constructed with subsystems from different companies that conform to the standard.
D Q
Built-In Self Test
S
Combinational logic
can be operated on one of two modes: normal and test. A key difference between the approaches is in the structure of the flip-flops. In LSSD, two clock phases are used to clock flipflops to guarantee the detection of clock faults.
d1 d0 Y S
D Q
d1 d0 Y S
D Q
Clock Test Figure 8. Scan path test structure. All flip-flops can be configured into an externally loadable shift register. In this configuration, combinational logic can be tested directly.
Built-In Self Test (BIST) approaches add circuitry to a design to enable the circuit to test itself. The test can, depending on its design, be conducted autonomously or while the device is out of service. The necessity of requiring the device to test itself places constraints on the manner in which inputs are generated for the design and monitoring the outputs, when compared with a test used in conjunction with ATE. Furthermore, in cases where ATE testing is difficult, the device test can be composed of a combination of ATE testing and BIST approaches. Running chip tests on all devices can be expensive, especially when yields are low. Effective application of BIST can enable circuits to assess their own health, reducing the amount of time necessary on expensive automatic test equipment. Furthermore, BIST enables systems in service to be configured to run automated tests. The benefit of this type of test is that systems can determine their health and report a failure possibly before problems appear at the system level. Second, for systems that have built-in redundancy, the results from an automated test can be used to switch-out a failed component or module and insert a new module. BIST can require that the designer conform to a particular set of design rules and require additional circuitry be added to the system being designed. This can reduce design flexibility by restricting the types of designs that are possible as well
AUTOMATIC TESTING
as potentially restricting the functionality of the design so that the entire system can fit in one device. Despite this, BIST approaches confer great benefits. Test Pattern Generation and Built-In Test. The requirement for built-in test places great restrictions on the test generation process in two ways. First, the actual generation of test patterns must be self-contained within the circuitry itself. This implies the presence of circuitry that generates sequences of test patterns. While in theory, circuitry can be designed to produce any test pattern sequence, in practice the required circuit may be excessive or impossible to include. As a result, simpler circuitry must be employed in order to perform test pattern generation. Three classes of circuits are typically employed because they have good asymptotic performance and because they have the ability to produce any sequence of test patterns. The first type of circuit is a simple counter. The counter can be constructed asynchronously and will thus only require as many flip-flops as there are test inputs. Counter solutions may be impractical for two reasons. First, if the number of test inputs are large, the time required to count through the entire sequence can be too long. Second the counter may be unable to test for other types of faults, such as delay faults or CMOS faults. Researchers have investigated reducing the count sequence so that more reasonable test lengths can be achieved (22). The second type of circuit generates pseudorandom sequences with LFSRs. Theoretically, as the number of random test patterns applied to the circuit increases, fault coverage increases asymptotically to 100%. Much research effort has gone into the development of efficient pseudorandom sequence generators. An excellent source on many aspects of pseudorandom techniques is (17). Third, a special circuit can be constructed that efficiently generates the test patterns. In this case, the desired sequence of test patterns are examined and a machine is synthesized to recreate the sequence. Memory tests have shown some success in using specialized test pattern generator circuits (27). In order to determine whether a fault is present, the outputs of the circuit must be monitored and compared with outputs representative of fault-free behavior. Test pattern generation equipment solves this by storing the expected behaviors for the circuit given the sequence of inputs supplied by the tester. Similar to the problem of supplying test patterns in the context of automatic testing, it may be impractical to store or recreate the exact circuit responses. Duplication approaches (several are summarized in Ref. 4) can be employed to duplicate a subsystem to ensure that a copy of the expected circuits are available. By completely duplicating the system circuitry, the duplicated systems can be operated concurrently from the same set of inputs where any discrepancy between the duplicated systems results in the detection of a fault. While this approach may be applicable to systems requiring high reliability or fault tolerance, it would be an undesirable approach in most cases. Another widely used solution is to compress the circuit responses into a special code word, a signature, that is indicative of the presence or absence of faults. The signature represents the lossy compression of the circuit responses. The computation of the signature should achieve high probability of detecting the faults in the system while using the smallest quantity of resources to do so.
127
Methods for Compressing Test Data Sets. As indicated, test set compression is essential to producing economical test cases for use in automatic testing. Compression methods utilize test hardware in a fashion that retains much of the fault detection capabilities, while minimizing testing resources. The compaction process can be measured with respect to the following attributes (25): space, time, function specificity, linearity, and memory. The test data set can be represented as a two-dimensional matrix where each row is associated with one measurement point and each column corresponds to a test pattern that is applied at a particular time. Compression is a transformation that reduces the size of this matrix. The compression may or may not reduce the effectiveness of the overall test; however, a small reduction in test coverage is often acceptable. Space and time compression occur when the number of rows and/or columns are reduced. The compression can be expressed mathematically as D ⫽ ⌽C, where D is the original test data matrix, C is the compressed test data matrix, and ⌽ is the compression transformation. The manner in which the transformation ⌽ is selected affects the matrix size as illustrated, but it may have implications in regards to the remaining three attributes. Function specificity occurs when the transformation relates the number of compactor stages to either the number of inputs or test patterns, or also if the compaction is related to the sequences of binary patterns or inputs. Linearity is a property that follows from using the exclusive OR function to derive ⌽, which can be shown to be a linear operator in the finite field GF(2) (26). For example, systems such as the LFSR would be considered a linear time compactor. Memory is the property that a bit in D is a function of both past and present bits in C. For example, space compression measures the reduction in test data width. For example, consider a test strategy constructed that uses 100 test points. Suppose further that the values from pairs of test points are exclusive or gates are exclusive together. If the resulting 50 signals are monitored, a space compaction of 50% is realized. Time compaction occurs when the number of columns is reduced by some process. For example, SISRs and MISRs can be used to compress the number of columns to 1. Signature Analysis. Signature analysis is the process of compressing output responses frequently used in BIST with a Single Input Shift Register (SISR) or Multiple Input Shift Register (MISR) to compute a signature, which is a compression of the output responses (17). The signature is a quantity that is representative of the condition of the system. Figure 9 gives the architecture of SISRs and MISRs. A signature for
Test input
... Test input
LFSR
LFSR
Signature
Signature
A. SISR
B. MISR
Figure 9. Signature generation architectures. The signature can be computed by sampling one signal or several at the same time.
128
AUTOMATIC TESTING
D Q
D Q
D Q
D Q
Figure 10. Linear feedback shift register. The LFSR is a key building block in the implementation of signature generation, pseudorandom pattern generation, and code generators/checkers.
the known good circuit is compared with the signature for the CUT where any discrepancies indicate the presence of a defect. SISR and MISR are linear sequential compaction schemes of the observed test point(s). The architecture of SISR and MISR are LFSRs at the core which are both forms of LSFRs. The signature, being a compression of output responses, gives no indication of the failure. Furthermore, the number of bits in the signature is far less than the number of test patterns. As a result, several different circuit conditions can result in the same signature, termed aliasing. In practice, if a fault is present, it is desired that the signature produced differs from the fault-free case. It is possible in some designs for the signature to be the same even though it was the result of compaction of an entirely different set of output results. In Ref. 18, the aliasing probability upper bounds were derived for signatures computed with SISRs. In addition in Ref. 19, methods for developing MISRs with no aliasing for single faults were developed. LFSRs have properties that are useful for generating effective signatures. First, the LFSR approach offers a simple and effective compaction approach for the detection of faults in the circuit. The LFSR consists of some number of flip-flops, N, an linear connections fed back to stages nearer the beginning of the shift register. The general structure of the LFSR used as an SISR is shown in Fig. 10. In addition, other types of circuits have shown the ability to effectively compute signatures. For example, DSP often results in modular design constructed from accepted DSP building blocks. In particular, digital integrator circuits have been employed to compute signatures (9). BILBO. The built-in logic block observer (BILBO) approach has gained a fairly wide usage as a result of its modularity (28). The BILBO approach is derived from scan path approach
where the scan circuitry and other functions are encapsulated in a BILBO register. In addition, connections within the BILBO register enable computation of a signature, suitable for fault detection. A four-bit BILBO register is shown in Fig. 11. BILBO registers operate in one of four modes. The first mode is used to hold the state for the circuitry as D flip-flops. In the second mode, the BILBO register can be configured as a shift register that can be used to scan values into the register. In the third mode, the register operates as a multiple input signature register (MISR). In the fourth mode, the register operates as a parallel random pattern generator (PRPG). These four modes make possible several test capabilities. One example application of BILBO registers is shown in Fig. 12. In test mode, two BILBO registers are configured to isolate one combinational logic block. The BILBO register at the input, R1, is configured as a PRPG, while the register at the output, R2, is configured as a MISR. In operation, for each random pattern generated, one output is taken and used to compute the next intermediate signature in the MISR. When all tests have been conducted, the signature is read and compared with the known good signature. Any deviation indicates the presence of faults in the combinational circuit. In order to test the other combinational logic block, the functions of R1 and R2 need only be swapped. Configuration of the data path to support test using BILBO registers is best achieved by performing register allocation and data path design with testability in mind (29). Memory BIST. Semiconductor memories are regular structures manufactured with the goal of maximizing the capacity for a given area of silicon. While in principle the memories can be tested as other sequential storage elements, in reality the overhead associated with utilizing scan path and similar test approaches would severely impact the storage capacity of the devices. The regularity of the structure, however, gives the capability of testing the memory with hardware structures outside the memory arrays proper. Among the test design considerations of memories is the number of tests as a function of the memory capacity. For example, a test methodology was developed (27) for creating test patterns. These test patterns could be computed using a state machine that is relatively simple and straightforward. The resulting state machine was shown to be implementable in random logic and as a microcode driven sequencer.
D Q
D Q
D Q
D Q
Q
Q
Q
Q
S Shift in
d1 Y d0
Figure 11. Four-bit BILBO register. The BILBO register can operate in one of four modes: simple register, shift register, pseudorandom pattern generator, or signature register.
AUTOMATIC TESTING
129
function is defined in terms of the original function in the following way Combinational logic
f dual (x1 , x1 , . . ., xn ) = f (x1 , x2 , . . ., xn )
R1 (BILBO)
PRPG
Combinational logic
MISR
R2 (BILBO)
Figure 12. Example BILBO architecture. BILBO registers are used in pairs to isolate combinational logic functions.
Programmable Logic Arrays With the increasing complexities of programmable logic arrays (PLAs) with sizes approaching 1 million devices, PLA test is becoming an important part of the design process. PLA test can be viewed from two perspectives. First, the ‘‘blank’’ device can be tested and deemed suitable for use in an application. Second, the programmed device may be subject to the testing process, itself. Self-Checking Circuits Self-checking circuits (SCCs) are circuits that, by their construction, have the inherent capability to detect whether a failure exists in their circuitry (4). Self-checking circuits are not employed in typical applications because of the overhead required which can be slightly more than 50% additional circuitry. SCCs enable automatic testing while the circuit is operational and can achieve high fault coverage without extensive test structures or regimens. The key feature of SCCs is that they employ complementary circuits and convey information using a codeword consisting of two bits. The pair of bits represents the true and complementary values of the signal being transmitted. Any fault results in the pair of signals being identical, allowing clear detection of the fault. A simple example of such a SCC is shown in Fig. 13. The dual of a
x1 x2 X1
Dual of f
f
xn
(1)
In short, the dual outputs the complement of the desired function given the complements of the inputs. In ASIC development, SCCs provide two levels of protection and detection of faults. First, if either the function or its dual has fault, then only that function is affected. As a result, if the fault is present, either the circuit will provide the correct output or it will give the incorrect output. In the event the incorrect output is supplied, an error will be detected because the dual function will still be providing the correct output. Second, in the event of a failure in the design tools, the function and its dual may be affected in different ways so that a tool failure will have a higher probability of detection. The final issue that must be addressed in SCCs is the checking circuitry. The checking circuitry must be designed to both detect incidences of illegal signal combinations and be self-checking, itself. SCCs are use in applications where fault detection is essential to the completion of the mission and are rarely used in common applications. Cases where SCCs might be used are space and aerospace applications. TEST PATTERN GENERATION Two general approaches can be used for developing tests that achieve high fault coverage. The first approach employs functional test and the second specifically targets defect tests. A test is an input or sequence of inputs that can be used to detect a fault in a circuit. For example, in Fig. 14, a test for G SA-1 is ABCDEF ⫽ 110X0X, where X is a don’t care. Note that by setting AB ⫽ 11, the expected value on node G is 0, while with the fault present the fault value is 1. In addition, setting CD ⫽ 0X and EF ⫽ 0X enables the propagation of the discrepancy due to the fault to the output of G4, where it can be detected by examining node J. Test pattern generation is the process of determining a set of test patterns that result in an effective test giving an acceptable test coverage (1). A good test requires testing of all aspects of the circuit for as many defects as are reasonably possible. The creation of the test is very much dependent on the type of defect model that is assumed for the device. For example, the stuck fault model may be used to developed test patterns. Even if the test engineer created a test that covered 100% of all stuck faults, defective CMOS devices may pass the test because the actual defects in the technology tend to show delay or bridging faults. Test patterns can be created in several ways. First, perhaps most intuitive source of test patterns come from func-
X2 x1 x2 Xn
xn
f
f
F Output
Inputs Figure 13. A self-checking circuit. Self-checking circuits are composed of complementary circuits of which only one malfunctions when a defect is present.
A=1 B=1 C=0 D=X E=0 F=X
G1 G2 G3
G H
SA-1 G4
J
I
Figure 14. Detection of fault in combinational circuit. The input ABCDEF ⫽ 110X0X detects the G stuck-at 1 fault.
130
AUTOMATIC TESTING
tional tests of the system under consideration. The second source of test patterns comes from a concerted attempt to exhaustively test for every defect in the circuit. The third approach is to create a random set of test patterns. Designs are often supported with DFT methodologies such as scan registers to decompose the system into sequential and combinational parts that can be tested largely independent of one another. The sequential part is often configured as a shift register which is straightforward to test. As a result, much effort has been expended on developing test patterns for combinational circuits. Several test generation approaches are immediately discussed. Functional Test A functional test can be used to determine whether the circuit meets computational and functional requirements. In addition, by exploiting known circuit structures with known testability characteristics. A designer will construct these tests in the process of debugging the design so that the functional characteristics of the system can be demonstrated. In theory, functional tests can be constructed to achieve very nearly 100% fault coverage; however, the number of tests may be prohibitive. For example, a functional test of a multiplication circuit might include tests computing several products. Functional tests can be used alone or in conjunction with defect tests to achieve the desired fault coverage. In the event DFT circuitry is incorporated, functional tests verify that the circuit function is achieved by operation in its normal mode by deactivating test circuitry. Defect Tests In defect testing, it is taken for granted that the system design meets functional requirements. In general, the goal of defect testing is to test and verify that all circuit structures and components are operating defect free and within manufactured tolerances. As a result, the actual test patterns presented to the system may bear no resemblance to circuit inputs that would ordinarily be encountered in the field. Defect tests can be constructed in an ad hoc fashion, provided the appropriate tools are available, such as an accurate fault simulator. Constructing effective tests can be eased with the introduction of DFT methods that improve the controllability and observability of internal circuit structures. For example, the introduction of test points and scan registers can aid the designer in efficiently achieving acceptable defect tests. In addition, algorithms exist for automatically determining a set of defect tests that under certain conditions and assumptions, can achieve 100% fault coverage. One such algorithm is the D-algorithm (20,21). The D-algorithm can be used to determine test vectors for combinational logic and achieves 100% fault coverage for such circuits, provided the circuits do not contain any redundancy, as what might be present for eliminating hazards or to achieve fault tolerance. The D-algorithm can be effectively applied to circuits that include a test configuration where all flip-flops are clocked by the same clock signal, test circuitry is present that reconfigures the circuit so that all flip-flops are a part of a long shift register, and a mechanism for shifting values into and out of the shift register are present. The test patterns can be used for BIST ap-
plications provided the patterns can be efficiently stored or recreated by test circuitry. Lastly, pseudorandom tests can be constructed by randomly creating test patterns that are applied to the circuit. The fault coverage is a function of the circuit and the number of different test patterns that are applied. Furthermore, the coverage asymptotically approaches 100% as the number of different test patterns that have been applied increases. The benefit of pseudorandom tests is that the tests can be easily created with simple circuit, desirable for BIST applications. Exhaustive Testing Exhaustive testing is the testing under all possible operating conditions for all faults. Because each possible operating condition is targeted all faults are guaranteed to be detected, provided they can be detected by any test. The problem with exhaustive testing is that the number of tests required to achieve exhaustive testing is in most cases intractably large. For example, an irredundant combinational circuit with four inputs may be exhaustively tested for all stuck faults with 16 test patterns, one for each input combination. A circuit of 64 inputs would require 264 ⫽ 1.8 ⫻ 1019 tests to be similarly tested. Given a tester that could present 109 patterns per second, the test would require over 500 years. Further complicating the process of exhaustive testing is the presence of sequential circuits and also several failure modes that require two or more consecutive patterns to test for individual faults. Pseudoexhaustive Testing In pseudoexhaustive testing, the circuit is partitioned into component blocks that can be tested exhaustively. This approach achieves some benefits of exhaustive testing in that all faults of the desired class are tested while reducing the scope of the test to a subcircuit of a more manageable size. In addition, counters (22) and LFSR circuits (17) are capable of generating exhaustive input patterns for some fault models. The circuit can be partitioned based on several criteria. One criteria for partitioning is to examine the dependence of outputs on the inputs. If an output is dependent on fewer than the total number of inputs, then select these inputs as input to the pseudorandom test (2). The D-Algorithm The D-Algorithm (20,21) is a test pattern generation algorithm that can be used to determine the set of test patterns to detect all detectable stuck faults in combinational circuits. While the types of circuits that are used in digital systems are not combinational, these circuits can be decomposed into combinational parts using scan design methodologies. The ‘‘D’’ in the D-Algorithm represents a discrepancy between the fault free and faulty circuit, where D is used both to represent the fault for which a test pattern is desired as well as the propagation of the discrepancy to the output of the circuit for detection of the fault. In the circuits, D and D will be used as logic values with the familiar 1 and 0 logic values. In the context of fault detection, D represents a discrepancy where the circuit node should be 1 yet is 0 due to the presence of a fault. In a complementary fashion, D represents a discrepancy where the circuit node should be 0 yet is 1 due to the presence of a fault. The theme of the algorithm is to assign D and D to
AUTOMATIC TESTING
circuit nodes to represent faults. Next, the discrepancy is propagated to the outputs by sensitizing gates to enable propagation of the discrepancy along all paths until the discrepancy appears at the circuit output(s). This process is termed the D-drive. Next, inputs of the remaining gates are set to support the propagation of the discrepancy to the circuit outputs which is termed the consistency operation. Backtracking in the consistency operation occurs whenever a node is required to be both 1 and 0. Among the significant features of the D-Algorithm is the ability to find a stuck fault test if one exists. In addition, the D-Algorithm can be extended to test for delay faults and nonfeedback bridging faults (21). Pseudorandom Test Pattern Generation While the D-Algorithm can be used to determine the test patterns for a particular circuit, as a result of searching, the D-Algorithm can be computationally expensive. A computationally less expensive yet but undirected approach uses test patterns generated in a pseudorandom fashion (17). The basic idea behind Parallel Random Pattern Generator (PRPG) is the assumption that many approximately equally desirable tests exist for a circuit. By randomly sampling the individual tests, probabilistically speaking, an acceptable test set can be found. LFSRs are simple circuits that can be used to generate pseudorandom test patterns. Weighted Test Patterns Test patterns constructed using PRPGs result in patterns where each bit has an equal chance of being either 1 or 0. For some circuits, this uniformity in the test patterns may not result in an effective set of tests. In particular, certain types of patterns are more productive in terms of the quantity of faults that can be detected. Adders and counters, for example, are effectively tested when patterns result in either carries generated or when carries could be generated if faults are present. One of the first applications of weighted test patterns assumed that the circuit had no DFT circuitry and that the tester changed one input at time. After presenting random set of test patterns, it was observed that some input changes resulted in a larger number of devices in the circuit changing. By changing the statistics of the inputs and making the inputs change in a pseudorandom frequency dictated by their importance, measured by internal gate signal changes, the benefits of weighted test patterns were demonstrated (23). Multiple Pattern Methods Under certain circumstances, a single pattern applied to a circuit is insufficient to test for some types of faults or to achieve certain testing goals. Multiple test patterns are necessary when it is desired to locate the pattern. Since one test pattern will detect the presence of several faults, additional tests are required to rule out other defects. In addition, delay faults require two test patterns to detect delay faults. The first pattern sets up the conditions and the second pattern makes the presence of the fault visible. The D-Algorithm can be easily enhanced to test for delay faults (21). In CMOS circuits, transistor faults can result in a combinational circuit operating as a sequential circuit. To detect this fault, the first pattern stores a value and the second pattern propagates the discrepancy to the output of the circuit (24).
131
Test Pattern Generation for PLAs As indicated earlier, PLAs are susceptible to crosspoint faults where the function programmed into the PLA is changed by the addition or omission of a term in the AND array or OR array. These faults do not fit the standard stuck fault model, so a different strategy to test for crosspoint faults must be adopted. CAD TOOLS Support of automatic testing in the CAD tool is as important as understanding the test techniques given the high density and complexity of today’s digital systems. The inclusion of test support in the CAD tool facilitates integrating good test practices into the design process. Tool builders have recognized the importance of including testing approaches. Two leading companies in this respect are LogicVision, Inc. and Mentor Graphics. Furthermore, the development of hardware description languages that has resulted the widely used Verilog and VHDL greatly simplifies the process of adding automatic testing capabilities. VHDL is the VHSIC Hardware Description Language where VHSIC is very high speed integrated circuits (32). Reference 30 gives a good summary of CAD approaches. Fault Simulation Fault simulation is a simulation capable of determining whether a set of tests can detect the presence of faults within the circuit. A fault simulator is a logic simulator with the capability of keeping track of whether one or more faults can be detected given the inputs to the circuit. In practice, a fault simulator simultaneously simulates the fault-free circuit concurrently with circuits having faults. In the event that faults produce circuit responses that differ from the fault-free cases, the fault simulator records the fault that was detected. In order to validate a testing approach, fault simulation is employed to determine the efficacy of the test. Fault simulation can be used to validate the success of a test regimen and give a measure fault coverage achieved in the test. In addition, test engineers can use fault simulation for developing functional test patterns. By examining the faults covered, the test engineer can identify circuit structures that have not received adequate coverage and target these structures for more intensive tests. In order to assess different fault models, the fault simulator must be able to both model the effect of the faults and also to report the faults that were detected by a particular input sequence. In the test for bridging faults detectable by IDDQ testing, traditional logic and fault simulators are incapable of detecting such faults because these faults may not produce a fault value that can differentiate faulty from fault-free instances. In Ref. 31, a fault simulator capable of detecting IDDQ faults is described. Test Structure Libraries By incorporating DFT structures and techniques in libraries, the designer is given the ability to easily include accepted testing practices. Some testing practices can be incorporated in a fashion that is to some extent transparent to the design engineer. In doing so, DFT techniques can become a more in-
132
AUTOMATIC TESTING
tegral part of the design process. For example, scan design approaches can be incorporated in a design provided all flipflops operate off the same clock. Connecting flip-flops in a scan path architecture can be an option with specific details of the test structure being hidden from the designer unless it is immediately relevant to the design. Languages such as VHDL can offer DFT capabilities with the addition of libraries (32). MIXED-SIGNAL TESTING Analog systems have increased in complexity along with digital systems. With increased system complexity comes with it the necessity to perform defect and parameter testing. Furthermore, since both analog and digital circuitry require essentially the same manufacturing process, a common practice is to include both analog and digital circuitry in the same chip. With the increasing complexity of analog circuits and with the introduction of digital circuitry, testing becomes a more important issue. Analog circuitry presents special problems in testing because analog signals are continuous, making it impossible to develop a test for all possible cases. Second, complex digital circuitry can be included in a circuit that contains analog circuitry for which testing is desired. At the interface, the digital signals serve as actuators to the analog functionality. In general, however, ABIST (Analog Built-In Self Test) shares many of the same features and challenges of digital BIST. First, for nonconcurrent testing, the CUT must be switched to a special test mode where inputs are generated and output responses are compared with accepted norms for behavior. At the system level, ABIST is not unlike digital BIST. The CUT requires an input stimulus and the outputs must be monitored and compared against a measure indicative of the CUT fault status. Signal Generation Among the challenges of developing ABIST architectures is having the capability of injecting a test signal into the CUT (33). Digital systems are straightforward from the point of view that in principal, any pattern can be generated for input. Analog signals can take on an infinity of values and combinations. Indeed, it is impossible to test an analog device for all possible operating conditions and levels. With the selection of appropriate signals, a fairly complete test is possible. Ideally, having the signal generation capabilities of a signal generator from the lab bench along with the capability of programming the signal generator gives the ideal combination of capabilities for testing analog devices. Integrating these capabilities on analog chips, however, can be impractical due to the difficulty of creating a signal generator with sufficient capabilities to test the CUT. The simplest form of signal generator simply generates a sine wave with either fixed or variable frequencies. Such a signal generator is termed a single-tone signal generator. The frequency response of an analog device can be obtained with a programmable single-tone signal generator that scans the desired range of signals in order to obtain a frequency response. This approach can be ineffective at generating tests of nonlinear properties. For example, a two-tone signal generator can be used to test for nonlinearities by determining the magnitude of frequencies that result from mixing the original two tones.
Mixed Testing In some analog applications, both analog and digital circuits are present. For example, an A/D converter requires analog circuits to sample the analog input compare the analog signal levels to internal references. Digital circuits are then used to take the results of these comparisons and output a binary integer representative of the analog voltage level. For a complete test, both the analog and digital circuits must be tested, as well as the interfaces between the two. Parts of digital circuits can be tested using the techniques of purely digital circuits. In Ref. 34, a BIST approach was developed that incorporated digital DSP hardware to perform frequency response and other parameters of an analog to digital converter. AUTOMATIC TEST EXAMPLES Two examples of microprocessors that employ automatic testing methodologies are presented in this section. It is interesting to note the range of techniques that are employed even in the same design. Microprocessor CPUs present special challenges for automatic testing as a result of their enormous complexity, state of the art implementation methodologies, and the complications of mass production. Indeed, because microprocessors are manufactured in volume, the economics of employing testing methodologies are carefully weighed against the benefits. In today’s modern microprocessors, subsystem structures of every type, each with its own testing challenges, coexist and must be tested. The techniques employ a combination of both built-in test and the application of ATE. Failure Analysis in Intel Pentium and Pentium Pro Processors Intel incorporated several DFT approaches that enable device test and manufacturing failure analysis (35). Failure analysis is necessary to determine the cause of device failure. In a manufacturing setting, specific patterns of failures may be indicative of problems with the process. Thus, the identification of the specific failure mode is essential for adjusting the manufacturing process to reduce the incidence of a specific type of failure. Approaches such as e-beam probing can identify failures, but Intel engineers recognized that current manufacturing technologies reduced the effectiveness of e-beam probing for several reaons. First, the component geometries were becoming too small to be resolved. Second, layers of metalization impedes observation of signals at the first and second layers of metalizations. Third, flip-chip packaging complicates e-beam probing. Several DFT techniques were employed in the Intel Pentium processor in order to facilitate testing off the manufacturing line. The Intel Pentium has microcode that can be patched with externally loaded microinstructions, accomplished by down-loading the new microinstructions into a special memory. Special control logic transfers control to the microinstruction patches at desired points in the microprogram. The microinstructions are coded to aid in localizing the source of defects and can be used in conjunction with external probing methodologies. Another Pentium DFT capability allows external dumping of the contents of memory arrays, enhancing the observability of memories and control logic. Complementing these DFT techniques is a scan-out mechanism en-
AUTOMATIC TESTING
abling the sampling of internal control and datapath signals. The cache memories also support a direct access test mode whereby the memories are accessed as a pipelined memory. A more detailed test mode, the low yield analysis mode, enables extraction of dc parameters for individual memory cells that are useful in failure analysis. Dec Alpha Processor Several testing methodologies were employed in the Dec Alpha Processor to achieve several goals (36). First, the Alpha was designed to achieve extremely high performance, and thus the testing methodologies had to be incorporated in such a way to not significantly affect performance. Second, testability methodoligies were employed to ease the burden on chip tester technology. Third, the testability methodologies were designed to work in conjunction with repair methodologies of chip memories. On the Alpha, it was decided not use full scan path methodologies as a result of the size of the processor and because some of the logic was implemented using dynamic logic. It was further recognized that the instruction cache could be used to store programs valuable in the testing process. Upon initial wafer probe, the Alpha instruction cache was designed to undergo a built-in self-repair operation prior to being loaded with a special test program. The purpose of the test program is to determine failed bits in the cache memories and report failures for their subsequent masking through laser repair. In addition, the Alpha has a manufacturing test port that supports limited internal scan path capabilities the form of 27 linear feedback shift registers. A mostly IEEE 1149.1 compatible TAP enables boundary scan test capabilities.
133
tual property, supplying enough information to effectively test the cores would require the disclosure of the design. It is impossible to be able to construct an effective test regimen without some knowledge of the internal structure of the core. Along with the core, the core vendors will supply test vectors that test the core to achieve a given level of defect coverage. The difficulty is that this requires the test patterns be applied directly to the inputs and outputs of the core, which can be embedded deep within the design (14). Synthesis for DFT In many ways, effective application of testing methodologies requires attention at every stage of the design process (29). While the focus of testing is on the individual components and gates in the circuit, initially, the designer will rarely work in this domain, given the complexity of most designs produced today. More typically the designer will design at the behavioral level. Synthesis for DFT can facilitate the design of a testable circuit in several ways. First, in synthesis, test structures can transparently be compiled into the design making it more testable. Second, by adding datapaths that improve controllability and observability, the testability of the design can be improved. In Ref. 29, resource scheduling is used to enhance testability. Four rules were introduced to improve testability which can be applied to the design at the behavioral level. Rule 1: Whenever possible, allocate a register to at least one primary input or primary output. Rule 1 improves the controllability and observability of the design.
EMERGING TECHNOLOGIES IDDQ Test Patterns In IDDQ testing, the current profiles for CMOS circuits are used to test for defects in the circuits (7). CMOS circuits are particularly susceptible to bridging faults that are not always testable when developing tests according to the stuck fault model. IDDQ testing is generally considered to be an augmentation to other testing approaches and has been shown to improve fault coverage. In IDDQ testing, as test patterns are presented to the circuit, the quiescent current is monitored. If the current exceeds some threshold, defects are presumed to be present. Test patterns for IDDQ testing can be the same patterns used to perform fault testing according to other models, augmented with current profiles for the different patterns in the test set. Because CMOS circuits can draw different quiescent currents depending on the internal function of the circuit, adopting a threshold for each test will give a more accurate test. In addition, test patterns can be created specifically for IDDQ testing. Challenges of Core-Based Design In core-based designs, VLSI systems are constructed using complex components from libraries (37). Using simple to moderate complexity parts from a library has been a part of the industry for some time. The difference is that today, library components are of the complexity of entire systems from the past, such as CPUs and memories. Since the cores are supplied by third parties who consider their cores to be intellec-
Rule 2: Reduce the sequential depth from an input register to an output register. In Rule 2, the paths through which data is processed are designed so that the data is stored in no intermediate registers. In doing so, improved controllability and observability of a design results. Rule 3: Reduce sequential loops by —proper resource sharing to avoid creating sequential loops for acyclic data flow graphs, and —assign IO registers to break sequential loops in cyclic data flow graphs. Sequential data path loops reduce the testability of the circuit. By avoiding the creation of sequential loops and breaking sequential loops that are unavoidable, testability is improved. Rule 4: Schedule operations to support the application of Rules 1 to 3. In Ref. 29, application of Rules 1 to 4 are shown to significantly improve the testability of circuits compared to cases that do not employ these rules. Interestingly, this approach does not require test circuitry traditionally used to improve testability, such as the application of scan design approaches.
134
AUTOMATIC TESTING
SUMMARY
10. W. C. Carter et al., Design of serviceability features for the IBM system/360, IBM J. Res. Develop., 115–126, July 1964.
In this article, many aspects of automatic test were investigated. Automatic testing can be conducted by either external test equipment or can be incorporated in the circuitry of the system. Important issues in developing and using automatic test include knowledge of the types of failures that can be expected in the application technology. The types of failures can have a great impact on both the methodology for generating tests and also on the design of the systems. With the complexity of today’s VLSI circuits, many circuits are incorporating DFT and BIST approaches to ease testing burdens.
11. S. Funatsu, N. Wakatsuki, and T. Arima, Test generation systems in Japan, Proc. 12th Des. Automation Conf., 1975, pp. 114–122.
APPENDIX 1. ACRONYMS AND ABBREVIATIONS ABIST ATE BILBO BIST CUT DFT DSP LFSR LSSD MISR PCB PLA PRPG SCC SISR TAP
Analog Built-in Self-test Automatic Test Equipment Built-in Logic Block Observer Built-in Self-test Circuit Under Test Design for Test Digital Signal Processing Linear Feedback Shift Register Level Sensitive Scan Design Multiple Input Shift Register Printed Circuit Board Programmable Logic Array Parallel Random Pattern Generator Self-Checking Circuits Single Input Shift Register Test Access Port
BIBLIOGRAPHY 1. M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital Systems Testing and Testable Design, IEEE rev. ed., Piscataway, NJ: IEEE Press, 1990. 2. P. K. Lala, Digital Circuit Testing and Testability, San Diego, CA: Academic Press, 1997. 3. J. M. Miranda, A BIST and boundary-scan economics framework, IEEE Des. Test Comput., 14 (3): 17–23, July–Sept 1997. 4. B. W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Reading, MA: Addison-Wesley, 1989. 5. M. Sivaraman and A. J. Strojwas, A Unified Approach for Timing Verification and Delay Fault Testing, Boston: Kluwer, 1998. 5a. N. K. Jha and S. Kindu, Testing and Reliable Design of CMOS Circuits, Boston: Kluwer, 1992. 6. J. Gailay, Y. Crouzet, and M. Vergniault, Physical versus logical fault models in MOS LSI circuits: Impact on their testability, IEEE Trans. Comput., 29: 1286–1293, 1980. 7. C. F. Hawkins et al., Quiescent power supply current measurement for CMOS IC defect detection, IEEE Trans. Ind. Electron., 36: 211–218, 1989. 8. R. Dekker, F. Beenker, and L. Thijssen, A realistic fault model and test algorithms for static random access memories, IEEE Trans. Comput.-Aided Des., 9: 567–572, 1996. 8a. K. M. Butler and M. R. Mercer, Assessing Fault Model and Test Quality, Boston: Kluwer, 1992. 8b. K. Brindley, Automatic Test Equipment, Oxford: Newnes, 1991. 9. J. Rajski and J. Tyszer, The analysis of digital integrators for test response compaction, IEEE Trans. Circuits Syst. II, Analog Digit., 39: 293–301, 1992.
12. E. B. Eichelberger and T. W. Williams, A logic design structure for LSI testability, Proc. 14th Des. Automation Conf., New Orleans, LA, 1977, pp. 462–468. 13. A. S. M. Hassan et al., BIST of PCB interconnects using boundary-scan architecture, IEEE Trans. Comput., 41: 1278–1288, Oct 1992. 14. N. A. Touba and B. Pouya, Using partial isolation rings to test core-based designs, IEEE Des. Test Comput., 14 (4): 52–59, Oct– Dec 1997. 15. Y. Zorian, A structured testability approach for multi-chip modules based on BIST and boundary-scan, IEEE Trans. Compon. Packag. Manuf. Technol. B, Adv. Packag., 17: 283–290, 1994. 16. IEEE, IEEE Standard Test Access Port and Boundary-Scan Architecture, Piscataway, NJ: IEEE Press, 1990. 17. P. H. Bardell, W. H. McAnney, and J. Savir, Built-In Test for VLSI: Pseudorandom Techniques, New York: Wiley, 1987. 18. S. Feng et al., On the maximum value of aliasing probabilities for single input signature registers, IEEE Trans. Comput., 44: 1265–1274, 1995. 19. M. Lempel and S. K. Gupta, Zero aliasing for modeled faults, IEEE Trans. Comput., 44: 1265–1274, 1995. 20. J. Paul Roth, Diagnosis of automata failures: A calculus and a method, IBM J. Res. Develop., 277–291, 1966. 21. J. Paul Roth, Computer Logic, Testing, and Verification, Potomac, MD: Computer Science, 1980. 22. D. Kagaris, S. Tragoudas, and A. Majumdar, On the use of counters for reproducing deterministic test sets, IEEE Trans. Comput., 45: 1405–1419, 1996. 23. H. D. Schnurmann, E. Lindbloom, and R. G. Carpenter, The weighted random test-pattern generator, IEEE Trans. Comput., C-24: 695–700, 1973. 24. BIST Test Pattern Generators for Two-Pattern Testing Theory and Design Algorithms, C. Chan and S. K. Gupta, IEEE Trans. Comput., 45: 257–269, 1996. 25. A. Ivanov, B. K. Tsuji, and Y. Zorian, Programmable BIST space compactors, IEEE Trans. Comput., 45: 1393–1404, 1996. 26. D. Green, Modern Logic Design, Reading, MA: Addison-Wesley, 1986. 27. M. Franklin, K. K. Saluja, and K. Kinoshita, A built-in self-test algorithm for row/column pattern sensitive faults in RAMs, IEEE J. Solid-State Circuits, 25: 514–524, 1990. 28. B. Koenemann, B. J. Mucha, and G. Zwiehoff, Built-in test for complex digital integrated circuits, IEEE J. Solid State Circuits, SC-15: 315–318, 1980. 29. M. Tien-Chien Lee, High-Level Tests Synthesis of Digital VLSI Circuits, Boston: Artech House, 1997. 30. M. Go¨ssel and S. Graf, Error Detection Circuits, New York: McGraw-Hill, 1993. 31. S. Chakravarty and P. J. Thadikaran, Simulation and generation of IDDQ tests for bridging faults in combinational circuits, IEEE Trans. Comput., 45: 1131–1140, 1996. 32. C. H. Roth, Jr., Digital Systems Design Using VHDL, Boston: PWS Publishing Company, 1998. 33. G. W. Roberts and A. K. Lu, Analog Signal Generation for BuiltIn-Self-Test of Mixed-Signal Integated Circuits, Norwell, MA: Kluwer, 1995. 34. M. F. Toner and G. W. Roberts, A frequency response, harmonic distortion, and intermodulation distortion test for BIST of a
AUTOMATIC TEST SOFTWARE sigma-delta ADC, IEEE Trans. Circuits Syst. II, Analog Digit., 43: 608–613, 1992. 35. Y. E. Hong et al., An overview of advanced failure analysis techniques for Pentium and Pentium Pro microprocessors, Intel Technology J., 2, 1998. 36. D. K. Bhavsar and J. H. Edmondson, Alpha 21164 testability strategy, IEEE Des. Test Comput., 14 (1): 25–33, 1997. 37. R. K. Gupta and Y. Zorian, Introducing core based system design, IEEE Des. Test Comput., 14 (4): 15–25, Oct–Dec 1997.
LEE A. BELFORE, II Old Dominion University
AUTOMATIC TESTING FOR INSTRUMENTATION AND MEASUREMENT. See AUTOMATIC TEST EQUIPMENT.
135
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1809.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Cad for Field Programmable Gate Arrays Standard Article Kai Zhu1 and D. F. Wong2 1Actel Corporation, Sunnyvale, CA, 2University of Texas at Austin, Austin, TX, Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1809 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (118K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1809.htm (1 of 2)18.06.2008 15:25:00
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1809.htm
Abstract The sections in this article are Technology Mapping Partitioning Placement Routing Commercial CAD Software Future Trends in FPGA CAD Research and Development | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1809.htm (2 of 2)18.06.2008 15:25:00
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
Logic module row
Segmented channel
707
IO module
Figure 1. Row-based architecture consists of rows of logic modules separated by horizontal routing channels. The routing tracks in horizontal routing channels are segmented. Vertical routing resources are relatively limited compared with horizontal routing resources. IO modules are at the boundary.
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS Field-programmable gate arrays (FPGA) are one of the most popular electronic devices that circuit designers use. Because of the high complexity of circuit designs, software tools have become indispensable to the circuit designer in implementing circuits on FPGAs. This article discusses the internal mechanism of computer-aided design (CAD) software tools used by circuit designers to implement circuits on FPGAs. FPGAs were first introduced into the market in the mid1980s to combine the field programmability of programmable logic devices and the high density of gate arrays. Compared to the traditional application-specific integrated circuit (ASIC) technology, FPGAs have the advantage of rapid customization with negligible nonrecurring engineering cost. The advantage of rapid turnaround with relatively low cost has led to increasing usage of FPGAs for a wide variety of applications, including rapid system prototyping, small volume production, logic emulation, and special-purpose reconfigurable computing.
Conceptually, an FPGA device can be visualized as composed of three types of basic components embedded in a twodimensional grid: logic modules, routing resources, and IO modules. A logic module can be customized to realize various logic functions for different circuit designs. IO modules are located around the periphery of an FPGA device. Routing resources consist of routing segments in both vertical and horizontal directions. Usually, adjacent routing segments in the same direction are grouped together to form routing channels. Interconnections between logic modules are realized by routing nets through the routing channels. Row-based and symmetrical array architectures are two popular architectures used in commercial FPGA products. In row-based architecture (see Fig. 1), logic modules are grouped into rows separated by horizontal channels. Compared to the horizontal routing resources, vertical routing segments are much more limited. In symmetrical array architecture (see Fig. 2), routing channels are distributed evenly in both horizontal and vertical directions. Logic modules are surrounded by the adjacent routing channels. Customization of logic modules and routing segments for implementing a particular circuit design is realized by programming a selected set of switches. A switch can be programmed into either a conductive state (on) or an insulative state (off). Physically, a switch can be implemented using an anti-fuse, or a pass transistor controlled by a static randomaccess memory (SRAM) cell, or other technologies. An FPGA device is reprogrammable if the device can be programmed multiple times. SRAM-based FPGAs are an example of reprogrammable FPGAs. Conversely, an FPGA device is one-time programmable if the device can be programmed only once. Anti-fuse-based FPGAs are one-time programmable. More description on architectural and physical details of FPGAs can be found in several references (1–3) (see PROGRAMMABLE LOGIC ARRAYS). The density of a state-of-the-art FPGA device is over 100K gates and continues to increase rapidly. It is practically not feasible to design circuits on FPGAs without using sophisticated CAD software tools. While there are FPGAs of different
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
708
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
Routing channel
Logic module
IO module
Figure 2. Symmetrical-array architecture consists of islands of logic modules surrounded by routing channels in both vertical and horizontal directions. Because of silicon area limitation, the intersecting vertical and horizontal channels in general are not fully connected.
architectures in both industry and academic research, the flows of the CAD software tool for any FPGA designs are similar and consist of several basic steps, as illustrated in Figure 3: • Design Entry. Specify a circuit design by using schematic capture or hardware design languages (such as VHDL, Verilog).
Design entry
Logic optimization Logic synthesis Technology mapping
Partitioning
Placement
Physical design
Routing
Delay extraction
Device programming Figure 3. A typical CAD flow for FPGAs goes through the following steps: design entry, logic synthesis, physical design, delay extraction, and device programming. Logic synthesis and physical design steps each can be divided into several substeps, as outlined by the dashed lines.
• Logic Optimization. Transform the circuit network into another equivalent circuit network which is more suitable for the subsequent technology mapping step. • Technology Mapping. Transform the technology-independent circuit network into a network of library cells of the target FPGA architecture so that the transformed network is functionally equivalent to the original circuit network. • Partitioning. Partition the network of library cells into several subcircuits so that each subcircuit can be fit into a given set of resources of FPGAs. • Placement. Assign cells of the circuit network to logic and IO modules on an FPGA device. • Routing. Assign nets of the circuit design to the routing segments on an FPGA device. Select the set of switches that need to be programmed into the on state. • Delay Extraction. Compute the routing delay with the physical routing information. Routing delay data will be used for post-layout circuit timing calculation and analysis. • Device Programming. Program the selected switches into on state. In the literature, logic optimization and technology mapping steps are also called logic synthesis and the software types for performing these steps are normally called front-end tools. On the other hand, the tasks of partitioning, placement, and routing are called physical design and programs for solving these problems are called back-end tools. This flow for FPGA designs is very similar to that used in traditional ASIC technologies. However, the algorithms used for solving the problems encountered in the FPGA design flow can be very different from the algorithms used in ASIC technologies. Very often, it is necessary to develop FPGA-specific algorithms in order to obtain effective as well as efficient solutions. The reason for having FPGA-specific algorithms is mainly because the resources in FPGAs are fixed and limited, and the architectural details of logic modules and routing resources vary significantly in different FPGA products. Strictly limited and fixed resources in FPGA devices post many constraints on feasible solutions. In comparison with the CAD problems in ASIC designs, the CAD problems in FPGA designs are generally more constraint driven than optimization driven. Finding a feasible solution for an FPGA CAD problem is usually more difficult than finding a feasible solution for an ASIC CAD problem. Practically, it is often acceptable to use as many logic modules and IO pins as available in an FPGA device as long as the utilization is under the resource limits and the solution is routable. Currently, FPGA architectures are still in constant evolution. There is not yet a universal architecture that is used for different FPGA products. FPGA CAD algorithms, especially physical design algorithms, strongly depend on the architectural details. It is generally necessary to develop architecturespecific algorithms for solving CAD problems in various stages of FPGA design flow in order to fully take advantage of architectural features in different FPGA products. In addition to these algorithmic differences, the primary advantage of quick turnaround of FPGAs dictates that CAD tools for FPGAs must run much faster than the CAD tools for
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
ASIC. Thus, more restrictions are imposed on the efficiency of critical algorithms for solving FPGA CAD problems. This article will focus on the discussion of major FPGAspecific CAD problems in technology mapping, partitioning, placement, and routing. To keep the article concise, algorithmic details for solving the FPGA-specific CAD problems are omitted. In most cases, only the basic ideas of the algorithms are described and the references to the literatures that contain detailed descriptions are given. General background information on logic synthesis and physical layout can be found in the literature (see CAD FOR MANUFACTURABILITY OF INTEGRATED CIRCUITS).
709
d e G
c g +
h
a b
f (a)
TECHNOLOGY MAPPING Details of technology mapping algorithms vary for different architectures. The basic strategy of most FPGA technology mapping algorithms, however, consists of two basic steps: decomposition and covering. In the decomposition step, logic gates in the original circuit networks are decomposed into a different set of logic gates so that the transformed network is more suitable for achieving the optimization objectives such as area or timing. In the subsequent covering step, logic gates in the circuit are covered by cells in the library of the target FPGA device where each cell can be implemented by using a logic module. The differences in FPGA technology mapping from the conventional approach result from the fact that the number of distinct logic functions that can be implemented with a logic module in most FPGAs is much larger than the typical library size for conventional ASIC technologies. It is therefore not practical to follow the conventional approach of enumerating all possible functions to determine the optimal selection of library cells. Logic modules in FPGAs can be broadly classified into two categories: lookup table (LUT) based and nonLUT based. Techniques used in technology mapping, especially in the covering step, are different for these two types of logic modules (4). LUT-Based Logic Modules A K-input LUT-based logic module can implement a total of K 22 distinct logic functions each with no more than K inputs. Examples of commercial FPGAs that use LUTs for logic modules include Actel’s ES6500, Altera’s Flex, Lucent’s ORCA, and Xilinx’s XC4000 product families. For values of K greater than 3, the size of the library for the library-based covering approach becomes impractically large. Many specialized algorithms have been developed to address the LUT-based FPGA technology mapping problem (5). An important optimization objective for LUT-based technology mapping is to minimize the area, that is, the number of LUTs used for covering a circuit network. One fast and effective approach for LUT area minimization is to formulate the decomposition and covering problems as the bin-packing problem (6). The bin-packing problem is to pack a set of objects of given sizes into the minimum number of bins of fixed capacity. The bin packing problem is NP-hard, but simple, fast, and effective heuristic algorithms for solving the problem exist. The technology mapping results generated by using this approach are significantly better than the conventional approach in terms of both run time and area.
a 0 b
1 1
0 c 0 d 0
1 1
e 0
x 0
1
f
0 y
g 0
1
1
1
1 0 h
0
z 0
1
1 0
1 (b)
0
1 (c)
Figure 4. A multiplexor-based logic module in (a) can be represented by a binary-decision diagram (BDD) shown in (b). The BDD for a subcircuit represented by a BDD in (c) is isomorphic to a subgraph of logic module BDD in (b), as indicated by the shaded nodes.
Another important optimization objective is circuit performance. During logic synthesis steps, a commonly used performance metric is the maximum circuit level, i.e., the maximum number of cells on any path from a primary input to a primary output in a combinational circuit. It has been shown that the problem of minimizing the maximum circuit level for combinational circuits in the covering step can be solved optimally using the network flow technique (7). Furthermore, algorithms have been developed to achieve often a practically desirable balance between area and performance (5). Non-LUT-Based Logic Modules A K-input non-LUT-based logic module cannot implement every logic function with no more than K inputs. An example of a non-LUT-based logic module is the multiplexer-based logic module used in Actel’s ACT FPGA families (see Fig. 4). In the covering step for non-LUT-based FPGAs, an important operation is to determine whether a cover of the logic gate can be implemented by personalizing a non-LUT logic module. This problem is also known as the Boolean matching problem (8). For the logic module shown in Figure 4, the number of distinct logic functions implementable by a logic mod-
710
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
ule is more than 700, and thus makes it impractical to apply a conventional library enumeration approach. A specialized technique for non-LUT-based logic module Boolean matching is based on a reduced ordered binary decision diagram (BDD) technique (9). Given a subcircuit logic function F and a logic module function G, BDDs for F and G, denoted as BDDF and BDDG, respectively, are constructed. Boolean matching of F on G is performed by detecting whether BDDF is isomorphic to any subgraph of BDDG. Figure 4 illustrates BDDs for a logic function F ⫽ xy ⫹ xz and the logic module shown in Fig. 4, G ⫽ (a ⫹ b)(cd ⫹ ce) ⫹ (a ⫹ b)(fg ⫹ fh). Function F can be implemented by G because BDDF is isomorphic to a subgraph of BDDG as induced by the shaded nodes in BDDG. Technology mapping and fast Boolean matching algorithms using the BDD isomorphism approach have been developed for multiplexer-based logic modules (10,11).
PARTITIONING In the partitioning step, a circuit is partitioned into a collection of subcircuits. Depending on the number of FPGA devices involved, FPGA partitioning could be either multiple-FPGA partitioning or single-FPGA partitioning. In multiple-FPGA partitioning, a circuit is partitioned between multiple FPGA devices so that each subcircuit can fit into a single FPGA device. An example where multipleFPGA partitioning is necessary is a logic emulation system. A logic emulation system verifies the functionality of a circuit design by implementing the circuit design on FPGAs running at a slower clock speed. Typically, a system-level design is too large for a single FPGA device and therefore must be implemented using multiple FPGAs. Single-FPGA partitioning partitions a circuit within a single FPGA device and is most commonly used for hierarchical architecture FPGAs. In a hierarchical architecture FPGA device, routing resources between logic modules are not uniformly distributed. Instead, logic modules are grouped into clusters where each cluster contains a number of logic modules. Routing resources between clusters are normally much limited compared to the routing resources within a single cluster. Hierarchical architecture has the advantage of smaller device die size than a flat architecture for the same device density, and therefore is most popular for supporting high-density FPGA devices. In the physical design flow for a hierarchical architecture FPGA, a circuit is usually first partitioned into subcircuits so that each subcircuit can fit into a single cluster. Then, subcircuits are placed and routed within individual clusters. Similar to the conventional partitioning problems, the most basic objective of FPGA partitioning is interconnection minimization between subcircuits. However, compared to the conventional partitioning problems, FPGA partitioning needs to satisfy more constraints in order to obtain a feasible partitioning solution. Finding a feasible partitioning solution is more difficult and important than in conventional partitioning problems. This is because the resources in an FPGA device, especially logic modules and IO ports (or the routing resources between clusters within a single FPGA device), are strictly limited. Consequently, FPGA partitioning problems are more resource-constraint driven than conventional partitioning problems.
Two most essential constraints for both multiple-FPGA and single-FPGA partitioning are IO constraint and capacity constraint. Capacity constraints for an FPGA device can be very complex. Driven by the demand of supporting systemlevel circuit designs, the FPGA device is becoming larger in terms of density as well as more heterogenous in terms of the types of resources. It is not uncommon to find a commercial FPGA device that contains different logic modules, complex IO modules, various speed grade clocks, embedded memory arrays, and dedicated resources designed for supporting special functions (e.g., wide input gates). Different types of resources on an FPGA device have different upper limits, and a feasible partitioning must satisfy the limitations for each of the different resources. To further complicate the capacity constraint, a logic function can be implemented by using different resources in an FPGA device. For example, a 2-input gate can be implemented using either a 2-input LUT or a 3input LUT logic module. Consequently, the capacity constraint of an FPGA device cannot be accurately captured by simple measurements such as gate count upper bounds. In addition to the limitation on each type of resource, capacity constraints for an FPGA device need to take into account multiple choices of logic function implementation. FPGA partitioning algorithms implemented in commercial CAD tools are usually based on traditional move-based approaches, such as Fiduccia–Mattheyses algorithm (12), with modification to incorporate FPGA-specific constraints into the algorithms. Starting with an initial feasible partitioning solution that satisfies the capacity constraints, the algorithms maintain the feasibility by allowing only the moves that do not violate capacity constraints. Initial feasible partitioning solution is usually not difficult to find if the device utilization is not close to the limitation. However, for partitioning problem where device utilization is approaching the resource limitations, finding an initial feasible partitioning solution can be challenging.
PLACEMENT In the placement step, each cell in the circuit netlist is assigned to a module on an FPGA device. The two most important issues for FPGA placement are routability and performance. Because of the fixed routing resources available on an FPGA device, routability is usually treated as a constraint in the placement process. Net length minimization, which is usually the most important optimization objective for conventional placement problems, is only of secondary importance in FPGA placement. Circuit performance in FPGA placement is also typically treated as a set of timing constraints as specified by the circuit designer. Placement algorithms that consider timing constraints are called timing-driven placement algorithms in the literature. Similar to the placement algorithms for ASIC technology, the placement steps in FPGAs consist of initial placement followed by placement optimization. Initial placement normally concentrates on general objectives, such as net length minimization, and uses constructive algorithms such as min-cut placement in order to achieve fast run time and reasonable quality. During placement optimization, initial placement results are further improved to ensure that the routing resource constraints are satisfied and other objectives, such as timing,
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
are optimized. Despite similarity to the ASIC placement approach, there exist several FPGA-specific issues that most FPGA placement algorithms need to address, especially during placement optimization, which is very often based on simulated annealing techniques.
711
Logic module CLK CLK1
Global Routing for Channel Density Computation The numbers of routing tracks in routing channels are fixed for an FPGA device. A necessary condition for any feasible placement solutions is that the channel density in every channel cannot exceed the number of routing tracks available in the channel. Minimization of net length will tend to cause local congestion and produce a placement solution that is very difficult for subsequent global routing algorithm to generate a feasible routing. In order to calculate the channel density accurately, simulated annealing-based placement algorithms need to perform global routing iteratively for every move. Therefore, in addition to producing high-quality routing solutions, run time becomes a critical requirement for designing FPGA global routing algorithms. Such closely interleaved global routing and placement in FPGAs is different from the placement algorithms used in standard cell architectures, where channel heights can be adjusted and, therefore, global routing does not need to be embedded within the placement process. Fast Interconnection Delay Estimation Interconnection delay estimation for timing-driven placement for FPGAs is also very different from ASIC. Normally, an FPGA device contains routing tracks of various lengths in order to achieve delicate balance between routability and performance. Simple interconnection delay estimation models based on net length or fanout are no longer accurate enough for use within timing-driven placement algorithms. On the other hand, more accurate interconnection delay computation methods, such as the distributed RC model, are too computationally expensive for incorporating into simulated annealing-based placement algorithms. Therefore, special techniques for fast and sufficiently accurate interconnection delay estimation are essential for timing-driven FPGA placement. Fast interconnection delay estimation techniques have been successfully developed and used for channel-based FPGA architectures (13). Clock Skew Controlling clock skew is a critical issue in synchronous circuit designs, especially for high-speed system level designs. As long as other higher priority constraints are satisfied, it is always desirable to reduce clock skew to further improve circuit performance and fault-tolerant margin. FPGA architectures allow further clock skew reduction during placement. A typical FPGA device usually contains several clock networks. Clock pins on sequential elements such as flip flops are connected to the selected clock networks through programmable switches. Figure 5 illustrates connections between clock pin (CLK) and two clock networks (CLK1 and CLK2) in a rowbased FPGA architecture. The clock pin CLK can be connected to either CLK1 or CLK2, depending on circuit designs. Different sets of logic modules chosen for circuit placement in an FPGA device lead to different capacitance load distribu-
CLK2 Normal routing tracks Figure 5. A clock pin on a logic module can be connected to one of two clock network tracks (CLK1, CLK2) in the adjacent routing channels. The connection is established by turning on appropriate switches, represented by the circles.
tions on a clock network. Consequently, the clock skews on clock networks may vary with different placements and FPGA-specific placement algorithms can take advantage of this fact to further reduce clock skew where desired (14).
ROUTING Because of the high complexity involved in the routing problem, FPGA routing normally is performed in two phases: global routing and detailed routing. Global routing assigns each net a routing path by selecting a set of routing channels, but does not choose specific routing tracks and switches for each net. The goal of global routing is to create a problem that can facilitate the subsequent detailed router to select routing segments. Since routability is the most important issue, minimization of channel density is normally the optimization objective in FPGA global routing. Similar to the approach for conventional ASIC technologies, FPGA global routing problems are normally formulated as minimum steiner tree problems and solved by using steiner tree minimization algorithms. However, there exist two FPGA-specific issues in global routing. The first one is run time. As mentioned in the previous placement section, since global routing is embedded within the placement optimization process, run time of the FPGA global router is more restricted than the global routers used for ASIC technologies. The second issue is routability estimation. For an FPGA architecture where channel intersection areas are not fully populated with switches, routability in the intersection areas cannot be accurately measured with channel densities. Instead, connectivity architectural details within the channel intersection areas need to be considered in order to estimate the routability more accurately (15). The task of detailed routing is to assign each net to specific routing segments in the channels as restricted by the global router. Design of detailed routing algorithms depends heavily on FPGA routing architectures. Detailed routing algorithms for row-based and symmetrical-array-based architectures are significantly different. Detailed Routing for Row-Based Architectures Routing channels in row-based architectures are segmented. A routing track in the segmented channel is divided into several routing segments with various lengths by placing
712
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
Row of logic modules
An “off” switch
A connection uses one segment
A routing track with two routing segments
A connection uses two segments with an “on” switch Figure 6. Routing in segmented channels. Switches in ‘‘off ’’ and ‘‘on’’ states are represented by open and solid circles, respectively. Switches can be turned on to connect adjacent routing wire segments on the same track in order to route longer connections.
switches between the adjacent routing segments (Fig. 6). Routing track segmentation is designed based on the net connection distribution statistics collected from a large pool of real circuit designs to achieve a delicate balance between routability and performance. Where desirable, two adjacent routing segments on the same track can be connected by turning on the switch in between to form a longer routing segment that can be used to complete a longer net connection. Most of the vertical routing segments are attached to the logic modules and provide routing resources similar to the feed throughs found in the standard cell architecture. Intersecting vertical and horizontal routing segments are fully populated with switches so that any vertical routing segments can be connected to any intersecting horizontal routing segments as necessary. Therefore, the detailed routing problem in rowbased architectures is reduced to solving segmented channel routing problems. Because switches can introduce significant delay to interconnections due to the relatively high fuse resistance, the number of switches allowed for completing a net connection is usually restricted in order to achieve high circuit performance. In a K-segment channel routing, the maximum number of segments used for routing any net connection is limited to K. For K equal to 1, the segmented channel routing problem can be solved efficiently by using a bipartite matching technique. However, for K greater than 1, segmented channel routing becomes an NP-complete problem (16), except for several special segmentations which, unfortunately, are not used in most commercial FPGA products. Efficient and effective heuristic algorithms have been developed in the commercial tools to solve the general segmented channel routing problem. Detailed Routing for Symmetrical-Array Architectures The intersecting vertical and horizontal routing tracks in a symmetrical-array-based architecture usually are not fully populated with switches. Consequently, the detailed routing problem for symmetrical-array-based architectures cannot be reduced to solving individual channel routing problems. A commonly followed approach is to explore the connectivity
Table 1. FPGA Logic Synthesis Tools Vendors and Their Products CAD Tool Vendor
CAD Tool Product Name
Cadence Designs Systems San Jose, CA Exemplar Logic Alameda, CA Synopsys Mountain View, CA Synplicity Mountain View, CA
FPGA Designer Galileo Logic Explorer FPGA Compiler Synplify
within the routing channels specified by the global router by using a search technique, such as maze router. The search approach is practically feasible due to the coarse granularity of the architecture, where the number of tracks in each channel is less than the number of tracks found in a segmented channel in a row-based architecture. Moreover, the tracks in symmetrical-array architectures are not as finely segmented as the segmentation found in the row-based architectures. The search space therefore is significantly limited. To improve the routability with the limited search space, the competition on the critical routing segments between different nets must be considered in the routing process. The critical routing segments contended by different nets can be identified based on the number of distinct nets that may use the routing segments for routing (17). COMMERCIAL CAD SOFTWARE Front-end logic optimization and technology mapping algorithms normally do not have strong dependence on FPGA architecture details. A small set of basic technology mapping algorithms can be used to support different FPGA products from different FPGA vendors. Consequently, front-end software tools used in FPGA designs are normally from independent CAD software vendors, instead of from FPGA companies. Table 1 lists several major CAD software companies that develop and market FPGA synthesis tools that can support various FPGA architectures (18). In addition to commercial tools, a number of FPGA logic synthesis tools developed at universities are in the public domain. Table 2 lists several such logic synthesis tools. Unlike synthesis and technology mapping algorithms, FPGA place and route algorithms are strongly tied to the architecture details of the individual FPGA product. Interactive evaluation between the physical design algorithms and architecture details during a new FPGA product development is
Table 2. FPGA Logic Synthesis Tools Developed at Universities Institute
CAD Tool Name
UC Berkeley UCLA University of Toronto
MIS-pga FlowMAP/RASP Chortle
CAD FOR FIELD PROGRAMMABLE GATE ARRAYS
critical to the success of product development. Currently, most FPGA vendors develop physical design software tools inhouse, and provide proprietary place and route tools together with silicon products to their customers. FUTURE TRENDS IN FPGA CAD RESEARCH AND DEVELOPMENT The goal of CAD tools is to help circuit designers use FPGA devices efficiently and effectively, and to help FPGA device architects design new FPGA architectures. Research and development of FPGA CAD tools therefore must be driven by the needs of FPGA users and designers. In this section we discuss several areas that are important for future FPGA CAD tool development. Run Time Reduction
713
vices, in order to support system-level designs that require both logic and memory. New CAD algorithms for logic synthesis and physical design may need to be developed in order to effectively integrate different functionalities on a single FPGA device. Another example where new algorithms are desirable is in hierarchical architectures. As FPGA capacity continues to increase, hierarchical FPGA architectures are more efficient compared with flattened architectures for achieving an appropriate balance between area, performance, and routability. Algorithms such as partitioning and clustering that were previously developed within other contexts will need to be modified in order to accommodate the special requirements of hierarchical FPGA architectures.
BIBLIOGRAPHY
Currently, the capacity of an FPGA device can far exceed 100K gates and is rapidly increasing. As FPGA devices become larger, the run time of CAD tools for completing an FPGA circuit design is getting longer, especially in the physical design stage. Making matters worse, the increase in run time of current CAD tools is greater than the increase in silicon gate capacity. It is no longer unusual to take more than a day to complete a design of a 100K FPGA device with current CAD tools. If the run time of CAD tools continues to increase at a faster rate than the increase in silicon capacity, the competitive advantage of fast turnaround provided by FPGAs will diminish. In order to maintain the fast turnaround advantage, it is necessary to reduce CAD tool run time, especially in the physical design stage.
1. S. D. Brown et al., Field-Programmable Gate Arrays, the Netherlands, Kluwer Academic Publishers, 1992.
Support of Different FPGA Architectures
7. J. Cong and Y. Ding, An optimal technology mapping algorithm for delay optimization in look-up-table based FPGA designs. Proc. IEEE Int. Conf. Comput.-Aided Des., pp. 48–53, 1992.
The demand for flexible CAD tools that are able to support different FPGA architectures is driven by two issues. The first is that new FPGA architectures continue to emerge to accommodate the requirements of new applications and technologies, and designing new FPGA architectures requires CAD tool support for architectural evaluation. The second issue is that developing new CAD tools is a time-consuming and hard to predict process, and very often this process is the bottleneck in the new FPGA product development. In order to address these issues, CAD tools should consist of a number of modular, independent point tools that can be easily modified and integrated to form a complete design flow to support new FPGA architecture development. The flexibility of integration of point tools is supported by carefully designed device and netlist databases that are used to transfer data between individual point tools. Each of the point tools must be able to support common features in different FPGA architectures and be flexible enough to support new architectural features. Innovative Algorithms Innovative algorithms are always in demand as FPGA architectures continue to evolve. For example, a new trend in FPGA architecture design is to integrate specialized functional modules implemented in ASIC together with FPGA in a single device. It is also becoming common to provide embedded memory arrays, especially on large capacity FPGA de-
2. S. M. Trimberger, (ed.), Field-Programmable Gate Array Technology, the Netherlands, Kluwer Academic Publishers, 1994. 3. A. El Gamal (ed.), Special section on field-programmable gate arrays, Proc. IEEE, 81 (7): 1993. 4. R. Murgai, R. Brayton, and A. Sangiovanni-Vincentelli, Logic Synthesis for Field-Programmable Gate Arrays, New York: Kluwer Academic Publishers, 1995. 5. J. Cong and Y. Ding, Combinational logic synthesis for LUT based field programmable gate arrays, ACM Trans. Des. Autom. Electron. Sys., 1 (2): 145–204, 1996. 6. R. Francis, J. Rose, and Z. G. Vranesic, Chortle-crf: Fast technology mapping for lookup table-based FPGAs, Proc. 28th Des. Autom. Conf., San Francisco, CA, pp. 227–233, 1991.
8. B. Luca and G. De Micheli, A survey of Boolean matching techniques for library binding, ACM Trans. Des. Autom. Electron. Syst. 2 (3): 1996. 9. R. E. Bryant, Graph-based algorithms for Boolean function manipulation, IEEE Trans. Comput., C-35: 677–691, 1986. 10. A. Bedarida, S. Ercolani, and G. De Micheli, A new technology mapping algorithm for the design and evaluation of fuse/antifuse-based field-programmable gate arrays, 1st Int. ACM/SIGDA Workshop FPGAs, pp. 103–108, 1992. 11. K. Zhu and D. F. Wong, Fast Boolean matching for field-programmable gate arrays, Proc. Eur. Des. Autom. Conf., pp. 352–357, 1993. 12. C. M. Fiduccia and R. M. Mattheyses, A linear-time heuristic for improving network partitions, Proc. ACM/IEEE Des. Autom. Conf., pp. 175–181, 1982. 13. M. Chew and J. C. Lien, Fast delay estimation in segmented channel FPGAs, 2nd Int. ACM/SIGDA Workshop Field-Programmable Gate Arrays, Section 8, 1994. 14. K. Zhu and D. F. Wong, Clock skew minimization during FPGA placement, IEEE Trans. Comput.-Aided. Des. Integr. Circuits Syst., CAD-16: 376–385, 1997. 15. Y.-W. Chang et al., A new global routing algorithm for FPGAs, Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., San Jose, CA, pp. 380–385, 1994. 16. J. Greene et al., Segmented channel routing, Proc. 27th ACM/ IEEE Des. Autom. Conf., pp. 567–572, 1990.
714
CAD FOR MANUFACTURABILITY OF INTEGRATED CIRCUITS
17. S. Brown, J. Ross, and Z. G. Vranesic, A detailed router for fieldprogrammable gate arrays, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-11: 620–627, 1992. 18. S. Schulz, Logic synthesis and silicon compilation tools, Integr. Syst. Des., 1996.
KAI ZHU Actel Corporation
D. F. WONG University of Texas at Austin
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1805.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Cad For Manufacturability of Integrated Circuits Standard Article S. Duvall1 1Intel Corporation, Santa Clara, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1805 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (119K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1805.htm (1 of 2)18.06.2008 15:25:18
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1805.htm
Abstract The sections in this article are Concepts of CAD for Manufacturability The Role of CAD for Manufacturability in the Development Life Cycle Models and Methods for CAD for Manufacturability Research Directions in CAD for Manufacturability | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1805.htm (2 of 2)18.06.2008 15:25:18
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
221
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING When designing and building digital systems, we must ensure that the manufactured final product is exactly what was intended. As shown in Fig. 1, there are two processes in creating digital systems: design process and manufacturing process. Corresponding to these two processes, there are two key issues for ensuring digital systems behave as originally intended. The first is to make sure that what we are designing is correct, that is, the design is exactly the same as what we intend. The second is to make sure that what we are manufacturing is correct, that is, the product is exactly the same as what we have designed. The former process is called design verification and the latter is called manufacturing test and diagnosis. In this article we will give an overview of design verification and manufacturing fault diagnosis technology. DESIGN VERIFICATION As mentioned before, design verification is the process to ensure that what we are designing is exactly what is intended.
Specification
Design
Design verification
Manufacturing
Manufacturing test and diagnosis
Digital system Figure 1. Creating digital systems.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
222
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
This is one of the most important and sometimes the most time-consuming process in designing complicated systems. The specifications describe what we want, and verification is the process for checking whether the designs satisfy their specifications. The first step for verification is to describe both specification and design in mathematical ways so that we can formally apply logic to them. In the case of digital systems, Boolean functions and mathematical logics such as first-order predicate calculus are typically used, since behaviors of digital systems can be directly described by these types of logic. Once we have mathematical descriptions for specifications and designs, the next step is to verify that designs satisfy their specification via reasoning. Since verification ensures the correctness of designs with respect to specification, it is done by simulating designs and checking the appropriateness of outputs from simulations. However, this approach cannot be complete until we simulate all possible cases, which is impossible for large circuits with many input signals (i.e., all possible values of n inputs or 2n combinations). Formal verification is a process that tries to prove the correctness of designs mathematically. It implicitly checks all possible cases and guarantees the correctness of designs for all possible input combinations. Let us clarify the difference between formal verification and simulation with an example. Figure 2 is an example combinational circuit. It uses a gate called the NAND gate. NAND(x, y) gives a complement of the conjunction of x and y. It generates a 0 at its output only when both inputs are 1. Otherwise, it generates 1. The specification for the circuit is the EXCLUSIVE-OR function of x and y, which must be realized at the output of the circuit. Here EXCLUSIVE-OR is a logic function that gives the value 1 if and only if the two input values are different; otherwise it gives 0. The EXCLUSIVE-OR function of x and y is defined as x ⭈ y ⫹ x ⭈ y. Formal verification is done to make sure that the circuit in Fig. 2 realizes the EXCLUSIVE-OR function at the output. We can simulate the circuit and test for its correctness. Verification by simulation is sometimes called validation, since it does not guarantee the correctness of the design completely unless we can simulate all possible cases, which is mostly impossible for large circuits. What we can do is to test some but not all, cases. Since digital systems are described in mathematical logic or its extensions, their behaviors can be simulated by repeatedly computing logic functions. By simulating the functions of NAND in the circuit, we can obtain the values for the output of the circuit. We need to check all four cases of possible input combinations for two variables. On the other hand, formal verification of the circuit in Fig. 2 is to prove that its output is mathematically equivalent to
x
b a
y
z
the logic function EXCLUSIVE-OR of x and y. This can be checked by manipulating the Boolean formulas generated from the circuit in the following way: z ⫽ b ⭈ c, b ⫽ x ⭈ a, c ⫽ a ⭈ y, a ⫽ x ⭈ y z ⫽ x⭈x⭈y⭈x⭈y⭈y z ⫽ x⭈x⭈y ⫹ x⭈y⭈y z ⫽ x ⭈ (x ⫹ y) ⫹ (x ⫹ y) ⭈ y z ⫽ x⭈y ⫹ x⭈y
Definitions from the circuit Substitution DeMorgan’s law DeMorgan’s law Simplification
The last formula is the definition of the EXCLUSIVE-OR function of x and y. Since manipulation of all preceding formulas is independent of the values of x and y, the circuit is formally verified to be equivalent to the EXCLUSIVE-OR function of x and y. Formal verification of logic circuits using transformations of logic formulas like those just given is sometimes called theorem-proving-based verification, since it is trying to prove mathematically the correctness of designs by manipulating logic formulas. As can be seen from the previous example, an appropriate ordering of the application of various transformations, such as substitution, De Morgan’s law, and simplification, must be identified in order to obtain the goal formulas (i.e., formulas in the specification). Moreover, if the designs are not correct, transformations do not work and the verification process may not terminate. Therefore appropriate user guidance is essential, and so the verification process is interactive, that is, each transformation of the formulas is guided by users who are verifying the designs. There is a significant amount of research on the use of theorem-proving methods for formal verification (1). Although there has been much success, this method is not yet widely used because it is not completely automatic and needs human interaction. Automatic verification techniques perform an exhaustive case analysis for all combinations of values of variables, similar to the simulation of all possible cases. Typically, the techniques are based on case analysis. They first analyze the case for which the first chosen variable in the formula is 0 and then check the case for which that variable is 1, and so on. This case pattern may have to continue for all variables in the formula. Fortunately, in most cases, we can reach special cases where we can decide the value of the formula immediately. For example, suppose we analyze the formula, x1 ⭈ x2 ⭈ x3. When the variable x1 is set to 0, the entire formula immediately becomes 0 regardless of the values of the other variables. Further analysis is unnecessary for this case. Although this case analysis technique performs much better than exhaustive simulation, it is still very time consuming as its execution time grows exponentially in principle. Because of this, the case analysis technique cannot be applied to large circuits. Situations have, however, changed completely since a new data representation method for logic functions in computers, called binary decision diagrams (BDDs) (2–4), and its efficient manipulation algorithms were proposed in the 1980s. By using BDDs, significantly larger circuits can be verified in much less time.
c
Figure 2. An example circuit that realizes an EXCLUSIVE-OR function.
Binary Decision Diagram The binary decision diagram was proposed in the late 1980s and since then it has been widely used for various problems in
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING x3
x3
x2
x2 x1
x1 1
0
1
x2
x1 0
1
x1 0
1
x1
x1 x2 + x3
x1
x1 x2
x1 1
1
223
0
1
x3
x1 x2 0
Figure 3. Decision tree and its corresponding binary decision diagram.
1
0
x1
1
Or
And x2
computer science, especially in computer-aided-design areas. Here we briefly introduce BDD. BDDs are derived from binary decision trees. An example of a binary decision tree is shown in the left side of Fig. 3. It is basically an all-case analysis of the given logic function based on the values of variables. x1, x2, x3 are variables and 0 and 1 are constants. Each left edge indicates that the value of that variable is 0, whereas each right edge indicates that the value is 1 (unless constant values are added to edges as attributes). We first fix the ordering of variables. In this case the ordering is x3, x2, x1. On all paths from the root node to the leaves, all variables must appear only in this order. By traversing the edges from the root node, we can determine the value of the function. For example, the value of the function for x1 ⫽ x2 ⫽ x3 ⫽ 0 is 1 whereas the value for x1 ⫽ x2 ⫽ 0, x3 ⫽ 1 is 0. Please note that the sizes of binary decision trees are exponential with respect to the numbers of variables. BDD is derived from this tree by removing redundant nodes, as can be seen from the right side of the figure. Figure 4 shows ways to generate BDD from the binary decision tree. First, isomorphic subgraphs are merged as can be seen from the first transformation in the figure. For example, the left three nodes for x3 are isomorphic and are merged. Then any nodes with two edges going to the same nodes are deleted, as can be seen from the second transformation of the figure. If the two edges go to the same nodes, the function does not depend on the value of that variable for that particular case, and hence those nodes can be deleted. After these steps, binary decision trees become binary decision graphs, since there is sharing of subgraphs. As can be seen from Fig. 4, BDD is a lot smaller than the binary decision tree in general. An important fact is that sizes of BDDs can be polynomial for many useful logic functions, such as adders, parity functions, and most control circuits. Another key issue is that BDD is a canonical representation for logic functions with respect to the predetermined orderings of variables. That is, if the two logic functions are equivalent, their corresponding BDDs will be isomorphic as long as they are using the same ordering of variables. This is an important fact when we apply
x1 x2
x2
x3 0
x3 1
Removal of equivalent nodes x2
0
x3 1
0
x3 1
1
x1 x2 x3
x3 1
0
1
x1
x1 0
1
x2 x3 0
1
0
1
Figure 5. Using ‘‘apply’’ to manipulate logic operatons on BDDs.
BDD to verification problems. Because of these advantages BDD is now widely used. Although BDD can be obtained from binary decision trees as shown in Fig. 4, this is not an efficient way to generate BDDs, since sizes of binary decision trees are exponential with respect to the numbers of variables. So we need more efficient ways to generate BDD directly from logic circuit representation. This can be done by the procedure ‘‘apply’’ that computes logic operations directly on BDDs. Examples of apply processes are shown in Fig. 5. The apply procedure basically traverses the two given BDDs from the roots to the leaves in a depth-first order. For each step in the depth-first traversal of the two BDDs, it applies logic operations, such as AND and OR, on the two current nodes and generates a new node that corresponds to the results of logic operations. The amount of time for completion of this procedure is proportional to the product of the sizes of the BDDs that it traverses, and hence it is very efficient as long as the BDD sizes can be kept small. By using the apply procedure, we can generate BDD directly from logic circuits and do not have to generate binary decision trees. Although BDD is a very efficient and also effective way to manipulate logic functions, it surely has several drawbacks. One of the most important is the fact that sizes of BDDs are very sensitive to ordering of variables. Figure 6 shows an ex-
x1
Removal of redundant nodes
x2 x3 0
1
Figure 4. BDD is a canonical representation for logic functions.
224
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
x1 x2 x3 x4 x5 x6
1
Worst 0
Best
1
3 0
1
3 1
0
0
5
5
2
0
0
1
0
5 1
0
3 0
4
4
5
Figure 6. Ordering of variables is important for BDD.
0
1
1
2
0
2
0 1
1
1
0
1 1
6
1
0
2
0
1
4
6
0
0
0
1
5 1
2
0
1
1
1
0 1
0
treme case. The two BDDs represent the same logic function that corresponds to the output of the circuit diagram in the figure. The left BDD uses the best ordering, x1, x2, x3, x4, x5, x6, whereas the right BDD uses the worst ordering, x1, x3, x5, x2, x4, x6. So, if we use bad ordering of variables, the resulting BDDs can be too large to be manipulated. Variable ordering for BDD is one of the most important problems in BDD-related research. It is known that to find the best ordering is NP-complete; so we have to use heuristic approaches for large logic functions (5). There are several good heuristics for giving good ordering (6–11). These heuristics are generally good for practical use, but sometimes BDDs cannot be built simply because of poor ordering. In that sense, the variable ordering problem for BDD is still a good research topic. Because BDDs are so widely used, several BDD packages are available in the public domain (12). They include a com-
1
0
plete set of useful routines for BDDs, and users can manipulate logic functions in BDDs by just using those routines appropriately. Practical Verification Technique For Combinational Circuits In order to compare the equivalence among combinational circuits, it is sufficient to generate BDDs from the circuits and to check if they are isomorphic, since BDD is a canonical representation for logic functions once ordering of variables is fixed. So, given a circuit, first of all, ordering of variables is determined by using appropriate heuristics. Then we generate BDDs for each gate in the circuit individually using the apply procedure as shown in Fig. 7. After this process, we get the BDD for the output of the circuit.
x1
x2
x2
x2 0
1
1
x1 0 0
Figure 7. Creating BDDs from circuits.
x2 0
0
x1 1
1 1
0
1
0
x2
1
1
x1 0
1 1
0
x2
x1 0
0 1
0
0
0
1
x1 1
0 1
0
x1 1 1
0 1
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
x1 x2
x1 x2 x3
Circuit 1
z
z
BDD for output is 0?
x3 Circuit 2
x1 0
1
x2 0
x2 1 1
x3
Use relationships (such as equivalence) among internal points
0
0 1 1
225
Figure 10. Use relationship among internal signals to reduce the size of BDD for output.
0
Figure 8. Verification based on BDD.
We repeat this process on the other circuit to be compared and then check if the two BDDs obtained are isomorphic (13). An example verification based on this approach is shown in Fig. 8. In this case, both circuits give the same isomorphic BDD and so they are logically equivalent. In this approach the most important part is how to obtain ordering of the variables of BDDs, since it will determine whether we can verify circuits. If we can have a good ordering, the BDD size can be relatively small and we may be able to finish BDD construction. But if we use a bad ordering, the BDD construction process may not finish because of the prohibitively large size. By using a good heuristic for variable ordering, the state-of-theart verifier based on this approach can verify circuits having up to a couple of thousands of gates. How can we proceed if the circuits to be verified are much larger than a couple of thousands of gates? One way is to construct a ‘‘miter’’ as shown in Fig. 9 (14,15). The two circuits to be compared are connected by an EXCLUSIVE-OR gate. Then if the two circuits are equivalent, the output of the EXCLUSIVE-OR gate is always 0. So, we have only to build BDD for the output of the EXCLUSIVE-OR gate and check if it is a constant 0 or not. In so doing, we do not necessarily build a BDD for each circuit. Instead, we can construct a BDD for the output of the EXCLUSIVE-OR gate by traversing the circuit from output to input. Hence, even if the BDDs for the original two circuits are large, the BDD that we construct may not become large. Although this is a better approach, it may still not be sufficient to solve verification problems for
Circuit 1 Always 0, if equal
Circuit 2
Figure 9. Creating a miter to check the equivalence of two circuits.
large circuits, because the sizes of intermediate BDDs during construction of the BDD for the output may become too large. The approach just mentioned can, however, be significantly improved by using information on the relationship among values of internal signals in the two circuits. For the equivalence check of two combinational circuits, there are cases in which we can verify many larger circuits, for example, circuits having 100,000 gates or larger. One such case involves two similar circuits, for example, one circuit is a slight modification of the other. This occurs frequently in real designs, as designers try to improve the performance of circuits by modifying circuits partially or incrementally. If the two circuits are similar we can expect much signal value dependency among internal signals in the two circuits. For example, if the circuit optimization performed by designers consists of just inserting buffers to speed up a circuit, we will see much internal equivalence between the two circuits. By using internal equivalence we can partition circuits into smaller ones and will only need to check the equivalence among those partitioned circuits instead of the original large circuits. Also, we can use relationships among internal signals in order to reduce sizes of intermediate BDDs when constructing BDDs for the output of the EXCLUSIVE-OR gate from output to inputs (see Fig. 10). By appropriately using those relationships and reducing BDD sizes, we can verify circuits having more than 100,000 gates rather easily if the two circuits to be compared are similar. Since this approach can treat circuits of real-life sizes, it is becoming widely used (16,17). Formal Verification of Sequential Circuits So far we have discussed only combinational circuits. Now we describe techniques on how to verify sequential circuits formally. First we discuss comparison between two sequential circuits. Since sequential circuits generate output sequences of varying time units, we have to make sure that the outputs have the same values at all times. That is, as shown in Fig. 11, two sequential circuits are connected and we check to see if the values of the outputs are always same (18). Since there are only finite number of flip-flops, the number of possible states in the sequential circuits is finite. Therefore, when we have checked the values of the outputs for all possible states in the two circuits, we can finish verification. For each state, essentially the same procedure as for combinational verification is followed, using the method shown in the previous sections.
226
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
Inputs
Outputs
s0
=? Combinational circuit 1
Combinational circuit 2
Flip-flops
Flip-flops
s1
s2
s2
s3
s1
s3
s2
Figure 11. Verification of sequential circuits.
A state-transition graph can be extracted from the sequential circuit. An example state transition graph is shown in the left side of Fig. 12. s0 is the initial state, which corresponds to the reset state of the original circuit. In this case, there are three additional states and state transitions that interconnect them. All possible behaviors are represented as are all possible state transitions starting from the initial state s0, as shown on the right side of the figure. This is also called a computation tree, because it represents all possible computations that can be done by the state-transition graph on the left side. Thus the goal of sequential verification is to ensure that the values of the outputs are equal to the specified values at each node of the computation tree. This can be checked by traversing the state-transition graph features one by one until a state that has been already traversed is reached. This is basically a depth-first search on computation trees. The time to complete this process, however, is exponential in the number of flip-flops, since there are 2n states in n flip-flop circuits. Hence this approach does not work for large circuits (19,20). Another method for traversing state-transition graphs is based on a width-first traversal on computation trees, as shown in Fig. 13. It maintains a set of states that have already been checked. First the set has just the initial state s0 in the case of Fig. 13. In the next step, it will have s1, s2, and s3 as well. Those are the states that can be reached directly by a single state transition from the state s0. Then, in the next step, we see that no more states can be added to the set, and therefore the search terminates and we have traversed everything. The key idea here is to process sets of states instead of each state individually. The next question is how to represent sets of states efficiently. One commonly used approach is to represent sets with their characteristic logic functions. We introduce new variables to encode each state in the state-transition graph.
s1
s2
s2
s3
Figure 13. Breadth-first search of state transitions.
Basically we need log2(numbers of states) new variables. Then we assign values of those variables so that each state has different values. This is a type of state assignment for the given state-transition graph. Then a set of states can be represented as a disjunction of values of the variables for those states. Let us see an example, shown in Fig. 14. Since there are four states, we need two variables for encoding of states. Suppose they are x and y, and we use the following state encoding: A B C D
(x, (x, (x, (x,
y) y) y) y)
⫽ ⫽ ⫽ ⫽
(0, (0, (1, (1,
0) 1) 0) 1)
From this, we can get the corresponding state-transition table as shown in Fig. 15. In the table, x and y are encoding variables corresponding to the present states and x⬘ and y⬘ are those corresponding to the next states. From this table, we can compute transition relations for the state transition graph as follows:
TR(x, y, x , y ) = x · y · x · y + x · y · x · y + x · y · x · y + x · y · x · y + x · y · x · y TR(x, y, x⬘, y⬘) is 1 if and only if there is a state transition from the state (x, y) to the state (x⬘, y⬘). Now we can traverse the state transition graph in Fig. 14 in a breadth-first order. Let us assume the initial state to be 兵A其. In the next step we get the set of states 兵A, B其. Then we get 兵A, B, C, D其 in the following step. This can be computed
s0 s1 s2
s1
s2
s3
B
C
A
D
s3 s2
s3
s1
s2
s0 s1
s2
s2
s3
Figure 12. State-transition graph and its trace of transitions.
Figure 14. Symbolic manipulation of breadth-first traversal of the state-transition graph.
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
Present x y
Next x’ y’
A
0 0
B
0 0
B B
0 1 0 1
C D
0 1 0 1
C
1 0
B
1 0
C
1 0
D
1 0
Figure 15. State-transition table corresponding to the state transition graph.
using the transition relation and the state-encoding variables x, y. For example, in order to get the set of states 兵A, B, C, D其 from the set of states 兵A, B其, we compute as follows. 兵A, B其 can be represented as x ⭈ y ⫹ x ⭈ y ⫽ x and so we compute
x · TR(x, y, x , y ) = x · x · y · x · y + x · y · x · y + x · y · x · y + x · y · x · y + x · y · x · y = x + y x ⫹ y ⫽ x ⭈ y ⫹ x ⭈ y ⫹ x ⭈ y corresponds to the set 兵B, C, D其. By adding the original set 兵A, B其, the result is 兵A, B, C, D其. Since computing state transitions is now formalized as the manipulation of logic functions, this process can be efficiently automated by using BDDs. It is called the symbolic traversal of state-transition graphs and is now widely used. State-of-theart implementation of this approach can verify circuits having up to around 200 flip-flops, which may have 2200 states (21–24).
MANUFACTURING FAULT DIAGNOSIS Fault location for digital logic circuits is studied here. After testing is performed to determine whether a circuit is faulty, fault location or diagnosis is performed to locate the failure. Diagnosis may be performed with a view to improving the manufacturing process or may be intended for the identification and replacement of a faulty subcircuit. Efficient diagnosis has been known to yield rapid improvement. Given a defective chip and good design criteria, the aim of the diagnostic process is to identify a subset of faults that can explain all the errors observed while testing the chip. Techniques described in this article are typically used to reduce the time required for expensive failure analysis procedures that aim at the physical confirmation of the defect (e.g., under an electron microscope). The time reduction is achieved by reducing the number of candidates to examine by analysis at the logic level. We shall first review diagnosis techniques based on their classification of usage of precomputed information (as opposed to run-time analysis) in the diagnosis process. The techniques are broadly grouped under static (cause–effect), dynamic (effect–cause), and integrated techniques. Then, we briefly review work on important tools required for diagnosis, diagnostic fault simulation, and diagnostic test generation. After this, we review diagnosis techniques specifically designed to handle unmodeled faults. Specific techniques that are representative of their genre are explained in greater detail whenever possible.
227
Diagnosis Strategies Diagnostic techniques can be broadly classified into three groups. The first group, called static (cause–effect) fault diagnosis, uses precomputed information in the form of fault dictionaries for matching with the faulty responses produced by defective circuits (25–33). Fault dictionaries store output information, produced by the circuit under consideration, on application of the given set of test vectors and under the influence of the set of modeled faults. In contrast, dynamic (effect–cause) diagnosis techniques detect the faulty behavior of the circuit while the test set is applied (34–40). Recent trends show the increasing popularity of integrated diagnosis techniques in which the focus is on using small amounts of precomputed information and coupling this with efficient dynamic algorithms to perform fault location (31,41). The main advantage of static fault diagnosis techniques occurs when multiple copies of the same design are being diagnosed (as in an integrated-circuit manufacturing process). Another significant advantage of the fault dictionary approach is that it is relatively simple to use. However, a common problem associated with these techniques is that it is typically infeasible to store all the precomputed information. (Typical full fault dictionaries can require several gigabytes of storage for even moderately large circuits containing 20,000 gates.) Hence, research in this direction has concentrated on providing compact fault dictionaries. The main motivation for dynamic diagnosis algorithms is that they do not require any precomputed information. This eliminates the storage problem with fault dictionaries and also relies to a lesser extent on the type of defects being diagnosed. However, this results in the fact that the time spent for diagnosing each single faulty unit is typically much larger than that required by static techniques. Hence, research in this area has concentrated on reducing the run times. Integrated techniques have been proposed to incorporate the advantages of both the static and dynamic techniques. The main advantage of integrated fault diagnosis is the flexibility provided in choosing the kind and amount of precomputed information. This, in turn, has an effect on the time required for performing diagnosis at run time. Static Fault Diagnosis. An example of a fault dictionary is shown in Fig. 16(a) for a circuit with six modeled faults, two vectors, and two primary outputs. A typical use of the information in this dictionary could be in the following manner: If the faulty response produced by a defective chip on the application of vectors v1 and v2 was 10 and 11, then the dictionary could be used to indicate fault 5’s presence in the defective chip. Techniques for handling situations when the faulty responses do not match with any of the stored responses (exactly) are discussed later in this article under the section Unmodeled Fault Diagnosis. Since fault dictionaries are typically prohibitively large to store, fault-dictionary compaction has been an important focus of research. Past work addressing the size problem has yielded solutions in two distinct directions. The first set of contributions provide fault-dictionary compaction targeting high modeled fault resolution (25,29–32), while the second set offers alternative representations for storing the full fault dictionary (30,31,33). Fault-Dictionary Compaction Research Pass/Fail Dictionary (29). This type of fault dictionary records the faults detected, potentially detected, and not de-
228
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
Faults
v1
v2
0 1 2 3 4 5
00 00 01 00 01 10
11 11 01 10 00 11
(a)
Faults Outputs 12345 0
Faults Outputs
01
013 11
1234
10 5
24
10
01
00
0 01
1
013 0 1
0
–
11
3
2
4
5
–
5
0
012345 00
1
–
(b)
1
3 01 4
0 –
24 1 0
1 0 2
1
1
0
5
24
013 10
0
–
–
– 1 –
0 –
1
–
5 1
0
1 0
–
–
5
–
– 1 0 –
–
1 –
(c)
Figure 16. (a) Matrix dictionary; (b) vector-based tree; (c) output-based tree.
tected for each vector. It does not record detections separately by output. It is created by a single full-fault simulation and is much smaller than a full-fault dictionary. But, as might be expected, this dictionary loses some diagnostic capability when compared with the full-fault dictionary. Compact Dictionary (29). One method of enhancing the diagnostic capability of the pass/fail dictionary is to add output information. Such an approach is used in the creation of the compact fault dictionary. The compact algorithm is computationally intensive, requiring multiple simulations of all vectors against some faults, plus a full-fault simulation to produce the vector dictionary and another to produce the final dictionary after extra columns are added. The dictionary produced is known to be considerably compressed, with no loss of modeled fault resolution (30). Sequential Dictionary (30). In this technique, a pass/fail dictionary is enhanced by a single full-fault simulation. An entry is added to the dictionary for any vector and output that distinguishes between any pair of faults not previously distinguished. This is computationally cheaper that the compact dictionary generation algorithm. There is no loss of modeled fault resolution. List Splitting Dictionary (30). This dictionary is created by using efficient list splitting. The lists correspond to faults that are not distinguished at each vector–output combination in the diagnosis process. However, it is not accurate for sequential circuits; hence the diagnostic resolution suffers. Drop on K Dictionary (30). While creating this dictionary, the fault simulator drops each fault after its Kth detection and creates an otherwise standard dictionary, including possible detections until each fault’s Kth definite detection. This technique assumes that K detections distinguish between most fault pairs and that some faults cause errors for many vectors, filling dictionaries with unneeded data. Simulation costs here are less than those for a full-fault dictionary.
First Failing Pattern Dictionary (30). This is a special case of the drop on K dictionary for K ⫽ 1. Detection Frequency Dictionary (30). A full-fault simulation is performed, and for each fault f, the number of vectors definitely (df) and potentially producing errors (pf) are counted. Each fault can cause errors numbering between df and df ⫹ pf. The list of faults that causes each possible number of errors forms an indistinguishability class for this dictionary. The resolution of this dictionary is poor in comparison with other schemes. Tree-Based Compaction Dictionary (32). Diagnostic experiment trees (as shown in Fig. 16) have also been used to identify information that is not diagnostically useful (for modeled faults) to provide compact dictionaries. An example of information that is eliminated corresponded to output information for faults after they were completely distinguished from other faults. Full-Fault-Dictionary Representation Research. A key problem with the compaction techniques that have been previously described lies in the fact that the information that they identify as diagnostically useful is useful only with respect to modeled faults. Hence, the diagnostic accuracy of such dictionaries in the presence of unmodeled faults may degrade. Thus there is a necessity for developing storage structures that enable efficient representation of the information in the full-fault dictionary. This approach is orthogonal to compaction, which has achieved storage savings by removing output information. Matrix Dictionary (42). Full-fault dictionaries need to store output information corresponding to each vector and fault pair. Conventionally, they have been stored using a matrix representation. For a circuit with v vectors, o outputs, and f faults, the size of the matrix dictionary is vof bits for combinational circuits and 2vof bits for sequential circuits.
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
List Dictionary (31). List-based dictionaries have been proposed as an alternative to the matrix representation (31). The list dictionary records only information corresponding to detections. Tree-Based Fault-Dictionary Compaction and Representation. Diagnostic experiment trees (32,33,43) are powerful tools for modeling the information corresponding to a diagnostic experiment. Diagnostic experiment trees are labeled trees; hence the dictionary storage problem can be reduced to a labeledtree encoding problem. Two labeled trees that were used to represent the diagnostic experiment are shown in Figs. 16(b) and 16(c). Definition 1 [Vector-Based Diagnostic Experiment Tree TV(V, E)]. A diagnostic experiment tree in which each level represents the application of a test vector and in which each edge e 僆 E(TV) is associated with a list of outputs O(e) that is the set of all the primary outputs of the circuit is called a vectorbased diagnostic experiment tree. Definition 2 [Output-Based Diagnostic Experiment Tree TO(V, E)]. A diagnostic experiment tree in which each level represents a (test vector, output) pair rather than a test vector, and in which each edge e 僆 E(TO) is associated with a single primary output of the circuit is called an output-based diagnostic experiment tree. Example. Figures 16(b) and 16(c) show the vector-based and output-based diagnostic experiment trees corresponding to the full-fault dictionary shown in the matrix format in Fig. 16(a). The information embedded in the vector-based diagnostic experiment tree is fully exploited to identify output sequences that may be eliminated to produce highly compact dictionaries even while they retain high diagnostic resolution with respect to modeled faults. The compact storage structures developed for storing the information identified to be useful provide compaction of up to 2 orders of magnitude (32). For full-faultdictionary representation, it is shown that both of the labeled trees can be efficiently represented by disjointly storing the label information and the underlying unlabeled tree. The vector-based tree is encoded by the use of a compact binary code, while the regular structure of the output-based tree is exploited to provide a spectrum of eight alternative representations for the full-fault dictionary. It is worth noting that the currently known list and the matrix formats arise as special cases in this framework. The results give some of the best currently known storage requirements for full-fault-dictionary representation (33). Dynamic Diagnosis. Dynamic diagnosis techniques analyze the output responses produced by the failed chip at diagnosis time with the possible use of diagnostic fault simulation to derive a set of failures that best explain the set of observed responses. The approach does not require the storage of any precomputed information. We present a brief overview of dynamic diagnosis research with emphasis on work targeting large, practical circuits. The Deduction Algorithm (42). This analysis processes the response obtained from the faulty unit to determine the possible stuck-at faults that can generate that response, based on deducing internal values in the unit under test (UUT). Any line for which both 0 and 1 values are deduced can be neither s-a-0 (stuck-at-0) nor s-a-1 (stuck-at-1) and is identified as fault-free. Faults are located on some of the lines that cannot
229
be proved normal. Internal values are computed by the deduction algorithm, which implements a line-justification process the primary goal of which is to justify all the values obtained at the POs (primary outputs), given the tests applied at the PIs (primary inputs). Backtracking is used either to recover from incorrect decisions or to generate all possible solutions. However, no results are available from this work for circuits of practical size. The Pair-Analysis Approach (34). In contrast to other techniques, this work considers pairs of vectors rather than single vectors. This gives the method an additional capability to encode polarity of different paths in the circuit by applying transitions on a limited number of inputs. The primary claim in this paper is that by the use of this technique, all faults can be diagnosed to their equivalence classes. This work is applicable only to combinational circuits. Sensitizing Input Pairs (45). A technique that has some similarity to the pair-analysis approach has been recently proposed. This is the first work that successfully provided analysis-based solutions to nontrivial sequential circuits. However, like other analysis techniques, it is still not possible to apply this technique to large circuits. Full-Scan Diagnosis Algorithms (35,36). This work targets full-scan designs. The heart of this work lies in an efficient vector parallel fault simulator that rapidly reduces the number of candidate faults based on the faulty responses and the expected failures due to the fault. Modeled Fault Simulation (38,46,47). A common dynamic diagnosis strategy that has been used to diagnose large circuit defects is to obtain expected output responses by the use of modeled fault simulation. However, due to the excessive fault-simulation costs, the time taken to perform the diagnosis may be large for repeated diagnosis of large circuits. Path Tracing (PT) (40). A strategy for dynamic diagnosis with reduced diagnostic fault simulation time performs fault dropping during diagnosis time with the help of critical path tracing. Faults are dropped when it is decided that they are on lines that do not influence any faulty output lines. Example Dynamic Diagnosis. An example of a diagnosis decision arrived based on path tracing is shown in Fig. 17. The output of gate e fails. The path trace starts from this output and proceeds to the inputs. Because gate e has two controlling inputs, the trace continues from one of them. Node B, which is part of the bridging fault A@B (node A shorted with node B), is included (along with other candidates on the paths traced) in the candidate set of faulty nodes by the path– trace procedure. Expert Systems and Artificial Intelligence Techniques. Diagnosis has been attempted in rule-based expert systems that utilize encoded empirical knowledge obtained from human experts. These systems are not entirely deductive and bear
1 0
a
A
1
c
1
0 e
Bridge 1 1
b
0
0 B
1 1
d
Figure 17. Path trace from failing output.
1/0
230
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
some resemblance to the fault-dictionary approach. In contrast, some artificial intelligence researchers have proposed techniques that are based on more detailed structural and behavioral models of the system being diagnosed. However, the most important problem with such techniques is that they target only small circuits and do not attempt to tackle the problems that arise with more elaborate designs. Integrated Diagnosis. The prohibitive size of fault dictionaries and the large run times required for dynamic diagnosis have given rise to integrated fault-diagnosis techniques, in which the focus is on storing a limited amount of essential information and utilizing this information effectively along with analysis or simulation at run time. We now provide an overview of this research. Dynamic Dictionaries. This approach involves two stages (31,41). The first stage identifies a small group of candidate faults, and then a small part of the full-fault dictionary is generated dynamically in the second stage for just those faults and for only a few of the vectors that detect them. Hence, two-stage fault isolation avoids the static cost normally associated with full dictionaries and most of the computation time that is required in a pure dynamic technique, while still providing most of the resolution. The limited dictionary used in the first stage of the two-stage process is a very small dictionary that can be generated by limited fault simulation. The diagnosis algorithm lists all candidate faults that have been observed by comparing observed errors with records in the limited dictionary. Then, in a second stage, a set of vectors is fully simulated against candidate faults, and a matching algorithm ranks all faults. Experimental results were provided for a variety of benchmark circuits and industrial implementations. It was also shown that the loss of resolution incurred was not significant. State-Information-Based Diagnosis. State-information-based diagnosis solves a crucial problem with traditional diagnostic techniques based on storage (48). Typically, such techniques store only primary output-based information, offering only a black-box view of the circuit and thus little diagnostic flexibility. This technique provides a solution by storing information corresponding to the internal nodes in the circuit, namely the state nodes. The selective storage of state information has been shown to improve the time for diagnostic fault simulation significantly. Experimental results on large circuits were presented. Level-Information-Based Diagnosis. Precomputed information tracking the diagnostic classes at each level of the diagnostic experiment tree, specifically targeting a reduction in the fault simulation costs to be incurred at diagnosis time, is the key contribution of this work (49). Fault simulation costs are modeled in terms of computations associated with each (fault, vector) pair. Tools for Diagnosis: Diagnostic Fault Simulation and Test Generation Diagnostic Fault Simulation. Diagnostic fault simulation is useful for determining the diagnostic capability of a given test set and for generating fault dictionaries and diagnostic information specific to a given test set. Diagnostic capability is reported using various diagnostic measures. Diagnostic test generation involves generating tests to distinguish between
fault pairs. Efficient generation of diagnostic test vectors can be assisted by a fast diagnostic fault simulator. Typically, diagnostic fault-simulation techniques have focused on simulation based on stuck-at faults and the developed measures are also for the same models. Rapid techniques are available both for combinational and sequential circuits, and we review the more general case of sequential circuits here. During fault simulation of a circuit starting from an unknown state, a good or faulty sequential circuit can produce a 0, 1, or X on each primary output for each test vector input, where X is an unknown value whose actual binary value depends on the initial state of the machine. If fault simulation indicates that a fault f i produces an output of 0 and another fault f j produces an output of 1 on the same primary output for the same input, then the faults f i and f j are said to be distinguished. However if a fault f i produces an output of 0 or 1 and another fault f j produces an output of X, then it is possible that the faults f i and f j may not be distinguished. Therefore, the pessimistic assumption is made that an output of 1 or 0 is indistinguishable (with respect to this test set) from an output of X. Diagnostic Measures. Camurati et al. (50) proposed two diagnostic measures. Diagnostic resolution (DR) is the fraction of fault pairs distinguished by a test set. Diagnostic power (DP) is the fraction of faults that are fully distinguished. A fault is fully distinguished if the test set distinguishes it from every other fault in the fault list. A third measure (51), which gives a more complete picture, is to identify sets of faultequivalence classes and report the number of these classes by size; this measure is applicable to combinational circuits and sequential circuits that start from a known reset state. This is extended to indistinguishable fault classes (38) to account for unknown values occurring at the outputs of sequential circuits during simulation. Another measure, the diagnostic expectation (30), is the average of indistinguishability class sizes over all faults. It is assumed that all faults are equally likely to occur. Distinguishability Matrix Approach. Early methods for performing diagnostic fault simulation for moderately large circuits (38) used a distinguishability matrix. The distinguishability matrix is an f ⫻ f matrix, where f is the number of faults. An entry of 1 indicates that the two faults specified at the intersection of the row and column are distinguished by some sequence of test vectors in the test set. It requires O( f 2) space, and the time complexity is O(vof 2), where v is the number of vectors in the set and o is the number of outputs in the circuit. List-Based Methods. Ryan, Fuchs, and Pomeranz (30) mention that a more efficient way to represent faults that are indistinguishable by a given test set is by using lists of faults. Jou and Chen (52) and Chen and Jou (53) represent pairs of indistinguishable faults using lists. This representation is a compact implementation of the distinguishability matrix. It is equivalent to storing only those entries of the distinguishability matrix with values of 0. Here, faults may appear in multiple lists. The indistinguishability relationship between all pairs of faults can be represented as an undirected graph, with the faults as nodes and the indistinguishability relationships between them as edges. Previous approaches essentially represent this graph as an adjacency matrix (38) or as incidence lists (52,53).
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
Later representations (39) avoided explicit storage of the indistinguishability relationship between all pairs of faults, but represent the indistinguishability relationship between classes of faults. Each fault is present in only one of the classes. This makes the representation more compact than those previously proposed (38,52,53). Although the worst-case space complexity is still O( f 2), experimental results demonstrated that the average memory usage is almost linear for the benchmark circuits. The representation also reduces the number of output response comparisons between faults and hence speeds up the simulation process. Diagnostic Test-Pattern Generation. Diagnostic automatic test pattern generation (DATPG) is critical to performing efficient fault diagnosis. In diagnostic test generation, the goal is to find a test sequence such that the circuit produces a different response under one fault than it does under another. Such techniques have been primarily targeted towards stuckat faults and for combinational circuits, although recent work has made progress towards both unmodeled faults and sequential circuits. The diagnostic test-generation problem for sequential circuits is more acute than its combinational circuit counterpart mainly because of multiple time frames that need to be handled. The problem is compounded by the unknown values in state elements; these unknown values may increase the number of fault pairs that need to be explicitly considered by a diagnostic test generator. Combinational Circuits. Work on DATPG for combinational circuits has been developed based on both functional (e.g., BDD-based) and structural techniques (PODEM-based) (50,54–57). DIATEST (56) is a combinational diagnostic testgeneration program that was developed based on the conversion of a conventional test generator into a diagnostic test generator. Complete results (with no aborted fault pairs) were provided on moderate-sized (on the largest standard public benchmark circuits) combinational circuits. Since equivalence identification, much like redundancy identification, is a computationally intensive operation in the DATPG process, techniques to identify combinational equivalences (57–61) have been proposed. Sequential Circuits. Formal techniques have also been used for sequential circuit diagnostic test generation (62,63); however, the drawbacks of these approaches are the assumption of a fault-free reset state and the inability to handle large circuits due to memory requirement problems. Simulationbased diagnostic test generation algorithms for large sequential circuits have also been presented (64), but there is a lack of indistinguishability identification. Later, a powerful method to modify a conventional sequential test generator into a sequential diagnostic test generator has been proposed (65). The method utilizes circuit netlist modification along with a forced 0/1 or 1/0 (66) value at a primary input in the modified circuit. Indistinguishability. There is also evidence (62,63,65,67,68) indicating that a main burden of diagnostic test generation is in proving indistinguishability. Another difficulty in solving this problem arises in sequential circuits because the terms distinguishable, indistinguishable, detectable, and undetectable take on different meanings with different test methodologies [multiple observation time (69,70) or conventional, gatelevel test generation with single observation time and threevalued simulation (42)]. Methods to characterize these rela-
231
tions and identify them implicitly (without explicitly making a call to the diagnostic engine for each relation) have simplified the computational task of diagnostic test-pattern generation (67,68). Unmodeled Fault Diagnosis The fault model used to predict defect behavior plays an important role in diagnosis (47). In order for a fault model to be valid for diagnosis it should accurately model the corresponding defect, and such defects should occur in real circuits (71). It is worth noting that static (cause–effect) techniques are perhaps more dependent on the fault models than dynamic (effect–cause). Based on Modeled Faults. A typical approach for diagnosing unmodeled faults is to use the information available from the modeled faults in a controlled manner to make conclusions about the presence of unmodeled faults. Issues concerning accuracy and the time required to perform diagnosis govern the kind of matching algorithm being used. These schemes can range from dropping all faults whose response shows a definite mismatch with the observed faulty response (applicable to pure modeled fault diagnosis; fast) (31,35,36) to dropping few or no faults with the use of scoring schemes to obtain a set of candidate faults (applicable to arbitrary unmodeled fault diagnosis; slow) (31,38). Schemes studying the use of various combinations of matching schemes and fault models have also received research attention, and information corresponding to vectors showing failures and vectors showing no failures has been used to obtain separate matching parameters (31,47,72). This approach has been suggested to attain better diagnosis for unmodeled faults. An intuitive explanation for the better accuracies obtained using the separate handling of the failing (failures observed) and passing (good values observed) vectors is given from the fact that obtaining separate parameters makes it possible to explain observed failures as opposed to other matching schemes in which matching of an error is not distinguished from the matching of a good value. Bridging Fault Diagnosis. A common failure mode in current complementary metal-oxide semiconductor (CMOS) technologies is that of short circuits. Thus, many failures can be modeled as bridging faults and they have hence received extra attention. Techniques for diagnosing bridging faults have been primarily targeted at combinational circuits because of the large computational overheads associated with the simulation of bridging faults and the lack of a clear understanding of the complete effects of sequential bridging faults. Even for combinational circuits, only a limited set of realistic bridging faults that are extracted from the layout (73) are typically used because of the prohibitively large numbers of all possible bridging faults, even for small circuits. An additional complicating factor for these faults is that a short circuit (that may produce an intermediate voltage value) may be interpreted differently by logic gates downstream from the bridged lines due to variable input logic thresholds. This is known as the Byzantine generals problem. Several techniques have been proposed for bridging-fault diagnosis in combinational circuits. The most popular approaches are ones that use stuck-at dictionaries to diagnose bridging faults. The reason for this is that this avoids compu-
232
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING
tationally intensive bridging-fault simulation. Millman, McClusky, and Acken (74) presented an approach to diagnose bridging faults using stuck-at dictionaries. Chess et al. (46) and Lavo, Larrabee, and Chess (72) improved on this technique. These techniques enumerate bridging faults and are hence constrained to use a reduced set of bridging faults extracted from the layout. Furthermore, they need to either store a stuck-at fault dictionary or perform stuck-at fault simulation. Chakravarty and Liu (75) proposed a technique based on Iddq (quiescent current) using only good circuit simulation. Chakravarty and Gong (76) described a voltage-based algorithm that used the wired-AND (wired-OR) model. This work (76) implicitly considers all bridging faults. It is worth noting that wired-AND and wired-OR models that are assumed work only for technologies for which one logic value is always more strongly driven than the other. A deductive technique for combinational circuits that does not explicitly simulate faults has been proposed. However, this technique is not complete because it only reduces the candidate set of bridging faults and may end up with a potentially large set of candidates. BIBLIOGRAPHY 1. A. Gupta, Formal hardware verification methods: A survey, Formal Methods Syst. Des., 1 (2/3): 151–238, October, 1992. 2. S. B. Aker, Binary decision diagrams, IEEE Trans. Comput., C27: 509–516, 1978. 3. R. E. Bryant, Graph-based algorithms for boolean function manipulation, IEEE Trans. Comput., C-35: 667–691, 1986. 4. S. Minato, N. Ishiura, and S. Yajima, Shared binary decision diagram with attributed edges for efficient boolean function manipulation, Proc. 27th ACM/IEEE Des. Autom. Conf., 1990, pp. 52–57. 5. S. J. Friedman and K. J. Spowit, Finding the optimal variable ordering for binary decision diagrams, Proc. 24th ACM/IEEE Des. Autom. Conf., 1987, pp. 348–356. 6. M. Fujita, H. Fujisawa, and N. Kawato, Evaluation and implementation of boolean comparison method based on binary decision diagrams, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’88), 1988, pp. 6–9. 7. N. Ishiura, H. Sawada, and S. Yajima, Minimization of binary decision diagrams based on exchanges of variables, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’91), 1991, pp. 472–745. 8. T. Kakuda, M. Fujita, and Y. Matsunaga, On variable ordering of binary decision diagrams for the application of multi-level logic synthesis, Proc. Eur. Des. Autom. Conf. (EDAC ’91) , 1991, pp. 50–54. 9. S. Malik et al., Logic verification using binary decision diagrams in a logic synthesis environment, Proc. IEEE Int. Conf. Comput.Aided Des. (ICCAD ’88), 1988, pp. 6–9. 10. S. Minato, Minimum-width method of variable ordering for binary decision diagrams, IEICE Jpn. Trans. Fundam., E75-A (3): March, 1992. 11. R. Rudell, Dynamic variable ordering for ordered binary decision diagrams, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’93), 1993. 12. K. S. Brace, R. L. Rudell, and R. E. Bryant, Efficient implementation of a bdd package, Proc. 27th ACM/IEEE Des. Autom. Conf., 1990, pp. 40–45. 13. J. C. Madre and J. P. Billon, Proving circuit correctness using formal comparison between expected and extracted behavior, Proc. 25th ACM/IEEE Des. Autom. Conf., 1988, pp. 205–210. 14. D. Brand, Verification of large synthesized designs, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’93), 1993.
15. W. Kunz, Hannibal: An efficient tools for logic verification based on recursive learning, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’93), 1993. 16. J. Jain, R. Mukherjee, and M. Fujita, Advanced verification techniques based on learning, Proc. ACM/IEEE Des. Autom. Conf., 1995. 17. A. Kuehlmann and F. Krohm, Equivalence checking using cuts and heaps, Proc. 34th ACM/IEEE Des. Autom. Conf., 1997. 18. O. Coudert and J. C. Madre, A unified framework for the formal verification of sequential circuits, Proc. IEEE Int. Conf. Comput.Aided Des. (ICCAD ’90), 1990, pp. 126–129. 19. E. M. Clarke and E. A. Emerson, Automatic verification of finitestate concurrent systems using temporal logic specification, ACM Trans. Programm. Lang. Syst., 8 (2): 244–263, 1986. 20. M. Fujita, H. Tanaka, and T. Moto-oka, Logic design assistance with temporal logic, Proc. IFIP WG10.2 Int. Conf. Hardw. Descript. Lang. Their Appl., 1985. 21. J. R. Burch, et al., Sequential circuit verification using symbolic model checking, Proc. 27th ACM/IEEE Des. Autom. Conf., 1990, pp. 46–51. 22. J. R. Burch et al., Symbolic model checking: 1020 states and beyond, Proc. 5th Annu. IEEE Symp. Logic Comput. Sci., 1991. 23. R. P. Kurshan, Automata-theoretic verification of coordinating processes, Lect. Notes Comput. Sci., 430: 414–453, 1990. 24. H. Touati et al., Implicit state enumeration of finite state machines using bdds, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’90), 1990, pp. 130–133. 25. R. E. Tulloss, Size optimization of fault dictionaries, Proc. Int. Test Conf., 1978, pp. 264–265. 26. R. E. Tulloss, Fault dictionary compression: Recognizing when a fault may be unambiguously represented by a single failure detection, Proc. Int. Test Conf., 1980, pp. 368–370. 27. J. Richman and K. R. Bowden, The modern fault dictionary, Proc. Int. Test Conf., 1985, pp. 696–702. 28. V. Ratford and P. Keating, Integrating guided probe and fault dictionary: An enhanced diagnostic approach, Proc. Int. Test Conf., 1986, pp. 304–311. 29. I. Pomeranz and S. M. Reddy, On the generation of small dictionaries for fault location, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’92), 1992, pp. 272–279. 30. P. G. Ryan, W. K. Fuchs, and I. Pomeranz, Fault dictionary compression and equivalence class computation for sequential circuits, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’93), 1993, pp. 508–511. 31. P. G. Ryan, Compressed and Dynamic Fault Dictionaries for Fault Isolation, Tech. Rep. UILU-ENG-94-2234, Center for Reliable and High-Performance, Urbana-Champaign: Computing, Univ. of Illinois, 1994. 32. V. Boppana and W. K. Fuchs, Fault dictionary compaction by output sequence removal, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’94), 1994, pp. 576–579. 33. V. Boppana, I. Hartanto, and W. K. Fuchs, Full fault dictionary storage based on labeled tree encoding, Proc. VLSI Test Symp., 1996, pp. 174–179. 34. H. Cox and J. Rajski, A method of fault analysis for test generation and fault diagnosis, IEEE Trans. Comput.-Aided Des., 7: 813–833, 1988. 35. J. A. Waicukauski et al., Fault simulation for structured VLSI, VLSI Syst. Des., 6 (12): 20–32, 1985. 36. J. A. Waicukauski and E. Lindbloom, Failure diagnosis of structured VLSI, IEEE Des. Test Comput., 6(4): 49–60, 1989. 37. M. Abramovici and M. A. Breuer, Fault diagnosis based on effectcause analysis, Proc. 24th ACM/IEEE Des. Autom. Conf., 1987, pp. 69–76.
DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING 38. E. M. Rudnick, W. K. Fuchs, and J. H. Patel, Diagnostic fault simulation of sequential circuits, Proc. Int. Test Conf., 1992, pp. 178–186. 39. S. Venkataraman et al., Rapid diagnostic fault simulation at stuck-at faults in sequential circuits using compact lists, Proc. 32nd ACM/IEEE Des. Autom. Conf., 1995, pp. 133–138. 40. S. Venkataraman, I. Hartanto, and W. K. Fuchs, Dynamic diagnosis of sequential circuits based on stuck-at faults, Proc. VLSI Test Symp., 1996, pp. 198–203. 41. P. Ryan, S. Rawat, and W. K. Fuchs, Two-stage fault location, Proc. Int. Test Conf., 1991, pp. 963–968. 42. M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital System Testing and Testable Design, New York: Computer Science Press, 1990. 43. Z. Kohavi, Switching and Finite Automata Theory, New York: McGraw-Hill, 1978. 44. F. C. Hennie, Finite-State Models for Logical Machines, New York: Wiley, 1968. 45. N. Yanagida, H. Takahashi, and Y. Takamatsu, Multiple fault diagnosis in sequential circuits using sensitizing sequence pairs, Proc. Int. Symp. Fault Tolerant Comput., 1996, pp. 86–95. 46. B. Chess et al., Diagnosing of realistic bridging faults with stuckat information, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’95), 1995, pp. 185–192. 47. R. C. Aitken and P. C. Maxwell, Better models or better algorithms? Techniques to improve fault diagnosis, Hewlett-Packard J., February, 46 (1): 110–116, 1995. 48. V. Boppana, I. Hartanto, and W. K. Fuchs, Fault diagnosis using state information, Proc. Int. Symp. Fault Tolerant Comput., 1996, pp. 96–103. 49. V. Boppana and W K. Fuchs, Integrated fault diagnosis targeting reduced simulation, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD ’96), 1996, pp. 267–271. 50. P. Camurati et al., A diagnostic test pattern generation algorithm, Proc. Int. Test Conf., 1990, pp. 52–58. 51. K. Kubiak et al., Exact evaluation of diagnostic test resolution, Proc. 29th ACM/IEEE Des. Autom. Conf., 1992, pp. 347–352. 52. J. M. Jou and S.-C. Chen, A fast and memory-efficient diagnostic fault simulation for sequential circuits, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD’94), 1994, pp. 723–726. 53. S.-C. Chen and J. M. Jou, Diagnostic fault simulation for synchronous sequential circuits, IEEE Trans. Comput.-Aided Des., 16: 299–308, 1997. 54. P. Camurati et al., Diagnostic oriented test pattern generation, Proc. Eur. Des. Autom. Conf., (ECAD’90), 1990, pp. 470–474. 55. J. Savir and J. P. Roth, Testing for, and distinguishing between failures, Proc. Int. Symp. Fault Tolerant Comput., 1982, pp. 165–172. 56. T. Gru¨ning, U. Mahlstedt, and H. Koopmeiners, DIATEST: A fast diagnostic test pattern generator for combinational circuits, Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD’91), 1991, pp. 194–197. 57. I. Hartanto, V. Boppana, and W. K. Fuchs, Diagnostic fault equivalence identification using redundancy information & structural analysis, Proc. Int. Test Conf., 1996, pp. 294–302. 58. E. J. McCluskey and F. W. Clegg, Fault equivalence in combinational logic networks, IEEE Trans. Comput., C-20: 1286–1293, 1971. 59. A. Goundan and J. P. Hayes, Identification of equivalent faults in logic networks, IEEE Trans. Comput., C-29: 978–985, 1980. 60. B. K. Roy, Diagnosis and fault equivalences in combinational circuits, IEEE Trans. Comput., C-23: 955–963, 1974. 61. A. Lioy, Advanced fault collapsing, IEEE Des. Test Comput., 9 (1): 64–71, 1992.
233
62. G. Cabodi et al., An approach to sequential circuit diagnosis based on formal verification techniques, J. Electron. Test.: Theory Appl., 4: 11–17, 1993. 63. K. E. Kubiak, Symbolic Techniques for VLSI Test and Diagnosis, Tech. Rep. UILU-ENG-94-2207, Urbana-Champaign: Center for Reliable and High-Performance Computing, University of Illinois, 1994. 64. G. Cabodi et al., GARDA: A diagnostic ATPG for large synchronous sequential circuits, Proc. Eur. Des. Test Conf., 1995, pp. 267–271. 65. I. Hartanto et al., Diagnostic test pattern generation for sequential circuits, Proc. VLSI Test Symp., 1997, pp. 196–202. 66. J. P. Roth, W. G. Bouricius, and P. R. Schneider, Programmed algorithms to compute tests to detect and distinguish between failures in logic circuits, IEEE Trans. Electron. Comput., EC-16: 567–579, 1967. 67. V. Boppana, I. Hartanto, and W. K. Fuchs, Characterization and implicit identification of sequential indistinguishability, Proc. Int. Conf. VLSI Des., 1997, pp. 376–380. 68. V. Boppana, State information-based solutions for sequential circuit diagnosis and testing, Tech. Rep. CRHC-97-20, Ph.D. thesis, Center Reliable High-Performance Comput., Univ. of Illinois at Urbana-Champaign, 1997. 69. I. Pomeranz and S. M. Reddy, The multiple observation time test strategy, IEEE Trans. Comput.-Aided Des., 40: 627–637, 1992. 70. I. Pomeranz and S. M. Reddy, Classification of faults in synchronous sequential circuits, IEEE Trans. Comput., 42: 1066–1077, 1993. 71. R. C. Aitken, Finding defects with fault models, Proc. Int. Test Conf., 1995, pp. 498–505. 72. D. B. Lavo, T. Larrabee, and B. Chess, Beyond the byzantine generals: Unexpected behavior and bridging fault diagnosis, Proc. Int. Test Conf., 1996, pp. 611–619. 73. A. Jee and F. J. Ferguson, Carafe: An inductive fault analysis tool for CMOS VLSI circuits, Proc. VLSI Test Symp., 1993, pp. 92–98. 74. S. D. Millman, E. J. McCluskey, and J. M. Acken, Diagnosing CMOS bridging faults with stuck-at fault dictionaries, Proc. Int. Test. Conf., 1990, pp. 860–870. 75. S. Chakravarty and M. Liu, Algorithms for current monitoring based diagnosis of bridging and leakage faults, Proc. 29th ACM/ IEEE Des. Autom. Conf., 1992, pp. 353–356. 76. S. Chakravarty and Y. Gong, An algorithm for diagnosing twoline bridging faults in CMOS combinational circuits, Proc. 30th ACM/IEEE Des. Autom. Conf., 1993, pp. 520–524.
MASAHIRO FUJITA VAMSI BOPPANA Fujitsu Labs of America
DETECTION ALGORITHM, RADAR. See RADAR TARGET RECOGNITION.
DETECTION THEORY. See CORRELATION THEORY. DETECTORS. See DEMODULATION PHOTODETECTORS QUANTUM WELL.
DETECTORS, IONIZATION. See IONIZATION CHAMBERS. DETECTORS, MICROWAVE. See MICROWAVE DETECTORS.
DETECTORS, SUBATOMIC-PARTICLE. See PARTICLE SPECTROMETERS.
DETECTORS, THERMOPILE. See THERMOPILES.
234
DIAGNOSIS OF SEMICONDUCTOR PROCESSES
DETECTORS, ULTRAVIOLET. See ULTRAVIOLET DETECTORS.
DEVICE AND PROCESS MODELING. See MONTE CARLO ANALYSIS.
DEVICE MODELS. See NONLINEAR NETWORK ELEMENTS. DEVICES, DIAMOND. See DIAMOND BASED SEMICONDUCTING DEVICES.
DEVICES, FIBER-OPTIC. See FIBEROPTIC SENSORS. DEVICES, ORGANIC. See ORGANIC SEMICONDUCTOR DEVICES.
DEVICES SUPERCONDUCTING. See SUPERCONDUCTING ELECTRONICS.
DEVICES, SURFACE MOUNT. See SURFACE MOUNT TECHNOLOGY.
DIAGNOSIS. See FAULT DIAGNOSIS. DIAGNOSIS FAULT LOCATION. See DESIGN VERIFICATION AND FAULT DIAGNOSIS IN MANUFACTURING.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1807.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Electrical and Timing Simulation Standard Article Chandu Visweswariah1 1IBM Thomas J. Watson Research Center, Yorktown Heights, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1807 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (148K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1807.htm (1 of 2)18.06.2008 15:26:46
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1807.htm
Abstract The sections in this article are Spice Timing Simulation Switch-Level and Logic Simulation Static Timing Analysis Sensitivity Analysis Circuit Optimization Conclusion | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1807.htm (2 of 2)18.06.2008 15:26:46
ELECTRICAL AND TIMING SIMULATION
229
Inaccuracy level
Logic simulation
Switch-level simulation
Timing simulation
10%
“Exact” simulation
1%
ELECTRICAL AND TIMING SIMULATION <1
In the early days of integrated circuits, a discrete ‘‘breadboard’’ version of the circuit was constructed to test the design before making the first chip. In modern practice, verification of the functionality and performance of a circuit is carried out by means of computer-aided design (CAD) software. Circuit simulation is an essential step before subjecting an integrated circuit design to a costly manufacturing process. This article will introduce the reader to CAD tools used for electrical and timing analysis of circuits, with particular emphasis on digital integrated circuits. The tools model circuits and circuit elements, typically by a set of equations, and then solve the resulting equations to predict the circuit’s behavior. Simulation is used to verify a circuit’s functionality and performance, to study design alternatives, and to optimize circuits. Circuit simulation is a numerically intensive task and the huge amount of simulation required to verify all the details of a complex integrated circuit far outstrips the capability of state-of-the-art simulation algorithms. Although full-chip or exhaustive simulation is rarely feasible, circuit simulation is an essential step in characterizing, designing, and optimizing smaller blocks of circuitry, packages, and discrete circuits. Perhaps the best-known circuit simulator is SPICE (1). In SPICE, electronic devices are modeled by accurate nonlinear equations and the resulting circuit equations are solved using numerical methods. Although the device models may contain inaccuracies and the numerical algorithms solve the equations only to a predetermined accuracy, this type of electrical simulation will be referred to as ‘‘exact simulation’’ in the rest of this article. Unfortunately, exact simulation cannot be applied to large circuits because the computer resources required for such simulation quickly become inordinate. In an effort to simulate much larger circuits, ‘‘timing simulators’’ were developed. These simulators make approximations in order to speed up the analysis. Often they apply only to digital circuits. They run two orders of magnitude faster than exact simulators, but at reduced accuracy (typically producing timing results within 10% of their exact counterparts). Switch-level and logic simulators sacrifice accuracy and the notion of circuit timing to gain even larger speed-ups in simulation efficiency. Figure 1 shows qualitatively the relative accuracy and efficiency of the different types of simulators. Each type of simulation has a practical limit on the size of circuit that can be accommodated, beyond which the memory or computer run time requirements are exorbitant. For example, it may be possible to simulate a circuit macro containing tens of thousands of transistors using exact simulation methods. In the context of a microprocessor, a full adder may be amenable
10 100 Relative run time
10,000
Figure 1. Qualitative depiction of the relative accuracy and efficiency of different types of circuit simulators. For switch-level and logic simulators, there is no notion of timing accuracy.
to exact simulation. However, anything larger like an instruction unit or memory controller can only be handled by timing or switch-level simulation. At the full-chip level, logic simulation is often the only practical alternative. On an entire multiprocessor system, logic simulation is barely practical for a limited number of cycles of simulation, and hardware accelerators are used to speed up the process. All of these types of verification with their concomitant levels of model abstraction are necessary to prevent bugs in first-pass chip hardware. Simulation typically requires specification of the input signals and initial conditions to be applied during the analysis. In the context of digital circuits, static timing analysis can be used to circumvent the dependence on being able to predict the input signals that cause worst-case timing behavior. Graph-tracing algorithms are employed to predict conservative bounds on the slowest (and fastest) delays through a circuit. During the simulation of a circuit, sensitivity analysis can be used to efficiently determine the gradient of circuit response to design parameters such as element values or transistor sizes. These sensitivities can then be used to optimize circuits automatically or improve their yield. Thus a rich variety of techniques has evolved to simulate and optimize circuits at various levels of modeling abstraction.
SPICE SPICE and SPICE-like simulators (1,2) are the workhorses of circuit design. They accept as input a description of the circuit by means of a netlist file, a set of device models which are equations representing the electrical behavior of electronic devices, and a set of input signals. The simulator then conducts an analysis of the circuit by solving the circuit equations using numerical techniques. Although these methods can typically only handle a circuit that is a small fraction of an entire chip, they are useful for carefully checking, analyzing, and optimizing crucial building blocks that may be replicated several times.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
230
ELECTRICAL AND TIMING SIMULATION
Dc, ac and Transient Analyses The simplest type of analysis in SPICE is an operating-point or direct current (dc) analysis. In this analysis, energy-storage elements such as capacitors and inductors are not considered (or, if they exist in the circuit, they are opened and shorted, respectively). Then a set of equations is formulated. All circuits obey three sets of equations. (1) Kirchhoff ’s Current Law (KCL) states that the sum of the currents at each node of a circuit is identically zero at any instant of time. In an n-node, b branch circuit, there are n KCL equations. (2) Kirchhoff ’s Voltage Law (KVL) states that the voltage of each branch is the difference between the node voltages at its terminals, and there are b such equations. Alternatively, KVL states that the algebraic sum of branch voltages in any closed loop of a circuit is identically zero at any instant of time. (3) Finally, each electronic device must obey its Branch Constitutive Relation (BCR), which is the equation that governs its behavior, such as Ohm’s Law for a resistor. Thus we have (2b ⫹ n) equations in (2b ⫹ n) unknowns, the unknowns being b branch voltages, b branch currents and n node voltages. Each equation involves a small subset of the unknowns, and likewise each unknown appears in just a few equations. Hence the equations are said to be sparse. In sparse tableau analysis (STA) (3), all (2b ⫹ n) equations are formulated and solved simultaneously, while maximally exploiting the sparsity of the equations. In modified nodal analysis (MNA) (4), loosely speaking, the BCR and KVL equations are substituted into the KCL equations, to formulate n equations in the n unknown node voltages, thus leading to a more compact set of equations. In tree link analysis (TLA) or hybrid analysis (5), a spanning tree in the graph corresponding to the circuit is chosen. KCL and KVL are written in terms of the fundamental cutsets and fundamental loops [see Appendix A of (6)] of the graph, respectively. Then the circuit equations are formulated with the voltages of the tree branches and the currents of the remaining branches (cotree or link branches) as the basis variables. The term hybrid analysis reflects the fact that some of the basis variables are currents and some voltages. Whichever method of equation formulation is used, a set of nonlinear algebraic equations is obtained. Equations formulated by any of the above methods are solved to obtain the dc solution of the circuit. Newton’s method is first used to linearize the system of equations and sparse LU factorization is employed to solve the resulting linear system of equations. To apply Newton’s method, the Jacobian (matrix of partial derivatives) of the system matrix must be computed and LU factored at each iteration. The iterative method is repeated until convergence is obtained. Dc analysis is at the heart of the various analysis modes offered by exact simulators. In alternating current (ac) analysis, the circuit equations are formulated in the frequency domain, with each electronic device being represented by a linearized model of complex admittance about its operating point. The resulting equations are solved for various choices of frequency by the same means described above. In transient analysis, the simulator must determine the behavior of the circuit in the time-domain. A transient analysis typically begins with a dc operating point analysis to determine the initial conditions. The transient analysis begins
Equation formulation and dc analysis (Modified nodal analysis) Algebraic, nonlinear ODEs Integration by stepping through time (Trapezoidal, Backward Euler) Algebraic, nonlinear equations Linearization until convergence (Newton iterations) Algebraic, linear equations Device model evaluation and solution (Sparse LU factorization)
Figure 2. The steps involved in transient simulation.
from the computed operating point. Energy storage elements like capacitors and inductors contribute ordinary differential equations to the system of circuit equations. Thus the simulator must solve a set of nonlinear differential algebraic equations (DAEs). The first step is integration in the time-domain to convert the equations to a set of nonlinear algebraic equations. Integration is achieved by advancing time by a small interval called the time-step, and integrating the currents through capacitors and the voltages across inductors, using one of several stable numerical integration algorithms. Circuits that exhibit a wide range of time constants, known as stiff circuits, typically require a large number of time steps. The Trapezoidal rule, Backward Euler and Gear’s variable order integration method are popular algorithms for achieving this step. Integration is equivalent to replacing an energy storage element by a suitable ‘‘companion model’’ (6). The resulting nonlinear algebraic equations are solved as in the case of a dc analysis. Then the ‘‘truncation error’’ which results from the Taylor series approximation inherent in the integration algorithm is estimated. If the error is larger than a predetermined tolerance, the time-step is reduced in half and another attempt made. In this fashion, time is marched forward until the required simulation results are produced. Such algorithms are called incremental-in-time algorithms, as opposed to event-driven algorithms which will be discussed in a later section of this article. The steps involved in transient simulation are summarized in Fig. 2. Although exact circuit simulation has existed since the late 1960s (7), it is an ongoing topic of research, particularly as applied to analog and communications circuits. Research into speeding up the simulation (8), computing the dc operating point on tricky analog circuits (9), handling nonlinear circuits in the frequency domain (10), accommodating frequency-dependent elements in the time domain (11) and conducting mixed time-frequency simulations (12) are vigorously pursued research topics. TIMING SIMULATION Exact simulation methods have the advantage of being accurate and general, but are computationally burdensome. For
ELECTRICAL AND TIMING SIMULATION
even modest-sized circuits (say, 50,000 transistors) and modest simulation intervals (say, 1 애s) the memory and run time requirements they place on computers are unacceptable. Timing simulators were invented in an effort to break this simulation bottleneck. In general, they sacrifice accuracy and generality for a gain of about two orders of magnitude in speed (see Fig. 1). The relative timing accuracy of timing simulators is in the 10% range. Typically applied only to transient simulation of digital FET circuits, timing simulators are based on the following concepts: • Repeated evaluation of the nonlinear analytic device models in the inner loop of exact simulators is extremely expensive. Timing simulators seek to simplify the device models, thus sacrificing accuracy for speed. • Exact simulators are incremental in time, thereby being forced to take small time steps if there is activity anywhere in the circuit. Most digital circuits have large subcircuits that are not active at any given time and large subintervals of time when the subcircuits are inactive. Timing simulators attempt to exploit this ‘‘latency’’ or ‘‘multirate nature’’ by employing event-driven algorithms. These algorithms incur computation only in those subintervals of time when the circuit has activity and in only those portions of the circuit that are active. (For a description of relaxation methods, which seek to exploit latency but maintain the accuracy of exact simulators, see CIRCUIT ANALYSIS COMPUTING BY WAVEFORM RELAXATION.) • Simplified linearization or integration techniques are used by timing simulators to gain a speedup over exact simulators. • Timing simulators have compact representations of the network and individual components in order to be able to store and simulate large circuits. • These simulators typically partition the circuit into ‘‘channel-connected components’’ (also called ‘‘strongly connected components’’ or ‘‘DC-connected components’’ in the literature), which are subcircuits consisting of transistors that are source-drain channel-connected, as shown in Fig. 3. The boundary of each channel-connected component consists of either gates of transistors, primary
231
inputs of the network, primary outputs of the network, power supply, or ground. • Several timing simulators offer variable accuracy. The user can loosen the accuracy requirements in return for faster execution times. There is a wealth of literature on timing simulation and the following two paragraphs present a far-from-exhaustive sampling of the best-known simulators that have been developed. See Chapter 11 of (6) for a review of timing simulation. One of the first timing simulators was MOTIS (13) in which table models were used to store I–V characteristics of transistors. The channel currents of transistors were quickly determined by table look-up. These currents were combined to determine the charging current of the load capacitance of each channel-connected component. Using a secant approximation for the conductance of transistors, the change of voltage at the output of the channel-connected component was computed and propagated to its fanouts in an event-driven fashion. SAMSON (14) is a mixed circuit/logic simulator, which uses event-driven algorithms to exploit latency, but solves each channel-connected component more accurately. In the late 1980s and early 1990s, a series of timing simulators were developed that approximated device and circuit quantities by piecewise approximate functions. For example, E-LOGIC (15) uses nodal analysis, but discretizes node voltages into several ‘‘states.’’ The time at which each node reaches the next state in its discrete set of voltages is computed, and simulation is thus carried out in an event-driven fashion. To simplify device modeling, the MOS transistor is modeled as a current-limited switch for fast event-driven simulation in (16). In SPECS (17), device I–V characteristics are approximated by piecewise constant functions, and again the time needed for each device to reach the next segment boundary in its I–V table is computed. TimeMill and PowerMill (18) are timing simulators employing piecewise approximate modeling and event-driven techniques. In ACES (19), piecewise linear functions are employed to model I–V characteristics, modified nodal analysis equations are written for each channel-connected component, and explicit integration with adaptive control of the time step is used to achieve efficient simulation. These fast but approximate timing simulation algorithms have provided designers of digital ICs with additional simulation and power estimation methods that have rapidly become an integral part of the design methodology of modern memory, microprocessor, and ASIC chips. SWITCH-LEVEL AND LOGIC SIMULATION
Figure 3. An example of a channel-connected component.
Although rich and important topics in their own right, switchlevel and logic simulation are only briefly addressed in this section for completeness. They apply almost exclusively to digital circuits. In switch-level simulation (20), each transistor is treated as a switch which is either on or off. Additionally, signals are allowed to have one of a discrete number of states, such as 1 (logic high), 0 (logic low), and X (unknown or uninitialized). For example, if the gate of an n-type transistor is high, it is on. From each node of each channel-connected component, all possible conducting paths are traced to the power supply and to ground. Based on heuristics, the
232
ELECTRICAL AND TIMING SIMULATION
‘‘strength’’ of each path is computed. If a signal has conducting paths exclusively to the power supply or the combined strengths of the paths to the power supply are much larger in magnitude than the strengths of the paths to ground, then the node is assigned a logic high state. Similarly, if the paths to ground are stronger, it is assigned a logic low. If neither of the above is true, the node is assigned an X. Then signal values are propagated to the fanouts of the channel-connected component. Such a computation is repeated for each cycle of simulation after applying new values of the signals at the primary inputs. Switch-level simulation is efficient because its MOSFET model is so simple. Hence large circuits can be simulated for many time cycles. However, switch-level simulators have some fundamental limitations linked to their simplistic modeling of transistors and time. First, they provide little or no timing information. Second, their handling of analog situations like charge-sharing, glitches, or bidirectional signal flow is at best inaccurate. Logic simulation (21) is one level higher in abstraction. In its simplest form, the circuit is modeled as an interconnection of primitive logic gates, and each gate has a precompiled logic behavior. In addition, each gate has a delay model that represents the delay from the arrival of each input to the availability of each output signal. Signal representation is much like switch-level simulators, but logic simulators often have special signal representations for high-impedance states, tristate signals, and so on. The simulation algorithm consists of a simple event-driven or selective-trace mechanism. Primary inputs are first assigned their initial values. Gates with exclusively primary inputs are evaluated and their outputs scheduled for update at an appropriate future time. When the outputs are updated, the fanouts of each of these signals are then evaluated and their outputs scheduled for updating. A simple ‘‘time wheel’’ allows coordination of the queue of evaluation and update events. Thus events are repeatedly scheduled and evaluated until the circuit has been simulated for the required number of cycles. Logic simulation is the backbone of digital system verification and permits the verification of large and complex systems by running simulations for a large number of cycles. In fact, logic simulation is such an important step in system verification that various specialpurpose hardware engines have been built to speed up the logic simulation process (22).
STATIC TIMING ANALYSIS SPICE and the timing simulators described above are all dynamic simulators. Input signals are specified in the time-domain, and circuit quantities are computed, starting from initial conditions, for a given interval of time. However, the simulation is only as good as the selection of input signals (‘‘patterns’’ or ‘‘vectors’’ in digital circuit argot). Digital circuits have numerous paths through them and the simulation verifies the function and timing of only those paths that are sensitized by the input signals. Often in digital circuits, it is required to compute an upper bound on the delay of all paths from the primary input to the outputs, irrespective of input signals. Such an upper bound is computed by means of static simulation, more commonly known as static timing analysis (23), which is used to charac-
B1
B4
B2 B3 FF1
FF2
Figure 4. Illustration of static timing analysis.
terize the delay of an interconnected set of combinational logic blocks between the flip-flops of a digital circuit. Figure 4 shows a simple circuit consisting of two banks of flip-flops (FF1 and FF2) and four combinational blocks (B1 through B4). In this example, static timing analysis seeks to predict the earliest time at which FF2 can be clocked while ensuring that valid signals are being latched into the flip-flops. Before embarking on static timing analysis, each combinational block’s delay is precharacterized. The delay from each input pin to each output pin is either described as an equation or stored in a look-up table. The delays are functions of such variables as input slope, fanout, and output load capacitance. The precharacterization phase consists of many circuit simulation runs at different temperatures, power levels, loading conditions, and so on. Delay data from these runs are abstracted into a timing model for each block. The actual analysis is carried out in two phases. In the first phase, the delay of each signal is propagated forward through the combinational blocks, using the precharacterized delay models. Thus each signal is labeled with a latest arrival time at which its correct digital value can be guaranteed. In the second phase, the required arrival time is propagated backwards from the target bank of flip-flops, FF2. The required arrival time on a signal is the latest time by which that signal must have its correct value in order for the system to meet timing requirements. The difference between the required arrival time and the actual arrival time of each signal is termed the slack of the signal. After the analysis, all the signals are listed in increasing order of their slack. This analysis yields a wealth of timing information. Clearly, if there is negative slack on any of the signals, the circuit will not meet its performance requirements. The path with the least (perhaps most negative) slack on all of its signals is the critical path. The nodes along this path will all have the same slack. The slacks also contain clues needed to redesign the circuit to cause it to function correctly. The above analysis can be carried out with a minimum and maximum delay for each block. In that case, a set of early and late arrival times can be computed for each signal. The early mode is computed using the best possible case for the arrival of all input signals to a block and the late mode considers the most pessimistic scenario. Then two sets of slacks are computed for each signal. These slacks yield valuable information about the timing properties of the circuit including possible violations of flip-flop setup or hold times, or possible ‘‘fast paths’’ that might cause spurious switching of the flip-flops. Delay models are not always available, particularly for custom-designed circuitry. To overcome this problem, the circuit is partitioned into channel-connected components and each component is analyzed by means of dynamic simulation.
ELECTRICAL AND TIMING SIMULATION
The dynamic simulation is automatically configured and run under the covers to create a timing model for each channelconnected component on the fly. Depending on whether an early or late mode analysis is being performed, loading and sensitization conditions are chosen for the dynamic simulation to actuate the best-case or worst-case delay through the channel-connected component. Thus static timing analysis is capable of handling circuits without a dependency on precharacterization or being restricted to cells from a standard library. Static timing analysis is a highly efficient method of characterizing the timing of digital logic circuits. It can be used to determined the critical path of a circuit and obtain valuable timing information. However, it assumes that all paths in the circuit are active or sensitizable. In reality, however, there are certain paths in logic circuits that are not sensitizable because of the nature of the logic or the manner in which the circuit is exercised. These paths are called false paths. Because it ignores the false-path problem and becuse late mode analysis makes conservative choices, static timing analysis often predicts pessimistic worst-case delays. Special Considerations in Interconnect Analysis As on-chip dimensions are scaled down, the fraction of total delay contributed by wiring or interconnect increases. Further, on-chip distribution of global signals like power supply, ground, or clocks is a thorny problem. Hence interconnect analysis has received special attention in recent years. Frequency-domain methods have been increasingly popular and effective for the analysis of large, lumped, linear networks. The basic idea is to create a reduced-order model that captures the salient behavior of the original interconnect up to a required frequency. In the earliest attempts (24), the impulse response of the original circuit in the frequency domain, written as a polynomial in the complex frequency s, was matched to a rational polynomial of order q, where q is much less than the order of the original circuit. By means of Pade´ approximation (25), the coefficients of the rational polynomial approximation (or equivalently, the poles and residues of the reduced-order model) can be determined. The reduced-order frequency-domain model can be used to predict time-domain responses by means of an inverse Laplace transform. The Pade´ approximation can be performed via a Lanczos process (26) to preserve numerical accuracy and stability. Several variants on this process have been described in the literature to preserve stability, accuracy and passivity of the reducedorder models [see, e.g., (27)]. Reduced-order models of large interconnect networks can be plugged into an exact or timing simulator and simulated along with nonlinear drivers and receivers more efficiently than if the entire network were simulated at once. Reduced-order models are incorporated either by stamping their contribution directly into the system matrix of the simulator or by first synthesizing them into simplified equivalent circuits. The reduced-order models are also amenable to the sensitivity analysis methods (28) that are the topic of the subsequent section. SENSITIVITY ANALYSIS In the design of a circuit, assume that one is interested in some response v (perhaps a voltage) and that a value is cho-
233
sen for a design parameter p (perhaps a resistance value or transistor size). Then the sensitivity of the response v to the parameter p, ⭸v/⭸p, indicates how much the response will change due to a small perturbation in p. Note that such a ‘‘small change’’ sensitivity or gradient is valid only in a small neighborhood around the nominal value of p. The ability to compute sensitivities efficiently is tremendously useful in circuit design. Sensitivities can be used in tolerance analysis, circuit optimization, computation of periodic steady-state solutions, enhancement of manufacturability, and so on. Note that obtaining approximations to the sensitivities by finite differences is too inefficient since it involves rerunning the simulator with small perturbations of the parameters one at a time. The effect of ‘‘large changes’’ in circuits can also be computed (29,30,6), but such methods are less efficient and therefore rarely used in practice. There are two well-known methods of computing gradients, the direct method (31) and the adjoint method (32). The reader is referred to (6) for a tutorial description of the theory behind these two methods. In both methods, a new circuit is formulated whose solution yields the required sensitivities. The new circuit in both cases is topologically identical to the nominal circuit and LU factors computed during the nominal simulation can be reused to solve the reformulated circuit efficiently. Hence these methods are collectively termed incremental sensitivity analysis. The direct method is based on direct differentiation of the branch constitutive relations (BCRs) that govern the electrical behavior of the elements of the circuit. For example, consider a resistor governed by Ohm’s law v = iR
(1)
Assuming that the parameter of interest is R, we differentiate to obtain ∂i ∂v =i+R ∂R ∂R
(2)
which can be rewritten as vˆ = i + Riˆ
(3)
where vˆ and iˆ are the unknowns in our sensitivity analysis. Therefore, we replace a resistor in the nominal circuit by a resistor of equal value in the sensitivity circuit, but with a voltage source in series. The value of the voltage source is the current through the resistor in the nominal circuit, i, which is known once the nominal circuit has been solved. We thus replace elements in the nominal circuit by appropriate elements, and solve the resulting sensitivity circuit to determine all the vˆ and iˆ variables simultaneously. The relations thus derived for each element represent a sensitivity circuit of the same topology as the original circuit but perhaps different circuit elements. The solution of this related circuit yields the sensitivity of all measurements with respect to a single parameter. Fortunately, the (linearized) system matrix of the original and sensitivity circuits are the same at each time instant, and hence the cost of LU factorization can be amortized during the analysis of the sensitivity circuit. In the resistor example above, the extra voltage source appears on the right-hand side of the circuit equations
234
ELECTRICAL AND TIMING SIMULATION
and hence does not change the system matrix of the nominal circuit. Note that the sensitivity circuit must be repeatedly solved as many times as the number of parameters, which is expensive for large numbers of parameters. The adjoint method is the method of choice for computing gradients of large circuits. While the direct method involves direct differentiation of BCRs, the adjoint method involves differentiation of the matrix of circuit equations. In the circuit context, the adjoint method is best understood as an application of Tellegen’s theorem (33). As in the direct method, an associated circuit called the adjoint circuit is formed. The adjoint circuit has the same topology as the nominal circuit, but possibly different electrical elements. Like the direct method, the LU factors of the nominal circuit can be re-used during the adjoint analysis, modulo some time-point mismatch issues, as discussed below. Control is reversed and time run backwards during the adjoint analysis. Finally, the waveforms of the original and adjoint circuits are convolved to yield the required sensitivities. The main advantage of the adjoint method is that it yields the gradients of one function with respect to all the parameters in a single adjoint analysis. However, because time is run backwards, the nominal and adjoint analyses cannot be carried out simultaneously. Further, it is not easy to make the time points of the two analyses coincide, leading to a clumsy time-point mismatch problem. The convolution of waveforms is an additional source of computational and memory overhead in the adjoint method. The single function which forms the sensitivity function in the adjoint method can be any scalar differentiable function of any number of circuit measurements (34). In particular, if the sensitivities are being computed for the purposes of being supplied to a nonlinear optimizer, then it is quite likely that the optimizer formulates an internal scalar merit function. In such a situation, the gradients of the entire merit function can be computed by means of a single adjoint analysis, irrespective of the number of measurements and the number of parameters! With either method of computing gradients, chain ruling and combining of gradients is essential. For example, when the width of an MOS transistor varies, the associated intrinsic device capacitances as well as the diffusion capacitances on the source and drain vary. The sensitivity of each measurement with respect to these parasitics must be computed, and then chain ruled and combined to obtain the composite sensitivity with respect to all ramifications of the variation of the parameter of interest. Incremental sensitivity computation in the case of dc and frequency-domain analyses is practical in the context of exact simulators. In the time-domain, the saving and interpolation of Jacobian matrices and the storage of waveforms for convolution in the adjoint method render incremental sensitivity less tractable. However, in the context of timing simulators, it has been shown that sensitivity computation of large circuits is practical and extremely efficient (35,34,36,6). Depending on the modeling simplifications used, the associated sensitivity or adjoint circuit can be trivial to solve. In the case of SPECS (17), the associated circuit consists of disconnected capacitors, with impulses of charge transferred between the capacitors at times corresponding to event times in the nominal transient solution. Further, the piecewise nature of the waveforms reduces the cost of otherwise costly convolutions.
Fast sensitivity computation on circuits with several thousand transistor width parameters has been reported (37).
CIRCUIT OPTIMIZATION Automatic circuit optimization (or tuning) (38) is an important part of rapidly, repeatably and robustly designing highperformance circuits. The relentless push for ever higher performance, the need to design circuits of greater complexity, the emphasis on custom design, and shrinking product cycles have led to an increased interest in optimization techniques. New challenges such as power minimization for portable applications, noise reduction, and signal integrity also increase dependence on automatic design methods. Given a functioning circuit schematic, the circuit tuning problem can be stated as that of optimally assigning values to components (e.g., transistor sizes, wire sizes, resistor values, compensating capacitor values). The performance metrics in digital circuits include (some subset of) delay, transition time, area, power dissipation, signal integrity, additional timing constraints, layout constraints, and manufacturability. Most of these metrics are nonlinear functions of the tunable parameters. Each metric can be presented as either an objective function or a constraint. The parameters of the problem are typically transistor and wire sizes, and these parameters are required to lie within simple bounds. Many circuit tuning problems are best stated as minimax problems in which the optimizer is required to minimize the maximum of a finite set of functions. For example, the problem may be stated as minimizing the worst delay across several paths through the logic. Circuit tuning is best approached by gradient-based nonlinear optimization. In the absence of gradients, large problems cannot be solved and one is typically limited to problems in a few tens of variables. Worse, there is often no guarantee of convergence or optimality in such ‘‘gradient-free’’ techniques. The efficient computation of gradients and even Hessians (matrices of second partial derivatives) is key to effective optimization of large circuits. Note that gradient-based nonlinear optimizers attempt to converge to a feasible and locally optimal point; there is no guarantee of global optimality. Circuit optimization techniques fall into three broad categories. The first is dynamic tuning, based on time-domain simulation of the underlying circuit, typically combined with adjoint sensitivity computation. These methods are accurate but require the specification of input signals, and are best applied to small data-flow circuits and ‘‘cross-sections’’ of larger circuits. Efficient sensitivity computation renders feasible the tuning of circuits with a few thousand transistors. Second, static tuners employ static timing analysis to evaluate the performance of the circuit. All paths through the logic are simultaneously tuned, and no input vectors are required. Large control macros are best tuned by these methods. However, in the context of deep submicron custom design, the inaccuracy of the delay models employed by these methods often limits their utility. Aggressive tuning can push a circuit into a precipitous corner of the manufacturing process space, which is a problem addressed by the third class of circuit optimization tools, statistical tuners. Statistical techniques are
ELECTRICAL AND TIMING SIMULATION
Function and gradient values
Variables and simple bounds
Objective function and constraints
Tuner
Nonlinear optimizer
Done ? xnew
Simulation info. (netlist, patterns, device models)
Y
Back annotate results on schematic
N
Circuit simulator
Figure 5. Typical flow of a dynamic tuner.
used to improve yield in the face of inevitable manufacturing variations. Dynamic Circuit Optimization Dynamic tuning (39–41,37) implies circuit optimization based on dynamic time-domain simulation of the underlying circuit. The typical flow of a dynamic tuner is shown in Fig. 5. Under the control of a nonlinear optimizer, tunable parameters are set to their initial values and a simulation is performed. The measurements of interest and the sensitivities of each measurement with respect to all tunable parameters are fed back to the optimizer. Based on this information, the nonlinear optimization package suggests a new solution vector, which is a new assignment of parameter values that is expected to improve the circuit. The iterative process is carried to convergence, or until a user-specified maximum number of iterations is reached. The parameters in dynamic tuning usually include transistor and wire sizes and they must conform to simple bounds. Ratioing of transistor widths to one another must be permitted. Further, grouping of similar structures is useful to ensure that they can share a common layout by maintaining the corresponding transistors of the structures at the same size during the tuning procedure. The measurements in dynamic tuning usually include area (often modeled by the sum of the tunable transistor widths), delay, noise, and transition time or slew. The objective function and constraints are expressed in terms of these measurements. Minimax optimization is a useful feature whereby the worst of a set of measurements is minimized. For example, the problem may be stated as minimizing the worst delay of m paths through the circuit, as shown below. minimize x ∈ n
maximum i ∈ {1, 2, . . ., m}
di (x)
(4)
Note that the optimizer has no a priori knowledge of which of these paths exhibits the worst delay. Further, different paths may be critical during different iterations of the optimization.
235
The main advantage of dynamic tuning is its accuracy. The tuning is realistic since it is based on full-blown transient simulation. Likewise, false paths are avoided in contrast to static tuning methods. If the transistor sizing at any iteration causes failure of a measured signal to switch correctly, the transient simulation is able to detect this situation. In such a case, a nonworking circuit has been obtained, usually because of the optimizer taking too aggressive a step. Recovery from this situation is implemented by requiring the optimizer to cut back on its step size and trying again. However, dynamic tuning suffers from a number of disadvantages. The main disadvantage is that it is specific to the input pattern sensitizations and measurements specified. Unlike static tuning, it is not possible to tune any but the smallest circuit for all possible input patterns and all possible paths through the logic. As with the use of any optimizer, the solution obtained is only as good as the problem specification. Dynamic tuning is particularly vulnerable to designers omitting tacit requirements and then encountering unexpected results. A disciplined approach to accurately expressing all aspects of the problem at hand is essential to making good use of any optimization program! Dynamic tuning is most often applied to small data-flow circuits in which the critical paths are well known and the input patterns to sensitize these paths are easy to come by. The relative computational inefficiency of these tools also limits the size of circuit that can be tuned. Most dynamic tuners are limited to a few tens of transistors. DELIGHT.SPICE (41) was one of the early practical implementations of a dynamic circuit optimization capability. Recently, a SPECS-based tuner called JiffyTune (34,37) was reported to tune circuits with over 10,000 transistors. The large capacity was achieved by using a simulator that simplifies device models, employs event-driven simulation, and applies the adjoint method to compute sensitivities. Static Circuit Optimization Static tuning implies circuit optimization based on static timing analysis (23). One of the earliest static tuners was TILOS (42). In these approaches (43), transistors are usually modeled by equivalent RC circuits. The actual values of the resistances and capacitances are computed during a precharacterization procedure. The delay of each channel-connected set of transistors is computed using the Elmore delay model (44,45), a special and simplified case of the reduced-order models discussed earlier. Alternatively, delay macromodels are used in (46). Conventional static timing analysis is used to determine the critical path. The delay of the critical path can then be expressed as a function of the widths of transistors and wires. This expression is modeled by a posynomial function [a particular algebraic form; see (47)] of the parameters of the optimization. The observation is then made that by a simple variable substitution, the posynomial function can be converted to a convex function. Thus any local minimum is guaranteed to be a global minimum. The procedure in TILOS is to start all transistors at their minimum widths, and iteratively bump up the width of the transistor to which the critical path is most sensitive at each step of the algorithm. The procedure is repeated until the lowest critical path delay through the circuit is found. More re-
236
ELECTRICAL AND TIMING SIMULATION
cently, power optimization has also been proposed in this general framework (48). The main advantages of static timing analysis are its pattern independence and its speed. Very large circuits can be tuned relatively quickly. All paths are implicitly taken into account because of the underlying static timing basis. The designer is freed of the onus of coming up with input patterns or identifying critical paths. Since the timing of many industrial designs is verified by static timing analysis, there are obvious advantages to carrying out static tuning in that same general framework. Further, interconnect delay can easily be modeled and accommodated into this framework. Unfortunately, static timing analysis has a number of drawbacks. The most serious one is accuracy. Elmore delays do not provide reasonable accuracy in the context of high-performance submicron circuits. Other modeling techniques like ‘‘collapsing’’ each logic gate into an equivalent inverter significantly degrade the accuracy. Unfortunately, the mathematical elegance of mapping the problem into a convex one and the intuitive satisfaction of finding a global minimum are rendered void by the crudeness of the delay approximation. The second major drawback of static tuning is the false-path problem. The optimizer may be hard at work tuning false paths through the circuit, and therefore unable to achieve any performance gains in the paths that really matter. But this problem is no more or less serious than the false-path problem in static timing analysis, and if the circuit ‘‘sign-off ’’ is based on static timing analysis, this activity may be legitimate, though wasteful. The third problem with static tuning is the lack of delay models as functions of transistor sizes. Analytic delay models for gates in a library are generally not constructed as functions of transistor sizes, thus making them unusable during optimization. The creation of such models involves an exhaustive and time-consuming SPICE-based characterization process. Finally, starting with all transistors set to their minimum size could lead to circuits that may not even have the correct logical transitions. Dynamic tuners, since they are based on a realistic simulation of the circuit, have the advantage of being able to detect such ‘‘nonworking’’ circuits and attempting to recover from them. Static timing analysis in the context of custom circuits is successful only when each channel-connected component is timed using a dynamic simulator of reasonable accuracy under the covers. For tuning purposes, the fast gradient computation methods mentioned above can then be exploited. Design for Manufacturability Yield loss on a fabrication line can be attributed to catastrophic and parametric (or circuit-limited) yield loss. Catastrophic yield loss is due, for example, to dust particles that cause opens or shorts on metal lines. Parametric yield loss, which is discussed in this section, occurs due to inherent manufacturing variations, leading to chips that do not have the required performance characteristics. In sorted designs like microprocessor chips, this degradation can mean that insufficient chips end up in the high-performance, high-profit bin. In nonsorted designs (e.g., a bus controller chip), circuits below a performance threshold must be thrown away. Across-chip linewidth variations (ACLVs) constitute the single dominant set of parameters that lead to variations in the performance
of the circuit. Statistical tuning is the process of changing design parameters to minimize circuit-limited yield loss. Aggressive tuning of a circuit often drives it into a corner of the process space, thus causing its yield to suffer. This problem has been studied extensively in the literature, and Ref. 49 is a good tutorial introduction to the subject. In addition, the books (50,51) provide a survey of the state-of-the-art as well as extensive pointers to further reading, while (52) is a useful reference on the topic of creating and building statistical models. The approaches to various aspects of statistical tuning are listed below (see CAD FOR MANUFACTURABILITY). • In Monte Carlo analysis, the parametric space is sampled and the design simulated at each sample point. Of course, this method assumes that distributions of the parameters are known. Further, by various principal component and correlation analyses, the number of independent parameters is reduced so as to limit the dimensionality of the space being sampled and therefore the number of simulations required. The results of the simulation runs can be used to determine both the distribution and worst-case behavior of the circuit. Designs are often simulated at multiple process corners, which is a simple form of Monte Carlo analysis. It is possible in the context of a dynamic tuner to replicate the nominal objective function(s) and constraints across all process corners, and simultaneously tune at all process corners (34). Nominal objective functions are transformed into minimax functions across the process corners. • Extreme case analysis is aimed at finding the worst-case behavior of the circuit given a statistical model of the parameter variations. The goal is not to predict the statistical distribution of the performance, but to predict the worst-case. A simple statement of the problem would be, for example, to maximize the delay of a circuit by optimally assigning transistor lengths that are constrained to conform to a precharacterized distribution. A word about relying on extreme case analysis is appropriate here. It is obviously unwise to rely on ‘‘best-case’’ assumptions during the design of a chip. Going too far the other way, while providing an easy guarantee of working hardware, is also wasteful. Unrealized performance can represent significant lost revenue. Worse, pessimistic projections may cause other parts of the system to be designed to work at lower specifications. Due to system limitations, it may not then be possible to harness the ‘‘surprisingly good’’ performance provided by chips that were designed using extreme case analysis. In practice, therefore, it is unwise to have a design methodology that is either overly pessimistic or overly optimistic. • Yield prediction and optimization seek to explicitly model the yield characteristics of a circuit as a response surface. Once this is done, the parametric yield of a circuit in the face of manufacturing variations can be predicted. Further, based on the yield model, the circuit can be modified to maximize the yield. • Design centering methods do not explicitly compute or model yields. Instead, they take the approach that pushing the circuit deeper into the interior of the feasible region in the space of parameter variations will result in
ELECTRICAL AND TIMING SIMULATION
a more robust circuit and therefore higher yields. While design centering methods operate on such a geometric model of the feasible region, method-of-moments-based techniques implicitly attempt to move designs away from regions of low yield to regions of high yield without seeking to explicitly compute the feasible region (53).
237
11. S. Kapur, D. E. Long, and J. Roychowdhury, Efficient time-domain simulation of frequency-dependent elements, IEEE Int. Conf. Comput.-Aided Des., San Jose, CA, 1996, pp. 569–573. 12. P. Feldmann and J. Roychowdhury, Computation of circuit waveform envelopes using an efficient, matrix-decomposed harmonic balance algorithm, IEEE Int. Conf. Comp.-Aided Des., San Jose, CA, 1996, pp. 295–300. 13. B. R. Chawla, H. K. Gummel, and P. Kozak, MOTIS—An MOS timing simulator, IEEE Trans. Circuits Syst., CAS-22: 901–910, 1975.
Despite much research on the topic of statistical tuning, Monte Carlo and extreme case analyses are the most popular approaches; industrial practice consists predominantly of these two methods. The advantages of these methods that are not shared by the other techniques are that they are easy to understand and in a form that is easily accessible to the design engineer.
14. K. A. Sakallah and S. W. Director, SAMSON2: An event driven VLSI circuit simulator, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 4: 668–684, 1985.
CONCLUSION
16. G. Ruan, J. Vlach, and J. A. Barby, Logic simulation with current-limited switches, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 9: 133–141, 1990.
Electrical and timing simulation and circuit optimization are crucial components of circuit design. The challenge is to design tomorrow’s faster and more complex systems with today’s computers and computer aids. As a result, a wealth of algorithms and techniques for simulation at various levels of abstraction has been developed over the last four decades. As the challenges grow, new approaches have been successful in tackling simulation and verification of larger and more complex integrated circuits. Simulation and optimization of circuits continues to be an active and vibrant research area.
BIBLIOGRAPHY 1. L. W. Nagel, SPICE2, a computer program to simulate semiconductor circuits, Memo UCB/ERL M520, Univ. California, Berkeley, May 1975. 2. W. T. Weeks et al., Algorithms for ASTAP—A network analysis program, IEEE Trans. Circuit Theory, CT-20: 628–634, 1973. 3. G. D. Hachtel, R K. Brayton, and F. G. Gustavson, The sparse tableau approach to network analysis and design, IEEE Trans. Circuit Theory, CT-18: 101–118, 1971. 4. C. W. Ho, A. E. Ruehli, and P. A. Brennan, The modified nodal approach to network analysis, Proc. 1974 IEEE Int. Symp. Circuits Syst., New York, 1974, pp. 505–509. 5. P. M. Russo and R. A. Rohrer, The tree-link analysis approach to the transient analysis of a class of nonlinear networks, IEEE Trans.Circuit Theory, CT-18: 400–403, 1971. 6. L. T. Pillage, R. A. Rohrer, and C. Visweswariah, Electronic Circuit and System Simulation Methods. New York: McGraw-Hill, 1995. 7. L. W. Nagel and R. A. Rohrer, Computer analysis of nonlinear circuits, excluding radiation (CANCER), IEEE J. Solid State Circuits, SC-6 (4): 166–182, 1971. 8. E. Lelarasmee, A. E. Ruehli, and A. L. Sangiovanni-Vincentelli, The waveform relaxation method for the time-domain analysis of large scale integrated circuits, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-1: 131–145, 1982. 9. J. S. Roychowdhury and R. C. Melville, Homotopy techniques for obtaining a DC solution of large-scale MOS circuits, Proc. 1996 Design Autom. Conf., Las Vegas, NV, 1996, pp. 286–291. 10. P. Feldmann, R. C. Melville, and D. E. Long, Efficient frequencydomain analysis of large nonlinear analog circuits, Proc. Custom Integrated Circuits Conf., San Diego, CA, 1996, pp. 461–464.
15. Y. H. Kim, S. H. Hwang, and A. R. Newton, Electrical-logic simulation and its applications, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 8: 8–22, 1989.
17. C. Visweswariah and R. A. Rohrer, Piecewise approximate circuit simulation, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 10: 861–870, 1991. 18. C. X. Huang et al., The design and implementation of PowerMill, Proc. Int. Workshop Low Power Des., 1995, pp. 105–110. 19. A. Devgan and R. A. Rohrer, Adaptively controlled explicit simulation, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-13: 746–762, 1994. 20. R. E. Bryant, MOSSIM: A switch level simulator for MOS LSI, Proc. 1981 Des. Autom. Conf., Nashville, TN, 1981. 21. B. H. Scheff and S. P. Young, Gate-Level Logic Simulation. Englewood Cliffs, NJ: Prentice-Hall, 1972. 22. T. Blank, A survey of hardware accelerators used in computeraided design, IEEE Des. Test Comput., 1 (3): 21–39, 1984. 23. R. B. Hitchcock, Sr., G. L. Smith, and D. D. Cheng, Timing analysis of computer hardware, IBM J. Res. Develop., 26 (1): 100– 105, 1982. 24. L. T. Pillage and R. A. Rohrer, Asymptotic waveform evaluation for timing analysis, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 9: 352–366, 1990. 25. J. G. A. Baker, Essentials of Pade´ Approximants, New York: Academic Press, 1975. 26. P. Feldmann and R. W. Freund, Efficient linear circuit analysis by Pade´ approximation via the Lanczos process, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 14: 639–649, 1995. 27. L. M. Silveira et al., A coordinate-transformed Arnoldi algorithm for generating guaranteed stable reduced-order models of RLC circuits, IEEE Int. Conf. Comput.-Aided Des., San Jose, CA, 1996, pp. 288–294. 28. R. W. Freund and P. Feldman, Efficient small-signal circuit analysis and sensitivity computations with the PVL algorithm, IEEE Int. Conf. Comput.-Aided Des., San Jose, CA, 1994, pp. 404–411. 29. G. Kron, Tensor Analysis of Networks. New York: Wiley, 1939. 30. A. S. Householder, A survey of some closed methods of inverting matrices, SIAM J. Appl. Math., 5: 153–169, 1957. 31. D. A. Hocevar et al., Transient sensitivity computation for MOSFET circuits, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-4: 609–620, 1985. 32. S. W. Director and R. A. Rohrer, The generalized adjoint network and network sensitivities, IEEE Trans. Circuit Theory, CT-16: 318–323, 1969. 33. B.D. H. Tellegen, A general network theorem, with applications, Philips Res. Rep., 7: 259–269, 1952.
238
ELECTRICAL ENGINEERING CURRICULA
34. A. R. Conn et al., Circuit optimization via adjoint Lagrangians, IEEE Int. Conf. Comput.-Aided Des., San Jose, CA, 1997, pp. 281–288. 35. P. Feldmann et al., Sensitivity computation in piecewise approximate circuit simulation, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 10: 171–183, 1991. 36. T. V. Nguyen, A. Devgan, and O. J. Nastov, Adjoint transient sensitivity computation in piecewise linear simulation, Proc. 1998 Des. Autom. Conf., San Francisco, CA, 1998. 37. A. R. Conn et al., JiffyTune: Circuit optimization using time-domain sensitivities, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 1998, to appear. 38. C. Visweswariah, Optimization techniques for high-performance digital circuits, IEEE Int. Conf. Comput.-Aided Des., San Jose, CA, 1997, pp. 198–205. 39. R. K. Brayton and R. Spence, Sensitivity and Optimization, vol. 2 of CAD of Electronic Circuits, Amsterdam: Elsevier, 1980. 40. R. K. Brayton, G. D. Hachtel, and A. L. Sangiovanni-Vincentelli, A survey of optimization techniques for integrated-circuit design, Proc. IEEE, 69: 1334–1362, 1981. 41. W. Nye et al., DELIGHT.SPICE: An optimization-based system for the design of integrated circuits, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-7: 501–519, 1988. 42. J. P. Fishburn and A. E. Dunlop, TILOS: A posynomial programming approach to transistor sizing, IEEE Int. Conf. Comput.Aided Des., Santa Clara, CA, 1985, pp. 326–328. 43. S. S. Sapatnekar et al., An exact solution to the transistor sizing problem for CMOS circuits using convex optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-12: 1621– 1634, 1993. 44. W. C. Elmore, The transient analysis of damped linear networks with particular regard to wideband amplifiers, J. Appl. Phys., 19 (1): 55–63, 1948. 45. P. Penfield and J. Rubinstein, Signal Delay in RC Tree Networks, Proc. 2nd Caltech VLSI Conf., Pasadena, CA, 1981, pp. 269–283. 46. M. D. Matson and L. A. Glasser, Macromodeling and optimization of digital MOS VLSI circuits, IEEE Trans. Comput.-Aided Des. Integr.Circuits Syst., CAD-5: 659–678, 1986. 47. R. J. Duffin, E. L. Peterson, and C. Zener, Geometric Programming—Theory and Applications, New York: Wiley, 1967. 48. P. K. Sancheti and S. S. Sapatnekar, Optimal design of macrocells for low power and high speed, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-15: 1160–1166, 1996. 49. R. Spence and R. S. Soin, Tolerance Design of Electronic Circuits, Reading, MA: Addison-Wesley, 1988. 50. S. W. Director and W. Maly, eds., Statistical Approach to VLSI, vol. 8 of Advances in CAD for VLSI, Amsterdam: North-Holland, 1994. 51. J. C. Zhang and M. A. Styblinski, Yield Variability Optimization of Integrated Circuits, Dordrecht: Kluwer, 1995. 52. C. Michael and M. Ismail, Statistical Modeling for ComputerAided Design of MOS VLSI Circuits. Dordrecht: Kluwer, 1993. 53. G. Kjellstrom and L. Taxen, Stochastic optimization in system design, IEEE Trans. Circuits Syst., CAS-28: 702–715, 1981.
CHANDU VISWESWARIAH IBM Thomas J. Watson Research Center
ELECTRICAL CONDUCTIVITY IN AMORPHOUS MATERIALS. See HOPPING CONDUCTION.
ELECTRICAL ENDURANCE OF INSULATION. See INSULATION AGING MODELS.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1803.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering High Level Synthesis Standard Article Inki Hong1, Darko Kirovski1, Miodrag Potkonjak1 1University of California at Los Angeles, Los Angeles, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1803 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (414K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1803.htm (1 of 2)18.06.2008 15:27:02
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1803.htm
Abstract The sections in this article are Transformations for High Level Synthesis High Level Synthesis Techniques for Design Debugging Considering Testability in High Level Synthesis High Level Synthesis for Low Power Conclusion | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1803.htm (2 of 2)18.06.2008 15:27:02
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
HIGH LEVEL SYNTHESIS High level synthesis (also known as behavioral synthesis) has been a popular research and development area since the 1970s. Several thousand papers on high level synthesis have been published and more than a hundred prototype and production-quality high level synthesis systems have been developed. There are several survey papers and books on high level synthesis (1,2,3,4,5). Recently, system-level synthesis and hardware–software codesign have also received widespread attention (6). Therefore, a comprehensive survey of the field is beyond the scope of a single article. Our goal here is to explain the key concepts and some of the most interesting directions as well as to provide a good starting point for readers who want to study high level synthesis. This article is organized in the following way. We first discuss computational and hardware models, define the main tasks within high level synthesis, and list the key design objectives. After that we explain some of the techniques recently proposed in several research areas, which have received much attention in high level synthesis in the last few years.
Computational Model A number of computational models well suited for high level synthesis are surveyed in Ref. 7. In general, the DSP systems have multiple inputs, multiple outputs, and finite number of states. They accept streams of samples on each of the inputs and produce streams of samples on each of the output ports. We represent an algorithm for a system by a hierarchical directed control-data flow graph (CDFG). In a CDFG the nodes represent data operators or subgraphs, data edges represent the flow of data between nodes, and control edges represent sequencing and timing constraints between nodes. Often, designers restrict themselves to operators that are synchronous in that they consume at every input, and produce at every output, a fixed number of samples on every execution. This restriction has two important ramifications. First, the operators, and hence the system, are determinate in that a given set of input samples always results in the same outputs independent of the execution times. Second, the system is well behaved in that the data sample rate at any given data edge in the CDFG is independent of the inputs, and the ratio between any two data sample rates will be a statically known rational number. Mathematically, such a synchronous CDFG is equivalent to a continuous function over streams of data samples. Because CDFGs are, of course, causal, they are equivalent to a function that expresses the ith set of output samples in terms of the ith and earlier sets of input samples. The system state is represented in a CDFG by special delay operator nodes that are initialized to a userspecified value. A delay operator node (often referred to as just delay or state) delays by one sample the stream of data on its sole input port. A CDFG corresponds to an algorithm for computing the output samples and the new samples to be stored at the delay nodes (i.e., the new state) given the input samples and the old (current) samples at the delay nodes (i.e., the old state). Intuitively, one can think of delay operators as representing registers holding states and the other operators as combinational logic. 1
2
HIGH LEVEL SYNTHESIS
Fig. 1. An example CDFG constructed for the following computation
where x and y are inputs, z is the output, and D1 and D2 are computation states.
A system is completely represented by a CDFG and the initial values for all the delays in the CDFG. We further restrict ourselves to single-rate systems where the data rate is identical on all the inputs. The term CDFG often refers to such a single-rate synchronous CDFG. Figure 1 shows an example CDFG with two inputs X and Y, and one output Z. The two delay nodes U and V are represented by boxes with the letter D. Timing relationships associated with a CDFG are also timing constraints specified by the user. These constraints arise from requirements of the interface to the external world and from performance requirements. Consider a CDFG with P inputs, Q outputs, and R delay nodes (state nodes). Let X[n] = (X 1 [n] X 2 [n] . . . X P [n])T be the vector of nth samples at the P input nodes of the CDFG, Y[n] = (Y 1 [n] Y 2 [n] . . . Y Q [n])T be the vector of nth samples at the Q output nodes of the CDFG, S[n] = (S1 [n] S2 [n] . . . SR [n])T be the vector of nth samples at the R delay nodes of the CDFG. Given initial values S[0] of the samples in the delay nodes, the CDFG repeatedly computes output samples Y[n] and new samples S[n] for delay nodes from input samples X[n] and old samples S[n − 1] at the delay nodes for n = 1, 2, 3, . . .. Various timing parameters are associated with a CDFG as shown in Fig. 2. Because we have restricted ourselves to single-rate synchronous CDFGs, the data rates are identical at all nodes so that the intersample time interval is identical and constant for all nodes. The maximum rate at which such a CDFG can process samples is called its throughout, and the inverse of this rate, called sample period, is the minimum required time between successive samples at a node in the CDFG. The sample period, denoted by T S , is an important timing parameter that is usually constrained not to exceed a maximum value. A second set of timing parameters associated with a CDFG is the set of pairwise latencies. The latency T L (i, j), from the ith input node to the jth output node, is the delay between the arrival of a sample at the ith input and the production of the corresponding sample at the jth output. The sample correspondence is defined by the initial CDFG given by the user to specify the system—the nth sample at an output corresponds to the nth sample at an input. However, adding or deleting pipeline stages during algorithm transformation changes this correspondence. For example, if one were to add one level of pipelining to the initial CDFG, then
HIGH LEVEL SYNTHESIS
3
Fig. 2. Timing parameters associated with a CDFG.
Fig. 3. A datapath with two functional units and four dedicated register files.
the latencies will be defined in terms of the (n + 1)-th input samples and the nth output samples. The pairwise latencies T L (i, j) associated with a CDFG are also important timing parameters because they are constrained not to exceed specified maximum values in latency-critical systems.
Hardware Models There are a number of hardware models used in high level synthesis. Most often the emphasis is on datapath, although recently control logic hardware cost modeling and memory have received considerable attention. One popular hardware model is shown in Fig. 3. To stress the importance of interconnect minimization early in the design process, this model clusters all registers in register files, connected only to the inputs of the corresponding execution units. Other popular hardware models include a variation where any register file could be connected to any functional unit. Until recently, a large number of high level synthesis systems were assuming hardware models where registers are not grouped. However, recently, it become apparent that for that model, interconnect and control logic overhead is too high for realistic designs.
4
HIGH LEVEL SYNTHESIS
High Level Synthesis Tasks The main high level synthesis tasks follow. (1) Transformations The structure of computation is reorganized in such a way that input/output behavior is not altered. The goal is to make the computation more amenable for good implementation. (2) Scheduling Operations are assigned to control steps in which the operations are to be executed so as to satisfy design constraints. (3) Resource Allocation Variables and operations are assigned to registers and functional units, respectively, and the interconnection of the various resources in terms of buses and multiplexors is determined. (4) Template Matching It is the process of mapping high-level algorithmic descriptions to specialized hardware libraries or instruction sets. The difficulty for template matching lies in the fact that the number of template matches can be extremely large and the possibility of enumerating all matches is prohibited. (5) Hardware Mapping It refers to the process of selecting, for each operation in the CDFG, the type of functional unit that will perform it. For example, ripple carry adder, carry lookahead adder, or carry select adder can be selected for addition. (6) Clock Selection It refers to the process of choosing a suitable clock period for the controller and data-path circuit. The choice of the clock period is known to have a significant effect on area, performance, and power consumption. (7) Partitioning The behavior of an architecture is divided in multiple chips, or into multiple parts within the bounds of a single chip while minimizing the global interconnect between parts. Partitioning in high level synthesis is different from partitioning in physical design because, due to the possibility of resource sharing of global interconnect, the number of connections between parts does not directly correspond to the amount of global interconnect required.
Design Metrics Applications and implementation technology have had a significant impact on the relative popularity of design objectives. Even though area and speed were initially dominant metrics, recently focus has moved to metrics such as low power, testability, suitability for debugging, and protection of intellectual rights of designers.
Transformations for High Level Synthesis Transformations alter the structure of a computation in such a way that the user-specified input/output relationship is maintained (8). Examples of transformations include pipelining, retiming, unfolding, folding, associativity, distributivity, and strength reduction. A simple application of associativity to reduce the critical path length is shown in Fig. 4. Many transformations have been known for a long time and have been used in the compiler domain. For example, transformations such as constant and copy propagation and common subexpression elimination techniques are often used. Comprehensive reviews of use of transformations in parallelizing compilers stateof-the-art general-purpose computing environments, and VLSI DSP design are given in Refs. 9, 10, and 11 respectively. In high level synthesis, transformations have received widespread attention (2,12,13,14,15) because of strong experimental evidence that they are most effective at the highest levels of abstractions, such as high level synthesis, although they have been widely used at all levels of abstraction in the synthesis process.
HIGH LEVEL SYNTHESIS
5
Fig. 4. Applying associativity to reduce the length of the critical path: CDFG (a) before and (b) after.
Transformations have been recognized as the high level synthesis step with the highest impact on design metrics. It has been demonstrated that exceptionally high improvements in all design metrics including area (16), throughput and latency (17,18), power (19), transient and permanent fault-tolerance overhead (14,20,21), memory (22), and testability (15) are achievable. Most of these approaches, however, consider only individual or small sets of transformations, which limits the improvement mainly for the following reasons: • • • •
Transformations are notorious for their ability to alter unpredictably the numerical properties of a design and therefore its required word-length. The implementation and maintenance of a large number of transformations in compiler and high level synthesis environments is a formidable software task. The accurate prediction of effects of transformations has been rarely addressed, and it is widely considered infeasible. Their full potential can be explored only with proper orders of transformations. This is the objection that is most often raised and quoted with respect to the application of transformations. However, deriving such orders is mainly an art practiced by experienced developers of compiler and CAD tools, which often result in disappointing outcomes.
The previous approaches for transformation ordering can be classified in seven groups: local (peephole) optimization, static scripts, exhaustive search-based “generate-and-test” methods, algebraic approaches, probabilistic search techniques, bottleneck removal methods, and enabling-effect-based techniques. Probably the most widely used technique for ordering transformations is local (peephole) optimization (23), where a compiler considers only a small section of code at a time in order to apply one by one iteratively and locally all available transformations. The advantages of the approach are that it is fast and simple to implement. However, performance is rarely high and is usually inferior to other approaches. Another popular technique is a static approach to transformations ordering where their order is given a priori, most often in the form of a script (24). Script development is based on the experience of the compiler/synthesis software developer. This method has at least three drawbacks: it is a time-consuming process
6
HIGH LEVEL SYNTHESIS
which involves a lot of experimentation on random examples in an ad-hoc manner; any knowledge about the relationship among transformations is only implicitly used; and the quality of the solution is often relatively low for programs/design which have different characteristics than the ones used for the development of the script. The most powerful approach to transformation ordering is enumeration-based “generate and test” (25). All possible combinations of transformations are considered for a particular compilation, and the best one is selected using branch-and-bound or dynamic programming algorithms. The drawback is the large run time, often exponential in the number of transformations. Another interesting approach is to use a mathematical theory behind the ordering of some transformations. However, this method is limited to only several linear loop transformations (26). Simulated annealing, genetic programming, and other probabilistic techniques in many situations provide a good trade-off between the run time and the quality of solution when little or no information about the topology of the solution space is available. Recently, several probabilistic search techniques have been proposed for ordering of transformations in both compiler and high level synthesis literature. For example, backward-propagation-based neural network techniques were used for developing probabilistic approach to the application of transformations in compilers for parallel computers (27), and approaches that combine both simulated annealing-based probabilistic and local heuristic optimization mechanism were used to demonstrate significant reductions in area and power (19). In high level synthesis, several bottleneck identification and elimination approaches for ordering of transformations have been proposed (28). This line of work has been mainly addressing the throughput and latency optimization problems, where the bottlenecks can be easily identified and well quantified. Finally, the idea of enabling and disabling transformations has been recently explored in a number of compilation (29) and high level synthesis papers (13,30). Using this idea, several very powerful transformations scripts have been developed, such as one for maximally and arbitrarily fast implementation of linear computations (13), and joint optimization of latency and throughput for linear computations (18). Also, the enabling mechanism has been used as a basis for several approaches for ordering of transformations for optimization of general computations (31). The key advantage of this class of approaches is related to the intrinsic importance and the power of the enabling/disabling relationship between a pair of transformations. Hong et al. (32) proposed an approach for the development of fully automatic, fast, and effective ordering of transformations for a variety of design metrics, such as throughput, area, and power. They introduced a new potential-driven statistical approach based on two new synthesis ideas. The first idea is to identify the characteristics of all transformations and the relationships between them based on their potential to reorganize a computation such that the complexity of the corresponding implementation is reduced. The second one is based on the observation that transformations may disable each other not only because they prevent the application of the other transformation but also because both transformations target the same potential of the computation. These two observations drastically reduce the search space to find efficient and effective scripts of transformations. We explain the key ideas of their approach for ordering transformations using two small, but meaningful examples. We first consider the example in Fig. 5. Figure 5(a) shows the CDFG of a computation that consists of three additions and three multiplications. We assume that each operation takes one clock cycle and that the available time is two cycles. Note that because the length of the critical path is also two cycles, all operations are on the critical path. Obviously, regardless of the scheduling algorithm used, the final implementation requires at least three multipliers and three adders. This design has relatively low (50%) resource utilization for execution units. An easy and effective way to improve this design is to apply transformations. Figure 5(b) shows the same CDFG after the application of distributivity on the isolated component which computes output x. Again all operations are on the critical path. Therefore, the only feasible schedule is the one shown in Fig. 5(b). It is easy to see that only two multipliers and two adders are required.
HIGH LEVEL SYNTHESIS
7
Fig. 5. An explanatory example to show the importance of considering the potential of computation and transformations for ordering transformations: (a) Initial CDFG and (b) transformed CDFG.
It is interesting and important to analyze why the CDFG in Fig. 5(b) is more amenable for the implementation with high resource utilization. The transformed design has one more operation than the initial design. So, one may expect that the transformed CDFG is inferior. However, this design has more evenly distributed operations of the same type along the time axis. How can one generalize these observations so that they can be incorporated in a synthesis program? The essence of the approach can be summarized in the following way. The initial CDFG has a high potential to be improved with respect to the area of the final implementation because it has data precedence constraints that force the low resource utilization of execution units. This can be observed by comparing the current solution (or bounds on the quality of the current solution) with the solution derived with the assumption that all units can be used in all control steps. In this example, the bounds of the initial solution are three multipliers and three adders, whereas the bounds for fully utilized units are two multipliers and two adders. To achieve improvement, one needs such transformation that will more evenly distribute operations of the same type along the time axis (i.e., alter data dependencies in such a way that some of multiplications can be moved in the second control step, and some of additions in the first control step). It is easy to deduce that distributivity is such a transformation. As shown, distributivity is indeed effective. Another important observation is that, when all execution units are well utilized, no further improvement can be achieved using transformations which alter only data and timing dependencies. This is because all potential along the dimension has already been realized (as indicated by absolute bounds on the number of execution units assuming 100% utilization). Thus, for further improvement, transformations that address some other potential of the computation are required. The substitution of constant multiplications with shifts and additions is such a transformation. The second motivational example demonstrates the importance of simultaneous consideration of the effects of more than one transformation. It also illustrates the conceptual complexity of transformations ordering and provides initial hints about how this task can be simplified while preserving the power of transformations. The design goal is area optimization under throughput constraint. Figure 6(a) shows the initial CDFG, which has five multiplications and two additions. The available time is three clock cycles. In the same fashion as in the first example, all operations are on the critical path. Therefore, the only feasible schedule is shown in Fig. 6(a). It is easy to see that three multipliers and two adders are required. The resource utilization is low for an ASIC design, for multipliers 55% and for adders 33%. It is often required to apply one or more enabling transformations in order to eventually apply a transformation that will improve a design. Following this direction, we observe that our goal is to move some additions from the second control step to the first one and to move some multiplications from the first control step
8
HIGH LEVEL SYNTHESIS
Fig. 6. An explanatory example to show the importance of considering enabling and disabling effects for ordering transformations. All data denoted by ci are constants: (a) three multipliers, two adders; (b) four multipliers, two adders; (c) two multipliers, two adders; (d) two multipliers, two adders; (e) two multipliers, two adders; and (f) one multiplier, one adder.
to the second one. It is easy to verify that distributivity can accomplish this task. However, the application of distributivity is disabled because the result of multiplication ∗2 is used in two places. We first replicate multiplication ∗2 as shown in Fig. 6(b). If we stop the transformation process at this stage, it is easy to see that the corresponding implementation requires four multipliers and two adders, even more hardware than in the initial design. However, we can now apply distributivity twice, as shown in Fig. 6(c). We need only two multipliers and two adders. More importantly, we can continue the optimization process, by first replicating constant c1 and then applying constant propagation, as shown in Figs. 6(d) and 6(e). Finally we can introduce a pipeline stage (if the computation is in a loop, retiming is performed) and obtain the functionality equivalent
HIGH LEVEL SYNTHESIS
9
computation shown in 6(f). The numbers next to the nodes–operations indicate the clock cycle in which a particular operation is scheduled. We need only one multiplier and one adder. This small example illustrates not only high effectiveness, but also an exceptional richness of degrees of freedom during optimization using sequences of transformations. The authors have shown, however, that, once the obtained insights are coupled with statistical methods, there is a conceptually simple, effective, and efficient solution for ordering of transformations. On a large set of diverse real-life examples, improvements in throughput, area, and power by large factors have been obtained. Both qualitative and quantitative statistical analysis supported the effectiveness, high robustness, and consistency of their approach to ordering transformations.
High Level Synthesis Techniques for Design Debugging Functionality and timing debugging dominates both development time and cost of modern design process. The key technological and application trends indicate that the cost and time expenses of debugging follow sharply ascending trajectories. The technological trends, inducing additional constraints on debugging, are mainly related to increasingly reduced design observability and controllability. For example, the UltraSPARC design team reported that debugging efforts (mainly architecture and functional verification) took two times longer than the design activities (33). Similarly, the Hitachi design team reported that specification of the GMICRO500 microprocessor took less than 36 man-months, functional design took 80 man-months, standard cell design took only 3 man-months, physical design took less than 24 man-months, and finally, debugging efforts took more than 270 man-months (34). The difficulty of verifying designs is likely to worsen in the future. The Intel development strategy team foresees that a major design concern for year-2006 microprocessor designs will be the need to test exhaustively all possible computational and compatibility combinations (35). The same team also states that the circuitry in their future designs devoted to debugging purposes is estimated to increase sharply to 6% from the current 3% of the total die area. Because the product is expected to encompass 350 million transistors, the computation power of two state-of-the-art, 10 million transistor processors is expected to support debugging activities solely. Two factors, most directly related to the expected increased difficulty of debugging future designs, are rapid growth in the number of transistors per pin in each new generation of designs and the increasing levels of hardware sharing caused by increasing clock speeds (see Fig. 7). Higher levels of hardware sharing are associated with more complex information flows in application-specific systems. For example, the analysis of physical data of state-of-the-art microprocessors (according to The Microprocessor Report) indicates that in less than 2 years (from late 1994 to middle of 1996) the number of transistors per pin increased by more than a factor of 2, from slightly more than 7000 to 14,100 transistors per pin. Whereas in 1994 eight microprocessors had a total of 16.1 million transistors and 2296 pins, in 1996 nine microprocessors had a total of 53.4 million transistors and 3781 pins. There is wider perspective on the trend of huge discrepancy in growth of number of transistors and pins per chip. Figure 7(a) points to the fact that the packaging technology has showed significant improvements over the past two decades. However, the rapid decrease of the silicon feature size has resulted in an even-higher ascending amount of available resources (transistors/gates) on chip. In Fig. 7(b), collected transistor per pin ratios for the most popular general purpose processors over the past three decades clearly point to the resulting exponential semiconductor over packaging technology trend. Similarly, the size of an average embedded or DSP application has been approximately doubling each year, the time to market has been getting shorter for each new product generation, and there has been a strong market need for user customization of application-specific systems. These three application factors have resulted in shorter available debugging time of increasingly more complex designs. Finally, design and CAD trends that additionally emphasize the importance of debugging include design reuse, introduction of system
10
HIGH LEVEL SYNTHESIS
Fig. 7. Trends in the physical design of general-purpose microprocessors.
software layer, and increased importance of collaborative design. These factors result in increasingly intricate functional errors, often due to interaction of parts of designs written by several designers. The Debugging Process. After the functional design of a hardware component is specified, the functional debugging is used to partially or fully verify and validate the design scheme. In the search for an error, a number of steps are taken, and they can be partitioned into the following four debugging phases (see Fig. 8). Test Input Generation In the first phase of debugging, the goal is to generate and execute the input, data likely to make functional errors visible. In order to create the most promising set of input vectors designer’s intuition can be used as well as such sophisticated methods as specialized databases or expert systems (36). Error Detection In this phase, the designer discovers that the design does not function correctly for a particular input. The approach to detect an error in the functionality or timing depends on the system for design functional execution. If design simulation is used, then a set of conditional breakpoints selected by the designer may be used to check for functional or timing errors. In the case of executing the design functionality using an emulation system, the error discovery can be obtained either by output-matching software run by the supervising workstation or design-internal error events defined at design time. Upon an event that undoubtedly specifies functional error, the chip signals the supervising station that terminates the emulation. Error Diagnosis In the third phase, the designer identifies the statement in the specification that causes incorrect behavior. This step in the debugging process may be exceptionally time consuming because of the characteristics of the current widely used functional execution approaches: design simulation and emulation. Modern design simulation is six to ten orders of magnitude slower than emulation and thus is used primarily for short, focused test sequences. Emulation has the required speed but imposes
HIGH LEVEL SYNTHESIS
11
Fig. 8. The four phases of functional debugging: test vector generation, error detection, error diagnosis, and error correction. Emerging debugging techniques rely on cut-based combining simulation and emulation for fast error diagnosis.
strict limitations on signal observability and controllability. To facilitate the advantages of both execution domains, Kirovski and Potkonjak (37,38) developed a technique that integrates emulation and simulation. Their technique uses a set of tools, transparent to both the design and debugging process, to enable the user to run long test sequences in emulation and, upon error detection, to roll-back to an arbitrary instance in execution time and switch over to simulation-based debugging for full design visibility and controllability. Error Correction In the final phase, the faulty section or statement responsible for the observed fault is replaced by the corrected section. The design is recompiled if simulated, or its specification is updated and the design is, again, emulated or fabricated.
Existing Debugging Techniques. The existing, widely employed approaches for functional debugging are either design simulation, or chip emulation or fabrication. Design simulation is typically run on a workstation environment and results in an arbitrary accurate overview of the simulated architecture (39,40). Unfortunately, simulation is an extremely computationintensive process and results in two to ten orders slower functional execution in comparison to the fabricated chip. Most important, the actual simulation speed heavily depends on the simulation accuracy. State-of-theart VHDL or Verilog RTL-level simulators are equipped with debuggers capable of performing error trace and timing analysis (Interra’s Picasso) (41) and simulation backtracking (Synopsys’ Cyclone) (42). For programmable processor simulation, instruction-set simulators providing full system visibility and controllability and of varying degrees of accuracy are used (40). Emulated or fabricated designs provide real-life execution speed and yet extremely limited observability and controllability of the design internal structure (see Table 1). For example, the Hewlett-Packard functional verification team for the new HP PA8000 processor reports almost six orders of magnitude difference in speed
12
HIGH LEVEL SYNTHESIS
between the RTL-level simulated (0.5 Hz on a typical workstation) and FPGA based emulated (300 kHz) functional execution of their PA8000-based 200 MHz workstation (43). An in-circuit emulator circuitry is made up of capture logic, which monitors the contents of the program address register, the internal data bus, and the control lines of the processor; trace circuitry comprising a FIFO buffer, which puts data from the capture logic to the output pins of the chip; and a content addressable memory and a software programmable logic array with emulation counters that together function as a finitestate machine, which performs the desired predetermined testing of the system. The debugging circuitry in the emulator, usually implemented using a JTAG boundary scan methodology (44), or dedicated high-bandwidth debug ports, enables controllability and observability of particular internal states of the emulated processor (45). The lack of full controllability and visibility of variables can be overcome by injecting debug instructions over the debug port directly into processor’s pipeline. The injected instructions read or write data into userspecified registers. This approach modifies the pipeline and thus introduces novel design challenges. The in-circuit emulation has evolved into logic (functional) porting of the processor model into arrays of rapid prototyping modules [arrays of gates, FPGAs (46,47), specialized processors such as Hydra (48)]. Such emulation engines provide both high execution speed and relatively high observability and controllability of all registers. The system that captures the signal values of the emulated design provides variable observability and controllability either by addressing the user-customized SRAM memory cells (Arkos emulator) (42,47) or by probing nets into the FPGA testbed [MP4 (46); HDL-ICE (47)]. The former case raises expenses, and the latter reduces visibility performance [only 256 probes are available in MP4 and 1152 in the HDL-ICE (47) for data inspection or update; moreover, probe reprogramming (i.e., repositioning) takes a considerable amount of time]. This methodology has been used during development of Intel’s Pentium processor (49) and Sun’s UltraSPARC-I and -II (47). While this approach is significantly faster than typical software simulation, it is still one to two orders of magnitude slower than the real operation speed (49). A method for optimal cutover from one simulation/emulation method to the other while system debugging has been patented (50). In the CAD domain, Powley and De Groat developed a VHDL model for an embedded controller, which supported debugging at the application software (51). In order to speed up the design validation process, Naganuma et al. (52) combined structured analysis approaches (34) with algorithmic debugging techniques to speed up the design validation process. Potkonjak et al. (53) proposed a technique for design-for-debugging. Their technique focuses only on the error diagnosis phase. It assumes that the designer specifies the debugging variables at design time and only provides controllability and observability of the specified variables. The technique is applicable only to hard-wired ASIC designs and assumes a single functional fault tolerance. Hosseini et al. proposed a code generation methodology and analysis for improved functional verification of microprocessors (54). Finally, Kirovski and Potkonjak developed an approach that coordinates in-circuit emulation and simulation using a cut-based computation state transferring mechanism. They have shown that the effective combination of simulation and emulation can be achieved both for ASIC (37) and programmable
HIGH LEVEL SYNTHESIS
13
Fig. 9. Optimal cut example for fourth-order CF IIR filter: (a) CDFG and (b) scheduled CDFG.
system-on-silicon (38) design verification. It has been proven experimentally that this approach enables both fast functional execution and complete controllability and observability of the design under debugging with minimal hardware overhead. Emerging Debugging Techniques. In the remainder of this section, the way, how their approach is conducted, is explained as are the enabling mechanisms for switching the functional execution between the two functional execution domains. First, the notion of a complete cut of a computation is introduced. Then, using an explanatory example, the design-for-debugging trade-offs are elucidated. A complete cut is a set of variables generated within one computation iteration, which bisects all possible paths in the CDFG. Such a set of variables fully determines the state of the machine between two iterations (i.e., the set of primary inputs and the optimal cut of a particular iteration fully determine the set of primary outputs of the same computation iteration). Thus, a complete cut corresponds to a subset of variables in the computation that bisect all paths between the states that delimit successive iterations of the computation. Clearly, if one has complete controllability and observability over the values for all variables in the complete cut for a specific breakpoint, the computation can be continued functionwise correctly from that breakpoint. In a sense, the cut contains the complete information about the complete history of the computation process and its primary inputs until a given point in time (breakpoint). An Explanatory Example. The diagnosis approach and accompanying optimization issues are illustrated using fourth-order continued fraction infinite impulse response (CF IIR) filter (55). Figure 9(a) shows the CDFG for this popular filter structure. In Fig. 9(b) the computation control flow graph is presented with respect to the particular control step. Recall that the goal of the design-for-debugging step is to allocate minimal hardware resources that enable computation observability and controllability. One possible solution would be to select variables D1 , D2 , D3 , and
14
HIGH LEVEL SYNTHESIS
D4 [dotted lines in Fig. 9(b)] as a complete cut. In this case, because all variables of the cut are concurrently alive, they have to be stored in four different registers. In order to provide full controllability and observability of the design, the designer must have the ability to read/write into those registers from the designated I/O pins. Because four registers are in the cut, we must allocate four sets of register-to-I/O connections in order to enable variable observability and controllability. If the cut is defined among the output variables of adders A2 , A4 , A6 , and A8 [bold lines in Fig. 9(b)], only one register is required to hold the values of all those variables because they are not alive simultaneously during the computation. In this case, only one instance of the selection hardware is dedicated to the register that holds the cut. Cut dispensing is performed in four consecutive control steps (cycles 2, 3, 4, and 5). Note that in both cases the variable included in the cut but not mentioned is the actual output of the chip. The primary output of the computation is always used as a part of the complete cut because its dispensing is inevitable.
Considering Testability in High Level Synthesis Recently, the importance and advantages of addressing testability early in the design process has been established. The topic has attracted attention in both the testing and high level synthesis research literature. A summary of the developments in this emerging field can be found in Ref. 56. Even though several scheduling, assignment, and allocation algorithms for testability enhancement have been proposed, only limited attention has been paid to exploring the relationship between other high level synthesis tasks, like transformations and partitioning, and testability. Clearly, it is important to evaluate the potential of high level synthesis techniques for testability improvements during the high level design process. Here, a set of techniques are presented. They use a variety of transformations to reduce the area overhead required by design-for-testability (DFT) techniques to make the final implementation highly testable. During testability optimization, area minimization and timing (throughput) constraints are simultaneously targeted. In the last few years, numeral high level synthesis approaches that address testability at the behavioral level have been reported (56). Whereas several systems target built-in-self-test (BIST) and hierarchical testability as the test strategy, a majority of high level synthesis techniques explore the relationship between hardware sharing and sequential automatic test pattern generation (ATPG) methods. The presence of loops in a sequential circuit is a major source of problems for sequential ATPG. Partial scan is an effective technique to break loops in the circuit by scanning a subset of flip-flops (FFs) (57,58). Empirical evidence shows that breaking all nontrivial sequential cycles (with at least two FFs) is an effective heuristic for making a circuit highly testable (57,58). An approach that minimizes the number of scan registers required to break all nontrivial loops in the data path is described in more detail later. This number is used as a measure of testability. It has been demonstrated that a design can be synthesized from a high level specification in such a way that a relatively small increase in the area of a data path can often be sufficient to reduce significantly the cardinality of the minimum feedback vertex set (MFVS) and, hence, the number of scan-registers needed to break all loops (59). This is achieved by alternating classical scheduling and assignment algorithms in such a way that the testability overhead is taken into account (60). The mandatory tasks during high level synthesis are allocation, scheduling, and assignment (1,61), all of which have been shown to have a significant impact on the testability of the synthesized designs. Existing high level synthesis for testability techniques can be broadly classified according to the testing methodology targeted: BIST, gate-level sequential ATPG, or hierarchical test pattern generation. Several research groups (57,62) have developed high level synthesis systems that target sequential ATPG testability. These systems synthesize data paths without loops, by using proper scheduling and assignment, and scan registers to break loops (63,64). A data path synthesized from a behavioral specification may contain several types of loops (e.g., CDFG, assignment, and sequential false loops) (64). A CDFG loop is formed in
HIGH LEVEL SYNTHESIS
15
Fig. 10. Motivational example for demonstrating use of transformation for testability: fourth-order parallel IIR filter.
the data path when there exists a cycle consisting of data-dependency edges in the CDFG. The other types of loops are introduced in the data path during high level synthesis, specifically as a result of hardware sharing. For instance, when operations along a CDFG path from operation u to operation v are assigned to n separate modules, with u and v assigned to the same module, an assignment loop of length n is created in the data path. A comprehensive analysis of the formation of loops, in circuits synthesized by high level synthesis techniques, is presented in Ref. 64. Recently, Bhatia and Jha (65) reported high level synthesis techniques that target hierarchical test pattern generation. Other works addressing testability during high level design are related to the use of testability analysis to guide test statement and test hardware insertion (66,67). Transformations are changes in the structure of a computation so that a particular objective is achieved, whereas the functional and timing dependencies between the inputs and the outputs are preserved. Recently, a new transformation technique that increases the complexity of the behavioral description while reducing the structural complexity of the resulting data path was developed (60). Application of the new transformation technique to reduce the partial scan overhead for generating easily testable data paths was demonstrated (59). An Explanatory Example of Testability in High Level Synthesis. The use of transformations for testability optimization is illustrated using the CDFG of the four-order parallel IIR filter, shown in Fig. 10. It is assumed that each operation takes one control cycle. The available time is six control cycles. The critical path is also six cycles long. To meet the available time of six clock cycles, the minimal resource allocation requires three multipliers (three multiplications have to be scheduled in the first control step), two adders (two additions must be scheduled simultaneously in the second control step), and two subtracters (two subtractions must be scheduled in third control step simultaneously). The result of scheduling and assignment, using an existing behavioral test synthesis system, BETS (64), is shown in Fig. 10. For instance, the first operation, a multiplication by k9 , is assigned to multiplier m3 , and scheduled in control cycle 1, shown by the tuple (m3 , 1). To generate the easily testable design shown in Fig. 11 (64), BETS performs allocation, scheduling, and assignment, simultaneously considering testability overhead and area of the implementation. The resulting minimum feedback set contains four scan registers shown shaded in Fig. 11 .
16
HIGH LEVEL SYNTHESIS
Fig. 11. The data path of the initial fourth-order parallel IIR filter. When the shaded registers are scanned, all loops in the data path are broken.
It can be shown that it is impossible to reduce the number of scan registers using any other scheduling and assignment. Consider the lower biquad in Fig. 10. In order to break the loops in the corresponding part of the data path, at least two scan registers are required. If the register file of the transfer unit on which delay D1 is assigned is selected, then all the CDFG loops are broken. However, in this case, another scan register is required to break assignment loops, which are formed when the two + operations have to be assigned to the same adder A2 . If the input variable to the transfer unit is not selected for scanning, then it is obvious that at least two scan registers are needed just for breaking the CDFG loops. Similarly, it can be shown that a minimum of two scan registers are needed for the upper biquad in Fig. 10. Note that the variables corresponding to delays cannot share the same scan register because they are simultaneously alive in the first control step. Consider now how transformations can be used to reduce simultaneously the area and testability cost of the design. A sequence of transformations shown in Fig. 12 is applied, so that eventually the CDFG shown in Fig. 12(c) is obtained. First, algebraic transformations are applied to the CDFG in Fig. 10 to obtain the functionally equivalent CDFG shown in Fig. 12(a). It can be shown that there is a correspondence between the coefficients ci and ki used in the two structures. The next transformation applied is the scaling operation (13), where two feedforward cuts are multiplied by b and 1/b, to obtain the CDFG shown in Fig. 12(b). Finally, retiming is used to relocate the delay D3 to three new positions D5 , D6 , and D7 . In addition, the CDFG is pipelined by introducing a delay D8 , which is retimed back across the multiplication by c9 . The final CDFG, shown in Fig. 12(c), is significantly more suitable for BETS to produce a testable implementation. The schedule and assignment obtained by BETS is shown in Fig. 12(c). Note that, although not targeted, the throughput is also improved, because the critical path is reduced to five control steps, and the schedule uses five control steps. The resulting data path is shown in Fig. 13. The hardware requirements are reduced to only two multipliers, one adder, and one subtracter. Simulation using Hyper (68) indicates that all the computational structures of the fourth-order parallel IIR filter, shown in Figs. 10 and 12, have identical numerical properties and therefore identical word-length requirements.
HIGH LEVEL SYNTHESIS
17
Fig. 12. The sequence of transformations for the simultaneous optimization of testability and area. First (a) associativity and the inverse element law are applied; next (b) scaling transformation; and (c) finally retiming. The data path corresponding to the final CDFG is shown in Fig. 13.
High Level Synthesis for Low Power Low power consumption has been established as an important design metric for VLSI design mainly because the remarkable success and growth of the class of personal computing devices and wireless communication systems, which demands high-speed computation and complex functionality with low power consumption.
18
HIGH LEVEL SYNTHESIS
Fig. 13. The data path of the transformed fourth-order parallel IIR. Only one scan register (shaded) is required to break all loops in the data path, compared with four scan register in the initial design.
Recent work (19,69,70,71) has shown that the most savings in power consumption are often obtained at the higher levels of design hierarchy, in particular at the high level synthesis. In CMOS technology, there are three sources of power dissipation: switching, short-circuit, and leakage currents. The switching component is the only one that cannot be made negligible if proper design techniques are followed (69). Average switching power is given by P = αCL V 2 dd f , where f is the system clock frequency, V dd is the supply voltage, CL is the load capacitance, and α is the switching activity (the probability of a 0 → 1 transition during a clock cycle) (69). The product term αCL is called the effective switched capacitance. The equation for power consumption implies that optimizing for power entails an attempt to reduce one or more of the three degrees of freedom: supply voltage, switching activity, and physical capacitance. The supply voltage, V dd offers the most effective means to minimize power consumption because of its quadratic contribution. An unfortunate side-effect of decreasing V dd is that the delay of the circuit increases, which can be shown to be k[V dd /(V dd ) − V T )2 ], where V T is the device threshold voltage and k is a technology-dependent constant (69). Reducing switching activity and physical capacitance can also be used to reduce power consumption, although the effect is not as drastic as that of supply voltage reduction. High Level Power Estimation. The availability of efficient and accurate high level power estimation tools is key for increasing the effectiveness of high level synthesis. High level power estimation addresses the problem of estimating the power consumed by a design from a high level description. Because the gate level structure of the design components is unavailable for power estimation techniques at the high level, the estimation techniques usually rely on some abstract notions of switching activity and physical capacitance. The approaches for high level power estimation can be categorized into four groups. First, information– theoretic models depend on information–theoretic measures of activity such as entropy, which characterizes the randomness of a sequence of inputs, to obtain fast power estimates (72,73). It is intuitive that if the switching activity for a sequence of inputs is high, the sequence is likely to be random, resulting in high entropy. Average entropy plays a key role in estimating physical capacitance and switching activity. Second, complexity-based models rely on the assumption that the power consumption of a design can be related to its complexity, which can be determined by such parameters as the number and type of operations and the number of states and transitions in a controller description. Third, profiling-based models capture relevant data statistics such as the number of operations of a given type, bus and memory accesses, and I/O operations by profiling based either on stochastic analysis of the high level description and input streams (19) or on direct simulation of the design under a typical input stream (74). Last, power macro-modeling methods use regression-based or sampling-based switched capacitance models for circuit modules (75,76). High Level Power Optimization. High level power optimization techniques have been proposed for all the various high level synthesis steps. Some techniques deal with only one high level synthesis step whereas others consider several synthesis steps simultaneously.
HIGH LEVEL SYNTHESIS
19
Several works (77,78) present scheduling algorithms, where operations are scheduled to reduce the switching activity of the functional units by minimizing the switching activity of their input operands. The scheduling algorithm proposed in Ref. 79 attempts to schedule operations so that the algorithm maximizes the possibility of shutting down resources that are not performing useful computations in a given control step to avoid unnecessary power dissipation. A number of research groups have addressed the scheduling problem with multiple (in their software implementation restricted to two or three) different voltages (80,81,82,83). In the resource allocation procedure, functional units, registers, and interconnections are assigned to variables and operations. The power dissipated by such resources usually depends on the input switching activity induced by the data being stored or processed. Assuming that the probability density functions of the inputs at the functional units. Chang and Pedram (84) formulated the allocation problem for functional units as a max-cost multicommodity network flow problem, which can be solved optimally. Musoll and Cortadella (78) also proposed a resource-binding algorithm to reduce the switching activity of functional units. Chang and Pedram (85) proposed an effective graph-based register allocation algorithm based on an accurate computation of the probability density functions at the inputs of various resources, given the probability distributions for the system primary inputs. Raghunathan and Jha (86) demonstrated how hardware sharing affects the capacitance and switching activity, and hence the power dissipation of resources and proposed an Integer Linear Programming (ILP) formulation for the allocation problem, simultaneously considering registers and functional units. Mehra et al. (87) proposed an allocation algorithm to reduce power consumption in interconnects by exploiting locality in a computation. Locality in a computation relates to the degree to which a computation has natural clusters of operations (87). Several authors noted that the various high level synthesis steps must be considered simultaneously to explore the possible trade-offs resulting from their interaction. Raghunathan and Jha (71) proposed an iterative improvement algorithm that considers scheduling, clock and module selection, and resource allocation simultaneously. Mehra and Rabaey also proposed the scheduling and resource allocation techniques for exploiting regularity in a computation to reduce power consumption in interconnects. Regularity in a computation refers to the repeated occurrence of the computational patterns such as a multiplication followed by an addition (88). Transformations alter the structure of a computation in such a way that the user-specified input/output relationship is maintained (8). The problem of automatically finding computational structures that result in the lowest power consumption for a specified throughput using transformations has been presented by Chandrakasan et al. (19,89). They have used two key approaches: reducing the supply voltage by utilizing speed-up transformations and reducing the capacitance being switched using a variety of transformations. Most of the transformations used are basically the same as the high level transformations targeting area and speed, but the different cost functions are used to evaluate the results obtained through such transformations. In Refs. 77 and 90, high level transformation techniques such as loop interchange, operand reordering, and loop folding are used to reduce the switching activity of functional units. Some high level transformations to reduce off-chip memory references have been proposed by Ref. 22. It is well known that operations whose corresponding hardware implementations require less energy per computation than others exist. For example, multiplications usually require more energy than additions. Therefore, strength reduction transformations are used to substitute constant multiplications with shifts and additions/subtractions. Chandrakasan et al. (19) demonstrated the effectiveness of the transformation by showing a significant reduction in power through applying the transformation on several designs. Potkonjak et al. (91) extended the transformation by considering multiple constant multiplications simultaneously. One of the most effective ways to reduce the total switched capacitance consists of reducing the number of operations in the CDFG. Various transformations can be used to reduce the number of operations. Figure 14 illustrates how the number of operations is reduced by applying distributivity transformation. Potkonjak and Rabaey (13) addressed the minimization of the number of multiplications and additions in linear computations in their maximally fast form so that the throughput is preserved. Potkonjak et al. (91) presented a set of techniques for minimization of the number of shifts and additions in linear computations. Sheliga and Sha (92)
20
HIGH LEVEL SYNTHESIS
Fig. 14. An example to show how the number of operations is reduced by distributivity transformation.
presented an approach for minimization of the number of multiplications and additions in linear computations. Srivastava and Potkonjak (30) developed an approach for the minimization of the number of operations in linear computations using unfolding and the application of the maximally fast procedure. Hong et al. (93) introduced an approach for power optimization using a set of compilation techniques. They proposed a novel divide-and-conquer compilation technique to minimize the number of operations for general computations. Their technique not only optimizes a significantly wider set of computations than the previously published techniques but also outperforms (or performs at least as well as other techniques) on all their design examples. A variant of the technique by Ref. (30) is used in the “conquer” phase of their approach. Their approach is different from Ref. (30) in two respects. First, the technique of Ref. (30) can handle only very restricted computations which are linear, whereas their approach can optimize arbitrary computations. Second, their approach outperforms or performs at least as well as the technique of Ref. (30) for linear computations. We illustrate the key ideas of their approach for minimizing the number of operations by considering the computation of Fig. 15(a). Each node represents a subpart of the computation. They make the following assumptions only specifically for clarifying the presentation of this example and simplifying the example. Assume that each subpart is linear and can be represented by state-space equations. Also assume that every subpart is dense, which means that every output and state in a subpart is a linear combination of all inputs and states in the subpart with no 0, 1, or −1 coefficients. The number inside a node is the number of delays or states in the subpart. Assume that when there is an arc from a subpart X to a subpart Y, every output and state of Y depends on all inputs and states of X. The number of operations per input sample is initially 2081, where the number of operations of the computation is calculated as that of the maximally fast tree representation (13) for the computation. Using the technique of Ref. 30, which unfolds the entire computation, the number can be reduced to 725 with an unfolding factor of 12. Their approach optimizes each subpart separately. This separate optimization is enabled by isolating the subparts using pipeline delays. Figure 15(b) shows the computation after the isolation step. Because every subpart is linear, unfolding is performed to optimize the number of operations for each subpart. Unfolding results in simultaneous processing of consecutive iterations of a computation. The subparts P1 , P2 , P3 , P4 , and P5 cost 120.75, 53.91, 114.86, 129.75, and 103.0 operations per input sample with unfolding factors 3, 10, 6, 7, and 2, respectively. The total number of operations per input sample for the entire computation is 522.27. Subpart merging is now applied to further reduce the number of operations. Their heuristic considers merging of adjacent subparts. Initially, the possible merging candidate pairs are P1 P2 , P1 P3 , P2 P5 , P3 P4 , and P4 P5 , which produce the gains of −51.48, −112.06, −52.38, 122.87, and −114.92, respectively. P3 and P4 are merged with an unfolding factor of 22. In the next iteration, there are now four subparts and four candidate
HIGH LEVEL SYNTHESIS
21
Fig. 15. An explanatory example to show the minimization of the number of operations by the divide-and-conquer method: (a) the initial computation and (b) the computation after the isolation by pipelining.
pairs for merging all of which yield negative gains. So, the heuristic stops at this point. The total number of operations per input sample has further decreased to 399.4. Their approach has reduced the number of operations by a factor of 1.82 from the previous technique of Ref. 30, while it has achieved the reduction by a factor of 5.2 from the initial number of operations. For experimental results, they used many real-life designs, which include all the benchmark examples used in Ref. (30) as well as the following typical portable DSP, video, communication, and control applications: DAC—four-stage NEC digital-to-analog converter (DAC) for audio signals; modem—two-stage NEC modem; GE controller—five-state GE linear controller; APCM receiver—Motorola’s adaptive pulse code modulation receiver; Audio Filter 1—analog-to-digital converter (ADC) followed by 14-order cascade IIR filter; Audio Filter 2—ADC followed by 18-order parallel filter; Video Filter 1—two ADCs followed by 10-order two-dimensional IIR filter; Video Filter 2—two ADCs followed by 12-order 2D IIR filter; and VSTOL—VSTOL robust observer structure aircraft speed controller. DAC, modem, and GE controller are linear computations, and the rest are nonlinear computations. The benchmark examples from Ref. (30) are all linear, which include ellip, iir5, wdf5, iir6, iir10, iir12, steam, dist, and chemical. Their method has reduced the number of operations by an average factor of 1.77 (average 43.5%) for the examples that previous techniques are either ineffective or inapplicable.
Conclusion In this survey, we have provided a nonexhaustive review of existing methodologies and tools for high level synthesis, mainly concentrating on research works targeting new set of design metrics such as power, testability, and debuggability. We expect this field to remain quite active in the foreseeable future.
22
HIGH LEVEL SYNTHESIS
BIBLIOGRAPHY 1. M. C. McFarland, A. C. Parker, R. Composano, The high level synthesis of digital systems, Proc. IEEE, 78: 301–317, 1990. 2. R. A. Walker, R. Camposano, A Survey of High-level Synthesis Systems, Norwell, MA: Kluwer, 1991. 3. D. D. Gajski et al., High-level Synthesis: Introduction to Chip and System Design, Norwell, MA: Kluwer, 1992. 4. G. De Micheli, Synthesis and Optimization of Digital Circuits, New York: McGraw-Hill, 1994. 5. Y.-L. Lin, Recent developments in high-level synthesis, ACM Trans. Des. Autom. Electron. Syst., 2 (1): 2–21, 1997. 6. W. H. Wolf, R. O’Donnell, Hardware-software co-design of embedded systems (and prolog), Proc. IEEE, 82: 965–989, 1994. 7. E. A. Lee, T. M. Parks, Dataflow process networks, Proc. IEEE, 83: 773–801, 1995. 8. M. C. McFarland, Formal analysis of correctness of behavioral transformations, Formal Methods Syst. Des., 2 (3): 231–257, 1993. 9. U. Banerjee et al., Automatic program parallelization, Proc. IEEE, 81: 211–243, 1993. 10. D. F. Bacon, S. L. Graham, O. J. Sharp, Compiler transformations for high performance computing, ACM Comput. Surv., 26 (4): 345–420, 1994. 11. K. K. Parhi, High-level algorithm and architecture transformations for DSP synthesis, J. VLSI Signal Process., 9 (1–2): 121–143, 1995. 12. D. C. Ku, G. De Micheli, High Level Synthesis of ASICs under Timing and Synchronization Constraints, Norwell, MA: Kluwer, 1992. 13. M. Potkonjak, J. Rabaey, Maximally fast and arbitrarily fast implementation of linear computations, IEEE/ACM Int. Conf. Comput.-Aided Des., 1992, pp. 304–308. 14. L. Guerra, M. Potkonjak, J. Rabaey, High level synthesis for reconfigurable datapath structures, IEEE/ACM Int. Conf. Comput.-Aided Des., 1993, pp. 26–29. 15. M. Potkonjak, S. Dey, R. K. Roy, Considering testability at behavioral level: Use of transformations for partial scan cost minimization under timing and area constraints, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 14: 531–546, 1995. 16. M. Potkonjak, M. B. Srivastava, Behavioral synthesis of high performance, low cost, and low power application specific processors for linear computations, Int. Conf. Appl. Specific Array Processors, 1994, pp. 45–56. 17. K. K. Parhi, D. G. Messerschmitt, Static rate-optimal scheduling of iterative dat-flow programs via optimum unfolding, IEEE Trans. Comput., 40: 178–195, 1991. 18. M. B. Srivastava, M. Potkonjak, Transforming linear systems for joint latency and throughput optimization, Eur. Des. Autom. Conf., 1994, pp. 267–271. 19. A. P. Chandrakasan et al., Optimizing power using transformations, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 14: 12–31, 1995. 20. R. Karri, A. Orailoglu, High-level synthesis of fault-secure microarchitectures, Des. Autom. Conf., 1993, pp. 429–433. 21. A. Orailoglu, R. Karri, Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2: 304–311, 1994. 22. F. Catthoor et al., Global communication and memory optimizing transformations for low power signal processing systems, VLSI Signal Process., 7: 178–187, 1994. 23. A. S. Tanenbaum, H. van Straven, J. W. Stevenson, Using peephole optimization on intermediate code, ACM Trans. Programm. Lang. Syst., 4 (1): 21–36, 1982. 24. J. D. Ullman, Database and Knowledge-Based Systems, Rockville, MD: Computer Science Press, 1989, vol. 2. 25. H. Massalin, Superoptimizer: A look at the smallest program, Int. Conf. Arch. Support Programm.Lang. Oper. Syst., 1987, pp. 122–126. 26. M. E. Wolf, M. S. Lam, A loop transformation theory and an algorithm to maximize parallelism, IEEE Trans. Parallel Distrib. Syst., 2: 452–471, 1991. 27. G. C. Fox, J. G. Koller, Code generation by a generalized neural network, J. Parallel Distrib. Comput., 7 (2): 388–410, 1989. 28. Z. Iqbal et al., Critical path minimization using retiming and algebraic speedup, Des. Autom. Conf., 1993, pp. 573–577. 29. D. Whitfield, M. L. Soffa, An approach to ordering optimizing transformations, ACM Symp. Prin. Pract. Parallel Program., 1990, pp. 137–147.
HIGH LEVEL SYNTHESIS
23
30. M. Srivastava, M. Potkonjak, Power optimization in programmable processors and ASIC implementations of linear systems: Transformation-based approach, Des. Autom. Conf., 1996, pp. 343–348. 31. S.-H. Huang, J. M. Rabaey, Minimizing the throughput of high performance DSP applications using behavioural transfomrations, EDAC-ETC-EUROASIC, 1994, pp. 25–30. 32. I. Hong, D. Kirovski, M. Potonjak, Potential-driven statistical ordering of transformations, Des. Autom. Conf., 1997, pp. 347–352. 33. L. Yang et al., System design methodology of UltraSPARC-I, Des. Autom. Conf., 1995, pp. 7–12. 34. S. Narayan et al., System specification and synthesis with the SpecCharts language, IEEE/ACM Int. Conf. Comput.Aided Des., 1991, pp. 266–269. 35. A. Yu, The future of microprocessors, IEEE Micro, 16 (6): 46–53, 1996. 36. K. Saab et al., LIMSoft: Automated tool for sensitivity analysis and test vector generation, IEEE Proc. Circuits, Devices Syst., 143: 386–392, 1996. 37. D. Kirovski, M. Potkonjak, Quantitative approach to functional debugging, IEEE/ACM Int. Conf. Comput.-Aided Des., 1997. 38. D. Kirovski, M. Potkonjak, L. M. Guerra, Functional debugging of systems-on-silicon, unpublished manuscript, Computer Science Department, UCLA, 1998. 39. M. Rosenblum et al., Complete computer system simulation: The SimOS approach, IEEE Parallel Distrib. Technol.: Syst. Appl., 3 (4): 34–43, 1995. 40. V. Zivojnovic, H. Meyr, Compiled HW/SW co-simulation, Des. Autom. Conf., 1996, pp. 690–695. 41. [Online], Available http://www.interrainc.com/picasso.html 42. [Online], Available http://www.synopsys.com/products/simulation/cyclone-cs.html 43. S. T. Mangelsdorf et al., Functional verification of the HPA PA 8000 processor, Hewlett-Packard J., August, 1997; [Online], Available http://www.hp.com:80/hpj/97aug/au97a3.pdf 44. C. Maunder, JTAG, the Joint Test Action Group, IEEE Colloq. New Ideas Test., 1986, pp. 6/1–4. 45. P. C. Ching, Y. C. Cheng, M. H. Ko, An in-circuit emulator for TMS320C25, IEEE Trans. Educ., 37: 51–56, 1994. 46. [Online], Available http://www.aptox.com:80/Products/mp4.html 47. [Online], Available http://www.quickturn.com/prod/hdlice/hdlice.htm 48. K. Olukotun et al., Hydra project, [Online], Available http://ogun.stanford.edu 49. A. Saini, Design of the Intel Pentium (tm) Processor, Int. Conf. Comput.-Aided Des., 1993, pp. 258–261. 50. J. D. Myers, J. L. Rivero, System for optimal electronic debugging and verification employing scheduled cutover of alternative logic simulations, US Patent No. 5,566,097, 1993. 51. G. S. Powley, J. E. DeGroat, Experience in testing and debugging the i960 MX VHDL model, VHDL Int. Users Forum, 1994, pp. 130–135. 52. J. Naganuma et al., High-level design validation using algorithmic debugging, EDAC, 1994, pp. 474–480. 53. M. Potkonjak, S. Dey, K. Wakabayashi, Design-for-debugging of application specific designs, IEEEACM Int. Conf. Comput.-Aided Des., 1995, pp. 295–301. 54. A. Hosseini, D. Mavroidis, P. Konas, Code generation and analysis for the functional verification of microprocessors, Des. Autom. Conf., 1996, pp. 305–310. 55. R. E. Crochiere, A. V. Oppenheim, Analysis of linear networks, Proc. IEEE, 63: 581–595, 1975. 56. L. J. Avra, E. J. McCluskey, High level synthesis of testable designs: An overview of university systems, Int. Test Conf., 1994. 57. K. T. Cheng, V. D. Agrawal, A partial scan method for sequential circuits with feedback, IEEE Trans. Comput., 39: 544–548, 1990. 58. D. H. Lee, S. M. Reddy, On determining scan flip-flops in partial-scan designs, IEEE Int. Conf. Comput.-Aided Des., 1990, pp. 322–325. 59. S. Dey, M. Potkonjak, R. K. Roy, Synthesizing designs with low-cardinality minimum feedback vertex set for partial scan application, VLSI Test Symp., 1994, pp. 2–7. 60. S. Dey, M. Potkonjak, Transforming behavioral specifications to facilitate synthesis of testable designs, Int. Test Conf., 1994. 61. C. Papachristou et al., SYBNTEST: A method for high-level SYNthesis with self TESTability. IEEE/ACM Int. Conf. Comput.-Aided Des., 1991, pp. 458–462.
24
HIGH LEVEL SYNTHESIS
62. A. Majumdar, K. Saluja, R. Jain, Incorporating testability considerations in high-level synthesis, Int. Symp. FaultTolerant Comput., 1992. 63. T. C. Lee, N. K. Jha, W. H. Wolf, Behavioral synthesis of highly testable data paths under non-scan and partial scan environment, Des. Autom. Conf., 1993, pp. 292–297. 64. S. Dey, M. Potkonjak, R. Roy, Exploiting hardware-sharing in high level synthesis for partial scan optimization, IEEE/ACM Int. Conf. Comput.-Aided Des., 1993, pp. 20–25. 65. S. Bhatia, N. K. Jha, Genesis: A behavioral synthesis system for hierarchical testability, Eur. Des. Test Conf., 1994, pp. 272–276. 66. P. Vishakantaiah, J. A. Abraham, M. Abadir, Automatic test knowledge extraction from VHDL (ATKET), Des. Autom. Conf., 1992, pp. 273–278. 67. C.-H. Chen, D. G. Saab, BETA: Behavioral testability analysis, IEEE/ACM Int. Conf. Comput.-Aided Des., 1991, pp. 202–205. 68. J. Rabaey et al., Fast prototyping of datapath-intensive architectures, IEEE Des. Test Comput., 8 (2): 40–51, 1991. 69. A. P. Chandrakasan, S. Sheng, R. W. Broderson, Low-power CMOS digital design, IEEE J. Solid-State Circuits, 27: 473–484, 1992. 70. R. Mehra, J. Rabaey, Behavioral level power estimation and exploration, Int. Workshop Low Power Des., 1994, pp. 197–202. 71. A. Raghunathan, N. K. Jha, An iterative improvement algorithm for low power data path synthesis, Int. Conf. Comput.Aided Des., 1995, pp. 597–602. 72. D. Marculescu, R. Marculescu, M. Pedram, Information theoretic measures for power analysis, IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., 15: 599–610, 1996. 73. M. Nemani, F. N. Najm, Towards a high-level power estimation capability, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 15: 588–598, 1996. 74. N. Kumar et al., Profile-driven behavioral synthesis for low power VLSI systems, IEEE Des. Test Comput., 12: 70–84, 1995. 75. S. Gupta, F. N. Najm, Power macromodeling for high level power estimation, Des. Autom. Conf., 1997, pp. 365–370. 76. P. E. Landman, J. M. Rabaey, Architectural power analysis: The dual bit type method, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 3: 173–187, 1995. 77. E. Musoll, J. Cortadella, High-level synthesis techniques for reducing the activity of functional units, Int. Symp. Low Power Des., 1995, pp. 99–104. 78. E. Musoll, J. Cortadella, Scheduling and resource binding for low power, Int. Symp. Syst. Synth., 1995, pp. 104–109. 79. J. Monteiro et al., Scheduling techniques to enable power management, Des. Autom. Conf., 1996, pp. 349–352. 80. J.-M. Chang, M. Pedram, Energy minimization using multiple supply voltages, Int. Symp. Low Power Electron. Des., 1996, pp. 157–162. 81. M. C. Johnson, K. Roy, Datapath scheduling with multiple supply voltages and level converters, ACM Trans. Des. Autom. Electron. Syst., 2: 1997. 82. Y.-R. Lin, C.-T. Hwang, A. C.-H. Wu, Scheduling techniques for variable voltage low power designs, ACM Trans. Des. Autom. Electron. Syst., 2 (2): 81–97, 1997. 83. S. Raje, M. Sarrafzadeh, Variable voltage scheduling, Int. Symp. Low Power Des., 1995, pp. 9–14. 84. J.-M. Chang, M. Pedram, Module assignment for low power, EURO-DAC Eur. Des. Autom. Conf., 1996, pp. 376–381. 85. J.-M. Chang, M. Pedram, Register allocation and binding for low power, Des. Autom. Conf., 1996, pp. 29–35. 86. A. Raghunathan, N. K. Jha, Behavioral synthesis for low power, Int. Conf. Comput. Des., 1994, pp. 318–322. 87. R. Mehra, L. Guerra, J. Rabaey, Low-power architectural synthesis and the impact of exploiting locality, J. VLSI Signal Process., 13 (2–3): 239–258, 1996. 88. R. Mehra, J. Rabaey, Exploiting regularity for low-power design, Int. Conf. Comput.-Aided Des., 1996, pp. 166–172. 89. A. P. Chandrakasan et al., Hyper-lp: A system for power minimization using architectural transformation, IEEE/ACM Int. Conf. Comput.-Aided Des., 1992, pp. 300–303. 90. D. Kim, K. Choi, Power-conscious high level synthesis using loop folding, Des. Autom. Conf., 1997, pp. 441–445. 91. M. Potkonjak, M. Srivastava, A. P. Chandrakasan, Multiple constant multiplications: Efficient and versatile framework and algorithms for exploring common subexpression elimination, IEEE Trans Comput.-Aided Des. Integr. Circuits Syst., 15: 151–165, 1996.
HIGH LEVEL SYNTHESIS
25
92. M. Sheliga, E. H.-M. Sha, Global node reduction of linear systems using ratio analysis, Int. Symp. High-Level Synth., 1994, pp. 140–145. 93. I. Hong, M. Potonjak, R. Karri, Power optimization using divide-and-conquer techniques for minimization of the number of operations, IEEE/ACM Int. Conf. Comput.-Aided Des., 1997, pp. 108–113.
INKI HONG DARKO KIROVSKI MIODRAG POTKONJAK University of California at Los Angeles
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1802.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Logic Synthesis Standard Article Tan-Li Chou1 and Kaushik Roy2 1Intel Corporation, Hillsboro, OR 2Purdue University, West Lafayette, IN Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1802 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (266K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1802.htm (1 of 2)18.06.2008 15:27:17
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1802.htm
Abstract The sections in this article are Two-Level Logic Synthesis Multilevel Logic Synthesis Sequential Circuit Synthesis Logic Synthesis for FPGAs Summary | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1802.htm (2 of 2)18.06.2008 15:27:17
580
LOGIC SYNTHESIS
LOGIC SYNTHESIS Synthesis of digital circuits consists of a series of steps involving translation, optimization, and mapping. In general, a design is described at different levels of abstraction using hardware description languages like VHSIC hardware description language (VHDL) or Verilog. Design descriptions at each level are usually optimized based on area, timing, and power dissipation measures to generate a design description at the following lower level. In this article we consider design description and optimization at the logic gate level. For sequential circuits, we assume that a higher level of description, such as the state-transition graph, is available. Our aim is to survey various logic synthesis techniques that target different optimization criteria, like area, timing, and power consumption. It is beyond the scope of this article to delve into the details of logic synthesis algorithms. For details of the algorithms, the readers are referred to the papers and books referenced here. This article is organized as follows. In the first section we describe two-level logic optimization techniques. Two-level logic usually consists of an AND gate level and an OR gate level. The logic description can be in a product-of-sums or a sum-of-products form (such as in programmable logic arrays). It turns out that logic implemented in multiple levels (rather than two) can be more area efficient. The next section describes multilevel logic optimization algorithms that consider area, timing, and power dissipation. After logic optimization, the design is usually mapped into a target library. Such mapping techniques are called technology mapping and are briefly described in this section. The previous sections consider only combinational circuits. Because sequential circuits can have feedback and use memory elements, like flip-flops or latches, J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
LOGIC SYNTHESIS
the synthesis techniques differ. Sequential circuits can be represented by state-transition diagrams. In the following section we consider the state assignment and the sequential circuit synthesis problem. The synthesis techniques described in the previous sections are suitable for application-specific integrated circuits (ASIC) or gate array implementations. Recently, a new style of design called the field-programmable gate arrays (FPGAs) has become very popular. FPGAs usually consist of rows of logic modules (which can implement different types of logic gates) with (re)programmable routing architectures. Such logic styles require a different logic synthesis style. The next section considers the basics of logic optimization for FPGA’s. Finally conclusions are given in the last section. TWO-LEVEL LOGIC SYNTHESIS By definition, two-level logic circuits are logic circuits with two levels of logic gates. A typical technology that uses twolevel logic circuits is programmable logic arrays (PLAs). Usually NOR–NOR or NAND–NAND arrays are used. For example, logic function y ⫽ ((x1 ⫹ x2)⬘ ⫹ (x3 ⫹ x4)⬘)⬘ is a NOR–NOR implementation. Here xi⬘ (or xi) denotes the complement of xi. In addition to PLAs, two-level logic design style is adopted because of its speed advantage. One of the factors that determine circuit delays is the number of stages (levels) through which a signal goes. Therefore, two-level logic circuits are fast compared to multilevel logic circuits (are introduced later) and may be chosen at the cost of area (using more and/or bigger gates to implement the circuit). Another reason that we are interested in two-level logic is that it represents a component of a multilevel logic network. If we can simplify the representation, it simplifies the multilevel logic optimization. For simplicity, AND–OR representation is assumed in this section, that is, sums of products like z ⫽ x1x2 ⫹ x3x4 are our focus in this discussion. Because by DeMorgan’s laws (1) NOR–NOR or NAND–NAND expressions can be rewritten as sums of products, the assumption does not lose generality. For example, y ⫽ [(x1 ⫹ x2)⬘ ⫹ (x3 ⫹ x4)⬘]⬘ mentioned earlier can be rewritten as y⬘ ⫽ x1⬘x2⬘ ⫹ x3⬘x4⬘ by DeMorgan’s laws (1). Given a Boolean function represented as a sum of products, we want to know how to implement it with minimum area. Most of the literature uses a minimum number of product terms as the cost function to be optimized. This is to simplify the problem and to separate the optimization from the specific technology that implements the function. However, it may not correspond to the minimum area. To understand how different optimization algorithms work, we have to introduce some key concepts. Basic Concepts A Boolean function f can be expressed as a sum of products of n literals. These product terms are called minterms. For example, w ⫽ x1x2 ⫹ x1x3 can be rewritten as x1x2x3⬘ ⫹ x1x2x3 ⫹ x1x2⬘x3 with three minterms of three literals. An implicant of a Boolean function f is a product term that is contained in f. An implicant p is said to be contained in f if p ⫽ 1 implies f ⫽ 1, which is denoted as p 債 f. Similarly, an implicant p1 is said to be contained in another implicant p2, denoted as p1 債 p2, if p1 ⫽ 1 implies p2 ⫽ 1. For example, f ⫽ x1x2 ⫹ x1x3 has implicants x1x2 and x1x3. Function f contains both x1x2 and
581
x1x3, and neither x1x2 contains x1x3 nor does x1x3 contain x1x2. In addition, x1x2x3 is also an implicant of f and is contained in x1x2 and in x1x3. A prime implicant of f is an implicant of f that is not contained in any other implicant of f, that is, x1x2 and x1x3 are two prime implicants of f whereas x1x2x3 is not. A cover Cf of a Boolean function f is a set of implicants that contains all the minterms of f, and f contains Cf. A cover is said to be prime if all the implicants of the cover are prime. Sometimes f may have don’t care conditions that specify that the result of f is not of concern for certain inputs. In this case, if don’t care conditions are denoted by DCf, f ⊆ C f ⊆ f ∪ DC f
(1)
A cover can contain some implicants that are don’t care conditions. A minimum cover of a Boolean function f is a cover of f of a minimum number of implicants. In contrast, an irredundant (or minimal) cover is a cover such that no subset of the cover can be a cover of f. If any implicant is taken away from an irredundant cover, it is no longer a cover. Given the definitions of covers, the two-level logic optimization problem becomes equivalent to finding a minimum cover of a Boolean function. There are two approaches to solving this problem, exact and heuristic. Because solving this problem exactly may not be feasible for large circuits, a heuristic approach is usually taken. Though a heuristic approach may yield a suboptimal solution, it often gives a minimum solution. Exact and Heuristic Solutions Among all the minimum covers, there is always a minimum cover that is prime. This was proved by Quine (2) and allows us to limit our search space and to find a minimum cover among all the prime covers. In addition, we can make prime every nonprime implicant of a minimum cover. This is done by replacing the nonprime implicant by a prime implicant that contains it. For example, suppose that abcde is a nonprime implicant and is in a minimum cover Cf. By definition of prime implicants, a prime implicant, say abd, must exist that contains abcde. Then abd can replace abcde. Usually the area cost for abd is less than that of abcde, and therefore prime minimum covers give solutions of smaller area than that of nonprime minimum covers. Karnaugh map and Quine–McCluskey methods are systematic procedures to simplify two-level logic functions (1). The Karnaugh map method is useful for simplifying functions of two to four variables. It can be extended to handle functions of five and six variables. For example, let y ⫽ a⬘b⬘ ⫹ a⬘b ⫹ ab⬘c. Figure 1 shows the Karnaugh map of function y. From the map we conclude that it has two prime implicant and a prime minimum cover is 兵a⬘, b⬘c其. Notice that the dotted squares or rectangles are prime implicants. The Karnaugh map method can be seen as an attempt to find all the prime implicants and a prime minimum cover out of these prime implicants. The Quine–McCluskey method finds all prime implicants first and builds a prime implicant chart to determine prime minimum covers. For example, let z = ab + ab + a c = abc + abc + ab c + a bc + a b c
582
LOGIC SYNTHESIS
p1
p2
p3
p4
Figure 1. Karnaugh Map of a logic function: It is used to simplify two-level logic functions.
Figure 3. Prime implicant chart of a function: It helps find a minimum cover.
The optimization procedure is carried out in Fig. 2. The table on the left-hand side is a list of minterms and the one on the left is a list of four prime implicants, p1, p2, p3, and p4. p1 (0-0 meaning ac) is derived by combining minterm 1 and 2, and p2 (⫺10 meaning bc⬘) is the result of combining minterm 2 and 4. The rest are shown as in the figure. Figure 3 shows the prime implicant chart. It is shown that minterm 1 (a⬘b⬘c⬘) is covered only by p1 and minterm 3 (ab⬘c) is covered only by p3. Such prime implicants are called essential prime implicants and must be chosen for the minimum prime covers. As a result, minterms 1, 2, 3, and 5 are covered whereas minterm 4 abc⬘ is not. In this case, it is very easy to see that two minimum prime covers exist, 兵 p1, p3, p2其 and 兵 p1, p3, p4其. However, when the number of variables and prime implicants increases, computational time and memory used for searching a minimum prime cover may be exponential in the worst case. Researchers improved the Quine–McCluskey method by building a smaller prime implicant chart and by applying an efficient branch-and-bound algorithm to search for the minimum prime covers (3). Further and other improvements have been done by Refs. 4–7. However, for most practical cases, the run time of using exact solutions may not be tolerable. This motivates heuristic approaches. A heuristic minimizer ESPRESSO (3) gives minimal (not minimum) prime cover. ESPRESSO builds a prime cover and uses an iterative improvement strategy to modify and to delete implicants. Often it gives solutions close to minimum covers.
MULTILEVEL LOGIC SYNTHESIS Optimal multilevel logic synthesis is a known difficult problem which has been studied since the 1950s. Unlike two-level logic optimization, the exact optimization methods for multilevel logic are not practical for today’s circuit design because of their computational complexity. Therefore, we focus on heuristic optimization methods in our discussion. The first of the modern developments is the logic synthesis system (LSS) at IBM (9), which has a variety of gate arrays and standard cells as target technology. More recent work in multilevel logic optimization includes MIS (10) and BOLD (11). Both MIS and BOLD are aimed at optimization techniques which are independent of particular technologies and were developed to bring multilevel logic optimization to the level of science obtained for two-level logic optimization. Because the results are independent of particular technologies, there is one more step, called technology mapping, that needs to be taken. Technology mapping maps the logic functions to the gates of a particular technology before it can be implemented in a very large scale integration (VLSI) circuit. First we review some basic concepts before introducing the optimization techniques behind MIS and BOLD. Then optimization techniques based on different algebra models (algebraic and Boolean models) are discussed, followed by a brief discussion of other approaches and technology mapping. Basic Concepts
p1 p2 p3
The on-set of a function of f is the set of minterms for which the function evaluates to 1. The off-set of the function of f is the set of minterms for which f equals 0. The don’t care set or dc-set is the set of minterms for which the value of the function is unspecified. A function which does not have a dcset is a completely specified function. A function with a nonempty dc-set is termed an incompletely-specified function.
p4
Figure 2. Quine-McCluskey method of a function: It is used to simplify two-level logic functions.
Sum of Product Form. A literal is a Boolean variable (say, x) or its complement (x⬘ or x). A cube can be defined as a product of literals, for example, xyz⬘. The trivial cubes, written as 0 and 1, represent the Boolean functions 0 and 1, respectively. A sum-of-products (SOP) form (also called expression) is a set of cubes. For example, expression a ⫹ bc consists of two cubes, a and bc⬘.
LOGIC SYNTHESIS
An expression is algebraic (nonredundant) if no cube in the expression properly contains another. For example, a ⫹ ab is redundant because a contains ab. But expression a ⫹ bc is algebraic. A Boolean expression is a nonredundant expression. The union of two expressions f and g, denoted as f 傼 g, is the set that consists of all the cubes of f and g and is transformed into a nonredundant expression. Similarly, the intersection of f and g, denoted as f 傽 g, is the set of all the common cubes of f and g. Factored Form. The usual representation of a logic function is the sum-of-products form. An alternative representation to this is the factored form. It is the generalization of the sumof-products form allowing nested parentheses. For example, the expression ace ⫹ ade ⫹ bce ⫹ bde ⫹ e can be written in factored form as e(a ⫹ b)(c ⫹ d) ⫹ e. In other words, a factored form is a sum of products of arbitrary depth. Generally the factored form is not unique. For example. the expression abc ⫹ abd ⫹ cd is itself a factored form but can also be written as ab(c ⫹ d) ⫹ cd or abc ⫹ (ab ⫹ c)d. Both are factored forms. In many applications, it is infeasible to describe each single-output function of a multiple-output function as a single expression or a single factored form. Often, a set of intermediate functions is introduced, and each depends on the original inputs and possibly other intermediate functions. Then, each single-output function can be expressed as a function of original inputs and the intermediate functions. For example, the multiple-output function F1 = [(a + b)c + d]e + f
tion constraints on the logic gates and assuming loose models for their area and performance. Second, the constraints on the usable gates (e.g., those represented in the cell library) are taken into account. Logic Synthesis Using the Algebraic Model A logic network can be optimized by using the general properties of polynomial algebra to simplify the Boolean model. The simplification includes the assumption that the exponent of every variable in the network is at most one in the polynomial algebra and ignores don’t care conditions. Such an assumption simplifies the problem but may yield a less than optimal solution. Division. Given functions f and p, we can find functions q and r such that ( f ⫽ pq ⫹ r). Every such operation is like the division operation and is therefore called (algebraic) division of f by p generating quotient q and remainder r. The function p is called a divisor of f if r is not null and a factor if r is null. Optimization can be carried out if good algebraic divisors can be found. The set of algebraic divisors is defined as D( f) ⫽ (g兩 f/g ⬆ 0 兾). The primary divisor of f is defined as P( f) ⫽ [( f/c兩c is a cube]. That means the primary divisors of an expression f are the expressions f /c where c is a cube. For example, if f = abc + abde then f = bc + bde a
F2 = [(a + b)c + d] g + h can be expressed as the following set of functions involving variables x and y:
F1 = ye + f F2 = yg + h y = xc + d x=a+b Multi-level logic refers to any multiple-output Boolean function represented by a set of interconnected functions. Therefore, multilevel logic is a particular representation of multiple-output functions. A multiple-level logic function can be graphically represented as a directed acyclic graph (DAG) (V, E), where V and E are the set of all vertices and edges in the graph, respectively. The set of input nodes VIN does not have any incoming edges, and the set of output nodes VOUT does not have any outgoing edges. The set of intermediate nodes is given by VINT ⫽ V ⫺ (VIN 傼 VOUT). Each node vi 僆 VINT, computes a Boolean function Fi in terms of its fan-in nodes and is also associated with a ‘‘local output’’ variable yi, where yi ⫽ Fi. The desired design style affects the synthesis and optimization methods. Indeed the search for an interconnection of logic gates that optimize area and/or performance depends on the constraints on the choice of the gates themselves. Multiple-level logic is usually partitioned into two tasks. First, a logic network is optimized while neglecting the implementa-
583
is a primary divisor. The kernels of f are defined as K( f) ⫽ (k兩k 僆 P( f), k is cubefree). An expression is cube-free if it does not have a cube factor. In other words, the kernels of an expression f are the cube-free primary divisors of the expression. The cube c used to obtain the kernel k ⫽ f /c is called the co-kernel of f. Continuing with the previous example, f /a is a primary divisor but not cube-free because b is a factor of f /a. Therefore, f = b(c + de) a However, f = c + de (ab) is a kernel. ab is called a co-kernel. Note that by definition, co-kernels are always cubes. Example. Given the expression X = abcdg + abcdh + abce + abc f + abi then X = bcdg + bcdh + bce + bc f + bi a
584
LOGIC SYNTHESIS
is a primary divisor, but is not a kernel because the expression is not cube-free (cube b divides each term). However, X = cdg + cdh + ce + c f + i ab
Example. Consider a network with the following expressions:
f x = ace + bce + de + g f y = ad + bd + cde + ge f z = abc
and
K( f x ) = [(ac + bc + d), (a + b), (ace + bce + de + g)] X =g+h abcd
K( f y ) = [(a + b + ce), (cd + g), (ad + bd + cde + ge)] The kernel set of f z is empty.
are both kernels with associated co-kernels ab and abcd. Because no single cube is cube-free, a kernel must contain two or more cubes. Also, because 1 is a cube, if f is cube-free, then f is considered one of its own kernels. The level of a kernel is defined to provide easily identifiable subsets of the set of all kernels. Recall that kernels are expressions, and hence it makes sense to refer to the kernels of a kernel. A kernel is called a level-0 kernel of f if it does not have any kernels except itself. No literal appears twice in a level-0 kernel. A kernel is called a level-n kernel if it contains a kernel of level (n ⫺ 1) but does not contain any kernels of level-n except itself. This gives us a natural partition of the kernels: K0( f) 傺 K1( f) 傺 K2( f) 傺 . . . 傺 Kn( f) 傺 K( f) Example x = [a(b + c) + d)(eg + g( f + e)] + (b + c)(h + i) has, among others, the kernel b ⫹ c and a(b ⫹ c) ⫹ d which are level-0 and level-1, respectively, whereas x itself is a kernel of level 2 because it has level 1 kernels but no level 2 kernels other than itself. The motivation for this definition of the kernels of a logic expression comes from the following theorem which was used in MIS (10): Theorem 1. f and g have a common multicube divisor if and only if there exists kf 僆 K( f), and kg 僆 K(g) such that 兩kf 傽 kg兩 ⱖ 2. That is, two functions have a common multiple-cube divisor if and only if the intersection of a kernel from f and a kernel from g has more than one cube. It is important to remember that an expression is a set of cubes and the intersection of kernels refers to the set intersection of the expressions and not the Boolean intersection of the logic functions implied by the expressions. The computation of the kernel set of the expression in the logic network is the first step toward the extraction of multiple-cube expressions. Then the candidate common subexpressions to be extracted are chosen among the kernel intersections.
Hence, multiple-cube common subexpressions can be extracted only from f x and f y. There is only one kernel intersection, namely, (a ⫹ b) 僆 K( fx), and (a ⫹ b ⫹ ce) 僆 K( fy). The intersection is a ⫹ b and can be extracted to yield
fw = a + b f x = wce + de + g f y = wd + cde + ge f z = abc Logic Transformation The goal of multilevel logic optimization is to obtain a representation of the Boolean function that is optimal with respect to area, speed, testability, and power dissipation. To restructure a logic function, the operations described following are used. Decomposition. The decomposition of a Boolean function is the process of reexpressing a single function as a collection of new functions. For example, the process of translating the expression F = abc + abd + a cd + bcd to the set of expressions
F = XY + X Y X = ab Y =c+d is decomposition, shown in Fig. 4. The associated optimization problem is to find a decomposition with minimum total area or power. Extraction. Extraction is applied to many functions. It is the process of identifying and creating some intermediate functions and variables and reexpressing the original functions in terms of the original and the intermediate variables. Extraction creates nodes with multiple fan-outs. For example, extraction applied to the following three functions:
F = (a + b)cd + e G = (a + b)e H = cde
LOGIC SYNTHESIS a
a
b
b
585
X = ab F = XY + X′Y′
F = abc + abd + a′c′d′ + b′c′d′ c
c
d
d
Y=c+d
yields
Figure 4. Decomposition decomposes large function into smaller ones.
produces
F = XY + e
F = G(a + c)
G = Xe H = Ye
This operation creates an arc (the wider line in Fig. 6) in the Boolean network that connects the node of the substituting function, namely, G, to the node of the function being substituted in, namely, F.
X = a+b Y = cd Thus the operation identifies common subexpressions among different logic functions that form a network. This is shown in Fig. 5. New nodes corresponding to the common subfunctions are created, and each of the logic functions in the original network is simplified as a result of introducing these new nodes. The optimization operation is to find a set of intermediate functions such that the resulting network has minimum area, low power dissipation, delay, or maximum testability. Factoring. Factoring is the process of deriving a factored form from a sum-of-products form of a function. For example,
Elimination. Elimination, collapsing, or flattening is the inverse operation of substitution. If G is a fan-in of F, collapsing G into F reexpresses F without G. It undoes the operation of substituting G in F. For example, if F = Ga + Gb G = c+d then, collapsing G into F results in F = ac + ad + bcd G = c+d
F = ac + ad + bc + bd + e can be factored to
This is illustrated in Fig. 7. F = (a + b)(c + d) + e
Logic Synthesis Using Don’t Cares
The optimization problem associated with factoring is to find a factored form with the minimum number of literals. Substitution. Substitution, also resubstitution, of a function G into F is the process of reexpressing F as a function of its original inputs and G. Let us consider the example of Fig. 6. Substituting G=a+b into F = (a + b)(a + c)
a b c
a F = (a + d) cd + e
G = (a + d) e′
X=a+b
Logic networks can also be efficiently synthesized using the concepts of don’t cares. Such logic networks obtained by using this optimization is much more testable than merely 100% testable for conventional input and output stuck-at faults. The approach is based on determining the complete don’t-care set for each two-level function embedded in a network of such functions. Once this is done, a two-level minimizer can be used to minimize the subfunction. Before we describe the don’t care-based logic optimization technique let us consider some definitions. Let F(y1, y2, . . . yn) be a Boolean expression. The cofactor of F with respect to yk is given as (F)yk. (F)yk is obtained by setting yk to ‘1’ in the expression of F. The cofactor of F with
F = XY + e
b c
G = Xe′ Y = cd
d e
cde
d e
H = Ye
Figure 5. Extraction extracts common expressions from functions.
586
LOGIC SYNTHESIS
a
a G = a+b
G = a+b
b
F = G (a + c)
F = (a + b)(a + c) c
c
a
b
d
b
c
e
Figure 6. Substitution substitutes one function into another.
respect to yk, (F)yk, is defined similarly. Using Shannon’s theorem, F can be represented as F = yk (F )y k + yk (F )y
(2)
k
Given a multilevel circuit graph, a set of input vectors exists, specified by the user, which can never occur. These set of inputs are called the external don’t care set of inputs. For all our discussions, we assume that this set is 0 兾. Note that Boolean functions Fi representing node vi, when considered in isolation, are completely specified functions. However, when embedded in a Boolean network, Fi is incompletely specified, that is, has a don’t-care set. Such redundancies can be generated by using the structure of the Boolean network. In Ref. 8 two sets of don’t cares, intermediate variable don’t care set, DIV (common to all nodes) and transitive fan-out don’t care set, DT (which is specific to node j), are defined. Then logic optimization is based on such don’t cares obtained from the multilevel Boolean network. The overall intermediate variable don’t care set is defined by
DIV =
m
DIV j
j=1
where DIVj ⫽ yjFj ⫹ yjFj, representing the EXOR function. DIVj can be best understood by considering the example of Fig. 8. It is clear from the figure that certain set of inputs, such as d ⬆ (c.e), can never occur at the inputs to the OR gate. The set of such inputs are the intermediate variable don’t care set. The set of input vectors for which node i is insensitive to the values of node j (i is in the transitive fan-out of node j and is a primary output) is called the transitive fan-out don’t care
a
a F = Ga + G′b
b
F = ac + ad + bc′d′ b
c
c G = c+d
d
G = c+d d
Figure 7. Elimination eliminates one particular function representation from a function. That is, it flattens the original function.
Figure 8. Example showing the intermediate variable don’t care set: d ⫽ ab, c ⫽ a, and e ⫽ b and therefore d ⫽ ce. Since d ⬆ ce will never happen, it is don’t care and should be utilized for logic optimization.
set of node j, DTj.
DT j =
Eij
i∈F O j
where
Eij = ((Fi )y j ≡ (Fi )y )
(3)
j
= (Fi )y j (Fi )y + (Fi )y j (Fi )y j
j
(4)
and FOj represent the transitive fan-out of j. Let us consider a simple two-input (a and b are inputs whereas c is an output) AND gate. The output c is insensitive to input b when an input of (a ⫽ 0 and b ⫽ 1) is applied to the circuit. Now each node of the Boolean network can be optimized using a two-level optimizer with the don’t care sets as described above. However, note that the don’t care sets for each node can be large. Hence, researchers have tried to optimize don’t care sets using heuristics, such as don’t care filters. Such filters tradeoff computational time versus optimization quality. Other Optimization Algorithms There are other interesting approaches in addition to the one just discussed. One is to use circuit redundancy to simplify the network (12,13). For example, let a node vi in a logic network F be untestable for stuck-at-0, that is, if F⬘ denotes the new logic network by forcing vi to logic 0, and F and F⬘ are equivalent. If vi is an input to a NAND gate, the NAND gate can be replaced by a logic value 1. If vi is an input to a NOR gate, vi can be eliminated from the inputs of the NOR gate. Transduction (14), global flow (15), and rule-based systems (9) are some of the most significant techniques. The logic synthesis techniques described previously are geared mainly toward area and performance optimization. We described area optimization techniques in detail. However, performance can be optimized by reducing the fan-out (achieved using logic duplication), by optimizing the number of inverters required (standard CMOS efficiently implement only inverting logic), and by reducing the logic depth during synthesis. More recently, because of the increased number of devices per chip and the proliferation of battery-operated components coupled with the fact that frequency has doubled every two years, power dissipation is becoming a major con-
LOGIC SYNTHESIS
cern. Average power dissipation in a standard CMOS circuit is given approximately by 2 Pavg = Vdd Ci ai + Isckt.Vdd + Ileak.Vdd i∈gates
where Vdd is the supply voltage, Ci is the capacitance associated with the output of a logic gate, and ai represents the average number of signal switching at node i. Isckt and Ileak represent the short circuit and the leakage currents of the design, respectively. The first component results from the switching current and is by far the dominant (more than 85% in current technology in the active mode of operation) and hence, let us concentrate on the switching component of power. 兺i僆gates Ciai represents the switched capacitance per unit time. Estimating ai is difficult because it depends on the primary input signal distribution. It has been shown that the inputs to a logic circuit can be represented as stochastic processes that have certain properties: signal probability (the probability that a signal is logic ONE) and signal activity (the probability of signal switching). Probabilistic and statistical methods (18,19) have been developed to determine the signal activity at the internal nodes of a logic circuit. It should also be noted that in CMOS circuits, Ci is proportional to the fanout of the logic gate. Hence, during logic synthesis, one can try to optimize power dissipation by properly selecting common subexpressions which would reduce the overall switched capacitance based on the given input signal probability and activity (31). However, it should be observed that power conscious technology mapping techniques are also important since the logic network derived after the logic optimization phase is modified during technology mapping (23). Technology Mapping In the multilevel logic synthesis described previously, optimization is a technology-independent process. Logic network F generated by the multilevel logic synthesis has to be mapped to real circuits that satisfy timing and area constraints. The real circuits usually consist of a set of predefined gates called library L, and the gates are called library cells. Library cells are technology-dependent. To simplify the mapping from F to library cells of L, two different approaches have been taken— tree-based and Boolean-matching approaches. Tree-based Matching and Covering. In this approach, logic network F and all library cells are decomposed into a set of base functions B. The set of base functions is usually small. For example, it can be 2-input NAND gate and inverter. The decomposed F corresponds to a graph, called subject graph. Similarly, each decomposed library cell corresponds to a pattern graph. In the subject graph and pattern graphs, each node corresponds to a base function. However, there may be more than one way to decompose F and library cells. Therefore, the final result depends on how we decompose F. In addition to decomposition, the subject graph is partitioned into trees (16). For a node vi in a tree, if a pattern graph is isomorphic to a subtree rooted in vi, the pattern graph matches vi. For example, let us consider Fig. 9. Assume that the set of base functions B consists of a 2-input NAND gate and an inverter and that library cells are 2-input NAND and NOR gates and an inverter. Logic network F ⫽ abcd is decomposed into three NAND gates and three inverters, which is a subject
Logic network
587
Subject tree a b
a b c d
c d
Library cells
Pattern graphs
Figure 9. Decomposition of logic network and library cells: Technology mapping first decomposes library cells into pattern graphs and the logic network into a subject tree.
tree. The library cells are also decomposed into base functions. In Fig. 10, the subtrees of the subject tree are matched by some pattern graphs and therefore are mapped into the corresponding library cells. Assume that the cost of pattern graph does not depend on the parent of vi. For example, the cost function can be the area of pattern graphs. A bottom-up dynamic programming algorithm can be performed from the leaves to the root to compute the optimum tree covering. However, if delay is the cost function, the cost of a pattern graph depends on the parent of the root of the subtree. An approximate solution to this was proposed in Refs. 17 and 20 by storing a piecewise function at every node.
a b
c d
a b c d
Figure 10. Tree-based technology mapping: Subtrees are matched by pattern graphs and are mapped into corresponding library cells.
588
LOGIC SYNTHESIS
Boolean Matching and Covering. In the Boolean-matching approach (21,22) the logic network F is still decomposed into a subject graph. However, the subject graph is partitioned into rooted directed acyclic graphs (DAGs) rather than trees. In addition, the library cell’s functions are checked against rooted subgraphs of rooted DAGs by equivalence checking. Equivalence between two Boolean functions can be checked by using ordered binary decision diagrams (OBDDs). Technology Mapping for Low Power and Deep Submicron Circuits. In addition to area and performance (delays) as cost functions or optimization constraints, power dissipation has become another design constraint. The key to reducing power dissipation is to hide nodes with higher switching activity (23). Another new design concern is the change of delay and area models of library cells. In today’s deep submicron circuits, interconnects contribute significantly to delays and die area. Therefore, when optimizing delays and area, we need to take layout into account (24).
SEQUENTIAL CIRCUIT SYNTHESIS A sequential circuit can be specified by a set of registers (latches) and a logic network, which is a clocked sequential network or synchronous logic network. They can also be described by a more abstract model—finite-state machine (FSM). We first focus on the FSM model. A finite state machine is defined by a sextuple (I, S, 웃, s0, O, ), where I, O, S, and s0 are primary inputs, primary outputs, a set of states, and an initial or reset state (s0 僆 S). The next state function 웃 takes a current state and primary inputs as input and gives a next state. Therefore, 웃 : S ⫻ I 씮 S. The output function takes either a state for a Moore model ( : S 씮 0) or a state and primary inputs for a Mealy model ( : S ⫻ I 씮 0). We concentrate on the Mealy model because it is more general and the techniques that are introduced later can also be applied to Moore machines. Figure 11 shows a Mealy-clocked sequential network. There are two more concise but equivalent ways to describe a FSM rather than the sex-tuple. They are the state-transition table and the state
Figure 12. State transition table (left) and graph are two representations of finite state machine.
transition graph (STG) shown in Fig. 12. Each node in the STG represents a state, and the edges are the transition function 웃. An FSM is incompletely specified if the next state function and/or the outputs are not specified for some combinations of inputs and present states. Otherwise, it is completely specified. Usually, the less the states are, the smaller the circuit area is. Therefore, it is desirable to reduce the number of states (state minimization), if possible, before some binary numbers are assigned to states (state assignment). State Minimization State minimization takes slightly different forms for completely and incompletely specified FSMs. For convenience, we denote the output sequence OS(IS, si) assuming that the FSM is initialized in state si and an input sequence IS is given. For a completely specified FSM, two states si and sj are equivalent if the output sequences OS(IS, si) and OS(IS, sj) are the same for any input sequence IS. It can be shown that si and sj are equivalent if and only if, for any input I (not input sequence IS), they have the same output and have equivalent next states for the same input (25). This suggests an iterative procedure to find equivalent classes within ns steps and thus obtain the minimum number of states. Number ns is the number of states before minimization. For incompletely specified FSM, a pairwise compatibility is defined (25). However, the computational complexity of finding a minimum is too high, and often times heuristic algorithms are applied instead to obtain near-optimal solutions. State Assignment
Figure 11. State machine representation: Output and states are functions of input and the previous states.
State assignment or state encoding is the process of assigning binary numbers to states of an FSM. Because the number of bits used to encode states is related to circuit complexity a minimum number of state bits is preferred. State assignment for two-level logic circuits can be thought of as symbolic minimization (26,27) whereas present states and next states are input symbols and output symbols on the state-transition table. However, because si may appear as present and next states, a constraint must be added to make sure that si gets the same encoding for serving as input and output symbols. State assignment for multilevel logic circuits uses multilevel logic network as the combinational logic circuits shown in Fig. 11. Recall from the third section that many transformations optimize the network. A heuristic method was proposed (28–
LOGIC SYNTHESIS
Figure 13. Fanout-oriented encoding tries to maximize the size of common cubes.
30) to consider common cube extraction, that is, state assignment is done so that larger and/or more common cubes can be extracted. For example, assume that S ⫽ 兵s1, s2, s3, s4, s5, s6其 and 3-bit (minimum number) encoding is used. s1 and s2 both have a transition to s3 when inputs are x1 and x2 respectively, as shown in Fig. 13. We encoded s1 (001) and s2 (000) so that their Hamming distance is short. The Hamming distance between two states is the number of 1’s in s1 XOR (bitwise) s2. Let the encoded transition functions be f 1, f 2, f 3 and the state bits be a, b, c. Then f 1 has two cubes x1a⬘b⬘c and x2a⬘b⬘c. Therefore, f 1 has a common cube a⬘b⬘ because x1 a b c + x2 a b c = a b (x1 c + x2 c )
(5)
Similarly, f 2 and the output have the same common cube a⬘b⬘. Keeping the Hamming distance short between states si and sj is called fan-out oriented encoding if they have a transition to the same state. It tries to maximize the size of common cubes. Figure 14 shows an example of fan-in oriented encoding. The idea is to keep the Hamming distance short between states si and sj with an incoming transition from the same state. Let us consider Fig. 8. We encoded s1 (111), s2 (110),
Figure 14. Fanin-oriented encoding tries to maximize the number of common cubes.
589
and s3 (001) so that the distance between s1 and s2 is short. It tries to maximize the number of common cubes. The common cube a⬘b⬘c appears five times. FSM and combinational logic synthesis have been conventionally targeted to reducing area and critical path delay (10,26,28). However, power dissipation during the logic synthesis process is being considered only recently. The synthesis process consists of two parts: state assignment, which determines the combinational logic function, and multilevel optimization of the combinational logic, which tries to minimize area while at the same time trying to reduce the sum over all circuit nodes of the product of the circuit activity at a node and the capacitance at the node. The state assignment scheme in (31) considers the likelihood of state transitions—the probability of a state transition (say, from state S1 to state S2) when the primary input signal probabilities (probability that an input is equal to logic ONE) are given. The state assignment minimizes the total number of transitions occurring at the V inputs (or the present state inputs) of the state machine shown in Fig. 11. It can be observed that scaled-down supply voltage technologies can still be applied after logic synthesis to reduce power dissipation further. The multilevel logic optimization process is iterative. During each iteration, the best subexpression from among all promising common subexpressions is selected. The objective function is based on both area and power savings. The selected subexpression is factored out of all affected expressions.
LOGIC SYNTHESIS FOR FPGAs Field-programmable gate arrays (FPGAs) combine the flexibility of mask programmable gate arrays with the convenience of field programmability. This can be achieved by channelled gate array architecture consisting of rows of configurable logic modules interspersed with programmable routing tracks (32,33). In such an architecture, a logic module can be configured to implement different functions by connecting one or more of its inputs to logical 0 (GND) or logical 1 (VCC) or by shorting them together. This connectivity is achieved by programming the appropriate programmable connections (PCs). Figure 15 shows the row-based FPGA architecture (32,33). Each logic module consists of an interconnection of a set of logic gates and can implement different logic functions. Two possible logic block architectures are multiplexor-based or lookup-table-based. The multiplexor-based architectures use logic blocks that combine a number of multiplexors and some AND or OR gates. Lookup-table-based logic blocks can implement any logic functions with no more than a certain number of inputs (35). A possible logic synthesis and technology mapping technique would exhaustively identify all possible logic functions that a logic module can implement by tying inputs of the logic module to ZERO or ONE and others to signals. The number of unique functions for logic modules can be large (⬎700 for Actel Act2, which is multiplexor-based). The number is usually much larger in lookup-table-based blocks. A library with these many combinational functions is difficult to manage. The function list can be reduced by qualifying the functions against an existing gate array or FPGA library. The number of functions reduced from 766 to 115 for Act2 by us-
590
LOGIC SYNTHESIS 3. R. Rudell and A. Sangiovanni-Vincentelli, Multiplied-value minimization for PLA optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., CAD-6: 727–750, 1987.
Figure 15. FPGA architecture can configure logic modules to implement different functions.
ing this scheme (34). The reduced number of functions, used after logic synthesis for technology mapping, produced excellent results on benchmark circuits. Hence, such logic synthesis and technology mapping requires minimal changes in the logic synthesis and technology mapping algorithms described earlier for gate arrays and ASIC’s. Let us consider the lookup-table-based FPGA architecture. A direct mapping technique can also be used for mapping logic functions into FPGAs. We assume that logic synthesis or an optimization step has already been performed. The logic network is decomposed into a forest of trees. The network does not have to be represented as a set of 2-input NAND gates and inverters because there is no explicit library. Each tree is optimally mapped into the lookup table by using dynamic programming (36). Note that the structure of the logic function is not important for mapping, but what is important is the number of inputs because a lookup table can implement any logic function up to a certain number of inputs. Hence, a modified technology mapping can efficiently map a logic network directly into a lookup-table-based architecture. SUMMARY In recent years automatic logic synthesis and technology mapping have been used very successfully to synthesize random logic and control circuitry for optimizing area, timing, and power dissipation. In this article we have surveyed combinational and sequential logic synthesis and technology mapping techniques. We also presented a modified technology mapping technique for FPGAs. Unfortunately the details of the algorithms were omitted, but readers should find the references useful for further studies. BIBLIOGRAPHY 1. F. Hill and G. Peterson, Introduction to Switching Theory and Logical Design, 3rd ed., New York: Wiley, 1981. 2. W. Quine, The problem of simplifying truth functions, Amer. Math. Mon., 59: 521–531, 1952.
4. O. Coudert and J. Madre, Implicit and incremental computation of primes and essential primes of Boolean functions, Proc. Design Autom. Conf., 1992, pp. 36–39. 5. O. Coudert, J. Madre, and H. Fraisse, A new viewpoint on twolevel logic minimization, Proc. Design Autom. Conf., 1993, pp. 625–630. 6. M. Dagenais, V. Agarwal, and N. Rumin, McBOOLE: A new procedure for exact logic minimization, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-5: 229–232, 1986. 7. P. McGeer et al., ESPRESSO-SIGNATURES: A new exact minimizer for logic functions, Proc. Design Autom. Conf., 1993, pp. 618–621. 8. K. Bartlett et al., Multilevel logic minimization using implicit don’t cares, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst. IC’s, CAD-7 (6): 723–740, 1988. 9. J. Darringer et al., LSS: A system for production logic synthesis, IBM J. Res. Dev., 28 (5): 537–545, 1984. 10. R. Brayton et al., Multiple-level logic optimization system, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-6: 1062– 1081, 1987. 11. K. A. Bartlett et al., Multilevel logic minimization using implicit don’t cares, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-7: 723–740, 1988. 12. D. Bryan, F. Brglez, and R. Lisanke, Redundancy identification and removal, Int. Workshop Logic Synthesis, 1991. 13. S. C. Chang et al., Layout driven logic synthesis for FPGAs, Proc. Design Autom. Conf., 1994, pp. 308–313. 14. S. Muroga et al., The transduction method-design of logic networks based on permissible functions, IEEE Trans. Comput., C38: 1404–1424, 1989. 15. L. Berman and L. Trevillyan, Global flow optimization in automatic logic design, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-10: 557–564, 1991. 16. K. Keutzer, DAGON: Technology binding and local optimization by DAG matching, Proc. Design Autom. Conf., 1987, pp. 341–347. 17. R. Rudell, Logic Synthesis for VLSI Design, Ph.D. Thesis, Univ. California, Berkeley, 1989. 18. F. Najm, A survey of power estimation techniques in VLSI circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-2 (4): 446–455, 1994. 19. T.-L. Chou and K. Roy, Accurate Estimation of Power Dissipation in CMOS Sequential Circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-4 (3): 369–380, 1996. 20. H. Touati, Performance oriented technology mapping, Ph.D. Thesis, Univ. California, Berkeley, 1990. 21. F. Mailhot and G. De Micheli, Technology mapping with Boolean matching, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-12: 559–620, 1993. 22. C. R. Morrison, R. M. Jacoby, and G. D. Hachtel, Techmap: Technology mapping with delay and area optimization, in G. Saucier and P. M. McLellan (eds.), Logic and Architecture Synthesis for Silicon Compilers, Amsterdam, The Netherlands: North-Holland, 1989, pp. 53–64. 23. M. Pedram, Power minimization in IC design: Principles and applications, ACM Trans. Des. Autom. Electron. Syst., 1 (1): 3–56, 1996. 24. M. Pedram, N. Bhat, and E. S. Kuh, Combining technology mapping and layout, The VLSI Design: An Int. J. Custom-Chip Des., Simulation Test., 5 (2): 111–124, 1997. 25. F. Hill and G. Peterson, Switching Theory and Logical Design, New York: Wiley, 1981.
LOGIC TESTING 26. G. De Micheli, R. Brayton, and A. Sangiovanni-Vincentelli, Optimal state assignment for finite state machines, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-4: 269–284, 1985. 27. G. De Micheli, Symbolic design of combinational and sequential logic circuits implemented by two-level logic macros, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-5: 597–616, 1986. 28. S. Devadas et al., MUSTANG: State assignment of finite-state machines targeting multi-level logic implementations, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-7: 1290– 1300, 1988. 29. X. Du et al., MUSE: A MUltilevel Symbolic Encoding algorithm for state assignment, IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., CAD-10: 28–38, 1991. 30. B. Lin and A. R. Newton, Synthesis of multiple level logic from symbolic high-level description language, Proc. IFIP Int. Conf. VLSI, 1989, pp. 187–196. 31. K. Roy and S. Prasad, Circuit Activity Based Logic Synthesis for Low Power Reliable Operations, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., VLSI-1 (4): 503–513, 1993. 32. A. E. Gammal et al., An architecture for electrically configurable gate array, IEEE J. Solid State Circuits, 24 (2): 394–398,1989. 33. J. Birkner et al., A very high speed field programmable gate array using metal to metal antifuse programming elements, IEEE Custom Integr. Circuits Conf., May 1991, pp. 1.7.1–1.7.6. 34. C.-H. Shaw et al., An FPGA architecture evaluation framework, ACM Workshop FPGAs, 1992, pp. 15–20. 35. H.-C. Hsieh et al., Third generation architecture boosts speed and density of field programmable gate arrays, IEEE Custom Integr. Circuits Conf., 1990, pp. 31.2.1–31.2.7. 36. R. Francis, J. Rose, and K. Chung, A technology mapping program for lookup table based field programmable gate arrays, 27th Design Autom. Conf., 1990, pp. 613–619.
TAN-LI CHOU Intel Corporation
KAUSHIK ROY Purdue University
LOGIC, TEMPORAL. See TEMPORAL LOGIC.
591
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1810.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Power Estimation and Optimization Standard Article Farid N. Najm1 1University of Illinois at Urbana-Champaign, Urbana, IL, Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1810 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (148K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1810.htm (1 of 2)18.06.2008 15:27:40
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1810.htm
Abstract The sections in this article are Sources of Power Dissipation The Power Design Space VLSI Design Abstraction Levels Power Estimation Power Optimization | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1810.htm (2 of 2)18.06.2008 15:27:40
POWER ESTIMATION AND OPTIMIZATION
POWER ESTIMATION AND OPTIMIZATION In recent years, there has been great demand for portable electronic systems, such as cellular phones, two-way pagers, personal digital assistants (PDA), and general portable audio and video communications equipment. If current market trends persist, as is generally believed, then these systems will probably continue to be in high demand in the future. A common feature of all portable systems is that they are usually powered by batteries, and that the power consumption of the system determines how long one can use the equipment before having to replace or recharge the batteries, the so-
645
called battery life. Naturally, users of portable electronics prefer to use equipment with the longest possible battery life. In order to increase battery life, one can either use a longerlasting battery or design the system so that it consumes less power. Unfortunately, progress in battery design has been slow, and longer-lasting batteries can be heavier, which is undesirable in portable equipment. This has led equipment manufacturers to pursue aggressively new and innovative designs that consume as little power as possible. Modern portable equipment contains a variety of powerconsuming components, such as integrated circuits (IC) or chips, disk drives, and display devices (typically liquid-crystal displays for computers and PDAs). In the pursuit of lowpower design, it is natural to start by reducing the power of the most power-consuming components of a system. Thus, much has been done lately to reduce the power consumption of displays and disk drives, as these parts can consume large amounts of power if not well designed. This article will focus on integrated circuits, which are pervasive throughout the electronics industry, and not just in portable components. Indeed, low-power ICs are desirable not only for prolonging battery life, but also for better IC reliability, in both portable and line equipment. This is because if the heat generated in a high-power IC is not properly dissipated (through the use of expensive IC packages), the chip temperature will rise, degrading the circuit performance and aggravating various failure mechanisms that can cause the chip to fail much sooner than it otherwise would. Due to the complexity of modern integrated circuits (millions of transistors per chip), ICs are designed by using sophisticated computer-aided design (CAD) software systems. These consist of myriad software tools (CAD tools) that can be used to enter a description of the design in terms of its parts or behavior, simulate the design using computer simulation models in order to verify its correctness, and then transform the design into the low-level physical specifications (layout) required to manufacture the chip. The design process of state of the art ICs is a very complex procedure involving design groups of hundreds of people, for periods of a year or a few years. This involves the use of many CAD tools, and the overall project requires careful planning and management. The overall flow of activities is referred to as the design methodology. Thus, in recent years, the industry has been slowly shifting toward the use of a low-power design methodology. While these efforts are still continuing, it is clear that part of such a methodology must be CAD capabilities for estimating the power requirements of a proposed design, and for optimizing (reducing) the power consumption of a proposed design. Only recently have such power estimation and optimization tools been introduced into the market place, and it is certain that better tools will be released in the future as a result of current research and development efforts in both industry and academia. Electronic circuits are usually classified as either digital or analog. Digital circuits are those that implement logic functions, and are pervasive throughout modern electronic equipment. Analog circuits process signals that do not have discrete (logic) states, but are more continuous in nature. Amplifiers are a common example of analog circuits. This article will focus on digital circuits, for the following reasons. Analog circuits are typically small (in terms of their number of components), so that designers can easily predict their
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
646
POWER ESTIMATION AND OPTIMIZATION
power from their knowledge of the design. Furthermore, designers have a few good design practices that lead to lowpower analog circuits, beyond which there is little scope for computer-aided optimization. Finally, since most systems are becoming predominantly digital, it is the digital parts that are consuming the most power, and it makes sense to focus on digital chips. The most power-consuming logic ICs are the very large scale integration (VLSI) components. ICs of this type can contain millions of transistors and consume several tens of watts of power. Next generation VLSI circuits are forecasted to consume around 100 W, which requires very costly packaging and system design solutions. VLSI circuits are built mostly in complementary metal-oxide-semiconductor (CMOS) technology transistors. Even though CMOS was originally introduced for its desirable low-power properties, the sheer number of transistors per chip and the high clock frequencies used have led to the present situation, where modern VLSI CMOS micro-processor chips are dissipating up to 60 W or more. It is instructive to consider why it is that modern chips consume so much power, and how the technology has evolved up to this point. In broad terms, power in VLSI CMOS is directly proportional to the supply voltage, transistor count, and clock frequency. Although the supply voltage has been steadily and slowly reduced over the last few years in order to control the chip power, the exponential increase in transistor count and clock frequency since the early 1970s has been a phenomenal trend; this trend is the main cause for the high levels of power dissipation in modern ICs. The Semiconductor Industry Association (SIA) road-map for 1998–2001 forecasts transistor widths around 0.18 애m, clock frequencies of 300 to 600 MHz, and 28 to 64 million transistors per chip. Some of these predictions have already been met or exceeded. Managing the power of a chip design adds to the list of problems that IC designers have to contend with. In the 1970s and 1980s, designers had to worry about two dimensions, namely circuit delay and circuit area. In the 1990s and beyond, power has become the third dimension, significantly complicating the overall design methodology. This article covers the CAD techniques and tools that have been developed or proposed to help designers overcome this problem, specifically for digital CMOS VLSI chips. In some cases, where CAD techniques are not available, certain design decisions and choices that can be made to reduce the power are briefly mentioned. This has been an active area of research in the recent past. Some good survey articles have been written on the overall methodology (1), the estimation problem (2,3), and the optimization problem (4,5).
3. Standby current 4. Leakage current These will be explained with the help of the CMOS inverter circuit in Fig. 1. During a low-to-high output transition, power supply current is required in order to charge up the output capacitance CL. This is indicated by the current path labeled I1 in the figure. In the next transition (high-tolow), the output capacitance is discharged through the local discharge current path I2. Since the charge is not returned to the power supply, there is a net transfer of charge from the power supply to the ground rail. The component of current required to charge or discharge circuit capacitances is called the capacitive current, and the power resulting from it is called the capacitive power. This component of power represents the biggest component of IC power. It depends only on the magnitude and number of capacitors, and on how often they are charged/discharged. Still referring to the CMOS inverter in Fig. 1, the shortcircuit current is the current that flows from the power supply directly to ground through the two p and n transistors during a logic transition. This current takes the form of a current pulse, and it is there because for a short time interval during a logic transition both transistors will be conducting. If the circuit is well designed, so that the gate input signal makes a fast logic transition, then the short-circuit current pulse will be narrow, and the short-circuit power will be relatively small. In this case, for typical circuits, this component of the power would represents a small fraction (about 15%) of the total circuit power. Otherwise, if the gate input signal is slow, this component can be more important. In any case, the total short-circuit power of a logic circuit depends on the number of gates, how often they switch, and how fast their inputs switch. The sum of the short-circuit and capacitive power contributions is called the dynamic power of the design, because this power is dissipated only during logic transitions. In contrast, the power due to the standby and leakage current is not directly linked to logic transitions, and the sum of the standby and leakage power is called the static power. Standby current refers to current that is continuously drawn from the power supply, but excluding leakage current. While a fully complementary static CMOS gate, as in Fig. 1, consumes no standby current, the chip may contain other
Vdd
SOURCES OF POWER DISSIPATION A CMOS VLSI chip dissipates power because it draws current from its power supply. Indeed the product of the average supply current and the (approximately constant) power supply voltage gives the average power dissipation of the chip. The supply current consists of four components, listed here in decreasing order of importance:
I1
I2
CL
1. Capacitive current 2. Short-circuit current
Figure 1. Current flow in a CMOS inverter during a logic transition.
POWER ESTIMATION AND OPTIMIZATION
types of circuitry (memory circuits, pseudo-nMOS gates, passtransistor logic), which can draw standby current from the power supply. Finally, the leakage current is of two kinds, diode leakage and subthreshold leakage. The diode leakage current is the small, often negligible, current which flows in a reversebiased pn-junction. Reverse biased pn-junctions abound in CMOS circuits, since all drain and source diffusions constitute reverse biased diodes. This current is small, typically in the 10⫺12 A range for a single diode. It is, therefore, often neglected. The subthreshold leakage current is the current that flows in an MOS transistor when it is supposed to be off (open circuit, due to vgs ⫽ 0). Even though this current is also small, it is becoming more important nowadays because it increases significantly for low-threshold transistors that are being proposed for future low-power IC technology.
647
3. Reduce the Node Activity-Capacitance Product by Design. The third option is to reduce the product of node switching activity and node capacitance. This leads to power reduction because the capacitive power dissipation of the whole circuit is proportional to the summation of product terms of the form Cif i, where Ci is the capacitance at node i and f i is its average switching frequency. This can be achieved by either reducing f i, without reducing the overall clock speed, or reducing Ci, by reducing the fanout, or both. This can be done through proper design of the circuit. In contrast, the circuit/logic designer has no control over the supply voltage, and on the basic capacitance per unit area of the metal, both of which are determined by the manufacturing process.
VLSI DESIGN ABSTRACTION LEVELS THE POWER DESIGN SPACE If a capacitor C is being charged from 0 to a voltage V and then discharged, and if this happens f times per second, then the average power being delivered by the charging circuitry is CV 2f. Since capacitive power is the dominant component of the total power, this simple analysis makes it clear that the power dissipation of a VLSI chip depends on the node capacitances, the power supply voltage, and the node-switching frequencies. Thus, in order to develop low-power chip designs, the industry has attacked this problem on three fronts: 1. Reduce the Power Supply Voltage. This option is most attractive because of the quadratic dependence on V. However, this is not easy to implement. For one thing, with a new supply voltage, the whole transistor manufacturing process may have to be modified so that the transistors have lower threshold voltages and reasonable noise margins. This can be very expensive. Another complication is that it would complicate the overall system design if some chips are at 5 V while others are at 3 V. Thus, if a microprocessor design is to be executed with a 3.3 V supply, then enough support chips must be made available on the market so that the whole board or system is at 3.3 V. The industry has by now migrated from 5 V to 3.3 V, although 5 V technology is still in use, and is forecasted to slowly migrate to 2 V, or even 1 V supplies. Currently, semiconductor companies are struggling with the technology design issues for transistors with such low supply voltages. In summary, this voltage scaling is expensive and slow, and it is imperative that other options be explored. 2. Reduce the Node Capacitance, by Scaling (Shrinking) the Technology. Smaller transistors lead to logic gates with less output capacitance, which leads to faster gates and overall higher clock speeds. This is desirable for improving chip and system performance, but the smaller capacitances also provide for lower power dissipation per logic transition. Sometimes, though, the increase in transistor count and clock frequency that results from technology scaling offsets the power gains from reduced capacitance and the overall effect can be an increase in power.
The design process for VLSI spans many levels of abstraction. When the design is specified as an interconnection of logic gates, we refer to that as being a logic-level view of the design. When the design is described as an interconnection of transistors, it is said to be at the transistor-level. When described in terms of its layout, the design is said to be at the physical or layout level. Designs are usually specified at high levels of abstraction, and then the design effort is aimed at translating that description to lower levels of abstraction, all the way to the layout level. Higher levels of abstraction include the architectural-level (also called register-transferlevel, or RTL). At this level, the structure of the design is specified, typically in terms of functional blocks and memory elements. Yet another higher level of abstraction is the behavioral-level, where the chip behavior, rather than structure, is described in terms of interconnections of behavioral/functional modules. Digital signal processing (DSP) chips are typically represented at the behavioral level, while general microprocessors are specified at the architectural level. DSP designs are transformed from the behavioral to architectural level using high-level behavior-preserving transformations. In reality, complex chips are specified as a mix of behavioral and architectural constructs, because some parts of the system may be amenable to an algorithmic (behavioral) descriptions, while others may not. Since most designs are implemented by transforming some high-level specifications to lower-level implementations, a gate or transistor level representation of the chip may not be available until the design process is almost complete. If one were to get to that point and only then discover (using a power estimation tool) that the power consumption is unacceptably high, then it would be too expensive to make design changes. The circuit may require significant rework, involving perhaps changes to the overall architecture of the chip, so that the whole design effort may have to be repeated. For this reason, it would be very beneficial to have a power estimation capability at a high level of abstraction. However, as we will see below, estimation from a high level of abstraction is potentially inaccurate, while low-level power estimation can be very accurate. Therefore, a power estimation capability is needed at every level of abstraction, in order to check the design at every step in the design flow. As for optimization, it is also generally believed that higher reductions in power would be possible at higher levels
648
POWER ESTIMATION AND OPTIMIZATION
of abstraction, simply because there would be more scope for changing and optimizing the design at a higher level. Once the logic design has been specified, for instance, there is very little that an optimization tool can do besides rewire certain connections or resize certain gates. While, at the architectural level, an optimization algorithm may change the design to use a different style of implementation altogether, say by using a Booth rather an array multiplier, possibly leading to large reductions in power. Thus, optimization capabilities are most beneficial at higher levels of abstraction, but are also needed at every abstraction level. In the remainder of this article, we will discuss power estimation and optimization techniques at the behavioral, architectural, and logic levels. In the literature, estimation from a transistor level and optimization at the layout level are also discussed, but these techniques will not be covered, due to lack of space and also because there is more interest in the industry in estimation and optimization at the higher levels of abstraction.
POWER ESTIMATION By power estimation we will generally refer to the problem of estimating the average power dissipation of a digital circuit. This is different from estimating the worst case instantaneous power (6–8), also referred to as the voltage drop problem, or the worst case power per cycle (9). These problems will not be discussed in this article. Instead, we will focus on average power estimation, which is directly related to chip heating and temperature and to battery life. Given a transistor-level description, a simple and straightforward method of average power estimation is to simulate the circuit, say using a circuit simulator, to obtain the power supply voltage and current waveforms, from which the average power can be computed. Techniques of this kind were the first to be proposed. Since they are based on circuit simulation, these techniques can be quite expensive. In order to improve computational efficiency, several other simulationbased techniques were also proposed using various kinds of RTL, gate-, switch-, and circuit-level simulation. Given a set of input patterns or waveforms, the circuit is simulated, and a power value is reported based on the simulation results. Almost all of these techniques assume that the supply and ground voltages are fixed, and only the supply current waveform is estimated. Even though these simulation-based techniques can be efficient, their utility, in practice, is limited because the estimate of the power which they provide corresponds directly to the input patterns that were used to drive the simulation. This points to the central problem in power estimation, namely, that the power dissipation is input pattern-dependent. Indeed, in CMOS and in most other modern logic styles, the chip components (gates, cells) draw power supply current only during a logic transition (if we ignore the small leakage current). Thus, the power-dissipation is highly dependent on the switching activity inside these circuits. Simply put, a more active circuit will consume more power. Since internal activity is determined by the input signals, then the circuit power is input pattern-dependent. In practice, the pattern-dependence problem is a serious limitation. Often, the power dissipation of a circuit block may
need to be estimated when the rest of the chip has not yet been designed, or even completely specified. In such a case, very little may be known about the inputs to this block, and exact information about its inputs may be impossible to obtain. Furthermore, for a microprocessor or a DSP chip, the exact data inputs cannot be determined a priori, because they depend on how the chip is used in the design of a larger board or system. Recently, several techniques have been proposed to overcome this problem by using probabilities as a compact way to describe a large set of possible logic signals, and then studying the power resulting from the collective influence of all these signals. In order to use these techniques, the user only specifies typical behavior at the circuit inputs, in the form of transition probability, or average frequency. If typical input pattern sets are available, then the required input probability or frequency information can be easily obtained by a simple averaging procedure. Thus, the average switching frequency at a node is considered to be the mean or average of many possible switching behaviors at the node. We will classify power estimation techniques as being either static or dynamic. An approach is called static when it is based on propagating a probability or activity measure directly through the logic, in order to estimate the average switching frequency. To perform this, special models for circuit blocks must be developed and stored in the cell library. In contrast, other techniques, referred to as dynamic, do not require specialized circuit models. Instead, they use traditional simulation models and simulate the circuit, using existing simulation capabilities, for a limited number of randomly generated input vectors while monitoring the power. These vectors are generated from user-specified probability information about the circuit inputs. Essentially, these techniques are based on statistical mean estimation resulting from a Monte Carlo procedure. Using statistical estimation techniques, one can determine when to stop the simulation in order to obtain certain user-specified accuracy and confidence. Logic Level At this level of abstraction, we usually exclude special and highly structured circuits such as memory arrays or PLAs, because these are not amenable to a gate level representation and are perhaps better represented at higher levels of abstraction. Instead, we focus on circuits that can be represented as an interconnection of logic gates (of any kind) and memory elements (flip-flops or registers). In describing the estimation techniques at this level, it is also helpful to restrict the discussion to synchronous circuits, and to circuits with just a single clock. Circuits with multiphase clocks can be handled by extensions of the techniques to be presented. Finally, another common simplification is to assume that the memory elements are edge-triggered flip-flops, rather than transparent latches. In a circuit with transparent latches, the power analysis should be identical to the edge-triggered case if no cycle-borrowing is employed. The first issue to consider is the gate power model. A common simplification at this level of abstraction is to ignore the leakage power and to assume that a logic gate consumes power only when its output makes a logic transition. In reality, a gate does consume power due to incomplete output transitions and due to charging/discharging events at its internal
POWER ESTIMATION AND OPTIMIZATION
capacitances that do not lead to an output transition, but these are considered second-order effects and are generally ignored. If, for now, we also ignore the short-circuit current, it can be shown that the average energy consumed per logic transition is: E = QVdd =
1 CV 2 2 dd
where Vdd is the supply voltage, Q is the average charge delivered per transition, and where C is the gate output capacitance, which includes the gate intrinsic (diffusion) capacitance and the extrinsic capacitance due to interconnect capacitance and gate oxide capacitance of the fanout gates. This simple model is very desirable because it greatly simplifies the power estimation problem and affords reasonable accuracy. It is possible to include the short-circuit power in this model by choosing an appropriate value of C, which we will 2 call Ceff (C-effective), such that (1/2)Ceff V dd represents the average energy per transition including the short-circuit power. Since the short-circuit power depends on the input signal slope during transition and the output loading, this requires one to assume a ‘‘nominal’’ slope for the gate input signals, based on knowledge of the technology, and a nominal output loading. The value of Ceff can be determined a priori during a characterization step that takes these factors into account. Thus, in summary, the average energy per output transition for a logic gate is: E=
1 C V2 2 eff dd
Thus, all that is required in order to estimate the power is to compute the average number of transitions per second for every gate output node, a quantity which we will simply refer to as the switching activity at that node. Similar modeling can be performed for the flip-flops. In a synchronous, edge-triggered, single-clock, sequential circuit, it is clear that when a flip-flop output makes a logic transition, it does so simultaneously with the clock and makes at most one transition per clock cycle. The same is not true for all logic gates. Many gates may experience multiple transitions per cycle, due to the possibly unequal delays from the flip-flop outputs to the inputs of the logic gate. If an even number of transitions occur in a certain cycle, then the gate output was not intended to make a transition at all, and all the transitions that did occur were artificial and not needed. If the number of transitions is odd, then only one of them was required and all the rest were not needed. In any case, these additional unneeded transitions have been called glitches in the power literature, and the power due to them is called the glitch power. The glitch power for a circuit can be small (20% of total power) or can be large (70% of total power), depending on the circuit structure. The glitch power is hard to handle because it depends on the relative delays inside the circuit. It is generally the case that all feedback paths in a circuit go through flip-flops. Thus, if somehow the flip-flops are removed, we would be left with (perhaps disconnected) circuit blocks that are all combinational, that is, they contain no feedback paths. Given a general sequential circuit, it is convenient to think of it as partitioned into blocks of combinational logic separated by boundaries consisting of flip-flops. Most of
649
the proposed power estimation techniques assume this partitioning is provided, and take the following two-step approach to power estimation: 1. From an analysis of the overall (sequential) circuit, compute certain switching statistics for the flip-flop outputs. 2. From a knowledge of the flip-flop switching statistics, compute the switching activity for every logic gate in the combinational blocks. This decoupling of the problem is purely a simplification. It leads to some loss of accuracy because the relationships between various flip-flop output values are lost when applying the second step. The error is felt to be generally acceptable, and this is an ongoing research topic. The two steps mentioned above will be discussed in detail in the following. It is simpler to start with a discussion of the second step, the combinational circuit analysis, and then present the sequential circuit step. Combinational Circuit Power. Several power estimation methods have been proposed for isolated combinational circuits. All assume that some information is provided about the circuit inputs, mainly in the form of switching statistics. Two statistics are deemed important, the signal probability and the switching activity. The signal probability is the fraction of time that a node is in the high state. The switching activity is the average number of logic transitions per second. Two styles of techniques have been proposed: static and dynamic. In the static methods, one directly propagates the supplied input statistics into the circuit to compute the corresponding switching activity at all the nodes. We can give a flavor of these methods by considering one of the earliest proposed techniques, the signal probability propagation method of (10). In this method, the signal probability at the inputs to a logic gate are propagated to its output by making the simplifying assumption that the gate inputs are (statistically) independent. As a result if z ⫽ AND(x, y) is an AND gate, then the probability of z is computed simply as Pz ⫽ PxPy. Once the signal probability is available for every gate output node, the node transition probabilities are computed as 2P(1 ⫺ P), which assumes that the node values before and after the clock edge are independent. Finally, the switching activity is obtained as the transition probability divided by the clock period. This algorithm is very efficient, but the first independence assumption (called a spatial independence assumption) may be unacceptable if the circuit has many reconvergent fanout paths, and the second (called a temporal independence assumption) also may not be acceptable. Furthermore, the method does not take delay into account so that glitches are ignored. This and other improved methods are reviewed in more detail in (2). Some of the improved methods are probabilistic simulation (11), transition density propagation (12,13), symbolic propagation (14), correlation coefficients (15), and spatio-temporal correlation (16,17). The above three issues of temporal independence, spatial independence, and delay sensitivity are major failings of the static methods, because no one method completely deals with these issues in an efficient and practical way. Some methods which succeed in doing so are computationally too expensive.
650
POWER ESTIMATION AND OPTIMIZATION
Particularly annoying is that the error cannot be predicted or bounded up-front. Typically, if one is interested only in the total circuit power, then static methods will on average have an error of around 10% for large circuits. But if one is interested in the power consumption of every gate or of small circuits, then the error is potentially much larger and unacceptable. Nevertheless, these methods are the fastest available and are the only way that we may be able to estimate the power at the gate level for extremely large circuits. Because of this, at least one of these methods (12,13), has been incorporated into commercial products in spite of the above weaknesses. This method directly propagates the switching activity from the circuit inputs to provide the switching activity at all nodes, and is based on a spatial independence assumption. The switching activity D(xi) at the inputs to a logic gate (or, more generally, a Boolean logic block) are used to compute the switching activity D(y) at its output, according to: n
∂y D(y) = P D(xi ) ∂xi i=1 where ⭸y/⭸x is the Boolean difference of y with respect to x, defined as ⭸y/⭸x ⫽ y兩x⫽1 丣 y兩x⫽0 where 丣 denotes the exclusiveor operation. Another class of methods, called dynamic, is based on the use of traditional simulators and simulation models. The key idea is that, even though the power depends on the specific vectors, if one were to simulate the circuit for randomly chosen typical vectors, then the average cumulative power measured from the simulation will converge to the true average power. It turns out that the number of vectors required to achieve convergence can be quite small, under 100 vectors for circuits with a few hundred gates or more. Often, circuits with thousands of gates will converge with under 50 vectors. Also, it turns out that using statistical mean estimation techniques, it is possible to tell, during the simulation, when to stop the simulation in order to achieve some user-specified accuracy and confidence in the result, leading to a stopping criterion. Being able to specify the accuracy up-front is a major advantage of these methods. Glitches are taken into account and only the primary inputs are assumed spatially independent, although this is not a limitation of these methods (they can be extended to not require this). Such methods were proposed in (18,19) for finding the total average power of a circuit. They are fast but do not provide estimates of the individual node power values. An extension, given in (20), does so, but becomes somewhat slower. Further improvements have been proposed in (21) and (22). Sequential Circuit Power. In this case, the objective is to compute the switching statistics at the flip-flop outputs. These would then be used for the combinational circuit analysis. Given the above discussion of combinational circuit techniques, it is clear that we need two statistics, the signal probability and the switching activity. To simplify the analysis, we will only discuss the estimation of the signal probabilities. Estimation of the switching activity is similar. Since flip-flop outputs constitute the circuit state, they are usually referred to as the state bits. Once more, two styles of techniques have emerged, static and dynamic. In the static methods (23,24), initial values of probability at the state bits are assumed, and they are then
propagated through the circuit (around the feedback loops) in order to provide updated values of the assumed probabilities. This process is repeated until convergence of the probability values is obtained. One problem with this approach is that the state bits are assumed independent during the propagation, but the error due to this can be reduced by unrolling the feedback loop several times. In the dynamic method (25), several copies of the whole circuit are simulated in parallel using randomly chosen input vectors until convergence of the monitored node statistics is obtained. This method can solve very large sequential circuits very efficiently, and has the advantage that the desired accuracy and confidence can be specified up-front. It has been used to estimate the power for circuits with up to 1,500 flip-flops and 20,000 gates. Architectural Level At this level of abstraction, also called the register-transfer level (RTL), the circuit is described as an interconnection of clocked memory elements (flip-flops or registers) and combinational logic blocks whose gate-level structure is not specified. Typically, the combinational logic blocks may be specified only as Boolean functions, so that the design description consists of flip-flops and Boolean black boxes. Since the flipflops are specified, then it is possible to simulate the circuit, as described above, in order to estimate the switching statistics at the flip-flop outputs. After this essential first step, it remains to compute the power consumed by the Boolean blocks, given their input/output switching statistics. Thus, the problem reduces to developing a power model for these blocks that can be used to compute their power from their I/O statistics. In some cases, the Boolean blocks may correspond to circuit blocks that were used in previous designs. In this case, the detailed implementation of the combinational black boxes is completely known. This is, for example, the case in DSP designs, where the circuit blocks come from a library of wellcharacterized adders, multipliers, and so forth, and where the design task may be to determine which type of adder or multiplier to use in a given chip design. In this case, it is still advantageous to carry out the analysis at a high level of abstraction, because the analysis can be done much faster. We refer to techniques of this kind as being bottom-up approaches— the low-level details are known, but we choose to ignore them and use instead a simplified high-level model of the block behavior. This is essentially a macro-modeling for power approach. Bottom-up techniques have been proposed in (26), where black-box models (macro-models) are built for circuit blocks by a process of characterization that models the block power as a function of the input/output signal statistics (probabilities) of the block. Other details are also included, such as the bus width, average capacitance, etc. In other cases, the low-level details of the circuit blocks may be truly yet unknown, because such a circuit block my never have been designed before. This presents a harder problem to solve of extracting power out of pure (Boolean) functionality. We refer to such techniques as being top-down. Recently, some top-down techniques have been proposed (27,28) that make use of entropy of a logic signal as a measure of the amount of information that can be carried by that signal. The rationale for this is that the power requirements of a circuit
POWER ESTIMATION AND OPTIMIZATION
is probably related to the amount of computational work that the circuit performs, which has traditionally been modeled with the entropy measure. All these techniques are fairly recent, and it is not clear yet how useful they will be in practice, or how their performances compare/contrast in a practical setting. Behavioral Level In this case, the circuit description is at an even higher level of abstraction, where it is not clear exactly where the flipflops are, and the design is an interconnection of blocks for which only their behavior is known. These blocks will eventually be implemented using flip-flops and/or combinational logic, but the specific architecture to be chosen for the implementation is not known. For instance, all we may know about a given block is that it performs n additions, but we may not know whether it performs them sequentially (on a single adder) or concurrently (on n adders). Or, a block may be a small microprocessor that will be embedded in a larger chip design, and which may be only specified in terms of its instruction set and its I/O ports. Power estimation at this level of abstraction is, understandably, very difficult, but also very appealing because of the potential gains of knowing the power so early in the design process. Estimation techniques in this area are still in their infancy, and much needs to be done to develop practical and accurate solutions. Notable is the method of (29) in which the design is described with a behavioral flow-chart that shows several computational resources (behavioral blocks, or modules) and the way that they interact. From a simulation of the behavior, the frequency with which a resource is accessed is measured (in accesses per second); call this f. The total capacitance inside the module, call this C, is either known from its low-level description (bottom-up approach), or is a rough estimate from prior design knowledge or using techniques borrowed from high-level synthesis (top-down approach). With this, the average power for a behavioral module is given by: 2 Pavg = αCVdd f
where 움 is the average node switching activity per access inside the module. The estimation of 움 is difficult. One way is to just use a fixed number, obtained from experimental studies of prior designs of the same class as this. For instance, minicomputers have been found experimentally to have 움 in the range 0.01 to 0.005, while microcomputers have 움 in the range 0.05 to 0.01. Another way of estimating 움 is to simulate a low-level description of the design, if available (bottom-up approach), under some arbitrary input switching statistics, and to use the resulting 움 value as a fixed number, and not bother to account for its dependence on the input switching statistics once it is embedded in the system being considered (to do otherwise would be computationally too expensive). Interconnect Issues Missing in all of the above discussion has been the issue of interconnect length, which, of course determines the physical capacitance at a node. In the past, when gate delays (capacitance) were significantly larger than wire delays (capacitance), the interconnect could be ignored, or somehow its
651
length and capacitance would have been estimated by an intelligent guess. Today and in the future, due to the fine dimensions of sub-micron CMOS technology, this is no longer the case. Interconnect delays are now dominant, significantly larger than gate delays (52). So any accurate power estimation must factor in the interconnect capacitance. If the design description has been extracted from a layout, then we may have good estimates of the wire capacitance. But this is rarely the case, because it would be too late to wait until the layout has been done before doing a power estimation. More frequently, power estimation (and optimization) would be attempted long before layout. This problem impacts estimation and optimization at all levels of abstraction. The research literature (30,31) includes several references to approaches that attempt to estimate the average wire length from knowledge of the circuit structure and function. Application and validation of these techniques in the realm of power analysis is an ongoing and future research problem. POWER OPTIMIZATION Reducing the average power dissipation of a VLSI chip is a fairly recent concern of the industry, which became prevalent in the early 1990s. It is now a major concern, and all design groups are faced with power management problems. A variety of techniques have been developed, both in industry and academia, to address this problem. As pointed out previously, three courses of action are open: (1) reduce the power supply voltage, (2) reduce the node capacitance through technology scaling, and (3) reduce the node activity-capacitance product by design. The first two of these have to do with the manufacturing process, while the third has to do with design and CAD, and is the one option with which we are concerned in this article. If the circuit contains n nodes (logic gate outputs) and if each node has capacitance Ci (this can be Ceff in order to include the short-circuit current as explained previously) and has a switching activity Di (average number of transitions per second), then we can write: Pavg
_
n
Ci Di
i=1
Thus the power is proportional to the summation of a large number of cross-products of activity and capacitance. If the values of the product terms are reduced, then the power will be reduced as well. To be sure, one further way of reducing the value of the summation is to reduce the value of n, the number of gates, which is related to circuit area. Thus, to some extent, area optimization can lead to power optimization. In some cases, though, such as when one considers area reduction through gate sizing, the minimum-power circuit is not necessarily the minimum-area one (51), mainly due to the effect of gate sizing on the overall short-circuit current of the circuit. Nevertheless, it is fortunate that, at least as far as the capacitive power is concerned, the two objectives of area and power do not conflict. Unfortunately, this is not the case with the delay objective. In order to reduce circuit delay often requires one to use larger gates, which will dissipate larger capacitive power.
652
POWER ESTIMATION AND OPTIMIZATION
The complex interactions between the area, delay, and power objectives lead to a situation where the objective function to be reduced during optimization can be hard to specify, and even harder to evaluate. Traditionally, optimization has been done with a compound objective function that takes into account both delay and area. Nowadays, it also has to take into account the power. The power component of this objective function is supposed to reflect the change in circuit power due to some candidate transformation that is being considered during optimization. We will refer to this as the power cost function. In principle, the power cost function has only to perform a power estimation. However, since the cost function has to be evaluated several times during an optimization process, dynamic (simulation-based) estimation methods cannot be used, because they can take too long, especially since we often need to know the power consumption (or switching activity) of every gate in order to decide what optimizations to apply. On the other hand, static estimation methods are notoriously inaccurate for estimating individual gate power or switching activity. This is a serious problem that confronts power optimization. The few existing solutions include application of a mix of static and dynamic estimation in a way that offers an acceptable trade-off between execution speed and accuracy. In the following, we will look at optimizations that can be applied at the three levels of abstraction previously considered, namely, the logic, architecture, and behavioral levels. At all levels, certain transformations will be applied to modify the design, and the cost function is the central difficulty. Given the loss of design detail at higher levels of abstraction, it is clear that estimation, and therefore the cost function evaluation, will not be entirely accurate. Given this, we are interested in cost functions with so-called relative accuracy rather than absolute accuracy. This means that we need a cost function that correctly predicts that we are on the right track, that transformation A would reduce the power, and would do so more than transformation B, but the exact value of the cost function may not be equal to the true power consumption of circuit. To date, there is no simple cost function that has been rigorously demonstrated to have this property, and this is an ongoing research topic. In the sections below, we will discuss briefly the different optimizations that have been applied at the different levels. Most of these will be part of CAD, that is, will be part of synthesis methods [for a general reference on synthesis, see (32)], but some will be design decisions or styles that can be used irrespective of automatic CAD optimization. It is generally agreed that optimizations at the highest level have the potential of producing the most gains, because there is more freedom to make design changes at that the higher levels. But optimizations at all levels are useful and possible, as we discuss below. Logic Level Logic level optimization is usually performed as part of the process of logic synthesis, which is the automatic translation from an RTL description to a gate-level description that refers to a given gate library. In this process of translation, the circuit representation goes through different intermediate forms, and several kinds of optimizations or transformations are performed at intermediate steps in order to improve the
value of the objective function. These transformations range from retiming the provided RTL, to looking for common Boolean expressions to form a multi-level logic network, to choosing the right size gates from the library, etc. We will consider some important transformations below and briefly give a feel for how they work and how they may be adapted to the new power objective. State Assignment. If the circuit implements a finite-state machine (FSM), then the circuit states are assigned specific codes (Boolean vectors) that uniquely identify each state. Choosing the best state codes (to provide the smallest objective function value) is called the state assignment problem. This is a very difficult classical synthesis problem that can only be solved efficiently when the objective function is quite simple and the number of states is small. In case of classical area optimization, the cost function may be simply the number of literals in a two-level Boolean expression for the Boolean function representing the combinational part of the circuit. For power, one can aim to minimize the switching activity on the state lines, or to minimize the Hamming distance between neighboring states. The Hamming distance between two states is the number of bits that are not the same in their two codes, i.e., it is the number of bits that have to switch if a transition occurs from one of these states to the other. The Hamming distance measure does not reflect the effect of the code assignment on the size and functionality (and therefore, power) of the combinational logic. A recent technique that does include these effects is given in (33). Clock Control. A common-sense technique to reduce power dissipation is to shut off circuit sections that are not required to perform a useful function for a certain time period. Inside an IC, this does not mean that the supply voltage is turned off to parts of the circuit, because that creates many problems, such as having to recover the correct circuit state and the turn-off/turn-on transients and the resulting noise. Instead, this only means turning the clock off (holding it fixed at either logic 0 or 1) so that the registers or flip-flops do not keep switching when the data stored in them is not needed. This general solution can be applied at all levels of abstraction and is common practice in the industry by now. However, it is not always easy to automate. At the logic level, there have been some attempts to automate this procedure. The method in (34) uses so-called precomputation architectures, which are essentially generalizations of the idea of look-ahead circuitry, in order to determine inexpensively whether or not some Boolean function needs to be computed in the next clock cycle. This is implemented by using additional circuitry to monitor the flip-flop inputs and, if the transition there is a ‘‘don’t care’’ (will not cause a useful transition at the circuit outputs), then the clock is disabled, and that transition is not applied at the combinational circuit inputs. Other related techniques include guarded evaluation (35) and activity-driven clock network design (50). Multi-Level Logic Optimization. At this point in the synthesis flow, one is dealing with a combinational logic circuit, represented as a multi-level Boolean network. A Boolean network is an interconnection of artificial Boolean gates that may not correspond to any real gates in the library. For this reason, this representation is said to be technology indepen-
POWER ESTIMATION AND OPTIMIZATION
dent and constitutes an intermediate form of the circuit being designed. Previously, several kinds of optimizations have been applied to the Boolean network in order to reduce its area, including making use of ‘‘don’t cares’’ and extraction of common sub-expressions to do Boolean factoring. For area minimization, the cost function used was related to the literal count (the number of Boolean variables). In order to adapt these techniques to the power minimization objective, the methods in (36,37) include switching activity in the cost function and predict the effect of the change in the function on its fanout cone. The method in (38) modifies the Boolean factoring procedure by using a cost function that looks at the amount of loading and logic sharing. One needs to keep in mind though, that the circuit is at this point technology independent, so that notions of loading and switching activity are approximate, because no capacitance or delay information is available yet. Technology Decomposition. Once the processing of the technology independent Boolean network is complete, it is time to start selecting gates from the library to implement the artificial gates of the Boolean network. As a prerequisite to that step, one performs a so-called technology decomposition, by which an artificial gate is decomposed into a tree of smaller gates, whose sizes are such that it is possible to find a library gate to implement each of them. For example, an artificial gate in the Boolean network may be a 10-input NAND gate. For performance reasons, such a large gate is never implemented as a single gate and will not be found in the library. Instead, this is decomposed into a simple tree interconnection of 2-input NANDs, which are sure to be found in the library. Traditionally, this decomposition is implemented as a balanced binary tree, in order to keep the delay of the tree structure at a minimum. For low-power, considerations of switching activity at the tree inputs are used to restructure the tree to reduce its overall switching activity. Typically, the result is a tree which is not balanced, and examples of these techniques may be found in (39,40). Technology Mapping. After technology decomposition, the circuit is mapped to the library by selecting library gates that can be used to implement either single gates or groups of gates in the decomposed Boolean network. The cost function used in the traditional mapping algorithm has been extended to take switching activity and power into account in (39,41). Post-Mapping Transformations. After the circuit has been mapped, there is a much better chance of doing a good job of low-power optimization because one is working with a technology-dependent representation. The gate delays and capacitances are known throughout the circuit. It is only at this point, for instance, that one can accurately estimate the glitching power, because one can use dynamic power estimation to compute a cost function that includes the glitches. However, given the relatively higher computational cost of dynamic estimation, when the individual gate power values or switching activity are desired, it cannot be applied for every cost function evaluation. Instead, it is applied every now and then as a corrective measure, and in the meantime static estimation techniques can be used, usually in a small neighborhood around the optimization site, in order to update the cost
653
function value. We briefly consider the following optimizations: 1. Rewiring and Transduction. The transduction method (42) is a logic synthesis approach that applies transformations to the mapped network. One of these transformations rewires the network in a way that does not change its overall function (making use of internal don’t cares) if the rewired network has a lower objective function value. This has been applied with a power cost function, leading to power reductions of 5 to 20%. 2. Path Balancing. If the delays along the paths leading to the inputs of a logic gate are approximately equal, then there will be few or no glitches at the gate output. Thus path balancing has been applied to equalize the path delays, through either delay insertion or gate sizing. In either case, there are possible disadvantages that one should watch out for, such as increased area or delay, or even increased power in some other part of the circuit. 3. Retiming. If a node has high glitching activity, then if a flip-flop is inserted right after it, only the one meaningful transition at that node would be allowed to propagate through. This is the idea behind using retiming for power reduction. The flip-flops are moved around, while preserving the circuit functionality and delay, in order to throttle glitches and stop them from propagating downstream, thus reducing power. As a final word, it should be stated that, in spite of all the above proposed techniques, some of which are quite complex and involved, power reduction from gate-level optimization is limited. In the industry, if an industrial-strength logic synthesis is applied for minimum area, it is typically found that the resulting circuit has been optimized to such an extent that only about 5% reduction in its power is possible through further power-specific gate-level optimization. This is in-line with the comments made earlier that optimizations at higher levels can have more impact. Architectural Level Higher gains in power reduction are possible at this level. Indeed, order of magnitude reduction in power has been reported by choosing the right architecture for an application. Unfortunately, there are no good automatic algorithms for making these choices. The automatic translation from a behavioral description to an architectural (RTL) description is called high-level synthesis. This topic has been studied for over 10 years (43), but no commercial tools exist yet that efficiently provide a good RTL solution given a general behavioral level specification for a large design. As a result, there are no commercial architectural level power optimization tools available today. In this section, we will consider the types of decisions/choices that such a tool would have to make in order to reduce power, and consider some attempts that have been made recently to solve this problem. One type of transformation that has been applied at this level is to consider that the voltage supply is variable and to consider decreasing the voltage as much as possible in order to reduce the power. Since voltage reduction leads to longer delays, and in order to compensate for this, the logic is repli-
654
POWER ESTIMATION AND OPTIMIZATION
cated and concurrent computation and multiplexing are used to maintain the computational throughput. This has been applied mainly to DSP designs (44). Other approaches aim to construct the architecture in a way that maintains a locality of reference. This means that, whenever possible, communication between computational resources should take place over small distances, in a local neighborhood. One should minimize the accesses to central resources such as memories, ALUs, or buses. This approach helps reduce the power because long interconnect lines can be a big source of power dissipation. This technique, even though it sounds simple, is hard to automate, because of the complexity of the interactions that have to be considered. Architectural choices can impact the amount of concurrency, multiplexing, and frequency of a design, so the search problem for the best architecture is a very difficult one. Existing high-level synthesis approaches include a 3-step process by which the behavioral specification is translated to the RTL level. These are allocation (how many instances of each hardware resource are needed?), assignment (on what hardware resource will a behavioral level computational operation be performed?), and scheduling (when will the operation be executed?). Traditionally, high-level synthesis tries to minimize the number of hardware resources needed (area) and/ or the total time of the schedule (delay). A common traditional approach in high-level synthesis is to schedule the operations so as to keep as many hardware resources busy as possible. This helps reduce the number of required resources (area) but is not always a good approach for power reduction. In fact, in some cases, one may have to add additional resources in order to reduce the overall switched capacitance. For low-power, and in order to maintain the locality of reference, the assignment algorithm should aim for regularity and locality. This reduces the interconnect overhead and power. Scheduling can also affect the signal correlations at the inputs of resources, which can affect the power consumed in these resources. In general, the difficulty lies in developing a cost function that takes the switching activity into account and in deciding what transformations to apply and in what sequence. For further reading, consult (4,29,45). Behavioral Level At this highest level of design specification, the IC may be represented by an algorithm, or a combination of algorithms or functional modules. Due to the difficulty of this problem, there are almost no CAD optimization techniques at this level, but there are many proven design decisions and styles that have been shown to work in certain situations. An obvious approach is to choose the right algorithm for a given function. In practice, this means rewriting the algorithm in order to reduce the number of times that certain functional operations are performed. The operations chosen should be those that are known to require a lot of energy per computation. High-level design decisions may be made at this point, such as choosing to recycle the energy back to the power supply, so-called adiabatic computing (46). These very interesting techniques, however, are not yet practical enough. Memoryoptimizing transformations can also be used to maintain the locality of reference. For instance, one can replace expensive
accesses to background (secondary) memory by accessing foreground memory or by using distributed memory (47). Another option is to choose the right data representation or encoding so as to reduce switching activity. For instance, the commonly used 2’s complement notation has the disadvantage that all data bits switch together when the data value changes from 0 to ⫺1, which occurs quite often. In contrast, the sign magnitude representation produces a single bit change for this occurrence. And finally, for embedded processors, one can consider choosing the right instruction set so as to reduce power, and compiling software to produce lowpower programs. This issue has been explored in (48). Other power optimizations have been applied at the behavioral level. The reader is referred to (49) for a more detailed discussion.
BIBLIOGRAPHY 1. D. Singh et al., Power conscious CAD tools and methodologies: A perspective. Proc. IEEE, 83: 570–593, 1995. 2. F. N. Najm, Power estimation techniques for integrated circuits. Proc. IEEE/ACM Int. Conf. on Computer-Aided Design, 13: 492– 499, 1995. 3. P. Landman, High-level power estimation. Proc. Int. Symp. Low Power Electron. Design, 1: 29–35, 1996. 4. M. Pedram, Power minimization in IC design: principles and applications. ACM Trans. Design Autom. Electronic Syst., 1 (1): 3– 56, 1996. 5. S. Devadas and S. Malik, A survey of optimization techniques targeting low power VLSI circuits. Proc. Design Autom. Conf., 32: 242–247, 1995. 6. S. Chowdhury and J. S. Barkatullah, Estimation of maximum currents in MOS IC logic circuits. IEEE Trans. Comput.-Aided Des., 9: 642–654, 1990. 7. S. Devadas, K. Keutzer, and J. White, Estimation of power dissipation in CMOS combinational circuits using Boolean function manipulation. IEEE Trans. Comput.-Aided Des., 11: 373–383, 1992. 8. H. Kriplani, F. N. Najm, and I. Hajj, Pattern independent maximum current estimation in power and ground buses of CMOS VLSI circuits: Algorithms, signal correlations, and their resolution. IEEE Trans. Comput.-Aided Des., 14: 998–1012, 1995. 9. S. Manne et al., Computing the maximum power cycles of a sequential circuit. Proc. Design Automation Conf., 32: 23–28, 1995. 10. M. A. Cirit, Estimating dynamic power consumption of CMOS circuits, Proc. Int. Conf. on Computer-Aided Design, 5: 534–537, 1987. 11. F. Najm, R. Burch, P. Yang, and I. Hajj, Probabilistic simulation for reliability analysis of CMOS VLSI circuits, IEEE Trans. Comput.-Aided Des., 9: 439–450, 1990 (Errata in July 1990). 12. F. Najm, Transition density: A new measure of activity in digital circuits, IEEE Trans. Comput.-Aided Des., 12: 310–323, 1993. 13. F. Najm, Low-pass filter for computing the transition density in digital circuits, IEEE Trans. Comput.-Aided Des., 13: 1123– 1131, 1994. 14. A. Ghosh et al., Estimation of average switching activity in combinational and sequential circuits, Proc. Design Automation Conf., 29: 253–259, 1992. 15. C-Y. Tsui, M. Pedram, and A. M. Despain, Efficient estimation of dynamic power consumption under a real delay model, Proc. Int. Conf. on Computer-Aided Design, 11: 224–228, 1993.
POWER FACTOR CORRECTION 16. R. Marculescu, D. Marculescu, and M. Pedram, Switching activity analysis considering spatiotemporal correlations, Proc. Int. Conf. on Computer-Aided Design, 12: 294–299, 1994. 17. R. Marculescu, D. Marculescu, and M. Pedram, Efficient power estimation for highly correlated input streams, Proc. Design Automation Conf., 32: 628–634, 1995. 18. C. M. Huizer, Power dissipation analysis of CMOS VLSI circuits by means of switch-level simulation, IEEE European Solid State Circuits Conf., pp. 61–64, 1990. 19. R. Burch et al., A Monte Carlo approach for power estimation, IEEE Trans. VLSI Syst., 1: 63–71, 1993. 20. M. Xakellis and F. Najm, Statistical Estimation of the Switching Activity in Digital Circuits, 31st ACM/IEEE Design Autom. Conf., 31: 728–733, 1994. 21. L-P. Yuan, C-C Teng, and S-M Kang, Statistical estimation of average power dissipation in CMOS VLSI circuits using nonparametric techniques, Proc. Int. Symp. on Low Power Electronics and Design, 1: 73–78, 1996. 22. C-S. Ding et al., Stratified random sampling for power estimation. Proc. Int. Conf. on Computer-Aided Design, 14: 576–582, 1996. 23. J. Monteiro and S. Devadas, A methodology for efficient estimation of switching activity in sequential logic circuits, ACM/IEEE 31st Design Autom. Conf., 31: 12–17, 1994. 24. C-Y Tsui, M. Pedram, and A. M. Despain, Exact and approximate methods for calculating signal and transition probabilities in FSMs, ACM/IEEE 31st Design Autom. Conf., 31: 18–23, 1994. 25. F. N. Najm, S. Goel, and I. N. Hajj, Power estimation in sequential circuits, ACM/IEEE 32nd Design Autom. Conf., 32: 635– 640, 1995. 26. P. E. Landman and J. M. Rabaey, Architectural power analysis: The dual bit type method, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 3: 173–187, 1995. 27. M. Nemani and F. N. Najm, Towards a high-level power estimation capability. IEEE Trans. Comput.-Aided Des., 15: 588–598, 1996. 28. D. Marculescu, R. Marculescu, and M. Pedram, Information theoretic measures for power analysis. IEEE Trans. Comput.-Aided Des., 15: 599–610, 1996. 29. R. Mehra and J. Rabaey, Behavioral level power estimation and exploration. Proc. Int. Workshop on Low Power Design, 1: 197– 202, 1994. 30. F. Kurdahi and A. C. Parker, Techniques for area estimation of VLSI layouts. IEEE Trans. Comput.-Aided Des., 8: 81–92, 1989. 31. H. T. Heineken and W. Maly, Standard cell interconnect length prediction from structural circuit attributes. Proc. Custom Integrated Circuits Conf., pp. 167–170, 1996. 32. G. De Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, 1994. 33. C-Y. Tsui et al., Low power state assignment targeting two- and multi-level logic implementations. Proc. Int. Conf. on ComputerAided Design, 12: 82–87, 1994. 34. M. Alidina et al., Precomputation-based sequential logic optimization for low power. IEEE Trans. VLSI, 2: 426–436, 1994. 35. V. Tiwari, S. Malik, and P. Ashar, Guarded evaluation: pushing power management to logic synthesis/design. Proc. Int. Symp. on Low Power Design, 1: 221–226, 1995. 36. A. Shen et al., On average power dissipation and random pattern testability of CMOS combinational logic networks. Proc. Int. Conf. on Computer-Aided Design, 10: 402–407, 1992. 37. S. Iman and M. Pedram, Multi-level network optimization for low power. Proc. Int. Conf. on Computer-Aided Design, 12: 372–377, 1994.
655
38. K. Roy and S. C. Prasad, Circuit activity based logic synthesis for low power reliable operations. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 1: 503–513, 1993. 39. C-Y Tsui, M. Pedram, and A. M. Despain, Technology decomposition and mapping targeting low power dissipation. ACM/IEEE 30th Design Autom. Conf., 30: 68–73, 1993. 40. R. Panda and F. N. Najm, Technology decomposition for lowpower synthesis. Proc. Custom Int. Circuits Conf., pp. 627–630, 1995. 41. V. Tiwari, P. Ashar, and S. Malik, Technology mapping for low power. Proc. Design Automation Conf., 30: 74–79, 1993. 42. S. Muroga et al., The Transduction Method—Design of Logic Networks Based on Permissible Functions. IEEE Trans. Comput., 38: 1404–1424, 1989. 43. R. A. Walker and R. Camposano, A Survey of High-Level Synthesis Systems. Boston: Kluwer, 1992. 44. A. P. Chandrakasan et al., HYPER-LP: A system for power minimization using architectural transformations. Proc. Int. Conf. on Computer-Aided Design, 10: 300–303, 1992. 45. S. Wuytack et al., Global communication and memory optimizing transformations for low power systems. Int. Workshop for LowPower Design, 1: 203–208, 1994. 46. W. C. Athas et al., Low-power digital systems based on adiabaticswitching principles. IEEE Trans. VLSI Syst., 2: 398–407, 1994. 47. J. Bunda, D. Fussel, and W. Athas, Evaluating power implications of CMOS microprocessor design decisions. Proc. Int. Workshop on Low-Power Design, 1: 147–152, 1994. 48. V. Tiwari, S. Malik, and A. Wolfe, Power analysis of embedded software: a first step towards software power minimization. IEEE Trans. VLSI, 2: 437–445, 1994. 49. R. Mehra et al., Algorithm and architectural level methodologies for low power. In Low Power Design Methodologies. J. Rabaey and M. Pedram, eds., New York: Kluwer, 1996, pp. 333–362. 50. G. E. Tellez, A. Farrahi, and M. Sarrafzadeh, Activity-driven clock design for low power circuits. Proc. Int. Conf. on ComputerAided Design, 13: 62–65, 1995. 51. S. S. Sapatnekar and W. Chuang, Power vs. delay in gate sizing: conflicting objectives? Proc. Int. Conf. on Computer-Aided Design, 13: 463–466, 1995. 52. L. Pileggi, Coping with RC(L) interconnect design headaches. Proc. Int. Conf. on Computer-Aided Design, 13: 246–253, 1995.
FARID N. NAJM University of Illinois at UrbanaChampaign
POWER ESTIMATION OF SOFTWARE. See SOFTWARE PERFORMANCE EVALUATION.
POWER FACTOR. See REACTIVE POWER.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1801.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Vlsi Circuit Layout Standard Article Jun Dong Cho1 and Majid Sarrafzadeh2 1Sung Kyun Kwan University, Suwon 2Northwestern University, Evanston, IL Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W1801 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (326K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1801.htm (1 of 2)18.06.2008 15:27:55
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICAL%2...puter-Aided%20Design%20of%20Integrated%20Circuits/W1801.htm
Abstract The sections in this article are Layout Methodology Physical Design Flow Partitioning Floor Planning Placement Routing Other Issues in Physical Design Flow High-Performance Layout Design with Submicron Technologies | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...d%20Design%20of%20Integrated%20Circuits/W1801.htm (2 of 2)18.06.2008 15:27:55
VLSI CIRCUIT LAYOUT
295
VLSI CIRCUIT LAYOUT Very-large-scale integrated (VLSI) circuits have been recognized as an important area of electrical and computer engineering. With the accelerating complexity of semiconductors shown in Fig. 1 (1), VLSI must address the needs of consumer products, personal computers, workstations, midrange computers, mainframes, and supercomputers. Among the stages of VLSI design, such as architectural design, functional design, logic design, and circuit design, the physical (or layout) design stage is the back-end process of determining the physical location of circuit components from the circuit net list and interconnecting them inside the VLSI chip by interconnecting a collection of transistors. Physical design is an art based on the science of establishing interconnections and fulfilling system functions by placing modules in a chip. Continuous advances in the speed and scale of integrated circuits have created ever greater demands for higher density packaging to ensure reduced interconnective delays for improved electrical performance. In the physical design stage, the circuit representation of each component is converted into a geometric representation. This representation is a set of geometric patterns that perform the intended logical function of the corresponding component. Connections between different components are also expressed as geometric patterns. The geometric representation of a circuit is called a layout. The exact details of the layout also depend on design rules, which are guidelines based on the limitations of the fabrication process and the electrical properties of the fabrication materials. Physical design is a very complex process, and thus it is generally partitioned into smaller subproblems and solved in a hierarchical fashion. At the highest level of hierarchy, for example, in multichip module (MCM) or printed circuit board (PCB) designs, a set of integrated circuits (ICs) are interconnected. Within each IC, a set of modules, such as memory units, arithmetic logic units (ALUs), input-output ports, and random logic are arranged and interconnected. Each module consists of a set of gates. Because the cost of fabricating a circuit is a function of the circuit area, circuit layout techniques aim to produce layouts with a small area. Also, a smaller area implies fewer defects and interconnects, hence a higher yield and higher performance. Conventional physical design is becoming a limiting factor in translating semiconductor speed into system performance. In high-end systems such as supercomputers, mainframes, and medical and military electronics, more than 50% of the total system delay usually results from interconnection, and by the year 2000, the share of interconnection delay is expected to rise to 80% (2). In the medical electronics industry, speed and reliability are the main objectives. Moreover, increasing circuit count and density in circuits continue to place demands on high-level physical design. Physical design in the timing and performance-driven environment differs from classical VLSI design in a number of important ways. Performance is of overriding importance in J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
296
VLSI CIRCUIT LAYOUT
0.18 µ m Process 7ML/1.8V, 50,000 Gates/mm2
Critical dimensions, µ m
0.18
0.25
0.35 µm Process 5ML/3.3 & 2.5V 18,000 gates/mm2
0.35
0.50
Figure 1. Trends in semiconductor device integration.
0.25µm Process 6/7ML/2.5V, 30,000 Gates/mm2
0.5 µ m 3ML/3.3V 5,600 gates/mm2 1993
‘‘aggressive’’ design. The need for a systematic approach to physical design and analysis is paramount. This approach must link high-performance tools with timing and integrity parameters. Engineers must be able to exploit these linkages to make early design trade-offs based on such parameters. Thus, physical design algorithms must be driven by performance constraints. This involves careful consideration to minimize signal integrity effects, such as cross talk, reflections, and the effects of crossings, bends, and vias (3–11). The number of layers required continually increases, with a three-dimensional flavor, which is lacking in existing VLSI routing where the number of layers rarely exceeds three. Because of rapid development in VLSI technology, the average transistor count in a chip has increased enormously. The minimization of power consumption in modern circuits, therefore, is of great importance. In particular, battery operated products, such as portable computers, cellular phones, etc., have come to a point in which minimization of power consumption is among the most crucial issues. On the other hand, it has been shown (12) that circuit activity (which is closely related to power consumption) is an acceptable measure of failure in digital circuits. Transitional density or average switching rate at different sites in a circuit is introduced in Ref. 13 as a model to measure the circuit activity, which can be used to estimate the power consumption in the circuit. Because of the importance of power consumption, there has been a great shift of attention in the logic and layout synthesis from delay and area minimization toward this issue (14–17). Today’s electronics industry requires new automated design tools and design methodologies that allow designers to concurrently design high-performance integrated circuits and high-performance packaging. The purpose of this article is to provide the tool developer and VLSI designer with recent automated design techniques and algorithms in VLSI and MCM methodologies. This article focuses on the VLSI physical design methodology and automation and covers most aspects of up-to-date physical design from partitioning, floor
1994
1995
1996
1997
1998
1999
planning, and placement to routings, together with such related areas as special structures for clock, power, and cross talk. This article explores various high-level physical layout problems in designing high-speed, very large integrated circuits in multilayer environments to minimize interconnect delays, bends, vias, cross talk, clock skew, and, finally, power consumption. Continual advances in the speed and integration scale of integrated circuits have created ever greater demands for high density VLSI to ensure reduced interconnection delays for improved electrical performance. Discontinuities must be controlled to keep the resulting reflections to a minimum. Because high-performance circuits are usually designed aggressively (i.e., most of the nets are considered as critical nets), it is preferable to minimize the number of bends and vias. In high-speed clock design, a number of physical constraints may result in a substantial increase in wiring area especially in smaller routing subregions. The remainder of this article is organized in the following sections: Layout Methodology, Physical Design Flow, Partitioning, Floor Planning, Placement, Routing, Other Issues in Physical Design Flow, and High-Performance Layout Design with Deep Submicron Technologies (Clock-Tree Synthesis, Cross Talk). For a comprehensive book on VLSI circuit layout, see Ref. 18.
LAYOUT METHODOLOGY Design styles are broadly classified as full-custom or semicustom. In a full-custom layout, different blocks of a circuit are placed at any location on a silicon wafer as long as all the blocks do not overlap. On the other hand, in semicustom layout, some parts of a circuit are predesigned and placed at some specific point on the silicon wafer. Selection of layout styles depends on many factors including type of chip, cost,
VLSI CIRCUIT LAYOUT
and time to market. Full-custom layout is a preferred style for mass produced chips because the time required to produce a highly optimized layout is justified. On the other hand, to design an application-specific integrated circuit (ASIC), a semicustom layout style is usually preferred. The term custom means the least restricted design style. The term gate array is used to describe chips which are personalized by metal levels only. The gate array is the most restricted design style. The term standard cell describes chips which are personalized by all masking levels within the constraints of some design system. The standard cell falls into the middle of this restrictivity spectrum. These chips are sometimes called master image and semicustom. Standard cells have a fixed height but different widths, depending on the functionality of the modules. They are laid out in rows, with routing channels or spaces between rows reserved for laying out the interconnects between the chip components. Standard cells are usually designed so the power and ground interconnects run horizontally through the top and bottom of the cells. When the cells are adjacent to each other, these interconnects form a continuous track in each row. The logic inputs and outputs of the module are available at pins or terminals along the top or bottom edge (or both). They are connected by running interconnects or wires through the routing channels. Connections from one row to another are made through vertical wiring channels at the edges of the chip or by using feedthrough cells, which are standard height cells with a few interconnects running through them vertically. Macro blocks are logic modules not in the standard cell format, usually larger than standard cells, placed at any convenient location on the chip. Here, the circuit consists only of primitive logic gates, such as NAND gates, not only predesigned but prefabricated as a rectangular array, with horizontal and vertical routing channels between gates reserved for interconnects. Then the design of a chip is reduced to designing the layout for the interconnects according to the circuit diagram. Likewise, fabrication of a custom chip requires only the making of steps for interconnect layout. The third chip layout style uses only macro blocks. The blocks are irregular in shape and size and do not fit together in regular rows and columns. Once again, space is left around the modules for wiring. For a detailed description of layout styles, see Ref. 19a. A VLSI chip designer has several approaches to design. Each of these choices offers the designer trade-offs between chip density and chip design time. A semiconductor wafer is coated with a light-sensitive substance called a photoresist. A mask containing patterns that represent circuits and their interconnections is placed over the wafer, and an ultraviolet light shines through the mask. This light exposes the photoresist on the wafer, and with additional processing, the shapes on the mask are transferred to the wafer. The process is similar to photography, where the mask is the negative and the wafer with photoresist is the light-sensitive photographic paper. The chip design problem is to design a mask with shapes that represent all of the circuits on the chip. Therefore, the complexity of a chip design is measured by the number of shapes that must be drawn to represent a complete chip. This number has historically doubled every two years. There are at least three major reasons for this increase in complexity: decreasing minimum dimensions, larger chip sizes, and increasing process complexity in the vertical dimension. Let us
297
look at each one of these in turn. Because of rapid advances in the state of the art of mask and wafer processing, the minimum dimensions of the shapes on the mask have been shrinking. A minimum dimension is the smallest width of a shape, or the smallest spacing between two adjacent shapes. Because the minimum spacings in a typical process are reduced in both the x and y dimensions, the increase in complexity is a square function, that is, a reduction in minimum dimensions by a factor of 2 results in an increase in complexity by a factor of 4. In the early 1970s, minimum features were on the order of 10 microns. In the early to mid 1980s the minimum features dropped to 2 microns and less. In the 1990s the minimum features shrank to less than one micron. The second driving force for increased chip complexity is the increasing chip size. As the chip size increases, the number of shapes on a chip increases. The maximum size of a chip is controlled by the ability of the maskmaking tools and by the quality (purity) of the chip manufacturing process. Both of these have been improving, and the result is that chip sizes are increasing. This improvement is projected to continue so that chips in the future will be still larger. The third driving force for the increase in chip complexity is the complexity of the process in the vertical dimension. Early processes typically had only four masking levels. Today’s processes range from 8 to 14 levels, and in the future even more masking levels will be employed. In addition there is exploratory work for placing another layer of circuits on top. This will obviously require even more masking levels resulting in more shapes that must be drawn. Gate Arrays A gate array consists of a chip in which a predefined pattern of transistors is fabricated on the chip. These transistors are constructed only in the silicon levels. The customization of the chip occurs only on the metal levels. Because every chip contains the same pattern of transistors independent of its final usage, the chips (wafers) are manufactured in volume, stockpiled, and the metal customizing is placed on the chips at the last moment. Thus, the unique manufacturing processing is only a portion of the total manufacturing time, and the time from design to end product is significantly reduced. In a typical gate array, the manufacturing turnaround time is half that for a chip which has customization built into all of the mask levels. The layout of a typical gate array is shown in Fig. 2. The chip has three main areas. Around the periphery of the chip
I/O pads
Wiring bay Cells Gate array cell
Figure 2. Typical gate array.
298
VLSI CIRCUIT LAYOUT
First-level wiring Second-level wiring
Figure 3. Wiring channels.
is the area reserved for input/output (I/O) circuits and the bonding pads which connect the chip to the next higher level package. Inside the area is the so-called active area that contains the array of transistors which are connected to implement the required logic. This active area is typically divided into entities called cells and wiring bays. Cells are regularly shaped areas that contain the transistors. A circuit is formed by interconnecting the transistors in one or more cells. Between the rows of cells are wiring channels used to connect circuits. In typical gate array, the connections between circuits consist of a series of line segments in a vertical and horizontal direction. These wiring segments lie on the wiring grid. The wiring grid consists of all of the allowable locations for placing line segments used to interconnect circuits. It is usually defined as a series of vertical and horizontal lines spaced as close together as the technological rules allow. These lines do not exist in reality, but they are stored by computer to define permissible line segment locations. If the gate array has two metal levels, the first metal level is typically routed horizontally and the second level metal is routed vertically (Fig. 3). The connection between metal levels is made by a via (a hole in the dielectric that separates the metal levels). These vias can be thought of as wiring segments in a vertical direction. If only one metal level is available, the vertical segments must be implemented on a diffusion or polysilicon mask level. A typical gate-array cell contains four transistors. The small square shapes are contacts to the transistors which must be connected together to define a specific circuit. All input and output (I/O) connections of this gate are brought to the cell boundary next to the wiring channels. These circuit I/Os are the only valid connection points to the circuit. They must fall on the exact locations of the vertical and horizontal wiring grid described before. Thus the connections of the transistors to form circuits are independent of the connections between circuits by making all circuit I/Os fall off the wiring grid. This is an important concept in design called hierarchy. Hierarchy is a methodology of doing various parts of a design independent of other portions of the design but under certain constraints. These constraints ensure that the various pieces of the design are merged at the end and that the chip works. This is done by defining standard interfaces which each part of the design must meet. A simple analogy to this is the way automobiles are assembled. Car engines are built in one plant, transmissions are built in another plant, and the car is assembled in still another plant. By designing the interface between the engine and the transmission in a standard format, the engines and transmissions are built in-
dependently and a large variety of engine and transmission options are combined in the final car. The entire chip image described previously is designed ahead of time (before the logic function is implemented on the chip). The image design is usually done in the laboratory of the chip manufacturer by its design engineers. Similarly, these same engineers are usually responsible for the electrical design and the physical design (layout) of the circuits. Doing this portion of the design requires a detailed knowledge of the particular technology, including circuit design and process. This is best done by the manufacturer of the chip. By segmenting the design process in this manner, the logic designers, who implement the logic, do not have to acquire the detailed skills necessary to design the image and circuits. A further advantage is that the chip manufacturer is free to modify the process without concern over distributing that information to a wide variety of designers. Once the image and circuits are designed, they are converted to digital form for storage in a design system. The chip design process consists of assembling the various circuit configurations on the chip image. Because all of the design requiring detailed technical knowledge is done ahead of time, the process of designing a chip is done by people who do not understand the technological details. In fact, if the design process is fully automated, which many gate array design systems are at this time, the chip design is done by relatively unskilled people. This is described in more detail in this section. The discussion of gate arrays shows some clear advantages for this approach. There are, however, some drawbacks. Gate arrays shorten and simplify the design process. They do this at the expense of silicon efficiency. A design done as a gate array is larger than a design done by a more manual process. The reason for this is that the gate array image designer must anticipate all of the potential usages of the gate array and incorporate them in the selection of transistors. This makes the transistors larger then if they were designed for each individual application. Also in any typical design, not all of the transistors are used. Another problem with the gate array is that it uses more power than a more customized design. The larger transistors and the inability to tailor the design to the specific application result in higher power usage. The switching speed for gate arrays is usually not as high as that of more customized chips. Performance is usually related to chip size. For any given logic function, the bigger the chip, the slower it runs. Also the inability to custom design transistors for a given application affects performance. In summary, gate arrays offer the designer a trade-off between design time and cost/performance/power of the finished product. Standard Cells For the purpose of this section, standard cells are defined as chips which are customized on all mask levels with the exception of pure custom designs. Referring to the design spectrum, standard cells cover the entire spectrum from the gate arrays on one side to the custom design on the other side of the spectrum. This results in a wide variety of design methods. In this introductory section, a simple, restricted design methodology has been chosen. This simple methodology is sufficient to explain the concepts and to contrast standard cells with gate
Standard cells
RAM PAD
Macro block
yy ;; yy ;; yy ;; ;; yy ;; yy
VLSI CIRCUIT LAYOUT
299
Routing channel
ROM
Feedthrough cell
Wasted space
Figure 4. Standard cell layouts with macro blocks.
arrays. The last part of this section deals with variations in the basic approach covered in this section. At this level of detail, the image looks just like the gate array (Fig. 2). The chip is divided into the same major areas. The outside of the chip contains the I/O circuits and pads. Inside is the so-called active area that contains cells into which circuits are placed and also wiring bays which contain the intercircuit wiring. For the gate array, the cells contained a predefined pattern of transistors. In a standard cell, the cells are totally blank at the time the image is defined. They are just reserved areas of silicon which will contain circuits. Because these areas are reserved by the design system and the customization of the chip occurs on all mask levels, there are no restrictions on the types of circuit devices that are placed in the cells. This allows much more flexibility in designing standard cell circuits, resulting in a more efficient utilization of silicon. In a standard cell, only the transistors needed to make a circuit are used. In addition the transistors used are tailored to the specific use. Another feature of standard cells that allows for higher density is the ability to place large complex circuits (called macros) on the image (Fig. 4). These macros occupy all of the image area and also the wiring bay area. They are random access memory (RAM) macros, read only memory (ROM) macros, registers, arithmetic logic units (ALUs), or any other customized piece of logic.
Field-Programmable Gate Arrays The field-programmable gate array (FPGA) is a new approach to ASIC design that dramatically reduces manufacturing turnaround time and cost (18–19). FPGA designs provide large-scale integration and user programmability. An FPGA consists of horizontal rows of programmable logic blocks which are interconnected by a programmable routing network. In its simplest form, a logic block is a memory block that is programmed to remember the logic table of a function. Given a certain input, the logic block ‘‘looks up’’ the corresponding output from the logic table and sets its output line accordingly. Thus by loading different look-up tables, a logic block is programmed to perform different functions. It is clear that 2K bits are required in a logic block to represent a K-bit input, 1-bit output combinational logic function. Obviously, logic blocks are feasible only for small values of K. Typically, the value of K is 5 or 6. The value of K is even less for multiple outputs and sequential circuits. The rows of logic blocks are separated by horizontal routing channels. The channels are not simply empty areas in which metal lines are arranged for a specific design. Rather, they contain predefined wiring ‘‘segments’’ of fixed lengths. Each input and output of a logic block is connected to a dedicated vertical segment. Other vertical segments merely pass through the blocks, serving as feedthroughs between channels. Connection between a horizontal segment and a vertical segment is provided through a
300
VLSI CIRCUIT LAYOUT
nects have to be completed by routing through gates or by adding more metal or polysilicon interconnection layers. There are problems associated with either solution. The former reduces the gate utilization, and the latter increases the mask count and increases fabrication time and cost. Cross fuse
Switch
Logic
Programmable interconnect point
Figure 5. Architecture of an array-style FPGA.
cross fuse. Figure 5 shows the general architecture of an FPGA, which consists of five rows of logic blocks. The cross fuses are shown as circles, whereas antifuses are shown as rectangles. Because there are no user-specific fabrication steps in an FPGA, the fabrication process is set up cost effectively to produce large quantities of generic (unprogrammed) FPGAs. The customization (programming) of an FPGA is rather simple. Given a circuit, it is decomposed into smaller subcircuits, so that each subcircuit is mapped to a logic block. The interconnections between any two subcircuits are achieved by programming the FPGA interconnects between the corresponding logic blocks. Programming (blowing) one of the fuses (antifuse or cross fuse) provides a low resistance bidirectional connection between two segments. When blown, antifuses connect the two segments to form a longer one. To program a fuse, a high voltage is applied across it. FPGAs have special circuitry to program the fuses. The circuitry consists of the wiring segments and control logic at the periphery of the chip. Fuse addresses are shifted into the fuse programming circuitry serially. The programmable nature of these FPGAs requires new CAD algorithms to effectively use logic and routing resources. The problems involved in customizing an FPGA are somewhat different from those of other design styles, but many steps are common. For example, the partition problem of FPGAs is different from the partitioning problem in other design styles whereas the placement and the routing are similar to the gate-array approach. Sea of Gates The sea of gates is an improved gate array in which the master is filled completely with transistors. The master of the sea of gates has a much higher density of logic implemented on the chip and allows a designer to fabricate complex circuits, such as RAMs. In the absence of routing channels, intercon-
Comparison of Different Design Styles The choice of design style depends on the intended functionality of the chip, time to market, and the total number of chips to be manufactured. It is common to use full-custom design style for microprocessors whereas FPGAs are used for a small circuit used in networking. However, there are several chips which have been manufactured by using a mix of design styles. For large circuits, it is common to partition the circuit into several small circuits which are then designed by different teams. Each team may use a different design style or a number of design styles. Another factor complicating the issue of design style is reusability of existing designs. It is a common practice to reuse a complete or partial layout from existing chips for new chips to reduce the cost of a new design. Design styles are a continuum from very flexible (full-custom) to a rather rigid design style (FPGA) to cater to differing needs. A comparison of layout styles is given in Table 1. For example, it takes longer to fabricate a standard cell; however, the resulting chip operates at a higher speed. As seen from the table, full-custom provides compact layouts for high-performance designs but requires considerable fabrication effort. On the other hand, an FPGA is completely prefabricated and does not require any user-specific fabrication steps. However, FPGAs are used only for small, general purpose designs. Other Layout Environments A printed circuit board (PCB) is a multilayer sandwich of routing layers. ICs are packaged into ceramic or plastic carriers and then mounted on a PCB. The current PCB technology offers as many as 30 or more routing layers. Via specifications are also very flexible and vary so that a wide variety of combinations is possible. For example, a set of layers can be connected by a single via called the stacked via. The traditional approach of single chip packages on a PCB has intrinsic limitations in silicon density, system size, and contribution to propagative delay. Thru-hole assemblies gave way to surfacemounted assemblies (SMAs). In an SMA, pins of the device do not go through the board but are rather soldered to the surface of the board and devices are placed on both sides of
Table 1. Comparison Among Various Layout Styles
Fabrication time Packing density Unit cost in large quantity Unit cost in small quantity Easy design and simulation Remedy for erroneous design Accuracy of timing simulation Chip speed ⫹ desirable ⫺ not desirable
Full Custom
Standard Cell
Gate Array
FPGA
⫺⫺ ⫹⫹ ⫹⫹ ⫺⫺ ⫺⫺ ⫺⫺ ⫺ ⫹⫹
⫺⫺ ⫹ ⫹⫹ ⫺⫺ ⫺ ⫺⫺ ⫺ ⫹⫹
⫹ ⫺ ⫹ ⫹ ⫺ ⫺ ⫺ ⫹
⫹⫹ ⫺⫺ ⫺⫺ ⫹⫹ ⫹⫹ ⫹⫹ ⫹ ⫺
VLSI CIRCUIT LAYOUT
the board. SMAs eliminate the need for large diameter plated-thru-holes, allowing finer pitch packages and increased routing density. SMAs reduce the package footprint and improve performance. The layout problems for printed circuit boards are similar to those in VLSI design, although printed circuit boards offer more flexibility and a wider variety of technologies. The routing problem is much easier for PCBs because many routing layers are available. The planarity of wires in each layer is a requirement in a PCB as it is a chip. There is little distinction between global routing and detailed routing in the case of circuit boards. In fact, because many layers are available, the routing algorithm must be modified to adapt to this threedimensional problem. Compaction has no place in PCB layout because of the gridlike domain used to represent layout information. For more complex VLSI devices with 120 to 196 I/Os, even the surface-mounted approach becomes inefficient and limits system performance. A 132 pin device in a 635 애m pitch carrier requires a 25.4 to 38.1 mm2 footprint. This represents a four to sixfold density loss and twofold increase in interconnect distances versus a 64 pin device. It has been shown that the interconnect density for current packaging technology is at least one order of magnitude lower than the interconnect density at the chip level. This translates into long interconnection lengths between devices and a corresponding increase in propagation delay. For high-performance systems, the propagation delay is unacceptable, even with surface mounting. A higher performance packaging and interconnection approach is necessary to achieve the performance improvements promised in VLSI technologies. This has led to the development of multichip modules (MCMs). The key to semiconductor device improvement is the shrinking feature size, that is, the minimum gate or line width on a device. The shrinking feature size provides increased gate density, increased gates per chip, and increased clock rates. These benefits are offset by an increased number
301
of I/Os and increased chip power dissipation. The increased clock rate is directly related to the device feature size. With reduced feature sizes, each on-chip device is smaller and therefore has reduced parasitic effects and allows faster switching. Furthermore, the scaling has reduced on-chip gate distances and therefore interconnect delays. Much of the improvement in system performance promised by the ever increasing semiconductor device performance, however, has not been realized because of the performance barriers imposed by today’s packaging and interconnection technologies. Increasingly more complex and dense semiconductor devices are driving the development of advanced VLSI packaging and interconnection technology to meet increasingly more demanding system performance requirements. The alternative approach to the interconnect and packaging limits of conventional chip carrier/PCB assemblies is to eliminate packaging levels between the chip and PCB. One such approach uses multichip modules. The MCM approach eliminates the single chip package and, instead, mounts and interconnects the chips directly onto a higher density, fine-pitch interconnection substrate. In some MCM technologies, the substrate is simply a silicon wafer on which layers of metal lines are patterned. This substrate provides all of the chip-tochip interconnections within the MCM. Because the chips are only one-tenth of the area of the packages, they are placed closer together on an MCM. This provides higher density assemblies and shorter and faster interconnects. Figure 6 is a diagram of an MCM package. It is predicted that multichip modules will have a major impact on all aspects of electronic system design. Multichip module technology offers advantages for all types of electronic assemblies. Mainframes will need to interconnect the high numbers of custom chips needed for the new systems. Costperformance systems will use the high density interconnect to assemble new chips with a collection of currently available chips to achieve high performance without time-consuming custom design, allowing quick time to market. The significant
Substrate
Chip
Module-level pad
Chip-level pad
Interchip connections Intrachip connections Logic cell
Figure 6. A typical multichip module.
diagram
of
a
302
VLSI CIRCUIT LAYOUT
Bare dies
nected to realize the entire system on the wafer. The attractiveness of WSI lies in its promise of greatly reduced cost, high performance, high level of integration, greatly increased reliability, and significant application potential. However there are still major problems with WSI technology, such as redundancy and yield, that are unlikely to be solved in the near future.
System description
System partitioning
PHYSICAL DESIGN FLOW The design cycle of VLSI chips consists of consecutive steps from high-level synthesis (functional design) to production (packaging). The physical design is the process of transforming a circuit description into the physical layout, which describes the positions of cells and routes for the interconnections between them. The main concern in the physical design of VLSI chips is to find a layout with minimal area and minimal total wire length. Some critical nets have much stricter limitations for the maximal wire length. Since VLSI is such a complex system, the designer is forced to use a hierarchical approach or top-down approach. In circuit layout, we do placement first, then global and detailed routing. Because of its complexity, the physical design is normally broken into various substeps:
Chip placement
Thermal analysis
Pin redistribution
Global routing Physical design Layer assignment
Timing analysis
Detailed routing
MCM design
• First, the circuit has to be partitioned to generate some (up to 50) macro cells. • In the floor-planning stage, the cells have to be placed on the layout surface. • After placement, the global routing is done. In this step, the ‘‘loose’’ routes for the interconnections between the single modules (macro cells) are determined. • In detailed routing, the exact routes for the interconnecting wires in the channels between the macro cells have to be computed. • The last step in the physical design is compacting the layout by compressing it in all dimensions, so that the total area is reduced.
Figure 7. Multichip module design flow.
benefits of multichip modules are reduced size, reduced number of packaging levels, reduced complexity of the interconnection interfaces and clearly cheaper and more efficient assemblies. The multichip revolution in the 1990s will have an impact on electronics as great or greater than surface-mount technology had in the 1980s. The layout problems in MCMs are essentially performance-driven (Fig. 7). The partitioning problem minimizes the delay in the longest wire. Although placement in MCM is simple compared with VLSI, global routing and detailed routing are complex in MCM because of the large number of layers. The critical issues in routing include the effects of cross talk and delay modeling of long interconnect wires. MCM packaging technology does not completely remove all of the barriers of IC packaging technology. Wafer-scale integration (WSI) is considered the next major step, bringing with it the removal of a large number of barriers. In WSI, the entire wafer is fabricated with several types of circuits, the circuits are tested, and the defect-free circuits are intercon-
Although we decompose the complex problem into a series of simpler problems to reduce the design complexity, relying on incomplete and abstract models when performing early design decisions may adversely affect the quality of the final solution. Unfortunately, the exact physical attributes of a design are not known, even in the placement step, until the entire design process is carried out. Because design steps are quite expensive, it is not feasible to go through a full design interative step evey time a high-level decision is made. At the heart of all methodologies, however, are logic entry programs and checking and auditing programs that ensure that the chip has been placed and wired properly according to the logic specification. Figure 8 shows a typical design methodology flowchart. Each block in the logic list corresponds to a graphical layout of a logic function. The discipline of physical design entails placing, wiring, and checking these physical blocks and then generating the final graphics for manufacture. As described previously, the chip image consists of an array of cells separated by wiring bays into which the physical blocks fall. Automatic placement and wiring programs take advantage of
VLSI CIRCUIT LAYOUT
Logic entry
Logic simulation Physical design
Logic-to-physical assignment
Partitioning
Floorplanning
Placement
Routing
Compaction Test generation Verification
Postlayout simulation
Mask generation
Manufacture Figure 8. Physical design flow.
this regularity and save much time in the physical design process. Now we can place and wire the chip. Automatic placement and wiring programs are not strictly necessary for the chip design process. Earlier design systems have relied upon large manual efforts to accomplish this task. Indeed, a manual placement and wiring capability is necessary because often the programs do not do a complete job. Placement simply consists of assigning a module location to each logic circuit. If the logic library is not all identical in size, then the physical size of each circuit must be known so that the placement does not overlap the circuits. It is all too easy to assign a one-cell circuit to an area already occupied by a multicell circuit. The end goal of placement is to reduce the total wire length on the chip. Preplacement is done if performance of a particular logic path is critical and must be manually controlled. Some standard cell systems offer the op-
303
tion of large array macros, such as RAMs and PLAs. These macros might have to be manually placed, because placing them automatically is not easy. After placement, global routing decomposes a large routing problem into small, manageable problems for detailed routing. The method first partitions the routing region into a collection of disjoint rectilinear subregions. This decomposition is carried out by finding a ‘‘rough’’ path for each net in order to reduce the chip size, shorten the wirelength, and evenly distribute the congestion over the routing area. Detailed routing follows global routing. The traditional model of detail routing is the two-layer Manhattan model with reserved layers, where horizontal wires are routed on one layer and vertical wires are routed in the other layer. The layout is verified to ensure that the layout meets the system specifications and the fabrication requirements. Design verification consists of design rule checking and circuit extraction. Once the chip is place and wired, then design rule checking is next. This checks the placement for violations, such as overlapping circuits, circuits falling off the chip, and I/O circuits placed in the internal cell structure, or vice versa. Wiring checks include checks for shorts and opens and checks for wires routed through blockages created by chip power buses. Electrical checking is also done at this point. Bipolar circuits are sensitive to wire widths because they drive currents through global nets. Checking for adequate wire widths should be done here. Excessive numbers of plane to plane vias are checked for, because they add resistance and, hence, delay to the nets. Results from all this checking must be folded back into the placement and wiring, forming another iterative loop in the design process. The next step in the physical design process is to extract the net lengths from the design to create a more accurate set of delays. Parasitic resistance and capacitance per unit length for the technology are stored in the technology database, and these values are multiplied by the lengths of each net. The delay of each circuit is computed accurately only when this information is known. Then the delays are fed back to the logic designer for a more accurate delay simulation. If delay paths are too long, then the logic must be changed, necessitating a complete restart on the design, or the placement or wiring must be changed. The crude delay estimation done before physical design should try to be accurate enough that this does not happen. Some design systems offer the capability of doing power optimization. This consists of automatically replacing each logic circuit with a higher power version until certain delay criteria are met. The advantage of doing this is that unnecessary ac or dc power consumption is minimized automatically. The design is started with all logic at the lowest power level. Then, successive iterations of delay calculation, identifying the failing blocks and powering up are done until either all of the nets pass their delay criteria or the maximum power level is reached. The final step in physical design is converting the placement and wiring information and the graphic designs of the chip power structure and circuit library into a complete graphical description of the chip to send to manufacturing. Shape level layout checking can be done at this point as a final check on the design. If the image and library are designed correctly, then there should never be a layout error at this stage. Extremely large data volumes characteristic of
304
VLSI CIRCUIT LAYOUT
VLSI have been raising questions about the ability to continue on this chipwide layout checking. One solution to the problem is to use a ‘‘shadow’’ for each circuit when actually doing the checking. The layout of the circuits is checked when the library is designed, and a simplified outline shape is generated which is substituted for the actual graphics before checking. If the gate array approach is chosen, then only the metallization levels need to be checked and sent to manufacturing. Otherwise, all design levels must be checked and sent. The data volumes of VLSI have forced modern placement programs to approach the chip hierarchically. The task is divided into a coarse of global phase and a fine or detailed phase. The global phase is concerned with partitioning, or dividing the logic into more manageable portions, and the detailed phase makes the exact cell assignment. The global and detailed phases are repeated as many times as necessary to fit each logic partition into the program’s available storage area. Partition looks at the circuit’s interconnection and decides where a boundary can be drawn with relatively few nets crossing it. Detailed placement then iterates through many physical arrangements based on different random number seeds looking for the smallest total net length.
PARTITIONING A chip may contain several million transistors. Layout of the entire circuit cannot be handled because of the limitation of memory space and computational power available. Even though fabrication technologies have made great improvements in packing more logic in a smaller area, the complexity of circuits has also been increasing correspondingly. This necessitates breaking a circuit and distributing it across several chips. Thus, the first step in the physical design phase is partitioning. Partitioning is a complex, discrete, and highly intractable problem which is nondeterministic polynomial (NP)-complete. The nature of the partitioning problem and with the size of the circuit make it difficult to perform an exhaustive search for an optimal solution. We review the commonly used methods for partitioning. Classic iterative approaches, such as Kernighan–Lin and Fiduccia–Mattheyses begin with some initial solution and try to improve it by making small changes, such as swapping modules between clusters. Iterative improvement has become the industry standard for partitioning because of its simplicity and flexibility. Recently, there have been many significant improvements to the basic FM algorithm. These are broadly classified into four categories: direct method, group migration method, metric allocation method, and simulated annealing. All these methods, except simulated annealing, suffer from the problem of getting stuck in locally optimal solutions. Simulated annealing (SA) is a probabilistic hill-climbing search technique with the advantage of not being stuck in a local optimum. For this reason, it is widely used. The foundation of SA in the process of the problem is the central aspect of SA, and it is analogous to the energy of a physical system. The stability of a physical system depends on the energy possessed by it. Lower energy (or low-cost functional value) indicates a more stable state. The SA method corresponds to a solution (or state) of the problem. The pro-
cess begins at a high temperature T, at which a large number of moves that increase the energy are accepted. The temperature is gradually reduced according to a cooling schedule until it reaches zero or a very low value. As the temperature decreases, the number of accepted moves that increase the cost functional value (or the energy) also decreases. At low temperatures, the method resembles a greedy algorithm because only moves that reduce the cost functional value are accepted. SA does not get trapped in a local optimum because moves that increase the cost functional value are also accepted which lead to a more exhaustive search of the solution space. Register transfer level (RTL) designs are evaluated with a spreadsheet-like approach and violations for area, power, and pin count constraints are checked. The hierarchical clustering technique is used in partitioning behavioral hardware descriptions in which a similarity measure is computed for all pairs that communicate with one another. Clusters of functions are formed with these similarity measures. However, no designed constraints are considered while forming these clusters. In addition to these four broad categories of partitioning algorithms, several variations and improvements to partitioning strategies continue to be researched. Network flow techniques have been applied to graph bipartitioning and also to multiway partitioning with area and pin constraints. The maximum-flow, minimum-cut algorithm transforms the minimum-cut problem into a maximum-flow problem (22). An approach that also serves as a strategy to generate initial partitions is based on eigenvector decomposition. This approach requires transformation of every multiterminal net into two terminal nets which could result in a loss of information needed for a performance-based partitioning. Although this algorithm finds the optimal solution between any pair of nodes in a network, there is no constraint on size of resultant partitions. Another related partitioning technique is that based on spectral partitioning, which relies on finding the eigenvalues and eigenvectors of a Laplacian matrix. The eigenvectors yield a k-dimensional Euclidean space embedding of a graph which minimizes the weighted sum of squared distance of the vertices. The k smallest eigenvectors provide an embedding of the n vertices of the graphs in a kdimensional subspace. Then efficient clustering heuristics are used to coerce the points in the embedding into k partitions. Hagen et al. (24) propose the Rent parameter as a quality measure for the underlying partitioning algorithm. The Rent parameter characterizes a power-law relationship between the number of external terminals of a partition in the layout and the number of nodes in the layout. Their results suggest that top-down layout techniques based on spectral ratio-cut partitioning achieve denser layouts than the current methods based on Fiduccia–Mattheyses partitioning. The power of spectral methods lies in their ability to take global information into account during partitioning. Another effective method is using clustering within a two-phase methodology. A clustering on the net list induces a smaller net list. For example, if the average cluster size is 5, a 10000 module net list is clustered to a 2000 module net list. Then iterative improvement is run on the smaller net list to derive an initial solution for the larger solution. Then iterative improvement is run a second time on the larger net list. Bui et al., Hagen and Kahng, and Alpert and Kahng (24) have all shown the effectiveness of the two-phase methodology. This two-phase
VLSI CIRCUIT LAYOUT
approach can be repeated many times to give a multilevel algorithm. Multilevel approaches are very popular in the sparse matrix computational community, with works such as Hendrickson and Leland and Kumar and Karypis (23). To apply it to circuit net lists as opposed to graphs, extension of the work of Kumar and Karypis is presented in Ref. 24. There are three steps in the algorithm: first, the coarsening algorithm, then, the initial partitioning algorithm, and, finally, the refinement algorithm.
FLOOR PLANNING In the floor-planning phase, the macro cells have to be positioned on the layout surface so that no blocks overlap and there is enough space left to complete the interconnections. The input for the floor planning is a set of modules, a list of terminals (pins for interconnections) for each module, and a net list, which describes the terminals to be connected. At this stage, good estimates for the area of the single macro cells are available, but their exact dimensions still vary in a wide range. Consider, for example, a register file module consisting of 64 registers. These alternatives are described by shape functions. A shape function is a list of feasible height/width combinations for the layout of a single macro cell. The result of the floor-planning phase is the sized floor plan, which describes the position of the cells in the layout and the chosen implementations for the flexible cells. In this stage, the relative positions of the modules to be laid out are determined. Timing, power, and area estimates are the factors guiding the relative placement. Floor planning is used to verify the feasibility of integrating a design onto chip without performing the detailed layout and design of all of the blocks and functions. If the control logic is implemented with standard cells, then the number of rows used for the modules is not necessarily fixed. Many rows produce a long, skinny block. Few rows produce a short, fat block. As other examples, folding and partitioning of a PLA is used to modify the aspect ratio of the module, or the number of bits used for row and column decoding in a RAM or ROM module also modifies their aspect ratio. Since floor planning is done very early in the design process, only estimates of the area requirements are given for each module. Recently, the introduction of simulated annealing algorithms has made it possible to develop algorithms where the optimization is carried out with all the degrees of freedom mentioned previously. A system developed at the IBM T. J. Watson Research Center and the TimberWolf package developed at Berkeley use the simulated annealing algorithm to produce a floor plan that gives the relative positions of the modules and also aspect ratios and pin positions. Automatic floor planning is more and more important as automatic module generators become available which accept pin positions and aspect ratios of the blocks as constraints or parts of the cost functions. Typically, floor plans consists of the following two steps: first, the topology, that is, the relative positions of the modules, is determined. At this point, the chip is viewed as a rectangle and the modules are the (basic) rectangles whose relative positions are fixed. Next, we consider the area optimization problem, that is, we determine a set of implementations (one for each module) so that the total area of the chip is minimized. The topology of a floor plan is ob-
305
tained by recursively using circuit partitioning techniques. A partition divides a given circuit into k parts so that (1) the sizes of the k parts are as close as possible and (2) the number of nets connecting the k parts is minimized. If k ⫽ 2, a recursive bipartition generates a slicing floor plan. A floor plan is slicing if it is a basic rectangle or there is a line segment (called a slice) that partitions the enclosing rectangle into two slicing floor plans. A slicing floor plan is represented by a slicing tree. Each leaf node of the slicing tree corresponds to a basic rectangle, each nonleaf node of the slicing tree corresponds to a basic rectangle, and each nonleaf node of the slicing tree corresponds to a slice. There are many different approaches to the floor-planning problem. Wimer et al. (25) describe a branch and bound approach for the floor-plan sizing problem, that is, finding an optimal combination of all possible layout alternatives for all modules after placement. Although their algorithm finds the best solution for this problem, it is very time-consuming, especially for real problem instances. Cohoon et al. (26) implemented a genetic algorithm for the entire floor-planning problem. Their algorithm uses estimates for the required routing space to ensure completing the interconnections. Another more often used heuristic solution method for placement is simulated annealing (27,28). When the area of the floor plan is considered, the problem of choosing each module for the implementation that optimizes a given evaluation function is called the floor plan area optimization problem (29). A floor plan consists of an enveloping rectangle partitioned into a nonoverlapping basic rectangles (or modules). For every basic rectangle, a set of implementations is given, which have a rectangular shape characterized by a width w and a height h. The relative positions of the basic rectangles are specified by the floor-plan tree. The leaves are the basic rectangles, the root is the enveloping rectangle, and the internal nodes are the composite rectangles. Each of the composite rectangles is divided into k parts in a hierarchical floor plan of order k. If k ⫽ 2 (slicing floor plan), a vertical or horizontal line is used to partition the rectangle. If k ⫽ 5, a right or left wheel is obtained. The general case of composite blocks which cannot be partitioned in two or five rectangles is dealt with by allowing them to be composed of L-shaped blocks. Once the implementation for each block has been chosen, the sizes of the composite rectangles are determined by traversing up the floor-plan tree. When the root is reached, the area of the enveloping rectangle is computed. The goal of the floor plan area optimization problem is to find the implementation for each basic rectangle, so that the minimum area enveloping rectangle is obtained. The problem has been proved to be NP-complete in the general case, although it is reduced to a problem solvable in polynomial time in the case of slicing floor plans. Several solutions have been proposed for this problem: • a branch and bound algorithm able to produce the exact solution when the size of the problem is not too large; • a complete algorithm which is based on a bottom-up traversal of the floor-plan tree and produces the optimum solution with medium-size problems; • an approximation algorithm which deals with the greatest problem sizes with acceptable CPU time and memory requirements;
306
VLSI CIRCUIT LAYOUT
• an alternative algorithm which is based on the replacement of superblocks by equivalent basic blocks; and • a recent algorithm particularly effective for floor plans, which are approximately slicing, and reduces to a polynomial algorithm for slicing floor plans. • Recently, a genetic algorithm was proposed to deal with general floor plans (29). PLACEMENT The placement problem is defined as follows (Fig. 9). Given an electrical circuit consisting of modules with predefined input and output terminals and interconnected in a predefined way, construct a layout indicating the positions of the modules so that the estimated wire length and layout area are minimized. The inputs to the problem are the module description, consisting of the shapes, sizes, and terminal locations, and the net list, describing the interconnections between the terminals of the modules. The output is a list of x and y coordinates for all modules. The main objectives of a placement algorithm are minimizing the total chip area and the total estimated wire length for all of the nets. We need to optimize chip area to fit more functionality into a given chip area. We need to minimize wire length to reduce the capacitive delays of longer nets and speed up the chip operation. These goals are closely related to each other for standard cell and gatearray designs, because the total chip area is approximately equal to the area of the modules plus the area occupied by the interconnects. Hence, minimizing the wire length is approximately equivalent to minimizing the chip area. In the macro design style, the irregularly sized macros do not always fit together, and some space is wasted. This plays a major role
in determining the total chip area, and we have a tradeoff between minimizing area and minimizing the wire length. In some cases, secondary performance measures, such as the preferential minimization of the wire length of a few critical nets, are also needed, at the cost of increased total wire length. Module placement is an NP-complete problem and, therefore, cannot be solved exactly in polynomial time. Trying to get an exact solution by evaluating every possible placement to determine the best one would take time proportional to the factorial of the number of modules. This method therefore, is impossible to use for circuits with any reasonable number of modules. A heuristic algorithm must be used to search through a large number of candidate placement configurations efficiently. The quality of the placement obtained depends on the heuristic used. At best, we can hope to find a good placement with wire length quite close to the minimum, with no guarantee of achieving the absolute minimum. Classification of Placement Algorithm Placement algorithms are divided into two major classes: constructive placement and iterative improvement. In constructive placement, a method is used to build up a placement from scratch; in iterative improvement, algorithms start with an initial placement and repeatedly modify it in search of a cost reduction. If a modification results in a reduction in cost, the modification is accepted; otherwise it is rejected. In typical constructive placement, a seed module is selected and placed in the chip layout area. Then other modules are selected, one at a time, in order of their connectivity to the modules placed (most densely connected first) and are placed at a vacant location close to the placed modules, so that the wire length is minimized. Such algorithms are generally very fast, but typi-
750 A
1
4
12
11
VDD
650
550 2
B
3
14
13
GND
400
350 5
7
10
16
E
200
150 C
6
8
9
D
15
(0,0)
Figure 9. An example of placement.
100
200
300
400
Placement (cell, x, y): (1,0,600) (2,0,400) (3,100,400) (4,100,600) (5,0,200) (6,0,0) (7,75,200) (8,100,0) (9,200,0) (10,150,200) (11,300,600) (12,200,600) (13,300,400) (14,200,400) (15,300,0) (16,250,200)
VLSI CIRCUIT LAYOUT
cally result in poor layouts. These algorithms are now used for generating an initial placement for iterative improvement algorithms. The main reason for their use is their speed. They take a negligible amount of computation time compared to iterative improvement algorithms and provide a good starting point for them. More recent constructive placement algorithms, such as numerical optimization techniques (30,31), placement by partitioning (32), integer programming formulation (33), and simulated annealing (34) yield better layouts but require significantly more CPU time. Iterative improvement algorithms typically produce good placements but require enormous amounts of computation time. The simplest iterative improvement strategy interchanges randomly selected pairs of modules and accepts the interchange if it results in a reduction in cost (35). The algorithm is terminated when there is no further improvement during a given large number of trials. An improvement over this algorithm is repeated iterative improvement in which the iterative improvement process is repeated several times with different initial configurations in the hope of obtaining a good configuration in one of the trials. Currently popular iterative improvement algorithms include simulated annealing, the genetic algorithm, and some force-directed placement techniques, which are discussed in detail in the following sections. Other possible classifications for placement algorithms are deterministic algorithms and probabilistic algorithms. Algorithms that function on the basis of fixed connectivity rules or formulas or determine the placement by solving simultaneous equations are deterministic and always produce the same result for a particular placement problem. Probabilistic algorithms, on the other hand, work by randomly examining configurations and produce a different result each time they are run. Constructive algorithms are usually deterministic, whereas iterative improvement algorithms are usually probabilistic. Simulated annealing (34) is probably the most well-developed method available for module placement today. It is very timeconsuming but yields excellent results. It is an excellent heuristic for solving any combinatorial optimization problem, such as the graph coloring (36), partitioning, routing (37,39), floor planning (28), or placement (39,40). The basic procedure in simulated annealing is to accept all moves that result in
S
S
307
reduced cost. Moves that result in a cost increase are accepted with a probability that decreases with the increase in cost. A parameter T, called the temperature, controls the acceptance probability of the moves increasing cost. Higher values of T cause more such moves to be accepted. In most implementations of this algorithm, the acceptance probability is given by exp(⫺⌬C/T), where ⌬C is the cost increase. In the beginning, the temperature is set to a very high value so that most of the moves are accepted. Then the temperature is gradually decreased so that the moves increasing cost have less chance of being accepted. Ultimately, the temperature is reduced to a very low value so that only moves reducing cost are accepted, and the algorithm converges to a low cost configuration. Wire Length Estimates The placement process is followed by routing, that is, determining the physical layout of the interconnects through the available space. Finding an optimal routing given a placement is also a NP-complete problem. Many algorithms work by iteratively improving the placement and, at each step, estimating the wire length of an intermediate configuration. It is not feasible to route each intermediate configuration to determine how good it is. Instead we estimate the wire length. To make a good estimate of the wire length, we should consider the way in which routing is actually done by routing tools. Almost all automatic routing tools use Manhattan geometry, that is, only horizontal and vertical lines are used to connect any two points. Further, two layers are used. Only horizontal lines are allowed in one layer and only vertical lines in the other. The shortest route for connecting a set of pins together is a Steiner tree [Fig. 10(a)]. In this method, a wire branches at any point along its length. This method is usually not used by routers, because of the complexity of computing both the optimum branching point and the resulting optimum route from the branching point to the pins. Instead, minimum spanning tree connections and chain connections are the most commonly used connection techniques. For algorithms that compute the Steiner tree, see (41–45). Source-to-sink connections [Fig. 10(b)] called the Steiner distance-preserving tree
S
S
Minimum, rectilinear, Steiner tree (minimum-cost tree)
Minimum, rectilinear, distance-preserving tree (minimum-cost, minimum-delay tree)
Chain connection
Minimum-cost, minimum-delay, zero-skew tree
(a)
(b)
(c)
(d)
Figure 10. Some wiring topologies.
308
VLSI CIRCUIT LAYOUT
(46–49), where the output of a module is connected to all of the inputs by the shortest path, are the simplest to implement. They result, however, in excessive interconnect length and significant wiring congestion. Hence, this type of connection is seldom used for estimation. Chain connections [Fig. 10(c)] do not allow any branching at all. Each pin is simply connected to the next one in the form of a chain. These connections are even simpler to implement than the spanning tree connection, but they result in slightly longer interconnects. An efficient and commonly used method for estimating wire length is the semiperimeter method. The wire length is approximated by half the perimeter of the smallest bounding rectangle enclosing all of the pins. For Manhattan wiring, this method gives the exact wire length for all two-terminal and three-terminal nets, provided that the routing does not overshoot the bounding rectangle. For nets with more pins and more zigzag connections, the semiperimeter wire length is generally less than the actual wire length. Besides, this method provides the best estimate for the most efficient wiring scheme, the Steiner tree. Some of the algorithms use the euclidean wire length or squared euclidean wire length. The squared wire length is used to save the time required for floating-point computations as compared with integral processing. Optimization of the squared wire length ensures that the euclidean wire length is optimized. For a comprehensive overview of placement and routing algorithms see (50). Over the years, a wide variety of placement algorithms have been developed. Sharookar and Mazumder (51) provide a survey of various placement techniques. Several genetic placement algorithms have been presented (26,40,52,53). The classical work on layout generation with a genetic algorithm (GA) was done by Cohoon et al. (40). They encode a placement by a Polish notation of a binary slicing tree, thus having a chromosome represented by a string. They use different recombinatory operators, which work either directly on this string or take the tree structure into consideration by decoding the chromosome. Chan et al. (54) represent a placement in a bit-matrix. The layout area is divided into discrete regions, and each row in the matrix describes the orientation and the position of a single cell. Recombination is done by mixing two matrices. During the optimization, incorrect placements with overlapping cells are allowed and are handled by adding a penalty term when computing the fitness. A GA for macro cell placement is described where the genotype representation is a binary tree (40). In contrast to the approach of Cohoon et al., this tree does not directly characterize a placement, but it is generated by decoding the tree in a traversal by a given order. The operators directly work on the tree structure.
ROUTING The routing stage, where the interconnections are laid out on the chip, is then broken into two stages because of complexity: global and detailed routing. In global routing sometimes called channel assignment, the ‘loose’ routes (which channel the interconnections go through) for the nets are determined. To compute global routing, the routing space is represented as a graph. The edges of this graph represent the routing regions and are weighted with the corresponding capacities.
Global routing is described by a list of routing regions for each net of the circuit, and none of the capacities of any routing region is exceeded. After global routing is done, the number of nets routed through each routing region is known. In the detailed routing phase, the exact physical routes for the wires inside routing regions have to be determined. This is done stepwise, that is, one channel at a time is routed in a predefined order. Channel Router If the region to be routed contains pins only on two sides, then effective detailed routing tools called channel routers are used. If the routing region has pins on four sides, then a switch box route is used. The shortest path problem is stated as follows: Given a graph G, each arc connecting vi and vj has an associated length dij. Find the shortest path between two vertices in the graph. Here, the length of the path is defined as the sum of the lengths of arcs in the path. This problem was solved by Dijkstra (55). A straightforward implementation of Dijkstra’s algorithm needs O(n2) time. Moore and Lee (56) suggest that breadth-first search be used to find the shortest path. This is known as the maze router. This algorithm is guaranteed to find a path with minimum wire length. Another type of algorithm, called the line-expansion algorithm, was developed to minimize the number of vias in the path and to reduce the memory storage and increase speed (57). The Lee–Moore grid-expansion algorithm (20) and many of its variations (19) have been widely used. The channel routing problem is stated as follows: Given two parallel horizontal lines at distance m ⫹ 1 units apart (i.e., there are m tracks between the two horizontal lines), there are numbers 0, 1, 2, . . ., n written on the two horizontal lines. The problem is to connect all nets using two layers and to minimize the required number of tracks m. Channel routing is widely used in automatic layout design because of its high packing density. Given the wiring list of a channel routing problem, we define the maximum density as the maximum intersections of a vertical column with the horizontal net segments over all columns, assuming that each net occupies the routing of any channel with a number of tracks equal to or approaching the maximum density. Thus the maximum density is used to estimate the required channel width, which is crucial in chip planning. The problem is NP-complete (58,59). For over two decades, many researchers have tried to solve various channel routing problems. Several channel routers were developed to generate near optimal solutions: an optimal solution for the channel assignment problem (60), a dogleg channel router (61), greedy channel router (62), hierarchical channel router (63), three- or multilayer channel routing (64–67), 45⬚ channel routing (68), gridless channel routing (69,70), parallel algorithm (71), segmented channel routing for FPGA’s (72–75), constrained channel routing for analog and mixed signal circuits (76,77), crosstalk-minimum channel router (78,79). See (80,81) for a survey of other previous channel routing problems and their algorithms. Global Routing The problem of global routing is very much like a traffic problem. The pins are the origins and destinations of traffic. The wires connecting the pins are the traffic, and the channels are the streets. If there are more wires than the number of tracks
VLSI CIRCUIT LAYOUT
in a given channel, some of the wires have to be rerouted just like rerouting traffic. For the real traffic problem, all drivers wants to go to their destinations in the quickest way and may try different routes every day. Finally each driver selects the best route possible and the traffic pattern is stabilized. Intuitively, we can do the same for the routing problem. In global routing the usual approach was to route one net at a time sequentially until all nets are connected. To connect one net, we could use the maze runner for example. If some nets cannot be routed at the end, these nets must be routed manually. Obviously, the success of this kind of routing depends on the order in which we route the nets, and there is no systematic way of rerouting the nets. Ting and Tien (82a) used the following approach for gate-array designs: • First, route every net as if it were the first net to be routed, that is, pay no attention to conditions at boundary overflows. • After all nets have been routed, identify the boundaries that overflowed and the amount of overflow. • Identify the nets that use these overflow boundaries and form a bipartite graph with one part of the vertices representing the nets and the other part of the vertices representing the overflowed boundaries. An edge connects a net to a boundary if the net uses the boundary. The bipartite graph shows the supply-demand situation among boundaries and nets, and a subset of nets is selected for rerouting. The criterion of selection is ‘‘greedy.’’ An interesting probabilistic approach called simulated annealing is due to Vecchi and Kirkpatrick (37). Basically, they first route the nets randomly. Then they change the routing pattern if the new pattern decreases the objective function: F= m2v v=1
where mv is the number of wires in the vth arc. Because this objective function is convex, a lower value of the objective function indicates that the wires are more uniformly distributed over the whole area. A novel approach is to change the pattern even if the objective function increases. This happens with a small probability, say 1/100, and changes the pattern is the objective function decreases with probability, say 99/100. As previously mentioned, many ingenious heuristic algorithms have been proposed. However, the fundamental question is Is there an algorithm which can be proved mathematically? (Here, we discount the algorithms, such as backtrack or branch-and-bound, which are classified as implicit enumerations). The amazing answer is ‘‘yes’’ using a linear programming approach. The algorithm was invented in 1949 by G. B. Dantzig. A serious reader should consult books on linear programming. Recently, as VLSI chips become more tightly packed, multilayer routing with more than two layers is necessary. Several approaches have been proposed for global wire routing. One simple approach is to incorporate net weights into the global routing process by assigning high weights to critical nets. The other method is to assign net priorities. Critical nets are routed before less critical nets, so that their paths are much more direct than the others. A rerouting technique
309
is used to solve the timing problem by first routing all nets to minimize some global objectives (such as total wire length or overall timing), and then rerouting the critical nets, net-bynet, based on net priorities. A performance-driven global routing algorithm for custom chip design is presented in Ref. 82b. Interconnective delays are modeled, directly included, and incrementally updated during the routing process. The objective of routing is to maximize the minimum remaining delay slack. By having the maximum allowable delay from signal source to each sink terminal of the signal (which is obtained from the timing analysis) as a set of constraints during the routing of a signal net, the route (which is based on the A-search technique) uses the remaining net delay slack as the primary parameter for guiding connective path searching. It is shown that when interconnection resistance is comparable to the output driver resistance, minimizing the total net length is not always equivalent to minimizing the delay for a multiterminal net. Sometimes reducing delays at some critical sink terminals is achieved by increasing delays at some less critical terminals. Because the delay at each sink terminal of a signal depends on how the interconnective tree is constructed. There are various approaches to the routing problem in two-dimensional arrays based on hierarchical wiring (82,83), sequential methods (84–87), simulated annealing (34,37), linear programming (88,89), multicommodity flow (90,91) and flat approaches (92,93). Global routing is NP-complete even in the case of one-bend routing of two-terminal nets (92). Based on a binary-cut tree, a custom chip routing algorithm using linear assignment together with hierarchical net decomposition is proposed in Ref. 94. The algorithm is based on a top-down hierarchical scheme. At each level of hierarchy, the current routing problem is partitioned into two subproblems by assigning pins to channels on a cut line (or bisector). To find an optimal pin assignment for a cost function, a linear assignment minimizes the total summation of the costs subject to the capacity constraints of the channels (or holes). Then, to determine the path of nets inside each subregion, interface (or pseudo) pins are created on the cut lines and nets are broken into several parts that are processed independently. The process of cutting and linear assignment continues until all the nets are connected or all the boundaries are included in the cuts. Note that this approach produces many bends. With signals in the gigahertz range, the electrical characteristics of the packages require treating the signal lines as transmission lines. On most conventional interconnecting substrates, transmission line delay is linearly proportional to the distance traveled, but with multilayer thin films, delay becomes proportional to the distance squared. Because of the high series resistance of thin conductive lines and the high capacitance of these lines to ground caused by thin dielectrics. In practice, transmission lines are not perfectly uniform, that is, in the package level, significant reflections are generated from capacitive and inductive discontinuities along the transmission lines. Moreover, in a multilayer ceramic substrate of MCM, wires at different levels do not have exactly the same impedance. Such mismatches of line impedance cause reflections from the junction points, such as vias and bends. In general, xy plane-pair routing techniques used for multilayer routing yield many vias. If we minimize the number of segments per net (i.e., the number of bends), the number of vias is reduced proportionally. Therefore, propaga-
310
VLSI CIRCUIT LAYOUT
tive delays associated with discontinuities (e.g., see Refs. 5, 95–97) are minimized by careful design. We call the process of assigning pseudoterminals to cutlines in each level of hierarchy terminal propagation. Terminal propagation depth (denoted as 웃i) for a net i denotes the number of levels of top-down hierarchy through which a net’s pseudoterminals are allowed to propagate. In this way, the depth of terminal propagation controls the number of bends. Moat Router The final stage in detailed routing typically is routing the connections between the I/O pads and the core circuits. The area between the core and the pads is called the moat. Moat routing consists of nets whose pins lie on either or both the inside perimeter of the pad frame and the outside perimeter of the core circuit area. The moat between the pads and the core is divided into a number of concentric tracks, similar to channel routing except each track forms a circle rather than a line segment. The moat routing problem is to minimize the number of tracks required to connect all nets in the moat. The pad assignment that avoids vertical constraints and minimizes the total wire length between pads and terminals on the core is effectively solved by the linear assignment algorithm. A similar problem is the case when all the points lie on a core frame periphery. The problem is to find the smallest enclosing rectangle with the least area among all feasible layouts. Gonzalez and Lee (98) presented a linear-time optimal algorithm for the case when all nets have exactly two terminals. They (98a) also presented a polynomial time approximation algorithm for multi terminal nets that generates a layout with an area at most 1.6 times the area of an optimal layout. Also note that a similar problem, minimum-congestion routing around a rectangle is polynomial time solvable for twopin nets (99), and the problem of routing two-terminal nets around two equal-width rectangles to minimize the total area was presented by Gonzalez and Lee (100) with an O(n log n) time two-approximation. The density of channel route is the maximum number of intervals that intersect any vertical line, that is, the size of the maximum clique in the corresponding interval graph. Clearly the density in a channel is a lower bound on the number of tracks required to route through the given channel. In fact, interval graphs are perfect graphs, meaning that the maximum clique size is equal to the minimum number of colors required to color the graph. Thus, the optimal number of tracks is precisely equal to the density. The left-edge algorithm (60) computes an optimal channel routing solution. A circular-arc graph is similar to an interval graph except that the vertices correspond to arcs on a circle rather than intervals on a line. Unlike interval graphs, a circular-arc graph is not necessarily equal to its density. Thus, coloring the circular-arc graph is NP-complete. It is proved by Tucker (101) that (G) ⱕ 2웆(G). The theorem results in a two approximation algorithm for coloring a circular graph G. Find a maximum clique C in G in O(n log n) time (102), and color it using 兩C兩 colors. Color the interval graph G ⫺ C optimally using the left-edge algorithm, which requires at most 兩C兩 colors in O(n log n) time (60,103). Thus, a coloring using at most 2兩C兩 colors is computed, and the optimal coloring requires at least 兩C兩 colors. The time complexity of the algorithm is O(n log n) time.
One may suspect that moat routing is analogous to coloring an circular-arc graph. Then, one may easily prove that moat routing under the restricted model is NP-complete, but the complexity for the case containing only two-terminal nets in not known. Furthermore, moat routing has a special property that the circular moat creates the possibility of many different routing paths, independent of the assignment of nets to tracks. Specifically, a net with m pins can be routed in m different ways, each corresponding to a complete track with the span between two adjacent pins removed. Because nets in a moat can be routed in multiple ways, the notion of density applied to channel routing is not directly generalizable to moat routing. Thus, given restricted moat routing, we should resort to an approximation algorithm based on finding the associated circular-arc graphs. An approximation algorithm called an iterated maximum independent set heuristic is proposed by Ganley and Cohoon (104). The idea is to construct a circular-arc graph G⬘ (we call it an extended circular-arc graph) considering all different routing paths as additional arcs and iteratively peeling off a maximum independent set (MIS) from G until no more MIS remains. The time complexity of the heuristic is O[n(nm)2] time. Over-The-Cell Router The conventional channel is defined as a rectangular area with terminal rows along its top and bottom. The main role of the channel router is to connect the given terminals according to the net list with minimum area. In two-layer standard cell design methodology, the M1 layer is typically used for connections internal to the cell, and the M2 layer is available for routing over the cell. Recently, the concept of over-the-cell (OTC) routing was introduced to minimize layout area. In OTC routing, the cell layout area and the channel area between two cell rows are used as a routing resource. Some heuristics have been developed to achieve 20–35% reduction in channel height compared with those for non-OTC channel routers. A recent algorithm yields as much as 65% reduction in the channel height with a triple-layer OTC router. Recently, Kim and Kang (105) present a new channel routing algorithm that uses three metal layers in the OTC area. Metal layers can be dynamically located for horizontal and vertical connections in OTC area even for the case where cyclic constraint exists. The algorithm relocates as many critical net segments as possible for the OTC area and, thus, the real channel space. High-Performance MCM Router With the accelerating complexity of semiconductors shown in Fig. 1 (1), packaging must address the needs of consumer products, personal computers, workstations, midrange computers, mainframes, and supercomputers. Electronic packaging is an art based on the science of establishing interconnections and fulfilling system functions. Continuous advances in the speed and integrative scale of integrated circuits have created ever greater demands for higher density packaging to ensure reduced interconnective delays for improved electrical performance. Packaging is becoming a limiting factor in translating semiconductor speed into system performance. Moreover, increasing circuit count and density in circuits have been continuing to place further demands on packaging. To minimize
VLSI CIRCUIT LAYOUT
the delay, chips must be placed close together. Thus, the multichip module (MCM) (106) technology has been introduced to improve system performance significantly by eliminating the entire level of interconnection. MCM is a packaging technique that places several semiconductor chips, interconnected in a high-density substrate, into a single package. This innovation led to major advances in interconnective density at the chip level of packaging. Compared with singlechip packages or surface-mounted packages, MCMs reduce circuit board area by 5 to 10 times and improve system performance by 20% or more. Therefore, MCM is used in a large percentage of today’s mainframes to replace the individual packages. The size of MCMs varies widely (107): 10–150 ICs, 40–1000 I/Os per IC, 1,000–10,000 nets, where the low end is ceramic/wire-bond, the high end is thin film/flip chip (maximum linear dimension is now up to 4–6 inches for thin-film MCMs, up to around 8.5 inches for ceramic MCMs, and up to 18 inches for laminated MCMs). Physical design in the MCM environment differs from classical VLSI design in a number of important ways. Performance is of overriding importance in MCM design, because that is the primary reason for choosing MCMs over conventional single-chip packages. Thus, MCM physical design algorithms must be driven by performance constraints. This involves careful consideration of minimizing transmission line effects, such as crosstalk, reflections, and the effects of crossings, bends and vias, as opposed to simple lumped-capacitor approximations commonly used in single-chip packages. Another important difference is that some types of MCMs, such as ceramic and laminated MCMs, require several tens of wiring layers. For example, the ceramic MCM designed by IBM uses over 60 layers of signal, power, and ground wiring (106). This imbues the MCM routing problem with a three-dimensional flavor, which is lacking in existing VLSI routing where the number of layers rarely exceeds three. It can be argued that printed circuit board (PCB) wiring, which has been studied by many researchers over the past two decades, also handles comparable numbers of layers. However, MCM high-density substrates have many unique features, such as blind, buried, stacked and segmented vias, which are not commonly considered in PCB routing algorithms. Also, the need for simultaneous consideration of performance issues, which are rarely brought up in PCB wiring algorithms, provides stimulus for new and innovative design approaches (3–7,9–11). Several wiring layers are required to route the large number of interchip connections. Thus, the problem is three-dimensional. A multilayer routing strategy for high performance MCMs is to route all nets optimizing routing performance and to satisfy various design constraints (e.g., minimizing coupling between vias and signal lines and minimizing discontinuities, such as vias and bends). There are several new and interesting MCM routers for this new packaging technique (3,6,7,9,10). Automatic layout of silicon-on-silicon hybrid packages developed at Xerox PARC is presented in Ref. 3. Placement determines the relative positions of the ICs automatically or interactively and organizes the routing areas into channels. The hybrid routing uses a topological model that reduces the complexity of hybrid routing by abstracting the geometrical information. Computation of geometry is deferred until needed. Global routing at-
311
tempts to find a minimum Steiner tree based on symmetric expansion from all pins.
OTHER ISSUES IN PHYSICAL DESIGN FLOW Compaction is simply the task of compressing the layout in all directions so that the total area is reduced. By making the chip smaller, wire lengths are reduced which, in turn, reduces the signal delay between circuit components. At the same time, a smaller area means that more chips are produced on a wafer which in turn reduces the manufacturing cost. The expense of computing time, however, mandates that extensive compaction is used only when large quantities of IC’s are produced. Compaction must ensure that no rules regarding the design and fabrication process are violated during the process. Design verification encompasses many different techniques to guarantee the specification of a VLSI logic chip. The types of analysis done on VLSI logic chips include functional simulation, delay calculation, delay simulation and timing analysis. A chip designer may choose to do all of the above analysis or only a subset. All simulation routines require a model which describes the function of the logic chip. This model is created by interconnecting functions available in the design system library using a standardized descriptive language. Each circuit in the library has a logical model coded by the circuit designer and provided to the chip designer via the technology database described in the last section. The types of functions in a gate array design system library include primitive functions (i.e., NAND, NOR, AND, and OR) and complex function (i.e., AO, OA, AOI, and OAI). The standard cell library includes all the functions in the gate array library with the addition of macros. A typical list of macros includes RAMs, PLAs, ALU, and register macros. The logical models for simple circuit functions are coded by Boolean primitives. The logical models for macros are coded as flat (Boolean) models using the primitives for test generation, but behaviorals are coded for functional simulation. Behaviorals are programs which interact with the simulator during the actual simulation and reduce the overall CPU time when simulating an entire chip. Once the logical model of the chip is coded, the designer must code patterns for exercising the logic. The first step is defining all the primary inputs and outputs of the chip and organizing them into groups which form the data flow. Each group of inputs or outputs is given a variable name, and that name is used to set the input patterns and monitor the results at the output. The input pattern modes available in a typical simulator include binary, hexadecimal, and decimal. A wellstructured input language offers DO loops for repetitive operations and a WAVE facility for clock pulse generation. After functional simulation is completed, the next step in design verification is delay calculation. Delay calculation is run before placement and wiring by estimating the amount of capacitance per fan-out. An average capacitance value for a single fan-out is hard-coded into the delay program. The delays are obtained by two different techniques. For the simple books in the library (NOR, NAND, etc.), the device parameters are stored in the technology rules. The parameters include device width and length and input capacitance. These parameters
312
VLSI CIRCUIT LAYOUT
are used as input to a device model which uses a numerical integrative scheme to compute the book delays. This scheme proves too costly and inefficient for large macros, however. These functions have precalculated delays which are stored in tables of delay versus capacitance. The delay calculator either interpolates or extrapolates the delay from the points given in the delay tables. The macro function (i.e., RAMs, PLAs, etc.) delays are divided into two parts: the skin delay and the body delay. The skin delays are for the input and output blocks of the macro and depend on input transition and output capacitance. These delays are precalculated and stored in the database. The body delay depends on a particular path in the macro and is embedded in the behavioral. Delay calculation is typically run after the logic is placed and wire on the chip. Actual wiring lengths for poly and metal wires are used to compute the wiring capacitance. Each wiring level has a precalculated capacitance value per unit length which is stored in the technology rules. Certain assumptions are made with about the amount of wire crossings and the amount of parallel and fringe capacitance. In a design system, it is too costly to compute the exact capacitance for each net because the positional relationship to other wires. The input gate capacitances are stored in the rules and are added to the wire capacitance to compute the total net capacitance. The capacitance values are passed to the delay calculator, and the delays are computed for each book and stored in a file. Delays computed by the delay calculator are typically three-sigma worst case.
HIGH-PERFORMANCE LAYOUT DESIGN WITH SUBMICRON TECHNOLOGIES IC designers moving to deep submicron technologies face big challenges. Initial design projects have experienced unexpectedly long design cycles, a larger than expected number of design iterations, problems getting chips to operate at target clock speeds and surprises with die size late in the design cycle. The effects of deep submicron geometries, high clock speeds, and soaring gate counts all create new design problems that are not addressed by existing tools and methodologies. The limitations of available tools and methodologies are clear. Logic designers, who once needed to know little about the physical implementation of their devices to successfully complete their designs, now must have access to key physical design information early in the design process. Without this information, timing delays, routability, and power dissipation are not accurately predicted, and the logic designer has no way of knowing if basic design constraints, such as functionality, cost, power, and speed are being met. The result is often big surprises late in the design cycle. Gate delay has decreased by about 30%, but interconnect delays have not been reduced as quickly. For 0.5 micron technologies, the solution lies in enabling early design improvement by providing logic designers with insight into physical implementation realities without forcing designers to do detailed placement and routing. Because information about the actual physical implementation of a design was not necessary to make accurate predictions, logic and physical design were highly decoupled. Conventionally, only a single iteration between the logic and the
physical design phases was necessary to successfully complete a design. For deep submicron designs, this model does not hold true. Delay, routability, size, and power dissipation are no longer predicted accurately unless information about the physical implementation is available to the logic designer. Interconnect, for example, now plays such a dominant role in predicting delay and power dissipation that it cannot be loosely estimated. Timing, routability, size, and power problems are not discovered until after detailed placement and routing. Multiple, lengthy iterations ensue between the logic and physical designer to repair the problems. The end result is that iterations between logic and physical designers grow from 1–3 iterations for 0.8 micron designs, to more than 10 iterations for 0.6 micron and smaller feature size designs. Only by developing a floor plan of a design and precisely predicting the placement of cells and the routing of nets are interconnect effects accurately forecast during the logic design phase. Because a floor plan provides a high-level abstraction of the eventual physical design implementation, it is created and modified quickly. As a result, multiple iterations with synthesis and timing analysis tools are completed quickly, and timing, size, and power constraint violations are resolved. Because the logic designer has access to key physical design information, problems are solved during the logic design phase, where they are least expensive to fix. Because the final logic design takes into account the physical implementation of the design, a single placement and routing cycle is achieved and design cycle times are significantly reduced. Increased interconnect resistance affects the load seen by a driving gate and pin-to-pin interconnect delay. The increase in coupling and interlayer capacitance caused by interconnects running closer together and the heightened use of multilayer interconnect technologies must also be considered. Failure to consider these submicron effects causes significant predictive errors. After full placement and routing, most physical design groups accurately predict delays if they have taken into account the submicron effects discussed previously. But during logic design, tools and methodologies in common use today often predict delays with errors in excess of 100– 200, and the design fails. Determining the routability of a design is critical to predicting both size and performance. Today, relatively simple calculations based on total cell area and number of nets are typically used to estimate design size early in the design process. The estimates do not consider routability. Given the mounting routability challenges facing deep submicron design projects, predictions based on these calculations are extremely inaccurate and cause major surprises late in the design process. Design routability has become a big problem for a number of reasons. One important factor is the continued explosion in design complexity. There are more cells and nets in a design to place and route optimally, and the likelihood of considerable routability problems is increased. A second factor is the use of top-down design methodologies, specifically synthesis tools. They have increased the average number of connections per net by 50–100 throughout the chip. A third factor is the increased number of macro blocks or mega cells with a high number of input and output pins, increasing congestion and more challenging routing problems. Finally, the use of larger and larger buses causes special problems for routers. De-
VLSI CIRCUIT LAYOUT
313
pending on the design, one or more of these issues needs to be considered to accurately determine the routability of a design. Fixing routability problems is also extremely difficult. Because most place and route tools operate flat, it is difficult to isolate routability problems and fix them incrementally. In addition, because place and route tools typically immediately flatten design hierarchy, it is impossible for them to take advantage of the highly structured major portions of a logic hierarchy. As a result, resolving routability issues requires many iterations and time. Clock-Tree Synthesis In a synchronous VLSI design, which carries the heaviest load and switches at high frequency, clock distribution is a major source of power dissipation. Also, circuit speed and chip area have been an important consideration, and the delay on the longest path (phase delay) through combinatorial logic and the maximum skew among the synchronizing components should be minimized. There has been active research in the area of high-performance and low-power clock routing. This chapter gives an overview of a number of combinatorial aspects and highlights the state of the art of currently active research in clock-tree synthesis. Reduction in power consumption without sacrificing processing speed is an increasingly important objective in integrated circuit design nowadays. One reason is the longer battery life required for the growing class of personal computing devices, such as portable desktops and wireless communication equipment. Another reason is the dramatic decrease in chip size and increase in transistor count and clock rate for integrated circuits. Power is reduced at various design levels, including circuit, logic, register-transfer, behavior, and system levels. Clock is the fastest and most heavily loaded net in a digital system. Power dissipation of the clock net contributes a large fraction of the total power consumption. The clocking circuitry in an adaptive equalizer consumes 33% of the total power. In a microprocessor, 18% of the total power is consumed by clocking (108) because the clock frequency is typically several times higher than other signals, such as data and control. There are clock signals which synchronite elements. Both clock and data signals are delayed by the combinational blocks delays introduced by wires. For a circuit to function correctly, clock pulses must arrive nearly simultaneously at the clock pins of all clocked components. Performance of a digital system is measured by its cycle time. Shorter cycle time means higher performance. At the layout model, performance of a system is affected by two factors, signal propagative time and clock skew. The difference in arrival times between a single pulse arriving at two different clocked components is referred to as clock skew which must be within a certain tolerance. Using advanced routing tools to minimize total wire length is helpful in reducing resistance of wires. But in high frequency applications, clock skew and phase delay should be considered to attain a desirable chip performance. It is said that the clock skew must be less than 5% of the critical path delay time to build a highperformance system. Timing critical nets with a high fan-out requires several layers of buffering to drive all the leaf cells. With the scaling of device technology and die size, interconnective delay now contributes up to 70% of the clock cycle in dense, high-performance circuits. The earlier works on tim-
(a)
(b)
Figure 11. Clock network topologies: (a) Single driver scheme; (b) Distributed buffers scheme.
ing-driven routing problem are in (109). A minimum rectilinear Steiner tree (MRST) approach for interconnecting the terminals of a clock net is not necessarily the best in various applications. To reduce phase delays and supply sufficient driving current, several levels of buffers are added to create a so-called multistaged clock tree [refer to Fig. 11(b)]. To optimize the buffer placement inside each group and between groups, we simply apply the variations of H-tree (called hierarchical matching tree (110). Then, we generate the rectilinear clock net topology to minimize the clock skew, Elmore delay (111), and wire length, simultaneously. Because of its fidelity to SPICE-computed delay, Elmore delay is a good performance objective for constructing highperformance routing trees. Figure 12 shows an instance of Elmore delay estimation for a hierarchical distributed buffer tree. Because interconnect resistance and capacitance are usually proportional to the edge length, we see that the delay has a quadratic relationship to the length of the source-sink path, suggesting a min-radius criterion. However, the Cj term implies that Elmore delay is also linear in the total edge length of the tree which lies outside the n0 ⫺ ni path, suggesting a min-cost criterion. The relative size of the driver resistance heavily influences the optimal routing topology (ORT). If rd decreases, the ORT
R3
Local buffer
C3
R1 Rs
R4
A Top level clock driver Clock source
CC
B
C1
C4
R2
C2
Sink C
E
D CD
CE
Figure 12. An instance of Elmore delay estimation for a hierarchical, distributed, buffer tree. For example, d(A) ⫽ 웁Rs(C1 ⫹ C2 ⫹ C3 ⫹ C4 ⫹ CC ⫹ CD ⫹ CE), d(B) ⫽ d(A) ⫹ 움R1C1 ⫹ 웁R1(C3 ⫹ C4 ⫹ CC ⫹ CD).
314
VLSI CIRCUIT LAYOUT
tends to a star topology. Elmore delay implies that the number of Steiner points in the source-sink path should be minimized and the Steiner points ‘‘shifted’’ toward n0 (i.e., branches off the source-sink path should occur as close to the source as possible). It is well understood that reducing the wirelength reduces the delay sensitivities at the leaf nodes because small process variations in the wire length result in smaller changes in the overall delays and, consequently, smaller skew. Then the objective is to make the delay sensitivities at the leaf nodes small enough (by shortening wires) so that the upper bound on the skew is acceptable without elongatting wires excessively. It is important to point out that the wirelengths are optimized in a bottom-up manner from the leaf-nodes to properly consider the possible changes in the upstream sensitivities. The change in the Elmore delay to any node n downstream of branch 1 is given by ⌬TDn ⫽ ⌬R1(Cd1 ⫺ C1). We see that the skew of binary tree-like clock nets is extremely sensitive to changes in the wire lengths of the branches closest to the root of the tree. Small changes in the lengths of such branches, therefore, have a large effect on skew. Delay, power, skew, area, and sensitivity to process variations are most important concerns in current clock-tree design. As with other signal integrity issues, clock layout problems force designers to evaluate a set of complicated, opposing alternatives, such as area, delay, skew sensitivity, and power consumption. In the global routing phase, route shapes for all nets are determined on a two-dimensional grid, to minimize the maximum routing density and minimize the interconnect delays in the nets. In VLSI routing, the primary objective is to minimize wire length. This is equivalent to minimizing delay when the lumped-capacitor model is used. However, when the Elmore or second-order model is used, minimum wire length does not necessarily imply minimum interconnect delay. Consequently, conventional minimum Steiner tree algorithms developed for VLSI routing are inadequate for MCMs. Performance-driven, tree-generating algorithms are currently a topic of considerable research. Using a first-order delay model, Cong et al. (112) show that the delay of a net consists of three terms. The first term is proportional to the total wire length of the net. The second term is proportional to the sum of the path lengths from the source to each of the sink vertices in the tree. The third term is related quadratically to the path lengths. Based on this analysis, they introduce the concept of an A-tree, whose interesting property is that the second term is always minimized and the first and third terms are closely related, so that minimization of one leads to minimization of the others. They propose a near-optimal algorithm for A-tree construction and find experimentally that the algorithm reduces delays up to 43% compared with a very good Steiner tree algorithm. A different approach based on a second-order delay model is proposed in Ref. 113. In this approach, the tree for a net is generated constructively by adding one sink at a time. During the ith step, a point k is found in the current tree, so that, when a minimum-length path from the current sink si to k is introduced, the second-order delays to sinks s1 . . . si are minimized. The point k is found efficiently by performing a ‘‘trial’’ connection from si to a set of points in the current tree and incrementally updating the admittance and coefficients to
compute the delays. Although the algorithm is greedy, it finds trees with significantly smaller delays (up to 50% less) than a sophisticated Steiner tree algorithm (114). In the PowerPC clock design methodologies (115), the first stage of clock design is during synthesis. The synthesis performs load balancing and duplication of clock buffers to minimize clock skew. The objective of load balancing is to balance capacitive and resistive loadings among the groups and the number of pins among the group. Capacitive loading is approximated by the Elmore sum of clocked cells’ input loading capacitances. Resistive loading of each group is approximated by the half perimeter of the smallest bounding box enclosing all of the clocked pins in the group. The purpose of balancing the number of clocked pins is to distribute the wiring congestion over the routing area. All three parameters are incorporated into a cost function to be optimized. Here clock buffers are inserted based on net fan-out and pin capacitances. The second stage is a physical clock design. First, clustering of regenerator (circuits that split the master clock into phases) and estimation of skews inside circuit blocks is performed. The second stage of physical clock design is optimizing the wire widths of the main trunks of the clock tree. The final stage of physical clock design is final buffer selection in the clock network. Two different clock distribution shemes are used in PowerPC designs: an H-tree clock distribution network [Fig. 13(a)], and a multilevel buffered clock grid [Fig. 13(b)]. The H-tree style designs are used for designs targeted for lower power and desktop markets, whereas the multilevel buffered grid design style has been used for high-performance processors with larger clock distribution area. There are many works related to high-performance clock physical network designs (116,117). In this article, we explore a number of important contributions to clock synthesis, and we use the following notations: T: A clock tree with a driver w0 at the root (clock source, N0) and a set of s sinks 兵N1, N2, . . ., Ns其. S: a set of t Steiner points 兵S1, S2, . . ., St其 whose locations are to be determined. dist(Ni, Nj): Manhattan distance between Ni and Nj. wi: ith wire segment or buffer. X: X ⫽ (x0, x1, ..., xn⫹m), where n is the number of wire segments and m is the number of buffers, a wire- and buffer-sizing solution. ri, ci: resistance and capacitance of wi, respectively. Ui, Li: upper bound and lower bound of the size of wi. xi, ᐉi: size and length of wi. Pi: all wires and buffers on the path from the source to sinks Ni. Ti: All wires and buffers in the subtree of T rooted at wi. Ans(wi to the nearest upstream buffer or the root. Dec(wi): all wires, buffers, or sinks on the path from wi to the nearest downstream buffers or sinks. Ri: upstream resistance of wi, Ri ⫽ 兺wi僆Ans(wi) rj. Ci: downstream capacitance of wi, Ci ⫽ 兺wi僆Dec(wi) (Cj ⫹ cj) ⫹ 兺Nj僆Dec(wi) c⬘j , where c⬘j is the capacitance of sink Nj. Di: Elmore signal delay (Fig. 12) at sink Ni, Di ⫽ 兺wj僆Pi,wire rj(Cj ⫹ cj /2) ⫹ 兺wj僆Pi,buffer rjCj.
VLSI CIRCUIT LAYOUT
315
Clock driver
CG
CG
CG
CG
Clock phases
CG
CG
CG
CG
Clock phases
GCLK
Level 1 buffer
GCLK CG
CG
CG
CG
Clock phases
CG
CG
CG
CG
Clock phases
Regenerator CG
CG
Clock Clock phase phase
Regenerator H-tree
Multilevel buffered clock
(a)
(b)
S: clock skew defined as the maximum difference in the delays from the clock source to clock sinks, S ⫽ maxi, j兩Di ⫺ Dj兩. n n⫹m A: area of a clock tree: A ⫽ 兺i⫽1 ᐉi ⫹ 兺i⫽n⫹1 xi. 2 , where nC is P: power consumption proportional to nCVdd the sum of capacitance times transitions (switching activity) needed to compute an operation and Vdd is the supply voltage. The signal net N is to be embedded in an underlying graph G ⫽ (V, E) with N 僆 V. The graph G is associated with Hanan’s grid (41) and has variable edge costs. Each edge (i, j) 僆 E has a cost d(i, j) equal to the routing cost between node i ⫽ (xi, yi) and node j ⫽ (xj, yj), that is, the rectilinear distance between the two points (兩xi ⫺ xj兩 ⫹ 兩yi ⫺ yj兩). The nodes in V correspond to the interconnection point (called Hanan’s point) in Hanan’s grid. Hanan’s grid is generated by drawing a horizontal straight line and a vertical straight line crossing each point in N. The cost of T is defined as cost(T) ⫽ 兺(i, j)僆T d(i, j). The Steiner tree (118) is a routing tree T in G that spans N. Hanan’s point denoted as H can be as large as n2. We set H ⫽ H ⫺ N. A Steiner tree for a set N contains at most (n ⫺ 2) other points of set S 僆 H called Steiner points on the plane. The Minimum Rectilinear Steiner Tree (MRST) problem, given a set of N of n points in the plane, is to determine a set S of Steiner points so that the tree cost over N 傼 S has a minimum rectilinear cost. The problems are known to be NP-hard for a long time (119). Figure 10 shows a set of alternative Steiner trees, each of whose objective is different. In this section, we exhibit a number of results on constrained Steiner trees for clock networks. Hierarchical, Recursive, Matching Tree. Let us briefly describe the algorithm proposed by Kahng, Cong, and Robins (120). Given a set N of randomly distributed clock pins and a distinguished pin called a source, they first match the closest pairs using minimum edge-weighted matching (MEWM). A
Figure 13. Typical PowerPC clocking schemes.
balance point is computed by finding the point p along the straight line connecting the roots of the two subtrees so that the difference in path lengths from p to any two leaves in the combined tree is minimum. Then, another MEWM is performed on generated balanced points. In this manner, a height-(⫽ log n) balanced binary clock tree is constructed by recursive geometric matching in a bottom-up manner. We call the generated tree the rectilinear hierarchical matching tree (RHMT). The RHMT produces a near-optimal solution in terms of total wiring cost and clock skew when clock pins are evenly distributed over chip area. Bounded-Skew Steiner Tree: The Deferred-Merge Embedding Approach. Based on the hierarchical matching tree, the deferred-merge embedding (DME) algorithm (121) improves the existing method and consists of two stages. In a bottom-up phase, a tree of merging segments is constructed that represents loci of possible placements of internal nodes (or Steiner points) in a zero-skew tree, and, in a top-down phase, a tree is embedded determining exact locations for the internal nodes in T. For node v with children a and b, the merging region of v corresponds to the set of all locations at which subtrees Ta and Tb are joined to v with minimum wiring cost while still maintaining the skew bound B. In general, to ensure correct clock operation under a required clock period P, the allowable clock skews between two adjacent flip-flops i and j are (1) to avoid zero-clocking with negative skews, Di ⱕ Dj: D j − Di ≤ MIN(Dlogic ) + Dff − Dhold and (2) to avoid zero-clocking with positive skew, Di ⱖ Dj: Di − D j ≤ P − (Dff + MAX (Dlogic ) + Dsetup )
316
VLSI CIRCUIT LAYOUT
where Di and Dj are clock arrival times, MAX(Dlogic) and MIN(Dlogic) denote the longest and shortest path delays of the combinational block between two FFs, and Dff is the delay in a FF. However, the formulation of the previous algorithm does not address the skew constraints. In practice, only minimizing the skew is not an actual design requirement. We should allow some tolerable (or sometimes useful for lower power) skew with which the system functions correctly. Boundedskew clock routing under the Elmore delay model is presented in Ref. 122 and its problem is as follows: Minimum-Cost, Bounded-Skew Routing Tree Problem: Given a set of N ⫽ (N1, N2, . . ., Ns) of sinks in a plane and a skew bound B, find a routing topology G and a minimum-cost clock tree T that satisfies Dmax ⫺ Dmin ⱕ B. The section proves several key properties of the deferredmerge embedding regions. Minimal Steiner Trees with Bounded Path Length. Let R be the length of the direct path from the source to the farthest sink and ⑀ be a nonnegative, user-specified parameter. The shortest path between u and v in graph G is distG(u, v). The shortest path between u and v in tree T is distT(u, v). The radius of node v 僆 G is radiusT(v) (i.e., maxdistT(u, v), ᭙u 僆 V). The method of (123) constructs a bounded path length minimal Steiner tree (BMST) with radius at most (1 ⫹ ⑀) ⭈ R by using an analog of the classical Kruskal’s MST construction. The tree cost is empirically at most 1.19 of that of an optimal BMST. Given the routing graph G(V, E) in Manhattan space, find a minimum cost routing tree with radius(S) ⱕ (1 ⫹ ⑀) ⭈ R. A heuristic algorithm for the problem is as follows: The Kruskal algorithm adds an edge (u, v) in G to MST, or equivalently, merges two partial trees tu and tv by the edge (u, v) if (u, v) is the least weight edge among the available edges and
1/ n
the merged tree satisfies the path length bound (1 ⫹ ⑀) ⭈ R from the farthest sink. Negative-sum exchange is defined as a sequence of T-exchanges where the sum of weights of exchange is negative. An exact algorithm starts from any solution tree, finds the negative-sum exchange, converts the solution tree to a new solution tree by exchanging edges, and iterates until no more possible exchanges(s) are found. Let us denote the search tree as . Each node in represents a spanning tree. The root of is the initial solution. A child node is generated by a T-exchange from its parent node. Note that one reaches any spanning tree from the initial tree by a series of at most (V ⫺ 1) T-exchanges. Because the number of possible T-exchanges in a tree T is O(EV), a node in has O(EV) children. So has O(EnVn) nodes where n is the depth of . Trade-off between Cost and Skew. We denote by L(T) the lower bound on wire length (cost or weight) of a worst-case tree T in the unit square. Kahng, Cong and Robins (120) showed that, for random sets of terminals chosen from a uniform distribution in the unit square, the total wire length of rectilinear hierarchical matching tree (RHMT) is, on average, within a constant factor of the total wire length of the optimal Steiner tree. The constant is at most two (124). Then, the bound is used for establishing bounds on wire length and clock-skew of the combined approach of MRST and RHMT (124,110) (refer to Fig. 14). Theorem. The total weight of a worst-case MRST-RHMT or RHMT-MRST in the unit square is at most twice the total weight of a worst-case MRST. Many systems uses clock trees with a buffer at each internal node and short daisy chains at the clustered leaves. By properly adjusting the size of the cluster at the leaves, one can tradeoff among skew and wire length. Minimizing the wirelength or the number of buffers also leads to power savings.
Algorithm-1: MRST-RHMT construction
Input: a set of clock pins N Output: A clock tree topology 1 Step 1. Construct a RHMT. Step 2. Reconstruct a set of subtrees of RHMT at Level k using the minimum rectilinear Steiner subtrees (MRSS).
1
log n — 1 2log n — k L [MRST-RHMT (k)] = L [MRST (k)] + L [MRST (i)] i=1 i = k +1
Σ
Σ
2log n — k
=
Figure 14. MRST-RHMT construction.
Σ
(2k — 1) /
i=1
log n — 1
n+ 1 2
Σ
n/2i i = k +1
VLSI CIRCUIT LAYOUT
Minimum-Cost, Rectilinear, Distance-Preserving Tree. The shortest distance between the source and a given sink i in G is denoted as ri. The shortest distance between the source and a given sink i in T is denoted as ci. The problem of a minimum-cost, rectilinear, Steiner distance-preserving tree (MRDPT) (46) seeks a minimum-cost tree with a special property such as ci ⫽ ri, for every sink i (refer to Fig. 5). It is known that the minimum-cost Steiner distance-preserving tree (125) (or min-cost, shortest path Steiner tree (117) in general graphs is NP-hard (126). The complexity of the problem in a planar graph has not been known. See Ref. 117 for the history. A typical approach for finding MRDPT as in (125) is as follows: 1. Partition the plane into quadrants Q0, Q1, Q2, and Q3. The partitioning of the plane divides the terminals into a one-quadrant MRDPT problem (1Q MRDPT) as depicted in Fig. 1(c). 2. Solve the 1Q MRDPT problem for Q0, Q1, Q2, and Q3, independently, obtaining an MRDPT T(Q0), T(Q1), T(Q2), and T(Q3). 3. Merge the solutions for each quadrant, thus obtaining the MRDPT T. Cho (127) was concerned with finding T(Q0) whose sinks are in the quadrant Q0 (i.e., right, upper corner of the plane). Given the solutions for each quadrants, the MRDPT T is found in polynomial time (125). Thus, the underlying graph of the 1Q MRDPT (we call it just MRDPT) is called flow graph G ⫽ [A, V ⫽ (H 傼 N)] so that there is a directed arc in A from i to j in V if rj ⱕ ri, xj ⱕ xi and yj ⱕ yi, that is, every arc is oriented toward the source s. Thus, arcs are embedded using some monotone or ‘‘staircase’’ path between the source to any sink. A tree is directed in-tree rooted at node s if the unique path in the tree from any node to node s is a directed path. Observe that every node in the directed in-tree has outdegree 1. Note that all paths from a sink to the source have the same length in G. Thus the problem of MRDPT is to find a directed in-tree from G with minimum cost. An exact algorithm based on min-cost flow transformation is given in (127). Minimum-Cost, Bounded-Skew, Bounded-Delay Clock Tree. The wire- and buffer-sizing problem is defined as follows: The Clock-Tree Wire- and Buffer-Sizing Problem Given: A clock tree T with source N0 and sinks 兵N1, N2, . . ., Ns其, wire segments 兵w1, w2, . . ., wn其, buffers 兵w0, wn⫹1, wn⫹2, . . ., wn⫹m其, upper bounds 兵U0, U1, . . ., Un⫹m其, and lower bounds 兵L0, L1, ..., Ln⫹m其. Objective: Find an X that minimizes max1ⱕiⱕsDI, S, P, A, and/ or ⌬. An approach by Chen et al. (128) is based on Lagrangian relaxation so that the delay constraints are relaxed into the objective function by introducing Lagrange multipliers i and 웃i, where i and 웃i are the Lagrange multipliers associated
317
with the delay constraint Di(X) ⱕ Dmax and Di(X) ⱖ Dmin, respectively. The constraints are relaxed, scaled with Lagrangian multipliers into its objective function, and then the subproblems resulting from dynamically adjusting the Lagrangian multiplier are iteratively solved. Then,
Minimize αDmax + βP + γ A + δ(Dmax − Dmin ) s s + i=1 λi [Di (X ) − Dmax] + i=1 δi [Dmin − Di (X )], subject to Li ≤ xi ≤ Ui , 0 ≤ i ≤ n + m, Dmax ≥ 0, Dmin ≥ 0 Note that Dmax is a variable introduced to minimize maximum delay and Dmin is introduced to minimize clock skew. Chen et al. (128) presented an algorithm for simultaneously optimizing the previous objectives by sizing wires and buffers in clock trees. The algorithm, based on the Lagrangian relaxation method, effectively minimizes delay, power and area, simultaneously, with low skew and sensitivity. However, the formulation of the previous algorithm does not address the skew constraints. In practice, minimizing only the skew is not an actual design requirement. We should allow some tolerable (or sometimes useful for lower power) skew with which the system functions correctly. Boundedskew clock routing is presented in Ref. 122. This method, however, considers only the skew bound and does not control the maximum source-sink delay. Long wires require more buffers and cause slower rise and fall times. More buffers and slower switching result in higher power dissipation. A power optimizing clock routing algorithm with bounded skew and bounded maximum source-sink delay under the Elmore delay model is presented in Ref. 129. In this algorithm, delays are controlled by buffer sizing rather than by controlling the wire lengths, the clock tree is an equal, source-sink, path length Steiner-tree regardless of skew bounds (it is a zero-skew tree under the linear model), and finally the routing cost becomes large when non-zero skew is required. Thus, Oh et al. (130) proposed allowing the user to specify different delay bounds for each individual sink, which lead to a further reduction of the routing cost. In case of clock routing, the required signal arrival times among sinks are made to differ. In addition, if the combinational delay between two FFs violates the short path delay constraint, common practice is to insert delay elements on the short path or increase the wire length. However, one cannot arbitrarily increase the length a wire because it may violate the required arrival times of other sinks. These observations motivated development of a method for controlling the path lengths so that any delays lie beween given upper and lower bounds. Variables of the proposed mathematical programming formulation are the edge lengths of the trees. The following formulation leads to a simple linear programming problem under the linear delay model which is solved optimally in polynomial time.
Minimize
n
i ,
i=1
subject to
w i ∈path(Ni ,Nj )
i ≥ dist(Ni , Nj ),
∀ sinks Ni , Nj (Steiner point constraints) Dmin ≤ Di ≤ Dmax ∀ sink Ni (delay constraints). Once the edge lengths are determined, the positions of Steiner points are determined from geometric considerations.
318
VLSI CIRCUIT LAYOUT
The method for placing Steiner points is similar to the DME algorithm. In the DME algorithm, the feasible regions for Steiner points and the edge lengths are found in a bottom-up fashion, and then Steiner points are placed in the feasible regions in a top-down fashion. The proposed method is different in that edge lengths are predetermined and the feasible regions are rectangular regions instead of simple line segments. In the Elmore delay model, the delay equation is quadratic with respect to ᐉi’s. Because the Elmore delay function is quadratic and the sum of the quadratic terms is positive (i.e., the function is posynomial in ᐉi.), the delay function is convex. The feasible set defined by a convex function with both lower and upper bounds, however, is not a convex set. Some edges may be given higher weights to account for wirability concerns, blockage, type of metals used, cross talk or switching activities. However, the approach splits the original problem into two subproblems solved separately and does not consider the interaction between two stages. Thus, we may need a new approach for determining the bounded-skew and bounded-delay Steiner tree in a ‘‘single’’ step. Bounded-Radius, Weighted Steiner Tree. With progress in VLSI fabrication technology, interconnective delay has become increasingly significant in determining circuit speed. Recently, it has been reported that interconnective delay contributes from 50% to 70% of the clock cycle in the design of dense, high-performance circuits (131,132). Thus, with submicron device dimensions and up to a million transistors integrated on a single microprocessor, on-chip and chip-to-chip interconnections play a major role in determining the performance of digital systems. Because of this trend, performance-driven layout design has received increased attention in the past several years. Most of the work in this area has been on the timing-driven placement problem, where a number of methods have been developed for placing blocks or cells in timing-critical paths close together, for example, see Refs. 131–135.
2
Although such techniques have been developed for timingdriven placement, only limited progress has been reported for the timing-driven interconnective problem. In Ref. 109, net priorities are determined on the basis of static timing analysis. Nets with high priorities are processed earlier using fewer feedthroughs. In Ref. 136, a hierarchical approach to timingdriven routing was outlined. A timing-driven global router based on the A* heuristic search algorithm was proposed for building-block design. However, these results do not provide a general formulation of the timing-driven global routing problem. Moreover, their solutions are not flexible enough to provide a tradeoff between interconnective delay and routing ‘‘cost.’’ Cong et al. (137) proposed a bounded-radius, minimum-routing tree and gave experimental results on the tradeoff between interconnective delay and routing ‘‘cost.’’ Three types of Steiner trees have been studied before in connection with global routing: min-max weight Steiner trees (84), minimum-length Steiner trees (20,38), and minimumweight Steiner trees (138). A minimum-length Steiner tree is not appropriate, for it goes through ‘‘critical regions’’ including the modules. Indeed such a path needs to be modified so that it does not go through any modules. After the modification (shown by an arrow in Fig. 15) the path still goes through ‘‘critical regions’’ (e.g., the region with weight 9 in Fig. 15). After the modification, we may even violate the optimality of its length. A min-max Steiner tree is also not suitable, for it is excessively long. (Heuristics have been introduced to obtain shorter min-max Steiner trees in Ref. 84). A minimum-weight Steiner tree is a Steiner tree with minimum total weight, where an edge thereof with length ᐉ in a region with weight w has total weight ᐉw (138). A minimum-weight Steiner tree deals with length and density, simultaneously. A minimumweight Steiner tree, however, does not consider the interconnective delay (e.g., from P1 to P3 in Fig. 15). We employ the notion of bounded-radius weighted Steiner trees (BRWRST) to trade off between interconnective delay and routing weight. Balanced-Mesh Clock Routing. Clock routing using a balanced-mesh routing with circuit partitioning (Fig. 16) was
2
9 N2
2
N1
4
4
2
3 6 N3
w
Figure 15. Four types of Steiner trees.
Weight of the region is w
Density
Source
Length
Sink
Weight Radius
2
VLSI CIRCUIT LAYOUT
319
Mesh routing for clock net FF
FF
Balanced tree Clock source of the chip Clock source of mesh routing
FF Clock buffer FF
FF
Figure 16. Balanced-mesh clock routing in cell-array designs.
proposed (21,139). The circuit is partitioned into subblocks called mesh-routing regions (MRs) in which clock skew is suppressed below a constant by mesh routing. Then the net from the clock source to each MR is routed as a balanced tree. Using the technique in the design of MPEG2-encoder LSI, a skew of 210ps was achieved. The balanced-tree method (BTM) achieves very low skew, but it increass area and delay time by making the skew unnecessarily small. This is especially crucial in the design of chips having many FF’s (e.g., MPEG2 LSI’s). The fixed-mesh method (FMM) generates a clock net in a fixed mesh driven by a large buffer with wire sizing. FMM has been applied to the design of a DEC Alpha chip (140). The entire chip is covered by a big mesh of interconnect metal that drives all the FF’s. Although it achieves clock skew of less than 300ps for 0.75 애m technology, the power dissipated by the clock is almost 40% of the total power dissipation of the chip because the FMM overestimates the skew, which increases the number of interconnects requiring a large buffer. However, a fixed mesh is easy to route and at most one routing track is required in each channel. Taking advantage of both of BTM and FMM, Sato et al. (139) developed a practical clock routing method called the balanced-mesh method (BMM), in which the circuit is partitioned into some subblocks and the clock net in each subblock is routed as a mesh. Each mesh is driven by a relatively small clock buffer placed at its center row and these buffers are routed from the clock
C2
C1
C3 +
C4 A2
C5
The MR-partitioning problem. Minimize 兺i, j C(i, j) subject to MR constraints, where C(i, j) is the number of nets connecting MRi and MRj. Activity-Driven Clock Design. The activity patterns are obtained from Fig. 17. The tree also contains the possible activity patterns of its internal nodes. These activity patterns are derived assuming that a gate is placed one very node of the clock tree and they represent the times when these gates must allow propagation of the clock signal. A gate at the root of any clock tree must be active during a time period if any of
C6
C1
ctr
^
dz
source by a balanced or a minimum-delay tree. The circuit is partitioned so that each subblocks’s skew and the clock-signal delay time are bounded under given allowances, based on the relationship of the clock skew, delay time, and FF density in a chip. A subblock region that ensures skew is called a meshrouting region (MR). In MR partitioning, the circuit is partitioned into MRs so that they satisfy the MR constraints and then a clock buffer, whose size depends on the number of FFs and area, is selected in each MR. In the placement stage, the FF’s cells and its buffers are placed within each MR to which they belong to. The routing is classified into two types: intraMR and inter-MR. Inter-MR routing is the mesh routing in each MR. Inter-MR routing is the minimum-delay time routing from the clock source to all MR’s.
z z1 M4
u dz y 3
M3 M2
*
M2
*
dz u
y1
*
z 3
+
*
A1
M1 M1
*
*
S1
S1
–
–
u1 Figure 17. High-level design transformation of differential equation to control data-flow graph.
320
VLSI CIRCUIT LAYOUT
Source v15 Active v14
v13
v9
v10
v1 Figure 18. Differential equation example: A clock-tree circuit for the modules of the differential equation circuit.
v11
v3
M1
its sinks is also active. Thus, the activity of the internal nodes is obtained by ORing the activity patterns of the sinks that belong to the corresponding subtree. Using Fig. 17, consider the activity pattern which contains modules A2 and M2 as sinks (001000 ⫹ 110110 ⫽ 111110). The tree in Fig. 18 has a total of 40 idle time periods. This is improved to 52 idle periods by closely placing modules of similar activity. The second task is to locate the clock gates, given a clock-tree topology, so that the total power is minimized. The gate insertion problem requires detailed information about the parasitic capacitances of the clock tree and the control lines of the gates. Hence, we model the clock-tree topology with an H-tree construction. First we define an activity pattern for an element i of the system, U(i) ⫽ 兵aij储j ⫽ 1, . . ., u, aij 僆 兵0, 1其其, with u time periods and activity aij. Let element i consume total active circuit power PA(i) during periods when the circuits are active, and thus the clock must be supplied for proper function. Also let PI(i) denote the total inactive or idle circuit power, when clock supply is unnecessary. The power consumed by the clock supply is negligible. The power consumed by clocked element u i is P(U(i), i) ⫽ 兺j⫽1 [aijPA(i) ⫹ (1 ⫺ aij)PI(i)]. Define the function t[U(i)], which measures changes in the activity pattern and the power consumed by the control signals of the clock gates. The total power of the clock tree is obtained from the sum of the power contributions defined previously. Activity-Driven Clock Tree Construction Let the activity pattern of a clock tree node be obtained by ORing the patterns of its sinks. Construct a tree T(V, E) on a set of sinks N so that the weighted sum of nodes activities u A(T) ⫽ 兺vi僆V 兵cit[UT(i)] ⫹ 兺j⫽1 kijaij其 in the resulting tree is minimized. The weights kij and ci are derived from the power contributions defined previously. The algorithm proposed by Te´llez et al. (141) is based on recursive weighted matching, where the matching weight is the value of the objective function of the resulting subtree. Cross Talk Rapid growth of multimedia and communication systems demands the use of both analog/digital mixed signal ICs and deep submicron (below 0.6 애m) CMOS technologies. The
v7
M3 C1
v2
v12
v5
A1 S1
Idle
A2 M4
v4
M2 v6
v8
higher density and improved electrical performance of such technologies are needed in these systems. In the mixed analog-digital layout design and deep submicron CMOS technologies, automated synthesis of interconnections during the early placement stage of the design cycle is emerging as a most promising approach. Current placement level design models do not capture important physical design effects, such as cross talk, power, and timing, simultaneously, which is the first order of factors in chip performance. Because of the continually decreasing distances between components and the simultaneous increase of operating frequencies, the noise induced from signal wires physically too close to each other, called cross talk, becomes stronger. Increased cross talk holds for coupling via the interconnects, and also for cross talk via the substrate (142). Cross talk contributes as much as 50% to 75% to interconnective delay as the width of the wire and the space between wires is reduced (2). The problem is particularly important in high-frequency integrated circuits. A critical area in the layout is an area in which spot defects are centered and a malfunction in the respective critical circuit arises. Both cross talk and spot defects occur frequently in a channel and are avoided by rearranging the wires and vias. Cross-talk noise should be considered because cross talk between long wires increases delay (because of larger effective line capacitance) and also degrades signal integrity and causes logic faults. Excessive local congestion gives rise to future routing difficulty and also increases the potential cross-talk noise in high-speed signal lines. Furthermore, it increases power dissipation due to coupling capacitance. Cross talk is minimized by ensuring that wires carrying high-activity signals are placed sufficiently far from other wires. Moreover, for highperformance circuit routing, intersections of wires cause the use of more vias which, in turn, require more routing resources (because of the large via pitch), lower manufacturing yield, and cause noise problems (because of the mismatched characteristic impedance between wires and vias)(143). The problem of cross talk is addressed typically after the placement step. The next step in physical design is to assign every global route in the layout environment to a plane pair, called layer assignment, so that the capacity constraints are
VLSI CIRCUIT LAYOUT
satisfied on all plane pairs and the number of plane pairs is minimized. A layer assignment algorithm to reduce cross talk presented in Refs. 9, 144–145 maximizes the layer separation between interfering nets, so as to reduce both intralayer and interlayer cross talk. There are several works related to crosstalk minimum routings. The main goal of the MCM router developed in Refs. 79, 146–148 is to route all the nets with a minimum number of layers and reduce the cross talk by separating high-frequency wires with a bound over the number of vias used in routing. In the following, we present cross-talk minimization techniques during placement, global routing and channel routing phases. The placement model in this paper targets MCMs. A given input is a set of rectangular chips of the same size with pins fixed within each block and a specification of n nets, including timing constraints on nets. Each output solution specifies an absolute position of each chip. The problem is stated as follows: Given a set of chips C and a set of chip sites S, find a mapping : C 씮 S, so as to minimize the cross talk, crossings, and total wire length needed for routing and to ensure routability of the design in a minimum number of routing layers. Early estimation of wirability during placement is important, but net topology is difficult to estimate at the placement stage. Conventionally, a cost metric based on wirelength plus congestion increases the wirability. However, in our formulation, we do not consider the congestion measure, explicitly. We observed by experiments that congestion minimization is done automatically and we perform cross talk and crossing minimization simultaneously, because it distributes wires evenly over the MCM substrate. Note that minimizing the number of crossings reduces the wire length, whereas minimizing the cross talk does not always do so. Next, we introduce a new interference measure based on cross talk and crossings. Net Topology and Graph Generation. Multiterminal nets have many possible routing topologies, such as daisy chain, Steiner tree, star and A-tree (128,149). However, it is impractical to consider all configurations of a large fan-out net because the number of net topologies as a function of the number of a large fan-out receivers increases rapidly. Raghavan, Cohoon, and Sahni (150) demonstrated a polynomial time solution (O(n2) time) for a one-layer routing problem called single-bend wirability problem, for two-terminal nets, which is the problem of determining whether there exists a planar routing with at most one bend per net. The problem can be reduced to the 2-satisfiability problem. However, allowing multiple terminals renders the single-bend wirability problem NP-complete (151). The formulation, however cannot be directly applied to solve our problem that considers multiple constraints on wires. The bounding box measure (of wiring interference) for placement without taking net topologies into account completely is not sufficient. For example, a simple measure which satisfies this property adds an edge between two vertices if the bounding boxes of corresponding nets intersect. If the bounding boxes of two nets intersect in a highly congested region, the routability is more severely affected than if they intersect in a region with very few nets. Thus, we consider, for two-terminal net i, two possible one-bend global routes, i(1) and i(2). It is desirable that the multiter-
1
321
1 2
2
Wi 1 Li 2
2 Cross-talk-critical regions
A global routing with more cross talk than (b)
A global routing with more crossing than (a)
(a)
(b)
Figure 19. Cross talk and crossing. Observe that minimizing cross talk introduces more crossings.
minal nets are routed within the smallest bounding box enclosing the terminals belonging to the nets, and with their favorite topologies as mentioned previously. For example, one restricts one to a specific routing pattern for a multiterminal net with a mincost Steiner tree having minimum wire length, minimum bends, and minimum stubs. For nets with large terminals, a mincost Steiner heuristic is used. A stub or branch in a tree introduces extra delay and/or ringing in the received signal waveform (113). Evidently, the topology estimate from a placement in this way is poorer than the estimate from global routing, but it is a necessary compromise for a strong coupling between the placement and global routing. Based on these facts, given a placement, we create a new interference graph G ⫽ (V, E), where 兩V兩 ⫽ 2n and 兩E兩 ⱕ 2n(n ⫺ 1) (in the case of two-terminal nets, refer to Fig. 19), to formulate the interference relationship between nets, so that each node in V represents a net and a weight on an edge in E represents a net-pair cross talk and crossing measured as below. Cross-Talk Measure. A popular approach used in the past to model the dependency of performance functions on parasitics is net classification. Nets are classified according to the type of signal they carry (stable, large swing, sensitive to noise, etc.). A bus of several sensitive nets running parallel to each other with correlated signals might inject considerable noise into a single net. The cross-talk-critical region is defined as a region enclosed by two wire segments of net i and net j so that their coupling distance d(i, j) is less than or equal to a small constant. The value depends on device technology. For example, using ac device technology on an MCM-L layer, 웃 ⫽ 1 cm (146). The shaded regions in Fig. 19 correspond to the set of cross-talk-critical regions induced by the given global routes of the two nets. The cross talk between two nets i(p) and j(q), denoted as 애[i(p), j(q)], is estimated as proportional to the maximum length for which two nets run in parallel and
322
VLSI CIRCUIT LAYOUT
is inversely proportional to the minimum separation between the parallel wires: µ[i(τ p ), j(τq )] =
(k /dk )
k∈K (i(τ q ), j(τ q ))
where K[i(p), j(q)] is the set of cross-talk-critical regions between two nets i with topology p and j with topology q. An interference graph is established for the net-pairwise crosstalk value which is an edge-weight of the graph in O(n2) time. Then, noise tolerance Ti for net i with topology with respect to the cross-talk measure 애 is approximated as Ti() ⫽ Mi() ⫺ 兺᭙j 애[i(), j], where Mi() is the maximum allowable coupled noise for net i with topology and j is the cross-talk critically adjacent net with respect to net i. We aim to identify the placement which maximizes the minimum noise tolerance for all nets. We say that the placement satisfies the noise-tolerance condition. BIBLIOGRAPHY 1. E. E. Davidson and G. A. Katopis, Package electrical design, in E. R. Tummala and E. J. Rynaszewski (eds.), Microelectronics Packaging Handbook, New York: Van Nostrand Reinhold, 1989, Chap. 3. 2. H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Reading, MA: Addison-Wesley, 1990, pp. 81–112. 3. B. Preas, M. Pedram, and D. Curry, Automatic layout of siliconon-silicon hybrid packages. In Design Automation Conference, 1989, pp. 394–399. 4. A. Hanafusa, Y. Yamashita, and M. Yasuda, Three-dimensional routing for multilayer ceramic printed circuit boards. In Int. Conf. Computer-Aided Des., IEEE, November 1990, pp. 386–389. 5. W. Wei-Ming Dai, Performance driven layout of thin-film substrates for multichip modules, in Proc. Multichip Module Workshop, IEEE, 1990, pp. 114–121. 6. W. Wei-Ming Dai, Topological routing in SURF: Generating a rubber-band sketch. In Proc. IEEE Des. Automation Conf., IEEE, 1991, pp. 39–48. 7. J. M. Ho et al., Layer assignment for multi-chip modules. IEEE Trans. Comput. Aided Des., CAD-9: 1272–1277, 1990. 8. J. D. Cho, K. F. Liao, and M. Sarrafzadeh, Multilayer Routing Algorithm for High Performance MCMs. In Proc. Fifth Annu. IEEE Int. ASIC Conf. Exhibit, 1992, pp. 226–229. 9. J. D. Cho et al., Crosstalk minimum layer assignment. In Proc. IEEE Custom Integr. Circuits Conf., San Diego, CA, 1993, pp. 29.7.1–29.7.4. 10. J. D. Cho and M. Sarrafzadeh, The pin redistribution problem in multichip modules. In Proc. Fourth Annu. IEEE Int. ASIC Conf. Exhibit, IEEE, September 1991. pp. p9-2.1–p9-2.4. 11. M. Sriram and S. M. Kang, Detailed layer assignment for MCM routing. In Int. Conf. Computer-Aided Des., IEEE, 1992, pp. 386–389. 12. R. Iyer, D. Rossetti, and H. Hsueh, Measurement and modeling of computer reliability as affected by system activity, ACM Trans. Comput. Syst., 4 (3): 214–237, 1986. 13. F. Najm, Transition density: A new measure of activity in digital circuits, IEEE Trans. Comput. Aided Des., 12 (2): 310–323, 1992. 14. B. Lin and H. DeMan, Low-power driven technology mapping under timing constraints. Int. Conf. Comput. Des., IEEE, 1993, pp. 421–427.
15. V. Tiwari, P. Ashar, and S. Malik, Technology mapping for low power. In Des. Automation Conf., ACM/IEEE, 1993, pp. 74–79. 16. C. Tsui, M. Pedram, and A. M. Despain, Technology decomposition and mapping targeting low power dissipation. In Des. Automation Conf., ACM/IEEE, 1993, pp. 68–73. 17. H. Vaishnav and M. Pedram, A performance driven placement algorithm for low power designs. In EURO-DAC, 1993. 18. M. Sarratzadeh and C. K. Wong, An Introduction to VLSI Physical Design, New York: McGraw–Hill, 1996. 19. E. S. Kuh and T. Outsuki, Recent advances in VLSI layout, Proc. IEEE, 1990, pp. 250–251. 19a. S. Muroga, VLSI System Design, New York: Wiley, 1982. 20. C. Y. Lee, An algorithm for path connection and its application. IRE Trans. Electronic Comput., EC-10: 346–365, 1961. 21. Y. Y. Lee and J. D. Cho, A new VLSI clock layout synthesis system. In Technical Report, Sung Kyun Kwan University, 46 (2): p. 891–903, 1995. 22. H. Liu and D. F. Wong, Network flow based multi-way partitioning with area and pin constraints. In Proc. 1997 IEEE/ACM Int. Symp. Physical Des., 1997, pp. 12–17. 23. G. Karypis and V. Kumar, Unstructured graph partitioning and sparse matrix ordering. In Technical Report, CS Dept., University of Minnesota, 1995. 24. C. Alpert, L. Hagen, and A. Kahng, A hybrid multilevel genetic approach for circuit partitioning, in Proc. IEEE Asia Pacific Conf. Circuits Syst., November 1996. 25. S. Wimer, I. Koren, and I. Cederbaum, Optimal aspect ratios of building blocks in VLSI, IEEE Trans. Comput. Aided Des., 8: 139–145, 1989. 26. J. P. Cohoon, Distributed genetic algorithms for the floorplan design problem, IEEE Trans. Comput. Aided Des., 10: 483– 492, 1991. 27. R. M. Kling and P. Banerjee, Optimization simulated evolution with applications to standard cell placement. In Des. Automation Conf., 1990, pp. 20–25. 28. D. F. Wong, H. W. Leong, and C. L. Liu, Simulated Annealing for VLSI Design, Norwell, MA: Kluwer Academic, 1988. 29. M. Rebaudengo and M. S. Reorda, GALLO: A genetic algorithm for floorplan area optimization, IEEE Trans. Comput. Aided Des., 15: 1996. 30. J. M. Kleinhans et al., GORDIAN: VLSI placement by quadratic programming and slicing optimization, IEEE Trans. Comput. Aided Des., 10: 356–365, 1991. 31. C. Alpert et al., Faster minimization of linear wirelength for global placement. In Proc. 1997 IEEE/ACM Int. Symp. Physical Des., 1997, pp. 4–11. 32. D. Huang and A. B. Kahng, Partitioning-based standard-cell global placement with an exact objective. In Proc. 1997 IEEE/ ACM Int. Symp. Physical Des., 1997, pp. 18–25. 33. T. Lengauer and M. Lugering, Integer programming formulation of global routing and placement problems. In M. Sarrafzadeh and D. T. Lee (eds.), Special volume on Algorithm Aspects of VLSI Layout, Singapore: World Scientific, 1993. 34. S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, Optimization by simulated annealing, Science, 220: 671–680, 1983. 35. D. G. Schweikert, A two-dimensional placement algorithm for the layout of electrical circuits. In Proc. IEEE Des. Automation Conf., IEEE/ACM, 1976, pp. 408–416. 36. D. S. Johnson et al., Optimization by simulated annealing: An experimental evaluation, Part II (graph coloring and number partitioning), Oper. Res., 39: 378–406, 1991. 37. M. P. Vecchi and S. Kirkpatrick, Global wiring by simulated annealing, IEEE Trans. Comput. Aided Des., CAD-2: 215–222, 1983.
VLSI CIRCUIT LAYOUT 38. K.-W. Lee and C. Sechen, A new global router for row-based layout, Int. Conf. Computer-Aided Des., IEEE, November 1988, pp. 180–183. 39. C. Sechen, VLSI Placement and Global Routing Using Simulated Annealing, Deventer, The Netherlands; Kluwer, B. V., 1988. 40. H. Esbensen and P. Mazumder, SAGA–A unification of genetic algorithm with simulated annealing and its application to macrocell placement, Proc. 7th Int. Conf. VLSI Des., 1994, pp. 211–214. 41. M. Hanan, On Steiner’s problem with rectilinear distance, SIAM J. Appl. Math., 14 (2): 255–265, 1966. 42. A. V. Aho, M. R. Garey, and F. K. Hwang, Rectilinear steiner trees: efficient special-case algorithm, Networks, 7: 35–58, 1977. 43. J. M. Ho, G. Vijayan, and C. K. Wong, New algorithm for the rectilinear Steiner Tree Problem. IEEE Trans. Comput. Aided Des., 9: 185–193, February 1990. 44. F. K. Hwang and D. S. Richards, Steiner tree problems, manuscript, 1989. 45. A. Z. Zelikovsky, P. Berman, and M. Karpinski, Improved Approximation Bounds for the Rectilinear Steiner Tree Problem, Technical Report Report No. 85108-CS, Institut fur Informatik, Universitat Bonn, 1994. 46. S. K. Rao et al., The rectilinear Steiner arborescence problem, Algorithmica, 7 (2–3): 277–288, 1992. 47. G. E. Te´llez and M. Sarrafzadeh, On rectilinear distance-preserving trees, Int. Symp. Circuits Syst., IEEE, 1995, Vol. 1, pp. 163–166. 48. J. Cong, A. Kahng, and K.-S. Leung, Efficient heuristics for the minimum shortest path Steiner Arborescence problem with applications to VLSI physical design. In Proc. 1997 IEEE/ACM Int. Symp. Physical Des., 1997, pp. 88–95. 49. J. D. Cho, A min-cost flow based min-cost rectilinear Steiner distance-preserving tree construction. In Proc. 1997 IEEE/ACM Int. Symp. Physical Des., 1997, pp. 82–87. 50. S. Chattopadhyay, D. Bouldin, and P. Dehkordi, An overview of placement and routing algorithms & multi-chip nodules, in J.D. Cho and P. D. Fyazon (eds.), High Performance Design Automation for MCM and Packages, Singapore: World Scientific, 1996, pp. 3–23. 51. K. Shahookar and P. Mazumder, VLSI cell placement techniques, ACM Computing Surveys, 23 (2): 143–220, 1991. 52. K. Shahookar and P. Mazumder, A genetic approach to standard cell placement using meta-genetic parameter optimization, IEEE Trans. Comput. Aided Des., 9: 500–511, 1990. 53. R. Vemuri, Genetic algorithms for partitioning placement and layer assignment for multichip modules, PhD Thesis, University of Cincinnati, Cincinnati, 1994. 54. H. Chan, P. Mazumder, and K. Shahookar, Macro-cell and module placement by genetic adaptive search with bitmap-represented chromosome, Integration, 12: 49–77, 1991. 55. E. W. Dijkstra, A note on two problems in connection with graphs, Numerische Mathematik, 1: 269–271, 1959. 56. E. F. Moore, Shortest Path Through a Maze. In Annals of Computation Laboratory, Cambridge, MA: Harvard University Press, 1959, pp. 285–292. 57. D. W. Hightower, A solution to line routing problems on the continuous plane, 6th Des. Automation Workshop, IEEE, 1969, pp. 1–24. 58. T. G. Szymanski, Dogleg channel routing is NP-complete, IEEE Trans. Comput. Aided Des., CAD-4: 31–41, 1985. 59. M. Sarrafzadeh, Channel-routing problem in the knock-knee mode is NP-complete, IEEE Trans. Comput. Aided Des., 6: 503– 506, 1987.
323
60. D. T. Lee, U. I. Gupta, and J. Y. Leung, An optimal solution for the channel assignment problem, IEEE Trans. Comput., 28: 807–810, 1979. 61. D. N. Deutsch, A dogleg channel router. In Des. Automation Conf., IEEE/ACM, 1976, pp. 425–433. 62. R. L. Rivest and C. M. Fiduccia, A greedy channel router, Des. Automation Conf., IEEE/ACM, 1982, pp. 418–424. 63. M. Burstein and R. Pelavin, Hierarchical channel router, Integration, 1, 1983. (Also Proc. 20th Des. Automation Conf., 1983.) 64. F. P. Preparata and W. Lipski, Jr., Optimal three-layer channel routing, IEEE Trans. Comput. Aided Des., C-33: 427–437, 1984. 65. M. L. Brady and D. J. Brown, Optimal multilayer channel routing with overlap. In 4th MIT Conf. Advanced Res. VLSI, Cambridge, MA: MIT Press, 1986, pp. 281–296. 66. R. J. Enbody and H. C. Du, Near optimal n-layer channel routing. Des. Automation Conf., IEEE/ACM, 1986, pp. 708–714. 67. J. Cong, D. F. Wong, and C. L. Liu, A new approach to the threeor four-layer channel routing, IEEE Trans. Comput. Aided Des., 7: 1094–1104, 1988. 68. K. Chaudhary and P. Robinson, Channel routing by sorting. IEEE Trans. Comput. Aided Des., 10: 754–760, 1991. 69. H. Chen and E. Kuh, A variable-width gridless channel router. In International Conference on Computer-Aided Design, IEEE/ ACM, 1985, pp. 304–306. 70. A. Sangiovanni-Vincentelli, M. Santomauro, and J. Reed, A new gridless channel router: Yet another channel router the second (YACR-II), Int. Conf. Computer-Aided Des., 1984. 71. N. Funabiki and Y. Takefuji, A Parallel algorithm for channel routing problems, IEEE Trans. Comput. Aided Des., 11: 464– 474, 1992. 72. J. Greene et al., Segmented channel routing, Des. Automation Conf., IEEE/ACM, 1990, pp. 567–572. 73. K. Zhu and D. F. Wong, On channel segmentation design for row-based FPGA’s, Int. Conf. Computer-Aided Des., 1992, pp. 26–29. 74. M. Pedram, B. Nobandegani, and B. Preas, Design and analysis of segmented routing channels for row-based FPGA’s, IEEE Trans. Comput. Aided Des., 13: 1470–1479, 1994. 75. V. Roychowdhury, J. Greene, and A. El Gamal, Segmented channel routing, IEEE Trans. Comput. Aided Des., 79–95, 1993. 76. J.–C. Jeen, R. S. Gyurcsik, and W.–T. Liu, A two layer chemical routing algorithm for mixed analog and digital signal nets. In IEEE Custom Integrated Circuits Conf., 1988, pp. 11.5.1–11.5.4. 77. U. Choudhury and A. Sangiovanni–Vincentelli, Constrainedbased channel routing for analog and mixed analog/digital circuits, Int. Conf. Computer-Aided Des., 1990, pp. 198–201. 78. J. D. Cho and M. S. Chang, LEXA: A left-edge based crosstalkminimum k-colour permutation in VHV channels. manuscript, July 1996. 79. S. Thakur, K.–Y. Chao, and D. F. Wong, An optimal layer assignment algorithm for minimizing crosstalk for three layer VHV channel routing. To appear in VLSI DESIGN, an international J. Custom-Chip Design, Simulation, and Testing, JunDong Cho (Guest Editor), 1995. 80. T. Leighton, A survey of problem and results for channel routing, in A.W.O.C., 1987. 81. R. L. Rivest, A. E. Baratz, and G. Miller, Provably good channel routing algorithms. In H. T. Kung, R. Sproull and G. Steele (eds.), VLSI Systems and Computations, Computer Science Press, Rockville, MD, 1981, pp. 178–185. 82. M. Burstein and R. Pelavin, Hierarchical wire routing, IEEE Trans. Comput. Aided Des., CAD-2: 223–234, 1983. 82a. B. S. Ting and B. N. Tien, Routing techniques for gate arrays, IEEE Trans. Comput. Aided Des., CAD-2: 301–312, 1983.
324
VLSI CIRCUIT LAYOUT
82b. A. E. Dunlop et al., Chip layout optimization using critical path weighting. In Design Automation Conf., 1984, pp. 133–136. 83. W. K. Luk et al., A hierarchical global wiring algorithm for custom chip design, IEEE Trans. Comput. Aided Des., CAD-6: 518– 533, 1987. 84. C. Chiang, M. Sarrafzadeh, and C. K. Wong, Global routing based on Steiner min-max trees, IEEE Trans. Comput. Aided Des., 9: 1315–1325, 1990. 85. J. T. Li and M. Marek–Sadowska, Global routing for gate arrays. IEEE Trans. Comput. Aided Des., CAD-3: 298–307, 1984. 86. K. F. Liao, M. Sarrafzadeh, and C. K. Wong, Single-layer global routing, Proc. 4th Annu. IEEE Int. ASIC Conf. Exhibit, IEEE, September 1991, pp. p14-4.1–p14-4.4. 87. R. Nair, A simple yet effective technique for global wiring, IEEE Trans. Comput. Aided Des., CAD-6: 165–172, 1987. 88. G. Meixner and U. Lauther, A new global router based on a flow model and linear assignment, Int. Conf. Computer-Aided Des., IEEE, November 1990, pp. 44–47. 89. Y. Nishizaki, M. Igusa, and A. Sangiovanni-Vincentelli, Mercury: A new approach to macro-cell global routing, Proc. VLSI, Germany, 1989, pp. 411–419. 90. R. C. Carden IV and C. K. Cheng, A global router using an efficient approximate multicommodity multiterminal flow algorithm. In Des. Automation Conf., IEEE/ACM, 1991, pp. 316–321. 91. E. Shargowitz and J. Keel, A global router based on multicommodity flow model, INTEGRATION: VLSI J., 5: 3–16, 1987. 92. R. M. Karp et al., Global wire routing in two-dimensional arrays, Algorithmica, 2 (1): 113–129, 1987. 93. M. Sarrafzadeh and D. Zhou, Global routing of short nets in two-dimensional arrays, Int. J. Comput. Aided VLSI Des., 2 (2): 197–211, 1990. 94. M. Marek–Sadowska, Route planner for custom chip design, Int. Conf. Computer-Aided Des., IEEE, November 1986, pp. 246–249. 95. L.-T. Hwang et al., Thin-film pulse propagation analysis using frequency techniques, IEEE Trans. CHMT, 14: 1991. 96. L.–T. Hwang and I. Turlik, Calculation of voltage drops in the vias of a multichip package, MCNC Technical Reports, Technical Report Series TR90-41, 1990. 97. L.–T. Hwang and I. Turlik, The skin effect in thin-film interconnections for ULSI/VLSI Packages. MCNC Technical Reports, Technical Report Series TR91-13, 1991. 98. T. F. Gonzalez and S. L. Lee, A linear time algorithm for optimal wiring around a rectangle, J. ACM, 35 (4): 810–832, 1988. 98a. T. F. Gonzalez and S. L. Lee, A 1.6 approximation algorithm for routing multiterminal nets, SIAM J. Comput., 16 (4): 669– 704, 1987. 99. A. Frank et al., Algorithm for routing around a rectangle, Discrete Appl. Math., 40: 363–378, 1992. 100. T. F. Gonzalez and S. Lee, Routing Around Two Rectangles to Minimize the Layout Area. In M. Sarrafzadeh and D. T. Lee, eds., Algorithmic Aspects of VLSI Layout, Singapore: World Scientific, 1993, pp. 365–397. 101. A. C. Tucker, Structure theorem for some circular-arc graphs, Discrete Math. 7: 167–195, 1974. 102. D. T. Lee, U. I. Gupta, and J. Y. Leung, Efficient algorithms for interval graphs and circular arc graphs, Networks, 12: 459– 467, 1982. 103. A. Hashimoto and J. Stevens, Wire routing by optimizing channel assignment within large apertures, Proc. 8th Design Automation Workshop, Atlantic City, NJ, June 1971, pp. 155–169. 104. J. L. Ganley and J. P. Cohoon, Provably good moat routing, Manuscript, 1996.
105. J. Kim and S. M. Kang, A New Triple-Layer OTC Channel Router. IEEE Trans. Comput. Aided Des., 1996. 106. A. J. Blodgett, Microelectronic packaging, Sci. Amer., pp. 86–96, July 1983. 107. P. D. Franzon, Private communication. Electronic MCM Clearing House in North Carolina State University, 1992. 108. R. Bechade, R. Flaker, and B. Kauffmann et al., A 32b 66 mhz 1.8W microprocessor. In IEEE Int. Solid-State Circuit Conf., 1994, pp. 208–209. 109. A. E. Dunlop et al., Chip layout optimization using critical path weighting, Des. Automation Conf., IEEE/ACM, 1984, pp. 133–136. 110. J. D. Cho and M. Sarrafzadeh, Buffer distribution algorithm for high-performance clock optimization. IEEE Trans. VLSI Syst., 3: 84–98, 1995. 111. W. C. Elmore, The transient response of damped linear networks with particular regard to wideband amplifiers, J. Appl. Phys., 19 (1): 55–63, 1948. 112. J. Cong, K.-S. Leung, and D. Zhou, Performance-driven interconnect design based on distributed RC delay model. In UCLA Computer Science Department Technical Report CSD920043, October 1992. 113. M. Sriram and S. M. Kang, Performance driven MCM routing using a second order RLC tree delay model, Proc. IEEE Int. Conf. Wafer Scale Integration, San Francisco, January 1993. 114. M. Sarrafzadeh and C. K. Wong, Hierarchical Steiner tree construction in uniform orientations, IEEE Trans. Comput. Aided Des., 11: 1095–1103, 1992. 115. S. Ganguly and S. Hojat, Clock distribution design and verification for power PC microprocessor, Int. Conf. Computer-Aided Des., 1995, p. Issues in Clock Designs. 116. E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems, IEEE, 1995. 117. A. B. Kahng and G. Robins, On Optimal Interconnections for VLSI. Norwell, MA: Kluwer Academic Publishers, 1995. 118. P. Winter, Steiner problem in networks: A Survey, Networks, 17: 129–167, 1987. 119. M. R. Garey and D. S. Johnson, The rectilinear Steiner tree problem is NP-complete, SIAM J. Appl. Math., 32 (4): 826– 834, 1977. 120. A. Kahng, J. Cong, and G. Robins, High-performance clock routing based on recursive geometric matching, Des. Automation Conf., IEEE/ACM, 1991, pp. 322–327. 121. J. Cong and C. K. Koh, Minimum-cost bounded-skew clock routing, Int. Symp. Circuits Syst., 1995, pp. 215–218. 122. J. Cong et al., Bounded-skew clock and Steiner routing under Elmore delay. In Int. Conf. Computer-Aided Des., p. Issues in Clock Designs, 1995. 123. I. Pyo, J. Oh, and M. Pedram, Constructing minimal spanning/ Steiner Trees with bounded path length, Eur. Des. Test Conf., 1996, pp. 244–248. 124. M. A. B. Jackson, A. Srinivasan, and E. S. Kuh, Clock routing for high-performance ICs, Des. Automation Conf., IEEE/ACM, 1990, pp. 573–579. 125. G. Tellez and M. Sarrafzadeh, On Rectilinear Distance-Preserving Trees, manuscript to appear in VLSI DESIGN, the special issue in High Performance Design Automation for VLSI Interconnections, May 1995. 126. H. A. Choi and A. H. Esfahanian, On complexity of a messagerouting strategy for multicomputer systems. In 16th Int. Workshop Graph-Theoretic Concepts Comput. Sci., Germany, pp. 170– 181, 1990. 127. J. D. Cho, Min-cost flow based minimum-cost rectilinear Steiner distance-preserving tree, 1996.
VLSI SIGNAL PROCESSING 128. C. P. Chen, Y. W. Chang, and D. F. Wong, Fast performancedriven optimization for buffered clock trees based on Lagrangian relaxation. In Des. Automation Conf., 1996, pp. 405–408. 129. J. G. Xi and W. W.–M. Dai, Buffer insertion and sizing under process variations for low power clock distribution, Des. Automation Conf., 1995, pp. 491–496. 130. J. Oh, I. Pyo, and M. Pedram, Constructing lower and upper bounded delay routing trees using linear programming, Des. Automation Conf., 1996, pp. 401–404.
150. R. Raghavan, J. Cohoon, and S. Sahni, Single bend wiring, J. Algorithms, 7 (2): 232–257, 1986. 151. H. C. Yen, On multiterminal single bend wirability. IEEE Trans. Comput. Aided Des., 13: 822–826, 1994.
JUN DONG CHO Sung Kyun Kwan University
MAJID SARRAFZADEH Northwestern University
131. W. E. Donath et al., Timing driven placement using complete path delays, Des. Automation Conf., IEEE/ACM, 1990, pp. 84–89. 132. S. Sutanthavibul and E. Shragowitz, An adaptive timing-driven layout for high speed VLSI. In Des. Automation Conf., IEEE/ ACM, 1990, pp. 90–95. 133. P. S. Hauge, R. Nair, and E. J. Yoffa, Circuit placement for predictable performance, Int. Conf. Computer-Aided Des., IEEE, 1987, pp. 88–91. 134. E. S. Kuh et al., Timing driven layout. In VLSI Logic Synthesis Des., 1991, pp. 263–270. 135. I. Lin and D. H. C. Du, Performance-driven constructive placement. Des. Automation Conf., IEEE/ACM, 1990, pp. 103–106. 136. M. A. B. Jackson, E. S. Kuh, and M. Marek–Sadowska, Timingdriven routing for building block layout, Int. Symp. Circuits Syst., IEEE, 1987, pp. 518–519. 137. J. Cong et al., Provably good performance-driven global routing, IEEE Trans. Comput. Aided Des., 11: 739–752, 1992. 138. C. Chiang, M. Sarrafzadeh, and C. K. Wong, A global router with simultaneous length and density minimization, manuscript, June 1991. 139. H. Sato, A. Onozawa, and H. Matsuda, A Balanced-mesh clock routing technique using circuit partitioning, Eur. Des. Test Conf., 1996, pp. 237–243. 140. D. Dobberpuhl et al., A200-Mhz 64-b dual-issue CMOS microprocessor, IEEE J. Solid-State Circuits, 1555–1567, 1992. 141. G. E. Te´llez, A. Farrahi, and M. Sarrafzadeh, Activity-driven clock design for low-power circuits, Int. Conf. Computer-Aided Des., IEEE/ACM, November 1995. 142. N. P. van der Meijs, T. Smedes, and A. J. Genderen, Extraction of circuit models for substrate cross-talk, Int. Conf. ComputerAided Des., 1995, pp. 199–206. 143. Cadence Design Systems, A vision for multichip Module Design in the Nineties. Tech. Rep. Cadence Design Systems Inc., Santa Clara, CA, 1993. 144. K. Y. Chao and D. F. Wong, Layer assignment for high-performance multi-chip modules, in J. D. Cho and P. D. Franzon (eds.), High Performance Design Automation for MCM and Packages, Singapore: World Scientific, 1996, pp. 61–79. 145. J. M. Ho et al., Layer assignment for multi-chip modules, IEEE Trans. Comput. Aided Des., CAD-9: 1272–1277, 1990. 146. H. H. Chen and C. K. Wong, 63-layer TCM wiring with threedimensional crosstalk constraints, in J. D. Cho and P. D. Franzon (eds.), High Performance Design Automation for MCM and Packages, Singapore: World Scientific, 1996, pp. 81–92. 147. G. Devaraj, Distributed placement and crosstalk driven router for multichip modules, M.S. Thesis, University of Cincinnati, Cincinnati, 1994. 148. K. Chaudhary, A. Onozawa, and E. Kuh, A spacing algorithm for performance enhancement and crosstalk reduction. In Int. Conf. Computer-Aided Des., 1993, pp. 697–702. 149. A. Vittal and M. Marek–Sadowska, Minimal delay interconnect design using alphabetic trees, Des. Automation Conf., 1994, pp. 392–394.
325
VLSI ISOLATION. See ISOLATION.