EDAv2n1.qxd
7/23/2007
2:34 PM
Page 1
Alexander Taubin, Jordi Cortadella, Luciano Lavagno, Alex Kondratyev, and Ad Peeters The number of gates on a chip is quickly growing toward and beyond the one billion mark. Keeping all the gates running at the beat of a single or a few rationally related clocks is becoming impossible. However, the electronics industry for the most part is still reluctant to adopt asynchronous design due to a common belief that there is a lack of commercial-quality Electronic Design Automation tools for asynchronous circuits. Design Automation of Real-Life Asynchronous Devices and Systems presents design flows that can tackle large designs without significant changes with respect to synchronous design flow. Limiting itself to the four design flows that come closest to this goal it starts by overviewing the most commercially and technically proven, Tangram. The other three flows, Null Convention Logic, de-synchronization and gate-level pipelining, can be considered as asynchronous re-implementations of synchronous specifications. Design Automation of Real-Life Asynchronous Devices and Systems demonstrates the possibility of implementing large legacy synchronous designs in an almost “push button” manner negating the need to re-educate synchronous RTL designers. It is essential reading for designers and researchers in large scale integrated circuit design.
FnT EDA 2:1 Design Automation of Real-Life Asynchronous Devices and Systems
Design Automation of Real-Life Asynchronous Devices and Systems
Foundations and Trends® in Electronic Design Automation 2:1 (2007)
Design Automation of Real-Life Asynchronous Devices and Systems Alexander Taubin, Jordi Cortadella, Luciano Lavagno, Alex Kondratyev, and Ad Peeters
Alexander Taubin et al.
This book is originally published as Foundations and Trends® in Electronic Design Automation, Volume 2 Issue 1 (2007), ISSN: 1551-3939.
now
now the essence of knowledge
Design Automation of Real-Life Asynchronous Devices and Systems
Design Automation of Real-Life Asynchronous Devices and Systems Alexander Taubin Boston University, USA
Jordi Cortadella Universitat Polit`ecnica de Catalunya, Spain
Luciano Lavagno Politecnico di Torino, Italy
Alex Kondratyev Cadence Design Systems, USA
Ad Peeters Handshake Solutions, The Netherlands
Boston – Delft
R Foundations and Trends in Electronic Design Automation
Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com
[email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is A. Taubin, J. Cortadella, L. Lavagno, A. Kondratyev and A. Peeters, Design Automation of Real-Life Asynchronous R Devices and Systems, Foundations and Trends in Electronic Design Automation, vol 2, no 1, pp 1–133, 2007 ISBN: 978-1-60198-058-8 c 2007 A. Taubin, J. Cortadella, L. Lavagno,
A. Kondratyev and A. Peeters All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The ‘services’ for users can be found on the internet at: www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged. Authorization does not extend to other kinds of copying, such as that for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. In the rest of the world: Permission to photocopy must be obtained from the copyright owner. Please apply to now Publishers Inc., PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com;
[email protected] now Publishers Inc. has an exclusive license to publish this material worldwide. Permission to use this content must be obtained from the copyright license holder. Please apply to now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail:
[email protected]
R Foundations and Trends in Electronic Design Automation Volume 2 Issue 1, 2007 Editorial Board
Editor-in-Chief: Sharad Malik Department of Electrical Engineering Princeton University Princeton, NJ 08544 Editors Robert K. Brayton (UC Berkeley) Raul Camposano (Synopsys) K.T. Tim Cheng (UC Santa Barbara) Jason Cong (UCLA) Masahiro Fujita (University of Tokyo) Georges Gielen (KU Leuven) Tom Henzinger (EPFL) Andrew Kahng (UC San Diego) Andreas Kuehlmann (Cadence Berkeley Labs) Ralph Otten (TU Eindhoven) Joel Phillips (Cadence Berkeley Labs) Jonathan Rose (University of Toronto) Rob Rutenbar (CMU) Alberto Sangiovanni-Vincentelli (UC Berkeley) Leon Stok (IBM Research)
Editorial Scope R Foundations and Trends in Electronic Design Automation will publish survey and tutorial articles in the following topics:
• System Level Design
• Physical Design
• Behavioral Synthesis
• Circuit Level Design
• Logic Design
• Reconfigurable Systems
• Verification
• Analog Design
• Test
Information for Librarians R in Electronic Design Automation, 2007, Volume 2, Foundations and Trends 4 issues. ISSN paper version 1551-3939. ISSN online version 1551-3947. Also available as a combined paper and online subscription.
R in Foundations and Trends Electronic Design Automation Vol. 2, No. 1 (2007) 1–133 c 2007 A. Taubin, J. Cortadella, L. Lavagno,
A. Kondratyev and A. Peeters DOI: 10.1561/1000000006
Design Automation of Real-Life Asynchronous Devices and Systems Alexander Taubin1 , Jordi Cortadella2 , Luciano Lavagno3 , Alex Kondratyev4 and Ad Peeters5 1 2 3 4 5
Boston University, USA,
[email protected] Universitat Polit`ecnica de Catalunya, Spain,
[email protected] Politecnico di Torino, Italy,
[email protected] Cadence Design Systems, USA,
[email protected] Handshake Solutions, The Netherlands,
[email protected]
Abstract The number of gates on a chip is quickly growing toward and beyond the one billion mark. Keeping all the gates running at the beat of a single or a few rationally related clocks is becoming impossible. In static timing analysis process variations and signal integrity issues stretch the timing margins to the point where they become too conservative and result in significant overdesign. Importance and difficulty of such problems push some developers to once again turn to asynchronous alternatives. However, the electronics industry for the most part is still reluctant to adopt asynchronous design (with a few notable exceptions) due to a common belief that we still lack a commercial-quality Electronic Design Automation tools (similar to the synchronous RTL-to-GDSII flow) for asynchronous circuits.
The purpose of this paper is to counteract this view by presenting design flows that can tackle large designs without significant changes with respect to synchronous design flow. We are limiting ourselves to four design flows that we believe to be closest to this goal. We start from the Tangram flow, because it is the most commercially proven and it is one of the oldest from a methodological point of view. The other three flows (Null Convention Logic, de-synchronization, and gate-level pipelining) could be considered together as asynchronous re-implementations of synchronous (RTL- or gate-level) specifications. The main common idea is substituting the global clocks by local synchronizations. Their most important aspect is to open the possibility to implement large legacy synchronous designs in an almost “push button” manner, where all asynchronous machinery is hidden, so that synchronous RTL designers do not need to be re-educated. These three flows offer a trade-off from very low overhead, almost synchronous implementations, to very high performance, extremely robust dual-rail pipelines.
Contents
1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6
1
Requirements for an Asynchronous Design Flow Motivation for Asynchronous Approach Asynchronous Design An Overview of Asynchronous Design Styles Asynchronous Design Flows Paper Organization
1 4 7 9 11 16
2 Handshake Technology
19
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
19 21 21 29 32 34 35 36 39
Motivation Functional Design Design Language Haste Handshake Circuits Handshake Implementations Library Connection Simulation and Verification Structural Design Flow Physical Design
3 Synchronous-to-asynchronous RTL Flow Using NCL
43
3.1
43
Overview ix
3.2 3.3 3.4 3.5 3.6
Null Convention Logic NCL Design Flow with HDL Tools DIMS-based NCL Design Flow NCL Flow with Explicit Completeness Verification of NCL Circuits
4 De-synchronization: Simple Mutation Circuit into Asynchronous 4.1 4.2 4.3 4.4 4.5
45 49 51 53 55
63
Introduction Signal Transition Graphs Revisiting Synchronous Circuits Relaxing the Synchronous Requirement Minimum Requirements for Correct Asynchronous Communication 4.6 Handshake Protocols for De-synchronization 4.7 Implementation of Handshake Controllers 4.8 Design Flow 4.9 Why De-synchronize? 4.10 Conclusions
69 76 80 82 83 84
5 Automated Gate Level Pipelining (Weaver)
85
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Automated Pipelining: Motivation Automated Gate-Level Pipelining: General Approach Micropipeline Stages: QDI Template Design Flow Basics Fine-Grain Pipelining with Registers Pipeline Petri Net Model of the Flow Weaving: Simple Examples
63 64 66 68
85 88 91 92 93 95 100
6 Applications and Success Stories
107
6.1 6.2 6.3
107 109 111
Low-Power Robust Design Using Haste Low-Power Robust Design using De-synchronization Design of Cryptographic Coprocessor
7 Conclusions
121
Acknowledgments
125
References
127
1 Introduction
1.1
Requirements for an Asynchronous Design Flow
With the increases in die size and clock frequency, it has become increasingly difficult to drive signals across a die following a globally synchronous approach. Process variations and signal integrity stretch the timing margins in static timing analysis to the point where they become too conservative and result in significant over-design. Asynchronous circuits measure, rather than estimate, the delay of the combinational logic, of the wires, and of the memory elements. They operate reliably at very low voltages (even below the transistor threshold), when device characteristics exhibit second and third order effects. They do not require cycle-accurate specifications of the design, but can exploit specification concurrency virtually at every level of granularity. They can save power because they naturally perform computation on-demand. Asynchronous design also provides unique advantages, such as reduced electromagnetic emission, extremely aggressive pipelining for high performance, and improved security for cryptographic devices. In addition, asynchronous design might become a necessity for 1
2
Introduction
non-standard future fabrication technologies, such as flexible electronics based on low-temperature poly-silicon TFT technology [50] or nanocomputing [97]. In the next section, we will discuss more in detail the main motivations to start using asynchronous approaches in real-life designs. Here we just mention that the problems listed above are suggesting some designers to consider asynchronous implementation strategies again, after decades of disuse. Despite all its potential advantages, asynchronous design has traditionally been avoided by digital circuit and system designers due to several reasons: • There were no good EDA tools and methodologies that completely covered the design flow (especially for the large, realistic designs). • Testing without a clock did not benefit from the wellestablished and reliable set of procedures ensuring that a synchronous circuit will work as expected after fabrication. • Asynchrony required a deep change in designers’ mentality when devising the synchronization among various components of a digital system. Custom and semi-custom design of asynchronous devices and systems, on the other hand, is a well established academic research area, offering several well-known success stories including realistic designs (e.g., [31, 113, 36, 50, 72, 79, 80]). Some asynchronous implementations have been reportedly developed and sometimes commercialized by major design companies (e.g., Intel’s RAPPID design [107]; Sun’s high-speed pipelined devices used in a commercial Sun Ultra are based on results from [15, 83, 124]; IBM/Columbia low-latency synchronous– asynchronous FIR filter chip [114, 128] was fabricated, etc.). However, until recently design and testing automation was considered a major weakness of asynchronous design approaches. Asynchronous design is a very large research domain, and it is almost impossible to cover it in depth within a single paper. The interested reader is referred to a number of excellent research papers, books, and surveys devoted to design automation tools and flows for
1.1 Requirements for an Asynchronous Design Flow
3
asynchronous circuits (e.g., [17, 18, 23, 32, 33, 64, 88]) and in general to asynchronous design (e.g., [123, 51, 81]). This article is devoted specifically to one topic: automation of asynchronous design based on industrial-quality tools and flows. This requires support throughout the design cycle, mostly re-using synchronous tools and modeling languages due to the otherwise prohibitive investment in training and development. We also aim at dispelling a common misbelief, namely that asynchronous design is difficult, and that only specially educated PhDs can do it. We show that this is no true and that there exist flows and tools that satisfy the following key requirements: • They must be able to handle real-size designs and to re-use the huge investments in tools, languages and methodologies that enable synchronous design. • They must use standard hardware description languages (HDL), such as Verilog or VHDL, with a modeling style that does not differ much from the synthesizable subset, in order to handle legacy designs. • They must use standard EDA tools to synthesize, map, place, route the design, and to perform timing analysis, equivalence checking, parasitics extraction, scan insertion, automated test pattern generation and so on. • They must use CMOS standard-cell libraries, except when specially designed dynamic CMOS libraries are required for maximum performance and minimum power consumption. • They must be scalable, in order to create industrial-quality and industrial-size designs. Any change in the design methodology, no matter how small, must be strongly motivated. We believe that this is happening now, particularly since asynchrony is the most natural, powerful and effective mechanism to address manufacturing and operating condition variability. For integrated circuits at 45 nm and beyond, asynchrony is also a natural way to overcome power, performance and timing convergence issues related to the use of clock signals, supply voltage drop and delay uncertainty caused by noise.
4
Introduction
1.2
Motivation for Asynchronous Approach
Feedback closed-loop control is a classical engineering technique used to improve the performance of a design in the presence of manufacturing uncertainty. In digital electronics, synchronization is performed in an open-loop fashion. That is, most synchronization mechanisms, including clock distribution, clock gating, and so on are based on a feed-forward network. All delay uncertainties in both the clock tree and the combinational logic must be designed out, i.e., taken care of by means of appropriate worst-case margins. This approach has worked very well in the past, but currently it shows several signs of weakness. A designer, helped by electronic design automation tools, must estimate at every design stage (floor-planning, logic synthesis, placement, routing, mask preparation) the effect that uncertainties will have on geometry, performance and power (or energy) of the circuit. Cycles in the design and/or fabrication flow translate into both time-to-market delays and direct non-recurrent engineering costs. In the case of mask geometry and fabrication, these uncertainties have so far had mostly a local effect, which can be translated into design rules. However, as technology progresses towards 45 nm and beyond, non-local effects, e.g., due to geometry density and orientation, are becoming more and more significant. In the case of delay and power, these uncertainties add up to significant margins (e.g., 30% was reported in [89]) in order to ensure that a sufficiently large percentage of manufactured chips works within specifications. Statistical timing analysis [3] attempts to improve the model of the the impact of correlated and independent variability sources on performance. However, it requires a deep change in the business relationship between foundries and design houses in order to make statistical process data available to design tools. The commercial demonstrations of statistical timing analysis viability have been scarce up to now. Moreover, it is still based on improving predictions, not on actual postmanufacturing measurements. Signal integrity comes into play as both crosstalk-induced timing errors and voltage drop on power lines. Commercial tools for signal and power integrity analysis and minimization (e.g., [11, 112]) predict and
1.2 Motivation for Asynchronous Approach
5
Fig. 1.1 Delay penalties in a synchronous system.
help reduce the impact of voltage drop and crosstalk noise on circuit performance. However, it is believed that up to 25% of delay penalty may be due to signal integrity. Figure 1.1 (from [7]) summarizes the delay penalties that are typical for a state-of-the-art synchronous methodology. In addition to design flow and manufacturing process uncertainties, modern circuit-level power minimization techniques, such as Dynamic Voltage Scaling and Adaptive Body Biasing, deliberately introduce performance variability. Changing the clock frequency in order to match performance with scaled supply voltage is very difficult since it multiplies the complexity of timing analysis by the number of voltage steps, and variability impact at low voltages is much more significant. Doing the same in the presence of varying body biasing, and hence varying threshold voltages, is even more complex. Moreover, phase-locked loops provide limited guarantees about periods and phases during frequency changes, hence the clock must be stopped while frequency is stepped up or down. It is well-known that, under normal operating conditions, the delay of a CMOS circuit scales linearly with its voltage supply, while its power scales quadratically. Thus the normalized energy-per-cycle or energyper-operation efficiency measure scales linearly with voltage supply. However, it is very difficult to use this optimization opportunity to the extreme by operating very close to the threshold voltage. Two approaches have been proposed in the literature to tackle this problem with purely synchronous means. Both are based on sampling
6
Introduction
the output of a signal which is forced to make a transition very close to the clock cycle, and slow down the clock frequency or increase the voltage supply if this critical sampling happens at the current voltage and frequency conditions. The Razor CPU [25] is designed with double slave latches and an XOR in each master-slave pair (thus increasing by over 100% the area of each converted latch). The second slave is clocked half a clock cycle later than the first slave. When the comparator detects a difference in values between the slaves, the inputs must have changed very close to the falling edge of the clock of the first slave, and the latch memorized an incorrect value. The Razor in that case “skips a beat” and restarts the pipeline with the value of the second latch, which is always (assuming that environmental conditions change slowly) latched correctly. An external controller always keeps voltage and clock frequency very close to this “critical clocking” condition in order to operate the processor very close to the best V dd point for the required clock frequency, under the current temperature conditions. The approach, while very appealing for processors, has an inherent problem that makes it not applicable to ASICs. Due to the near-critical clocking, it is always possible that the first latch goes meta-stable. In that case, the whole detector and the clock controller may suffer from meta-stability problems. That case is detected with analogue mechanisms, and the whole pipeline is flushed and restarted. This is easy to do in a processor, for which flushing and restarting is already part of a modern micro-architecture. However, it is very difficult, if not impossible, to achieve the same objective automatically for a generic logic circuit. Another technique that has been proposed is to dynamically monitor the delay of the logic at the current voltage value and adjust the clock frequency accordingly (the PowerWiseT M technology from National Semiconductors). It samples, with a high frequency clock, the output of a digital delay line that toggles once per system clock cycle. This is used, more or less as in Razor, to measure the delay of the line in the current environment conditions (temperature, V dd etc.). The scheme is safer than Razor, because it allows one to insert enough synchronizers after the delay line to reduce the meta-stability
1.3 Asynchronous Design
7
danger. However, it is an indirect measure, and requires a complicated (patented) logic to monitor and control the clock speed. Asynchronous implementation, as demonstrated, e.g., in [52, 80, 90], achieves similar goals with much simpler logic, because the delay of the logic is directly used to generate the synchronization signals in a feedback control fashion. On the other hand, several kinds of applications, in particular most of those that use complex processor architectures for part of the computation (e.g., general purpose computing and multi-media), and several others that are tolerant to environmental variations (e.g., wireless communications), do not have to obey strict timing constraints at all times. The widespread use of caches, the difficulty of tight worst-case execution time analysis for software, and the use multi-tasking kernels, require the use of DSP algorithms that are tolerant to internal performance variations and offer only average case guarantees. Hence, a design style in which the device provides average case guarantee, but may occasionally go slower or faster is quite acceptable for many applications. If the performance of that device on average is twice that of a traditionally designed device, then the performance advantage is significant enough to make a limited change in the design flow acceptable.
1.3
Asynchronous Design
Asynchronous design can be viewed as a method to introduce feedback control for synchronization of latches and flip–flops in a digital design. Asynchronous circuits measure, rather than estimate, the delay of the combinational logic, of the wires, and of the memorization elements. Handshaking between controllers that generate the clock signal for groups of flip–flops and latches ensures the satisfaction of both setup and hold constraints as illustrated in Figure 1.2. Asynchronous circuits also reduce electromagnetic emission with respect to equivalent synchronous ones [53, 82] because they reduce the power consumption peaks in the vicinity of clock edges. Hence they produce a flatter power spectrum and exhibit smaller voltage supply drops.
8
Introduction
Fig. 1.2 Synchronous–Asynchronous Direct Translation: from synchronous (a) to desynchronized (b) and fine-grain pipelined (c) circuits.
Asynchronous circuits offer two kinds of power-related advantages. First, they provide a very fine-grained control over activation of both datapath and storage resources in a manner that is similar to clock gating but much easier to describe, verify, and implement robustly at the circuit level. Second, they reliably operate at very low voltages (even below the transistor threshold), when device characteristics exhibit second and third order effects. Synchronous operation becomes virtually impossible under these conditions [90] because: • Library cells are seldom characterized by the manufacturer at such extreme operating conditions. Hence the normal synchronous ASIC design flow is unsuitable to guarantee correct operation. • The transistor electrical models deviate significantly from those used under nominal conditions and make a straightforward scaling of performance and power impossible, or at least very risky. • The effects of various random or hard-to-predict phenomena, such as threshold voltage variations, wire width variations, and local voltage supply variations due to IR drop, are dramatically magnified. All this means that, even if one were able to use the traditional synchronous flow for circuits that will operate at a voltage supply that is close to or even below the transistor threshold voltage the performance margins that one would have to use to ensure correct operation would be huge. Robustness of asynchronous circuits to delay variations allows them to run at very low voltage levels. Recently, Achronix reported that
1.4 An Overview of Asynchronous Design Styles
9
an asynchronous FPGA chip, built in 90 nm CMOS, reliably operates with a supply of just 0.2 V and exhibits an 87% power consumption reduction by scaling the supply voltage from 1.2 V to 0.6 V [2].
1.4
An Overview of Asynchronous Design Styles
Clocking is a common, simple abstraction for representing the timing issues in the behavior of real circuits. Generally speaking, it lets designers ignore timing when considering functionality. Designers can describe both the functions performed and the circuits themselves in terms of logical equations (Boolean algebra). In general, synchronous designers do not need to worry about the exact sequence of gate switching as long as the outputs are correct at the clock pulses. In contrast, asynchronous circuits must strictly coordinate their behavior. Logic synthesis for asynchronous circuits not only must handle circuit functionality but must also properly order gate activity (switching). The solution is to use functional redundancy to explicitly model computation flows without using abstract means such as clocks. Using logic to ensure correct circuit behavior under any delay distribution can be costly and impractical. Therefore, most asynchronous design styles use some timing assumptions to correctly coordinate and synchronize computations. These assumptions can have different degrees of locality, from matching delays on some fanout wires, to making sure that a set of logic paths are faster than others. Localized assumptions are easier to meet in a design because they simplify timing convergence and provide more modularity. But ensuring the correctness of such assumptions can be costly because it requires more system redundancy at the functional level. Asynchronous design styles differ in the way they handle the trade-off between locality of timing assumptions [121] and design cost. The main asynchronous design flows rely on the following assumptions: • Delay-insensitive (DI ) circuits [86] impose no timing assumptions, allowing arbitrary gate and wire delays. Unfortunately, the class of DI implementations is limited and impractical [77].
10
Introduction
• Quasi-delay-insensitive (QDI ) circuits [76] partition wires into critical and noncritical categories. Designers of such circuits consider fanout in critical wires to be safe by assuming that the skew between wire branches is less than the minimum gate delay. Designers thus assume these wires, which of course must be constrained to physically lie within a small area of the chip, to be isochronic. In contrast, noncritical wires can have arbitrary delays on fanout branches. • Bundled-delay (BD ) circuits assume that the maximum delay of each combinational logic island is smaller than that of a reference logic path (usually called matched delay) [121]. Matched delays are implemented using the same gate library as the rest of the datapath and they are subject to the same operating conditions (temperature, voltage). This results in consistent tracking of datapath delays by matched delays and allows one to reduce design margins with respect to synchronous design (bundled-data protocols). As Figure 1.3 shows, the locality of timing assumptions decreases, from DI systems (which make no global assumptions) to synchronous circuits.
Fig. 1.3 Functional redundancy and locality of timing assumptions in asynchronous designs.
1.5 Asynchronous Design Flows
11
The imposed timing assumptions help in differentiating asynchronous implementations. For further categorizing of asynchronous design flows one needs to find out how the following key issues are addressed: (a) which way a designer expresses his/her intents, i.e., a design specification and (b) which way a designer proceeds with synthesis. The spectrum of the flow described in this paper ranges from think asynchronously — implement asynchronously to think synchronously — implement almost synchronously. Table 1.1 gives a highlevel picture of the observed flows.
1.5
Asynchronous Design Flows
The ability to specify a design at a relatively high level, roughly equivalent to Register Transfer Level (RTL), is essential in order to enable enough designer productivity today. Two basic approaches have been proposed in the asynchronous domain for this purpose. (1) The first one is to use an asynchronous HDL, generally based on the formal model of Communicating Sequential Processes (CSP [44]), because it fits very well the underlying asynchronous implementation fabrics [75, 134]. By nature, asynchronous circuits are highly concurrent and communication between modules is based on handshake channels (local synchronization). Haste, the design language used by the flow in Chapter 2, offers a syntax similar to behavioral Verilog, and in addition has explicit constructs for parallelism, communication, and hardware sharing. Starting from Haste, one can compile to behavioral Verilog for functional verification. Its main forte is the ability to achieve all the advantages of asynchronous implementation, e.g., by controlling the activation of both control and datapath units (and hence power and energy consumption) at a very fine level with direct and explicit HDL support. Several groups tried to combine RTL with handshake channel based specifications, particularly, to add channel as an extension of standard HDL (e.g., Verilog or VHDL) [103,
Design style From QDI to Bundled Data QDI
Bundled Data
QDI
Design flow Haste
NCL
Desync.
Fine grain pipeline (Weaver)
Synchronous RTL
Specification style Asynchronous high-level and RTL (CSP-based) Synchronous RTL for datapath, asynchronous for control Synchronous RTL
Synchronous + scripts to map into pipeline cells
Synchronous + scripts to implement local clocking
Synchronous + scripts to map into async. library
Type of synthesis Asynchronous
Table 1.1 High-level view on presented fully-automated design flows.
Custom (dynamic logic) Possible to extend to standard cells
Standard cells
Implementation library Asynchronous DesignWare mapped to standard cells Custom NCL. Possible to extend to standard cells
Summary Think asynchronously — implement asynchronously Think asynchronously — implement almost synchronously Think synchronously — implement almost synchronously Think synchronously — implement asynchronously
12 Introduction
1.5 Asynchronous Design Flows
108, 109] that could be considered as HDL level automation of Caltech group’s ideas [79]. Unfortunately this approach which is attractive from theoretical point of view requires reeducation of RTL designers and rewriting of existing specifications which is not realistic for big and expensive designs. (2) The other approach, that we call Synchronous–Asynchronous Direct Translation (SADT), starts from a synchronous synthesizable specification, translates it into a gate-level netlist using traditional logic synthesis tools, and then applies a variety of techniques to translate it into as asynchronous implementation [20, 68, 70, 117]. Its main advantage is to allow reimplementation of legacy designs without any designer intervention at the HDL level. Since eliminating logic bugs takes up to 50% of design time, this can potentially translate into a significant advantage in terms of design time and cost, with respect to approaches that require a significant redesign. This approach is represented by the following design styles. (a) The Null-Convention Logic (NCL), that is used by Theseus Logic and is described in Chapter 3, is based on a proprietary library of majority gates and produces fully delay-insensitive circuits. This has very high overhead (about 3–4× in terms of area), but is also very robust, because functional correctness is achieved independent of gate and wire delays. (b) De-synchronization, that is described in Chapter 4, uses a micropipeline-style architecture and an interconnection of small handshaking controllers to derive the clock for groups of up to 100 latches (derived by splitting synchronous flip–flops). It has very low overhead in terms of area (between 5% and 10%), but requires careful matching of delays all the way down through physical design. Note that the same implementation architecture is used by the Handshake Solution flow of Chapter 2, starting from the
13
14
Introduction
Haste language, thus easing interfacing between logic blocks produced by one of these two flows. (c) Automated fine-grain pipelining presented in Chapter 5. In the finest granularity (gate-level pipelining Figure. 1.2(c) in addition to replacing global synchronization with local handshake control this flow also removes functionally unnecessary synchronization. The flow offers support for automated pipelining therefore it is targeted to improve the original design performance. In this flow [115, 117] by default pipelining is done in the finest degree resulting in high-performance. The flow can exploit the use of aggressive pipelining to reduce the performance gap while maintaining low power. For example, for a standard cell library developed using 180 nm TSMC process the fine-grain pipelined cells are functional with VDDs down to 0.6 V. FIFO performance of 780 MHz at nominal 1.8 V drops to 135 MHz at the safe 0.8 V with 14.2× lower power consumption. We believe that synchronous–asynchronous directed translation model could play for asynchronous design automation as important of a role as RTL did for synchronous EDA. The main contribution to EDA by RTL model is due to a separation of optimization and timing (all sequential behavior is in an interaction between registers, all synthesis and optimization are only about combinational clouds). The key idea that enables SADT flows is as follows. The RTL model (Figure 1.2(a)) is based on global synchronization and timing assumptions (computations complete in every stage before the next clock edge). During every clock cycle every latch undergoes two phases: pass and store. Master–slave flip–flops prevent the register from being transparent at any given time, but introduce the requirement to carefully control clock skew in order to avoid double-clocking failures (also called hold time violations), which are especially nasty because they cannot be fixed simply by slowing down the clock. Dynamic logic also has
1.5 Asynchronous Design Flows
15
a two-phase operation: evaluate and precharge (reset). These phases naturally map to asynchronous handshake (request–acknowledge) protocols [121]. The separation into phases enables, in asynchronous just as in synchronous design, a clean separation between functionality and timing. A datapath implemented using any of the techniques described in this paper behaves just like combinational logic (and is in fact just plain combinational logic in the Haste and the desynchronization flows) during evaluation. A resetting or precharge phase is used in the NCL and fine-grain pipelining flows to ensure reliable delay-insensitive or highspeed operation. In other words, asynchronous implementation does not change the externally observable behavior, and the sequence of values that appears on the boundary registers is the same before and after the application of SADT. Among other advantages (e.g., in terms of reuse of functional and timing verification simulation vectors), this enables easy interfacing with synchronous logic. The latter could be achieved by driving the clocks of the synchronous blocks by the request signals coming from the asynchronous blocks if the following two assumptions are satisfied: • Each synchronous block has an interface whose fanin and fanout registers are all clocked by a single asynchronous controller. • Each synchronous block is faster than the asynchronous one that drives its clock. For this reason, synchronous legacy logic can be used unchanged in an asynchronous fabric if it is non-critical. This is very different from GALS-based methods, which require the use of wrappers and introduce significant system-level integration problems due to the need of creating ad-hoc manual handshaking between blocks that belong to different clock domains. The various SADT flows described in this paper differ in the granularity of the pipeline stages. The NCL and desynchronization approaches retain exactly the same position of registers, and hence pipelining level, as the original synchronous specification. Gate-level
16
Introduction
pipelining, on the other hand, significantly decreases the granularity of the pipelining, down to the gate level, as shown in Figure 1.2(c). All the approaches considered in this paper share several common characteristics. First of all, control is derived through a syntaxdirected translation from a specification, whether it is written in Haste or in synthesizable Verilog. Second, the datapath is generated (at least initially, for the NCL and fine-grain pipelining flows) using traditional logic synthesis techniques, starting from design libraries such as DesignWare [22]. Third, physical design and implementation verification (equivalence checking, extraction, back-annotation etc.) are essentially unchanged. Finally, testing of the datapath and of the controllers is performed mostly synchronously thanks to the fact that timing faults in the controller networks are easy to detect with simple functional tests. Hence design for testability and automated test pattern generation tools and techniques can be reused almost without change. We present a variety of automated flows (Haste and the flows related to SADT approach) because we believe that the full advantages of asynchronous design to tackle power, energy, variability, and electromagnetic emission will come from a judicious mix of: • Full redesigning performance and power-critical widely used components (e.g. microprocessors) using a language like Haste. • Converting synchronous designs of special-purpose critical modules to asynchronous implementations, using one of the SADT techniques outlined in this paper. The choice of method will depend on whether the main goal is robustness, cost or performance. • Leaving non-critical components as synchronous and clocking them with the handshake signals produced by the asynchronous interface controllers.
1.6
Paper Organization
We will start our review by considering in Chapter 2 the most radical design approach that has been applied so far to real-life designs. It is based on the Haste language, and commercialized by Handshake
1.6 Paper Organization
17
Solutions. It is radical since it starts from non-standard (asynchronously specific model) and needs some specific education for designers. However, this non-standard HDL could lead to rather efficient solutions that could be unreachable from standard HDLs specifications. It was successfully used for real LARGE designs. Then we will describe the SADT approaches, starting from the pioneering NCL technique [57, 68] discussed in Chapter 3. NCL was the first approach to asynchronous design exploiting the idea of synthesizing large designs using commercial EDA tools. NCL circuits are dual-rail to enable completion detection. They are architecturally equivalent to the RTL implementation. However, full synchronization of completion detection at each register implies that NCL circuits are significantly slower and larger (by a factor of 2 to 4) than the synchronous starting point. Reducing NCL overheads moves us to de-synchronization [19, 20], in Chapter 4, which uses delay matching in order to achieve a good compromise between robustness, performance, and cost. When run at their worst-case speed, desynchronized designs exhibit an almost negligible overhead with respect to synchronous ones. On the other hand, when run at the actual speed at the process, voltage, and temperature conditions. They can dramatically reduce the delay margins required by synchronous design. None of the above approaches offers support for automated pipelining, therefore they cannot directly improve this aspect of the performance equation. Gate-level pipelining [117, 118], on the other hand, can pipeline at the level of individual gates, thus achieving performance levels that are virtually impossible to match with synchronous designs. We present this flow in Chapter 5. Chapter 6 is dedicated to design examples that illustrate both the achievable results and the possible application areas of the various design flows. Finally, Chapter 7 presents some conclusions on the opportunities offered by asynchronous circuits and flows.
2 Handshake Technology
2.1
Motivation
The Handshake Technology design flow is aimed at making clockless circuit technology available to the IC community in general, and not only to asynchronous circuit experts. This has led us to develop a complete tool set that goes all the way from a high-level design entry down to a circuit-level realization. The design flow has been set up in such a way that new tools are developed only where needed. Furthermore, it is designed to be complementary to and compatible with standard design flows. In particular, the Handshake Technology design flow reuses the following: • Cell library. Any standard-cell library can be connected, as no dedicated “asynchronous” cells are needed, cf. Section 2.6. • Memories. Standard (on-chip) memories can be interfaced with directly in a glue-less manner. • Optimization. Combinational datapath logic can either be incorporated directly (and the tools will create a handshake wrapper around it) or it can be generated using the design entry language. Logic optimization can be done using 19
20
Handshake Technology
•
•
•
•
tools designed for synchronous flows, since the datapaths are implemented using single-rail logic. Test-pattern generation. Although asynchronous circuits will always require a dedicated Design-for-Test solution, the testpattern generation part can be done using standard ATPG (automatic test-pattern generation) tools, provided that a scan like solution is implemented, as we did in the Handshake Technology flow, cf. Section 2.8.2. Placement and routing. Since the design flow uses only standard cells, we can reuse existing placement and routing tools to do the layout. Care is required for the correct handling of timing assumptions and combinational feedbacks, cf. Section 2.9. Timing analysis. Timing characterization of digital circuits has become a science in itself, and we are happy that we can reuse these tools as well, by cutting all combinational loops before timing analysis and by specifying different dummy clocks. FPGA prototyping. Although in principle asynchronous circuits can be mapped onto an FPGA as well, we have chosen to use a clocked implementation of handshake protocols for prototyping. This results in standard clocked circuits which can be mapped onto FPGAs directly and serve as functional prototyping. Naturally the timing behavior on an FPGA will be different than that on final silicon.
The Handshake Technology design flow does not only consists of scripts around standard third-party EDA tools. Dedicated tools and representations have also been developed, specifically in the following areas: • Design entry. A new design-entry language — called Haste, formerly known as Tangram — has been developed to enable expressing asynchronous behavior and communication. • Handshake circuits. An intermediate architecture called Handshake Circuits has been defined, in which we still
2.2 Functional Design
21
abstract from the implementation details and the handshake protocols that are used. • Netlist generation. We have chosen to implement netlist generators for handshake circuits, where we select a specific handshake protocol for each handshake channel and component, and then generate a circuit implementation in terms of standard-cells from a pre-selected standard-cell library. This step could alternatively be implemented through the use of asynchronous control synthesis tools such as Petrify [17, 64] and Minimalist [32, 33]. • Design for Test. Design-for-Test of asynchronous circuits cannot be handled by synchronous scan insertion tools directly, as these cannot work with combinational feedback loops that are abundant in any asynchronous circuit, for instance inside Muller-C [86] and mutual-exclusion elements [111]. A new test solution has therefore been developed that in itself is as compatible as possible with synchronous scan-based test. The remainder of this chapter goes into more detail regarding the innovative aspects of the Handshake Technology design flow. It addresses both the dedicated “asynchronous” tools and representations that have been developed, and the interfacing to “synchronous” tools.
2.2
Functional Design
The functional part of the design flow, during which the designer’s focus is on the functional correctness of the design, is shown schematically in Figure 2.1. In this diagram, boxes denote design representations and test benches, and the oval boxes refer to tools. The two central tools are htcomp, which compiles a Haste program into a handshake circuit, and htmap, which compiles a handshake circuit into a Verilog netlist. Other tools and representations also play a role during functional design.
2.3
Design Language Haste
The goal of the Handshake Technology design flow is to make the benefits of clockless IC design available to designers that themselves are
22
Handshake Technology
Fig. 2.1 Functional design flow.
not asynchronous circuit experts. Even more, we want to make Handshake Technology also available to system designers and application specialists, rather than to ASIC/IC designers only. The design language thus has to abstract from the physical implementation, such as the handshake protocols and the data-valid schemes that are being used. Therefore it needs to be a behavioral more than a structural design language. Three essential requirements for the design language — that were not available in any design language when Haste was conceived — are: (1) Parallelism must be supported at any granularity, on process level, at the level of atomic actions, and anything in between. Arbitrarily nesting of parallel and sequential operations should also be supported. (2) Channels should be supported for communication (and synchronization) between parallel processes. For such channels, send and receive actions should be available as atomic
2.3 Design Language Haste
23
communication actions. The representation of channels and their communication actions (send, receive) in the language must be independent of their implementation (which could be clocked or clockless and employ many different data-encoding options). (3) The parameter mechanism (for, e.g., modules, function, procedures) should support channels as well. More precisely, it must be possible to refer to such a channel parameter as an atomic entity. Since, at the implementation level, a channel is a bidirectional object (with both input and output signals in it), most languages do not support this by design. The combination of the above requirements was not found in any programming or design language. In addition, no language was found where this could be added through an abstraction mechanism. The considerations mentioned above could have led to the choice of starting from a standard language and then support an extended subset. Such an approach was taken for instance by Celoxica, when developing Handel-C [10]. This choice typically refers to the expectation that, in order to save on engineering (re)training, a standard language would be preferable. Indeed, a standard language would offer the advantage of a familiar syntax. However, there are some disadvantages to a standard language as well. First of all, there is no consensus as to which standard language should be chosen. We have had requests for C, System-C, Verilog, SystemVerilog, and VHDL. Each language has its own proponents, depending on the background of the teams, which may have a focus on hardware, software, or system-level design. Supporting a multitude of extended subsets of different input languages could be an option, but it is quite a challenge to maintain consistency between the expectation levels and capabilities of such an approach. Additionally, for each language, a designer will typically have a standard way of using it. That may be a handicap when that language is also going to be used to design clockless circuits, for which a new way of using the language would have to be learned. Having a new syntax may be an advantage in this learning process, as it makes the designer aware of the new approach.
24
Handshake Technology
With the choice for developing a design language that is a good balance between the asynchronous target implementation and the high abstraction level that was envisioned, it has been possible to demonstrate the high-level design capabilities of an asynchronous circuit design style. Haste has been set up such that is fully synthesizable: any Haste program can be compiled into a silicon implementation. Since the work on Haste and its silicon compiler has started, the offerings of design languages has changed. Whether in hindsight a new design language has been the optimal choice remains to be seen. It is definitely intriguing to see that academic design environments that have been developed to mimic the Haste environment, such as the Balsa system, have chosen to develop additional new languages. It is likely, that with the emerging of high-level modeling languages like System-C and SystemVerilog, at some point the Haste compiler will be tuned so as to also accept a SystemVerilog or System-C syntax, in combination with coding style guidelines. A complete overview of the design language Haste can be found in the Haste manual [21]. Below, we highlight some key aspects that make Haste a unique and powerful design language. 2.3.1
Haste as a Behavioral Design Language
The gain in code size that is reported by designers using Haste is to a large extent due to the behavioral aspects of the language. In contrast to an RTL description, Haste does not require alignment and scheduling of actions to a global event such as a clock edge. One can freely mix actions of different grain size, as in x :=0 y :=y +a *z
// ‘fast’ (reset) // ‘slow’ (multiply-acc),
which combines a fast reset action with a slower multiply accumulate. In a synchronous circuit, the reset would be slowed down the slowest action in the circuit, so at least to the time needed for the multiply accumulate. In an asynchronous circuit, such a fast action can actually be exploited. The non-uniform timing does not
2.3 Design Language Haste
25
only allow microtiming at a finer granularity than what one would achieve in a uniformly clocked circuit, it also allows exceptions to take more time. In a synchronous circuit, significant effort goes into fitting even extreme cases into the clock period, which may result in a serious investment in circuit area and operation energy for that exception. Asynchronous circuits can accommodate such slow actions naturally. Actions can have different durations that may even be data dependent. This may come in the form of an if-statement, where the different branches may have (completely) different completion times, leading to completion times of the if-statement that depend on the actual value of the guard, as in: if x >y then x :=0 else y :=y +x *z fi It may also come in the form of a data-dependent while loop, where the number of iterations depends on the value of the guard and its evolution over time, such as in the following implementation of the greatest common divisor algorithm: do x
26
Handshake Technology
2.3.2
Haste as a Parallel Programming Language
Haste offers a parallel operator, denoted by “||,” that can be used at any grain size. It can be used to parallelize atomic actions such as assignments and communications, as an operator to compose processes in parallel, anything in between, and can be nested arbitrarily in combination with the sequential composition operator (which is denoted by “;”). This parallel operator has been inherited from CSP (Communicating Sequential Processes, [43, 44]). An example that illustrates the orthogonal use of parallel and sequential composition is shown next: ( (x :=x +1 || y :=0) ; ( ( (x :=x +1 ; y :=x ) || z :=z +x 0*x 1 ) || if p
2.3 Design Language Haste
27
parallel composition of two one-place fifos (or pipeline stages), resulting in a two-place fifo: forever do a?x ; b!x od || forever do b?y ; c!y od Haste support both broadcast and narrowcast channels. The difference between these two is how parallel input (receive) actions on such a channel are handled. On a broadcast channel, all receivers participate in each communication action, whereas on a narrowcast channel, exactly one receiver participates. The latter may require arbitration, which can be made explicit in Haste. Arbitration also plays a role in the parallel composition of output (send) actions in Haste, where arbitration may also be needed, and the designer can help the compiler in avoiding unneeded arbitration by making this explicit in the program. An example of the use of a narrowcast channel is the following fragment, which has arbitration for parallel outputs (indicated by arb!) of the values 0, 1, 2, and 3 on narrowcast channel a, and for the parallel inputs (indicated by arb?) in variables, x 0, x 1, x 2, and x 3. Execution of this program will result in the values being distributed in a non-deterministic way over these four variables. begin a : chan [0..3] narrow arb! arb? | a!0 || a!1 || a!2 || a!3 || a?x 0 || a?x 1 || a?x 2 || a?x 3 end The above example is academic. The different combinations have all proven their value in practice, for instance in the design of the asynchronous bus infrastructure as used in the implementation of an 8051 microcontroller [54]. 2.3.3
Haste as an Asynchronous Design Language
Since the design language Haste is primarily targeted at the design of clockless circuits, it also has some features that allow the exploitation
28
Handshake Technology
of asynchrony, such as constructs for waiting on events and conditions, and a construct to obtain a digitally stable sample of an unreliable input signal. In the design of interfaces, edge constructs are typically quite useful. Like Verilog, Haste offers three edge constructs, for rising and falling edges, and one to detect any edge, as shown in the following three examples: posedge(clk) negedge(reset) edge(intrq) One may observe that the expressions used in such edge construct cannot be handshaked, but are direct signals that are considered to be “asynchronous” inputs. When activated, the edge constructs will terminate (or complete) when a corresponding signal change is observed. It should be noted that this will typically involve potentially metastable behavior, which has to be handled correctly during the gate-level implementation of the corresponding handshake components. The edge constructs are based on sequential or parallel composition of wait constructs. As in Verilog, the wait construct completes only when the associated condition is true. Since a wait could be activated when the condition is just changing, the associated handshake component will also have to deal with this. In code fragment wait(x <>y ) the wait completes only when signals x and y differ, independent of the condition (or change therein) at activation of this wait construct. In addition, Haste offers the possibility the designer to express that such arbitration is not required, as in the following code, which assumes signal clk is not rising when this wait is activated. wait(-clk, narb) A generalization of the wait construct is also offered, through the select construct, which may also be considered as a combination of an
2.4 Handshake Circuits
29
if and wait: sel x <>c then x:=c or r then x :=0 les The last non-handshake construct is the sample, which is needed to make reliable copies of asynchronously changing (external) signal. If such a signal is latched or copied when sampled, a stable digital copy is made of the sampled expression, even if the expression is changing at the moment of inspection, as in: x :=sample(c) Sampling of unstable input date is important not only when making a copy of such data for storage, but especially when such an unstable input enters the control, e.g., when used in a conditional statement: if sample(c) then x :=x +1 fi
2.4
Handshake Circuits
Handshake circuits form the intermediate architecture between a high-level design language such as Haste or Balsa, and the gatelevel implementation as a VLSI circuit [134]. The starting point in the syntax-directed compilation approaches that have been developed around handshake circuits is to at least have a set of handshake components that corresponds to the syntactic operators in the associated design language. These handshake components abstract from the actual gate-level implementation, and merely provide a specification in the form of a handshake protocol. 2.4.1
Handshake Components
A library of some 30 components suffices to construct a sufficiently powerful design language and a compiler into handshake circuits. The set of handshake components that is used may vary depending on which constructs are offered in the design language and how the compilation
30
Handshake Technology
Fig. 2.2 Handshake components: run (left), sequencer (sequencer), and parallel (right).
scheme is set up. In that sense, there is for instance a slight difference between the set of handshake components used for Haste and that used for Balsa [6]. Three simple components that are always used are the run component, the sequencer, and the parallel. Symbols for these are shown in Figure 2.2. For a more extensive overview and introduction to handshake components we refer to [75, 94, 96, 121, 134]. The run component has only a passive port a. A passive port is drawn as an open circle, and refers to a handshake port that plays the passive role in the handshake protocol: in each handshake it first waits for a request and then later send the acknowledge. When activated via port a, that is, when receiving a request along this port, it will react by giving an acknowledge. This component corresponds to a skip construct in a design language, which is needed for instance, as the default alternative in an if-then construct (without an else branch). The sequencer has one passive port a and two active ports, b and c. An active port is drawn as a filled circle, and refers to a handshake port that plays the active role in the handshake protocol: in each handshake it first sends a request and waits until the associated acknowledge arrives. When activated via a, performs first a handshake via b and then via c. It is used to control the sequential execution of commands connected to b and c. After receiving a request along a, it sends a request along b, waits for the corresponding acknowledge, then sends a request along c, waits for the acknowledge on c, and finally signals completion of its operation by sending an acknowledge along channel a. The sequencer corresponds to a sequential composition of operations in the design language, typically denoted by “;.” The parallel component, when activated by a request along a, sends requests along channels b and c concurrently, waits until both
2.4 Handshake Circuits
31
Fig. 2.3 Handshake circuit for while loop.
acknowledges have arrived, and then sends an acknowledge along channel a. The parallel corresponds to parallel composition of operations in the design language, typically denoted by “||.” The compilation of a while loop do Guard then Command od can be implemented by using a do-component to abstract from the control structure, as shown in Figure 2.3. The do-component, when activated, collects the value of the guard. When the guard is false, it completes the handshake on its passive port, otherwise it activates the command, and after its completion re-evaluates the guard to start a new cycle. Another essential handshake component is one that implement storage of data in registers, which enables the use of variables in the design/programming language. Also, the use of channels for synchronized handshake communication requires a synchronization component. 2.4.2
Optimizations
The use of a handshake component per syntactic construct in the language leads to a complete solution, but may not necessarily be optimal. Different types of optimizations exist [12, 56, 96]. First of all, generalizations of handshake components are useful. In both Haste and Balsa, both the parallel and the sequential composition operator are used as binary operators. A straightforward step is to use generalized versions of these operators, that can compose any number of commands. This can be introduced by offering N -way sequential and
32
Handshake Technology
parallel composition as operators in the language (such as was done for OCCAM), or by a simple peephole optimization step. A second option is to combine different components into one, thereby offering handshake components that implement a combination of language constructs. In the handshake design flow this has been applied at several places, for instance to enable optimized pipeline implementations.
2.5
Handshake Implementations
At the level of handshake components and handshake circuits, it is not yet specified how the handshake protocol should be implemented; it is only dictated that a handshake protocol should be used. Many different implementations of handshake protocols are possible, as long as a relation between the events that are used to implement a protocol and the protocol itself can be defined [94]. Some of the choices that can be made are: • Using combine request and acknowledge events on a single wire (known as single track signaling [135]), or use separate wires. In case of single-track implementations, circuits will have tri-state operation, and different options are available as to how robust these will be. • When separate wires are used, the number of phases in the handshake protocol implementation still needs to be selected. Two popular options are two-phase and four-phase implementation. In the former case, each event is relevant for the handshake protocol, in the latter case, half of the events are redundant, and three different choices can be made as to which events are relevant and which are redundant. The choices are referred to as early (when the first two events are relevant for the protocol, and the last two serve as a return-to-zero cycle), broad , when only the first and the last events are relevant and the middle two are redundant, and late, when the first two events are redundant and the last two define the handshake protocol that is implemented.
2.5 Handshake Implementations
33
In addition to the two asynchronous options mentioned above, one can also opt for a sample based implementation of the handshake protocol. In this case, a separate event (for instance a periodic clock) is used to observe the state of the handshake wires, and based on these samples, determine how the handshake protocol is evolving, that is, whether request or acknowledge events have occurred. This is in contrast with the transition based interpretation of request and acknowledge wires, where the transitions on these signals directly translate into handshake events. Sample-based implementations lend themselves for synchronous implementation, and allow for glitches on the handshake wires between sample moments [95]. • The encoding of data in handshakes also can be approached in different ways. A whole range of delay-insensitive dataencodings is available [143], from which double-rail and 1-out-of-4 may be the most popular ones. In practice, socalled single-rail data-encoding [111] is most popular. In this case one uses a single wire per bit only, allowing for standard combinational logic to implement datapath functions, and an extra wire to indicate the validity of this data. This data-valid function can either be combined with the request or with the acknowledge signal. The choices for control and data are orthogonal, and different combinations may be useful. Two options that are in use today in the handshake technology design flow that is outlined in this paper are four-phase single-rail (targeted at VLSI implementation, [96]), and synchronous single-rail (targeted at FPGA prototyping [95]). Two other options that have been implemented as part of the handshake technology design flow are four-phase double rail and singletrack single-rail. Efficient and reliable implementation of single-track circuits requires the use of dedicated standard cells, whereas double-rail can also be implemented efficiently with “synchronous” standard-cell libraries [136].
34
Handshake Technology
More exotic combinations are also possible, such as synchronous double-rail or single-track 1-out-of-N [28, 29]. It may be observed that combinations of choices are also possible. Most notably, when using four-phase control in combination with single-rail data encoding, many different variants are possible, and one may want to determine the optimal protocol per handshake channel and handshake component, rather than opting for a rigorous fixed alternative.
2.6
Library Connection
The Handshake Technology design flow is set up to be complementary to and compatible with third-party EDA flows that are already upand-running at typical semiconductor and electronics companies and design teams. The design environments that we aim at are standardcell design flows. Design teams typically use a variety of tools from established companies such as Cadence, Synopsys, Mentor, Magma, and Synplicity, to go from a design description at register-transfer level in Verilog or VHDL to a layout description in GDS2. A strength of the single-rail design flow that has been developed is that no dedicated “asynchronous” standard cells are needed in the standard-cell library. This strongly facilitates the connection to new libraries and the migration between different versions of a library. The connection to a specific library is established via a mapping file that specifies how the required asynchronous elements like delay matchers, C-elements, and mutual-exclusion elements are composed in terms of the standard-cell library at hand. The creation of a mapping file is partly automated and partly manual. Structural verification against a golden mapping file is done using third partly logic equivalence checking tools. However, the mapping onto a library also involves some analog properties, such as in the implementation of mutual-exclusion elements and delay elements. If a mutual-exclusion element is not available in the library, a mapping can be made onto a cross-coupling of NAND-gates and a filter section consisting of inverters or NOR-gates with shorted inputs. Inverters and NOR-gates typically have a lower switching threshold than a
2.7 Simulation and Verification
35
NAND-gate, which guarantees proper operation of the decomposition of the mutual-exclusion element.
2.7
Simulation and Verification
Simulation is a vital aspect in chip design. In the design flow, as shown in Figure 2.1, simulation is supported on three different levels of abstraction: at behavioral level, at handshake circuit level, and at gate level. For each simulation level, a simulation trace file can be generated and viewed with the tool htview. There is no simulator for the design language Haste directly. Instead, behavior-level simulation models can be created, either in behavioral Verilog or in System-C. The mapping on these languages made it clear that the expressive power of Haste, when it comes to the different types of channels (narrowcast and broadcast, arbitrated and non-arbitrated, the support for multiple parallel senders), is stronger than what is natively supported in Verilog and System-C. It has also become obvious, that although these languages are typically considered to be standards, that there are quite a few different versions, each with their own limitations. These differences are mainly found in the behavioral parts of the languages, where parallelism and communication need to be modeled. Behavioral Verilog remodeling is typically targeted at design teams that want to reuse existing test benches at different levels of design. The System-C models are typically used by system-level design teams that have at some point created a System-C model of a system, and want to verify whether the implementation that has been made is compliant to the model created earlier. Handshake circuit simulation is supported by a dedicated tool called htsim. This offers a fast bit-accurate handshake circuit simulation based on C models of the handshake components. The modeling is both bit-level accurate and handshake-protocol accurate with respect to the four-phase single-rail [96] implementation at the gate-level. This model is mostly used to do initial design iterations. Channel communication is supported via file-I/O. Test benches for this representation level can easily be written in Haste.
36
Handshake Technology
Simulation at gate-level is rather straightforward. The only exception is the simulation of arbiters. When these are available as dedicated standard cells, their behavior can be modeled correctly. In contrast, when an arbiter is constructed from a combination of gates, the correctness of this implementation will always depend on analog properties of these gates (such as thresholds). These properties are not modeled in the digital cell libraries, which may cause the cross-coupled elements in the arbiter to oscillate in simulation, a behavior that will not appear in silicon. In order to prevent simulators from erroneously demonstrating this behavior, corrective models need to be added to the gate-level netlist. These models monitor the cross-coupled element in the arbiter, and take corrective actions to prevent such oscillation. Besides functional simulation, code coverage simulation is also supported. This simulation variant, which is performed at netlist level, measures the degree to which the Haste code has been validated. Thus, code coverage simulation assures the quality of the testbench and can be used to monitor the validation progress. This tool analyzes the simulation results with respect to statement coverage (“has each Haste statement been executed”) and expression coverage (“has each required expression value been reached”). Feedback is presented at Haste level.
2.8
Structural Design Flow
The structural part of the design flow is where multiple designs can be combined into one, so as to support design reuse and modular design. This part of the flow also prepares a design for placement and routing, and includes the design-for-test transformation. An overview is given in Figure 2.4, and the different steps are discussed next. 2.8.1
Linking
The Handshake Technology design flow supports modular design through import and export constructs in Haste. This allows Haste programs to be compiled separately, and then linked later using htlink. This tool not only supports the linking of designs compiled from Haste, but can also link in combinational blocks that have been generated using, e.g., datapath compilers for multipliers and other structures.
2.8 Structural Design Flow
37
Fig. 2.4 Structural design flow.
When the linker finds such combinational blocks, it puts a handshake wrapper and delay line around it. This will guarantee that during the remainder of the flow, such datapath blocks are truly integrated in a correct way. 2.8.2
Design for Test
The main roadblock toward successful widespread use of asynchronous circuits has long been the lack of a design-for-test solution of industrial quality. Testing for production faults of asynchronous circuits is known to be fundamentally more difficult than testing of synchronous circuits [47, 105]. The main reason for this test problem is the high autonomy of these circuits, which only have a few quiescent states, and lack a global control or pacing signal. This complicates two fundamental operations required for proper testability, namely bringing the circuit in a known state and evaluating its operation from that point onward.
38
Handshake Technology
The asynchronous community originally has focused on exploiting the request–acknowledge properties of asynchronous control circuits, to keep the cost for design-for-test low [40, 78, 104]. This approach was based on the observation that in the presence of a stuck-at fault on a control signal, the control circuit will deadlock when that signal is activated. This can be observed by a timeout. However, although this observation in itself is true, it turns out that many stuck-at (input) faults are not covered by this simple approach [40]. A more structural approach to testing of asynchronous circuits is therefore required. A rigorous full scan solution has been developed that includes not only a solution for the datapath, but also makes the control elements (such as Muller-C elements) scan testable [127]. Scanning of C-elements was achieved by integrating a latch or flip–flop function inside these elements that can be controlled via freshly introduced clock-like control signals. It turned out that this solution was most cost effective if dedicated scan-elements based on LSSD clocking and dynamic implementation of the latching function were introduced [137]. Dynamic scan cells have two disadvantages. First of all, such cells are typically not available in standard-cell libraries, thus complicating the introduction of this solution. Second, although the elements are dynamic only during scan, and fully static during asynchronous operation, the dynamic operation leads to reduced reliability in deep submicron technologies, and moreover complicates the use of IDDQ testing. For the latter it is required to bring the circuit in a known state using scan, and then measure the current during a halt. This halting during a dynamic mode leads to floating gates that may be drawing current that interfere with the IDDQ measurement. The same problem also manifests when the scan chain needs to drive inputs to an analog circuit during test. A new solution has therefore been introduced, in which only multiplexer are introduced in the combinational loop inside the C-elements, so-called multiplexer-based scan test [126]. The latching function is then reused from existing latches and flip–flops in the datapath, which further reduces cost. Although optimization and fine-tuning of this approach is ongoing, the solution is already in the same league in terms
2.9 Physical Design
39
of cost as scan solutions in synchronous circuits, which also only add a multiplexer for each sequential element. 2.8.3
Handshake Constraint Format
The Handshake Constraint Format (HCF) has been introduced as a repository to support the bookkeeping that is needed throughout the physical part of the design flow. It supports the documentation of all timing constraints that are made in the implementation, such as relative timing assumptions, and the specification of timing paths in the control that will run in parallel with the datapath, and can be used for instance to optimize (reduce) delay matching. In addition to timing specifications, the HCF files is also used to document where combinational loops should be broken in order to enable standard timing tools to generate a timing report which can then be used to quantify and validate all the timing margins that play a role. It is also used to specify the points in the circuit where high-fanout nets are found, and which of these have minimum skew requirements, such that clock-tree synthesis steps may be required during the physical design steps.
2.9
Physical Design
A schematic overview of the physical part of the Handshake Technology design flow is shown in Figure 2.5. In this phase of the design, all tools except for htremodel are standard third-party tools. The operation of these tools is partly controlled by scripting that, based on the handshake constraint file, instructs the tools how to handle the clockless circuitry correctly. The first step in this flow is the logic optimization run. In this phase, the datapath logic may be optimized toward criteria such as low power, small area, or a specified speed target. The control that is needed is that the optimizer should be instructed which hierarchical instances in the netlist it is allowed to optimize. Especially in the control circuit, where combinational feedback loops are abundant, the optimization and even rebuffering should be handled with care, hence the use scripts and constraints. As a second step after the (optional) optimization of the
40
Handshake Technology
Fig. 2.5 Physical design flow.
datapath, the delay lines that are used for delay-matching and other timing assumptions, are pre-dimensioned based on a pre-layout timing report of the circuit. In order to establish such a timing report, the handshake constraint file also specifies where timing loops have to be broken, so as to guarantee that the required timing arcs can be found, and are not accidentally broken. The pre-layout netlist that results after the optimization step is fed into the selected placement and routing tools. The layout procedure is again under guidance of dedicated scripting that interprets the handshake constraint file and helps the hardening tool to do its job. An example of this is the relative timing constraints that are specified for the single-rail datapath. After the timing of the logic blocks in the datapath is fixed, the corresponding delay elements can be finetuned so that they meet the absolute and relative safety margins that have been specified. Once detailed routing has completed, the sign-off procedure can start, based on extracted parasitic data to obtain a precise timing
2.9 Physical Design
41
report. As part of the sign-off process, all timing assumptions that have been specified in the handshake control file are verified, preferably using a timing engine that is different from the one used during placement and routing. Should this step fail, than ECO changes can be applied to fix the violations, using the same adjustment procedures as during the layout process. Timing sign-off is typically done in multiple corners. One dedicated tool is still required after layout, namely htremodel. This tool is part of the design-for-test procedure, and will generate remodel files of the post-layout netlist. These files can then be fed into a standard automatic test-pattern generation (ATPG) tool that will produce the patterns that can later be used for post-fabrication testing and failure analysis. The physical flow described above has been used to develop many different applications in the area of automotive, wireless, identification, and smartcards. All application domains are driven by the low power capabilities of the handshake implementations. In addition, electromagnetic emission reductions and simplification of integration with analog and RF parts either in the same product, or in the same package, or on the same die, are increasingly relevant.
3 Synchronous-to-asynchronous RTL Flow Using NULL Convention Logic (NCL)
3.1
Overview
Micropipelining [125] is a backbone of coordinating asynchronous designs. Similarly to clocks in synchronous systems it provides a convenient separation between tasks of implementing functionality and a design coordination. There are important differences though. Clocks force designers to think about a specified algorithm as a walk through a state space, i.e., in a state centric way. Moving separate pieces of data through a system is done indirectly as a result of system transitioning from a state to a state. Contrary to that, micropipelining suggests an event-centric specification of an algorithm. Many tokens of data might be present in a pipeline and moving each of them could be traced easily because they do not collide and are governed by a very local control. This feature is a basis for low power premises of asynchronous systems. In micropipeline implementations the power consumption is proportional to the number of events in a system, and the lower the system activity the less power is consumed. The latter is not necessarily true in a synchronous domain unless a designer makes significant efforts in explicitly defining power saving modes. 43
44
Synchronous-to-asynchronous RTL Flow Using NCL
The need to rethink an algorithm was considered to be an inevitable cost when moving to asynchronous systems. From this it was/is commonly concluded that EDA tools developed for synchronous systems are totally useless for asynchronous implementations. To the best of our knowledge NCL approach [68] was the first to argue with the above misconception. This approach is based on implementing control using micropipelines and implementing functionality (datapath) using QDI logic. QDI logic provides completion signals to indicate that all transient processes in a datapath are finished. This fits well into request/acknowledgment protocols of micropipelines and ensures a robust implementation able to tolerate variability of a process and/or operational conditions. The applicability of synchronous RTL compilers to NCL flow is based on the following two observations. (1) The structure of a micropipelines is fully defined by the number and position of pipeline stages. Once this decision is made, the control part of the algorithm is fully defined. This step could be directly translated in choosing the number and position of registers in an RTL specification. The fact that these registers are different from conventional synchronous flip–flops is irrelevant for a design. (2) QDI logic has a two-phase behavior. In the first phase data is evaluated. It is followed by a reset phase preparing the datapath to evaluate the next data item. This seems to be very different from a conventional single-phase synchronous behavior.1 However when QDI logic is self-resetting (i.e., reset phase is automatically inserted between any two evaluation phases) then propagating data through QDI is similar to the propagation through synchronous logic. In this case, a designer needs to take care only about the proper function 1 Although
the mainstream logic organization in design relies on a single phase functioning, the idea of two-phase operation is well known in a synchronous world. Dynamic logic operates in two-phases very much like QDI designs do. Note, however that similarly to QDI designs an automation of dynamic logic synthesis is an open problem and its support is provided mostly through custom design CAD tools.
3.2 Null Convention Logic
45
evaluation and such a task is common for both synchronous and QDI logic. These observations are the basis of automated NCL design flow in which synchronous tools are “cheated” about the object to which they are applied and asynchronous nature of these objects is fully concealed. NCL flow was the first to marry QDI synthesis with synchronous RTL and CAD tools (see Table 1.1). However, due to high penalties incurred the practicality of this flow is limited.
3.2 3.2.1
Null Convention Logic Delay-Insensitive Combinational Circuits
A combinational gate output reflects the gate’s Boolean function after the gate delay. Certain functions let to infer the input state by observing the output. For example, output 0 for a two-input OR gate means both inputs are 0. For the other three possible inputs, the output is 1. When transitioning from one set of inputs to another, the output can temporarily become 0 before returning to its proper value of 1. This behavior constitutes a hazard. A circuit in which hazards cannot occur under any distribution of gate and wire delays is delay insensitive. Imagine a circuit in which false transitions do not occur and where you can always infer the inputs by merely observing the outputs. One could say that the output of such a circuit has full acknowledgment of all its inputs. The notion of acknowledgment is key to ensuring delay insensitivity. If every transition at a wire or gate in a circuit translates its firing results into changes in the primary outputs, the circuit behavior does not depend on transition timing and is delay insensitive. Figure 3.1 illustrates the acknowledgment concept. In Figure 3.1(a), the transition at y does not properly acknowledge the rising input transitions at both a and b, because y changes as soon as any one of its inputs transitions, regardless of the value at the other input. If inputs a and b are represented as dual-rail pairs {a.0, a.1} and {b.0, b.1} with reset values a.0 = a.1 = b.0 = b.1 = 0 then output y properly acknowledges the same input transitions in the circuit in Figure 3.1(b) because, in this case, y does not change until the circuit asserts both a.1 and b.1.
46
Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.1 Illustration of acknowledgment property.
3.2.2
NCL Functioning
NCL is a specific way of implementing data communication based on delay-insensitive (DI) encoding [27, 26]. Data changes from the spacer (Null) to a proper code word (Data) in the set phase, and then back to Null in the reset phase. NCL targets simple DI encoding in which Data code words are one-hot codes, (only one bit of the code could be asserted to 1) and a vector with all entries equal to 0 represents the spacer Null. For example, in dual-rail encoding, two wires, a.0 and a.1, represent each signal, a. Thus, this method encodes a = 1 as a.0 = 0 and a.1 = 1, and a = 0 as a.0 = 1 and a.1 = 0. DI encoding lets the receiver determine that a code word has arrived by observing the code word itself, without appealing to timing assumptions. In particular, for dual-rail signals a.0 and a.1, an OR gate (a.0 + a.1) is the simplest detector for validating a code word at a.0 and a.1. At an architectural level, NCL systems clearly separate sequential and combinational parts, much in the same way as with synchronous systems, as Figure 3.2 shows. To understand how the NCL system functions, assume that all registers are initially in the Null state and that the circuit has asserted Ack signals to 0. When Data arrives, a register’s outputs change from Null to Data, and the Data wave front propagates through a combinational circuit to the next register’s inputs. Simultaneously, a completion detector checks for a Data code word at its inputs, and replies by raising the Ack signal. This signal disables the previous register’s request line and prepares the register for storing the next Null wave front.
3.2 Null Convention Logic
47
Fig. 3.2 NCL system implementation.
The request–acknowledgment mechanism of register interaction ensures a two-phase discipline in NCL system functioning and prevents collisions between different Data wave fronts. Guaranteed implementation of this behavior requires gates, such as those based on NCL, that satisfy the following properties: • Monotonic Null → Data (Data → Null) transitions at a combinational circuit’s inputs result in monotonic Null → Data (Data → Null) transitions at its outputs. These properties are achievable by using gates that implement a positively unate function (all inputs in such functions are used without inversions). Each gate can then make at most one transition in the set or reset phases, much like in precharge dynamic circuits. • For intermediate states of the Null → Data input transition, a combinational circuit must keep some of its outputs in Null (so that it does not produce Data prematurely). For intermediate states of the Data → Null input transition, the circuit must keep some of its outputs in Data (so that it does not produce Null prematurely). Thus, NCL gates must track the current function; that is, gates must have internal memory. These conditions lead to a general representation of an NCL gate as g = S + gR , where S and R are the unate set and reset functions. Because there’s only one designated value for Null (the vector with all 0’s), the system must uniformly reset every gate by changing its output to 0 only when all inputs to that gate are 0. This
48
Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.3 Semistatic implementation of NCL gate (a), a Muller C-element as an example of particular NCL gate (b) and its notation (c).
refines the representation to g(x1, x2, . . . , xn) = S + g(x1 + x2 + · · · + xn). Figure 3.3(a) shows a semistatic CMOS implementation of an NCL gate. Figures 3.3(b) and 3.3(c) show an implementation and notation for a particular NCL gate with the function g = x1x2 + g(x1 + x2), known from literature as a Muller’s C-element [86]. The uniform reset of each NCL gate provides the reset for the whole combinational circuit as soon as its inputs become NULL. In this way, the propagation of NULL wavefronts is guaranteed by the nature of NCL gates and does not depend on circuit functionality. It is the implementation of circuit behavior in the set phase (propagation of DATA wavefronts) that solely defines the functionality of an NCL. Table 3.1 shows a content of proprietary cell library from Theseus Logic [119]. These gates might be realized following the structure of Table 3.1 Library of NCL gates. NCL gate TH22 TH33 TH44 TH23W2 TH23 TH24W22 TH24W2 TH24 TH34W22 TH34W2 TH34W32 TH44W322 TH44W3 THAND THXOR
Set function A*B A*B*C A*B*C*D A + B*C AB + BC + AC A + B + CD A + BC + BD + CD AB + (A + B)(C + D) + CD AB + AC + AD + BC + BD AB + AC + AD + BCD A + BC + BD AB + AC + AD + BC AB + AC + AD AC + AD + BC AC + BD
Dual NCL gate TH12 TH13 TH14 TH33W2 TH23 TH54W22 TH44W2 TH34 TH44W22 TH34W2 TH54W32 TH54W322 TH34W3 THAND THCOMP
Set function A+B A+B+C A+B+C+D AB + AC AB + BC + AC ABC + ABD ABC + ABD + ACD AB(C + D) + C(A + B) AB + ACD + BCD AB + AC + AD + BCD AB + ACD AB + AC + BCD A + BCD AC + AD + BC AB + AD + CB + CD
3.3 NCL Design Flow with HDL Tools
49
Figure 3.3(a) and are called hysteresis gates because of their sequential nature. They implement all unate functions up to 4 inputs.
3.3
NCL Design Flow with HDL Tools
NCL is coded at the register-transfer level (RTL). To synthesize and simulate an NCL circuit at the RTL using commercial tools, the tools must handle the NULL value and sequential behavior of gates with hysteresis. This is accomplished by the following rules: • Separate combinational logic and registers. Like in clocked logic, combinational logic is written as concurrent signal assignments or in processes (VHDL). • Instantiate NCL registers and provide a simulation-only model with hysteresis behavior. The simulation model is ignored during synthesis. • Use a hysteresis procedure inside processes to simulate hysteresis, but ignore the procedure during synthesis.
Example 3.1. The following example shows an RTL specification of NCL 4 to 2 encoder with registration at its output. library ncl; use ncl.ncl logic.all,ncl.ack logic.all; use ncl.ncl components.all; entity enc 4 to 2 is port ( din: in ncl logic vector(4 downto 1); ack in, reset: in ack logic; ack out : out ack logic; dout: out ncl logic vector(2 downto 1)); end enc 4 to 2; architecture behave of enc 4 to 2 is signal b: ncl logic vector(2 downto 1);
50
Synchronous-to-asynchronous RTL Flow Using NCL
– combinational part begin encode : process(din) begin if din = “1000” then d <= “11”; elsif din = “0100” then d <= “10”; elsif din = “0010” then d <= “01”; elsif din= “0001” then d <= “00”; else d <= (others => ‘0’); end if; – for simulation purposes only hysteresis(d, din); end process encode; – instantiation of register – with initial value NULL (-1) ir1: ncl register ss generic map (width => 2, initial value => -1, stages => 1) port map (datain => d, ki => ack in, rst => reset,dataout => dout, ko => ack out); end behave;
The NCL design flow uses off-the-shelf simulation and synthesis components (see Figure 3.4). The flow executes two synthesis steps. The first step treats NCL variables as single wires. The synthesis tool performs HDL optimizations and outputs a network in generic library
3.4 DIMS-based NCL Design Flow
51
Fig. 3.4 RTL flow for NCL.
(GTECH) as if it is a conventional Boolean RTL. The second step expands the intermediate generic netlist into a dual-rail NCL by making dual-rail expansions and mapping into the library of hysteresis gates. The particulars of Step 1 and Step 2 implementation impact the quality of final results. Note, that although a 2-rail expansion is a specific step that is not present in synchronous flows, its implementation might be performed by mere manipulation of gate networks. Editing capabilities on gate networks are present in all modern synthesis tools. Thus a tworail expansion is performed within commercial flows through developing of corresponding scripts.
3.4
DIMS-based NCL Design Flow
In [68] a regular method for NCL implementation based on Delay Insensitive Minterm Synthesis (DIMS) [122] was suggested. In this method, Steps 1 and 2 of the design flow are implemented as follows: Step 1 performs a mapping of the optimized network into two-input NAND, NOR, and XOR gates.
52
Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.5 Dual-rail expansion for NAND gate.
Step 2 first represents each wire a as a dual-rail pair a.0 and a.1 and then makes a direct translation of two-input Boolean gates into pairs of gates with hysteresis performing some limited optimizations. Figure 3.5 illustrates Step 2 implementation for a two-input NAND gate c = a ∗ b. Note, that in the DIMS method, a function is represented as a set of minterms with only one of them turning “ON” in set phase. Therefore outputs of the pair of gates with hysteresis acknowledge all changes at their inputs. The latter is the basis for delay-insensitivity of DIMS-based NCL implementation. The advantage of DIMS-based approach is a simplicity of the translation scheme and automatic guarantee of DI properties in implementation. Unfortunately DIMS-based implementations have significant overhead which comes from three sources: (1) Over-designing due to locality of ensuring DI. Each dual-pair of gates acknowledges ALL its inputs. This is excessive when inputs are shared between several pairs of gates. (2) Little room for optimization. This comes from the close coupling of implementing functions and ensuring delayinsensitivity by the very same gates. Optimization of functionality can easily destroy DI properties. (3) Need for proprietary gates. Gates with hysteresis are required to ensure a correct by construction reset in DATA → NULL transitions. These gates are significantly larger than conventional gates (2 times approximately).
3.5 NCL Flow with Explicit Completeness
3.5
53
NCL Flow with Explicit Completeness
The next step in gaining efficiency of NCL implementations was proposed in [57] by developing the so called NCL X flow. It exploits the idea of separate implementations for functionality and delay insensitivity. NCL X partitions an NCL circuit into functional and completion parts, which permits independent optimization of each. Modifying the flow’s implementation steps permits this separate implementation and optimization. In Step 1, designers perform a conventional logic synthesis (with optimization) from the RTL specification of an NCL circuit. This also maps the resulting network into a given library, using gates that implement set functions of threshold gates. Step 2 involves the following substeps: • reducing the logic network to unate gates by using two different variables, a.0 and a.1, for direct and inverse values of signal a (the obtained unate network implements rail.1 of a dual-rail combinational circuit); • enabling dual-rail expansion of the combinational logic by creating a corresponding dual gate in the rail.0 network for each gate in the rail.1 network (for this one need to confine the library content by ensuring that each gate of a library has a gate with dual function from the same library); • ensuring delay insensitivity by providing local completion detectors (OR gates) for each pair of dual gates and connecting them in a completion network (multi-input C element) with a single output, done. Implementing the NCL X design flow requires a minor modification of interfacing conventions within the NCL system. We assume that for each two-rail primary input (a.0, a.1), explicit signal a.go exists such that a.0 6= a.1 → a.go = 1 (set phase), whereas a.0 = a.1 = 0 → a.go = 0 (reset phase). Figure 3.6 shows the modified organization of the NCL system. Unlike the system in Figure 3.2, this system has separate completion detectors for combinational logic and registers. Designing the combinational logic for a 4-to-2 encoder (from Section 3.3) illustrates this design flow.
54
Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.6 NCL system with explicit completion detection.
Figure 3.7 shows design steps in NCL X implementation of the encoder. After the reduction to unate gates (Figure 3.7(b)) the network is expanded into a dual-rail implementation (Figure 3.7(c)). This is done locally by providing a dual function for each of the gates in unate representation from Figure 3.7(b). At last each pair of dual gates is supplied by local completion detectors (OR gates) whose outputs are connected to multi-input C-element providing signal done (Figure 3.7(d)). Information about validity of input codewords is provided by C-element I.go
Fig. 3.7 4 to 2 encoder with explicit completion.
3.6 Verification of NCL Circuits
55
that combines all completion signals for primary inputs. The output I.go is connected to C-element done. The following property is crucial for the proposed implementation. An NCL circuit implemented with explicit completion belongs to QDI class. Proof for this statement is rather straightforward. In each phase (set and reset) exactly one gate out of the dual pair must switch. The switching is acknowledged by local completion detectors which, in turn, are acknowledged by primary output done. Hence, transitions at every gate of a circuit are acknowledged by circuit outputs and a circuit belongs to QDI class. A clear boundary between a circuit’s functional and completion parts simplifies optimization procedures and results in significant area reduction of NCL X versus conventional NCL flow (about 30% [57]). Optimization of NCL implementations is an on-going topic of research. Jeong and Nowick [48] suggested a set of QDI preserving transformations that help in more efficient technology mapping of NCL structures. Zhou et al. [145] proposed the idea of partial completion to reduce the size of NCL gates and completion network. Note, however that area improvements in NCL flow are limited by the theoretical 2× penalty bound stemming from the need to perform a dual-rail expansion. Figure 3.8 summarizes the efforts in optimizing NCL implementations and provides their comparison to the reference synchronous implementations.
3.6
Verification of NCL Circuits
In synchronous circuits static timing analysis plays the major role in design sign-off. Every combinational path must be checked for setup and hold constraints to make sure that a circuit functions correctly. The big advantage of QDI circuits (NCL in particular) is that their behavior is independent from the delays of gates. This makes static timing analysis obsolete for verifying NCL circuits. Ideally QDI properties of NCL implementations should be satisfied by using correct-byconstruction synthesis methods. In practice explicitly verifying QDI is necessary because
56
Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.8 Comparative analysis of NCL implementations.
(1) Parts of a system might be designed manually. (2) Even best automated tools are prone of errors. (3) It increases a designer’s confidence in implementation results. Formally QDI verification is reduced to checking that every gate of a circuit translates the result of its firing into the changes at primary outputs. This is essentially the same as the acknowledgment property introduced in Section 3.2. If outputs of a circuit take a settled value (Null or Data) before some internal gate transitions to the settled value then an internal gate is called an orphan [58] because its transition is left unsupervised. Using this terminology to verify QDI properties means to prove that a circuit is free from orphan gates. In NCL systems orphan checking plays the same role in a design sign-off as static static timing analysis for synchronous circuits. Several important observations help in reducing the complexity of the delay-insensitive analysis of NCL systems: • Analysis of a whole NCL system is decomposed into smaller tasks of checking its combinational subparts in the same way as with ordinary static timing analysis.
3.6 Verification of NCL Circuits
57
• Data and Null wavefronts cannot collide in NCL systems. This suggests that orphans must be checked separately for set and reset phases (but not for their mix). For a gate g with hysteresis (g = S + gR) checking must be performed with respect to a combinational network of gates with functions S (for the set phase) or for a network of gates with functions R (for the reset phase). • It is easy to prove that consecutive Data and Null wavefronts sensitize the very same combinational paths. This property simplifies the job of orphan checking to confine it to the set phase only. Figure 3.9 shows an overall flow for QDI verification in NCL circuits. The key component of the verification tool is a satisfiability solver Chaff [85] that provides a simple and efficient solution for orphan analysis. The analysis is based on constructing a miter [63] that combines the outputs of the “correct” sub-network IO*(gi) (in which gi might switch) and a “faulty” sub-network O*(gi) (in which the value of gi is kept at “0”). The output F of the resulting circuit (Figure 3.10) takes the value “1” if there is an input set DATAj that does not distinguish outputs of O*(gi) and IO*(gi). Gate gi must switch to “1” under DATAj . However, in O*(gi), the value at the output of gi is replaced by “0”. Therefore, under the DATAj input set, the outputs settle down irrelevant of the transition at gi. This behavior describes an orphan crossing gi. Clearly, F being a tautology implies that the circuit has no orphans and is delay-insensitive.
Fig. 3.9 Verification flow for NCL.
58
Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.10 Miter construction for checking orphans crossing gi .
Fig. 3.11 ALU design size vs verification time.
The efficiency of SAT approach for orphan checking is confirmed by Figure 3.11 from [58] that shows an example of a scalable ALU. One can see that as the size of a circuit increases the verification time grows fairly slowly. 3.6.1
Testing of NCL
Testing is the last piece of mosaic to provide a complete picture of NCL design methodology. Delay-insensitive NCL systems are immune to timing failures. Delay faults in these circuits influence the circuit performance but not the correctness of its behavior. Therefore the main focus of testing efforts for NCL designs is centered around stuck-at faults. As we will show below, a scan based approach is well suited for testing NCL circuits. This suggests that bridging faults in NCL implementations
3.6 Verification of NCL Circuits
59
might be tested using conventional techniques of applying IDDQ vectors. This topic is however outside the scope of this section. At first glance testing micropipelines seems to be very difficult because of numerous loops that couple pipeline stages and implement handshakes. Fortunately many asynchronous systems (micropipelines in particular) enjoy the self-checking properties with respect to stuck-at output faults [139]. If one of the handshake wires is stuck at a constant level then sooner or later the whole pipeline stalls which provides an excellent observability for control faults. These results could be generalized beyond handshake wires to any input stuck-at faults in completion detectors [59].2 The latter bring us to an important conclusion: register and control circuitry is selfchecking in NCL designs and therefore can be removed from consideration when exploring test patterns. Intuitively this is easy to understand because transparent latches under monotonic changes at data inputs behave like buffers and do not influence testability of a system. Therefore, for the purpose of test pattern generation one might take a simplified view of the NCL pipeline as a connection of combinational logic networks in sequence, which is equivalent to a single combinational logic network (see Figure 3.12).
Fig. 3.12 Removal of registers from NCL pipeline.
2 Here,
we assume that feedback wires in NCL gates and C-elements are not externally visible inputs. Testing input stuck-at faults for feedback wires is an open problem for NCL designs.
60
Synchronous-to-asynchronous RTL Flow Using NCL
Even after registers are stripped out of the NCL pipeline a designer is left with a combinational cloud of sequential gates with hysteresis. These gates cannot be directly submitted to any of the existing commercial ATPG tools. However, using the very same observations as for verification, one can represent a single cloud of NCL gates by two separate components: one for specifying a behavior of NCL network in the set phase and another for its behavior in the reset phase. Both these components are conventional Boolean networks. Furthermore from the symmetry of path sensitizations in set and reset phases it follows that testing the single combinational network for the set phase would suffice to cover all stuck-at faults. This leads to a following important conclusion: Acyclic NCL pipelines are 100% input stuck-at faults testable. For a general case of pipelines with computational loops one need to first break all computational cycles and after that enjoy the ease of testing of acyclic case. Breaking cycles could be done by inserting scan registers that are very similar to synchronous scan registers (see [59]). This method is effective and has low cost because it uses partial scan rather than full scan insertion. 3.6.2
Summary
NCL design methodology was the first to show that asynchronous circuits are indeed possible to design using existing commercial EDA tools. This was an important proof of a concept. NCL implementations provide the following advantages. (1) Significant reduction of electromagnetic interference (orders of magnitude) due to even distribution of circuit transitions in time [82]. (2) Potential power savings that stem from reimplementing an algorithm in an event-centric fashion. One of the encouraging examples is provided by NCL version of Motorola’s STAR08 processor core that saved about 40% of power consumption with respect to its synchronous counterpart [82].
3.6 Verification of NCL Circuits
61
(3) Robustness to process and operation variations that are inherent for QDI implementations. However NCL advantages are coming at a cost. These are: (1) Significant area penalty (up to 3–4× for conventional NCL and up to 2.5× for NCL X). The latter cannot be reduced below 2× due to the nature of the dual-rail implementation. (2) Power consumption may grow up significantly because one out of a pair of dual gates goes up and down in every phase of the functioning. This may offset the power savings that are achieved at algorithmic level. Area overhead seems to be the main bottleneck in accepting NCL methodology. It makes it difficult to use NCL as a universal approach leaving however a room for the “sweet spot” applications [82]. NCL example stimulated development of other automated design flows. De-synchronizations is a successful follower of NCL that manage to keep area penalties negligible without significant sacrifices in the robustness of an implementation.
4 De-synchronization: A Simple Mutation to Transform a Circuit into Asynchronous
4.1
Introduction
If one would ask about a strategy to design an asynchronous circuit that would keep the differences from its synchronous counterpart to a minimum, that strategy would probably be the one presented in this chapter: de-synchronization. The idea is simple. Let us consider the implementation of every flip–flop as a pair of master/slave latches. The role of the clock signal is to generate the enable pulses for the latches in such a way that data flows through the circuit to perform a certain computation. De-synchronization simply substitutes the clock with an asynchronous control layer that generates the enable pulses of the latches in such a way that the behavior is preserved. The data-path remains intact. De-synchronization has been “on the air” since Ivan Sutherland, in his Turing award lecture [125], pioneered micropipelines, a scheme to generate local clocks for a synchronous latch-based datapath. Mimicking synchronous behavior by asynchronous logic was suggested in [138]. This work was mainly focused on the synchronization 63
64
De-synchronization: Simple Mutation Circuit into Asynchronous
of asynchronous arrays. Similar ideas were exploited in a doubly-latched asynchronous pipeline proposed in [55]. However, the previous approaches were never formalized in a way that allowed them to be integrated into an automated synthesis flow. The first attempt to derive an asynchronous circuit from an arbitrary synchronous circuit was presented in [69]. The scheme required a re-encoding of the data-path using LEDR and delay-insensitive code. This results in a significant overhead in area and power that makes the approach not very attractive to designers. To alleviate this overhead, a coarse-grain approach was suggested in [102]. Finally, a formal approach that requires no overhead for the datapath was presented in [20]. This approach will be explained in more detail in this chapter. In this chapter, a formal model to specify timing diagrams will be used. This model is called Signal Transition Graph and is presented in the next section. The behavior of synchronous systems is next revisited and analyzed. Understanding the properties of synchronous data transfers is crucial to proposing similar schemes with asynchronous control. The foundations and implementation of de-synchronization are presented in the remaining sections.
4.2
Signal Transition Graphs
The model used in this paper to specify asynchronous controllers is based on Petri nets [87] and is called Signal Transition Graph (STG) [14, 106]. Roughly speaking, an STG is a formal model for timing diagrams that can specify causality, concurrency and choice among rising and falling transitions of signals. For those readers not familiar with Petri nets, we refer to [87]. Graphically, Petri nets are depicted as bipartite graphs with boxes representing transitions, circles representing places and arcs representing the flow relation. The tokens held in places represent markings of the net and denote the state of the system. In an STG, the Petri net transitions are labeled with signal events, e.g., a+ and a− representing rising and falling transitions of signal a, respectively. An example of STG is shown in Figure 4.1(a).
4.2 Signal Transition Graphs
65
Fig. 4.1 (a) STG, (b) compact representation of an STG, (c) reachability graph.
To represent STGs in a more compact form, places with exactly one predecessor and one successor transitions are omitted. Therefore, arcs between transitions implicitly represent a place. In that case, the tokens are held on the arc. A compact representation of the previous STG is depicted in Figure 4.1(b). A Petri net models the dynamic behavior of a concurrent system. The set of reachable markings from the initial marking form a graph called reachability graph. Figure 4.1(c) depicts the reachability graph of the previous STG. Let us analyze the behavior of the example by looking at the STG in Figure 4.1(b). After firing the transitions a+ and b+, a marking is reached with a token in place p1 in which two branches can be taken (choice). After firing any of them, hc + e + c−i or hd + e + d−i, another marking is reached (tokens in p2 and p3 ). At this point, the sequence ha − b−i can be fired concurrently with the event e−, thus reaching the initial marking again.
66
De-synchronization: Simple Mutation Circuit into Asynchronous
Fig. 4.2 (a) Netlist, (b) marked graph representation.
There is a specific class of Petri nets that has special relevance in this work: marked graphs. A marked graph only has places with one predecessor and one successor transitions. Most of the STGs presented in this paper will be marked graphs in which places hold one token at most in every reachable marking. Figure 4.2 describes how a netlist can be modeled with a marked graph. The combinational logic (blocks A, B, and C) is represented by transitions. Latches (or flip–flops) are represented by places. Every latch has only one input predecessor, which is the combinational block that produces the result stored in the latch. A latch may have multiple fanout. In that case, the latch is represented by a set of places, one associated to every output block. This place multiplicity is required to model the fact that the value of the latch is transmitted to all combinational blocks. A place with multiple output arcs would represent a choice, i.e., a transfer to only one of the successors, which is not the actual behavior of the circuit. This is the reason why netlists are typically represented with choice-free nets, i.e., marked graphs.
4.3
Revisiting Synchronous Circuits
Before presenting the principles of de-synchronization it is convenient to understand how synchronous circuits work. The clock signal is the one that keeps the system continuously beating, interleaving data computation and storage. It is a signal with a fixed frequency arriving simultaneously (or with negligible skew) at all latches. A conceptual clocking scheme is depicted in Figure 4.3. The clock is generated by an oscillator (inverted delay) that feeds a non-overlapping phase generator. The oscillator can be internally implemented as a
4.3 Revisiting Synchronous Circuits
67
Fig. 4.3 Conceptual view of the clock generation in a synchronous circuit.
ring with an odd number of inverters. However, the most conventional way is by means of an external quartz crystal that guarantees a fixed oscillation frequency with high accuracy. The registers are typically implemented as flip–flops, internally composed by two level-sensitive latches: master and slave. Both latches operate with the two complementary phases of the clock. The crosscoupled NOR gates guarantee that the phases Φ1 and Φ2 do not overlap. The scheme depicted in Figure 4.3 has two importants properties: • Absence of races. At each phase, half of the latches are sending data while the other half are receiving data. Transfers occur from master to slave during Φ2 and from slave to master during Φ1 . Since the phases do not overlap, simultaneous transfers master → slave → master or slave → master → slave can never occur and, therefore, no data overruns are produced. This constraint is what determines the hold time of a latch. • No data loss. The frequency of the clock, determined by the delay of the oscillator, guarantees that the receiving latch does not become opaque before capturing the incoming data, otherwise the data would be lost. This constraint is what determines the performance of the system: the clock period
68
De-synchronization: Simple Mutation Circuit into Asynchronous
must accomodate the longest combinational path from latch to latch. At the level of latches, this constraint is directly related to the setup time.
4.4
Relaxing the Synchronous Requirement
The synchronous scheme results in simple implementations, since only one synchronization signal is required in the whole system. However, this requirement is not strictly necessary for a correct operation of the system. A well-known asynchronous scheme is the Muller pipeline [86], depicted in Figure 4.4. In this case, there is no explicit distinction between master and slave latches and there is no global synchronization. Instead, a correct synchronization is guaranteed by the local interactions with the neighboring latches: a transfer only occurs when the sender has “valid” data (req = 1) and the receiver is “empty” (ack = 0). A behavioral specification of the interaction of the req and ack signals between two adjacent stages is shown in Figure 4.5(a). The acki signals play the role of the enable signals (Φi ) for the latch at stage i. By hiding the events of the req signals and observing only the events of the enable signals (Φi = acki ), the causality relation depicted in Figure 4.5(b) can be derived. With this particular protocol, the phases Φi−1 and Φi overlap during the initial period of the transfer (Φi−1 + → Φi +), but this always occurs
Fig. 4.4 Muller pipeline.
4.5 Minimum Requirements for Correct Asynchronous Communication
69
Fig. 4.5 Muller pipeline: (a) STG specification, (b) enable signals of adjacent latches, (c) timing diagram.
Fig. 4.6 Asymmetric delay.
when the previous data item at stage i has been delivered to the next stage (Φi+1 − → Φi +), thus preventing data overruns. To avoid data losses, the delay at the req signals is included. This delay must be designed in such a way that the latch does not become opaque before the combinational logic has completed its calculation. To make this scheme more efficient, it is possible to use asymmetric chains with short return-to-zero delays, as shown in Figure 4.6. Finally, Figure 4.5 shows a possible timing diagram of the interaction between consecutives stages in a Muller pipeline.
4.5
Minimum Requirements for Correct Asynchronous Communication
In this section, we face the problem of data communication in a more general way. In particular, we want to answer the following question: what are the minimum requirements for an asynchronous scheme to implement correct data transfers between latches? Figure 4.7(a) depicts a linear pipeline in which the signals A − D denote the enable signals of the latches. The dark and white circles inside the latches represent valid and non-valid data (tokens and
70
De-synchronization: Simple Mutation Circuit into Asynchronous
Fig. 4.7 Asynchronous data transfer for a linear pipeline and a ring.
bubbles, respectively). To operate correctly, data transfers must always move tokens to latches with bubbles. Figure 4.7(b) describes the minimum causality relations between the events of the enable signals to guarantee correct data transfers. The picture depicts a fragment of the unfolded marked graph. The tokens on the arcs denote the current state of the system. There are three types of arcs in the diagram: • A+ → A− → A+, arcs that simply denote that the rising and falling transitions of each signal must alternate. • B− → A+, arcs that indicate that for latch A to read a new data token, B must have completed the reading of the previous
4.5 Minimum Requirements for Correct Asynchronous Communication
71
token coming from A. If this arc is not present, data overruns can occur, or in other terms hold constraints can be violated. • A+ → B−, arcs that indicate that for latch B to complete the reading of a data token coming from A it must first wait for the data token to be stored in A. If this arc is not present, B can “read a bubble” and a data token can be lost, or in other terms setup constraints can be violated. Figure 4.7(c) depicts the marked graph corresponding to the causality relations in Figure 4.7(b). A simplified notation is used in Figure 4.7(d) to represent the same graph, substituting each cycle −→ • y by a double arc x ←→ x ←− • y, where the token is located close to the enabled event in the cycle (y in this example). It is interesting to notice that the previous model is more aggressive than the Muller pipeline or the classical scheme for generating non-overlapping phases for latch-based designs. As an example, the following sequence can be fired in the model of Figure 4.7(a–d): D+ D− C+ B+ B− A+ C− · · · After the events hD+ D− C+ B+i, a state in which B = C = 1 and A = D = 0 is reached, where the data token stored in A is rippling through the latches B and C. A timing diagram illustrating this sequence is shown in Figure 4.8. This feature only illustrates the fact that various schemes can be proposed for a correct flow of data along a pipeline, exploring different degrees of concurrency among the events. The practicality and efficiency of every specific scheme will depend on its implementation. Another important detail is the initialization of the marked graph. If the latch has a token (e.g., latch A), the arc A+ → B− must be marked, i.e., the pulse at latch B must occur before the one at latch A. Conversely, the arc C− → B+ is marked since there is a bubble in latch B. Thus, latch C can deliver its token before capturing the new incoming token from B. As a particular case, Figure 4.7(f) depicts the graph for a 2-stage ring (Figure 4.7(e)). This is the typical communication mechanism obtained from a state machine that holds the combinational logic between the latches B and A. Interestingly, the folded marked graph
72
De-synchronization: Simple Mutation Circuit into Asynchronous
Fig. 4.8 Timing diagram of the linear pipeline in Figure 4.7(a–d).
after removal of redundant arcs (Figure 4.7(h)) represents the behavior of two non-overlapping phases. Precisely, this is the only protocol that allows a correct operation of the ring since any phase overlapping would overwrite the contents of both latches. Still, the interesting question that we need to answer is: can this asynchronous data transfer model be generalized for any arbitrary netlist? 4.5.1
A Simple Example
Figure 4.9 depicts a synchronous netlist with a global clock. The shadowed boxes represent level-sensitive latches. The labels 0 and 1 at the latches represent the polarity of the phase. The reader can notice that the phases alternate in any path of the netlist. The de-synchronization method proposed in this chapter substitutes the clock network with a network of asynchronous controllers. This network is shown in Figure 4.10. Every box is an asynchronous controller that generates the enable signal (Φ) of one of the latches. The controllers synchronize through request and acknowledge signals: Ri and Ai for the input data and Ro and Ao for the output data. When a latch receives data from more than one latch, a C element synchronizes the incoming requests. Similarly, a C element synchronizes the acknowledgments of the output data. As an example, latch A receives data from F and G through the combinational
4.5 Minimum Requirements for Correct Asynchronous Communication
Fig. 4.9 A synchronous circuit with a single global clock.
Fig. 4.10 Asynchronous control for the netlist in Figure 4.9.
73
74
De-synchronization: Simple Mutation Circuit into Asynchronous
logic that precedes A and, thus, the C element for Ri synchronizes the Ro signals coming from F and G. The rest of this chapter will be devoted to study the properties and implementations of the asynchronous controllers that guarantee a correct operation of the system. 4.5.2
Correct De-synchronization
The interaction of the asynchronous controllers in Figure 4.10 produces a specific alternation of the rising and falling transitions of Φi for all latches. We are seeking the requirements for these alternations to produce an asynchronous behavior equivalent to the synchronous one. The main results in this area were presented in [20]. We next review and discuss these results in an informal and intuitive way. 4.5.2.1
Flow Equivalence
This is a concept that defines the behavioral equivalence between two circuits [65]. Two circuits are flow equivalent if they have the same set of latches and, for each latch A, the projections of the traces onto A are the same in both circuits. Intuitively, two circuits are flow-equivalent if their behavior cannot be distinguished by observing the sequence of values stored at each latch. This observation is done individually for each latch and, thus, the relative order with which values are stored in different latches can change, as illustrated in Figure 4.11. The top diagram depicts the behavior of a synchronous system by showing the values stored in two latches, A and B, at each clock cycle. The diagram at the bottom shows a possible de-synchronization, where the pulses of signals A and B represent the behavior of the enable signals of these latches. An important result presented in [20] is the following: The de-synchronization protocol presented in Figure 4.7 preserves flow equivalence.
4.5 Minimum Requirements for Correct Asynchronous Communication
75
Fig. 4.11 Flow equivalence.
This result guarantees that all computations performed by a desynchronized circuit will be the same as the ones performed by the synchronous counterpart. However, another subtle property is required to guarantee a complete mimicry of the synchronous behavior: liveness. The following result also holds [20]: The de-synchronization protocol presented in Figure 4.7 preserves liveness. This property guarantees that the protocol will never enter a deadlock state due to an unappropriate synchronization. At first sight it might look like liveness is an obvious property of any reasonable handshake protocol. Unfortunately, this is not true. The protocol of the Muller pipeline presented in Figures 4.4 and 4.5 preserves flow equivalence. In fact, the protocol can be obtained by reducing the concurrency of the protocol shown in Figure 4.7. However, an n-stage ring with n/2 tokens1 would enter a deadlock if we would use the controllers of a Muller pipeline. Formally, liveness can be analyzed by studying the marked graph obtained by the cyclic composition of the handshake, as shown in Figure 4.12. A ring with non-overlapping phases would manifest a behavior like the one specified in Figure 4.12(a). The one in Figure 4.12(b) corresponds to a ring with the controllers of a Muller 1A
ring like this can be obtained by connecting the output of D to the input of A in Figure 4.7.
76
De-synchronization: Simple Mutation Circuit into Asynchronous
Fig. 4.12 Synchronization of a ring: (a) non-overlapping phases (live), (b) Muller pipeline (deadlock).
pipeline. Liveness can be deduced by using a well-known result is the theory of marked graphs [16]: A marked graph is live if and only if the initial marking assigns at least one token on each directed circuit. One can immediately realize that the marked graph in Figure 4.12(b) contains a cycle without tokens, the one covering all falling signal transitions (A−, B−, C−, and D−).
4.6
Handshake Protocols for De-synchronization
What is inside the asynchronous controllers in Figure 4.10? Several handshake protocols have been proposed in the literature for asynchronous pipelines. The question is: are they suitable for a fully automatic de-synchronization approach? Is there any controller that manifests the concurrency of the de-synchronization model proposed in this paper? We now review the classical four-phase micropipeline latch control circuits presented in [35]. For that, the specification of each controller (Figures 5, 7, and 11 in [35]) has been projected onto the handshake signals (Ri, Ai, Ro, Ao) and the latch control signal (A), thus abstracting away the behavior of the internal state signals. The controller shown in Figures 5 and 6 of [35] (simple 4-phase control) corresponds exactly to the behavior of the Muller pipeline. Figures 4.13(a) and 4.13(b) show the projections of the semidecoupled and fully-decoupled controllers from [35]. The left part of the figure depicts the connection between adjacent controllers generating the latch enable signals A and B, respectively. The right part depicts
4.6 Handshake Protocols for De-synchronization
Fig. 4.13 Handshake protocols.
77
78
De-synchronization: Simple Mutation Circuit into Asynchronous
only the projection on the latch control signals when three controllers are connected in a row with alternating tokens and bubbles. The controllers from [35] show less concurrency than the desynchronization model. For this reason, we also propose a new controller implementing the protocol with maximum concurrency proposed in this paper (Figure 4.13(d)). For completeness, a handshake decoupling the falling events of the control signals (fall-decoupled ) is also described in Figure 4.13(c). In all cases, it is crucial to properly define the initial state of each controller, which turns out to be different for the latches initially holding tokens and those initially holding bubbles. We now present a general study of four-phase protocols, illustrated in Figure 4.14. The figure describes a partial order defined by the degree of concurrency of different protocols. Each protocol has been annotated with the number of states of the corresponding state graph. The marked graphs in the figure do not contain redundant arcs. An arc in the partial order indicates that one protocol can be obtained from the other by a local transformation (i.e., by moving the
Fig. 4.14 Different degrees of concurrency in handshake protocols.
4.6 Handshake Protocols for De-synchronization
79
target of one of the arcs of the model). The arcs A+ ↔ A− and B+ ↔ B− cannot be removed for obvious reasons (they can only become redundant). For example, the semi-decoupled protocol (5 states) can be obtained from the rise-decoupled protocol (6 states) by changing the arc2 A+ −→ • B− to the arc A+ −→ • B+, thus reducing concurrency. The model with 8 states, labeled as “de-synchronization model,” corresponds to the most concurrent model presented earlier in this chapter. The other models are obtained by successive reductions or increases of concurrency. The nomenclature rise- and fall-decoupled has been introduced to designate the protocols in which the rising or falling edges of the pulses have been decoupled, respectively. The rise-decoupled protocol corresponds to the fully decoupled one proposed in [35]. Any attempt to increase the concurrency of the de-synchronization model results in the violation of one of the properties for correct data transfer described in Section 4.3: absence of races or no data loss. In [20], the following results were proved for the models shown in Figure 4.14: • All models preserve flow equivalence. • All models except the simple four-phase protocol (top left corner) are live. • De-synchronization can be performed by using any hybrid combination of the live models shown in the figure (i.e., using different types of controllers for different latches). These results offer a great flexibility to design different schemes for de-synchronized circuits. 4.6.1
Which Controller to Use?
There is no simple answer to this question. It is generally known that the degree of concurrency is highly correlated with the complexity of the controller. Depending on the technology used for the implementation, the critical cycles in the controller may vary. 2 Note
that this arc is not explicitly drawn in the picture because it is redundant.
80
De-synchronization: Simple Mutation Circuit into Asynchronous
Since any hybrid combination of controllers is valid for desynchronization, a natural approach to select the controllers can be the following: (1) Calculate the criticality of the cycles in the datapath. The most stringent cycles are those that have the highest average delay for each combinational block. (2) Use simple controllers (e.g., non-overlapping) for the latches in non-critical cycles. (3) Use more concurrent controllers for the latches in critical cycles.
4.7
Implementation of Handshake Controllers
The protocols described in Section 4.6 can be implemented in different ways using different design styles. In this section, a possible implementation of the semi-decoupled four-phase handshake protocol is described. It is an implementation with static CMOS gates, while the original one was designed at transistor level. This controller offers a good trade-off between performance and complexity. In time-critical applications, other controllers can be used, including hybrid approaches that combine protocols different from the ones shown in Figure 4.14. Figure 4.15 depicts an implementation of a pair of controllers for a fragment of data-path. The figure also shows the marked graphs modeling the behavior of each controller. The only difference is the initial marking, that determines the reset logic (signal RST). Resetting the controllers is crucial for a correct behavior. In this case, alternating latches are, respectively, transparent and opaque in the initial state, playing the role of master and slave latches in synchronous designs. With this strategy, only the opaque latches must be reset in the data-path. The controllers also include a delay that must match the delay of the combinational logic and the pulse width of the latch control signal. Each latch control signal (A and B) is produced by a buffer (tree) that drives all the latches. If all the buffer delays are similar, they can
4.7 Implementation of Handshake Controllers
81
Fig. 4.15 Implementation of semi-decoupled controllers.
be neglected during timing analysis. Otherwise, they can be included in the matched delays, with a similar but slightly more complex analysis. In particular, the delay of the sequence of events A+ → Ro/Ri− → logic delay → B+ is the one that must be matched with the delay of the combinational logic plus the clock-to-output delay of a latch. The event Ro/Ri− corresponds to the falling transition of the signal Ro/Ri between the A- and B-controllers. On the other hand, the delay of the sequence B+ → Ai/Ao− → Ro/Ri+ → pulse delay → B− is the one that must be matched with the minimum pulse width. It is interesting to note that both delays appear between transitions of
82
De-synchronization: Simple Mutation Circuit into Asynchronous
the control signals of Ri and B, and can be implemented with just one asymmetric delay. A more detailed analysis about the timing requirements of the asynchronous controllers used for de-synchronization can be found in [20].
4.8
Design Flow
Given a synchronous circuit implemented with combinational logic and D flip–flops, all of which work at the same clock edge, desynchronization can be performed in the following steps: (1) Conversion into a latch-based design. All flip–flops must be decomposed into the corresponding master and slave latches. (2) Generation of matched delays. Each matched delay must be greater than the critical path of the corresponding combinational block. The matched delays serve as completion detectors. In case of variable-delay units that may have datadependent delays, a multiple x or selecting a specific delay can be used, thus taking advantage of the average performance of the functional units [9]. (3) Selection and implementation of the asynchronous controllers. As shown in Section 4.6, different controllers can be used for each latch trading-off area and delay. Finally, the C elements combining the request and acknowledge signals of multiple inputs and outputs, respectively, must be inserted. Unfortunately, the delay of each functional unit is not known until the circuit layout has been generated. This can make the estimation of matched delays overly conservative with a negative impact on performance. Matched delays are simply chains of inverters and nand/nor gates, as shown in Figure 4.6. A practical approach to design the matched delays is to initially start with pessimistic delays, according to prelayout timing analysis. After layout, timing analysis can be performed and more accurate delays can be calculated. The already placed delays can be re-adjusted by removing some of the gates. This operation only
4.9 Why De-synchronize?
83
requires incremental rewiring, leaving the unused gates in their original location. The previous design flow offers important benefits with regard to other more intricate flows for asynchronous design. Given that the datapath remains intact, conventional equivalence checking tools can be used for verification and scan paths can be inserted for testability. Asynchronous handshake circuits can also be tested by using a fullscan methodology, as discussed in [98]. Handshake circuits are selfchecking, and the work in [59] showed that 100% stuck-at coverage can be achieved for asynchronous pipelines using conventional test pattern generation tools. Since variability is small between devices in relatively small regions of the layout, it is reasonable to assume that the performance of the circuit is determined by the matched delays. In this scenario, the response time of every individual die can be measured by observing the request and acknowledge wires at the boundaries of the system. In other words, the maximum speed of a die can be estimated by only looking at the timing of transitions of some output signals with respect to the clock input, without the need for expensive at-speed delay testing equipment. This allows one to classify dies according to their maximum operational speed (binning), which so far was only used for leading-edge CPUs due to the huge cost of at-speed testing equipment.
4.9
Why De-synchronize?
While de-synchronized circuits closely resemble their synchronous counterparts, there are subtle but important differences. These make de-synchronization an attractive approach to tackle some of the problems inherent to modern circuit design. The significant uncertainties in gate and wire delays of current technologies are causing designers to adopt extremely conservative strategies to ensure that a sufficiently large number of manufactured chips work correctly. While statistical timing analysis may help to identify and quantify the different sources of variability, designers must still define huge margins that have a negative impact on performance.
84
De-synchronization: Simple Mutation Circuit into Asynchronous
De-synchronization is an approach to tackle the effects of correlated variability sources, such as supply voltage, operating temperature, and large-scale process variations (e.g., optical imperfections). We believe that the natural capability of accomodation of circuit performance to environmental conditions is what allows this approach to explore a wide range of voltage/frequency points that are not achievable by synchronous circuits. We refer the reader to Section 6.2 and analyze the results obtained from the de-synchronization of a DLX microprocessor to observe the features previously mentioned.
4.10
Conclusions
The approach presented in this chapter is probably the simplest strategy to mimic a synchronous circuit without a clock. A correct synchronization is guaranteed by replacing the clock with a network of distributed asynchronous controllers associated with the sequential elements of the data-path. This simple strategy allows one to use existing tools for synthesis and timing analysis of synchronous systems. Section 6.2 presents a practical example of desynchronization.
5 Automated Gate Level Pipelining (Weaver)
5.1
Automated Pipelining: Motivation
This chapter presents a direct translation approach for automatic implementation of conventional RTL designs as fine-grain pipelined asynchronous QDI circuits. This approach has two main distinctive features. It solves the global clocking problem per se and replaces the register transfer architecture with a gate transfer architecture (further referred to as GTL). This results in very fine-grain pipelined circuits, regardless of the level of pipelining of the original synchronous implementation. In synchronous designs, automation of pipelining is difficult to implement as it changes the number and position of registers which finally results in a completely new specification. There are no tools capable of establishing the correspondence between the functionality of synchronous pipelined and synchronous non-pipelined designs. Moreover, synchronous pipelining is reasonable only for more than eight levels of logic. This number is increasing because gate latency decreases as technology shrinks while clock limitations (routing and skew) stay the same [39, 46]. 85
86
Automated Gate Level Pipelining (Weaver)
Fig. 5.1 Retiming based automated RTL pipelining.
To illustrate the inefficiency of fine-grain pipelining in synchronous systems we set up an experiment with retiming based automated synchronous pipelining. We synthesized the Advanced Encryption Standard (AES) [30] from RTL Electronic Code Book with Synopsis DC using Artisan 0.18 library (typical). The first column of Figure 5.1 shows the non-pipelined implementation. It is combinational design comprising of over 42,000 gates. Columns 2–9 present results of various modes of automatic pipelining. Columns 2 and 3 show automatically pipelined hierarchical and flat implementations targeting maximum performance. The results with high numbers of stages (columns 4–9) are obtained by forcing Design Compiler to insert 10, 25, 50, 100, 200, and 400 pipeline stages, respectively. It is clear from Figure 5.1 that finer pipeline granularity in synchronous RTL causes a significant area overhead with little effect on performance. In the extreme cases, performance of the very fine-grain pipelined implementations is worse due to growing complexity of the clock tree. These results support the theoretical conclusions from [39, 46] on the granularity limits of synchronous pipelining. The comparison between synchronous and automatically fine-grain pipelined asynchronous implementations
5.1 Automated Pipelining: Motivation
87
for the same example is presented later on (see Figure 6.7 in Section 6.3). Contrary to synchronous case in QDI designs, delays of gates and delays of handshake control implementations shrink consistently. No margin is required because data transfers are based on explicit local handshakes. These local handshakes allow a circuit operate correctly regardless of process and environmental variations. In GTL synthesis flow [116, 117, 118], pipelining is done by default at the gate level which gives the finest degree of pipelining and the highest potential performance. A part from efforts to automate pipelining in circuit synthesis a serious effort is dedicated to designing efficient micropipeline implementation based on dynamic logic. Dynamic logic is relevant not only for high speed implementations [101], but for hardware security as well [129]. Unfortunately, up to now the usage of dynamic gates was limited to custom or semi-custom design flows. There are two major reasons why EDA support for dynamic logic designs is difficult when using the synchronous methodology [13, 38]. First, each synchronous dynamic gate requires a clock input and uses both levels of the clock signal which makes it behave like a flip– flop from the EDA tool’s point of view. Second, charge sharing and uncertainty about worst case delay makes static timing analysis (STA) of dynamic circuits very complex. As a result commercial RTL synthesis tools provide almost no support for design in dynamic logic libraries. Our approach incorporates dynamic logic within QDI implementations for which timing problems are solved graciously. As a result the approach enables usage of dynamic gates in automatic standard-cell based design flow. The QDI implementation out of GTL synthesis flow is based on ideas borrowed from custom design methodology suggested for MIPS microprocessor in [71, 79] which were further developed to result in a fully automated design flow. The most attractive feature of this flow is a complete compatibility with standard synchronous RTL flow. The dual-rail nature of the implementation and the handshake control are completely hidden from the RTL designer. The GTL-flow does not require channel extensions of RTL. This is important as any standard
88
Automated Gate Level Pipelining (Weaver)
synchronous Verilog or VHDL specification can be directly implemented as an asynchronous fine-grain pipelined design without any modifications to the existing HDL code. In short, our flow uses results of synchronous synthesis to trace register transfers and set/reset points, and structurally replaces communication between registers by chains of gate-level communications. Thus, the asynchronous channels (dual-rail data and handshake control) are introduced only at the level of technology mapping. Other known approaches suggest the addition of channels to standard HDL (e.g., Verilog or VHDL) [103, 108, 109] to make these languages more appropriate for describing behavior of asynchronous systems. This approach, though attractive, requires the re-education of RTL designers and increases the difficulty of using legacy RTL code.
5.2
Automated Gate-Level Pipelining: General Approach
Micropipeline synthesis (as a particular case of SADT methodology) consists of three main steps: RTL synthesis, re-implementation, and final mapping. RTL synthesis is performed by standard RTL synthesis tools from a given HDL specification. Initially, RTL functionality is identified. Every combinational gate or clocked latch (gi ) is represented by a library cell (see examples in Figure 5.2). Clocked flip–flops are treated as pairs of sequentially connected latches (master gmi and slave gsi ) controlled by two opposite phase clocks and are represented as two cells each. Every data wire (i.e., any wire except for clock and reset) is mapped to a cell connection. This way, no additional data dependencies are added and no existing data dependencies are removed. The initial state of state holding gates (D-latches and D-flip–flops) is guaranteed by an appropriate reset. The only difference from standard RTL synthesis is that implementation is performed using a virtual library whose content corresponds to micropipeline library (see details below). This synthesis step determines the design implementation architecture. It can be tuned, in the same way as conventional RTL synthesis, to trade-off area, performance or dynamic power consumption.
5.2 Automated Gate-Level Pipelining: General Approach
89
Fig. 5.2 Micropipeline synthesis examples: clocked latch (top) and AND2 gate with a fork (bottom).
Re-implementation takes the RTL netlist obtained in the previous step as input. The algorithm has a linear complexity as it is substitution based. We assume that for every logic gate or latch there exists a cell or a previously synthesized module implementing the cell’s functionality. This assumption is satisfied by targeting RTL synthesis to the virtual library, which is functionally equivalent to the micropipeline library, and by the bottom-up synthesis of hierarchical designs. Next, implementability and optimality are ensured. The netlist of virtual cells is translated into a micropipeline netlist using the stagecells from the micropipeline library. Finally, the micropipeline netlist is checked for fan-out/driving capability design rules and modified, if necessary, by the same RTL synthesis tool. The final netlist can be written in any of the formats supported by the tool in hierarchical or flat form. Figure 5.2 presents micropipeline synthesis examples for a clocked latch with fan-out of 1 and an AND2 gate with fan-out 2. The latch (Figure 5.2(a) top) is connected to reset with its preset pin and is initialized to “1” in RTL implementation. During micropipeline synthesis an identity cell from a micropipeline library replaces the latch. This cell is initialized to dual-rail value of logical “1.”
90
Automated Gate Level Pipelining (Weaver)
The combinational AND gate is implemented by a micropipeline stage with equivalent dual-rail functionality. The gate output depends on both inputs therefore the inputs must be synchronized by a join module. Likewise the output is split to X and Y what makes it necessary to synchronize the feedback acknowledgments with a fork module. Channels denote sets of signals (two for dual-rail) carrying data and up to two handshake signals associated with data (request propagating in the same direction as data and acknowledge moving backward). Other components of micropipeline cells will be explained shortly. Figure 5.2 shows a general case of channel expansion using request (req), acknowledgement (ack), and dual-rail binary data wires. The join and fork module implementations are protocol dependent. The Muller C-element, for example, may serve as a handshake synchronizer. Channels depicted in Figure 5.2(b) show the most general case when both req and ack signals are present together with synchronization for multiple-input and multiple-fan-out gates. However, some templates may not use all four (handshake req, ack, and data d0, d1) communication lines (e.g., precharge half-buffer (PCHB) from [92] does not use req). Portions of combinational logic are substituted by functionally equivalent pipeline stages. Without loss of generality, let every such portion be a single logic gate in synchronous implementation. The following properties of micropipeline synthesis are guaranteed by the flow design: During the mapping of the RTL implementation, data wires are substituted by channels and: • No additional data dependencies are added and no existing data dependencies are removed. • Every gate implementing a logical function is mapped to a fine-grain cell (stage) implementing equivalent function for dual-rail encoded data that is initialized to spacer. More detailed consideration of this and others properties of micropipeline synthesis could be found in [115, 118].
5.3 Micropipeline Stages: QDI Template
5.3
91
Micropipeline Stages: QDI Template
The suggested fine-grain pipelining flow is targeted at QDI implementations. The assumptions about isochronic forks are supported by keeping all wire forks internal to micropipeline cells and by constraining the internal layout of cells. An example of a dynamic implementation of a micropipeline stage cell implementing AND2 function is shown in Figure 5.3. This particular example illustrates the reduced stack precharge half-buffer (RSPCHB) template from [92]. A block implementing the stage logical function is denoted by F . The rest of the blocks are typical for most of the stages. LReq and LAck are the left and RReq and RAck the right requests and acknowledgments, respectively. ACK implements handshake, PC is a phase (precharge/evaluate) control, CD serves as a completion detection and M stands for memory. “Staticizers” (or keepers) are formed by adding weak inverters as shown in Figure 5.3. They are needed to store the stage output value for an unlimited time thus eliminating timing assumptions. At the same time, keepers solve the charge sharing problem and improve the noise margin of precharge style implementations. The req line is used in protocols to signal data
Fig. 5.3 AND2 micropipeline stage dynamic implementation example.
92
Automated Gate Level Pipelining (Weaver)
consumption to the previous stage. Availability to the following stages while the ack is used to indicate data depending on the communication protocol, some or all of the handshake events can be transmitted over the data lines so req and/or ack lines may not be needed. A dedicated micropipeline library with each cell representing an entire micropipeline stage localizes in-stage timing assumptions and power balancing inside the cell thereby leaving it to the library designer. With DI inter-stage communication, the functionality of the implementation no longer depends on its place & route. Note (Figure 5.3) that the implementation of memory and logic functions implementations are of the same cost and speed as their synchronous dual-rail domino counterparts. The main sources of area overhead are the Muller C-element for handshake control implementation (ACK), completion detection circuitry (CD), and ack/req synchronization (see Figure 5.2).
5.4
Design Flow Basics
The Weaver synthesis tool [115, 144] consists of two main parts an engine and a set of scripts. These scripts implement the user interface (commands) and handle interaction with a standard RTL synthesis tool. The flow is implemented [115, 117, 118] maximizing the use of commercial design tools. Weaver Engine is a VHDL compiler based on the Savant VHDL compiler [110]. The engine incorporates VHDL and Synopsys Liberty [67] parsers/generators to interface the design and library specifications with industry-standard tools. The RTL synthesis tool currently used R in the flow is Synopsys Design Compiler . This set of tools is targeted at micropipeline synthesis but also automates some library installation tasks. Library installation is executed once per library or every time the library is modified. This step is essential ensuring the flexibility of the flow in using various micropipeline libraries. A micropipeline stage cell is defined as a pre-designed module that implements one or more functions of its inputs. This layer of abstraction allowes the flow to be
5.5 Fine-Grain Pipelining with Registers
93
flexible and support different micropipeline styles. Every data input or output is considered to be a channel consisting of encoded data and, optionally, handshake lines. The virtual library is an imaginary single-rail synchronous RTL library functionally equivalent to the micropipeline library. The virtual library is generated from the micropipeline library during its installation. The cell AND2 in Figure 5.2(a) is a virtual library cell generated for the micropipeline cell AND2 (Figure 5.2(b)) found in the micropipeline library. Area and delay characteristics of the virtual library cells are mapped from the corresponding micropipeline library cells to make optimization during the RTL synthesis meaningful. For the adopted QDI style only the skew of wire delays after forks matter. The delays after forks are kept under control by ensuring that inter-cell communications are delay insensitive. The flow achieves this by placing all isochronic forks inside library cells. Together with guaranteeing correct functionality, this requirement lowers the effect of variation in design parameters. As long as individual cells are designed to handle variation inside the cell, no global constraints are necessary. Communication protocol is a return-to-zero DI protocol. All submodules of any given module must use this protocol, in which data items are separated with spacers. The flow currently cannot handle a library designed for non-return-to-zero (non-RTZ) protocols, like phased logic, since the synthesis procedure would be quite different. Apart from the handshake synchronization cells, the library must also implement the equivalent of AND2 cell (or OR2, assumed to be the dual of AND2) and an identity function stage (buffer stage). The inverter is implemented as the cross-over of data wires. There must be at least three versions of each buffer stage implementation (resettable to data1, data0, and spacer) and at least one resettable to spacer version of cells implementing other logic functions.
5.5
Fine-Grain Pipelining with Registers
Suppose that a synchronous RTL netlist produced by a standard RTL synthesis engine consists of combinational logic (CL) gates, D-latches (DL), and D-flip–flops (FF). Clocked registers (either composed of
94
Automated Gate Level Pipelining (Weaver)
Fig. 5.4 Weaving with clocked registers.
DFFs or DLs) are used in RTL implementations to separate the concurrent processing of distinct consecutive data items which advance through the registers at the pace of the clock signal. Figure 5.4(a) shows master and slave latches with empty and shaded circles representing bubble and data items. “Master” and “slave” latches are controlled by opposite edges of a clock. The pipeline cells can be divided into two categories full-buffer (FB) and half-buffer (HB) [79, 92]. The two are distinct by the token capacity, i.e., the number of data items that can fit in a pipeline of a given length. The handshake protocols we are considering support consecutive data tokens separation (with spacer/bubble) which makes it possible to substitute every DL with one HB stage (DFF with two HB or one FB stages). The result is shown in Figure 5.4(b) where, in the absence of a clock signal, the circles represent the initial states of the stages. Bubbles are denoted by empty circles and tokens by shadowed ones. In HB pipelines distinct tokens are always separated with bubbles (no two distinct tokens in any two adjacent stages); From this follows the next property: For each FF in RTL implementation in GTL implementation there exist two HB stages one initialized to a bubble and another — to a token. More tokens can simultaneously fit in the pipeline increasing its performance relative to the synchronous implementation.
5.6 Pipeline Petri Net Model of the Flow
95
For linear and loop free data paths, initialization is less of an issue because it impacts only a few output items until the pipeline is flushed with new data tokens. In the presence of loops and/or nonlinearities, initialization is crucial for liveness as shown in [118].
5.6
Pipeline Petri Net Model of the Flow
We use Petri nets (PN) [87] to model the behavior of the original synchronous and resulting GTL circuits. Petri nets are also used to prove correctness of GTL implementation. We use high level abstraction where PN markings represent the positions of tokens (data portions) in the pipeline. 5.6.1
Linear Case
In synchronous implementations each pipeline stage is implemented with a D-flip–flop (DFF) or two D-latches (DL) with oppositely driven clocks to store the result of data processing in the combinational logic (CL) (see Figures 5.5(a) and 5.5(b). Synchronous pipeline PN model is shown in Figure 5.5(c) where t1 represents the DL1 state change, t3 – DL2 state change etc. while the transitions clk0, clk1 represent clock edges. For asynchronous pipeline implementations, the model from [139] (Figure 5.5(d)) can be used. It restricts PN in such a way that for every two transitions ti , tj connected through a place pk , there exists a place pl that succeeds tj and precedes ti . We refer to such PNs as Pipeline Petri nets (PPN s) [139]. We call pairs of places like pk , pl as a stage space and the stage space with adjacent transitions as a stage. A PPN consists of stages for which the following properties are satisfied: • No stage space is shared between any two stages. • Exactly one place is marked in every stage space. Depending on whether the post- or precondition is marked for a stage the latter is said to contain a token or a bubble. In the course of PPN execution adjacent stages having opposite markings can change their states by firing transitions. Tokens propagate in one direction (the
96
Automated Gate Level Pipelining (Weaver)
Fig. 5.5 Modeling pipeline with Petri nets.
direction of data propagation) while bubbles — in the opposite. The PPN feedback arcs along with proper initial marking M0 preserve PPN 1-safeness. Safeness is important because it is a necessary condition to guarantee that different data tokens never collide in PN stages. In Figure 5.5 stages are delimited by vertical lines. Linear PPNs might be represented by marked graphs (MG) [16] in which places are not shown explicitly but are assumed to be sitting on every arc (see Figure 5.5(d)). To model half buffer (HB) GTL pipelines we modify the PPN model by making the PPN feedbacks twice as long (to span over one stage).
5.6 Pipeline Petri Net Model of the Flow
97
This reduces the modeled pipeline token capacity by the same factor of two. We call this model HB PPN and use it to model HB pipelines (Figure 5.5(e)). Based on the token capacity equivalence of HB PPN model to the HB pipelines it was shown [118] that the following property is satisfied: GTL style pipeline is properly modeled by HB PPN. Note that during pipeline synthesis both DL and portions of CL, individual gates by default, are mapped to HB GTL stages, increasing the pipeline token capacity. Hence, starting from the pipeline on Figure 5.5(e), there are 8 stages (assuming CL1, CL2, CL3 are one gate each). Liveness of HB PPN model is rather straightforward and related results are presented in [118]. 5.6.2
Register Elimination and Slack Matching
Let us discuss area vs. performance trade-offs specific to fine-grain pipelines. The stages corresponding to DFFs and/or DLs carry no functionality (identity function stages). In synchronous designs, they are needed to store data. In the mapped fine-grain pipelines, every pipeline stage is capable of storing data. Therefore, identity stages are not needed and might be optimized out. This has the effect of reducing area and latency overhead as long as the initial marking can be transferred to other stages. Consider Figure 5.6(a) in which two identity stages are initialized to carry tokens. If the CL portions are deep enough (greater than 2 levels) these identity stages can be optimized out with the initialization moved to the preceding CL stages like in Figure 5.6. Eliminating redundant identity stages must be performed with care. The performance of pipelines may suffer when they have too few bubbles because data between HB stages can propogate only by overwriting bubbles. Let a pipeline have two reconvergent paths: one long (with a large number of stages) and one short (with a few stages). Then, no matter how many bubbles the pipeline has in the long path, the performance will be limited by the throughput of the short path. This is due to the request–acknowledge protocol which cause the number of handshakes in these paths to be the same. One may consider inserting
98
Automated Gate Level Pipelining (Weaver)
Fig. 5.6 Removing redundant stages.
additional bubbles (identity stages) into the short path to improve the performance. This problem of balancing the path length in asynchronous pipelining is known as slack matching [74]. In the example from Figure 5.6, it is assumed that all paths through combinational logic are 4 levels deep. Such an assumption is impractical. A more realistic example is shown in Figure 5.7 where combinational gates (denoted by rectangles) are placed according to their topological sorting. One may push identity stages from a register (e.g., level 5 in Figure 5.7(a)) into CL paths (Figure 5.7(b)), with further elimination of redundant stages, once the desired performance target is achieved (Figure 5.7(c)). Generally, the best performance is achieved if the “shape” of a module is close to rectangular, i.e., all paths along the pipeline have the same slack (Figure 5.7(d)). The current implementation of slack matching in the Weaver flow assumes that HB stages found in the library have approximately the same delay. Thus, path lengths are measured in the number of stages. The procedure always preserves the shape of submodules in hierarchical designs and only the top module is forced to have a rectangular “shape” (white rectangles represent the stages added on this stage).
5.6 Pipeline Petri Net Model of the Flow
99
Fig. 5.7 Slack matching.
The complexity of the suggested analysis for slack matching by balancing pipeline paths is O(|X||C|2 ), where |X| is the number of primary inputs and |C| is the number of connection points in the netlist. More sophisticated slack matching algorithms were reported recently [8, 100]. They reduce a problem to ILP formulation and achieve more optimal results. The scalability of these methods might be a bottleneck in applying them to big and hierarchical designs.
100
Automated Gate Level Pipelining (Weaver)
5.6.3
Correctness of Gate-Level Pipelining
Similarly to [20] the correctness of GTL implementations is based on the following properties: (1) Different data items never collide when propagating through a circuit, i.e., every pipeline stage may contain a single data item, at most. (2) Pipelines are live, to ensure continuous operation. (3) Given a synchronous and a corresponding GTL implementation, sequences of values passing through a pipeline stage and a corresponding sequential gate (latch or flip–flop) coincide (flow equivalence). The following is the rationale behind satisfying properties 1–3 in GTL pipelines. Liveness comes from the requirement that, to the best of our knowledge, all RTL synthesis engines satisfy: every loop in a synchronous implementation contains at least one flip–flop (or a pair of latches). This implies that every loop in the corresponding PPN model will contain a token and, therefore, PPN is live. Flow equivalence can be proven along the lines of [20] since the communication protocols are compatible. There is a difference though. For de-synchronized and synchronous implementations there exists a natural one-to-one correspondence between sequential elements. This property is not guaranteed in GTL implementations since identity stages corresponding to registers might be optimized out (as discussed in the previous section). We omit the proof here (see [118] for more detailed consideration). Informally, it is based on the fact that, in a pipeline, the number of tokens stays the same, regardless of the number of stages inserted for slack matching or removed during register optimization.
5.7
Weaving: Simple Examples
In previous sections, we explained the general approach for converting a single-rail netlist into a GTL implementation. Converting combinational parts is straightforward. It reduces to mere substitution of
5.7 Weaving: Simple Examples
101
Fig. 5.8 Weaving for a combinational circuit.
combinational gates with the corresponding GTL cells (Figure 5.2(b)) and is illustrated by Figure 5.8, where join and fork are shown by multiple xor-like symbols. Mapping of sequential parts, and circuits with loops, is less obvious and requires some explanation. 5.7.1
Mapping Latches and Flip–Flops
Figure 5.9 shows a shift register-like example where the original synchronous implementation has shallow combinational logic between
Fig. 5.9 Latches and flip–flops mapping: shallow CL.
102
Automated Gate Level Pipelining (Weaver)
flip–flops (one level to none). The implementation first proceeds by mapping every pair of latches onto a pair of identity HB stages (one of which must be initialized to data). Then, the optimization step removes some of the HB stages until it manages to keep two stages initialized to data, to be separated by at least one stage initialized to the bubble. Note, that due to the lack of combinational logic between flip–flops, the room for optimizing identity stages is very limited and most of them are non-redundant (see Figure 5.9(b)). 5.7.2
Coding for GTL
From the explanation above, it follows that coding for GTL is not different from the conventional “synchronous” HDL coding style. Figure 5.10(a) shows a sample VHDL specification of a sequence detector and Figures 5.10(b) and 5.10(c) sketch its corresponding implementations, where HB and FB stand for half and full buffers while F stands for fork. The clock signal (shaded in Figure 5.10(a)) is optimized out in the final design. Such an optimization does not result in the loss of functionality, because replacing the clock with handshakes provides an equivalent mean of control over data propagation. The polarity of the clocks is used to identify the initialization state of the pipelines
Fig. 5.10 Sequence detector (a) behavioral VHDL specification; synchronous (b) and GTL (c) implementations.
5.7 Weaving: Simple Examples
103
stages: for positive edge driven latches, the corresponding stages are initialized to data while the stages corresponding to for negative edge driven are initialized to bubbles. 5.7.3
Finite State Machines: Mapping Circuits with Shallow Loops
The performance of GTL implementations is limited by the throughput of computational loops. Throughput of a computational cycle is defined as the overall delay of a cycle, divided by the number of data tokens in it. FSMs consist of a state register, with the combinational logic for next state computation looping back to this register. This loop, naturally, has a single token, as only one state is enabled at any moment of time. Care must be taken to ensure that this loop is not a bottleneck that slows down the speed of the whole system. The only way to improve the speed of the FSM loop is to utilize the gain in efficiency in the next logic generation. This is achieved in GTL flow by identifying the FSM and re-synthesizing it with a one-hot encoding style. Weaver is responsible for the FSM identification and the host synthesis engine is called to perform re-synthesis. 5.7.4
Experimental Results
To evaluate the advantages and penalties provided by the GTL flow, we compare GTL circuits with synchronous ones implementing the same functionality. The comparison mainly concerns two design metrics: performance and area. For GTL, we chose dynamic logic as the implementation basis because of its superiority over static gates, both in area and in performance. Note, that the difficulty of timing characterization of dynamic logic (which is a big obstacle for using it in synchronous flows) is irrelevant for GTL because of the QDI nature of the proposed pipelined implementation. We implemented a simple dynamic library in TSMC 0.18 µm technology obtained through MOSIS. The library was optimized for speed and is minimal, consisting of an identity function stage and an AND2 gate only. This library was used to map the set of MCNC combinational benchmarks. Performance related results for
104
Automated Gate Level Pipelining (Weaver)
several benchmarks are presented in [117, 118]. The performance of GTL implementations is limited by: (1) the handshake cycle time of the slowest stage, i.e., the maximum delay of a local loop; and (2) the delay of the computation loop divided by the number of tokens in this loop, i.e., by the throughput of computational loops. In our minimal library, the cycle time of local loops is slower than the latency of a conventional static gate. Due to this, GTL implementations provide performance advantages starting from combinational depth of 6–10 gates. This is typically satisfied because in modern ASIC designs the average combinational depth is about 15 levels of logic (see [13]). Note, that a more efficient domino library with a careful cell design could provide performance of several Gigahertz as was shown for QDI custom design by Fulcrum [34]. This data point shows huge room for performance improvements of GTL implementations. We also checked the typical performance of one-hot GTL FSMs mapped into the minimal library. On average, we found out that it is about 500 MHz. The area overhead for FSM implementations is about the same as for combinational benchmarks. In general, the area penalty is significant for QDI implementations. It comes from the dual-rail nature of QDI designs (about 2× area increase), from the completion detectors and from handshake overhead. Pipelined implementations have an additional source of area penalty that stems from the need to perform slack matching to achieve the desired performance target. We found [115] an interesting trade-off while developing the slack matching method described in 5.6.2. Slack matching balances the token capacities of the convergent paths, thereby reducing token congestion and increasing performance. The amount of tolerated “capacity skew” and, therefore, the cost of slack matching (in the number of buffer stages inserted) depends on the example. We varied the capacity skew per length of a path and found that zero skew, i.e., complete balancing of all paths is a very costly proposition. Figure 5.11 shows how performance and area depend on the capacity skew of a 10-round implementation of the Electronic Codebook
5.7 Weaving: Simple Examples
105
Fig. 5.11 Area-performance trade-off: slack matching of a 10-round AES implementation.
(ECB) mode from the Advanced Encryption Standard (AES) [30]. The libraries used for these experiments (M2PCHB and power balanced BSDT [73]) contain implementations of functions with up to 2 inputs and resettable buffer stages with all necessary synchronizers. Performance plots (left) depend on the design properties and the micropipeline library implementation and protocol. M2PCHB and BSDT implement different protocols which explains the significant difference in performance. The qualitative picture of performance degradation with capacity skew remains the same for both libraries. The area plots are shown in Figure 5.11 (right). The shift between curves, for the same example and different libraries, is proportional to the difference in the area of library cells. One can see that there is a sweet spot for slack matching when the induced area penalty and performance degradations are small. For the given example, this lies around the skew value of 6 stages. Unfortunately, we are unaware of algorithms that define the capacity skew to get the best trade-off between area penalties and performance. This is an open problem. In practice, this should be a parameter that depends on the example and might be adjustable in synthesis runs. 5.7.5
Conclusions
In this chapter, we presented a GTL framework and weaving technique to automatically compile synchronous specifications into fine-grain
106
Automated Gate Level Pipelining (Weaver)
pipelined QDI circuits. We have shown how the basic synchronous constructs are mapped onto asynchronous pipeline cells. This technique provides a performance speedup that is proportional to the combinational depth of the RTL netlist. The area and performance figures depend heavily on the quality of the library. The development of efficient libraries is crucial for the industrial success of this methodology. Our preliminary results with a limited pipeline library show that the suggested approach is capable of providing a significant performance boost and might be useful for application requiring high-speed. One more application area, where the suggested flow provides a significant added value, is security related applications (see the next chapter for more detailed discussion).
6 Applications and Success Stories
This chapter illustrates the techniques described in the rest of this paper, by using some realistic (and in some cases real-life) examples. We consider both standard components, like microprocessors, and dedicated components, like security co-processors, to show that asynchronous implementation can provide advantages to all sorts of applications and domains.
6.1
Low-Power Robust Design Using Haste
The ARM996HS is a new 32-bit RISC processor core — part of the ARM9E family of cores. The ARM9E family is based on the ARMv5TE Instruction Set Architecture (ISA), which includes the 16-bit Thumb and 32-bit ARM instruction sets. The Harvard-architecture core is based on a 5-stage integer pipeline with fast 32-bit multiply-accumulate (MAC) block. It has tightly coupled memories (TCMs) for both instruction and data, each of which can be up to 4 MB. Dual AMBA AHB-Lite synchronous buses provide the instruction and data interfaces. Specific security enhancements for the ARM996HS core include a 12 region, 32-byte granularity memory protection unit (MPU) and the provision 107
108
Applications and Success Stories
of non-maskable interrupts (NMI). A hardware divide co-processor is also provided; offering unsigned and signed 32-bit division through normal coprocessor instructions. The pipeline within the ARM996HS core mirrors the normal ARM9E pipeline, with the exception that it is implemented as a clockless design. The pipeline handshakes with the system controller to fetch instructions and to load and store data. The key advantages of the ARM996HS include low power, with the clockless processor generating almost 3× less power compared with the RM968E-S. Additionally, the characteristically flat EM spectrum results in reduced interference with radio receivers. The ability of the ARM996HS to automatically track and adapt to changes in current, temperature, and voltage parameters enables reliable operation across a wider range of environmental conditions than a conventional synchronous core. The benchmark results are based on simulations for the ARM968ES and ARM996HS that were performed on a post-layout netlist in a Sage-X 0.13-micron TSMC process. The ARM968E-S netlist was synthesized for 100 MHz. To establish power metrics, both designs were simulated running at an equivalent 77 MHz (or 83.3 DMIPS), in nominal conditions. Performance of the ARM996HS is 50 equivalent MHz (worst, 1.08 V, 125◦ C). Under nominal conditions 77 equivalent MHz (nominal, 1.2 V, 25◦ C) is achieved. These figures correspond to 54 and 83 Dhrystone MIPS, respectively. (The equivalent MHz numbers are based on how fast an ARM968E-S would have to run to deliver the same performance.) At 89 K gates for the ARM996HS versus 88 K gates for the ARM968E-S, the ARM996HS is very competitive considering that the ARM996HS also includes a 12 region MPU and a Hardware Divide coprocessor. Power consumption for the ARM996HS is very low at only 0.045 mW/MHz. This is a factor 2.8 lower than the most power efficient synchronous ARM9E core, the clock-gated ARM968E-S. Current peaks are reduced by a factor 2.4 when compared to a clockgated ARM968E-S. In clockless circuits, activity is spread over time,
6.2 Low-Power Robust Design using De-synchronization
109
leading to low current peaks and preventing noise on supply lines. The low electromagnetic emission prevents interference with radio receivers.
6.2
Low-Power Robust Design using De-synchronization
In this section, we present results for the application of desynchronization to a large realistic circuit: a DLX microprocessor, designed with a 0.25 µm library and later fabricated. 6.2.1
DLX ASIC Core
The de-synchronized version of the DLX processor [41] is called ASPIDA (ASynchronous open-source Processor Ip of the Dlx Architecture). It was first designed synchronously at the Register Transfer Level, and then automatically desynchronized using an early version of the desync tool described in [4]. Figure 6.1 shows the overall structure, including the multiplexed clocks and five architectural pipeline stages, four of which actually correspond to circuit blocks (at the circuit level, WB is merged with ID). Each block is controlled by its own latch controller. The arrows of the latch controllers correspond to the Ro and Ao signals, and illustrate the datapath dependencies. Stages ID, EX, and MEM form a ring. ID is the heart of the processor containing the Register File and all hazard-detection logic. It also synchronizes instructions leaving MEM (for WB) with instructions coming from IF. Data hazard detection takes place by ID comparing the output register of instructions in other pipeline stages and their opcodes, and deciding on inserting the correct number of NOPs. After the initial synthesis of each circuit block using latches (without retiming), the whole design is optimized incrementally to meet all timing requirements. Max-delay constraints between latches are used to ensure cycle time in the datapaths. The control blocks are left untouched by the synthesis tool. Then the gate-level netlist and matching timing constraints are placed and routed. If delay line gates are placed away from each other in the floorplan, then the routing delay becomes unpredictable at synthesis time. Hence, floor-planning was extensively used in order to control the routing delay.
Fig. 6.1 De-synchronized DLX with Multiplexed-Clocking.
110 Applications and Success Stories
6.3 Design of Cryptographic Coprocessor
111
Table 6.1 ASPIDA post-layout power breakdown. Sync. De-sync.
Latches 28.81 mW (15.82%) 28.81 mW (14.41%)
Delay elements 0 mW (0%) 1.7 mW (0.85%)
Low-skew nets 27.3 mW (15%) 49.98 mW (25%)
The ASPIDA fabricated chip contains a DLX processor core and 2 on-chip memories. It supports multiplexed clocking, i.e., the chip can be operated in fully-synchronous mode, or in de-synchronized mode. In synchronous mode (used both for scan testing and power and performance measurements), the master and slave latches are driven by two non-overlapping clocks from two global clock trees. In de-synchronized mode, the signals that open and close the latches are generated by asynchronous handshaking controllers. A coarse-grain partitioning was used for ASPIDA, as each controller drives the M or S latches of one of the four physical pipeline stages, including the processor’s register file, which resides internally within the ID pipeline stage. Table 6.1 shows the power breakdown for both designs, simulated at 50 MHz. Total power is divided into power consumed by the latches, by the delay elements and by the low-skew clocking nets. The ASPIDA chip, shown in Figure 6.2, was fabricated in early 2005 and post-manufacturing measurements verified the correct operation of the first silicon. Extensive tests over about 90 fabricated chips yielded very interesting results. The de-synchronized mode of operation demonstrated excellent coping with voltage-induced variability. Figure 6.3 shows that the de-synchronized mode of operation has significant performance (and hence energy) advantages over the synchronous mode when the voltage is varied. This is due to the self-adapting nature of the de-synchronized design, whereas in synchronous mode, the external clocks were adjusted manually on the test machine, based on the results of functional testing.
6.3
Design of Cryptographic Coprocessor Resistant to Multiple Side-Channel Attacks
Strong cryptographic algorithms have been designed to withstand rigorous cryptanalysis. However, if the overall cryptographic system is
112
Applications and Success Stories
Fig. 6.2 ASPIDA die photo.
considered, including the physical implementation, the strong notions of security are far from guaranteed. The physical implementation provides attackers with important extra information and modes of attack not taken into account in classical considerations. Numerous attacks have been developed which exploit physical properties of implementations and information that is leaked through side channels, i.e., channels other than the data channel [42]. By exploiting the information leaked through side-channels an attacker, with the help of statistical methods can quickly compromise the system. The importance of information leakage attacks has been acknowledged by the industry and mechanisms to protect against them comprise a major part of today’s security design. Various techniques aiming at decreasing the signal-to-noise ratio of the side channel information were developed for synchronous circuits. Some of the techniques artificially add variations in timing, power, and radiation by randomized clocking, noise generation, and randomization of parameters in
6.3 Design of Cryptographic Coprocessor
113
Fig. 6.3 ASPIDA: Period in synchronous (top) and de-synchronized (bottom) operation vs. supply voltage Vdd.
cryptographic algorithms [42]. However, all of the techniques developed for a synchronous design could be considered as ad-hoc measures targeted mostly to increase the “noise” part but not to decrease the variations of the original “signals” (power, electromagnetic interference,
114
Applications and Success Stories
timing etc.) from the device. Recent asynchronous dual-rail implementations [1, 84, 99, 120] provide more natural countermeasures against the sources of variations. However, [1, 84, 99, 120] use traditional asynchronous techniques (not fine-grain pipelining) and non-balanced (at the level of gates) circuitry and need ad-hoc measures to provide symmetry in logic between the bits while preventing information leakage. As was recently shown [62] such a straightforward asynchronous dualrail methodology, without a carefully balanced cell library, could be even less secure than a synchronous dual-rail methodology. Without pipeline array organization such circuitry is not resistant to differential electromagnetic analysis attacks [66]. Design methodology and Weaver EDA flow address the current practical and physical limitations and provide tool support for the complete design cycle of secure cryptographic hardware. Such hardware is capable of eliminating almost all sources of non-invasive side-channel information while allowing for very high performance of the implementation as well as short design time. It is based on asynchronous fine-grain pipelining with power-balanced cells to combine high performance with the best available power analysis resistance and excellent fault attack countermeasures. Using gate-level pipelining methodology, we have designed a complete hardware implementation of Advanced Encryption Standard (AES) [30]. Our AES implementation was built, simulated and taped out using our EDA tools and balanced library. This AES implementation combines high differential power analysis (DPA) attack resistance with high performance. 6.3.1
Dynamic Logic and Security
Some of the most promising methods for DPA resistance are based on specially designed, balanced, dynamic gates like those from [129]. Recent results on DPA resistance based on special power-balanced cells [129, 131, 132] show a very big reduction of fluctuations in the power consumption of the circuits. However, most of gate-level approaches, such as those from [129, 131, 132] have no countermeasures against glitch and fault-injection
6.3 Design of Cryptographic Coprocessor
115
attacks and require additional protection. More importantly, since differential and dynamic (DD) approaches from [129, 131, 132] require dynamic (domino) logic cell design, the usage of DD gates was limited to custom or semi-custom design flows. This greatly limits the perceived universality of DD based designs. Problems with synthesis support of dynamic libraries make power balanced dynamic circuitry practically unavailable for rapid ASIC development. Thus, the researchers had to resort to less secure (e.g., less balanced) but easier to implement solutions based on standard static non-balanced gate libraries (see, e.g., motivation to use WDDL from [130]). Asynchronous, dual-rail, balanced, pipelined circuit techniques can play a key role in hardware security due to many of their unmatched benefits. Some styles of asynchronous circuits naturally possess many desirable properties designs try to add artificially to other synchronous designs. Some of the benefits of this approach are discussed in [61, 62]. 6.3.2
Cell Customization and Security Benefits
The previously described synthesis approach based on fine-grained templates is in large part independent of the detailed implementation of the template cell. For example, SABL [129, 132] gates, one of the best known balanced gate designs, can be easily adapted to the cells preserving all of their balance properties and enhancing their fault resistance and robustness. The addition of asynchronous control removes the clocking and timing difficulties normally associated with the gates and enhances their security applications due to the benefits of asynchronous behavior [61, 62]. The function and operation of the additional asynchronous wrapper is almost completely data independent and only the completion detection of the wrapper requires a trivial power balancing consideration which can be easily met with two additional minimal size transistors [73]. By simply using an unmodified SABL gate as the functional block of the asynchronous template and using the handshake circuitry of the template (like that shown in Figure 5.3) for the generation of the clock signal for the SABL gate as shown in Figure 6.4, a fully QDI balanced
116
Applications and Success Stories
Fig. 6.4 Incorporation of a SABL gate into the QDI template.
gate is created. The resulting gate has identical power balance characteristics as that of the original SABL gate. Besides, the explicit synchronization and completion detection of the asynchronous template allows for fewer restrictions on the design of the balanced functional block. Restrictions, such as elimination of early propagation effect, which need to be explicitly considered in synchronous implementations [60] are automatically satisfied. Explicit input completion can be incorporated into the design; which, coupled with the C-element, will prevent evaluation until all of the input data has arrived and is ready. Furthermore, the timing and voltage tolerance of the QDI implementation allows for more aggressive dynamic designs, which can achieve better balance than previous designs. A balanced library designed specifically for the fine-grained asynchronous template, called Balanced Symmetric with Discharge Tree (BSDT) gates, was fully incorporated into the flow. The gates showed better balance than the synchronous SABL implementations (Figure 6.5) [73]. In addition to allowing a more robust design for dynamic balanced function blocks, the asynchronous handshake protocol and template add natural fault resistance to the design. For the balanced asynchronous gates presented in [73], out of all the possible transistor level single stuck-at faults, inside and outside of the complete asynchronous gate, not a single fault changes the Boolean function of the gate. Almost
6.3 Design of Cryptographic Coprocessor
117
Fig. 6.5 BSDT-style XOR and the Standard Deviation of the evaluation phase of SABL and BSDT implementations.
80% of the faults result in a pipeline stall which naturally prevents further data processing and creates deadlock within the pipeline. We are currently performing a full analysis of the side-channel information leakage from sample implementations. Initial simulated power analysis attacks on the Sbox of the Data Encryption Standard (DES) indicate that the beneficial properties of the balanced dynamic gates and asynchronous circuits translate to the proposed implementations. Figure 6.6 shows the results of the performed simulations and
Fig. 6.6 Differential power traces of a simulated DPA attack on DES subcircuit implemented with (a) a synchronous gates, which reveal the correct secret key (red line), and (b) asynchronous balanced fine-grained from which the secret key cannot be distinguished from all other keys.
118
Applications and Success Stories
comparison between static unsecured CMOS and the balanced finegrained implementations. The static CMOS (first graph) implementation quickly revealed the key in simulations, while the attack on the balanced asynchronous implementation was not successful even after an exhaustive simulation of all input patterns. 6.3.3
AES Implementations Comparison
To estimate the efficiency of the flow, we compare the performance of automatically synthesized synchronous and asynchronous balanced and unbalanced, fine-grain pipelined implementations [144]. We implemented simple dynamic logic based micropipeline libraries using TSMC 0.18 µm technology obtained through MOSIS. One of the libraries — BSDT is a power balanced library implemented using minimum transistor sizes. Another — MPCHB (modified PCHB from [92]) optimized for performance. MPCHB was characterized by parts using Cadence SignalStorm Library Characterizer (SLC). For the balanced library (BSDT) characterization was done manually. From an RTL Electronic Code Book mode (unfolded 10-round) HDL specification of the Advanced Encryption Standard (AES) [30], we synthesized implementations using our fine-grained libraries and the described synthesis flow. We compared our fine-grain micropipeline implementations with synchronous RTL designs synthesized using the Artisan Sage-XTM [133] standard cell library using the same (TSMC 0.18 µm) technology. Implementation performances are summarized in Figure 6.7. The two leftmost bars represent the non-pipelined and autoR matically pipelined (with Synopsys Design Compiler pipeline design –period 0 command — maximum performance setting) implementation using the library from Artisan. R Although Design Compiler can be forced to insert more pipeline stages this way the implementation becomes unreliable since the static timing analysis is not considered to be reliable with fine granularity of synchronous pipelining [49]. Additionally, reducing the amount of logic per pipeline stage in synchronous RTL reduces the amount of useful work per cycle while not affecting the overheads associated with latches, clock skew and jitter [39, 46]. In the handshake implementation of QDI
6.3 Design of Cryptographic Coprocessor
119
Fig. 6.7 AES implementations.
circuits, latency and area are shrinking with the rest of the circuit. No margins are necessary, since data transfers are based on explicit local handshakes, making the circuit operate correctly regardless of process and environmental variation. The 3rd and 4th bars in Figure 6.7 represent the performance bounds achieved using MPCHB library. The higher performance is achieved with the slack matching technique that balances the critical paths through the micropipeline, thereby minimizing the wait time. The rightmost two bars of Figure 6.7 represent the same bounds obtained for BSDT. By taking into account the word length of 128 bits, performance of our implementation comes to over 35 Gbps (298 MHz*128bit) for balanced and over 55 Gbps (468 MHz*128bit) for unbalanced implementations at the nominal (1.8 V) voltage of the power supply at room temperature. Compare these performance numbers with commercial ASIC implementations like one from [91] available on the market today 25 Gbps the word length of 256 bits; that scaled down to 12.5 Gbps for the word length of 128 bits or the best known academic custom (manual) design [45]. Note that in both cases, there is no side-channel attack protection. The high performance and protection level cost a
120
Applications and Success Stories
Fig. 6.8 AES implementation-die picture.
significant area overhead — the area of protected gate-level pipelined implementation approaches 30 mm2 (Figure 6.8 — the vertical lines are the power/ground grid stripes; the light green frame around the die is the pad frame). Finally, we would like to note that both the MPCHB and BSDT libraries are under development. In the current stage they feature logic gates (stages) with up to 2 data inputs, as well as the identity function (to be used for initialization and slack matching), micropipeline stages along with synchronization cells and a minimal set of standard logic cells. Design characteristics can be improved with a better optimized and richer micropipeline library. The combination of asynchronous operation and balanced dynamic gates allows automated standard-cell library based design highly resistance to side-channel attacks.
7 Conclusions
Conventional wisdom of computer system design has changed dramatically. People used to believe that hardware was rigid and software was flexible. The enormous legacy of existing software and existing HDL specifications have inverted this wisdom. Hardware is now relatively flexible and could be re-synthesized; software and HDL specifications have become hard to change (see, e.g., [93]). The popularity of synchronous design and its support by EDA tools on one hand, and the crisis of the synchronous paradigm on the other resulted in a number of approaches based on asynchronous reimplementation of synchronous designs. We call it synchronous-asynchronous direct translation flow (SADT). One of the main targets of SADT is the reuse of existing synchronous HDL specifications, without any changes for asynchronous design. Design automation of real-life asynchronous devices and systems is supposed to be focused on ease of use. To paraphrase Michael Keating’s sentence from ISQED’06 keynote: “The art of design [automation] is the art of making the complex appear very simple” [37]. One of the main aims of this paper was to argue, in a constructive way, against
121
122
Conclusions
the common misbelief that asychronous design is difficult and required the re-education of RTL-level designers. We have presented here the flows that make asynchronous design as simple as synchronous Register Transfer Level (RTL) flow. One of the main tendencies in electronic design automation today is a shift from RTL to higher level specifications. Many people believed that most RTL level design efforts are going to be superceded by Electronic System Level (ESL) flows. At the very last ESL flow is supposed to incorporate a front-end with higher level specifications (e.g., System-C or ANSI-C) into synthesis flow. There are several interesting publications related to ESL flow for asynchronous (bundled delay) design [140, 141, 142]. This flow is not based on commercial synthesis tools. Actually, it is closer to a software C-compiler than a hardware synthesis flow. Besides, there are serious doubts that ANSI-C is the appropriate language for ESL [24]. Detailed consideration of this flow is outside the scope of our paper. However, we would like to note that the ESL approach to asynchronous EDA is very attractive for embedded design applications and is natural progressing since asynchronous implementations do not require cycle-accurate model generation [5] (there are no clock cycles). Finally, another important purpose of this paper was to retrace this direct translation methodology, through different flavors of SADT flow developed so far, and establish the general principles of SADT methodology. SADT translation approaches are mostly differ in the granularity (or the level – HDL, RTL etc.) of handshake protocol insertion and by degree of automation: from HDL level specialized handshake control primitives for Haste and HDL level manual insertion of special registers with handshake interaction and completion detection for NCL to register-based automatic insertion for De-synchronization flow, and gate level (handshake between gates) automatic pipelining for Weaver. The suggested methodology can result in easier SOC integration and shorter design cycles. The partitioning of the clock trees offers lower power and possibly lower electro-magnetic emissions. These are important to reduce the cost of packaging, to increase security, and to
123 integrate analogue circuitry on-chip. Moreover, it provides the foundation for achieving power savings by tolerating broader performance variations. Early fabrication results confirm simulation results, and increase our confidence in the widespread applicability of SADT to real ASIC designs.
Acknowledgments
The authors would like to thank Sharad Malik, Masahiro Fujita, and the anonymous reviewers for their help making the manuscript more readable and understandable. We also thank Konrad Kulikowski, Alex Smirnov, and Vyas Venkataraman for their technical help in the preparation of the paper. We are very grateful for all our collaborators for the important contributions they made to the research area, as listed in the bibliography section, and for many helpful discussions. This work has been partially funded by the ASPIDA IST-2002-37796 project, CICYT TIN2004-07925 and a Distinction for Research from the Generalitat de Catalunya.
125
References
[1] A. Abrial, J. Bouvier, M. Renaudin, P. Senn, and P. Vivet, “A new contactless smart card IC using an on-chip antenna and an asynchronous microcontroller,” IEEE Journal of Solid-State Circuits, vol. 36, no. 7, pp. 1101–1107, 2001. [2] Achronix Semiconductor, www.achronix.com, 2006. [3] A. Agarwal, D. Blaauw, V. Zolotov, and S. Vrudhula, “Statistical timing analysis using bounds,” in Design, Automation and Test in Europe (DATE), pp. 10062–10067, 2003. [4] N. Andrikos, A Fully Automated Desynchronization Flow for Synchronous Circuits. PhD thesis, University of Crete, 2006. [5] M. Baleani, F. Gennari, Y. Jiang, Y. Patel, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “HW/SW partitioning and code generation of embedded control applications on a reconfigurable architecture platform,” in Proceedings of the Tenth International Symposium on Hardware/Software Codesign, CODES 2002, pp. 151–156, Estes Park, Colorado, USA, May 6–8 2002. [6] A. Bardsley, Implementing Balsa Handshake Circuits. PhD thesis, Department of Computer Science, University of Manchester, 2000. [7] P. Beerel, J. Cortadella, and A. Kondratyev, “Bridging the gap between asynchronous design and designers (Tutorial),” in VLSI Design Conference, (Mumbai), 2004. [8] P. A. Beerel, M. Davies, A. Lines, and N.-H. Kim, “Slack matching asynchronous designs,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 184–194, March 2006.
127
128
References
[9] L. Benini, E. Macii, and M. Poncino, “Telescopic units: Increasing the average throughput of pipelined designs by adaptive latency control,” in Proc. ACM/IEEE Design Automation Conference, pp. 22–27, 1997. [10] Celoxica, Handel-C Language Reference Manual. Celoxica. 2003. [11] CeltIC signal integrity analysis, www.cadence.com/products. [12] T. Chelcea and S. M. Nowick, “Resynthesis and peephole transformations for the optimization of large-scale asynchronous systems,” in Proc. ACM/IEEE Design Automation Conference, June 2002. [13] D. Chinnery and K. Keutzer, Closing the Gap between ASIC & Custom. Tools and Techniques for Gigh-Performance ASIC Design. Kluwer Academic Publishers, 2002. [14] T.-A. Chu, C. K. C. Leung, and T. S. Wanuga, “A design methodology for concurrent VLSI systems,” in Proc. International Conf. Computer Design (ICCD), pp. 407–410, IEEE Computer Society Press, 1985. [15] W. S. Coates, J. K. Lexau, I. W. Jones, S. M. Fairbanks, and I. E. Sutherland, “FLEETzero: An asynchronous switch fabric chip experiment,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 173–182, IEEE Computer Society Press, March 2001. [16] F. Commoner, A. W. Holt, S. Even, and A. Pnueli, “Marked directed graphs,” Journal of Computer and System Sciences, vol. 5, pp. 511–523, 1971. [17] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev, “Petrify: A tool for manipulating concurrent specifications and synthesis of asynchronous controllers,” IEICE Transactions on Information and Systems, vol. E80-D, no. 3, pp. 315–325, March 1997. [18] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev, Logic Synthesis of Asynchronous Controllers and Interfaces. Springer-Verlag, 2002. [19] J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and C. Sotiriou, “From synchronous to asynchronous: An automatic approach,” in Proc. Design, Automation and Test in Europe (DATE), pp. 1368–1369, February 2004. [20] J. Cortadella, A. Kondratyev, L. Lavagno, and C. Sotiriou, “Desynchronization: Synthesis of asynchronous circuits from synchronous specifications,” IEEE Transactions on Computer-Aided Design, vol. 25, no. 10, pp. 1904–1921, October 2006. [21] M. de Wit and A. Peeters, “Haste language reference manual,” Tech. Rep., 2006. [22] Designware, intellectual property: www.synopsys.com/products/designware/. [23] D. Edwards and A. Bardsley, “Balsa: An asynchronous hardware synthesis language,” The Computer Journal, vol. 45, no. 1, pp. 12–18, 2002. [24] S. A. Edwards, “The challenges of hardware synthesis from c-like languages,” in DATE ’05: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 66–67, 2005. [25] D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim, and K. Flautner, “Razor: Circuit-level correction of timing errors for low-power operation,” IEEE Micro, November 2004.
References
129
[26] K. M. Fant, Logically Determined Design: Clockless System Design with NULL Convention Logic. John Wiley & Sons, 2005. [27] K. M. Fant and S. A. Brandt, “NULL conventional logic: A complete and consistent logic for asynchronous digital circuit synthesis,” in International Conference on Application-specific Systems, Architectures, and Processors, pp. 261–273, 1996. [28] M. Ferretti and P. A. Beerel, “Single-track asynchronous pipeline templates using 1-of-N encoding,” in Proc. Design, Automation and Test in Europe (DATE), pp. 1008–1015, March 2002. [29] M. Ferretti, R. Ozdag, and P. Beerel, “High performance asynchronous ASIC back-end design flow using single-track full-buffer standard cells,” in Proc. International Symposium on AdvancedResearch in Asynchronous Circuits and Systems, pp. 95–105, IEEE Computer Society Press, April 2004. [30] FIPS PUB 197: advanced encryption standard, http://csrc.nist.gov/ publications/fips/fips197/fips-197.pdf. [31] First clockless processor for real-time chip designs – http://www.handshake solutions.com/News/. [32] R. M. Fuhrer and S. M. Nowick, Sequential Optimization of Asynchronous and Synchronous Finite-State Machines: Algorithms and Tools. Kluwer Academic Publishers, 2001. [33] R. M. Fuhrer, S. M. Nowick, M. Theobald, N. K. Jha, B. Lin, and L. Plana, “Minimalist: An environment for the synthesis, verification and testability of burst-mode asynchronous machines,” Tech. Rep. TR CUCS-020-99, Columbia University, NY, July 1999. [34] Fulcrum microsystems www.fulcrummicro.com/. [35] S. B. Furber and P. Day, “Four-phase micropipeline latch control circuits,” IEEE Transactions on VLSI Systems, vol. 4, no. 2, pp. 247–253, June 1996. [36] J. D. Garside, W. J. Bainbridge, A. Bardsley, D. A. Edwards, S. B. Furber, J. Liu, D. W. Lloyd, S. Mohammadi, J. S. Pepper, O. Petlin, S. Temple, and J. V. Woods, “AMULET3i — an asynchronous system-on-chip,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 162–175, IEEE Computer Society Press, April 2000. [37] R. Goering, “Simple designs aren’t easy, speaker says,” EE Times, http:// www.eetimes.com/showArticle.jhtml?articleID=184400784, no. 03/28/2006, 2006. [38] D. Harris, Skew-Tolerant Circuit Design. Morgan Kaufmann Publishers, 2001. [39] A. Hartstein and T. R. Puzak, “Optimum power/performance pipeline depth,” in MICRO-36 International Symposium on Microarchitecture, 2003. [40] P. J. Hazewindus, Testing Delay-Insensitive Circuits. PhD thesis, California Institute of Technology, 1992. [41] J. L. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publisher Inc., 1990. [42] E. Hess, N. Janssen, B. Meyer, and T. Schutze, “Information leakage attacks against smart card implementations of cryptographic algorithms and countermeasures — a survey,” in EUROSMART Security Conference, 2000.
130
References
[43] C. A. R. Hoare, “Communicating sequential processes,” Communications of the ACM, vol. 21, no. 8, pp. 666–677, August 1978. [44] C. A. R. Hoare, Communicating Sequential Processes. Prentice-Hall, 1985. [45] A. Hodjat, D. D. Hwang, B. Lai, K. Tiri, and I. Verbauwhede, “A 3.84 gbits/s AES crypto coprocessor with modes of operation in a 0.18-um CMOS technology,” in 15th ACM Great Lakes symposium on VLSI, (Chicago, Illinois, USA), pp. 60–63, 2005. [46] M. S. Hrishikesh, N. P. Jouppi, K. I. Farkas, D. Burger, S. W. Keckler, and P. Shivakumar, “The optimal depth per pipeline stage is 6 to 8 fo4 inverter delays,” in 29th Int’l Symp. Computer Architecture, pp. 14–24, IEEE CS Press, 2002. [47] H. Hulgaard, S. M. Burns, and G. Borriello, “Testing asynchronous circuits: A survey,” Integration, the VLSI Journal, vol. 19, no. 3, pp. 111–131, November 1995. [48] C. Jeong and S. M. Nowick, “Optimal technology mapping and cell merger for asynchronous threshold networks,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 128–137, March 2006. [49] J. Kahle, “The myth of the optimal fo4,” in ACM/IEEE TAU Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, 2004. [50] N. Karaki, T. Nanmoto, H. Ebihara, S. Utsunomiya, S. Inoue, and T. Shimoda, “A flexible 8b asynchronous microprocessor based on low-temperature poly-silicon TFT technology,” in International Solid State Circuits Conference, pp. 272–274, February 2005. [51] C. H. (Kees) van Berkel, M. B. Josephs, and S. M. Nowick, “Scanning the technology: Applications of asynchronous circuits,” Proceedings of the IEEE, vol. 87, no. 2, pp. 223–233, February 1999. [52] C. Kelly, V. Ekanayake, and R. Manohar, “SNAP: A sensor-network asynchronous processor,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 24–33, IEEE Computer Society Press, May 2003. [53] J. Kessels and A. Peeters, “The Tangram framework: Asynchronous circuits for low power,” in Proc. of Asia and South Pacific Design Automation Conference, pp. 255–260, February 2001. [54] J. Kessels, A. Peeters, T. Kramer, M. Feuser, and K. Ully, “Designing an asynchronous bus interface,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 108–117, IEEE Computer Society Press, March 2001. [55] R. Kol and R. Ginosar, “A doubly-latched asynchronous pipeline,” in Proc. International Conf. Computer Design (ICCD), pp. 706–711, October 1996. [56] T. Kolks, S. Vercauteren, and B. Lin, “Control resynthesis for controldominated asynchronous designs,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, March 1996. [57] A. Kondratyev and K. Lwin, “Design of asynchronous circuits using synchronous CAD tools,” IEEE Design & Test of Computers, vol. 19, no. 4, pp. 107–117, 2002.
References
131
[58] A. Kondratyev, L. Neukom, O. Roig, A. Taubin, and K. Fant, “Checking delay-insensitivity: 104 gates and beyond,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 149–157, April 2002. [59] A. Kondratyev, L. Sorensen, and A. Streich, “Testing of asynchronous designs by inappropriate means. synchronous approach,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 171– 180, April 2002. [60] K. Kulikowski, M. Karpovsky, and A. Taubin, “Power attacks on secure hardware based on early propapation of data,” in 12th IEEE International On-Line Testing Symposium, 2006. [61] K. Kulikowski, A. Smirnov, and A. Taubin, “Automated design of cryptographic devices resistant to multiple side-channel attacks,” in Chyptographic Hardware and Embedded Systems CHES, (Yokohama), 2006. [62] K. J. Kulikowski, M. Su, A. Smirnov, A. Taubin, M. G. Karpovsky, and D. MacDonald, “Delay insensitive encoding and power analysis: A balancing act,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 116–125, 2005. [63] T. Larrabee, “Test pattern generation using boolean satisfiability,” IEEE Transactions on Computer-Aided Design, vol. 11, no. 1, pp. 4–15, 1992. [64] L. Lavagno and S. M. Nowick, “Asynchronous control circuits,” in Logic Synthesis and Verification, pp. 255–284, Kluwer Academic Publishers, 2002. [65] P. Le Guernic, J.-P. Talpin, and J.-C. L. Lann, “Polychrony for system design,” Journal of Circuits, Systems and Computers, April 2003. [66] H. Li, A. Markettos, and S. W. Moore, “Security evaluation against electromagnetic analysis at design time,” in Workshop on Cryptographic Hardware and Embedded Systems (CHES), 2005. [67] Liberty CCS: www.synopsys.com/products/libertyccs/libertyccs.html. [68] M. Ligthart, K. Fant, R. Smith, A. Taubin, and A. Kondratyev, “Asynchronous design using commercial HDL synthesis tools,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 114–125, IEEE Computer Society Press, April 2000. [69] D. H. Linder, Phased Logic: A Design Methodology for Delay-Insensitive, Synchronous Circuitry. PhD thesis, Mississippi State University, 1994. [70] D. H. Linder and J. C. Harden, “Phased logic: Supporting the synchronous design paradigm with delay-insensitive circuitry,” IEEE Transactions on Computers, vol. 45, no. 9, pp. 1031–1044, September 1996. [71] A. Lines, Pipelined Asynchronous Circuits. PhD thesis, California Institute of Technology, 1998. (CaltechCSTR:1998.cs-tr-95-21). [72] A. Lines, “Nexus: An asynchronous crossbar interconnect for synchronous system-on-chip designs,” in Proceedings of the 11th Symposium on High Performance Interconnects, pp. 2–9, August 2003. [73] D. J. MacDonald, MS Thesis. A Balanced-Power Domino-Style Standard Cell Library for Fine-Grain Asynchronous Pipelined Design to Resist Differential Power Analysis Attacks. PhD thesis, Boston University, 2005.
132
References
[74] R. Manohar and A. J. Martin, “Slack elasticity in concurrent computing,” in Proc. 4th International Conference on the Mathematics of Program Construction, (J. Jeuring, ed.), pp. 272–285, 1998. [75] A. J. Martin, “Programming in VLSI: From communicating processes to delayinsensitive circuits,” in Developments in Concurrency and Communication, (C. A. R. Hoare, ed.). UT Year of Programming Series, pp. 1–64, AddisonWesley, 1990. [76] A. J. Martin, “Compiling communicating processes into delay-insensitive VLSI circuits,” Distributed Computing, vol. 1, no. 4, pp. 226–234, 1986. [77] A. J. Martin, “The limitations to delay-insensitivity in asynchronous circuits,” in Advanced Research in VLSI, (W. J. Dally, ed.), pp. 263–278, MIT Press, 1990. [78] A. J. Martin and P. J. Hazewindus, “Testing delay-insensitive circuits,” in Advanced Research in VLSI, (C. H. S´equin, ed.), pp. 118–132, MIT Press, 1991. [79] A. J. Martin, A. Lines, R. Manohar, M. Nystr¨ om, P. P´enzes, R. Southworth, and U. Cummings, “The design of an asynchronous MIPS R3000 microprocessor,” in Advanced Research in VLSI, pp. 164–181, September 1997. [80] A. J. Martin, M. Nystr¨ om, K. Papadantonakis, P. I. P´enzes, P. Prakash, C. G. Wong, J. Chang, K. S. Ko, B. Lee, E. Ou, J. Pugh, E.-V. Talvala, J. T. Tong, and A. Tura, “The lutonium: A sub-nanojoule asynchronous 8051 microcontroller,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 14–23, IEEE Computer Society Press, May 2003. [81] J. Martin Alain and M. Nystrom, “Asynchronous techniques for system-onchip design,” Proceedings of the IEEE, vol. 94, no. 6, pp. 1089–1120, June 2006. [82] J. McCardle and D. Chester, “Measuring an asynchronous processor’s power and noise,” in SNUG, 2001. [83] C. E. Molnar, I. W. Jones, B. Coates, and J. Lexau, “A FIFO ring oscillator performance experiment,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 279–289, IEEE Computer Society Press, April 1997. [84] S. Moore, R. Anderson, R. Mullins, G. Taylor, and J. J. A. Fournier, “Balanced self-checking asynchronous logic for smart card applications,” Microprocessors and Microsystems, vol. 27, no. 9, pp. 421–430, October 2003. [85] M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik, “Chaff: Engineering an efficient SAT solver,” in Proc. ACM/IEEE Design Automation Conference, 2001. [86] D. E. Muller, “Asynchronous logics and application to information processing,” in Symposium on the Application of Switching Theory to Space Technology, pp. 289–297, Stanford University Press, 1962. [87] T. Murata, “Petri nets: Properties, analysis and applications,” Proceedings of the IEEE, vol. 77, no. 4, pp. 541–574, April 1989. [88] C. Myers, Asynchronous Circuit Design. John Wiley & Sons, 2001.
References
133
[89] S. R. Nassif, “Modeling and forecasting of manufacturing variations,” in AsiaSouth Pacific Design Automation Conference (ASP-DAC), 2001. [90] L. Necchi, L. Lavagno, D. Pandini, and L. Vanzago, “An ultra-low energy asynchronous processor for wireless sensor networks,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 78–85, March 2006. [91] . Overview Datasheet -high performance aes (rijndael) cores for asic http://www.heliontech.com/downloads/aes asic helioncore.pdf. [92] R. O. Ozdag and P. A. Beerel, “High-speed QDI asynchronous pipelines,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 13–22, April 2002. [93] D. Patterson, “The future of computer architecture. berkeley eecs annual research symposium 2006, http://www.eecs.berkeley.edu/BEARS/presentations/06Patterson.ppt,” 2006. [94] A. Peeters, “Implementation of handshake components,” in Comunicating Sequential Processes, the First 25 Years, Volume 3525 of Lecture Notes in Computer Science, (A. E. Abdallah, C. B. Jones, and J. W. Sanders, eds.), pp. 98–132, 2005. [95] A. Peeters and K. van Berkel, “Synchronous handshake circuits,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 86–95, IEEE Computer Society Press, March 2001. [96] A. M. G. Peeters, Single-Rail Handshake Circuits. PhD thesis, Eindhoven University of Technology, June 1996. [97] F. Peper, J. Lee, S. Adachi, and S. Mashiko, “Laying out circuits on asynchronous cellular arrays: A step towards feasible nanocomputers?,” Nanotechnology, vol. 14, pp. 469–485, 2003. [98] O. A. Petlin and S. B. Furber, “Scan testing of micropipelines,” in Proc. IEEE VLSI Test Symposium, pp. 296–301, May 1995. [99] L. A. Plana, P. A. Riocreux, W. J. Bainbridge, A. Bardsley, J. D. Garside, and S. Temple, “SPA — A synthesisable amulet core for smartcard applications,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 201–210, April 2002. [100] P. Prakash and A. J. Martin, “Slack matching quasi delay-insensitive circuits,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 195–204, March 2006. [101] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits. Prentice-Hall, 2nd ed., 2002. [102] R. B. Reese and M. A. T. C. Traver, “A coarse-grain phased logic CPU,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 2–13, IEEE Computer Society Press, May 2003. [103] M. Renaudin, P. Vivet, and F. Robin, “A design framework for asynchronous/synchronous circuits based on CHP to HDL translation,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 135–144, April 1999.
134
References
[104] M. Roncken and R. Saeijs, “Linear test times for delay-insensitive circuits: A compilation strategy,” in Asynchronous Design Methodologies, (S. Furber and M. Edwards, eds.), pp. 13–27, Elsevier Science Publishers, 1993. [105] M. Roncken, “Defect-oriented testability for asynchronous IC’s,” Proceedings of the IEEE, vol. 87, no. 2, pp. 363–375, February 1999. [106] L. Y. Rosenblum and A. V. Yakovlev, “Signal graphs: From self-timed to timed ones,” in Proceedings of International Workshop on Timed Petri Nets, (Torino, Italy), pp. 199–207, IEEE Computer Society Press, July 1985. [107] S. Rotem, K. Stevens, R. Ginosar, P. Beerel, C. Myers, K. Yun, R. Kol, C. Dike, M. Roncken, and B. Agapiev, “RAPPID: An asynchronous instruction length decoder,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 60–70, April 1999. [108] A. Saifhashemi and P. A. Beerel, “High Level modeling of channel-based asynchronous circuits using verilog,” in Communicating Process Architectures, (J. F. Broenink et al., ed.), pp. 275–288, IOS Press, September 2005. [109] A. Saifhashemi and H. Pedram, “Verilog HDL, powered by PLI: A suitable framework for describing and modeling asynchronous circuits at all levels of abstraction,” in Proc. ACM/IEEE Design Automation Conference, pp. 330– 333, June 2003. [110] Savant Project-http://www.cliftonlabs.com/savantp.htm. [111] C. L. Seitz, “System timing,” in Introduction to VLSI Systems, (C. A. Mead and L. A. Conway, eds.), ch. 7, Addison-Wesley, 1980. [112] Sequence signal-integrity tools, www.sequencedesign.com. [113] Silistix, self-timed interconnect technology, www.silistix.com. [114] M. Singh, J. A. Tierno, A. Rylyakov, S. Rylov, and S. M. Nowick, “An adaptively-pipelined mixed synchronous-asynchronous digital FIR filter chip operating at 1.3 gigahertz,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 84–95, April 2002. [115] A. Smirnov and A. Taubin, “Synthesizing asynchronous micropipelines with design compiler,” in Synopsys Users Group, Boston, 2006. [116] A. Smirnov, A. Taubin, and M. Karpovsky, “Automated pipelining in ASIC synthesis methodology: Gate transfer level,” in IWLS 2004 Thirteenth International Workshop on Logic and Synthesis, 2004. [117] A. Smirnov, A. Taubin, and M. Karpovsky, “An automated fine-grain pipelining using domino style asynchronous library,” in ACSD 2005: Fifth International Conference on Application of Concurrency to System Design, (St.Malo, France), IEEE CS Press, 2005. [118] A. Smirnov, A. Taubin, M. Karpovsky, and L. Rozenblyum, “Gate transfer level synthesis as an automated approach to fine-grain pipelining,” in Workshop on Token Based Computing (ToBaCo), (Bologna, Italy), 2004. [119] G. E. Sobelman and K. Fant, “CMOS circuit design of threshold gates with hysteresis,” in Proc. International Symposium on Circuits and Systems, pp. 61–64, June 1998. [120] D. Sokolov, J. Murphy, A. Bystrov, and A. Yakovlev, “Design and analysis of dual-rail circuits for security applications,” IEEE Transactions on Computers, vol. 54, no. 4, pp. 449–460, April 2005.
References
135
[121] J. Sparsø and S. Furber, eds., Principles of Asynchronous Circuit Design: A Systems Perspective. Kluwer Academic Publishers, 2001. [122] J. Sparsø and J. Staunstrup, “Delay-insensitive multi-ring structures,” Integration, the VLSI Journal, vol. 15, no. 3, pp. 313–340, October 1993. [123] Special issue on asynchronous circuits and systems. Proceedings of the IEEE, vol. 87, no. 2, pp. 1–375, February 1999. [124] I. Sutherland and S. Fairbanks, “GasP: A minimal FIFO control,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 46–53, IEEE Computer Society Press, March 2001. [125] I. E. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, no. 6, pp. 720–738, June 1989. [126] F. te Beest and A. Peeters, “A multiplexer based test method for self-timed circuits,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 166–175, 2005. [127] F. te Beest, A. Peeters, K. van Berkel, and H. Kerkhoff, “Synchronous full-scan for asynchronous handshake circuits,” Journal of Electronic Testing: Theory and Applications, vol. 19, pp. 397–406, 2003. [128] J. Tierno, A. Rylyakov, S. Rylov, M. Singh, P. Ampadu, S. Nowick, M. Immediato, and S. Gowda, “A 1.3 GSample/s 10-tap full-rate variable-latency selftimed FIR filter with clocked interfaces,” in International Solid State Circuits Conference, February 2002. [129] K. Tiri, M. Akmal, and I. Verbauwhede, “A dynamic and differential cmos logic with signal independent power consumption to withstand differential power analysis on smart cards,” in 28th European Solid-State Circuits Conference (ESSCIRC 2002), 2002. [130] K. Tiri, W. Hwang, A. Hodjat, L. Bo-Cheng, Y. Shenglin, P. Schaumont, and I. Verbauwhede, “Prototype IC with wddl and differential routing — dpa sesistance assessment,” in Chyptographic Hardware and Embedded Systems — CHES, (Edinburgh), pp. 354–365, LNCS3659, Springer, 2005. [131] K. Tiri and I. Verbauwhede, “Securing encryption algorithms against dpa at the logic level: Next generation smart card technology,” Workshop on Cryptographic Hardware and Embedded Systems (CHES 2003), 2003. [132] K. Tiri and I. Verbauwhede, “A logic level design methodology for a secure dpa resistant asic or fpga implementation,” Design Automation and Test in Europe Conference (DATE 2004), 2004. [133] TSMC 0.18mm Process 1.8-Volt SAGE-X TM Standard Cell Library Databook. September 2003. [134] K. van Berkel, Handshake Circuits: An Asynchronous Architecture for VLSI Programming. Vol. 5 of International Series on Parallel Computation, Cambridge University Press, 1993. [135] K. van Berkel and A. Bink, “Single-track handshaking signaling with application to micropipelines and handshake circuits,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 122–133, IEEE Computer Society Press, March 1996.
136
References
[136] K. van Berkel, F. Huberts, and A. Peeters, “Stretching quasi delay insensitivity by means of extended isochronic forks,” in Asynchronous Design Methodologies, pp. 99–106, IEEE Computer Society Press, May 1995. [137] K. van Berkel, A. Peeters, and F. te Beest, “Adding synchronous and LSSD modes to asynchronous circuits,” Microprocessors and Microsystems, vol. 27, no. 9, pp. 461–471, October 2003. [138] V. Varshavsky, V. Marakhovsky, and T.-A. Chu, “Logical timing (global synchronization of asynchronous arrays,” in The First International Symposium on Parallel Algorithm/Architecture Synthesis, (Aizu-Wakamatsu, Japan), pp. 130–138, March 1995. [139] V. I. Varshavsky, ed., Self-Timed Control of Concurrent Processes: The Design of Aperiodic Logical Circuits in Computers and Discrete Systems, (Dordrecht, The Netherlands), Kluwer Academic Publishers, 1990. [140] G. Venkataramani, T. Bjerregaard, T. Chelcea, and S. C. Goldstein, “Hardware compilation of application-specific memory-access interconnect,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 25, no. 5, pp. 756–771, 2006. [141] G. Venkataramani, M. Budiu, T. Chelcea, and S. Goldstein, “C to asynchronous dataflow circuits: An end-to-end toolflow,” in IWLS, pp. 501–508, Temecula, CA, June 2004. [142] G. Venkataramani and S. Copen Goldstein, “Leveraging protocol knowledge in slack matching,” in IEEE/ACM International Conference on Computer-Aided Design, (San Jose, CA, USA), November 2006. [143] T. Verhoeff, “Delay-insensitive codes—an overview,” Distributed Computing, vol. 3, no. 1, pp. 1–8, 1988. [144] Weaver: GTL synthesis flow. http://async.bu.edu/weaver/. [145] Y. Zhou, D. Sokolov, and A. Yakovlev, “Cost-aware synthesis of asynchronous circuits based on partial acknowledgement,” in Proc. International Conf. Computer-Aided Design (ICCAD), pp. 255–260, IEEE Computer Society Press, November 2006.