3D Integration for NoC-based SoC Architectures (Integrated Circuits and Systems)

Integrated Circuits and Systems Series Editor Anantha P. Chandrakasan Massachusetts Institute of Technology Cambridge,...

Author: Abbas Sheibanyrad | Frederic Petrot | Axel Jantsch

70 downloads 1054 Views 18MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Integrated Circuits and Systems

Series Editor Anantha P. Chandrakasan Massachusetts Institute of Technology Cambridge, Massachusetts

For further volumes: http://www.springer.com/series/7236

Abbas SheibanyradÂ€•Â€Frédéric PétrotÂ€•Â€Axel Jantsch Editors

3D Integration for NoC-based SoC Architectures

1 3

Editors Abbas Sheibanyrad TIMA Laboratory 46, Avenue Felix Viallet 38000 Grenoble France [email protected]

Axel Jantsch Royal Institute of Technology Forum 120 SE-16440 Kista Sweden [email protected]

Frédéric Pétrot TIMA Laboratory 46, Avenue Felix Viallet 38000 Grenoble France [email protected]

ISSN 1558-9412 ISBN 978-1-4419-7617-8â•…â•…â•…â•… e-ISBN 978-1-4419-7618-5 DOI 10.1007/978-1-4419-7618-5 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

â•…

Preface

3D integration technologies, 3D-Design techniques, and 3D-Architectures are emerging as truly hot and broad research topics. As the end of scaling the CMOS transistor comes in sight, the third dimension may come to the rescue of the industry to allow for a continuing exponential growth of integration during the 2015–2025 period. As such 3D stacking may be the key technology to sustain growth until more exotic technologies such as nanowires, quantum dot devices and molecular computers become sufficiently mature for deployment in main stream application areas. The present book gathers the recent advances in the domain written by renowned experts to build a comprehensive and consistent book around the topics of threedimensional architectures and design techniques. In order to take full advantage of the 3D integration, the decision on the use of in-circuit vertical connection (predominantly Through-Silicon-Vias (TSVs) and Inductive Wireless Interconnects) must come upfront in the architecture planning process rather than as a packaging decision after circuit design is completed. This requires taking the 3D design space into account right from the start of the system design. Previously published books about this active research domain focus on fabrication technologies and physical aspects, rather than network and system-level architectural concerns. In contrast, the present book covers almost all architectural design aspects of 3D-SoCs, and as such can be useful both for introducing the current research topics to researchers and engineers, giving a basis for education and training in M.Sc. and Ph.D. programs. The book is divided into three parts. The first part, which contains two chapters, deals with the promises and challenges of 3D integration. ChapterÂ€1 introduces 3D integration of Integrated Circuits, and as an objective, discusses performance enhancement, as well as new integration capabilities, enabled technology platforms, and potential applications made possible by 3D technology. ChapterÂ€2 elaborates on the promises and limitation of 3D integration by studying the limits of performance under different memory distribution constraints of various 2D and 3D topologies in current and future technology nodes. The second part of the book consists of four chapters. It discusses technology and circuit design of 3D integration. ChapterÂ€3 focuses on the available solutions and open challenges for testing 3D Stacked ICs (3D-SICs). It provides an overview v

vi

Preface

of the manufacturing steps of TSV-based 3D-SICs relevant for the testing issues. ChapterÂ€4 reviews the process of 3D-IC designing exploiting Through-Silicon-Via technology, and introduces the notion of re-architecting systems explicitly to exploit high density TSV processes. ChapterÂ€5 investigates physical properties of NoC topologies for 3D integrated systems. It describes an enhanced physical analysis methodology, providing a means to estimate early in the design cycle the behavior of a 3D topology for an integrated system interconnected with an on-chip network. ChapterÂ€6 characterizes the performance of multiple 3D NoC architectures in the presence of realistic traffic patterns through cycle-accurate simulation and establishes the performance benchmark and related design trade-offs. The last part of the book includes five chapters that globally concern system and architecture design of 3D integration. ChapterÂ€7 makes a case for using asynchronous circuits to implement 3D-NoCs. It claims that asynchronous logic allows for serializing vertical links, leading to the definition of innovative architectures which, by reducing the number of TSVs, can address some critical issues of 3D integration. ChapterÂ€8, by supporting both unicast and multicast traffic flows, considers the problem of designing application-specific 3D-NoC architectures that are optimized for a given application. ChapterÂ€9 presents methodologies and tools for automated 3D interconnect design, focusing on application-specific 3D-NoC synthesis which consists of finding the best NoC topology for the application, computing paths for the communication flows, assigning network components onto the layers of the 3D stack, and placing them in each layer. ChapterÂ€10 describes constructing 3D-NoCs based on inductive-coupled wireless interconnect in which the data modulated by a driver are transferred between two inductors placed at exactly the same position of two stacked dies. ChapterÂ€11 discusses how 3D technology can be implemented in GPUs. It investigates the problems and constraints of implementing such a technology and proposes architectural designs for a GPU that implements 3D technology and evaluates these designs in terms of fabrication cost, power consumption and thermal profile. We greatly appreciate all the authors’ hard work that led to their valuable contributions, whose patience and support are the basis for the production of this book. Their innovations and comprehensive presentations make this book novel, unique and useful. We also thank the publisher for his strong support and quick actions that made a speedy and timely publication possible. We sincerely hope that you will find reading this book as exciting and informative as we have done when selecting and discussing the content. Abbas Sheibanyrad Frédéric Pétrot Axel Jantsch

Contents

Iâ•…â•‡â•›â•›3DI Promises and Challenges ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•…â•‡ 1 1â•…â•‡â•›Three-Dimensional Integration of Integrated Circuits—an Introduction ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•…â•‡ 3 Chuan Seng Tan 2â•…â•‡â•›The Promises and Limitations of 3-D Integration ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•… 27 Axel Jantsch, Matthew Grange and Dinesh Pamunuwa IIâ•…â•›â•›â•›Technology and Circuit Design ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•… 45 3â•…â•‡â•›Testing 3D Stacked ICs Containing Through-Silicon Vias ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•… 47 Erik Jan Marinissen 4â•…â•‡â•›Design and Computer Aided Design of 3DIC ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ â•… 75 Paul D. Franzon, W. Rhett Davis and Thor Thorolfsson 5â•…â•‡â•›Physical Analysis of NoC Topologies for 3-D Integrated Systems ï¿½ï¿½ï¿½ï¿½ï¿½ â•… 89 Vasilis F. Pavlidis and Eby G. Friedman 6â•…â•‡â•›Three-Dimensional Networks-on-Chip: Performance Evaluation ï¿½ï¿½ï¿½ï¿½ â•‡ 115 Brett Stanley Feero and Partha Pratim Pande IIIâ•… System and Architecture Design ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•‡ 147 7â•…â•‡â•›Asynchronous 3D-NoCs Making Use of Serialized Vertical Links ï¿½ï¿½ï¿½ï¿½ â•‡ 149 Abbas Sheibanyrad and Frédéric Pétrot 8â•…â•‡â•›Design of Application-Specific 3D Networks-onChip Architectures ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•‡ 167 Shan Yan and Bill Lin

vii

viii

Contents

â•‡ 9â•…3D Network on Chip Topology Synthesis: Designing Custom Topologies for Chip Stacksï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ â•… 193 Ciprian Seiculescu, Srinivasan Murali, Luca Benini and Giovanni De Micheli 10â•…3-D NoC on Inductive Wireless Interconnectï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ â•… 225 Hiroki Matsutani, Michihiro Koibuchi, Tadahiro Kuroda and Hideharu Amano 11â•…Influence of Stacked 3D Memory/Cache Architectures on GPUsï¿½ï¿½ï¿½ï¿½ï¿½ â•… 249 Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Yuan Xie and Narayanan Vijaykrishnan Index ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½å°“ï¿½ï¿½ï¿½ï¿½ï¿½ â•… 273

Contributors

Ahmed Al Maashriâ•… The Pennsylvania State University, University Park, PA, USA Hideharu Amanoâ•… Keio University, Yokohama, Japan Luca Beniniâ•… The University of Bologna, Bologna, Italy W. Rhett Davisâ•… The North Carolina State University, Raleigh, NC, USA Giovanni De Micheliâ•… EPFL, Lausanne, Switzerland Xiangyu Dongâ•… The Pennsylvania State University, University Park, PA, USA Brett Stanley Feeroâ•… ARM Inc., Austin, TX, USA Paul D. Franzonâ•… The North Carolina State University, Raleigh, NC, USA Eby G. Friedmanâ•… University of Rochester, Rochester, NY, USA Matthew Grangeâ•… Lancaster University, Lancaster, UK Axel Jantschâ•… Royal Institute of Technology, Sweden Michihiro Koibuchiâ•… Japanese National Institute of Informatics, Tokyo, Japan Tadahiro Kurodaâ•… Keio University, Yokohama, Japan Bill Linâ•… University of California, San Diego, CA, USA Erik Jan Marinissenâ•… IMEC, Leuven, Belgium Hiroki Matsutaniâ•… The University of Tokyo, Tokyo, Japan Srinivasan Muraliâ•… EPFL, Lausanne, Switzerland Dinesh Pamunuwaâ•… Lancaster University, Lancaster, UK Partha Pratim Pandeâ•… Washington State University, Pullman, WA, USA Vasilis F. Pavlidisâ•… EPFL, Lausanne, Switzerland Frédéric Pétrotâ•… TIMA Laboratory, Grenoble, France

ix

x

Contributors

Ciprian Seiculescuâ•… EPFL, Lausanne, Switzerland Abbas Sheibanyradâ•… TIMA Laboratory, Grenoble, France Guangyu Sunâ•… The Pennsylvania State University, University Park, PA, USA Chuan Seng Tanâ•… Nanyang Technological University, Singapore Thorlindur Thorolfssonâ•… The North Carolina State University, Raleigh, NC, USA Narayanan Vijaykrishnanâ•… The Pennsylvania State University, University Park, PA, USA Yuan Xieâ•… The Pennsylvania State University, University Park, PA, USA Shan Yanâ•… University of California, San Diego, CA, USA

â•…

Part I

3DI Promises and Challenges

â•…

Chapter 1

Three-Dimensional Integration of Integrated Circuits—an Introduction Chuan Seng Tan

1.1â•…Background and Introduction Imagine a situation where you need to travel between your home and office every day. You need to put up with time lost during commute as well as paying for the fuel. One possible solution is to have your home in another floor of your office building. In this way, all you need to do is to go up and down between floors and you can save time and cost. This simple idea can similarly be applied to boost the overall performance in future integrated circuits. For the past 40 over years, higher computing power was achieved primarily through commensurate performance enhancement of transistors as a result of continuously scaling down the device dimensions in a harmonious manner. This has resulted in a steady doubling of device density from one technology node to another as famously described by Moore’s Law. Improvement in transistor switching speed and count are two of the most direct contributors to the historical performance growth in integrated circuits (particularly in silicon-based digital CMOS). This scaling approach has been so effective in many aspects (performance and cost) that integrated circuits have essentially remained a planar platform throughout this period of rigorous scaling. As performance enhancement through geometrical scaling becomes more challenging and demand for higher functionality increases, there is tremendous interest and potential to explore the third dimension, i.e., the vertical dimension of the integrated circuits. This was rightly envisioned by Richard Fenyman, physicist and Nobel Laureate, when he delivered a talk on “Computing Machines in the Future” in Japan in 1985. His original text reads: “Another direction of improvement (in computing power) is to make physical machines three dimensional instead of all on a surface of a chip. That can be done in stages instead of all at once—you can have several layers and then add many more layers as time C. S. Tan () School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798 Singapore Tel.: +65-67905636 e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_1, ©Â€Springer Science+Business Media, LLC 2011

3

4

C. S. Tan

goes on” [1]. The need for 3D integration has become clear going forward and it was reiterated by Dr. Chang-Gyu Hwang, President and CEO of Samsung Electronics’ Semiconductor Business, when he delivered a keynote speech at the 2006 International Electron Devices Meeting (IEDM) in San Francisco entitled “New Paradigms in the Silicon Industry” [2]. Some important points of his speech are quoted: “The approaching era of electronics technology advancement—the Fusion Era—will be massive in scope, encompassing the fields of information technology (IT), bio-technology (BT), and nano-technology (NT) and will create boundless opportunities for new growth to the semiconductor industry. The core element needed to usher in the new age will be a complex integration of different types of devices such as memory, logic, sensor, processor and software, together with new materials, and advanced die stack technologies, all based on 3-D silicon technology.”

1.2â•…Motivations and Drivers This section examines the role of 3D integration in ensuring that performance growth enjoyed by the semiconductor industry as a direct result of geometrical scaling, coupled with the introduction of performance boosters in more recent nodes, as predicted by Moore’s Law can continue in the future. Scaling alone has met with diminishing return due to fundamental and economics scaling barriers (non-commensurate scaling). 3D integration explores the third dimension of IC integration and offers new dimension for performance growth. 3D integration also enables integration of disparate chips in a more compact form factor and it is touted by many as an attractive method for system miniaturization and functional diversification commonly known as heterogeneous integration.

1.2.1 Sustainable IC Performance Growth Beginning with the invention of the first integrated circuit in 1958 by Kilby and Noyce, the world has witnessed sustainable performance growth in IC. The trend is best exemplified by the exponential improvement in computing power (measured in million instructions per second, MISP) in Intel’s micro-processors over the last 40Â€years as shown in Fig.Â€1.1 [3]. This continuous growth is a result of the ability to scale silicon transistor to smaller dimension in every new technology nodes. The growth continued, instead of hitting a plateau, in more recent nodes thanks to the addition of performance boosters (e.g., strained-Si, high-κ and metal gate, etc) on top of conventional geometrical scaling. Scaling doubles the number of transistors on IC (famously described by “Moore’s Law”) in every generation and allows us to integrate more functions on IC and to increase its computing power. We are now in the Giga-scale integration era featured by billions of transistors, GHz operating frequency, etc. Going forward to Tera-scale integration, however, there are a number of imminent show-stoppers (described in the next section) that pose serious threat to continuous performance

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction Fig. 1.1â†œæ¸€ The evolution of computing performance. (Source: Intel)

5

1.E+06 TIPS

1.E+05 1.E+04 MIPS

1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1960

1980

2000

2020

enhancement in IC and a new paradigm shift in IC technology and architecture is needed to sustain the historical growth. It is widely recognized that the growth can be sustained if one utilizes the vertical (i.e., the third) dimension of IC to build three-dimensional IC, a departure from today’s planar IC as projected in Fig.Â€1.2. Three-dimensional integrated circuits (3D IC) refers to a stack consists of multiple ultra-thin layers of IC that are vertically bonded and interconnected with through silicon via (TSV) as shown in Fig.Â€1.3. In 3D implementation, each block can be fabricated and optimized using their respective technologies and assembled to form a vertical stack. 3D stacking of ultra-thin ICs is identified as an inevitable solution for future performance enhancement, system miniaturization, and functional diversification.

Performance

Architectural 3D Integration Business as usual Performance Boosters (Cu/low-k, Strain, HK/MG, High- channels, 3D structures) Business as usual

Geometrical Scaling

1990s

2000s

Year

Fig. 1.2â†œæ¸€ Historical IC performance growth can be sustained with a new paradigm shift to 3-D integration

6

C. S. Tan

Fig. 1.3â†œæ¸€ A conceptual representation of 3D IC

Device/Interconnect

Through Silicon Via

Layer 4 Layer 3 Layer 2

Bonding Interface

Layer 1 (Substrate)

1.2.2 Show-Stoppers and 3-D Integration as a Remedy 1.2.2.1â•…Transistor Scaling Barriers There are at least two barriers that will slow down or impede further geometrical scaling. The first one relates to the fundamental properties of transistor in extremely scaled devices. Experimental and modeling data suggest that performance improvement in devices is no longer commensurate with ideal scaling in the past due to high leakage and parasitic hence they consume more power. This is shown in Fig.Â€1.4 by Khakifirooz and Antoniadis [4]. The intrinsic delay of n-MOS is shown to increase beyond the 45Â€nm despite continuous down scaling of the transistor pitch.

RO Stage Delay Intrinsic NMOS

Delay (ps)

10

1

Fig. 1.4â†œæ¸€ The intrinsic delay in n-MOS transistor is projected to increase in future nodes despite continuous down scaling of the device pitch [4]

45 nm Target @ 32 nm

10

100 Technology Node (nm)

600

Frequency

0

300

1,500 1,000 500 0

Frequency

7

900

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

0

1 2 3 4 Normalized Leakage Current

5

0.7

0.8 0.9 1.0 1.1 1.2 1.3 Normalized Threshold Voltage

Fig. 1.5â†œæ¸€ Measured leakage current and threshold voltage in 65Â€nm devices reported by IBM [6]

Another issue related to scaled devices is variability [5]. Variability in transistor performance and leakage is a critical challenge to the continued scaling and effective utilization of CMOS technologies with nanometer-scale feature sizes. Some of the factors contributing to the variability increase are fundamental to the planar CMOS transistor architecture. Random dopant fluctuations (RDFs) and line-edge roughness (LER) are two examples of such intrinsic sources of variation. Other reasons for the variability increase are the advanced resolution-enhancement techniques (RETs) required to print patterns with feature sizes smaller than the wavelength of lithography. Transistor variation affects many aspects of IC manufacturing and design. Increased transistor variability can have negative impact on product performance and yield. FigureÂ€1.5 shows measurement data reported by IBM on its 65Â€nm devices that clearly show variation in leakage current and threshold voltage. Variability worsens as we continue to scale in future technology nodes and it is a severe challenge. The second barrier concerns the economic aspect of scaling. The development and manufacturing cost has increased from one node to another making scaling a less favorable option in future nodes of IC. 3D integration on the other hand achieves device density multiplication by stacking IC layers in the third dimension without aggressive scaling. Therefore it can be a viable and immediate remedy as conventional scaling becomes less cost effective. 1.2.2.2â•…On-Chip Interconnect While dimensional scaling has consistently improved device performance in terms of gate switching delay, it has a reverse effect on global interconnect latency [7]. The global interconnect RC delay has increasingly become the circuit performance limiting factor especially in the deep sub-micron regime. Even though Cu/low-κ multilevel interconnect structures improve interconnect RC delay, they are not a long-term solution since the diffusion barrier required with Cu metallization has a finite thickness that is not readily scaled. The effective resistance of the interconnect

8

C. S. Tan

2-D wires C0

C1

C2

C3

C4

C5

C4

C3

C0

C1

C2

C5

3-D wires

Wire Counts

a

b

3-D

2-D

Wire Length

Fig. 1.6â†œæ¸€ a Long global wires on IC can be shortened by chip partitioning and stacking. b 3D integration reduces the number of long wires on IC

is larger than would be achieved with bulk copper, and the difference increases with reduced interconnect width. Surface electron scattering further increases the Cu line resistance, and hence the RC delay suffers [8]. When chip size continues to increase to accommodate for more functionalities, the total interconnects length increases at the same time. This causes a tremendous amount of power to be dissipated unnecessarily in interconnects and repeaters used to minimize delay and latency. On-chip signals also require more clock cycles to travel across the entire chip as a result of increasing chip size and operating frequency. Rapid rise in interconnects delay and power consumption due to smaller wire cross-section, tighter wire pitch, and longer lines that transverse across larger chips is severely limiting IC performance enhancement in current and future nodes. 3D IC with multiple active Si layers stacked vertically is a promising method to overcome this scaling barrier as it replaces long inter-block global wires with much shorter vertical inter-layer interconnects as shown in Fig.Â€1.6. 1.2.2.3â•…Off-Chip Interconnect (Memory Bandwidth Gap) FigureÂ€1.7a depicts the memory hierarchy in today’s computer system in which the processor core is connected to the memory (DRAM) via power-hungry and slower off-chip buses on the board level. Data transmission on these buses experiences

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction Fig. 1.7â†œæ¸€ a Memory Hierarchy in today computer system. b Direct placement of memory on processor improves the data bandwidth

On-chip Wires

Core

9

off-chip Buses

Main Memory (DRAM)

Cache

a TSV

Memory Processor Core

b

severe delay and consumes significant amount of power. The number of available bus channels is also limited by the amount of external pin count available on the packaged chips. As a consequence, the data bandwidth suffers. As the computing power in processor increases in each generation, the limited bandwidth between processor core and memory places a severe limitation on the overall system performance [9]. The problem is even more pressing in multi-core architecture as every core will demand for data supply. To close this gap, the most direct way is to shorten the connections and to increase the number of data channels. By placing memory directly on processor, the close proximity shortens the connections and the density of connections can be increased by using more advanced CMOS processes (as opposed to packaging/assembly processes) to achieve fine-pitch TSV. This massively parallel interconnection is shown in Fig.Â€1.7b. TableÂ€1.1 is a comparison between 2D and 3D implementations in terms of connection density and power consumption. Clearly, 3D can provide bandwidth enhancement (100X increment at the same frequency) at lower power consumption (10X reduction). Effectively, this translates into 1,000X improvement in bandwidth/power efficiency, an extremely encouraging and impressive number.

Table 1.1â†œæ¸€ Comparison of 2D and 3D implementations Â€ 2D Connections density <1e3Â€per cm2 *Power consumption per pin/via 30–40Â€mW Total power consumption (per cm2) 30–40Â€W *Data from Tezzaron [10].

3D ~1e5Â€per cm2 ~25Â€µW 2.5Â€W

Remark 100X increment Â€ 10X reduction

10

C. S. Tan

1.3â•…Options of 3D IC 1.3.1 System Integration Landscape System integration, that is, the integrating together of circuits or intellectual property (IP) blocks, is one of the major applications of 3D integration. As such, 3D integration must compete against a number of established technologies. FigureÂ€1.8 compares the relative capability of several system integration methods (board, 2D multi-chip module—2D-MCM, package-on-package—PoP, system-in-package— SiP, and 2D system-on-chip—2D-SoC) in terms of form factor and interconnects density between circuit blocks [11]. 3D integration offers more compact form factor and higher chip-to-chip interconnects density [11, 12]. Comparing with 2D-SoC, 3D integration shortens time-to-market and lowers the system cost. By using larger number of smaller and shorter through silicon via (TSV) as compared to wire bonding in SiP, performance is enhancement via 3D integration due to smaller latency and higher bandwidth, as well as smaller power consumption.

1.3.2 Classification There are a number of technology options to arrange integrated circuits in a vertical stack. It is possible to stack ICs in a vertical fashion at various stages of process-

larger

Board

Form Factor

2D MCM

smaller

PoP pkg-on-pkg SiP stacked w wirebond

SoC System-on-chip

3D

Circuit-to-Circuit Interconnect Density

increasing

Fig. 1.8â†œæ¸€ Comparison of various system integration technologies in terms of form factor and circuit-to-circuit interconnect density [11, 12]

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

11

ing: (1) post-singulation 3D packaging (e.g., chip-to-chip), and (2) pre-singulation wafer level 3-D integration (e.g., chip-to-wafer, wafer-to-wafer, and monolithic approaches). Active layers can be vertically interconnected using physical contact such as bond wire or interlayer vertical via (including TSV). It is also possible to establish chip to chip connection via non-contact (or wireless) links such as capacitive and inductive couplings [13]. Capacitive coupling utilizes a pair of electrodes that are formed using conventional IC fabrication. The inductive-coupling I/O is formed by placing two planar coils (planar inductors) above each other and is also made using conventional IC fabrication. The advantages of these approaches are fewer processing steps hence lower cost, no requirement for ESD protection, low power, and smaller area I/O cell. Since there is substantial overlap between various options and lack of standardization in terms of definition, classification of 3D IC technology is often not straight forward. This chapter makes an attempt to classify 3D IC based on the processing stage when stacking takes place.

1.3.3 Monolithic Approaches In these approaches, devices in each active layer are processed sequentially starting from the bottom-most layer. Devices are built on a substrate wafer by mainstream process technology. After proper isolation, a second device layer is formed and devices are processed by conventional means on the second layer. This sequence of isolation, layer formation, and device processing can be repeated to build a multilayer structure. The key technology in this approach is forming a high quality active layer isolated from the bottom substrate. This bottom-up approach has the advantage that precision alignment between layers can be accomplished. However, it suffers from a number of drawbacks. The crystallinity of upper layers is usually low and imperfect. As a result, high performance devices cannot be built in the upper layers. Thermal cycling during upper layer crystallization and device processing can degrade underlying devices and therefore a tight thermal budget must be imposed. Due to the sequential nature of this method, manufacturing throughput is low. A simpler FEOL process flow is feasible if polycrystalline silicon can be used for active devices; however, a major difficulty is to obtain high-quality electrical devices and interconnects. While obtaining single-crystal device layers in a generic IC technology remains in the research stage, polycrystalline devices suitable for non-volatile memory (NVM) have not only been demonstrated but have been commercialized (for example by SanDisk). A key advantage of FEOL-based 3-D integration is that IC BEOL and packaging technologies are unchanged; all the innovation occurs in 3-D stacking of active layers. A number of FEOL techniques include: laser beam recrystallization [14, 15], seeding-assisted recrystallization [16, 17], selective epitaxy and over-growth [18], and grapho-exitaxy [19].

12

C. S. Tan

1.3.4 Assembly Approaches This is a parallel integration scheme in which fully processed or partially processed integrated circuits are assembled in a vertical fashion. Stacking can be achieved with one of these methods: (1) chip-to-chip, (2) chip-to-wafer, and (3) wafer-towafer. Vertical connection in chip to chip stacking can be achieved using wire bond as shown in Fig.Â€1.9 or through silicon via (TSV) as shown in Fig.Â€1.10. Wafer level 3D integration, such as chip-to-wafer and wafer-to-wafer stacking, use TSV as the vertical interconnect. This integration approach often involves a sequence of wafer thinning and handling, alignment, TSV formation, and bonding. The key differentiators are: • Bonding medium—metal-to-metal, dielectric-to-dielectric (oxide, adhesive, etc) or hybrid bonding; • TSV formation—via first, via middle or via last; • Stacking orientation—face-to-face or back-to-face stacking; • Singulation level—chip-to-chip, chip-to-wafer or wafer-to-wafer.

Fig. 1.9â†œæ¸€ Stacked die with wire bond interconnections in a chip-scale package [20]

Top Chip TSV Bottom Chip

Fig. 1.10â†œæ¸€ 3D chip stacking using through silicon via [21]

Package

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

ILD

ILD

Vertical Via Cu

Dielectric

ILD

Vertical Via Cu

13

Dielectric

Vertical Via Cu

b

c

• Cu-Cu Bonding

• Hybrid Bonding

• Mechanical Bond

• Mechanical and Electrical Bonds

• Mechanical and Electrical Bonds

• Via After Bonding

• Via During Bonding

• Via During Bonding

a

Dielectric = SiO2, BCB

• Dielectric-Dielectric Bonding

Dielectric = SiO2,

Fig. 1.11â†œæ¸€ Wafer bonding techniques for wafer-level 3-D integration: a dielectric-to-dielectric, b metal-to-metal, and c dielectric/metal hybrid

The types of wafer bonding potentially suitable for wafer-level 3D integration are depicted in Fig.Â€1.11. Dielectric-to-dielectric bonding is most commonly accomplished using silicon oxide or BCB polymer as the bonding medium. These types of bonding provide primary function as a mechanical bond and the inter-wafer via is formed after wafer-to-wafer alignment and bonding (Fig.Â€1.11a). When metallic copper-to-copper bonding is used (Fig.Â€1.11b), the inter-wafer via is completed during the bonding process; note that appropriate interconnect processing within each wafer is required to enable 3D interconnectivity. Besides providing electrical connections between IC layers, dummy pads can also be inserted at the bonding interface at the same time to enhance the overall mechanical bond strength. This bonding scheme inherently leaves behind isolation gap between Cu pads and this could be a source of concern for moisture corrosion and compromise the structural integrity especially when IC layers above the substrate is thinned down further. FigureÂ€1.11c shows a bonding scheme utilizing a hybrid medium of dielectric and Cu. This scheme in principle provides a seam-less bonding interface consists of dielectric bond (primarily a mechanical bond) and Cu bond (primarily an electrical bond). However, very stringent requirements with regards to surface planarity (dielectric and Cu) and Cu contamination in the dielectric layer due to misalignment are needed. The selection of the optimum technology platform is subject to ongoing development and applications. Cu-to-Cu bonding has significant advantages for highest inter-wafer interconnectivity. As a result, this approach is desirable for microprocessors and digitally-based system-on-a-chip (SoC) technologies. Polymer-to-polymer bonding is attractive when heterogeneous integration of diverse technologies is the driver and the inter-wafer interconnect density is more relaxed; benzocyclobute (BCB) is the polymer most widely investigated. Taking advantage of the viscosity of the polymer, this method is more forgiving in terms of surface planarity and particle contamination. Oxide-to-oxide bonding of fully processed IC wafers requires atomic-scale smoothness of the oxide surface. In addition, wafer distortions

14 Via First

C. S. Tan

TSV

FEOL

BEOL

FEOL

TSV

BEOL

Via Last (front side)

FEOL

BEOL

TSV

Via Last (back side)

FEOL

BEOL

Bonding

Via Middle

Thinning

TSV

Fig. 1.12â†œæ¸€ TSV can be formed at various stages of IC processing

introduced by FEOL and BEOL processing introduces sufficient wafer bowing and warping that prevents sufficient contact area to achieve the required bonding strength. While oxide-to-oxide bonding after FEOL and local interconnect processing has been shown to be promising (particularly with SOI wafers that allows for extreme thinning down to the buried oxide layer) the increased wafer distortion and oxide roughness after multilevel interconnect processing require extra attention during processing. TSV can be formed at various stages during the 3D IC process as shown in Fig.Â€1.12. When TSV is formed before any CMOS processes, the process sequence is known as “via first”. It is also possible to form the TSV when the front end processes are completed. In this “via middle” process, back end processes will continue after the TSV process is completed. When TSV is formed after the CMOS processes are completed, it is known as “via last” process. TSV can be formed from the front side or the back side of the wafer. The above schemes have different requirements in terms of process parameters and materials selection. The choice depends on final application requirements and infrastructures in the supply chain. Another key differentiator in 3D IC integration is related to the stacking orientation. One option is to perform face-to-face (F2F) alignment and bonding with all required I/Os brought to the thinned backside of the top wafer (which becomes the face of the two-wafer stack). Another approach is to temporarily bond the top wafer to a handling wafer, after which the device wafer is thinned from the back side and permanently bonded to the full-thickness bottom wafer; after this permanent bonding the handling wafer is removed. This is also called a back-to-face (B2F) stacking. These two stacking orientations are shown in Fig.Â€1.13. F2F stacking allows a high density layer to layer interconnection which is limited by the alignment accuracy. Handle wafer is not required in F2F stacking and this imposes more stringent requirement on the mechanical strength of the bonding interface in order to sustain shear force during wafer thinning which is often

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

15

ILD2 TSV IC2

TSV

IC2 ILD2

ILD1

ILD1 IC1

a

IC1

b

Fig. 1.13â†œæ¸€ a Face-to-face stacking and b back-to-face stacking

achieved by mechanical grinding or polishing. Since one of the IC layer is facing down in the final assembly, F2F stacking also complicates the layout design as opposed to more conventional layout design whereby IC layers are facing up. Another potential disadvantage of F2F stacking relates to the thickening of the ILD layer at the bonding interface which presents higher barrier for effective heat dissipation. B2F stacking requires the use of a temporary handle and the layer to layer interconnection density is limited by the TSV pitch. Since the device layer is bonded to a temporary handle, the final permanent bond does not sustain damage resulting from wafer thinning. It requires the use of a temporary bonding medium that can provide sufficient strength during wafer handling and can be readily released after successful device layer permanent transfer on the substrate. In wafer level 3D integration, permanent bonding can be done either in chipto-wafer (C2W) or wafer-to-wafer (W2W) stacking. A comparison of these two methods is summarized in TableÂ€1.2. As shown in Fig.Â€1.14, the option of C2W or W2W depends on two key requirements on chip size and alignment accuracy. When high precision alignment is desired in order to achieve high density layer to layer Table 1.2â†œæ¸€ Comparison between wafer-to-wafer and chip-to-wafer stacking Â€ Wafer/die size Throughput Yield Alignment accuracy

Wafer-to-wafer Wafer/die of common size in order to avoid silicon area wastage Wafer scale Lower than lowest yield wafer, therefore high yield wafer must be used <2Â€µm global alignment

Chip-to-wafer Dissimilar wafer/die size is acceptable Die scale Known good die can be used if pre-stacking testing is available ~10Â€µm for >1,000Â€dph <2Â€µm for <100Â€dph

16

C. S. Tan

Fig. 1.14â†œæ¸€ The choice between C2W and W2W depends on the chip size and the required alignment accuracy Chip Size

C2W

W2W

Alignment

interconnections, W2W is a preferred choice to maintain acceptable throughput by performing a wafer level alignment. W2W is also preferred when chip size gets smaller.

1.4â•…Technology Platforms and Strategies A number of new enabling technologies must be developed and introduced into the existing fabrication process flow to make 3D integration a reality. Depending on the level of granularity, new capabilities include wafer bonding (permanent or temporary), through silicon/strata via (TSV), wafer thinning and handling, and precision alignment. There are a number of references on technology platforms available in the literature and the references therein [22, 23]. A brief introduction to TSV process flow is given. This section will primarily discuss low temperature Cu–Cu permanent bonding which is the author’s core research expertise.

1.4.1 Through Silicon Via FigureÂ€ 1.15 is a generic process flow of TSV fabrication flow using Cu as the core metal. It begins with high aspect ratio deep etching of Si. Dielectric liner layer is then deposited on the via sidewall followed by barrier and Cu seed layers deposition. Liner layer, which is made of dielectric layer such as silicon dioxide, provides electrical isolation between Cu core and Si substrate. The liner thickness must be chosen appropriately to control leakage current and capacitance between Cu core and Si substrate. Cu super conformal filling is then achieved

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

High aspect ratio Si deep etching

Liner deposition, followed by barrier and seed layers deposition

Super conformal Cu filling

17

Removal of Cu over-burden

Fig. 1.15â†œæ¸€ A generic process flow of Cu-filled TSV fabrication

with electro-plating process. Super conformal filling is required to prevent void formation in the Cu TSV. Finally, Cu over-burden is removed by chemical mechanical polishing. More information on TSV fabrication can be found in literature such as [22, 23].

1.4.2 Cu–Cu Permanent Bonding 3D integration of integrated circuits by means of bump-less Cu–Cu direct bonding is an attractive choice as one accomplishes both electrical and mechanical bonds simultaneously. Cu–Cu direct bonding is desired compared to solder-based connections because: (1) Cu–Cu bond is more scalable and ultra-fine pitch can be achieved; (2) Cu has better electrical and thermal conductivities; and (3) Cu has much better electro-migration resistance and can withstand higher current density in future nodes. Direct Cu–Cu bonding has been demonstrated using thermo-compression bonding (also known as diffusion bonding). As the name implies, thermo-compression bonding involves simultaneous mechanical pressing (~200Â€kPa) and heating of the wafers (~300–400°C). Two wafers can be held together when the Cu thin films bond together to form a uniform bonded layer. In order for this technique to be applicable to wafers that carry device and interconnect layers, an upper bound of temperature step is set at 400°C to prevent undesired damages particularly to the interconnects. The main objective of this Cu thermo-compression bonding study is to explore its suitability for utilization as a permanent bond that holds active device layers together in a multi-layer ICs stack. Cu is a metal of choice for 3-D ICs application because it is a mainstream CMOS material, and it has better electrical and thermal conductivities compared to Al-based interconnect. Most importantly, Cu bonds to itself under conditions compatible with CMOS back-end processes as initially demonstrated by Fan etÂ€al. [24].

18

C. S. Tan

1.4.2.1â•…Wafer Preparation and Bonding Procedures In this section, wafer bonding by Cu thermo-compression is demonstrated and characterized on blank Si wafers. All wafers used in this experiment were p-type 150Â€mm Si-(100) wafers of 10–20Â€Ω-cm resistivity. Thermal oxide (5,000Â€Å) was grown on the wafers. All wafers received a 10Â€min piranha (H2O2:H2SO4â•›=â•›1:3, by volume) solution clean followed by deionized water rinse and spin-dry prior to metallization. The next step was the deposition of Tantalum (50Â€nm) and Copper (300Â€nm) in an e-beam deposition system. Ta was used to prevent Cu out-diffusion into the oxide layer. Chamber pressure during metal deposition was 1â•›×â•›10−6Â€Torr. The rms roughness of the Cu/Ta/SiO2/Si wafers is estimated to be around 1.99Â€nm from AFM scan. A pair of wafers was aligned face-to-face in wafer aligner and clamped together on a bonding chuck. Three separation metal flaps were inserted between the wafers at the edges and loaded into bonder. Three cycles of N2 purge ware done, and the chamber was evacuated to ~1â•›×â•›10−3Â€Torr. At this point, a down force was applied on the wafer pair while the flaps were being pulled out. The temperatures of the chuck and top electrode were ramped up to and maintained at 300°C. The contact force was 4,000Â€N when the wafer pair was in full contact at 300°C, and the bonding step lasted for 1Â€h. After bonding, the bonded wafers were annealed in atmospheric N2 ambient for 1Â€h at 400°C. 1.4.2.2â•…Bonding Mechanism In order to understand the microstructures of the bonded Cu layer, transmission electron microscopy (TEM) analysis was performed on this sample. Note that the two Cu bonding layers merge and a homogeneous bonded layer is obtained as shown in the TEM image in Fig.Â€1.16. As can be seen from this image, large Cu grains, which often extend beyond the original bonding interface, are obtained after bonding and annealing. Dislocation lines are also found in the Cu grains. A possible bonding mechanism that gives rise to the above grain structures will be proposed in this section. From the TEM image, it is evident that there is substantial grain growth during bonding and annealing. The jagged Cu–Cu interface suggests that inter-diffusion between two Cu layers has taken place. During bonding and subsequent annealing, Cu layers are in intimate contact under the applied pressure. At the bonding temperature, Cu atoms acquire sufficient energy to diffuse rapidly, and Cu grains begin to grow. At the bonding interface, diffusion can happen across the bonding interface and grains growth can progress across the interface. After sufficiently long duration, large Cu grains on the order of 300–500Â€nm are obtained, and a homogeneous bonded Cu layer is formed. Electron Dispersion Spectroscopy (EDS) analysis of the bonded Cu layer shows that apart from Cu, no appreciable foreign contaminant is found in the bonded layer within the detection limit of EDS. Since e-beam deposition is not used in typical manufacturing environments for metallization, the above demonstration of Cu thermo-compression is

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction Fig. 1.16â†œæ¸€ TEM image of bonded Cu layer. Note that the bonding Cu layers merge and a homogeneous Cu layer is obtained after bonding and anneal. Grain structures that extend across the original bonding interface are observed. Dislocation lines are clearly seen in the grains

19

Ta Cu-Cu Ta

0.2 µm

repeated using Cu deposited by electro-chemical means. The above experiment is repeated using electroplated Cu and similar observation of the bonding characteristic is made. 1.4.2.3â•…Surface Oxide Since the mechanism for Cu thermo-compression bonding is based on Cu interdiffusion and grain growth, surface contaminants such as oxide are detrimental to successful bonding especially at very low temperature. However, there is more often than not a time lag between Cu deposition and bonding, and therefore the formation of surface oxide is inevitable. Excessive oxygen incorporation into the bonded Cu layer might also increase the resistivity of the Cu layers and hence degrade the electrical performance of Cu interconnects. Techniques that can be used to reduce the surface oxide prior to bonding include the use of a chemical clean such as HCl [1] and glacial acetic acid followed by a forming gas purge in the bonding chamber [25] prior to bonding. Forming gas anneal can also be performed on Cu wafers prior to bonding and experimental evidence of the reduction of oxygen content in the bonded Cu layer is reported in [26]. 1.4.2.4â•…Process Parameters There are a number of important process parameters that directly determine the quality of the final bond during Cu thermo-compression bonding. Three important

20

C. S. Tan

bonding parameters, i.e., temperature, duration, and contact pressure, are frequently considered. In the bonding procedures described in above, thermo-compression bonding of Cu is accomplished in two steps, i.e., an initial bonding step to establish bond between pairing wafers and a post-bonding anneal to enhance the bond. Since the bonding step is a single wafer pair step, long bonding duration will decrease through-put in a manufacturing environment. On the other hand, annealing can be accomplished in an atmospheric furnace and it is possible to process batches of wafers during annealing. Therefore, a better way to achieve high through-put Cu wafer bonding is to initiate a preliminary bond with a short bonding step and to enhance the bonding strength with a post-bonding anneal. A number of references that discuss process parameters during thermo-compression bonding of Cu can be found in [27, 28]. 1.4.2.5â•…Observation of Interfacial Voids In order for the bonded Cu layer to act as a reliable electrical or mechanical bond, a defect-free and uniformly bonded Cu layer is desired. Careful examination by SEM analysis across a length of 20Â€μm reveals large voids in the bonded Cu layers as shown in Fig.Â€1.17. These voids are located at and near the location of the original bonding interface. These voids can provide nucleation sites for electromigration failure and can lead to open circuit failures. When the void density is too high, the voids can also cause thin film delamination. Therefore, this observation requires

Fig. 1.17â†œæ¸€ Interfacial void growth during Cu thermocompression bonding at 300°C for a 10Â€min b 30Â€min and c 60Â€min

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

21

Fig. 1.18â†œæ¸€ Non-blanket bonding of Cu lines

careful understanding of the origin of these voids so that counter-measures can be implemented [29]. 1.4.2.6â•…Non-Blanket Cu Bonding Since Cu is a conductive medium, a continuous Cu bonding layer between active layers is of no practical application. In an actual multi-layer 3-D ICs implementation having Cu as the bonding medium, Cu bonding should be done in the form of pad-to-pad or line-to-line bonding with proper electrical isolation. FigureÂ€1.18 shows a cross section of Cu lines (2–9Â€µm) that are successfully bonded [30]. The spacing between bonded lines is 5.3Â€µm and it is filled with air. Interfacial voids are observed in the bonded lines and they can lead to serious reliability concern. The bonding process should be optimized to minimize the formation of void. Another reliability concern is the empty space between the bonded lines that might reduce mechanical support between the active layers. Moisture in the empty space can also potentially corrode the bonded Cu lines. One solution is to form damascene Cu lines and to perform hybrid bonding of Cu and dielectric. A few examples are: 1. Jourdain etÂ€al. [31] at IMEC have successfully demonstrated the 3-D stacking of an extremely thinned IC chip onto a Cu/oxide landing substrate using simultaneous Cu–Cu thermo-compression and compliant glue-layer (BCB) bonding. The

22

C. S. Tan

goal of this intermediate BCB glue layer between the two dies is to reinforce the mechanical and thermal stability of the bonded stack and to enable separation of die pick-and-place operations from a collective bonding step; 2. Gutmann etÂ€al. [32] at RPI have demonstrated another scheme of hybrid bonding using face-to-face bonding of Cu/BCB redistribution layers. The first step is to prepare the single-level damascene-patterned structures (Cu and BCB) by CMP in the two Si wafers to be bonded. The second step is to align the two wafers and bond the two aligned wafers; 3. Researchers at Ziptronix have developed a Cu/oxide hybrid bonding technology known as Direct Bond Interconnect (DBITM) [33]. Vertical interconnections in direct oxide bond DBI™ are achieved by preparing a heterogeneous surface of non-conductive oxide and conductive Cu. The surfaces are aligned and placed together to effect a bond. The high bond energies possible with the direct oxide bond between the heterogeneous surfaces result in vertical DBI™ electrical interconnections.

1.4.2.7â•…Low Temperature Cu Bonding Thermo-compression bonding of Cu layers is typically performed at a temperature of 300°C or higher. There is strong motivation to move the bonding temperature to even lower range primarily from the point of view of thermal stress induced due to CTE mismatch of dissimilar materials in a multi-layer stack and temperature swing. A number of approaches have been explored: 1. Surface Activated Bonding [34]—In this method, a low energy Ar ion beam is used to activate the Cu surface prior to bonding. Contacting two surface-activated wafers enables successful Cu–Cu direct bonding. The bonding process is carried out under an ultrahigh vacuum (UHV) condition. No thermal annealing is required to increase the bonding strength. Tensile test results show that high bonding strength equivalent to bulk material is achieved at room temperature. In [35], adhesion of Cu–Cu bonded at room temperature in UHV condition was measured to be about ~3Â€J/m2 using AFM tip pull-off method; 2. Cu Nanorod [36]—Recent investigation on surface melting characteristics of copper nanorod arrays shows that the threshold of the morphological changes of the nanorod arrays occurs at a temperature significantly below the copper bulk melting point. With this unique property of the copper nanorod arrays, wafer bonding using copper nanorod arrays as a bonding intermediate layer is investigated at low temperatures (400°C and lower). Silicon wafers, each with a copper nanorod array layer, are bonded at 200–400°C. The FIB/SEM results show that the copper nanorod arrays fuse together accompanying by a grain growth at a bonding temperature of as low as 200°C; 3. Solid-Liquid Inter-diffusion Bonding (SLID) [37]—This method involves the use of a second solder metal with low melting temperature such as Tin (Sn) in between two sheets of Cu with high melting temperature. Typically a short

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

23

reflow step is followed by a longer curing step. The required temperature is often slightly higher than Sn melting temperature (232°C). The advantages of SLID is that the inter-metallic phase is stable up to 600°C and the requirement of contact force is not critical; 4. In the DBITM technology described in [33], a moderate post oxide bonding anneal may be used to effect the desired bonding between Cu. Due to the difference in coefficient of expansion between the oxide and Cu and the constraint of the Cu by the oxide, Cu compresses each other during heating and metallic bond can be formed. 5. Direct Cu–Cu bonding at atmospheric pressure is investigated by researchers at LETI [38]. By means of CMP, the roughness and hydrophily (measure by contact angle) of Cu film are improved from 15 to 0.4Â€nm and from 50° to 12°. Blanket wafers were successfully bonded at room temperature with an impressive bond strength of 2.8Â€ J/m2. With a post-bonding annealing at 100°C for 30Â€ min the bonding strength was improved to 3.2Â€J/m2; 6. A novel Cu–Cu bonding process has been developed and characterized to create all-copper chip-to-substrate input/output (I/O) connections [39]. Electroless copper plating followed by low temperature annealing in a nitrogen environment was used to create an all-copper bond between copper pillars. The bond strength for the all-copper structure exceeded 165Â€MPa after annealing at 180°C. While this technique is demonstrated as a packaging solution, it is an attractive low temperature process for Cu–Cu bonding; 7. In the author’s research group, a method of Cu surface passivation using selfassembled monolayer (SAM) of alkane-thiol has been developed. This method has been shown to be effective to protect the Cu surface from particle contamination and to retard surface oxidation. The SAM layer can be thermally desorbed in-situ in the bonding chamber rather effectively hence providing clean Cu surface for successful low temperature bonding. Cu wafers bonded at 250°C present significant reduction in micro-void and substantial Cu grain growth at the bonding interface [40–43].

1.5â•…Applications, Status and Outlook Applications enabled by 3D fall into a few categories depending on the use of TSV and bonding. At the time of this writing, there are already commercial products made with TSV and many others are still under research and development. This section samples a number of closely watched applications enabled by 3D technology including image sensor, high density memory stack, memory/logic integration, and more futuristic ones like 3D heterogeneous systems. It ends with a positive note on the current status and future outlook of 3D integration. The main drivers for 3D applications include: (1) form factor, such as replacing wire bond with TSV in CMOS image sensor, (2) high density, such as stand-alone memory stack, (3) performance, such as bandwidth enhancement in memory on

24

C. S. Tan

Fig. 1.19â†œæ¸€ Applications enabled by 3D technology

CMOS Image Sensor

TSV

Backside Ground Si Interposer Memory Stacking with no Redesign Memory/Logic, Logic/Logic Heterogeneous Systems

F2F Chip/Chip Bonding with Conventional I/O at the Periphery

Bonding

logic, and (4) heterogeneous integration of disparate chips. Regardless of the main driver, the feasibility and key consideration of any 3D application for consumer products has always been low cost manufacturing. Broadly, applications enabled by 3D technology can be classified into three categories as shown in Fig.Â€1.19. The first group of products only utilizes TSV such as CMOS image sensor (at the time of writing, there are commercial products from companies such as ST Microelectronics, Toshiba, OKI, etc), backside ground (e.g., SiGe power amplifier by IBM), and silicon interposer (based on the silicon carrier alone). In this class of devices, chip to chip bonding is not required. In another group, 3D devices are implemented by bonding chip on chip in a face to face fashion. Unlike system shown in Fig.Â€1.9, chip to chip electrical connections can be established with micro-bump or bump-less metal-metal bonding. I/O is formed using conventional wire bond or flip chip at the non-bonding periphery area. One such example is the Sony Play Station featuring memory on logic. The real 3D devices that make use of both TSV and bonding include stand-alone high density memory stack, memory on logic, logic on logic, and heterogeneous systems. At the time of this writing, there has been announcement from Elpida on a multi-layer DRAM stack using 3D stacking technology. For more than 40Â€years, performance growth in IC is realized primarily by geometrical scaling. In more recent nodes, performance boosters are used to sustain this historical growth. Moving forward, 3D integration is an inevitable path. There have been significant investment in 3D technology by various sectors and the development has been both rewarding and encouraging. While 3D technology is not without it challenges, it is likely to see wider adoption of 3D technology in the future when solutions for thermal management, EDA tools, testing, and standardization are made available in the IC supply chain. Acknowledgmentsâ•‡ The author is supported by funding from the Nanyang Technological University through an award of Nanyang Assistant Professorship, Defense Science and Technology Agency (DSTA, Singapore), Semiconductor Research Corporation (SRC, USA) through a subcontract from the Interconnect and Packaging Center at the Georgia Institute of Technology, and Defense Advanced Research Projects Agency (DARPA, USA). The author thanks Professor Rafael Reif of MIT for his constructive and valuable comments on the content of this chapter.

1â•… Three-Dimensional Integration of Integrated Circuits—an Introduction

25

References â•‡ 1. R.P. Feynman, The Pleasure of Finding Things Out, Perseus Publishing, Cambridge, p.Â€28, 2000. â•‡ 2. C.G. Hwang, New Paradigms in the Silicon Industry. In: Keynote Speech, IEDM, 2006. â•‡ 3. Intel Corporation: http://www.intel.com â•‡ 4. A. Khakifirooz, Transport Enhancement Techniques for Nanoscale MOSFETs. PhD Thesis, MIT, Cambridge, MA, 2008. â•‡ 5. S. Saxena etÂ€al., Variation in Transistor Performance and Leakage in Nanometer-Scale Technologies. IEEE Transaction on Electron Devices, 55(1), p.Â€131, 2008. â•‡ 6. S. Nassif etÂ€ al., High Performance CMOS Variability in the 65Â€ nm Regime and Beyond, IEDM, p.Â€569, 2007. â•‡ 7. D. Sylvester and C. Hu, Analytical Modeling and Characterization of Deep-Submicrometer Interconnect. Proceedings of the IEEE, 89(5), p.Â€634, 2001. â•‡ 8. P. Kapur, J.P. McVittie, and K.C. Saraswat, Realistic Copper Interconnect Performance with Technological Constraints. Proceedings of the IEEE Interconnect Technology Conference, p.Â€233, 2001. â•‡ 9. P.G. Emma, Is 3D Chip Technology the Next Growth Engine for Performance Improvement? IBM Journal of Research and Development, 52(6), p.Â€541, 2008. 10. Tezzaron: http://www.tezzaron.com 11. R.E. Jones etÂ€al., Technology and Application of 3D Interconnect. Proceedings IEEE International Conference Integrated Circuit Design and Technology, p.Â€176, 2007. 12. S.K. Pozder etÂ€al., Status and Outlook. In: Wafer Level 3-D ICs Process Technology (Edited by C.S. Tan, R.J. Gutmann, and L.R. Reif), Springer, New York, p.Â€333, 2008. 13. N. Miura etÂ€al., Capacitive and Inductive-Coupling I/Os for 3D Chips. In: Integrated Interconnect Technologies for 3D Nanoelectronic Systems (Edited by M.S. Bakir and J.D. Meindl), Artech House, Boston, MA, p.Â€449, 2009. 14. S. Kawamura, N. Sasaki, T. Iwai, M. Nakano, and M. Takagi, Three-Dimensional CMOS ICs Fabricated by Using Beal Recrystallization. IEEE Electron Device Letters, 4(10), p.Â€366, 1983. 15. T. Kunio, K. Oyama, Y. Hayashi, and M. Morimoto, Three Dimensional ICs, Having Four Stacked Active Device Layers. In: IEDM Technical Digest, p.Â€837, 1989. 16. V. Subramanian, M. Toita, N.R. Ibrahim, S.J. Souri, and K.C. Saraswat, Low-Leakage Germanium-Seeded Laterally-Crystallized Single-Grain 100-nm TFTs for Vertical Integration Applications. IEEE Electron Device Letters, 20(7), p.Â€341, 1999. 17. V.W.C. Chan, P.C.H. Chan, and M. Chan, Three-Dimensional CMOS SOI Integrated Circuit Using High-Temperature Metal-Induced Lateral Crystallization. IEEE Transaction on Electron Devices, 48(7), p.Â€1394, 2001. 18. S. Pae, T. Su, J.P. Denton, and G.W. Neudeck, Multiple Layers of Silicon-on-Insulator Islands Fabrication by Selective Epitaxial Growth. IEEE Electron Device Letters, 20(5), p.Â€194, 1999. 19. B. Rajendran, R.S. Shenoy, D.J. Witte, N.S. Chokshi, R.L. DeLeon, G.S. Tompa, and R.F.W. Pease, Low Temperature Budget Processing for Sequential 3-D IC Fabrication. IEEE Transaction on Electron Devices, 54(4), p.Â€707, 2007. 20. http://www.flipchips.com/tutorial71.html 21. M. Bohr, The New Era of Scaling in an SoC World. ISSCC, p.Â€23, 2009. 22. C.S. Tan, R.J. Gutmann, and R. Reif, Wafer Level 3-D ICs Process Technology, Springer, New York, ISBN 978-0-387-76532-7, 2008. 23. P. Garrou, C. Bower, and P. Ramm, Handbook of 3D Integrations: Technology and Applications of 3D Integrated Circuits, Wiley-VCH, Weinheim, ISBN 978-3-527-32034-9, 2008. 24. A. Fan, A. Rahman, and R. Reif, Copper Wafer Bonding. Electrochemical and Solid-State Letters, 2(10), pp. 534–536, 1999.

26

C. S. Tan

25. R. Tadepalli, and Carl V. Thompson, Quantitative Characterization and Process Optimization of Low-Temperature Bonded Copper Interconnects for 3-D Integrated Circuits. Proc. of the IEEE 2003 International Interconnect Technology Conference, pp. 36–38, 2003. 26. C.S. Tan, K.N. Chen, A. Fan, and R. Reif, The Effect of Forming Gas Anneal on the Oxygen Content in Bonded Cu Layer. Journal of Electronic Materials, 34(12), pp. 1598–1602, 2005. 27. K.N. Chen, A. Fan, C.S. Tan, and R. Reif, Temperature and Duration Effect on Microstructure Evolution During Copper Wafer Bonding. Journal of Electronic Materials, 32(12), pp. 1371–1374, 2003. 28. K.N. Chen, C.S. Tan, A. Fan, and R. Reif, Morphology and Bond Strength of Copper Wafer Bonding. Electrochemical and Solid-State Letters, 7(1), pp. G14–G16, 2004. 29. C.S. Tan, R. Reif, D. Theodore, and S. Pozder, Observation of Interfacial Voids Formation in Bonded Copper Layer. Applied Physics Letters, 87(20), p. 201909, 2005. 30. C.S. Tan, K.N. Chen, A. Fan, R. Reif, and A. Chandrakasan, Silicon Layer Stacking Enabled by Wafer Bonding. MRS Symposium Proceedings, 970, pp. 193–204, 2007. 31. A. Jourdain, S. Stoukatch, P. De Moor, W. Ruythooren, S. Pargfrieder, B. Swinnen, and E. Beyne, Simultaneous Cu–Cu and Compliant Dielectric Bonding for 3D Stacking of ICs. Proceedings of IEEE International Interconnect Technology Conference, pp. 207–209, 2007. 32. R.J. Gutmann, J.J. McMahon, and J.-Q. Lu, Damascene-Patterned Metal-Adhesive (CuBCB) Redistribution Layers. Materials Research Society Symposium Proceedings, 970, pp. 205–214, 2007. 33. P. Enquist, High Density Bond Interconnect (DBI) Technology for Three Dimensional Integrated Circuit Applications. Materials Research Society Symposium Proceedings, 970, pp. 13–24, 2007. 34. T.H. Kim, M.M.R. Howlader, T. Itoh, and T. Suga, Room temperature Cu–Cu direct bonding using surface activated bonding method. Journal of Vacuum Science and Technology A: Vacuum, Surfaces and Films, 21(2), pp. 449–453, 2003. 35. R. Tadepalli and Carl V. Thompson, Formation of Cu–Cu Interfaces with Ideal Adhesive Strengths via Room Temperature Pressure Bonding in Ultrahigh Vacuum, Appl. Phys. Lett., 90, p. 151919, 2007. 36. P.-I. Wang, T. Karabacak, J. Yu, H.-F. Li, G.G. Pethuraja, S.H. Lee, M.Z. Liu, and T.-M. Lu, Low Temperature Copper-Nanorod Bonding for 3D Integration. Materials Research Society Symposium Proceedings, 970, pp. 225–230, 2007. 37. P. Benkart, A. Kaiser, A. Munding, M. Bschorr, H.-J. Pfleiderer, E. Kohn, A. Heittmann, and U. Ramacher, 3D Chip Stack Technology using Through-Chip Interconnects. IEEE Design & Test of Computers, 22(6), pp. 512–518, 2005. 38. P. Gueguen, L. Di Cioccio, M. Rivoire, D. Scevola, M. Zussy, A.M. Charvet, L. Bally, and L. Clavelier, Copper direct bonding for 3D integration. IEEE International Interconnect Technology Conference, pp. 61–63, 2008. 39. T. Osborn, A. He, H. Lightsey, and P. Kohl, All-copper chip-to-substrate interconnects. Proceedings of IEEE Electronic Components and Technology Conference, pp. 67–74, 2008. 40. D.F. Lim, S.G. Singh, X.F. Ang, J. Wei, C.M. Ng, and C.S. Tan, Achieving low temperature Cu to Cu diffusion bonding with self assembly monolayer (SAM) passivation. IEEE International Conference on 3D System Integration, art. no. 5306545, 2009. 41. D.F. Lim, S.G. Singh, X.F. Ang, J. Wei, C.M. Ng, and C.S. Tan, Application of Self Assembly Monolayer (SAM) in Cu–Cu Bonding Enhancement at Low Temperature for 3-D Integration, Advanced Metallization Conference, Baltimore, October 13–15, 2009. In: D.C. Edelstein, and S.E. Schulz (Eds), AMC 2009, pp. 259–266, ISBN 978-1-60511-218-3, Materials Research Society, 2010. 42. C.S. Tan, D.F. Lim, S.G. Singh, S.K. Goulet, and M. Bergkvist, Cu–Cu diffusion bonding enhancement at low temperature by surface passivation using self-assembled monolayer of alkane-thiol. Applied Physics Letters, 95(19), p. 192108, 2009. 43. D.F. Lim, J. Wei, C.M. Ng, and C.S. Tan, Low Temperature Bump-less Cu–Cu Bonding Enhancement with Self Assembled Monolayer (SAM) Passivation for 3-D Integration. IEEE Electronic Components and Technology Conference (ECTC), Las Vegas, June 1–4, pp. 1364– 1369, 2010.

Chapter 2

The Promises and Limitations of 3-D Integration Axel Jantsch, Matthew Grange and Dinesh Pamunuwa

2.1â•…Introduction Due to their compact geometry, 3-D integrated systems hold promises to significantly reduce latency, power consumption and area, while increasing bandwidth. In the following we quantify the potential and limits of 3-D integration by analyzing the theoretical performance of various 2-D and 3-D topologies. We adopt T. Claasen’s notion of intrinsic computational efficiency of silicon [1]. The intrinsic computational efficiency is obtained when all the silicon area is filled with elementary operations, say 32-bit adders, and no area is “wasted” for data communication and control. FigureÂ€2.1 shows how the intrinsic computational efficiency, measured in millions of operations per second per Watt, is increasing with technological progress. The left part of the curve is copied from T. Claasen’s original paper [1], while the right part is based on our own model, as introduced below. For comparison we have marked the performance of two recent multi-core processors from Tilera Inc. [2]. Different architectures such as micro-processors, DSPs, FPGAs and custom hardware, will approximate this line to a higher or lower degree depending on how well an application matches the architecture and how much flexibility is built-in. But no real processing unit can match or exceed it. Larger and more general purpose processors exhibit a greater gap because they utilize more area and power on interconnect, control and provision of programmability. Communication performance, area and power consumption directly benefit from 3-D integration due to geometric properties. FigureÂ€2.2 shows how the geometric distance between cores grows very differently in 2-D and 3-D structures with the number of cores. Since for global and long distance communication the geometA. Jantsch () Department of Electronic Systems, School of Information and Communication Technology, Royal Institute of Technology, 120 Forum, 16440 Kista, Sweden Tel.: +46 8 790 4124 e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_2, ©Â€Springer Science+Business Media, LLC 2011

27

28

Fig. 2.2â†œæ¸€ The average geometric distance for a multicore system for a 2-D and a 3-D realization

Intrinsic Computational Efficiency

1e+07

MOPS/Watt

1e+06

ICE T. Claasen ICE proposed

100,000 10,000

+Tilera 100 core +Tilera 36 core

1,000 100

1,000

100 Technology node [nm]

10

45 Average communication distance

Fig. 2.1â†œæ¸€ Computational efficiency vs. minimum feature size is shown here. The discrepancy in the overlapping section between T. Classon’s work and this study is due to the significant variation in the energy consumption of different adder architectures. The performance of the two processors shown fall outside of this range due to the overhead of the control circuitry and interconnect which is not accounted for by this metric

A. Jantsch et al.

40

2D 3D

35 30 25 20 15 10 5

200 400 600 800 No of cores between 16 and 1,024

1,000

ric distance translates linearly to latency, we can expect to cut communication latency by 50%. A number of recent studies of communication performance in 3-D structures [3–7] demonstrate the significant potential of 3-D integration technology for reducing power consumption and increasing performance. 3-D integration enables stacking of memory on top of processors, thus realizing a direct low latency and high bandwidth memory access link. However, to exploit the benefits, the memory architecture has to be adapted to allow for multi-port, parallel memory access. Several recent studies have explored various memory and cache architectures while exploiting the third dimension. For instance Li etÂ€al. [8] propose a 3-D distributed L2 cache and observe a 50% access latency reduction, essentially due to shorter wires within the L2 cache. Loh [9] explores the effect of parallel memory access by means of multiple memory controllers and ranks in a 3-D stacked DRAM based memory architecture and reports a performance increase of more than 280% over a conventional memory architecture for a set of benchmark applications. In our model we assume that in a 3-D topology, DRAM is used as embedded memory because it can be placed on a separate die, thus leveraging on the capability of 3-D to integrate different process technology in the same system.

2â•… The Promises and Limitations of 3-D Integration

29

In Sect.Â€2.2, we introduce and motivate our analytical model for performance comparison. Then, we describe the technological parameters for performance, power consumption and area for the 2-D and 3-D topologies (Sect.Â€2.3). Also, we explain our scaling methodology. SectionÂ€2.4.1 delves into the impact of how memory is distributed in the system. Not surprisingly a distributed memory exhibits higher performance than a centralized memory. However, after distributing 80% of the memory, further diffusion of the storage has little added benefit. In Sect.Â€2.4.3 we become more concrete by assuming specific system sizes and frequencies. This allows for analyzing performance limits under power, frequency and area constraints. Finally, we provide further discussions of the model and our conclusions in Sect.Â€2.5.

2.2â•…Computational Efficiency of Silicon 2.2.1 Computation Intrinsic Computational Efficiency (â†œICE) of silicon [1] is the number of 32-bit add operations per Joule, or the number of operations per second per Watt. As a concrete example, let’s assume one 32-bit adder covers an area of 2956Â€μm2 and each 32-bit addition consumes 6.9Â€pJ in 130Â€nm technology; it covers an area of 437.3Â€μm2 and consumes 1.85Â€pJ in 50Â€nm technology. Thus, we get as intrinsic computational efficiency the following. ICE 130 = 1/(6.9 pJ) = 144 GOPS/W ICE 50 = 1/(1.83 pJ) = 540 GOPS/W

The ICE reflects the amount of computation that can be done within an energy envelope, but it does not measure the amount of computations per area or per volume. We now define the Intrinsic Computational Density (â†œICD) as the number of 32-bit adders that fit into 1Â€mm2. For instance, with the example above we get ICD130 = 1/(2,956 µm2 ) = 338.3 ICD50 = 1/(437.3 µm2 ) = 2, 286.8

operators mm2 operators mm2

FigureÂ€ 2.1 shows the intrinsic computational efficiency as a function of technology nodes. Theo Claasen’s plot from 1999 is repeated and for comparison the ICE figures of our model (Sect.Â€2.3) for technology nodes between 180 and 17Â€nm are added.

30

A. Jantsch et al.

2.2.2 Adding Memory and Communication ICE and ICD give the upper bound of what amount of computation can be done with a given silicon technology under the assumption that the entire area is densely packed with computation units. However, we also need to account for memory, where the data are stored before and after processing, and for interconnect, which allows the data to move between processing units and memory. In the following we study variants of ICE and ICD under less ideal assumptions in different 2-D and 3-D configurations. In particular we explore the following factors and configurations. • Ratio of computation to memory μ: We distinguish between the temporal ratio μT and the spatial ratio μS. The relative number of memory accesses for each operation is μT, while μS is the number of memory words in the system for each operator. With memory we mean SRAM, caches and the like but also off-chip DRAM. If μTÂ€=Â€1, for each operation there is 1 memory access. Typical values will be between 1 and 3. On the other hand, the amount of memory is usually much higher than the operators. Hence, typical values for μs are between 1,000 and 10,000 as we discuss later. • Ratio of on-chip versus off-chip memory: ω: If ωÂ€=Â€1, all memory is on-chip; if ωÂ€=Â€0, all memory is off-chip. In a 3-D topology, with on-chip we mean all dies in the 3-D stack. • Memory distribution factor Δ: − ΔÂ€= 0: completely distributed memory where the distance between a computation unit and the memory is always 0; − ΔÂ€= 1: completely central memory where the distance between a computation unit and the memory is always the diameter of the system (or off-chip) − for example ΔÂ€ = 0.05: models a cache system where 95% of all memory accesses are local and 5% are far away. The idea is to account for the communication required to write to and read from memory. If the memory is completely distributed, we assume all memory reads and writes are local and no long-range communication is required. Obviously, this is a simplification but any specific architecture-application pair can be characterized by a Δ value between 0 and 1, denoting the amount of global on-chip communication occurring. • We explore different 2-D and 3-D topologies but we typically compare systems with the same total silicon area. For example if the total area is 400Â€mm2, the configurations considered are − − − − −

2D: one plain silicon die of size 20â•›×â•›20Â€mm2; 3D2: two dies stacked upon each other, each 200Â€mm2; 3D4: four dies stacked upon each other, each 100Â€mm2; 3D8: eight dies stacked upon each other, each 50Â€mm2; 3D16: sixteen dies stacked upon each other, each 25Â€mm2.

2â•… The Promises and Limitations of 3-D Integration

31

2.2.3 Effective Computational Efficiency To include the effects of memory and interconnect, we define the Effective Energy (â†œEE) for a 32-bit addition as follows. tn tn tn tn EEarch = E32 + µT (ω(e1 + × E_intarch ) + (1 − ω)(e1 + E_intarch + Eoffchip ))

for a given technology node, tn, and a given architecture, arch, ∈{2D, 3D2, 3D4, 3D8, 3D16}. The three main terms correspond to the energy consumption of an addition, of on-chip memory access and of off-chip memory access, respectively. • e1 is the amount of energy it takes to read or write one 32-bit word in on-chip SRAM. tn • E_intarch is the energy it takes to transport one 32-bit word from a non-adjacent on-chip memory to the local cache (see TableÂ€2.1). For example, if the total silicon area is 400Â€mm2, we have  (10 + 10)e2 (tn) if arch = 2D      (7.07 + 7.07)e2 (tn) + e3 (tn) if arch = 3D2 tn (5 + 5)e2 (tn) + 2e3 (tn) if arch = 3D4 E_intarch =   (3.5 + 3.5)e2 (tn) + 4e3 (tn) if arch = 3D8    (2.5 + 2.5)e2 (tn) + 8e3 (tn) if arch = 3D16 • e2 is the energy it takes for a 32-bit word to be transported 1Â€mm horizontally in a given technology. • e3 is the energy it takes to move a 32-bit word from one vertical level to the next via a set of TSVs. • Eoffchip is the energy it takes to get off-chip and to read or write the off-chip memory. It includes the I/O drivers, the inter-chip communication and the energy consumption of the memory chip. The idea of E_int is to capture the communication energy in different architectures to get from an arbitrary point in the system to a particular point at the system boundary. For a 2-D 20â•›×â•›20Â€mm2 die, the distance is on average 10Â€mm in each dimension, hence it is 20Â€mm. For a 3-D structure we have to traverse half of the vertical levels on average. For example for a 3D4 we have to traverse two vertical levels. Thus, the effective energy EE gives the required energy for a 32-bit addition if memory access and communication is taken into account. The factors μT, ω and Δ are abstractions of architectural choices and features. Based on EE we define the Effective Computational Efficiency (â†œECE) as follows. tn ECEarch =

1 tn EEarch

which gives the amount of computation we can do within the energy envelope of 1Â€J; or the amount of computations per second we can do within the power envelope of 1Â€W.

32

A. Jantsch et al.

Table 2.1â†œæ¸€ Notation and metrics of comparison Abstraction of architectural design parameters Minimum feature size of a technology node (nm) tn Architecture of system (2D, 3D2, 3D4, 3D8, 3D16) arch ω Ratio of on- to off-chip memory (â†œωÂ€= 1: all memory is on-chip, ωÂ€= 0: all off-chip) Δ Memory distribution factor (ΔÂ€= 1: all centralized memory, ΔÂ€= 0: all local) μT Number of memory accesses per h/w operation (â†œμTÂ€= 1: one mem. access per op) μs Amount of memory per h/w operator (typically μsÂ€=Â€1,000−10,000) σ Interconnect sharing factor (â†œσÂ€= 1: no sharing, σÂ€=Â€0: completely shared) Number of die layers for 3-D architectures n Area of die area Technology and architecture dependent parameters E32 Energy for a 32-bit add operation Energy for a 32-bit read/write to local SRAM e1 Energy to transport a 32-bit word over 1Â€mm on a planar on-chip bus e2 Energy to transport a 32-bit word over one vertical layer across TSVs e3 Area for a 32-bit memory word in SRAM or DRAM a1 Area for a 1Â€mm long 32-bit planar on-chip bus a2 Area for 32 TSVs a3 Energy to read/write to off-chip memory. Includes I/O drivers, inter-chip comEoffchip munication and memory chip energy consumption Primary comparison metrics Number of 32-bit add operations per Joule ICE Number of 32-bit adders per mm2 ICD tn Interconnect area required totransport a 32-bit word from a non-adjacent on-chip A_intarch a2 + n2 a3 memory to the local cache: area n tn Interconnect energy required to transport a 32-bit word from a non-adjacent onE_intarch c + n2 c3 chip memory to the local cache: area n 2 tn Effective Energy for a 32-bit addition: EEarch tn tn tn tn EEarch = E32 + µT (ω(e1 + E_intarch ) + (1 − ω)(e1 + E_intarch + Eoffchip )) tn ECEarch tn EAarch

Amount of computation achieved with 1Â€J:

1 tn EEarch

Effective area for a 32-bit addition without off-chip memory: tn tn EAtn arch = A32 + µS ω a1 + σ A_intarch

For example the special case of μTÂ€=Â€0 (no memory read or write) and 2-D in a 130Â€nm technology we get 130 ECE2D =

1 1 = 130 = ICE 130 = 144.3 GOPS/W tn EEarch E32

2.2.4 Effective Computational Density Similarly, the area cannot be filled with computational units only. We need to take memory and interconnect into account as well. We define the Effective Area (â†œEA) as follows.

2â•… The Promises and Limitations of 3-D Integration

33

tn tn EAtn arch = A32 + µS ω a1 + σ A_intarch

EA is defined similarly to EE (see TableÂ€2.1) but the off-chip component is omitted since we do not include the area for off-chip memory. Again, with “off-chip” we really mean “out-of-package”. Different dies in a 3-D stack are considered “on-chip”. • a1 is the area for a 32-bit memory word. Depending on the geometry we assume either SRAM or DRAM memory. For a 2-D system we use the area of embedded SRAM, while for a 3-D system we use DRAM. Concretely we use 60F2 area for one SRAM cell [10] and between 8F2 and 4F2 for DRAM cells [11], where F is the minimum feature size. tn • A_intarch is the interconnect area required for transporting a 32-bit word to memory. For example, in a 400Â€mm2 system we have

tn A_intarch

 (10 + 10)a2 (tn)      (7.07 + 7.07)a2 (tn) + a3 (s) (5 + 5)a2 (tn) + 2a3 (s) =   (3.5 + 3.5)a2 (tn) + 4a3 (s)    (2.5 + 2.5)a2 (tn) + 8a3 (s)

if arch = 2D if arch = 3D2 if arch = 3D4 if arch = 3D8 if arch = 3D16

• σ is the interconnect sharing factor. If σÂ€=Â€1, no sharing takes place and every operator has its own, private interconnect across the system. If σÂ€=Â€0, the interconnect is optimally shared and the interconnect area per operator is 0. In analogy to μS it gives the ratio of area occupied by operators versus interconnect. Typical values are between 0.01 and 0.1. For instance the Tilera TILE64 [12] with 64 cores has eight 32-bit operators per core1. For 512 operators on the 64 core Tilera die with an 8â•›×â•›8 mesh interconnect, the sharing factor is σÂ€=Â€16/512Â€= â•›0.031. Note, that only the global interconnect for the dataflow is counted, while the local interconnect and global control lines are ignored. • a2 is the required area for a 1Â€mm long 32-bit bus. • a3 is the required area for 32 TSVs to connect from one horizontal layer to the next. For instance, if one 32-bit adder has the size 437.5Â€μm2 in 50Â€nm technology and μsÂ€= σÂ€=Â€0, we get an ECD50Â€=Â€ICD50Â€=Â€2,286.8Â€operations/mm2 and we can fill a 400Â€mm2 chip with 914,720 adders. In a more realistic scenario with μsÂ€= 4,000 and σÂ€= 0.031 we get 345 operators on a 400Â€mm2 area.

Again, this is a simplification, because there are fewer operators but they are pipelined. In effect, 8 operations can be completed per cycle in the best case, motivating the number 8 that we use in this example.

1â•‡

A. Jantsch et al.

34

2.3â•…Technology Parameter Scaling To capture the performance benefits for feature size scaling of each successive generation of technology, physical properties of various on-chip communication transactions and logical operations were modeled for each node. The technology parameters are broken down into several categories; global planar 2-D wires, logical operations, memory transactions and 3-D TSV signaling.

2.3.1 Planar 2-D Wire Models The minimum feature size on a die scales by roughly 0.7 each generation, however global on-chip wires do not scale as aggressively as intermediate or local wires. In Ho et al.’s “The Future of Wires” [13] conservative and aggressive wire scaling trends for decreasing feature sizes are discussed. Capacitance, including parallel plate, fringe and coupling terms, and resistance are extracted from field solver simulations and compact models fitted to extract parameters for future technology nodes given the predicted global wire dimension, barrier thickness, spacing, resistivity of the medium, vertical and horizontal dielectric constants and the switching probability of the surrounding wires. Using the RC characteristics of the wires, typical repeater insertion strategies, and scaling supply voltages, we determine the energy per bit for a requisite wire length across technology nodes from 180Â€nm down to 17Â€nm. We assume bus widths of 32 and the percentage of bits switching per transaction to be one half the total number of bits.2 Driver, receiver and repeater energies are also considered for each wire in the bus. The energy consumed in transmitting a 32-bit word per network link is calculated from the following equation: Elink = k

1 2 Vdd Cwself + 2Cc + hCrep × Bus_width × Switch_factor 2

where k is the number of repeaters and h the size of each, Cwself is the self capacitance of a global wire, Cc is the coupling capacitance to neighboring wires, Crep is the total input gate and output drain capacitance of the repeater and Vdd is the supply voltage for a given technology.

2.3.2 3-D TSV Interconnect Models We have conducted field solver simulations of cylindrical, copper-filled through silicon vias to extract the relevant RLC parasitics. TSV electrical characteris2â•‡ This figure depends less on architectural choices, but more on how information is coded, protected and compressed. Although a simplification, it is important to note that the same assumption is used for all architectures and the relative comparisons and main trends are not sensitive to the chosen value for switching activity.

2â•… The Promises and Limitations of 3-D Integration

35

tics depend on their geometrical parameters as well as material properties such as the dielectric properties of the barrier and insulating layers and the dopant concentration in the substrate. For the purpose of extracting parasitics and subsequent analysis, a representative structure for a TSV is assumed to be a copper-filled via with uniform circular cross-section and an annular dielectric barrier of SiO2 or Si3N4 surrounding the Cu cylinder with a thickness of 0.2Â€μm [14]. The dimensions vary depending on the technology node; the cross-section is assumed to be uniformly circular, with radii of 10, 8, 6, 4, 2 and 1Â€μm and a constant length of 50Â€μm. The pitch of the TSVs is twice the radius to match planar global wire spacing trends. In this work we have considered a substrate conductivity of 10Â€ S/m representing typical values used in digital processes. The topology considered is three parallel coupled TSVs, a representative unit in any size row of TSVs. We use the extracted parasitics with the same methodology as the planar wires, where the bus width and switch factor match the 2-D parameters. The energy is calculated for a single transaction from one layer to the next adjacent layer, so the hop length is fixed. Driver and receiver energy is also considered, however no repeaters are required for the 3-D interconnect. The TSVs are arranged in a row, thus the total area of the interconnect is a straightforward relationship to the pitch and radius of the TSVs.

2.3.3 Logical Operation and DRAM Scaling We extract the energy per operation of several logic operations such as a 32-bit addition or SRAM read, by using published data [15, 16] for a particular technology node and scaling the energy and area for future or past generations. Dally in [16] publishes the energy per add operation of a 32-bit adder in 130Â€nm 1.2Â€V technology as 5Â€pJ. A reasonable approximation, ignoring leakage, for the energy in other technology nodes can be obtained by scaling according to the following: Energynew = Energy130 nm ×

2 Featuresizenew × Vddnew 2 Featuresize130 nm × Vdd130 nm

.

The area can be scaled in a similar manner by a straightforward relation of the minimum feature sizes between nodes. The off-chip DRAM transaction energy is not a simple function of the feature size, and depends on the DRAM architecture, its peripheral circuitry and also characteristics of off-chip drivers and terminations, and chip, package and board trace parasitics. We have used the Micron System Power Calculator [11] to estimate the average read/write power for different generations of DRAM, from SDRAM to DDR3. We then divide this power by the number of bits and the simulated transaction data rate to determine the energy per bit per generation of off-chip DRAM. We have matched the DRAM generation to the technology node, such that 180Â€nm corresponds to SDRAM and 17Â€nm to DDR3. There are a number of complexities

36

A. Jantsch et al.

associated with off-chip transactions, such as bus controller architecture, termination power, transaction delay, and the number of peripheral I/O devices, which cause the energy to vary over a wide range depending on these choices. In our study we have been consistent with the values we use in order to minimize the impact on comparisons between different schemes.

2.4â•…ECE Trends and Dependencies Next, we study the dependency of ECE on parameters like Δ and ω and then we investigate the limits of ECE and raw performance under power, area and frequency constraints. To see the overall trend, Fig.Â€2.3 illustrates how the ECE, the performance for a given power envelope, will develop as technology scales. As a reference the plot shows the ECE of two recent multi-core Tilera processors. 3-D topologies have a 3 times higher ECE, mainly due to lower communication power consumption in a more compact geometry. Moreover, this increased efficiency of 3-D is gained at a much smaller area and lower frequency for the same performance, as will be illustrated below in Sect.Â€2.4.3.

2.4.1 Distributed versus Central Memory To study the effect of the memory distribution factor on the ECE we assume that for every operation on average one word has to be read or written from or to memory (or cache)3. Hence, μTÂ€=Â€1.0. Effective Computational Efficiency

9e+10

Fig. 2.3â†œæ¸€ Performance of different topologies at different technology nodes with Δâ•›=â•›0.05, ωÂ€= 1 and μTÂ€=Â€1. The data for the two Tilera processors are closer to the theoretical performance than in Fig.Â€2.1 due to the fact that the interconnect overhead has been accounted for, although the control circuitry overhead is neglected

(Operations/sec)/Watt

8e+10 7e+10 6e+10 5e+10

2D 3D2 3D4 3D8 3D16

4e+10 3e+10 2e+10 1e+10

+Tilera 100 core +Tilera 36 core

0 180 160 140 120 100 80 60 40 Technology node [nm]

20

0

3â•‡ We assume registers and small register files close to the operators. Reading and writing of registers is not considered as memory access.

2â•… The Promises and Limitations of 3-D Integration

37

FigureÂ€2.4 shows the ECE for various topologies when Δ, representing the proportion of centralized memory, varies between 0 and 1. For all topologies a centralized memory drags down ECE significantly from over 60 GOPS/W to about 5–10Â€GOPS/W. Hence, there is a benefit from distributing memory, but only a distribution of Δâ•›<â•›0.2 gives a significant effect. This benefit from distribution is more pronounced for a 2-D topology. Going from ΔÂ€= 1 to ΔÂ€=Â€0.1 improves ECE for 3D16 by a factor 5, while the improvement is 8 for 2D. Intuitively the reason for this is that the cost of transporting data across the chip to a central memory is much lower for a 3-D topology. Hence, if it is difficult to decentralize most memory accesses, the penalty will be lower for 3-D. However, the cost of centralized memory becomes steeper for more advanced technologies. FigureÂ€2.5 shows this effect for a 3D16 topology. While the difference between ΔÂ€=Â€0 and ΔÂ€=Â€1 is a factor 5.4 for 180Â€nm technology, it grows to a factor of 34 for a 17Â€nm technology. Hence, even if a 3-D topology can mitigate the cost of centralized memory, it is still growing exceedingly as technology advances due to the inverse effect on the performance of logic versus interconnect as a result of scaling.

2.4.2 On-chip versus Off-chip Memory While Figs.Â€2.4 and 2.5 assume all memory to be on-chip (â†œωâ•›=â•›1), Fig.Â€2.6 shows the cost of having part of the memory off-chip. At ωâ•›=â•›1 all memory is on-chip. ECE for T = 1.0, 50 nm, =1.0

7e+10

2D 3D4 3D8 3D16

(Operations/sec)/Watt

6e+10 5e+10 4e+10 3e+10 2e+10 1e+10 0

1

0.8

0.6

∆

0.4

0.2

Fig. 2.4â†œæ¸€ The effect of the memory distribution factor Δ on ECE for different topologies

0

38

A. Jantsch et al. ECE for T = 1.0, = 1.0, 3D16

2.5e+11

180 nm 100 nm 50 nm

2e+11

24 nm

(Operations/sec)/Watt

17 nm 1.5e+11

1e+11

5e+10

0

1

0.8

0.6

∆

0.4

0.2

0

Fig. 2.5â†œæ¸€ The effect of the memory distribution factor Δ on ECE for 3D16 topology and various technology nodes with μTâ•›=â•›1.5 and ωâ•›=â•›1 ECE for T = 1.0, 50 nm, ∆ = 0.10

3e+10

2D 3D4 3D8 3D16

(Operations/sec)/Watt

2.5e+10

2e+10

1.5e+10

1e+10

5e+09

0

1

0.95

0.90

0.85

0.80

0.75

0.70

0.65

0.60

Fig. 2.6â†œæ¸€ The effect of the on-chip versus off-chip memory with a memory distribution factor Δâ•›=â•›0.1 on ECE for various topologies

2â•… The Promises and Limitations of 3-D Integration

39

As the fraction of on-chip memory accesses decreases, the ECE drops. When 20% of the memory accesses is off-chip (â†œωâ•›=â•›0.8), the ECE drops by a factor 9 for 2-D and by a factor 19 for 3-D16 systems. Intuitively, ECE drops more for 3-D topologies because its ECE figures for all on-chip memory are more favorable, but the ECE figures for all off-chip memory are almost the same for all considered topologies.

2.4.3 Size Constrained System ECE is a metric that does not consider at what frequency or within what space a set of computations is performed. It is an abstract metric for a technology rather than for a concrete system. In order to better understand the limits of performance under given power, area and frequency constraints, we consider systems of a concrete size. FigureÂ€2.7 shows the effect of varying the ratio of memory and operators in a concrete system with 400Â€mm2 area. To start with the area occupied by operators is relatively small (Fig.Â€2.7a). Most of the area is covered by memory for realistic memory-operator ratios of 1,000â•›≤â•›μSâ•›≤â•›5,000. For instance TILE64 [12] has a maximum performance of 8 operations per cycle per core yielding 512 operations per System@1 Ghz, 50 nm, T = 1.0,

System@1 Ghz, 50 nm, T = 1.0, 400

∆ = 0.05, =1.0, = 0.031

Area [mm2]

300 250 200 150 100

No of 32bit Operators

350 Computation 2D Computation 3D16 Memory 2D Memory 3D16 Interconnect 2D Interconnect 3D16

50

a

0 5,000

4,000

3,000

s

2,000

1,000

0

b

180,000 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 5,000

∆ = 0.1, =1.0, = 0.031

6,000

2D 3D2 3D4 3D8 3D16

4,000

3,000

s

2,000

1,000

0

∆ = 0.1, = 1.0, = 0.031 2D 3D2 3D4 3D8 3D16

5,000 4,000 Watt

Operations/second

c

2e+13 0 5,000

2D 3D2 3D4 3D8 3D16

[email protected] Ghz, 50 nm, T = 1.0,

[email protected] Ghz, 50 nm, T = 1.0, 1.8e+14 1.6e+14 1.4e+14 1.2e+14 1e+14 8e+13 6e+13 4e+13

∆ = 0.1, = 1.000, = 0.031

3,000 2,000 1,000

4,000

3,000 2,000 s

1,000

0

d

0 5,000

4,000

3,000

s

2,000

1,000

0

Fig. 2.7â†œæ¸€ Area, performance and power consumption for a concrete 400Â€mm2 system with μTâ•›=â•›1.0, σ = 0.031 and ω = 1.0, clocked at 1Â€GHz, with a 50Â€nm technology. a Area distribution. b Number of operations. c Operations per second. d Power consumption is limiting how densely computational units can be packed

40

A. Jantsch et al.

cycle. It has 4Â€MB (=1Â€M Word) of on-chip cache resulting in μs=220/512â•›=â•›2,048. The Niagara 2 [17] processor from Sun Microsystems, which is an 8 core 64 thread processor with 4Â€MB of on-chip cache, falls into a similar range. Keep in mind that our model illustrates trends and limits but does not account for control logic, decoders, arbiters, etc. The area contribution of that part is not seen in Fig.Â€2.7a. We usually attribute those elements to the processing units and hence, their area fraction is lower in Fig.Â€2.7a than we intuitively expect. However, the comparison of 2-D and 3-D topologies is interesting. Due to the higher density of memory in a 3-D architecture (DRAM vs SRAM in 2-D), the area dominance of memory in 2-D is much higher than in 3-D for the same μs. Consequently, more of the area in a 3-D system is filled with computation units and interconnect. (The relative ratio of the two latter is given by σ. A lower σ would reserve more of the area to computation.) FigureÂ€2.7b shows how the area not covered by memory, is used for computation in 2-D and 3-D topologies. For μsâ•›=â•›2,000 we can afford 684 operators in a 2-D system, while we can squeeze in 24,551 operators in a 3D16 system. The reason 35 times more operators fit into the same area is mainly due to the much higher density of DRAM as opposed to SRAM that is common in 2-D based systems. This naturally translates to a similar increase of performance as Fig.Â€2.7c illustrates. It also results in a prohibitively high power consumption since the computation consumes much more power than the memory. Apparently, we cannot power all these computations in reality, but we can translate the increased potential that 3-D offers into either smaller chips, or lower frequency, or higher memory content. FigureÂ€ 2.8 shows performance and power consumption for a smaller system (100Â€mm2) clocked at a somewhat lower frequency and at the 35Â€nm technology node. With μsâ•›>â•›4,000 we get a practical power consumption and still a very respectable tera-scale performance.

a

4.5e+13 4e+13 3.5e+13 3e+13 2.5e+13 2e+13 1.5e+13 1e+13 5e+12 0 5,000 4,000

2D 3D2 3D4 3D8 3D16

3,000 2,000 1,000 s

Watt

Operations/second

System@700 Mhz, 35 nm, T = 1.0, ∆ = 0.1, =1.0, = 0.031

0

b

800 700 600 500 400 300 200 100 0 5,000

System@700 Mhz, 35 nm, T = 1.0, ∆ = 0.1, =1.0, = 0.031 2D 3D2 3D4 3D8 3D16

4,000

3,000

2,000

1,000

0

s

Fig. 2.8â†œæ¸€ Performance and power consumption for a concrete 100 mm2 system with μTâ•›=â•›1.0, σâ•›=â•›0.031, ωâ•›=â•›1.0 and Δâ•›=â•›0.1, clocked at 700Â€ MHz, with a 35Â€ nm technology. a Operations per second. b Power consumption

2â•… The Promises and Limitations of 3-D Integration

41

2.4.4 Power and Frequency Constrained Systems For a given power budget, the performance is significantly higher for 3-D architectures as shown in Fig.Â€2.9. However, these performance and power figures are obtainable at small sizes and low frequencies for 3-D topologies. FigureÂ€2.9a gives the maximum performance under a given power budget. A 3D16 topology offers 2.4 times higher performance per Watt than a 2-D topology. Interestingly, for every doubling of the stack height, we see a 20–30% increase of the performance per Watt figure. This trend is only slowly decreasing from 30% (2D–3D2) down to 20% (3D8–3D16) in Fig.Â€2.9a. The somewhat higher ECE of 3-D topologies are obtainable at significantly lower frequency and smaller area. FigureÂ€2.9b shows the area-frequency trade-off for a given power (100Â€W) and performance (6,170.5 GOPS). For any given area, the frequency required for a 2D topology is about 25 times the frequency of the 3D16 system. Since frequencies above a few GHz are hard and costly to realize, a 2-D chip faces a tough performance hurdle while 3-D topologies can approach their ECE limits at much lower frequencies.

9e+12 8e+12 7e+12 6e+12 5e+12 4e+12 3e+12 2e+12 1e+12 0

a

= 1,

T

= 4,000, ∆ = 0.050, = 1.0, = 0.031 s

2D 3D2 3D4 3D8 3D16

6e+10

2e+10 1e+10

120

100

80 60 40 Power budget

20

b

0 400 350 300 250 200 150 100 50 Area [mm2 ]

2D 3D2 3D4 3D8 3D16

T

Operations/sec

Operations/sec

c

2D 3D2 3D4 3D8 3D16

3e+10

= 1.0, s = 4,000, ∆ = 0.050, 35 nm, = 1.0, = 0.031, frequency = 1.0 GHz

3e+13 2.5e+13 2e+13 1.5e+13 1e+13 5e+12 0

s

4e+10

T

3.5e+13

= 4,000, ∆ = 0.050, 35 nm, = 1.0, = 0.031, performance = 6,170.5 GOPS, power = 100 W

= 1.0,

5e+10 Frequency

Operations/sec

T

400 350 300 250 200 150 100 50 Area [mm2 ]

d

7e+13 6e+13 5e+13 4e+13 3e+13 2e+13 1e+13 0

= 1.0, =1.0,

= 4,000, ∆ = 0.050, 35 nm, = 0.031, area = 400 mm2

s

2D 3D2 3D4 3D8 3D16

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Frequency [GHz]

Fig. 2.9â†œæ¸€ Dependences of area, frequency, performance and power consumption of different topologies in a 35Â€nm technology. a Performance under power constraint of a 400Â€mm2 system. b Under a given power budget of 100Â€W the performance is 6,170Â€GOPS. c Given a frequency of 1Â€GHz, the performance is a function of the area. d Given an area of 400Â€mm2 the performance is a function of frequency

42

A. Jantsch et al.

In Fig.Â€2.9c the frequency is set to 1Â€GHz and the performance increases with area size. For 3-D systems the performance increase is significantly steeper than for 2-D, because most added area in a 2-D chip is spent on memory and interconnect, and relatively little is invested in additional operators. FigureÂ€2.9d illustrates the same trend; it fixes the area to 400Â€mm2 and shows how the performance grows with the frequency. FigureÂ€ 2.9 demonstrates clearly the tremendous potential of 3-D stacked systems, which mainly stems from (1) the possibility to integrate dense DRAM tightly into the multi-core architecture, and (2) from the more power efficient interconnection in the third dimension, which essentially is due to shorter geometric distances.

2.5â•…Conclusion Inspired by the intrinsic computational efficiency of silicon proposed by T. Claasen we have developed the concepts effective computational efficiency (â†œECE) and effective computational density (â†œECD). They consider memory and interconnect in addition to computational operators. A small number of parameters allow an abstract characterization of a broad range of architectures and topologies. We have used ECE and ECD to study the limits of performance of 2-D and 3-D topologies with technology down to 17Â€nm. Our model is an abstraction of real systems and ignores many relevant aspects and details. Thus, it can only give upper bounds on the performance and real systems will not be able to exhibit comparable performance numbers. In particular we have not considered control logic, local interconnect, registers and register files. These components can consume a significant portion of the area and power of a real system. Another limitation is our focus on throughput as the main performance characteristic, while ignoring latency. Latency is much harder to capture at an abstract level since it is influenced strongly by the many details of the architecture, the arbitration policies and resource management. In real systems the theoretical limits of throughput are often not achieved because raw capacity is over-provided and a lot of control logic is spent to keep critical latency figures low. It can be noted however, that a main benefit of 3-D topologies is the lower latency of memory transactions since high capacity memory can be located much closer to the computation units. This may mean that 3-D systems come closer to their intrinsic performance limits than 2-D topologies. In summary, although our model constitutes an idealization of systems, it still expresses correct trends and bounds of real systems and we draw the following main conclusions from our study: • 3-D systems have 2–3 times higher ECE due to lower interconnect power; • 3-D systems have one order of magnitude higher memory density due to DRAM integration;

2â•… The Promises and Limitations of 3-D Integration

43

• Consequently, 3-D systems can accommodate many more computation units in a given area and with the same amount of memory; • This allows for much higher performance but causes also very high power density. • The same performance with the same power can be realized in 3-D topologies with much smaller area and at lower frequency. The last point means a cost advantage for 3-D systems, which may compensate the more expensive 3-D integration technology.

References â•‡ 1. T. Claasen. High speed: Not the only way to exploit the intrinsic computational power of silicon. Proceedings of the International Solid State Circuits Conference (â†œISSCC), 1999. â•‡ 2. Tilera Corporation. Tilera Home Page. http://www.tilera.com. â•‡ 3. W.J. Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6):775–785, 1990. â•‡ 4. A.Y. Weldezion, M. Grange, D. Pamunuwa, Z. Lu, A. Jantsch, R. Weerasekera and H. Tenhunen. Scalability of network-on-chip communication architecture for 3-D meshes. Proceedings of the International Symposium on Networks-on-Chip, 2009. â•‡ 5. V.F. Pavlidis and E.G. Friedman. 3-D topologies for networks-on-chip. IEEE Transactions on Very Large Scale Integration Systems, 15(10):1081, 2007. â•‡ 6. R. Weerasekera, D. Pamunuwa, L.-R. Zheng and H. Tenhunen. Two-dimensional and threedimensional integration of heterogeneous electronic systems under cost, performance and technological constraints. IEEE Transactions on Computer-Aided Design, 28(8):1237–1250, 2009. â•‡ 7. B. Feero and P.P. Pande. Networks on chip in a three dimensional environment: A performance evaluation. IEEE Transactions on Computers, 58(1), 2009. http://www.micron.com/ support/dram/power_calc.html â•‡ 8. F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan and M. Kandemir. Design and management of 3 D chip multiprocessors using network-in-memory. ACM SIGARCH Computer Architecture News, 34(2):130–141, 2006. â•‡ 9. G. Loh. 3D-stacked memory architectures for multi-core processors. Proceedings for the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA), 2008. 10. S. Lai and T. Lowrey. OUM—A 180Â€ nm nonvolatile memory cell element technology for stand alone and embedded applications. Proceedings of the International Electronics Device Meeting, 2001. 11. J. Janzen. The micron system-power calculator. Micron web site, 2009. http://www.micron. com/support/dram/power_calc.html 12. S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney and J. Zook. TILE64TM Processor: A 64-Core SoC with mesh interconnect. Proceedings of the International Solid State Circuits Conference, 2008. 13. R. Ho, K.W. Mai and M.A. Horowitz. The future of wires. Proceedings of the IEEE, 89(4):490–504, 2001. 14. R. Weerasekera, M. Grange, D. Pamunuwa, H. Tenhunen and L.-R. Zheng. Compact modelling of through-silicon vias (TSVs) in three-dimensional (3-D) integrated circuits. Proceedings IEEE International Conference on 3D System Integration (â†œ3D IC), 2009.

44

A. Jantsch et al.

15. S. Perri, P. Corsonello and G. Staino. A low-power sub-nanosecond standard-cells based adder. Proceedings of the 2003 10th IEEE International Conference on Electronics, Circuits and Systems, 2003. 16. W.J. Dally, U.J. Kapasi, B. Khailany, J.H. Ahn and A. Das. Stream processors: progammability and efficiency. Queue, 2(1):52–62, 2004. 17. U. Nawathe, M. Hassan, K. Yen, L. Warriner, B. Upputuri, D. Greenhill, A. Kumar and H. Park. An 8-Core 64-Thread 64b Power-Efficient SPARC SoC. Proceedings of the Inetrnational Solid State Circuits Conference, 2007.

â•…

Part II

Technology and Circuit Design

Chapter 3

Testing 3D Stacked ICs Containing Through-Silicon Vias Erik Jan Marinissen

3.1â•…Introduction The semiconductor industry is on an ongoing quest to integrate more functionality into a smaller form factor with increased performance, lower power, and reduced cost. Traditionally, only two-dimensional planes were used for this: through conventional CMOS scaling, multiple IP cores in a single die (System-on-Chip, SoC), multiple dies in a single package (Multi-Chip Package, MCP), and multiple ICs on a Printed Circuit Board (PCB). More recently, also the third, vertical dimension started to become exploited: System-in-Package (SiP), in which multiple naked dies are vertically stacked in a single IC package, and interconnected by means of wirebonds to the substrate; and Package-on-Package (PoP), in which multiple packaged chips are vertically stacked. The latest evolution in this list of innovations is the so-called three-dimensional stacked IC (3D-SIC); a single package containing a vertical stack of naked dies which are interconnected by means of Through-Silicon Vias (TSVs) [1–3]. TSVs are conducting nails which extend out of the back-side of a thinned-down die, and which allow that die to be vertically interconnected to another die. TSVs are highdensity, low-capacity interconnects compared to traditional wire-bonds, and hence allow for much more interconnects between stacked dies, and these interconnects can operate at higher speeds and lower power dissipation. TSV-based 3D technologies enable the creation of a new generation of ‘super chips’ by opening up new architectural opportunities [4, 5]. Hence, they allow the semiconductor industry to continue to still its hunger for more functionality, bandwidth, and performance at smaller sizes, power dissipation [6], and cost; even in an era in which conventional feature-size scaling becomes increasingly difficult and expensive.

E. J. Marinissen () IMEC, Kapeldreef 75, 3001 Leuven, Belgium Tel.: +32 16 28-8755 Fax: +32 16 28-1515 e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_3, ©Â€Springer Science+Business Media, LLC 2011

47

48

E. J. Marinissen

Like all ICs, also these new TSV-based 3D-SICs need to be tested for manufacturing defects, in order to guarantee sufficient outgoing product quality to the customer. In this chapter, we describe the test needs of 3D-SICs, review to what extent existing test solutions can be used to address those, and discuss what new test challenges come with this new class of products. The remainder of this chapter is organized as follows. SectionÂ€ 3.2 positions TSV-based 3D-SICs in the market place and describes various TSV technology options which have a bearing on test technology. Test flows for 2D and 3D ICs are compared in Sect.Â€3.3. SectionÂ€3.4 argues that modular testing is a very appropriate approach for 3D-SICs and lists the required infrastructure for it. SectionÂ€3.5 discusses the new test contents we need for intra-die defects due to new 3D processing steps and for the TSV interconnects. SectionsÂ€3.6 and 3.7 focus on test access, resp. the chip-external wafer probe access and the chip-internal DfT architecture. SectionÂ€3.8 concludes this chapter.

3.2â•…3D-SICs Based on TSV Technology Micro-electronic systems typically consist of multiple integrated circuits (ICs). Traditionally, these ICs were brought together as a system by interconnecting them on a PCB (see Fig.Â€3.1a). Ongoing integration has brought us SoC, MCP, SiP, and PoP, all with their respective benefits and drawbacks [7]. SoCs integrate all functionality in a single die, with the benefit of fast and abundant inter-module communication; however, SoC integration forces all modules to make use of the same (for some design modules non-optimal) process technology and might lead to large, low-yielding ‘monster’ chips. MCPs, SiPs, and PoPs (see Fig.Â€3.1b, c) integrate multiple dies, and hence provide advantages over SoCs with respect to heterogeneous system integration, as these dies can be manufactured in different, optimized process technologies. All three integration approaches are based on wiring interconnects, which are limited in number and relatively slow and power-consuming. Only SiP and PoP make use of the vertical dimension, and hence offer the benefit of a smaller footprint. A TSV-based 3D-SIC (see Fig.Â€3.1d) combines the advantages of the above integration technologies [3]. It integrates multiple dies, with as potential benefits small-

Fig. 3.1â†œæ¸€ Ongoing system integration evolution: 2D a PCB, b MCP, and 3D c SiP, and d 3D-SIC

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

49

er, high-yielding dies in possibly different, dedicated process technologies [8]. It utilizes the vertical dimension for stacking thinned dies, thereby creating a smaller footprint as well as a higher transistor density per volume unit. And the TSV-based interconnects between the various dies have in number, speed, and power dissipation more similarity to on-chip wires than to off-chip wire bonds [9, 5]. On-chip interconnect wire length, especially of the global and semi-global wires, can be reduced drastically [10]. The process technology for 3D-SICs based on TSVs is becoming available now, design support tools are emerging [11], and first products start to appear on the market. Among the early applications are CMOS image sensors stacked on their corresponding read-out and digital processing circuitry, and stacks of memories [12]. Memory-on-logic [13] and logic-on-logic [14] are expected to follow suit. Through-Silicon Vias provide, as their name indicates, an electrical connection from the active front-side (‘face’) of a silicon die through the silicon substrate to the back-side. This allows to interconnect multiple vertically stacked dies with each other. Below, we describe one of the TSV types developed at IMEC. These TSVs are cylindrical copper nails of 25Â€μm height, 5Â€μm diameter (i.e., an aspect ratio of 5:1), and have a minimum pitch of 10Â€μm [15, 16]. The TSV fabrication steps are depicted in Fig.Â€3.2 and consist of (1) deep silicon etching of TSV holes, (2) oxide deposition, (3) copper seed deposition, (4) copper plating, and (5) chemicalmechanical polishing (CMP). As can be seen in Fig.Â€3.2, when the wafer processing is completed, the TSVs are still deeply buried within the wafer; their height is only 25Â€μm, while the total wafer thickness is around 750Â€μm. To expose the TSV tips, the wafer needs to be thinned down from the back-side to just below 25Â€μm thickness. In order to provide sufficient mechanical strength and prevent it from breaking or cracking, the to-bethinned wafer is temporary bonded onto a carrier wafer, prior to thinning [17]. Subsequently, the thinned product wafer on its carrier wafer is permanently bonded to the next die, after which the temporary carrier wafer is removed. The thinning and bonding steps are depicted in Fig.Â€3.3. This process can be repeated in case more than two dies are stacked. There are many different variants of TSV types, process steps, and stacking options [3, 18, 19]. Below, we describe some of them, in as far as relevant for the remainder of this chapter. The TSVs we described above have a 5Â€μm diameter, 25Â€μm height, and 10Â€μm minimum pitch; TSVs can also be larger or smaller. Taller TSVs cause less problems with respect to wafer thinning, but have a larger aspect ratio and hence are more difficult to properly fill [20]. That is, unless we also make the TSV diameter larger, but that has a negative impact on the silicon area they consume as keep-out

deep silicon etching

via oxide deposition

Fig. 3.2â†œæ¸€ Subsequent TSV fabrication steps

Cu seed deposition

Cu plating

CMP

50

E. J. Marinissen

Carrier wafer Carrier wafer temporary carrier bonding

Carrier wafer

Carrier wafer

back-side thinning

expose Cu nails

Carrier wafer

Bottom wafer

Bottom wafer

permanent bonding

temp. carrier de-bonding

Fig. 3.3â†œæ¸€ Subsequent wafer thinning and bonding steps

area, the maximum TSV density, and the capacitive load (and performance and power dissipation) of the TSV [7]. TSVs can be fabricated as a ‘via-first’, ‘via-middle’, or ‘via-last’ process. The TSVs described above are fabricated as a ‘via-middle’ process, i.e., after the (FrontEnd-Of-Line, FEOL) transistors are fabricated, but before the (Back-End-Of-Line, BEOL) metal layers [16]. ‘Via-first’ TSVs are fabricated before FEOL, while ‘vialast’ TSVs are fabricated after BEOL. Typically, ‘via-first’ and ‘via-middle’ are quite a lot smaller, denser, and with larger aspect ratios than the rather large ‘vialast’ TSVs. One could say that ‘via-first/middle’ TSVs more follow the characteristics of semiconductor interconnect technology, while ‘via-last’ TSVs are closer to assembly interconnect technology. The orientation of the individual dies in the stack is another item of variation. Dies can be connected ‘face-to-face’, ‘back-to-back’, or ‘face-to-back’ [14], as depicted in Fig.Â€3.4. ‘Face’ refers to the active front-side of the die, whereas ‘back’ refers to the substrate back-side, where the exposed TSV tips stick out. In ‘face-

face 2 face 1

face 2 face 1

a face 2

face 2

face 1

face 1

b face 3 face 3

Fig. 3.4â†œæ¸€ Different stacking orientation variants with wire-bond and flip-chip external connections. a Faceto-face. b Back-to-back. c Face-to-back

face 2 face 2 face 1 face 1

c

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

51

to-face’ stacking (Fig.Â€ 3.4a), the active sides of the two dies are interconnected directly. In case the external I/O is through wire-bonds, no TSVs are necessary; but the bottom die should be slightly larger than the top die to allow space for wire-bonding. In case of flip-chip bumps, TSVs are required make the connection through those. ‘Face-to-face’ stacking does not scale well to stacks of more than two dies. In ‘back-to-back’ stacking (Fig.Â€3.4b), TSVs in both dies are used to interconnect the two dies, while the external I/O goes via either via wire-bonds from the top die or flip-chip bumps at the bottom die. Also this scenario does not scale well to stacks of more than two dies, and moreover does not provide the minimum amount of TSV-equipped layers (which is important, as the fabrication of TSV comes with extra processing steps and associated costs). ‘Face-to-back’ stacking (Fig.Â€3.4c) has the advantage that it is scalable to stacks of more than two dies. The TSV tips sticking out of the back of one die connect to dedicated TSV landing pads on the face of the other die. ‘Face-to-back’ stacking can be used with all dies in the stack face-up, while the external I/O is done via wire-bonding from the bottom die; again, it is required that the bottom die is somewhat larger than the other dies. Alternatively, ‘face-to-back’ stacking also works with all dies in the stack face-down, while the external I/O is done via flip-chip bonds at the face-down bottom die. The bonding technology is yet another variation item. The TSVs with 10Â€ μm minimal pitch described above have been realized with copper-to-copper (Cu–Cu) direct bonding. The TSV tips are cylindric copper nails with a 5Â€μm diameter (as shown in Fig.Â€3.5a), and the TSV landing pads are copper rectangles of 9Â€×Â€9Â€μm2. IMEC uses thermo-compression bonding at relatively high temperatures. This bonding technology achieves sub-micron stand-off heights between the two dies, which also means that the bond quality is very sensitive to small particles in between both dies [21]. IMEC also have experimented with bonds based on coppertin (Cu–Sn) micro-bumps between both dies, as depicted in Fig.Â€3.5b [8]. While enlarging the minimum pitch from 10Â€ μm to around 40Â€μm (and hence reducing the maximum TSV density), this bonding type works at much lower temperatures and due to the larger stand-off height between the two dies is far less sensitive to particles in between the dies. We distinguish between Wafer-to-Wafer (W2W), Die-to-Wafer (D2W), and Dieto-Die (D2D) stacking. W2W stacking has the benefit that it avoids the pick-andplace operations required for D2W and D2D stacking; however, W2W stacking has

Fig. 3.5â†œæ¸€ Photos of two alternative bonding technologies: a 5Â€μm-diameter Cu TSV tip of direct Cu–Cu bond process, and b inter-metallic bond of a Cu–Sn–Cu micro-bump process

52

E. J. Marinissen

far less possibilities to exploit pre-bond test results, and hence might suffer from low compound yields and corresponding higher costs. The technology options described in this section (and others we did not discuss) make that right now there is a wide variety of TSV technologies under consideration by the global semiconductor R&D community. Some will prove ineffective or inefficient and will be abandoned; nevertheless, it seems likely that even if this field matures, multiple alternative TSV technologies will continue to exist alongside each other [2].

3.3â•…3D Test Flows A generally-applicable test strategy is to test items as early as possible in their manufacturing flow, in order to prevent additional costs for faulty items in subsequent manufacturing steps, and hence lower the overall production cost.

3.3.1 Prior Art in Test Flows Conventional single-die chips have two natural test moments: (1) wafer test (also referred to as ‘e-sort’) takes place after wafer fabrication and before assembly and packaging, and (2) package test takes place after assembly and packaging (see Fig.Â€ 3.6a). The outgoing product quality is guaranteed by the final test, i.e., the wafer fab

wafer fab 1

wafer fab 2

wafer fab n

wafer test

pre-bond test 1

pre-bond test 2

pre-bond test n

stacking 1+2

stacking (1 + 2)+...

stacking (1+ 2+...)+n

post-bond test 1 + 2

post-bond test 1 + 2 + 3

post-bond test 1+...+n

assembly & packaging

assembly & packaging

package test

package test

a Fig. 3.6â†œæ¸€ Typical 2D test flow (a) and potential 3D test flow (b)

b

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

53

package test. The wafer test typically serves merely as an economical optimization to prevent unnecessary package and packaging costs of dies that can be identified as faulty already during wafer test. Hence, the wafer test contents is typically a large subset of the package test. Whether or not wafer testing should be done for a particular product, depends on the wafer fabrication yield y, the fraction d of faulty products that the test can detect (which is determined by the test quality), the costs p that can be prevented per product (here: the cost of packaging a single die), and the cost t of executing a wafer test on a single product (which is typically directly related to the test quality d). Executing a wafer test contributes to overall cost savings if

(1 − y) · d · p > t.

(3.1)

For many conventional ICs, the benefits of wafer testing indeed exceed its execution cost and hence the wafer test is part of their test flow. In recent years, a market has grown for delivery of unpackaged dies, either still in their original wafers or already singulated. These unpackaged dies are delivered by one company (the ‘die provider’) and subsequently used by another company (the ‘die user’), for example for embedding in an MCP or SiP. Suddenly, the naked die is no longer an intermediate product; for the die provider, it is the final product. The die user wants the die provider to deliver high-quality products. This changes the role of the wafer test. Instead of consisting of an economically-justified subset of the (final) package test, it is now the final test. This has led to the concept of Known-Good Die (KGD) tests [22], which should guarantee as much as possible the quality of the outgoing dies. A typical KGD wafer test includes at-speed and burn-in tests, in order to achieve quality levels otherwise known from final tests.

3.3.2 Test Flows for 3D-SICs For 3D-SICs, we distinguish between (1) pre-bond tests, (2) post-bond tests, and (3) the package test [23]. These tests are depicted in Fig.Â€3.6b. Only the package test is a test on packaged products; all other tests are tests of naked dies and die stacks. As handling and testing individual dies is rather cumbersome, we assume here that all pre-bond and post-bond tests are wafer-level tests. We distinguish between the prebond and post-bond wafer tests, as they are distinctly different, not only in content and purpose, but also in test access. For pre-bond tests, each die requires its own probe access points, while in post-bond tests, all test data is assumed to be pumped in and out through the bottom die of the stack. The package test serves the purpose of guaranteeing the outgoing product quality of the final product. Depending on the relationship and agreements between the die provider(s) and die user, the pre-bond and post-bond tests might have different roles. In the case of a vertically-integrated 3D-SIC maker, wafer tests have the role of economic optimizers only. However, in case die provider and die user are distinct companies, pre-bond and/or post-bond wafer tests might have the role of final test

54

E. J. Marinissen

for a particular company’s product, and hence require Known-Good Die (or KnownGood Stack) quality levels. For the generic case of a 3D-SIC consisting of n dies, as depicted in Fig.Â€3.6b, we introduce the following notation. • PRE(â†œi): Pre-bond test of Die i (for 1 ≤ i ≤ n ). • POST(â†œk): Post-bond test of a stack up to level k (for 1 < k ≤ n ). This test potentially consists of the following sub-tests: − POST die (k, i) : test of Die i (for 1 ≤ i ≤ k ). − POST int (k, j) : test of interconnect between Dies jÂ€−Â€1 and j (for 1 < j ≤ k ).

• PT: Package test of the completed stack. This test potentially consists of the following sub-tests:

− PT die (i) : test of Die i (for 1 ≤ i ≤ n ). − PT int ( j) : test of interconnect between Dies jÂ€−Â€1 and j (for 1 < j ≤ n ).

A 3D-SIC test flow potentially brings about a tremendous increase in wafer tests, both in number of tester insertions, as well as in number of executed sub-tests, especially for large values of n. Executing all tests leads for pre-bond testing to n tester insertions, and n sub-tests, and for post-bond testing to nÂ€−Â€1 tester insertions, n and k=2 (k + (k − 1)) sub-tests. Including the Package Test, this would lead to 6 tester insertions and 16 sub-tests for nÂ€=Â€3 and to 16 tester insertion and 86 subtests for nÂ€=Â€8! Executing all these tests will certainly lead to a sharp increase in test costs, when compared to conventional 2D chip test flows. Whether or not all these tests make indeed technical and economical sense, is essentially governed by the same trade-off as expressed by Eq. (3.1). For 3D-SICs, the cost p that can be prevented per product by executing a prebond or post-bond wafer test is typically more comprehensive than is the case for conventional single-die chips; it includes all downstream material and processing costs: other dies in the stack, stacking costs, the package, and packaging costs. The parameters y, d, and t are strongly influenced by which faults are targeted during a particular test. For pre-bond tests, important questions are whether or not the test is executed on the thinned die, in order to be able to cover thinning-induced defects, and whether or not the test targets, in addition to the common intra-die defects, TSV faults as well; this increases the number (1 − y) · d of faulty parts we might catch, but at the same time will increase the test execution costs t. For post-bond tests, an important question is whether we only test the newly-made latest bond, or also re-test the previously tested dies and bonds in the stack. Re-testing comes with extra test execution costs t, and hence makes sense only if the latest stacking operation indeed can introduce a significant number of new defects in the already tested items. The cost-benefit trade-off offered by all these wafer test possibilities should by no means be considered static. A typical manufacturing process matures during the life-time of a product, allowing to reduce cost by omitting certain tests on intermediate product stages without reducing the overall product quality. TableÂ€3.1 gives four example test flow scenarios, that allow to reduce the number of tester inser-

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

55

Table 3.1â†œæ¸€ Various example test flow scenarios and their respective test costs Example test flow scenarios 1. All PRE tests, all POST tests, all PT tests 2. All ‘first-time’ testsÂ€+Â€all PT tests 3. All PRE tests, final POST test, all PT tests 4. No wafer tests, only all PT tests

Number of (insertions, tests) nÂ€=Â€3

nÂ€=Â€5

nÂ€=Â€8

(6, 16) (6, 10) (5, 13) (1, 5)

(10, 38) (10, 18) (7, 23) (1, 9)

(16, 86) (16, 30) (10, 38) (1, 15)

tions and/or sub-tests. In Scenario 1, all tests of Fig.Â€3.6b are executed; this might be appropriate in prototype production, where the production process is not stable yet, and the many tests provide valuable opportunities to learn and improve the production process. Scenario 2 is a case in which every ‘item’ is tested upon first opportunity, but never re-tested during wafer testing; this scenario leads to a drastic reduction in the number of sub-tests, but the number of tester insertions remains high, as after every intermediate stacking operation, a test of the newly made interconnects is performed. Scenario 3 includes all pre-bond tests, but does post-bond wafer testing only on the final stack. In this scenario, the dies are re-tested only once, and the number of tester insertions goes down. Scenario 4 is a true low-cost version, in which only Package Tests are executed. In theory, this does not diminish the test quality, but it does requires a very mature production process with high yields in all the subsequent intermediate production steps, in order to be cost-effective. A modular test approach, as discussed in the next section, is very suitable to flexibly adapt a test flow with tests and re-tests of modules.

3.3.3 Pre-Bond Testing for W2W Stacking All dies comprising a 3D stack need to be fault-free to obtain a fault-free stack. In a D2W or D2D stacking approach, pre-bond testing makes sense; utilizing its results, we can avoid stacking good to bad die or vice versa. In a W2W stacking approach, we stack entire wafers, and hence cannot avoid stacking an individual bad die. At first sight, pre-bond testing does not seem to make sense in W2W stacking, as we cannot act on its results. However, pre-bond test results can still be exploited, provided we are capable of performing matching over a repository of pre-tested wafers [24–26]. As depicted in Fig.Â€ 3.7, the idea is that an algorithm determines which pairs of wafers out of the repository should be stacked in order to maximize the compound yield. The yield increase due to this approach, compared to ‘blind’ stacking, depends on the number of stack tiers, the number of dies per wafer, the number of faulty dies per wafer, and the repository size. FigureÂ€ 3.8 shows, based on Monte-Carlo simulation, the relative and absolute expected yield increase for a realistic example 3D-SIC. We assume a die area A = 50 mm2 , defect density d0 = 0.5 defects/cm2 , and a defect clustering parameter α = 0.5. Assuming a negative binomial defect distribution [27], this

56

fab 1

E. J. Marinissen

sort 1

test 1

bond 1+2

match fab 2

test 2

Fig. 3.7â†œæ¸€ Exploitation of pre-bond test results in wafer matching

gives a yield yÂ€ =Â€ 81.6%. On 300Â€ mm wafers with 3Â€ mm edge clearance, we fit about 1,278 dies of size A [28], of which on average 235 (=Â€81.6%) are faulty. The graph in Fig.Â€3.8a shows, for a stack of nÂ€=Â€2 dies and varying die yield y, the relative expected yield increase as a function of the repository size. At the right-hand side of the figure, for the various curves the absolute yields for repository sizes kÂ€=Â€1 and kÂ€=Â€50 are given. (Note that a typical wafer cassette holds kÂ€=Â€25 wafers.) FigureÂ€3.8b shows the same, for a fixed die yield yÂ€=Â€81.6%, and varying number of stack tiers n. The expected relative yield increases grow with decreasing die yield y and with increasing tier count n. The relative yield increases vary between 0.5 and 10% and hence are significant. In this way, pre-bond test results can be made useful, even in the context of W2W stacking. Obviously, this approach comes with additional costs, in the form of additional tester insertions and test execution costs for the pre-bond tests. However, these costs are not as high as they seem at first sight, as we can save again on postbond test costs. A more in-depth cost-benefit analysis is discussed in [26].

3.4â•…Modular Testing Modular testing refers to an approach in which the various modules that together constitute an IC product are tested as separate, stand-alone units [29]. For complex SoCs, modular testing is increasingly gaining traction, for the following reasons. • Heterogeneous modules contain different circuit structures and exhibit different defects; consequently, they require dedicated fault models and test patterns. For example: embedded memories and mixed-signal blocks are tested separately from the embedding digital logic. • Black-boxed IP cores can only be tested with test patterns as delivered by their core providers, as only they know the implementation details of those cores. • A divide-n-conquer approach in which a monolithic design is broken up into more digestible chunks makes test pattern generation more tractable, and signifi-

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias 1.06

Absolute Yield

Fixed parameters: n=2; d=1,278 y=~50% y=~60% y=~70% y=~80% y=~90%

1.05 Relative Increase Factor of Expected Yield

57

1.04

one cassette k=25

10,000 simulation runs

k=1

k=50

25.0% 26.2%

36.0% 37.2%

1.03 1.02

49.0% 50.0%

1.01

63.9% 64.7% 81.0% 81.3%

1.00

a

Relative Increase Factor of Expected Yield

1.12

0

5

10

15

20 25 30 35 Repository Size k

40

45

50

Fixed parameters: d=1,278; fi =235 (for 1≤ i ≤6) n=6 n=5 n=4 n=3 n=2

1.10 1.08

Absolute Yield

one cassette 10,000 simulation runs k=25

k=1

k=50

29.6% 32.5%

36.2% 38.7%

1.06

44.4% 46.3%

1.04

54.4% 55.7%

1.02

66.6% 67.3% 1.00

b

0

5

10

15

20 25 30 35 Repository Size k

40

45

50

Fig. 3.8â†œæ¸€ Expected yield increases for an example 3D-SICs with a varying die yield y and fixed stack size nÂ€=Â€2 and b varying stack size n and fixed die yield yÂ€=Â€81.6% [26]

cantly reduces the resulting test data volume as well, as every module gets only the test patterns it requires, instead of the SoC-wide maximum [30]. • Modular testing allows for easy reuse of tests, throughout the life-cycle of one SoC, and also in subsequent derivate designs that reuse the same design module. The modular test approach is especially suited to 3D-SICs. All four arguments listed above for modular SoC testing apply even stronger to the case of 3D-SICs. In

58

E. J. Marinissen

addition, a modular test approach fits very naturally with the test flow as depicted in Fig.Â€3.6b. A modular test approach allows to freely decide where in the test flow a certain module is tested and/or re-tested and to adapt that test flow during the life cycle of the product. Finally, in case of a faulty product, a modular test approach provides a first-order diagnosis, sufficient to pin-point the faulty component. This might be crucial, especially if the overall 3D-SIC product is assembled from components of different manufacturing sources. Obvious test modules are the various individual dies, the TSV-based interconnect between dies, and the ‘extra-connect’ to the external world. For example, in a two-die stack, we would consider at least four separate test modules: (1) Die 1, (2) Die 2, (3) the interconnect between Dies 1 and 2, and (4) the extra-connect to the product pins. If the dies themselves are complex, possibly heterogeneous SoCs, the modular test hierarchy can be further extended by dividing them into various sub-modules. Modular testing requires the following infrastructure. 1. A language in which the test description is transferred from one party to the next (standardized as IEEE Std. 1,450.6, Core Test Language (CTL) [31]). 2. On-chip DfT in the form of a core-test wrapper (standardized as IEEE Std. 1,500 [32]) and a Test Access Mechanism (TAM) such as a TestRail or test bus [33]. 3. EDA support for automated test expansion, i.e., the translation from modulelevel test into chip-level test [34]. SectionÂ€ 3.7 discusses the on-chip DfT to support modular testing, as part of the overall DfT architecture for 3D-SICs, in more detail.

3.5â•…3D Test Contents The test contents of 3D-SICs is, in a first-order approximation, not very different from conventional 2D ICs. In the (FEOL/BEOL) wafer manufacturing, the same defects can occur, and hence the faults, fault models and test patterns are the same. In this section, we focus on the differences in defects and test content between conventional 2D ICs and new 3D-SICs. The 3D processing might lead to new intra-die defects and faults, that require new tests. Secondly, the TSV-based interconnects constitute a new structure, for which we have to define new tests.

3.5.1 New Intra-Die Defects Commonly used tests for digital logic, memories, analog, and RF modules cover the majority of known defects. However, for 3D-SICs, the question is whether some of the new 3D processing steps cause new defects that are not covered by conventional fault models.

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

59

Wafer thinning is such a 3D processing step that can cause new defects. The TSV processing only allows for limited TSV heights (say, 10–100Â€μm) and aspect ratios (say, 10:1 maximum) [20]. In order to expose the TSV tips at the back-side for bonding to another die, the wafer needs to be thinned down. Although this is an area for further research, early results indicate degradation of some I-V characteristics, shifts in device performances, and limited yield losses due to wafer thinning [35, 36]. Thinning defects might be detected during a post-bond or package test. In case they would form a significant fraction of the total defect population, it might be worthwhile to consider performing pre-bond tests on thinned wafers, in order to increase the compound stack yield; Sect.Â€3.6.1 discusses the related challenges in more detail. Thermal dissipation and thermo-mechanical stress are other causes of concern. Integrated circuits heat up during operation. In densely packed stacks of thinned dies, the heat density might pile up quite high, and has little way of escaping. The heat generated might easily impact the correct operation of the various dies, especially since some dies (e.g., DRAMs) are more heat-sensitive than others. Due to the different Coefficients of Thermal Expansion (CTEs) of the various materials in the stack, the stack might also suffer from thermo-mechanical stress, causing further malfunction. Tests for these phenomena are executed as part of the post-bond or package tests.

3.5.2 Test of TSV-Based Interconnects The TSV-based interconnect of 3D-SICs constitute a new structure, not existing in conventional 2D ICs. Hence, we need to consider which defects it might suffer from, how these defects behave as faults, how to model these faults, and how to test for them. TSV-related defects might occur either in the fabrication of the TSV themselves (see Fig.Â€3.9), in the bonding of the TSVs to the next layer, or during the life time of the 3D stack. During the fabrication of TSVs, (micro-)voids, for example due to quasi-conformal plating, might lead to (hard or weak) opens in TSVs. Pinholes in the TSV oxide might lead to shorts between TSV and substrate. Ineffective removal of the seed layer might lead to shorts between neighboring TSVs. The bond quality might be negatively impacted by oxidation or contamination of the bond surface, height variation of the TSVs, or particles in between the two dies. Misalignment during bonding, in either x, y, or (tilted) z direction might lead to opens or shorts. In case of Cu–Sn micro-bumps, the tin might squeeze out due to TSV height variation and cause shorts in between them. During the product’s life time, CTE mismatch might cause the thinned dies to warp, after processing and/or during operation; also, thinned dies might be more susceptible to the effects of mechanical load. TSVs can already be tested during the pre-bond test. If done on not-yet-thinned wafers, we have test access only to one side of the TSV, as the other side is still buried in the thick substrate. Tsai etÂ€al. [37] propose DfT to perform a leakage test

60

E. J. Marinissen

Fig. 3.9â†œæ¸€ Examples of TSV defects: a insufficiently filled TSVs, and b TSVs containing microvoids along the axis and at the bottom

for detecting TSV oxide pinholes. Chen etÂ€al. [38] propose DfT to perform a capacitive test for open or shorted TSVs. If we want to test signal propagation through the TSV, we necessarily have to probe on thinned wafers; this is described in more detail in Sect.Â€3.6. The interconnect is only formed during stacking, which implies that the full interconnect (including TSV and bond) can only be tested during post-bond or package tests. Although the actual defect mechanisms might be different, the resulting faults for TSV-based interconnects are rather similar to the faults normally considered for wiring interconnects: opens, shorts, and delay faults. This means there is also a large body of existing work with respect to test pattern generation [39] which we can leverage, such as the Counting Sequence Algorithm [40], the Modified Counting Sequence Algorithm [41], and the True/ Complement Test Algorithm [42]. These algorithms detect all (hard) open and shorts through a set of digital test patterns that can be kept small, as it grows only logarithmically with the number of interconnects. The test algorithms rely on the fact that we have full controllability at all interconnect inputs and full observability at all interconnect outputs (as is for example the case with Boundary-Scan-Test (IEEE Std. 1,149.1) equipped ICs at a PCB). In Sect.Â€3.7 of this chapter we discuss this requirement in more detail. Testing delay faults on TSVbased interconnects is not difficult nor test-time consuming either, but requires good synchronization between both dies. The tests described above all rely on the power-, ground-, and clocking infrastructure in the dies across both sides of the TSV-based interconnect to be present and working. All dies in a 3D-SIC stack, apart from the bottom die, receive their power, ground, and clock signals also through TSVs. Typically, multiple redundant TSVs are used to transfer power and ground, and hence that infrastructure might be less sensitive for faults in single TSVs. On the other hand, this redundancy also complicates the detection in power and ground TSVs. Faults in clock-signal TSVs are typically easier to detect, as a non-transferred clock signal often leads to catastrophic results.

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

61

3.6â•…3D Wafer Test Challenges Compared to a conventional 2D test flow, a 3D test flow contains potentially many more wafer tests for carrying out pre-bond and post-bond tests on intermediate product stages, which prevent downstream processing of bad products and hence are necessary to keep the overall production cost down. Pre-bond and post-bond wafer tests both have their specific challenges with respect to probe access.

3.6.1 Pre-Bond Wafer Test Access For pre-bond tests, the main wafer test access challenges stem from probing on very many, very small probe points, possibly on thinned wafers on a temporary carrier wafer. Today’s probe technology, using either cantilever or vertical probes, goes down to a minimum pitch of about 35Â€μm [43], has a maximum probe count of several hundreds, and makes significant scrub marks in order to achieve a proper electrical contact. This is insufficient to probe on TSV tips of 5Â€μm diameter and 10Â€μm pitch (or smaller), which might come in several thousands (the 10Â€μm pitch allows TSV densities up to 10Â€ k/mm2), are made of fragile copper, and do not tolerate scrub marks that inhibit downstream Cu–Cu bonding on the same surface. Probing on Cu–Sn micro-bumps is also challenging, but nevertheless a bit easier, as the sizes and pitches of the micro-bumps are larger, consequently their numbers smaller, and the constraints on scrub marks less strict. For pre-bond testing of the various dies that will make up a 3D-SIC, we distinguish between the bottom die and the other (non-bottom) dies. The non-bottom dies receive all their functional signals (power, ground, clocks, control, data) exclusively through TSV connections, and hence only possess I/Os which cannot be probed with today’s probe technology. The bottom die is different, in the sense that, next to its TSV connections, it also has wire-bond or flip-chip pads for the extra-connect, which provide an interface probe-able with today’s probe technology and which allows us to get test data in and out of that die and test it. If we want to execute prebond tests on the other, non-bottom dies, new solutions need to be developed. The following solution approaches are being explored. • Additional probe pads Providing dedicated additional probe pads (as a form of DfT) at the side to be probed and sized such that today’s probe technology can handle them. Obviously, this comes with an area penalty, and hence the number of extra pads should be minimized; on-chip DfT can help with this (see Sect.Â€3.7). • Probe technology improvement Significant improvement of wafer probe technology, scaling down to (in the order of) 25Â€μm pitch for micro-bump probing or down to (in the order of) 10Â€μm pitch for TSV tip and TSV landing pad probing, while at the same time increasing the maximum probe count and reducing the scrub mark damage.

62

E. J. Marinissen

• Contactless wafer probing Further development of contactless wafer probe technology, for example based on capacitive [44] or inductive coupling [45, 46]. This technology has the inherent advantage to not cause any scrub marks. However, also this technology needs to scale down in order to probe on micro-bumps and/or TSVs in the sizes and densities they occur. Moreover, DUT power delivery is not contactless and hence still requires conventional contact probes. While the latter two approaches still require further development effort, the first approach is feasible today. As an example, let us consider the flip-chip 3D-SIC depicted in Fig.Â€ 3.4c. It consists of three dies, which are stacked face-down, and face-to-back. The bottom die and middle die contain TSVs, the top die does not. The bottom die provides the extra-connect by means of flip-chip bumps. FiguresÂ€ 3.10–3.12 show the various pre-bond wafer probe options that exist for these three dies. The bottom die can be probed on its (regular-sized) flip-chip pads on the frontside, as shown in Fig.Â€3.10a. The wafer is not yet thinned, and hence its TSVs are buried within the thick wafer and we cannot test through them yet. Alternatively, it could be attractive to execute the pre-bond tests on a thinned wafer, in order to be allow the pre-bond test to also detect defects due to wafer thinning. A thinned wafer only exists on a carrier wafer, to prevent cracking. The front-side flip-chip pads are now inaccessible, due to the carrier wafer. Wafer probe access could be obtained from the back-side of the thinned wafer, where the exposed TSV tips or microbumps are located. Unfortunately, these are too small and numerous to be probed by today’s probe technology. A solution could be to provide a number of dedicated additional regular-sized probe pads on the back-side (as a form of DfT). This is depicted in Fig.Â€3.10b, where both the small TSV tips/pads and the larger dedicate probe-able pads are shown. FigureÂ€ 3.11 shows the two similar alternatives (front-side probing on a thick wafer and back-side probing on a thinned wafer on carrier) for the middle die. The difference between bottom die and middle die is that the middle die has functional probe-able pads on neither front- nor back-side. Hence, also for the front-side

face 1 face 1 carrier wafer

a

b

Fig. 3.10â†œæ¸€ Wafer probe options for the bottom die. a Front-side on thick wafer. b Back-side on thinned wafer

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

63

face 2 face 2 carrier wafer

a

b

Fig. 3.11â†œæ¸€ Wafer probe options for the middle die. a Front-side on thick wafer. b Back-side on thinned wafer

Fig. 3.12â†œæ¸€ Wafer probe options for the top die. Frontside on thick wafer

face 3

probing (Fig.Â€3.11a), dedicated additional regular-sized probe pads need to be provided. In our example stacking scenario, the top die does not contain TSVs, and hence will not be thinned. Therefore, probing on its back-side is not an option. As shown in Fig.Â€3.12, this leaves us only with the possibility to probe on the front-side of the thick wafer, where again dedicated additional regular-sized probe pads need to be provided, as the functional TSV landing pads are too small for today’s probe technology. A question that arises is whether the costs associated to providing additional probe pads to be able to execute a pre-bond test do not entirely consume the benefits of that same pre-bond test. Using IMEC’s elaborate 3D cost model (which accounts for the cost of clean room, equipment, maintenance, materials, and personnel, as well as production yields and test coverage), we have done cost-benefit calculations, which demonstrate that for most manufacturing yields, the costs of providing additional probe pads on the non-bottom dies to enable pre-bond testing, does pay off [47]. As an example, one such trade-off calculation assumed the following case. • A two-die stack in a D2W or D2D stacking process (such that the pre-bond tests results can be fully utilized, and dies failing the pre-bond test are excluded from stacking). • Both dies are full-scan testable digital logic dies. The top die contains 15Â€M gates (10Â€×Â€10Â€mm2), while the bottom die contains 20Â€M gates (12Â€×Â€12Â€mm2).

64

E. J. Marinissen

• The bottom die always undergoes a pre-bond test. It contains 100 scan chains, and its pre-bond test uses 2k patterns to achieve 99% test coverage. • The trade-off is about the pre-bond test for the top die. If tested, it uses 13 scan chains, and 2k patterns to achieve 99% test coverage. To be able to do this, it requires 30 additional probe pads (with associated silicon area costs). Alternatively, the top die is not tested; we do not need additional probe pads, but are required to blindly stack both good and bad top dies. • All bad stacks (due to bad top die, bottom die, or TSV interconnect) are detected by the final test. Hence, the decision to execute or not execute the pre-bond test has no impact on the final outgoing product quality. However, the total manufacturing and test cost can only be recuperated by increasing the cost price of good products. The results of the cost model calculations are depicted in Fig.Â€3.13. The horizontal axis shows the wafer manufacturing yield for both top and bottom die, which is assumed to be equal and in order to take into account the different die sizes expressed in yield per cm2. The vertical axis shows the relative production costs per produced good 3D stack, normalized to the (cheapest) case in which wafer yield is 100% and no pre-bond test for the top die is performed. FigureÂ€3.13 shows that performing the pre-bond test pays off for wafer yields above 98%/cm2, despite the extra test execution costs and the costs for providing extra, probe-able sized pads on the top die.

1.6

1.4

1.3

Relative cost per 3D stack

Relative cost per 3D stack

1.5

20 18 16 14 12 10 8 6 4 2 0

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Die yield per cm2

1.2

1.1 No Top Die Pre-Bond Test Top Die Pre-Bond Test

1.0 100%

95%

90%

85%

80%

2

Die yield per cm

Fig. 3.13â†œæ¸€ Cost comparison between executing a top-die pre-bond test or not, where the first alternative requires extra probe pads

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

65

3.6.2 Post-Bond Wafer Handling In post-bond testing, the wafer test access is typically through the regular functional pads of the bottom die of the stack. This is largely ‘business-as-usual’ as far as probe technology is concerned. The challenges for post-bond testing are in the wafer handling within the probe station, especially in the case of D2W stacking. On top of the bottom wafer, stacks consisting of one or multiple dies stick out. If that top side is the probe side of the wafer, the stacks might obstruct the contact view, making probe needle positioning difficult. Also, during probe needle movement, we need to make sure that the needles do not collide with the stacks. On the other side, if we probe on the bottom side of the wafer, the stacks create a very non-planar surface, which is difficult to keep stable on the chuck. It is still a matter of on-going research to resolve these issues.

3.7â•…3D On-Chip DfT Architecture 3.7.1 Hierarchical DfT Requirements The primary role of on-chip Design-for-Testability (DfT) is to provide controllability and observability from the chip I/Os into the heart of the chip design and vice versa. We consider a hierarchical 3D-SIC consisting of multiple stacked dies, which each consist of one or multiple ‘test modules’ (see Sect.Â€3.4). DfT requirements exist at every level of the design hierarchy. 1. Core-Level DfT DfT within the test modules, e.g., internal scan chains, Test Data Compression (TDC) hardware [48], and/or Built-In Self-Test (BIST). 2. Die-Level DfT DfT to enable modular testing per die, i.e., wrappers around the test modules and TAMs [29]. Wrappers based on IEEE Std. 1,500 [32] provide controllability and observability at the module boundary, for both inward-facing tests (â†œInTest) as well as outward-facing tests (â†œExTest). The TAM architecture can be optimized for test bandwidth vs. test length trade-off [33]. 3. SIC-Level DfT This level constitutes the new DfT for TSV-based 3D-SICs. It includes the following. − A wrapper at the die boundary, to support modular testing per die, and allow for both InTest (testing the dies and cores) and ExTest (testing the TSV-based interconnect in between the dies). The wrapper cells could either be based on IEEE Std. 1,149.1 or IEEE Std. 1,500 (see Fig.Â€3.14). Especially if dies are designed by different teams or companies, and not necessarily ‘friends’ at their interfaces, it is probably a wise thing to equip them with ripple-protected wrapper cells.

66

E. J. Marinissen SO

G – 1 1

PI G – 1 1

a

SO PI G – 1 1

1D Q

1D Q

Clk

Clk

SI

SO G – 1 1

PO

SI

G – 1 1

PI G – 1 1

1D Q Clk

b

PO

c

1D Q

1D Q

Clk

Clk

PO

SI

Fig. 3.14â†œæ¸€ Various wrapper cell implementations. a IEEE 1,149.1 default, ripple-free. b IEEE 1,500 default. c IEEE 1,500 special, ripple-free

− For pre-bond tests: dedicated additional probe pads for the non-bottom dies (see Sect.Â€3.6). − For post-bond tests: TestElevators that transparently pass test stimuli from a lower-level die to a higher-level die and test responses in the opposite direction. − A switch to select between additional probe pad access (for pre-bond tests) and TestElevator access (for post-bond tests). Also, the DfT of a lower-level die should be able to operate independently from the presence (or not) of the die(s) above it in the stack. This requirement allows pre-bond tests, as well as post-bond tests on incomplete stacks. 4. Board-Level DfT DfT to support board-level testing, once the product is soldered to a Printed Circuit Board (PCB). Towards this end, the overall product should be compatible to IEEE Std. 1,149.1/4/6.

3.7.2 Simplified Example DfT Architecture For illustration purposes, Fig.Â€3.15 shows a simplified example of a DfT architecture that meets the above requirements. It consists of two dies, stacked face-down

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

67

scan chain 2 Die 2 (top)

logic scan chain 1 logic

SIt1

SIt2

SOt2

SOt1

Die 1 (bottom)

logic scan chain 2 logic scan chain 1 logic

TD1

D1 SI1

D2 SI2

Q2 SO2

Q1 SO1

TDO

Fig. 3.15â†œæ¸€ Simplified example of a 3D DfT architecture

in a face-to-back fashion. Each die is a single scan-testable logic module. The figure shows the functional circuitry and I/Os, as well as the additional DfT: wrapper cells, multiplexers, additional probe pads for pre-bond testing, and TSV-based TestElevators for post-bond testing. The above example is simplified in various ways. FigureÂ€ 3.15 shows only test data, and abstracts from test control signals. The stack consists of two dies only. Each die consists of a single test module only. Each test module has only two scan chains: one wrapper boundary scan chain and one consisting of two concatenated internal scan chains. In reality, many of these parameters will be (significantly) larger. As is the case of DfT architectures for conventional 2D SoCs [33], it is worthwhile for real-life large 3D-SICs to optimize the DfT architecture, such that the corresponding test schedule allows for a minimal test length [49–52]. The example DfT architecture has seven different test access modes, which are enabled by various DfT multiplexer settings. FigureÂ€3.16 shows the active test access paths of the various modes. All modes (except for the board-level test mode) have two scan chains: one for the wrapper boundary register, and one for the module-internal scan chains. • Pre-Bond Tests 1. A mode in which a pre-bond test can be applied before wafer thinning to the bottom die from its front side (Fig.Â€3.16a). Test access is via two scan

logic scan chain 1

logic

Die 2 (top)

logic

scan chain 1

logic

logic scan chain 2

logic scan chain 1

logic

e

scan chain 2

logic scan chain 1

logic

c scan chain 2

logic scan chain 1

logic

logic scan chain 2

logic scan chain 1

logic

Die 2 (top)

logic

Die 2 (top)

b scan chain 2

d

Die 1 (bottom)

logic scan chain 1

Die 1 (bottom)

Die 1 (bottom)

Die 2 (top)

a

logic scan chain 2

Die 2 (top)

logic scan chain 2

Carrier Glue Die 1 (bottom)

E. J. Marinissen

Die 1 (bottom)

Die 1 (bottom)

68

scan chain 2

logic scan chain 1

logic

logic scan chain 2

logic scan chain 1

logic

f

scan chain 2

logic scan chain 1

logic

logic scan chain 2

logic scan chain 1

logic

g

Fig. 3.16â†œæ¸€ The seven test access modes of the example 3D DfT architecture. a Pre-bond test Die 1 (thick). b Pre-bond test Die 1 (thinned). c Pre-bond test Die 2. d Post-bond test Die 1. e Post-bond test Die 2. f Post-bond test interconnect. g PCB test

chains, which are multiplexed onto already-existing functional pads, viz. D1/SI1–Q1/SO1 and D2/SI2–Q2/SO2. Note that in this mode, the wrapper cells at the extra-connect terminals of the bottom die do not need to be included in scan chains, as they are directly controllable and observable from the wafer probes. 2. A mode in which a pre-bond test can be applied after wafer thinning to the bottom die from its back side (Fig.Â€3.16b). Test access is via two scan chains which connect to dedicated back-side pads, viz. SIb1–SOb1 and SIb2–SOb2. Note that in this mode, the wrapper cells at the extra-connect terminals of the bottom die do need to be included in the wrapper boundary register scan chain. 3. A mode in which a pre-bond test can be applied to the top die from its front side (Fig.Â€3.16c). Test access is via two scan chains which connect to dedicated front-side pads, viz. SIt1–SOt1 and SIt2–SOt2.

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

69

• Post-Bond Wafer and Package Tests For all post-bond wafer and package tests, test access is via two scan chains which are multiplexed onto already-existing functional pads on the front-side of the bottom die, viz. D1/SI1–Q1/SO1 and D2/SI2–Q2/SO2. 4. A mode in which a post-bond wafer or package test can be applied to the bottom die (Fig.Â€3.16d). 5. A mode in which a post-bond wafer or package test can be applied to the top die (Fig.Â€3.16e). 6. A mode in which a post-bond wafer or package test can be applied to the TSVbased interconnects between bottom and top die (Fig.Â€3.16f). • Board-Level Test 7. A mode in which a board-level test is executed (Fig.Â€3.16g). Test access is via one boundary scan chain along the extra-connect interface of the bottom die which connects to dedicated front-side pads, viz. TDI–TDO (part of the IEEE 1,149.1 Boundary Scan standard).

3.7.3 Advanced DfT Techniques and Test Resource Partitioning The following advanced DfT techniques play a special role in the context of 3DSICs. • Reduced Pad-Count Testing (RPCT) RPCT is a DfT technique to reduce the width of a scan test interface [53]. As discussed in Sect.Â€3.6, dedicated additional probe pads might need to be provided to execute a pre-bond test on the bottom die after thinning or to execute a pre-bond test on other, non-bottom dies. These extra probe pads are costly in area. RPCT can be exploited to reduce the number of extra pads required. Note that the usage of RPCT does not affect the total test data volume, which implies that a reduced test interface width comes at the expense of an increased test length. • Test Data Compression (TDC) TDC is a DfT technique that exploits the many ‘don’t care’ bits in ATPG-generated test patterns to compress the off-chip test data in a (near-) lossless manner. It allows to reduce the volume of both test stimuli and responses (and the corresponding test length) with one or two orders of magnitude [48]. TDC can play a role in reducing the test data volumes of 3D-SICs. When applied per test module, the TAMs that provide these test modules with stimuli and responses can be scaled down. Also, RPCT and TDC form an attractive DfT combination, as the first scales down the test interface width, while the second prevents the test length from increasing.

70

E. J. Marinissen

• Built-In Self-Test (BIST) BIST is a DfT technique that performs the entire generation of test stimuli and evaluation test responses on chip, such that the IC test becomes completely selfcontained and does not require external test equipment (other than to start the BIST and read out its result). BIST can play a role in reducing the test data volumes of 3D-SIC. Actually, BIST reduces the test data volume even more than TDC, viz. down to (virtually) zero. The price to be paid for this is that LBIST (for digital logic) typically requires more implementation area, the test time can be longer, and/or the test quality less. MBIST for (embedded) memories, on the other hand, does not have these drawbacks and seems very attractive for 3D-SICs containing either memory dies or memories embedded in logic dies. Other benefits of BIST are that it might play a role in protecting proprietary test contents, such that a die user can execute a test without having to know its exact contents, and in re-using tests across the IC’s life time [54, 55]. Just as 3D-SICs offer the opportunity to system architects to re-design and re-optimize their system architecture, these die stacks also offer the opportunity to DfT architects to re-design and re-optimize their DfT architecture. Especially where this involves the decision which DfT resource to put into which die, we refer to this as Test Resource Partitioning. Let us illustrate the possibilities by means of an example. Consider a 3D-SIC product consisting of a memory die stacked on top of a logic die. The memory die provider might be used to make only stand-alone memories, which often do not come with BIST; being stand-alone, the memory I/Os are fully accessible from the test equipment, and the test costs are typically reduced by testing many memory chips in parallel (‘multi-site’). In a 3D-SIC context, for the pre-bond test the memory is still a stand-alone memory, but for the post-bond test that same memory die is actually more similar to an embedded memory. And, for embedded memories, BIST is the DfT methodology of choice. One can envision a scenario in which the memory die comes ‘3D-prepared’, i.e., with an on-chip BIST engine. This scenario has as advantage that any proprietary memory test content does not need to be released. However, the MBIST (logic) might be difficult to implement in the technology of the memory die (probably DRAM). Therefore, an alternative scenario could be that the memory maker provides a ‘drop-in’ description of a MBIST engine, which is actually implemented in the bottom logic die. The MBIST operation is controlled from within the logic die, whereas the stimuli and responses flow over TSV-based interconnect into and out of the memory die.

3.8â•…Conclusion 3D-SICs have many compelling benefits and hence are quickly gaining ground. It is to be expected that soon they will take a significant share of the overall semiconductor market. Test solutions need to be ready for this new generation of ‘super chips’.

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

71

3D-SICs are chips where all basic as well as most advanced test technologies come together. In addition, they pose some truly new test challenges. The manufacturing flow for 3D-SICs is more complex than for conventional 2D ICs, and hence offers more natural moments to perform wafer tests. We introduced the notions of pre-bond and post-bond wafer test, in addition to the (final) package test. In case of a disintegrated manufacturing flow, the individual dies and/or notyet-packaged die stacks might be final products for a particular company. In such a setting, high-quality pre-bond and/or post-bond tests might be required to guarantee the outgoing product quality. We typically refer to such tests as Known-Good Die (KGD) and Known-Good Stack (KGS) tests. For an integrated device manufacturer, the main role of wafer tests is merely to prevent costs downstream: stacking a bad to a good die, or packaging a bad stack. Whether or not it is economical to perform pre-bond and/or post-bond wafer tests in that context, depends on factors as yield, preventable downstream costs, and test costs. We have demonstrated that for most die yields, it is indeed economical to perform pre-bond tests, even if this implies that non-bottom dies need to be equipped with additional probe pads. We also showed that even in the case of Wafer-to-Wafer (W2W) stacking, pre-bond testing might pay off, as the additional test costs are surpassed by yield benefits if wafer matching is performed. A modular test approach treats the various components of the product as separate test units and hence fits very well to 3D-SICs. Different dies might contain different circuit structures or come from different fabs, and hence require different tests. In addition, modular testing provides a first-order diagnostic resolution, in case of test failure. The definition of an optimal test flow changes over the life-time of the product, as component yields mature. Modular testing allows to flexibly adjust and optimize the test flow, by freely deciding where in that flow a certain module is tested and/or re-tested. In first-order approximation, the test content for stacked dies is similar to conventional 2D ICs. However, evidence starts to appear for new defects due to novel 3D processing steps such as wafer thinning, as well as thermal and thermo-mechanical stress. The TSV-based interconnects are an entirely new structure, which needs to be tested as well. Wiring interconnect tests for opens and shorts are a good first attempt, but need to be refined for more subtle defects. Wafer test for 3D-SICs is a challenge, as today’s probe solutions do not allow to probe on the I/Os (in the form of TSV tips or micro-bumps) of the non-bottom dies. A way out is to provide extra probeable-sized pads, but a more cost-effective solution can hopefully be found in a significant improvement of wafer probe technologies. As in all ICs, DfT plays a crucial role in transporting test stimuli and responses in and out of the various test modules. Our chapter showed an example DfT architecture that enables modular testing and supports the various pre-bond, post-bond, and package test modes. Die-level test wrappers, additional pads for pre-bond testing, and TestElevators for post-bond testing play a crucial role in this architecture. The architecture can be further augmented with DfT techniques like RPCT, TDC, and BIST. 3D-SICs open the possibility to re-evaluate existing test resource partitions and explore new ones.

72

E. J. Marinissen

Note that this chapter only focused on electrical testing of 3D-SICs. There are also many challenges for 3D-SICs in related areas as design validation, in-line inspection, metrology, diagnosis, failure analysis, redundancy and repair, but they are beyond the scope of this chapter. Acknowledgmentsâ•‡ We thank many colleagues at IMEC for stimulating discussions, especially Eric Beyne, Ingrid De Wolf, Luc Dupas, Mario Gonzalez, Anne Jourdain, Paresh Limaye, Pol Marchal, Nikolaos Minas, Dan Perry, Philippe Roussel, Geert Van der Plas, Bart Swinnen, Kris Vanstreels, Dimitrios Velenis, and Jouke Verbree.

References 1. Robert S. Patti. Three-Dimensional Integrated Circuits and the Future of System-on-Chip Designs. Proceedings of the IEEE, 94(6):1214–1224, 2006. 2. Eric Beyne and Bart Swinnen. 3D System Integration Technologies. Proceedings of IEEE International Conference on Integrated Circuit Design and Technology (ICICDT), pages 1–3, 2007. 3. Philip Garrou, Christopher Bower and Peter Ramm, Eds. Handbook of 3D Integration—Technology and Applications of 3D Integrated Circuits. Wiley-VCH, Weinheim, Germany, 2008. 4. Gabriel H. Loh, Yuan Xie and Bryan Black. Processor Design in 3D Die-Stacking Technologies. IEEE Micro, 27(3):31–48, 2007. â•‡ 5. Roshan Weerasekera, Li-Rong Zheng, Dinesh Pamanuwa and Hannu Tenhunen. Extending Systems-on-Chip to the Third Dimension: Performance, Cost and Technological Tradeoffs. Proceedings International Conference on Computer-Aided Design (ICCAD), pages 212–219, 2007. â•‡ 6. James W. Joyner and James D. Meindl. Opportunities for Reduced Power Dissipation Using Three-Dimensional Integration. Proceedings IEEE International Interconnect Technology Conference (IITC), pages 148–150, 2002. â•‡ 7. Bart Swinnen. 3D Technologies: Requiring More Than 3 Dimensions from Concept to Product. Proceedings IEEE International Interconnect Technology Conference (IITC), pages 59– 62, 2009. â•‡ 8. Eric Beyne et al. Through-Silicon Via and Die Stacking Technologies for Micro Systems Integration. Proceedings IEEE International Electron Devices Meeting (IEDM), pages 1–4, 2008. â•‡ 9. Kaustav Banerjee et al. 3-D ICs: A Novel Chip Design for Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration. Proceedings of the IEEE, 89(5):602–633, 2001. 10. Shamik Das, Anantha Chandrakasan and Rafael Reif. Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools. Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 13–18, 2003. 11. Shamik Das, Anantha Chandrakasan and Rafael Reif. Design Tools for 3-D Integrated Circuits. Proceedings IEEE Asia South Pacific Design Automation Conference (ASP-DAC), pages 53–56, 2003. 12. Uksong Kang et al. 8Â€Gb 3D DDR3 DRAM Using Through-Silicon-Via Technology. Proceedings International Solid State Circuits Conference (ISSCC), pages 130–132, 2009. 13. Christianto C. Liu et al. Bridging the Processor-Memory Performance Gap with 3D IC Technology. IEEE Design & Test of Computers, 22(6):556–564, 2005.

3â•… Testing 3D Stacked ICs Containing Through-Silicon Vias

73

14. Bryan Black et al. 3D Processing Technology and its Impact on iA32 Microprocessors. Proceedings International Conference on Computer Design (ICCD), pages 316–318, 2004. 15. Bart Swinnen et al. 3D Integration by Cu–Cu Thermo-Compression Bonding of Extremely Thinned Bulk-Si Die Containing 10Â€μm Pitch Through-Si Vias. Proceedings IEEE International Electron Devices Meeting (IEDM), pages 1–4, 2006. 16. Jan Van Olmen et al. 3D Stacked IC Demonstration using a Through Silicon Via First Approach. Proceedings IEEE International Electron Devices Meeting (IEDM), pages 1–4, 2008. 17. Anne Jourdain et al. Electrically Yielding Collective Hybrid Bonding for 3D Stacking of ICs. Electronic Components and Technology Conference (ECTC), pages 11–13, 2009. 18. Rafael Reif et al. Fabrication Technologies for Three-Dimensional Integrated Circuits. Proceedings International Symposium on Quality of Electronic Design (ISQED), pages 33–37, 2002. 19. John U. Knickerbocker et al. 3D Silicon Integration. Electronic Components and Technology Conference (ECTC), pages 538–543, 2008. 20. Bioh Kim et al. Factors Affecting Copper Filling Process Within High Aspect Ratio Deep Vias for 3D Chip Stacking. Electronic Components and Technology Conference (ECTC), pages 1–6, 2006. 21. Twan Bearda et al. Post-Dicing Particle Control for 3D Stacked IC Integration Flows. Electronic Components and Technology Conference (ECTC), pages 1513–1516, 2009. 22. Yervant Zorian, Ed. Multi-Chip Module Test Strategies. Kluwer Academic Publishers, 1997. 23. Erik J. Marinissen and Yervant Zorian. Testing 3D Chips Containing Through-Silicon Vias. Proceedings IEEE International Test Conference (ITC), 2009. 24. Greg Smith et al. Yield Considerations in the Choice of 3D Technology. Proceedings International Symposium on Semiconductor Manufacturing (ISSM), pages 1–3, 2007. 25. Sherief Reda, Gregory Smith and Larry Smith. Maximizing the Functional Yield of Wafer-toWafer 3-D Integration. IEEE Transactions on VLSI Systems, 17:1357–1362, 2009. 26. Jouke Verbree et al. On the Cost-Effectiveness of Matching Repositories of Pre-Tested Wafers for Wafer-to-Wafer 3D Chip Stacking. Proceedings IEEE European Test Symposium (ETS), pages 36–41, 2010. 27. Michael Bushnell and Vishwani Agrawal. Essentials of Electronic Testing for Digital, Memory & Mixed-Signal VLSI Circuits. Wiley-VCH, Weinheim, Germany, 2000. 28. Dirk K. de Vries. Investigation of Gross Die Per Wafer Formulas. IEEE Transactions on Semiconductor Manufacturing, 18(1):136–139, 2005. 29. Yervant Zorian, Erik J. Marinissen and Sujit Dey. Testing Embedded-Core Based System Chips. Proceedings IEEE International Test Conference (ITC), pages 130–143, 1998. 30. Ozgur Sinanoglu et al. Test Data Volume Comparison: Monolithic vs. Modular SoC Testing. IEEE Design & Test of Computers, 26(3):25–37, 2009. 31. Rohit Kapur. CTL for Test Information of Digital ICs. Kluwer Academic Publishers, 2003. 32. Francisco da Silva, Teresa McLaurin and Tom Waayers. The Core Test Wrapper Handbook: Rationale and Application of IEEE Std. 1500. Springer, Berlin, Germany, 2006. 33. Sandeep K. Goel and Erik J. Marinissen. SOC Test Architecture Design for Efficient Utilization of Test Bandwidth. ACM Transactions on Design Automation of Electronic Systems, 8(4):399–429, 2003. 34. Erik J. Marinissen. The Role of Test Protocols in Automated Test Generation for Embedded-Core-Based System ICs. Journal of Electronic Testing: Theory and Applications, 18(4/5):435–454, 2002. 35. Akihiro Ikeda et al. Design and Measurements of Test Element Group Wafer Thinned to 10Â€μm for 3D System in Package. Proceedings IEEE International Conference on Microelectronic Test Structures, pages 161–164, 2004. 36. Dan Perry et al. Impact of Thinning and Packaging on a Deep Sub-Micron CMOS Product. Electronic Workshop Digest of DATE 2009 Friday Workshop on 3D Integration, page 282, 2009. (http://www.date-conference.com/files/file/09-workshops/date09-3dws-digestv2-090504.pdf).

74

E. J. Marinissen

37. Menglin Tsai et al. Through Silicon Via (TSV) Defect/Pinhole Self Test Circuit for 3D-IC. Proceedings IEEE International Conference on 3D System Integration (3DIC), pages 1–8, 2009. 38. Po-Yuan Chen, Cheng-Wen Wu and Ding-Ming Kwai. On-Chip TSV Testing for 3D IC Before Bonding Using Sense Amplification. Proceedings IEEE Asian Test Symposium (ATS), pages 450–455, 2009. 39. Erik J. Marinissen et al. Minimizing Pattern Count for Interconnect Test under a Ground Bounce Constraint. IEEE Design & Test of Computers, 20(2):8–18, 2003. 40. William H. Kautz. Testing of Faults in Wiring Networks. IEEE Transactions on Computers, C-23(4):358–363, 1974. 41. P. Goel and M.T. McMahon. Electronic Chip-in-Place Test. Proceedings IEEE International Test Conference (ITC), pages 83–90, 1982. 42. Paul T. Wagner. Interconnect Testing with Boundary Scan. Proceedings IEEE International Test Conference (ITC), pages 52–57, 1987. 43. William R. Mann et al. The Leading Edge of Production Wafer Probe Test Technology. Proceedings IEEE International Test Conference (ITC), pages 1168–1195, 2004. 44. Gil-Su Kim, Makoto Takamiya and Takayasu Sakurai. A Capacitive Coupling Interface with High Sensitivity for Wireless Wafer Testing. Proceedings IEEE International Conference on 3D System Integration (3DIC), pages 1–5, 2009. 45. Brian Moore et al. High Throughput Non-Contact SIP Testing. Proceedings IEEE International Test Conference (ITC), 2007. Paper 12.3. 46. Erik Jan Marinissen et al. Contactless Testing: Possibility or Pipe-Dream? Proceedings Design, Automation, and Test in Europe (DATE), pages 676–671, 2009. 47. Dimitrios Velenis et al. Impact of 3D Design Choices on Manufacturing Cost. Proceedings IEEE International Conference on 3D System Integration (3DIC), pages 1–5, 2009. 48. Janusz Rajski et al. Embedded Deterministic Test. IEEE Transactions on Computer-Aided Design (TCAD), 23(5):776–792, 2004. 49. Xiaoxia Wu, Paul Falkenstern and Yuan Xie. Scan Chain Design for Three-dimensional Integrated Circuits (3D ICs). Proceedings International Conference on Computer Design (ICCD), pages 208–214, 2007. 50. Xiaoxia Wu et al. Test-Access Mechanism Optimization for Core-Based Three-Dimensional SOCs. Proceedings International Conference on Computer Design (ICCD), pages 212–218, 2008. 51. Xiaoxia Wu et al. Test-Access Solutions for Three-Dimensional SOCs. Proceedings International Test Conference (ITC), page 1, October 2008. 52. Li Jiang, Lin Huang and Qiang Xu. Test Architecture Design and Optimization for ThreeDimensional SoCs. Proceedings Design, Automation, and Test in Europe (DATE), pages 220–225, 2009. 53. Harald Vranken et al. Enhanced Reduced Pin-Count Test for Full-Scan Design. Proceedings IEEE International Test Conference (ITC), pages 738–747, 2001. 54. Yervant Zorian. A Structured Testability Approach for Multi-Chip Modules Based on BIST and Boundary-Scan. IEEE Transactions on Components, Packaging and Manufacturing Technology—Part B, 17(3):283–290, 1994. 55. Yervant Zorian and Hakim Bederr. An Effective Multi-Chip BIST Scheme. Journal of Electronic Testing: Theory and Applications, 10(1–2):87–95, 1997.

Chapter 4

Design and Computer Aided Design of 3DIC Paul D. Franzon, W. Rhett Davis and Thor Thorolfsson

4.1â•…Introduction This chapter reviews the process of 3DIC designing exploiting Through Silicon Via (TSV) technology. The chapter introduces the notion of re-architecting systems explicitly to exploit high density TSV processes. A particular focus is on (redesigned) memory on top of logic. This article also serves as a tutorial for the design of 3D specific systems. The chapter is organized as follows. First 3D technology is briefly reviewed specifically from a designer’s aspect. Then some of the 3D specific optimizations that the designer can explore. The core of this article is the description of a 3D specific design—a radar DSP application. 3D specific CAD is then explored. Finally the outstanding issues in 3D specific design are explored.

4.2â•…Technology Selection Today, there is a wide selection of potential 3DIC technologies and choosing amongst them can be confusing. The details of the technology choice interact strongly with the design intent and design approach. An attempt to categorize the technologies from a designer’s perspective is summarized by considering the following questions:

4.2.1 What Am I Stacking (Fig.Â€4.1)? The alternatives commonly available are wafer-on-wafer, and die-on-wafer. Both of these exploit similar equipment and processes, though the latter obviously has less

P. D. Franzon () Department of Electrical and Computer Engineering, The North Carolina State University, Box 7914, Raleigh, NC 27695, USA e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_4, ©Â€Springer Science+Business Media, LLC 2011

75

P. D. Franzon et al.

76

What am I stacking?

Wafer on wafer

Die on substrate on die

Die on die Chip on wafer

Fig. 4.1â†œæ¸€ Stacking alternatives

intrinsic parallelism and is likely to cost more. It is also possible (but not common) to stack die-on-substrate-on-die or die-on-die. Both of these last two processes are more specialized, requiring extra fixtures and steps. Die-on-substrate-on-die is one common configuration in three-D packaging, and can be done using laminate substrates, or potentially, silicon substrates. The best method to perform die-on-die is to reconstitute the die into a wafer using adhesives and fills. The decision as to which is best is fairly straightforward. If stacking wafers, then it is likely that you are working in a homogeneous technology, designing and fabricating complete wafers intended only for stacking with each other. For example, a memory manufacturer fabricating a memory stack. When stacking wafer on wafer, care must be taken with the cumulative yield loss. There is no method to prevent bad chips being stacked on a good chip. Thus, the integrated yield will be less than the single tier yield. (An example is given in TableÂ€4.1.) Alternatively if stacking die to wafer, then there is more flexibility in the technology choice and mix. In addition, the die and wafer sites can be tested before integration, and yield can be maximized. Obviously die-on-die and die-on-substrate-on-die avoid the compounded yield reduction issue though at the expensive of additional assembly cost. In many circumstances chips from different vendors will be integrated. It is very unlikely that the I/O on one chip (stack) are not physically aligned with the I/O on its soon-to-be mate. In that case some pitch matching is needed. The available alternatives are summarized in Fig.Â€4.2. If the only need is pitch matching, then a RedisTable 4.1â†œæ¸€ Reduction of integrated yield with stacking using wafer on wafer Number of tiers 1 2 3 Yield (%) 95 91 85

4 81

4â•… Design and Computer Aided Design of 3DIC

77

Fig. 4.2â†œæ¸€ If the IO on two of the chips in a chip stack are not physically aligned either a redistribution layer or intermediate substrate are needed

tribution Layer (RDL) will often suffice. This consists of one or more layers of thin metal, usually integrated via a spun-on dielectric. If an RDL is not possible (e.g. due to vendor limitations), or power and thermal management is needed between the chips then an intermediate substrate is needed. That is a silicon, ceramic or laminate substrate containing through vias and multiple metal layers. Note if a substrate is used then the interconnectivity through it will be limited by its internal via pitch.

4.2.2 Do I Need Through Silicon Vias (Fig.Â€4.3)? Not all 3DIC stacks need Through Silicon Vias (TSVs). If only two chips are being stacked, and peripheral connections are sufficient for signal I/O, power and ground connections, then only face to face bonding is needed. Several technology options are available including solder bumping, and fine pitch copper-copper bonding. TSVs add considerably to the cost of 3D stacking and integration. They also consume silicon area. If they are not needed, then avoid them. Many lower power two chip stacks can avoid the added costs of TSVs.

4.2.3 Through Silicon Via (TSV) Technology The TSV technology choice has a significant impact on the potential vertical interconnect density and the silicon area overhead of the TSV interconnectivity. One

Fig. 4.3â†œæ¸€ Face to face bonding suffices for a two chip stack if the current density is low enough to permit peripheral bonding

78

P. D. Franzon et al.

Table 4.2â†œæ¸€ Different TSV technologies from a designer’s perspective Technology Diameter (μm) Pitch (μm) Keepout (μm) “Package scale” 30–50 50–150 >5 today “Package scale” 20–30 50 >5 tomorrow “Chip scale” 1–10 2–50 >5 (Cu) 0 (W)

Capacitance Up to 1Â€pF, typically <300Â€fF Up to 300Â€fF Up to 40Â€fF

way to categorize TSVs is via their source of manufacture as summarized in TableÂ€4.2 (with reference to Fig.Â€4.4). “Package scale” vias, as fabricated by a packaging vendor (e.g. [1]), are large and likely to consume significant area. They also add significantly to the vertical interconnect capacitance. These are likely to scale to smaller dimensions in the future but not scale to the tight dimensions possible in “chip scale” TSV technologies. “Chip scale” technologies are usually integrated at the chip fabrication facility. They consume the smallest area, have the smallest capacitance and can achieve the tightest pitch. In contrast to copper TSVs, Tungsten (W) TSVs do not require a transistor keepout rule around them. Copper TSVs need a keepout rule due to the stress induced in the silicon due to the different coefficient of thermal expansion of copper and silicon. The induced stress can change transistor properties. Chip scale technologies provide the most potential for achieving 3D specific optimizations. From a designer’s viewpoint, the differences between the TSV technologies are their pitch, capacitance and total footprint. For example when creating a 3D specific memory interface, the pitch will determine the area over which a bus is spread out. For example, with a 50Â€μm pitch technology, a x128Â€bit bus might be arranged in a 4â•›×â•›32 grid over an area of 0.2â•›×â•›1.6Â€mm. Obviously for an ASIC interfacing with such a memory, there will be considerable wire length required to support this fanout to the (smaller) logic block interfacing with the memory. The capacitance of the TSV is determined by the isolating oxide thickness, the depletion depth in the silicon and the length of the via. The capacitance determines the power and speed potential of the inter-chip interface. A common stated advantage of exploiting 3D technology is a reduction in wire-length. Some of this is lost with high capacitance TSVs. At 32Â€nm, the capacitance of on-chip wiring is around 300Â€fF/mm. Thus a Interconnect layers

wafer thickness

Silicon surface TSV Silicon backside

transistor keepout

diameter pitch

Fig. 4.4â†œæ¸€ Dimensions referred to in Table 4.2

TSV bond

4â•… Design and Computer Aided Design of 3DIC

79

TSV capacitance is equivalent to that of 0.1–1Â€mm of on-chip wiring, depending on the TSV technology. In many circumstances, the TSV foot-print matters more than the pitch, as it adds to the total silicon area. For example, in a “package scale” TSV, the x128Â€bit bus described above would consume a total of 0.2Â€× 1.6Â€mm of silicon (0.32Â€mm2), roughly the same area it would consume with conventional packaging. In contrast, if built in a chip-scale TSV (with a 10Â€× 10Â€μm footprint), the area consumed would be only 0.0128Â€mm2, assuming that useful circuits can be designed into the space between the TSVs.

4.3â•…3D Specific Optimizations If the chip stack is redesigned to explicitly exploit 3DIC technologies then 3D specific optimizations are possible. Possible optimizations include the following (TableÂ€4.3): • Miniaturization, especially in sensors. • Many studies (e.g. [2, 3]) demonstrate that 3D integration lead to shorter interconnects. Though valuable, the improvement is incremental and has to be judged against the added cost. • Use of 3D stacking to increase memory bandwidth. With future multicore CPUs likely to require memory bandwidths of 1Â€TB/s or above [4], power efficient methods to provisioning this bandwidth are needed. For example, 1,024 1Â€Gbps TSV enabled channels are likely to be much more power efficient than 400 20Â€Gbps channels built in conventional packaging.

Table 4.3â†œæ¸€ Potential 3D specific optimizations Driving issue Case for 3D Miniaturization Stacked memories. “Smart dust” sensors Interconnect delay When delay in critical paths can be substantially reduced through 3D integration Memory bandwidth Logic on memory can dramatically improve memory bandwidth Power consumption

Mixed technology (Heterogeneous) integration

Caveats For many cases, stacking and wirebonding is sufficient Not all applications will have a substantial advantage While memory bandwidth can be improved dramatically, memory size can only be improved linearly Limited domain. In many cases, it does not

In certain cases, a 3D architecture might have substantially lower power over a 2D. Memory bandwidth can be provided at a lower bandwidth in 3D Mixing an advanced ASIC technol- Though might justify 3D integration, not all examples might justify ogy with an older or different through-wafer vias analog technology. Incorporating a passives layer

80

P. D. Franzon et al.

• Repartitioning the system to decrease power consumption is a unique opportunity for 3D. One potential is to reorganize the memory stack, not just to decrease interconnect power, but also to decrease memory core power. This is explored further below. • Mixed technology (Heterogeneous) integration is a unique opportunity in 3D. Examples include stacking processing with a sensor array; keeping the hard to redesign analog circuits in an older (cheaper) technology node while moving the digital portion to an advanced node; and moving on-chip passives (inductors, capacitors, decoupling) to a low-loss stacked substrate.

4.3.1 Example of 3D Specific Optimized Design A 3D optimized synthetic aperture radar (SAR) processor has been designed and is currently in fabrication with Lincoln Labs, using their three chip stack 0.18Â€μm SOI technology [5]. The core of this processor is a 1,024 point, 32Â€bit floating point FFT. Power was minimized by using small memories that minimize the energy per access. 3D interconnect was employed in order to reduce the length of connections to these highly partitioned memories. The chip layouts are shown in Fig.Â€4.5. The overall system consists of four different components, eight processing elements, one controller, 32Â€SRAMs, and eight ROMs and is shown in Fig.Â€4.6. The system performs 32 memory accesses per cycle (16 reads and 16 writes), completing a 1,024-point FFT in 653 cycles utilizing five pipeline stages. The system is based on a radix-2 Cooley-Tukey FFT algorithm, where every processing element implements a single two-point FFT butterfly. A radix-2 FFT has a data dependency that resembles a hypercube. This hypercube data dependency can be exploited in two ways. First, a radix-2 FFT will require two memory reads every cycle, one of which will have odd parity while the other will have even parity. As a result we can split the processing memory into two independent memory groups that never need to be accessed at the same time. Second, we can sub-divide

Fig. 4.5â†œæ¸€ Three-tier layout of the 3D synthetic aperture radar design

Even

Odd

Even

Odd

Even

Fig. 4.6â†œæ¸€ 3D synthetic aperture radar system block diagram

Controller

Even

Odd

Odd

Even

Odd

Even

Odd

Even

Odd

Even

Odd

Even

Odd

4â•… Design and Computer Aided Design of 3DIC 81

82

P. D. Franzon et al.

the even and odd groups into smaller subgroups where each processing element is only connected to the absolute minimum number of memory locations required to successfully compute the FFT. The benefit of splitting the memories into smaller subgroups is that smaller memories are faster and since each memory subgroup can be accessed simultaneously, the system can perform a greater number of reads and writes per cycle. Conversely, a single memory will require less area as only one set of peripheral logic (write driver and sense amp) is required. In this specific implementation the memory was divided into 32 smaller memories (16 even and 16 odd). We use Cacti 4.1 [6] to assess the architectural benefit of the partitioned design, by comparing the properties of a single 8Â€kByte memory to sixteen 512Â€Byte memories. The memory-core area savings of using a single memory would have been 67.6%. By using multiple smaller memories, the energy per read is reduced by 60.8% (from 68.205 to 26.718Â€pJ), the energy per write was reduced by 57.6% (from 14.48 to 6.142Â€pJ) and bandwidth increased by 854.9% (13.4–128.4Â€GBps). The number of wires interconnecting the memory to logic is increased from 150 to 2,272 wires. 3DIC stacking is used to minimize the area impact of the added wires. A summary of the differences between the highly partitioned small memories and the unpartitioned big memories is shown in TableÂ€4.4. As can be seen in TableÂ€4.4, partitioning the FFT in such a manner is greatly beneficial, in terms of bandwidth and read and write energy consumption. However, this comes at high cost increasing the number of memory to logic interconnect wires from 150 to 2,272, leading to an interconnect dominated architecture. Luckily, moving this architecture to 3D ensures that the increase in interconnect power increases does not outweigh the bandwidth and memory access energy consumption gains. This design was conducted using “chip scale” 5Â€μm pitch high density TSVs. The 2.6â•›×â•›3Â€mm design has 17,634 TSVs within it, roughly equally split between signal and power/ground vias. If a coarser “package level” TSV was used, the silicon area impact would be substantially larger (TableÂ€4.5). The optimizations achieved in this design require high-density, fine-scale vias. To quantify how much the move to 3D benefits the architecture, we also place and route the design in 2D. In order to ensure a fair comparison between the two circuits, the circuit is not resynthesized, instead the same synthesis output is used. Table 4.4â†œæ¸€ Comparison between the highly partitioned small memories and the unpartitioned big memory Metric Big memories Small memories % â•‡â•‡ 2,272 −1,414.7 Wires (#) 150 Bandwidth (GBps) â•‡ 13.4 â•‡â•‡â•‡â•›â•›128.4 â•‡â•‡â•›â•›854.9 Energy per write (pJ) â•‡ 14.48 â•‡â•‡â•‡â•‡â•‡â•›â•›6.142 â•‡â•‡â•‡â•›â•›57.6 Energy per read (pJ) â•‡ 68.205 â•‡â•‡â•‡â•‡â•›â•›26.718 â•‡â•‡â•‡â•›â•›60.8 Table 4.5â†œæ¸€ Area impact on SAR design of different TSV technologies

TSV technology 6Â€μm pitch SOI TSV 15Â€μm pitch (10Â€μm diameter) intermediate TSV

Area impact 0.14Â€mm2 (1.7%) 2Â€mm2 (18%)

4â•… Design and Computer Aided Design of 3DIC Table 4.6â†œæ¸€ Comparison of the 3D optimized FFT design in both 2D and 3D technologies Metric 2D 3D â•‡ 31.36 â•‡ 23.40 Total area (mm2) Core area (mm2) â•‡ 29.16 â•‡ 20.16 Mean net length (μm) 836.0 392.9 Total wire length (m) â•‡ 19.107 â•‡â•‡ 8.238 Max speed (MHz) â•‡ 63.7 â•‡ 79.4 Critical path (ns) â•‡ 15.7 â•‡ 12.6 Logic power (mW) 340.0 324.9 FFT logic energy (μJ) â•‡â•‡ 3.552 â•‡â•‡ 3.366

83

% 25.3 30.9 53.0 56.9 24.6 19.7 â•‡ 4.4 â•‡ 5.2

For comparisons sake, a similar floorplan is used for both the 2D and the 3D design. The metrics of the two designs are summarized in TableÂ€4.6. Due to increased congestion, the 2D design does not route successfully with same total area as its 3D counterpart (4.8 by 4.8Â€mm). To remedy this, the area used for place and route is grown until the design routes without any design rule violations. Compared to the 3D version, the total area used must be expanded significantly. Also, the 3D design is 24.6% faster, has 56.9% lower total wire length, and uses 5.2% less energy per transform. In order to complete this design using the CAD flow described in the next section, an important step was to determine an appropriate logic partition. Especially with today’s tools higher level partitions are easier to manage than lower level ones. For example in each design each Processing Element is contained within one tier. Furthermore there was only one entry point to the clock tree within each tier. That is within each logic tier, the design was treated as a 2D design, except for the memory interface. This high level partitioning is useful for several reasons. First since commercial grade 3D place and route and clock distribution tools do not exist, we were able to complete the design with available commercial 2D tools. Second, it simplifies early design analysis considerably—2D architectural evaluations can be used at sub-module level. Third, such a position enables each tier to be tested before integration. In contrast, if the partitioning was done at the logic cell level, or the clock tree was distributed between the tiers, integration would be required before testing and the yield would be degraded (see TableÂ€4.1 and discussion). More detail about this design can be found in [3, 7].

4.4â•…3D Design and CAD An outline design flow for a 3DIC enabled sub-system is shown in Fig.Â€4.7. This flow can be executed with available tools, after appropriate integration through scripts. The first step, described above, is technology selection. Until 3D options are offered as standard in semiconductor fabs, this will be a many factored decision. However, even after TSV options are standard, there will be many complex deci-

84 Fig. 4.7â†œæ¸€ 3DIC CAD flow

P. D. Franzon et al. Technology Choice

High level Chip-Package Floorplan

Chip Floorplans and TSV location assignment

Detailed 2D Design

Test Insertion

Reassemble 3D Chip Stack

3D Verification: DRC, LVS, Performance, Thermal

sions to make. Almost concurrent with this step is the decision making about the overall high level floorplan of the chips and package. There are a lot of issues that have to be concurrently managed in this step, and unfortunately, their analysis has to be largely human driven. The end decision is an overall floorplan—what goes on which layer, and approximately where. Factors in these decisions include the following: • Technology set. Coarse of fine TSVs? Interposer or RDL? Each has different capabilities and thus different options. • Signal flow. Where are the signals originating and terminating in each layer? How wide are the data paths (wider data paths tend to consume less power than narrower, faster ones). What are the implications for interconnect power, wirelength, TSV placement, TSV density and TSV footprint? Can the RDL or substrate support the signal flow? • Power delivery. How is current delivered to the chips in the stack? In performance driven designs, the total current levels can be 100Â€A or more, and severely constrain the power delivery solutions. • Performance and power enhancements. The TSV technology should be leveraged to deliver unique advantages. For example wide memories with small memory banks consume considerably less power than large memory banks feeding narrow interfaces. Numerous other opportunities exist. • Thermal performance. A key consideration is DRAM operating temperature and refresh time. DRAMs typically operate at 85°C or less in order to guarantee

4â•… Design and Computer Aided Design of 3DIC

85

refresh cycle time. On the other hand, high performance ASICs often operate at 100°C or more. Care must be taken to ensure the DRAM is sufficiently cooled, and is not subject to co-heating by the ASIC. In addition, no component must operate beyond the designed temperature. Unfortunately, there is a direct conflict between power delivery and thermal performance. In a 3D chip stack, power delivery is optimized by putting the highest power closest to the Printed Circuit Board (PCB) delivering power. However the highest power part is best cooled by placing it farthest from the PCB and next to the heat sink. The latter solution would necessitate a large number of TSVs in the lower structure simply for power delivery. The next step is detailed floorplanning and TSV assignment. Considerations here mainly center on wiring delay, thermal and routability considerations. Today there are no true 3D floorplanners so clean partitions are required to make this easy. An example of a clean partition is to keep all the logic in one layer in the chip stack. Then the other portion (e.g. analog or memory) is separately designed and used to designate the TSV locations. They can be propagated through to the logic layer, which is then designed as normal using 2D tools. If there are hot-spots in the nonlogic layers, some thermal design might be needed to ensure that the hot spots are not vertically stacked. The next step is detailed design. If the partitions are simple and clean, this is a relatively straightforward step. As discussed below, test insertion is a critical and complex step in 3D design. However, techniques are starting to emerge that can exploit existing tools to produce a testable design [8–10]. The final steps are verification. It is best to verify the 3D design as one integrated sub-assembly rather than trying to verify each layer independently. This requires obtaining verification decks that include all layers in the 3D stack. It also requires extraction tools that can include reasonable parasitic models for the TSVs. This is all manageable but requires effort from the CAD team.

4.5â•…Outstanding Issues in 3D Design Outstanding issues in 3DIC design include cost management, CAD tools, Test and thermal/power integrity management.

4.5.1 Cost Management Since the 3D integration increases the wafer cost by 5–15% or more, care must be taken to mitigate the added cost elsewhere in the system. Ideally this cost is compensated not just by a performance and/or power improvement but also by cost reduction elsewhere. Opportunities for cost reduction include using a lower cost

86

P. D. Franzon et al.

functionally optimized technology mix, such as moving passives to an interposer, or reduced packaging cost, e.g. by reducing pin count.

4.5.2 Computer Aided Design With care, well partitioned 3DICs can be designed and analyzed with the available tools. However, there are still identifiable pain points that today require heavy human intervention. Also, more complex partitions (e.g. an SOC split across numerous layers) cannot be managed with today’s CAD tools. The biggest single need are better tools and sub-flows for early planning, especially for floorplanning, and chip-package codesign. These issues are especially difficult in 3DIC due to the complex interaction between floorplanning, bandwidth provisioning, signal integrity, power integrity and thermal management. This gets especially difficult when also evaluating different technology options, such as interposers, specialized cooling features, etc.

4.5.3 Thermal Design and Analysis The root need in thermal design and analysis is to predict temperature sufficiently accurately so that the resulting unpredictability in signal path delay and clock path delay (and thus skew), and leakage power can both be brought within their budgets. The accuracy needed depends on the budgets allowed. Thermal analysis is complicated in 3D design as the thinned silicon tiers are no longer as good as spreading heat as in an unthinned 2D tier. This leads to local hot spots on the tiers away from the heat sink. For example, the design described above was taken through a full thermal analysis, as discussed in [11] and presented in Fig.Â€4.5. The relative coolness of the tier (A) closest to the heatsink is apparent, as is the temperature nonuniformity of the tier farthest from the heatsink (tier C). The heatspikes in Tier C are located at the clock buffers, which have a high activity factor (Fig.Â€4.8).

4.5.4 Test and Design for Test When stacking die on wafer, it is highly desirable to know which singulated die are good and which locations are good on the wafer. This requires that the test flow yield “Known Good Die” (KGD)—die that can be determined to meet speed and burn-in requirements with wafer test alone. While KGD tools and methods are available today, they are not in common use—most system integrators rely on package level performance test. In a 3D design, KGD tools are essential. An additional complexity are the TSVs—probe cards will not be able to test large numbers of

4â•… Design and Computer Aided Design of 3DIC Tier A

Tier B

87 Tier C

27.00 28.69 30.37 32.06 33.75 35.44 37.12 38.81 40.50 42.18 43.87 45.56 47.25 48.93 50.62 52.31

Fig. 4.8â†œæ¸€ Detailed thermal analysis of the three tier FFT design

closely spaced TSVs. Wafer-level self test is needed as described in [9, 10]. The situation is different if wafer on wafer integration is used. In this case, the solution to cope with compounded yield degradation must center on using redundancy and error correction.

4.6â•…Conclusions Designing chips and systems of chips to exploit 3DIC technologies can bring specific advantages in terms of bandwidth and power consumption. A particularly attractive opportunity is redesigning memory for integration with logic functions. An example is given in which power consumption is reduced by 60% compared with the 2D specific design. Outstanding issues in 3DIC design include thermal/power codesign, cost, and test.

References 1. J. Kim, E. Song, J. Cho, J. Pak, J. Lee, H. Lee, K. Park, and J. Kim, “Through Silicon Via Equalizer,” in Proc. IEEE EPEPS ’09, Oct. 2009, pp.Â€13–16. 2. K. Banerjee, S. Souri, P. Kapur, and K. Saraswat, “3-D ICs: A Novel Chip Design for Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration,” Proc. IEEE, Vol.Â€89, No.Â€5, 2001, pp. 602–633. 3. T. Thorolfsson, K. Gonsalves, and P. Franzon, “Design Automation for a 3DIC Processor for Synthetic Aperture Radar: A Case Study,” in Proc. DAC 2009, July 2009, pp.Â€51–56. 4. H.P. Hofstee, “Future Microprocessors and Off-Chip SOP Interconnect,” IEEE Trans. Adv. Packag, Vol.Â€27, No.Â€2, May 2004, pp.Â€301–303. 5. J.A. Burns, B.F. Aull, C.K. Chen, C.-L. Chen, C.L. Keast, J.M. Knecht, V. Suntharalinam, K. Warner, P.W. Wyatt, and D. Yost, “A Wafer-Scale 3-D Circuit Integration Technology,” IEEE Trans. ED, Vol.Â€53, No.Â€10, Oct. 2006, pp.Â€2507–2516. 6. S. Wilton and N. Jouppi, “CACTI: An Enhanced Cache Access and Cycle Time Model,” IEEE J Solid-State Circuits, Vol.Â€31, No.Â€5, Oct. 1996, pp.Â€677–688.

88

P. D. Franzon et al.

â•‡ 7. T. Thorolfsson, S. Melamed, G. Charles, and P. Franzon, “Comparative Analysis of Two 3D Integration Implementations of a SAR Processor,” in Proc. IIII 3DIC, 2009, pp.Â€1–4. â•‡ 8. E. Marinissen and Y. Zorian, “Testing 3D Chips Containing Through-Silicon Vias,” in ITC, 2009, pp.Â€1–11. â•‡ 9. M. Tsai, A. Klooz, A. Leonard, J. Appel, and P. Franzon, “Through Silicon Via (TSV) Defect/ Pinhole Self Test Circuit for 3D-IC,” in IEEE Proc. 3DIC, 2009, pp.Â€1–8. 10. P.-Y. Chen, C.-W. Wu, and D. Ming, “On-Chip TSV Testing for 3D IC Before Bonding Using Sense Amplification,” in Proc. ATS’09, 2009, pp.Â€450–455. 11. S. Melamed, T. Thorolfsson, A. Srinivasan, E. Cheng, P. Franzon, and R. Davis, “JunctionLevel Thermal Extraction and Simulation of 3DICs,” in Proc. IEEE 3DIC, 2009, pp.Â€1–7.

Chapter 5

Physical Analysis of NoC Topologies for 3-D Integrated Systems Vasilis F. Pavlidis and Eby G. Friedman

5.1â•…Introduction Several new topologies for on-chip interconnect networks are supported by vertical integration. These three-dimensional topologies improve the performance of an on-chip network primarily in two ways. The length of the physical links connecting the switches of the network is shorter. Additionally, the data can be routed across the on-chip network through a smaller number of switches. The three-dimensional (3-D) NoC topologies include two types of physical links implemented with horizontal and vertical interconnects. These links exhibit substantially different physical and electrical characteristics. The different 3-D topologies and timing and power models that describe the performance of the resulting 3-D networks are discussed in this chapter. These models emphasize the physical characteristics rather than the architectural details of the network. These models provide useful bounds for improving the performance of on-chip networks by exploiting the third dimension. With these models, the topology that minimizes the latency or power consumption of a network can be determined. As described in this chapter, a network topology can typically enhance one of these two primary design objectives. The thermal behavior of 3-D integrated systems is another important issue due to the increased power densities that can develop. To characterize the thermal effects on the performance of 3-D NoC topologies, the timing and power models are enhanced by including the dependence on temperature of specific parameters, such as the electrical resistance of an interconnect. Consequently, the topology that produces the minimum rise in temperature at the plane located farthest from the heat sink of the system can be selected while satisfying specific performance characteristics.

V. F. Pavlidis () Integrated Systems Laboratory, EPFL, 1015 Lausanne, Switzerland e-mail: [email protected] E. G. Friedman () University of Rochester, Rochester, NY 14627, USA e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_5, ©Â€Springer Science+Business Media, LLC 2011

89

90

V. F. Pavlidis and E. G. Friedman

A first-order thermal model is utilized to determine the rise in temperature in each plane within a 3-D system. Elevated temperatures can affect the performance of the processing elements (PEs) in addition to the performance of the network. Consequently, an enhanced analysis methodology including thermal effects, which provides the interconnect architecture employed in the PEs, is described in this chapter. In this methodology, the number of metal layers, pitch of the interconnect in each metal layer, and number of physical planes are considered. In addition, the rise in temperature due to the heat generated by transistor switching and joule heating of the wires is evaluated. In other words, the methodology described in this chapter provides a means to estimate early in the design cycle the behavior of a 3-D topology for an integrated system interconnected with an on-chip network. This holistic approach in the design of 3-D systems based on networks-on-chip is necessary, as demonstrated in this chapter. Neglecting the effects of the power consumption of the PEs and the related temperature rise can produce a misleading result when selecting the 3-D NoC topology that exhibits the highest performance. The analysis approach presented in this chapter is applied to homogeneous 3-D NoCs (i.e., all of the PEs are assumed identical) exploring diverse objectives, such as speed, power, and temperature. As demonstrated in this analysis, a primary criterion for the design of these 3-D systems is whether the third dimension is used for the on-chip network or the PEs. In the next section, the 3-D NoC topologies investigated in this chapter are described and some notation is introduced. Timing and power models for the on-chip network are presented in SectionÂ€5.3. A technique for determining the wiring resources of the PEs within a 3-D system interconnected with an on-chip network and the resulting rise in temperature in the different 3-D NoC topologies is presented in SectionÂ€5.4. Several tradeoffs among different characteristics of the topologies including the network size, number of physical planes, and operating frequency of the PEs are discussed in SectionÂ€5.5. The primary objectives of the chapter are summarized in the last section of the chapter.

5.2â•…Three-Dimensional On-Chip Network Topologies Primary topologies for 3-D networks are presented and related terminology is introduced in this section. Mesh structures have been a popular network topology for conventional 2-D NoC [1–3]. A fundamental element of a mesh network is illustrated in Fig.Â€5.1a, where each processing element (PE) is connected to the network through a switch. A PE can be integrated either on a single physical plane (2-D IC) or on several physical planes (3-D IC). Each switch in a 2-D NoC is connected to a neighboring switch in one of four directions. Consequently, each switch has five ports. Alternatively, in a 3-D NoC, the switch typically connects to two additional neighboring switches located on the adjacent physical planes. The switch architecture is considered here to be a canonical switch with input and output buffering [4].

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems NoC

2-D

IC

91

3-D

a

b Router

2-D

PE

n3 n 2 n1

c

d np Lv

3-D Lh

Fig. 5.1â†œæ¸€ Different NoC topologies (not to scale), (a) 2-D IC – 2-D NoC, (b) 2-D IC – 3-D NoC, (c) 3-D IC – 2-D NoC, and (d) 3-D IC – 3-D NoC [8]

Although a network switch can also be designed in a 3-D manner [5], herein, the switches are considered as two-dimensional (i.e., occupy a single physical plane). Note that if 3-D switches are utilized, the performance of each of the targeted topologies will be improved equally. Consequently, a comparison of the 3-D network topologies is independent of whether a 2-D or 3-D switch is used. The combination of a PE and switch is a network node. For a 2-D mesh network, the total number of nodes N is n1Â€×Â€n2, where ni is the number of nodes included in the ith physical dimension. Integration in the third dimension introduces a variety of topological choices for NoCs. For a 3-D NoC, as shown in Fig.Â€5.1b, the total number of nodes is NÂ€ =Â€ n1Â€ ×Â€ n2Â€ ×Â€ n3, where n3 is the number of nodes in the third dimension. In this topology, each PE is on a single yet possibly different physical plane (2-D IC – 3-D NoC). Alternatively, a PE can be implemented on only one of the n3 physical planes. The 3-D system, therefore, contains n1Â€×Â€n2 PEs on each of the n3 physical planes, where the total number of nodes is N. This topology is discussed in [6, 7]. A 3-D NoC topology is proposed in Fig.Â€5.1c, where the interconnect network is contained within one physical plane (i.e., n3Â€ =Â€ 1), while each PE is integrated on multiple planes, notated as np (3-D IC – 2-D NoC). Finally, a hybrid 3-D NoC based on the two previous topologies is proposed in Fig.Â€5.1d. In such an NoC, both the interconnect network and the PEs can span more than one physical plane of the stack (3-D IC – 3-D NoC). In the following section, latency and power expressions for each of the NoC topologies are presented, assuming a zero-load model.

92

V. F. Pavlidis and E. G. Friedman

5.3â•…Timing and Power Model for 3-D NoCs In this section both timing and power dissipation models for on-chip networks are described. Modeling specific parameters as a function of temperature is also discussed to investigate thermal effects on different 3-D NoC topologies. The latency and power dissipation models are presented in SectionsÂ€5.3.1 and 5.3.2, respectively.

5.3.1 Latency Model for 3-D NoC In this section, analytic models of the zero-load latency of each of the 3-D NoC topologies are described. Zero-load network latency is widely used as a performance metric in traditional interconnection networks [10]. The zero-load latency of a network is the latency where only one packet traverses the network at any one time. Although this model does not consider contention among packets, the zeroload latency can be used to describe the effect of a topology on the performance of a network. The zero-load latency of an NoC with wormhole switching is [10]

Tnetwork = hops · tr + tc +

Lp , b

(5.1)

where the first term represents the routing delay, tc is the propagation delay along the wires of the physical link, which is also called a buss here for simplicity, and the third term is the serialization delay of the packet. hops is the average number of switches that a packet traverses to reach the destination node, tr is the switch delay, Lp is the length of the packet in bits, and b is the bandwidth of the buss defined as bÂ€≡Â€wcâ•›â•›â•›fc, where wc is the width of the link in bits and fc is the inverse of the delay of a bit propagating along the longest physical link. Since the number of planes that can be stacked in a 3-D NoC is constrained by the target technology, n3 is also constrained. Furthermore, n1, n2, and n3 are not necessarily equal. The average number of hops in a 3-D NoC is

hops =

n1 n2 n3 (n1 + n2 + n3 ) − n3 (n1 + n2 ) − n1 n2 , 3(n1 n2 n3 − 1)

(5.2)

assuming dimension-order routing to ensure that minimum distance paths are used for routing packets between any source-destination node pair. Although the average number of hops provides a useful expression for a latency model of an NoC, (5.2) does not characterize all possible traffic scenarios within a network. For example, while uniform traffic among the PEs can be modeled by the average number of hops, applications that favor localized traffic will result in a different number of hops from (5.2). This situation does not lessen the applicability of the models described in this section as long as an expression for the number of hops can be determined. In addition, synthesis tools for on-chip networks typi-

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

93

cally utilize a zero-load model to determine the number of hops within the network, thereby determining the latency of each synthesized topology [9]. In the case, where the network traffic cannot accurately be described in closed form, the inherent characteristics of the topologies depicted in Fig.Â€5.1 can guide the selection process for a specific topology. For example, consider a highly localized traffic scenario. The vertical channels can be used primarily for high bandwidth data transfer since the vertical links exhibit a considerably lower delay as compared to the horizontal links. This difference in latency suggests that a 2-D IC – 3-D NoC is a better candidate than a 3-D IC – 2-D NoC topology, where the network includes only horizontal links. The criterion for choosing this topology, in this case, would be the short vertical links, not the reduction in the number of hops (due to the larger number of the PEs connected to each switch). To describe this difference in latency between the two types of links, the average number of hops in (5.2) can be divided into two components, the average number of hops within the two dimensions n1 and n2, and the average number of hops within the third dimension n3,

hops2−D =

n3 (n1 + n2 )(n1 n2 − 1) , 3(n1 n2 n3 − 1)

(5.3)

(n23 − 1)n1 n2 . 3(n1 n2 n3 − 1)

(5.4)

hops3−D =

The delay of the switch tr is the sum of the delay of the arbitration logic ta and the delay of the switch ts, which is assumed to be implemented with a classic crossbar switch [10],

tr = ta + ts .

(5.5)

The delay of the arbiter as described in [11] is

ta = 21(1/4)log2 p + 14(1/12) + 9 ,

(5.6)

where p is the number of ports of the switch and τ is the delay of a minimum sized inverter for the target technology. Note that (5.6) exhibits a logarithmic dependence on the number of switch ports. The length of the crossbar switch also depends upon the number of switch ports and the width of the buss,

ls = 2(wt + st )wc p,

(5.7)

tc = tv hops3−D + th hops2−D ,

(5.8)

where wt and st are the width and spacing, respectively, or, alternatively, the pitch of the interconnect and wc is the width of the physical link in bits. Consequently, the worst case delay of the crossbar switch is determined by the longest path within the switch, which is equal to (5.7). The delay of the physical link tc is

94

V. F. Pavlidis and E. G. Friedman

where tv and th are the delay of the vertical and horizontal buss, respectively (see Fig.Â€5.1b). Note that if n3Â€=Â€1, (5.8) describes the propagation delay of a 2-D NoC. Substituting (5.5) and (5.8) into (5.1), the overall zero-load network latency for a 3-D NoC is

Tnetwork = hops(ta + ts ) + hops2−D th + hops3−D tv +

Lp th . wc

(5.9)

To characterize ts, th, and tv, the models described in [12] are adopted, where repeaters implemented as simple inverters are inserted along the interconnect. According to these models, the propagation delay and rise time of a single interconnect stage for a step input, respectively, are

tdi = 0.377

tri = 1.1

ri ci li 2 ki 2

ri ci li 2 ki 2

ri li Cg0 hi Rd0 ci li , + 0.693 Rd0 C0 + + hi ki ki

(5.10)

ri li Cg0 hi Rr0 ci li , + 2.75 Rr0 C0 + + hi ki ki

(5.11)

where ri(â†œci) is the per unit length resistance (capacitance) of the interconnect and li is the total length of the interconnect. The index i is used to notate the interconnect delays included in the network (i.e., i ∈Â€{s,â•›v,â•›h}). hi and ki denote the size and number of repeaters, respectively, and Cgâ•›0 and C0 represent the gate and total input capacitance of a minimum sized device, respectively. C0 is the summation of the gate and drain capacitance of the device. Rr0 and Rd0 describe the equivalent output resistance of a minimum sized device used to determine the propagation delay and transition time of a minimum sized inverter, respectively, where the output resistance is approximated as

Rr(d)0 = Kr(d)

Vdd . Idn0

(5.12)

K denotes a fitting coefficient and Idn0 is the drain current of an NMOS device where both Vds and Vgs are equal to Vdd. The saturation current for a MOSFET assuming the alpha-power law model [13, 14] is

Idsat = Id0

where

Id0 =

Vgs − Vth Vdd − Vth

a ,

(5.13)

µ0 Cox VD0 [Vdd − Vth − (η/2) VD0 ] , [1 + θ (Vdd − Vth )] [1 + Vdd / (EC L)]

(5.14)

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

95

and VD0 is the drain source voltage at saturation with Vgs equal to Vdd. The parameters θ and μ0 are described in [13], while the technology related constants are from the ITRS report [15] and the MOSFET models [16, 17] are for a 45Â€nm technology node. The an( p) parameter of the model is 2VD0 [Vdd − Vth_n(p) − (η/2)VD0 ] 1 ln , an(p) = (5.15) ln 2 VDa [Vdd − Vth_n(p) − ηVDa ] where VDa is the drain source voltage at saturation with Vgs equal to (â†œVddÂ€+Â€Vth_n(p))/2 [13]. To include the effect of the input slew rate on the total delay of an interconnect stage, (5.10) and (5.11) are further refined by including an additional coefficient  as in [18], 1 − Vtn Vdd 1 (5.16) γr = − . 2 1 + an By substituting the subscript n with p, the corresponding value for a falling input transition is obtained. The average value of r and f is , which is used to determine the effect of the input transition time on the interconnect delay. The overall interconnect delay can therefore be described as

ti = ki (tdi + γ tri ) = b1

ri ci li2 R0 ci li + b2 R0 C0 ki + + ri li Cg0 hi , ki hi

(5.17)

where R0, b1, and b2 are described in [19] and the index i denotes the interconnect structures, such as the crossbar switch (â†œiÂ€≡Â€s), horizontal buss (â†œiÂ€≡Â€h), and vertical buss (â†œiÂ€≡Â€v). The interconnect delay also depends upon the temperature during circuit operation. To capture these dependencies, the resistance of the interconnects is

ri = ρ0 1 + βCu T − Tref /Awi ,

(5.18)

where Tref and T are the reference and operating temperature of the circuit, respectively. The resistivity of copper at Tref is ρ0, where a different resistivity is used for each tier according to the ITRS. βCu is the temperature coefficient of resistance for copper, βCu = 3.9 × 10−3 1/◦ C. Awi is the area of the cross-section of the wire. The MOSFET current described by (5.13) and (5.14) also varies with temperature. Analytic expressions for VD0 and Vth as a function of temperature have been adapted from the BSIM User’s manual [20]. The dependence of an(p) on temperature is implicitly captured through those analytic expressions describing VDa and Vth as a function of temperature [13, 20]. For minimum delay, the size hi and number ki of repeaters are determined by setting the partial derivative of ti with respect to hi and ki, respectively, equal to zero and solving for hi and ki [21],

96

V. F. Pavlidis and E. G. Friedman

ki∗ =

h∗i =

a1 ri ci li2 , a2 R0 C0

R0 ci . ri Cg0

(5.19)

(5.20)

The expression in (5.17) only considers RC interconnects. An RC model is sufficiently accurate to characterize the delay of a crossbar switch since the length of the longest wire within the crossbar switch and the signal frequencies are such that inductive behavior is not prominent. For buss lines, however, inductive behavior can appear. For this case, suitable expressions for the delay and repeater insertion characteristics can be adopted from [22, 23]. Additionally, for the vertical buss, kvÂ€=Â€1 and hvÂ€=Â€1, meaning that no repeaters are inserted and minimum sized drivers are utilized. Repeaters are not necessary due to the short length of the vertical buss. Note that the latency expressions include the effect of the input slew rate and temperature. Additionally, since a repeater insertion methodology for minimum latency is applied, any further reduction in latency is due to the network topology. The length of the vertical communication channel for the 3-D NoC shown in Fig.Â€5.1 is   Lv , for 2D IC − 3D NoC (5.21a) lv = np Lv , for 3D IC − 3D NoC (5.21b)  0, for 2D IC − 2D NoC and 3DIC − 2D NoC, (5.21c) where Lv is the length of a through-silicon (interplane) via connecting two switches on adjacent physical planes. np is the number of physical planes used to integrate each PE. The length of the horizontal buss is

lh =

√

APE

coef

(5.22a) for 2D IC − 2D NoC and 2D IC − 3D NoC APE , for 3D IC − 2D NoC and 3D IC − 3D NoC (np > 1), np (5.22b)

where APE is the area of the processing element. The area of all of the PEs and, consequently, the length of each horizontal link are assumed to be equal. For those cases where the PE is implemented in multiple physical planes (â†œnpÂ€ >Â€ 1), a coefficient coef is used to consider the effect of the interplane vias on the reduction in wirelength due to utilization of the third dimension. This coefficient is based on the layout of a crossbar switch designed [24] with the FDSOI 3-D technology from MIT Lincoln Laboratory (MITLL) [25]. In the following section, expressions for the power consumption of a network with delay constraints are presented.

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

97

5.3.2 Power Consumption Model for 3-D NoC Power dissipation is a critical issue in three-dimensional circuits. Although the total power consumption of a 3-D system is expected to be lower than that of an equivalent 2-D circuit (since the global interconnects are shorter [26]), the increased power density is a challenging issue for this novel design paradigm. Therefore, those 3-D NoC topologies that offer low power characteristics are of significant interest. The different power consumption components for interconnects with repeaters are briefly discussed in this section. Due to specified performance characteristics, a low power design methodology with delay constraints for the interconnect within an NoC is adopted from [19]. An expression for the power consumption per bit of a packet transferred between a source destination node pair is used as the basis for characterizing the power consumption of an NoC for the proposed topologies. The power consumption components of an interconnect line with repeaters are: (a) Dynamic power consumption is the dissipated power due to the charge and discharge of the interconnect and input gate capacitance during a signal transition, and can be described by

2 Pdi = as_noc f (ci li + hi ki C0 )Vdd ,

(5.23)

where f is the clock frequency and as_noc is the switching factor [27]. (b) Short-circuit power is due to the DC current path that exists in a CMOS circuit during a signal transition when the input signal voltage changes between Vtn and VddÂ€+Â€Vtp. The power consumption due to this current is described as shortcircuit power and is modeled in [28] by

Psi =

2 2 4as_noc fId0 tri Vdd ki h2i , Vdsat GCeffi + 2H Id0 tri hi

(5.24)

where Id0 is the average drain current of the NMOS and PMOS devices operating in the saturation region and the value of the coefficients G and H are described in [29]. Due to resistive shielding of the interconnect capacitance, an effective capacitance is used in (5.23) rather than the total interconnect capacitance. Note that resistive shielding results in a smaller capacitive load as seen from the interconnect driver (i.e., CeffÂ€ ≤Â€ Ctotal). This effective capacitance is determined from the methodology described in [30, 31]. (c) Leakage power is comprised of two power components, the subthreshold and gate leakage currents. Subthreshold power consumption is due to current flowing where the transistor operates in the cut-off region (below threshold), causing Isub current to flow. The gate leakage component is due to current flowing through the gate oxide, denoted as Ig. The total leakage power can be described as

Pli = hi ki Vdd (Isub0 + Ig0 ),

(5.25)

98

V. F. Pavlidis and E. G. Friedman

where the average subthreshold Isub0 and gate Ig0 leakage current of the NMOS and PMOS transistors is considered in (5.25).

The total power consumption with delay constraint T0 for a single line of a crossbar switch Pstotal, horizontal buss Phtotal, and vertical buss Pvtotal is, respectively,

Pstotal (T0 − ta ) = Pdi + Psi + Pli ,

(5.26)

Phtotal (T0 ) = Pdi + Psi + Pli ,

(5.27)

Pvtotal (T0 ) = Pdi + Psi + Pli .

(5.28)

The power consumed by the arbitration logic is not considered in (5.26)–(5.28) since most of the power is consumed by the crossbar switch and the buss interconnect, as discussed in [32]. Note that for a crossbar switch, the additional delay of the arbitration logic poses a stricter delay constraint on the switch. The minimum power consumption with delay constraints is determined by the methodology described in [19], which is used to determine the optimum size h*powi and number k*powi of repeaters for a single interconnect line. Consequently, the minimum power consumption per bit between a source destination node pair in an NoC with a delay constraint is

Pbit = hopsPstotal + hops2−D Phtotal + hops3−D Pvtotal .

(5.29)

Note that the proposed power expression includes all of the power consumption components in the network, not only the dynamic power. The effect of resistive shielding is also considered in determining the effective interconnect capacitance. Additionally, the effect of temperature on each of the power dissipation components is considered. Furthermore, since the repeater insertion methodology described in [19] minimizes the power consumed by the repeater system, any additional decrease in power consumption is only due to the network topology. In the following section, a technique for analyzing the power dissipation of a PE and the related rise in temperature within an NoC system is described.

5.4â•…Thermal-Aware Analysis Methodology The 3-D NoC topologies and related timing and power models presented in the previous section emphasize performance improvements achieved by including the third dimension within the on-chip network. Vertical integration, however, can significantly enhance the performance of the PEs [33, 34], in addition to the performance of the network. Redesigning a planar PE into several physical planes can greatly decrease the power consumed by the PEs [35]. This reduction, in turn, lowers the temperature rise within the stack leading to tangible benefits in the overall behavior of the system. For example, since the temperature rise is limited, the corresponding increase in the interconnect resistance is lower, decreasing the interconnect delay within the NoC. Excessive increases in leakage power will also be avoided. In this

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

99

section, a methodology for analyzing the overall behavior of a 3-D system interconnected with an on-chip network is presented. Since the systems described in this chapter are presumed to be homogeneous, the power consumed by the entire system can be straightforwardly determined if the power dissipated by an elemental unit, as depicted in Fig.Â€5.1, is known. The power consumed by a PE and the two physical links that surround the PE is another useful metric for characterizing the different 3-D NoC topologies,

Ptotal = PPE + 2Lp Pbit ,

(5.30)

where PPE is the power consumed by a single PE. PPE includes all of the different energy components described in Section 5.3.2. Different expressions, however, can be used to describe these components. The dynamic power consumption PPE_dyn is separated into the power constituents, Pg and Pint, for driving the capacitance of the logic gates and interconnects, respectively. The dynamic power consumption can therefore be written as 2 PPE_dyn = Pg + Pint = fPE · αs Ntot Cgate + Cint_loc + Cint_semi + Cint_glob Vdd , (5.31)

where Ntot is the number of gates within a PE, as is the switching activity within a PE, and fPE is the operating frequency of the PE. Note that this frequency is typically different from the clock frequency fc of the NoC. Cint_loc, Cint_semi, and Cint_glob, are, respectively, the total capacitance of the local, semi-global, and global interconnects driven by the Ntot gates within the PE. The leakage and short-circuit power dissipation of a PE can be determined similarly to (5.24)–(5.25). The leakage power of the PE is

PPE_leak = Weff (Isub0 + Ig0 ) · Vdd ,

(5.32)

where Weff is the total width of the transistors within a PE. Assuming that the PEs consist of four transistor gates and the transistors are sized to equalize the rise and fall transitions, the width of the four devices is a function of the minimum feature size. Multiplying this width by the number of gates within the PE, Weff is obtained. To determine the interconnect capacitance described in (5.31), the distribution of the interconnect within a PE is required. An enhanced wire distribution model for either 2-D or 3-D circuits based on [36] is utilized in this analysis. This model is further integrated into the methodology described by [37] to produce the number of tiers (i.e., local, semi-global, global) and the pitch of each metal layer. A pair of metal layers routed orthogonally is assumed to comprise a tier. A specific constraint for the interconnect delay is set for each tier to determine the maximum interconnect length that can be placed within that tier. This constraint is set to 25 and 90% of the clock frequency fPE for the local and other tiers, respectively. An interconnect is considered to be placed on the next tier whenever the length of the wire does not satisfy the aforementioned constraint. Although the effect of temperature is considered in these power expressions, the methodology described in [37] does not consider thermal issues. To consider

100

V. F. Pavlidis and E. G. Friedman Q3

I3 ∆T3

∆T3 = Q3(R3+R2+R1) + Q2(R2+R1) + Q1R1

∆T2 = Q3(R2+R1) + Q2(R2+R1) + Q1R1

Q2

Q1

R3 ∆T2 R2 ∆T1

∆T1 = R1(Q3+Q2+Q1)

R1

V3 = I3(R3+R2+R1) + I2(R2+R1) + I1R1 I2

R3 V2 = I3(R2+R1) + I2(R2+R1) + I1R1

I1

R2 V1 = R1(I3+I2+I1) R1

Fig. 5.2â†œæ¸€ An example of the duality of thermal and electrical systems

these important 3-D circuit effects, a first-order thermal model of a 3-D circuit has been integrated into this methodology. This model assumes a one-dimensional (1-D) flow of heat throughout a 3-D stack [38, 39]. Note that the model includes both the heat generated by the devices and interconnect. The assumption of a 1-D heat flow is justified as the path towards the heat sink exhibits the lowest thermal resistance, facilitating the removal of heat from the circuit to the ambient. This path is the primary path for the heat to flow in a 3-D circuit [39]. By exploiting the electro-thermal duality, as illustrated in Fig.Â€5.2, the temperature in each plane and each metal layer within a physical plane of a 3-D circuit can be determined. The number of metal layers, metal pitch within these layers, temperature rise, and power dissipation for a PE can all be determined early in the design cycle by utilizing this thermal-aware interconnect architecture design methodology. The analysis method is depicted in Fig.Â€5.3. The input parameters are the target technology node, the number of gates Ntot within the PE, the number of physical planes np used for the PE, and the clock frequency fPE. Most of the interconnect and related parameters can be extracted for the target technology node. The complete methodology proceeds as follows: 1. For the clock frequency fPE and nominal temperature, an initial number of metal layers and the interconnect pitch are determined to satisfy the aforementioned delay constraints. 2. For this interconnect architecture and assuming the same average current density for the wires on each metal layer, the rise in temperature is determined. 3. Based on this increase in temperature, the electrical wire resistance, leakage power, and other temperature dependent parameters are updated. The interconnect delay is again evaluated against the input delay constraints. If these specifications are not satisfied, a new interconnect architecture is produced. 4. The iterative process terminates once the circuit has reached a steady state temperature and the delay constraints for each tier is satisfied. The output is the number of metal layers, interconnect pitch, and temperature of the circuit.

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems Fig. 5.3â†œæ¸€ Analysis flow diagram that produces a first-order interconnect architecture and the steady-state temperature of a PE consisting of np physical planes within a 3-D stack of n planes (â†œnpÂ€≤Â€n)

101

fPE, np, Ntot

Determine # of tiers, interconnect pitch, and distribution

Calculate T

Update wire resistance, temperature dependent parameters, leakage power

Are delay specs. met?

NO

YES

NO

Is ΣT < 10–5 °C?

YES Report steady-state temp., interconnect pitch, number of tiers per PE

Once the temperature and interconnect length at each tier have been determined, the power consumed by the PE is readily determined from (5.30)–(5.32). In this manner, the topology that produces the lowest power dissipation for the entire 3-D system rather than the physical link of the on-chip network is determined. Following this procedure, the different 3-D topologies discussed in SectionÂ€5.2 are evaluated in terms of the latency and power dissipation of the NoC, the total power dissipation of the system, and the rise in temperature. Various tradeoffs inherent to these topologies are also discussed for a NoC-based 3-D system.

102

V. F. Pavlidis and E. G. Friedman

5.5â•…Latency, Power, and Temperature Tradeoffs in 3-D NoC Topologies The improvement in the performance of traditionally planar on-chip networks by introducing the third-dimension is discussed in this section. Those topologies that produce the highest network performance, the lowest power dissipation of the network and system, and the lowest rise in temperature are evaluated. Different topologies are demonstrated that satisfy each of these objectives. Consequently, the analysis methodology presented in the previous section can be a useful tool to evaluate the improvement in a specific objective offered by a 3-D topology. The effect of the 3-D topologies on the speed, power, and temperature rise of a network is discussed in SectionsÂ€5.5.1, 5.5.2, and 5.5.3, respectively. The latency, power, and rise in temperature of a 2-D mesh network with the same number of nodes is used as a reference for the comparisons throughout this section. In all of the on-chip networks, a 45Â€ nm technology node, as described in the ITRS, is assumed [15]. Consequently, specific interconnect and device parameters, such as the minimum pitch of the horizontal interconnects, the maximum power density, and the dielectric constant k are adopted from this report. A TSV technology is assumed to provide the vertical interconnects, with a TSV pitch of 5Â€μm and an aspect ratio of five [33]. Finally, a synchronous on-chip network is assumed with a clock frequency of fcÂ€=Â€1Â€GHz. A small set of parameters are considered as variables throughout the analysis of the 3-D topologies. These variables include the network size, number of physical planes making up the 3-D system, clock frequency of the PEs fPE, and area of the PEs APE. The range of these variables is listed in TableÂ€5.1. For multi-processor SoC networks, sizes of up to NÂ€=Â€256 are expected to be feasible within the near future [7, 40], whereas for NoC with a finer granularity, where each PE corresponds to a hardware block of approximately 100 thousand gates, network sizes over a few thousands nodes are predicted at the 45Â€nm technology node [41]. Furthermore, to apply the thermal model discussed in [38, 39], the nominal temperature is considered to be 300Â€K and the side and topmost surfaces of the 3-D stack are assumed to be adiabatic. In other words, the heat is assumed to flow from the uppermost to the lowest plane of the multi-plane system.

Table 5.1â†œæ¸€ Parameters of the investigated NoCs

Parameter N APE (mm2) fPE (GHz) Max. number of planes, nmax

Values 16, 32, 64, 128, 256, 512 0.64, 0.81, 1.00, 2.25 1, 2, 3 8

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

103

5.5.1 Latency of 3-D NoCs Utilizing the third dimension to implement an on-chip network (i.e., 2-D IC – 3-D NoC topology) by simply stacking the PEs (i.e., n3Â€>Â€1, npÂ€=Â€1) and using network switches with seven ports decrease the average number of hops for packet switching, thereby improving the network latency. Alternatively, utilizing the third dimension to decrease the length of the physical links between adjacent network switches (i.e., 3-D IC – 2-D NoC topology) also reduces the latency of the network. The reduction in buss length is achieved by implementing the PEs in multiple physical planes (i.e., n3Â€=Â€1, npÂ€>Â€1), thereby reducing PE area. Finally, a hybrid topology (i.e., 3-D IC – 3-D NoC topology), which uses the third dimension both for the onchip network and the PEs (i.e., n3Â€>Â€1, npÂ€>Â€1), can result in the greatest reduction in latency. Note that the effect of the various topologies on only the speed of the network, described by fc, is considered, while the operating frequency of the PEs fPE can be different. Although this approach is convenient from an architectural perspective, certain physical design issues can arise due to the multiple clock domains that can co-exist in these topologies. Each of these topologies faces different synchronization challenges. A multi-plane on-chip network can be implemented with various synchronization schemes ranging from a fully synchronous approach (as assumed herein) to an asynchronous network. A potent non-synchronous approach that has been used for PE-to-network communication in planar systems is the globally asynchronous locally synchronous (GALS) approach. An immediate extension of the GALS approach into three physical dimensions could include a number of clock domains equal to the number of planes comprising a 3-D circuit. This synchronization scheme is suitable for the 2-D IC – 3-D NoC topology. Alternatively, for the 3-D IC – 2-D NoC topology where the network is planar, a synchronous clocking scheme is a simpler and efficient solution. The synchronization challenge for this topology is related to the multi-plane PEs. The primary issue is how to efficiently propagate the clock signal across a PE occupying several planes. Recently, several synthesis techniques that produce bounded skew and testable clock trees prior and after bonding have been reported [42]. In addition, preliminary experimental results on multi-plane clock networks demonstrate that operating frequencies in the gigahertz regime are feasible, while the clock skew is manageable [24]. Although functional testing for each plane of a multi-plane PE remains an important problem, early results are encouraging. For the 3-D mesh networks discussed in this chapter, fully synchronous schemes are assumed due to the simplicity of these approaches. In this way, the effect of the topological characteristics rather than the synchronization mechanism on the performance of the network is evaluated, which is the objective of the physical analysis flow discussed in this chapter. The latency for different network sizes and a PE area of 0.64Â€mm2 and 2.25Â€mm2 is illustrated in Fig.Â€5.4a, b, respectively. Note

104 12

20

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

18 16

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

14

8

Latency [ns]

Latency [ns]

10

V. F. Pavlidis and E. G. Friedman

6 4

12 10 8 6 4

2

2 0

a

4

5 6 7 8 Number of nodes log2N

9

0

b

4

5

6

7

8

9

Number of nodes log2N

Fig. 5.4â†œæ¸€ Zero-load latency for various network sizes where (a) APEÂ€ =Â€ 0.64Â€ mm2 and (b) APEÂ€=Â€ 2.25Â€mm2 for fPEÂ€=Â€1Â€GHz

that the temperature rise due to stacking or folding the PEs is also considered by the methodology presented in SectionÂ€5.4. The latency of the 2-D IC – 3-D NoC topology decreases with increasing network size. For example, for NÂ€=Â€16, the third dimension does not improve the latency, since the number of hops required for packet switching is small. The increase in the delay of the network switch (due to the increase in the number of switch ports) can outweigh the decrease in latency due to the reduction in the number of hops, which is negligible for this network size. As the network size increases, however, the decrease in latency offered by this topology progressively increases. The improvement in latency increases from 1.08% for NÂ€=Â€32 to 6.79% for NÂ€=Â€256, where APEÂ€=Â€0.64Â€mm2. In addition, the area of the PE has no significant effect in this improvement, as depicted in Fig.Â€5.4, since this topology only alters the number of hops for packet switching. Note, however, that the absolute network latency increases for this topology, since the length of the busses increases with the area of the PEs. The 3-D IC – 2-D NoC exhibits the opposite behavior, since this topology reduces the length of the physical links. Thus, the improvement in latency increases in those networks with a large PE area. The latency decreases by 48.97% for APEÂ€=Â€0.64Â€mm2 and 60.39% for APEÂ€=Â€2.25Â€mm2 for a network size of NÂ€=Â€256. The reduction in network latency for this topology decreases with increasing network size. As the network size increases, the greatest portion of the latency as described by (5.9) is due to the larger number of hops rather than the buss delay. Consequently, the benefits offered by the reduction in length of the busses decrease with network size for the 3-D IC – 2-D NoC topology. For example, the improvement in latency decreases from 54.12% for NÂ€=Â€32 to 50.83% for NÂ€=Â€128, where APEÂ€=Â€0.64Â€mm2. The hybrid topology 3-D IC – 3-D NoC demonstrates the greater improvement in latency as compared to a 2-D network, since the third dimension decreases both the length of the busses and the number of hops [43]. Depending upon the network size, the area of the PE, and the interconnect impedance characteristics of the bus-

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

105

ses, n3 and np ensure that the 3-D IC – 3-D NoC topology can support either the 2-D IC – 3-D NoC (i.e., n3Â€=Â€nmax) or the 3-D IC – 2-D NoC (i.e., npÂ€=Â€nmax). The results shown in Fig.Â€5.4 include the increase in delay caused by the rise in temperature within a 3-D stack. Based on the methodology discussed in the previous section, the resulting temperature rise does not significantly affect the improvement in latency provided by the 3-D topologies. The resulting higher temperatures within the 3-D system cause a small 3% increase in the interconnect latency for all of the investigated networks. The thermal effects are similar to those discussed in [44, 45]. Consequently, the overall effect of the 3-D topologies is that the network latency is significantly decreased, although an inevitable increase in temperature will consume a small portion of this improvement. The effect of the third dimension on the power consumed by the network and the PEs is discussed in the following section.

5.5.2 Power Dissipation of 3-D NoCs The decrease in the power of a conventional 2-D on-chip network achieved by the 3-D topologies is presented in this section. Two different power consumption metrics are used to characterize the benefits of these topologies. First, the 3-D topology that minimizes the power consumed by the network is described by (5.29), ignoring the power of the PEs. For the second metric, the overall power dissipation of the system, including both the power of the network and the PEs, is described by (5.30). Those topologies that minimize each of these two metrics are determined. Furthermore, the distribution of the network nodes in terms of the physical dimensions (i.e., n1, n2, n3, and np) can be quite different for the same 3-D topology. The power consumed by these 3-D topologies is illustrated in Fig.Â€5.5, where the power dissipated by the PEs is ignored. Similar to the discussion related to latency, the 2-D IC – 3-D NoC and 3-D IC – 2-D NoC topologies lower the power dissipated 0.7 0.6

1

0.4 0.3

0.6 0.4

0.1

0.2 4

5 6 7 8 Number of nodes log2N

9

0

b

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

0.8

0.2

0

a

1.2

Pbit [mW]

Pbit [mW]

0.5

1.4

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

4

5 6 7 8 Number of nodes log2N

9

Fig. 5.5â†œæ¸€ Power consumed by the network with delay constraints (â†œæ¸€fcÂ€=Â€1Â€GHz) for several network sizes where (a) APEÂ€=Â€0.64Â€mm2 and (b) APEÂ€=Â€2.25Â€mm2 for fPEÂ€=Â€1Â€GHz

106

V. F. Pavlidis and E. G. Friedman

by the network through a reduction in the number of hops and the capacitance of the wires, respectively. Note that the y-axis in Fig.Â€5.5 corresponds to the power required to transfer a single bit over an average distance within the network where this distance is determined by the number of hops for packet switching, as described by (5.2). Comparing Fig.Â€5.5a, b the power consumed by a planar on-chip network increases with the area of the PEs interconnected by this network. For example, the power almost doubles for the same network size as the area of the PE increases from 0.64Â€mm2 to 2.25Â€mm2. Similar to the network latency, the power consumption decreases in the 2-D IC – 3-D NoC topology by reducing the number of hops for packet switching. Again, the increase in the number of ports adds to the power consumed by the crossbar switch; however, the effect of this increase in power is not as significant as the corresponding increase in the latency of the network. A three-dimensional network can therefore reduce power even in small networks. The power savings achieved with this topology is greater in larger networks. This situation occurs because the reduction in the average number of hops for a three-dimensional network increases in larger network sizes. With the 3-D IC – 2-D NoC topology, the number of hops in the network is the same as for a two-dimensional network. The horizontal buss length, however, is shortened by implementing the PEs in more than one physical plane. The greater the number of physical planes that can be integrated in a 3-D system, the larger the power savings; meaning that the optimum value for np with this topology is always nmax regardless of the network size and operating frequency (if temperature is not the target objective). The savings is practically limited by the number of physical planes that can be integrated in a 3-D technology. For this type of NoC, the topology resulting in the maximum speed is identical to the topology minimizing the power consumption, as the key element of either objective originates solely from the shorter buss length. Finally, the 3-D IC – 3-D NoC can achieve the minimum power consumption for a 3-D on-chip network by properly adjusting n3 and np depending upon the interconnect impedance characteristics, the available number of physical planes, and the clock frequency of the network. Interestingly, when the power metric described in (5.30) is utilized, the topologies that minimize the total power are different, as illustrated in Fig.Â€ 5.6. The distribution of the network nodes within those topologies also changes. The total power of a network-based 3-D system is plotted in Fig.Â€5.6, where the clock frequency of the network is fcÂ€=Â€1Â€GHz and the area of the PE is APEÂ€=Â€0.64Â€mm2 and APEÂ€=Â€2.25Â€mm2. The clock frequency of the PE is equal to fc in this case. A common characteristic of Fig.Â€5.6a, b is that for larger network sizes, the topology that produces the lowest power dissipation changes from the 3-D IC – 2-D NoC to the 2-D IC – 3-D NoC topology (for this specific example, the 3-D IC – 3-D NoC coincides with either of these topologies). For small networks and PE area (see Fig.Â€5.6a), the reduction in power originates from the shorter buss length of the network and the shorter interconnects within the PEs since the PEs are implemented in multiple planes. As the network size increases, however, the number of hops increases considerably, making the power dissipated by the network the dominant

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems 5 4.5

8

3.5

7

3

6

2.5 2

4 3

1

2

0.5

1 4

5 6 7 8 Number of nodes log2N

9

0

b

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

5

1.5

0

a

9

Ptotal [W]

Ptotal [W]

4

10

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

107

4

5 6 7 8 Number of nodes log2N

9

Fig. 5.6â†œæ¸€ Total power consumed by a PE and the adjacent network busses according to (5.30) with delay constraints (â•›fcÂ€ =Â€ 1Â€ GHz) for several network sizes where (a) APEÂ€ =Â€ 0.64Â€ mm2 and (b) APEÂ€=Â€2.25Â€mm2 for fPEÂ€=Â€1Â€GHz

power component consumed by the system. Consequently, the 3-D IC – 2-D NoC does not offer the maximum power savings; the maximum savings is now achieved with the 2-D IC – 3-D NoC topology. If the PE is larger, the network size at which the optimum topology changes increases. This behavior occurs since larger PEs include a greater number of gates leading to additional longer interconnections within the PEs, as described by the interconnect distribution model presented in SectionÂ€5.4. The greater number and length of the wires within a PE are the primary power component of the entire system. The 3-D IC – 2-D NoC topology offers a greater improvement for even larger network sizes before the power caused by the increasing number of hops starts to dominate. This behavior occurs since the 3-D IC – 2-D NoC topology reduces the length of the interconnects within the PEs in addition to the length of the network busses. Another interesting result is that the clock frequency of the PE fPE affects the overall power dissipation, a factor typically ignored when evaluating the performance of a network-based integrated system. In Fig.Â€5.7, fPE increases from 1 to 3Â€GHz. This increase has a profound effect on the overall power of the system. To satisfy this aggressive timing specification while limiting the interconnect power consumption, the 3-D IC – 2-D NoC topology exhibits the best results for most of the network sizes depicted in Fig.Â€5.7a. Note that this behavior is more pronounced for larger PE areas, as depicted in Fig.Â€5.7b, where the 3-D IC – 2-D NoC topology performs better than the 2-D IC – 3-D NoC topology for any network size. Furthermore, the 3-D IC – 3-D NoC topology can lead to the lowest power consumption with appropriate adjustment of the parameters n3 and np. To demonstrate the distribution of the networks nodes within the three physical dimensions in addition to the effect on the topology, the node distribution is listed in TableÂ€5.2 for specific network and PE characteristics. From TableÂ€5.2, both the operating frequency and the area of the PEs affect the distribution of nodes within the NoC. Large PE areas (â†œAPEÂ€=Â€2.25Â€mm2) and high operating frequencies (â•›â†œfPEÂ€=Â€3Â€GHz) require 3-D NoC topologies where some of the physical planes are

108

V. F. Pavlidis and E. G. Friedman

6

14

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

5

10 Ptotal [W]

4 Ptotal [W]

2D ICs - 2D NoCs 2D ICs - 3D NoCs 3D ICs - 2D NoCs 3D ICs - 3D NoCs

12

3 2

8 6 4

1

2

0

4

a

5 6 7 8 Number of nodes log2N

0

9

4

b

5 6 7 8 Number of nodes log2N

9

Fig. 5.7â†œæ¸€ Total power consumed by a PE and the adjacent network busses according to (5.30) with delay constraints (â†œfcÂ€ =Â€ 1Â€ GHz) for several network sizes where (a) APEÂ€ =Â€ 0.64Â€ mm2 and (b) APEÂ€=Â€2.25Â€mm2 for fPEÂ€=Â€3Â€GHz

Table 5.2â†œæ¸€ Node distribution for minimum power consumption of different network sizes N fPEÂ€=Â€1Â€GHz fPEÂ€=Â€3Â€GHz APEÂ€=Â€2.25Â€mm2

APEÂ€=Â€0.64Â€mm2

â•‡ 16 â•‡ 32 â•‡ 64 128 256 512

APEÂ€=Â€0.64Â€mm2

APEÂ€=Â€2.25Â€mm2

n1

n2

n3

np

n1

n2

n3

np

n1

n2

n3

np

n1

n2

n3

np

4 8 8 4 8 8

4 4 8 4 4 8

1 1 1 8 8 8

8 8 8 1 1 1

â•‡ 4 â•‡ 8 â•‡ 8 16 16 â•‡ 8

â•‡ 4 â•‡ 4 â•‡ 8 â•‡ 8 16 â•‡ 8

1 1 1 1 1 8

8 8 8 8 8 1

â•‡ 4 â•‡ 8 â•‡ 8 16 16 â•‡ 8

â•‡ 4 â•‡ 4 â•‡ 8 â•‡ 8 16 â•‡ 8

1 1 1 1 1 8

8 8 8 8 8 1

â•‡ 4 â•‡ 8 â•‡ 8 16 16 32

â•‡ 4 â•‡ 4 â•‡ 8 â•‡ 8 16 16

1 1 1 1 1 1

8 8 8 8 8 8

used for the PEs (i.e., npÂ€>Â€1). For small PEs (â†œAPEÂ€=Â€0.64Â€mm2) and low operating frequencies (â†œfPEÂ€=Â€1Â€GHz), a simple 3-D network (i.e., n3Â€>Â€1 and npÂ€=Â€1) is typically the best choice. Note that the selection of the optimum topology for either a latency or power objective depends strongly on the interconnect and device characteristics of the specific technology node. Consequently, even for system level exploratory design, the analysis methodology presented in SectionÂ€5.4 provides a first estimate of the behavior of a network-based 3-D system. The related temperature rise for these 3-D topologies, which is another design objective for this type of integrated system, is discussed in the following section.

5.5.3 Temperature in 3-D NoCs Elevated temperatures are expected to become an important challenge in vertical integration, specifically where several high performance circuits form a multi-plane

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

109

integrated system. The increased power densities per unit volume can potentially increase the operating temperature of the system to prohibitive levels, greatly affecting the performance characteristics and severely degrading the reliability of this system. Consequently, the temperature rise resulting from these 3-D topologies is of primary interest. Based on the methodology described in SectionÂ€5.4, the temperature of the substrate and each metal layer within a physical plane is determined assuming a one-dimensional flow of heat towards the heat sink. The heat sink is assumed to be attached to the lowest plane within the 3-D stack. The change in temperature rise due to the different 3-D topologies, area and operating frequency of the PEs, and number of physical planes are discussed in this section. Considering the 3-D NoC topologies discussed in this chapter, the 2-D IC – 3-D NoC topology will result in higher temperatures as compared to the 3-D IC – 2-D NoC topology since the former topology simply stacks the PEs, while the latter topology utilizes more than one plane to implement a PE. The 2-D IC – 3-D NoC topology leads to higher temperatures for two reasons. Several PEs, determined by n3, are placed adjacent to the vertical direction. Consequently, the power density generated by both the devices and metal layers increases. In addition, each of these PEs is implemented in one physical plane (i.e., npÂ€=Â€1) and, hence, no reduction in power density is possible. Alternatively, the 3-D IC – 2-D NoC topology utilizes more than one plane for each PE, reducing the interconnect load capacitance and, consequently, the temperature within the 3-D system. The temperature rise resulting from the different 3-D topologies is illustrated in Fig.Â€ 5.8 for different number of planes. These temperatures correspond to the temperature rise at the topmost metal layer of the uppermost physical plane over the nominal temperature (here assumed to be 27°C). From Fig.Â€5.8, as the number of planes increases, the temperature naturally increases for the 2-D IC – 3-D NoC topology (for this topology npÂ€=Â€1). Alternatively, when some or all of the physical planes are used to implement the PEs, as occurs for the 3-D IC – 3-D NoC and 3-D 8

np = 1 np = 2 np = 4 np = 8

10 Temp. rise [°C]

6 Temp. rise [°C]

12

np = 1 np = 2 np = 4 np = 8

7

5 4 3

8 6 4

2 2

1

a

0

1

2

4 # of planes

8

b

0

1

2

4 # of planes

8

Fig. 5.8â†œæ¸€ Temperature rise within the 3-D topologies for different combinations of n3 and np. For all of the topologies, n3Â€×Â€npÂ€=Â€n, where n is the number of planes within the 3-D stack. A maximum number of planes nmaxÂ€=Â€8 is assumed, according to TableÂ€5.1. The clock frequency of the PE is fPEÂ€=Â€1Â€GHz and the area is (a) APEÂ€=Â€0.64Â€mm2 and (b) APEÂ€=Â€2.25Â€mm2

110

V. F. Pavlidis and E. G. Friedman

8

np = 1 np = 2 np = 4 np = 8

10 Temp. rise [°C]

6 Temp. rise [°C]

12

np = 1 np = 2 np = 4 np = 8

7

5 4 3

8 6 4

2 2

1 0

a

1

2

4 # of planes

8

0

b

1

2

4 # of planes

8

Fig. 5.9â†œæ¸€ Temperature rise within the 3-D topologies for different combinations of n3 and np. For all of the topologies, n3Â€×Â€npÂ€=Â€n; where n is the number of planes in the 3-D stack. A maximum number of planes nmaxÂ€=Â€8 is assumed according to TableÂ€5.1. The area of the PE is APEÂ€=Â€0.64Â€mm2 and the clock frequency is (a) fPEÂ€=Â€1Â€GHz and (b) fPEÂ€=Â€3Â€GHz

IC – 2-D NoC topologies, the temperature rise is considerably smaller. Note, for example, that a 3-D system consisting of eight planes exhibits comparable temperatures to another system comprised of only four planes, as long as the former system uses two physical planes for the PE. This behavior is more pronounced for PEs with larger area, as depicted in Fig.Â€5.8b. For this case, using more than one plane for the PE significantly reduces the power density per plane. Most importantly, however, the number of metal layers required for a two-plane PE can be smaller [36]. This construction decreases the length and resistance of the vertical thermal path to remove the heat within the 3-D stack. An increase in temperature also occurs when the operating frequency of the PEs increases, as illustrated in Fig.Â€5.9. This behavior can be explained by noting that an increase in frequency produces a linear increase in the (dynamic) power consumed by the 3-D system. Note that the temperature rise for higher frequencies within the PEs is comparable to the increase observed for PEs with larger areas. In Fig.Â€5.8, the area of the PE is almost quadrupled, while in Fig.Â€5.9 the operating frequency is tripled, resulting in approximately the same rise in temperature. This behavior can be explained as follows. A larger PE includes additional gates that require additional wiring resources. Alternatively, tighter timing constraints can be satisfied, in this example, by increasing the wire pitch. If in either case, an additional tier is required, the thermal resistance of the heat flow path increases. Additionally, for both cases, the power consumption increases, resulting in higher temperatures. The increase in temperature shown in Figs.Â€5.8 and 5.9 is for the highest metal layer of the uppermost physical plane within a 3-D system. Although this increase may not be catastrophic, the timing specifications for the PE or the network can possibly not be satisfied if temperature is ignored. To better explain this situation, the different metal pitches for the PEs considering thermal effects are listed in TableÂ€5.3 for the 2-D IC 3-D NoC topology where n3Â€=Â€8. In columns 2 to 7, thermal effects are not considered in the analysis flow diagram depicted in Fig.Â€5.3, while

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

111

Table 5.3â†œæ¸€ Pitch of the interconnect layers for each plane for the 2-D IC – 3-D NoC topology where n3Â€=Â€8, APEÂ€=Â€1Â€mm2, and fPEÂ€=Â€3Â€GHz. Two cases are considered, where the system operates at nominal T0 and at temperature T0Â€+Â€ΔΤ. At the uppermost plane ΔΤÂ€=Â€20.1°C T0Â€+Â€ΔΤ T0 Plane # of Metal pitch (nm) # of Metal pitch (nm) tiers tiers Â€ Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 Â€ Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 1 2 3 4 5 6 7 8

5 5 5 5 5 5 5 5

90 90 90 90 90 90 90 90

270 270 270 270 270 270 270 270

900 900 900 900 900 900 900 900

1,440 1,440 1,440 1,440 1,440 1,440 1,440 1,440

2,000 2,000 2,000 2,000 2,000 2,000 2,000 2,000

5 5 5 5 5 5 5 5

90 90 90 90 90 90 90 90

270 270 270 270 270 270 270 270

900 900 900 900 900 900 900 900

1,280 1,280 1,280 1,280 1,280 1,280 1,280 1,280

2,250 2,500 2,500 2,750 2,750 3,000 3,000 3,000

in columns 8–13, thermal effects are considered. Note that a different temperature is determined for each tier in a plane according to the flow diagram shown in Fig.Â€5.3. For the uppermost plane, the maximum rise in temperature is ΔΤÂ€=Â€20.1°C. As reported in TableÂ€5.3, neglecting the rise in temperature, particularly in the upper planes, results in a smaller interconnect pitch which is insufficient to satisfy the timing requirements. Another tier (not shown in TableÂ€5.3) should be used for the network and, therefore, separate timing specifications would apply for this tier. The pitch of this global interconnect tier is not determined by the analysis procedure described in SectionÂ€5.4, but a small pitch is selected to constrain the area allocated for the physical links within the network. The power consumption and related temperature rise also depend upon the switching activity of both the network as_noc and the PEs as. The relative magnitude of these two parameters can greatly affect the behavior of a 3-D topology. In these examples, as_nocÂ€=Â€0.25 and asÂ€=Â€0.15 have been assumed. These parameters do not affect those traits of the 3-D topologies that improve the performance of a conventional 2-D network but can considerably affect the extent to which each of the 3-D topologies can improve a specific design objective.

5.6â•…Summary 3-D NoC are a natural evolution of 2-D NoC, exhibiting superior performance. Several 3-D NoC topologies are discussed. Models for the zero-load latency and power consumed by a network are presented for these 3-D topologies. Expressions for the power dissipation of the entire system including the PEs are also provided. A methodology that predicts the distribution of the interconnects within a system based on an on-chip network is extended to accommodate the 3-D nature of the investigated topologies. Thermal effects of the interconnect distribution are also considered in this analysis methodology.

112

V. F. Pavlidis and E. G. Friedman

In 3-D NoCs, the minimum latency and power consumption can be achieved by reducing both the number of hops per packet and the length of the communication channels. The topology that best achieves this reduction, however, changes according to the design objective. The network size, speed, and gate count of the PEs, as well as the particular 3-D technology are some important aspects that need to be considered when a 3-D topology is chosen. Selecting a topology that minimizes the power dissipated by an on-chip network does not necessarily guarantee that the power dissipated by the overall system will be minimized. Consequently, the analysis methodology described in this chapter can be a useful tool for exploring early in the design cycle the topological and architectural choices of a 3-D NoC-based system.

References â•‡ 1. G. De Micheli and L. Benini, Networks on Chips: Technology and Tools, Morgan Kaufmann, San Francisco, CA, 2006. â•‡ 2. A. Jantsch and H. Tenhunen, Networks on Chip, Kluwer Academic, San Francisco, CA, 2003. â•‡ 3. M. Millberg et al., “The Nostrum Backbone—A Communication Protocol Stack for Networks on Chip,” Proceedings of the IEEE International Conference on VLSI Design, pp.Â€693–696, January 2004. â•‡ 4. J. M. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach, Morgan Kaufmann, San Francisco, CA, 2003. â•‡ 5. D. Park et al., “MIRA: A Multi-Layered On-Chip Interconnect Router Architecture,” Proceedings of the IEEE International Symposium on Computer Architecture, pp.Â€251–261, June 2008. â•‡ 6. C. Addo-Quaye, “Thermal-Aware Mapping and Placement for 3-D NoC Designs,” Proceedings of the IEEE International System-on-Chip Conference, pp.Â€25–28, September 2005. â•‡ 7. F. Li et al., “Design and Management of 3D Chip Multiprocessors Using Network-in-Memory,” Proceedings of the IEEE International Symposium on Computer Architecture, pp.Â€130– 142, June 2006. â•‡ 8. V. F. Pavlidis and E. G. Friedman, “Three-Dimensional (3-D) Topologies for Networks-onChip,” Proceedings of the IEEE International System-on-Chip Conference, pp.Â€ 285–288, September 2006. â•‡ 9. C. Seiculescu, S. Murali, L. Benini, and G. De Micheli, “SunFloor 3D: A tool for Networks on Chip Topology Synthesis for 3D Systems on Chips,” ACM/IEEE Design, Automation and Test in Europe Conference and Exhibition, pp.Â€9–14, April 2009. 10. W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, San Francisco, CA, 2004. 11. L.-S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, Vol.Â€21, No.Â€1, pp.Â€26–34, January/February 2001. 12. T. Sakurai, “Closed-Form Expressions for Interconnection Delay, Coupling, and Crosstalk in VLSI’s,” IEEE Transactions on Electron Devices, Vol.Â€40, No.Â€1, pp.Â€118–124, January 1993. 13. K. A. Bowman et al., “A Physical Alpha-Power Law MOSFET Model,” IEEE Journal of Solid States Circuits, Vol.Â€34, No.Â€10, pp.Â€1410–1414, October 1999. 14. S. L. Garverick and C. G. Sodini, “A Simple Model for Scaled MOS Transistors that Includes Field-Dependent Mobility,” IEEE Journal of Solid States Circuits, Vol.Â€SC-22, No.Â€2, pp.Â€111–114, February 1987. 15. The International Technology Roadmap for Semiconductors Reports, 2009 [Online]. Available: http://www.itrs.net/Links/2008ITRS/Home2008.htm

5â•… Physical Analysis of NoC Topologies for 3-D Integrated Systems

113

16. Predictive Technology Model [Online]. Available: http://www.eas.asu.edu/~ptm 17. W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45Â€nm Design Exploration,” Proceedings of the IEEE International Symposium on Quality Electronic Design, pp.Â€585–590, March 2006. 18. T. Sakurai and A. R. Newton, “Alpha-Power Law MOSFET Model and Its Applications to CMOS Inverter Delay and Other Formulas,” IEEE Journal of Solid State Circuits, Vol.Â€25, No.Â€2, pp.Â€584–594, April 1990. 19. G. Chen and E. G. Friedman, “Low-Power Repeaters Driving RC and RLC Interconnects with Delay and Bandwidth Constraints,” IEEE Transactions on Very Large Integration (VLSI) Systems, Vol.Â€12, No.Â€2, pp.Â€161–172, February 2006. 20. X. Xi et al., BSIM4.5.0 MOSFET Model User’s Manual, University of California, Berkeley, CA, 2004. 21. H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, Reading, MA,1990. 22. Y. I. Ismail, E. G. Friedman, and J. L. Neves, “Equivalent Elmore Delay for RLC Trees,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.Â€19, No.Â€1, pp.Â€83–97, January 2000. 23. Y. I. Ismail, E. G. Friedman, and J. L. Neves, “Figures of Merit to Characterize the Importance of On-Chip Inductance,” IEEE Transactions on Very Large Integration (VLSI) Systems, Vol.Â€7, No.Â€4, pp.Â€442–449, December 1999. 24. V. F. Pavlidis, I. Savidis, and E. G. Friedman, “Clock Distribution Networks for 3-D Integrated Circuits,” Proceedings of the IEEE International Conference on Custom Integrated Circuits, pp.Â€651–654, September 2008. 25. Massachusetts Institute of Technology Lincoln Laboratory, FDSOI Design Guide, Cambridge, 2006. 26. H. Hua et al., “Performance Trends in Three-Dimensional Integrated Circuits,” Proceedings of the International IEEE Interconnect Technology Conference, pp.Â€45–47, June 2006. 27. K. Banerjee and A. Mehrotra, “A Power-Optimal Repeater Insertion Methodology for Global Interconnects in Nanometer Design,” IEEE Transactions on Electron Devices, Vol.Â€ 49, No.Â€11, pp.Â€2001–2007, November 2002. 28. H. J. M. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry and Its Impact on the Design of Buffer Circuits,” IEEE Journal of Solid State Circuits, Vol.Â€SC-19, No.Â€4, pp.Â€468–473, August 1984. 29. K. Nose and T. Sakurai, “Analysis and Future Trend of Short-Circuit Power,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.Â€ 19, No.Â€ 9, pp.Â€1023–1030, September 2000. 30. G. Chen and E. G. Friedman, “Effective Capacitance of Inductive Interconnects for ShortCircuit Power Analysis,” IEEE Transactions on Circuits and Systems I: Brief Papers, Vol.Â€55, No.Â€1, pp.Â€26–30, January 2008. 31. P. R. O’Brien and T. L. Savarino, “Modeling the Driving-Point Characteristic of Resistive Interconnect for Accurate Delay Estimation,” Proceedings of the International IEEE/ACM Conference on Computer-Aided Design, pp.Â€512–515, April 1989. 32. H. Wang, L.-S. Peh, and S. Malik, “Power-Driven Design of Router Microarchitectures in On-Chip Networks,” Proceedings of the IEEE International Symposium on Microarchitecture, pp.Â€105–116, December 2003. 33. V. F. Pavlidis and E. G. Friedman, Three-Dimensional Integrated Circuit Design, Morgan Kaufmann, San Francisco, CA, 2009. 34. V. F. Pavlidis and E. G. Friedman, “Interconnect-Based Design Methodologies for ThreeDimensional Integrated Circuits,” Proceedings of the IEEE, Special Issue on 3-D Integration Technology, Vol.Â€97, No.Â€1, pp.Â€123–140, January 2009. 35. J. W. Joyner and J. D. Meindl, “Opportunities for Reduced Power Distribution Using ThreeDimensional Integration,” Proceedings of the IEEE International Interconnect Technology Conference, pp.Â€148–150, June 2002.

114

V. F. Pavlidis and E. G. Friedman

36. J. W. Joyner et al., “Impact of Three-Dimensional Architectures on Interconnects in Gigascale Integration,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.Â€9, No.Â€6, pp.Â€922–927, December 2000. 37. R. Venkatesan, J. A. Davis, K. A. Bowman, and J. D. Meindl, “Optimal n-tier Multilevel Interconnect Architectures for Gigascale Integration (GSI),” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.Â€9, No.Â€6, pp.Â€899–912, December 2001. 38. T.-Y. Chiang, S. J. Souri, C. O. Chui, and K. C. Saraswat, “Thermal Analysis of Heterogeneous 3D ICs with Various Integration Scenarios,” Proceedings of the IEEE International Electron Device Meeting, pp.Â€681–684, December 2001. 39. T.-Y. Chiang, K. Banerjee, and K. C. Saraswat, “Analytical Thermal Model for Multilevel VLSI Interconnects Incorporating Via Effect,” IEEE Electron Device Letters, Vol.Â€23, No.Â€1, pp.Â€31–33, January 2002. 40. C. Marcon et al., “Exploring NoC Mapping Strategies: An Energy and Timing Aware Technique,” Proceedings of the ACM/IEEE Design, Automation and Test in Europe Conference and Exhibition, Vol.Â€1, pp.Â€502–507, March 2005. 41. P. P. Pande et al., “Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures,” IEEE Transactions on Computers, Vol.Â€54, No.Â€8, pp.Â€1025–1039, August 2005. 42. X. Zhao, D. L. Lewis, H.-S. H. Lee, and S. K. Lim, “Pre-Bond Testable Low-Power Clock Tree Design for 3D Stacked ICs,” Proceedings of the IEEE/ACM International Conference on Computer Aided Design, pp.Â€184–190, November 2009. 43. V. F. Pavlidis and E. G. Friedman, “3-D Topologies for Networks-on-Chip,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.Â€15, No.Â€10, pp.Â€1081–1090, October 2007. 44. A. H. Ajami, K. Banerjee, and M. Pedram, “Modeling and Analysis of Nonuniform Substrate Temperature Effects on Global ULSI Interconnects,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.Â€24, No.Â€6, pp.Â€849–861, June 2005. 45. J. C. Ku and Y. Ismail, “Thermal-Aware Methodology for Repeater Insertion in Low-Power VLSI Circuit,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.Â€15, No.Â€8, pp.Â€963–970, August 2007.

Chapter 6

Three-Dimensional Networks-on-Chip: Performance Evaluation Brett Stanley Feero and Partha Pratim Pande

6.1â•…Introduction The current trend in System-on-Chip (SoC) design in the ultra deep sub-micron (UDSM) regime and beyond is to integrate a huge number of functional and storage blocks in a single die [1]. The possibility of this enormous degree of integration gives rise to new challenges in designing the interconnection infrastructure for these big SoCs. Extrapolating from the existing CMOS scaling trends, traditional on-chip interconnect systems have been projected to be limited in their ability to meet the performance needs of SoCs at the UDSM technology nodes and beyond [2]. This limit stems primarily from global interconnect delay significantly exceeding that of gate delays. While copper and low-k dielectrics have been introduced to decrease the global interconnect delay, they only extend the lifetime of conventional interconnect systems a few technology generations. According to the International Technology Roadmap for Semiconductors (ITRS) [2], for the longer term, material innovation with traditional scaling will no longer satisfy the performance requirements. New interconnect paradigms are in need. Continued progress of interconnect performance will require employing approaches that introduce materials and structures beyond the conventional metal/dielectric system, and one of the promising approaches is 3D integration. Shown in Fig.Â€6.1, three-dimensional (3D) ICs, which contain multiple layers of active devices, have the potential for enhancing system performance [3–6]. According to [3], three-dimensional ICs allow for performance enhancements even in the absence of scaling. A clear way to reduce the burden of high frequency signal propagation across monolithic ICs is to reduce the line length needed, and this can be done by employing stacking of active devices using 3D interconnects. Here, the multiple layers of active

B. S. Feero () ARM Inc., 3711 S. Mopac Expy. Austin, TX 78731, USA e-mail: [email protected] P. P. Pande () School of Electrical Engineering and Computer Science, Washington State University, PO BOX-642752, Pullman, WA 99164-2752, USA e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_6, ©Â€Springer Science+Business Media, LLC 2011

115

116

B. S. Feero and P. P. Pande

Fig. 6.1â†œæ¸€ 3DIC from a SOI process Second Device Layer Vertical Via

p+

n+ First Device Layer

Silicon

devices are separated by a few tens of micrometers. Consequently, 3D interconnects allow communication among these active devices with smaller distances required for signal propagation. Three-dimensional ICs will have a significant impact on the design of multi-core SoCs. Recently, Network-on-Chip (NoC) has emerged as an effective methodology for designing big multi-core SoCs [7, 8]. However, the conventional two dimensional (2D) IC has limited floor-planning choices and, consequently, limits the performance enhancements arising out of NoC architectures. The performance improvement arising from the architectural advantages of NoCs will be significantly enhanced if 3D ICs are adopted as the basic fabrication methodology. The amalgamation of two emerging paradigms, namely NoCs in a 3D IC environment, allows for the creation of new structures that enable significant performance enhancements over more traditional solutions. With freedom in the third dimension, on-chip network architectures that were impossible or prohibitive due to wiring constraints in planar ICs are now possible [9, 10]. However, 3D ICs are not without limitations. Thermal effects are already impacting interconnect and device reliability in 2D circuits [11]. Due to the reduction of chip size in a 3D implementation, 3D integrated circuits exhibit a profound increase in power density. Consequently, increases in heat dissipation will give rise to circuit degradation and chip cracking, among other side-effects [12]. As a result, there is a real need to keep the temperature low for reliable circuit operation. Furthermore, in ICs implementing NoCs, the interconnect structure dissipates a large percentage of energy. In certain applications [13], this percentage has been shown to approach 50%. As a result, the interconnection network has a significant contribution to the thermal performance of 3D NoCs. This chapter introduces multiple NoC architectures that are enabled through 3D integration. The following sections characterize the performance of these 3D NoC architectures in terms of five metrics: throughput, latency, energy dissipation, silicon area, and thermal profile. Through the application of realistic traffic patterns in cycle-accurate simulation, this chapter demonstrates that three-dimensional integration can facilitate topology choices that dramatically outperform two-dimensional topologies in terms of throughput, latency, energy, and area, and it demonstrates that these improvements begin to mitigate some of the thermal concerns presented by 3D ICs.

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

117

6.2â•…3D NoC Architectures Enabling design in the vertical dimension permits a large degree of freedom in choosing an on-chip network topology. Due to wire-length constraints and layout complications, the more conventional two-dimensional integrated circuits have placed limitations on the types of network structures that are possible. With the advent of 3D ICs, a wide range of on-chip network structures that were not explored earlier are being considered [9, 10]. This chapter investigates five different topologies in 3D space and compares them with three well-known NoC architectures from 2D implementations. This analysis considers a SoC with a 400Â€mm2 floor plan and 64 functional IP blocks. This system size was selected to reflect state of the art of emerging SoCs. At ISSCC 2007, design of an 80-core processor arranged in an 8×10 regular grid built on fundamental NoC concepts was demonstrated [14]. Moreover, Tilera Corp. has recently announced design of a multi-core platform with 100 cores [15]. Therefore, the system size assumed in this work is representative of the latest trends. IP blocks for a 3D SoC are mapped onto four 10â•›×â•›10Â€mm layers, in order to occupy the same total area as a single-layer, 20â•›×â•›20Â€mm layout.

6.2.1 Mesh-Based Networks One of the well-known 2D NoC architectures is the 2D Mesh as shown in Fig.Â€6.2a. This architecture consists of an mâ•›×â•›n mesh of switches interconnecting IP blocks placed along with them. It is known for its regular structure and short inter-switch wires. From this structure, a variety of three-dimensional topologies can be derived. The straightforward extension of this popular planar structure is the 3D Mesh. FigureÂ€6.2b shows an example of 3D Mesh NoC. It employs 7-port switches: one port to the IP block, one each to switches above and below, and one in each cardinal direction (North, South, East, and West), as shown in Fig.Â€6.3a. A second derivation, 3D Stacked Mesh (Fig.Â€6.2c), takes advantage of the short inter-layer distances that are characteristics of a 3D IC, which can be around 20Â€μm [3]. The 3D Stacked Mesh architecture is a hybrid between a packet-switched network and a bus. It integrates multiple layers of 2D Mesh networks by connecting them with a bus spanning the entire vertical distance of the chip. As the distance between the individual 2D layers in 3D IC is extremely small, the overall length of the bus is also small, making it a suitable choice for communicating in the z-dimension [9]. Furthermore, each bus has only a small number of nodes (i.e. equal to the number of layers of silicon), keeping the overall capacitance on the bus small and greatly simplifying bus arbitration. For consistency with [9], this analysis considers the use of a dynamic, time-division multiple-access (dTDMA) bus, although any other type of bus may be used as well. A switch in a 3D Stacked Mesh network has, at most, 6 ports: one to the IP, one to the bus, and four for the cardinal directions (Fig.Â€6.3b). Additionally,

118

B. S. Feero and P. P. Pande

l2DIC

2.5 mm

a

c

20 µm

l2DIC 2

= l3DIC

b

d y

x

z x

y Coordinate Axes

Interconnect

IP Block

Bus

Switch

Bus Node

e

Fig. 6.2â†œæ¸€ Mesh-based NoC architectures: a 2D Mesh, b 3D Mesh, c Stacked Mesh, and d Ciliated 3D Mesh

IP

IP

N IP W

Switch

N

IP

W

E

Switch

E

W

E

S

S

a

Switch

N

b

S

bus

c

Fig. 6.3â†œæ¸€ Switches for mesh-based NoCs: a 3D Mesh, b Stacked Mesh, and c Ciliated 3D Mesh

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

119

it is possible to utilize ultra wide buses similar to the approach introduced in [16] to implement cost-effective, high-bandwidth communication between layers. A third method of constructing a 3D NoC is by adding layers of functional IP blocks and restricting the switches to one layer or a small number of layers, such as in the 3D Ciliated Mesh structure. This structure is essentially a 3D Mesh network with multiple IP blocks per switch. The 3D Ciliated Mesh is a 4â•›×â•›4â•›×â•›2 3D mesh-based network with 2 IPs per switch, where the two functional IP blocks occupy, more or less, the same footprint but reside at different layers. This is shown in Fig.Â€6.2d. In a Ciliated 3D Mesh network, each switch contains seven ports (one for each cardinal direction, one either up or down, and one to each of two IP blocks) as shown in Fig.Â€6.3c. This architecture will clearly exhibit lower overall bandwidth than a complete 3D Mesh due to multiple IP blocks per switch and reduced connectivity; however, Sect.Â€6.4 will show that this type of network offers an advantage in terms of energy dissipation, especially in the presence of specific traffic patterns.

a

b

l2DIC 20 µm

l2DIC 2 l2DIC 2

c

= l3DIC

d 40 µm

Interconnect

IP Block

Switch

l3DIC

l3DIC/4

e

Fig. 6.4â†œæ¸€ Tree architectures: a Butterfly Fat Tree, b SPIN, c 2D BFT Floorplan, d 3D BFT Floorplan for the first two layers, and e the first three layers of a 3D BFT Floorplan as seen in elevation view

120

B. S. Feero and P. P. Pande

To Parent Nodes

To Parent Nodes

Switch

Switch

To Child Nodes

a

To Child Nodes

b

Fig. 6.5â†œæ¸€ Switches for tree networks: a BFT and b SPIN

6.2.2 Tree-Based Networks Two types of tree-based interconnection networks that have been considered for network-on-chip applications are Butterfly Fat Tree (BFT) [17, 18] and the generic Fat Tree, or SPIN [19]. Unlike the work with mesh-based NoCs, this chapter does not propose any new topologies for tree-based systems. Instead, it investigates the achievable performance benefits by instantiating already-existing tree-based NoC topologies in a 3D environment. The BFT topology considered is shown in Fig.Â€6.4a. For a 64-IP SoC, a BFT network will contain 28 switches. Each switch (Fig.Â€6.5a) in a Butterfly Fat Tree network consists of 6 ports, one to each of four child nodes and two to parent nodes, with the exception of the switches at the topmost layer. When mapped to a 2D structure the longest inter-switch wire length for a BFT-based NoC is l2DIC/2, where l2DIC is the die length on one side [18, 20]. If the NoC is spread over a 20â•›×â•›20Â€mm die, then the longest inter-switch wire is 10Â€mm [20], as shown in Fig.Â€6.4c. Yet, when the same BFT network is mapped onto a four-layer 3D SoC, wire routing becomes simpler, and the longest inter-switch wire length is reduced by at least a factor of two, as can be seen in Fig.Â€6.4d. This will lead to reduced energy dissipation as well as less area overhead. The fat tree topology of Fig.Â€6.4b will have the same advantages when mapped on to a 3D IC as the BFT.

6.3â•…Performance Evaluation 6.3.1 Performance Metrics In order to properly analyze the various 3D network-on-chip topologies, a standard set of metrics must be used [21]. Wormhole routing [22] is assumed as the data

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

121

transport mechanism where the packet is divided into fixed length flow control units or flits. The header flit holds the routing and control information. It establishes a path, and subsequent payload or body flits follow that path. This comparative analysis focuses on the four established benchmarks [21] of throughput, latency, energy, and area overhead. Throughput is a metric that quantifies the rate in which message traffic can be sent across a communication fabric. It is defined as the average number of flits arriving per IP block per clock cycle, so the maximum throughput of a system is directly related to the peak data rate that a system can sustain. For purposes of a message-passing system, throughput T is given by the equation

T =

(Total Messages Completed) × (Message Length) . (Number of IP Blocks) × (Time)

(6.1)

Total Messages Completed are the number of messages which successfully traverse the network from source to destination. Message Length refers to the number of flits a message consists of, and Number of IP Blocks signifies the number of intellectual property units that send data over the network. Time is length of time in clock cycles between the generation of the first packet and the reception of the last. It can be seen that throughput is measured in flits/IP block/cycle, where a throughput of 1 signifies that every IP block is accepting a flit in each clock cycle. Accordingly, throughput is a measure of the maximum amount of sustainable traffic. Throughput will be dependent on a number of parameters including the number of links in the architecture, the average hop count, the number of ports per switch, and injection load. Injection load is measured by the number of flits injected in to the network per IP block per cycle. Consequently, it has the same unit as the throughput, and an injection load of 1 signifies that every IP block is injecting a flit in each clock cycle. Next, latency refers to the length of time elapsed between the injection of a message header at the source node and the reception of the tail flit at the destination. Latency is defined as the time in clock cycles elapsed from the transfer of the header flit by the source IP to the acceptance of the tail flit by the destination IP block. Latency is characterized by three delays: sender overhead, transport latency, and receiver overhead.

Li = Lsender + Ltransport + Lreceiver

(6.2)

Flits must traverse a network while traveling from source to destination. With different routing algorithms and switch architectures, each packet will experience a unique latency. As a result, network topologies will be compared by average latency. Let P be the number of packets received in a given time period, and let Li be the latency of the ith packet. Average latency is therefore given by the equation:

Lavg =

P

i=1

Li

P

.

(6.3)

122

B. S. Feero and P. P. Pande

Additionally, the transport of messages across a network leads to a quantifiable amount of energy dissipation. Activity in the logic gates of the network switches as well as the charging and discharging of interconnection wires lead to the consumption of energy. The analysis in this chapter examines two types of energy: cycle energy and packet energy. Cycle energy is defined as the amount (in Joules) of energy dissipated by the entire network in one clock cycle. On the other hand, packet energy is defined as the amount of energy incurred by a single packet as it traverses the network from source to destination over many clock cycles. It will be shown that each of these types of energy reveals unique information about the behavior of the varying network architectures. Lastly, the amount of silicon area used by an interconnection network is a necessary consideration. As the network switches form an integral part of the infrastructure, it is important to determine the amount of relative silicon area they consume. Additionally, area overhead arising from layer-to-layer vias, inter-switch wires, and buffers incurred by relatively longer wires need to be considered. The evaluation of area in this chapter includes each form of area overhead.

6.3.2 Performance Analysis of 3D Mesh-Based NoCs Here, the performance of the 3D mesh-based NoC architectures is analyzed in terms of the parameters mentioned above: throughput, latency, energy dissipation, and area overhead. Throughput is given in the number of accepted flits per IP per cycle. This metric, therefore, is closely related to the maximum amount of sustainable traffic in a certain network type. Any throughput improvements in 3D networks are principally related to two factors: the number of physical links and the average number of hops. In general, for a mesh-based NoC, the number of links is given as follows:

links = N1 N2 (N3 − 1) + N1 N3 (N2 − 1) + N2 N3 (N1 − 1) ,

(6.4)

where Ni represents the number of switches in the ith dimension. For instance, in an 8â•›×â•›8 2D Mesh NoC, this yields 112 links. In a 4â•›×â•›4â•›×â•›4 3D Mesh NoC, the number of links is 144. With a greater number of links, a 3D Mesh network, for example, is able to contain a greater number of flits and therefore transmit a greater number of messages. However, only considering the number of links will not characterize the overall throughput of a network. The average hop count also has a definitive effect on throughput. A lower average hop count will also allow more flits to be transmitted through the network. With a lower hop count, a wormhole-routed packet will utilize fewer links, thus leaving more room to increase the maximum sustainable traffic. It is important to note that hop count is also very application-dependent. For instance, if a particular application produces more localized traffic, where the majority of traffic is between source and destination nodes that are spatially close, average hop

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

123

count will be reduced. It is easier to first approach average hop count by considering uniform spatial traffic distribution. The case of localized traffic is discussed in detail in Sect.Â€6.3.7. Following [10], assuming a uniform spatial traffic distribution, the average number of hops in a mesh-based NoC is given by

hopsMesh =

n1 n2 n3 (n1 + n2 + n3 ) − n3 (n1 + n2 ) − n1 n2 , 3 (n1 n2 n3 − 1)

(6.5)

where ni is the number of nodes in the ith dimension. This equation applies both to the 4â•›×â•›4â•›×â•›4 3D Mesh and 4â•›×â•›4â•›×â•›2 3D Ciliated Mesh networks. The number of hops for the 3D Stacked Mesh is equal to

hopsStacked =

n1 + n2 n3 − 1 . + 3 n3

(6.6)

For the 4â•›×â•›4â•›×â•›4 3D Mesh and 8â•›×â•›8 2D Mesh, average hop counts are 3.81 and 5.33, respectively. There are 40% more hops in the 2D Mesh compared to that in 3D Mesh. Consequently, flits in the 3D Mesh traverse fewer stages between source and destination than in the 2D counterpart. As a result of this, a corresponding increase in throughput is expected. Transport latency, like throughput, is also affected by average hop count. The number of links and the injection load also affects it heavily. In 3D architectures, a decrease in latency is expected due to a lower hop count and an increased number of links. In the System-on-Chip realm, energy dissipation characteristics of the interconnect structures are crucial, as the interconnect fabric can consume a significant portion of the overall energy budget [13]. The energy dissipation in a NoC depends on the energy dissipated by the switch blocks and the inter-switch wire segments. Both of these factors depend on the network architecture. Additionally, the injection load has a significant contribution, as it is the cause for any activity in the switches and inter-switch wires. Intuitively, it is clear that with more packets traversing the network, power will increase. This is why packet energy is an important attribute for characterizing NoC structures. The energy dissipated per flit per hop is given by

Ehop = Eswitch + Ewire ,

(6.7)

where Eswitch and Ewire are the energy dissipated by each switch and inter-switch wire segments respectively. The energy of a packet of length n flits that completes h hops is given by

Epacket = n

h

Ehop, j .

(6.8)

j=1

From this, a formula for packet energy can be realized. If P packets are transmitted then the average energy dissipated per packet is given as

124

B. S. Feero and P. P. Pande

Epacket =

P

i=1

Epacket,i P

=

P

i=1

ni

hi

j=1

P

Ehop, j

.

(6.9)

Now, it is clear that a strong correlation exists between packet energy and the number of hops from source to destination. Consequently, a network topology that exhibits smaller hop counts will also exhibit correspondingly lower packet energy. As all 3D mesh-based NoC architectures exhibit a lower hop count they should also dissipate less energy per packet. Lastly, the area overhead for mesh-based NoCs must be established. Area overhead for a NoC includes switch overhead and wiring overhead. Switch area is affected by the overall number of switches and the area per switch, which is highly correlated to the number of ports. Since all 3D mesh-based NoCs have more ports, the area per switch will increase. However, wire overhead is reduced when moving to a 3DIC. This is not due to reductions in the length of most inter-switch wires in the case of mesh-based NoCs. Horizontal wire length is given by lIC/nside, where nside represents the number of IPs in one dimension of the IC and lIC is the die length on one side as shown earlier in Fig.Â€6.2a, b. For the 8â•›×â•›8 2D Mesh, this evaluates to 20Â€mm/8 or 2.5Â€mm, and for all 3D mesh-based architectures, the expression evaluates to 10Â€mm/4, also 2.5Â€mm. With this in mind, reductions in wire overhead come from the interlayer wires. The 3D structures have a reduced number of horizontal links due to the presence of interlayer wires. These interlayer wires are very small and hence, they are the source of wire overhead savings in mesh-based 3D NoCs. These effects are quantified in Sect.Â€6.4.4.

6.3.3 Performance Analysis of 3D Tree-Based NoCs Unlike the previous discussion pertaining to mesh-based NoCs, the tree-based networks considered for 3D implementations have identical topologies to their 2D counterparts. The only variable is the inter-switch wire length. As a result, there are significant improvements both in terms of energy and area overhead. In 2D space, the longest inter-switch wire length in a BFT or SPIN network is equal to l2DIC/2 [18, 20], where l2DIC is the die length on one side. This inter-switch wire length corresponds to the top-most level of the tree. In a 3D IC, however, this changes significantly. For instance, as shown in Fig.Â€6.4d, e, the longest wire length for 3D, tree-based NoC is equal to the length of horizontal travel in addition to the length of the vertical via. Considering a 20â•›×â•›20Â€mm 2D die, the longest inter-switch wire length is equal to 10Â€mm, whereas with a 10â•›×â•›10Â€mm stack of four layers, the maximum wire length is equal to the sum of l3DIC/4, or 2.5Â€mm, and the span of two layers, 40Â€μm. This is almost a factor-of-4 reduction compared to 2D implementations. Similarly, mid-level wire lengths are reduced by a factor of 2. As a result, this reduction in wire length, shown in TableÂ€6.1, causes a significant reduction in energy.

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation Table 6.1â†œæ¸€ Inter-switch wire lengths in 3D tree-based NoCs

â•– 1st Level 2nd Level 3rd Level

125

2D NoC

4-layer 3D NoC

≤â•›l/8Â€= 2.5Â€mm l/4Â€=Â€5Â€mm l/2Â€=Â€10Â€mm

≤â•›l/4Â€= 2.5Â€mm l/4Â€= 2.5Â€mm l/4Â€= 2.5Â€mm

In addition to benefits in terms of energy, 3D ICs effect area improvements for tree-based NoCs. Again, as with energy, area gains pertain only to the inter-switch wire segments; there is neither a change in the number of switches nor in the design of the switch. As with the 3D mesh-based NoCs, wire overhead in a 3D tree-based NoC consists of the horizontal wiring in addition to the area incurred by the vertical wires and vias. Also, the longer inter-switch wires, which are characteristics of 2D treebased NoCs, require repeaters, and this is taken into account. For a Butterfly Fat Tree, the number of wires in an arbitrary tree level l as defined in [17] is

wireslayerl = wlink

N · l−1 , 2

(6.10)

where N is the number of IP blocks and wlink is the link width in bits. For a generic Fat Tree, the number of wires in a tree level l is given by

wireslayerl = wlink · N .

(6.11)

For instance, in a 64-IP BFT network with 32-bit wide bi-directional interswitch links, there are 2,048 wires in the first level, 1,024 wires in the second level, and 512 wires in the third. Similarly, a 64-IP Fat Tree will have 2,048 wires in every level.

6.3.4 Simulation Methodology To model performance of different NoC structures, a cycle-accurate network simulator is employed that can also simulate dTDMA buses. The simulator is flit-driven, uses wormhole routing, and assumes a self-similar injection process [21–24]. This type of traffic has been observed MPEG-2 video applications [25], as well as various other networking applications [24]. It has been shown to closely model real traffic [25]. In terms of spatial distribution the simulator is capable of producing both uniform and localized traffic patterns for injected packets. In order to acquire energy and area characteristics, the network switches, dTDMA arbiter, and FIFO buffers were modeled in VHDL. The network switches were designed in such a way that their delay can be constrained within the limit of one clock cycle. The clock cycle is assumed to be equal to 15FO4 (fan-out-of 4) delay units. With a 90Â€nm standard cell library from CMP [26], this corresponds to a clock frequency of 1.67Â€GHz. As the switches were designed with differing numbers of ports, their delays vary with one another. However, it was important to ensure that all the delay numbers were

126

B. S. Feero and P. P. Pande

Table 6.2â†œæ¸€ Wire delays Wire type Interlayer Vertical bus Horizontal Horizontalâ•›+â•›Interlayer Horizontal Horizontal a b

Wire length 20Â€µm 60Â€µm 2.5Â€mm 2.54Â€mm 5Â€mm 10Â€mm

Delay (ps) 16 110/450a 219 231 436b 550b

Architectures used All 3D mesh-based 3D Stacked Mesh Mesh-based, 2D tree-based All 3D tree-based Mid-level in all 2D tree-based Top-level in all 2D tree-based

Bus arbitration included Repeaters necessary

kept within the 15FO4 timing constraint. Consistent with [20], the longest delays were in the 2D/3D Fat Tree switches as they had the highest number of ports. Yet, even it can be run with a clock frequency of 11FO4, well within the 15FO4 limit. To provide a consistent comparison, all the switches were run with a 15FO4 clock. Similarly, all inter-switch wire delays must hold within the same constraints. As shown in TableÂ€6.2, wire RC delays remain within the clock period of 600Â€ps [26]. For Stacked Mesh, even considering the bus arbitration, the delay is constrained within one clock cycle. For the vertical interconnects, the via resistance and capacitance are included in the analysis. As such, all network architectures are able to run at the same clock frequency of 1.67Â€GHz. Additional architectural parameters for each topology are shown in TableÂ€6.3. Each switch was designed with 4 virtual channels per port and 2-flit-deep virtual channel buffers as discussed in [21]. Synopsys Design Vision was used to synthesize the hardware description, and Synopsys PrimePower was used to gather energy dissipation statistics. To calculate Eswitch and Ewire from (6.7), the methodology discussed in [21] is followed. The energy dissipated by each switch, Eswitch, is determined by running its gate-level netlist through Synopsys PrimePower using large sets of input data patterns. In order to determine the interconnect energy, Einterconnect, the interconnects’ capacitance is estimated, taking into account each inter-switch wire’s specific layout, by the following expression [21]:

Cinterconnect = Cwire · Wa+1;a + n · m · (CG + CJ ) ,

Table 6.3â†œæ¸€ Architectural parameters Topology Port count 2D Mesh 3D Mesh 3D Stacked Mesh Ciliated 3D Mesh 2D BFT 3D BFT 2D Fat tree 3D Fat tree

5 7 6 (+â•›bus arbitration) 7 6 6 8 8

Switch area (mm2) 0.0924 0.1385 0.1225 0.1346 0.1155 0.1155 0.1616 0.1616

Switch static energy (pJ) â•‡ 65.3 â•‡ 91.4 â•‡ 81.3 â•‡ 91.2 â•‡ 78.3 â•‡ 78.3 104.5 104.5

(6.12)

Longest wire delay (ps) 219 219 219 219 550 231 550 231

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

127

where Cwire represents the capacitance per unit length of the wire, waâ•›+â•›1;â•›a is the wire length between two consecutive switches, n is the number of repeaters, m represents the size of those repeaters with respect to minimum-size devices, and lastly, CG and CJ represent the gate and junction capacitance, respectively, of a minimum size inverter. While determining Cwire, the worst-case scenario is considered, where adjacent wires switch in opposite directions [27]. The simulation was initially run for 10,000 cycles to allow the 64-IP network to stabilize, and it was subsequently run for 100,000 more cycles, reporting statistics for energy, throughput, and latency.

6.3.5 Experimental Results for Mesh-Based Networks This chapter first considers the performance of 3D mesh-based NoC architectures. FigureÂ€6.6a shows the variation of throughput as a function of the injection load. A network cannot accept more traffic than is supplied, and limitations in routing and collisions cause saturation before throughput reaches unity. From Fig.Â€ 6.6a, it is clear that both the 3D Mesh and Stacked Mesh topologies exhibit throughput improvements over their two-dimensional counterparts. It is also clear that the ciliated 3D Mesh network shows only a small throughput improvement. However, this is not where a ciliated structure exhibits the best performance. It will be shown later that this network topology has significant benefits both in terms of energy dissipation and silicon area. These results coincide with the analysis of a 3D mesh-based NoC provided earlier. Equation (6.4) shows that a 3D mesh will have 29% more interconnection links than a 2D version; hop count calculations have shown that a flit in a 2D mesh network will traverse 40% more hops than a flit navigating a 3D mesh (see TableÂ€6.4); and 3D mesh switches have higher connectivity with the increased number of ports. These all account for throughput improvements. In general, the lower hop count allows a packet to occupy fewer resources, freeing up links for additional packets. Consequently, there is a corresponding increase in throughput. Next, the 3D Stacked Mesh architecture is considered. An increase in throughput is evident, as shown in Fig.Â€6.6a. However, with a 32-bit bus (corresponding to the flit width) connecting the layers of the NoC, throughput improvements are not as substantial as with the 3D Mesh. Contention issues in the bus limit the attainable performance gains. Yet, since communication between layers is bus-based, the bus width is easily increased to 128 bits without modifying the switch architectures in order to limit contention. Any further increases do not have any significant impact on throughput, except to increase the capacitance on the bus. With this improvement, 3D Stacked Mesh saturates at a slightly higher injection load than a 3D Mesh network. The 3D Stacked Mesh topology also offers a lower hop count in comparison to a strict 3D Mesh. From (Eq. 6.6), the average hop count is equal to 3.42. With the lower hop count in addition to the wide, 128-bit bus for vertical transmission, this architecture offers the highest throughput among all the 3D mesh-based networks.

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0

Throughput

Cycle Energy (pJ)

0

0.9

Injection Load

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

2D MESH 3D MESH Ciliated 3D Mesh Stacked Mesh

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Injection Load

1

1

b

d

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

1000 900 800 700 600 500 400 300 200 100 0 0

2D MESH

9139.84

6111.85

Ciliated 3D Stacked Mesh Mesh

4867.94

Topology

3D MESH

5264.09

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

Stacked Mesh (32-bit)

Stacked Mesh (128-bit)

Ciliated 3D Mesh

3D MESH

2D MESH

1

Fig. 6.6â†œæ¸€ Experimental results for mesh-based NoCs: a Throughput vs. injection load, b Latency vs. injection load, c Cycle energy vs. injection load, and d Packet energy

c

0

2D MESH 3D MESH Ciliated 3D Mesh Stacked Mesh (128-bit) Stacked Mesh (32-bit)

Latency (clock cycles) Packet Energy (pJ)

a

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

128 B. S. Feero and P. P. Pande

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation Table 6.4â†œæ¸€ Average hop count in mesh-based NoCs

2D Mesh 3D Mesh Stacked Mesh Ciliated 3D Mesh

129 5.33 3.81 3.42 3.10

Throughput characteristics of the ciliated 3D Mesh topology differ significantly from the other 3D networks. This network has a saturating throughput that is slightly higher than a 2D Mesh network and considerably less than both 3D Mesh and Stacked Mesh networks. This is true despite having the lowest hop count at an average of 3.10 hops. However, with only 64 inter-switch links, compared to 144 in the 3D Mesh and 112 in the 2D Mesh, throughput improvements due to hop count are negated by the reduced number of links. The fact that there are multiple functional IP blocks for every switch is also responsible for considerable lower throughput due to contention issues in the switches. FigureÂ€6.6b depicts the latencies for the architectures under consideration. Here, it is seen that 3D mesh-based NoCs have superior latency characteristics over the 2D versions. This is a product of the reduced hop count characteristic of 3D meshbased topologies. Energy dissipation characteristics for three-dimensional mesh-based NoCs reveal a substantial improvement over planar NoCs. The energy dissipation profiles of the mesh-based NoC architectures under consideration are shown in Fig.Â€6.6c. Energy dissipation is largely dependent on two factors: architecture and injection load. These two parameters are considered as independent factors in this analysis. As shown in (6.7), the energy dissipation in a NoC depends on the energy dissipated by the switch blocks and the inter-switch wire segments. Both these factors depend on the architectures. The design of the switch varies with the architecture and interswitch wire length is also architecture dependent [21]. Besides the network architecture, injection load has a clear effect on the total energy dissipation of a NoC, in accordance with Fig.Â€6.6c. Intuitively, it is clear that with more packets traversing the network, power will increase. This is why packet energy, in Fig.Â€6.6d, is an important attribute for characterizing NoC structures. Notice that, at saturation, a 2D Mesh network dissipates less power than both 3D Stacked Mesh and 3D Mesh networks. This is the result of the lower 2D Mesh throughput, and the 3D networks consume more energy because they transmit more flits at saturation. Packet energy is a more accurate representation of the cost of data transmission. With packet energy in mind, it can be seen that every 3D topology provides a very substantial improvement over a 2D Mesh. Also, the energy dissipation of the ciliated mesh topology is less, still, than that of 3D Mesh network. These results follow closely the hop count calculations summarized in TableÂ€6.4, with the exception of the packet energy for a 3D Stacked Mesh network. Energy is heavily dependent on interconnect energy, and this is where the 3D Stacked Mesh suffers. Since vertical communication takes place through wide busses, the capacitive loading on those busses results in a significant amount of energy. As a result, though 3D Stacked

130

B. S. Feero and P. P. Pande

Fig. 6.7â†œæ¸€ Area overhead for mesh-based NoCs

8

Switch Wiring

[%] of SoC Area

7 6 5 4 3 2 1 0 2D Mesh

3D Mesh

Stacked Mesh

Ciliated 3D Mesh

Topology

Mesh has a lower hop count compared to 3D Mesh, it dissipates more packet energy on average. Regardless, the profound energy savings possible in these 3D architectures provides serious motivation for a SoC designer to consider a three-dimensional integrated circuit. The final performance metric considered in this study is the overall area overhead incurred with the instantiation of the various networks in the 3D environment. FigureÂ€6.7 shows the area penalty from each NoC design, both in terms of switch area and interconnects area. It shows that while the 3D Mesh and 3D Stacked Mesh NoCs reduce the amount of wiring area, switch overhead is increased. For both 3D Mesh and 3D Stacked Mesh NoCs, the number of longer inter-switch links in x-y plane is reduced. There are 96 x-y links for both topologies, for 3D Stacked Mesh, 16 buses are present, and for the 3D Mesh, 48 vertical links are present. In comparison, the conventional 2D mesh-based NoC has 112 links in the horizontal plane. As the 3D NoCs have fewer long horizontal links they incur less wiring area overhead. Although there are a large number of vertical links, the amount of area incurred by them is very small due to the 2â•›×â•›2Â€μm interlayer vias. However, an increased number of ports per switch results in larger switch overhead for both of these NoC architectures, ultimately causing the 3D Mesh and 3D Stacked Mesh topologies to incur more silicon area in spite of wiring improvements. On the other hand, 3D Ciliated Mesh shows a significant improvement in terms of area. The 4â•›×â•›4â•›×â•›2 3D Ciliated Mesh structure involves half the number of switches as the other meshbased architectures in addition to only 64 links. As a result, the area overhead is accordingly smaller.

6.3.6 Experimental Results for Tree-Based Networks Here, the performance of the three-dimensional tree-based NoCs is evaluated. It has already been established that 2D and 3D versions of the tree topologies should have identical throughput and latency characteristics, and Fig.Â€6.8a, b support this.

a

0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

3D Fat Tree 3D BFT 2D Fat Tree 2D BFT

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

3D Fat Tree 3D BFT 2D Fat Tree 2D BFT

1

1

d

0

2000

4000

6000

8000

10000

12000

14000

16000

b

1000 900 800 700 600 500 400 300 200 100 0

6598.97

3D BFT 2D Fat Tree Topology

6604.84

12962.67

2D BFT

12978.42

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

3D Fat Tree

0

3D Fat Tree 3D BFT 2D Fat Tree 2D BFT

1

Fig. 6.8â†œæ¸€ Experimental results for tree-based NoCs: a Throughput vs. injection load, b Latency vs. injection load, c Cycle energy vs. injection load, and d Packet energy

c

0

2000

4000

6000

8000

10000

12000

Throughput

Cycle Energy (pJ)

Latency (clock cycles) Packet Energy (pJ)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation 131

132

B. S. Feero and P. P. Pande

Consistent with the analysis of mesh-based NoCs, Fig.Â€6.8a shows the variation of throughput as a function of injection load, and Fig.Â€6.8b shows the effect of injection load on latency. The assumption here was that the switches and the inter-switch wire segments are driven by the same clock as explained earlier. Consequently under this assumption, in terms of throughput and latency there is no advantage to choosing a 3D IC over a traditional planar IC for a tree-based NoC. However, this is eclipsed by the superior performance achieved in terms of energy and area overhead. The energy profiles for 3D tree-based NoCs (Fig.Â€6.8c) reveal significant improvements over 2D implementations. Both BFT and Fat Tree (SPIN) networks show a very large reduction in energy when 3D ICs are used. Once again, energy dissipation is largely dependent both on architecture and injection load. Each NoC shows that energy dissipation increases with injection load until the network becomes saturated, similar to the throughput curve shown in Fig.Â€ 6.8a. The energy profiles show that the Fat Tree networks cause higher energy dissipation than the Butterfly Fat Tree instantiations, but this is universally true only at high injection load. Again, this is the motivation to consider packet energy of the networks as a relevant metric for comparison, shown in Fig.Â€6.8d. Energy savings in excess of 45% are achievable by adopting 3D ICs as a manufacturing methodology, and both BFT and Fat Tree networks show similar improvements. In case of tree-based NoCs, where the basic network topology remains unchanged in 3D implementations, all improvements in energy dissipation are caused by the shorter wires. As showed earlier in TableÂ€6.1, a three-dimensional structure greatly reduces the inter-switch wire length. The overall energy dissipation in a NoC is heavily dependent on the interconnect energy, and this reduction in inter-switch wire length effects very large savings. Besides advantages in terms of energy, three-dimensional ICs enable tree-based NoCs to reduce silicon area overhead by a sizable margin. FigureÂ€6.9 shows the overall area overhead of tree-based NoCs. Although no improvements are made in terms of switch area, the reductions in inter-switch wire lengths and amount of repeaters are responsible for substantial reductions in wiring overhead. This is 25

Switch Wiring

[%] of SoC Area

20 15 10 5

Fig. 6.9â†œæ¸€ Area overhead for tree-based NoCs

0

2D BFT

2D Fat Tree 3D BFT Topology

3D Fat Tree

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

133

especially true of the Fat Tree network, which has more interconnects in the higher levels of the tree; wiring overhead is reduced more than 60% by instantiating the network into a 3D IC.

6.3.7 Effects of Traffic Localization Until this point, a uniform spatial distribution of traffic has been assumed. In a SoC environment, different functions would map to different parts of the chip and the traffic patterns are expected to be localized to different degrees [28]. We therefore consider the effect of traffic localization on the performance of the 3D NoCs, and in particular it considers the illustrative case of spatial localization where local messages travel from a source to the set of the nearest destinations. In the case of BFT and Fat Tree, localized traffic is constrained to within a cluster consisting of a single sub-tree while, in the case of 3D Mesh, it is constrained to within the destinations placed at the shortest Manhattan distance [21]. On the other hand, the 3D Stacked Mesh architecture is created simply to take advantage of the inexpensive vertical communication. The research pursued by Li etÂ€al. in [9] suggested that in a 3D multi-processor SoC, much of the communication should take place vertically, taking advantage of the short inter-layer wire segments. This is a result of a large proportion of network traffic occurring between processor and the closest cache memories, which are often placed along the z-dimension. Consequently, in these situations, the traffic will be highly localized, and this study therefore considers localized traffic to be constrained to within a pillar for 3D Stacked Mesh. FigureÂ€6.10 summarizes these effects, revealing the benefits of traffic localization. More packets can be injected into the network, improving the throughput characteristics of each topology as shown in Fig.Â€6.10a, c, which also shows the throughput profile of the 2D topologies for reference. Analytically, increasing localization reduces the average number of hops that a flit must travel from source to destination. FigureÂ€ 6.10a reveals that the 3D Stacked Mesh network provides best performance in terms of throughput in the presence of localized traffic. However, this is achieved by using a wide bus for vertical communication. Let us consider what occurs when the bus size is equal to the flit width of 32 bits. With low localization, the achieved throughput is higher than that in a 2D Mesh network. However, when the fraction of localized traffic in the vertical pillars is increased, huge performance degradation is seen. This is due to the contention in the bus. When the bus width is increased to 128 bits, throughput increases significantly with increase in localized traffic. This happens due to less contention in a wider communication channel. FigureÂ€6.10b, d depict the effects of localization on packet energy, and, unsurprisingly, there is a highly linear relationship between these two parameters. Packet energy is highly correlated with the number of hops from source to destination, and the resultant reduction of packet energy with localization supports this correlation. For the mesh-based networks, 3D Ciliated Mesh exhibits the lowest packet energy due to its low hop count and very short vertical wires. In fact, at highest localization,

a

Maximum Throughput

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0

2D BFT

3D Fat Tree 3D BFT 2D Fat Tree

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 % Localized Traffic

% Localized Traffic

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Stacked Mesh (32-bit)

Stacked Mesh (128-bit)

Ciliated 3D Mesh

2D MESH 3D MESH Localized

1

1

d

0

2000

4000

6000

8000

10000

12000

14000

16000

b

0

2000

4000

6000

8000

10000

12000

0

0

2D Fat Tree 2D BFT

3D Fat Tree 3D BFT

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 % Localized Traffic

% Localized Traffic

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Stacked Mesh (32-bit)

Stacked Mesh (128-bit)

Ciliated 3D Mesh

3D MESH Localized

2D MESH

1

1

Fig. 6.10â†œæ¸€ Localization effects on mesh-based NoCs in terms of: a Throughput and b Packet energy; and on tree-based NoCs in terms of c Throughput and d Packet energy

c

Maximum Throughput

Packet Energy (pJ) Packet Energy (pJ)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

134 B. S. Feero and P. P. Pande

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

135

the packet energy for a 3D Ciliated Mesh topology is less than 50% of that of the next-best-performing topology: 3D Mesh. For the tree-based NoCs, both 3D networks have much-improved packet energy with traffic localization. As can be seen from Fig.Â€6.10, there are tradeoffs between packet energy and throughput. For instance, the best-performing topology in terms of energy, ciliated Mesh, operates at the lowest throughput even when traffic is highly localized. On the other hand, although a 3D Stacked Mesh network with wider bus width achieves superior throughput without necessitating a highly local traffic distribution, it incurs more energy dissipation than other structures under local traffic due to the capacitive loading on the interlayer busses. However, the other topologies lie in some middle ground between these two extremes, and in general, it is clear that 3D ICs continue to effect improvements on NoCs under localized traffic.

6.3.8 Effects of Wire Delay on Latency and Bandwidth In NoC architectures, the inter-switch wire segments, along with the switch blocks, constitute a pipelined communication medium. The overall latency will be governed by the slowest pipelined stage. TableÂ€ 6.2 showed earlier that the maximum wire delays for the network architectures are different. Though the vertical wire delays are very small, still the overall latency will be dependent on the delay of the switch blocks. Though the delays of the switch blocks were constrained within the 15FO4 limit, they were still the limiting stages in the pipeline, specifically when compared to the fast vertical links. Yet, considering a hypothetical case, which ignores the implications of switch design, where the clock period of the network is equal to the inter-switch wire delay, then the clock frequency can be increased, and, resultantly, the latency can be reduced significantly. With this in mind, latency in nanoseconds (instead of latency in clock cycles) and bandwidth (instead of throughput) are calculated. All other network parameters are kept consistent with the previous analysis. A plot of latency for all network topologies is shown in Fig.Â€6.11, and TableÂ€6.5 depicts the network bandwidth in units of Terabits per second. To calculate bandwidth, the following expression is followed:

BW = TPmax ·

1 · wflit · N , f

(6.13)

where TPmax represents the throughput at saturation, f represents the clock frequency, wflit is the flit width, and N is the number of IP blocks. TableÂ€ 6.5 shows the performance difference achieved by running the NoC with a clock as fast as the inter-switch wire, disregarding the switch design constraints. It is evident that the tree-based architectures show the greatest performance improvement in this scenario going from 2D to 3D implementations, as the horizontal wire lengths are also reduced.

B. S. Feero and P. P. Pande

136 Fig. 6.11â†œæ¸€ Latency in ns at hypothetical clock frequencies

400

2D MESH 3D MESH Ciliated 3D MESH Stacked MESH 3D Fat Tree 3D BFT 2D Fat Tree 2D BFT

350 Latency (ns)

300 250 200 150 100 50 0

Table 6.5â†œæ¸€ Bandwidth of network architectures at simulated and hypothetical frequencies (Terabits/s)

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

â•–

fÂ€=Â€1.67Â€GHz

2D Mesh 3D Mesh Ciliated 3D Mesh 3D Stacked Mesh 2D BFT 2D Fat Tree 3D BFT 3D Fat Tree

1.357 2.412 1.457 2.488 0.9543 2.515 0.9543 2.515

fâ•›Â€= 1/(max wire delay) 3.711 6.596 3.983 6.804 1.039 2.738 2.474 6.520

1

% increase 173.5 173.5 173.5 173.5 â•‡â•‡ 8.9 â•‡â•‡ 8.9 159.2 159.2

6.3.9 Network Aspect Ratio The ability to stack layers of silicon is not without nuances. Upcoming 3D processes have a finite number of layers due to manufacturing difficulties and yield issues [3]. Furthermore, it is speculated [3] that the number of layers in a chip stack are not likely to scale with transistor geometries. This has a nontrivial effect on the performance of 3D NoCs. Consequently, future NoCs may have a greater number of intellectual property blocks in the horizontal dimensions than vertically. The effect of this changing aspect ratio must be characterized. For a more in-depth illustration of these effects, the overall performance of a mesh-based NoC in a 2-layer IC will be evaluated in comparison to the previouslyanalyzed 3D 4â•›×â•›4â•›×â•›4 Mesh and 2D 8â•›×â•›8 Mesh. Here, a 64-IP 8â•›×â•›4â•›×â•›2 Mesh is considered to match the 64-IP network size, in order to make the comparison of latency and energy as fair as possible, along with a 60-IP 6â•›×â•›5â•›×â•›2 Mesh to show a network which is similar in size and that results in a more square overall footprint than the 8â•›×â•›4â•›×â•›2 Mesh. FigureÂ€6.12 summarizes the analysis of these 2-layer ICs. Throughput characteristics are seen in Fig.Â€ 6.12a. It shows clearly that the 6â•›×â•›5â•›×â•›2 Mesh achieves a significantly higher throughput than the 2D 8â•›×â•›8 Mesh and the 8â•›×â•›4â•›×â•›2 Mesh, which suffers from a high average hop count (4.44 vs 4.11

a

Throughput

1

0.1

0.2

0.3

0.5

0.6

0.7

Injection Load

0.4

0.8

0.9

8x4x2 MESH

6x5x2 MESH

500

0

2D MESH 4x4x4 MESH

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

1000

1500

2000

2500

3000

3500

4000

0

8x4x2 MESH

6x5x2 MESH

4x4x4 MESH

2D MESH

1

1

b

d

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

1000 900 800 700 600 500 400 300 200 100 0 0

2D MESH

9139.84

6196.86 5264.09

8x4x2 MESH 6x5x2 MESH 4x4x4 MESH

Topology

7115.65

Injection Load

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

8x4x2 MESH

6x5x2 MESH

4x4x4 MESH

2D MESH

1

Fig. 6.12â†œæ¸€ Comparing two 2-layer NoCs: a Throughput vs. injection load, b Latency vs. injection load, c Cycle energy vs. injection load, and d Packet energy

c

Cycle Energy (pJ)

Latency (clock cycles) Packet Energy (pJ)

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation 137

138

B. S. Feero and P. P. Pande

for the 6â•›×â•›5â•›×â•›2 Mesh), while achieving a lower maximum throughput than the 4-layer mesh. Likewise, the 2-layer mesh NoCs outperform the 2D Mesh in terms of latency, shown in Fig.Â€6.12b, without exceeding the performance of the 4-layer 3D instantiation. This trend continues when considering cycle energy (Fig.Â€6.12c) and packet energy (Fig.Â€6.12d). These results are as expected. With the first layer added, significant improvements are apparent in terms of each performance metric over the 2D case. Though the multi-layer NoC exhibits superior performance characteristics compared to a 2D implementation, it will have to circumvent significant manufacturing challenges. Yet, even if implementations are limited to two-layer 3D realizations, they will still significantly outperform the planar NoCs.

6.3.10 Multi-Layer IPs Throughout this chapter, each IP block has been assumed to be instantiated in one layer of silicon. However, as discussed in [10], it is possible for the IP blocks to be designed using multiple layers, although major issues like clock synchronization must be addressed. So, each network architecture is analyzed with multi-layer IPs. The pipelined communication shown in Fig.Â€ 6.13 is assumed; i.e. the NoCs are constrained by the switch delay and it cannot be driven as fast as the inter-switch wire. Considering this, multi-layer IPs have no effect on either throughput or latency (assuming the same clock frequency for all networks), but there are nontrivial effects on the energy dissipation profile. This effect on packet energy is depicted in Fig.Â€6.14. The energy savings come from reduced horizontal wire lengths. For

Source IP

Dest. IP

Fig. 6.13â†œæ¸€ The pipelined nature of NoCs

Switches

14000

2D MESH 2D BFT Ciliated 3D Mesh 3D Fat Tree

Packet Energy (pJ)

12000 10000

2D Fat Tree 3D MESH Stacked Mesh (128-bit) 3D BFT

8000 6000 4000 2000

Fig. 6.14â†œæ¸€ The effect of multilayer IPs

0

1

2 3 Number of Layers for IP

4

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

139

instance, if a 2.5â•›×â•›2.5Â€mm IP block is instantiated in 2 layers, the IP’s circuitry is spread over 2 layers, and the footprint diagonal reduces by a factor of 1.414. Similarly, if instantiated in 3 layers, the footprint diagonal reduces by a factor of 1.732, and with 4 layers, the factor is 2. Although the vertical wire lengths are increased 2, 3, and 4 times, respectively, in order to span the entire multi-layer IP, the negative effects on energy incurred by this are eclipsed by the significant reductions in horizontal wire lengths. However, multi-layer IPs increase the number of layers in a 3D IC, placing an increased burden on manufacturability.

6.4â•…Heat Dissipation Profile of 3D NoCs Heat dissipation is an extremely important concern in 3D ICs. Already, thermal effects have been known to have significant implications on device reliability and interconnect in traditional 2D circuits [29]. With the reduced footprint inherent to 3D ICs, this problem is exacerbated as the energy dissipated throughout the entire chip is now constrained to a smaller area, therefore increasing the energy density of these circuits. As a result, it is imperative that thermal issues are addressed in any system involving 3D integration. Accordingly, an analysis of 3D NoC is incomplete without an examination of temperature. It is especially important since the interconnect structure of a NoC can consume close to 50% of the overall power budget [13]. As temperature is closely related to the energy dissipation of the IC, this analysis will draw heavily upon the discussion of energy from Sects.Â€6.3.1 and 6.3.2. This section considers the 2D and 3D NoC architectures introduced in Sect.Â€6.3 and evaluates them in the presence of real traffic patterns. Furthermore, Sect.Â€6.3 has shown that the energy dissipated by the interconnection infrastructure, i.e. the communication energy, can be reduced compared to a 2D implementation by virtue of the inherent nature of the network architecture. Consequently, it will have a positive effect on heat dissipation.

6.4.1 Temperature Analysis Temperature in a 3D IC is related to a variety of factors including power dissipation and power density. In an integrated circuit, according to [30], the steady state temperature distribution is given by the following Poisson equation:

∇ 2 T (r) =

−g(r) . kl (r)

(6.14)

Here, r is the three-dimensional coordinate (â†œx,y,z). T(r) is the temperature inside the chip at point r, g(r) is the volume power density at that point, and kl(r) is the thermal conductivity. An important fact to note is that kl(r), the thermal conductivity, is

140

B. S. Feero and P. P. Pande

constrained by the manufacturing process and a SoC designer has little or no control over it. Therefore, the volume power density, g(r), is the parameter over which a designer has the most control. The challenge facing all designers of 3D ICs is to exercise control over this parameter. In a 3D integrated circuit, the volume power density of the chip is increased. The lateral dimensions are significantly smaller, and as a result, the total power of the circuit is dissipated in a much smaller area. For instance, in a four-layer 3D IC, the floor area is reduced by a factor of 4, for an eight-layer 3D IC, that area is reduced by a factor of 8, etc. Clearly, the energy of the entire chip is now constrained to a much smaller footprint, and the volume power density increases with respect to this.

6.4.2 The Relationship Between Temperature and Energy According to (6.14), it is clear that lower energy corresponds to lower heat. With an increase in volume power density, there is a corresponding increase in temperature. However, with 3D integration, the density of the chip is increased. As a result, there is a factor leading to higher heat in 3D NoCs. On the other hand, in 3D NoCs the communication energy can be reduced compared to a 2D implementation due to the various factors explained earlier. Consequently, it will lead to lesser heat dissipation in 3D NoCs. To quantify the overall effects of these two opposing factors, the heat dissipation profile for the aforementioned 3D NoC architectures is evaluated in presence of realistic traffic patterns.

6.4.3 Simulation Methodology The temperature profiles of the 3D NoCs were obtained through simulations following the methodology shown in Fig.Â€ 6.15. First, the network architecture is chosen. Subsequently, the network switches are instantiated in VHDL and synthesized using Synopsys DesignVision and a 90-nm standard cell library from CMP [26]. Here, Synopsys PrimePower is run to generate the energy profiles of the network switches. Next, the overall floorplan of the NoC is created. In order to generate the energy profile of the entire NoC it is necessary to incorporate the energy dissipated by each inter-switch stage. This is calculated taking into account the specific layout of each topology following the method elaborated in [20]. Following this, the NoC simulator is run in order to generate the overall energy profile of the NoC. Finally, with a complete floorplan and power profile, the Hotspot tool, developed by a research team at the University of Virginia [31], is used to generate the temperature profile. Hotspot takes the floorplan and applies the power profile to it, and with this information, it calculates the volume power density. From this, the temperature profile is generated.

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation Fig. 6.15â†œæ¸€ Design flow

141

Network topology Instantiate in VHDL Synthesize using DesignVision & analyze power of switches using PrimePower Energy profile of switches Run NoC Simulator Energy Profile of entire NoC

Generate Floorplan Complete floorplan

Temperature analysis using HotSpot Temperature profile of NoC

6.4.4 Experimental Results In accordance with the prescribed methods, 64-IP instantiations of each 3D NoC architecture were analyzed for thermal performance, with temperature taken as a function of injection load. As explained in Sect.Â€6.4.1, temperature is closely related to power density, so, likewise, these temperature profiles are very similar in form to the energy profiles, shown in Fig.Â€6.16a. The analysis begins with the 2D topologies. A plot of the temperature characteristics of the two architectures is shown in Fig.Â€6.16b with the temperature normalized to the maximum temperature of the 2D Mesh, considered as the baseline case. FigureÂ€6.16a shows temperature saturating at different values and at different injection loads for each topology, like the communication energy dissipation profiles. With a 3D network implementation, this chapter has shown significant improvements in terms of energy dissipation, particularly packet energy, which is revisited in Fig.Â€6.16e. These significant improvements in energy have substantial effects on the temperature characteristics of these 3D networks. Let us first consider the hypothetical case where 3D implementations of these topologies dissipate the same communication energy per packet as the 2D versions. This case is shown by the dotted lines in Fig.Â€6.16c. It is very clear that in the absence of any packet energy gains, the result is a much hotter network. As discussed in SectÂ€6.2, when moving to a 3D NoC, the overall chip area remains constant while the footprint is reduced. In these 10â•›×â•›10Â€ mm 4-layer, 3D implementations, the entire energy dissipation of the chip is constrained to an area one quarter the size of the 20â•›×â•›20Â€mm 2D implementations. As a result, the power density should be significantly increased. However, the actual temperature profiles of the 3D networks, depicted by the solid lines in Fig.Â€6.16d, show a marked difference. This highlights a very important

142

B. S. Feero and P. P. Pande

a

3

2D Mesh 3D Mesh 3D Stacked Mesh 2D BFT 3D BFT

Normalized Temperature

Cycle Energy (pJ)

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

1

Normalized Temperature

2 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load 12978.42

Packet Energy (pJ)

12000 9139.84

8000 5264.09

6000

6111.85

6604.84

4000 2000 0 2D Mesh

2D BFT

3D Mesh 3D Stacked Mesh

Topology

3D BFT

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

1

4 3 2 1 0

f

3D Mesh 3D Stacked 3D BFT

5

d Normalized Temperature per Packet

Normalized Temperature

3

14000

e

0

6

4

10000

1

b

3D Mesh 3D Stacked 3D BFT

5

0

2

0 0

6

c

2D Mesh 2D BFT

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Injection Load

1

2.78 2.23 1.82 1.45 1.00

2D Mesh

2D BFT

3D Mesh 3D Stacked Mesh

3D BFT

Topology

Fig. 6.16â†œæ¸€ Experimental results for 3D NoCs: a Cycle energy, b Maximum temperature in 2D architectures, c Hypothetical temperature for 3D architectures, d Maximum temperature for 3D architectures, e Packet energy, and f the normalized contribution to temperature per packet

characteristic of NoC in a 3D environment: the savings in communication energy incurred by choosing a 3D NoC implementation partially mitigate what would otherwise be a drastic increase in temperature. To help describe this effect, Fig.Â€6.16f presents the normalized temperature contribution per packet, using the 2D mesh architecture again as the baseline case. The dotted bars represent the hypothetical case discussed earlier. The contribution to temperature per packet metric follows a similar idea to that of packet energy. Each packet sent through the network is responsible for a certain amount of energy dissipation. This, in turn, causes a rise in temperature. Therefore, as packet energy quantifies the energy efficiency of a NoC, the temperature contribution per packet thus quantifies the temperature efficiency of a NoC. All topologies show real improvements over the hypothetical case, and, in fact, the 3D version of the BFT

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

143

network have lower temperature than its 2D counterpart. This can be attributed, in part, to the very high (49%) decrease in packet energy that is characteristic of a 3D BFT implementation over a 2D BFT instantiation.

6.5â•…Conclusion This chapter has demonstrated that besides reducing the footprint in a fabricated design, three-dimensional network structures have better performance compared to traditional, 2D NoC architectures. It has demonstrated that both mesh- and treebased NoCs are capable of achieving better performance when instantiated in a 3D IC environment compared to more traditional 2D implementations. The meshbased architectures show significant performance gains in terms of throughput, latency and energy dissipation with a small area overhead. On the other hand, the 3D tree-based NoCs achieve significant gain in energy dissipation and area overhead without any change in throughput and latency. However, if the NoC switches are designed to be as fast as the interconnect, even the 3D tree-based NoCs will exhibit performance benefits in terms of latency and bandwidth. Furthermore, 3D NoCs are efficient in addressing the temperature issues characteristic of 3D integrated circuits. The Network-on-Chip (NoC) paradigm continues to attract significant research attention in both academia and industry. With the advent of 3D ICs, the achievable performance benefits from NoC methodology will be more pronounced as this chapter has shown. Consequently this will facilitate adoption of the NoC model as a mainstream design solution for larger multi-core system chips.

References 1. P. Magarshack and P. G. Paulin, “System-on-Chip Beyond the Nanometer Wall,” Proceedings of 40th Design Automation Conf. (DAC 03), ACM Press, 2003, pp.Â€419–424. 2. International Technology Roadmap for Semiconductors 2005: Interconnect, [online] http:// www.itrs.net/ 3. A. W. Topol et al., “Three-Dimensional Integrated Circuits,” IBM Journal of Research & Development, vol.Â€50, no.Â€4/5, July/Sept. 2006. pp. 491–506. 4. W. R. Davis et al., “Demystifying 3D ICs: The Pros and Cons of Going Vertical,” IEEE Design and Test of Computers, vol.Â€22, no.Â€6, Nov. 2005. 5. Y. Deng et al., “2.5D System Integration: A Design Driven System Implementation Schema,” Proceedings of the Asia South Pacific Design Automation Conference, 2004. 6. M. Ieong et al., “Three Dimensional CMOS Devices and Integrated Circuits,” Proceedings of IEEE Custom Integrated Circuits Conference, 2003. 7. L. Benini and G. De Micheli, “Networks on Chips: A New SoC Paradigm,” IEEE Computer, Jan. 2002, pp.Â€70–78. 8. W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” Proceedings of the 2001 DAC, June 18–22, 2001, pp.Â€683–689.

144

B. S. Feero and P. P. Pande

â•‡ 9. F. Li et al., “Design and Management of 3D Chip Multiprocessors Using Network-in-Memory,” Proceedings of the 33rd International Symposium on Computer Architecture (ISCA’06), pp.Â€130–141. 10. V. F. Pavlidis and E. G. Friedman, “3-D Topologies for Networks-on-Chip,” IEEE Transactions on Very Large Scale Integration (VLSI), October 2007, pp.Â€1081–1090. 11. J. Srinivasan et al., “Exploiting Structural Duplication for Lifetime Reliability Enhancement,” Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05), pp.Â€520–531. 12. J. Tsai, C. C. Chen, G. Chen, B. Goplen, H. Qian, Y. Zhan, S. Kang, M. D. F. Wong, and S. S. Sapatnekar, “Temperature-Aware Placement for SoCs,” Proceedings of the IEEE, vol.Â€94, no.Â€8, Aug. 2006, pp.Â€1502–1518. 13. T. Theocharides, G. Link, N. Vijaykrishnan, and M. Irwin, “Implementing LDPC Decoding on Network-on-Chip,” Proceedings of the International Conference on VLSI Design, 2005 (VLSID 2005), pp.Â€134–137. 14. S. Vangal et al., “An 80-Tile 1.28TFLOPS Network-on-Chip in 65Â€nm CMOS,” Proceedings of IEEE International Solid-State Circuits Conference (ISSCC), 2007, pp.Â€98–99. 15. Tilera Co. http://www.tilera.com 16. P. Jacob et al., “Predicting the Performance of a 3D Processor-Memory Stack,” IEEE Design and Test of Computers, Nov. 2005, pp.Â€540–547. 17. R. I. Greenberg and L. Guan, “An Improved Analytical Model for Wormhole Routed Networks with Application to Butterfly Fat Trees,” Proceedings of the International Conference on Parallel Processing (ICPP 1997), pp.Â€44–48. 18. C. Grecu et al., “A Scalable Communication-Centric SoC Interconnect Architecture,” Proceedings of the 5th International Symposium on Quality Electronic Design, 2004, pp.Â€343– 348. 19. P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip Packet-Switched Interconnections,” Proceedings of Design and Test in Europe (DATE), Mar. 2000, pp.Â€250–256. 20. C. Grecu, P. P. Pande, A. Ivanov, and R. Saleh, “Timing Analysis of Network on Chip Architectures for MP-SoC Platforms,” Microelectronics Journal, Elsevier, vol.Â€36, no.Â€9, Mar. 2005, pp.Â€833–845. 21. P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance Evaluation and Design Trade-offs for Network on Chip Interconnect Architectures,” IEEE Transactions on Computers, vol.Â€54, no.Â€8, Aug. 2005, pp.Â€1025–1040. 22. J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks—An Engineering Approach, Morgan Kaufmann, San Francisco, CA,2002. 23. K. Park and W. Willinger, Self-Similar Network Traffic and Performance Evaluation, John Wiley & Sons, New York,2000. 24. D. R. Avresky, V. Shubranov, R. Horst, and P. Mehra, “Performance Evaluation of the ServerNetR SAN under Self-Similar Traffic,” Proceedings of 13th International and 10th Symposium on Parallel and Distributed Processing, April 12–16th, 1999, pp.Â€143–147. 25. G. V. Varatkar and R. Marculescu, “On-Chip Traffic Modeling and Synthesis for MPEG-2 Video Applications,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.Â€8, no.Â€3, June 2000, pp.Â€335–339. 26. Circuits Multi-Projects. http://cmp.imag.fr/ 27. K. C. Saraswat et al., “Technology and Reliability Constrained Future Copper Interconnects—Part II: Performance Implications,” IEEE Transaction Electron Devices, vol.Â€ 49, no.Â€4, Apr. 2002, pp.Â€598–604. 28. P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Effect of Traffic Localization on Energy Dissipation in NoC-based Interconnect,” Proceedings of IEEE International Symposium on Circuits and Systems, 23rd–26th May 2005, pp.Â€1774–1777. 29. J. A. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky, S. J. Souri, K. Banerjee, K. C. Saraswat, A. Rahman, R. Reif, and J. D. Meindl, “Interconnect Limits on Gigascale Integration (GSI) in the 21st Century,” Proceedings of the IEEE, vol.Â€89, noÂ€3, Mar. 2001, pp.Â€305–324.

6â•… Three-Dimensional Networks-on-Chip: Performance Evaluation

145

30. D. Meeks, “Fundamentals of Heat Transfer in a Multilayer System,” Microwave Journal, vol.Â€1, no.Â€1, Jan. 1992, pp.Â€165–172. 31. W. Huang, K. Sankaranarayanan, R. J. Ribando, M. R. Stan, and K. Skadron, “An Improved Block-Based Thermal Model in HotSpot 4.0 with Granularity Considerations,” Proceedings of the Workshop on Duplicating, Deconstructing, and Debunking, in conjunction with the 34th International Symposium on Computer Architecture (ISCA), June 2007.

â•…

Part III

System and Architecture Design

â•…

Chapter 7

Asynchronous 3D-NoCs Making Use of Serialized Vertical Links Abbas Sheibanyrad and Frédéric Pétrot

7.1â•…Introduction 3D-Integration, a breakthrough in increasing transistor density by vertically stacking multiple dies with a high-speed die-to-die interconnection [1], is becoming a viable solution for the consumer electronic market segment. 3D-Integration results in a considerable reduction in the length and the number of long global wires which are the dominant factors on delays and power consumption, and allows stacking dies of different technologies (e.g. DRAM, CMOS, MEMS, RF) in a single package. However, even though this new technology introduces a whole new set of application possibilities, it also aggravates some current problems in VLSI design and, introduces several new ones. Delivering the clock signal to each die and dealing with clock synchronization are critical design problems for 3D integrated circuits [2]. The so-called GALS (Globally Asynchronous Locally Synchronous) paradigms seem indispensable to be exploited [3]. In that situation, and in order to provide the necessary computational power and communication throughput, NoCs offer a structured solution to construct GALS architectures. Since a NoC spans the entire chip, the network could be the globally asynchronous part of the system, while the subsystem modules are the locally synchronous parts. While the utilization of the 3D-Integration technology is at the moment very ad-hoc, the innovative exploitation of this major novel key technology with the Networks-on-Chip paradigm for the design and fabrication of advanced integrated circuits will allow a more generic use. The introduction of the NoC concept in early 2000 [4] was a big paradigm shift and opened a new, active, and practical area of research and development in the academia and the industry. Supposing a NoC with a complete 3D-Mesh (3D-Cube) topology, the number of vertical channels is equal to 2(â†œNâ•›−â•›3√N 2), where N is the number of network nodes. As generally each channel of a NoC consists of tens and even in some architectures A. Sheibanyrad () TIMA Laboratory, 46, Avenue Felix Viallet, 38000 Grenoble, France e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_7, ©Â€Springer Science+Business Media, LLC 2011

149

150

A. Sheibanyrad and F. Pétrot

hundreds of physical wire links, such a network with a large number of nodes requires a large number of physical vertical interconnections. In a 3D-Integrated circuit the die-to-die interconnect pitch (mainly due to the bonding pads which need to be large enough to compensate for misalignment of dies) imposes a larger area overhead than corresponding horizontal wires. Moreover, fabricating such a circuit involves several extra and costly manufacturing steps, and each extra manufacturing step adds a risk for defects, resulting in potential yield reduction. The approach can be cost-effective only for very high yields. As a consequence looking for cost-efficiency introduces a trade-off between the great benefit of exploiting high-speed, short-length, vertical interconnects, and a serious limitation on the number of them. In order to reduce the number of vertical links to be exploited by the network, we suggest making use of serialized vertical links in asynchronous 3D-NoCs. SectionÂ€ 7.2 of this chapter elaborates on 3D-Integration technologies and Through-Silicon-Vias (TSVs) which, at the time being, are the most promising vertical interconnects. Then in Sect.Â€7.3 we explain how a three-dimensional design space leads to the design of 3D-NoCs, and describe how the incorporation of the third dimension provides a major improvement in the network performance. SectionÂ€7.4 details the advantageous aspects of exploiting asynchronous circuits and Sect.Â€7.5 explains how the use of asynchronous serialized vertical links minimizes the number of die-to-die interconnects and maximizes the exploitation of the potentially high bandwidth of these vertical connections. Finally we conclude the paper in Sect.Â€7.6.

7.2â•…3D-Integration Technology The shrinking of processing technology in the deep submicron domain aggravates the imbalance between gate delays and wire delays: while gate delays decrease, global wire delays, because of the growth of wire resistance, increase [5]. However, since the length of local wires usually shrinks with traditional scaling, the impact of their delay on performance is minor. On the contrary, as the die size does not necessarily scale down, global wire lengths do not reduce. Global wires connect different functional units of a system and spread over the entire chip. The largest part of delays now is related to global wires. Whereas the operating frequency and transistor density need to continue to grow, global wires are likely to have propagation delays largely exceeding the required clock period. Furthermore the total amount of energy dissipated by long global wires is not negligible. While Networks-on-Chip [6] systematically tackle these challenges by differentiating between local and global interconnections, 3D-Integration results in a considerable reduction in the length and the number of long global wires, by folding the die into multiple layers and using short vertical links instead of long horizontal interconnects.

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links

151

There are several methods for die stacking and various vertical interconnection technologies. A comparison in terms of vertical density and practical limits can be found in (Chap.Â€ 1 of the present book). Through-Silicon-Via (TSV) has the potential to offer the greatest vertical interconnect density and therefore is the most promising among the vertical interconnect technologies. Furthermore, it features an extremely small inter-wafer distance of about 50Â€µm. Such a short distance guarantees low interconnect resistance about 50 times smaller than for a typical Metal 8 horizontal wire in 0.13Â€µm technology [7]. The authors of [7] have also indicated that for a whole via, capacitance is about 10 times smaller than for a typical Metal 2/3 horizontal wires of 1.5Â€mm in 0.13Â€µm. They have shown that while a 1.5Â€mm horizontal link delay is around 200Â€ps, for a whole vertical interconnect delay is 16–18.5Â€ ps, turning out to be substantially faster and more energy efficient than moderate size planar links.

7.2.1 TSV Technology Challenges FigureÂ€ 7.1 shows a side view of the inter-die connection using Through-SiliconVias. The TSV technologies are assembled at the wafer-level, rather than the dielevel. Since assembly cannot be performed with known-good dies, the fabrication yield with this approach drops quickly as more dies are added. Furthermore, additional processing steps are required and so some new defects can be generated, including misalignment, void formation during bonding phase, dislocation and defects of copper grains, oxide film formation over Cu interface, partial or full Pad detaching due to thermal stress, etc [8].

Metal Layers Silicon Substrate

Pitch

Fig. 7.1â†œæ¸€ Side view of the inter-die vertical connections using Through-Silicon-Via (TSV) technology

152

A. Sheibanyrad and F. Pétrot

Although 3D-Integration using TSV technology is not limited in the number of layers to be assembled, the yield can be a limiting factor. The approach can be cost-effective only for a very high yield. According to results of the leading 3D technology owners (shown in Fig.Â€9.4), the yield is an exponential function of TSV defect rate (DBI) and the number of TSVs, and thus, exponentially decreases when the number of TSVs reaches a technology dependent limit, from 1,000–10,000 in the sample technologies of the figure. Looking for cost-efficiency, TSV defect rate and consequently the chip fabrication yield is a major limiting factor and introduces a significant trade-off between the great benefit of exploiting high-speed, short-length, vertical TSV interconnects, and a serious limitation on the maximum number of them should (could) be exploited.

7.3â•…NoC and 3D-Design Contribution In order to take full advantage of the 3D-Integration, the decision on the use of TSVs must come upfront in the architecture planning process rather than as a packaging decision after circuit design is completed. This requires taking the 3D design space into account right from the start of the system design. In fact, a 3D integrated complex system contains different partitions of the whole pilled-up system connected by a Three-Dimensional NoC architecture [9, 10], allowing massive integration of processing elements, SRAMs, DRAMs, I/Os, etc. The incorporation of the third dimension (offered by the 3D-Integration paradigm) in the design of the integrated networks allows the exploitation of three dimensional topologies [11]. By dividing a system into more dies and stacking them up, we can provide a major improvement in the network performance. TableÂ€7.1 presents a formal comparison between 2D-Mesh and 3D-Cube topologies. While a 3D-Cube is more complex as its switch degree is 7 (in comparison with 5 of a 2DMesh), it offers a lower average communication latency as the network diameter is smaller. Furthermore, according to the higher number of channels, and especially the higher number of bisection channels, we can anticipate a higher network saturation threshold for a 3D-Cube topology. The number of channels determines the maximum possibility of the simultaneous communications and the number of bisection channels determines the possibility of concurrent global communications (i.e. it Table 7.1â†œæ¸€ Formal comparison between 2D-Mesh and 3D-Cube Number Switch Network Number of Number of nodes degree diameter channels of vertical channels 2D-Mesh Nâ•›=â•›n2 5 2√N 6Nâ•›−â•›4√N 0 3D-Cube Nâ•›=â•›m3 7 33√N 8Nâ•›−â•›63√N2 2Nâ•›−â•›23√N2

Number of bisection channels

Load of the busiest channelsa

2√N 23√N2

Câ•›×â•›¼√N Câ•›×â•›¼3√N

C average load injected to the network by each node Assuming uniform destination distribution and dimension-ordered routing

a

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links Fig. 7.2â†œæ¸€ Saturation threshold of a 64-Core system arranged in two different network topologies

300

8×8 4×4×4

250 Latency (Cycles)

153

200 150 100 50 0 15

20

25 30 35 Offered Load (%)

40

45

is a measure of bottleneck channels which could be used concurrently by members of a subnet communicating with members of other subnets). In order to experimentally testify the network latency and saturation gain when using networks with 3D-Cube topologies rather than 2D-Mesh, we have developed Cycle-Accurate SystemC simulation models for a general NoC architecture (i.e. an input buffered, wormhole packet-switching network using dimension-ordered routing algorithm). Assuming uniform random traffic pattern, Fig.Â€7.2 shows the average communication latencies of a system with 64 cores generating traffic. As can be seen, while an 8â•›×â•›8 2D-Mesh topology saturates when the offered load is about 25%, a 4â•›×â•›4â•›×â•›4 3D-Cube saturates for about 40%. The question which may arise here is: from the network performance point of view what would be the number of dies should be stacked [12]? TableÂ€7.2 shows some figures of a system with 900 cores generating traffic (which can likely be expected in the near future [13]), when arranged in (1) one, (2) four, and (3) nine layers. This formal comparison testifies that the value of the third dimension (i.e. the number of dies a system is divided into) has a direct influence on the network performance, namely on communication latencies and the saturation threshold. However for a given operating point, there would be an optimum number of dies to be exploited. FigureÂ€7.3 depicts the average communication latency in a 900-core Table 7.2â†œæ¸€ Formal comparison between three different network topologies of a 900-Core system Number of Load of Number Switch Network Number of Number the busiest of nodes degree diameter channels of vertical bisection channelsa channels channels 900 5 60 5,280 0 60 Câ•›×â•›¼â•›×â•›30 30â•›×â•›30 7 34 6,510 1,350 120 Câ•›×â•›¼â•›×â•›15 4â•›×â•›15â•›×â•›15 900 7 29 6,640 1,600 180 Câ•›×â•›¼â•›×â•›10 9â•›×â•›10â•›×â•›10 900 C average load injected to the network by each node a Assuming uniform destination distribution and dimension-ordered routing

154

A. Sheibanyrad and F. Pétrot

Fig. 7.3â†œæ¸€ Average packet latency of a 900-Core system arranged in three different network topologies

300

Latency (Cycles)

250 200 150 100 30×30 4×15×15 9×10×10

50 0

0

5

10 15 Offered Load (%)

20

25

system arranged in the three different topologies. We can see that, for example, while at the average offered load of 12% a 30â•›×â•›30 network is saturated, a 9â•›×â•›10â•›×â•›10 network works properly. We can also see that this offered load could be an operating point for a 4â•›×â•›15â•›×â•›15 network, whose fabrication would result in a better yield.

7.4â•…Asynchronous Circuit Exploitation Delivering clock to each die and dealing with clock synchronization is a critical problem on design of the 3D integrated circuits. Remembering that even in twodimensional chips making a highly symmetric fractal structured clock distribution network (e.g. an H-tree), which is essentially needed to route the clock signal to all parts of a chip with equal propagation delays, is very hard to achieve [14], we can imagine that constructing a three-dimensional clock distribution network (tree) is almost infeasible, as a huge number of vertical links (TSVs) will be needed. Furthermore, the run-time temperature variations and stresses and the thermal concerns (due to the increased power densities), which are significant issues of 3D integration, will introduce additional uncontrollable time-varying clock skew. In consequence, the GALS approaches seem to be the best solution to be used in 3D integrated systems. A GALS system is divided into several physically independent clusters and each cluster is clocked by a different clock signal. The advantage of this method is the reduction of the problem of clock distribution to a number of smaller sub-problems: since the different clocks do not need to be related in frequency and phase, the clock distribution problem on each planar cluster (which is much smaller than the whole system) becomes feasible. Moreover, the use of GALS paradigm enables the implementation of various forms of DPM (Dynamic Power Management) and DVFS (Dynamic Voltage and Frequency Scaling) methods which, because of heat extraction and energy dissipation, seem essential to be exploited in future

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links

155

SoCs. DPM and DVFS contain a set of techniques that achieve energy-efficient computation by selectively turning off or reducing the performance of system components when they are idle or partially unexploited. In these methods the need for physical distinction between system power and frequency segments has emerged. Potentially NoCs are compatible with the idea of GALS that needs to clusterize the chip into several physically independent subsystems, but the question that remains is how the network itself must be clocked, and how we can deal with the problem of synchronization and metastability on clock boundaries. Since one obvious way to eliminate the problem of clock skew is the utilization of asynchronous logic, a network with a fully asynchronous circuit design is a natural approach to construct GALS architectures [15]. A large number of locally planar synchronous islands can communicate together via a global three-dimensional asynchronous network (see Fig.Â€7.4). An asynchronous NoC (which, itself, does not involve synchronization issues) limits the synchronization failure (metastability [16]) only at the network interfaces in the first and last steps of a packet path, i.e. in the source cluster where the synchronous data enters the asynchronous network and in the destination cluster where the asynchronous data goes into synchronous subsystems. Due to the fact that the robust synchronization is often accompanied by an unavoidable penalty on the latency, the absence of synchronizing elements at each hop of the path leads to an extremely lower communication latency for an asynchronous NoC, compared with a GALS-compatible multi-synchronous one [17]. The instantiation of two special types of FIFO (Sync-to-Async and Async-to-Sync) at network boundaries, between the synchronous Network Interface Controller (NIC) and the asynchronous network, provides the requested synchronous compliant interfaces [18]. Each link of an asynchronous network works with its own speed as fast as possible, as opposed to the synchronous approach in which the slowest element determines the highest operating frequency. The highest possible clock frequency in a synchronous system is limited by the worst-case combination of some parameters, such as power supply variation, temperature variation, transistor speed variation (e.g. due to the fabrication variability), data-dependent operations, and data prop-

CK0

CK3

CK1

CK4

CK2

CK5

IP

Async/Sync Interfaces (FIFOs)

Loca lnterconnect NIC

Asynchronous Network

Fig. 7.4â†œæ¸€ Multi-clocked 2D-clusters in a GALS system using an asynchronous 3D-Network

156

A. Sheibanyrad and F. Pétrot

agation delays. Typically, the worst-case combination is encountered very infrequently and the system performance usually is less than what it could be. Asynchronous circuits automatically adjust their speed of operation according to the current conditions and they are not restricted by a fixed clock frequency. Their operating speed is determined by actual local latencies rather than global worst-case. In a 3DNoC this property offers the opportunity of totally exploiting the potentially high bandwidth of vertical links. The average load of a link in a NoC depends on several parameters, including traffic distribution pattern of the system application, the network routing algorithm, and the flit injection rate. The last columns of TableÂ€7.1 and 7.2 show the average load of the busiest links in a general NoC with 2D-Mesh and 3D-Cube topologies when the destination distribution is uniform (i.e. each traffic generator sends randomly packets to all other network nodes with a fixed load equivalent in all clusters). The nominal estimated average load of a link is much less than its maximum capacity, especially in a GALS-compatible asynchronous NoC in which the flit injection rate is much lower than the network throughput [17]. FigureÂ€7.5 demonstrates the average packet latency of a system with 256 cores. The curve in which the points are represented by + corresponds to when the system is arranged in a mesh of 16â•›×â•›16 and the curve in which the point are × when arranged in a cube with 4 layers (4â•›×â•›8â•›×â•›8). In these two cases the network is synchronous. In contrast, the curves with the points in * and □ display the average packet latency of the same system (corresponding to the + and × curves, respectively) when the network is asynchronous. We should mention here that from the system point of view the physical characteristics of an asynchronous network have no sense and so for this level of simulation we have used cycle accurate SystemC models of the network for both synchronous and asynchronous cases. In the case of asynchronous we have used two different clocks symbolically with different frequency ratio, one for the network and the other for the system (i.e. traffic generators). In fact in a system-level simulation and to obtain system-level latencies the only parameter of an asynchronous network has to be taken into account is its speed ratio to the clock frequency of clusters. 120

“2D-Sync” “3D-Sync” “2D-Async” “3D-Async”

Fig. 7.5â†œæ¸€ Average packet latency of a 256-Core system using synchronous (the speed ratio of 1) and asynchronous (the speed ratio of 2) networks arranged in two and three dimensional topologies

Latency (cycles)

105 90 75 60 45 30 15 0

0

10

20

30

40 50 Load (%)

60

70

80

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links

157

The curves of Fig.Â€7.5 are the results when the speed ratio between system clock frequency and the network throughput is two. It means for example if the average link throughput of the network is 1,000Â€Mflits/s, the cores inject the flits with a rate of 500Â€ Mflits/s. According to [19] the maximum clock frequency of usual SoCs using STMicroelectronics 90Â€nm GPLVT technology is about 400Â€MHz, while as stated in [20] the throughput of an asynchronous implementation of the same NoC using the same technology is about 1,000Â€Mflits/s. Hence, in our simulations the speed ratio of two could be a worst-case assumption. In a packet switching network when a packet is traversing the network all resources in the path between the header (first flit) and the trailer (last flit) are allocated to the packet and no other packet can use those resources. The other packets must wait until the path is released, that is after the packet trailer is passed. In a synchronous NoC the flits injected to the network (as well as the trailer) move through the hops cycle by cycle, with a throughput of one flit per cycle. In contrast in the asynchronous approach flits propagate as fast as possible. When the speed ratio between the asynchronous network and flit injectors (subsystems) is larger than one, the trailer releases the path faster than in a synchronous one which works with the speed of subsystems. We believe that fast path liberation is why asynchronous NoCs have a better saturation threshold. All told, we can easily state that in an asynchronous 3D-NoC making use of high-bandwidth TSVs, the vertical links exploit only a small fraction of their capacities. And, this fraction is lower than that of the horizontal links as their bandwidth is much higher, encouraging to search for solutions that make a more efficient usage of TSVs.

7.5â•…Serialized Vertical Links Serializing the data communication [21] of vertical links is an innovative solution for better utilization of these high-speed connections in an asynchronous 3D-NoC, particularly because 3D integrated circuits are strictly TSV limited to ensure an acceptable yield and area overhead. The serialization allows minimizing the number of die-to-die interconnects and simultaneously maximizing the exploitation of the high bandwidth of these vertical connections, hence addressing the cost-efficiency trade-off of 3D-Integration using TSV technologies. Additionally, reducing the number of TSVs to be exploited allows for hardware redundancy that is often used to improve the yield. As the principal cause of TSV defects is the misalignments, a simple and effective way to add redundancy and improve yield is to use larger pads. Serialization and consequently using fewer TSVs leads to increase the vertical interconnection pitches and so in the same area we can use square pads larger than standard pads. Another efficient example of hardware redundancy is the use of redundant TSVs which can be used to replace defected TSVs [22]. FigureÂ€7.6 depicts a conceptual architecture of an asynchronous 3D-NoC and the inter-router vertical connections using asynchronous serializer/deserializer (instan-

158

A. Sheibanyrad and F. Pétrot

Router Through-Silicon-Via Serializer Deserializer

Fig. 7.6â†œæ¸€ Inter-router vertical connections using asynchronous circuits

tiated just before and after of vertical connections). The communication throughput of serialized links depends on the speed of data serializing and deserializing, as well as the serialization level (i.e. the number of parts a flit must be divided into). If the serialization degrades the vertical throughput to a value much lower than that of horizontal links, these serial vertical links may become bottlenecks for all paths crossing them. As a consequence, the circuit design optimization of the serializer and deserializer plays a key role. It is also very important to properly determine the optimum serialization level. FigureÂ€7.7 shows the packet latency of a 256-core system with a 3D dimensional topology of 4â•›×â•›8â•›×â•›8 using networks with serialized vertical links. These results are obtained from cycle accurate SystemC simulations. The name of curves is a notation determining the type of the network (i.e. synchronous or asynchronous) and the level of serialization. For example “3D-Async-V3to1” means an asynchronous network using serialized vertical links that divide a parallel data word into three serial parts. The serialization level has a direct impact on the network performance. The interesting point of this figure is that the performance of a normal synchronous 3D-NoC is between the performance of an asynchronous network with vertical serialization level of 3 and 4, which means, having a reduction in the number of vertical links between 66% and 50% when using asynchronous dual-rail data encoding

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links 120

“3D-Sync” “3D-Async” “3D-Async-V2to1” “3D-Async-V3to1” “3D-Async-V4to1”

105 Latency (cycles)

Fig. 7.7â†œæ¸€ Average packet latency of a 256-Core system using synchronous (the speed ratio of 1) and asynchronous (the speed ratio of 2) 3D networks using serialized vertical links with different serialization level

159

90 75 60 45 30 15 0

0

10

20

30 40 50 60 Offered Load (%)

70

80

90

method. Dual-rail data encoding is a delay-insensitive code which uses two wires to represent a single bit of data with which the validity information is carried along and thereby the receiver is enable to unambiguously detect the word completion regardless of delays. Seeing that in these system-level simulations we have not taken into account that the bandwidth of vertical links (TSVs) is higher than horizontals, we can easily expect that the performance of such asynchronous 3D networks is better than those presented in the figure!

7.5.1 Implementation FigureÂ€ 7.8 demonstrates a schema for vertical communication serialization. The router output data going down (or up), encoded on n bits at the transfer rate of f, is serialized in p (
n×f p×g

In order that the serialization produces a minimum impact on the link bandwidth (i.e. R be as small as possible), g should be as large as possible. In a synchronous NoC, f and g are equal and are determined by the clock frequency which depends on the slowest part of the whole system. In contrast, in an asynchronous network g is determined only by the local circuit design of the serializer/deserializer themselves and could be larger than f. Hence, our primary desire when designing the serializer and deserializer was to optimize the communication throughput. To serialize n bits of incoming data to p bits we put in parallel p separate and independent serializers. In fact the input data coming from the queue of n-bit width is divided into p segments of m (=â•›nâ•›÷â•›p) bits and each segment is connected to a

160

A. Sheibanyrad and F. Pétrot

n

n Serialization level = m = p

m

m

m

m

m

m

p

n

m

m

m

m

m

m

Fig. 7.8â†œæ¸€ n bits of data coming at the transfer rate of f divided into p segments of m bits serialized to one bit at the rate of g

serializer of m to1. In the deserializer the figure is reversed and p separate and independent deserializer of 1 to m are put in parallel to collect n bits of the original data. FigureÂ€7.9 presents the proposed schema for serialization of m bits to 1. The serializer consists of the tree of multiplexors and the deserializer the tree of demultiplexors. In order to obtain the minimum overhead of the controller part of the serialization/deserialization on the rest of the circuits, the multiplexors and demultiplexors are autonomous. We called them “self-controlled multiplexor” and “self-controlled demultiplexor”. A self-controlled multiplexer alternatively switches the inputs and a self-controlled demultiplexor alternatively switches the outputs. FigureÂ€7.9 shows an example with the serialization level of 8 and the numbers presented as inputs of the serializer and outputs of the deserializer represent the order of the transmission. In the figure the squares with a “P” in the center represent the pipeline stages. These pipeline stages are included in the design of the multiplexors and demultiplexors and are to reduce the communication overhead between each two stages.

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links

1 5

3 7 2 6

4 8

1 5

3 7 2 6

4 8

P

0

P

1

P

0

P

1

P

0

P

1

P

0

P

1

P

0

P

1

P

0

P

1

P

0

P

1

P

0

P

1

Self-Controlled Multiplexor P

0

P

1

P

0

P

1

P

0

P

1

Serializer

Self-Controlled Demultiplexor P

0

P

1

P

0

P

1

P

0

P

1

Through Silicon Via

Fig. 7.9â†œæ¸€ Circuit implementation for data serialization over a TSV

161

Deserializer

FigureÂ€7.10 presents the circuit implementation of the 2:1 self-controlled multiplexor using quasi-delay-insensitive (QDI) dual-rail four-phase asynchronous communication protocol. In the figure the AND gates with a C label are C-Muller (also called C-Element), in which the output is equal to one when its two inputs are 1, and is zero when the two inputs are 0, else (i.e. when one of the inputs is zero and the other is one), the output keeps its preceding value. The responsibility of the RS-flip-flop in the circuit is to determine the state of the selection. Immediately after receiving a word on the line on which the multiplexor is waiting for data, the value of the flip-flop changes, but this value

162

A. Sheibanyrad and F. Pétrot

InBit0_W0

Out_W0

C

InBit0_W1 InBit0_Ack

C

InBit1_W0

C Out_W1

InBit1_W1

C

InBit1_Ack

I0 C +

I1 C + Out_Ack Req0 Req1

S Q S

Q’ R

(Self-Controlled Multiplexor 2:1)

Fig. 7.10â†œæ¸€ The circuit implementation of the 2:1 self-controlled multiplexor

does not effectuate the selection before terminating the current communication. For example imagine the multiplexor waits for data on InBit0 (i.e. S, the value of the flip-flop and I0, the select command for InBit0, are 1). As soon as the data arrives (i.e. Req0 rises) the flip-flop will be reset, but, because of value 1 of Req0, I1 does not go to 1. When the communication terminates, that is to say Req0 and Out_Ack both are 0, I1 sets and InBit1 will be selected as the next input. Note that the main idea of the early change of the flip-flop value is to take yet more of the bandwidth. The QDI, dual-rail, four-phase asynchronous circuit implementation of the 1:2 self-controlled demultiplexor is shown in Fig.Â€7.11. The circuit functions in a same manner as explained for the self-controlled multiplexor: an RS-flip-flop, saving the current selection state of the demultiplexor, changes its value immediately when the data arrives, and this new value effectuate the selection when the current data transmission terminates. We have developed the generic VHDL models for the circuits. We also have electrically evaluated the designs by SPICE simulations under Cadence (â†œSpectre). We used the STMicroelectronics 90Â€nm GPLVT technology. TableÂ€7.3 shows the simulation results. These promising results show that in our architectures g, the communication throughput of the serialized links, is much higher than f, the

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links In_W0

163

OutBit0_W0 OutBit0_W1 OutBit0_Ack

C

In_W1 In_Ack

C

OutBit1_W0

C

OutBit1_W1 OutBit1_Ack

C

I1

I0 C +

C +

Req0 Req1 S Q S

Q’ R

(Self-Controlled Demultiplexor 1:2)

Fig. 7.11â†œæ¸€ The circuit implementation of the 1:2 self-controlled demultiplexor

communication throughput of the network router (according to [17] in which the results are obtained for the same technology). As a consequence, in a link using our asynchronous serializer and deserializer, the effect of the serialization is not as much important as the serialization level demonstrates. For example, while the router throughput in a cluster of 2â•›×â•›2Â€mm2 is about 710Â€Mflits/s [17], if the propagation delay of TSV be considered about 20Â€ps, the throughput of a serialized link with the serialization level (â†œm) of 8 is about 2,080Â€Mflits/s, resulting in not a ratio (â†œRâ•›=â•›(â†œnâ•›÷â•›p)â•›×â•›(â†œfâ•›÷â•›g)) of 8, but 2.73 (=Â€8â•›×â•›(710â•›÷â•›2,080))! Table 7.3â†œæ¸€ SPICE simulation results Self-controlled multiplexer 2:1 Self-controlled demultiplexor 1:2 Serializer 4:1 Deserializer 1:4 Serializer 8:1 Deserializer 1:8

Transistor count 130 132 390 396 910 924

Latency (ps) 80 70 150 130 220 190

Throughput (Gflits/s) 2.9 3.2 2.5 2.8 2.5 2.8

164

A. Sheibanyrad and F. Pétrot

7.6â•…Conclusion In this chapter we have elaborated on the strategic exploitation of two key technologies: 3D-Integration and Networks-on-Chip. We have explained, and by presenting some simulation results, experimentally verified, that the contribution of the third dimension offered by the 3D-Integration paradigm in the design of the integrated networks provides a major improvement in the network performance and this improvement in an asynchronous network is more important. Moreover, we have made a case for asynchronous three-dimensional NoCs benefiting from serialized vertical link. We have explained that such an innovative architecture maximizes the utilization of high-speed vertical interconnects and minimizes the number of these connections which are expensive in terms of yield reduction and large vertical pitch. Innovative implementations for asynchronous circuits of serializer and deserializer [23] have also been presented showing that the expected gain in TSV reduction is obtained. We have shown that an asynchronous network making use of serialized vertical links with a serialization level of 8 shows a performance even much better than a synchronous 3D-NoC without any serialized links, which means at least a reduction of 75% in number of TSVs! Nevertheless, considering the fact that the main objective of using asynchronous circuits and data serialization in vertical links is to provide a strategic design concept to address cost-efficiency trade-offs of 3D-Integration, since lowering the cost actually degrades the network performance, what would be a good compromise? What would be the acceptable point of performance reduction? Thereupon a study on area and power consumption would be necessary to convince people to use serialized vertical links. Moreover, after developing each new design concept it is necessary to analyze the network performance and the saturation threshold. In fact these performance analyses will attempt to answer some fundamental questions: Can we have all vertical links serialized? How many serialized vertical links in a 3D-NoC can we have? Where is the best place for these serial vertical links? What would be the maximum serialization level and the minimum bandwidth of each serial vertical link? And so on.

References 1. B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, H. L. Gabriel, D. McCaule, P. Morrow, W. N. Donald, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb, “Die stacking (3D) microarchitecture,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture: IEEE Computer Society, 2006. 2. I. Loi, F. Angiolini, and L. Benini, “Developing mesochronous synchronizers to enable 3D NoCs,” in Proceedings of the Conference on Design, Automation and Test in Europe, Munich, Germany: ACM, 2008. 3. D. M. Chapiro, “Globally-asynchronous locally-synchronous systems.” Ph.D. Thesis: Stanford Univ., CA, Dept. of Computer Science, 1984.

7â•… Asynchronous 3D-NoCs Making Use of Serialized Vertical Links

165

â•‡ 4. P. Guerrier and A. Greiner, “A generic architecture for on-chip packet-switched interconnections,” in Proceedings of the Conference on Design, Automation and Test in Europe, Paris, France: ACM Press, 2000. â•‡ 5. R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,” in Proceedings of the IEEE, vol.Â€89, pp.Â€490–504, 2001. â•‡ 6. A. Jantsch and H. Tenhunen, Eds. “Networks on chip,” Kluwer Academic Publishers, p.Â€303, 2003. â•‡ 7. I. Loi, F. Angiolini, and L. Benini, “Supporting vertical links for 3d networks-on-chip: Toward an automated design and analysis flow,” in International Conference on nano-networks, Catania, Italy, 2007. â•‡ 8. A. Young, “Perspectives on 3D-IC technology,” Presentation at the 2nd Annual Conference on 3D Architectures for Semiconductor Integration and Packaging, Tempe, AZ, USA, June 2005. â•‡ 9. D. Park, S. Eachempati, R. Das, A. K. Mishra, Y. Xie, N. Vijaykrishnan, and C. R. Das, “MIRA: a multi-layered on-chip interconnect router architecture,” SIGARCH Computer Architecture News, vol.Â€36, pp.Â€251–261, 2008. 10. F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir, “Design and management of 3D chip multiprocessors using network-in-memory,” SIGARCH Computer Architecture News, vol.Â€34, pp.Â€130–141, 2006. 11. V. F. Pavlidis and E. G. Friedman, “3-D topologies for networks-on-chip,” IEEE Transactions Very Large Scale Integration Systems, vol.Â€15, pp.Â€1081–1090, 2007. 12. A. Y. Weldezion, M. Grange, D. Pamunuwa, Z. Lu, A. Jantsch, R. Weerasekera, and H. Tenhunen, “Scalability of network-on-chip communication architecture for 3-D meshes,” in Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip: IEEE Computer Society, 2009. 13. http://www.itrs.net/Links/2005ITRS/Home2005.htm. 14. E. G. Friedman, “Clock distribution networks in synchronous digital integrated circuits,” in Proceedings of the IEEE, 2001, vol. 89, issue 5, pp. 665–692, 2001. 15. E. Beigne, F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin, “An asynchronous NOC architecture providing low latency service and its multi-level design framework,” in Proceedings of the 11th IEEE International Symposium on Asynchronous Circuits and Systems: IEEE Computer Society, 2005. 16. R. Ginosar, “Fourteen ways to fool your synchronizer,” in Proceedings of the 9th International Symposium on Asynchronous Circuits and Systems: IEEE Computer Society, 2003. 17. A. Sheibanyrad, I. Miro Panades, and A. Greiner, “Systematic comparison between the asynchronous and the multi-synchronous implementations of a network on chip architecture,” in Design, Automation and Test in Europe, Nice, France, 2007, pp.Â€1090–1095. 18. A. Sheibanyrad and A. Greiner, “Two efficient synchronous <--> asynchronous converters well-suited for networks-on-chip in GALS architectures,” Integration, the VLSI Journal, vol.Â€41, pp.Â€17–26, 2008. 19. http://www.laas.fr/RTP21-SdF/Documents/03-11-05/02-Schoellkopf.pdf. 20. A. Sheibanyrad, A. Greiner, and I. Miro Panades, “Multisynchronous and fully asynchronous NoCs for GALS architectures,” IEEE Design & Test of Computers, vol.Â€ 25, pp.Â€ 572–580, 2008. 21. S. Ogg, E. Valli, B. Al-Hashimi, A. Yakovlev, C. D. Alessandro, and L. Benini, “Serialized asynchronous links for NoC,” in Proceedings of the Conference on Design, Automation and Test in Europe, Munich, Germany: ACM, 2008. 22. A. C. Hsieh, T. T. Hwang, M. T. Chang, M. H. Tsai, C. M. Tseng, and H. C. Li, “TSV redundancy: architecture and design issues in 3D IC,” Conference on Design, Automation and Test in Europe, Dresden, Germany, 2010. 23. A. Sheibanyrad and F. Pétrot,“Sérialiseur et desérialiseur asynchrone pour circuit intégré tridimensionnel,” French Patent: 09/53637, 2009.

â•…

Chapter 8

Design of Application-Specific 3D Networks-on-Chip Architectures Shan Yan and Bill Lin

8.1â•…Introduction Network-on-Chip (NoC) architectures have been proposed as a scalable solution to the global communication challenges in nanoscale SoC designs [1, 2]. The use of NoCs with standardized interfaces facilitates the reuse of previously-designed and third-party-provided modules in new designs (e.g. RISC CPU cores, DSP cores, etc). Besides design and verification benefits, NoCs have also been advocated to address increasingly daunting clocking, signal integrity, and wire delay challenges. Indeed, tremendous progress has been made in recent years on the design of 2D NoC architectures, both on regular topologies like 2D mesh networks for chip-multiprocessor applications [3–6] and on application-specific network architectures for custom SoC designs [7–10]. However, the advent and increasing viability of 3D silicon integration technology have opened a new horizon for new on-chip interconnect design innovations. In particular, there has been considerable discussion in recent years on the benefits of three dimensional (3D) silicon integration in which multiple device layers are stacked on top of each other with direct vertical interconnects tunneling through them using through-silicon vias [11–14] (Fig.Â€ 8.1). 3D integration promises to address many of the key challenges that arise from the semiconductor industry’s relentless push into the deep nano-scale regime. First, as feature sizes continue to shrink, and integration densities continue to increase, interconnect delays have become a critical bottleneck in chip performance. By providing a third dimension of interconnect, wire delays can be substantially reduced by enabling greater spatial locality. Second, for many high-performance applications, such as video or graphics processing, the performance bottleneck is often in the chip-to-chip or chip-to-memory communication. Three dimensional integration offers the compelling advantage that massive amounts of bandwidth can be provided between device layers without incurring the usual latency penalty, leading potentially to new architectures that can achieve much S. Yan () University of California, San Diego, CA, USA e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_8, ©Â€Springer Science+Business Media, LLC 2011

167

168

S. Yan and B. Lin ~4x4 um

Top View

4 um 4 um

~1.05x1.05 um Via Pad

1.05 um

Layer X+1

1.05 um

Via Silicon Substrate

~50 um

Layer X

Metal Layer 4 um 4 um

Fig. 8.1â†œæ¸€ 3D silicon integration [15]

higher performance. Third, fabrication technologies specific to functions such as RF circuits, memories, or optoelectronic devices are often incompatible with the processing steps needed for high performance logic devices. Three dimensional interconnect provides a flexible way to integrate these disparate technologies into a single systems-on-chip (SoC) design. Recent advances in 3D technology in the area of heat dissipation and micro-cooling mechanisms have alleviated earlier thermal viability and reliability concerns regarding stacked device layers. In this work, we investigate the problem of designing application-specific 3DNoC architectures for custom SoC designs. NoC design in 3D chips imposes new constraints and opportunities compared to that of a 2D NoC design. Current literature has focussed on regular 3D mesh NoC architectures [15–17], which is appropriate for regular 3D processor designs [18–20]. However, in the case of designing application-specific 3D NoC architectures for custom SoC designs, there are many choices that depend on the 3D floorplanning of cores, traffic requirements, and accurate power and delay models for 3D wiring. Our problem formulation is based on a rip-up and reroute procedure for routing flows and a router merging procedure for optimizing network topologies. A key idea in our new formulation is a rip-up and reroute concept that has been successfully used in the VLSI routing problem [21–23]. The rip-up and reroute concept provides us with a heuristic iterative mechanism to identify increasingly improving solutions. There are two central differences between our on-chip network routing and design problem and the VLSI routing problem. The first is the ability to share network resources in our problem, and the second is the difference in cost models. In the latter case, the costs of routers and links are not simple linear costs, and the sharing of network resources further complicates the optimization process. In particular, we propose a very efficient algorithm called Ripup-Reroute-andRouter-Merging (RRRM) that synthesizes custom 3D-NoC architectures. In order to obtain the best topology solutions with minimum power consumption, accurate power models for 3D interconnects and routers are derived. The RIPUP-REROUTE

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

169

algorithm for routing flows and the ROUTER-MERGING algorithm to optimize topologies are based on using these power costs of network links and router ports as evaluation criteria. Our design flow integrates floorplanning and our synthesis process is both performance and power consumption aware. The rest of this chapter is organized as follows. SectionÂ€8.2 outlines related work. SectionÂ€8.3 presents our design flow. SectionÂ€8.4 presents the problem description and formulation. SectionÂ€8.5 derives the accurate power and delay models for 3D wiring and routers. SectionÂ€8.6 describes our 3D synthesis algorithm RRRM. SectionÂ€8.7 addresses deadlock considerations. Experimental results and conclusions are presented in Sects.Â€8.8 and 8.9, respectively.

8.2â•…Related Work Research in 3D NoC is only emerging recently. Several works have been done in the area of 3D floorplanning and 3D placement and routing. Cong etÂ€al. [24, 25] proposed thermal-driven design flows for 3D ICs including 3D floorplanning and 3D placement and routing algorithms. In [26], thermal effect was formulated as another force in a force-directed approach to direct the placement procedure. In [27], a thermal-aware Steiner routing algorithm for 3D ICs is proposed by constructing a delay-oriented Steiner tree under a given thermal profile. On the problem of designing NoC architectures for 3D ICs, current literature has focussed on regular 3D mesh NoC architectures [15–17], which is appropriate for regular 3D processor designs [18–20]. Addo-Quaye [28] presented an algorithm for the thermal-aware mapping and placement of 3D NoCs including regular mesh topologies. Li etÂ€ al. [19] proposed a similar 3D NoC topology targeting multi-processor systems by employing a bus structure for communications between different device layers. In [16], various possible topologies for 3D NoCs are presented and analytic models for zero-load latency and power consumption of these networks are described. However, all these 3D topologies are based on regular 3D mesh networks. For designing custom NoC architectures without assuming an existing network architecture, a number of techniques have been proposed in the 2D domain [7–10]. In [7], Srinivasan etÂ€al. presented NoC synthesis algorithms that consider systemlevel floorplanning. In [8], Murali etÂ€al. presented a NoC synthesis flow with detailed backend integration that also considers the floorplanning process. In [9, 10], Yan etÂ€al. formulated the custom NoC synthesis problem based on set partitioning of traffic flows and finding good network topologies using a Steiner tree formulation. However, for the best of our knowledge, a fully automated 3D-NoC synthesis solution remains an open problem. Multicasting in wormhole-switched networks has been explored in the context of chip multiprocessors based on the methods in parallel machines for supporting cache coherency, acknowledgement collection, and synchronization, etc [29, 30]. In the NoC works of [31, 32], they have reported that multicast service can be implemented in their NoC architectures. However, the methods for providing multicast routing and services have not been presented in details. In [33], a novel multicast

170

S. Yan and B. Lin

scheme in wormhole-switched NoCs using a connection-oriented technique to realize QoS-aware multicasting in a best-effort network was proposed to support SoC applications. In [34], a router architecture supporting unicast and multicast services was proposed using a mechanism for managing broadcast-flows so that the communication links in an on-chip network can be shared. In [35], the dual-path multicast algorithm, used in multicomputers, was adapted to wormhole-switched NoCs to support deadlock-free multicast routing. In this chapter, we present a custom 3D-NoC synthesis approach based on ripup and routing of flows and merging routers to construct good network topologies. Accurate power and delay models for 3D wiring are also proposed. To the best of our knowledge, our approach is among the first to consider custom NoC synthesis for 3D IC designs. We believe our work provides an interesting direction in this research area.

8.3â•…Design Flow Our NoC synthesis design flow is depicted in Fig.Â€8.2. The major elements in the design flow are elaborated below.

8.3.1 Floorplanning The input specification to our design flow consists of a list of modules and their communications. In general, a mixture of network-based communications and con-

Objective & Constraints

Input Specification

3D Floorplanning

Application-Specific 3D NoC synthesis

3D NoC Architecture (with routers & links)

Fig. 8.2â†œæ¸€ Application-specific 3D NoC design flow

Design Parameters

3D NoC Power and Area Estimation

Detailed Design

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

171

ventional wiring may be utilized as appropriate, and not all inter-module communications are necessarily over the on-chip network. Our design flow allows for both interconnection models. The floorplanning problem for 3D ICs has been studied recently with many solutions (e.g. [24, 26]). These floorplanners can handle a variety of constraints. The thermal issues and power consumption effects that are especially important for 3D ICs are also considered in those floorplanners. In our design flow, an initial 3D floorplanning step is performed before 3D-NoC synthesis to obtain a placement of modules. This is important because the floorplanning of modules is often influenced by non-network-based interconnections, and the floorplan locations of modules can have a significant influence on the NoC architecture. With the module locations available from the initial floorplanning step, 3D-NoC synthesis can better account for wiring delays and power consumptions. After 3D-NoC synthesis, actual routers and links in the synthesized 3D-NoC architecture can be fed back to the floorplanner to update the floorplan, and the refined floorplanning information can be used to obtain more accurate power and area estimates. Also, 3D-NoC synthesis can be re-invoked with the refined floorplan as well.

8.3.2 3D Networks-on-Chip Synthesis Given floorplanning information, the 3D-NoC synthesis step then proceeds to synthesize a 3D-NoC architecture that is optimized for the given specification and floorplan. Consider Fig.Â€8.3a that depicts a small illustrative example. FigureÂ€8.3a only shows the portion of the input specification that corresponds to the networkattached modules and their traffic flows. It utilizes a graph representation called a communication demand graph, which is discussed in more details in Sect.Â€8.4. Multicast traffic flows are represented with directed hyperedges, which are shown graphically in Fig.Â€8.3a as a bundle of directed edges in a shaded region. For example, the traffic flow from v4 to v2, v5 and v6 is a multicast flow. An example 3D floorplan is shown in Fig.Â€8.3b. The unlabeled rectangles represent modules that are not attached to the network and connected by conventional wiring. The communication demand graph with the floorplan positions annotated is illustrated in Fig.Â€8.3c. FigureÂ€8.3d, e show two examples of synthesized 3D network topologies.

8.3.3 NoC Objectives and Constraints Our 3D-NoC design flow allows different user-defined objective and constraints. As power dissipation becomes a critical issue in 3D stacked IC designs due to the increased power density, we focus in this chapter on the problem of minimizing network power consumption under performance constraints. Another possible design

172

S. Yan and B. Lin

v0 200

v2

200

v1

200

v5

200 400

v3

200

v4

v6 100

a v0 (0,0,0)

v0

v5

v2

v1

= 200

λ(e3) = 200 λ(e6)

λ(e1)

= 200 λ(e2) = 200

v4

λ (e4)

v6

v3

v1

= 200

= 400

λ(e7) v 4 = 100

v6

(1,0,1)

v3

(5,1,1)

v0

v 2 100

(2,2,1)

(0,2,1)

b

v5

λ(e5) (5,0,0)

v2

(3,1,0)

c 200

v 0 400

600

200 300

v2

v5

200

200 400 200

v 1 400

400

300

100 500

v4

v4 500 400

600

v5

v6

v1

v3

d

400

100

v6

v3

e

Fig. 8.3â†œæ¸€ Illustration of the 3D-NoC synthesis problem. a Example. b Floorplan. c CDG. d One architecture. e Alternative architecture

objective is the minimization of hop counts for data routing under power consumption constraints. Other possible constraints can be design area, total wire length, or some combinations of them.

8.3.4 NoC Design Parameters In addition to user-defined objectives and constraints, NoC design parameters such as the operating voltage, target clock frequency, and link widths are provided to the 3D-NoC synthesis step as well. Operating voltage and clock frequency parameters

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

173

are usually dictated by the design, and link widths are often dictated by IP interface standards. However, if the design allows for different voltages or clock frequencies, or if the IP modules allow for different link widths, then 3D-NoC synthesis can be invoked to synthesize solutions for a range of design parameters specified by the user.

8.3.5 Detailed Design Finally, the synthesized NoC architecture with the rest of the design specification can be fed to a detailed RTL design flow where design tools like RTL optimization and detailed 3D placement and routing [25, 27] are used.

8.4â•…Problem Description and Formulation 8.4.1 Problem Description The input to our 3D-NoC synthesis problem is a communication demand graph (CDG), defined as follows: Definition 1â•‡ A communication demand graph (CDG) is an annotated directed hypergraph H(â†œV, E, π, λ), where each node viÂ€∈Â€V corresponds to a module, and each directed hyperedge ekÂ€=Â€sÂ€→Â€DÂ€∈Â€E represents a traffic flow from source sÂ€∈Â€V to one or more destinations DÂ€=Â€{d1, d2,…}, DÂ€⊆Â€V. The position of each node vi is given by π(â†œvi)Â€=Â€(â†œxi, yi). The data rate requirement for each communication flow ek is given by λ(â†œekâ•›). In general, traffic flows can either be unicast or multicast flows. Multicast flows are flows with |D| > 1 . For example, in Fig.Â€8.3c, e7 corresponds to a multicast flow from source v4 to destinations v2, v5 and v6. Based on the optimization goals and cost functions specified by the user, the output of our 3D-NoC architecture synthesis problem is an optimized custom network topology with pre-determined routes for the specified traffic flows on the network such that the data rate requirements are satisfied. For example, Fig.Â€8.3d, e show two different topologies for the CDG shown in Fig.Â€8.3c. FigureÂ€8.3d shows a network topology where all flows share a common network. In this topology, the pre-determined route for the multicast flow e7 travels from v4 to v2 to first reach v2, and then it bifurcates at v2 to reach v5 and v6. FigureÂ€8.3e shows an alternative topology comprising of two separate networks. In this topology, the multicast flow e7 bifurcates in the source node to reach v6, then it is transferred over the network link between v4 to v2 to reach v2, and then bifurcates to reach v5. Observe that in both cases, the amount of network resources consumed by routing of multicast traffic is less than what would be required if the traffic is sent to each destination as a separate unicast flow.

174

S. Yan and B. Lin

8.4.2 Problem Formulation In general, the solution space of possible application-specific network architectures is quite large. Depending on the communication demand requirements of the specific application under consideration, the best network architecture may indeed be comprised of multiple networks, among each, many flows sharing the same network resources. To address the 3D-NoC synthesis problem, we formulate the problem as a combination of a rip-up and reroute procedure for routing flows and a router merging procedure for optimizing network topologies. The key part of the algorithm is a ripup and reroute procedure that routes multicast flows by way of finding the optimum multicast tree on a condensed multicast routing graph using the directed minimum spanning tree formulation and the efficient algorithms [36, 37]. Then a router merging procedure follows after to further optimize the implementation and reduce cost. The router merging algorithm iteratively considers all possible mergings of two routers connected with each other and merges them if the cost of the resulting topology after merging is reduced. In order to obtain the best topology solutions with minimum power consumption, accurate power models for 3D interconnects and routers are derived in Sect.Â€8.5. They are provided to the synthesis design flow as a library and utilized by the synthesis algorithm as evaluation criteria. The RIPUP-REROUTE algorithm for routing flows and the ROUTER-MERGING algorithm to optimize topologies are based on using these power costs of network links and router ports as edge weights. The application-specific 3D-NoC synthesis problem can be formulated as follows: Input: • The communication demand graph H(â†œV, E, π, λ) of the application. • The 3D-NoC network component library Φ(â†œI, J), where I provides the power and area models of routers with different sizes, and J provides power models of physical links with different lengths. • The target clock frequency, which determines the delay constraint for links between routers. • The floorplanning of the cores. Output: • A 3D-NoC architecture T(â†œR, L, C), where R denotes the set of routers in the synthesized architecture, L represents the set of links between routers, and a function C:VÂ€→Â€R that represents the connectivity of a core to a router. • A set of ordered paths P, where each pijÂ€∈Â€PÂ€=Â€(â†œri, rj,…, rk), ri,…, rkÂ€∈Â€R, represents a route for a traffic flow e(â†œvi, vk)Â€∈Â€E. Objective: • The minimization of power consumption for the synthesized 3D-NoC architecture.

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

175

8.5â•…3D Design Models Power dissipation is a critical issue in 3D circuits due to the increased power density of stacked ICs and the low conductivity of the dielectric layers between the device layers. Therefore, designing custom 3D-NoC topologies that offer low power characteristics is of significant interests. The different power consumption components that are comprised in 3D-NoC topologies are routers, horizontal interconnects that connect modules in the same 2D layer, and the through-silicon vias (TSVs) that connect modules or horizontal interconnects on different layers. We will discuss the details of modelling these components in the following sections.

8.5.1 3D Interconnect Modelling In 3D-NoCs, interconnect design imposes new constraints and opportunities compared to that of 2D NoC designs. There is an inherent asymmetry in the delay and power costs in a 3D architecture between the vertical and the horizontal interconnects due to differences in wire lengths. The vertical TSVs are usually few tens of μm in length whereas the horizontal interconnects can be thousands of μm in length. Consequently, extending a traditional 2D NoC fabric to the third dimension by simply adding routers at each layer and connecting them using vertical vias is not a good option, as router latencies may dominate the fast vertical interconnect. Hence, we explore an alternate option: a 3D interconnect structure that connects modules on different layers as shown in Fig.Â€8.4a and we derive an accurate model for it. As discussed in Sect.Â€8.3, the target clock frequency is provided to our 3D-NoC synthesis design flow as a design parameter. However, depending on the network topology, long interconnects may be required to implement network links between routers, which may have wire delays that are larger than the target clock frequency. To achieve the target frequency, repeaters may need to be inserted. In the 2D design problem, interconnects can be modelled as distributed RC wires. One way to optimize the interconnect delay is to evenly divide the interconnect into k segments with repeaters inserted between them that are s times as large as a minimum-sized repeater. When minimizing power consumption is the objective, the optimum size sopt and number kopt of repeaters that minimize power consumption while satisfying the

a

b

Fig. 8.4â†œæ¸€ 3D interconnect model. a 3D interconnect. b Distributed RC model with repeaters

176

S. Yan and B. Lin

Table 8.1â†œæ¸€ Interconnect parameters Interconnect Parameter structure Electrical Horizontal bus Vertical bus

ρÂ€=Â€2.53Â€μΩÂ€×Â€cm rhÂ€=Â€46Â€Ω/mm ρÂ€=Â€5.65Â€μΩÂ€×Â€cm cvÂ€=Â€600Â€fF/mm

Physical kILDÂ€=Â€2.7 chÂ€=Â€192.5Â€fF/mm rvÂ€=Â€51.2Â€Ω/mm

wÂ€=Â€500Â€nm tÂ€=Â€1,100Â€nm wÂ€=Â€1,050Â€nm

sÂ€=Â€500Â€nm hÂ€=Â€800Â€nm LviaÂ€=Â€50Â€μm

delay constraint can be determined for the interconnect [38]. For the 3D interconnect structure, we extended this distributed RC model. As shown in Fig.Â€8.4b, a 3D interconnect is divided into k segments by repeaters. Among the k segments, kÂ€−Â€1 segments are part of the horizontal interconnect with the same structure. The other one is a different structure with two horizontal parts connected by a vertical via. The delay and power consumption per bit of this interconnect can be modelled using the Elmore model, as in [16, 38, 39]. In order to take the vertical via into account for the delay and power calculation of the entire interconnect, we first consider the interconnect as k segments with the same structure. We use the methodology described in [38] to find sopt and kopt for an interconnect with specific length to minimize power while satisfying the delay constraint1. After that, the delay and power of each segment are known. Given the fixed length and the physical parameters of the via, the detailed structure of the segment including the via which gives the same delay as the delay of the original segment without via can be determined by properly choosing the length of the horizontal wire parts in this segment. Finally, the total length of the 3D interconnect can be adjusted to the original length by evenly adjusting the length of each segment. Besides deciding the segment structure with vertical via, the via position on the interconnect also needs to be determined. That is, which wire segment is selected to include the via. As an example, in order to determine the influence of via positioning on the delay and power of the entire 3D interconnect, we performed experiments to evaluate the delay and power of an 8Â€mm 3D interconnect with a via length of 150Â€μm under different via positions. In the experiments, the physical and electrical parameters in 70Â€nm technology are used and are listed in TableÂ€8.1. The horizontal wires are implemented on the global metal layers and their parameters are extracted from IRTS [40]. The parameters of vertical vias are obtained from [16]. For the vertical via, the length of 50Â€μm is assumed for a via that connects adjacent layers. For a 3D interconnect of 8Â€mm in length, if the target frequency is 1Â€GHz, then the power optimum solution using the methodology described in [38] is to divide the interconnect into three segments. Thus, there are three possible via positions with three interconnect structures correspondingly, which are shown in Fig.Â€8.5. The optimization result of each structure together with the result of the interconnect without vertical via (labelled as 2D-wire) are shown in TableÂ€8.2. The differences of delay and power results of all structures relative to the 2D-wire results 1â•‡ Since inserting TSV adds delay, we tighten the delay constraints by some extent to get valid solutions.

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

a

b

c

d

177

Fig. 8.5â†œæ¸€ Different structures for an 8Â€mm 3D interconnect. a 2D-wire. b Model A. c Model B. d Model C Table 8.2â†œæ¸€ Power and delay comparison of 3D interconnect models Model Power Delay (â†œmWâ•›â•›) % diff to 2D-wire (â†œns) 0.3909 0.00 0.1951 2D-wire A 0.4020 2.85 0.1956 B 0.4020 2.85 0.1956 C 0.4020 2.85 0.1956

% diff to 2D-wire 0.00 0.25 0.25 0.25

are also listed. The results show that the influence of the vertical via on the total delay and power consumption of the entire interconnect is very small. The 150Â€μm via results in 0.25% increase in delay and 2.85% increase in power over the 8Â€mm interconnect. The results also show that the position of via on the interconnect has little effect on the delay and power. All the structures result in the same total delay and power. Thus, for our 3D-NoC synthesis algorithm, we can safely choose to position the via in the first segment of the interconnect for all 3D interconnects in the synthesized NoC topology for the purpose of computing the interconnect power costs. In our 3D-NoC synthesis design flow, we use the above 3D interconnect model to evaluate optimum power consumption of interconnects with different wire lengths under the given design frequency and delay constraint. These results are provided to the design flow in the form of a library. We emphasize that the focus of this chapter is on new 3D-NoC synthesis algorithms. We readily admit that 3D interconnect optimization is a complex problem and a subject of separate research. New or alternative 3D interconnect models can be easily used with our synthesis algorithms and design flow.

8.5.2 Modelling Routers To evaluate the power of the routers in the synthesized NoC architecture, we extended the router power model in two dimensions to three dimensions. The routers

178

S. Yan and B. Lin

Table 8.3â†œæ¸€ Power consumption of routers using Orion [42] Ports (inâ•›×â•›out) 2â•›×â•›2 3â•›×â•›2 3â•›×â•›3 Leakage power (W) Switching bit energy (pJ/bit)

0.0069 0.3225

0.0099 0.0676

0.0133 0.5663

4â•›×â•›3 0.0172 0.1080

4â•›×â•›4 0.0216 0.8651

5â•›×â•›4 0.0260 0.9180

5â•›×â•›5 0.0319 1.2189

are still located on a 2D layer. The ports of routers on the same layer are connected by horizontal interconnects whereas the ports of routers on different layers are connected by 3D interconnects. We use a state-of-the-art NoC power-performance simulator called Orion [41, 42] that can provide detailed power characteristics for different power components of a router for different input/output port configurations. It accurately considers leakage power as well as dynamic switching power. The power per bit values are also used as the basis for the entire router power estimation under different configurations. The leakage power and switching bit energy of some example router configurations with different number of ports in 70Â€nm technology are shown in TableÂ€8.3.

8.6â•…Design Algorithms In this section, we present algorithms for the 3D topology synthesis process. The entire process is decomposed into the inter-related steps of constructing an initial network topology, rip-up and rerouting flows to design the network topology, inserting the corresponding network links and router ports to implement the routing, and merging routers to optimize network topology based on design objectives. In particular, we propose an algorithm called Ripup-Reroute-and-Router-Merging (RRRM). The details of the algorithm are discussed in this section.

8.6.1 Initial Network Construction The details of RRRM are described in Algorithm 1. RRRM takes a communication demand graph (CDG) and an evaluation function as inputs and generates an optimized network architecture as output. It starts with initializing a network topology by a simple router allocation and flow routing scheme. Then it uses a procedure of rip-up and rerouting flows to refine and optimize the network topology. After that, a router merging step is done to further optimize the topology to obtain the best result. In the initialization, every flow is routed using its own network. To construct the initial network topology, router allocation is considered at each core. A router is allocated to a core if there are more than two flows coming into that core or more than two flows going out from that core. After router allocation, a Routing Cost Graph (RCG) is generated (Algorithm 1 line 2). RCG is a very important graph used in the whole rip-up and reroute procedure of RRRM algorithm.

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

179

Definition 2â•‡ The RCG(â†œR, E) is a weighted directed complete graph (a full-mesh) with each vertex riÂ€∈Â€R represents a router, and each directed edge eijÂ€=Â€(â†œri, rj)Â€∈Â€E from ri to rj corresponds to a connection from ri to rj. A weight w(â†œeij) is attached to each edge which represents the incremental cost of routing a flow f through eij. Please note that RCG does not represent the actual physical connectivity between routers and its edge weights change during the whole RIPUP-REROUTE procedure for different flows. Also, the actual physical connectivity between the routers is established during RIPUP-REROUTE procedure, which is explained in the following sections. Before RIPUP-REROUTE, initial network topology is constructed using InitialNetworkConstruction() procedure. Each flow ekÂ€=Â€(â†œsk, dk) in the CDG is routed using a direct connection from router rsk to router rdk , where ri is the router that core i connects to, and the path is saved in path(â†œek). Multicast flows are routed as a sequence of unicast flows from the source to each of their destinations. If either core i is not connected to any router, a direct connection is added between core i and the other router if any. The links and router ports are configured and saved. If a connection between routers can not meet the delay constraints, its corresponding edge weight in RCG is set to infinity. This can be used to guide the rerouting of the flows to use other valid links instead of this one in the RIPUP-REROUTE procedure. As an example, after initial network construction, the connectivity of routers for the example shown in Fig.Â€8.3a is shown in Fig.Â€8.6a. In this initial solution, each core is connected to a dedicated router.

180

S. Yan and B. Lin 0.80 0.80

R0

200

R2

R1

R5

100

200

100

200 200 R3

R4

R6

R1

100

0.40

R2

0.75

0.75

0.40 0.25

R3

R4 0.80 0.80

0.90

R5

0.70 0.47

0.35 0.35

400

200

0.75

R0

0.70 0.70

0.72 0.35 0.80 0.80

R6

0.90

a

b 0.40

R2

0.40

R2

R5

R5

0.70 0.75

0.70 0.45

0.75

0.35

0.70

R0

200

100 100 400

R3

R6

d R2

400 200

0.35

R4

R6

0.72

c

R1

0.47 0.35

R4

0.35

0.70

R4

100

R5

400

R0

R2

100

400 200

R6

e

R1

R3 400

300

R5

400

R4

R6 100

f

Fig. 8.6â†œæ¸€ Illustration of the RIPUP-REROUTE procedure. a Initial connectivity. b RCG. c MRG. d MRTree. e Connectivity before reroute e7. f Connectivity after reroute e7

8.6.2 Flow Ripup and Rerouting Once the initial network is constructed and the initial flow routing is done, the key procedure of the algorithm—RIPUP-REROUTE procedure is invoked to route flows and find optimized network topology. The details are described in Algorithm 2. In the RIPUP-REROUTE, each multicast routing step is formulated as a minimum directed spanning tree problem. Two important graphs, Multicast Routing Graph (MRG) and Multicast Routing Tree (MRTree), are used to help facilitate the rip-up and rerouting procedure. They are defined as follows.

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

181

Definition 3â•‡ Let f be a multicast flow with source sÂ€∈Â€V and one or more destinations DÂ€⊆Â€V. i.e., D = {d1 , d2 , . . . , d|D| }, each diÂ€∈Â€V. A Multicast Routing Graph (MRG) is a complete graph (N , A) defined for f as follows: • N = s ∪ D . • There is a directed arc between every pair of nodes (â†œi, j) in N. Each arc ai, jÂ€∈Â€A corresponds to a shortest path pi, j between the same nodes in the corresponding RCG, pi, jÂ€=Â€e1Â€→Â€e2Â€→Â€…Â€→Â€ek. • The weight for arc ai, j, w(â†œai, j), corresponds to the path weight of the corresponding shortest path pi, j in RCG. i.e., w(ai, j ) =

w(ei )

ei ∈ p

Definition 4â•‡ A Multicast Routing Tree (MRTree) is the Minimum Directed Spanning Tree for multicast routing graph (N , A) with sÂ€∈Â€N as the root. When a flow is rip-up and rerouted, its current path is deleted and the links and router ports resources it occupies are released (line 3). Then based on the current network connectivity and resources occupation, the RCG related to this flow is built and the weights of all edges in RCG are updated (line 4). In particular, for every pair of routers in RCG, the cost of this flow using those routers and the link connecting them is evaluated. This cost depends on the sizes of the routers, the traffic already routed on the routers and the connectivity of the routers to other routers. It also depends on whether an existing physical link will be used or a new physical link needs to be installed. If there are already router ports and links that can support the traffic, the marginal cost of reusing those resources is calculated. Otherwise, the cost of opening new router ports and installing new physical links to support the traffic is calculated. The cost is assigned as edge weight to the edge connecting the pair of routers in RCG. If the physical links used to connect the routers can not satisfy the delay constraints, a weight of infinity is assigned to the corresponding edges in RCG. Once the RCG is constructed, the multicast routing graph (MRG) for the flow is generated from RCG (line 5). MRG is build by including every source and destination router of the flow as its nodes. For each pair of the nodes in MRG, the least cost directed path with least power consumption on RCG is found for the corresponding routers using Floyd-Warshall’s all pair shortest path algorithm and the cost is assigned as edge weight to the edge connecting the two nodes in MRG. Then the Chu-Liu/Edmonds algorithm [36, 37] is used to find the rooted directed minimum spanning tree of MRG with the source router as root. A rooted directed spanning tree of a graph is defined as a graph which connects, without any cycle, all n nodes in the graph with nÂ€−Â€1 arcs such that the sum of the weight of all the arcs is minimized. Each node, except the root, has one and only one incoming arc. This directed minimum spanning tree is obtained as the multicast routing tree (MRTree) so that the routes of the multicast flow follows the structure of this tree. The details of ChuLiu/Edmonds Algorithm is summarized in Algorithm 3. The multicast routing for

182

S. Yan and B. Lin

flow f in RCG can be obtained by projecting MRTree back to RCG by expanding the correspond arcs to paths. A special case is when f is a unicast flow with source s and destination d. In this case, MRG will just consist of two nodes, namely s and d, and one directed arc from s to d. Therefore, the routing between s and d in RCG is simply a shortest path between s and d. After the path is determined, the routers and links on the chosen path are updated. As an example, Fig.Â€8.6b shows the RCG for rerouting the multicast flow e7. For clarity, only part of the edges are shown for RCG. The MRG and MRTree for e7 are shown in Fig.Â€8.6c, d respectively. By projecting MRTree back to RCG, the routing path for e7 is determined, namely e7 is bifurcates in the source router R4 to reach R6 and v6, then it transferred over the network link between R4 to R2 to reach v2, and then bifurcates to reach R5 and v5. The real physical connectivity between routers before and after rip-up and rerouting e7 are also shown in Fig.Â€8.6e, f. From them, we observe that the link between R4 and R5 and their corresponding ports are saved thus the power consumptions are reduced after rerouting e7 by utilizing the existing network resources for routing other flows. This RIPUP-REROUTE process is repeated for all the flows. The results of this procedure depend on the order that the flows are considered, so the entire procedure can be repeated for several times to reduce the dependency of the results on flow ordering2. Once the path of each flow is decided, the size of each router, the links that connect the routers are determined. Those routers and links constitute the network topology. The total implementation cost of all the routers and links in this topology is evaluated and the network topology is obtained.

2â•‡ In the experiments, we’ve tried several flow ordering strategies such as largest flow first, smallest flow first, random ordering et al., and we found the ordering of smallest flow first gave the best results. Thus we used this ordering in our experiments. Also, we observed that repeating the whole RIPUP-REROUTE procedure twice is enough to generate good results.

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

183

8.6.3 Router Merging After the physical network topology has been generated using RIPUP-REROUTE, a router merging step is used to further optimize the topology to reduce the power consumption cost. The router merging step was first proposed by Srinivasan in [43]. Their router merging was based on the distance between routers. However, in this work, we propose a new router merging algorithm for reducing the power consumption of the network and improving the performance. As has been observed, routers that connect with each other can be merged to eliminate router ports and links and thus possibly the corresponding costs. Routers that connect to the same common routers can also be merged to reduce ports and costs. We propose a greedy router merging algorithm, which is shown in Algorithm 4. The algorithm works iteratively by considering all possible mergings of two routers connected with each other. In each iteration, each router’s adjacent routers list is constructed and sorted by the distance between them in increasing order. They are possible candidate mergings. Then the routers are considered to merge in the decreasing order of the number of neighbors they have. For each candidate merging, if the topology from the merging result is valid, the total power consumption of the resulting topology after merging is evaluated using the power models. Routers are merged if they have not merged in this iteration and the cost is improving. After all routers are considered in the current iteration, they are updated by replacing the routers merged with the new one generated. Those routers are reconsidered in the next iteration. The algorithm keeps merging routers until no improvement can be made further. After router merging, the optimized topology is generated and the routing paths of all flows are updated. Since the router merging will always reduce the number of routers in the topology, it will not increase the hop counts for all the flows thus will not worsen the performance of the application.

184

S. Yan and B. Lin

The topology generated after router merging represents the best solution with the minimum power consumption. It is returned as the final solution for our NoC synthesis algorithm.

As an example, the connectivity graphs before and after ROUTER-MERGING procedure for the example of Fig.Â€8.3a are shown in Fig.Â€8.7a, b. It is shown that after router merging, the network resources are reduced from four routers to three routers and the total power consumption is reduced as well.

R2

v0

100

400 200 R1 400

a

300

v3

R4

400

100

v0

R2

400 200

100 400

v5

R1

R6

v3 400

300

R4 500

v5

v6

b

Fig. 8.7â†œæ¸€ Illustration of the RIPUP-MERGING procedure. a Before router merging. b After router merging

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

185

8.6.4 Complexity of the Algorithm For an application with |V | IP cores and |E| flows, the initial network construction step needs O(|E|) time. In the rip-up and reroute procedure, each flow is ripped-up and rerouted once. The edge weight calculation for router cost graph takes O(|V |2 ) . For a multicast flow with m destinations, the construction of multicast routing graph takes O((m + 1)2 |V |2 ) by finding shortest path between each pair of nodes. Then it takes O(|V |2 ) to find the rooted directed minimum spanning tree as the multicast tree by using the Chu-Liu/Edmonds algorithm. So the overall complexity of our algorithm is O(|E||V |2 ) .

8.7â•…Deadlock Considerations Deadlock-free routing is an important consideration for the correct operation of custom NoC architectures. In our previous work [9, 10], we’ve proposed two mechanisms to ensure the deadlock-free operation in our NoC synthesis results. In this section, we adopt the same mechanisms into our new Noc synthesis algorithm to ensure deadlock-free in the deterministic routing problem we consider. The first method is Statically Scheduled Routing. For our NoC solutions, the required data rates are specified and the routes are fixed. In this setting, data transfers can be statically scheduled along the pre-determined paths with resource reservations to ensure deadlock-free routing [44, 45]. The second method is Virtual Channels insertion. As shown in [46], a necessary and sufficient condition for deadlock-free routing is the absence of cycles in a channel dependency graph. In particular, we use an extended channel dependency graph construction to find resource dependencies between multicast trees3 and break the cycles by splitting a channel into two virtual channels (or by adding another virtual channel if the physical channel has already been split). The added virtual channels are implemented in the corresponding routers. We applied this method into our NoC synthesis procedure and we have found that virtual channels are rarely needed to resolve deadlocks in practice for custom networks. In all the benchmarks that we tested in Sect.Â€8.8, no deadlocks were found in the synthesized solutions. Therefore, we did not need to add any virtual channel.

8.8â•…Results 8.8.1 Experimental Setup We have implemented our proposed algorithm RRRM in C++. In our experiment, we aim to evaluate the performance of our proposed algorithm RRRM on bench3â•‡

This extended channel dependency graph construction treats unicast flows as a special case.

186

S. Yan and B. Lin

marks with the objective of minimizing the total power consumption of the synthesized NoC architectures under the specific performance constraint for the traffic flows. The performance constraint is specified in the form of average hop counts of all the traffic flows in the benchmarks. The total power consumption includes both the leakage power and the dynamic switching power of all network components. As discussed in Sect.Â€8.5, we use Orion [41, 42] to estimate the power consumptions of router configurations generated. We applied the design parameters of 1Â€GHz clock frequency, four-flit buffers, and 128-bit flits. For the link power parameters, we use the 3D interconnect models to evaluate the optimum powers for links with different lengths under the given delay constraint of 1Â€ns. Both routers and links are evaluated using the 70Â€nm technology and they are provided in a library. All existing published benchmarks are targeted to 2D architectures. However, their sizes are not big enough to take advantage of 3D network topologies. In the absence of published 3D benchmarks with a large number of available cores and traffic flows, we generated a set of synthetic benchmarks by extending the NoCcentric bandwidth-version of Rent’s rule proposed by Greenfield etÂ€al. [47]. They showed that the traffic distribution models of NoC applications should follow a similar Rent’s rule distribution as in conventional VLSI netlists. The bandwidthversion of Rent’s rule was derived showing that the relationship between the external bandwidth B across a boundary and the number of blocks G within a boundary obeys BÂ€=Â€kGβ, where k is the average bandwidth for each block and β is the Rent’s exponent. We used this NoC-centric Rent’s rule [47, 48] to generate large 3D NoC benchmarks for 3D circuits with varying number of cores in each layer and flows of varying data rate distributions. The average bandwidth k for each block and Rent’s exponent β are specified by the user. In our experiments, we generated large NoC benchmarks by varying k ranging from 100 to 500Â€kb/s and varying β from 0.65 to 0.75. We formed multicast traffic with varying group sizes for about 10% of the flows. Thus our multicast benchmarks cover a large range of applications with mixed unicast/multicast flows and varying hop count and data rate distributions. The benchmarks are generated for 3D circuits with three layers and four layers respectively with face-to-back bounding between each layer. The total number of cores for these benchmarks are ranging from 48 to 120. The total number of flows are ranging from 101 to 280. Our work is among the first in the area of application-specific NoC synthesis of 3D network topologies. In the absence of previously published work on this area, direct comparison with others’ work is unavailable. To evaluate the effectiveness of our proposed algorithm, we have generated a full 3D mesh implementation for each benchmark for comparisons. In a full 3D mesh implementation, each module is connected to a router with seven input/output ports, with one local port, four ports connecting to the four directions in the same layer, and two ports connecting to the upper and lower adjacent layers. Packets are routed using XYZ routing over the mesh from source to destination. We also generated a variant of the basic mesh topology called optimized mesh (opt-mesh) by eliminating router ports and links that are not used by the traffic flows. All experimental results were obtained on a 1.5Â€GHz Intel P4 processor machine with 512Â€MB memory running Linux.

101 133 149 169 177 203 228 248 280

0.497 0.654 0.788 0.850 0.867 1.089 1.083 1.335 1.426

0.25 0.26 0.28 0.26 0.24 0.27 0.24 0.27 0.25

0.50 0.49 0.46 0.51 0.53 0.48 0.44 0.45 0.44

â•‡ 48 â•‡ 60 â•‡ 64 â•‡ 75 â•‡ 80 â•‡ 90 100 108 120

B1 B2 B3 B4 B5 B6 B7 B8 B9

3 3 4 3 4 3 4 3 4

Ratio to opt-mesh

Table 8.4â†œæ¸€ 3D NoC synthesis results Bench. |L| |Cores| |Flows| RRRM Power Ratio to (W) mesh 2.48 2.73 2.05 2.38 2.25 2.65 2.68 2.22 2.59

Avg. hops 0.98 0.99 0.60 0.88 0.82 0.85 0.88 0.63 0.81

Ratio to mesh/opt â•‡â•‡â•‡â•›â•›4 â•‡â•‡â•›â•›15 â•‡â•‡â•›â•›37 â•‡â•‡â•›â•›79 â•‡â•›â•›137 â•‡â•›â•›235 â•‡â•›â•›443 â•‡â•›â•›446 1,144

Time (s) 1.951 2.530 2.769 3.269 3.550 4.032 4.599 4.975 5.670

Mesh Power (W) 2.54 2.77 3.40 2.69 2.75 3.11 3.06 3.52 3.21

Avg. hops

0.990 1.326 1.712 1.678 1.637 2.254 2.461 2.979 3.233

Opt-mesh Power (W)

2.54 2.77 3.40 2.69 2.75 3.11 3.06 3.52 3.21

Avg. hops

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures 187

188

S. Yan and B. Lin

8.8.2 Comparison of Results The synthesis results of our algorithm on all benchmarks at 70Â€nm with comparison to results using mesh and opt-mesh topologies are shown in TableÂ€ 8.4. For each algorithm, the power results and the average hop counts are reported. In the experiments, we used the average hop count results of 3D mesh topologies as the performance constraints feeding to RRRM for each benchmark. The average hop count results for the different benchmarks of 3D mesh topologies reported in TableÂ€8.4 are small, all under 3.5. The average hop count results of RRRM on all benchmarks relative to opt-mesh/ mesh implementation are graphically compared in Fig.Â€ 8.8b. The results show that the results of RRRM satisfies constraints for all the benchmarks. On average, RRRM can achieve 17% average hop count reduction over the mesh topology.

Power ratio over mesh

1 RRRM opt-mesh mesh

0.8 0.6 0.4 0.2

a

0 B1

B2

B3

B4

B5

B6

B7

B8

B9

Avg. hops ratio over opt/mesh

1

Fig. 8.8â†œæ¸€ Comparisons of all algorithms on benchmarks. a Power. b Hop count

0.8 0.6

0.2 0

b

RRRM opt-mesh/mesh

0.4

B1

B2

B3

B4

B5

B6

B7

B8

B9

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

189

The power consumption results of RRRM and opt-mesh relative to mesh implementations are graphically compared in Fig.Â€8.8a. The results show that RRRM can efficiently synthesize NoC architectures that minimize power consumption under the performance constraint. It can achieve substantial reduction in power consumption over the standard mesh and opt-mesh topologies in all cases. In particular, it can achieve on average a 74% reduction in power consumption over standard mesh topologies and a 52% reduction over the optimized mesh topologies. The execution times of RRRM are also reported in TableÂ€8.4. The results show that RRRM works very fast. For the largest benchmarks with 120 cores and 280 flows, it can finish within 20Â€min.

8.9â•…Conclusions In this chapter, we proposed a very efficient algorithm called Ripup-Reroute-andRouter-Merging (RRRM) that synthesizes custom 3D-NoC architectures. The algorithm is based on a ripup-reroute formulation for routing flows to find network topology followed by a router merging procedure to optimize network topology. Our algorithm takes into consideration both unicast and multicast traffic and our objective is to construct an optimized 3D interconnection architecture such that the communication requirements are satisfied and the power consumption is minimized. We also derived accurate power and delay models for 3D wiring. For the network topology derived, the routes for the corresponding flows and the bandwidth requirements for the corresponding network links are determined and the implementation cost is evaluated based on design objective and constraints. Experimental results on a variety of benchmarks using the accurate power consumption cost model show that our algorithms can produce effective solutions comparing to 3D mesh implementations.

References 1. W. J. Dally, B. Towles, “Route packet, not wires: On-chip interconnection networks,” DAC, 2001. 2. L. Benini, G. De Micheli, “Networks on chips: A new SoC paradigm,” IEEE Computer, vol.Â€35, no.Â€1, pp.Â€70–78, Jan. 2002. 3. M. B. Taylor et al., “The RAW microprocessor: A computational fabric for software circuits and general-purpose programs,” IEEE Micro, vol.Â€22, no.Â€6, pp.Â€25–35, Mar./Apr. 2002. 4. K. Sankaralingam et al., “Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture,” ISCA, 2003. 5. J. Hu, R. Marculescu, “Energy-aware mapping for tile-based NoC architectures under performance constraints,” ASP-DAC, 2003. 6. S. Murali, G. De Micheli, “Bandwidth constrained mapping of cores onto NoC architectures,” DATE, 2004. 7. K. Srinivasan, K. S. Chatha, G. Konjevod, “Linear-programming-based techniques for synthesis of network-on-chip architectures,” IEEE Transactions on VLSI Systems, vol.Â€14, no.Â€4, pp.Â€407–420, Apr. 2006.

190

S. Yan and B. Lin

â•‡ 8. S. Murali et al., “Designing application-specific networks on chips with floorplan information,” ICCAD, 2006. â•‡ 9. S. Yan, B. Lin, “Application-specific network-on-chip architecture synthesis based on set partitions and Steiner trees,” ASPDAC, 2008. 10. S. Yan, B. Lin, “Custom networks-on-chip architectures with multicast routing,” IEEE Transactions on VLSI Systems, accepted for publication, 2008. 11. K. Lee et al., “Three-dimensional shared memory fabricated using wafer stacking technology,” IEDM Technical Digest, Dec. 2000. 12. L. Xue et al., “Three dimensional integration: Technology, use, and issues for mixed-signal applications,” IEEE Transactions on Electron Devices, vol.Â€50, pp.Â€601–609, May 2003. 13. W. R. Davis et al., “Demystifying 3D ICs: The pros and cons of going vertical,” IEEE Design & Test of Computers, vol.Â€22, no.Â€6, pp.Â€498–510, 2005. 14. M. Kawano et al., “A 3D packaging technology for 4Gbit stacked DRAM with 3Gbps data transfer,” IEEE International Electron Devices, pp.Â€1–4, 2006. 15. J. Kim et al., “A novel dimensionally-decomposed router for on-chip communication in 3D architectures,” ISCA, 2007. 16. V. F. Pavlidis, E. G. Friedman, “3-D topologies for networks-on-chip,” IEEE Transactions on VLSI Systems, vol.Â€15, no.Â€10, pp.Â€1081–1090, Oct. 2007. 17. H. Matsutani, M. Koibuchi, H. Amano, “Tightly-coupled multi-layer topologies for 3-D NoCs,” ICPP, 2007. 18. T. Kgil et al., “PICOSERVER: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor,” ASPLOS-XII, 2006. 19. F. Li et al., “Design and management of 3D chip multiprocessors using network-in-memory,” ISCA, 2006. 20. P. Morrow et al., “Design and fabrication of 3D microprocessor,” Material Research Society Symposium, 2007. 21. W. A. Dees, Jr., P. G. Karger “Automated rip-up and reroute techniques,” DAC, 1982. 22. H. Shin, A. Sangiovanni-Vincentelli, “A detailed router based on incremental routing modifications: Mighty,” IEEE Transactions on CAD of Integrated Circuits and Systems, vol.Â€CAD6, no.Â€6, pp.Â€942–955, Nov. 1987. 23. H. Shirota, S. Shibatani, M. Terai, “A new rip-up and reroute algorithm for very large scale gate arrays,” ICICC, May 1996. 24. J. Cong, J. Wei, Y. Zhang, “Thermal-driven floorplanning algorithm for 3D ICs,” ICCAD, 2004. 25. J. Cong, Y. Zhang, “Thermal-driven multilevel routing for 3-D ICs,” ASPDAC, 2005. 26. B. Goplen, S. Sapatnekar, “Efficient thermal placement of standard cells in 3D ICs using a force directed approach,” ICCAD, 2003. 27. M. Pathak, S. K. Lim, “Thermal-aware Steiner routing for 3D stacked ICs,” ICCAD, 2007. 28. C. Addo-Quaye, “Thermal-aware mapping and placement for 3-D NoC designs,” IEEE International SOC Conference, 2005. 29. X. Lin, P. K. McKinley, L. M. Ni, “Deadlock-free multicast wormhole routing in 2-D mesh multicomputers,” IEEE Transactions on Parallel and Distributed Systems, vol.Â€ 5, no.Â€ 8, pp.Â€793–804, Aug. 1994. 30. M. P. Malumbres, J. Duato, J. Torrellas, “An efficient implementation of tree-based multicast routing for distributed shared-memory,” IEEE Symposium on Parallel and Distributed Processing, 1996. 31. K. Goossens, J. Dielissen, A. Radulescu, “The Ethereal network on chip: Concepts, architectures, and implementations,” IEEE Design & Test of Computers, vol.Â€22, no.Â€5, pp.Â€414–421, 2005. 32. M. Millberg et al., “Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip,” DATE, 2004. 33. Z. Lu, B. Yin, A. Jantsch, “Connection-oriented multicasting in wormhole-switched networks on chip,” Emerging VLSI Technologies and Architectures (ISVLSI), 2006.

8â•… Design of Application-Specific 3D Networks-on-Chip Architectures

191

34. F. A. Samman, T. Hollstein, M. Glesner, “Multicast parallel pipeline router architecture for network-on-chip,” DATE, 2008. 35. E. A. Carara, F. G. Moraes, “Deadlock-free multicast routing algorithm for wormholeswitched mesh networks-on-chip,” ISVLSI, 2008. 36. Y. J. Chu, T. H. Liu, “On the shortest arborescence of a directed graph,” Science Sinica, vol.Â€14, pp.Â€1396–1400, 1965. 37. J. Edmonds, “Optimum branchings,” Research of the National Bureau of Standards, vol.Â€71B, pp.Â€233–240, 1967. 38. G. Chen, E. G. Friedman, “Low-power repeaters driving RC and RLC interconnects with delay and bandwidth constraints,” IEEE Transactions on VLSI Systems, vol.Â€14, no.Â€2, pp.161– 172, Feb. 2006. 39. L. Zhang et al., “Repeated on-chip interconnect analysis and evaluation of delay, power, and bandwidth metrics under different design goals,” ISQED, 2007. 40. The international Technology roadmap for semiconductors, 2007. 41. H. Wang et al., “Orion: A power-performance simulator for interconnection networks,” MICRO 35, Nov. 2002. 42. X. Chen, L.-S. Peh, “Leakage power modeling and optimization in interconnection networks,” ISPLED, 2003. 43. K. Srinivasan, K. S. Chatha, G. Konjevod, “Application specific network-on-chip design with guaranteed quality approximation algorithms,” ASPDAC 2007, Jan 2007. 44. E. Rijpkema et al., “Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip,” DATE, 2003. 45. N. Enright-Jerger, M. Lipasti, L.-S. Peh, “Circuit-switched coherence,” IEEE Computer Architecture Letters, vol.Â€6, no.Â€1, pp.Â€5–8, Mar. 2007. 46. W. J. Dally, C. L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networks,” IEEE Transactions on Computers, vol.Â€C-36, no.Â€5, pp.Â€547–550, May 1987. 47. D. Greenfield et al., “Implications of rent’s rule for NoC design and its fault-tolerance,” NOCS, May 2007. 48. D. Stroobandt, P. Verplaetse, J. van Campenhout, “Generating synthetic benchmark circuits for evaluating CAD tools,” IEEE Transactions on CAD of Integrated Circuits and Systems, vol.Â€19, no.Â€9, pp.Â€1011–1022, Sep. 2000.

â•…

Chapter 9

3D Network on Chip Topology Synthesis: Designing Custom Topologies for Chip Stacks Ciprian Seiculescu, Srinivasan Murali, Luca Benini and Giovanni De Micheli

9.1â•…Introduction Today, many integrated circuits contain several processor cores, memories, hardware cores and analog components integrated on the same chip. Such Systems on Chips are widely used in high volume and high-end applications, ranging from multimedia, wired and wireless communication systems to aerospace and defense applications. As the number of cores integrated on a SoC increases with technology scaling, the two-dimensional chip fabrication technology is facing lot of challenges in utilizing the exponentially growing number of transistors. As the number of transistors and the die size of the chip increase, the length of the interconnects also increases. With smaller feature sizes, the performance of the transistors have increased dramatically. However, the performance improvement of interconnects has not kept pace with that of the transistors [1]. With reducing geometries, the wire pitch and cross section also reduce, thereby increasing the RC delay of the wires. This coupled with increasing interconnect length leads to long timing delays on global wires. For example, in advanced technologies, long global wires could require up to 10 clock cycles for traversal [2]. Another major impact of increased lengths and RC values is that the power consumption of global interconnects become significant, thereby posing a big challenge for system designers.

9.1.1 3D-Stacking Recently, 3D-stacking of silicon layers has emerged as a promising solution that addresses some of the major challenges in today’s 2D designs [1, 3–8]. In the 3D C. Seiculescu () Doctoral-assistent in Integrated Systems Laboratory, Swiss Federal Institute of Technology Lausanne (EPFL), EPFL IC ISIM LSI1 INF 339 (Bâtiment INF), Station 14, 1015 Lausanne, Switzerland Tel.: +41 21 693 0916 e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_9, ©Â€Springer Science+Business Media, LLC 2011

193

194

C. Seiculescu et al.

stacked technology, the design is partitioned into multiple blocks, with each block implemented on a separate silicon layer. The silicon layers are stacked on top of each other. Each silicon layer has multiple metal layers for routing of horizontal wires. Unlike the 3D packaging solutions that have been around for a long time (such as the traditional system-in-package), the different silicon layers are connected by means of on-chip interconnects. The 3D-stacking technology has several major advantages: (1) the foot-print on each layer is smaller, thus leading to more compact chips (2) smaller footprints lead to shorter wires within each layer. Inter layer connections are obtained using efficient vertical connections, thereby leading to lower delay and power consumption on the interconnect architecture (3) allows integration of diverse technologies, as each could be designed as a separate layer. A detailed study of the properties and advantages of 3D interconnects is presented in [1] and [9]. There are several methods for performing 3D integration of silicon layers, such as the Die-to-Die, Die-to-Wafer and Wafer-to-Wafer bonding processes. In the Dieto-Die bonding process, individual dies are glued together to form the 3D-IC. In the Die-to-Wafer process, individual dies are stacked on top of dies which are still not cut from the wafer. The advantages of these processes are that the wafers on which the different layers of the 3D stack are produced can be of different size. Another advantage is that the individual dies can be tested before the stacking process and only “known-good-dies” can be used, thereby increasing the yield of the 3D-IC. In the Wafer-to-Wafer bonding, full wafers are bonded together. The vertical interconnection is usually achieved using Through Silicon Vias (TSVs). For connection from one layer to another, a TSV is created in the upper layer and the vertical interconnect passes through the via form the top layer to the bottom layer. Connections across non-adjacent layers could also be achieved by using TSVs at each intermediate layer. The integration of the different layers could be done with either face to face or face to back topologies [10]. A die’s face is considered to the metal layers and the back is the silicon substrate. The copper half of the TSV is deposited on each die and the two dies are bonded using thermal compression. Typically, the dies are thinned to reduce the distance between the stacked layers. Several researches have addressed 3D technology and manufacturing issues [1, 4, 11]. Several industrial labs, CEA-LETI [12], IBM [13], IMEC [14] and Tezzaron [15], to name a few, are also actively developing methods for 3D integration. In Fig.Â€ 9.1, we show a set of vertical wires using TSVs implemented in SOI and bulk silicon technologies [11]. We also show the schematic representation of a bundle of TSV vias in Fig.Â€9.2. In [11], a 4â•›×â•›4Â€µm via cross section, 8Â€µm via pitch, 1Â€µm oxide thickness and 50Â€µm layer thickness are used. The use of 3D technology introduces new opportunities and challenges. The technology should achieve a very high yield and commercial CAD tools should evolve to support 3D designs. Another major concern in 3D chips is about managing heat dissipation. In 2D chips, the heat sink is usually placed at the top of the chip. In 3D designs, the intermediate layers may not have a direct access to the heat sink to effectively dissipate the heat generated. Several researchers have been working on all these issues and several methods have been proposed to address them. For example, the problem of partitioning and floorplanning of designs for 3D integration has been

9â•… 3D Network on Chip Topology Synthesis

195 Bulk Si

SiO2

Via

Tier n

Tier n+1

Bonding Pad

z

a

x

y

b

Fig. 9.1â†œæ¸€ An example set of nine vertical links

Pads

L

HSiO2

SiO2

HBulk

Si-Bulk

W Pitch

tox

Fig. 9.2â†œæ¸€ 3D bundle cross-section

addressed in [4–8]. Today, several 3D technologies have matured to provide high yield [16]. Many methods have been presented for achieving thermally efficient 3D systems. The methods span from architectural to technology level choices. At the architectural level, works have addressed efficient floorplanning to avoid thermal hot-spots in 3D designs [17]. At the circuit level, use of thermal vias for specifically conducting heat across the different silicon layers has been used [4]. In [13], use of liquid cooling across the different layers is presented.

9.1.2 Networks on Chips for 3D ICs One of the major challenges that designers face today in 3D integration is how to achieve the interconnection across the components within a layer and across the layers

196

C. Seiculescu et al.

in a scalable and efficient manner. The use of Networks on Chips (NoCs) has emerged as the solution to the 3D integration problem. The NoC paradigm has recently evolved to address the communication issues on a chip [2, 18, 19]. NoCs consist of switches and links and use circuit or packet switching technology to transfer data inside a chip. NoCs have several advantages, including achieving faster design closure, higher performance and modularity. An example NoC architecture is shown in Fig.Â€9.3. A NoC consists of set of switches (or routers), links and interfaces that packetize data from the cores. A detailed introduction to NoC principles can be found in [19]. NoCs differ from macro-networks (such as the wide area networks) because of local proximity and predictable behavior. The on-chip networks should have low communication latency, power consumption and could be designed for particular application traffic characteristics. Unlike a macro-network, the latency inside a chip should be in the order of few clock cycles. Use of complex protocols will lead to large latencies, NoCs thereby require streamlined protocols. Power consumption is a major issue for SoCs. The on-chip network routers and links should be highly power efficient and occupy low area. The use of NoCs in SoCs has been a gradual process, with the interconnects evolving from single bus structures to multiple buses with bridges, crossbars and a packet-switching network. Compared to traditional bus based systems, a network is clearly more scalable. Additional bandwidth can be obtained by adding more switches and links. Networks are inherently parallel in nature with distributed arbitration for resources. Thus, multiple transactions between cores take place in parallel in different parts of a NoC. Whereas, a bus-based system use centralized arbitra-

IP core master

NI

NI

IP core master

switch

switch

NoC topology

switch NI

switch IP core slave

NI

IP core slave

Fig. 9.3â†œæ¸€ Example NoC design

NI

NI

IP core slave

IP core master

9â•… 3D Network on Chip Topology Synthesis

197

tion, thereby leading to large congestion. Also, the structure and wiring complexity can be well controlled in NoCs, leading to timing predictability and fast design closure. The switches segment long global wires and the load on the links are smaller, due to their point-to-point nature. NoCs are a natural choice for 3D chips. A major feature of NoCs is that a large degree of freedom is available that can be exploited to meet the requirements. For example, the number of wires in a link (i.e. the link data width) can be tuned according to the application and architecture requirements. The data can be efficiently multiplexed on a small set of wires if needed. This is unlike bus based systems, that require several set of wires for address, data and control. Thus, communication across layers can be established with fewer vertical interconnects and TSVs. NoCs are also scalable, making the integration of different layers easy. Several different data streams from different sources and destinations can be transferred in parallel in a NoC, thereby increasing performance. The combined use of 3D integration technologies and NoCs introduces new opportunities and challenges for designers. Several researchers are working on building NoC architectures for 3D SoCs. Router architectures tuned specifically for 3D technologies have been presented in [20] and [21]. Using NoCs for 3D multi-processors has been presented in [22]. Cost models for 3D NoCs, computed analytically has been presented in [23]. Designing regular topologies for 3D has been addressed in [24].

9.1.3 Designing NoCs for 3D ICs Designing NoCs for 3D chips that are application-specific, with minimum powerdelay is a major challenge. Successful deployment of NoCs require dedicated solutions that are tailored to specific application needs. Thus, the major challenge will be to design hardware-optimized, customizable platforms for each application domain. The designed NoC should satisfy the bandwidth and latency constraints of the different flows of the application. Several works have addressed the design of bus based and interconnect architectures for 2D ICs [25, 26]. Several methods have addressed the mapping of cores on to NoC architectures [27–31]. Custom topology synthesis for 2D designs has been addressed in [32–39]. Compared to the synthesis of NoCs for 2D designs, the design for 3D systems present several unique challenges. The design process needs to support the constraints on the number of TSVs that can be established across any two layers. In some 3D technologies, only connections across adjacent layers can be supported, which needs to be considered. Finally, the layer assignment and placement of switches in each layer need to be performed. The yield of a 3D IC can be affected by the number of TSVs used, depending on the technology. In Fig.Â€9.4, we show how the yield for different processes vary with the number of TSVs used across two layers. The graphs show a trend that after a threshold, the yield decreases with increasing number of TSVs. Thus, the topology

198

C. Seiculescu et al.

120% (HRI-JP) DBI=9.75E-6

(HRI-JP)

100%

Yield [%]

80% (IMEC) DBI=40.0E-6

60%

(IMEC) (IBM) DBI=13.9E-6

40%

HRI-JP IMEC (IEDM2006) IBM/SOI (IEDM2005)

20% 0%

1

10

100 1000 10000 Number of TSV [piece/chip]

(IBM) 100000

1E+06

Fig. 9.4â†œæ¸€ Yield vs. TSV count

synthesis process should be able to design valid topologies meeting a specific yield and hence a TSV constraint. Moreover, with increasing TSV count, more area needs to be reserved for the TSV macros in each layer. Thus, to reduce area, a bound on allowed TSV is important. In [3], the pitch of a TSV is reported to be between 3 and 5Â€m. Reserving area for too many TSVs can cause a considerable reduction in the active silicon area remaining for transistors. The TSV constraint can significantly impact the outcome of the topology synthesis process. Intuitively, we can see that when more TSVs are permitted, more vertical links (or larger data widths) can be deployed. The resulting topologies could have lower latencies, while using more area for the TSVs. On the other hand, a tight TSV constraint would force fewer inter-layer links, thereby increasing congestion on such links and affecting performance. In Figs.Â€9.5 and 9.6, we show the best topologies synthesized by our flow for the same benchmark for two different TSV constraints. In the first case 13 interlayer links are used and in the second only eight inter-layer links are used. Building power-efficient NoCs for 3D systems that satisfy the performance requirements of applications, while satisfying the technology constraints, is an important problem. To address this issue, new architectures and design methods are needed. In this chapter, we present synthesis methods for designing NoCs for 3D ICs. The objective of the synthesis process is to obtain the most power efficient topology for the NoC for the 3D design. The process has to meet the 3D design constraints and the application performance requirements. We also take the floorplan of each layer of the 3D design, without the interconnect, as an optional input to the design process. In our design process, we also compute the position of the switches in the floorplan and place them, while minimally perturbing the position of the other cores. We apply our methods to several SoC benchmarks which show large power and latency improvements when compared to the use of standard topologies.

9â•… 3D Network on Chip Topology Synthesis

199

Smm 31

Switch 15

Int 32

Switch 16

Switch 14

ShM 30

PrM 22 PrM 21

Switch 9

PrM 25 PrM 26

Switch 11 PrM 24

Switch 13

PrM 29 PrM 16 PrM 15

PrM 28 PrM 17

Switch 8

Switch 10

Switch 5

Switch 1

Core 4

PrM 19

Switch 12

PrM 18

Switch 6

Core 5 Core 14

PrM 27

Switch 7

PrM 23

PrM 20

Core 2

Core 9 Core 8

Core 3 Switch 3

Core 1

Core 0

Switch 2 Core 11

Core 10

Switch 4 Core 6

Core 7

Switch 0 Core 13

Core 12

Fig. 9.5â†œæ¸€ Topology using 13 inter-layer links

While many works have addressed the architectural issues in the design of NoCs for 3D ICs, relatively fewer works have focused on the design aspects. In [17], the authors have presented methods for mapping cores on to 3D NoCs with thermal constraints. We have presented our methods to design NoCs for 3D ICs in [40, 41]. We also make a comparative study of NoC designs for the corresponding 2D implementation of the benchmarks. The objective is to evaluate the actual power and latency advantages when moving to a 3D implementation. For this study, we apply a flow developed by us earlier for NoC design for 2D systems [39]. Our results show that a 3D design can significantly reduce the interconnect power consumption (38% on average) when compared to the 2D designs. However, the latency savings is lower (13% on average), as the number of additional links that require pipelining in 2D were few. We use TSVs to establish the vertical interconnections. In Fig.Â€9.7, an example of how a vertical link is established across two layers is presented. In our architecture, the TSV needs to be drilled only on the top layer, and the interconnect uses horizontal metal at the bottom layer. In our synthesis flow, we allocate area for a TSV macro at the top layer for the link during the floorplanning phase. The TSV macro is actually placed directly at the output port of the corresponding switch. For links that go across multiple silicon layers, we also place TSV macros in each intermediate layer. In Fig.Â€9.8, we show an example link that spans multiple layers.

200

C. Seiculescu et al. Smm 31

Switch 15

Int 32

Switch 16

Switch 14

ShM 30

PrM 21 PrM 22

Switch 11 PrM 25

PrM 23 Switch 9

Switch 13

PrM 24

PrM 26 PrM 27 Switch 7 PrM 28

PrM 29 Switch 8

PrM 16 PrM 15

Switch 10 PrM 17

Switch 12

PrM 18 PrM 20 PrM 19

Core 9

Core 5

Core 3 Core 8

Core 4 Switch 1

Core 14

Core 0

Core 1

Switch 5

Switch 6

Switch 2

Switch 4

Core 11

Core 10

Core 6

Core 7

Core 2 Switch 3

Switch 0

Core 13

Core 12

Fig. 9.6â†œæ¸€ Topology using eight inter-layer links

9.2â•…3D Architecture and Design Flow If there is a core that is connected to a switch that is in another layer below the core layer, then the network interface that translates the core communication protocol to the network protocol will be the one that will contain the necessary TSV macros. If there are intermediate layers among the core’s network interface and the switch it is connected to then TSV macros will be added in the intermediate layers just as in the

9â•… 3D Network on Chip Topology Synthesis

201

TSV macro Core

Switch Layer 1 vertical link

horizontal links Switch

Core

Layer 0

Fig. 9.7â†œæ¸€ Example vertical link

switch Core

TSV macro

Layer 2 vertical link

horizontal link

TSV macro

Layer 1 switch Core

Layer 0

Fig. 9.8â†œæ¸€ Example vertical link

case of the switch to switch link from Fig.Â€9.8. Active silicon area is lost every time a TSV macro is placed as the area reserved by the macro will be used to construct the TSV.

9.3â•…Design Flow Assumptions In the design approach we use several realistic assumptions. • The computation architecture is designed separately from the communication architecture. Several works (such as [42]) have shown the need to separate compu-

202

C. Seiculescu et al.

tation from communication design to tame the complexity. We assume hardware/ software partitioning of the design has been performed and tasks are statically assigned to the cores. For the communication architecture design, we assume that the hardware/software partition of application tasks onto the processor/hardware cores has been performed. • The assignment of the cores to the different layers of the 3D are performed using existing methods/tools. There have been several works that address this issue and our work is complementary to them. • The floorplan of the cores in each layer (without the NoC) has been performed by existing methods. We use the floorplan estimates as inputs to our design flow to better estimate wire delay and power consumption. • Though the synthesis methods presented in this chapter are general and applicable to wide range of NoCs, we use a particular architecture ([43]) to validate the approach.

9.4â•…Design Approach Our design flow for topology synthesis is presented in Fig.Â€9.9. The topology synthesis procedure produces a set of valid design points that meet the application performance and 3D technology constraints, with different power, delay and area values. From the Pareto curves, the designer can then choose the best design point.

Communication characteristics

Application bandwith requirements Latency constraints Message type of traffic flows Communication specification file

NoC area, power and timing models

Vertical link power, latency models

Fig. 9.9â†œæ¸€ Design flow

3D Specs

Technology constraints

Core assignment to layer in 3D Optionally, floorplan of cores in each layer

Max. # TSVs across adjacent layers Constraint on links only across adjacent layers

Core specification file

3D NoC Topology Synthesis

Application-specific 3D NoCs

9â•… 3D Network on Chip Topology Synthesis

203

The placement and floorplan of the switches and TSVs on each layer is also produced as an output. The different steps in the topology synthesis process are outlined as follows. In the outer most loop, the architectural parameters, such as the NoC frequency of operation, data width are varied and for each architectural point the topology synthesis process is repeated. Then, the number of switches in the design are varied. When fewer switches are used to connect the cores, the size of each switch is large and the inter-switch links are longer. However, the packets traverse a shorter path. On the other hand, when more switches are used, the size of each switch is smaller, but packets travel more hops. Depending on the application characteristics, the optimal power point in terms on the number of switches used varies. Hence, our synthesis tool sweeps a large design space, building topologies with different switch counts. For a chosen architectural point and switch count, we establish connectivity across the cores and switches. Then, we also determine the layer assignment for each of the switches. If there is a tight constraint on the TSVs or when the design objective is to minimize the number of vertical connections, a core in a layer can be constrained to be connected to a switch in the same layer. In this way, core to switch links will not require vertical connections. On the other hand, this will require inter-layer traffic flows to traverse at least two switches, thereby increasing latency and power consumption. This choice is application-specific and should be chosen carefully. To address this issue, we present a two-phase method for establishing core to switch connectivity. In the first phase, a core can be connected to a switch in any layer, while in the second phase, cores can be connected to only those switches in the same layer. The second phase is invoked when TSV constraints are not met in Phase 1 or when the objective is to minimize the number of vertical interconnections used. These phases are explained in detail in the next sections. Several inputs are obtained for the topology synthesis process. The name, size and position of the different cores, assignment of cores to the 3D layers, the bandwidth and latency constraints of the different flows are obtained. A constraint on the number of TSVs that can be established is also taken as an input. In some 3D technologies, vertical interconnects can established only across adjacent layers. This is also taken as an input. We model the maximum TSV constraint by constraining the number of NoC links that can be established across adjacent layers. We denote this by max_ill. For a chosen link width, the value of max_ill can be computed easily from the TSV constraint. For the synthesis procedure, the power, area and timing models of the NoC switches and links are also taken as inputs. For the experimental validation, we use the library of network components from [43] and the models are obtained from layout level implementations of the library components. The design process is general and models for other NoC architectures can also be easily integrated with the design flow. The delay and power consumption values of the vertical interconnects are also taken as inputs. We use the models from [11] for the vertical interconnects.

204

C. Seiculescu et al.

9.5â•…Algorithm We will now go on to describe the algorithm for synthesizing applications specific Noc topologies for 3D ICs. We will start by formally defining the inputs to the algorithm. The first input is the core specification which describes the number of cores, their position and the layer assignment. The core specification is defined as follows: Definition 1â•‡ For a design with n cores. The x and y co-ordinate positions of a core i are represented by xci and yci respectively, ∀i ∈ 1 · · · n. The assignment of core i to the 3D layer represented by layeri. The second input is the communication specification which describes the communication characteristics of the application and it is represented by a graph [28, 30, 31]. The graph is defined as follows: Definition 2â•‡ The communication graph is a directed graph, G(â†œV, E) with each vertex vi∈V representing a core and the directed edge (â†œvi, vj) representing the communication between the cores vi and vj . The bandwidth of traffic flow from cores vi to vj is represented by bwi, j and the latency constraint for the flow is represented by lati, jâ•›. Setting of several NoC architectural parameters can be explored, like the frequency at which the topology has to run and the data width of the links. The ranges in which the design parameters are varied are taken as inputs. The algorithm sweeps the parameters in steps, and designs the best topology points for each setting. For each architectural point, the algorithm performs the steps shown in Algorithm 1. The algorithm will create a list of all the switch counts for which topologies will be generated (steps 2–5). By default, the switches are varied from one to the maximum number of cores in the design or in each layer. However, the designer can also manually set the range of switch counts to be explored. The objective function of topology synthesis is initially set to minimize power consumption. However, for each topology point, if the 3D technology constraints are not met, the objective function is slowly driven to minimize the number of vertical interconnections. For this purpose, we use the scaling parameter θ. To obtain designs with lower inter-layer links, θ is varied from θmin to θmax in steps of θscale, until the maximum number of inter-layer links constraints is met. After several experimental runs, we determined that varying θ from 1 to 15 in steps of 3 gives good results. In step 7, the algorithm tests if inter-layer links can cross multiple layers, and if not, then phase one is skipped and phase 2 is used directly. In step 8, the parameter θ used for setting the importance of the 3D constraints is set to the minimum value to try to optimize for power. The function to build topologies is called in step 10 on the initial list of switch counts to be explored. A detailed description of the BuildTopologyGeneral is given in the Sect.Â€9.5.1. If the Unmet set is not empty, then some topology points may not have met the technology constraints. Thus, θ is increased and the function is called again. Phase 2 of the algorithm detailed in Sect.Â€9.5.2 is more restricted, as cores can only be connected to switches in the same layer of the 3D stack. Topologies built

9â•… 3D Network on Chip Topology Synthesis

205

using this restriction are usually consume more power, as more switches are required. Also the average hop count increases, as inter-layer flows traversing different layers have to go through at least two hops. The advantage of the method from phase 2 is that it can build topologies with a very tight constraint on the number of inter-layer links. In step 15, the algorithm tests if there are entries in the Unmet set for which topologies were not built in phase 1. This could be either because the constraints on the maximum number of inter-layer links were too tight or because the technology did not allow for inter-layer links to cross more than one layer and phase one was skipped completely. If Unmet is not empty then in step 16, the algorithm calls BuildTopologyLayerByLayer function which tries to build topologies using the restrictive approach.

9.5.1 Phase 1 Since different switch counts are explored and the number of switches rarely equals the number of cores, the first problem that arises is to decide how to connect the cores to switches. The assignment of cores to switches can have a big impact on the power consumption, but also on the number of inter-layers links required as switches from different layers can be assigned to the same switch. As multiple cores have to be assigned to the same switch, we partition the cores in as many blocks as there are switches. For this, we define the Partitioning Graph as follows: Definition 3â•‡ The partitioning graph is a directed graph, PG(â†œU, H, α), that has same set of vertices and edges as the communication graph. The weight of the edge (â†œui, uj), defined by hi, j, is set to a combination of the bandwidth and the latency constraints of the traffic flow from core ui to uj: hi,j = α × bwi,j /max_bw + (1 − α) × min_lat/lati,j ,

206

C. Seiculescu et al.

where max_bw is the maximum bandwidth value over all flows, min_lat is the tightest latency constraint over all flows and α is a weight parameter.

The weights on the edges in the partitioning graph are calculated as a linear combination of the bandwidth required by the communication flow and the latency constraint. The parameter α can be used to make trade-offs between power and latency. The intuition is that when α is large, cores that have high bandwidth communication flows will be assigned to the same switch. This will minimize the switching activity in the NoC and therefore reduce the power consumption. On the other hand, when α is small, cores that have tight latency constraints will be assigned to the same switch minimizing the hop count. The parameter α is given as input or can be varied in a range as well to explore the trade-offs between power consumption and latency. However, the partitioning graph has no information on the layer assignment of the cores and cannot be used if the number of inter-layer links has to be reduced. For this purpose, we define the it Scaled partitioning Graph: Definition 4â•‡ A scaled partitioning graph with a scaling parameter θ, SPG(â†œW, L, θ), is a directed graph that has the same set of vertices as PG. A directed edge li, j exists between vertices i and j, if â•› ∃(ui , uj ) ∈ P or layeriÂ€=Â€layerjâ•›. In the scaled partitioning graph, the edges that connect vertices that correspond to cores that are in different layers are scaled down. This way cores that are on dif-

9â•… 3D Network on Chip Topology Synthesis

207

ferent layers will be assigned to different switches. This can lead to a reduction in the inter-layer links, because the links that connect switches can be reused by many flows while links that connect cores to switches can only be used by the communication flows of that core. As the parameter θ scales, to drive the partitioner to cluster cores that are on the same layer, edges between the vertices that correspond to cores in the same layer are added. It is important that these edges have lower weight than the real edges. If too much weight is given to the new edges, then the clustering is no more communication based and it will lead to an increase in the power consumption. Equation 9.1 shows how the weights are calculated in the SPG. The weight from the new edges is calculated based on the maximum weight of the edge in the PG and it is denoted by max_wt.

li,j =

 if (ui , uj ) ∈ PG & layeri = layerj hi,j    h  i,j    θ × |layer − layer | if (ui , uj ) ∈ PG & layeri = layerj i

θ × max_wt        10 × θmax 0

j

if (ui , uj ) ∈ / PG & layeri = layerj

(9.1)

otherwise

From the definition, we can see that the newly added edges have at most one-tenth the maximum edge weight of any edge in PG, which was obtained experimentally after trying several values. The steps of the BuildTopologyGeneral function are presented in Algorithm 2. In the first step, the partitioning graph is build. If θ is larger than the initial value (step 2), it means that feasible topologies could not be built for all switch counts using the core to switch assignment based on power and latency only. Therefore in step 3, the scaled partitioning graph is built from the partitioning graph using the current value of θ and replaces the partition graph in step 4. The design points from the Unmet set are explored in step 7. For each switch count that is explored, the cores are partitioned in as many blocks as the value of the switch count for the current point (step 8). Once the cores are connected to switches, the switch layer assignment can be computed. Switches are assigned to layers in the 3D stack based on the layer assignment of the core it connects to. A switch is placed at the average distance in the third dimension among all the cores it connects (steps 11–13). For the current core to switch assignment, the inter-switch flows have to be routed (steps 14, 15). The function CheckConstraints(â†œcost) enforces the constraints imposed by the upper bound on inter-layer links. A more detailed description on how the constraints are enforced and how the routes are found is provided in Sect.Â€9.5.3. If paths for all the inter switch flows were found with the given constraints, then the topology for the design point is saved and the entry corresponding to the current switch count is removed from the Unmet set (steps 18 and 19).

208

C. Seiculescu et al.

9.5.2 Phase 2 As previously stated, phase 2 is more conservative in the sense that cores can only be connected to switches in the same layer. To ensure that the blocks that result from the partitioning do not contain cores that are assigned to different layers on the 3D stack, the partitioning is done layer by layer. To do a layer by layer partitioning, we define the Local Partitioning Graph as follows: Definition 5â•‡ A local partitioning graph, LPG(â†œZ, M, ly), is a directed graph, with the set of vertices represented by Z and edges by M. Each vertex represents a core in the layer ly. An edge connecting two vertices is similar to the edge connecting the corresponding cores in the communication graph. The weight of the edge (â†œmi, mj), defined by hij, is set to a combination of the bandwidth and the latency constraints of the traffic flow from core mi to mjâ•›: hi,j = α × bwi,j /max_bw + (1 − α) × min_lat/lati,j , where max_bw is the maximum bandwidth value over all flows, min_lat is the tightest latency constraint over all flows and α is a weight parameter. For cores that do not communicate with any other core in the same layer, edges with low weight (close to 0) are added between the corresponding vertices to all other vertices in the layer. This will allow the partitioning process to still consider such isolated vertices. A local partitioning graph is built for each layer and the partitioning and therefore the core to switch assignment is done layer by layer. Another restriction is imposed on the switches, as they can be connected to other switches in the same layer, but only to switches that are in adjacent layers when it comes to connections in the third dimension. Since there can be different number of cores on the different layers, it is important to be able to distribute the switches to the layers of the 3D stack unevenly. Therefore, the algorithm in phase 2 starts by determining the minimum number of switches in each layer necessary to connect the cores. The operating frequency determines the maximum size of a switch, as the critical path in a switch depends on the number of input/output ports. The maximum switch size is determined in step 1 based on switch frequency models given as inputs. In steps 2–5, the minimum number of switches in each layer is determined from the number of cores in the layer and the maximum supported size for the switches at the desired operating frequency. In step 5, the local partitioning graphs are built, one per layer. Then for each design point remaining in the Unmet set, the algorithm distributes the switches on the different layers (step 8). Then we calculate the actual number of switches to be used in each layer, starting from the minimum number of switches in each layer previously calculated (steps 12–16). We also makes sure that the number of switches in each layer does not grow beyond the number of cores. For the calculated switch count on each layer, the local partitioning graphs are partitioned in as many blocks as the switch count (step 17). Once the cores are assigned to switches, the CheckConstraints(â†œcost) is called to enforce the routing constraints and paths are found for the inter switch flows (steps 19, 20). If paths are found for all flows, the topology for the design point is saved.

9â•… 3D Network on Chip Topology Synthesis

209

9.5.3 Find Paths When routing the inter switch flows, new physical links have to be opened between switches as in the beginning the switches are not connected among themselves. To establish the paths and to generate the physical connectivity for the inter switch flows, a similar procedure is used as in the 2D case [39]. The procedure finds minimum cost paths and the cost is based on the power increase generated by routing the new flow on that path. By using marginal power as the cost metric, the algorithm minimizes the overall power consumption. The full description of finding paths is beyond the scope of this work and we refer the reader’s attention to [39]. The work also shows how to find deadlock free routes in the design, which can also be used in 3D. However, in 3D we must take care of the maximum number of inter-layer links constraint (â†œmax_ill) together with the constraint on the maximum switch size imposed by the operating frequency. Therefore, in this section we will focus on how these constraints can be enforced by modifying the cost on which the paths are calculated. The routine to check and enforce the constraints is presented in Algorithm 4. Before describing the algorithm we have to make the following definitions:

210

C. Seiculescu et al.

Definition 6â•‡ Let nsw be the total number of switches used across all the layers and let layeri be the layer in which switch i is present. Let ill(â†œi, j) be the number of vertical links established between layers i and j. Let the switch_size_inpi and switch_size_outi be the number of input and output ports of switch i. Let costi, j be the cost of establishing a physical link between switches i and j.

In the algorithm, we use two types of threshold. One type refers to hard thresholds that are given by the constraints and the other type is the soft thresholds which are set to be just a bit less than the hard constraints. While violating a hard threshold means that it is impossible to build a topology, soft thresholds can be violated. These allow the algorithm to reserve the possibility to open new links for special flows that otherwise cannot be routed due to other constraints (e.g. to enforce deadlock freedom). The algorithm tests if a link can be opened between every pair (â†œi, j) of switches (steps 3, 4). First, the constraint on the number of vertical links are checked. In the case of phase 2, when inter-layer links cannot cross multiple layers and the distance in the third dimension between switch i and switch j is larger than 1, then the cost for that pair is set to INF. Also if the number of inter-layer links between the layer containing switch i and switch j reached the max_ill value, then the cost is also set to INF (steps 7, 8). By setting the cost to INF, we make sure that when finding paths, we will not open a new link between switch i and j. If only the hard threshold is used, then the algorithm would be able open links until reaching the limit and abruptly hit

9â•… 3D Network on Chip Topology Synthesis

211

an infeasible point. In a similar manner to the hard constraints, the soft constraints are enforced by setting the cost to SOFT_INF when the number of inter-layer links is already close to the hard constraint (steps 9, 10). The SOFT_INF value is chosen to be several orders of magnitude larger than the normal cost based on power. The constraints to limit the size of the switches are very similar to the constraints on the maximum number of inter-layer links and are enforced in steps 11–15. When paths are computed, if it is not feasible to meet the max_switch_size constraints, we introduce new switches in the topology that are used to connect the other switches together. These indirect switches help in reducing the number of ports needed in the direct switches. Due to space limitations, in this chapter, we do not explain the details of how the indirect switches are established. If we look at Algorithm 1, we can see that many design points are explored, especially when the constraint on the maximum number of inter-layer links is tight. Several methods can be employed to stop the exploration of design points when it becomes clear that a feasible topology cannot be built. To prune the search space, we propose three strategies. First, as the number of input/output ports of a switch increases, the maximum frequency of operation that can be supported by it reduces, as the combinational path inside the crossbar and arbiter increases with size. For a required operating frequency of the NoC, we first determine the maximum size of the switch (denoted by max_sw_size) that can support that frequency and determine the minimum number of switches needed. Therefore, design points where the switch count is less than that can be skipped. Second, for phase 2 we initialize the number of switches layer by layer as above. Thus, the starting design point can have different number of switches in each layer. The third strategy is applied after partitioning. The number of inter-layer links used to connect the cores to the switches is evaluated, before finding the paths. If the topology requires more inter-layer links than the threshold, we directly ignore the design point.

9.5.4 Switch Position Computation In modern technology nodes, a considerable amount of power is used to drive the wires. To be able to evaluate more accurately the power consumption of a topology point, we have to estimate the power used to drive the links. In order to evaluate the lengths of the links, we have to find positions for the switches and to place them in the floorplan. While the positions of the cores is given as input, the switches are added by the algorithm and their positions has to be calculated. Since a switch is connected to several cores, one way to calculate the switch position is to minimize the distance between the switch and the cores it is connected to. This can easily be done by averaging the coordinates of the cores the switches is connected to. This is a simple strategy and can provide good results and can be further improved by weighing the distance between the switch and cores with the bandwidth that the core generates, so that links that carry more bandwidth would be shorter. However, this strategy does not take into consideration that the switch can be connected to

212

C. Seiculescu et al.

other switches as well and minimizing the distance between switches is desirable. To achieve this, a strategy that uses a linear program formulation that minimizes the distance between the cores and switches at the same time is presented in [40]. If an inter-layer link crosses more than one layer then macros have to be placed on the floorplan to reserve space to create the TSVs. However, finding the position for TSV macros is much easier because the TSV macro is connected between only two components (core to switch or switch to switch). Therefore the TSV macro can be placed anywhere in the rectangle defined by the two component as it would not increase the wire length (Manhattan distance is considered). Placing the switches and TSV macros at the computed position may result in overlap with the existing cores. For most real designs, moving the cores from their relative positions is not desirable as there are many constraint to be satisfied.

CONT2

IMPRO2

100

0

20

20

FLASH

10

20

20

CONT1

40 20

ARM

IMPRO1

SD RAM1

L2CC

100 100

20

100

20

100 P1

200

MEM5

P2

L2C

20 100 DSP L2CC

DSP

DMA

400

20

40

0

MEM3

SD RAM2

DSP L2C

DEBUG

40

0 20

10

100

40

40

40

MEM4

200 20

20

20

0

20

0 20

Fig. 9.10â†œæ¸€ D26_media communication graph

100

MEM2

MEM1

PE3

PE2

P3

MEMPE1

9â•… 3D Network on Chip Topology Synthesis

213

A standard floorplanner can be used, but it can produce poor results if it is not allowed to swap the cores. A custom routine designed to insert the NoC component in the existing floorplan can give better results removing the overlap. The routine considers one switch or TSV macro at a time. It tries to find a free space near its ideal location to place it. If no space is available, we displace the already vld 0 70

run le dec 0

inv scan 0 362

acdc pred 0

49

353

27 srtipe mem 0

samp up 0 300

313

126

pad 0 94

313

vop rec 0

540

vop mem 0

ARM 0 16

357 362

362

idct 0

iquam 0

500 run le dec 1

iquam 1

357

12 inv scan 1 362

acdc pred 1

27

49

srtipe mem 1

353

126

94

mem out

samp up 1 300

313 vop mem1

ARM 1 0

mem in

362

16

54

362

idct 1

pad 1

313

vop rec 1

540

70

6

vld 1

500

vld 2

70

run le dec 2 362

362 inv scan 2

362

iquam 2

acdc pred 2

49

357

353

27

samp up 2

srtipe mem 2

300

313 vop mem 2

94

pad 2

500

Fig. 9.11â†œæ¸€ D38_tvopd communication graph

idct 2

313

vop rec 2

16

ARM 2

M10

M7

ARM1

M11

M6

ARM0

ARM2

M8

M9

ARM3

Fig. 9.12â†œæ¸€ D36_4 communication graph

Layer 0

ARM4

ARM5

ARM12

M18

M23

ARM17

Layer 1

ARM13

M19

M22

ARM16

ARM14

M20

M21

ARM15

400 MB/s

ARM24

M30

M35

ARM29

300 MB/s

Layer 2

ARM25

M31

M34

ARM28

100 MB/s

ARM26

M32

M33

ARM27

214 C. Seiculescu et al.

9â•… 3D Network on Chip Topology Synthesis

Switch power Core-to-switch link power Switch-to-switch link power Total power

70 60 Power consumption (mW)

215

50 40 30 20 10 0

4

6

8

10

12

14 16 Switch count

18

20

22

24

26

14 16 Switch count

18

20

22

24

26

Fig. 9.13â†œæ¸€ Power consumption in 2D

Switch power Core-to-switch link power Switch-to-switch link power Total power

70

Power consumption (mW)

60 50 40 30 20 10 0

4

6

8

10

Fig. 9.14â†œæ¸€ Power consumption in 3D

12

216

C. Seiculescu et al.

25

2D 3D Layer-by-layer

Number of wires

20

15

10

5

0

0–1

1–2

2–3

3–4

4–5 5–6 Wire length

6–7

7–8

8–9

9–10

Fig. 9.15â†œæ¸€ Wire length distributions

Relative power consumption

2.5

3D Application specific 3D Layer-by-layer

2

1.5

1

0.5

0

D_36_4

D_36_6

D_36_8

D_35_bot D_65_pipe D_38_tvopd

Fig. 9.16â†œæ¸€ Comparison with layer-by-layer

placed blocks from their positions in the x or y direction by the size of the component, creating space. Moving a block to create space for the new component can cause overlap with other already placed blocks. We iteratively move the necessary blocks in the same direction as the first block, until we remove all overlaps. As more components are placed, they can re-use the gap created by the earlier components (See Figs. 9.10 to 9.18).

9â•… 3D Network on Chip Topology Synthesis DSP L2C

DSP

DSP L2CC

L2C

217

IMPRO2

L2CC

IMPRO1

CONT1

CONT2

DMA

L2CC

Layer 0 MEM1 MEM2 Switch 2

Switch 1

MEM4

Layer 1

Switch 0

MEM3

Layer 2

SD RAM2

PE3

FLASH

MEM5

SD RAM1

DEBUG

PE2

P1

P3

PE1

P2

Fig. 9.17â†œæ¸€ Most power-efficient topology

DMA

DSPL2CC IMPROC2

IMPROC1

MEM2

DSPL2C

P2 DSP

PE2

SW1 SW0

CONT1 ARM

MEM1

L2CC

SW2

L2C

Layer 0 NI

Switch

Layer 1

MEM5 PE1

P3

MEM3 DEBUG SDRAM1

CONT2

Core

P1

FLASH MEM4

SDRAM2

PE3

Layer 3

Flip-Flop

Fig. 9.18â†œæ¸€ Resulting 3D floorplan with switches

9.6â•…Experiments and Case Studies To show how the algorithm performs, we will show different results on a realistic multimedia and wireless communication benchmark. We will also make a comparison on topologies built with the two phases of the described algorithm to show the advantages and disadvantages of each phase. We will also make a comparison between NoCs designed for different applications for 2D-ICs and 3D-ICs. Experiments showing the advantage of custom topologies over regular ones are also presented. In order to better estimate power, the NoC component library from [43] is used. The power and latency values of the switches and links of the library are

218

C. Seiculescu et al.

determined from post-layout simulations, based on 65Â€nm low power libraries. The vertical links are shown to have an order of magnitude lower resistance and capacitance than a horizontal link of the same dimension [11]. This translates to a traversal delay of less than 10% of clock cycle for 1Â€GHz operation and negligible power consumption on the vertical links.

9.6.1 Multimedia SoC Case Study For the case study, a realistic multimedia and wireless communication SoC was chosen as benchmark. The communication graph for the benchmark (denoted as D_26_media) is shown in Fig.Â€5.4. From the figure, it can be observed that there are 26 cores in the SoC. A part of the SoC which is constructed around an ARM processor is used for multimedia applications and it is aided by hardware video accelerators and controllers to access external memory. The other part of the SoC built around a DSP is used for wireless communication. The communication between the two parts is facilitated by several large on-chip memories and a DMA core. A multitude of peripherals are also present for off-chip communication. To compare between NoC designed for 2D-ICs and 3D-ICs, we performed 2D floorplans of the cores as well as 3D floorplan of the cores distributed to three layers using existing tools [44]. The assignment of cores to the layers of the 3D stack was performed manually, but there are solutions presented for 3D floorplanning that can give also the assignment of cores to layers. The tool from [39] was used to generate the application specific topologies for the 2D case. To data width of the links was set to 32 bits to correspond to the data width of the cores and the frequency was set to 400Â€MHz (this being the lowest frequency at which topologies can be designed to support the required bandwidth for the chosen data width). The max_ill constraint was set at 25. The impact of the constraint on power is analyzed later on. The power consumption on the different components of the NoCs, as well as the total power consumption is presented in Fig.Â€9.9 for the 2D-IC and in Fig.Â€9.10 for 3D. The plots represent the power consumption for topologies generated for different switch counts. Both plots start with topologies containing three switches. Because there are 26 cores in the design, topologies with less than three switches could not be built, as they would require the use of switches too large to support the operating frequency. These design points are pruned from exploration to reduce the run time, as explained in Sect.Â€9.5. The power consumption on individual components: switches, switch to switch links and core to switch links are presented in the figures. Several trends can be observed. The switch power grows as the number of switches grow. The core to switch link power goes down with more switches as the switches are placed closer to cores and the wire length decreases. The switch to switch link power consumption grows with the number of switches as the number of such links increases. However, the switch to switch link power does not increase as fast as the switch power with the switch count. The trends are similar for both 2D and 3D cases, but the absolute values for the link power in the 3D case is less than the ones

9â•… 3D Network on Chip Topology Synthesis

219

for 2D, as long and power hungry links from the 2D layout are replaced by short and efficient vertical links. For this particular benchmark, a power saving of 24% is achieved in 3D over 2D due to shorter wires. To give a better understanding, we show the wire-length distribution of the links in 2D and 3D cases in Fig.Â€9.11. From the figure, as expected, the 2D design has many long wires. In Fig.Â€9.12 the topology designed using phase 1 for the best power point is shown and in Fig.Â€9.13 the floorplan of the cores and network components in 3D for the corresponding topology is presented. For a more complete comparison between topologies for 2D-ICs and 3D-ICs, we designed topologies using different SoC benchmarks. We consider three distributed benchmarks with 36 cores (18 processors and 18 memories): D_36_4 (communication graph in Fig.Â€9.14), D_36_6 and D_36_8, where each processor has 4, 6 and 8 traffic flows going to the memories. The total bandwidth is the same in the three benchmarks. We consider a benchmark, D_35_bot that models bottleneck communication, with 16 processors, 16 private memories (one processor is connected to one private memory) and 3 shared memories to which all the processors communicate. We also consider two benchmarks where all the cores communicate in a pipeline fashion: 65 core (â†œD_65_pipe) and 38 core designs (â†œD_tvopd) (communication graph in Fig.Â€5.4). In the last two benchmarks, each core communicates only to one or few other cores. We selected the best power points for both the 2D case and the 3D case and we report the power and zero load latency in TableÂ€9.1. As most of the power difference comes from the reduced wire length in 3D, the power savings differs from benchmark to benchmark. For the benchmarks with spread traffic and many communication flows the power savings are considerable, as they benefit from the reduction of the long wires of the 2D design. In the bottleneck benchmark, there are many long wires that go to the shared memory in the 2D case. Even though the traffic to shared memories is small, we can still see a reasonable power saving when moving to 3D. For the pipeline benchmarks, most of the links are between neighboring cores, so links are short even in 2D. So, going for a 3D design dose not lead to large savings. The average power reduction is 38% and the average zero load latency reduction is 13% for the different benchmarks when comparing 3D to a 2D implementation.

Table 9.1â†œæ¸€ 2D vs 3D NoC comparison Benchmark Power (mW) Link power Switch power 2D 3D 2D 3D

Total power 2D 3D

2D

3D

D_36_4 D_36_6 D_36_8 D_35_bot D_65_pipe D_38_tvopd

215 230 320 116 169 â•‡ 89.5

3.28 3.57 4.37 6.04 2.53 4

3.14 3.5 3.65 4.2 2.57 3.6

150 154.5 215 â•‡ 68 106 â•‡ 52.5

â•‡ 41.5 â•‡ 43.5 â•‡ 55.5 â•‡ 36.2 104 â•‡ 22.67

â•‡ 65 â•‡ 76.5 105 â•‡ 48 â•‡ 63 â•‡ 37

â•‡ 70.5 â•‡ 82 104.5 â•‡ 43.3 â•‡ 58 â•‡ 38.11

Latency (cyc)

112 125.5 160 â•‡ 79.5 162 â•‡ 60.78

220

C. Seiculescu et al.

In Fig.Â€9.13, we show the power consumption of the topologies synthesized using Phase 2 of the algorithm, with respect to topologies synthesized using Phase 1 for the different benchmarks. Since in Phase 2 cores in a layer are connected to switches in the same layer, the inter-layer traffic needs to traverse more switches to reach the destination. This leads to an increase in power consumption and latency. As seen from Fig.Â€9.13, Phase 1 can generate topologies that lead to a 40% reduction in NoC power consumption, when compared to the Phase 2. However Phase 2 can generate topologies with a much tighter inter-layer link constraint.

9.6.2 I mpact of Inter-layer Link Constraint and Comparisons with Mesh

Minimum power consumption (mW)

Limiting the number of inter-layer links has a great impact on power consumption and average latency. Reducing the number of TSVs is desirable for improving the yield of a 3D design. However, a very tight constraint on the number of inter-layer links can lead to a significant increase in power consumption. To see the impact of the constraint, we varied the value of max_ill constraint and performed topology synthesis for each value, for the D_36_4 benchmark. The power and latency values for the different max_ill design points are shown in Figs.Â€ 9.19 and 9.20. When there is a tight constraint on the inter-layer links, the cores are connected to switches in the same layer, so that only switch-to-switch links need to go across layers. This results in the use of more switches in each layer, increasing switch power consumption and average latency. Please note that our synthesis algorithm also allows the designers to perform such power, latency trade-offs for yield, early in the design cycle. Custom topologies that match the application characteristics can result in large power-performance improvement when compared to the standard topologies, such as mesh and torus [39]. A detailed comparison between a custom topol-

Fig. 9.19â†œæ¸€ Impact of max_ill on power

190 180 170 160 150 140 130 120 110

0

5 10 15 20 25 Maximum number of inter-layer links (max_ill)

9â•… 3D Network on Chip Topology Synthesis Fig. 9.20â†œæ¸€ Impact of max_ill on latency

221

4 Minimumlatency (cycles)

3.5 3 2.5 2 1.5 1 0.5 0

0

5 10 15 20 25 Maximum number of inter-layer links (max_ill)

ogy and several standard topologies for different benchmarks for the 2D case has been presented in [39]. For completeness, we compared the application specific topologies generated by our algorithm with an optimized 3D mesh topology (â†œ3D Opt-mesh), where core placement is optimized such that cores that communicate are connected to nearby switches. The power consumption value for the topologies for different benchmarks is presented in Fig.Â€9.21. The custom topologies result in large power reduction (average of 51%) when compared to the 3D mesh topology. 450 3D Application specific 3D Opt-mesh

400

Power consumption (mW)

350 300 250 200 150 100 50 0

D_36_4

D_36_6

Fig. 9.21â†œæ¸€ Comparisons with mesh

D_36_8

D_35_bot

D_65_pipe D_38_tvopd

222

C. Seiculescu et al.

9.7â•…Conclusions Networks On Chips (â•›NoCs) are a necessity for achieving 3D integration. One of the major design issues when using NoCs for 3D is the synthesis of the NoC topology and architecture. In this chapter, we presented synthesis methods for designing power-efficient NoC topologies. The presented methods not only address classic 2D issues, such as meeting application performance requirements, minimizing power consumption, but also the 3D technology constraints. We showed two flavors of the general algorithm, one for achieving low power solution and the other to achieve tight control on the number of vertical connections established. We also presented comparisons with 2D designs to validate the benefits of 3D integration for interconnect delay and power consumption. Acknowledgmentâ•‡ We would like to acknowledge the financial contribution of CTI under project 10046.2 PFNM-NM and the ARTIST-DESIGN Network of Excellence.

References â•‡ 1.â•‡â•›K. Banerjee et al., “3-D ICs: A Novel Chip Design for Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration”, Proc. of the IEEE, vol. 89, no. 5, p. 602, 2001. â•‡ 2.â•‡â•›L. Benini and G. De Micheli, “Networks on Chips: A New SoC Paradigm”, IEEE Computers, vol. 35, no. 1, pp. 70–78, Jan. 2002. â•‡ 3.â•‡â•›E. Beyne, “The Rise of the 3rd Dimension for System Integration”, International Interconnect Technology Conference, pp. 1–5, 2006. â•‡ 4.â•‡â•›B. Goplen and S. Sapatnekar, “Thermal Via Placement in 3D ICs”, Proc. Intl. Symposium on Physical Design, p. 167, 2005. â•‡ 5.â•‡â•›J. Cong et al., “A Thermal-Driven Floorplanning Algorithm for 3D ICs”, ICCAD, Nov. 2004. â•‡ 6.â•‡â•›W.-L. Hung et al., “Interconnect and Thermal-Aware Floorplanning for 3D Microprocessors”, Proc. ISQED, March 2006. â•‡ 7.â•‡â•›S. K. Lim, “Physical Design for 3D System on Package”, IEEE Design & Test of Computers, vol. 22, no. 6, pp. 532–539, 2005. â•‡ 8. P. Zhou et al., “3D-STAF: Scalable Temperature and Leakage Aware Floorplanning for ThreeDimensional Integrated Circuits”, ICCAD, Nov. 2007. â•‡ 9. R. Weerasekara et al., “Extending Systems-on-Chip to the Third Dimension: Performance, Cost and Technological Tradeoffs”, ICCAD, 2007. 10. G. H. Loh, Y. Xie, and B. Black. “Processor Design in 3D Die-Stacking Technologies”, IEEE Micro Magazine, vol. 27, no. 3, pp. 31–48, May--June 2007. 11. I. Loi, F. Angiolini, and L. Benini, “Supporting Vertical Links for 3D Networks on Chip: Toward an Automated Design and Analysis Flow”, Proc. Nanonets, 2007. 12. C. Guedj et al., “Evidence for 3D/2D Transition in Advanced Interconnects”, Proc. IRPS, 2006. 13. http://www.zurich.ibm.com/st/cooling/interfaces.html 14. IMEC, http://www2.imec.be/imec_com/3d-integration.php 15. http://www.tezzaron.com 16. N. Miyakawa, “A 3D Prototyping Chip Based on a Wafer-level Stacking Technology”, ASPDAC, 2009. 17. C. Addo-Quaye, “Thermal-Aware Mapping and Placement for 3-D NoC Designs”, Proc. SOCC, 2005.

9â•… 3D Network on Chip Topology Synthesis

223

18. P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip Packet Switched Interconnections”, Proc. DATE, 2000. 19. G. De Micheli and L. Benini, “Networks on Chips: Technology and Tools”, Morgan Kaufmann, San Francisco, CA, First Edition, July 2006. 20. J. Kim et al., “A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3d Architectures”, ISCA, 2007. 21. D. Park et al., “MIRA: A Multi-Layered On-Chip Interconnect Router Architecture”, ISCA, 2008. 22. F. Li et al., “Design and Management of 3D Chip Multiprocessors Using Network-in-Memory”, ISCA, 2006. 23. V. F. Pavlidis and E. G. Friedman, “3-D Topologies for Networks-on-Chip”, IEEE TVLSI, 2007. 24. B. Feero and P. P. Pande, “Performance Evaluation for Three-Dimensional Networks-onChip”, Proc. ISVLSI, 2007. 25. J. Hu et al., “System-Level Point-to-Point Communication Synthesis Using Floorplanning Information”, Proc. ASPDAC, 2002. 26. S. Pasricha et al., “Floorplan-Aware Automated Synthesis of Bus-Based Communication Architectures”, Proc. DAC, 2005. 27. S. Murali and G. De Micheli, “An Application-Specific Design Methodology for STbus Crossbar Generation”, Proc. DATE, 2005. 28. S. Murali and G. De Micheli, “SUNMAP: A Tool for Automatic Topology Selection and Generation for NoCs”, Proc. DAC, 2004. 29. S. Murali and G. De Micheli, “Bandwidth Constrained Mapping of Cores on to NoC Architectures”, Proc. DATE, 2004. 30. J. Hu and R. Marculescu, “Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of Regular NoC Architectures”, Proc. DATE, 2003. 31. S. Murali et al., “Mapping and Physical Planning of Networks on Chip Architectures with Quality-of-Service Guarantees”, Proc. ASPDAC, 2005. 32. A. Pinto et al., “Efficient Synthesis of Networks on Chip”, ICCD 2003, Oct. 2003. 33. W. H. Ho and T. M. Pinkston, “A Methodology for Designing Efficient On-Chip Interconnects on Well-Behaved Communication Patterns”, HPCA, 2003. 34. T. Ahonen et al., “Topology Optimization for Application Specific Networks on Chip”, Proc. SLIP, 2004. 35. K. Srinivasan et al., “An Automated Technique for Topology and Route Generation of Application Specific On-Chip Interconnection Networks”, ICCAD, 2005. 36. J. Xu et al., “A Design Methodology for Application-Specific Networks-on-Chip”, ACM TECS, 2006. 37. A. Hansson et al., “A Unified Approach to Mapping and Routing on a Combined Guaranteed Service and Best-Effort Network-on-Chip Architectures”, Technical Report No: 2005/00340, Philips Research, Apr. 2005. 38. X. Zhu and S. Malik, “A Hierarchical Modeling Framework for On-Chip Communication Architectures”, ICCD, 2002. 39. S. Murali et al., “Designing Application-Specific Networks on Chips with Floorplan Information”, ICCAD, 2006. 40. S. Murali et al., “Synthesis of Networks on Chips for 3D Systems on Chips”, ASPDAC, 2009. 41. C. Seiculescu, S. Murali, L. Benini, and G. De Micheli, “SunFloor 3D: A Tool for Networks on Chip Topology Synthesis for 3D Systems on Chip”, Proc. DATE, 2009. 42. K. Keutzer et al., “System-Level Design: Orthogonalization of Concerns and Platform-Based Design”, IEEE TCAD, 2000. 43. S. Stergiou et al., “×pipesLite: a Synthesis Oriented Design Library for Networks on Chips”, Proc. DATE, 2005. 44. S. N. Adya and I. L. Markov, “Fixed-outline Floorplanning: Enabling Hierarchical Design”, IEEE TVLSI, 2003.

â•…

Chapter 10

3-D NoC on Inductive Wireless Interconnect Hiroki Matsutani, Michihiro Koibuchi, Tadahiro Kuroda and Hideharu Amano

10.1â•…Introduction: Wired vs. Wireless Three-dimensional chip implementation has been used in real commercial products to increase the real estate without stretching the implementation size and wiring delay. Most commercial 3-D chips are implemented using wafer-bonding technologies: face-to-face micro-bump [8, 2] or face-to-back through-silicon-via [3, 4]. On the other hand, wireless approaches have received a lot of attention because of their flexibilities. Wireless interconnects can be achieved by using capacitive coupling [5] and inductive coupling [12, 13]. The former approach is used only in surface-surface interconnections, while a number of dies can be stacked with the latter approach. The latter approach, inductive coupling, has the following advantages: (1) Dies can be stacked after fabrication and testing, thus only known-good-dies can be used. (2) A high speed and low power data transfer can be achieved [11]. (3) It is highly scalable; that is, from two to at least sixteen dies can be easily stacked. (4) Addition, removal, and swapping of dies are also all possible after fabrication. However, there are also several limitations: (1) In spite of remarkable improvements, the area of the inductors is still larger than those for through-silicon-vias. (2) The location of the inductors and the time to data transfer must be arranged so as to avoid introducing interference in the electric field. These inductive coupling advantages and limitations create challenges for using 3-D NoCs on it. Scalable and changeable network protocols are required to avoid any interference. We introduce 3-D NoCs on inductive-coupled wireless interconnects.

H. Matsutani () The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-8656 Tokyo, Japan e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_10, ©Â€Springer Science+Business Media, LLC 2011

225

226

H. Matsutani et al.

RX Die TX

Square Coil using metal layer

Fig. 10.1â†œæ¸€ Inductive coupling

10.2â•…Wireless Interconnect with Inductive Coupling An inductor is implemented in wireless inductive coupling approach as a square coil of metal in a common CMOS layout. The data modulated by a driver are transferred between two inductors placed at exactly the same position of two stacked dies, and it is received at the other die using the receiver, as shown in Fig.Â€10.1. This method allows the stacking of at least 16 of dies if the power consumption of each die is low enough to work without power dissipation facilities. Although more than two inductors can be stacked, multiple transmitters at the same location cannot simultaneously send the data in order to avoid the interference. The techniques have recently been improved for inductive coupling, and the contact-less interface without an electrostatic-discharge (ESD) protection device achieves a high speed of more than 1Â€GHz with a low energy dissipation (0.14Â€pJ per bit) and a low bit-error rate (BER <10-14) [13]. A 1-Tbit/s inductive-coupling interchip clock and data link has been developed by using 1,024 transceivers arranged with a pitch of 30Â€µm.

10.3â•…Wireless 3-D NoC for Building Block 3-D ICs The three-dimensional Network-on-Chip (3-D NoC) is an emerging research topic exploring the network architecture of 3-D ICs that stack several smaller wafers or dies to reduce the wire length and wire delay. Regarding the 3-D NoC architecture, the network topology [10, 15], router architecture [9, 7, 14], and routing strategy [16] have already been extensively studied. Various interconnection technologies have been developed to be used as a medium for the 3-D NoCs: wire-bonding, micro-bump [8, 2], contactless (i.e., wireless) [6, 4, 12, 13] between stacked dies, and through-via [3, 4] between stacked wafers. The comparisons of these interconnection technologies were discussed in [4]. Many studies on the 3-D NoC architecture focus on the through-wafer via technology that offers the greatest interconnect density but also the highest cost. On the other hand,

10â•… 3-D NoC on Inductive Wireless Interconnect

227

Application chips

(arbitrary number and order of chips)

Wireless 3-D crossbar

Service chip

2-D NoC and I/O pads

Fig. 10.2â†œæ¸€ Concept of wireless 3-D NoC

we focus on the flexibility of the contactless approach, since it enables us to stack known-good-dies to build the desired SoCs, like building blocks. To exploit the flexibility of this approach, we propose a wireless 3-D NoC architecture that guarantees there is network connectivity between all cores in different tiers1 without caring about the number and order of the tiers. FigureÂ€10.2 shows the concept of the proposed wireless 3-D NoC architecture. By attaching only a single service chip that provides a planar network infrastructure and I/O pads as a bottom layer of the 3-D IC, the other tiers (application chips) do not have to provide the network connectivity of their cores on their own. Such features reduce the difficulties encountered in application chip design and provide a great flexibility for building customized 3-D ICs. The service and application chips communicate with each other via wireless vertical crossbars in a time-division multiplex (TDM) manner. In the proposed wireless 3-D NoC architecture, we present two TDM communication schemes: static and dynamic. A time-slot for transmitting data flits is periodically assigned to every tier in the static scheme, while arbitration is performed during a time-slot to select a single tier that can use the next time-slot in the dynamic scheme. Obviously, the static scheme can simplify the wireless crossbar, while the dynamic one can improve the throughput. In this chapter, the two types of wireless crossbars are introduced to show their feasibility for use in building-block 3-D ICs. 1â•‡

A tier refers to a die in a 3-D IC.

228

H. Matsutani et al.

10.3.1 Toward Building Block 3-D ICs A wireless 3-D NoC architecture that guarantees network connectivity between all cores in different tiers without caring about the number and order of the tiers is proposed here in order to reduce the difficulties encountered in application chip design and provide a greater flexibility for building customized 3-D ICs, in the same manner as building blocks. FigureÂ€10.3 illustrates an example of a wireless 3-D IC, in which a single service chip and three application chips are wirelessly connected (the side view is shown in Fig.Â€10.3d). CORE

I/O PADS RX

TX RX

TX

a

b Pillar TX RX

CORE TX

Pillar TX RX

Layer #3

RX Layer #2

Layer #1

Layer #0 RX TX

c

RX TX

d

Fig. 10.3â†œæ¸€ Example of wireless 3-D NoC. Layer #0 is a service chip and the others are application chips. a Layer #0 (â†œtop view). b Layer #1 (â†œtop view). c Layer #2 and #3 (â†œtop view). d Side view of four layers

10â•… 3-D NoC on Inductive Wireless Interconnect

229

The application and service chip designs are described in the following sections. The wireless crossbar architecture for their inter-chip communication will be described in Sect.Â€10.3.2. 10.3.1.1â•…Application Chip Design Application logic, such as microprocessor cores and cache cores, is implemented on application chips. As shown in layers #2 and #3 in the example (Fig.Â€10.3c), the application chips do not have to provide network connectivity between cores on the same tier. The only requirement for application chips is to equip the transmitter (TX) and receiver (RX) modules in a pre-defined pitch for the vertical communication. The TX and RX modules consist of a TX or RX channel array, its driver, and buffers. They will be described in more detail in Sect.Â€10.3.2. Note that application chips can have their own local network, as shown in Fig.Â€10.3b, to improve the communication performance. 10.3.1.2â•…Service Chip Design The service chip provides a planar network infrastructure and I/O pads, as shown in Fig.Â€10.3a. At least a service chip is required in a wireless 3-D NoC to provide network reachability between any pair of cores in the 3-D IC. Wireless crossbars are implemented on a service chip in addition to the RX and TX modules. The RX modules of the service chip are placed just under the TX modules of the application chips to receive the wireless data from the upper layers. Similarly, the TX modules of the service chip lie just under the RX modules of application chips. The planar network infrastructure of a service chip can use an arbitrary network topology and packet routing algorithm, but must provide a full connectivity between all TX and RX modules on its own.

10.3.2 Wireless 3-D Crossbar Architecture This section introduces two communication schemes for the wireless crossbar. It also illustrates the wireless crossbar architecture and its communication protocols. 10.3.2.1â•…Time-Division Multiplex Transfer A reachability between any pair of cores in a 3-D IC must be always guaranteed, even if the number and order of tiers are changed. To allow the addition, removal,

230

H. Matsutani et al.

and swapping of application chips, the TX and RX modules layout pattern must be the same for all application chips (Fig.Â€10.3). This means that multiple TX modules on different application chips are stacked at exactly the same horizontal location. A time-division multiplex (TDM) protocol is inevitably required for the wireless crossbar to avoid the interference problem of multiple TX modules sharing the same horizontal location. The TDM communication within a vertical crossbar is classified into the downstream and upstream transfers. • Downstream transfer is a data transfer from the TX module of an application chip to an RX module of a service chip. In a downstream transfer, a time-slot is assigned for one of the TX modules in a pillar2. The selected TX module can communicate with the RX module of the service chip during the assigned timeslot. • Upstream transfer is a data transfer from the TX module of a service chip to one or more RX modules of the application chips (i.e., unicast or multicast). Since only the TX module of the service chip acts as a sender in an upstream communication, there is no need to restrict its data transfer. That is, the TX module of the service chip can communicate with the RX modules of the application chips whenever it has data to send. Here, we focus on the time-slot assignments for the downstream transfer. 10.3.2.2â•…Static TDM Scheme The static TDM scheme periodically allocates a time-slot for every TX module in a pillar, regardless as to whether it has data flits to be sent. The order of assignment is static and no arbitration is required between the TX modules in a pillar. The time-slot assignment is simply calculated as follows.

i = t/l mod n,

(10.1)

where i is the granted layer ID, t is the clock counter, l is the time-slot length, and n is the number of application chips. A TX module can transmit data flits when its layer ID is equivalent to i. Each TX module has a clock counter and knows its own layer ID, time-slot length l, and total number of application chips n. Such information should be given for each TX module at the boot time. FigureÂ€10.4a shows a timing chart of the static time-slot assignment in the proposed 3-D NoC. In this example, layer #0 is a service chip and the others are application chips. Layer #3 and layer #2 transmit data flits to the service chip in the first and second time-slots, respectively. Note that the third time-slot is assigned to layer #1, although layer #1 does not have data to send. The static TDM scheme can simplify the wireless crossbar, since it does not require arbitration, even if multiple TX modules have data flits to send. It evenly A pillar refers to a vertical connection between the service and application chips, via a vertical crossbar and corresponding TX/RX modules (Fig.Â€10.3d).

2â•‡

10â•… 3-D NoC on Inductive Wireless Interconnect

Layer #3

Layer #2

Layer #1

Layer #0

TX

DATA

231 DATA

RX DATA

TX

DATA

RX TX RX TX RX 1

2

Time slot (l cycles)

3

4

5

6

DATA

DATA

5

6

Elapsed time [cycles]

a Layer #3

Layer #2

Layer #1

Layer #0

TX

DATA

DATA

DATA

DATA

RX TX RX TX RX TX RX 1

2

Time slot (l cycles)

3

4

Elapsed time [cycles]

b Fig. 10.4â†œæ¸€ Two time-slot assignment schemes for wireless 3-D crossbar. a Static TDM scheme. b Dynamic TDM scheme

allocates a time-slot for every TX module of the application chips. However, since a time-slot is also assigned to the TX modules with no data, the communication bandwidth is often wasted. 10.3.2.3â•…Dynamic TDM Scheme In the dynamic TDM scheme, every time-slot is assigned to one of the TX modules from the application chips that have data to send. To select a single TX module for

232

H. Matsutani et al.

the time-slot (â†œi+1), every TX module, by turns, indicates whether it has data during the time-slot i. Then, the vertical crossbar at the service chip selects in a round-robin manner a single TX module that can use the time-slot (â†œi+1). At the end of the timeslot i, the vertical crossbar of the service chip asserts a GRANT signal that indicates the layer ID that can use the next time-slot. FigureÂ€10.4b illustrates a timing chart of the dynamic time-slot assignment. The time-slots 1 to 4 are allocated for layer #3, and then the time-slots 5 to 6 are granted to layer #2. No time-slot is allocated for layer #1, since it does not have data to send. Since the dynamic scheme does not allocate a time-slot for a TX module with no data to send, it makes effective use of the wireless crossbar bandwidth. However, this requires more complicated transactions between the application chips and the service chip (i.e., request and grant), and so the hardware amount needs to be increased compared with that necessary for a static scheme.

10.3.2.4â•…Hardware Structure This section illustrates the architecture of the TX/RX modules on the application chip, and the vertical crossbar on the service chip. Note that the service chip also has TX and RX modules, but they are included in the vertical crossbar.

10.3.2.5â•…TX Module Architecture TX modules are implemented on application chips at a pre-defined pitch (Fig.Â€10.3). A TX module receives data flits from the application logic, such as a processor core or a cache core, on the same layer. Then, it forwards the data to the service chip. Since the flit size is typically wider than that of the wireless channel array, a data flit is divided into several data cells, each of which can be sent in a single cycle in the wireless line. FigureÂ€ 10.5 shows the architecture of the TX module on an application chip. First, an f-bit input flit is transformed into multiple c-bit data cells in the TX modWired line from horizontal network Serializer TX array (c bit)

Flit (f bit) Cell (c bit)

Fig. 10.5â†œæ¸€ TX module on application chip

PORT

Wait for GRANT Wireless line to vertical crossbar

10â•… 3-D NoC on Inductive Wireless Interconnect Fig. 10.6â†œæ¸€ RX module at application chip

233

Wireless line from vertical crossbar Cell (c bit) & PORT

Deserializer

RX array (c bit)

Flit (f bit) Wired line to horizontal network

ule. In the first time-slot, the TX module sends the layer ID of the destination (denoted as PORT in the figure) and waits for the GRANT signal from the service chip. After receiving the GRANT signal from the service chip, it can send the data cells in the next time-slots. 10.3.2.6â•…RX Module Architecture RX modules are also implemented on application chips at a pre-defined pitch. An RX module receives wireless data cells from a service chip, and then it forwards them to the application logic on the same layer. FigureÂ€10.6 shows the architecture of the RX module on an application chip. It receives c-bit wireless data (i.e., data cell) from the service chip in a given cycle. An output data flit is formed by combining the f/c data cells, and then it is forwarded to the application logic. Note that the PORT tag that indicates the destination layer ID is attached to each data cell at a service chip. An RX module of the application chip whose ID is matched to the destination layer ID receives the data cells from the service chip. The other RX modules ignore the incoming data cells. 10.3.2.7â•…Vertical Crossbar Architecture Vertical crossbars are implemented on a service chip at the pre-defined pitch. A vertical crossbar receives the wireless data cells from an application chip and forwards them to the planar on-chip network, or it receives data flits from the planar network and forwards them to an application chip. FigureÂ€10.7 shows the architecture of the vertical crossbar on the service chip. It has an input FIFO buffer for each source layer to store data cells from the TX module of the layer. In downstream communications, the source layer (i.e., application chip) sends the destination layer ID, and then the vertical crossbar asserts the GRANT signal if there is a free space in the FIFO buffer. The vertical crossbar receives the data cells

234 Fig. 10.7â†œæ¸€ Vertical crossbar at service chip

H. Matsutani et al. From upper layers

To upper layers Wireless line GRANT

RX array (c bit)

TX array (c bit)

PORT

SEL To layer #0

From layer #0 f bit

FIFO buffers

f bit

from the TX module of the granted application chip and forwards them to the planar on-chip networks. In an upstream transfer, on the other hand, the vertical crossbar receives data flits from the horizontal network and forwards them to the destination layer. In this case, the destination layer ID (denoted as PORT) is attached to each cell in order to indicate which layer should receive the data. 10.3.2.8â•…Communication Protocol In this section, we illustrate the communication protocols of the downstream and upstream transfers. 10.3.2.9â•…Downstream Transfer FigureÂ€10.8a shows a detailed timing chart of the static TDM scheme on the downstream transfer. A time-slot is periodically allocated for each TX module of the application chips. The selected TX module first transmits the destination layer ID before sending the body data cells, and then the wireless crossbar asserts a GRANT signal if there is free space in its buffer (time-slot 1). Once a GRANT signal is asserted, the TX module can send body data cells in its next time-slots (see time-slot 4). FigureÂ€10.8b shows the detailed timing chart of the dynamic TDM scheme. To select a single TX module that can use the time-slot (â†œi+1), an arbitration between all the TX modules in a pillar is performed during time-slot i. In the dynamic TDM scheme, in each time-slot, every TX module in turn indicates whether it has data to send. That is, the layer #i asserts a REQ signal in the i-th cycle of every time-slot if it is ready to transmit. After obtaining the status of all the TX modules, the vertical crossbar selects one of the TX modules in a round-robin manner. Then, it asserts the GRANT signal that notifies the layer ID of the selected TX module. Once the GRANT signal is asserted, the selected TX module can send body data cells in the next time-slot.

10â•… 3-D NoC on Inductive Wireless Interconnect Layer #3

Layer #2

235

TX_data [c] PORT

DATA

RX_next [1]

TX_data [c]

DATA

PORT

RX_next [1]

Layer #0

GRANT

GRANT

GRANT

GRANT

NO DATA

1

1

1

1

0

TX_next [1] RX_data [c] 1

2

4

5

Elapsed time [cycles]

Time slot (/ cycles)

a

TX_data [c] PORT Layer #3 TX_req [1] 1 RX_next [2]

DATA 1

REQ

DATA 0

0

0

REQ

TX_data [c] Layer #2 TX_req [1]

6

DATA

PORT 0

0

GRANT

GRANT

3

3

1

0

1

RX_next [2]

TX_next [2]

REQ GRANT

GRANT

2

0

2

Layer #0 RX_req [1] RX_data [c] 1

b

2

Time slot (/ cycles)

4

5

6

Elapsed time [cycles]

Fig. 10.8â†œæ¸€ Time slot assignment (downstream transfer). a Static TDM scheme. b Dynamic TDM scheme

10.3.2.10â•…Upstream Transfer In the case of an upstream transfer, the same communication protocol is used in both the static and dynamic schemes. FigureÂ€10.9 shows the timing chart of the upstream transfer. In each time-slot, all the RX modules in turn indicate whether they have enough buffers. That is, the layer #i asserts the GRANT signal in the i-th cycle of every time-slot if it is ready to

236

H. Matsutani et al. RX_data [c]

Layer #3

RX_port [2] 1

TX_next [1]

1

RX_data [c] Layer #1

RX_port [2] 1

1

TX_next [1]

GRANT

GRANT

GRANT

GRANT

RX_next [1] Layer #0

TX_port [2] TX_data [c]

1

3

1

DATA

DATA

5

6

Time slot (/ cycles)

7

3 DATA

DATA

8

9

Elapsed time [cycles]

Fig. 10.9â†œæ¸€ Time slot assignment (upstream transfer)

receive. The vertical crossbar transmits data cells to the destination layer in the next time-slot if the destination can accept the next data cells.

10.4â•…An Implementation Example: MuCCRA-Cube 10.4.1 PE Array Structure in a Die Here, 3-D dynamically reconfigurable processors called MuCCRA-Cube is developed as the first step of building block 3-D SoCs. FigureÂ€ 10.10 shows a PE array structure in each die of the MuCCRA-Cube. It is composed of a 4â•›×â•›4 24-bit PE array and four 24-bit distributed memory modules (MEMs) at the bottom edge of the PE array. Like common FPGAs, an island-style interconnection network is used, and it is based on a two-dimensional array of PEs with routing channels located between the rows and columns. The input or output of each PE can be connected to the channels adjacent to it via a connection block. Data can be transferred along multiple routing channels in any direction, horizontally or vertically by using Switching Elements (SEs) at the channel intersections. The SE is a set of simple programmable switches which can be connected to adjacent SEs. It has two 24-bit links called (d0, d1) from side to side and up and down. Although it is omitted in the Fig.Â€10.10, writing data can be transferred into MEMs via the feedback wires from the uppermost SEs. Each MEM is 24-bitâ•›×â•›256-word and two-ported SRAM module, and all modules are double-buffered so that operations of the PE array and input/output can be per-

10â•… 3-D NoC on Inductive Wireless Interconnect SE 40

SE 30

SE 20

SE 10

SE 00

PE 30

PE 20

PE 10

PE 00

MEM0

SE 41

SE 31

SE 21

SE 11

SE 01

PE 31

PE 21

PE 11

PE 01

MEM1

237 SE 42

SE 32

SE 22

SE 12

SE 02

PE 32

PE 22

PE 12

PE 02

MEM2

SE 43

SE 33

SE 23

SE 13

SE 03

PE 33

PE 23

PE 13

PE 03

MEM3

SE 44

SE 34

SE 24

SE 14

SE 04

channel d0,d1 (24bit x 2)

Fig. 10.10â†œæ¸€ PE array structure in a die

formed independently. While one SRAM module is used by the PE array, the other can communicate with the outside via 32-bit interface. The array structure has the following benefits as a basis for dynamically reconfigurable processors to be stacked: (1) it achieves a sufficient level of performance at a low clock frequency [1], that is, there is no power dissipation problem when they are stacked. (2) The three-dimensional interconnection is a point-to-point link. In this case, the island style network used in MuCCRAs supports enough flexibility to move the data between distant Processing Elements (PEs) in two-dimensional directions. 10.4.1.1â•…PE Structure Each PE consists of a 24-bit PE Core, connection blocks for the connections to global routing resources, and a context memory. The detailed architecture of the PE Core is illustrated in Fig.Â€10.11. Since one of the target applications of the MuCCRA-Cube is image processing, such as JPEG encoder for embedded systems, in order to efficiently treat RGB image data, the PE

238

H. Matsutani et al.

Register File

SMU

ALU

MUX

MUX

MUX

zero

zero

zero

Fig. 10.11â†œæ¸€ PE core architecture

Core is designed as a simply-structured 24-bit processor. Just like common dynamically reconfigurable processors, it is composed of a Shift and Mask Unit (SMU), an Arithmetic and Logic Unit (ALU), and a register file (RFile). The SMU supports various types of shift & mask operations and supplies a constant value, and the ALU provides addition/subtraction, multiplication, comparison, and logical operations. The RFile is a 24-bit wide and eight-entry deep register file, and it has a read/write port (A) and a read-only port (B). For avoiding combinatorial loops, an output from the ALU can be connected to an input of the SMU, but the opposite connection is not allowed. On the other hand, each RFile can connect with all of the inputs and outputs of the ALU and the SMU. The context memory is a 64-bitâ•›×â•›32-entry memory, and it can hold the configuration data that contains the operational instructions of the PE Core and the connection instructions to the global routing resources. The context memory is read out according to the context pointer from the Context Switching Controller (CSC), and the context switching is done with a clock cycle. In MuCCRA-Cube, the context switching is independently controlled in each PE array. Thus, the parallel processing between PE arrays often requires synchronization between PE-arrays. For this purpose, the CSC provides a handshaking mechanism to wait for the context switching until the signal from the connected PE-arrays satisfies a certain condition. Other context switching mechanisms are the same as other MuCCRA processors [1] and are omitted here. Configuration data are loaded to the central configuration data memory, and transferred to context memory for each PE and SE by using the RoMultiC [18] multicast scheme. Like previous MuCCRA processors [1], the Task Configuration Controller (TCC) manages the loading configuration data from outside the chip and its delivery. The configuration data for the next task can be loaded for the empty space of the context memory during the execution.

10â•… 3-D NoC on Inductive Wireless Interconnect

239

10.4.2 Homogeneous Stacking The inductors for a transmitter and a receiver on adjacent chips for inductive-coupling should be aligned. We introduced the stacking method shown in Fig.Â€10.12 in order to stack the PE array with the same structure. First, an axis of rotation is set at the center of a core area (Fig.Â€10.12b). The even die is then rotated 180° with respect to the axis of rotation, and stacked. The inductors are placed symmetrically with respect to the axis of rotation, and therefore, they remain aligned after the rotation. Due to the symmetry, two links are formed. They are used for an uplink and a downlink. This stacking method also can solve the problem for power supply to the intermediate dies. Due to the rotation, the dies are shifted, which makes space for bonding wires to provide power. In MuCCRA-Cube, clock signals and the configuration data are transferred to intermediate dies by using these bonding wires in order to simplify the TCC control. Since the clock delivery causes a certain degree of skew, the data communication between dies is performed using dedicated clocks for the inductive-coupling channel as described later. The top die exposes all four sides for the bonding wires, providing a sufficient number of pins for external communications. Thus, the data to be processed are inserted from the die. The stack structure also has a benefit to form an extended datapath without using the long feedback links. As shown in Fig.Â€10.13, the computational results from a PE array are transferred to the starting points of datapath formed on the upper stacked PE array through the 3-D interconnection without using long feedback links. Axis of Rotation (Center of Core Area)

Chip00

Up Link

Down Link 180º

Chip01

Coil

0º Chip10

Core Area I/O Area

180º

Chip11

0º

Spaces for Bonding wires

a Fig. 10.12â†œæ¸€ Stacking method. a Stacking Scheme. b Top View

b

240

H. Matsutani et al. redundant data path PE

PE

PE

PE

PE

PE

MEM

MEM

MEM

MEM

MEM

MEM

MEM

PE PE

PE PE PE

PE

PE

a

MEM

PE

PE

PE PE

PE

MEM

PE

PE

PE

PE

PE

PE

PE

PE PE

PE

PE

PE

PE

PE

MEM

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

MEM

PE

PE

PE

MEM

PE

b

Fig. 10.13â†œæ¸€ Datapaths using two PE arrays. a 2D. b 3D

Since only a one-way data transfer is supported between two stacked PEs in this method, we assigned the left half of PEs for communications to the upper direction and the other for the lower direction, as shown in Fig.Â€10.14. By using the interconnection, the computation starts from the PEs in the bottom left half of the array, and the results are transferred in the upper direction. When

Fig. 10.14â†œæ¸€ 3-D data direction in MuCCRA-Cube

10â•… 3-D NoC on Inductive Wireless Interconnect

241

they reach the upper most PE array, then the computation goes down by using the right half PEs.

10.4.3 3-D Inductive Connection Channels In MuCCRA-Cube, a three dimensional interconnection is provided for each PE output, as shown in Fig.Â€10.15. The 26-bit data consisting of 24-bit data and 2-bit carry are sent from the ALU, SMU, and RFile through a multiplexer from the PE in the lower die, and received at the PE in the upper die through the inductive-coupling channels. Data are transferred serially by using two inductive-coupling channels; TxData and TxCLK. At the rising edge of the clock (PE-CLK) for the sender PE array, TxCLK and TxData start a serial data transfer, as shown in Fig.Â€10.16. Thus, a 2-clock delay is needed for sending 26-bit data through the 3-D connection channels.

Fig. 10.15â†œæ¸€ Connections between PEs

PE-CLK

TxCLK

TxData

Fig. 10.16â†œæ¸€ Data transfer in 3-D interconnection

242

H. Matsutani et al.

10.4.4 Three-Phase Interleaving Scheme The largest problem for stacking multiple homogeneous arrays is to receive data from a desired transmitter while avoiding the crosstalk signals from other transmitters placed on the same location. In order to address this problem, the three-phase interleaving (TPI) scheme shown in Fig.Â€10.17 is introduced. Here, chip00 transmits to chip01 in phase φ0 , chip01 to chip10 in phase φ1 , and chip10 to chip11 in phase φ2 sequentially, as shown in Fig.Â€10.17a. In the TPI scheme, it is important for all chips to accurately share the phase. A local clock (CLK) is generated on each chip and broadcasted by the clock transmitter to share the phase as shown in Fig.Â€10.17b. All the chips can change the phase at the same time, but inter-chip skew in the System Clocks differs at the beginning of phase φ0 on each chip. In order to guarantee the timing for a data transfer, the idle time and time margin are inserted. A block diagram of the inductive-coupling interface used in MuCCRA-cube is also shown in Fig.Â€10.18. The phase control unit (PCU) consists of a counter and System Clock

System Clock

Chip00

Tx

Chip01

Rx

Tx

Off

Off

Chip01

Rx

Tx

Off

Tx

Chip10

Rx CLK 0

Rx CLK 1

Tx CLK 2

Rx

Chip11

Rx

Rx

Rx

0 Idle Time

1

Tx

Chip10

Rx

Chip11

a

Chip00

1

0 Idle Time

2 time Time Margin

b

2 time Time Margin

Fig. 10.17â†œæ¸€ Three-phase interleaving scheme. a Data Link. b Clock Link From Upper Chip 3.6Gb/s Data Link (8ch)

Rx

Rxdata 8

240

PE [7:0]

240

τDEMUX

Rx

Tx

τMUX τTx-Rx System Clock

1.5GHz Broadcast Clock Link

Txdata 8

Received CLK

PCU Rx Enable Chip Number

2

Oscillator CLK

Tx

Tx Enable

To Lower Chip

Fig. 10.18â†œæ¸€ Inductive-coupling interface

10â•… 3-D NoC on Inductive Wireless Interconnect

243

a register file. In every system clock cycle, it initializes the states for transmitting both the clock and data, receiving both the clock and data or receiving only the clock based a 2-bit chip number provided through bonding wires. According to the state, the transmitters or the receivers are activated. In the transmitting phase, the counter in the PCU counts the number of Oscillator local CLKs to match that of the data bits to change the phase. Similarly, during the receiving phase, it counts the number of Received Local CLK to match that of the data bits. The TPI is fast enough to transfer a 26-bit data serially between the PE arrays as will be shown in the evaluation results.

10.4.5 Prototype Chip MuCCRA-Cube was implemented using ASPLA’s 90Â€nm CMOS process. Cadence NC-Verilog, Synopsys Design Compiler, and Synopsys Astro were used for the simulation, synthesis, and layout, respectively. The die size we used was 2.5×5.0 mm, and the PE array works for at least a 15Â€MHz clock. The frequency can be improved depending on the application mapping. FigureÂ€10.19 shows the layout of the chip. The central part is occupied with the PE array, and the inductive-coupling links are aligned on both sides for a 180° rotation. The memory modules are located on the left most side. Four chips are stacked as shown in Fig.Â€10.20, and thus, a total of 64 PEs can work at a time. TableÂ€ 10.1 lists the breakdown of a PE array. Compared with previous MuCCRAs, the ratio of the TCC is increased for storing additional configuration data bits for controlling the 3-D interconnection. The PE array area occupies 53.83% of total chip area.

Fig. 10.19â†œæ¸€ Layout of MuCCRA-Cube

244

H. Matsutani et al.

Fig. 10.20â†œæ¸€ Stacked chips

TableÂ€10.1â†œæ¸€ Breakdown of MuCCRA-Cube chip area Module Num Area (µm2) PE SE TCC CSC MEM Etc. Total

â•‡ 39,538.9 â•‡ 13,047.3 146,456.5 â•‡â•‡ 8,937.8 â•‡ 62,835.8 â•‡â•‡ 3,921.7

16 25 â•‡ 1 â•‡ 1 â•‡ 4 â•‡ –

Total (µm2) â•‡â•›â•›632,622.6 â•‡â•›â•›326,181.2 â•‡â•›â•›146,456.5 â•‡â•‡â•‡â•›â•›8,937.8 â•‡â•›â•›251,343.2 â•‡â•‡â•‡â•›â•›3,921.7 1,369,463.0

Ratio (%) 24.87 12.82 â•‡ 5.76 â•‡ 0.35 â•‡ 9.88 â•‡ 0.15 53.83

10.4.6 Communication Link TableÂ€10.2 lists the specifications for the inductive-coupling links. The maximum frequency is 1.5Â€GHz and a 7.2Â€Gb/s bandwidth is achieved. In this implementation, since the conservative scale is used, the sizes of the coil for the data and clock are relatively large, 100,100Â€µmm2 and 300,300Â€µmm2, respectively. On the other hand, the active Si area for the link is only 0.031Â€mm2. The energy consumption of an inductive-coupling transceiver and PCU is quite small, 7.6Â€pJ/b. The communication links were confirmed to work for a 1.5Â€GHz local clock. A successful TPI operation was measured and the results are shown in Fig.Â€10.21. It TableÂ€10.2â†œæ¸€ Inductive-coupling link

Local clock Bandwidth Energy consumption Active Si area Inductor size for data Inductor size for clock Communication distance (Thickness)

1.5Â€GHz 7.2Â€Gb/s/chip 7.6Â€pJ/b 0.031Â€mm2 100 × 100Â€µm2 300 × 300Â€µm2 95Â€µm (Chip: 85Â€µm, glue: 10Â€µm)

10â•… 3-D NoC on Inductive Wireless Interconnect

245

Fig. 10.21â†œæ¸€ Measured waveforms of three-phase interleaving

demonstrates that each chip generates a local clock (CLK) and broadcasts it to share the phase when each PE array works with a 15Â€MHz PE clock.

10.4.7 Execution Performance Three applications; Discrete Cosine Transform used in JPEG (DCT), a secure hash algorithm for encryption (SHA-1), and Discrete Wavelet Transform (DWT) for JPEG2000 are implemented using BlackDiamond [17] retargetable compiler. In DCT, the target image block is divided into the number of stacked PE arrays, and executed in parallel. The transformation of the matrix is done by making the best use of the 3-D data transfer. In DWT, the total job is divided into two or four tasks and executed with each PE array in the pipelined manner. In SHA-1, iterative execution process is divided into the number of PE arrays, and streaming processing is done. FigureÂ€10.22 shows the normalized execution time estimated by RTL simulation. Although a 2-clock delay is needed to transfer between PE arrays, the execution time can be reduced by using multiple PE arrays. In DWT, the synchronization overhead between PE arrays for a pipelined operation degrades the performance improvement especially when two PEs are used. FigureÂ€10.23 shows the utilization ratio which is computed with the following equation:

Util.Ratio =

Data amount to 3D direction × 100 Exec. cycles × 16 Ch/Plane

(10.2)

246

H. Matsutani et al.

1 0.75

ROW TRANS

2 4 DCT

1

2 4 DWT

1

2 4 SHA-1

15 3D Communication Utilizeation [%]

Fig. 10.23â†œæ¸€ Utilization ratio of 3-D channels (%)

1

ROW

0

TRANS

0.25

ROW

0.5

TRANS

Normalized Execution Time

Fig. 10.22â†œæ¸€ Simulated performance of MuCCRA-Cube

10

5

0

2

4 DCT

2

4 DWT

2 4 SHA-1

As shown in Fig.Â€ 10.23, the utilization ratio is not very high in the applications implemented here. In particular, in SHA-1, a pipelined operation only requires a single result between two neighboring PE arrays, and thus, the total amount of data exchanged is small. We implemented applications to minimize the amount of necessary 3-D communications, since the algorithm was originally designed for a 2-D array. Attempts should be made to develop an algorithm that takes into account the implementation of a 3-D array to make the best use of a wide bandwidth for 3-D direction.

10.5â•…Research Perspective 3-D NoCs using the wireless inductive coupling enable building-block SoCs which can add or replace their functions or performance on demand. If the problem on power delivery is solved, “field-stackable” SoCs will be realized. A trial to supply

10â•… 3-D NoC on Inductive Wireless Interconnect

247

power using inductive coupling has been done [19], and such prototype chips have been implemented. The heat dissipation from stacked dies will become major barrier for such field-stackable building-block SoCs; thus it should be addressed as a matter of priority.

References â•‡ 1. H. Amano and Y. Hasegawa and S. Tsutsumi and T. Nakamura and T. Nishimura and V. Tanbunheng and A. Parimala and T. Sano and M. Kato. MuCCRA Chips: Configurable Dynamically-Reconfigurable Processors. Proceedings of the IEEE Asian Solid-State Circuits Conference (ASSCC’07), pages 384–387, 2007. â•‡ 2. B. Black and M. Annavaram and N. Brekelbaum and J. DeVale and L. Jiang and G. H. Loh and D. McCaule and P. Morrow and D. W. Nelson and D. Pantuso and P. Reed and J. Rupley and S. Shankar and J. P. Shen and C. Webb. Die Stacking (3D) Microarchitecture. Proceedings of the International Symposium on Microarchitecture (MICRO’06), pages 469–479, 2006. â•‡ 3. J. Burns and L. McIlrath and C. Keast and C. Lewis and A. Loomis and K. Warner and P. Wyatt. Three-Dimensional Integrated Circuits for Low-Power High-Bandwidth Systems on a Chip. Proceedings of the International Solid-State Circuits Conference (ISSCC’01), pages 268–269, 2001. â•‡ 4. W. R. Davis and J. Wilson and S. Mick and J. Xu and H. Hua and C. Mineo and A. M. Sule and M. Steer and P. D. Franzon. Demystifying 3D ICs: The Pros and Cons of Going Vertical. IEEE Design and Test of Computers, 22(6):498–510, 2005. â•‡ 5. A. Fazzi and L. Magagmni and M. Mirandola and B. Charlet and L. D. Cioccio and E. Jung and R. Canegallo and R. Guerrieri. 3-D Capacitive Interconnections for Wafer-Level and DieLevel Assembly. IEEE Journal of Solid-State Circuits, 42(10):2270–2282, 2007. â•‡ 6. K. Kanda and D. D. Antono and K. Ishida and H. Kawaguchi and T. Kuroda and T. Sakurai. 1.27-Gbps/pin, 3mW/pin Wireless Superconnect (WSC) Interface Scheme. Proceedings of the International Solid-State Circuits Conference (ISSCC’03), pages 186–187, 2003. â•‡ 7. J. Kim and C. Nicopoulos and D. Park and R. Das and Y. Xie and N. Vijaykrishnan and M. Yousif and C. Das. A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3D Architectures. Proceedings of the International Symposium on Computer Architecture (ISCA’07), pages 138–149, 2007. â•‡ 8. K. Kumagai and C. Yang and S. Goto and T. Ikenaga and Y. Mabuchi and K. Yoshida. System-in-Silicon Architecture and its Application to an H.264/AVC Motion Estimation fort 1080HDTV. Proceedings of the International Solid-State Circuits Conference (ISSCC’06), pages 430–431, 2006. â•‡ 9. F. Li and C. Nicopoulos and T. Richardson and Y. Xie and V. Narayanan and M. Kandemir. Design and Management of 3D Chip Multiprocessors Using Network-in-Memory. Proceedings of the International Symposium on Computer Architecture (ISCA’06), pages 130–141, 2006. 10. H. Matsutani and M. Koibuchi and H. Amano. Tightly-Coupled Multi-Layer Topologies for 3-D NoCs. Proceedings of the International Conference on Parallel Processing (ICPP’07), 2007. 11. M. Miura and H. Ishikuro and K. Niitsu and T. Sakurai and T. Kuroda. A 0.14 pJ/b Inductive-Coupling Transceiver with Digitally-Controlled Precise Pulse Shaping. IEEE Journal of Solid-State Circuits, 43(1):285–291, 2008. 12. N. Miura Â€and D. Mizoguchi and M. Inoue and K. Niitsu and Y. Nakagawa and M. Tago and M. Fukaishi and T. Sakurai and T. Kuroda. A 1Tb/s 3Â€W Inductive-Coupling Transceiver for Inter-Chip Clock and Data Link. Proceedings of the International Solid-State Circuits Conference (ISSCC’06), pages 424–425, 2006.

248

H. Matsutani et al.

13. N. Miura and H. Ishikuro and T. Sakurai and T. Kuroda. A 0.14pJ/b Inductive-Coupling InterChip Data Transceiver with Digitally-Controlled Precise Pulse Shaping. Proceedings of the International Solid-State Circuits Conference (ISSCC’07), pages 358–359, 2007. 14. D. Park and S. Eachempati and R. Das and A. K. Mishra and V. Narayanan and Y. Xie and C. R. Das. MIRA: A Multi-layered On-Chip Interconnect Router Architecture. Proceedings of the International Symposium on Computer Architecture (ISCA’08), pages 251–261, 2008. 15. V. F. Pavlidis and E. G. Friedman. 3-D Topologies for Networks-on-Chip. IEEE Transactions on Very Large Scale Integration Systems, 15(10):1081–1090, 2007. 16. R. S. Ramanujam and B. Lin. Randomized Partially-Minimal Routing on Three-Dimensional Mesh Networks. IEEE Computer Architecture Letters, 7(2):37–40, 2008. 17. V. Tunbunheng and H. Amano. DisCounT: Disable Configuration Technique for Representing Register and Reducing Configuration Bits in Dynamically Reconfigurable Architecture. Proceedings of the Workshop on Synthesis and System Integration of Mixed Information Technologies (SASIMI’07), pages 412–419, 2007. 18. V. Tunbunheng and M. Suzuki and H. Amano. RoMultiC: Fast and Simple Configuration Data Multicasting Scheme for Coarse Grain Reconfigurable Devices. Proceedings of the International Conference on Field Programmable Technology (ICFPT’05), pages 129–136, 2005. 19. Y. Yuan and Y. Yoshida and N. Yamaguchi and T. Kuroda. Chip-to-Chip Power Delivery by Inductive Coupling with Ripple Canceling Scheme. Japanese Journal of Applied-Physics, 47(4):2797–2800, 2008.

Chapter 11

Influence of Stacked 3D Memory/Cache Architectures on GPUs Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Yuan Xie and Narayanan Vijaykrishnan

11.1â•…Introduction Graphics Processing Units (GPUs) are highly parallel processing units that offload graphics rendering from microprocessors. For over 20Â€years, these units were exclusively utilized for graphics processing and over that period, more and more technological breakthroughs have been accomplished in improving the computational power of GPUs. One of the latest advances is the ability to program the GPU pipeline, allowing non-graphics applications and algorithms to run on top of the GPU. This has opened a whole new research area that is concerned with utilizing GPUs for running general purpose applications. Consequently, this has put more pressure on manufacturers in seeking innovative ways of improving GPUs even more. For instance, 3D die-stacking—another emerging technology—can be considered in improving GPU’s performance. In this chapter, we discuss how 3D technology can be implemented in GPUs. We also investigate the problems and constraints of implementing such a technology and propose and assess solutions to these problems. Moreover, we propose architectural designs for the GPU that implements 3D technology and evaluate these designs in terms of cost, power consumption and thermal profile. However, before we delve into that discussion, it is very important for the reader to understand how GPU works and what its architecture looks like. Therefore, the next section introduces GPU technology and its hardware architecture. Also, we shortly discuss the 3D technology and the benefits that it offers for this demanding application.

A. Al Maashri () The Microsystems Design Lab, The Pennsylvania State University, University Park, PA, USA e-mail: [email protected] A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5_11, ©Â€Springer Science+Business Media, LLC 2011

249

250

A. Al Maashri et al.

11.2â•…Background The first part of this section presents the GPU technology, how it works and its architecture. It is important to point out that the intension is not to extensively discuss every detail about GPUs architecture and how they work, but rather introduce a number of concepts that will be needed later when describing the proposed architecture for the GPU. The second part of the section gives a brief discussion of the 3D technology.

11.2.1 How Do GPUs Work? We start by explaining a canonical graphics pipeline, and later in the chapter we discuss some actual pipeline implementations in modern GPUs. FigureÂ€11.1 [1] and [29] shows a representation of canonical graphics pipeline. The first stage of the canonical pipeline is called Application; this is where users programs generate the primitives used to describe the graphics scene. There are two common libraries for writing applications that produce computer graphics; namely, OpenGL (â†œcurrent version is 4.0) [19], and Microsoft DirectX (â†œcurrent version is 11.0) [20]. The example in Fig.Â€11.2 shows how OpenGL can be used to create a blue square. Next is the Command stage. This stage buffers the commands (â†œprimitives) coming from the Application stage. It may also be required to interpret some of these commands if necessary and perform format conversion to ensure that what is passed is compatible with lower stages. The Geometry stage receives a stream of vertices and commands from the previous stage. This is where vertices are assembled to-

Application

Command

Geometry

Rasterization

Display

Fragment

Texture

Fig. 11.1â†œæ¸€ A logical representation of a canonical pipeline of the GPU showing the stages involved in rendering graphics

Fig. 11.2â†œæ¸€ OpenGL describes a blue square by using primitives that define the coordinates of the square vertices and a filling color

glBegin(GL_QUADS); glColor(BLUE); glVertex3i(0, 0, 0); glVertex3i(1, 0, 0); glVertex3i(0, 1, 0); glVertex3i(1, 1, 0); glEnd()

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

251

gether forming triangles in a process known as primitive assembly. Also, this stage performs Clipping (i.e. removing portions of triangles that lie outside the viewable scene) and Culling (i.e. back-facing triangles are eliminated). These triangles are then passed to the Rasterization stage, where they are transformed into fragments (pixels). This is also where color, coordinates and depth attributes are interpolated with the fragments. In the Texture stage, images called textures are draped over the geometry to give the illusion of detail, adding more realism to a scene. When a texture is applied to geometry, it may need to be zoomed in or zoomed out to match the texture with the geometry size. This would require either interpolation or filtering. A number of filtering techniques exist, including linear, bilinear, trilinear and quadlinear. However, details of how these filters work is out of the scope of this chapter. Moreover, at this stage the texture address is calculated for each fragment. Once the texture address is known, both texture and fragment are combined at the Fragment stage. A number of other operations are performed at this stage such as depth, alpha, stencil tests and color blending. The depth test for instance is performed using a depth buffer, which is a special region in memory. This buffer stores the distance from each pixel to the viewer. Each time a new pixel arrives, its z-coordinate is compared to that of the existing pixel in the buffer. The new pixel will replace the existing one only if the former is closer to the viewer. Lastly, the Display stage—as the name implies—is responsible for displaying the output. The basic function of this stage is performing Digital-to-Analog conversion (DAC) and gamma correction which is basically the process of linearizing the non-linear signal-to-light-intensity. The reader is reminded that what has been described so far is a canonical pipeline. Actual pipelines may vary in terms of stage order and functionalities assigned to each stage. The next subsection describes a pipeline architecture that is closer to actual GPUs.

11.2.2 GPU Hardware Architecture The actual hardware architectural specifications of a modern GPU varies depending on the manufacturer. This subsection describes a generic architecture, and in the following subsection, an example of an actual implementation of a commodity GPU is discussed. FigureÂ€11.3 illustrates the GPU pipeline [2] assumed throughout this chapter. The streamer unit fetches the stream of vertices—generated by the user application—from the memory through the Memory Controller. This unit has a cache associated with it to capture localities. The streamer then passes these vertices to a pool of graphics processors that are called shaders as illustrated in Fig.Â€11.4. At this stage, these shaders act as Vertex shaders where they execute a kernel that transforms attributes that came with the streamed vertices (e.g. coordinates, lighting, etc.) into a format that is understood by the Primitive Assembly. The next unit per-

252

A. Al Maashri et al. Vertex Fetch (Streamer Unit)

Primitive Assembly

Memory Controller

ZST

Shader

Distributer

ZST

Scheduler

Triangle Setup

Shader Interpolator

Clipping

CW

CW

Memory Controller Memory Controller Memory Controller

Rasterization Hierarchical Z

Fig. 11.3â†œæ¸€ A generic GPU pipeline showing the hardware components of each stage Fig. 11.4â†œæ¸€ Shader architecture; the texture hardware components only utilized when the shader pool acts as a fragment shader

Texture Filter Register File Texture Cache

Texture Address

forms three operations; namely Primitive Assembly, Clipping and Triangle Setup. This unit represents the Geometric stage discussed in the canonical pipeline. Rasterization unit transforms the vertices into fragments (i.e. pixels). After that, the Hierarchical Z (HZ) buffer removes fragments that are marked as “cull” by the Rasterization stage and are outside the viewable window. The Z and Stencil Test unit (ZST) removes non-visible fragments by comparing these against the stencil and depth buffer. This would reduce the computational load on the shaders at a later phase. ZST unit utilizes a cache called ZST cache. The visible fragments are then passed to the Interpolator, which uses perspective corrected linear interpolation to generate the fragment attributes from the assembled triangle attributes. Fragments are arranged in a formation of quads (group of four fragments) and are fed into the shader pool which is now acting as a Fragment shader at this stage.

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

253

FigureÂ€11.4 illustrates a shader [2] consisting of a pipeline and a number of Texture Units (TU). Each TU has a cache that is associated with. The Color Write (CW) stage comes right after that where the fragments are blended with color and then passed to the frame buffer for display. Note that each CW unit has its own local cache. The reader has noticed that each shader in the pool of shaders acts as a vertex shader at some point in the rendering process and as a fragment shader at a later stage. However, few years ago that was not the case. Older GPUs have had separate specialized processors for vertex shaders, and fragment shaders. In contrast, modern GPUs use a unified shader architecture that provides one large grid of data-parallel floating-point processors that runs the shaders mentioned above. Therefore, instead of flowing through a fixed pipeline, vertices, triangles and pixels re-circulate through the grid as shown in Fig.Â€11.5. This new architecture offered a better overall utilization since demand for the various shaders varies between applications [21]. The unified architecture remedies this variation by allocating a varying percentage of its pool of processors to each shader type. In addition, modern GPU manufacturers have replaced the traditional fixedfunction 3D graphics pipeline with a flexible general-purpose computational engine. In other words, the GPU’s pipeline is transformed into an application-programmable pipeline. This ability to program the pipeline, complemented with the continuous and dramatic increase in the GPU’s computational power, has given birth to a new research field known as GPGPU [3], where general purpose applications are run on top of GPUs utilizing their high computational power. Some GPU manufacturers already support GPGPU, for example, NVIDIA supports GPGPU through its CUDA platform [4]. Similarly, AMD supports GPGPU through “Stream SDK” platform [5]. In the next subsection, we discuss an actual modern GPU.

Fig. 11.5â†œæ¸€ Unified GPU architecture uses the same pool of shaders to process elements such as vertices and pixels. The scheduler takes care of allocating a free shader to the element at hand

Pool of Shader

Scheduler

254

A. Al Maashri et al.

11.2.3 A n Example of a Modern GPU: NVIDIA GeForce GTX280 The GTX280 is NVIDIA’s second generation of unified-shader architecture. It is based on the Scalable Processor Array (SPA) framework. SPA architecture consists of a number of TPC units as shown in Fig.Â€11.6. Note that TPC stands for Texture Processing Cluster if GPU used in graphics processing mode, otherwise “T” stands for Thread if in parallel compute mode. Each TCP contains three Streaming Multiprocessors (SM), eight Texture Filtering processors (TF) and L1 cache. Each SM contains a single Instruction Unit (IU), eight Streaming Processors (SP) and a local memory. SPs are hardware-multithreaded processors with multiple pipeline stages so that it can execute an instruction for each thread every clock. As illustrated in Fig.Â€ 11.6 the Local Memory in a SM allows SPs to share data with each other without being forced to read from or write to an external memory. This has a great impact on improving computational speed. FigureÂ€11.7 shows a high-level view of the GTX280 architecture. The Thread Scheduler manages scheduling threads across the TPCs. Also, the atomic units at the bottom of the figure are responsible for performing atomic read-modify-write operations to memory. These atomic operations prevent race conditions in multithreaded environments. MIMD (Multiple Instruction Multiple Data) architecture is used across TPCs, and SIMT (Single Instruction Multiple Thread) architecture is used across each SM. The SIMT creates, manages, schedules and executes threads—through the IU—in group of 32 parallel threads called “warps”. In total, GTX280 can support up to 30,720 concurrent threads in hardware. TableÂ€ 11.1 lists the main features of the GTX280 GPU.

IU SP

Fig. 11.6â†œæ¸€ TPC architecture in GTX280. Each group of SPs share a local memory. (Adapted from [6])

TF

TF

SP

SP

SP

SP

SP

SP

SP TF

TF

IU SP

SP

SP

SP

SP

SP

SP

SP

TF

L1 Cache

TF

SP Local Memory

SP

SP

Local Memory

SP

Local Memory

SP

IU

SP SP SP

TF

TF

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

255

Thread Scheduler

Atomic

Text L2

Atomic

Text L2

Atomic

Text L2

Atomic

Text L2

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Fig. 11.7â†œæ¸€ GTX280 parallel computing architecture. (Adapted from [6]) Table 11.1â†œæ¸€ Main features of the NVIDIA GTX280 GPU

Number of TPCs Number of SMs per TPC Number of SPs per SM Total number of SPs Number of threads per SM Total number of threads GFLOPs

â•…â•‡â•›â•›10 â•‡â•‡â•‡â•‡â•›â•›3 â•‡â•‡â•‡â•‡â•›â•›8 â•‡â•‡â•›â•›240 â•‡ 1,024 30,720 â•‡â•‡â•›â•›933

11.2.4 3D Technology The 3D TSV technology allows stacking previously-distant-blocks on top of each other. Dies can be stacked in Face-To-Face (F2F) or in Face-To-Back organization. FigureÂ€11.8a shows F2F stacking which consists of depositing stubs vias on the top-level metal of each die and then the two dies are placed face-to-face such that the stub of the first die is pressed against that of the second die. Then, via stubs fuse to hold the two dies together and at the same time forming the die-to-die vias needed to connect the two dies. This process allows F2F to offer high die-to-die vias density. On the other hand, F2B stacking, shown in Fig.Â€11.8b, requires that vias need to be etched through the back side of the die, a process that reduces the effective density of the vias. However, F2B has an advantage over F2F in the sense that it can be repeated an arbitrarily number of times for dies stacked together. Wafer-to-Wafer (W2W) and Die-to-Wafer (D2W) are the most common techniques for bonding dies. Unlike W2W, D2W allows for stacking individual dies to another wafer resulting in higher flexibility and higher yield. However, this also increases cost because of the additional tests that are required to verify functionality (e.g. Known-Good-Die testing).

256

A. Al Maashri et al.

Vertical Vias

a

b

Fig. 11.8â†œæ¸€ Technologies used to stack dies. a F2F. b F2B

Cache Cache

Cache Core

Cache Core

a

b

c

d

Fig. 11.9â†œæ¸€ A number of cache/core layouts. a 2D planar. b Cache stacked atop core. c Cache and core are partitioned and stacked on themselves. d Same as case c, but this time partitioned cache is stacked atop core

3D technology offers a number of benefits including reduced latency in circuits, reduced wires length—that results in a reduction in power consumption—and a reduction in footprint [27]. 3D technology also enables heterogeneous integration, which allows stacking dies fabricated with different process technologies. One of the direct applications of 3D technology is cache stacking. A Cache unit can be stacked on itself hence improving latency since vertical vias reduce the critical path. FigureÂ€ 11.9 shows a number of cache/core stacking layouts. Note that although the figure shows a single core, the concept can be extended to a multi-core system. Caches can be partitioned by the word line (3DWL) or around the bit line (3DBL). Empirical results show that 3DWL is better in terms of access latency [7]. FigureÂ€11.10 illustrates 3DWL and 3DBL techniques.

11.2.5 MRAM Magnetic Random Access Memory (MRAM) is another emerging technology [28]. MRAM is a non-volatile memory that has zero standby leakage power and is denser

1-to-n decoder

1-to-n/2

1-to-n

257

SA

SA

SA

1-to-n decoder

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

a

b

c

Fig. 11.10â†œæ¸€ Two techniques that are used in portioning caches. a 2D planar view of SRAM cache. b 3D divided wordline partitioning (3DWL). c 3D divided bitline partitioning (3DBL)

than SRAM technology. The lower power consumption factor makes MRAM a better alternative to SRAM when implementing caches for power-aware applications. In this chapter we investigate the implementation of GPU caches using MRAM technology and study the benefits of using such a technology. MRAM uses Magnetic Tunnel Junction (MTJ) to store data instead of using electrical charges. As shown in Fig.Â€11.11, MTJ contains two ferromagnetic layers and one tunnel barrier layer (MgO). The direction of one ferromagnetic layer is fixed, which is called the reference layer, while the direction of the other layer (free layer) can be changed by passing a driving current. The relative magnetization direction of two ferromagnetic layers determines the resistance of the MTJ. Usually, if two ferromagnetic layers have the same directions, the resistance of MTJ is low, indicating a “0” state; if two layers have different directions, the resistance of MTJ is high, indicating a “1” state. FigureÂ€11.12 shows the most popular way to form MRAM cell known as “onetransistor-one-MTJ structure” (1T1J), where each MTJ is connected in series with B

Fig. 11.11â†œæ¸€ A conceptual view of MTJ structure. a High resistance, indicating “1” state. b Low resistance, indicating “0” state

a

B

Reference Layer

Reference Layer

MgO

MgO

Free Layer

Free Layer

A

b

A

258

A. Al Maashri et al.

Fig. 11.12â†œæ¸€ Structural view of a MRAM cell

Bit-Line (BL)

Free Layer

dor W

Reference Layer

ne Li L)

(W

Transistor

Source-Line (SL)

an NMOS. The gate of the NMOS is connected to the word line (WL), and the NMOS is turned on if MTJ needs to be accessed during read or write operations. The source of the NMOS is connected to the source line (SL), and the free ferromagnetic layer is connected to the bit line (BL). MRAM requires special fabrication process that would potentially incur extra cost when integrating MRAM with CMOS technology. However, as discussed earlier, one of the benefits of using 3D technology is Heterogeneous Integration, which allows stacking MRAM on top of CMOS logic. FigureÂ€11.13 shows a conceptual view of 3D integration.

Magnetic Layer

MRAM Die TSV

Fig. 11.13â†œæ¸€ A conceptual view of 3D integrations where MRAM die is stacked on top of a CMOS die. Two dies are fabricated individually and connected together by using TSVs

Metal Layer

Core Die

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

259

11.3â•…Employing 3D Technology in GPUs This section discusses the implementation of 3D technology in graphics processors. We start by showing why 3D technology is needed. Then, we develop the necessary models to evaluate a GPU system built using 3D technology.

11.3.1 Why 3D Technology? Earlier, we discussed the dramatic increase of the computational power that GPUs offer. We also presented GPGPU and how this new field has increased the demand for more powerful GPUs. However, a number of studies have located some limitations of the GPU, and we list some of these limitations below: • Since GPUs have relatively smaller cache size compared to CPU caches, a less number of blocks can be brought to the GPU cache. This would have an impact on the hit rate since current workloads are becoming more cache/memory intensive. • The limitation of floating-point buffer size may prevent the GPU from reaching its theoretical capacity. • Caches pose significant performance limitations in some of the GPU workloads. Interested readers can refer to [8–10] and [24, 25] for more information. The designer needs to be aware of the following constrains while proposing alternate designs to fix the limitations stated above: • As we increase cache size, the access latency also increases. This increase in latency may have a negative impact on the performance. • Adding more resources is usually associated with higher power consumption. Some Applications may be affected by this increase (e.g. a laptop running on battery). In addition, the thermal profile of the system would also increase due to power consumption increase. Higher thermal profile may have a negative impact on performance and may damage hardware components. • The designer should also consider the chip footprint since it is directly related to cost. Employing 3D technology may provide a sound solution to some of the problems and limitations stated above. Therefore, we evaluate the impact of implementing 3D technology on the GPU performance. To do so, the following tools are needed: • A cost model that determines the overall cost of bare die fabrication • A power model to compute the system’s overall power consumption • A thermal profiling tool However, the cost and power models require knowledge of the die area. Therefore, we start by developing an area estimation model, followed by a description of the models and tools stated above.

260 Table 11.2â†œæ¸€ Examples of area estimation of some of the GPU components

A. Al Maashri et al. Unit

Estimated gate count

Area (m2) 65Â€nm Tech.

45Â€nm Tech.

Streamer Primitive assembly Clipper Triangle setup Hierarchical-Z Fragment shader ZST Color write

â•‡â•›â•›320,488 â•‡â•›â•›295,912

8.33E-07 7.69E-07

4.17E-07 3.85E-07

â•‡â•‡â•›â•›10,216 â•‡â•‡â•›â•›47,080 2,589,696 â•‡â•›â•›606,404

2.66E-08 1.22E-07 6.73E-06 1.58E-06

1.33E-08 6.12E-08 3.37E-06 7.88E-07

â•‡â•›â•›836,584 â•‡â•›â•›836,584

2.18E-06 2.18E-06

1.09E-06 1.09E-06

11.3.2 Area Estimation Model We estimate the design area of caches and memory unit using Cacti simulator [11]. Estimating the design area of different GPU units and stages requires estimation of gate count of those units. We use the methodology described in [12] for estimating the gate count, where we identify the components and stages in the GPU pipeline (according to our generic pipeline discussed earlier) and for each component we estimate the gate count. Then, we use the “logic gate area with overhead” constant as defined by ITRS [13]. For example, the gate area (with overhead) for 65 and 45Â€nm technologies is 2.6 and 1.3Â€µm2, respectively. TableÂ€11.2 shows examples of area estimation of some of the components in the GPU pipeline. Using this area estimation model, we can estimate the cost and the power consumption of the designed chips as described in the following subsections.

11.3.3 Cost Model One of the major constraints while designing any system is cost. In this subsection we develop a model that estimates the cost of a bare die fabrication. The cost model presented here is based on the cost model developed in [14] and illustrated in Fig.Â€11.14. As shown in the figure, a number of factors affect the cost of the design. One of these factors is the die yield and is computed using the following equation [15]: Defect_Density × Adie −α Ydie = Ywafer × 1 + α where Ywafer is the wafer yield, Adie is the die area and α is the critical masking level. Also, the cost of the die is another factor. To compute it, we need to compute the number of dies per wafer using the following equation [15]:

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs Wafer Cost Model

Die Cost

261

Bonding Cost Model

Die Yield

Bonding Cost

Bonding Options

Bonding Yield

3D Cost Model

W2W/D2W Stacking

KGD Test Cost + KGD Test Coverage

3D Chip Cost

Fig. 11.14â†œæ¸€ 3D cost model

Dies per wafer =

π × (wafer_diameter/2)2 π × wafer_diameter − √ Adie 2 × Adie

Now we can plug in the die yield and dies per wafer into the following equation to compute the cost of a single die before 3D bonding [15]: Cdie =

Cwafer dies per wafer × Yyield

where Cwafer is the cost of processed wafer. The total cost of the 3D design can be computed based on the bonding option that we select (i.e. W2W or D2W) [14]: N Cdie + (N − 1) ∗ Cbonding CW 2W = i=1 N i N −1 ( i=1 Ydiei ) ∗ Ybonding

CD2W

N Cdiei +CKGDtest + (N − 1) ∗ Cbonding i=1 Ydiei = N −1 Ybonding

where CW2W is the cost of bare die when using W2W bonding, CD2W is the cost when using D2W bonding, Cbonding is bonding cost, Ybonding is bonding yield CKGDtest is Known-Good-Die (KGD) test cost. TableÂ€ 11.3 lists the assumed values of the parameters that appeared in the equations above.

262

A. Al Maashri et al.

Table 11.3â†œæ¸€ Cost model parameters and assumed values

Parameter Ywafer Defect_density  Wafer_diameter Cwafer Ybonding Cbonding CKGDtest

Description Wafer yield Defects per unit area Critical masking levels Wafer diameter Cost of processed wafer Bonding yield Bonding cost KGD test cost

Assumed value 95% 3,716 2 300Â€mm 5,500 99% US $150 per wafer [22] 1.5*â•›cost of bare die for a yield of 90%

11.3.4 Power Model and Thermal Profiling Once again we use results from CACTI simulator to get an estimate of the power consumption for caches and memory units. We calculate the dynamic power of other pipeline components using the following equation: 1 ∗ Capacitive Load ∗ Voltage2 ∗ Frequency 2 The nominal voltage and capacitive load can be obtained from ITRS depending on the technology used. As for the interconnect layers, we estimate the dynamic power consumption using the Power Index constant introduced by ITRS. We multiply the number of metal layers and area of the design by the Power Index. The power indices used in the experiments are 16,000 and 18,000Â€W/GHz*m2 for the 65 and 45Â€nm technologies, respectively. The last step towards computing the total power consumption is estimating the static power. This can be estimated based on its projected contribution to overall power consumption [16, 23]. The next step is generating the thermal profile of the design. Since we have an estimation of power consumption, we can feed this data to the hotspot simulator [17] to generate the thermal profile for each of the proposed 3D GPU designs [26]. Now we have all the necessary tools to start designing and implementing 3Dbased GPU architecture. Pdyanmic =

11.4â•…Evaluating 3D-Based GPU In this section we investigate a number of approaches that employ 3D technology in GPUs. The first step is to set a scope on what needs to be evaluated and the type of experiments that needs to be conducted. After that, we discuss the results obtained

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs Fig. 11.15â†œæ¸€ The design space investigated in this section focuses on modifying the organization of caches and investigating the impact of changing the cache technology on the performance

263 MRAM

Cache Design Space

Technology SRAM

Organization

No. Sets

Block Size

Associativity

from these experiments. Finally, we propose architectural designs for a GPU system and assess the performance of each design.

11.4.1 The Design Space In this chapter we focus only on cache stacking techniques. Therefore, we consider the Streamer, Texture Unit, Z & Stencil Test and Color Write caches. We start by modifying the organization of each of these caches and observe the impact on performance. Modifying the organization includes increasing the number of sets, line size and associativity. Based on the preliminary results, we determine which of the caches can potentially improve overall system performance. Also, we investigate the impact of using non-volatile memory such as MRAM as a replacement for SRAM caches. Since MRAM write-time is much slower than that of the SRAM, it is expected that MRAM implementation would result in a better performance in caches such as texture cache since these types of caches are seldom written to. FigureÂ€11.15 summarizes the design space explored in this section.

11.4.2 Simulation Environment of GPU Architecture Attila [2] is a cycle-accurate GPU simulator and is used to simulate the experiments conducted in this subsection. The workloads used in the experiments are listed in TableÂ€11.4. The baseline cache organization assumed in the experiments consists of 64 shader units and an operating frequency of 1.3Â€GHz. TableÂ€11.5 shows a summary of cache organization used in these experiments.

264

A. Al Maashri et al.

Table 11.4â†œæ¸€ A number of workloads used in experiments [18] are diverse in their characteristics. (e.g. depth complexity, texture intensiveness, etc.)

Workload 3Dspace DPS Doom3 Prey5 Quake4 Terrain

Table 11.5â†œæ¸€ The baseline cache organization for the caches considered in the experiments

Cache

Resolution

No. frames 128 200 â•‡ 40 â•‡ 40 â•‡ 40 â•‡ 15

640â•›×â•›480 788â•›×â•›540 320â•›×â•›240 320â•›×â•›240 320â•›×â•›240 788â•›×â•›540

Total size (KB)

Organization (# Linesâ•›*â•›line sizeâ•›*â•›associativity) 32â•›*â•›64â•›*â•›1 16â•›*â•›256â•›*â•›4 16â•›*â•›256â•›*â•›4 16â•›*â•›256â•›*â•›4

Streamer TU ZST CW

â•‡ 2 16 16 16

11.4.3 Experiments and Results [30] We start by modifying the caches’ organization and observing the improvement on the hit rate. The results show a noticeable improvement only in TU and ZST caches. Based on the results, we narrow down the design space exploration to TU and ZST caches exclusively. FiguresÂ€11.16 and 11.17 show the hit rate improvement in TU and ZST caches, respectively. 1

Hit rate

0.75

0.5

0.25

0 3DSpace

DPS

Doom3

Prey5

Quake4

Terrain

16L+4W

512L+4W

1,024L+4W

512L+8W

2,048L+4W

1,024L+8W

Fig. 11.16â†œæ¸€ Hit rate of different TU cache configurations. The notion NLâ•›+â•›MW means that N lines and M-way associativity. The figure shows how increasing number of lines increases hit rate

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

265

1

Hit rate

0.95

0.9

0.85 3DSpace

DPS

256 B

Doom3

512 B

Prey5

1,024 B

Quake4

2,048 B

Terrain 4,096 B

Fig. 11.17â†œæ¸€ Hit rate of different ZST cache configurations. The notion NB means N-Byte block size. The figure shows that increasing the block size results in higher hit rates in ZST caches 5 Latency (Cycles)

Fig. 11.18â†œæ¸€ Access time (in cycles) vs. number of lines as observed in TU cache. The figure shows that latency decreases when partitioning cache into two or four layers. 3DWL technique is used

4 3 2 1

256

512

1,024

2,048

Cache Lines 2D (1 Layer)

3D (2 Layers)

3D (4 Layers)

However, increasing the cache size has a negative impact on latency. To quantify latency, we use Cacti simulator to estimate the access latency of the caches for a single-layer, two-layer and four-layer cache configurations. Note that the two-layer and four-layer caches were die-stacked by dividing the word lines (3DWL) [7] since we seek optimizing latency. Also, we assume that TSV technology can be used to replace wires connecting caches across the layers. FiguresÂ€11.18 and 11.19 show access latency for the TU and ZST caches. The figures show how cache stacking has reduced the access latency. These results motivate us to investigate the benefits gained from using 3D technology. The next step is to compare the impact of cache stacking on the overall system performance. We conduct a set of experiments while maintaining the same cycle time across all architectures (Iso-Cycle). FiguresÂ€11.20 and 11.21 show the outcome of the simulation runs.

266 4 Latency (Cycles)

Fig. 11.19â†œæ¸€ Access time (in cycles) vs. block size (in bytes) as observed in ZST cache. The figure shows that latency decreases when partitioning cache into two or four layers. 3DWL technique is used

A. Al Maashri et al.

3 2 1

256

1,024

2,048

4,096

Cache Lines 2D (1 Layer)

3D (2 Layers)

Fig. 11.21â†œæ¸€ Execution time speedup with increasing number of layers in ZST cache. All values are normalized to a one-layer design. A 256Â€KB four-way associative cache with 4,096Â€byte word size is used

1.6

1-Layer 2-Layers 4-Layers

1.25

Normalized Speedup

Fig. 11.20â†œæ¸€ Execution time speedup with increasing number of layers in TU cache. All values are normalized to a one-layer design. The number of cache lines is 2,048, using a 2Â€MB four-way associative cache with 256Â€byte word size

Normalized Speedup

3D (4 Layers)

0.9

1.4

1.15

0.9

3DSpace DPS

Doom3

Prey5 Quake4 Terrain

1-Layer 2-Layers 4-Layers

3DSpace DPS

Doom3 Prey5 Quake4 Terrain

We observe that 3D technology has improved overall performance. For example, Fig.Â€11.20 shows that the two-layer layout improved execution time by up to 22%, whereas the four-layer cache improved execution time by up to 55%. Similarly, Fig.Â€11.21 shows similar trends for the ZST cache where the two-layer and four-layer cache implementation improves performance by up to 15 and 37%, respectively. When merging the four-layer stacked TU and ZST caches into a single design and run another set of experiments, we obtained a 53% geometric mean speedup over the 2D design.

11.4.4 Iso-Cost Designs So far, the results obtained from the experiments prove that 3D technology improves the performance when used in GPUs. However, 3D fabrication comes with an extra

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

267

Table 11.6â†œæ¸€ Organization of the TU and ZST caches used in the Iso-Cost scenarios Cache Number of lines Line size (bytes) Associativity Total size (KB) TU 256 â•‡â•›â•›128 4 128 â•‡ 16 2,048 4 128 ZST

r1

ye

a eL

h ac

C

er

ay

L he

ac

C

PU

its

Un

2

Normalized Speedup

cost. Also, stacking dies on top of each other increases the power density, and as a result the thermal impact on performance and hardware is exacerbated. Therefore, in this section we use the tools we developed in Sect.Â€11.3 to propose two design scenarios. In each scenario, we compare two GPU architectures; one designed in 2D and the other designed using 3D technology. In these comparisons, we match the cost of the 2D and 3D implementations (Iso-Cost comparison). TableÂ€11.6 shows the TU and ZST cache organization used in both scenarios. The other caches maintain the same cache organization presented in TableÂ€11.5. In the first scenario we have a 2D GPU and a 3D-stacked cache GPU. Both GPUs contain 128 shaders, and both fabricated using 65Â€nm technology. In the 2D design, all caches and GPU processing units are in the same layer. In contrast, the first layer of the 3D GPU contains the GPU processing units, while the other two layers contain the partitioned ZST and TU caches as illustrated in Fig.Â€11.22a. The cost of each of these configurations is matched by adjusting the architectural parameters such as the size of queues that interface with the caches. The GPU die area is about 170Â€mm2. This is a relatively large area, and hence we use D2W bonding since W2W costs more due to deteriorating yield with increasing die area [14]. The cost is computed and found to be approximately US $235. FigureÂ€11.22b shows that for the iso-cost scenario, the 3D architecture achieves up to 45% speed up over the 2D planar architecture with a geometric mean of 16%. In the second scenario, we utilize one of the benefits that 3D technology offers; namely, heterogeneous integration. We do this by integrating two different process technologies. In the first layer of the 3D design we layout the GPU units in 65Â€nm technology. However, the second layer is fabricated using 45Â€nm technology as illustrated in Fig.Â€11.23a. Working with smaller feature sizes allows cramming all the

Geometric Mean = 16%

2D 3D

1.2

0.9 3DSpace DPS Doom3 Prey5 Quake4 Terrain

G

a

1.5

b

Fig. 11.22â†œæ¸€ Iso-Cost design (Scenario I): a cache is divided into two layers and stacked atop GPU unit layer; b 3D design improves performance by up to 45% (results are normalized to 2D design)

A. Al Maashri et al. Normalized Speedup

268

er

e

ch

Ca

U GP

y La its

Un

a

1.5

Geometric Mean = 19%

2D 3D

1.2

0.9

3DSpace DPS

Doom3 Prey5 Quake4 Terrain

b

Fig. 11.23â†œæ¸€ Iso-Cost design (Scenario II): a using 45Â€nm to fabricate caches into a single layer; b 3D design improves performance by up to 46% (results are normalized to 2D design)

caches into a single layer saving cost incurred due to bonding. FigureÂ€11.23b shows that 3D design outperforms 2D design by up to 46% speedup with a geometric mean of 19%. Note that the die area of the 3D design is about 175Â€mm2 and the cost is US $166.

11.4.5 Power and Thermal Constraints In this subsection, we use the power model developed earlier to compute the power consumption of the 3D design for both design scenarios described in last subsection. We find that the power consumption of 3D GPU from the first scenario is 106.4Â€W. On the other hand, the power consumption of the 3D GPU from the second scenario is 82.1Â€W. 82 81

120

a

118

0

110

116 114

Celsius

Celsius

80 122 120 118 116 114 112 110 108

112

5 mm 10

0

2

4mm

6

79 78 77 76 75 74 8

8 108

82 80 78 76 74 72 70

b

6 mm 4

2

0

0

2

4

6 mm

8

10

73 72

Fig. 11.24â†œæ¸€ Thermal profiling of the Iso-Cost 3D designs. a Scenario I: max temperature is 121.55°C, avg. temperature is 105.41°C. b Scenario II: max temperature is 82.24°C, avg. temperature is 70.78°C

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs 1.2 Normalized Speedup

Fig. 11.25â†œæ¸€ Comparing system-level improvement using four-layer SRAM and MRAM caches (normalized to SRAM per set). TU cache is 2Â€MB four-way associative and ZST cache is 256Â€KB four-way associative

269 SRAM MRAM

0.8

0.4

3DSpace DPS

Doom3

Prey5 Quake4 Terrain

We use the power consumption of each component in the design and generate the thermal profile using hotspot simulator. We find that the average temperature of the 3D design from the first scenario is 105.41°C, while the maximum temperature is 121.55°C. In contrast, the average temperature for the 3D design from the second scenario is 70.78°C, while the maximum temperature is 82.24°C. FigureÂ€11.24 shows the thermal profile for both designs. The thermal profiling of the second scenario looks reasonable. However, both maximum and average temperatures of the first scenario violate the recommendations of ITRS (i.e. 100°C operating temperature). Therefore, in the next subsection we discuss an alternative to SRAM caches that has a lower power consumption and better thermal profile.

11.4.6 MRAM as an Alternative to SRAM In the previous subsection, the thermal profile of the 3D GPU design from scenario 1 was not satisfactory. This motivates us to seek alternatives to SRAM caches with lower power consumption. In this subsection, we run a set of experiments to compare between the MRAM and SRAM caches. We use two comparison metrics; execution speedup and power consumption. FigureÂ€11.25 shows the simulation output when running iso-cycle designs. The figure shows that workloads with less number of writes than reads have a better performance. However, due to the slow write latency of the MRAM compared to SRAM, we observe degradation in performance when write transactions are more frequent than read transactions in other workloads. Now we compare the power consumption of the two technologies. FigureÂ€11.26 shows that as cache size increases, power consumption of MRAM cache is relatively less than that of SRAM cache. Therefore, although MRAM may not always offer high performance over all workloads, yet the power benefits of MRAM puts it as an alternative to SRAM for power-aware applications.

270 18 Normalized Power Consumption

Fig. 11.26â†œæ¸€ Power consumption of MRAM and SRAM. As cache size increases, MRAM consumes less power than SRAM cache

A. Al Maashri et al.

15

MRAM SRAM

12 9 6 3 0 16 KB

128 KB

256 KB 512 KB Cache Size

1 MB

2 MB

11.5â•…Summary We started this chapter by introducing a number of technologies including GPUs, 3D technology and MRAM. After that, we discussed how and why 3D technology can be used in GPU implementation, and developed the necessary tools and models needed to evaluate the proposed architectures. We explored different approaches in employing cache stacking to improve overall system performance. A number of experiments were conducted and the obtained results have shown that 3D technology improves the overall performance of the system. Also, we compared 2D and 3D GPU designs of the same cost and proved that 3D technology still offer an improvement in performance. We computed the power consumption of the proposed 3D designs and generated the thermal profiling of the design by simulation. Finally, we compared MRAM to SRAM in terms of performance gain and power consumption. Acknowledgmentâ•‡ The work appeared in this chapter was supported in part by NSF grants 0903432; 0702617.

References â•‡ 1. Stanford University CS488a Spring 2007 Real-Time Graphics Architecture, available at: http://graphics.stanford.edu/cs448-07-spring/ â•‡ 2. R. del Barrio, V. M. Gonzalez, C. Roca, J. Fernandez, and A. Espasa E., “ATTILA: A CycleLevel Execution-Driven Simulator for Modern GPU Architectures,” in Proc. International Symposium on Performance Analysis of Systems and Software, 2006, pages 231–241 â•‡ 3. General-Purpose Computation Using Graphics Hardware, available at: www.gpgpu.com â•‡ 4. Nvidia: CUDA Homepage, available at: http://www.nvidia.com/object/cuda_home.html â•‡ 5. ATI Stream Software Development Kit (SDK), available at: http://developer.amd.com/gpu/ ATIStreamSDK/Pages/default.aspx â•‡ 6. GeForce GTX200 Technical Brief, available at: http://www.nvidia.com/docs/IO/55506/GeForce_GTX_200_GPU_Technical_Brief.pdf â•‡ 7. Yuh-Fang Tsai, Y. Xie, N. Vijaykrishnan, and M. Jane Irwin, “Three-Dimensional Cache Design Exploration Using 3DCacti,” in Proc. International Conference on Computer Design, 2005, pages 519–524

11â•… Influence of Stacked 3D Memory/Cache Architectures on GPUs

271

â•‡ 8. N. Govindaraju, S. Larsen, J. Gray, and D. Manocha, “A Memory Model for Scientific Algorithms on Graphics Processors,” in Proc. Conference on High Performance Networking and Computing, 2006. Article No. 89 â•‡ 9. N. Goodnight, C. Woolley, G. Lewin, D. Luebke, and G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” in Proc. SIGGRAPH/ EUROGRAPHICS Conference on Graphics Hardware, 2003, pages 102–111 10. K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication,” in Proc. SIGGRAPH, 2004, pages 133–137 11. CACTI Cache Simulator, available at: http://www.hpl.hp.com/research/cacti/ 12. V. K. Kodavalla, “IP Gate Count Estimation Methodology During Micro-Architecture Phase,” in IP based Electronic System Conference and Exhibition, Dec. 5–6 2007, Grenoble, France, available at: http://www.design-reuse.com/ipbasedsocdesign/slides_2007-32_01.html 13. ITRS, “International Technology Roadmap for Semiconductors,” available at: www.itrs.net 14. X. Dong, and Y. Xie, “System-Level Cost Analysis and Design Exploration for 3D ICs,” in Proc. Asia and South Pacific Design Automation Conference, 2009, pages 234–241, Yokohama, Japan 15. J. L. Hennessy, and D. A. Patterson, Computer Architecture: A Quantitative Approach. Fourth Edition, Wiley, San Francisco, CA, 2010 16. M. Saravana Sibi Govindan, S. W. Keckler, S. R. Nassif, and E. Acar, “A Temperature Aware Power Estimation Methodology,” ASPDAC, January 2008 17. K. Skadron, M. R. Stan, W. Velusamy, K. Sankaranarayanan, and D. Tarjan, “TemperatureAware Microarchitecture,” in Proc. International Symposium on Computer Architecture, 2003, pages 2–13 18. Attila Project: AttilaWiki, available at: https://attila.ac.upc.edu/wiki/index.php/Main_Page, 2008 19. OpenGL, available at: http://www.opengl.org/ 20. DirectX Library, available at: http://www.microsoft.com/games/en-US/aboutGFW/pages/directx.aspx 21. D. Luebke, and G. Humphreys, How GPUs Work, in IEEE Computer, vol. 40, no. 2, pages 126–130, 2007 22. S. Jones, “2008 IC Economics Report,” in IC Knowledge LLC, 2008, available at: http:// www.icknowledge.com/ 23. S. Rodriguez, and B. Jacob, “Energy/power Breakdown of Pipelined Nanometer Caches (90nm/65nm/45nm/32),” in Proc. International Symposium on Low Power Electronics and Design, 2006, pages 25–30 24. J. D. Hall, N. Carr, and J. Hart, “Cache and Bandwidth Aware Matrix Multiplication on the GPU,” Technical Report UIUCDCS-R-2003-2328, University of Illinois Urbana-Champain, 2003 25. M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens, “Efficient Computation of Sum-Products on GPUs Through Software-Managed Cache,” in Proc. Inter. Conference on Supercomputing, 2008, pages 308–318 26. G. Luca Loi, B. Agrawal, N. Srivastava, Sheng-Chih Lin, T. Sherwood, and K. Banerjee, “A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy,” in Proc. Design Automation Conference, 2006, pages 991–996 27. K. Puttaswamy, and G. H. Loh, “Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors,” in Proc. HPCA, 2007, pages 193–204 28. M. Hosomi, H. Yamagishi, and T. Yamamoto, “A Novel Nonvolatile Memory with Spin Torque Transfer Magnetization Switching: Spin-Ram,” in International Electron Devices Meeting, 2005, pages 459–462 29. J. Owens, “GPU Architecture Overview,” in Proc. International Conference on Computer Graphics and Interactive Techniques, 2007, Article No. 2 30. A. Al Maashri, G. Sun, X. Dong, V. Narayanan, and Y. Xie, “3D GPU Architecture Using Cache Stacking: Performance, Cost, Power, and Thermal Analysis,” in Proc. International Conference on Computer Design (ICCD), 2009

â•…

Index

2D mesh, 117, 122–124, 127, 129, 130, 133, 138, 141, 142, 167 3D Ciliated mesh, 119, 123, 130, 133, 135 3D Cube, 149, 152, 153, 156 3D Floorplanning, 168, 169, 171, 218 3D Integrated Circuit (3D IC), 5, 116, 140, 143, 149, 150, 154, 157 3D Integration technology (3DI), 28, 43, 149–152 3D Multi-processor, 133,169, 197 3D Opt-mesh, 221 3D Packaging, 11, 190, 194 3D Stacked Integrated Cisrcuit (3D SIC), v, vi, 47–62, 65–71 3D Stacking, 5, 24, 26, 73, 77, 79, 190 3D Technology, 23, 24, 73, 75, 78, 152, 168, 194, 202, 204, 222, 249, 250, 255, 256, 258, 259, 262, 265–267, 270 3D Topology, 129, 178 A Accumulation Algorithm, 55, 60, 80, 121, 153, 156, 168–170, 174, 177, 178, 180, 181, 183–186, 188–191, 204, 205, 207–211, 217, 220–222, 229, 245, 246, 249, 271 Alignment, 11–16, 59, 150, 151, 157 Analytical model, 25, 29, 144 Application chip, 227–234 Application Specific 3D-NoC Architecture, 3, 5, 7, 9, 27, 28, 30, 31, 35, 36, 40–43, 47, 48, 58, 65–67, 70, 71, 73–75, 82, 89, 90, 100, 112–117, 119, 121–124, 126, 127, 129, 130, 132, 133, 135, 138–145, 147, 149, 152, 153, 155, 157, 162, 164, 165, 167–171, 173–175, 177–179, 181, 183, 185–187, 189–191, 193, 194, 196–203, 222, 223, 225–229,

232, 233, 237, 247–251, 253–255, 257, 259, 261–263, 265, 267, 269–271, 273 Area, 11, 14, 24, 27, 29–33, 35, 36, 39–43, 49, 50, 55, 59, 61, 64, 69, 70, 72, 77–79, 82, 83, 95, 96, 102–104, 106, 107, 109–111, 116, 117, 120–122, 124, 125, 127, 130, 132, 139–141, 143, 149, 150, 157, 164, 168–172, 174, 186, 196, 198, 199, 201–203, 225, 239, 243, 244, 249, 259, 260, 262, 267, 268 Arithmetic, 238 Arithmetic and Logic Unit (ALU), 238 Assembly, 9, 12, 15, 26, 50, 52, 76, 85, 151, 247, 251, 252 Asynchronous Circuit, 150, 154–156, 162, 164, 165 Asynchronous NoC, 155–157, 165 Async-to-Sync FIFO, 155 Average number of hops, 92, 93, 103, 106, 122, 123, 133 B Back End Of Line (BEOL), 50 Back-to-face, 12, 14 Bandwidth, 8–10, 23, 27, 28, 47, 65, 73, 79, 82, 86, 87, 92, 93, 113, 119, 135, 143, 150, 156, 157, 159, 162, 164, 167, 186, 189–191, 196, 197, 203–206, 208, 211, 218, 219, 223, 231, 232, 244, 246, 247, 271 BFT, 120, 124, 125, 132, 133, 142, 143 Bisection channel, 152 Bonding, 10, 12–26, 49, 51, 59, 61, 73, 74, 77, 88, 103, 150, 151, 194, 225, 226, 239, 243, 255, 261, 267, 268 Boundary Scan Test, 60 Breakdown voltage Built in self test (BIST), 65, 70, 71

A. Sheibanyrad et al. (eds.), 3D Integration for NoC-based SoC Architectures, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-7618-5, ©Â€Springer Science+Business Media, LLC 2011

273

274 Bus, 3, 4, 8, 9, 27, 33–36, 47, 58, 65, 73, 75, 78, 79, 89, 92–96, 98, 103, 104, 106, 107, 115, 117, 119, 125–127, 129, 130, 133, 135, 149, 155, 156, 167, 169, 193, 196, 197, 223, 225, 249, 273 C Cache, 28, 30, 31, 36, 40, 87, 133, 169, 229, 232, 249, 251–257, 259–271 Capacitance, 16, 34, 78, 79, 94, 97–99, 106, 109, 113, 117, 126, 127, 151, 218 Capacitive coupling, 11, 74, 225 Cardinal direction, 117, 119 Centralized memory, 29, 37 Characterization, 25, 26, 42 Chip, 3, 4, 7–13, 15, 16, 21, 23–26, 30, 31, 33–37, 39–43, 47–49, 51, 52, 54, 58, 61, 62, 65, 69–74, 76–80, 82, 84–90, 92, 98, 99, 101–103, 105, 106, 111–117, 119–121, 123, 125, 127, 129, 131, 133, 135–137, 139–141, 143–145, 149, 150, 152, 154, 155, 164, 165, 167–171, 173, 175, 177, 179, 181, 183, 185, 187, 189–191, 193– 197, 199, 201, 203, 205, 207, 209, 211, 213, 215, 217–219, 221–223, 225–234, 238, 239, 242, 243, 245, 247, 248, 259, 260 Chip design, 65, 72, 87, 191, 222, 227–229 Circuit, 3–5, 7, 9–13, 15, 17, 19–21, 23, 25–27, 35, 43–45, 47–49, 56, 59, 66, 67, 71–75, 79, 80, 82, 85, 87–89, 95, 97, 99, 100, 103, 108, 112–117, 130, 139, 140, 143, 144, 149, 150, 152, 154–162, 164, 165, 167, 168, 175, 186, 189–191, 193, 195, 196, 222, 225, 247–249, 256, 273 Clock, 8, 40, 60, 61, 83, 86, 97, 99, 100, 102, 103, 106, 107, 113, 114, 121, 122, 125, 126, 132, 135, 138, 149, 150, 154–157, 159, 165, 167, 172–175, 186, 193, 196, 218, 226, 230, 237–239, 241–245, 247, 254 Clock delivery, 239 Clock skew, 103, 154, 155 C-Muller, 161 Communication, 27, 28, 30, 31, 34, 36, 43, 48, 96, 103, 112, 116, 119, 121, 127, 129, 133, 135, 138–142, 144, 149, 152, 153, 155, 157–163, 165, 167, 169–171, 173, 174, 178, 189, 190, 193, 196, 197, 200–202, 204–208, 217–219, 223, 227, 229–231, 233–235, 239, 240, 244, 246, 247 Communication Demand Graph (CDG), 173, 178

Index Communication Graph, 204, 205, 208, 218, 219 Computer Aided Design (CAD), 75, 86 Congestion, 83, 197, 198 Contact, 11, 14, 18, 20, 22, 23, 61, 62, 65, 74, 226, 227 Contactless wafer probing, 62 Context Switching Controller (CSC), 238 Cooling, 86, 168, 195, 222 Copper, 8, 13, 18, 22, 23, 25, 26, 34, 35, 49, 51, 61, 73, 77, 78, 95, 115, 144, 151, 194 Cost model, 63, 64, 168, 189, 197, 259, 260 Cost-efficiency trade-off, 157, 164 Crosstalk, 112, 242 Cu-Cu Bonding, 17, 23, 61 Cycle-Accurate, 116, 125, 153, 263 D Delay model, 112, 168–170, 189 Delay-Insensitive, 159, 161 Demultiplexor, 160, 162 Depletion, 78 Deserializer, 157–160, 163, 164 Design, 7, 15, 25, 26, 43, 45, 48, 49, 56, 57, 65, 70, 72–87, 89–91, 96–98, 100, 103, 108, 111–117, 125, 126, 129, 130, 135, 138, 140, 143, 144, 147, 149, 150, 152, 154, 155, 158–160, 162, 164, 165, 167–175, 177–179, 181, 183, 185–187, 189–191, 193–204, 207–209, 211–213, 217–220, 222, 223, 227–229, 238, 243, 246, 247, 249, 259–264, 266–271 Design-for-Test (DfT), 65 Diameter, 30, 49, 51, 61, 152 Die, 4, 12, 13, 16, 21, 22, 26, 28, 30, 31, 33–35, 47–56, 58–73, 75–77, 79, 86, 102, 115, 120, 124, 149–154, 157, 164, 167, 171, 175, 190, 193, 194, 217, 222, 225–227, 236, 239, 241, 243, 247, 249, 253, 255, 256, 259–261, 265, 267, 268 Dielectric, 12, 13, 16, 21, 26, 34, 35, 77, 102, 115, 175 Die-to-die, 51, 149, 150, 157, 194, 255 Die-to-wafer, 51, 194, 255 Distributed memory, 29, 30, 236 Dual-rail, 158, 159, 161, 162 Dynamic power consumption, 97, 99, 262 Dynamic Power Management (DPM), 154, 155 Dynamic Random Access Memory (DRAM), 8, 9, 24, 28, 30, 32, 32, 33, 35, 40, 59, 70, 84, 85, 149, 152 Dynamic Voltage and Frequency Scaling (DVFS), 154, 155

Index E Effective Computational Efficiency (ECE), 31, 42 Electronic Design Automation (EDA), 24, 58 Electrostatic discharge (ESD), 226 Encryption, 245 Energy consumption, 31, 82, 244 Energy dissipation, 116, 119, 120, 122, 123, 126, 127, 129, 132, 135, 138, 139, 141–144, 154, 226 Etching, 16, 49, 225 Exploration, 113, 211, 218, 264, 270, 271 F Fabrication, 11, 16, 17, 25, 49, 51–53, 59, 73, 78, 80, 116, 149, 151, 152, 154, 155, 168, 190, 193, 225, 258–260, 266 Face-to-face, 12, 14, 18, 22, 50, 51, 225, 255 Feature size, 7, 33–35, 99, 167, 193, 267 Field stackable, 246, 247 Filling, 16, 17, 73 Flit, 121–123, 125–127, 129, 133, 135, 156–158, 163, 186, 227, 230, 232–234 Floorplan, 83–86, 140, 168–171, 174, 190, 194, 195, 198, 199, 202, 203, 211–213, 218, 219, 222, 223 Floorplanning, 85, 86, 168–171, 174, 190, 194, 195, 199, 218, 222, 223 FO4, 125, 126, 135 Footprint, 48, 49, 78, 79, 84, 119, 136, 139–141, 143, 194, 256, 259 Four-phase asynchronous communication, 161 FPGA, 27, 236 Front End Of Line (FEOL) G Geometry, 27, 33, 36, 250, 251 Global interconnect, 7, 33, 97, 99, 111, 113, 115, 150, 193 Global wires, 8, 49, 149, 150, 193, 197 Globally Asynchronous Locally Synchronous (GALS), 103 Graphics Processing Unit (GPU), 249 H Heat dissipation, 15, 116, 139, 140, 168, 194, 247 Heterogeneous integration, 4, 13, 24, 256, 258, 267 Homogeneous 3D NoCs, 90 Homogeneous integration

275 Hop, 35, 71, 73, 92, 93, 103, 104, 106, 107, 112, 121–124, 127, 129, 130, 133, 136, 145, 155, 157, 172, 183, 186, 188, 203, 205, 206, 248 Hot spot, 85, 86 Hyper cube, 80 I Impedance, 104, 106 Inductance, 113 Inductive coupling, 11, 62, 225, 226, 246–248 Information Technology (IT), 4 Integrated Circuit (IC), v, 3, 4, 10, 12, 17, 48, 59, 116, 130, 139, 140, 143, 149, 150, 154, 157, 193 Intellectual property (IP), 10, 121, 136 Inter layer link, 198, 204–207, 209–212, 220 Interconnects, 8, 10, 11, 17, 19, 26, 47–49, 55, 58–60, 69, 71, 79, 89, 95–97, 99, 102, 106, 107, 111, 113–116, 126, 130, 133, 144, 150, 152, 157, 164, 167, 168, 174, 175, 177, 178, 191, 193, 194, 196, 197, 203, 222, 223, 225 Interface, 13–15, 18, 20, 23, 26, 61, 65, 69, 74, 78, 83, 84, 151, 155, 167, 173, 196, 200, 222, 226, 237, 242, 247, 267 International Roadmap for Semiconductors (ITRS), 95, 102, 115, 260, 262, 269 Intrinsic Computational Density (ICD), 29 Intrinsic Computational Efficiency (ICE), 27, 29, 42 Inversion K Known-Good Die, 53, 54, 71, 151 L Landing pad, 51, 61, 63 Latency, 7, 8, 10, 27, 28, 42, 89, 91–94, 96, 101–106, 108, 111, 112, 116, 121–123, 127, 129, 130, 132, 135, 136, 138, 143, 152, 153, 155, 156, 158, 165, 167, 169, 196–199, 203–208, 217, 219, 220, 256, 259, 265, 269 Leakage, 6, 7, 16, 25, 35, 59, 86, 97–100, 178, 186, 191, 222, 256 Leakage power, 86, 97–100, 178, 186, 191, 256 Library, 125, 140, 174, 177, 186, 203, 217, 223, 271 Liquid cooling, 195 Localized traffic, 92, 93, 122, 123, 125, 133, 135

276 M Magnetic Random Access Memory (MRAM), 256 Manhattan distance, 133, 212 Materials, 4, 14, 22, 26, 59, 63, 115 Measurements, 73 Memory controller, 28, 251 Memory Modules (MEMs), 236 Memory subsystem Metal, 4, 7, 12, 13, 16–18, 22–26, 50, 77, 90, 99, 100, 109, 110, 115, 151, 176, 194, 199, 226, 255, 262 Micro bump, 24, 51, 59, 61, 62, 71, 225, 226 Micro cooling, 168 Microprocessor, 13, 73, 87, 189, 190, 222, 229, 249 Mixed technology, 80 Modeling, 6, 25, 92, 113, 114, 144, 191, 223 Moore’s law, 3, 4 MUCCRA Cube, 236, 237–239, 241–243 Multi Chip Modules, 276 Multi Chip Package (MCP), 47 Multicast Routing Graph (MRG), 180, 181 Multicast Routing Tree (MRT), 180–182 Multicast traffic, 171, 173, 186, 189 Multiple instruction multiple data (MIMD), 254 Multiplexor, 160–162 Multi-Processor System on Chip (MPSoC) N Netlist, 126, 186 Network diameter, 152 Network Interface Controller (NIC), 155 Network on Chip (NoC), 116, 120, 143, 167, 193, 226 NoC topologies, 89–93, 95, 97–99, 101–103, 105, 107, 109–111, 113, 120, 175, 204, 222 NoC-based system, 112 Noise O Off-chip memory, 30, 31, 33, 37, 39 On-chip memory, 31, 39 P Package-on-Package (PoP), 47 Packaging, 9, 11, 23, 24, 52–54, 71, 73, 74, 76, 78, 79, 86, 113, 152, 165, 190, 194 Packet latency, 156, 158 Parasitic, 6, 34, 35, 85 Partitioning, 69, 70, 80, 82, 83, 169, 194, 202, 205–208, 211 Partitioning graph, 205–208

Index Performance analysis, 43, 122, 124, 270, 271 Performance evaluation, 43, 114, 115, 117, 119–121, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143–145, 223 Phase Control Unit (PCU), 242 Physical analysis, 89, 91, 93, 95, 97, 99, 101, 103, 105, 107, 109, 111, 113 Physical design, 103, 222 Physical link, 89, 92, 93, 99, 101, 103, 104, 111, 122, 174, 181, 209, 210 Pin, 9, 14, 18, 58–60, 74, 77, 80, 86, 88, 99, 112, 114, 117, 120, 124, 132, 164, 169, 189, 190, 194, 197, 199, 222, 223, 225, 230, 239, 243, 247, 248, 251, 252, 259, 271 Pipeline, 80, 135, 138, 160, 191, 219, 245, 246, 249–254, 260, 262, 271 Pitch, 6, 8, 9, 15, 17, 35, 49, 51, 61, 73, 76–79, 82, 90, 93, 99, 100, 102, 110, 111, 150, 157, 164, 193, 194, 198, 226, 229, 232, 233 Place and route, 82, 83 Placement, 84, 112, 144, 169, 171, 173, 190, 197, 203, 221, 222, 263 Planning, 85, 86, 116, 152, 168–171, 174, 190, 194, 195, 199, 218, 222, 223 Platform, 3, 13, 16, 117, 144, 197, 223, 253 Plug, 261 Post-bond testing, 54, 65, 67, 71 Power, 3, 4, 6, 8–11, 24, 27–29, 31, 35, 36, 39–44, 47–50, 60–62, 72, 77–80, 82, 84–87, 89–92, 94, 96–102, 105–114, 116, 123, 126, 129, 139–141, 149, 154, 155, 164, 168–172, 174–178, 181–184, 186, 188, 189, 191, 193, 194, 196–199, 202–207, 209, 211, 217–222, 225, 226, 237, 239, 246–249, 253, 256, 257, 259, 260, 262, 267–271 Power consumption, 8–10, 27–29, 36, 40, 80, 87, 89, 90, 96–99, 105–107, 110–112, 149, 164, 168, 169, 171, 172, 174–177, 181–184, 186, 189, 193, 194, 196, 199, 202–207, 209, 211, 218, 220–222, 226, 249, 256, 257, 259, 260, 262, 268–270 Power delivery, 62, 84, 85, 246, 248 Power density, 43, 97, 102, 109, 110, 116, 139–141, 171, 175, 267 Power distribution, 113 Power grid Pre-bond testing, 54, 55, 61, 63, 67, 71 Printed Circuit Board (PCB), 47, 66, 85 Process variability Processing elements (PE), 80, 90, 152, 237

Index Processor, 4, 8, 9, 13, 27, 28, 36, 40, 43, 44, 72, 73, 80, 87, 88, 102, 112, 117, 133, 144, 165, 167–169, 186, 189–191, 193, 197, 202, 218, 219, 222, 223, 229, 232, 236–238, 247, 249, 251, 253, 254, 259, 271 Propagation delay, 92, 94, 150, 154, 155, 163 Q QDI, 161, 162 R RC delay, 7, 8, 126, 193 Redundancy, 60, 72, 87, 157, 165 Redundant TSVs, 60, 157 Register File (RFile), 238 Regular topology Resistance, 7, 8, 17, 34, 89, 94, 95, 98, 100, 110, 126, 150, 151, 218, 257 Resistivity, 18, 19, 34, 95 Resolution Enhancement Technique (RET), 7 RGB image, 237 Rip up, 168, 174, 178, 180–182, 185 Ripup Reroute and Router Merging (RRRM), 168, 178, 189 Router merging, 168, 174, 178, 183, 184, 189 Routing, 92, 120, 121, 125, 127, 153, 156, 168–170, 172–174, 178–183, 185, 186, 189–191, 194, 208, 209, 223, 226, 229, 236–238, 248 Routing Cost Graph (RCG), 178 RS Flip-Flop, 161, 162 RTL, 173, 245, 249 S Saturation threshold, 152, 153, 157, 164 Scalable Processor Array (SPA), 254 Scaled Partitioning Graph (SPG), 206, 207 Self-inductance Serialization, 92, 157–160, 163, 164 Serialization delay, 92 Serialization level, 158, 160, 163, 164 Serialized vertical link, 149–151, 153, 155, 157–159, 161, 163–165 Serializer, 157–160, 163, 164 Shader, 251–254, 263, 267 Shared memory, 190, 219 Shift and Mask Unit (SMU), 238 Signal, 8, 34, 56, 60, 61, 67, 73, 77, 82, 84, 86, 96, 97, 103, 115, 116, 149, 154, 167, 190, 232–235, 238, 239, 242, 251 Silicon area, 27, 30, 31, 49, 64, 77, 79, 82, 116, 122, 127, 130, 132, 198, 201 Single instruction multiple thread, 254

277 Solder balls Spacing, 21, 34, 35, 93 SPIN, 18, 120, 124, 132, 271 Square coil, 226 Stack, 4, 5, 7, 8, 10–12, 14, 15, 17, 21–26, 28, 30, 33, 41–43, 47, 49–51, 53–67, 69–73, 75–77, 79, 80, 82, 84–86, 91, 92, 98, 100, 102–105, 109, 110, 112, 114, 115, 117, 123, 124, 126, 127, 129, 130, 133, 135, 136, 144, 149, 151–153, 164, 167, 168, 171, 175, 190, 193, 194, 204, 207, 208, 218, 222, 225–227, 230, 237, 239, 240, 242, 243, 245–247, 249, 251, 253, 255–259, 261, 263, 265–267, 269–271 Stacked mesh, 117, 123, 126, 127, 129, 130, 133, 135 Stacking, 5, 7, 11, 12, 14, 15, 21, 24, 26, 28, 49, 51, 54–56, 60, 63, 65, 71–73, 75–77, 79, 80, 82, 86, 103, 104, 115, 149, 151, 152, 164, 190, 193, 194, 222, 226, 239, 242, 247, 249, 255, 256, 258, 263, 265, 267, 270, 271 Stacking method, 239 Static power consumption, 262 Static Random Access Memory (SRAM), 30, 31, 33, 35, 40, 80, 152, 236, 237, 257, 263, 269, 270 Static TDM scheme, 230, 234 Streaming multiprocessors (SM), 254 Substrate, 11, 13, 15, 16, 21, 23, 26, 35, 47, 49, 50, 59, 76, 77, 80, 84, 109, 114, 194 Supply voltage, 34 Synchronization, 60, 103, 138, 149, 154, 155, 169, 238, 245 Synchronizer, 164, 165 Sync-to-Async FIFO, 155 System, 3–5, 8–10, 13, 18, 23–31, 33, 35, 39–44, 47, 48, 70, 72–75, 80, 82, 83, 85–87, 89–91, 93, 95, 97–99, 101–103, 105–115, 117, 120, 121, 123, 139, 143–145, 147, 149, 150, 152–159, 164, 165, 167–169, 189–191, 193–199, 222, 223, 225, 237, 242, 243, 247–249, 256, 259, 260, 263, 265, 270, 271, 273 System in Package (SiP), 10, 47, 194 System C, 153, 156, 158 T Task Configuration Controller (TCC), 238 Temperature, 16–20, 22, 23, 25, 26, 51, 84–86, 89, 90, 92, 95, 96, 98–102, 104–106, 108–111, 114, 116, 139–144, 154, 155, 222, 269, 271

Index

278 Test Probe Thermal analysis, 86, 114, 271 Thermal conductivity, 139 Thermal variability Thermo-compression bonding, 17, 19, 20, 22, 51, 73 Thermo-fluidic Thickness, 7, 14, 16, 34, 35, 49, 78, 194 Thinning, 12, 14–16, 49, 54, 59, 62, 67–69, 71, 73 Three-dimensional integration, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 43, 72, 113, 116 Through Silicon Via (TSV), v, vi, 5, 10, 12, 16, 34, 47, 49, 75, 77, 150, 151, 167, 175, 194, 225 Throughput, 11, 16, 42, 74, 116, 121–123, 127, 129, 130, 132, 133, 135, 136, 138, 143, 149, 156–159, 162, 163, 227 Tier, 55, 56, 76, 83, 86, 95, 99–101, 110, 111, 114, 227–229 Tile, 7, 11, 14, 27, 33, 36, 39, 43, 61, 114, 117, 144, 189, 190, 256, 263, 271 Tilera processor, 36 Time Division Multiple Access (TDMA), 117 Time Division Multiplex (TDM), 227, 229, 230 TSV-based technologies, 48–52 Tungsten, 78 U Uniform traffic, 92 V Vertical bus, 95, 96, 98 Vertical channel, 93, 149 Via first, 12, 14, 73 Via last, 12, 14 Via middle, 12, 14

Vias, 34, 43, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 77, 78, 82, 88, 96, 122, 125, 130, 150, 151, 167, 175, 176, 194, 195, 225, 255, 256 Void, 17, 20, 21, 23, 26, 51, 55, 59, 76, 77, 98, 151, 155, 195, 225, 226, 230, 238, 242 W Wafer, 11–20, 22, 23, 25, 26, 48, 49, 51–56, 58–65, 67–69, 71, 73–76, 85–87, 151, 190, 194, 222, 225, 226, 247, 255, 260, 261 Wafer-to-wafer, 11–13, 15, 51, 71, 73, 194, 255 Width, 8–10, 23, 27, 28, 34, 35, 47, 65, 69, 73, 79, 82, 86, 87, 92, 93, 99, 113, 119, 125, 127, 133, 135, 143, 150, 156, 157, 159, 162, 164, 167, 172, 173, 186, 189–191, 196–198, 203–206, 208, 211, 218, 219, 223, 231, 232, 244, 246, 247, 271 Wire bounding Wire length, 34, 49, 78, 83, 120, 124, 127, 129, 132, 135, 138, 139, 150, 172, 175, 177, 212, 218, 219, 226 Wire-bonds, 47, 51 Wireless inductive coupling, 226, 246 Wormhole, 92, 120, 122, 125, 144, 153, 169, 170, 190, 191 Y Yield, 7, 39, 48, 49, 52, 53, 55, 56, 59, 63, 64, 71, 73, 76, 83, 86, 87, 122, 136, 150–152, 154, 157, 164, 194, 195, 197, 198, 220, 255, 260, 261, 267 Z Zero-load latency, 92, 111, 169