Retargetable Processor System Integration into Multi-Processor System-on-Chip Platforms
Andreas Wieferink · Heinrich Meyr · Rainer Leupers
Retargetable Processor System Integration into Multi-Processor System-on-Chip Platforms
123
Prof. Dr. Heinrich Meyr ISS RWTH Aachen Templergraben 55 52056 Aachen Germany
[email protected]
Dr. Andreas Wieferink CoWare, Inc. Gruener Weg 1 52070 Aachen Germany
[email protected] Prof. Dr. Rainer Leupers SSS RWTH Aachen Templergraben 55 52056 Aachen Germany
[email protected]
ISBN: 978-1-4020-8574-1
e-ISBN: 978-1-4020-8652-6
Library of Congress Control Number: 2008930276 2008 Springer Science+Business Media B.V. No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
c
Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com
Dedicated to my wife Julia, to my brother Juergen, and to my parents Harm-Hindrik and Johanna.
Contents
Foreword Preface 1. INTRODUCTION
xi xiii 1
1.1
Challenge: From Board to SoC
1
1.2
Degrees of SoC Customization 1.2.1 Computation 1.2.2 Communication
2 2 4
1.3
Organization of this Book
5
2. SOC DESIGN METHODOLOGIES
7
2.1
Traditional HW/SW Co-Design 2.1.1 HW/SW Co-Simulation 2.1.2 Automatic Synthesis
2.2
System Level Design 2.2.1 Motivation 2.2.2 Standardization 2.2.3 Design Flows
10 10 12 15
2.3
Current Research on SoC Design Methodologies 2.3.1 Bottom-Up SoC Design 2.3.2 Top-Down SoC Design
18 18 20
2.4
Contribution of this Work
22
3. COMMUNICATION MODELING 3.1
Transaction Level Modeling 3.1.1 Use Cases 3.1.2 Abstraction Levels
7 8 9
25 25 25 26
viii
Contents
3.2
Generic Communication Modeling 3.2.1 Architect’s View Framework (AVF) 3.2.2 Generic TLM Simulation Modules Communication Customization 3.3.1 Communication IP Providers 3.3.2 Protocol Specific TLM Interfaces The BusCompiler Tool 3.4.1 Cycle Accurate Communication Modeling 3.4.2 BusCompiler Input Specification
29 29 31 32 32 33 36 37 38
4. PROCESSOR MODELING 4.1 Generic Processor Modeling 4.1.1 Native Execution on the Simulation Host 4.1.2 Generic Assembly Level 4.2 Processor Customization Techniques 4.2.1 Selectable Processor Core IP 4.2.2 (Re-)Configurable Processor Architectures 4.2.3 ADLs 4.3 LISA 4.3.1 LISA Processor Design Platform 4.3.2 Abstraction Levels 4.3.3 LISA 2.0 Input Specification
41 41 41 42 43 43 44 45 48 48 50 51
5. PROCESSOR SYSTEM INTEGRATION 5.1 Simulator Structure 5.1.1 Standalone Processor Simulator 5.1.2 The LISA Bus Interface 5.1.3 SystemC Wrapper 5.2 Adaptors: Bridging Abstraction Gaps 5.2.1 LISA Bus/Memory API 5.2.2 TLM Communication Module API 5.2.3 API Mapping 5.2.4 Bus Interface State Machine 5.3 Commercial SoC Simulation Environments 5.3.1 CoWare PlatformArchitect System Simulator 5.3.2 Synopsys SystemStudio SoC Simulator
53 53 53 54 56 56 57 59 60 61 63 63 64
3.3
3.4
Contents
ix
6. SUCCESSIVE TOP-DOWN REFINEMENT FLOW 6.1 Phase 1: Standalone 6.1.1 SoC Communication 6.1.2 LISA Standalone 6.2 Phase 2: IA ASIP ↔ AVF Communication Models 6.3 Phase 3: IA ASIP ↔ CA TLM Bus 6.4 Phase 4: CA ASIP ↔ CA TLM Bus 6.5 Phase 5: BCA ASIP ↔ CA TLM Bus 6.6 Phase 6: RTL ASIP ↔ CA TLM Bus 6.7 Phase 7: RTL ASIP ↔ RTL Bus
67 68 68 70 72 73 75 77 78 79
7. AUTOMATIC RETARGETABILITY 7.1 MP-SoC Simulator Generation Chain 7.2 Structure of the Generated Simulator 7.2.1 Creating the Communication Infrastructure 7.2.2 Generating SystemC Processor Models 7.2.3 Generating Adaptors 7.3 Bus Interface Specification 7.3.1 Overview 7.3.2 Feeding Data into the State Machine 7.3.3 Characterizing the State Machine 7.3.4 Getting Data Out of the State Machine 7.3.5 Advantages
81 81 83 83 84 84 85 85 86 87 89 90
8. DEBUGGING AND PROFILING 8.1 Multi-Processor Debugger 8.1.1 Retargetable Standalone Simulation 8.1.2 Multi-Processor Synchronization 8.1.3 Dynamic Connect 8.1.4 Source Code Level Debugging with GNU gdb 8.2 TLM Bus Traffic Visualization 8.2.1 Message Sequence Charts (MSC) 8.2.2 Word Level Data Display 8.3 Bus Interface Analysis 8.3.1 CoWare PlatformArchitect Analysis 8.3.2 Bus Interface Optimization
91 91 92 95 98 100 102 102 103 105 105 105
x
Contents
9. CASE STUDY 9.1 Multi Processor JPEG Decoding Platform 9.1.1 The JPEG Application 9.1.2 Platform Topologies 9.1.3 Platform Performance Indicators 9.2 Phase 2: IA + AVF Platform 9.3 Phase 3: IA + BusCompiler Platform 9.4 Phase 4: CA + BusCompiler Platform 9.5 Phase 5: BCA + BusCompiler Platform
109 109 109 110 110 112 116 119 122
10. SUMMARY
127
Appendices A Businterface Definition Files A.1 Generic AMBA 2.0 Protocol A.2 Derived AMBA 2.0 Protocols A.3 AMBA 2.0 Bus Interface Specification
131 131 131 134 134
B Extended CoWare Tool Flow
147
List of Figures
149
References
153
Index
161
Foreword
Computer architecture presently faces an unprecedented revolution: The step from monolithic processors towards multi-core ICs, motivated by the ever increasing need for power and energy efficiency in nanoelectronics. Whether you prefer to call it MPSoC (multi-processor system-on-chip) or CMP (chip multiprocessor), no doubt this revolution affects large domains of both computer science and electronics, and it poses many new interdisciplinary challenges. For instance, efficient programming models and tools for MPSoC are largely an open issue: "Multi-core platforms are a reality – but where is the software support" (R. Lauwereins, IMEC). Solving it will require enormous research efforts as well as the education of a whole new breed of software engineers that bring the results from universities into industrial practice. At the same time, the design of complex MPSoC architectures is an extremely time-consuming task, particularly in the wireless and multimedia application domains, where heterogeneous architectures are predominant. Due to the exploding NRE and mask costs most companies are now following a platform approach: Invest a large (but one-time) design effort into a proper core architecture, and create easy-to-design derivatives for new standards or product features. Needless to say, only the most efficient MPSoC platforms have a real chance to enjoy a multi-year lifetime on the highly competitive semiconductor market for embedded systems. This book, based on the Ph.D. thesis of Andreas Wieferink, presents a novel approach to disciplined MPSoC platform design. It builds on a proven processor architecture exploration methodology (LISATek, alias CoWare Processor Designer) and shows how the traditional exploration scope can be extended beyond the processor boundaries. The proposed stepwise refinement methodology enables the connection to new de-facto MPSoC modelling standards such as SystemC and TLM, so as to encourage processor/communication co-design. The inherent retargetability of the tool flow facilitates, and partially automates, MPSoC design space exploration. Moreover, in our view, Dr. Wieferink makes
xii
Foreword
a significant contribution to MPSoC software development, too, by means of a novel retargetable multicore debugging concept embedded into standard SystemC based simulation. We expect that this book will be of help to both engineers and managers in order to gain insight into state-of-the-art system design methodologies. Furthermore, we hope to stimulate further researchers to pick up this exciting area, and to join us in taking on the challenges of the MPSoC revolution! Rainer Leupers and Heinrich Meyr, April 2008
Preface
This book reflects more than five years of research performed at the Institute for Integrated Signal Processing Systems (ISS) at the RWTH Aachen University. It combines the tools and methodologies developed during three large electronic design automation projects, each of them alone already containing more than 15 man-years of effort performed by graduated engineers. The first basis is the GRACE++/NoC project, also undertaken at the ISS institute. It targets early phases of the chip design flow. Abstract, manually written or parameterizable generic simulation models are used for architecture exploration on platform level. This project is mirrored by an earlier Springer book [1], and has largely been commercialized by CoWare Inc. in the meanwhile. My work carries on this flow for designing highly optimized programmable platforms, which in the most general case contain customized processors and customized communication modules. For exploring the individual platform elements on a finer grained level, automated approaches are necessary to provide suitable simulators. The second basis is the LISA project, whose origin is in the ISS institute as well. This project provided high expertise in the design of customized processors, which is reflected in multiple earlier Springer/Kluwer books [2, 3]. The LISA 2.0 tools are now commercially available as ProcessorDesigner at CoWare. This book deals with the system integration capabilities of such LISA 2.0 processors, which have been developed in the course of my work. Finally, the BusCompiler project at CoWare Inc. is the third basis for this book. It enables modeling on-chip communication modules, similar to what LISA allows for processor modules. Bringing all this together would not have been possible without good cooperation of the brilliant people working on these projects, and the guidance by the supervisors. First of all, I would like to thank my PhD advisor and co-author Prof. Heinrich Meyr for his support, and for providing a creative working atmosphere
xiv
Preface
with critical, inspiring discussions, but leaving enough space for implementing the projects as well. Also, I would like to thank my second reviewer and co-author Prof. Rainer Leupers for his deep interest in my work and for the valuable feedback. Furthermore, my thanks go to Prof. Jens-Rainer Ohm for his commitment as third thesis reviewer. Of course, I am very thankful to the colleagues and project partners at the university and in the industry for their great work and their good cooperation. Concerning the GRACE++/NoC project, I would like to thank Tim Kogel, Malte Doerper and Torsten Kempf. For the LISA project, I am especially thankful to Andreas Hoffmann, Achim Nohl, Gunnar Braun and Oliver Schliebusch. On the BusCompiler side, my thanks are directed to Tom Michiels and Niels Vanspauwen. Furthermore, I would like to thank Lisa Rotenberg and Torsten Kempf for reviewing the entire thesis. Last but not least, my very special thanks go to my family: to my wife Julia for her support and understanding during the hard writing period, and to my parents Harm-Hindrik and Johanna for enabling an academic career at all. Andreas Wieferink, March 2008
Chapter 1 INTRODUCTION
1.1
Challenge: From Board to SoC
Starting with the invention of the transistor in 1947 and the development of the first working Integrated Circuit (IC) about ten years later, a rapid evolution in semiconductor technology has taken place. Gordon Moore already predicted in 1965 that the integrated circuit density would double roughly every 1 12 to 2 years [4]. Until today, this forecast has turned out to be true. According to the International Technology Roadmap for Semiconductors (ITRS) [5], this trend can even keep up at least until 2020.1 Currently, already about a billion transistors can be implemented on a single chip, offering an enormous potential for electronic functionality. And this functionality is increasingly demanded by the consumer markets. Multimedia applications, networking devices and the wireless communication domain, for example, are experiencing rapidly increasing bandwidths as well as functionality. Due to the heuristic Logarithmic Law of Usefulness [6], the exponentially increasing circuit density and thus complexity is actually needed in order to obtain a linear increase in usefulness. With these huge growths on the technology as well as on the demand side, it becomes an enormous challenge to design the new, innovative electronic systems of rapidly increasing complexity. Furthermore, fierce competition in this market leads to ever shrinking time-to-market and product lifetimes. Traditional chip design techniques do not scale to the complexity demanded today. Decades ago, new systems were typically built by using new combinations of off-the-shelf chips, soldered onto a specifically designed printed-circuit board
1 Although
the 2003 edition of ITRS reported on deceleration, the 2005 and 2007 editions observed a reacceleration of scaling.
2
Introduction
(PCB). This approach worked fine for the system complexity and device volumes of those days. Compared to the production cost, development and prototyping effort was almost negligible. In the last decade, semiconductor technology reached such densities that complex systems could be implemented on a single chip rather than a board (System-on-Chip, SoC). Thus, the task of designing chips moved closer to the product designers. Development costs increased drastically. However, a specialized chip, once designed, could be produced relatively cheaply, amortizing the higher development costs over high chip volumes. With the highly increased complexity, today’s system designers face two typically contradicting goals more than ever: Amortizing the development costs over as many products and devices as possible by heavy reuse on a suitable level, but at the same time, always delivering those chips that best satisfy the performance requirements for the respective end product. Furthermore, energy efficiency is more and more becoming a limiting criterion, especially for mobile devices. Despite of the ever increasing computational performance demanded by modern applications, battery capacity stayed nearly constant over the recent years. The key to these problems is to find an optimal level of customization for the SoC [7]. The next section presents today’s techniques to implement functionality on a chip, each of them representing a specific trade-off between these constraints.
1.2 Degrees of SoC Customization 1.2.1 Computation The ultimate goal is to implement complex algorithms optimally in an SoC, leaving a maximum of flexibility for executing additional or modified standards, as well as for reusing the same basic design for future product generations. For each component of a complex SoC design an implementation technique exists which best fulfills the respective constraints concerning flexibility, required computation power and energy efficiency. The system modules having the highest flexibility constraints, but usually lower performance requirements, are best implemented in software (Figure 1.1) [8]. Most generally, this software runs on an embedded general purpose processor or a microcontroller as provided by ARM [9] or MIPS [10]. However, this implementation technique typically implicates bad energy efficiency. The first step in providing more customized processing power is moving to a domain specific processor. Digital Signal Processors (DSPs), as available from Texas Instruments [11] or Sandbridge [12], offer additional instructions for operations typically occurring in the signal processing domain, e.g. MultiplyACcumulate (MAC). Other popular domain specific processors are Network
3
Degrees of SoC Customization
Digital Signal Processors
Application Specific Instruction Set Processor
Field Programmable Devices
SW Design
Application Specific ICs
HW Design
Log
Physically Optimized ICs
105 . . . 106
Log F L E X I B I L I T Y
General Purpose Processors
TMS320C54x 3MIPS/mW
Log P O W E R D I S S I P A T I O N
StrongARM110 0.4 MIPS/mW
PERFORMANCE 103 . . . 104 Source: T.Noll, RWTH Aachen [modified]
Figure 1.1.
The Energy-Flexibility Gap [8]
Processing Units (NPU), which are targeted for rather simple operations on huge amounts of data. In contrast, system modules with very high performance requirements are usually fully implemented by hardware designers. The highest performance is obtained by physically optimized ICs. But the design of those is a very tedious and error-prone task, which leads to a very long time-to-market. Even more, there is no flexibility left at all. A change in the specification may lead to a complete re-design of the entire system module. The first step in providing more flexibility are Application Specific ICs (ASICs) which are automatically synthesized from a more abstract hardware model written in a textual HDL (Hardware Description Language). By modifying this HDL model, changes can be put onto silicon much faster. But still, once the silicon is produced, no subsequent modifications are possible any more. Thus, both these techniques do not offer any way of extending or just modifying the functionality of an existing chip. In order to solve this flexibility problem, field programmable devices like FPGAs (Field Programmable Gate Arrays) are offered by companies like Xilinx [13]. They are still based on an HDL hardware description, but in this case the synthesized circuit information is not used to produce the silicon, but to configure a previously produced generic silicon area. This technique is much more flexible, since the silicon can even be reconfigured anytime. However, this happens on the cost of performance.
4
Introduction
In order to obtain one reconfigurable gate, roughly ten real gates are necessary on the silicon, which leads to one magnitude worse values in chip area as well as in timing. Nevertheless, this technique is already widely accepted for fast prototyping and for low or medium volume production. As can be seen in Figure 1.1, there is still a relatively large gap between the areas covered by the software respectively the hardware designer. This is the field of Application Specific Instruction set Processors (ASIPs). Since optimally tailored for the respective target applications rather than generic benchmark suites, they offer higher performance to the embedded software designer than more general processors. On the other side, due to the programmability rather than only configurability, they provide higher flexibility to the system designer than pure hardware solutions. Multi-Processor System-on-Chip (MP-SoC) Platforms with customized processors are currently one of the most promising approaches to meet flexibility and efficiency constraints at the same time [14]. Of course, designing and programming ASIPs needs a totally new design methodology, combining expertise in software as well as processor design. Recently, powerful tools have been developed, which enable the system designer to succeed in this challenging task (cf. Section 4.2.3).
1.2.2
Communication
Not only the complexity of the SoC building blocks is currently increasing, but also their number. Future (MP-)SoC designs will contain dozens or even hundreds of system modules, and the resulting communication issues will be one of the limiting system bottlenecks [15, 16]. Thus, bus topologies will not necessarily be the best suited communication infrastructure any more, at least for the global interconnect. An important design challenge in the development process of tomorrow’s large scale SoCs is to design the complex communication architecture by which the system modules are connected. The disciplined exploration of alternative communication infrastructures has recently been subsumed under the Networkon-Chip (NoC) design paradigm. The SoC designer can differentiate or tweak the communication part of the design by two means. First, by changing the implementation of the communication modules or nodes, and second, by changing their interconnect topology. In the early days, the SoC communication topology was limited to point-topoint connections, or at most to a bus-based interconnect. What was adapted or even specifically developed for the respective use were the communication protocols and the bus nodes themselves. In contrast, today’s NoC design methodologies adapt the interconnect topology to the application’s needs, using a fixed and limited set of communication modules and protocols. However, the largest degree of freedom for optimal SoC implementation is available if both means of customization are applied.
5
Organization of this Book
System Simulation (Chapter 5)
Simulate Platform
Generate Platform
Generate Instrumentation
Debugging & Profiling (Chapter 8) Feedback, Results
Tool & Platform Generators (Chapter 7)
Specification
Designer, guided by Methodology (Chapter 6)
Figure 1.2.
1.3
Main Book Chapters
Organization of this Book
After this introduction into the MP-SoC topic in the first chapter of the book, the following three chapters summarize the prior art in the fields of SoC design methodologies, communication modeling and processor modeling. The fifth chapter leads over to the own work (Figure 1.2). Here, the general structure of a MP-SoC system simulator is described, combining the generated simulators for the processors and those for the communication modules. This simulator structure is basically the same for all involved abstraction levels and SoC simulation environments. The new interfacing capabilities on multiple levels of abstraction enable a successive top-down design flow with joint optimization of the processor modules and their communication (Chapter 6). While refining one module, it can directly be verified for correctness in the very same system environment. This co-exploration methodology with customized modules bears the potential for much better optimized MP-SoC platforms than a bottom-up design flow that assembles predefined, fixed IP (Intellectual Property) blocks. Such a flow with customized modules is only feasible when the simulation adaptors are automatically tailored to the processor as well as to the SoC communication side. Also, the adaptors must be able to bridge an abstraction gap if necessary. The techniques to accomplish this retargetability to customized processors and generated TLM (Transaction Level Modeling) communication modules is presented in Chapter 7. Of course, interaction between the designer and the design tools is still necessary. First, the user is assisted in debugging the SoC modules as well as the
6
Introduction
embedded software. Second, the system simulation is instrumented to collect profiling information, which guides the designer to make his decisions. The newly developed debugging and profiling capabilities of the system simulators are presented in Chapter 8. A case study of a JPEG decoding platform is described in Chapter 9. As one focus, the influence of the number of processors, the cache configuration, the pipeline lengths, the bus node types and their interconnect topology is evaluated. Also, the inaccuracy introduced by more abstract models is examined. Finally, Chapter 10 summarizes and concludes the book.
Chapter 2 SOC DESIGN METHODOLOGIES
This chapter reviews current trends and approaches in the area of SoC design. First, the research done on traditional HW/SW co-design is summarized. Second, the need for system level design is motivated, and an overview over the standardization and a rough categorization of possible design approaches is given. The third section reviews current research on SoC design methodologies. Finally, the motivation for this book is derived. This survey focuses on general SoC design research which is relevant in the context of this book. For the main disciplines, only a few important representatives are named. More detailed overviews can be found in [1] and [17]. Related work in the areas of communication and processor modeling is given in Chapters 3 and 4.
2.1
Traditional HW/SW Co-Design
During the 1990s, step by step Moore’s Law led to chip integration densities that allowed electronic systems to be constructed from modules on a chip, rather than from chips on a board. Thus, hardware (HW) and software (SW) of a system moved closer together. The emerging research on hardware/software co-design focused on two fields: On HW/SW co-simulation as a manner of system analysis, as well as on synthesis methods [17]. A lot of research in this area has been done in academia and industry. This chapter gives a condensed overview over the research during the 1990s. It focuses on a few early representatives and on those environments which were able to gain commercial relevance.
8
2.1.1
SoC Design Methodologies
HW/SW Co-Simulation
Typically, on implementation level, the hardware portions of a system are simulated with an HDL (Hardware Description Language) simulator, while the target software is executed on Instruction Set Simulators (ISS). It was recognized early that the simulation of the entire SoC functionality is a compulsory analysis method for system verification before fabrication [18]. Thus, the different kinds of simulators and the different Models of Computation (MoC) had to be combined. In the 1990s, system simulation was typically done on a very low abstraction level. This led to a tedious and error-prone modeling, as well as to very low simulation speed. Furthermore, the complexity of the SoCs was rising faster than the performance of the simulation hosts.
Ptolemy (Berkeley) The Ptolemy project [19] at the University of California in Berkeley is an early approach to co-simulation of different MoCs. This environment allows simulating heterogeneous systems by introducing an own domain for every involved MoC. Examples for these domains are Synchronous Data-Flow (SDF), Dynamic Data-Flow (DDF) and Discrete-Event (DE). The coordination framework provides interfaces that allow different domains to interact during simulation.
Seamless CVE (Mentor) Mentor Seamless [20] is one of the commercialized HW/SW co-simulation environments. It is still widely used for verification of HW/SW interfaces prior to silicon fabrication. HW/SW co-verification is enabled by providing a library of instruction accurate processor simulators. The cycle accurate pin wiggling information is determined using a Bus Interface Model (BIM) suitable for the respective processor. Visualization of the traffic on the HW/SW interface helps verifying the HW/SW communication. However, this co-simulation technique is far too slow to be used in earlier design phases.
Early GRACE/LISA Co-Simulation (Aachen University) An approach to increase system simulation speed has been made by linking together two design frameworks developed at Aachen University [21, 22]. LISA processor simulators [23] applying fast compiled simulation were coupled to GRACE hardware models [24], which allow higher simulation speed by being modeled on a higher level of abstraction than Register-Transfer-Level (RTL). However, the interface protocol was modeled pin and cycle accurately only, and the coupling could not be generated automatically yet.
Traditional HW/SW Co-Design
2.1.2
9
Automatic Synthesis
Traditional HW/SW co-design research also aimed for an automatic path from a formalized system specification to an optimal implementation. The most ambitious approaches (full System Synthesis) targeted implementing the complete SoC automatically. Many research projects were initiated in this field. The main subtasks research has focused on are: Automatic Partitioning [25, 26]. One of the most challenging tasks is HW/SW partitioning. For automated design decisions on system level, the input specification needs to be analyzed and evaluated based on typically very vague cost functions. On this level, permanent worst case assumptions often lead to inefficient results. Behavioral Synthesis [27]. For the computation modules implemented in hardware, behavioral synthesis targets automatic implementation based on a more abstract input specification. Often, complex state machines are generated from just a few lines of input code. The product quality of many algorithm types is far behind handcrafted RTL code. Communication Synthesis [28]. With an increasing number of system modules, optimized communication between them is becoming more important. Thus, researchers have also started to customize the communication infrastructure to the application’s requirements automatically. However, in the 1990s these frameworks were limited to point-to-point and at most to bus-based communication. Interface Synthesis [29, 30]. This field deals with the synthesis of the HW/SW interfaces. Only the tedious and error-prone process of implementing the basic driver software as well as the glue logic on the hardware side is automated. Due to the mentioned problems, the original goal of a fully automatic synthesis as known from RTL has not been reached satisfactorily. Virtually none of the ambitious all-embracing co-design environments have gained industrial relevance.
Vulcan (Stanford) and Cosyma (TU Braunschweig) Already in the early 1990s, very similar approaches to automatic partitioning were developed in parallel: Vulcan from Stanford [25] and Cosyma from TU Braunschweig [26]. Both take a C-like program as input and partition it automatically into a critical part executed in hardware and a non-critical part executed on a standard CPU. Cosyma starts with a full SW implementation on a single processor and extracts the critical parts onto a hardware accelerator,
10
SoC Design Methodologies
while Vulcan starts with a full implementation in hardware and takes the opposite direction.
Behavioral Compiler (Synopsys) Synopsys, the market leader in RTL synthesis, put a lot of effort in developing behavioral synthesis during the 1990s. The Behavioral Compiler [27] takes an extended VHDL subset as input. The necessary hardware operations are extracted and their earliest and latest possible execution time is determined. According to constraints given by the user, the needed functional units are allocated. A state machine is generated which is capable to perform the scheduling. The resulting quality1 heavily depends on the type of application as well as on the user’s coding style. The Behavioral Compiler was discontinued by Synopsys in 2003. N2C (CoWare) In 1996, CoWare was founded to commercialize the interface synthesis technology [30, 31] developed at IMEC [32]. In the Napkin-to-Chip N2C [33] framework, the target application needs to be specified and partitioned manually, typically using the proprietary CoWareC language. Communication between different modules is modeled on high abstraction level, based on Remote Procedure Calls (RPC). A large library of integrated commercial processor IP as well as the HW/SW and the HW/HW interface generating capabilities support the SoC designer in the tedious and error-prone design tasks. However, CoWare did not yet support customized processors at that time (cf. Section 4.3).
2.2
System Level Design
This section motivates system level design and presents the standards and de-facto standards that are already established in this area. Finally, general concepts for system level design flows are introduced.
2.2.1
Motivation
The HW/SW co-design research of the 1990s provided an important basis for today’s research in SoC design. However, it also showed limitations of what is achievable, even for the simpler applications of that time. In general, the approaches were either suffering from a restrictive input formalism, or the synthesis results were too inefficient. Thus, most IC designers kept following the traditional flow, where the application phase is largely decoupled from the implementation phase. Mainly
1 For
RTL implementations, the main quality indicators are the resulting maximum clock speed, the chip area covered, as well as energy efficiency.
System Level Design
11
based on the experience and intuition of the application specialists, a specification document in natural (i.e. English) language is created. A team of HDL coders reads it, interprets it, and then manually implements the extracted functionality on detailed RT-Level. In the meanwhile, it is commonly agreed that today’s applications are too complex for both of these approaches. Neither a fully automatic nor a completely manually performed design flow down to RT-Level leads to an optimal final SoC implementation. To solve this problem, a new design phase has been introduced, which is called System Level Design (SLD). On a higher abstraction level, the SoC designer is doing the creative work, while tedious and error-prone tasks are automated or reduced by reusing blocks. The key aspects of SLD are the following: Consider a Higher Abstraction Level. In the history of electronic design, the complexity has always increased, and this has always forced the designer to consider a higher abstraction level. For software, development started with punch cards, and went over assembly programming to today’s high level programming languages like C++. Similarly, for hardware development, the considered components initially were transistors and gates, and then RTL constructs. The step to a higher abstraction level for the design is natural. It enables higher modeling efficiency and simulation speed. However, due to the limited system synthesis automation capabilities, a higher abstraction level alone does not solve the emerging design issues. Orthogonalization of Concerns. On the higher abstraction level, architecture needs to be modeled according to specific formalisms. Ideally, the modeling attributes Functionality (Behavior), Timing, Data, Structure and Communication should be separated in order to keep models of different abstraction levels reusable, interchangeable and interoperable. At least the separation of Interfaces and Behavior according to the interface based design paradigm [34] is mandatory for efficient modeling and reusable models. Effective HW and SW Modeling. SLD covers the SoC design phases between the high level application specification and the implementation of the RTL hardware, respectively of the embedded software. Thus, a SLD specification mechanism needs to support both, hardware specific as well as software specific concepts. On the one hand, properties like concurrency and determinism, on the other hand features like SW process handling and object oriented programming need to be supported.
12
SoC Design Methodologies
Ideally, an SLD flow does not only support the SW implementation on an off-the-shelf processor as the one extreme or fully customized HW implementation as the other, but also customization techniques in between (cf. Section 1.2.1). This book focuses on customized processors as a promising compromise between these two. Increase Design Reuse. Since full design automation is not feasible, new complex architectures cannot be implemented completely from scratch. Ideally, SLD supports design reuse over multiple abstraction levels, for several building block granularities, as well as between design groups and even between different companies. Integrating fixed processor IP is already a widely applied manner of design reuse. The disciplined creation of easily reusable system modules or even complete MP-SoC platforms, however, is not yet widely adopted. Standardization of SLD languages and concepts (cf. next section) targets this issue.
2.2.2
Standardization
VSI Alliance The Virtual Socket Interface Alliance (VSIA) [35] was formed already in 1996 and tried to establish standards that enable integration of IP cores (virtual components) from different sources. Their IP core interface is called Virtual Component Interface (VCI). However, this standard has not widely been adopted for SLD. SystemC The emerging EDA standard language for System Level Design is the SystemC library [36]. Since 1999, the language is promoted by the Open SystemC Initiative (OSCI), a non-profit organization backed up by a broad range of companies. In December 2005, SystemC version 2.1 has been approved as IEEE standard 1666. This C++ class library enables efficient modeling of hardware by providing an event-driven simulation engine as well as support for the TLM abstraction level (cf. next section). In this section the basic concepts of SystemC 2.1 based System Level Design are briefly introduced. A thorough representation of this topic area is given by Groetker et al. [37]. As illustrated in Figure 2.1, SystemC 2.1 follows the Interface Method Call (IMC) principle to achieve high modularity in system level modeling. Processes (Behavior) are wrapped into modules (Structure) and access Communication services through ports. The available methods are declared in the Interface specification and implemented (Behavior) by the channel. Alternatively, a module port may be directly connected to another module via an inverse port
13
System Level Design module
interface
process
channel port
Figure 2.1.
Interface Method Call Principle
called export. Graphically a port is represented as a square containing two diametrical arrows and an interface is represented as a circle containing a Uturn arrow.
Transaction Level Modeling (TLM) The IMC principle as provided by SystemC is the base for applying the Transaction Level Modeling (TLM) paradigm. TLM enables efficient modeling of SoC communication above RTL. It is presented in detail in Section 3.1. This abstraction level has already been widely applied in the recent years. However, a multitude of proprietary interface classes still hinders interoperability and reusability of TLM modules. In order to push forward the standardization also on this level, the Open SystemC Initiative released the OSCI TLM 1.0 library [38] in 2005. Basically, this proposal contains a set of primitives for sending and receiving data packets (cf. left half of Figure 2.2).2 The data to be transferred, however, is a generic template parameter and is not defined in version 1.0 of the TLM standard. The main contribution of OSCI TLM 2.0 [36], released in 2008, is a set of additional core interfaces (cf. right half of Figure 2.2): Extra transport interfaces for bi-directional data transfer, a direct memory interface allowing ultra-fast platform simulation, as well as debug and analysis interfaces which serve their purpose without any side effect on the simulation. Furthermore, TLM 2.0 provides a default data type tlm generic payload, which can be used as a template parameter for the core interfaces. Other classes provided in this standard address specific tasks like analysis as well as temporal decoupling of system modules for higher simulation speed. OCP-IP The Open Core Protocol International Partnership (OCP-IP) has been established in 2002. It is a non-profit organization, that provides an open and complete IP core communication protocol (OCP) [39]. Additionally to the data flow, OCP can also be configured to support control and test flows.
2 All
code examples in Chapters 2, 3 and 4 are simplified in order to illustrate the basic concepts.
14
SoC Design Methodologies /** OSCI TLM 1.0 interface */
/**OSCI TLM 2.0 interface */
/** bi-directional (transport), blocking */ template
class tlm_transport_if : public sc_interface { public: RSP transport(const REQ& arg_Req); … };
/** reuse TLM 1.0 interfaces as subset */
/** uni-directional (get/put), blocking */ template class tlm_blocking_get_if: public sc_interface { public: T get( … ); … }; template class tlm_blocking_put_if: public sc_interface { public: void put(T &arg_T); … }; /** uni-directional (get/put), nonblocking */ template class tlm_nonblocking_get_if: public sc_interface { public: bool nb_get(T& arg_T); bool nb_can_get( … ); … }; template class tlm_nonblocking_put_if: public sc_interface { public: bool nb_put(T& arg_T); bool nb_can_put( … ); … }; … /** peek interface */ … /** combined interfaces */ …
/** define default data types */ enum tlm_phase { … }; enum tlm_sync_enum { … }; class tlm_generic_payload { public: uint64 get_address(); void set_address(uint64 arg_Addr); uchar* get_data_ptr(); void set_data_ptr(uchar* arg_pData); … }; … /** bi-directional (transport), blocking */ template class tlm_blocking_transport_if : public sc_interface { public: void b_transport(TRANS& arg_Trans); … }; /** bi-directional (transport), nonblocking */ template class tlm_nonblocking_transport_if : public sc_interface { public: tlm_sync_enum nb_transport( TRANS& arg_Trans, PHASE& arg_Phase, sc_time& t); … }; /** direct memory interface */ … /** debug transaction interface */ … /** analysis interface */ … /** combined interfaces */ …
Figure 2.2. Code Examples: OSCI TLM Core Interfaces
On RT-Level, OCP is a protocol with a small set of mandatory signals and a wide range of optional signals. Also, the width of address and data words as well as the protocol timing are configurable. An IP core is only considered OCP compliant when it is shipped with a specific RTL configuration file, which indicates the selected parameters. Additionally, a synthesis configuration file is
System Level Design
15
needed to describe the timing parameters for each OCP interface. Only if these configurations are compatible, OCP compliant cores can directly be hooked together on RT-Level. Recently, OCP-IP also defined standards for communication modeling on higher levels of abstraction. Altogether, 4 Transaction Levels (TL) have been defined [40]. RTL-Layer (TL-0). This is the traditional pin, bit and cycle accurate abstraction level, typically applying modeling languages like VHDL and Verilog. Transfer Layer (TL-1). Compared to TL-0, interface pins are abstracted away. Instead, every phase of an ongoing transaction can be invoked separately by IMC function calls. This abstraction level is also called Cycle Callable (CC). Transaction Layer (TL-2). Communication is occurring event-driven; such an event is generated at the end of a burst transaction, for example. On this level, a complete transaction can be invoked by a single, blocking function call. Message Layer (TL-3). This is OCP’s highest level of abstraction. TL-3 models are often untimed; data is typically process mapped, not yet address mapped. In revised SystemC packages, TL-3 is rather seen as generic timed protocol without any OCP specifics at all [41]. The higher levels TL-3 and TL-2 are both used as a generic protocol as well, while the lower transaction levels are already too specific so that eventually a bridge is necessary to map the communication to any other OCP or non-OCP protocol.
IP-XACT In the meanwhile, many companies provide SoC IP blocks, while others offer tools which apply simulation models of those IP blocks. Interoperability is often very limited because each company uses own, mostly proprietary interfaces. Thus, in 2006, the SPIRIT consortium has been founded to develop an XMLbased specification standard called IP-XACT [42]. Compliant IP providers do not need to change their interfaces, but they can ship an IP-XACT characterization together with their IP. This enables compliant tools to adapt to the specifics of the respective blocks during IP import.
2.2.3
Design Flows
A multitude of SLD approaches bridges the gap between specification and implementation. A rough distinction can be done between bottom-up and topdown approaches. Basically, bottom-up approaches start with fixed architecture
16
SoC Design Methodologies Application Space
Application Instance
IP Block Selection & Integration
Design Space Exploration
Architecture (Block) Instance
Architecture Space
Pure Bottom-Up
Pure Top-Down
Application Space
Application Instance
Application Partitioning & Mapping
API Usage
Mapping Space
Platform API API Implementation
IP Block Selection & Mapping
Architecture (Block) Instance
Architecture Space
Platform based Design
Y-Chart Approach
Full Application/Architecture Space
Identified Target Instance Possible Target Space
Figure 2.3.
Basic Design Methodologies
17
System Level Design
blocks, while top-down approaches start from the application and perform design space exploration in order to identify an optimal architecture for the given application. Another way of looking at it is the size of the considered SoC building blocks. The smaller their size, the more important it gets to assemble or synthesize these blocks automatically. Figure 2.3 depicts several approaches to provide a suitable architecture (instance in the architectural space) for a specific application (instance in the application space). In practice, current design flows are a mixture of these basic concepts.
Pure Top-Down Design The starting point of this approach is a specific application instance to be implemented. For that, an optimal SoC implementation is to be identified. This process is called Design Space Exploration (DSE). The more customization the applied technology allows, the larger the design space becomes and the more likely it is to identify an optimal implementation. One of the basic principles of top-down design space exploration is the abstraction pyramid (Figure 2.4) [43]. It fosters a stepwise refinement flow from a very abstract application specification down to a very detailed microarchitectural implementation, having a suitable modeling abstraction level for every step. Based on design decisions made in the previous DSE step, the selected design point is the top vertex of the next pyramid covering the remaining architectural design space.
Back-of-theEnvelope
Abstraction Opportunities
Explore
Abstract Executable Models
Explore
Cycle Accurate Models Synthesizable VHDL
Cost of Modeling and Evaluation
Low
High
High
Low Alternative Realizations
Figure 2.4.
Abstraction Pyramid [43]
18
SoC Design Methodologies
This technique bears the best potential for optimal implementation, which is worthwhile especially for high volume and heavily reused SoCs. It needs a high degree of tool support to obtain results in reasonable design times.
Pure Bottom-Up Design This approach starts with a set of existing coarse grained computation and communication IP blocks, from which the suitable ones are selected and integrated into the SoC. Since customization is only possible to a limited degree, SoC platforms can be composed quickly, but in most cases they will not be targeted optimally for a specific application instance. Especially data-flow intensive applications will most likely still need specially designed SoC blocks in order to satisfy the performance requirements. Y-Chart Approach The Y-Chart Approach [44] puts a mapping process between the architectural space and the application space into the main focus. The basis is a set of abstract models for architectural blocks as well as a set of possible partitionings of the given application. The goal is to find an optimal temporal and spatial mapping of the application tasks to the processing elements. Platform Based Design Platform based Design (PBD) [45] improves reuse by defining a common platform API.3 Starting from that, the platform designer implements the API using a suitable architecture instance, while the application designer uses the API to implement the targeted application. This is kind of a meet in the middle approach. On the one hand, the separation enables mapping different applications to the same platform API. On the other hand, it also allows to provide an alternative platform architecture for the same application software. In a more general sense, PBD means reusing a once designed, generally complex architectural instance for multiple applications.
2.3 Current Research on SoC Design Methodologies 2.3.1 Bottom-Up SoC Design Bottom-up SoC design flows consider the design of MP-SoC platforms as the composition of selectable and parameterizable IP library elements. The main two groups are
3 API
= Application Programming Interface.
Current Research on SoC Design Methodologies
19
Component Based Design [46, 47, 48, 49].The focus of IP reuse is on the computation modules. In some projects, IP libraries containing processor IP with preconfigured SW stacks have been created. Other projects built up a multi-processor platform that is configured rather than specifically assembled to match the given requirements. Communication Based Design [50]. This approach, in contrast, focuses on the optimization of the emerging NoC communication. Similarly, the communication infrastructure is created by instantiating and configuring existing IP communication nodes.
ROSES (Jerraya, TIMA Laboratory) The ROSES design framework [46] contains a library of processors and parameterizable communication modules. Automatic integration is possible by using a protocol library for putting together the SoC hardware simulator [51], as well as a software library containing building blocks for a preconfigured operating system [52]. The ROSES approach enables the MP-SoC platforms to be composed quickly. The disadvantage is a time-consuming manual creation of the libraries and a narrowed design space due to the limited customization capabilities. Eclipse (Philips Research) The Eclipse architecture template [47] developed by Philips Research is optimized for building data-intensive media-processing SoCs. Abstract applications modeled as Kahn Process Networks (KPN) [53] can be mapped onto the Eclipse platform elements. Specific hardwired modules or coprocessors execute the behavior of the KPN modules, while HW fifos or memories perform the KPN FIFO queuing. The temporal task scheduling mechanism allows even an interleaved execution of multiple tasks on a single processing element. However, the architecture template is targeted for data streaming applications and thus cannot be applied beyond that application domain effectively. NetChip (Stanford and Bologna) The NetChip project [50] is a prominent example for communication-based design. The approach is based on a library of scalable network components called X-Pipes [54], which are automatically instantiated and configured by a tool named SUNMAP [55]. An optimal communication architecture can be detected conveniently only within the limited design space spanned by the relatively fixed communication modules. Also, NetChip does not consider optimization of the computation modules at all.
20
SoC Design Methodologies
MPARM (Bologna) The MPARM multi-processor platform [48] developed at University of Bologna originally consists of a configurable number of ARM processor cores and either an AMBA or an STBus interconnect. The system level platform model is used to evaluate the design space opened by these base components. Recently, retargetable LISA processor simulators have been integrated into MPARM [56]. This integration is based on the work presented in this book.
StepNP (ST Microelectronics) ST Microelectronics developed a multi-processor platform for network applications, which is called StepNP [49]. The SystemC based construction kit contains wrapped ISS’s for ARM, PowerPC and DLX processor cores. The wrappers even allow to emulate the execution of a multi-threaded processor by instantiating multiple instances of the ISS inside a SystemC module. Additionally, the original framework contains a generic network-on-chip communication channel which uses the proprietary SOCP channel interface model [49]. ST also developed an associated network application development toolchain. Its most important part is MIT’s open source framework Click [57], which enables rapid development of embedded routing application software. Recently, some efforts have been made to incorporate customizable processors into the platform as well. A Configurable XTensa processor by Tensilica [58] as well as a LISA 2.0 processor model have been integrated. The latter coupling is based on the blocking interface presented in this book.
2.3.2
Top-Down SoC Design
Top-down SoC design approaches put design space exploration (DSE) into the main focus. This bears the best potential of finding a most optimal SoC architecture for a given application. In order to evaluate many design alternatives in a reasonable time, an efficient design style and a high degree of automation is necessary. This approach is advantageous especially for high volume designs. In this case, the higher design costs for the optimized solution are amortized over more devices.
ARTEMIS The Dutch universities Delft, Leiden and Amsterdam as well as Philips Research cooperated for the ARTEMIS [43] project, which is based on the predecessor project SPADE [59]. On system level, the target application is modeled as a Kahn Process Network (KPN) [53]. Design space exploration is performed following the design principles Y-Chart and Abstraction Pyramid. The output of the architecture-independent annotated KPN application model is either a
Current Research on SoC Design Methodologies
21
data-dependent simulation trace, or a more general Control-Data-Flow-Graph (CDFG). Both can be transformed and fed into the architecture simulator. The ARTEMIS branch Sesame [60] focuses on transformations like Integercontrolled Data-Flow (IDF) [61] on the traces, while the Archer branch [62] focuses on transforming the CDFG. The latter approach is fine-granular enough to do investigations on the benefit of Instruction-Level-Parallelism (ILP) by executing symbolic programs [63]. However, due to the KPN formalism, the ARTEMIS project is still limited to dataflow applications. Thus, their investigations tend more toward customizing the architectures with reconfigurable logic [43] instead of using customized processors.
Metropolis The University of California at Berkeley, Politecnico di Torino and the Cadence Berkeley Labs cooperated for the Metropolis framework [64]. The project is based on the predecessor project POLIS [65], but the input specification is not limited to Codesign Finite State Machines (CDFMs) any more. Instead, the Metropolis Meta Model (MMM) also allows to model data flow applications. The MMM is still formalized enough to enable formal analysis as well as synthesis. There is no obvious direct link to Berkeley’s Mescal/Tipi Project [66], or any other approach to implement the abstract application models based on customized processors. SpecC SpecC [67] is a specification language and design methodology developed at U.C. Irvine. It is a C derivative with additional keywords for architecture modeling. Because of the formalized modeling style of SpecC, general hardware modules can be synthesized automatically. The SpecC synthesis flow consists of two sequential steps: Architecture Exploration including automatic model refinement [68], followed by communication synthesis [69]. For processor modules, system level RTOS4 modeling and automatic task refinement are possible [70]. Communication synthesis also includes generating implementations of upper ISO/OSI network protocol layers for more complex NoC structures [71]. However, the processor and communication modules as well as the protocols themselves cannot be customized easily. They are selected from IP libraries and are at most parameterized.
4 Real
Time Operating System.
22
SoC Design Methodologies
Architect’s View Framework (AVF) CoWare [72] recently commercialized the Architect’s View Framework (AVF) as part of the PlatformArchitect product line. AVF is a SystemC based system level exploration workbench [1] which has been developed at the Institute for Integrated Signal Processing Systems (ISS) at the RWTH Aachen University. Both, communication and computation modules are modeled on a high abstraction level. Using generic communication modules, alternative Networkon-Chip configurations can be explored quickly [73]. On the computation side, a Virtual Processing Unit (VPU) [74] allows to early investigate possible mappings of abstractly modeled tasks to computation modules.5 By annotating timing to the tasks and the generic communication modules, the performance to be expected on lower abstraction levels can be estimated. The successive refinement of these abstract models to possibly customized processor and communication modules is presented in this book.
2.4
Contribution of this Work
It is commonly agreed that leveraging the level of abstraction considered by the system designer is absolutely necessary to cope with the ever increasing SoC complexity. As described earlier in this chapter, many SoC specification and modeling languages exist on system level. Each of them has an associated abstraction level and a specific degree of formalism. Several of the approaches already provide a path to RTL implementation. But typically, a computation module is either implemented as fixed RTL hardware, or as software running on a non-tailored, externally licensed COTS (commercial-off-the-shelf) processor. Thus, a system module is either not flexible at all, or it is very likely implemented too inefficiently. As indicated in Section 1.2.1, customized processors would be a good compromise between these two extremes. In this work, a processor design flow is linked into an SoC design flow. Similarly, the on-chip communication is most efficient when tailored to the characteristics of the application which is meant to run on the platform. Several system level approaches already allow thorough exploration of complex Network-on-Chip (NoC) communication architectures. However, typically only the topology and the parameterization of external communication IP modules is determined this way. In a more general approach, the overall flow presented in this work also allows customizing the communication modules and protocols themselves to the application’s needs. Even though mainly targeting full customization in a top-down flow, the methodology and tooling presented in this book also allows heavy IP reuse.
5 The
VPU is not yet part of the releases V2005.2.x and V2006.1.x.
23
Contribution of this Work
The approach can be seen as generalization of bottom-up flows based on static IP which is at most parameterizable. As the simplest case of the new approach, existing abstract specifications of processors and communication modules can be reused as they are. But they are also easily modifiable. An optimization in an abstract model can relatively quickly be propagated to the models and simulators on the lower levels of abstraction. The communication adaptors are generated automatically using the same data base which already served for building up the involved SoC modules. Thus, consistency is ensured easily. Processor Modules
Communication Abstraction
Functional Specification
Packet Level TLM
Instruction Accurate ISS
Communication Modules
Byte/Word Level TLM
Cycle Accurate ISS RTL Processor Model
RTL
Figure 2.5.
Generic Bus/NoC Model
TLM Bus/NoC Model
RTL Bus/NoC Model
Abstraction Levels
In this work, an SoC design flow is intertwined with design flows for processor modules as well as for communication modules. Every flow has an own set of abstraction levels. In Figure 2.5, the basic abstraction levels used in this book are depicted separately for processor modules, communication modules, as well as the communication modeling. The next two chapters introduce in more detail existing modeling techniques for communication modules and for processors, respectively.
Chapter 3 COMMUNICATION MODELING
This chapter presents current techniques to model communication in system level design in more detail. First, use cases and abstraction levels covered by the TLM paradigm are introduced. The second section focuses on higher abstraction levels, where communication is typically modeled in a generic, architecture independent way. In later design stages, after the decision in favor of a specific type of infrastructure has been made, more detailed communication modeling is necessary. Common commercial communication IP as well as cycle level TLM simulation APIs are presented in the third section. The fourth section introduces CoWare’s BusCompiler. It is the only known tool that allows generating complex cycle accurate TLM simulators from an abstract specification automatically.
3.1 Transaction Level Modeling 3.1.1 Use Cases Transaction Level Modeling (TLM) has been introduced to close the abstraction gap between high level, mostly domain specific algorithm specification languages, and the very detailed Register-Transfer-Level (RTL). Thus, TLM covers a broad range of abstraction levels, and it can be used for several purposes. This is the use case classification proposed by the OCP-TLM working group [41, 75] (extended):
26
Communication Modeling
Requirement
Functional Programmers Architects Verification View View View View Simulation Speed ++ ++ + 0 Functional Accuracy + + 0/+ ++ Timing Accuracy – 0 +/++ ++ Flexibility 0 – ++ 0
Functional View (FV) The main focus of this TLM use case is on the functional correctness of the algorithms. It serves as an executable specification. In contrast to many domain specific specification formalisms, using SystemC already structures the application into modules which execute concurrently.
Programmer’s View (PV) This TLM use case targets early embedded SW design. For a reasonable time-to-market, embedded designers need to start developing the SW as soon as possible and in a realistic system context. Fast simulation speed is compulsory. Thus SoC properties not absolutely needed for embedded SW (eSW) design should be abstracted away.
Architect’s View (AV) This use case focuses on early architecture exploration. Basic decisions are made like which interconnect topology is optimal, or how to map the application tasks to what kind of processing elements. In order to quickly evaluate many architectural alternatives in a given time, high flexibility of the models is compulsory.
Verification View (VV) In later design steps, a more refined TLM model is required that enables HW/SW and system verification. After design space exploration is finished and a promising eSW implementation already exists, this needs to be verified on a higher level of accuracy.
3.1.2
Abstraction Levels
For efficient design space exploration, communication is modeled on a higher abstraction level than RTL. There are several abstraction mechanisms, which are more or less orthogonal to each other. In the context of this book, the communication refinement is done in a certain order (Figure 3.1). Of course, for less complex designs, specific steps may be skipped.
27
Transaction Level Modeling
timing abstraction
event driven
data abstraction
1a
2 4
pin+wire abstraction
3
RTL
TLM
b b it/ w yte or / d p to ack k AD e et/ T n/
clock driven
0
1b
causality only no timing
Figure 3.1. Communication Refinement
Step 0 – Untimed Packet Level TLM: Functional View (FV) Use Case. On this level, a model of the target application is already partitioned into a set of coarse grained system modules. They implement the application’s tasks. The modules exchange their information over channels, using compound data types. Such a data packet contains associated data from the target application’s point of view, e.g. an ATM cell or an IP packet. There is no timing annotated to the communication yet; just the causality of the packets is correct already. On this level, the functional correctness of the partitioning is verified. Step 1a – Timed Packet Level TLM: Architect’s View (AV) Use Case. For early design space exploration, timing budgets are annotated to the communication channels as well as to the computation modules. On this level, alternative NoC communication architectures are explored by annotating specific topologies and parameterizations to generic communication modules. Also, suitable temporal and spatial mappings of the application’s tasks to processing elements are examined (cf. Section 3.2.1, page 29). Step 1b – Untimed Word/Byte Level TLM: Programmer’s View (PV) Use Case. Alternatively, if it has already been decided to implement the tasks
28
Communication Modeling
on customized processor cores, or if early embedded software development has high priority, then the designer may first refine the data accuracy. Still, the communication is modeled by packet exchange, but on this level the packets are already fragmented from the architecture’s point of view: there are data cache lines or instruction cache lines rather than IP packets or ATM cells. In order to obtain this fragmentation, memory models, memory maps and cache models need to be added, and peripheral models are equipped with a register layer. This abstraction level already allows integrating Instruction Set Simulators (ISS), e.g. for eSW development. Only when such an ISS is present, the traffic caused by instruction cache refills can also be modeled correctly. Step 2 – Timed Word/Byte Level TLM: Refined AV/PV Use Case. Independent of the path (1a or 1b) taken, the next step is refining the respectively other kind of information. Thus, the PV models are equipped with timing information (PV with Timing, PV-T); the AV use case’s accuracy is improved due to data abstraction refinement and the ability of integrating ISSs, for example. If the events generated by the communication modules are not based on timing annotation, but already on accurate cycle-by-cycle simulation, then the TLM communication is even Bus Accurate (BA). Step 3 – Cycle Callable (CC) TLM: Verification View (VV) Use Case. Instead of annotating cycle counts in order to move forward simulation time, in a cycle callable model the communicating modules are all driven by a clock. TLM function calls control every single phase of a transaction. The standardized OCP TL-1 models communication on this level. Step 4 – Synthesizable RTL: The final step is creating a fully cycle, bit and pin accurate model. Only on this level, reliable information concerning chip area, power consumption and maximum clock speed can be derived.
Use Case
Abstraction Level
Figure 3.2.
FV
PV
AV
VV
0
0
0
0
1a/b
1b
1a
1a/b
2
2
2
2
3
3
3
3
4
4
4
4
Optimal Abstraction Level for the Use Cases
29
Generic Communication Modeling
In this flow, the most suitable use case is associated to the respective abstraction level. Strictly speaking, there is no 1:1 association of abstraction levels to use cases. Figure 3.2 indicates that the use cases FV, PV and AV are applicable starting at a specific level of detail. Of course, more detailed abstraction levels also deliver the necessary information, but they are less eligible for these use cases mainly due to the lower simulation speed. The VV use case, in contrast, obtains the most accurate information from cycle accurate models. The more abstract the models are, the less suitable they are for this use case.
3.2
Generic Communication Modeling
On a higher abstraction level, typically no architecture specific communication models are applied yet. This section first introduces the AVF workbench, which approximates the architecture specific properties by annotating timing to rather generic modules. Then this section presents some further simpler TLM models and APIs widely applied for early design stages.
3.2.1
Architect’s View Framework (AVF)
With the TLM paradigm introduced in the previous section, system simulations can be set up on different levels of abstraction. In early phases of complex SoC designs (phase 1a in terms of Figure 3.1), basic decisions about the NoC communication architecture are made. These decisions concern (a) the basic properties of communication modules to use, as well as (b) their optimal interconnect topology. For these examinations, no architecture specific simulators are necessary yet. Ideally, for this Architect’s View (AV) use case, generic communication models are applied, which are combined to a generic parameterizable Network-on-Chip model.
1
1
NoC Channel
...
... M
S 1
...
N Slave Modules (Targets)
Master Modules (Initiators)
Network Engines
Figure 3.3.
NoC Channel Overview
30
Communication Modeling
In the context of this book, the AVF performance simulation environment has been used [73, 1]. This NoC exploration framework is implemented on top of the SystemC library [36]. The application’s functionality is captured by SystemC modules, which communicate with each other by exchanging data packets (Figure 3.3). Master modules like processor cores actively initiate transactions, while slave modules, e.g. memories, can only react passively. The entire communication between these functional modules is encapsulated by the generic NoC channel module. The actual communication infrastructure can easily be configured by selecting the respective network engine and parameterizing it accordingly (e.g., for ARM AMBA AHB or customized engines). These network engines determine the latency and throughput during simulation according to the selected network topology (point-to-point, bus, crossbar, etc.).
/** PV interface */
/**simple_bus master interface*/
/** define request class (PVreq) */ template class PVReq { public: void setAddress( AT address ); AT getAddress() const;
/** define data types */ enum simple_bus_status { … }; …
void setWriteDataSource(DT* writeData); DT* getWriteDataSource(); DT getWriteData(unsigned int index = 0); PVResp obtainResp() const; … }; /** define response class (PVresp) */ template class PVResp{ void setReadDataDestination(DT* readData); DT* getReadDataDestination(); … };
/** PV interface definition (based on OSCI TLM 1.0 transport interface)*/ template class PV_if : public tlm_transport_if , PVResp > { public : PVResp transport( const PVReq< DT, AT >& arg_Req); };
Figure 3.4.
/** simple_bus master interface definition*/ class simple_bus_master_if : public sc_interface { public: /** direct/debug bus interface */ bool direct_read(int *data, unsigned int address); bool direct_write(int *data, unsigned int address); /** blocking bus interface */ simple_bus_status burst_read( unsigned int unique_priority, int *data, unsigned int start_addr, …); simple_bus_status burst_write( unsigned int unique_priority, int *data, unsigned int start_addr, …); /** non-blocking interface*/ void read(unsigned int unique_priority, int *data, unsigned int start_addr, …); void write(unsigned int unique_priority, int *data, unsigned int start_addr, …); simple_bus_status get_status( unsigned int unique_priority); };
Code Examples: PV Interface and simple bus Master Interface
Generic Communication Modeling
3.2.2
31
Generic TLM Simulation Modules
PV Models For the Programmers View (PV) use case, the TLM interface is not bus or processor dependent. Basically, only a manner of reading or writing a word or a burst of words must be available. The OSCI TLM 1.0 proposal [38] defines a generic transport() function to transfer data between initiator and target. However, the data types to be transferred are template parameters and not defined in version 1.0 of this standard. The CoWare PV API uses a PVreq object to transport information from initiator to target, and a PVresp object for the opposite direction (cf. left half of Figure 3.4).1 Both objects contain fields to store addresses, data pointers, and identifiers for access type and direction. The PV transport() function called by the initiator is blocking until the response object is available. In case timing is modeled as well (PV with timing, PV-T), this is either done explicitly or implicitly. In the former case, every module executes SystemC wait() calls respectively; in the latter case, peripherals return their access latency in a specific field of the PVresp object. PV interfaces have become very common for peripheral models; they are often connected to a generic AVF or OCP bus model via a so called transactor. The simple Bus Library As an early example of a TLM bus model, the simple bus model has been shipped together with the official OSCI SystemC releases for several years already (cf. right half of Figure 3.4). The bus model provides a simple blocking, a non-blocking and a debug API to the master side. Even though cycle callable TLM communication is possible by using the non-blocking interface, no corresponding RTL implementation is available. OCP TL-2 and TL-3 Models The SystemC OCP TL-2 interface has been designed mainly to model OCPIP compliant communication on a higher level of abstraction (cf. Section 3.3.2, page 33). However, the TL-2 interface is generic enough to model any kind of SoC communication on a higher abstraction level. Thus, the example bus channel provided by OCP-IP or CoWare’s simple OCP bus library is often used for SoC system simulation before any decision concerning the actual implementation is made. The TL-3 interfaces, provided since OCP package version 2.1, reside on an even higher level of abstraction and do not have any OCP specifics any 1 The
PVreq and the embedded PVresp classes are very similar to the tlm generic payload structure defined in the OSCI TLM 2.0 standard recently.
32
Communication Modeling
more. Instead of providing template parameters for data and addresses to be transferred, TL-3 allows free specification of the entire request and response data packets. The generic AVF communication modules (cf. Section 3.2.1) support TL-2 and TL-3 interfaces as well.
3.3
Communication Customization
After the basic design decisions are made, more accurate SoC communication models are necessary. Depending on the degree of customization, these models origin from different sources. Currently, the most common approach is purchasing commercial IP blocks. This section first gives some examples of companies offering communication IP. For modeling customized communication modules instead, no standardized cycle accurate TLM API suitable for any protocol is available. The second part of this section presents some prominent exemplary APIs.
3.3.1
Communication IP Providers
Since the case study in Chapter 9 applies some generated TLM simulators for ARM AMBA communication infrastructures, this kind of communication IP is presented in this section with the highest level of detail.
ARM AMBA Families Advanced Risc Machines (ARM) [76], additionally to the main business of processor core IP, also offers SoC infrastructure IP. The most widely employed communication IP belongs to their two main AMBA families. However, the protocol between the cores and the bus is ARM proprietary. AMBA 2.0 [77] is a family of well established bus IP. The AHB (Advanced High Performance Bus) acts as high performance, high clock frequency, pipelined system backbone bus. AHB provides priority based arbitration of up to 16 bus masters. As a lightweight AHB without arbitration capabilities, AHBLite bus nodes can be used for a single master infrastructure. For less performance critical system parts, the APB (Advanced Peripheral Bus) can be selected. It is optimized for power consumption and is typically accessed from the AHB subsystem via a bridge. A single AHB bus node allows only one master to access the bus per time slot, so it does not scale well for multiple masters. As one optimization, multi-layer matrices can be set up to provide multiple paths from the master’s to the slave’s side. Every master is connected to an input stage, and every slave owns an output stage. Only for actually used master-slave combinations, a connection between the respective input- and output stages is instantiated. The bandwidths are parameterizable.
Communication Customization
33
The new AMBA 3.0 family also provides the AXI (Advanced eXtensible Interface) [78] bus, which further addresses the evolving SoC communication issues. In order to improve the throughput, AXI connections consist of up to 5 independent channels: read address channel, read data channel, write address channel, write data channel, and write response channel. This way, the different phases of read and write accesses are decoupled, enabling a better utilization of the resources. AXI supports multiple outstanding transactions at a time, as well as enhanced protection features.
Sonics SMART Interconnects The SonicsLX and SonicsMX SMART interconnect solutions provided by Sonics [79] target high performance applications and thus already contain NoClike features as well. Similar to the AMBA 2.0 family, an additional low power bus for interconnecting less demanding peripherals exists, namely Sonics3220. Specifically for optimizing the accesses to off-chip DRAM memory, Sonics also offers the MemMax Memory Scheduler. Every connected core uses OCP or ARM AMBA protocols for communication with the bus. A so called Agent at every bus port translates this to the Sonics proprietary protocol.
Arteris Danube Danube is a NoC IP library provided by Arteris [80]. In contrast to the more bus or crossbar like architectures offered by ARM and Sonics, Danube uses a packet based protocol (NoC Transaction and Transport Protocol, NTTP). The IP library contains switches, width converters, endianness converters, long distance links containing repeaters, as well as different NoC interface units (NIUs) for the communication end points. Via the NIUs, local buses like AMBA can be connected (GALS, Globally Asynchronous Locally Synchronous communication). Alternatively, cores can be directly connected using AHB, AXI or OCP communication protocols. The Arteris NOC Solution consists of the Danube IP Library together with a suite of design tools, namely the NOCexplorer and the NOCcompiler.
3.3.2
Protocol Specific TLM Interfaces
OCP SystemC TLM OCP-IP has developed a SystemC TLM package [39, 41] containing the OCP TLM interface classes together with an exemplary generic bus channel. As shown in the code fragments in Figure 3.5, the OCP abstraction levels TL-1 and TL-2 both use predefined compound data types for requests and responses, respectively. These data structs are templated with those C++ data
34
Communication Modeling /** OCP TL2 interface */
/** OCP TL1 interface */
/** define data types */ enum OCPMCmdType { OCP_MCMD_IDLE, //idle command OCP_MCMD_WR, //write command OCP_MCMD_RD, //read command OCP_MCMD_RDEX, //exclusive read OCP_MCMD_RDL, //read linked OCP_MCMD_WRNP, //write non-posted OCP_MCMD_WRC, //write conditional OCP_MCMD_BCST //broadcast }; enum OCPSRespType { OCP_SRESP_NULL, //null response OCP_SRESP_DVA, //data valid/accept OCP_SRESP_FAIL, //request failed OCP_SRESP_ERR //error response }; …
/** define data types */ // same as TL2 (cf. left hand side)
/** define OCPTL2RequestGrp */ template class OCPTL2RequestGrp { public: OCPMCmdType MCmd; AddrType MAddr; DataType *MDataPtr; unsigned int DataLength; … }; /** define OCPTL2ResponseGrp */ template class OCPTL2ResponseGrp { public: OCPSRespType SResp; DataType *SDataPtr; unsigned int DataLength; … };
/** OCP TL2 master interface definition*/ template class OCP_TL2_MasterIF : virtual public sc_interface { public: /** request phase */ bool requestInProgress(void); bool sendOCPRequest(const OCPTL2RequestGrp& req); … /** response phase */ bool responseInProgress(void); bool getOCPResponse( OCPTL2ResponseGrp& resp); bool acceptResponse(int cycles); … };
Figure 3.5.
/** define TL1 Data Handshake Grp */ template class OCPDataHSGrp { public: DataType MData; … }; /** define OCPRequestGrp */ template class OCPRequestGrp { public: OCPMCmdType MCmd; AddrType MAddr; bool HasMData; DataType MData; … }; /** define OCPResponseGrp */ template class OCPResponseGrp { public: OCPSRespType SResp; DataType SData; … };
/** OCP TL1 master interface definition*/ template class OCP_TL1_MasterIF : virtual public sc_interface { public: /** request phase */ bool getSBusy(void); bool sendOCPRequest(const OCPRequestGrp& req); bool getSCmdAccept(void); … /** data handshake phase */ bool getSBusyDataHS(void); bool startOCPDataHS( OCPDataHSGrp& datahs); bool getSDataAccept(void); … /** response phase */ bool getOCPResponse( OCPResponseGrp& resp); bool putMRespAccept(); … };
Code Examples: OCP TL-2 and OCP TL-1 Interfaces
Communication Customization
35
types that carry the data and address information. Both kinds of master interface provide functions to send a request packet and to receive a response packet. Since OCP TL-2 data packets carry pointers to the actual data words to exchange, one single request-response sequence is sufficient to transfer a whole burst of data at once. OCP TL-1, in contrast, only transfers one data word at a time. A third compound data type is therefore needed to process the data handshake phase for bursts fully cycle accurately. The TL-2 interface definition can also be used as a generic interface, implementing any protocol on top of it (cf. Section 3.2.2). TL-1, in contrast, already contains many OCP-specific implementation details so that only OCP communication can be represented accurately.
ARM AMBA CLI For modeling the AMBA 2.0 bus family, ARM released the Cycle Level Interface (CLI) Specification [81] as an open standard (cf. right half of Figure 3.6). This SystemC TLM interface contains four sub-interfaces: a configuration interface, a non-blocking interface for cycle accurate communication modeling, a blocking interface for transport TLM modeling, and a direct interface for debug accesses. The Synopsys DesignWare AMBA TLM library uses this interface definition. The library is applied within Synopsys SystemStudio (cf. Section 5.3.2). ARM RealView ESL API Recently, ARM released the ARM RealView ESL API [82]. It consists of three parts: the Cycle Accurate Simulation Interface (CASI), the Cycle Accurate Debug Interface (CADI), and the Cycle Accurate Profiling Interface (CAPI). Recent cycle accurate ARM ESL bus and processor simulators support only this API. CoWare’s Cycle Accurate TLM Modules In its portfolio, CoWare offers fast, cycle accurate TLM simulators for existing third party communication IP, e.g. the AMBA 2.0 and AMBA 3.0 bus families. They enable fast cycle accurate platform simulation in the PlatformArchitect [83] environment. CoWare generates these communication libraries automatically from an abstract formal specification using a tool called BusCompiler (cf. next section). The generated very formal TLM API is derived directly from the communication protocol definition. Thus, for example, even though also modeling ARM AMBA 2.0 communication, the simulator interfaces do not follow the AMBA CLI specification (cf. Figure 3.6).
36
Communication Modeling /**AHB Interface: CoWare TLM */ /** define data types */ enum tlmTTransactionType {tlmIdle, tlmReadAtAddress, tlmWriteAtAddress, tlmReadWriteAtAddress, …}; … /** define AHB transfer: AddrTrf */ class AHBInitiator_AddrTrf_if { public: void setAddress (unsigned int v); void setType (tlmTTransactionType v); … }; /** define AHB transfer: WriteDataTrf */ class AHBInitiator_WriteDataTrf_if { public: void setWriteData (unsigned long long v); … }; …
/** AHBInitiator interface definition*/ class AHBInitiator_master_if : public sc_interface { public: /** API calls for AddrTrf */ AHBInitiator_AddrTrf_if* getAddrTrf(); bool canSendAddrTrf(); void sendAddrTrf(); … /* API calls for WriteDataTrf */ AHBInitiator_WriteDataTrf_if* getWriteDataTrf(); bool canSendWriteDataTrf(); void sendWriteDataTrf(); … … };
/** AHB Interface: ARM Cycle Level Interface (CLI) */ /** define data types */ enum ahb_hwrite { RD, WR }; …
/** AHB master interface definition */ class ahb_bus_if : public sc_interface { public: /** configuration interface */ int bus_addr_width(); int bus_data_width(); bool priority(int); int priority(); …
//get addr width //get data width //set priority //get priority
/** non-blocking interface (cycle level)*/ void request(); //request bus access bool has_grant(); //check if granted void init_transaction(ahb_hwrite type, unsigned int addr, unsigned int* data, …); //initiate transaction void set_data(); //initiate data phase … /** blocking interface (transaction level)*/ bool burst_read(unsigned int addr, unsigned int* data, int burstlen, …); //do full read access bool burst_write(unsigned int addr, unsigned int* data, int burstlen, …); //do full write access … /** direct (debug) interface */ bool direct_read(unsigned int addr, unsigned int* data, int burstlen); bool direct_write(unsigned int addr, unsigned int* data, int burstlen); … };
Figure 3.6. Code Examples: CoWare-TLM API and ARM AMBA CLI
3.4
The BusCompiler Tool
CoWare’s BusCompiler does not only generate fast cycle accurate simulators for existing communication IP, but it is also helpful for top-down development of new communication modules and protocols as well. As usual, the interconnect topology can still be adapted to the application’s needs. But customizing the communication modules themselves as well bears the best potential for designing optimal platforms.
37
The BusCompiler Tool ReqTrf, AddrTrf, WriteDataTrf,...
Initiator
I/F GrantTrf, EotTrf, ReadDataTrf,...
AddrTrf, WriteDataTrf,...
AMBA AHB Bus EotTrf, Node ReadDataTrf,...
AHBInitiator Protocol
I/F
Target
AHBTarget Protocol
Figure 3.7. Cycle Accurate Bus Communication Modeling
3.4.1
Cycle Accurate Communication Modeling
The generated TLM bus nodes allow cycle accurate communication modeling by exchanging small data packets, called transfers, with their environment every cycle during an ongoing transaction. The nodes can be connected to active initiator modules, reactive target modules, or additional communication nodes to form crossbars or other NoC interconnect hierarchies. On every connection, information exchange takes place according to a specific communication protocol. As shown in the example of an AMBA AHB [77] node in Figure 3.7, the protocols for the initiator side and the target side of the AHB node at TLM can exchange a limited set of transfers in fixed directions. This transfer exchange replaces the detailed pin wiggling being simulated on RTL. Every transfer carries a set of attributes, which is the information associated with this transfer. An address transfer (addrTrf), for example, carries the address and an access direction flag. In Figure 3.8, the exemplary simulation of an AHB write is depicted. In parallel, the TLM transfer sequences as well as the RTL pin wiggling modeled by that are shown. The transaction consists of three phases: arbitration, address phase and data phase. The initiator invokes the transaction by sending a reqTrf transfer. The bus model performs the arbitration; as soon as bus access is granted, the initiator can send an addrTrf.2 After that, the initiator tries to send a writeDataTrf until it is accepted by the bus model. This way, the transaction is simulated fully cycle accurately by exchanging transfers between the modules at specific points in time. The generated state machine inside the bus simulator takes care that transfers can only be exchanged in valid time slots.
2 The
initiator also could receive a grantTrf at this point in time, but this does not deliver any additional information.
38
Communication Modeling addrTrf
reqTrf addrTrf
wrDataTrf
wrDataTrf
eotTrf eotTrf
time
Transfer
HREQ HGRANT
is sent
ReqTrf
by initiator Grant Trf
Transfer
HADDR
could not yet
AddrTrf
HBURST / HWRITE / HSIZE / HPROT
HWDATA
be sent
WriteDataTrf Transfer
HREADY
is received EotTrf
by initiator
HRESP
Transfer
Arbitration
Address Phase
Data Phase
could not yet be received
Transaction
Figure 3.8. AMBA AHB Write Example, Initiator Side
3.4.2
BusCompiler Input Specification
The abstract textual bus model is a formal description which consists of three parts (Figure 3.9). First, a generic protocol section defines all transfers of any protocol of a bus family, with the maximum possible set and width of their attributes. Second, a protocol definition section introduces all protocols with that subset of generic transfers and attributes they comprise, respectively. And third, a node definition section formally defines the pipelines and state machines inside every node of the bus family. The generated bus simulator allows sending and receiving the protocol transfers only in valid time slots. The exemplary node definition fragment at the bottom of Figure 3.9 defines an AHB node which implements the protocol AHBInitiator at the initiator side. A writeDataTrf transfer can only be sent (canSend = ...) by the initiator if in the current cycle an addrTrf has been sent which indicated a write transaction. This transfer is forwarded to the target side one cycle later (canReceive = delay(...)). The formal input specification of both, the AMBA 2.0 as well as the AMBA AXI bus families, consists of roughly 4000 lines of code. Supplemental C++ code is necessary for decoders, arbiters, or other more complex internal
39
The BusCompiler Tool /** generic AMBA protocol */
/** derived protocols */
/** define data types */ enum TTransactionType, values = [ idle, readAtAddress, writeAtAddress, readWriteAtAddress… ];
/** AHBInitiator protocol */ protocol AHBInitiator { parameter address_width, values = [16,32]; parameter data_width, values = [8,16,24,32];
/** define superset of AMBA transfers */ transaction genericTransaction { /** define address transfer (addrTrf) */ transfer addrTrf, sender = initiator, receiver = target { attribute address, type = bits<32>; attribute type, type = TTransactionType; … }; /** define write data transfer (writeDataTrf) */ transfer writeDataTrf, sender = initiator, receiver = target { attribute = writeData, type = bits<32>; … }; };
/** refine addrTrf for AHBInitiator prot. */ transfer addrTrf … { attribute address, width = address_width; attribute type, values = [readAtAddress, writeAtAddress]; … }; /** refine writeDataTrf for AHBInitiator */ transfer writeDataTrf … { attribute = writeData, width = data_width; }; … };
/** bus node state machine specification */ /** define bus node AHB */ node AHB, ...... { configuration, initiatorProtocols = [ AHBInitiator ], targetProtocols = [ AHBTarget, AHBLiteTarget ] { ... }; partition bus { .... /** variable definitions, for convenience */ variable setupWrite, type bool, compute = (addrTrf.sent & addrTrf.type == writeAtAddress); .... /** define conditions when writeDataTrf can be sent (by the initiator) or received (by the target) */ transfer writeDataTrf, ... canSend = setupWrite, canReceive = delay ( setupWrite ); .... }; ... };
Figure 3.9.
BusCompiler Input Specification Example
40
Communication Modeling
functional units. In case of AMBA 2.0 and AMBA AXI, the amount is 1000 and 4000 lines of code, respectively. For every bus node, an adequate simulator class is generated by the BusCompiler tool. These nodes contain interfaces to the external initiators and slaves, as well as internal interfaces which allow connecting other communication nodes of the same family. The intermediate C++ source code generated by the BusCompiler tool has a line count in the range of 100k. Thus, it is at least 10 times larger than the specification. It contains a precomputed optimized state machine that enables fast cycle accurate simulation of the communication.
Chapter 4 PROCESSOR MODELING
This chapter gives an overview over existing techniques for modeling processors on several levels of abstraction. First, generic module simulators are introduced that compile and execute the application code directly on the simulation host. The second section presents techniques to customize processors to the application’s needs. The most flexible customization is based on Architecture Description Languages (ADLs). The third section then introduces the LISA ADL, which is used within this book.
4.1 Generic Processor Modeling 4.1.1 Native Execution on the Simulation Host In very early stages of the MP-SoC design flow, an executable functional specification of the application tasks is used, which already reflects partitioning decisions made by the designer [84]. The next step is mapping these tasks onto a set of possible computation units. For these examinations, no ISSs or even HDL models are necessary yet. The application task models are compiled on the host workstation and natively executed, which results in a high simulation performance. According to the interface based design principle [34], the application specific tasks access other modules or tasks using a generic communication interface. In compliance with an extended Y-Chart approach, general decisions like the mapping of applications tasks onto architectural units can be made on this level [74]. In a dynamically configurable simulation environment, the architecture independent application tasks and the application independent architecture models are mapped onto each other. As depicted in Figure 4.1b, different mapping alternatives can be explored without modifying any of the modules.
42
Processor Modeling a) Functional Model
Task 1 Task 2
NoC
Task 3 b) Virtual Task Mapping (Temporally and Spacially)
Task 1
VPU 1
Task 2
NoC VPU 2
Task 3
NoC = Network on Chip
c) Final Mapping Task 2 Task 1
PE 1 NoC
Task 3
Figure 4.1.
VPU = Virtual Processing Unit
PE 2
PE = Processing Element
Virtual Architecture Mapping
The generic processor model applied here is called Virtual Processing Unit (VPU). It also allows modeling interleaved execution of multiple tasks on a single processing resource, as it is possible on multithreaded processing elements. Multithreading is often applied to hide high memory access latencies, which may occur during the execution of a single task. The task activations and processing delays are controlled and manipulated by the VPU to simulate the tasks as if they were executed on a shared processing resource. This is a very suitable technique to perform the mapping of the tasks to the processing resources. For a more elaborated evaluation of the performance to be expected, more accurate simulators are required.
4.1.2
Generic Assembly Level
Eventually the above exploration suggests executing a specific set of tasks on a customized processor in order to meet performance, efficiency and flexibility constraints at the same time. Eventually, the above mapping is to be confirmed using more accurate application models than coarse grained annotated tasks. Both requirements can be satisfied applying an automatic conversion into a model with a finer granularity of the memory accesses and the delay annotations. The LANCE compiler [85] converts ANSI C code into an assembly-like generic
Processor Customization Techniques
43
Three-Address-Code (3-AC). This makes many implicit C operators visible and already allows many standard compiler optimizations like constant propagation and dead code elimination. In a microprofiler tool [86], the fine granular code is annotated with counters. This way, properties like the number of executed operations and the memory accesses are determined. The execution frequencies collected during simulation serve for two purposes. First, they give valuable hints for worthwhile instructions to be implemented in a customized processor. Second, with the help of a lookup-table containing possible cycle counts for any 3-AC operator, overall task execution delay can be calculated more accurately. The data memory accesses, which appear in the three-address-code as pointer dereferencing operators, can be forwarded to the TLM communication port to also account for memory access delays as accurate as possible [87]. The next steps toward implementation depend on the chosen technique for implementing the computation module.
4.2
Processor Customization Techniques
In order to get reliable information concerning the performance of the application software, its execution on the target processor platform must be simulated instruction by instruction. Also, such an ISS is needed to start the implementation of the final application software as early as possible. Depending on the degree of processor customization, different modeling techniques are typically applied. These techniques are presented in this section.
4.2.1
Selectable Processor Core IP
If a Commercial-of-the-Shelf (COTS) processor core is chosen for executing the embedded software, the vendor usually also provides the tool suite to generate the application binary as well as an ISS for debugging and profiling it. This way, customization just takes part by choosing the most suitable processor core out of the vendor’s portfolio. Thus, customization is at most possible toward the respective application domain, but not toward the application itself. Configurability is limited to basic items like the size of 1st-level caches. Besides the lower efficiency which can be expected because of this, the IP vendor demands for high royalties, typically on a per-device basis.
ARM Advanced Risc Machines (ARM) [76] exists as a company since 1990, their basic RISC architectures going back to the mid-1980s. ARM promoted to the world’s largest semiconductor IP supplier. Currently they mainly provide 32-bit processor cores for three system categories:
44
Processor Modeling
Embedded (real-time) Systems, being used mainly for control applications. Embedded Application Platforms, being able to run operating systems like Linux and Windows CE. Secure Applications like Smart Cards and SIMs. Even though there are up to around 10 cores available per category, it cannot be assumed that they perfectly match a specific application. Thus, ARM recently introduced application specific extensibility to some of their cores, called Optimal Data Engine Technology (OptimoDE).
MIPS MIPS Technologies [88], formerly MIPS Computer Systems, has been founded in 1985 as a vendor of microprocessor chips. In the mid-1990s, the company also started to license processor IP. Currently, MIPS offers several families of their 32-bit (MIPS32) as well as 64-bit (MIPS64) architectures. Some of the MIPS cores can also be supplemented with customized instructions. This mechanism is called CorExtend.
4.2.2
(Re-)Configurable Processor Architectures
Other companies provide a basic fixed RISC processor core, which can be extended with application specific instructions and functional units. The additional hardware can either be defined once and fabricated as hardwired silicon (configurable processor), or it can be defined as configuration data loaded into a dynamically configurable FPGA area (reconfigurable processor). Although the flexibility is already higher than those of the IP core providers outlined in Section 4.2.1, the basic processor architecture still stays the same.
Tensilica Tensilica [89] was incorporated in 1997 and offers configurable processor cores for SoC designs [7]. Currently the main products are the Xtensa 6 and Xtensa LX processor templates. The former is targeted for low-power, the latter for high-end applications. For example, the Xtensa 6 template is a 32bit architecture and consists of a 5-stage pipeline with alternatively 16-bit or 24-bit instructions. User defined registers and execution data paths may have a width of up to 1024 bits. Several additional units such as memory blocks, MMU, multiply and MAC units are optional and/or configurable. The extensions are defined using the TIE (Tensilica Instruction Extension) Language. Either they are defined manually, or the XPRESS compiler automatically determines promising extension candidates, taking the application C code as input. Recently, Tensilica introduced the Diamond processor family. These processor cores are preconfigured to compete directly against widely spread processor IP as provided by ARM.
Processor Customization Techniques
45
Stretch Stretch Inc. [90] has been founded in 2002 and offers reconfigurable processor cores. Their main reconfigurable processor family S5000 is based on the S5 engine, which is basically a Tensilica Xtensa RISC processor template. Around it, the Instruction Set Extension Fabric (ISEF) provides a reconfigurable area and the infrastructure to dynamically configure application specific processor extensions. Stretch also offers a tool suite which supports the designer in performing the necessary tasks from analyzing and compiling the target application to reconfiguring the ISEF.
4.2.3
ADLs
For very high volume SoC applications and platforms, it is advantageous to have even more degrees of freedom when designing a customized processor. On the one hand, when having a much larger design space available for developing the processor architecture and its SoC communication, the target application’s needs are matched even better. On the other hand, no per-unit royalties have to be paid for respective third party IP. Especially for very high volumes, these royalties would significantly decrease the overall profit. For the processor IP and the processor templates presented in Sections 4.2.1 and 4.2.2, the vendor provides the tools suite for generating the executables, for generating the hardware model, and for enabling early simulation in system context. Due to the limited flexibility, these tools are either fixed or configurable. Designing a fully customized processor, in contrast, used to be a very tedious and error-prone task. Keeping all application development tools, HDL models and system models in sync manually over multiple design iterations could only be afforded by very few companies. Doing everything manually, even those houses could not actually perform a thorough design space exploration. Thus, it was almost impossible to really take advantage of the huge design space effectively. That is why in the recent decade there have been several approaches to automate this task. These approaches are based on Architecture Description Languages (ADLs), whose abstraction level is somewhere between functional specification and the RT level languages Verilog and VHDL. Being designed for modeling processors, they generally provide a manner of specifying instruction sets. From such a processor model, the respective tools are generated automatically. Since C-Compiler generation, (system) simulation and RTL hardware generation put very different requirements onto a unified ADL processor model, not all of these approaches are applicable in all these fields. Only a few of them were able to gain commercial importance so far.
46
Processor Modeling
Traditionally, the ADLs are subdivided into three groups: Instruction-set Centric Languages like ISDL [91] and nML [92]. These languages are targeted for modeling the instruction set and instruction behavior rather than the hardware structure. Their strength is in generating software tools. Especially, generating a C-Compiler needs more semantical rather than structural information about the architecture. The weaknesses are the missing ability to perform fully accurate cycle-by-cycle simulation and to generate efficient RTL hardware. Architecture-centric Languages like MIMOLA [93]. This category, in contrast, focuses on the hardware structure. In general, the user constructs a processor model by instantiating and interconnecting basic building blocks. The strengths of these languages are in the areas hardware generation and cycle accurate simulation. However, the high level software tools are in general difficult to generate automatically due to the very detailed description style. Instruction-set and Architecture Oriented Languages like EXPRESSION [94] and LISA [3]. Languages of this group are also called mixed level machine description languages. They provide concepts for both domains and thus have the potential for efficient software tool and RTL hardware generation at the same time.
nML The nML language was initially developed at the Technical University Berlin in the early 1990s for describing instruction sets [92]. The group also developed an ISS called SIGH/SIM and a code generator (compiler backend) called CBC [95]. Several other groups adopted the nML language and created own tools around it. IMEC [32] in Leuven, Belgium, developed a retargetable C-Compiler called Chess and an ISS generator called Checkers, being based on nML. These tools, together with a linker (Bridge), an assembler/disassembler (Darts), a test-program generator (Risk) and a hardware generator (Go) have been commercialized by Target Compiler Technologies [96]. Their Checkers ISS also offers system integration capabilities by implementing a well defined API in the generated C++ simulator class. However, integrating the simulator class into a SoC system environment still has to be done manually. Depending on the complexity of the connected buses and the type of system simulation environment this is most likely a non-trivial task. Furthermore, system integration seems to be possible only on one abstraction level. Fully cycle accurate simulation of bus accesses and multi-core simulation seem not to be feasible. Cadence and the Indian Institute of Technology also adopted the nML language. They extended nML to the Sim-nML language [97], eliminating some
Processor Customization Techniques
47
shortcomings especially concerning pipeline modeling. They also developed a set of software tools [98, 99] and a hardware generator [100]. However, there is no publication available about Sim-nML’s system integration capabilities.
ISDL The Instruction Set Description Language (ISDL) [91] originates from the Massachusetts Institute of Technology (MIT). The group also developed a simulator generator (GENSIM) and a hardware generator (HGEN). ISDL is an instruction-set oriented language, where costs (delay cycles, instruction word size, stall cycles) as well as timing (result latency, resource usage) are annotated to the instructions. This way, performance evaluation is relatively easy to accomplish. However, more complex or multiple pipelines and especially external effects like bus arbitration delay cannot be evaluated accurately. Significant system integration of the ISDL simulators seems not to be performed at all.
MIMOLA The MIMOLA [93] ADL has been developed at Dortmund University in the early 1990s and is architecture based. Processor hardware structures are described as a set of modules and a detailed interconnect scheme. MIMOLA supports C-compiler generation (MSSQ, RECORD), HDL hardware generation, ISS generation as well as test-generation. However, due to the high level of detail, the generated tools are not suitable for fast processor simulation and early SoC system integration.
EXPRESSION EXPRESSION [94, 101] has been developed at the U.C. Irvine and contains both, instruction-set information as well as structural information about a processor. The former is captured by operations, the latter by components. The group developed an optimizing compiler (EXPRESS), an ISS (SIMPRESS), and recently also a hardware generation tool. EXPRESSION processor simulators are capable of performing processor-memory co-exploration [102]. However, no information about simulation in multi-processor system context seems to be publicly available.
LISA LISA [3] has been developed at the Institute for Integrated Signal Processing Systems (ISS) at the RWTH Aachen University. It also contains both, instruction-set and structural information. An early LISA version [103] without hardware generator and without C compiler support has been commercialized by Axys Design Automation in
48
Processor Modeling
the late 1990s. In 2004, the company has been acquired by ARM. The main products are an SoC simulation environment (MaxSim) and a processor simulator generator (MaxCore). The processor system integration is based on their non-standardized proprietary environment. The SoC communication of the processors is limited to fixed bus IP provided by ARM itself. Since the processor design tools compete against ARM’s main business, it can be assumed that they will not be promoted very intensively. In 2002, an extended LISA version (LISA 2.0) has been commercialized by LISATek, which has been acquired by CoWare in 2003. The associated ProcessorDesigner tools include HDL hardware synthesis as well as C compiler generation. The system integration capabilities are based on the work presented in this book.
4.3 LISA 4.3.1 LISA Processor Design Platform
Application
Profiler
C C Compiler Compiler
Simulator
Assembler
LISATek Processor Designer
Linker
LISA 2.0
Architecture Exploration
Architecture LISATek Software Designer Specification Assember/ Linker
C-Compiler
Architecture Implementation LISATek System Integrator System on Chip
Simulator / Debug.
Software Application Design
Figure 4.2.
Integration and Verification
LISA Processor Design Platform [3]
The LISA processor design platform (LPDP) [3] is an environment that allows generating hardware and software development tools automatically from one sole specification of the target architecture in the LISA language. Figure 4.2 shows the components of the LPDP environment.
LISA
49
Hardware Designer Platform – for Exploration and Processor Generation Architecture design requires the designer to work in two fields. First, the development of the software part including compiler, assembler, linker and simulator. And second, the development of the target architecture itself. The software simulator produces profiling data and thus may answer questions concerning the instruction set, the performance of an algorithm and the required size of memories and registers. The required silicon area or power consumption can only be determined in conjunction with a synthesizable HDL model. To accommodate all these requirements, the LISA hardware designer platform generates a full exploration tool suite for the respective target processor automatically, consisting of C-compiler, assembler, linker and ISS (cf. Section 6.1.2). Besides the ability to generate the set of software development tools, synthesizable HDL code (both VHDL and Verilog) can be generated automatically from the LISA processor description [104, 2]. This comprises both, the control path as well as the data path. It is obvious that deriving both software tools and hardware implementation model from one sole specification of the architecture in the LISA language has significant advantages: only one model needs to be maintained, and changes on the architecture are applied automatically to the software tools and the implementation model. Thus, the consistency problem among the software tools and between software tools and implementation model is reduced significantly. Software Designer Platform – for Software Application Design To cope with the requirements of functionality and speed in the software design phase, the tools generated for this purpose are an enhanced version of the tools generated during architecture exploration phase. The generated simulation tools are enhanced in speed by applying the compiled simulation principle – where applicable – and are faster by one to two orders in magnitude than the interpretive simulators often provided by architecture vendors. The JustIn-Time Cache-Compiled (JIT-CC) Simulation principle [105] automatically uses precompiled simulation tables where possible and only switches back to interpretive simulation when necessary. Besides the architecture specific assembler and linker, also a fully featured C-Compiler is generated automatically [106]. System Integrator Platform – for System Integration and Verification Once the processor software simulator is available, it must be integrated and verified in the context of the entire SoC system, which can include a mixture of different processors, memories, and interconnect components.
50
Processor Modeling
The communication of the integrated processor simulator (a) to its SoC environment [107] as well as (b) to the debugger frontend respectively the SoC designer [108] is described in more detail in the further chapters of this book.
4.3.2
Abstraction Levels fetch
read (1000)
decode
Instruction 1: LDR R[0],@1000
LDR
fetch
n
decode
read (1001)
Instruction 2: ADD R[0],@1001
ADD
n+1
n+2
time
Figure 4.3.
Instruction Accurate Processor Model
The LISA [3] ADL allows modeling processor architectures on two main levels of abstraction: Instruction Accurate (IA) and Cycle Accurate (CA). Instruction accurate models comprise the full processor instruction set, and the generated processor simulator sequentially executes assembly instructions (Figure 4.3). The interleaved execution as occurring in most real processor’s pipelines is neglected on this level. Every instruction of a LISA processor model is composed from a number of (atomic) operations.1 In every simulation step, the behavior associated with each operation of an instruction is executed. Eventually, the model returns information about the number of cycles the execution of every instruction takes. FE
DE
AG
fetch
decode
FE
DE
fetch
n
n+1
RD
EX
read (1000)
LDR
AG
decode
n+2
Figure 4.4.
n+3
n+4
Instruction 1: LDR R[0],@1000
RD
EX
read (1001)
ADD
n+5
Instruction 2: ADD R[0],@1001 n+6
time
Cycle Accurate Processor Model
The instruction accurate model is generally not cycle accurate since pipeline effects are not considered at all. In contrast, cycle accurate processor models fully simulate the processor pipeline by assigning the individual operations of 1 This
allows describing the behavior of a regular instruction set efficiently.
51
LISA
the instruction to the respective pipeline stages (Figure 4.4) and by implementing an operation scheduling according to the processors execution scheme. Generally, moving from an IA to a CA model implies a restructuring of the model (cf. Section 6.4).
4.3.3
LISA 2.0 Input Specification
/** processor resources */
/** processor instructions */
RESOURCE { /** register resources */ PROGRAM_COUNTER U32 pc; REGISTER U32 R[0…15]; … /** pin resources */ PIN bool pin1; … /** memories and buses */ RAM U32 prog_rom {…}; RAM U32 data_mem {…}; BUS U32 common_bus {…}; … /** memory topology */ MEMORY_MAP { BUS(common_bus), RANGE(0x0000,0x7FFF) -> prog_rom[…]; BUS(common_bus), RANGE(0x8000,0xFFFF) -> data_mem[…]; … }… }
/** instruction: store register (str) */ OPERATION str { DECLARE { INSTANCE reg32; INSTANCE addr16; } SYNTAX { “STR” reg32 “,” addr16 } CODING { 0b1000 reg32 addr16 } BEHAVIOR { U32 data = reg32; common_bus.write(addr16,&data); } } … /** operations referred to by str */ OPERATION reg32 {… } OPERATION addr16 {… } …
Figure 4.5.
LISA Code Example
A LISA 2.0 processor specification consists of two main parts (Figure 4.5): First, the RESOURCE section, which defines all HW resources like memories, buses, registers and pins to the outside. It also defines the interconnect between the buses and the memories in the MEMORY MAP sub-section. The second part of the specification defines the OPERATION graph, from which the processor instructions are constructed. As depicted on the right hand side of Figure 4.5 on an exemplary store instruction, the respective str operation contains all associated information in sub-sections: The DECLARE section introduces sub-operations which are used by the current operation. The str instruction applies the reg32 operation to determine the write data operand, and the addr16 operation to determine the target
52
Processor Modeling
address operand. These sub-operations are also used for other instructions to evaluate their operands. The SYNTAX section defines the assembly representation of the store instruction. Terminals are directly given in quotes, non-terminals refer to the SYNTAX section of sub-operations. The CODING section determines the binary representation of the instruction in an executable. Again, terminals are directly given; non-terminals refer to inferior operations. The BEHAVIOR section contains a fragment of C code to be executed in the generated simulator when the respective instruction is encountered. The C code can access global resources defined in the RESOURCE section as well as values returned by the sub-operations. In the str example, the write is directed to a bus resource. For simplification, this IA example model invokes the complete instruction specific behavior of the str instruction from the inside of one single operation. In order to distribute the str instruction over the processor pipeline in a cycle accurate LISA model, this behavior must be split into several operations. The size of a simple IA processor specification is about 1k lines of code. A more complex processor, especially when already modeled cycle accurately, quickly exceeds 10k lines of code. Behavioral features not natively contained in LISA 2.0 can be specified using additional C/C++ code.
Chapter 5 PROCESSOR SYSTEM INTEGRATION
This chapter introduces the LISA processor system integration [109]. Starting with a standalone processor simulator, the first section stepwise builds up the structure of an SoC system simulator. The main building block for the coupling is the adaptor, which forwards the LISA bus/memory accesses to any kind of external communication module. This adaptor is presented in more detail in the second section. The last section of this chapter shows the integration of generated LISA processors into some commercial SoC system simulation environments, namely CoWare PlatformArchitect and Synopsys SystemStudio.
5.1 Simulator Structure 5.1.1 Standalone Processor Simulator
LISA Simulator
Processor Simulator Core
Pin
Pin
A P I
common_bus
0K.....32K API 32 KB prog_rom
Pin
Figure 5.1.
1M.....2M
2M.....4M
API 32 KB prog_cache
API 128 KB data_cache
API 1 MB prog_mem
API 2 MB data_mem
Standalone LISA Processor Simulator
54
Processor System Integration
The LISA simulator structure is outlined in Figure 5.1. Not only the processor simulator core, but also the internal processor memory hierarchy is generated by the processor compiler according to the ADL specification. Several memory resources (RAM, ROM, cache, write-buffer) as well as buses can freely be instantiated and parameterized using pre-defined or user-defined models. The resource models applicable here all implement the same generic LISA memory/bus API1 [110]. Additionally, a LISA specification contains a memory map, which defines the interconnections and bus address ranges of these modules. Thereby, the LISA behavioral description only needs to access the respective bus object, which then possesses the necessary intelligence to perform the address decoding and the communication with the selected slave module autonomously (Figure 5.1). The generated LISA processor simulator is a C++ class, which can be instantiated and controlled by the generic LISA debugger ldb (cf. Section 8.1.1). In this standalone use case the processor simulator allocates all memory internally and does not get any stimuli on its input pins. API
LISA Simulator 2M.....4M
Processor Simulator Core
Pin
Pin
A P I 0K.....32K API 32 KB prog_rom
1M.....2M
2M.....4M
API 32 KB prog_cache
API 128 KB data_cache
API 1 MB prog_mem
API 2 MB data_mem
Shadow Resources
Pin
Figure 5.2.
5.1.2
common_bus
LISA Processor Simulator with Bus Interface
The LISA Bus Interface
In order to integrate the generated LISA simulator into arbitrary environments, the simulator generator also builds a bus interface for the processor simulator (Figure 5.2). Every pin resource2 as well as every bus resource3 gets a link to the outside world. In the simplest case, the pins are passive C/C++ datatypes, being accessible from outside the LISA simulator class as well. In the extended bus class, additional functionality is incorporated, which also allows directing requests to an additional bus/memory API. Since the 1 Application
Programming Interface. ’PIN’ in the LISA resource section. 3 keyword ’BUS’ in the LISA resource section. 2 keyword
55
Simulator Structure
implementation of this API is located outside the LISA class, basically any externally modeled memory or peripheral is accessible. Every resource which is connected to a bus can be (dynamically) configured to be modeled inside the LISA model (as it is the only option in the standalone simulator), or all its accesses to be directed to the external port class. A bus access invoked by a LISA operation first does the address decoding as before, but next the enhanced bus module consults the identified internal module to know whether it is modeled inside the LISA model (e.g. for simulation performance reasons), or if it is a shadow resource. The latter means that the real resource is located outside the LISA simulator, thus the bus address is to be forwarded to the associated external port instead (Figure 5.2). However, the exact nature of the bus/memory API implementation depends on the system simulation environment the LISA processor is embedded in and especially the bus model it is connected to.
Memory Modules
2 MB data_mem
Bus Module
SystemC TLM bus model
SystemC Module
TLM Adaptor LISA Port AAPPI I
LISA Simulator
SystemC Module
TLM Adaptor LISA Port AAPPI I
LISA Simulator
2M.....4M
Processor Modules
2M.....4M A P I
common_bus 1M.....2M
2M.....4M
API
API
Core
prog_cache data_cache API
common_bus
0K.....32K
1M.....2M
2M.....4M
API
API
API
prog_rom prog_cache data_cache
API
API
prog_mem data_mem
API
prog_mem data_mem Pin Pin Pin
Figure 5.3.
SystemC Processor Simulators with External Data Memory
56
Processor System Integration
5.1.3
SystemC Wrapper
This extended LISA simulator is prepared to be embedded in arbitrary system environments. Recently, the SystemC language and the TLM paradigm have become popular. That is why most couplings developed so far are for commercial or non-commercial SystemC environments.4 Figure 5.3 shows the LISA processor simulator embedded in a SystemC environment. The LISA tools generate a wrapper for the processor, so that it appears to the outside as a SystemC module with TLM ports as well as SystemC pins. Such a TLM port is controlled by a TLM adaptor. This adaptor implements the LISA bus/memory API in a way that the requests from the LISA simulator are forwarded to the SystemC TLM port. Once being in the SystemC world, the requests are further processed by the module whichever is connected to the processor port. In case of Figure 5.3, a common data memory is shared by two LISA processor simulators. Analogously, the SystemC wrapper also provides a mechanism to map LISA pin resources to respective SystemC ports. SystemC input ports, quite efficiently, declare themselves sensitive to incoming signal changes. Only in case of signal state changes, the wrapper needs to update the internal LISA pin resource respectively. A straight forward output pin mapping would need to copy all LISA output pin states to the respective SystemC ports after every invocation of the LISA processor. Hence, pin resources are implemented using a more intelligent class than passive C/C++ datatypes. This class detects all assignments from the LISA behavioral code automatically. On a value change, a callback function is invoked, which in case of a SystemC environment updates the associated ports immediately. Using the automatically generated SystemC processor simulator, the freely available OSCI SystemC and the enclosed simple bus class library, simple SoC platforms as depicted in Figure 5.3 can be composed easily. As described in Section 5.3, commercial SystemC variants provide additional benefits like a graphical user interface, large SoC libraries, or extended debugging and analysis capabilities.
5.2
Adaptors: Bridging Abstraction Gaps
An adaptor maps generic LISA bus/memory API calls to the bus specific API of the respective bus node simulator. In the simplest case, it is just a 1:1 mapping. In the general case, an abstraction gap needs to be bridged as well. First, this section introduces both APIs: the used as well as the provided one. Then, possible mappings between them are discussed. Some of them require a
4 Other,
non-SystemC environments LISA has been coupled to are: CoWare N2C and Mentor CVE.
Adaptors: Bridging Abstraction Gaps
57
bus interface state machine, which is presented in more detail in the final part of this section. class lisa_memory_api {
/** ideal/debug interface */ int dbg_read(U32 addr, U32* data, int burst_size=1, int sub_block=-1); int dbg_write(U32 addr, U32* data, int burst_size=1, int sub_block=-1);
/** functional/blocking interface */ int read(U32 addr, U32* data, int burst_size=1, int sub_block=-1); int write(U32 addr, U32* data, int burst_size=1, int sub_block=-1);
/** cycle accurate interface */ int request_read(U32 addr, U32* data, int burst_size=1, int sub_block=-1); int request_write(U32 addr, U32* data, int burst_size=1, int sub_block=-1); int try_read(U32 addr, U32* data, int burst_size=1, int sub_block=-1); int could_write(U32 addr, U32* data=0, int burst_size=1, int sub_block=-1);
/** command extension */ int command(U32 opcode, U32 arg1=0, U32 arg2=0, U32 arg3=0, U32 arg4=0); };
Figure 5.4.
5.2.1
LISA Bus/Memory API Definition
LISA Bus/Memory API
The generic LISA bus/memory API is capable of modeling communication on several levels of abstraction. In total, it offers three sub-APIs: an ideal, a functional and a cycle accurate communication interface (Figure 5.4). Whenever a LISA processor needs to access a memory, in an IA simulator the ideal or functional memory interface is applied. Both interfaces basically offer two access methods to the processor core: one for reading and one for writing. To enable easy processor modeling on this level, these interface calls are expected to return successfully in any case. In a standalone processor, the requested access is performed immediately. A functional memory implementation can additionally provide accumulated latency information. In a CA processor simulator, the LISA bus and memory accesses can be refined to use the cycle accurate bus/memory interface. Since the resource accesses generally take more than one cycle, at least portions of this latency should be hidden in the processor’s pipeline. The cycle accurate memory interface offers the possibility to invoke and finalize a memory or bus access from within different pipeline stages. As long as a read data word is not yet available in the respective pipeline stage, for example, the processor simulator dynamically reacts on that by inserting stall cycles (cf. Figure 6.6, page 77).
58
Processor System Integration
Additional parameters in the API functions, printed gray in Figure 5.4, allow burst reads (burst size > 1) as well as sub-block accesses (sub block = −1). For more specific bus features and more accurate timing granularity (Figure 5.5), the API provides a generic command() function. It is invoked with an opcode indicating the kind of extension and can carry up to four integer arguments.
supported bus features basic bus/memory interface
• basic word access • burst access • subblock (byte) access
timing accuracy • ideal/debug interface • functional interface • 2-phase cycle accurate interface
• extended burst features – custom address increment – incremental/wrap – precise/imprecise
command() extension
• split/retry transaction • cancellation • locking
• fully cycle accurate interface – any transaction phase can be invoked separately
– i.e. atomic read-modify-write
• arbitration – i.e. dynamic priority
Figure 5.5.
LISA Bus/Memory API Capabilities
To increase the timing granularity of a LISA model, a set of command() function calls with specific opcodes precedes the actual request read() or request write() calls. This way, arbitration phase, address phase and data phase of a transaction can be invoked separately, eventually delivering addresses or write data for an already running transaction. Alternatively, the command() function can be used with opcodes delivering information like the burst address increment to use, the information if an access should lock the bus, or a dynamic arbitration priority. Only if implemented in a bus model, this information is having an effect, otherwise it is ignored. The generic LISA memory interface enables modeling bus and memory specific properties with a fixed set of functions. A LISA processor core is tailored for a specific bus or memory by distributing the function calls to the cycle accurate memory interface suitably over the pipeline stages and by issuing commands for special communication features like wrapped bursts or readmodify-write. Such a specialized LISA simulator core can still be used with other buses, but it will probably not be optimal for those any more. Some
Adaptors: Bridging Abstraction Gaps
59
very specific communication features are not supported by the generic LISA memory interface. As soon as the interface is bypassed or extended because of this, the respective simulator core is not retargetable to other buses any more.
5.2.2
TLM Communication Module API
A TLM communication module generally interacts with three kinds of modules: active initiators, reactive targets, or additional communication nodes to construct crossbars or other NoC interconnect hierarchies. Processors cores are always connected as initiators, so that they only need to support the initiator protocol. Hence, in the following, only the initiator TLM API is considered. The initiator TLM API of communication modules depends heavily on the model provider, but also on the specific bus or NoC family. Currently, a multitude of API’s exist (cf. Sections 3.2.2 and 3.3.2). However, the functions which perform data transfer typically fall into these categories: Debug Interface. This interface offers methods for reading and writing data to memories without any side effects for the system simulation. A typical use case is a processor debugger which needs to display memory data without actually invoking transactions on the bus. Transport TLM Interface. In terms of Figure 3.1 (page 27), this interface type covers the causal and event driven timing abstraction levels. Often, the interface already contains blocking convenience functions for data read and write, respectively. Such a blocking interface function call advances system simulation time as far as necessary to succeed with the function call. During this time, the respective initiator module simulator is suspended. In terms of a SystemC system simulation, the IMC implementation executes wait() until the memory access is actually done. This includes time for bus arbitration, memory access request and memory device reply; in case of cache misses also for the full access for the next layer memory accesses. Sometimes these convenience functions are not available explicitly. Nevertheless, on transport TLM level, they could be constructed easily from the three parts, using nonblocking TLM API function calls: –
Send the request.
–
Suspend the thread until the transaction is finished. Ideally, the function does not poll for the response data, but declares itself sensitive to an event notification sent on transaction end.
–
Read the response.
Since the transport API deals with the communication on transaction level, it is often not bus or protocol specific. Examples are the PV API or the OCP TL-2 and TL-3 APIs (cf. Section 3.2.2). The TL-2 API is already
60
Processor System Integration
tailored for the OCP protocol, but can in principle also be used for any other communication protocol on this abstraction level. Transfer TLM Interface. In terms of Figure 3.1 (page 27), this interface type covers the cycle callable abstraction level. Since the transfer interface simulates the protocol cycle-by-cycle, it is highly dependent on the respective protocol.
5.2.3
API Mapping
A TLM adaptor has to provide an implementation of the LISA bus/memory API. The implementation uses the respective communication API of the module the LISA processor is connected to. Eventually, the TLM communication module does not provide all three sub-API’s described before. Thus, not all mappings shown in Figure 5.6 are always possible. However, every TLM adaptor should provide an implementation for all three LISA sub-API’s. LISA Port API
Port
TLM Bus API
Implementation
/** ideal/debug interface */ dbg_read(U32 addr, U32* data); dbg_write(U32 addr, U32* data);
Bus Imple-
Ideal/Debug API
Debug Interface
/** functional interface */ read(U32 addr, U32* data); write(U32 addr, U32* data);
Functional API
Blocking API Transport TLM
/** cycle true interface */ read_request(U32 addr, U32* data); write_request(U32 addr, U32* data); try_read(U32 addr, U32* data); could_write(U32 addr, U32* data);
Cycle True API
Non-Blocking API Transfer TLM
mentation
Figure 5.6. API Mapping: LISA → TLM
Ideal LISA Interface Mapping The ideal LISA interface is needed for debugging purposes only and can directly be mapped to the respective debug interface of the bus model the processor is connected to. If the TLM model does not provide a dedicated debug API, the tool environment should offer a manner of accessing the memories directly. Synopsys SystemStudio [111] is an example for the first, CoWare PlatformArchitect [83] for the second alternative. Functional LISA Interface Mapping The functional LISA interface has to make sure that the LISA behavioral model can rely on successful memory accesses even in case of memory
Adaptors: Bridging Abstraction Gaps
61
resources external to the respective processor module. Two possible simple mappings are onto the debug TLM API and the blocking TLM API. The ideal interface performs a memory access without delays. In contrast, a blocking interface function call blocks the entire processor simulator until the transaction is completed. Thus, the first mapping causes no latency, leading to a too optimistic simulation result. In case of the second mapping, the simulation does not account for parallelism possible inside a processor simulator during a pending memory access. In general, this causes the simulation to report too pessimistic latency and throughput values. Alternatively, the adaptor contains a bus interface state machine driven by its own clock input. The state machine has to play through the initiator protocol phases for every transaction autonomously. For that, the transfer TLM API of the connected communication module is applied.
Cycle Accurate LISA Interface Mapping There are two options for the mapping of the cycle accurate LISA interface to the transfer TLM API. A simple 1:1 adaptor assumes that the LISA model already has the bus protocol completely incorporated into the processor pipeline. Since developing such a model is very tedious and error-prone, a more intelligent adaptor is useful that implements its own bus interface state machine (cf. next section).
5.2.4
Bus Interface State Machine
The task of the adaptor is to couple a processor model with a bus model, independent of whether an IA or a CA processor simulator is involved.5 On fully cycle accurate level, in case both sides are driven by the same system clock, a simple adaptor maps the API calls from the processor pipeline one-toone to the bus pipeline. As shown in Figure 5.7 on an exemplary data memory store instruction, an optimally tailored processor pipeline can hide multiple cycles of bus latency. In the decode (DE) stage, the processor already requests to access the data memory, without knowing the address or even the write data yet. In the next cycle, the address generation (AG) stage calculates the write address. After the bus access has been granted, the adaptor immediately fills in the address attribute in the addrTrf transfer and sends it off. The write data is read from a processor register one cycle later in the read (RD) stage, which is exactly the time it is expected by the bus pipeline. Fully integrating the bus interface protocol into the processor pipeline is a very tedious and error-prone task and prevents the designer from early system simulation on this level. Thus, an adaptor that implements its own bus interface
5 Or,
to be more precise, whether the functional or the cycle accurate memory API is applied.
62
Processor System Integration
Processor Pipeline
FE
fetch
DE
decode
AG
calc_addr
RD
calc_dat
EX
analyse _resp
AMBA AHB Pipeline
time
addr
wdata
1:1 ADAPTOR
req
status
reqTrf (reqOneCycle)
addrTrf (writeAt, addr)
writeDataTrf (wdata) eotTrf (status)
Figure 5.7.
req
addr
data
resp
1:1 Adaptor Example: ST R @0x1000, R[0]
IA Processor Model
AMBA AHB Pipeline Bus Interface State Machine
fetch
decode
req,addr,wdata
time reqTrf
write
(reqOneCycle)
req
store addr
(suspended)
addrTrf store wdata
writeDataTrf (wdata)
write
Figure 5.8.
(writeAt, addr)
status
eotTrf (status)
addr
data
resp
Bus Interface State Machine Example: ST R @0x1000, R[0]
state machine is necessary. Thereby, cycle accurate API calls that are not optimally distributed over the pipeline yet, or even abstract calls to the functional API lead to a well working communication. Also, independent clocks for the processor core and the bus node call for a more intelligent adaptor.
Commercial SoC Simulation Environments
63
The bus interface state machine automatically buffers data arriving too early from one side, and vice versa, stalls a pipeline in case a data item arrives too late. As depicted in Figure 5.8, an IA processor model applying the functional API is blocked until the bus protocol has completely finished the requested transaction. Either this bus interface state machine is synthesized to an RTL module later on, or, more efficiently, the timing statistics of the TLM adaptor guide the designer while fully integrating the bus protocol manually into the processor pipeline (cf. Section 8.3).
5.3
Commercial SoC Simulation Environments
This section shows the integration of LISA processors into commercially available SystemC simulation environments. CoWare’s PlatformArchitect has its strengths in its large library of TLM components. Synopsys SystemStudio, in contrast, offers a smoother integration into the stream driven algorithm simulation domain.
5.3.1
CoWare PlatformArchitect System Simulator
The CoWare PlatformArchitect product family [83] is a SystemC environment with a large TLM IP model library. The extended LISA tools enable the SoC designer to integrate any LISA model into this framework [109]. In Figure 5.9, the graphical platform creator frontend is depicted. The example platform contains two LISA processors with two memory ports each, being connected to a set of memory blocks via an AMBA AXI crossbar. The TLM adaptors are located in any of the four LISA AXI ports. During simulation, they translate the LISA bus requests to the respective AXI TLM requests to forward them through the crossbar. Additionally, the generated SystemC wrappers support the CoWare analysis and debugging capabilities like memory access statistics for the processor core, and bus interface timing analysis for the adaptors (cf. Chapter 8). The AXI crossbar is composed using IP modules from the CoWare model library. For CoWare’s customers, AXI is one of the IP libraries shipped with the tools. Internally, CoWare creates the fast communication modules using the BusCompiler (cf. Section 3.4). In future, the Bus Compiler could also be used to generate arbitrary communication modules. The tooling presented in this book is capable of creating a coupling between arbitrary LISA models and these communication modules automatically (cf. Chapter 7). In general, memory modules as well as helpers like clock and reset generator are relatively simple on TLM level an thus can be created manually by the SoC designer. However, the most common modules of this kind are already available in CoWare’s generic SystemC Modeling Library (scml).
64
Processor System Integration
Figure 5.9.
5.3.2
LISA–PlatformArchitect Integration
Synopsys SystemStudio SoC Simulator
System Studio offered by Synopsys [111] is a system level design tool that supports both: stream driven algorithmic simulation as inherited from COSSAP [112], as well as SystemC based event-driven simulation. In Figure 5.10, a simple QPSK6 receiver is depicted, implementing the FIR7 filtering on a LISA processor. All other receiver functionality is simulated more abstractly in SystemStudio’s algorithmic domain. In order to co-simulate the different Models of Computation (MoC), a fifo2bus module converts the complex-valued data stream coming from the channel model to integer values, which are then written into a TLM memory module over a simple bus instance (cf. Section 3.2.2). After a new set of values has been written into the memory, the LISA processor gets a signal to apply FIR filtering on them. Then, the LISA processor informs the bus2fifo module that
6 Quadrature 7 Finite
Phase Shift Keying. Impulse Response.
Commercial SoC Simulation Environments
Figure 5.10.
65
LISA–SystemStudio Integration
a new set of output samples is ready to be converted to a complex-valued data stream again. The succeeding receiver modules SampleDown, QPSKDemodulator, GrayDecoder and SymbolToBits perform the further receiver processing. Finally, additional system blocks compare the received bits to the reference and determine the bit error rate.
Chapter 6 SUCCESSIVE TOP-DOWN REFINEMENT FLOW
This chapter presents an iterative system level processor/communication co-exploration methodology [107, 113, 114], as depicted in Figure 6.1. In the most flexible case, LISA is used for the system-level description of processor architectures, while AVF or BusCompiler deliver SystemC based TLM communication models. On a specific abstraction level, an arbitrary number of iterations on either side is possible, until debugging is finished and profiling provided enough information for solid design decisions towards the next abstraction level.
ISS
PlatformArchitect
Code Generation
Application + Architecture
LISA
Processor Profiling
Figure 6.1.
Retargetable MP-SoC Integration PlatformArchitect
Platform Creation TLM Module Generation
AVF + BusCompiler
Bus/NoC Model + Configuration
Communication Profiling
Co-Exploration Methodology
Having several abstraction levels available for SoC modules as well as for their intercommunication, this chapter describes a 7-phase flow to successively
68
Successive Top-Down Refinement Flow
refine the processor modules and their communication (Figure 6.2). While keeping one side constant, the refinement of the other side is directly verified for correctness. This basic schematic is also applicable multi-dimensionally, where more than one LISA processor is integrated into a MP-SoC with multiple heterogeneous communication domains. This book focuses on the refinement down to fully cycle accurate TLM models. The RTL-near phases 6 and 7 are named only for the sake of completeness.
CA API RTL pins
LISA2AVF converter
AVF model
1
3
LISA2TLM Functional API converter LISA2TLM CA API converter
Phase
2
Cycle Accurate TLM model
RTL2TLM converter
4 5
6 RTL pins
RTL model
BCA
RTL pins
Cycle Accurate (CA) model
Functional API
Instruction Accurate (IA) model
SystemC
Token API
Interface
TLM API
LISA
RTL model
7
Figure 6.2. Successive Refinement Flow
6.1
Phase 1: Standalone
In the flow, phase 1 means standalone exploration of the respective modules on both sides (Figure 6.2). On the SystemC side, the NoC exploration framework introduced in Section 3.2.1 is applied standalone. The LISA side of the flow starts with an instruction accurate processor module as outlined in Section 5.1.1 [110].
6.1.1
SoC Communication
AVF Design Space Exploration For an early and thorough exploration of the full design space, the AVF workbench models the data exchange on the abstraction level of packet transfers (cf. Section 3.2.1). This temporal and data abstraction is one of the key techniques to enable the high simulation performance.
69
Phase 1: Standalone
An efficient design space exploration also needs the simulation environment to be flexible. Dynamic platform configurablity allows the designer to quickly iterate through several platform alternatives without repetitively recompiling the simulation. Strict orthogonalization between the communication and computation models allow setting up different communication models and topologies quickly without modifying the system modules themselves. On the left hand side of Figure 6.3, abstract AVF simulation is illustrated on a simple example. A task module performs the computations. The latency values n and m determine the next simulation event for the simulation scheduler. Whenever external memory accesses are necessary, the task module is suspended and the NoC simulator performs the communication. In order to enable high simulation performance, such a communication phase is preferably inserted only for transferring larger data packets at once instead of just single words at a time. Design space exploration for the overall system on this high abstraction level delivers, among others, an optimal number of system modules that should be implemented in software. Additionally, possible special instructions that should be offered by customized processors can be derived at this level. According to these requirements, the designer can adapt an existing LISA processor model or develop a new customized processor standalone (cf. Section 6.1.2).
Task Model
NoC Model
Processor Model
time
Adaptor
NoC Model
time
EX 0
n Communication
n+c
Computation #2
FE DE n+1
AG
(suspended)
(suspended)
Previous Instruction
Computation #1
n
Communication
n+3
AG RD
n+c+m
EX
Phase 1.1 (AVF Standalone)
Phase 2 (IA ASIP + AVF)
Figure 6.3. Phase 1 + Phase 2 Simulation
n+4
70
Successive Top-Down Refinement Flow
Limitations The NoC exploration framework is well suited for debugging, analyzing and profiling all on-chip communication issues [73]. However, the abstract C++ task models applied in the standalone NoC simulation are often not sufficient for high simulation accuracy, even if equipped with data cache models, or refined to generic assembly level. Especially in case of complex, MMU and cache equipped processors like the MIPS32 core, the program code also needs to be transferred over the NoC if and only if a program cache miss occurs. The cache behavior highly affects the performance of a processor, and it can have high influence on the network traffic and thus on the performance of the remaining modules. Even more, the impact of the processor architecture itself, e.g. if it is a VLIW processor or an optimally tailored ASIP, and the impact of the RTOS1 on the software execution time can hardly be estimated using the abstract C++ system modules. This makes it nearly impossible to annotate accurate timing budgets, neither manually nor automatically.
6.1.2
LISA Standalone
LISA Design Space Exploration Depending on the application tasks to be implemented on the processor, an already existing LISA processor model can be selected, or a new customized processor is developed, possibly starting by using a basic template. Promising instructions and functional units to be incorporated into a new processor may also be delivered by the microprofiler tool (cf. Section 4.1.1). This standalone processor model is refined iteratively as shown in Figure 6.4. Taking the abstract model as input, the LISA processor designer tools generate a tool suite for that processor, in general consisting of C-compiler, assembler, linker and simulator. The target application, ideally derived from the abstract SystemC task models already used in Section 6.1.1, is processed by this tool chain. The generic debugger frontend then allows debugging and profiling the application. Depending on the results, the LISA model or the application is refined, and a new iteration with a new set of generated tools is started. Simulation Accuracy Such an instruction accurate processor model can be developed quickly, but simulation can only give a raw estimation of the number of cycles ncycles consumed by a given application. With ninstr being the number of instructions executed by the application and nlat being the latency accumulated by the
1 Real
Time Operating System.
71
Phase 1: Standalone Application Application
LISA 2.0 Model RESOURCE{ RESOURCE{ REGISTER unsigned FPC; REGISTER unsigned FPC; RAM unsigned prgMem{ RAM unsigned prgMem{ SIZE(0x1000); SIZE(0x1000); BLOCKSIZE(32,32); BLOCKSIZE(32,32); } } … … PIPELINE pipe = {FE;DC}; PIPELINE pipe = {FE;DC}; } }
Generate
C-Compiler C-Compiler
LISA Processor Designer
Assembler Assembler Linker Linker Simulator Simulator
Architecture Debugging & Profiling
Build
no
Analyze Figure 6.4. LISA Standalone Processor Design Space Exploration
memory accesses, the estimation is: ncycles = ninstr · CP Iint + = ninstr · CP Iext
nlat
(6.1) (6.2)
CP Iint is the average number of cycles per instruction depending only on control and data hazards in the pipeline itself, while CP Iext also considers cycles additionally lost due to non-ideal memory accesses. Although ninstr can be determined almost exactly already in this phase,2 ncycles is still inaccurate. First, there is no information about CP Iint ; a common estimation is the best case value CP Iint = 1. Second, the latencies nlat caused by the memory accesses do not consider parallelism at all on this abstraction level. On the one side, parallelism within the processor may hide a portion of the latencies contributing to nlat in this phase. On the other side, in the real SoC system, the processor in general does not have exclusive access to its buses and memories as assumed in standalone simulation. The competition for resource access can drastically increase the latencies nlat .
2 Of
course, if the control flow depends on external events or data, those have to be provided by the designer (cf. limitations section).
72
Successive Top-Down Refinement Flow
Limitations Besides the limited cycle count accuracy, a standalone processor in general does not get the same stimuli it would get in a system simulation. Simple information can be incorporated into the application executable, or entered manually during simulation using the debugger frontend. The LISA processor simulators are very accurate in simulating the execution of the final embedded software on the processor, but the entire system environment cannot be considered properly in a standalone processor simulator. Neither more complex bus models, nor additional processor cores of an MP-SoC can be incorporated into a single LISA processor model. Thus, it is beneficial to integrate the processor in its SoC system context as early as possible.
6.2
Phase 2: IA ASIP ↔ AVF Communication Models
Simulator Coupling As described in the previous section, the LISA side as well as the AVF side have specific strengths and weaknesses when applied standalone. As soon as the basic system configuration has been determined on the SystemC side, and the LISA simulators seem to work fine standalone, then it is time to couple the two sides for the first time. Thereby, LISA’s high SW execution accuracy and AVF’s communication modeling efficiency are combined to some degree. A newly developed generic adaptor enables integrating LISA processors into the SystemC environment already on the high packet TLM abstraction level. A precondition is that the data accuracy of the AVF platform already includes explicit memories and memory maps (see Section 3.1.2, phase 1b). In case system modules based on generic assembly have already been used in the AVF standalone phase 1, most likely no further refinement on neither side is necessary. The right hand side of Figure 6.3 illustrates this simulator coupling on a data write example. The IA processor simulator executes the instruction sequence of the target application. Whenever a LISA operation accesses an external memory, the LISA simulator is suspended for the full duration of the communication. The AVF communication itself is performed as before.
Simulation Efficiency Of course, incorporating a clock driven LISA module into a generally event driven SoC simulation slows down AVF simulation speed. The IA processor simulator typically runs in a SystemC thread. This implicates that normally on every active clock edge, a thread switch is necessary on the simulation host. However, during external memory accesses, the thread is not clocked any more.
Phase 3: IA ASIP ↔ CA TLM Bus
73
Instead, it is completely suspended (‘sleeping’) until the full access is finished. Thus, cached data transfers are still simulated efficiently. A data packet then has the granularity of a cache line. Furthermore, since the LISA simulators apply the fast just-in-time cachecompiled (JIT-CC) simulation paradigm [105], the overhead for processor instruction decoding is minimized.
Simulation Accuracy Without modifying the processor model at all, redirecting the accesses to outside already gives a valuable estimation of the performance to be expected nlat in system context. Referring to equation (6.1), the accuracy of the has significantly improved since on-chip-communication, shared resources and multi-processor issues are now basically considered. The simulation delivers rather an upper bound of the actual latency since the entire processor simulator is suspended during every memory access instead of allowing parallelism within the processor.
Limitations
The accuracy of nlat in this phase still highly depends on the accuracy of the abstract AVF framework. Due to the packet oriented communication modeling, simulation events are only generated for the beginning and the end of transactions, completely based on annotations or parameterizations. For a higher accuracy, the communication needs to be modeled cycle-by-cycle in a refined communication model. Thus, after exploration and identification of an optimal communication infrastructure, the abstract NoC models are replaced with more accurate TLM versions of the selected communication modules, leaving the LISA processor side untouched.
6.3
Phase 3: IA ASIP ↔ CA TLM Bus
Cycle-by-Cycle Communication Modeling As one output of the AVF exploration, the decision is made whether customized communication modules are necessary. In many cases existing communication IP like AMBA, Sonics or Arteris is able to satisfy the performance requirements of the target application. If this fixed communication IP is selected, then the IP provider typically also delivers the more accurate TLM models for the succeeding phases. However, if AVF analysis discovers that the performance of external IP is not satisfactory, or if the royalties to pay are too high, then the BusCompiler tool may be used to create customized communication IP. In that case, the AVF module parameterizations serve as requirements for the new BusCompiler
74
Successive Top-Down Refinement Flow
model, rather than being back-annotations of existing IP. The BusCompiler tool generates a cycle accurate TLM simulator from the node and protocol specifications automatically (cf. Section 3.4). Processor Model
Adaptor
Bus Pipeline
time
EX
Processor Pipeline
Adaptor
Bus Pipeline
time
FE
n
n
DE
AG
AG
RD
REQ REQ n+1 ADDR n+2 DATA DATA n+3 RESP
FE n+1 DE n+2 AG
REQ REQ
n+3 (suspended)
AG
Previous Instruction
DE
(suspended)
Previous Instruction
FE
AG
ADDR
n+4 DATA DATA
n+5 RESP
n+6 EX
RD
RD n+7
EX
n+4
Phase 3 (IA ASIP + CA TLM)
Figure 6.5.
EX
n+8
Phase 4 (CA ASIP + CA TLM)
Phase 3 + Phase 4 Simulation
On the left hand side of Figure 6.5, system simulation as performed in phase 3 is illustrated. As before, the IA processor simulator is suspended during external communication. The adaptor takes care that the respective communication protocol is executed cycle-by-cycle. This means that associated information obtained by the processor simulator is eventually buffered for some cycles until it is actually needed by the bus simulator.
Simulation Efficiency Cycle-by-cycle communication modeling implicates that the communication modules are clocked as well. This model of computation is slower than an abstract model which calculates events in the future and completely skips clock events irrelevant for packet level communication. However, the generated TLM communication modules contain a precomputed and highly optimized state machine. Often, especially to the target side, communication to the attached modules can still be event based, which avoids that the modules need to poll for changes every cycle. The IA processor models depicted in Figure 6.2 are also woken up on completion of a pending transaction only. However, the adaptor connecting the initiator and the bus node on this
Phase 4: CA ASIP ↔ CA TLM Bus
75
level contains a bus interface state machine which also needs to be clocked (cf. Section 5.2.4). The actual simulation speed loss compared to AVF heavily depends on the typical data packet size. AVF is especially efficient for large packets, as present in fully cached applications with large cache line sizes. In non-cached, already word-accurate system models, AVF is likely to be even slower than cycle accurate TLM models. This is because AVF generates a sequence of events for every packet, even if it only carries a payload of one single word.
Simulation Accuracy Without modifying the processor side at all, the nlat value improves due to the more accurately calculated communication timestamps. However, on this level, processor pipeline effects are not yet considered at all. In general, we still assume the optimistic case CP Iint = 1. In the following phase, the processor model is refined for more accuracy, leaving the system environment as it is.
6.4
Phase 4: CA ASIP ↔ CA TLM Bus
Processor Model Refinement Having identified the instructions reasonable to be implemented in the processor module, the instruction accurate model is refined to a cycle accurate model. For one reason, this step is necessary to gain more detailed information about the processor’s temporal behavior. And second, a cycle accurate processor model is a precondition for automatically generating an RTL implementation of the processor in phase 6. The effort necessary for this refinement heavily depends on the coding style applied when developing the initial IA model. It is advantageous to have a further refinement to CA already in mind during IA model creation, especially when splitting the processor instructions into operations, and when implementing the data exchange between the operations. Generally, the following steps are necessary for the refinement: Define a Pipeline in the Resource Section. In addition to the pipeline itself, also a set of pipeline registers is defined which allows associated operations to transfer data. Assign the Operations to Pipeline Stages. Every operation whose behavior can be activated needs to be associated to a pipeline stage (Figure 4.4). Create Activation Chains. Instead of invoking an operation’s behavior directly from within the behavior of the parent operation, the target operation gets activated now. This way, the target behavior is not executed immediately, but according to the spacial distance in the pipeline. The designer has
76
Successive Top-Down Refinement Flow
to make sure that the operations transfer the associated data properly. This is done elegantly by using the pipeline registers instead of global variables. Invoke Pipeline Executing and Shifting. The designer has to execute and shift the pipelines explicitly instead of just invoking the decoded instruction directly. Especially, the designer has to stall or even flush the pipeline explicitly when necessary in order to preserve the behavior of the former IA model. Optimize the Architecture. Later on, profiling results give valuable hints for further optimization. Bypass hardware helps to reduce costly pipeline stalls. Dynamic interlock mechanisms ease assembly programming respectively compiler construction since data dependencies between assembly instructions are resolved automatically by the processor hardware. Despite of the restructuring of the model due to the pipeline presence, the memory interface is not yet modified at all in this phase. The functional LISA bus/memory API accesses are assigned to a pipeline stage as a whole. In order to check whether the processor’s functionality is preserved, the model is first debugged standalone (as done in phase 1). However, in order to obtain realistic stimuli and results, the refined model should be applied in the system simulation context again as soon as possible. The right hand side of Figure 6.5 illustrates the system simulation mechanism on this level. Every blocking API call to the bus interface is assigned as a whole to one operation and thus to one pipeline stage. That is why every memory access, independent of which pipeline stage it is invoked in, can suspend execution of the entire processor simulator. The functional behavior is still correct, since every external memory access guarantees to return the current value. But the processor pipeline is not clocked every cycle any more. At that point in time when the processor model is suspended, some operations belonging to the current clock cycle have already been executed, while others have not. It depends on simulator internals whether other instructions in other pipeline stages have already been processed in a suspended model. Furthermore, a second bus access from a different pipeline stage again suspends the simulator for the full duration of that transaction.
Simulation Accuracy Referring to equation (6.1) again, from this phase on, we obtain an accurate value for CP Iint . The only inaccuracy is caused by the term nlat because memory access concurrency is still not modeled correctly. For example, 2 accesses, invoked from within different pipeline stages and accessing different buses, in reality will be executed in parallel. However, system simulation on this abstraction level executes them sequentially.
Phase 5: BCA ASIP ↔ CA TLM Bus
77
This specific error due to missing parallelism could be reduced by modifying the simulation paradigm of the CA processor. Instead of executing all pipeline stages sequentially in a common SystemC thread, alternatively one SystemC thread could be instantiated for every pipeline stage. However, this would introduce a large amount of additional SystemC context switch overhead, and it would not even solve the problem in general. Instead, it is advantageous for the user to consider an upper and a lower bound for the cycle count ncycles . Mapping all LISA bus and memory accesses to the functional and thus blocking TLM API delivers a reliable upper bound. Since the pipeline stalls and flushes are modeled correctly in the current phase, the source of underestimation that was present in the earlier phases is gone. By mapping the accesses to the ideal TLM interface instead, a lower bound for the cycle count ncycles is obtained since all additional latencies are neglected ( nlat = 0). For more reliable cycle count information, the remaining simulation errors need to be removed by moving forward to phase 5.
Phase 5: BCA ASIP ↔ CA TLM Bus
6.5
TLM Interface Refinement To get more accurate information about the timing behavior, not only the processor itself, but also the communication to the outside has to be modeled cycle accurately (BCA = bus cycle accurate). Typical bus accesses which take longer than one cycle are not assigned to one single pipeline stage as a whole any more (cf. Figure 4.4, page 50). Instead, the function calls to the cycle accurate memory/bus API are now distributed suitably over the pipeline. The processor simulator is never suspended since the used API is not blocking any more. When a memory access cannot be completely hidden in the pipeline, the simulator explicitly invokes stalls (Figure 6.6), which is exactly the behavior of the associated real processor hardware.
FE
DE
AG
fetch
decode
read_req (1000)
FE
DE
fetch
decode
RD
RD
EX
try_read (1000)
try_read (1000)
LDR
Instruction 1: LDR R[0],@1000
STALL
n
n+1
n+2
AG
AG read_req (1000)
n+3
n+4
n+5
RD
EX
try_read (1001)
ADD
n+6
Instruction 2: ADD R[0],@1001 n+7
time Figure 6.6.
Bus Cycle Accurate Processor Model
78
Successive Top-Down Refinement Flow
The TLM adaptor contains its own protocol specific bus interface state machine. Even if specific data is delivered too early or too late, communication works functionally correct. The adaptor stores information arriving too early, and stalls the processor or bus pipeline when information arrives too late. Since these issues indicate room for optimization, the adaptor reports the automatic corrections during simulation (cf. Section 8.3). This way, the designer is assisted in optimally incorporating the bus protocol into the processor pipeline. Alternatively, the adaptor’s bus interface state machine is implemented in RTL in the remaining steps as well. Figure 6.7 shows two exemplary implementations on this level. Processor pipeline A needs some timing corrections by the adaptor. Version B, in contrast, is optimally tailored to the connected bus module.
Figure 6.7.
Phase 5 Simulation
Simulation Accuracy From now on, cycles that so far may have multiple times contributed to nlat or CP Iint in equation (6.1) are properly considered in ninstr · CP Iext (equation (6.2)). Thus, on this abstraction level the simulation does not have temporal inaccuracy concerning the cycle count ncycles any more.
6.6
Phase 6: RTL ASIP ↔ CA TLM Bus
Processor Implementation Having fully reached cycle accurate abstraction level, it is soon desired also to switch to pin accurate models on RT-Level. For the LISA side, there already exists a methodology and tooling to refine a CA processor model for generating an HDL implementation on RT-Level automatically [2]. The methodology
Phase 7: RTL ASIP ↔ RTL Bus
79
includes an intermediate representation [104] on which optimization techniques like resource sharing [115] are applied. The type and semantics of the processor pins that are used to perform the bus accesses on RT-Level are defined in a small processor model independent xml-file [116]. Currently, more complex bus interface state machines as necessary for e.g. AMBA protocols are not yet supported. In principle, the interface specification presented in Section 7.3 would also be suitable to generate the adaptor implementation on RT-Level. However, automated RTL implementation of LISA processor models and bus interfaces is far beyond the scope of this book.
Co-Simulation Adaptor In phase 6, no changes are made yet on the bus side. In order to couple the RTL processor pins to the TLM API of the bus model, a simulation adaptor is necessary. Such an adaptor is already generated automatically by the BusCompiler tool, based on a section in the protocol specification. The type of pins and their semantics is also defined in a condensed manner. Transfers received from the bus model update specific pins, and vice versa, modifying pins from outside may trigger a transfer being sent to the bus model.
6.7
Phase 7: RTL ASIP ↔ RTL Bus
Communication Module Implementation Finally, also the TLM bus model needs to be replaced by an RTL implementation. Unfortunately, this step is not automated yet. Due to the strongly formalized specification of the protocols and the state machines inside every bus node, there should be no general limitation. However, implementing the communication nodes on RT-Level is also far beyond the scope of this book. Full RTL Implementation Simulating the SoC on RT-Level is roughly 100x slower than the fully cycle accurate TLM simulation available in phase 5. After synthesis, gate level simulation is another factor of 100 slower. Thus, RTL and gate level simulation is compulsory for verifying the HW/SW interfaces in detail, but it is far too slow to simulate a significant amount of system behavior.
Chapter 7 AUTOMATIC RETARGETABILITY
In order to make a successive refinement flow as described in the previous chapter feasible, not only simulators for the processors and communication nodes, but also the simulation adaptors have to be generated automatically [117]. In the first section, this chapter describes the code generators and their relationship to each other. The following section presents the automatic generation of the processor wrappers and of the co-simulation adaptors. Finally, the bus interface definition section of the bus specification is presented, which is the base for generating the adaptors.
7.1
MP-SoC Simulator Generation Chain
As shown on the left hand side of Figure 7.1, the LISA processor compiler tool automatically generates a processor simulator on multiple possible abstraction levels, taking an abstract textual processor specification as input. Analogously, as shown on the right hand side of Figure 7.1, the BusCompiler tool generates simulators for the communication modules. These bus node simulators apply the Transaction Level Modeling (TLM) communication paradigm, which allows efficient but still fully cycle accurate simulation (cf. Section 3.1). With P being the number of different processor cores and C being the number of different communication protocols, in a large heterogeneous Multi-Processor SoC (MP-SoC) P > C > 1 holds generally true. Even if only a single family of communication protocols like OCP or AMBA 2.0 is applied in the MP-SoC, typically multiple incarnations or at least configurations exist. In order to address specific performance requirements efficiently, the protocols differ in features like word width or protocol timing. Thus, they are most likely incompatible to each other. In the context of this chapter, two protocols that vary in more than just configuration parameters are accounted for separately. However, since processors are always integrated as initiators into the SoC, only initiator protocols
82
Automatic Retargetability Processor Spec.
Bus Spec.
PSP Compiler Generator
Processor Compiler
Cycle Accurate
PSP Compiler
PSP
TLM
Processor Support Package (PSP)
Bus Simulator
Instr. Accurate
Processor Simulator
Bus Compiler
Figure 7.1.
Simulator Generation Flow
need to be considered. As an example, the AMBA 2.0 family provides AHBInitiator, AHBLiteInitiator and APBInitiator as configurable initiator protocols, thus CAM BA2 ≤ 3. The LISA processor compiler takes care of generating the P different processor simulators, while the BusCompiler tool generates a number of communication node classes, implementing the slave side of the C initiator protocols. In the simplest case, such a communication node is a single bus, being connected to all its initiators and all its targets directly. In a more general case, multiple communication nodes are combined to form more complex communication structures like crossbars.1 In order to do meaningful overall SoC exploration, the processor simulators and bus nodes have to be coupled, which at maximum leads to P · C combinations. During the design flow the adaptors often also have to bridge a gap in model abstraction, which calls for implementing a bus interface state machine in the adaptor. Thus, manually developing and maintaining these P · C couplings is much too tedious and error-prone. This chapter describes a methodology and the tooling for automatically generating the P · C couplings which are called Processor Support Packages (PSP). As shown in Figure 7.1, this is done in a two-step approach. A PSP compiler generator generates a set of C PSP compilers, one for each communication
1 Independent
of the actual complexity of the attached communication structure, this book refers to the protocol dependent coupling as a bus interface.
83
Structure of the Generated Simulator Communication Infrastructure
Processor SystemC Processor Module LISA Processor Simulator Simulator Core
A P common_bus I 0K.....32K 1M.....2M API
2M ... 4M
A LISA P TLM I Adaptor
Input Stage Node
RAM Module
Output Stage Node
API
prog_rom prog_cache API
SystemC TLM Bus / Crossbar
prog_mem
Generated by: Processor Compiler
Output Stage Node
Memory
Generated by: PSP Compiler
Generated by: BusCompiler
RAM Module
Manually Written
Modules assembled with: Platform Creator GUI
Figure 7.2. Simulator Structure
protocol. A PSP compiler, in turn, generates the simulator coupling of any of the P processors for a specific communication protocol.2 In the beginning, some PSP compilers for very common bus protocols have been developed manually and shipped with the ProcessorDesigner releases. This already enabled automatic retargetability to the processor side. The main focus of this chapter is on generating a bus interface state machine, which is the main task of the PSP compiler generator. This enables automatic retargetability also to the bus side.
7.2
Structure of the Generated Simulator
The system simulator structure is outlined in Figure 7.2. This section describes the automatic generation of the individual components in more detail.
7.2.1
Creating the Communication Infrastructure
The BusCompiler tool generates simulation classes for the communication nodes, together with a definition file that provides a GUI3 tool with the necessary information how the nodes can be parameterized and interconnected. 2 In
a more general approach, a single PSP compiler that supports all communication protocols is generated. This way, not all bus ports of a processor necessarily apply the same protocol any more. However, this feature has not been implemented so far. 3 GUI = Graphical User Interface.
84
Automatic Retargetability
Using the platform creator GUI, the SoC designer opens such a generated bus library. He instantiates the contained communication modules, for example AHB and AHBLite bus nodes in case of the AMBA 2.0 bus family. Many communication properties are set by configuring the modules after instantiating them; more general communication features are defined by selecting a specific communication node.4 An exemplary crossbar constructed from AXI input and output stages is depicted in Figure 5.9 (page 64).
7.2.2
Generating SystemC Processor Models
The LISA processor simulator including the bus interface is generated by the processor compiler, as outlined in Section 5.1. As indicated in Figure 7.2, the main task of the PSP compiler is to generate the SystemC processor wrapper module by instantiating and connecting the LISA processor simulator to one LISA TLM adaptor per TLM port. Additionally, the wrapper is equipped with the necessary code to enable powerful analysis and debugging capabilities for the processor core in its system environment (cf. Section 8). The PSP compiler also generates a module definition file for the platform creator GUI. However, the TLM adaptor class itself is a fixed building block of every PSP compiler. Due to the generic LISA bus/memory API, the adaptor does not depend on the processor side. As outlined in Section 5.2.1, only the pipeline stages from where the generic cycle accurate function calls are invoked, as well as the kind of command() function invocations make a processor model more or less suitable for a specific bus. The adaptor maps the generic LISA bus/memory API calls to the bus specific API of the respective bus simulator. Thus, the main challenge of the PSP compiler generator is to build up these TLM adaptors for arbitrary communication protocols.
7.2.3
Generating Adaptors
Basically, the TLM adaptor must provide an implementation for the ideal, the functional and the cycle accurate parts of the generic bus/memory API. These are the methods the generated processor simulators need to access. Figure 7.3 displays a simplified version of the respective API methods. The implementation is made in two parts. In a first API class derivation, the protocol independent methods are implemented. Since the ideal interface basically bypasses the bus, it is implemented once, using the generic access methods of the simulation environment. The functional interface itself is also implemented once for all generated bus protocols. For the bus specific parts, it
4 The nodes belonging to one family are too dissimilar to be differentiated by parameterization only, but their
protocols are similar enough to inherit from one common generic protocol.
85
Bus Interface Specification class adaptor_base: public lisa_memory_api
class lisa_memory_api {
{
/** ideal/debug interface */ virtual int dbg_read(U32 addr, U32* data)=0; virtual int dbg_write(U32 addr, U32* data)=0;
/** functional/blocking interface */
/** bus independent methods */ int dbg_read(U32 addr, U32* data); int dbg_write(U32 addr, U32* data); int read(U32 addr, U32* data); int write(U32 addr, U32* data); };
virtual int read(U32 addr, U32* data)=0; virtual int write(U32 addr, U32* data)=0;
class bus_adaptor: public adaptor_base
/** cycle true interface */ virtual int request_read(U32 addr, U32* data)=0; virtual int request_write(U32 addr, U32* data)=0; virtual int try_read(U32 addr, U32* data)=0; virtual int could_write(U32 addr)=0; };
{
/** bus dependent methods */ int request_read(U32 addr, U32* data); int request_write(U32 addr, U32* data); int try_read(U32 addr, U32* data); int could_write(U32 addr); };
Figure 7.3.
Implementing the ADL API
uses the methods of the cycle accurate interface, which are virtually defined in the base class and thus accessible [118]. In the second implementation step, the bus specific functions are implemented in a further class derivation. Basically, the functions define how to feed data into the bus interface state machine and how to get the results back. This C++ class is generated automatically from the abstract bus definition. It makes sense to keep the flexibility to generate alternative C++ or SystemC APIs for accessing a bus node as well, e.g. the standardized OCP abstraction levels TL-2 and TL-1 [40] or the Programmer’s View (PV) [75] APIs (cf. Sections 3.2.2 and 3.3.2). Thus, the properties of the bus specific API are defined in a special section of the bus specification: the interface specification.
7.3 Bus Interface Specification 7.3.1 Overview The same formal syntax which is used to specify the bus nodes and their protocols (cf. Section 3.4.2) is also applied to specify the C/C++ interface functions of the adaptors. One interface specification defines adaptors for all initiator protocols of the entire bus family.
86
Automatic Retargetability
busInterface LISA_AMBA2, connect = [AHBInitiator,AHBLiteInitiator,APBInitiator], parentClass = LISA_Adaptor_Base { stateMachine, ... { ... sequence seq_single_read, ... catch burst_continue, ... }; ... function request_write, ... { ... }; }; This example interface LISA AMBA2 generates adaptors for the protocols AHBInitiator, AHBLiteInitiator and APBInitiator of the AMBA 2.0 [77] bus family. Basically, it consists of two parts. First, a characterization of the state machine. Here, the state sequences to traverse as well as extra code to execute additionally to the default behavior of a state can be defined. The second part specifies the API functions that the processor simulator invokes in order to feed data into the state machine or get information back.
7.3.2
Feeding Data into the State Machine
The following fragment defines the signature of the request write() method and specifies how it feeds a write burst into the state machine. function request_write, returnType = int { parameter addr, type = U32; parameter data, type = U32*, value = 0; parameter n, type = int, value = 1;
section feed_data { buffer, type = allocateNew, size = n, behaviorFailed = [ return = -1; ]; ...
Bus Interface Specification
87
After having declared the function name, the return type and the function parameters, the buffer is defined on which the implementation is working on. This function allocates a new buffer with one entry per burst item.
behavior = [ buffer[current].reqTrf.reqMode = reqUntilUnreq; buffer[current].addrTrf.type = writeAtAddress; buffer[current].addrTrf.address = addr + 4*current; buffer[current].writeDataTrf.writeData = data[current]; buffer[current].sequence = if (n==1) then seq_single_write elseif ( current == 0) then seq_first_write elseif ( current == n-1) then seq_last_write else seq_burst_write; return = 0; ] }; };
In the corresponding behavior section, the buffer is filled for every burst item. This function already provides attribute values for the AMBA 2.0 transfers reqTrf, addTrf and writeDataTrf, which are stored in the buffer. If the API does not specify properties, fixed default values can be set here. Additionally, a state sequence is selected which has to be played through by the state machine in order to process the request.
7.3.3
Characterizing the State Machine
The state sequences are defined in the central section of the interface definition.
stateMachine, extraStates = [ burstContState, finishState ] { sequence seq_single_write, value = [ reqTrf , addrTrf , writeDataTrf , eotTrf , unreqTrf , finishState ]; sequence seq_first_write, value = [ reqTrf , addrTrf , writeDataTrf , finishState ];
88
Automatic Retargetability
sequence seq_burst_write, value = [ burstContState ; addrTrf , writeDataTrf , finishState ]; sequence seq_last_write, value = [ burstContState ; addrTrf , writeDataTrf , eotTrf , unreqTrf , finishState ]; }; The first state sequence is to be used for a single word write access. The other three sequences are for the first burst item, the middle burst items, or the last item of a write burst, respectively. Every bus transfer has an associated state. When a state is entered during simulation, the respective transfer is sent to or received from the bus. The attribute values are taken from or written to the buffer, and the next state according to the sequence definition is entered. If a transfer could not yet be exchanged with the bus, the generated adaptor will try it again in the next cycle. In the right hand side of Figure 5.7 (page 62), a simplified write sequence is depicted. Here, the state sequence to traverse is reqTrf, addrTrf, writeDataTrf, eotTrf. Independent of when exactly the processor model delivers the write address or the write data, the respective transfers will be exchanged in the correct time slots. Information arriving too early will be buffered in the adaptor, information arriving too late will cause a stall of the bus pipeline. Additional states can be defined that do not have an associated transfer, and thus do not forward automatically. In the example, the burstContState is the state which burst items have as long as they are waiting for their predecessor having finished the first access phase. In order to define the behavior of these additional states, or to alter the default behavior of the transfer states, catch statements can be inserted into the state machine. catch burst_continue, state = addrTrf, condition = (buffer[next].state == burstContState), behavior = [ buffer[next].state = addrTrf; ]; Since the state sequences for all items of a burst, except the first, start with the passive burstContState, they are explicitly forwarded to the addrTrf state as soon as the preceding burst item succeeded sending the addrTrf. In a similar manner, a burst can be completely re-sent if the target reported a retry in the eotTrf response transfer. After the bus interface state machine went through the given state sequence, the result can be returned.
Bus Interface Specification
7.3.4
89
Getting Data Out of the State Machine
The implementation of the try read() function, for example, works on a buffer of type searchFinished and checks if the address matches and if the corresponding state sequences have succeeded. In case of success, the read data is copied into the target array, otherwise a respective error code is returned. behavior = [ buffer, type = searchFinished, condition = ( ( buffer[first].addrTrf.address == addr) & ( buffer[first].addrTrf.type == readAtAddress) ), size = n, behaviorFailed = [ return = -1; ]; implementation, condition = ( ( buffer[first].state != finishState ) | ( buffer[index(n-1)].state != finishState) ), behavior = [ return = MA_SIG_WAIT; ]; implementation, behavior = [ data[current] = buffer[current].readDataTrf.readData; return = MA_OK; ], discardOlderBuffers = true, discardBurstUntil = (n-1); ] SearchFinished refers to a specific range of pending transactions. The state sequence of at least one burst item of the respective burst is already finished. On the other hand, bursts which have already been discarded are not searched any more. As can be seen at the end of the behavior, every successful try read() function call discards all previous transactions, as well as the already read burst items of the current burst. The parameter n of the try read() function refers to the last burst item that has to be valid already. By calling try read() with burst length smaller than the requested size, a once requested burst may be read word by word. The discardBurstUntil command takes care that every word is read at maximum once.
90
Automatic Retargetability
Similarly, by using a buffer of type searchProcessing, an API function can supplement an already running write transaction with the write data. The adaptor will not send the writeDataTrf until the write data has been provided.
7.3.5
Advantages
This single condensed interface specification5 is suitable to customize the generation of the bus interface for all initiator protocols of a bus family. Given information can be verified automatically, e.g. if the state sequences are a valid path through the state graph of the respective bus nodes. Information that is excessive for a specific bus protocol is skipped. In AMBA 2.0, for example, all information concerning the bus request phase is used for generating the AHBInitiator interface, but it is ignored for generating the AHBLiteInitiator and the APBInitiator bus interfaces. Manually implementing these state machines specifically for every bus protocol would be a very tedious and error-prone task. Using the new approach, working adaptors are obtained very quickly. Additional features like bursts and sub-block accesses are added successively by refining the initial specification. Further annotations indicate the applicability of specific optimizations. For example, remaining pending items of a burst can be skipped completely, once an addrTrf could not yet be sent. Using such optimizations, the simulation speed of platforms with generated and handwritten adaptors is roughly the same (cf. case study Chapter 9).
5 500. . . 800
lines of code for the AMBA 2.0 [77] and AMBA AXI [78] bus families.
Chapter 8 DEBUGGING AND PROFILING
This chapter focuses on the interaction between the designer and the design tools. The main tasks are debugging and profiling of the components in system context. Debugging helps getting the HW and the SW functionally correct, while profiling provides useful information for future design decisions. First, the multi-processor debugger is described, which is useful for both tasks. The second section presents the Message Sequence Chart (MSC) debugger, which is used to observe SoC communication on word level. Finally, the analysis instrumentation of the generated bus interfaces is introduced.
8.1
Multi-Processor Debugger
For user-friendly debugging and online profiling of the embedded SW and its platform, the user should always be able to get the full SW centric view of an arbitrary SW block at simulation runtime [108, 119]. This section stepwise introduces the multi-processor debugger, which can be dynamically connected to any processor simulator that is currently of interest (Figure 8.1). First, the underlying standalone processor simulators are introduced. The tool-set is in principle not restricted to automatically generated LISA simulators only. In the second part of this section, an SoC simulator with multiple processor cores is considered, still offering the same observability and controllability as known from standalone simulation. After that, the dynamic connect tooling is introduced. Finally, an extension is presented which enables full source code level debugging.
92
Debugging and Profiling
Multiprocessor Debugger control and observation of any processor core
Bridge
display / modify memory state
Embedded Memory
ASIP Bus
ìC Bus
Logic
ASIP
DSP
Multiprocessor Platform Figure 8.1.
8.1.1
Multi-Processor Debugging
Retargetable Standalone Simulation
The flexible multi-processor debugging environment is based on retargetable standalone simulators, satisfying several constraints. This section presents the necessary or at least recommended features of a well suited underlying processor simulator environment.
ISS Requirements Any Instruction Set Simulator (ISS) to be applicable here must be strictly separated into two parts: an architecture specific simulator backend and a generic, mostly graphical debugger frontend (Figure 8.2). The simulator backend has to encapsulate all its functionality into a well defined architecture independent Application Programming Interface (API). The debugger frontend can then – either directly or via Inter Process Communication (IPC) – access the backend through the API to dynamically retarget to the respective backend architecture and display its state. The API functions can be separated into two main groups: (a) simulation control and (b) processor observation/manipulation. These requirements are fulfilled by state of the art ISS like the ARMulator [76]. Automatically generated retargetable processor simulator backends as provided by the LISA tool-suite (cf. Section 4.3) meet these constraints even
93
Multi-Processor Debugger Disassembly Window Simulation Source Code Output Window Window
Memory Window GP Register Window
Pipeline Register Window
IPC or direct Control API
Observation API
Simulator Backend
Figure 8.2. Standalone Simulation with Debugger GUI
for arbitrary processor architectures and for multiple abstraction levels (i.e. instruction accurate, cycle accurate).
Observation/Manipulation API with Generic Resource Access The observation/manipulation part of the remotely accessed API contains methods for generic resource access, breakpoint and watchpoint handling, profiling data access etc. The generic resource access mechanism allows the frontend GUI (Graphical User Interface) to adapt its resource windows and memory display to the respective processor hardware. When opening an architecture, the debugger frontend initially does not know anything about the simulator backend. A set of API functions allows the frontend to get to know the number of resources and then to request the type and
94
Debugging and Profiling
properties for each of them. Memory blocks, General Purpose (GP) registers, pipeline registers, ports: all these types are managed by the same uniform resource data structure received through the backend’s API. Having this information, the debugger’s windows (Figure 8.2) are configured: each memory resource is added as a tab to the memory window; the general purpose registers and the pipeline registers are appended to a respective window. These windows can display and edit the current resource state as well as manage watchpoints. The disassembly window always highlights the current instruction and allows the user to set and reset breakpoints. When debug information is provided by the current application executable, a source code window offering the same breakpoint features is created as well. If the respective features are enabled in the current simulator backend, further windows for profiling and debugging the application and the architecture can be created, or alternatively, the existing windows are extended. A state of the art debugger frontend also enables the user to modify the displayed values and it can manipulate the current state of the processor simulator backend accordingly.
Control API The control part of the remotely accessed API allows the user to manually or automatically invoke and stop processor simulation cycles on several degrees of granularity. For standalone processor simulation, several simulation control modes are known. The main distinction can be made between interactive and non-interactive mode. In this context interactive mode means that the user has full control over the respective processor simulator. The control can be regularly handed back to the user on several levels of granularity: clock cycles, assembly instructions or source code instructions (i.e. C/C++ source). Simulation is continued only on an explicit new invocation by the user. In contrast, non-interactive mode means here that the processor simulator does not expect any special user input, but is solely invoked automatically by the system environment it is embedded into. There are several non-interactive modes, differing only in the degree of observability still possible: automatic frontend display refresh on completion of every instruction, frontend display refresh only on demand, or the simulator running in the background with no frontend connected at all. A state of the art debugger enables free choice of the simulation mode at any time, depending on the user’s current interest. Besides a manual switch between the simulation modes, this can also be initiated automatically e.g. if a breakpoint is hit while being in non-interactive mode, the simulator will be turned to interactive mode automatically.
Multi-Processor Debugger
8.1.2
95
Multi-Processor Synchronization
In this section, we generalize the single-debugger-frontend - single-simulatorbackend constellation to multiple debugger frontends and multiple processor simulator backends.
Goal For user friendly multi-processor debugging, the user should be able to open as many instances as needed of a generic debugger frontend which – at simulation runtime – can be retargeted to those processors the user selects. To keep the potential for highest possible simulation speed in a complex SoC simulation, the data exchange between the processors and their respective SoC environment – which is always necessary independent of the user’s system observation – must be highly optimized. This can only be guaranteed if all processor simulators reside within one executable on the host so that no slow Inter Process Communication (IPC) is necessary for this task. IPC is applied if and only if the user decides to observe or take control over a certain processor simulator.
System Integration Concept By encapsulating the entire ISS functionality into a unified API, instantiating any number of different ISS within the same simulation executable is possible simply by linking the respective architecture libraries to the system simulator. This guarantees the high simulation speed: no IPC is necessary for communication within the system simulation if no frontend, i.e. user interaction, is desired (Figure 8.3). On the other side, the user may decide to observe and control the execution of a certain processor. For that purpose, he can create an IPC connection between a generic frontend instance and the processor backend to access its API (Figure 8.4). This IPC connection transfers Control and Observation/manipulation information (CaO). Thus, the remote retargetable debugger frontend instance offers all observability features for multi-processor simulation as known from standalone simulation. Even resources external to the processor simulator backend like peripheral registers and external memories can be visualized and modified through the observation/manipulation API. For good observability, every frontend can request backend data to display the current processor state without influencing the remaining processors or the system simulation synchronization. Analogously, via the manipulation API, the user can modify the state of a processor simulator backend without directly influencing the others. For user controllability, in contrast, some more sophisticated synchronization between the user control coming from the frontend and
96
Debugging and Profiling
Control API Observation API
Simulation Scheduler
Processor Simulator Backend
Control API
Observation API
Processor Simulator Backend
Control API
Bus Module
Control API
Control API
Memory Module
Peripheral Device
Simulation Executable module activation
inter module communication
Figure 8.3.
System Simulation Executable without Debugger GUI
Debugger Frontend Control API Observ. API
CaO Socket
Control and Observation Information
Scheduler
Synchronizer
Control API
Simulation Scheduler Control
IPC
Synchronizer
Observation API
Processor Simulator Backend
Control API
User Control
Synchronizer Observation API
Processor Simulator Backend
Commands to Simulator Backend
Simulation Executable
Figure 8.4.
Synchronizing User and Scheduler Control
the system simulation control coming from the simulation scheduler is needed (Figure 8.4).
The Synchronizer Unit The standalone simulation control modes described in Section 8.1.1 are available for every processor core during initial development. When system
97
Multi-Processor Debugger simulation scheduler cycle finished
debugger frontend
step cycle
grant permission
control handed back to user
frontend connected ?? No
Yes
evtl. block until new ldb permission available
step simulator
autostop target ?? No
permissions granted by debugger frontend
step simulator
Yes
permission expired ?? No
notify frontend
control flow
Yes notify frontend
signal/notification
Figure 8.5.
Synchronizer Control Flow
integration starts, it is reasonable to still be able to choose all the previously introduced interactive and non-interactive simulation modes individually for each processor of the system. The black boxes in Figure 8.4 indicate the problem that 2 instances can access every Control API and need to be synchronized. The control flow for the most common synchronizing mode is sketched in Figure 8.5. From the upper side, the control is handed over from both sources: The simulation scheduler at the left; the debugger frontend (respectively the user) at the right. Both instances have to cope with the situation that the call can be blocking, i.e. it takes some time until control is handed back to the respective unit. For frontend updates or socket watching, the synchronizer can be configured to initiate callbacks while being in blocking state. The synchronization is done such that the debugger frontend only gives permission to actually execute cycles on the backend simulator. Depending on the user command, this permission can be for one cycle, an entire assembly instruction, or even more. This permission can be removed again by the user if it has not been consumed yet by the scheduler’s activation, e.g. because the system simulation scheduler is in pause mode. When the simulation scheduler activates the processor module, it is first checked if a frontend is connected at all (Figure 8.5). If this is not the case, the
98
Debugging and Profiling
simulator backend can advance a cycle since it is running in the background now. If a breakpoint or watchpoint is hit, all frontend instances can be notified about that. Thus, the user can connect to the respective processor to analyze the target that hit the breakpoint. Otherwise, if a debugger frontend is connected on the scheduler’s activation, it has to be checked whether the frontend already gave permission to execute the next simulation cycle. In case this is not true, the execution thread blocks until this permission is available. Then the processor cycle actually can be invoked. After that, it is checked if the permission given by the frontend is expired. This decision depends on the kind of permission that is obtained. If the permission is expired, if a breakpoint is hit, or if the frontend should refresh its display, then the frontend instance is notified about that. In all of the above cases, the control is handed back to the simulation scheduler so it can invoke the other system modules, i.e. the remaining processor backends, in the same manner.
Multi-Processor Simulation Modes By attaching such a synchronizer unit to every processor simulator backend, all simulation modes known from standalone simulation are still available for multi-processor simulation and can be selected independently for each processor. If one processor is in interactive mode and new user input is expected at the respective frontend, the entire system simulation will be stopped. If there are multiple processor backends in interactive mode at the same time, the control will be passed to the respective debugger frontends in turn. Even if no processor simulator backend is in interactive mode at a time, all breakpoints or watchpoints, independent in which processor, are considered and can interrupt system simulation.
8.1.3
Dynamic Connect
For any point in time during simulation and individually for every processor core, the user should choose dynamically the optimal trade-off between good observability/controllability and high simulation speed. For enabling this dynamic connect capability, the tooling described so far needs some extensions.
TCP/IP The synchronizing mechanism described above manages both situations: a debugger frontend being connected or the simulator backend running without user interaction. To switch between these two states at runtime, it is necessary to dynamically create and destruct the CaO IPC connection. By choosing TCP/IP for the IPC, implemented using the platform independent QT library [120], the user
99
Multi-Processor Debugger Debugger Frontend
Debugger Frontend
Control API Observ. API
State API
Control API Observ. API
State API
CaO Socket
State Socket
CaO Socket
State Socket
TCP/IP TCP/IP
Synchronizer
Control API
CaO Socket
State Socket
Frontend Controller
Scheduler
CaO Socket
Synchronizer
Observation API
Control API
Observation API
Simulation Executable
Figure 8.6.
Dynamic Connect Debugging
can launch the frontend instances not only on a different host, but even on a different host operating system (OS).
State Information Exchange To manage the information about how to setup and tear down the TCP/IP connections, a global connection management is necessary (Figure 8.6). For that purpose, a frontend controller is added to the simulation executable. The controller maintains global information about all simulator backends that should be published to all frontend instances: host/port information needed to connect to the CaO socket of the respective processor simulator, the current availability for dynamic connect, and global backend information like current processor load. Additionally, a system simulation control panel is managed by this device, which allows the user to start and stop the system simulation scheduler. The frontend controller offers this information to a simulation state socket, which is permanently connected to every debugger frontend. Thereby, every frontend instance can display the current global state of every processor backend, provide a system simulation control panel, and enable the user to connect to the CaO socket of any simulator backend currently available. Socket Watching At any time, the debugger frontends must have the possibility to access the simulation executable’s TCP/IP sockets, even if only for refreshing the display.
100
Debugging and Profiling
Thus, all sockets of the simulation executable must be observed all the time, independent of whether the system simulation is running, a processor backend in interactive mode is blocking, or some other component, e.g. an external HW simulator or an external system simulation control panel is holding the control. In the latter case, the sockets cannot necessarily be served by the main event loop. For still being able to integrate the simulator backends into a third party system simulation environment (e.g. SystemStudio provided by Synopsys), it is necessary to make the simulator backend thread-safe to control the sockets from within a separated thread or directly from the host operating system. In terms of Figure 8.5, the control flow at the right hand side, which serves one CaO socket of the simulation executable, is possibly not performed from within the main thread that executes the system simulation itself, but by an additional socket watcher thread.
8.1.4
Source Code Level Debugging with GNU gdb
More and more embedded software is written in High Level Languages (HLL) like C/C++ in order to increase the software developer’s productivity. This section describes how the HLL debugger GNU gdb [121] is integrated into the multiprocessor debugging tool-set.
The GNU Debugger gdb At least on Linux hosts, the most commonly used C/C++ software debugger is the GNU debugger (gdb). It allows attaching to a running process on the host, or invoking an executable to become such a target process (Figure 8.7). The gdb debugger can then control and observe the target process on source code level as the LISA debugger does it on machine level (cf. Section 8.1.1). Similarly, this is done using a suitable API for exchanging control and observation information. The control part of the API allows advancing simulation time over or into a C/C++ line, or until a breakpoint or watchpoint is hit. The observation part of the API enables observing/modifying the contents of C/C++ variables, the current source code line and the execution stack, for example. To the user, gdb offers a Command Line Interface (CLI) as well as a complex API. The user can directly access the CLI using a terminal, but in most cases, a graphical debugger frontend (e.g., ddd [122]) enables a much more intuitive appliance of gdb. Normally, gdb is used to debug native software directly executing on the host. However, as shown in the following subsection, gdb can also enable debugging embedded multiprocessor software executing on a set of customized processor simulators.
101
Multi-Processor Debugger
Debugger Frontend gdb CLI
gdb API
GNU Debugger gdb
IPC or direct Control API
gdb Observ. API
Target Process
Figure 8.7. GNU gdb Connected to a Host Process
Multiprocessor Debugging with gdb For the embedded software developer, the dynamic connect multiprocessor debugging environment outlined in Sections 8.1.1–8.1.3 covers the processor architecture dependent issues. But in order to enable comfortable debugging also on the architecture independent source code level, the gdb debugger has been integrated into the tool-set (Figure 8.8). The extended LISA debugger frontend offers source code debugging features as known from debugging host software. Similar to ddd, the user input is used to control and configure the gdb instance that the debugger frontend owns. The gdb control commands are directed to the control API of the processor simulator backend. These do not need to be synchronized with the direct user input since they are only additional alternatives of user input. Instead of forwarding simulation for one clock cycle or one assembly instruction, now the user can also step into or step over a C/C++ source code line, for example. The gdb observation commands need to be forwarded to the processor simulator backend in the same manner it is done for the LISA observation commands. Thus, the automatically generated processor simulator backends need to offer an additional generic observation API for the gdb debugger. If the user requests a C/C++ variable lookup, for example, the GNU debugger directs a suitable memory or register lookup command to the gdb observation API of the respective processor simulator.
102
Debugging and Profiling Debugger Frontend gdb CLI
here: no Synchronizer necessary
gdb API
GNU debugger gdb
Control API
gdb Observ API lisa Observ API
CaO Socket
Control and Observation Information
IPC
Synchronizer
Scheduler
Synchronizer
Control API
gdb Observ API lisa Observ API
Processor Simulator Backend
Control API
gdb Observ API lisa Observ API
Processor Simulator Backend
Simulation Executable Figure 8.8. Multi-Level Multiprocessor Debugging
Full SoC Debugging Having a gdb available, it is also possible to consider the entire simulation executable as a target process and connect gdb to this platform simulator (Figure 8.7). When hiding the threads serving the sockets and neglecting the LISA processor simulators, the remaining SystemC system modules can be debugged on SystemC level using gdb as well.
8.2 TLM Bus Traffic Visualization 8.2.1 Message Sequence Charts (MSC) Early on in the AVF project, a graphical frontend has been used to display the packet level communication between the abstract system modules [84]. This frontend visualizes the communication according to the Message Sequence Chart (MSC) paradigm. As depicted in Figure 8.9, every system module is displayed as a vertical line, and every packet transfer appears as an horizontal arrow with the time stamp annotated in brackets. In the AVF environment, MSC visualization is typically applied to display functionally associated data packets like IP packets or ATM cells. This is
103
TLM Bus Traffic Visualization
Figure 8.9.
Transaction Accurate MSC Visualization
encountered in very early phases of SoC design (step 0 and 1a in Figure 3.1, page 27).
8.2.2
Word Level Data Display
The MSC debugger is also capable of visualizing the communication already based on architecturally associated data, covering several levels of timing accuracy. In terms of Figure 3.1 (page 27), the phases 1b, 2 and 3 are also visualized.
Transaction Level Traffic Display Figure 8.9 shows the data display for phases 1b and 2. The difference between these two phases is the reliability of the time stamps. The generic request/response communication protocol is the same as applied for functionally associated data. The only difference is the data payload. As depicted in Figure 8.9, a packet contains a burst of words, i.e. a cache line. This packet type is application and architecture independent. A current restriction of the generic MSC packet type is a maximum number of data words that can be exchanged.
104
Debugging and Profiling
Figure 8.10.
Cycle Accurate MSC Visualization
Cycle Accurate Traffic Display The same MSC frontend is also useful to display the same transactions fully cycle accurately. As shown in Figure 8.10, a single cache line refill possibly invokes dozens of simulation events in a cycle accurate system simulation. All successful and unsuccessful transfers introduced in Figure 3.8 (page 38) appear as a dedicated arrow in the MSC. Obviously, the more complex initiator side actively tries to invoke every transfer explicitly. The target, in contrast, only declares itself sensitive to those events it needs to react on. It does not explicitly receive the addrTrf. Instead, the associated information is obtained implicitly before sending the response. The cycle accurate transfers and their attributes are highly protocol dependent and thus cannot be visualized using a generic MSC packet type any more. Since
Bus Interface Analysis
105
the BusCompiler input specification contains all needed information about the protocol transfers and their attributes, it is reused to generate the MSC packet classes automatically as well. The extended BusCompiler generates extended bus libraries which also contain the MSC packet types as well as a respective bus simulator instrumentation (cf. Appendix B). The packet filtering mechanisms built into the MSC debugger help the designer in quickly identifying the communication issues which are currently of interest.
8.3 Bus Interface Analysis 8.3.1 CoWare PlatformArchitect Analysis An important part of CoWare’s PlatformArchitect product family are the analysis capabilities [123]. IP blocks shipped by CoWare are generally instrumented with additional code for profiling and analysis. This enables the SoC designer to observe and trace properties like bus traffic, bus contention, cache usage or software behavior. The collected data can either be viewed dynamically during the simulation is running, or after postprocessing. The analysis library includes generic visualization capabilities to display any kind of collected data as trace, graph, pie or table.
8.3.2
Bus Interface Optimization
The bus interfaces between the processors and the communication nodes are critical for the overall system performance. Especially if both sides are customizable, the interface needs to be designed very carefully. Otherwise, the gain of customizing the modules would quickly be lost again by introducing a new bottleneck. As described in Chapter 7, a bus interface state machine is generated automatically on TLM level. However, the designer still has to assign the LISA bus/memory API invocations to suitable pipeline stages of the processor (cf. phase 5, Section 6.5) manually. In order to ease this task, the generated bus interface state machine is instrumented with analysis capabilities automatically. As displayed exemplarily in Figure 5.8 (page 62), the generated bus interface state machine buffers data delivered too early by the processor. This in most cases indicates room for optimization. Either, possible critical paths in the processor could be avoided by delivering specific data later in the processor pipeline. Or the associated address and data information may still be delivered in time when a request is initiated already one cycle earlier.
106
Debugging and Profiling
Each set of bars refers to 1000 cycles simulation time. Figure 8.11. Bus Interface Analysis
Example An example of an uncached processor memory port is depicted in Figure 8.11. Since this data port has been assigned a lower priority, the actual arbitration time for each transaction depends on the current bus traffic. The topmost graph gives an accumulated view over successful and unsuccessful addrTrf transfers. While the MSC debugger depicts each of these transfers separately (cf. Section 8.2.2), the upper analysis view sums them up for buckets of 1000 simulation cycles. The second graph considers only successful transfers of multiple kinds. Obviously, every addrTrf is either accompanied by a writeDataTrf, or by a readDataTrf. The third graph considers the timing of the addrTrf phase in more detail. While the topmost view only shows the ratio between successful and unsuccessful transfers, this graph displays the time between obtaining the address
Bus Interface Analysis
107
information from the processor and successfully forwarding it to the bus side. 1 of a cycle. The groups of bars display the minimum, The time unit is 1000 the maximum and the average time period, respectively. On invocation of the memory access, the applied processor delivers the address to the adaptor immediately. However, during the arbitration phase, the adaptor does not yet need this information. Obviously, it takes at least two cycles until the addrTrf can be forwarded to the bus. If the bus is busy due to requests from higher prioritized ports, it takes up two five cycles until bus access is granted.1 This kind of information gives the designer an overview about what is going on in the different parts of the SoC system. Issues which bear potential for optimization can be detected relatively easily.
1 This
simple example platform is completely uncached, thus there are no burst transactions on the bus.
Chapter 9 CASE STUDY
This chapter illustrates the design space exploration and the successive refinement flow until phase 5 on a multi processor image processing platform. In all involved phases, the automatically generated processor simulators, bus simulators and couplings have been applied.
9.1 9.1.1
Multi Processor JPEG Decoding Platform The JPEG Application
The considered application throughout this chapter is a JPEG decompression algorithm. Both, the original C source code of the algorithm as well as sample bitmaps are freely available from the JPEG group’s web page [124]. The testcase for all profiling measurements in this chapter is a small 150 × 100 sized sample bitmap. Initially, the JPEG algorithm is compiled for a MIPS32 4K processor [10], using a respectively retargeted GNU C++ compiler [125] and the dietlibc standard C library [126]. The initial profiling on a standalone instruction accurate MIPS32 4K LISA simulator (Figure 9.1) clearly shows the algorithmic core of JPEG decryption: The Inverse Discrete Cosine Transformation (IDCT). By far most of the simulator steps are spent in the respective C function. It transforms the image data from the frequency domain back to the spatial domain. This task is performed for the luminance information as well as, with lower resolution, for the chrominance information. In order to build up an optimized platform, an IDCTcore coprocessor has been designed, optimized for performing this transformation task. The respective C code has been compiled using a C compiler for a similar architecture, appending
110
Case Study
Figure 9.1.
Phase 1, C-Profile of the JPEG Application
a manual optimization step. All other application tasks, i.e., the control and file I/O tasks, are still performed on the MIPS32 4K main processor.
9.1.2
Platform Topologies
Since multiple processor models are involved from now on, the LISA standalone phase 1 is not capable of delivering meaningful profiling data any more. It is necessary to switch over to platform simulations. The basic platform topologies are depicted in Figure 9.2. All involved processors share a common data memory, and each of them accesses a personal program memory. Since every IDCT invocation works independently from the others, the case study also considers extending the approach to multiple IDCTcore coprocessors. Furthermore, this case study evaluates different alternatives for the interconnect, several cache configurations, as well as, in later exploration phases, different microarchitectures for the processors1 . However, both the JPEG algorithm as the target application and the sample bitmap as the testcase always stay the same.
9.1.3
Platform Performance Indicators
A multitude of indicators is commonly used to evaluate the performance of a platform. For the overall system, and also for single communication and computation components, the main metrics are the latency (time between start
1 Since the flow is based on abstract specifications of the system blocks, even MIPS- and AMBA-like modules
are customizable.
111
Multi Processor JPEG Decoding Platform mips only prog_rom (MIPS)
data_ram (MIPS)
1-copro prog_rom
p_cache
d_cache
MIPS32
data_ram (shared)
(MIPS)
Interconnect
Inter-
p_cache
2-copro
d_cache
MIPS32
prog_rom
prog_rom
(IDCTcore)
(MIPS)
connect
d_cache
data_ram (shared)
Inter-
p_cache
IDCTcore
p_cache
d_cache
prog_rom (IDCTcore)
connect
d_cache
MIPS32
p_cache
IDCTcore_1
d_cache
p_cache
IDCTcore_2
Figure 9.2. Co-Processor Configurations
and completion of a task) and the throughput (accomplished tasks per time). The former metrics does not account for the interleaved execution of several tasks (resp. transactions or instructions), while the second metrics does. For the processor cores, the main indicator is the throughput, typically measured in Million Instructions Per Second (MIPS). The instruction latency on pipeline level is indirectly contained in this metrics2 . The throughput of the communication infrastructure is typically measured in bits per second. This is an important metric especially for larger systems with a lot of traffic between modules which are relatively independent of each other. In smaller systems with more module dependencies like the considered JPEG decoding platform, the limiting factor is rather the communication latency. The latency is typically measured in initiator clock cycles. In the context of this case study, the most important performance metric is the number of clock cycles it takes to decode the sample bitmap. It is the overall latency finally observed by the platform user. Since this number to some extend contains all related performance values, weighted by their relevance, most performance comparisons in this chapter use this value. The accuracy comparisons throughout this chapter are also mostly based on this value. Thus, the cycle error refers to an instruction window of several million instructions. In many cases, for such large windows, it is very likely that positive and negative errors cancel out each other. That is why cycle approximation research also considers smaller instruction windows down to one single instruction. However, these considerations are out of the scope of this case study.
2 Reasons
for a long instruction latency are either a long clock period, or many stall cycles. These directly worsen the throughput as well. Longer pipelines are another reason for higher latencies. In order to avoid hazards, additional NOPs or dynamic interlock stall cycles often need to be inserted, which also have negative influence on throughput.
112
Case Study
The phases 3 and 4 are not based on annotation and thus contain rather a systematic error which always is positive or negative over the whole simulation run. Multiple effects that cause a different sign of the error are isolated by considering different platform configurations which rather trigger the one or the other error. As shown in the next section, also the applied AVF communication node contains a systematic error. This case study does not consider the architecture metrics cost (i.e. chip area) and power dissipation (energy consumed per time) quantitatively. Reliable information of this kind can only be back-annotated from the final phases 6 and 7.
9.2
Phase 2: IA + AVF Platform
In phase 2, the basic bus traffic characteristics of the application are analyzed, using a generic AVF communication module which is parameterized to match the behavior of a single AMBA AHB bus node.
Design Space Exploration In this section, different cache configurations are investigated more thoroughly. The caches have the most obvious influence on the bus traffic shape and amount. Since the caches are located tightly to the processor, a hit does not cause any traffic on the platform interconnect infrastructure.
p_cache
d_cache
MIPS32
d_cache
prog_rom
IDCTcore
MIPS
p_cache
IDCTcore
p_cache
data_ram
d_cache
MIPS32
Figure 9.3.
prog_rom
prog_rom
IDCTcore
MIPS
d_cache
p_cache
shared
IDCTcore
p_cache
MIPS32
data_ram shared
prog_rom
IDCT core
prog_rom
prog-cache
MIPS
shared
MIPS
MIPS
IDCT core
data_ram MIPS
prog_rom
excl-cache
IDCT core
full-cache
IDCTcore
p_cache
IDCTcore
Cache Configurations
The examined cache configurations are depicted in Figure 9.3. In the fullcache system configuration, cache consistency is ensured by flushing and invalidating the data cache after a processor’s task is performed. The excl-cache configuration avoids this problem by only caching memory regions exclusive to one processor. However, to exchange data with a coprocessor, the MIPS32 4K processor needs to copy the data explicitly into and from the shared area. The prog-cache configuration, in contrast, only contains simple read-only caches for the instruction memories.
113
Phase 2: IA + AVF Platform
The cache sizes applied for the exploration are as follows: cache size small medium large
MIPS32 4K lines size 2 128 Bytes 32 2 KB 256 16 KB
IDCTcore lines size 2 128 Bytes 8 512 Bytes 16 1 KB
The small caches contain just two lines (16 words each). Thus, the cache miss probability is very high in this case, causing much traffic on the bus. The simulation results for three alternative platforms are depicted in Figure 9.4. A single processor solution is compared with two dual-processor solutions. The multi-processor platforms differ concerning the way the cores are synchronized. In one case, every core polls at a specific memory location to detect if it is activated by the other processor; the other solution causes less bus traffic by using interrupts for activating the processors. For every platform, several cache configurations and cache sizes are applied. As depicted in Figure 9.4, the cache size has a high impact on overall system performance. The difference between small caches and large caches is up to a factor of 6, the program cache size impact alone still leads to a factor of two. These factors would even be higher for slower memories behind the AMBA bus. The program cache size is an example for a property which is very difficult to annotate to an abstract processor module. Comparing the configurations having program caches only and those having additional data caches (Figure 9.4), it becomes obvious that small and even medium sized data caches do not improve the overall system performance. Only large data caches lead to an improvement. Obviously, locality on the data side is not very high. Since data caches furthermore need a write path as well as significant cache consistency overhead, the following exploration phases do not consider data caches any more. Instead, alternative ways of optimizing the data memory access are analyzed when exploring the processor microarchitectures in later phases.
Simulation Accuracy The profiling data depicted in Figure 9.4 has been collected using an IA processor model connected to an AVF communication module. In order to evaluate the accuracy of the parameterized AVF module, it has been replaced by a TLM AHB bus module in the reference platform. Since it cannot be the goal to model processor or even application specific properties in an AVF communication module, the IA processor simulator has not been replaced by a more accurate one for this comparison. The generic AVF bus module shipped with the applied CoWare ConvergenSC release 2005.2.1 does not yet easily support some AMBA AHB specific features.
114
Case Study MIPS Standalone
Dual Processor System, Dual Processor System, Interrupt based Synchronization Polling prog_rom data_ram prog_rom
prog_rom data_ram
AHB
AHB IRQ
MIPS32
JPEG Decoding Latency
medium
none
large
small
medium
large
none
30
30
30
15 10
MCycles/Image
35
20
25 20 15 10
5
5
0
0
none
small
prog-cache
20 15 10
excl-cache
prog-cache
large
none
small
prog-cache
full-cache
medium
large
excl-cache
prog-cache -2
-4
-4
-4
-6
-6
-6
-8
-8
% Error
0
-2
% Error
0
-2
-12
-10 -12
large full-cache
-8
-12
-14
-14
-16
-16
-16
-18
-18
-18
-20
-20
-20
-22
-22
-22
Cache Configuration
medium
excl-cache
-10
-14
Cache Configuration
full-cache
Approx. Error AVF vs. BCT Interconnect: 1-ahb Processors: irq-copro LISA Models: IA Cache Size: none small
0
-10
excl-cache
Cache Configuration
Approx. Error AVF Interconnect: 1-ahb Processors: poll-copro LISA Models: IA Cache Size:
medium
large
25
Cache Configuration
Approx. Error AVF
medium
0
Cache Configuration
Interconnect: 1-ahb Processors: mips only LISA Models: IA Cache Size:
small
5
prog-cache
full-cache
IRQ
JPEG Decoding Latency
35
25
IDCTcore
MIPS32
Interconnect: 1-ahb Processors: irq-copro LISA Models: IA Cache Size:
35
prog-cache
% Error
IDCTcore
IRQ
JPEG Decoding Latency
MCycles/Image
MCycles/Image
small
MIPS32
AHB IRQ
Interconnect: 1-ahb Processors: poll-copro LISA Models: IA Cache Size:
Interconnect: 1-ahb Processors: mips only LISA Models: IA Cache Size: none
prog_rom data_ram prog_rom
Cache Configuration
Figure 9.4. Design Space Exploration, Phase 2 (IA+AVF)
The reference bus model supports a default master mechanism. As long as no initiator requested access to the bus resource, the arbiter always decides in favor of a certain initiator, the default master. In case the default master actually starts a transaction, this happens after zero arbitration cycles.
115
Phase 2: IA + AVF Platform
The reference bus model applies two different request modes: an optimized one for single word accesses and a regular one for longer bursts. This leads to non-linear latencies as well, which cannot easily be captured by the AVF modules. These are the main reasons for the inaccuracy displayed in the bottom graphs in Figure 9.4. The annotation task is to find a single parameter set which best matches completely different traffic scenarios. In order to keep the average error low, especially the short transactions of only one word need to be considered very carefully. An additional one or two cycles for these transactions easily introduce an overall error larger than a factor of two. Due to the annotation in favor of these cases, the leftmost configurations show the smallest error. The other configurations have less default master accesses and more burst accesses. Both facts lead to an underestimation in the approximation, causing an error of up to 20% in worst case.
AVF Simulation Performance
AVF Simulation Performance
Interconnect: 1-ahb Processors: mips only LISA Models: IA Cache Size: small
medium
Interconnect: 1-ahb Processors: irq-copro LISA Models: IA Cache Size: small medium
large
900
Simulation Speed of Reference System
800
700
700
600
600
kCycles/s
kCycles/s
800
large
900
500 400 300
500 400 300
200
200
100
100
0
Simulation Speed of Reference System
0
none
prog-16
full-8
full-16
full-32
Traffic Shape (Cache Config incl. Linesize)
Figure 9.5.
none
prog-16
full-8
full-16
full-32
Traffic Shape (Cache Config incl. Linesize)
Simulation Speed, Phase 2 (IA+AVF)
Simulation Speed As already explained in Chapter 6, the typically high simulation speed of a pure AVF platform model is mainly based on the absence of clocked modules. Communication events are modeled on the granularity of packets, while other
116
Case Study
periods of time are just skipped during simulation. By introducing clocked processor models, these remaining time periods cannot be skipped any more. Furthermore, the AVF simulation performance depends heavily on the number and thus indirectly on the size of the transferred packets. Larger but fewer packets cause less communication events and thus lead to higher simulation speed. In Figure 9.5, the simulation speed is displayed for several cache line sizes.3 The black lines mark the simulation speed of the respective cycle accurate TLM communication models. Since fully based on clocked communication modules, the reference simulation speed is relatively independent of the actual cache line size. Instead, cycle accurate simulation performance decreases with an increasing number of clocked bus nodes. Figure 9.5 shows several effects. Only for a single processor system, the AVF platform has the potential for high simulation speed. For multiple processors, it is rather unlikely that all of them are in a communication phase at the same time, allowing to completely skip clock cycles during simulation. Furthermore, only fully cached communication with large line sizes enables efficient AVF simulation. Uncached processors lead only to about 25% of the reference simulation performance. Obviously, simulating the transfer of a single word is more efficient when applying a generated optimized state machine than when using the generic AVF handshake mechanism. The left graph in Figure 9.5 also shows that for roughly a line size of 16 words the AVF simulation performance becomes independent from the cache size respectively the packet count. Thus, the effort for simulating a packet transfer in AVF is the same as clocking a LISA processor model roughly 16 times. In contrast to AVF, the simulation performance of cycle accurate TLM communication models decreases significantly for an increasing number of clocked communication nodes.4 For the current platform complexity, however, automatically generated cycle accurate communication modules are superior in terms of simulation accuracy as well as simulation performance.
9.3
Phase 3: IA + BusCompiler Platform
In phase 3, the AVF communication modules are replaced by cycle accurate models, generated automatically by using the BusCompiler tool.
Design Space Exploration In this section, several module types and topologies are evaluated on the interconnect side, as well as the number of coprocessors on the computation 3 All simulation speed measurements throughout this chapter have been performed on a 2.2 GHz AMD Athlon
64 simulation host with Redhat 8 operating system. 100 kCycles/s for the AHBLite crossbar with 7 clocked communication nodes.
4 E.g.,
117
Phase 3: IA + BusCompiler Platform single bus node (ahb-1)
prog_rom
data_ram (shared)
(MIPS)
p_cache
d_cache
MIPS32
d_cache
cross bar (ahb-c; axi-c)
prog_rom
prog_rom
(IDCTcore)
(MIPS)
p_cache
IDCTcore
p_cache
data_ram (shared)
d_cache
MIPS32
Figure 9.6.
d_cache
multiple bus nodes (sb-n)
prog_rom
prog_rom
(IDCTcore)
(MIPS)
p_cache
p_cache
IDCTcore
data_ram (shared)
d_cache
MIPS32
prog_rom (IDCTcore)
d_cache
p_cache
IDCTcore
Interconnect Topologies
side. The examined three interconnect topologies are displayed in Figure 9.6. The module types AMBA AHB (ahb) and AMBA AXI (axi) are popular communication infrastructures provided by ARM (cf. Section 3.3.1). The additional simple bus (sb) module is an academic TLM bus model which shows the platform performance for rather ideal conditions. In contrast to the AMBA nodes, the simple bus simulator is not generated automatically by using the BusCompiler tool, and there is no RTL implementation available either. The simulation results are depicted in Figure 9.7. Both graphs on the left hand side evaluate the impact of additional IDCTcore coprocessor instances. According to the upper graph, the first IDCTcore leads to an overall task acceleration of 25%. Only for systems without program caches, the overall decoding performance is further improved by another 20% when adding a second coprocessor. Otherwise, the performance gain is always below 5%, which is negligible. Further analysis (Figure 9.7, bottom left graph) shows that the IDCT phases themselves are accelerated by a factor of two for the first IDCTcore coprocessor, and by further 25% for the second coprocessor. This observation follows Amdahl’s Law [127]. The coprocessors are only useful during the IDCT phases in the applied JPEG implementation. One coprocessor seems to be the best choice for this platform. The graph on the top right of Figure 9.7 visualizes the influence of some alternative interconnects on the overall JPEG decoding latency. The groups of bars show the performance of an AHB bus node, an AHBLite cross bar, an AXI cross bar, and a set of simple bus instances, respectively. The AHB node is optimized for latency; the AHBLite crossbar provides a better throughput but a worse latency. That is why the processors without caches work more optimal on an AHB node, while the cache bursts are better relayed by the AHBLite crossbar.
118
Case Study Influence of the Coprocessor Count
Influence of the Interconnect Type
JPEG Decoding Latency
JPEG Decoding Latency D-Cache: none Processors: 1-copro LISA Models: IA
I-Cache Size: none
small
medium
large
16
16
14
14
12
12
MCycles/Image
MCycles/Image
D-Cache: none Interconnect: ahb1 Lisa Models: IA
10 8 6 4 2
I-Cache Size: none
small
medium
large
10 8 6 4 2 0
0 mips only
1-copro
2-copro
ahb1
3-copro
Number of coprocessors
IDCT Interarrival Time D-Cache: none Interconnect: ahb1 LISA Models: IA
small
axic
sbn
Simulation Error IA vs. BCA D-Cache: none Processors: 1-copro LISA Models: IA,BCA
I-Cache Size: none
ahbc
Interconnect Type
medium
large
I-Cache Size: none
small
medium
large
26 24 22 20 18 16 14 12 10 8 6 4 2 0
70 60 50
% Error
kCycles/IDCT
80
40 30 20 10 0
mips only
1-copro
2-copro
Number of coprocessors
3-copro
-10 -20
ahb1
ahbc
axic
sbn
Interconnect Type
Figure 9.7. Design Space Exploration, Phase 3 (IA+BusCompiler)
The AXI infrastructure targets larger platforms. It provides 5 independent channels which completely decouple reads and writes as well as address and data phases. Furthermore it can be configured for a data width of up to 128 bit. The current JPEG platform does not really take advantage from the advanced features, but it suffers from the disadvantages like an even higher latency for single words and bursts due to the reassociation of the transactions.5 5 The
default burst timing of the CoWare AXI bus model (version 2005.1.0 alpha) is two clock cycles per burst beat.
Phase 4: CA + BusCompiler Platform
119
The simple bus is analyzed as well since it is a famous TLM bus model representative. It performs at roughly half the latency of the other infrastructures. However, this is rather a theoretical lower limit. The large cache configuration virtually does not spend any additional cycles in platform communication any more. As a result of this exploration phase, the dual processor platform with a single AHB node interconnect is one of the most promising candidates for the further investigations.
Simulation Accuracy The instruction accurate LISA model does not contain any assumptions concerning the pipeline yet. It just executes one instruction per activation. In order to evaluate the simulation error, the problem arises which microarchitecture to take as reference. The error displayed in Figure 9.7 refers to an implementation with shortest possible pipelines.6 Basically, the simulation error is caused by two effects. An overestimation is caused by the missing modeling of parallelism; a negative error is introduced by the missing consideration of the pipeline effects. Thus, in long simulation runs, the errors often cancel each other out. In configurations with a lot of concurrent communication, the overall error accounts for up to +75%. In cases with relatively few communication events, the missing pipeline modeling leads to an overall error of up to –10%. Simulation Speed As depicted in Figure 9.8, the simulation performance does not primarily depend on the cache size and thus the traffic shape. In contrast, it mostly depends on the number and complexity of clocked platform modules. Crossbars with multiple communication nodes simulate slower than single node infrastructures. Complex modules like AXI nodes cause up to a factor of 10 more simulation effort than simple bus modules.
9.4
Phase 4: CA + BusCompiler Platform
Design Space Exploration The next refinement step on the LISA side is introducing pipelines into the processor models. The initial implementations foresee a 5-stage pipeline for the MIPS32 4K, and a 4-stage pipeline for the IDCTcore. Both pipeline layouts allow hiding one cycle of memory access latency without causing any stall cycles. This seems to be fully sufficient for the instruction side since the instruction caches can deliver the instruction words in one cycle in case
6 MIPS32
4K: 5 stages, IDCTcore: 4 stages; cf. exploration phase 4 and 5.
120
Case Study Simulation Performance
Simulation Performance
I-Cache Size: none
small
medium
large
D-Cache: none Interconnect: ahb1 Lisa Models: IA
900
900
800
800
700
700
600
600
kCycles/s
kCycles/s
D-Cache: none Processors: 1-copro LISA Models: IA
500 400 300
I-Cache Size: none
small
medium
large
500 400 300
200
200
100
100
0
0
ahb1
ahbc
axic
sbn
Interconnect Type
Figure 9.8.
mips
1co
2co
3co
Number of Coprocessors
Simulation Speed, Phase 3 (IA+BusCompiler)
of a hit. However, since the data memory accesses are not cached, alternative processor implementations contain elongated pipelines. They allow hiding multiple cycles of data access in the pipeline without causing stalls (cf. table). config name std+0 std+1 std+2 std+3
MIPS32 4K pipeline I-fetch D-access stages stages stages 5 1 1 6 1 2 7 1 3 8 1 4
IDCTcore pipeline I-fetch D-access stages stages stages 4 1 1 5 1 2 6 1 3 7 1 4
Simulation Accuracy In exploration phase 4, only the pipeline is introduced into the LISA model, but the memory accesses are still assigned to one pipeline stage as a whole. This leads to functionally correct platform simulation, but the overall accuracy does not improve yet. As shown by the error indicators added to the graphs in Figure 9.9, specific trends are even misleading. When elongating the pipeline, the overall latency observed on this abstraction level increases, while in reality it decreases. By introducing a pipeline model in this exploration phase, the source of underestimation is gone, while parallelism during communication is still not considered correctly. Instead of hiding the access latencies in the longer pipelines, the blocking interface still accounts for them separately. Thus, this simulation technique only delivers a reliable upper bound for the latencies.
121
Phase 4: CA + BusCompiler Platform Influence of the Pipeline Lengths, Dual Processor System
Influence of the Pipeline Length, MIPS Standalone
prog_rom data_ram prog_rom
prog_rom data_ram
Interconnect
Interconnect
MIPS32
MIPS32
JPEG Decoding Latency
D-Cache: none I-Cache: medium Processors: 1-copro
Interconnect: ideal
ahb1
ahbc
axic
14
14
12
12
MCycles/Image
MCycles/Image
D-Cache: none I-Cache: medium Processors: mips only
10 8 6 4 2
Interconnect: ideal
ahb1
ahbc
axic
10 8 6 4 2
0
0
IA
std
std+1
std+2
std+3
IA
std
D-Cache: none I-Cache: medium Processors: 1-copro
Interconnect: ahb1
std+2
std+3
Simulation Error CA vs. BCA
Simulation Error CA vs. BCA D-Cache: none I-Cache: medium Processors: mips only
std+1
Pipeline Lengths
Pipeline Length
ahbc
axic
60
60
50
50
40
40
% Error
% Error
IDCTcore
JPEG Decoding Latency
30
ahb1
ahbc
axic
30
20
20
10
10
0
Interconnect:
0
IA
std
std+1
std+2
Pipeline Length
Figure 9.9.
std+3
IA
std
std+1
std+2
std+3
Pipeline Lengths
Design Space Exploration, Phase 4 (CA+BusCompiler)
A lower bound for the overall access latencies, in contrast, is achieved by neglecting all memory access latency cycles using an ideal interconnect model (cf. leftmost bar in every group of bars). For longer pipelines, the ideal memory interface delivers the more reliable performance data.
122
Case Study
Simulation Speed Simulation Performance
Simulation Performance D-Cache: none I-Cache: medium Processors: 1-copro
Interconnect: ahb1
ahbc
axic
450
400
400
350
350
300
300
kCycles/s
kCycles/s
D-Cache: none I-Cache: medium Processors: mips only
250 200 150
ahb1
ahbc
axic
250 200 150
100
100
50
50
0
Interconnect:
0
IA
std
std+1
std+2
Pipeline Length
Figure 9.10.
std+3
IA
std
std+1
std+2
std+3
Pipeline Length
Simulation Speed, Phase 4 (CA+BusCompiler)
As displayed in Figure 9.10, the simulation performance (measured in kCycles/s) does not significantly decrease after moving from IA to CA processor models and after increasing the number of pipeline stages. On the one hand, additional effort is necessary to compute the pipeline behavior, but on the other hand, additional processor cycles are introduced during simulation without executing additional instructions. However, the simulation performance measured in kInstructions/s (kIPS) in fact does decrease significantly due to the constant number of instructions.
9.5
Phase 5: BCA + BusCompiler Platform
Design Space Exploration In order to model the platform fully cycle accurately, the processor interface needs to be refined to be able to invoke the different phases of a transaction from within different pipeline stages. The processor pipeline for the MIPS is outlined in Figure 9.11. The initial 5stage pipeline depicted on the left hand side contains the stages Instruction Fetch (I), Execution (E), Memory Fetch (M), Align/Accumulate (A) and Writeback (W). The instruction fetch takes place between the I and E stage. Due to the instruction cache, the one cycle latency is fully sufficient in most cases. As decided in earlier exploration phases, the data memory is not cached. Since an AMBA AHB or AXI bus features a latency of 2 cycles or even more, a lot of stall cycles are executed when such a bus is connected directly to the
123
Phase 5: BCA + BusCompiler Platform
5
I
MIPS data load
data store
req,addr, req,addr M wdata 1 status rdata A status
1
req
5
E
2 1
I E
data load 5
req
data store
MIPS data store
data load 6
E
6 2
A
std+0, ereq
req
I
data load req
E
addr, addr M wdata
3 2
M1 status
W
std+0
I
req,addr, req,addr M wdata
addr, addr 2 M wdata 1 status rdata A status
W
MIPS
pipeline
MIPS data store
2 status
rdata status
3 2
M1
6
rdata A status
W
W
std+1
std+1, ereq
IDTcore data store 4
IDCTcore
FE data load
req, addr req,addr, EX rdata status wdata status WB
4
AG
1
data store
1
2 1
FE data load
req, req, AG addr addr rdata wdata EX status status WB
4
pipeline
Figure 9.11. Data Memory Access from the MIPS Pipeline
IDTcore data store 5
AG
1
EX status
std+0
std+0, ereq
Figure 9.12.
req, addr
req,addr, M1 wdata 2
IDCTcore
FE data load
WB
std+1
FE data load
req, req, AG addr addr wdata M1
5 2
rdata status
data store
3 2
EX
5 2
rdata status
status WB std+1, ereq
Data Memory Access from the IDCTcore Pipeline
MIPS pipeline. In order to hide this access latency, the pipeline is elongated by introducing additional Memory Fetch Stages M1, M2 etc. The MIPS bypass and interlock mechanism takes care that the same application software can still be executed without hazards. A further optimization, which can only be analyzed on this abstraction level, is the early request (ereq). Already in the E stage, it is known whether the data memory is about to be accessed. Even if the address has not been determined yet, many buses can already start the arbitration phase in this case, eliminating at least one clock cycle of latency. Analogously, Figure 9.12 depicts alternative pipeline structures of the IDCTcore coprocessor. The basic pipeline contains the stages Fetch (FE), Address Generation (AG), EXecute (EX) and WriteBack (WB). Program memory accesses take place between the first two stages. Data memory reads occur between stages AG and EX, whereas writes are invoked between stages EX and WB. Thus, one cycle of data memory latency is also hidden in the pipeline. In order to simplify the bypass mechanism, the memory access stages are different for loads and stores. An elongated pipeline increases the penalty-free latency for both types of memory access.
124
Case Study
The JPEG decoding performance of the microarchitectural alternatives is depicted for an AMBA AHB interconnect in Figure 9.13. In the AMBA AHB protocol, the bus accesses start with an arbitration phase. Thus, the early request approach is more beneficial than elongating the pipeline by one stage. The optimal pipeline length, in general, is a trade-off between hiding memory access delay and avoiding data hazards. The early request mechanism, if applicable, allows hiding more latency cycles in the pipeline without introducing additional penalty. The very same processor model can also be connected to an AHBLite or AXI crossbar, using the automatically generated adaptors. Since these protocols expect address and/or data information already at the beginning of a transaction, the early request pipeline version is not advantageous for these interconnects any more.7
Simulation Accuracy The cycle counts presented in this section origin from a fully cycle accurate simulation and thus have been taken as a reference for the earlier phases. When using a hardware synthesis toolchain for processors and communication modules, the cycle counts should be the same. However, additional information concerning critical path and power consumption may make it necessary to modify design decisions made so far. Simulation Performance In Figure 9.14, the simulation performance achieved on the different levels of processor abstraction is depicted. Both, BCA and optBCA are associated to phase 5, however, the optBCA level applies the refined processor memory interface. It is not just capable of invoking and finishing transactions from different pipeline stages (2-phase CA interface), but also associates any information delivered in between to the respective pending transaction. Since all depicted platforms include a cycle accurate communication model, in general the performance does not differ more than by a factor of two for every interconnect type. Outlook This chapter illustrated the refinement flow until phase 5. Many design alternatives have been explored on different abstraction levels down to full cycle accuracy. This should give the designer the confidence to move forward to the RTL phases 6 and 7, having the best possible starting point.
7 Graphs
not shown.
125
Phase 5: BCA + BusCompiler Platform
prog_rom data_ram
JPEG Decoding Lateny vs. Pipeline Setup D-Cache: none Interconnect: ahb1 Processors: mips only
AHB
I-Cache Size: none
small
medium
large MIPS32
16
14
MCycles/Image
12
10
9,39 8,62
8,24 7,66
8
8,04 7,40
7,94
7,61
7,30
6
4
2
0 IA
std+0
std+0, ereq
std+1
std+1, ereq
std+2
std+2, ereq
std+3
std+3, ereq
Pipeline Setup
JPEG Decoding Latency vs. Pipeline Setup D-Cache: none Interconnect: ahb1 Processors: 1-copro
prog_rom data_ram prog_rom
I-Cache Size: none
AHB small
medium
large MIPS32
IDCTcore
16
14
MCycles/Image
12
10
8 6,15
6,00
6
5,47
5,65 5,13
5,43
5,00
5,28
5,18
4
2
0 IA
std+0
std+0, ereq
std+1
std+1, ereq
std+2
std+2, ereq
std+3
std+3, ereq
Pipeline Setup
Figure 9.13.
Design Space Exploration, Phase 5 (BCA+BusCompiler)
126
Case Study
Simulation Performance
Simulation Performance
D-Cache: none I-Cache: medium Processors: 1-copro
Interconnect: ahb1
ahbc
axic
450
450
400
400
350
350
300
300
kCycles/s
kCycles/s
D-Cache: none I-Cache: medium Processors: mips
250 200
ahb1
ahbc
axic
250 200
150
150
100
100
50
50
0
Interconnect:
0 IA
CA
BCA
Abstraction Level
Figure 9.14.
optBCA
IA
CA
BCA
Abstraction Level
Simulation Speed, Phase 5 (BCA+BusCompiler)
optBCA
Chapter 10 SUMMARY
The ever increasing complexity of modern electronic devices together with the continually shrinking time-to-market and product lifetimes pose enormous SoC design challenges to meet flexibility, performance and energy efficiency constraints with a good design efficiency. Programmable platforms best fulfill today’s and tomorrow’s flexibility constraints, and tailoring them specifically for the target application domain is the key to meet the performance demands with a good energy efficiency. Designing these platforms requires a systematic methodology and suitable tooling for obtaining optimal results in a reasonable design time. In order to identify a suitable platform for a specific application or application domain, design space exploration on a higher level of abstraction is mandatory. The better the tools support a thorough investigation of the full design space, the more likely it is optimal design decisions are made in early stages of the design flow. This avoids the high costs of long redesign cycles or the risk of placing suboptimal products on the market. The largest design space is opened by heterogeneous platforms with Application Specific Instruction set Processor (ASIP) cores on the computation side, together with a highly optimizable interconnect structure (Network-on-Chip, NoC) on the communication side. To achieve the best performance and energy efficiency possible, in this book processor cores as well as communication modules can be specified freely instead of instantiating fixed IP (Intellectual Property) blocks.
Contribution This book presents a methodology and the associated tooling for enabling design space exploration as well as a successive refinement flow for the design
128
Summary
of such optimized MP-SoCs (Multi-Processor SoCs) with a high degree of automation. In particular, the contributions are: Processor System Integration. For customizable LISA processor models, a flexible tooling for early integration into the SoC system context has been developed. The tooling supports various bus models to connect to, it covers several abstraction levels, and it enables the integration into multiple SoC simulation environments. Successive Top-Down Refinement Flow. Based on this tooling, a new SoC processor-communication co-exploration methodology has been developed. According to the abstraction pyramid principle, the huge design space opened by fully customized processor and communication models can be explored successively and thus exploited. Automatic Retargetability. In the optimal case, processor modules are developed using the LISA processor compiler, while customized communication modules are created using CoWare’s BusCompiler. The necessary simulation adaptors supporting several levels of abstraction are generated automatically as well, based on a new condensed section of the BusCompiler input specification. Versatile Debugging and Profiling. Furthermore, the generated MP-SoC platform simulators are equipped with various debugging and profiling capabilities. By analyzing and verifying the customized modules in system context, efficient HW+SW development including solid design decisions becomes much easier for the SoC designer. Together, these building blocks form an important base for successfully coping with the demanding challenges of the omnipresent MP-SoC era.
Outlook Of course, the MP-SoC topic is still subject of further investigations. Future research and development activities are possible, for example, in the following areas: Bus Interface Customization on RT-Level. This book focused on SoC interfaces between customized processors and communication modules for several abstraction levels down to fully cycle accurate TLM. The remaining path down to customized hardware on RT-Level does not offer this degree of automation yet. So far, only the processor models themselves can be refined automatically to synthesizable HDL code. More complex bus interface state machines as necessary e.g. for AMBA protocols are not supported yet. However, a specification mechanism for this kind of state machines has
Outlook
129
already been developed in this book. It could be reused to automatically generate the more sophisticated bus interfaces on RT-Level as well. Furthermore, the BusCompiler tool is not yet capable of generating RTL code for the communication nodes, either. But due to the strongly formalized specification of protocols and state machines, there should be no general limitation. Software Flow. This book focused on the hardware side of MP-SoC design. The developed tools do enable debugging and profiling the embedded software in system context. But partitioning and compiling it effectively for the MP-SoC target platform is not in the focus of this book. Furthermore, techniques like automated customization of Operating Systems are not considered yet, either. Simulation Speed. This work did not focus on techniques to optimize simulation speed. However, high simulation performance is desirable especially for the Programmer’s View use case. The most important current approaches include temporal decoupling of system modules, as well as a direct memory access mechanism. Cycle Approximation. In this book, performance estimation on higher levels of abstraction is based on very rough microarchitectural information. The main use case is a top-down flow, where the microarchitecture of the customized processors is not yet known at all. However, if the tooling is applied for a bottom-up flow, it may be desirable to deliver more accurate performance information already by the faster simulators on higher abstraction levels. Modern SoC processor cores apply many techniques to accelerate the execution of embedded software for the average case. However, these processor features like long execution pipelines, large caches, branch-prediction and MMU-TLBs1 do not guarantee any worst case execution time. In either case, they make it difficult to estimate the actual cycle count for executing a specific portion of embedded software. However, this kind of information is absolutely necessary for the execution of real time software. But also for embedded application software development it is useful to know if it takes rather 2 or 20 seconds until a PDA2 calendar application is launched, for example. The goal of Cycle Approximation is to deliver as accurate as necessary performance information at a simulation speed acceptable for embedded software development.
1 MMU=Memory 2 PDA=
Management Unit; TLB= Translation Lookaside Buffer. Personal Digital Assistant.
Appendix A Businterface Definition Files
A.1
Generic AMBA 2.0 Protocol
The file genprot.bci defines all transfers being part of any protocol of the AMBA 2.0 bus family with the maximum possible set and width of their attributes.
//file genprot.bci enum TStatus, values = [ok, error, retry, abort, split]; enum TTransactionType, values = [idle, control, addr, read, write, readWrite, readAtAddress, writeAtAddress, readWriteAtAddress]; enum TKind, values = [opcode, data]; enum TGroup, values = [single, burstStart, burstCont, burstIdle]; enum TBurstIncDec, values = [increment, decrement]; enum TBurstComp, values = [burstInitiator, burstTarget, burstBus]; enum TBurstWrap, values = [incremental, wrapBurstSize, wrapDataSize]; enum TProtectionType, values = [user, privileged]; enum TReqMode, values = [reqOneCycle, reqUntilGrant, reqUntilUnreq]; transaction GPTransaction { transfer addrTrf, sender = initiator, receiver = target
132
Appendix A: Businterface Definition Files
{ attribute address, type = bits<32>; attribute type, type = TTransactionType; attribute accessSize, type = bits<32>; attribute kind, type = TKind; attribute group, type = TGroup; attribute burstLength, type = bits<32>; attribute burstAddr, type = bits<32>; attribute burstIncDec, type = TBurstIncDec; attribute burstWrap, type = TBurstWrap; attribute cacheable, type = bool; attribute bufferable, type = bool; attribute protectionType, type = TProtectionType; attribute masterId, type = bits<32>; }; transfer readDataTrf, sender = target, receiver = initiator { attribute readData, type = bits<64>; }; transfer writeDataTrf, sender = initiator, receiver = target { attribute writeData, type = bits<64>; }; transfer eotTrf, sender = target,
Generic AMBA 2.0 Protocol receiver = initiator { attribute status, type = TStatus; }; transfer splitResumeTrf, sender = target, receiver = bus { attribute splitResume, type = bits<32>; }; transfer lockTrf, sender = initiator, receiver = target { attribute lock, type = bool; }; transfer reqTrf, sender = initiator, receiver = bus { attribute reqMode, type = TReqMode; }; transfer unreqTrf, sender = initiator, receiver = bus; transfer grantTrf, sender = bus, receiver = initiator; transfer cancelTrf, sender = initiator, receiver = target; transfer shiftTrf, sender = target, receiver = initiator; transfer earlyAddrTrf, sender = bus,
133
134
Appendix A: Businterface Definition Files receiver = target { attribute outstandingData, type = bits<64>; };
};
A.2
Derived AMBA 2.0 Protocols
The file amba protocols.bci introduces all protocols with that subset of generic transfers and those transfer attributes they comprise, respectively. In case of the AMBA 2.0 family, the protocols defined in this section are DefSlaveTarget, AHBArbiter, BMArbiter, APBInitiator, APBTarget, AHBInitiator, AHBTarget, AHBLiteInitiator, AHBLiteTarget and MLTarget. This section does not define new transfers or attributes. It just specializes the generic protocol for the AMBA protocols named before. Thus, the file amba protocols.bci is not explicitely printed here.
A.3
AMBA 2.0 Bus Interface Specification
This is the new section which specifies how the generated TLM adaptor should implement the cycle accurate LISA bus/memory API functions request read(), request write(), try read() and could write(). Below, the specification for the basic non-blocking 2-phase TLM interface is shown. Once a request is invoked, no delayed information like address or data can be delivered any more. An extended interface specification not shown here uses the command() extension to allow supplementing an already running transaction with additional data. The file businterface.bci is the same for all AMBA 2.0 initiator protocols, namely AHBInitiator, AHBLiteInitiator and APBInitiator (cf. Section 7.3).
//file businterface.bci busInterface LTCWR_AMBA, connect = [AHBInitiator, AHBLiteInitiator, APBInitiator], headerFiles = [ "LTCWR_PSP_TLM_Base.h" ], parentClass = CLTCWR_TLM_Adaptor_Base { type DType, template = true, forward = true, value = unsigned long int; type AType, template = true, forward = true, value = unsigned long int; parameter pPSP_API, forward = true, type = CLTCWR_PSP_TLM_Base *, value = 0;
AMBA 2.0 Bus Interface Specification parameter cwrCoreId, forward = true, type = const char *, value = 0; parameter name_port, forward = true, type = const char *, value = 0; function sub_block_to_mas, extern = true; function record_transaction, extern = true; function base_command, extern = true; function LTCWR_GET_BYTE_ADDRESS_IN_BURST, extern = true; function reset_statemachine, generateResetFunction = true; function post_cycle_processing, generateStatemachine = [send, extra]; function pre_cycle_processing, generateStatemachine = [receive, extra]; function full_cycle_processing, generateStatemachine = [send, receive, extra]; const MA_SIG_WAIT, type = int, extern = true; const MA_OK, type = int, extern = true; const MA_ERR_BUSY, type = int, extern = true; const MA_ERR_NORQ, type = int, extern = true; const MA_ERR_RANGE, type = int, extern = true; variable type = extern variable type = extern variable type =
m_TlmStatus, int, = true; m_plain, unsigned long, = true; m_mask, cwrMas,
135
136
Appendix A: Businterface Definition Files
extern variable type = extern variable type = extern
= true; m_shift, unsigned long, = true; m_AdaptorID, string, = true;
protocolParameter { parameter address_width, value = 32; parameter data_width, forward = data_width, value = 32; }; stateMachine default1, normalOrderProcessing = true, extraStates = [ burstContState, finishState {
]
// state sequences sequence seq_single_read, value = [ reqTrf , addrTrf ; eotTrf , readDataTrf , finishState ], ready = eotTrf; sequence seq_first_read, value = [ reqTrf , addrTrf ; eotTrf , readDataTrf , finishState ]; sequence seq_burst_read, value = [ burstContState ; addrTrf ; eotTrf , readDataTrf , finishState ]; sequence seq_last_read, value = [ burstContState ; addrTrf ; eotTrf , readDataTrf , unreqTrf , finishState ], ready = eotTrf; sequence seq_single_write, value = [ reqTrf , addrTrf , writeDataTrf , eotTrf , finishState ], ready = eotTrf; sequence seq_first_write, value = [ reqTrf , addrTrf , writeDataTrf , eotTrf , finishState ];
AMBA 2.0 Bus Interface Specification sequence seq_burst_write, value = [ burstContState ; addrTrf , writeDataTrf , eotTrf , finishState ]; sequence seq_last_write, value = [ burstContState ; addrTrf , writeDataTrf , eotTrf , unreqTrf , finishState ], ready = eotTrf; // additional buffer members extraBufferEntry mask, type = cwrMas; extraBufferEntry shift, type = unsigned long; // debugging & analysis debugging, namePathVariable = m_AdaptorID, states = [ addrTrf , writeDataTrf , readDataTrf ], print = [ addrTrf.address, writeDataTrf.writeData, readDataTrf.readData ]; analysis, namePathVariable = m_AdaptorID, counter = [ addrTrf ], timing = [ addrTrf.address, writeDataTrf.writeData, readDataTrf.readData ];
// state machine rules rules { feedData, behavior = [ port.writeDataTrf.writeData = buffer[current].writeDataTrf.writeData >> buffer[current].shift; buffer[current].readDataTrf.readData = port.readDataTrf.readData << buffer[current].shift; ]; catch eot_retry, state = eotTrf, condition =
137
138
Appendix A: Businterface Definition Files (buffer[current].eotTrf.status == retry), behavior = [ buffer[first].state = cancelTrf; buffer[others].state = burstContState; ]; catch eot_error, state = eotTrf, condition = (buffer[current].eotTrf.status == error), behavior = [ m_TlmStatus = MA_ERR_RANGE; ]; catch cancel_behavior, state = cancelTrf, condition = true, behavior = [ buffer[current].state = addrTrf; ]; catch addr_extra, state = addrTrf, oncePerCycle = true, condition = (buffer[next].state == burstContState), behavior = [ buffer[next].state = addrTrf; ]; catch write_trace, state = writeDataTrf, crossCheck = [ addrTrf.address ], behavior = [ void = record_transaction( buffer[current].addrTrf.address, buffer[current].mask, buffer[current].writeDataTrf.writeData, false); ]; catch read_trace, state = readDataTrf, crossCheck = [ addrTrf.address ], behavior = [ void = record_transaction( buffer[current].addrTrf.address, buffer[current].mask, buffer[current].readDataTrf.readData, true);
AMBA 2.0 Bus Interface Specification ]; catch burst_optimize, state = burstContState, proceed = skip; }; };
function request_read, returnType = int { parameter addr, type = AType; parameter data, type = DType*, value = 0; parameter n, type = int, value = 1; parameter sb, type = int, value = -1; parameter len, type = int, value = -1; section main, condition = true { buffer, type = allocateNew, size = n, behaviorFailed = [ return = MA_ERR_BUSY; ]; implementation, behavior = [ void = sub_block_to_mas(sb, len); buffer[current].addrTrf.accessSize = m_plain; buffer[current].mask = m_mask; buffer[current].shift = m_shift; buffer[current].reqTrf.reqMode = if ( n==1 ) then reqUntilGrant else reqUntilUnreq; buffer[current].addrTrf.address = LTCWR_GET_BYTE_ADDRESS_IN_BURST(addr,current); buffer[current].readDataTrf.readData = data[current];
139
140
Appendix A: Businterface Definition Files buffer[current].addrTrf.burstLength = n; buffer[current].addrTrf.burstWrap = incremental; buffer[current].addrTrf.burstIncDec = increment; buffer[current].addrTrf.type = readAtAddress; buffer[current].addrTrf.group = if (current == 0) then burstStart else burstCont; buffer[current].burstitem = current; buffer[current].sequence = if ( n==1 ) then seq_single_read elseif ( current == 0) then seq_first_read elseif ( current == n-1) then seq_last_read else seq_burst_read ; m_TlmStatus = MA_OK; return = MA_OK; ];
}; };
function request_write, returnType = int { parameter addr, type = AType; parameter data, type = DType*, value = 0; parameter n, type = int, value = 1; parameter sb, type = int, value = -1; parameter len, type = int, value = -1; section main, condition = true { buffer, type = allocateNew, size = n, behaviorFailed = [ return = MA_ERR_BUSY; ]; implementation, behavior = [ void = sub_block_to_mas(sb, len); buffer[current].addrTrf.accessSize = m_plain;
AMBA 2.0 Bus Interface Specification buffer[current].mask = m_mask; buffer[current].shift = m_shift; buffer[current].reqTrf.reqMode = if ( n==1 ) then reqUntilGrant else reqUntilUnreq; buffer[current].addrTrf.address = LTCWR_GET_BYTE_ADDRESS_IN_BURST(addr,current); buffer[current].writeDataTrf.writeData = data[current]; buffer[current].addrTrf.burstLength = n; buffer[current].addrTrf.burstWrap = incremental; buffer[current].addrTrf.burstIncDec = increment; buffer[current].addrTrf.type = writeAtAddress; buffer[current].addrTrf.group = if (current == 0) then burstStart else burstCont; buffer[current].burstitem = current; buffer[current].sequence = if ( n==1 ) then seq_single_write elseif ( current == 0) then seq_first_write elseif ( current == n-1) then seq_last_write else seq_burst_write ; m_TlmStatus = MA_OK; return = MA_OK; ]; }; };
function try_read, returnType = int { parameter addr, type = AType; parameter data, type = DType*, value = 0; parameter n, type = int, value = 1; parameter sb, type = int, value = -1; parameter len, type = int, value = -1; section main, condition = true
141
142
Appendix A: Businterface Definition Files { buffer, type = searchPending, condition = ( ( buffer[first].addrTrf.address == addr) & ( buffer[first].addrTrf.type == readAtAddress) ), size = n; implementation, condition = ( (buffer[first].state != finishState ) | (buffer[index(n-1)].state != finishState) ), behavior = [ return = MA_SIG_WAIT; ]; implementation, behavior = [ data[current] = buffer[current].readDataTrf.readData; return = MA_OK; ], discardOlderBuffers = true, discardBurstUntil = (n-1); }; section reverse, condition = true { buffer, type = searchReverse, steps = 3, condition = ( ( buffer[first].addrTrf.address == addr) & ( buffer[first].addrTrf.type == readAtAddress) ), size = n, behaviorFailed = [ return = MA_ERR_NORQ; ]; implementation, condition = ( (buffer[first].state != finishState ) | (buffer[index(n-1)].state != finishState) ), behavior = [ return = MA_SIG_WAIT; ]; implementation, behavior = [ data[current] = buffer[current].readDataTrf.readData; return = MA_OK; ]; };
};
AMBA 2.0 Bus Interface Specification function could_write, returnType = int { parameter addr, type = AType; parameter data, type = DType*, value = 0; parameter n, type = int, value = 1; parameter sb, type = int, value = -1; parameter len, type = int, value = -1; section main, condition = true { buffer, type = searchPending, condition = ( ( buffer[first].addrTrf.address == addr) & ( buffer[first].addrTrf.type == writeAtAddress) ), size = n; implementation, condition = ( (buffer[first].state != finishState ) | (buffer[index(n-1)].state != finishState) ), behavior = [ return = MA_SIG_WAIT; ]; implementation, behavior = [ return = MA_OK; ], discardOlderBuffers = true, discardBurstUntil = (n-1); }; section reverse, condition = true { buffer, type = searchReverse, steps = 3, condition = ( ( buffer[first].addrTrf.address == addr)
143
144
Appendix A: Businterface Definition Files & ( buffer[first].addrTrf.type == writeAtAddress) ), size = n, behaviorFailed = [ return = MA_ERR_NORQ; ]; implementation, condition = ( (buffer[first].state != finishState ) | (buffer[index(n-1)].state != finishState) ), behavior = [ return = MA_SIG_WAIT; ]; implementation, behavior = [ return = MA_OK; ]; };
};
function command, returnType = int { parameter opcode, type = MA_COMMAND; parameter p1, type = uint, value = 0; parameter p2, type = uint, value = 0; parameter p3, type = uint, value = 0; parameter p4, type = uint, value = 0; section main, condition = true { implementation, condition = ( opcode == 12000 ), behavior = [ void = reset_statemachine(); return = 1; ]; implementation, behavior = [
AMBA 2.0 Bus Interface Specification return = base_command(opcode, p1, p2, p3, p4); ]; }; }; };
145
Appendix B Extended CoWare Tool Flow
The bus interface generator presented in Chapter 7 is not only useful for a top-down refinement flow for fully customized MP-SoC platforms as described in Chapter 6. It also brings the CoWare products ProcessorDesigner, BusCompiler and PlatformArchitect closer together. This is especially useful for CoWare as a TLM IP and tools provider. As depicted in Figure B.1, the BusCompiler tool generates TLM bus libraries (BL), e.g. for ARM AMBA 2.0 or ARM AXI. ProcessorDesigner is capable of generating processor IP (Processor Support Packages, PSP). In the succeeding PlatformArchitect tools PlatformCreator and SystemCExplorer, the processors and the communication modules have to fit together smoothly. So far, the PSPcompilers for ProcessorDesigner had to be developed manually for every bus protocol. Especially creating the protocol specific bus interface state machines is a very tedious and error-prone task. The current BusCompiler tool has been extended by three components: The MSC Library Generator. This tool generates MSC packet classes for the Bus Library. This enables MSC visualization of cycle accurate TLM communication (cf. Section 8.2). The Bus Interface Generator builds up a bus API implementation according to a given specification (cf. Section 7.3). One use case is generating the bus interface state machine as needed for the ProcessorDesigner coupling. The generated file is then forwarded to the succeeding PSP Compiler Generator. However, alternative interface specifications, e.g. for AMBA CLI, PV or OCP, can be used to provide additional APIs for the bus library. So far, the only API supported by the communication modules is the CoWare proprietary TLM API. The PSPcompiler Generator takes a bus API generated for the LISA bus/memory API as input. It is compiled together with a set of additional files to create the bus protocol specific PSPcompiler.
Figure B.1. CoWare Tool Flow
Input written by CoWare
additional SoC Modules
bus family dependent
~ 500 lines
interface.bci
~ 100 lines
genprot.bci
>1000 lines/processor
processor.lisa
bus dependent processor dependent
~ 100 lines/protocol
protocols.bci
~ 500 lines/node
nodes.bci
Input written by Customer
PSPcompiler Generator
BusInterface Generator
MSC Library Generator
Current BusCompiler
BusCompiler
PSP
ProcessorDesigner
PSPcompiler (csc_*.so)
additional Bus API
MscLibrary
BusLibrary (BL)
PlatformArchitect‘s PlatformCreator
MP-SoC Simulation
PlatformArchitect‘s SystemCExplorer
MP-SoC Programming
MP-SoC Analysis/Profiling
MP-SoC Verification/Debugging
148 Appendix B: Extended CoWare Tool Flow
List of Figures
1.1 1.2 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4
The Energy-Flexibility Gap Main Book Chapters Interface Method Call Principle Code Examples: OSCI TLM Core Interfaces Basic Design Methodologies Abstraction Pyramid Abstraction Levels Communication Refinement Optimal Abstraction Level for the Use Cases NoC Channel Overview Code Examples: PV Interface and simple bus Master Interface Code Examples: OCP TL-2 and OCP TL-1 Interfaces Code Examples: CoWare-TLM API and ARM AMBA CLI Cycle Accurate Bus Communication Modeling AMBA AHB Write Example, Initiator Side BusCompiler Input Specification Example Virtual Architecture Mapping LISA Processor Design Platform Instruction Accurate Processor Model Cycle Accurate Processor Model LISA Code Example Standalone LISA Processor Simulator LISA Processor Simulator with Bus Interface SystemC Processor Simulators with External Data Memory LISA Bus/Memory API Definition
3 5 13 14 16 17 23 27 28 29 30 34 36 37 38 39 42 48 50 50 51 53 54 55 57
150 5.5 5.6 5.7 5.8 5.9 5.10 6.1 6.2 6.3 6.4 6.5 6.6 6.7 7.1 7.2 7.3 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
List of Figures
LISA Bus/Memory API Capabilities API Mapping: LISA → TLM 1:1 Adaptor Example: ST R @0x1000, R[0] Bus Interface State Machine Example: ST R @0x1000, R[0] LISA–PlatformArchitect Integration LISA–SystemStudio Integration Co-Exploration Methodology Successive Refinement Flow Phase 1 + Phase 2 Simulation LISA Standalone Processor Design Space Exploration Phase 3 + Phase 4 Simulation Bus Cycle Accurate Processor Model Phase 5 Simulation Simulator Generation Flow Simulator Structure Implementing the ADL API Multi-Processor Debugging Standalone Simulation with Debugger GUI System Simulation Executable without Debugger GUI Synchronizing User and Scheduler Control Synchronizer Control Flow Dynamic Connect Debugging GNU gdb Connected to a Host Process Multi-Level Multiprocessor Debugging Transaction Accurate MSC Visualization Cycle Accurate MSC Visualization Bus Interface Analysis Phase 1, C-Profile of the JPEG Application Co-Processor Configurations Cache Configurations Design Space Exploration, Phase 2 (IA+AVF) Simulation Speed, Phase 2 (IA+AVF) Interconnect Topologies Design Space Exploration, Phase 3 (IA+BusCompiler) Simulation Speed, Phase 3 (IA+BusCompiler) Design Space Exploration, Phase 4 (CA+BusCompiler)
58 60 62 62 64 65 67 68 69 71 74 77 78 82 83 85 92 93 96 96 97 99 101 102 103 104 106 110 111 112 114 115 117 118 120 121
List of Figures
9.10 9.11 9.12 9.13 9.14 B.1
Simulation Speed, Phase 4 (CA+BusCompiler) Data Memory Access from the MIPS Pipeline Data Memory Access from the IDCTcore Pipeline Design Space Exploration, Phase 5 (BCA+BusCompiler) Simulation Speed, Phase 5 (BCA+BusCompiler) CoWare Tool Flow
151 122 123 123 125 126 148
References
[1] Kogel, T., Leupers, R. and Meyr, H. Integrated System-Level Modeling of Network-onChip Enabled Multi-Processor Platforms. Springer, June 2006. ISBN 1-4020-4825-4. [2] Schliebusch, O., Meyr, H. and Leupers, R. Optimized ASIP Synthesis from Architecture Description Language Models. Springer, March 2007. ISBN 1-4020-5685-0. [3] Hoffmann, A., Meyr, H. and Leupers, R. Architecture Exploration for Embedded Processors with LISA. Kluwer Academic Press, December 2002. ISBN 1-4020-7338-0. [4] Moore, G. Cramming more components onto integrated circuits. In Electronics Magazine, April 1965. [5] International Technology Roadmap for Semiconductors. 2005 Edition, Executive Summary, http://public.itrs.net/ [6] Claasen, T.A.C.M. High speed: not the only way to exploit the intrinsic computational power of silicon. In Solid-State Circuits Conference, February 1999. [7] Rowen, C. Engineering the Complex SoC. Prentice Hall, 2004. [8] Blume, H., Hubert, H., Feldkamper, H.T. and Noll, T.G. Model-based exploration of the design space for heterogeneous systems on chip. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 2002. [9] ARM Processor Cores. Advanced Risc Machines (ARM), http://www.arm.com/ [10] MIPS Processor Cores. MIPS Technologies, http://www.mips.com/ [11] DSP Products. Texas Instruments (TI), http://www.ti.com/ [12] DSP Products. Sandbridge Technologies, http://www.sandbridgetech.com/ [13] Xilinx Virtex FPGA Platforms. Xilinx, http://www.xilinx.com/ [14] Meyr, H. System-on-chip for communications: The dawn of ASIPs and the dusk of ASICs. Keynote Speech of IEEE International Workshop on Signal Processing Systems (SIPS), August 2003. [15] Benini, L. and De Micheli, G. Networks on chips: a new SoC paradigm. IEEE Computer, 35:70–78, January 2002. [16] Sgroi, M., Sheets, M., Mihal, A., Keutzer, K., Malik, S., Rabaey, J. and Sangiovanni-Vincentelli, A. Addressing the system-on-a-chip interconnect woes through communication-based design. In Design Automation Conference, June 2001. [17] Wolf, W. A decade of hardware/software codesign. IEEE Computer, 36:38–43, April 2003. [18] Rowson, J.A. Hardware/software co-simulation. In Design Automation Conference (DAC), June 1994.
154
References
[19] Buck, J.T., Ha, S., Lee, E.A. and Messerschmitt, D.G. Ptolemy: a framework for simulating and prototyping heterogeneous systems. In International Journal of Computer Simulation, April 1994. [20] Seamless C.V.E. Mentor Graphics, http://www.mentor.com [21] Wieferink, A. Verification and performance analysis of telecommunication systems by fast HW/SW co-simulation. Diploma Thesis, June 2000. [22] Hoffmann, A., Kogel, T. and Meyr, H. A framework for fast hardware-software cosimulation. In Design, Automation and Test in Europe Conference (DATE), 2001. [23] Pees, S., Hoffmann, A. and Meyr, H. Retargetable compiled simulation of embedded processors using a machine description language. IEEE Transactions on Design Automation of Electronic Systems, 5(4), October 2000. [24] Mueller, A., Kogel, T. and Post, G. Methodology for ATM-cell processing system design. In 12th Annual 1999 IEEE International ASIC/SOC Conference, Washington, DC, September 1999. [25] Gupta, R.K. and De Micheli, G. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers, 10:29–41, September 1993. [26] Ernst, R., Henkel, J. and Benner, T. Hardware-software cosynthesis for microcontrollers. IEEE Design & Test of Computers, 10:64–75, December 1993. [27] Camposano, R. Behavioral synthesis. In Design Automation Conference (DAC), 1996. [28] Daveau, J.-M., Ismail, T.B. and Jerraya, A.A. Synthesis of system-level communication by an allocation-based approach. In International Symposium on System Synthesis (ISSS), 1995. [29] Chou, P.H., Ortega, R.B. and Borriello, G. The Chinook hardware/software co-synthesis system. In International Symposium on System Synthesis (ISSS), 1995. [30] Van Rompaey, K., Verkest, D., Bolsens, I. and De Man, H. CoWare-a design environment for heterogeneous hardware/software systems. In European Design Automation Conference (EURO-DAC), 1996. [31] Bolsens, I., De Man, H.J., Lin, B., Van Rompaey, K., Vercauteren, S. and Verkest, D. Hardware/software co-design of digital telecommunication systems. IEEE Proceedings, 85:391–418, March 1997. [32] Interuniversity Micro Electronics Center (IMEC). http://www.imec.be [33] Napkin-to-Chip (N2C). CoWare, http://www.coware.com [34] Rowson, J.A. and Sangiovanni-Vincentelli, A. Interface-based design. In Proceedings of the Design Automation Conference (DAC), 1997. [35] Virtual Component Interface (VCI). Virtual Socket International Alliance (VSIA), http://www.vsia.org [36] Open SystemC Initiative (OSCI). http://www.systemc.org [37] Groetker, T., Liao, S., Martin, G. and Swan, S. System Design with SystemC. Kluwer Academic Publishers, 2002. [38] Rose, A., Swan, S., Pierce, J. and Fernandez, J.M. Transaction Level Modeling in SystemC. White Paper, Open SystemC Initiative (OSCI), June 2005. http://www.systemc.org [39] Open Core Protocol (OCP). Open Core Protocol International Partnership (OCP-IP), http://www.ocpip.org [40] Haverinen, A., Leclercq, M., Weyrich, N. and Wingard, D. SystemC based SoC communication modeling for the OCP protocol. White Paper, OCP-IP, October 2002. http://www.ocpip.org [41] Kogel, T., Haverinen, A. and Aldis, J. SystemC based SoC communication modeling for the OCP protocol. White Paper, OCP-IP, September 2005. http://www.ocpip.org
References
155
[42] IP-XACT. The SPIRIT Consortium, http://www.spiritconsortium.org [43] Pimentel, A.D., Hertzbetger, L.O., Lieverse, P., van der Wolf, P. and Deprettere, E.E. Exploring embedded-systems architectures with Artemis. IEEE Computer, 34:57–63, November 2001. [44] Kienhuis, B., Deprettere, E., Vissers, K. and Van Der Wolf, P. An approach for quantitative analysis of application-specific dataflow architectures. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 1997. [45] Keutzer, K., Newton, A.R., Rabaey, J.M. and Sangiovanni-Vincentelli, A. System-level design: orthogonalization of concerns and platform-based design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 19:1523–1543, December 2000. [46] Cesario, W.O., Lyonnard, D., Nicolescu, G., Paviot, Y., Sungjoo Yoo, Jerraya, A.A., Gauthier, L. and Diaz-Nava, M. Multiprocessor SoC platforms: a component-based design approach. IEEE Design and Test of Computers, 19:52–63, November 2002. [47] Rutten, M.J., van Eijndhoven, J.T.J., Jaspers, E.G.T., van der Wolf, P., Gangwal, O.P., Timmer, A. and Pol, E.-J.D. A heterogeneous multiprocessor architecture for flexible media processing. IEEE Design and Test of Computers, 19:39–50, July 2002. [48] Loghi, M., Angiolini, F., Bertozzi, D., Benini, L. and Zafalon, R. Analyzing on-chip communication in a MPSoC environment. In Design, Automation and Test in Europe Conference (DATE), 2004. [49] Paulin, P.G., Pilkington, C. and Bensoudane, E. StepNP: a system-level exploration platform for network processors. IEEE Design and Test of Computers, 19:17–26, November 2002. [50] Bertozzi, D., Jalabert, A., Murali, S., Tamhankar, R., Stergiou, S., Benini, L. and De Micheli, G. NoC synthesis flow for customized domain specific multiprocessor systemson-chip. IEEE Transactions on Parallel and Distributed Systems, 16:113–129, February 2005. [51] Sarmento, A., Cesario, W. and Jerraya, A.A. Automatic building of executable models from abstract SoC architectures made of heterogeneous subsystems. In 15th IEEE International Workshop on Rapid System Prototyping (RSP), 2004. [52] Yoo, S., Bacivarov, I., Bouchhima, A., Paviot, Y. and Jerraya, A.A. Building fast and accurate SW simulation models based on hardware abstraction layer and simulation environment abstraction layer. In Design, Automation and Test in Europe Conference (DATE), 2003. [53] Kahn, G. The semantics of a simple language for parallel programming. In Proceedings of Information Processing, 1974. [54] Jalabert, A., Murali, S., Benini, L. and De Micheli, G. X-pipes compiler: a tool for instantiating application specific networks on chip. In Design, Automation and Test in Europe Conference (DATE), 2004. [55] Murali, S. and De Micheli, G. SUNMAP: a tool for automatic topology selection and generation for NoCs. In Design Automation Conference (DAC), 2004. [56] Angiolini, F., Ceng, J., Leupers, R., Ferrari, F., Ferri, C. and Benini, L. An integrated open framework for heterogeneous MPSoC design space exploration. In Design, Automation and Test in Europe Conference (DATE), 2006. [57] Kohler, E., Morris, R., Chen, B., Jannotti, J. and Kaashoek, M.F. The click modular router. ACM Transactions on Computer Systems, 18(3):263–297, August 2000. [58] Quinn, D., Lavigueur, B., Bois, G. and Aboulhamid, M. A system level exploration platform and methodology for network applications based on configurable processors. In Design, Automation and Test in Europe Conference (DATE), 2004.
156
References
[59] Lieverse, P., van der Wolf, P., Deprettere, E. and Vissers, K. A methodology for architecture exploration of heterogeneous signal processing systems. In IEEE Workshop on Signal Processing Systems (SiPS), 1999. [60] Pimentel, A.D., Erbas, C. and Polstra, S. A systematic approach to exploring embedded system architectures at multiple abstraction levels. IEEE Transactions on Computers, 55:99–112, Feb 2006. [61] Pimentel, A.D. and Erbas, C. An IDF-based trace transformation method for communication refinement. In Design Automation Conference, 2003. [62] Zivkovic, V.D., Deprettere, E., van der Wolf, P. and de Kock, E. Design space exploration of streaming multiprocessor architectures. In IEEE Workshop on Signal Processing Systems, 2002. [63] Zivkovic, V.D., Deprettere, E., de Kock, E. and van der Wolf, P. Fast and accurate multiprocessor architecture exploration with symbolic programs. In Design, Automation and Test in Europe Conference (DATE), 2003. [64] Balarin, F., Watanabe, Y., Hsieh, H., Lavagno, L., Passerone, C. and SangiovanniVincentelli, A. Metropolis: an integrated electronic system design environment. Computer, 36:45–52, April 2003. [65] Balarin, F. et al. Hardware/Software Co-Design of Embedded Systems- The POLIS Approach. Kluwer Academic Publishers, 1999. [66] Gries, M. and K. Keutzer (Eds.). Building ASIPs: The Mescal Methodology. Springer, 2005. [67] Gajski, A.D., Zhu, J., Doemer, R., Gerstlauer, A. and Zhao, S. SpecC: Specification Language and Methodology. Kluwer Academic Publishers, 2000. [68] Peng, J., Abdi, S. and Gajski, D. Automatic model refinement for fast architecture exploration. In Asia and South Pacific Design Automation Conference (ASP-DAC), 2002. [69] Abdi, S., Shin, D. and Gajski, D. Automatic communication refinement for system level design. In Design Automation Conference (DAC), 2003. [70] Gerstlauer, A., Haobo Yu and Gajski, D.D. RTOS modeling for system level design. In Design, Automation and Test in Europe Conference (DATE), 2003. [71] Shin, D.Gerstlauer, A., Doemer, R. and Gajski, D.D. Automatic network generation for system-on-chip communication design. In Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2005. [72] CoWare Inc. http://www.coware.com [73] Kogel, T., Doerper, M., Wieferink, A., Leupers, R., Ascheid, G., Meyr, H. and Goossens, S. A modular simulation framework for architectural exploration of on-chip interconnection networks. In The First IEEE/ACM/IFIP International Conference on HW/SW Codesign and System Synthesis, Newport Beach (California USA), October 2003. [74] Kempf, T., Doerper, M., Leupers, R., Ascheid, G., Meyr, (ISS Aachen, DE), H., Kogel, T. and Vanthournout (CoWare Inc., BE), B. A modular simulation framework for spatial and temporal task mapping onto multi-processor SoC platforms. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), Munich, Germany, March 2005. [75] Vanthournout, B., Goossens, S. and Kogel, T. Developing Transaction-level Models in SystemC. White Paper, CoWare Inc., August 2004. www.coware.com [76] Advanced Risc Machines (ARM) Ltd. http://www.arm.com [77] AMBA Specification, Rev. 2.0. ARM IHI 0011A, http://www.arm.com [78] AMBA AXI Protocol, v1.0, Specification. ARM IHI 0022B, http://www.arm.com
References
157
[79] SMART Interconnects. Sonics Inc., http://www.sonicsinc.com [80] NoC Solution. Arteris SA, http://www.arteris.com [81] Cochrane, A., Lennard, C., and Topping K. et al. AMBA AHB Cycle Level Interface (AHB CLI) Specification, 2003. [82] ARM RealView ESL API v2.0 Developer’s Guide. ARM DUI 0359B, http://www.arm.com [83] PlatformArchitect. CoWare, http://www.coware.com [84] Kogel, T., Wieferink, A., Leupers, R., Ascheid, G., Meyr, H., Bussaglia, D. and Ariyamparambath, M. Virtual architecture mapping: a systemC based methodology for architectural exploration of system-on-chip designs. In International Workshop on Systems, Architectures, Modeling and Simulation (SAMOS), Samos (Greece), July 2003. [85] Leupers, R., Wahlen, O., Hohenauer, M., Kogel, T. and Marwedel, P. An executable intermediate representation for retargetable compilation and high-level code optimization. In International Workshop on Systems, Architecturs, Modeling and Simulation(SAMOS), Samos(Greece), July 2003. [86] Karuri, K., Al Faruque, M.A., Kraemer, S., Leupers, R., Ascheid, G. and Meyr, H. Finegrained application source code profiling for ASIP design. In 42nd Design Automation Conference, Anaheim, California, USA, June 2005. [87] Kempf, T., Karuri, K., Wallentowitz, S., Ascheid, G., Leupers, R. and Meyr, H. A SW performance estimation framework for early system-level-design using fine-grained instrumentation. In Design, Automation and Test in Europe Conference (DATE), 2006. [88] MIPS Technologies (MIPS) Inc. http://www.mips.com [89] Tensilica Inc. http://www.tensilica.com [90] Stretch Inc. http://www.stretchinc.com [91] Hadjiyiannis, G. and Devadas, S. Techniques for accurate performance evaluation in architecture exploration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 11:601–615, August 2003. [92] Fauth, A., Van Praet, J. and Freericks, M. Describing instruction set processors using nML. In Proceedings of the European Design and Test Conference (ED&TC), March 1995. [93] Leupers, R. HDL-based modeling of embedded processor behavior for retargetable compilation. In 11th International Symposium on System Synthesis (ISSS), September 1998. [94] Grun, P., Halambi, A., Dutt, N. and Nicolau, A. Rtgen-an algorithm for automatic generation of reservation tables from architectural descriptions. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 11:731–737, August 2003. [95] Fauth, A. Beyond tool specific machine descriptions. In P. Marwedel, G. Goossens (eds), Code Generation for Embedded Processors. Kluwer Academic Publishers, 1995. [96] Target Compiler Technologies N.V. http://www.retarget.com [97] Rajesh, V. and Moona, R. Processor modeling for hardware software codesign. In 12th International Conference On VLSI Design, January 1999. [98] Chandra, S. and Moona, R. Retargetable functional simulator using high level processor models. In 13th International Conference on VLSI Design, January 2000. [99] Moona, R. Processor models for retargetable tools. In Workshop on Rapid System Prototyping (RSP), June 2000. [100] Basu, S. and Moona, R. High level synthesis from Sim-nML processor models. In 16th International Conference On VLSI Design, January 2003.
158
References
[101] Halambi, A., Grun, P., Ganesh, V., Khare, A., Dutt, N. and Nicolau, A. EXPRESSION: a language for architecture exploration through compiler/simulator retargetability. In Proceedings of Design, Automation and Test in Europe Conference (DATE), March 1999. [102] Mishra, P., Grun, P., Dutt, N. and Nicolau, A. Processor-memory co-exploration driven by an architectural description language. In 14th International Conference on VLSI Design, 2001. [103] Zivojnovic, V., Pees, S. and Meyr, H. LISA – machine description language and generic machine model for HW/SW co-design. In Proceedings of the IEEE Workshop on VLSI Signal Processing, San Francisco, October 1996. [104] Schliebusch, O., Chattopadhyay, A., Kammler, D., Ascheid, G., Leupers, R., Meyr, H. and T. Kogel. A framework for automated and optimized ASIP implementation supporting multiple hardware description languages. In ASP-DAC, Shanghai, China, January 2005. [105] Nohl, A., Braun, G., Hoffmann, A., and Schliebusch, O., Meyr, H. and Leupers, R. A Universal technique for fast and flexible instruction-set architecture simulation. In Proceedings of the Design Automation Conference (DAC), New Orleans, June 2002. ACM. [106] Hohenauer, M., Scharwaechter, H., Karuri, K., Wahlen, O., Kogel, T., Leupers, R., Ascheid, G., Meyr, H., Braun, G. and van Someren, H. A methodology and tool suite for C compiler generation from ADL processor models. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), Paris, France, February 2004. [107] Wieferink, A., Kogel, T., Braun, G., Nohl, A., Leupers, R., Ascheid, G. and Meyr, H. A system level processor/communication co-exploration methodology for multi-processor system-on-chip platforms. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), Paris, France, February 2004. [108] Wieferink, A., Kogel, T., Nohl, A., Hoffmann, A., Leupers, R. and Meyr, H. A generic toolset for SoC multiprocessor debugging and synchronisation. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), The Hague (Netherlands), June 2003. [109] Wieferink, A., Kogel, T., Hoffmann, A., Zerres, O. and Nohl, A. SoC integration of programmable cores. In International Workshop on IP-Based SoC Design, Grenoble, France, November 2003. [110] Braun, G., Wieferink, A., Schliebusch, O., Leupers, R., Meyr, H. and Nohl, A. Processor/memory co-exploration on multiple abstraction levels. In DesignAutomation & Test in Europe (Date), Munich, March 2003. [111] CoCentric System Studio. Synopsys, http://www.synopsys.com [112] Kunkel, J., Meyr, H. and Ascheid, G. COSSAP: communication system simulation and analysis package. In 2nd IEEE Workshop CAMAD of Commun. Links and Networks, University of Massachusetts, Amherst, October 1988. [113] Wieferink, A., Doerper, M., Kogel, T., Leupers, R., Ascheid, G. and Meyr, H. Early ISS Integration into Network-on-Chip Designs. In International Workshop on Systems, Architectures, Modeling and Simulation (SAMOS), Samos (Greece), July 2004. [114] Wieferink, A., Doerper, M., Kogel, T., Braun, G., Nohl, A., Leupers, R., Ascheid, G. and Meyr, H. A system level processor/communication co-exploration methodology for multi-processor system-on-chip platforms. IEEE Proceedings: Computers & Digital Techniques, 152(1):3–11, January 2005. [115] Witte, E.M., Chattopadhyay, A., Schliebusch, O., Kammler, D., Leupers, R., Ascheid, G. and Meyr, H. Applying resource sharing algorithms to ADL-driven
References
[116]
[117]
[118] [119]
[120] [121] [122] [123] [124] [125] [126] [127]
159
automatic ASIP implementation. In IEEE International Conference on Compute Design (ICCD), San Jose, California, USA, October 2005. Bauwens, B. Specification and implementation of automatic memory interface generation within the framework of the RTL-processor-synthesis. Diploma Thesis, December 2004. Wieferink, A., Michiels, T., Kogel, T., Nohl, A., Leupers, R., Ascheid, G. and Meyr, H. Retargetable generation of TLM bus interfaces for MP-SoC platforms. In 3rd IEEE/ACM/IFIP International Conference on HW/SW Codesign and System Synthesis (CODES+ISSS), Jersey City, NJ, USA, September 2005. Stroustrup, B. The C++ Programming Language. Addison Wesley, third edition, 1997. Wieferink, A., Kogel, T., Zerres, O., Nohl, A., Leupers, R. and Meyr, H. SoC multiprocessor debugging and synchronization using generic dynamic-connect debugger frontends. International Journal on Embedded Systems (IJES), 1, September 2005. The QT Library. http://www.trolltech.com The GNU Project Debugger (gdb). http://www.gnu.org/software/gdb/ GNU Data Display Debugger (ddd). http://www.gnu.org/software/ddd/ Analysis manual: instrumenting the code, running the simulation, postprocessing. CoWare PlatformArchitect Product Family, http://www.coware.com Official JPEG homepage. http://www.jpeg.org GNU compiler collection (gcc). available at: http://gcc.gnu.org The dietlibc library. available at: http://www.fefe.de/dietlibc/ Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, Vol. 30, pp. 483–485, April 1967.
Index
Abstraction Levels, 11 LISA, 50 Transaction Level Modeling (TLM), 26 Abstraction Pyramid, 17 Adaptor, 56, 84 AMBA AHB, 32, 84, 112, 117 AHBLite, 32, 84, 117 APB, 32 AXI, 33, 84, 117 Amdahl’s Law, 117 API LISA Bus/Memory API, 57 LISA Simulator API, 93 Mapping, 60 Platform API, 18 TLM Module API, 59 Architect’s View (AV), 26 Architect’s View Framework (AVF), 22, 29 Example Platforms, 112 Exploration, 68 LISA Coupling, 72 Architecture Description Language (ADL), 45 ARM, 43 AMBA CLI, 35 AMBA Families, 32, 112 ARMulator, 92 Axys Design, 48 Processor Cores, 43 RealView ESL API, 35 ARTEMIS, 20 Arteris, 33 ASIC, 3 ASIP, 4 Bottom-Up Design, 18 BusCompiler Tool, 36 Cycle Accurate Communication Modeling, 37 Example Platforms, 116 Generated API, 35 Input Specification, 38, 131
LISA Coupling, 73 Tool Flow, 147 Bus Interface Analysis, 105 Optimization, 105 Specification, 85, 134 State Machine, 61, 85, 147 Synthesis, 128 Co-Exploration Methodology, 67 Communication Based Design, 19 Component Based Design, 19 Cosyma, 9 CoWare Analysis Tools, 105 AVF, 22 BusCompiler, 36, 147 LISATek, 48 N2C, 10 PlatformArchitect, 63, 147 PlatformCreator GUI, 84 ProcessorDesigner, 48, 147 SCML, 63 Cycle Approximation, 129 Design Reuse, 12 Design Space Exploration, 17 Domain Specific Processor, 2 Dynamic Connect, 98 Eclipse, 19 EXPRESSION, 47 FPGA, 3, 44 Functional View (FV), 26 General Purpose Processor, 2 Generic Assembly Level, 42 GNU ddd, 100 gcc, 109 gdb, 100 Hardware Description Language, 3, 8, 11, 49 HW/SW Co-Design, 7
162 Automatic Synthesis, 9 HW/SW Co-Simulation, 8 IDCT, 109 Interface Method Call (IMC), 12 IP-XACT, 15 ISDL, 47 ISS, 8, 28, 43, 49, 92 JPEG Decompression, 109 LANCE Compiler, 43 Latency, 111 LISA, 47 Abstraction Levels, 50 AVF Coupling, 72 BusCompiler Coupling, 73 Bus Cycle Accuracy (BCA), 77, 122 Bus Interface, 54 Bus/Memory API, 57 Code Example, 51 Design Space Exploration, 70 Example Platforms, 112 Introducing a Pipeline, 75, 119 LISA Processor Design Platform (LPDP), 48 Processor Synthesis, 49, 78 Simulator API, 93 Standalone Simulator, 54, 92 SystemC Simulator, 56, 84 Mentor Seamless, 8 Message Layer, 15 Message Sequence Chart (MSC), 102 Metropolis, 21 MIMOLA, 47 MIPS, 44, 109 Model of Computation (MoC), 8, 64 Moore’s Law, 1 MPARM, 20 Multi-Processor Debugger, 91 MP-SoC, 4 Synchronization, 95 Native Execution, 41 NetChip, 19 Network-on-Chip (NoC), 4 NML, 46 Open Core Protocol (OCP), 13, 33 Generic Communication Models, 31 OCP Specific Communication Models, 33 Transaction Levels, 15 Open SystemC Initiative (OSCI), 12 Orthogonalization of Concerns, 11, 26 Performance Indicators, 110 Platform Based Design (PBD), 18 Processor Customization, 43
Index ADLs, 45 (Re-)Configurable Processor Architectures, 44 Selectable Processor Core IP, 43 Processor Support Package (PSP), 82, 147 Programmer’s View (PV), 26 Ptolemy, 8 PV Models, 31 Register Transfer Level (RTL), 8–9, 15, 28, 78 ROSES design framework, 19 Simple Bus Library, 31, 64, 117 Sonics, 33 SpecC, 21 StepNP, 20 Stretch Inc., 45 Successive Refinement Phase 1, 68, 109 Phase 2, 72, 112 Phase 3, 73, 116 Phase 4, 75, 119 Phase 5, 77, 122 Phase 6, 78 Phase 7, 79 Synopsys Behavioral Compiler, 10 DesignWare, 35 SystemStudio, 64 SystemC, 12, 56 System Level Design, 10 Design Flows, 15 Motivation, 10 Standardization, 12 Tensilica, 44 Throughput, 111 Top-Down Design, 17, 20 Transaction Layer, 15 Transaction Level Modeling (TLM), 13, 25 Abstraction Levels, 26 Bus Traffic Visualization, 102 Module API, 59 Protocol Specific Interfaces, 33 Standardization, 13 Transfer TLM Interface, 60 Transport TLM Interface, 59 Use Cases, 25 Transfer, 37 Transfer Layer, 15 Verification View (VV), 26 Virtual Processing Unit (VPU), 22, 42 Virtual Socket Interface Alliance (VSIA), 12 Vulcan, 9 Y-Chart Approach, 18, 41