Power-Aware Computer Systems: Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002, Revised Papers

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2325 3 Berlin Heidelberg New Y...

Author: Babak Falsafi | T.N. Vijaykumar

34 downloads 693 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2325

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

B. Falsafi T. N. Vijaykumar (Eds.)

Power-Aware Computer Systems Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002 Revised Papers

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Babak Falsafi Carnegie Mellon University Electrical and Computer Engineering, Computer Science Hamershlag Hall A305, 5000 Frobes Ave., Pittsburgh, PA 15213, USA E-mail: [email protected] T. N. Vijaykumar Purdue University School of Electrical and Computer Engineering 1285 Electrical Engineering Building, West Lafayette, Indiana 47907-1285, USA E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .

CR Subject Classification (1998): B.7, B.8, C.1, C.2, C.3, C.4, D.4 ISSN 0302-9743 ISBN 3-540-01028-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN: 10846717 06/3142 543210

Preface

Welcome to the proceedings of the Power-Aware Computer Systems (PACS 2002) workshop held in conjunction with the 8th International Symposium on High Performance Computer Architecture (HPCA-8). Improvements in computer system performance have been accompanied by an alarming increase in power and energy dissipation, leading to higher cost and lower reliability in all computer systems market segments. The higher power/energy dissipation has also significantly reduced battery life in portable systems. While circuit-level techniques continue to reduce power and energy, all levels of computer systems are being used to address power and energy issues. PACS 2002 was the second workshop in its series to address power-/energy-awareness at all levels of computer systems and brought together experts from academia and industry. These proceedings include research papers spanning a wide spectrum of areas in power-aware systems. We have grouped the papers into the following categories: (1) power-aware architecture and microarchitecture, (2) power-aware real-time systems, (3) power modeling and monitoring, and (4) power-aware operating systems and compilers. The ﬁrst group of papers propose power-aware techniques for the processor pipeline using adaptive resizing of power-hungry microarchitectural structures and clock gating, and power-aware cache design by avoiding tag checks in periods when the tags have not changed. This group also includes ideas to adapt energy and performance dynamically by detecting regions of application at runtime where the supply voltage may be scaled to reduce power with a bounded decrease in performance. Lastly, a paper on multiprocessor designs trades oﬀ computing capacity and functionality for improved energy per cycle by scheduling simple tasks on low-end and low-energy processors and complex tasks on high-end processors. The second group of papers target real-time systems including ideas on a lowcomplexity heuristic which schedules real-time tasks such that no task misses its deadline and the total energy savings are maximized. The other papers in this group (1) tune the system-level parallelism to the current-level of power/energy availability and optimize the system power utilization, and (2) perform adaptive texture mapping in real-time 3D graphics systems based on a model of human visual perception to achieve signiﬁcant power savings without noticeable image quality degradation. The third group of papers focus on power modeling and monitoring including statistical proﬁling to detect software hotspots of power, and using Petri Nets to model DRAM power policies. This group also includes a simulator for evaluating the performance and power of dynamic voltage scaling algorithms. The last group concentrates on OS and compilers for low power. The ﬁrst paper proposes application-issued directives to set the power modes in devices such as a disk drive. The second paper proposes policies for cluster-wide power

VI

Preface

management. The policies employ combinations of dynamic voltage scaling and turning on and oﬀ to reduce overall cluster power. PACS 2002 was a highly successful forum due to the high-quality submissions, the enormous eﬀorts of the program committee and the keynote speaker, and the attendees. We would like to thank Ronny Ronen for an excellent keynote speech, showing the technological scaling trends and their impact on energy/power consumption in general-purpose microprocessors, and pinpointing recent microarchitectural strategies to achieve more power-eﬃcient microprocessors. We would like to also thank Antonio Gonzalez, Andreas Moshovos, John Kalamatianos, and other members of the HPCA-8 organizing committee who helped arrange for local accomodation and publicize the workshop.

February 2002

Babak Falsaﬁ and T.N. Vijaykumar

PACS 2002 Program Committee

Babak Falsaﬁ, Carnegie Mellon University (co-chair) T.N. Vijaykumar, Purdue University (co-chair) Dave Albonesi, University of Rochester Krste Asanovic, Massachusetts Institute of Technology Iris Bahar, Brown University Luca Benini, University of Bologna Doug Carmean, Intel Yuen Chan, IBM Keith Farkas, Compaq WRL Mary Jane Irwin, Pennsylvania State University Stefanos Kaxiras, Agere Systems Peter Kogge, University of Notre Dame Uli Kremer, Rutgers University Alvin Lebeck, Duke University Andreas Moshovos, University of Toronto Raj Rajkumar, Carnegie Mellon University Kaushik Roy, Purdue University

Table of Contents

Power-Aware Architecture/Microarchitecture Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor . . . . . . . 1 P. Bose, D. Brooks, A. Buyuktosunoglu, P. Cook, K. Das, P. Emma, M. Gschwind, H. Jacobson, T. Karkhanis, P. Kudva, S. Schuster, J. Smith, V. Srinivasan, V. Zyuban, D. Albonesi, and S. Dwarkadas Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Koji Inoue, Vasily Moshnyaga, and Kazuaki Murakami A Hardware Architecture for Dynamic Performance and Energy Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . .33 Phillip Stanley-Marbell, Michael S. Hsiao, and Ulrich Kremer Multi-processor Computer System Having Low Power Consumption . . . . . . . . 53 C. Michael Olsen and L. Alex Morrow Power-Aware Real-Time Systems An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Pedro Mejia, Eugene Levner, and Daniel Moss´e Power-Aware Task Motion for Enhancing Dynamic Range of Embedded Systems with Renewable Energy Sources . . . . . . . . . . . . . . . . . . . . . 84 Jinfeng Liu, Pai H. Chou, and Nader Bagherzadeh A Low-Power Content-Adaptive Texture Mapping Architecture for Real-Time 3D Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Jeongseon Euh, Jeevan Chittamuru, and Wayne Burleson Power Modeling and Monitoring Energy-Driven Statistical Sampling: Detecting Software Hotspots . . . . . . . . . 110 Fay Chang, Keith I. Farkas, and Parthasarathy Ranganathan Modeling of DRAM Power Control Policies Using Deterministic and Stochastic Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Xiaobo Fan, Carla S. Ellis, and Alvin R. Lebeck SimDVS: An Integrated Simulation Environment for Performance Evaluation of Dynamic Voltage Scaling Algorithms . . . . . . . 141 Dongkun Shin, Woonseok Kim, Jaekwon Jeon, Jihong Kim, and Sang Lyul Min

X

Table of Contents

Power-Aware OS and Compilers Application-Supported Device Management for Energy and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Taliver Heath, Eduardo Pinheiro, and Ricardo Bianchini Energy-Eﬃcient Server Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 E.N. (Mootaz) Elnozahy, Michael Kistler, and Ramakrishnan Rajamony Single Region vs. Multiple Regions: A Comparison of Diﬀerent Compiler-Directed Dynamic Voltage Scheduling Approaches . . . . . . . . . . . . . . 197 Chung-Hsing Hsu and Ulrich Kremer Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor P. Bose1 , D. Brooks1, A. Buyuktosunoglu2 , P. Cook1 , K. Das3 , P. Emma1 , M. Gschwind1 , H. Jacobson1, T. Karkhanis4, P. Kudva1 , S. Schuster1 , J. Smith5 , V. Srinivasan1 , V. Zyuban1 , D. Albonesi6 , and S. Dwarkadas6 1

IBM T. J. Watson Research Center, Yorktown Heights, NY [email protected] 2 University of Rochester, NY; summer intern at IBM Watson 3 University of Michigan, Ann Arbor; summer intern at IBM Watson 4 University of Wisconsin, Madison; summer intern at IBM Watson 5 University of Wisconsin, Madison; visiting scientist at IBM Watson 6 University of Rochester, NY

Abstract. We present the high-level microarchitecture of LPX: a lowpower issue-execute processor prototype that is being designed by a joint industry-academia research team. LPX implements a very small subset of a RISC architecture, with a primary focus on a vector (SIMD) multimedia extension. The objective of this project is to validate some key new ideas in power-aware microarchitecture techniques, supported by recent advances in circuit design and clocking.

1

Introduction

Power dissipation limits constitute one of the primary design constraints in future high performance processors. Also, depending on the thermal time constants implied by the chosen packaging/cooling technology, on-chip power-density is a more critical constraint than overall power in many cases. In current CMOS technologies, dynamic (“switching”) power still dominates; but, increasingly, the static (“leakage”) component is threatening to become a major component in future technologies [6]. In this paper, we focus primarily on the dynamic component of power dissipation. Current generation high-end processors like the IBM POWER4 [3, 26], are performance-driven designs. In POWER4, power dissipation is still comfortably below the 0.5 watts/sq. mm. power density limit aﬀorded by the package/cooling solution of choice in target server markets. However, in designing and implementing future processors (or even straight “remaps”) the power (and especially the power-density) limits could become a potential “show-stopper” as transistors shrink and the frequency keeps increasing. Techniques like clock-gating (e.g. [21, 13]) and dynamic size adaptation of on-chip resources like caches and queues (e.g. [1, 20, 4, 9, 12, 2, 15, 27]) have been either used or proposed as methods for power management in future processor cores. Many of these techniques, however, have to be used with caution in

©

B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 1–17, 2003. Springer-Verlag Berlin Heidelberg 2003

2

P. Bose et al.

server-class processors. Aspects like reliability and inductive noise on the power supply rails (Ldi/dt) need to be quantitatively evaluated prior to committing a particular gating or adaptation technique to a real design. Another issue in the design of next generation, power-aware processors, is the development of accurate power-performance simulators for use in early-stage design. University research simulators like Wattch [7] and industrial research simulators like Tempest [10] and PowerTimer [8] have been described in the recent past; however their use in real design environments is needed to validate the accuracy of the energy models in the context of power-performance tradeoﬀ decisions made in early design. In the light of the above issues, we decided to design and implement a simple RISC “sub-processor” test chip to validate some of the key new ideas in adaptive and gated architectures. This chip is called: LPX, which stands for low-power issue-execute processor. This is a research project, with a goal of inﬂuencing real development groups. LPX is a joint university-industry collaboration project. The design and modeling team is composed of 10-12 part-time researchers spanning the two groups (IBM and University of Rochester) aided by several graduate student interns and visiting scientists recruited from multiple universities to work (part-time) at IBM. LPX is targeted for fabrication in a CMOS 0.1 micron high-end technology. RTL (VHDL) simulation and veriﬁcation is scheduled for completion in 2002. Intermediate circuit test chips are in plan (mid- to late 2002) for early validation of the circuit and clocking support. LPX chip tapeout is slated for early 2003. In this paper, we present the microarchitecture deﬁnition with preliminary simulation-based characterization of the LPX prototype. We summarize the goals of the LPX project as follows: – To understand and assess the true worth of a few key ideas in power-aware microarchitecture design through simulation and eventually via direct hardware measurement. Based on known needs in real products of the future, we have set a target of average power density reduction by at least a factor of 5, with no more than 5% reduction in architectural performance (i.e. instructions per cycle or IPC). – To quantify the instantaneous power (current) swings incurred by the use of the adaptive resizing, throttling and clock-gating ideas that are used to achieve the targeted power reduction factors in each unit of the processor. – To use the hardware-based average and instantaneous power measurements for calibration and validation of energy models used in early-stage, powerperformance simulators. Clearly, what we learn through the “simulation and prototyping in the small” experiments in LPX, will be useful in inﬂuencing full-function, power-eﬃcient designs of the future. The calibrated energy models will help us conduct design space exploration studies for high-end machines with greater accuracy. In this paper, we limit our focus to the microarchitectural deﬁnition process, with related simulation-based result snapshots, of the LPX prototype. (Note that LPX is a research test chip. It is not intended to be a full-function, production-quality

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

3

IDU 2.6% FXU 4.4%

Clock Tree 10.3% L3 cntrl. 1.7%

Other 9.5%

IFU 6.1%

Issue Q's 32.1%

ISU 10.4%

L2 22.1%

Rename h/w 43.3%

Compl Table 8.8% Disp. DataFlow/Ctrl 6.3%

LSU 18.7% Bus/IO/other 18.2%

FPU 5.4% (a)

(b)

Fig. 1. Power proﬁle: (a) relative unit-wise power; (b) power breakdowns: ISU

microprocessor. At this time, LPX is not directly linked to any real development project).

2

Background: Power-Performance Data

In an out-of-order, speculative super scalar design like each of the two cores in POWER4, a large percentage of the core power in the non-cache execution engine is spent in the instruction window or issue queue unit [26, 9, 20, 12]. Figure 1(a) shows the relative distribution of power across the major units within a single POWER4 core. Figure 1(b) zooms in on the instruction sequencing unit that contains the various out-of-order issue queues and rename buﬀers. Figure 2 shows the power density across some of the major units of a single POWER4 core. The power ﬁgures are non-validated pre-silicon projections based on unconstrained (i.e. without any clock-gating assumptions) “average/max” power projections using a circuit-level simulation facility called CPAM [19]. (Actual unit-wise power distribution, with available conditional clocking modes enabled, are not shown). This tool allowed us to build up unit-level power characteristics from very detailed, macro-level data. Here, the activity (utilization) factors of all units are assumed to be 100% (i.e. worst case with no clock-gating anywhere); but average, expected input data switching factors (based on representative test cases run at the RTL level, and other heuristics) are assumed for each circuit macro. Such average macro-level input data switching factors typically range from 4-15%. (From Figure 2, we note that although on a unit basis, the power density numbers are under 0.5 watts/sq. mm., there are smaller hotspots, like the integer FX issue queue within the ISU that are above the limit in an unconstrained mode). (Legend for Figs. 1-2: IDU: instruction decode unit; FXU: ﬁxed point unit; IFU: instruction fetch unit; BHT: branch history table;

4

P. Bose et al.

BHT Icache FX-issQ

IFU IDU ISU FXU LSU FPU L2 L3 cntrl

Power Density

watts/sq.mm 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Fig. 2. Unconstrained power density proﬁle

ISU: instruction sequencing unit; LSU: load-store unit: includes L1 data cache; FPU: ﬂoating point unit). Another class of data that we used was the performance and utilization information obtained from pre-silicon performance simulators. Figure 3 shows the relative “active/idle” barchart plot across some of the major units for a POWER4like pre-silicon simulation model. The data plotted is for a commercial TPC-C trace segment. This ﬁgure shows, for example, that the instruction fetch unit (IFU) is idle for approximately 47% of the cycles. Similar data, related to activities within other units, like issue queues and execution unit pipes were collected and analyzed.

3

Areas of Focus in Defining the LPX Processor

Based on microarchitecture level and circuit simulation level utilization, power and power-density projections, as above, we made a decision to focus on the following aspects of a super scalar processing core in our LPX test chip: Power-eﬃcient, Just-in-Time Instruction Fetch. Here, we wanted to study the relative advantages of conditional gating of the ifetch function, with a goal of saving power without appreciable loss of performance. The motivation for this study was clearly established after reviewing data like that depicted in Figures 1 and 2. In simulation mode, we studied the beneﬁt of various hardware heuristics for determining the “gating condition” [18, 14, 5], before ﬁxing on a particular set of choices (being reported in detail in [17]) to implement in LPX. Our emphasis here is on studying ifetch gating heuristics that are easy to implement and test, with negligible added power for the control mechanism. Adaptive Issue Queues. The out-of-order issue queue structure inherent in today’s high-end super scalar processors is a known “hot-spot” in terms of power dissipation. The data shown in Figures 1, 2, and also corroborative data from

5

100 80 USEFUL OTHER WASTE HOLD/IDLE

60 40 20 0

IFU IDU FXU-pipe FPU-pipe LSU-pipe BRU-pipe

Percentage of total cycles

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

Fig. 3. Unit-wise utilization stack (TPC-C trace)

other processors (e.g. [1]), makes this an obvious area to focus on. In LPX, our goal is also to compare the achieved power savings with a ﬁxed issue queue design, but with ﬁne-grain clock-gating support, where the valid-bit for each issue queue entry is used as a clock-gating control. A basic issue in this context is the extra power that is spent due to the presence of out-of-order execution modes. Is the extra power spent worth the performance gain that is achievable? We wish to understand the fundamental power-performance tradeoﬀs in the design of issue queues for the next generation processors. Again, simplicity of the adaptive control and monitoring logic is crucial, especially in the context of the LPX prototype test vehicle. Locally Clocked Execution Pipeline. Based on the data shown in Figures 1 and 2, a typical, multi-stage complex arithmetic pipeline is also a high powerdensity region within the chip. We wish to study the comparative beneﬁt of alternate conditional clocking methods proposed in ongoing work in advanced circuit design groups ([21, 23, 16]). In particular, we wish to understand: (a) the beneﬁt of simple valid-bit-based clock-gating in a synchronously clocked execution unit; and (b) the added power-savings beneﬁt of using a locally asynchronous arithmetic unit pipeline, within a globally synchronous chip. The asynchronously clocked pipeline structure is based on the IPCMOS circuit technology previously tested in isolation [23] by some in our research team. Such locally clocked methods oﬀer the promise of low power at high performance, with manageable inductive noise (Ldi/dt) characteristics. In LPX, we wish to measure and validate these expectations as the IPCMOS pipe is driven by data in real computational loop kernels. Power-Eﬃcient Stalling of Synchronous Pipelines. In the synchronous regions of the design, we wish to quantify the amount of power that is consumed by pipeline stall (or “hold/recirculation”) conditions. Anticipating (from circuit

6

P. Bose et al.

LPX power-perf cycle simulator

Configuration parameters for given idea and experiment

LPX parms filter

Full function, parameterized super scalar, power-perf simulator

tangible benefits? possible selection?

measureable Augment the LPX benefits? No microarchitecture Yes Yes Idea selected for implementation in LPX

No Idea not selected; go to next idea

Fig. 4. Methodology for ﬁne-tuning the LPX microarchitecture

simulation coupled with microarchitectural simulation data) such wastage to be signiﬁcant, we wish to experiment with alternate methods to reduce or eliminate the “stall energy” by using a novel circuit technique called interlocked synchronous pipelines (ISP) that was recently invented by some members of our team [16]. Thus, a basic fetch-issue-execute super scalar processing element (see sections 4 and 5) was decided upon as the study vehicle for implementation by our small research team. The goal is to study the power-performance characteristics of dynamic adaptation: in microarchitectural terms as well as in clocking terms with the target of achieving signiﬁcant power (and especially, power density) reduction, with acceptable margins of IPC loss.

4

Tuning the Microarchitecture

In this section, we outline the methodology adopted for deﬁning the range of hardware design choices to be studied in the LPX testchip. Since we are constrained by the small size of our design team, and yet the ideas explored are targeted to inﬂuence real, full-function processor designs, we adopted the following general method. Figure 4 shows the iterative method used to decide what coarse-level features to add into the LPX test chip, starting from an initial, baseline “bare-bones” fetch-issue-execute model. – A given, power-eﬃcient microarchitectural design idea is ﬁrst simulated in the context of a realistic, current generation super scalar processor model (e.g. POWER4-like microarchitectural parameters) and full workloads (like SPEC and TPC-C) to infer the power-performance beneﬁt. Once a basic

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

7

hardware heuristic is found to yield tangible beneﬁt - in other words, a signiﬁcant power reduction, at small IPC impact - it is selected for possible implementation in LPX. – A detailed, trace-driven, cycle-by-cycle simulator for the baseline LPX processor is coded to run a set of application-based and synthetic loop test cases designed to test and quantify the LPX-speciﬁc power-performance characteristics of the candidate hardware power-saving feature. In order to get a measurable beneﬁt, it may be necessary to further simplify the heuristic, or augment the microarchitecture minimally to create a new baseline. Once the power-performance beneﬁt is deemed signiﬁcant, we proceed to the next candidate idea. In this paper we mainly focus on (b) above: i.e. understanding the fundamental power-performance tradeoﬀ characteristics, using a simple, illustrative loop test case. However, we also refer brieﬂy to example, full-model super scalar simulation results to motivate the choice of a particular hardware heuristic. Energy Models Used. The LPX cycle-by-cycle simulator used to analyze early stage microarchitectural power-performance tradeoﬀs has integrated energy models, as in the PowerTimer tool [8]. These energy models were derived largely from detailed, macro-level energy data for POWER4, scaled for size and technology to ﬁt the requirements of LPX. The CPAM tool [19] was used to get this data for most of the structures modeled. Additional experiments were performed at the circuit simulation level, to derive power characteristics of newer latch designs (with and without clock- and stall-based gating). The energy model-enabled LPX simulator is systematically validated using specially architected testcases. Analytical bounds modeling is used to generate bounds on IPC and unit-wise utilization (post-processed to form power bounds). These serve as reference “signatures” for validating the power-performance simulator. Since the LPX design and model are still evolving, validation exercises must necessarily continue throughout the high-level design process. Details of the energy model derivation and validation are omitted for brevity.

5

High-Level Microarchitecture of LPX

Figure 5 shows a very high-level block diagram of the baseline LPX processor that we started with before further reﬁnement of the microarchitectural features and parameters through a simulation-based study. The function and storage units shown in dashed edges are ones that are modeled (to the extent required) in the simulation infrastructure, but are not targeted for implementation in the initial LPX design. The primary goal of this design is to experiment with the fetch-issue-execute pipe which processes a basic set of vector integer arithmetic instructions. These instructions are patterned after a standard, 4x32 bit SIMD multimedia extension architecture [11] but, simpliﬁed in syntax and semantics. The “fetch-and-issue” sub-units act together as a producer of instructions, which

8

P. Bose et al.

valid bit (used for clock-gating)

I-BUFFER

(contains predecoded LPX instructions)

V I-CACHE (loop buffer)

V

0-4 instr per cycl

(3 read, 2 write)

(adaptive . queue)

vector register file

. issue queue On-chip power/perf

V

counters

V

opnd A opnd B

opnd C

up to 2 instructions per cycle misc. proggrammable

valid-bit

control regs to inject

based CG

pipeline stalls (e.g.

vector fixed pt unit

for cache misses)

scalar fixed point unit (combined

(VFXU)

LSU and FXU) (LSFX Unit)

simulated via dummy pipelined unit built using scannable stage latches

(IPCMOS pipe)

writeback (rename) bus/buffer

Fig. 5. LPX Processor: High-Level Block Diagram

are consumed by the “execute” sub-unit. The design attempts to balance the dynamic complexity of the producer-consumer pair with the goal of maximizing performance, while minimizing power consumption. The basic instruction processing pipeline is illustrated in Figure 6. The decode/dispatch/rename stage, which is shown as a lumped, dummy dispatch unit in Figure 5, is actually modeled in our simulator as an m-stage pipe, where m=1 in the nominal design point. The nominal VFXU execute pipe is n=4 stages deep. The LSFX execute pipe is p=2 stages (in inﬁnite cache mode) and p=12 stages when a data cache miss stall is injected using the stall control registers (Figure 5); in particular, using a miss-control register (MCR). One of the functional units is the scalar FXU (a combined load-store unit and integer unit, LSFX) and the other is the vector integer unit (VFXU). The VFXU execution pipe is multi-cycle (nominally 4 cycles). The LSFX unit has a 1-cycle pipe plus (nominally) a 1-cycle (inﬁnite) data cache access cycle for loads and stores. At the end of the ﬁnal execution stage, the results are latched on to the result bus while the target register tags are broadcast to the instructions pending in the issue queue. As a substitute for instruction caching, LPX uses a loop buﬀer in which a loop (of up to 128 instructions) is pre-loaded prior to processor operation. The loaded program consists of pre decoded instructions, with inline explicit speciﬁers of pre renamed register operands - in full out-of-order mode of execution. This avoids the task of designing explicit logic for the instruction decode and rename

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

9

m stages IF

DD-1

DD-m

Issue

Rread

dispatch/decode EX-n WB EX-1 In LPX n stages baseline: m=1 EX-p EX-1 WB n=4 (VCFXU) p=2 (LSFXU: p stages infinite cache) p=12 (LSFXU: L1 dcache miss)

Fig. 6. LPX (simulator) pipeline stages

processes. LPX also supports an “in-order” mode, without register renaming as the lowest performance design point for our tradeoﬀ experiments. The instructions implemented in LPX are listed below in Table 1. For the most part, these are a set of basic vector (SIMD) mode load, store and arithmetic instructions, following the general semantics of a PowerPCVMX (vector multimedia extension) architecture [11]. There are a few added scalar RISC (PowerPC-like) instructions to facilitate loading and manipulation of scalar integer registers required in vector load-store instructions. The (vector) load and store instructions have an implied “update” mode in LPX, where the scalar address base register is auto-incremented to hold the address of the next sequential array data in memory.

Table 1. LPX Instruction Set Example Syntax Semantics Vector Load VLD vr1, r2, 0x08 Load vr1. Scalar base address register: r2 Vector Store VST vr1, r2, 0x08 Store vr1. Vector Add VADD vr1, vr2, vr3 vr1 <– vr2 + vr3 Vector Sub, Mul, Div instructions: similar to VADD above Scalar Load LD r1, r2, 0x08 Load scalar reg r1 Scalar Inc INC r1 Increment r1 (scalar) Scalar Dec DEC r1 Decrement r1 (scalar) Cond. Branch BC +-0x08 Branch conditional (PC relative jump) Uncond. Branch BR +-0x08 Branch unconditional

10

6

P. Bose et al.

Examples: LPX Microarchitecture Analysis

In this section, we illustrate the use of simple loop-based test cases in understanding the basic power-performance trade-oﬀs of adaptive structures and clocking mechanisms that were chosen for study in LPX. The challenge is to determine the nominal sizes, adaptation windows and (in each case) a simple “monitorand-control” mechanism that is appropriate in the context of building a small prototype engine, like LPX. We started with the simplest baseline, where ideal cache eﬀects were modeled, by architecting a single-stage LSFX pipe unit; but, later we had to augment the speciﬁcation to include a variable-length LSFX pipe, to simulate data cache miss latency. In the absence of real cache hardware (correspondingly, real cache hit/miss code in the simulator), we architect for programmable “miss” scenarios via a user-loadable miss control register (MCR). Details of how this works in the real hardware are not discussed in this paper. For brevity, we only show a few example tradeoﬀ analysis examples limited to the inﬁnite (ideal) cache scenario. As described before in section 4 (see Figure 4), each candidate power reduction idea is analyzed in the “large” (i.e. using a general out-of-order super scalar simulator) to ensure potential beneﬁt. Then, a simpler hardware heuristic is used for trial and measurement “in the small” within the LPX simulation tool kit. LPX experiments: an example loop test-case: vect add. We use a simple “vector add” loop trace, formed by execution of the following loop, to illustrate LPX tradeoﬀ experiments: | -> | | | | | ---

VLD VADD VLD VADD VST DEC BRZ

vr1, vr4, vr6, vr4, vr4, r7 r7,

r2 (0x4) vr1, vr6 r2 (0x8) vr4, vr6 r3 (0x8) -0x7

The baseline LPX model parameters were ﬁxed as follows, after initial experimentation. Instruction fetch (ifetch) bandwidth is up to four instructions/cycle, with no fetch beyond a branch on a given cycle. The instruction fetch buﬀer size is four instructions; dispatch bandwidth (into the issue queue) is up to two instructions/cycle; issue bandwidth (into the execution pipes) is up to two instructions/cycle; and, completion bandwidth is also two instructions/cycle. Fetch and dispatch is in-order and issue can be in-order or out-of-order (switchable); instructions ﬁnish out-of-order. (LPX does not model or implement in-order completion for precise interrupt support using reorder buﬀers). Conditional Ifetch. Figures 7(a,b) show a snapshot of analysis data from a typical 4-way, out-of-order super scalar processor model. The data reported is for two benchmarks from the SPECint2000 suite. It shows that the ifetch stage/buﬀer, the front-end pipe and the issue queue/window can be idle for

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

AMMP on 4-way super scalar model Idle Active Flushed Active Stall Flushed Stall Used 100 80 60 40 20 0

11

GCC on 4-way super scalar model Idle Stall Flushed

Active Flushed Stall Used

Active

100 80 60 40 20 0 Fetch

Pipe

(a)

Issue Queue

Fetch

Pipe

IssQ

(b)

Fig. 7. Idle, speculative and stall waste: (a) AMMP and (b) GCC

signiﬁcant fractions of the program run. These are cycles where power can be saved by valid-bit-based clock-gating. In addition, the fraction of cycles that are wasted by useful (but stalled) instructions and by incorrectly fetched speculative instructions can also be signiﬁcant. Gating oﬀ the ifetch process using a hardware heuristic to compute the gating condition, is therefore a viable approach to saving energy. For LPX, we wish to experiment with the simplest of such heuristics, that are easy to implement. The basic method used is to employ the “stall” or “impending stall” signals available from “downstream” consumer units to throttle back the “upstream” producer (ifetch). Such stall signals are easy to generate and are usually available in the logic design anyway. Figures 8(a,b) show results from an illustrative use of conditional ifetch while simulating the vect add loop trace. We use the following simple hardware heuristic for determining the ifetch gating scenario. When a “stall” signal is asserted by the instruction buﬀer (e.g. when the ibuﬀer is full) the ifetch process is naturally inhibited in most designs; so this is assumed in the baseline model. However, additional power savings can be achieved by retaining the “ifetch-hold” condition for a fetch-gate cycle window, GW, beyond the negation of the ibuﬀer stall signal. Since the ibuﬀer was full, it would take a while to drain it; hence ifetch could be gated oﬀ for GW cycles. Depending on the size of the ibuﬀer, IPC would be expected to drop oﬀ to unacceptable levels beyond a certain value of GW; but increasing GW is expected to reduce IFU (instruction fetch unit) and overall chip power. Adaptive Issue Queue. Figure 9 shows a snapshot of our generalized simulationbased power-savings projection for various styles of out-of-order issue queue design. (An 8-issue, super scalar, POWER4-like research simulator was used). These studies showed potential power savings of more than 80% in the issue queue with at most 2-3% hit in IPC on the average. However, the best power reductions were for adaptive and banked CAM/RAM based designs that are not

P. Bose et al.

Power (watts)

5

vect_add loop trace on LPX CLK VCFXU LSFXU ISU IDU IFU

4 3 2 1 0

Base IO + VB-CG + OO +C-IF-10 CPI of baseline in-order (IO) = 3.00 CPI of all the other out-of-order (OO) = 2.29

Cycles per instruction (CPI)

12

vect_add loop trace on LPX

2.9

Pwr=1.52

2.8 2.7 Pwr=1.69

2.6 2.5

Pwr=1.84

2.4 2.3 Pwr=1.99Pwr=2.09 Pwr=2.08 2.2 2 4 6 8 10 12 14 16 Cond. Ifetch (C-IF) Gating Window, GW

Fig. 8. Conditional ifetch in LPX: (a) power reduction and (b) CPI vs. gating window 120

POWER

100

Baseline, POWER4-like

80 60 40

adapt.banked

bank-CAM

CAM issQ

w/clk gate

w/clk gate

latch issQ

0

adapt.latch

20

Fig. 9. Adaptive issue queue: power saving (generalized 8-issue super scalar)

easy to design and verify. For LPX, we started with a baseline design of the POWER4 integer issue queue [26], which is a latch-based design. It is structured as a 2-chunk structure, where in adaptive mode, one of the chunks can be shut-oﬀ completely (to eliminate dynamic and static power). Figure 10 illustrates the beneﬁt of using a simple, LPX-speciﬁc adaptive issue queue heuristic that is targeted to reduce power, without loss of performance; i.e. the size is adapted downwards only when “safe” to do so from a performance viewpoint; and the size is increased in anticipation of increased demand. (In the example data shown in this paper, we consider only the reduction of dynamic power via such adaptation). The adaptive issue queue control heuristic illustrated is simpler than proposed in the detailed studies reported earlier [9], for ease of implementation in the LPX context. The control heuristic in LPX is as follows:

Cycles per instruction (CPI)

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

13

vect_add loop trace on LPX

2.8 2.7

2.6 Pwr=4.059, IssueQpwr=0.098 2.5 2.4 2.3 Pwr=4.119, IssueQpwr=0.098 2.2 1

2

4

8

16

32

64

128

Adaptation cycle window, AW (cycles) baseline power (non-adaptive) = 4.46 watts

Fig. 10. Adaptive issue queue experiment in LPX

if (current-cycle-window-issuecount < 0.5 * last-cycle-window-issuecount) then increase-size (* if possible *); else decrease-size (* if possible *); Discussion of Results. From Figure 8(a) we note that adding out-of-order (oo) mode to the baseline in-order (io) machine causes an IPC increase (CPI decrease) of 23.6%, but with a 12.5% overall power increase. The ISU, which contains the issue queue, increases in power by 27.5%. So, from an overall power-performance eﬃciency viewpoint, the out-of-order (oo) mode does seem to pay oﬀ in LPX for this loop trace, in inﬁnite cache mode. However, from a power-density “hot-spot” viewpoint, even this basic enhancement may need to be carefully evaluated with a representative workload suite. Adding the valid-bit-based clock-gating (VBCG) mode in the instruction buﬀer, issue queue and the execution unit pipes, causes a sharp decrease in power (42.4% from the baseline oo design point). Adding a conditional ifetch mode, (with a cycle window W of 10 cycles over which ifetch is blocked after the ibuﬀer stall signal goes away) yields an additional 18.8% power reduction, without loss of IPC performance. As the gating cycle window W is increased, we see a further sharp decrease in net power beyond W=10, but with IPC degradation. For the adaptive issue queue experiment (Fig. 10) shown, we see that a 8% reduction in net LPX power is possible; but beyond an adaptation cycle window, AW of 1, a 11% increase in CPI (cycles-perinstruction) is incurred. Thus, use of ﬁne-grain, valid-bit based clock-gating is simpler and more eﬀective than adaptive methods. Detailed results, combining VB-CG and adaptation will be reported in follow-up research.

14

P. Bose et al.

Stall-Based Clock-Gating. As previously alluded to, in addition to valid-bitbased clock-gating in synchronous (an locally asynchronous) pipelines, LPX uses a mode in which an instruction stalling in a buﬀer or queue for multiple cycles is clock-gated oﬀ, instead of a recirculation-based, hold strategy often used in high performance processors. The stall-related energy waste is a signiﬁcant fraction of queue/buﬀer power that can be saved if the stall signal is avaliable in time to do the gating. Carefully designed control circuits [16] have enabled us to exploit this feature in LPX. In this version of the paper, we could not include the experimental results that show the additional beneﬁts of such stall-based gating. However, suﬃce it to say, with the addition of stall-based clock-gating, simulations predict that we are well within the target of achieving a factor of 5 reduction in power and power density, without appreciable loss of IPC performance. The use of a locally asynchronous IPCMOS execution pipe [23] is expected to increase power reduction even further. Detailed LPX-speciﬁc simulation results for these circuit-centric features, will be available in subsequent reports.

7

Conclusions and Future Work

We presented the early-stage deﬁnition of LPX: a low-power issue-execute processor prototype that is designed to serve as a measurement and evaluation vehicle for a few new ideas in adaptive microarchitecture and conditional clocking. We described the methodology that was used to architect and tune simple hardware heuristics in the prototype test chip, with the goal of drawing meaningful conclusions of use in future products. We presented a couple of simple examples to illustrate the process of deﬁnition and to report the expected power-performance beneﬁts of the illustrated adaptive features. The basic idea of fetch-throttling to conserve power is not new. In addition to work that we have already alluded to [18, 14, 5], Sanchez et al. [22] describe a fetch stage throttling mechanism for the G3 and G4 PowerPC processors. The throttling mode in the prior PowerPC processors was architected to respond to thermal emergencies. The work reported in [18, 14, 5] and the new gating heuristics described in this paper and in [17] are aimed at reducing average power during normal operation. Similarly, the adaptive issue queue control heuristics being developed for LPX are intended to be simpler adaptations of our prior general work [9]. We believe that the constraint of designing a simple test chip with a small design team forces us to experiment with heuristics that are easy to implement with low overhead. If some of these heuristics help create relatively simple power management solutions for a full-function, production-quality processor, then the investment in LPX development will be easily justiﬁed. In addition to the adaptive microarchitecture principles alluded to above, the team is considering the inclusion of other ideas in the simulation toolkit; some of these remain candidates for inclusion in the actual LPX deﬁnition: at least for LPX-II, a follow-on design. The following is a partial list of these other ideas:

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

15

– Adaptive, power-eﬃcient cache and register ﬁle designs: these were not considered for implementation in the initial LPX prototype, due to lack of seasoned SRAM designers in our research team. In particular, as a candidate data cache design for LPX-II, we are exploring ideas that combine prior energy-eﬃcient solutions [1, 4, 2, 15] with recently proposed, high performance split-cache architectures ([24, 25]). – Exploiting the data sparseness of vector/SIMD-mode execution, through hardware features that minimize clocking waste in processing vector data that contains lots of zeroes. – Newer features that reduce static (leakage) power waste. – Adding monitoring hardware to measure current swings in clock-gated and adaptive structures.

Acknowledgement The authors are grateful to Dan Prener, Jaime Moreno, Steve Kosonocky, Manish Gupta and Lew Terman (all at IBM Watson) for many helpful discussions before the inception of the LPX design project. Special thanks are due to Scott Neely (IBM Watson), Michael Wang and Gricell Co (IBM Austin) for access to CPAM and the detailed energy data used in our power analysis. The support and encouragement received from senior management at IBM - in particular Mike Rosenﬁeld and Eric Kronstadt - are gratefully acknowledged. The authors would like to thank Joel Tendler (IBM Austin) for his comments and suggestions during the preparation and clearance of this paper. Also, the comments provided by the anonymous reviewers were useful in preparing an improved ﬁnal version; this help is gratefully acknowledged. The work done in this project at University of Rochester is supported in part by NSF grants CCR–9701915, CCR–9702466, CCR–9705594, CCR–9811929, EIA–9972881, CCR–9988361 and EIA–0080124; by DARPA/ITO under AFRL contract F29601-00-K-0182; and by an IBM Faculty Partnership Award. Additional support, at University of Wisconsin, for continuation of research on power-eﬃcient microarchitectures, is provided by NSF grant CCR-9900610.

References [1] D. H. Albonesi. The inherent energy eﬃciency of complexity-eﬀective processors. In Power-Driven Microarchitecture Workshop at ISCA25, June 1998. 1, 5, 15 [2] D. H. Albonesi. Selective cache ways: on-demand cache resource allocation. In Proceedings of the 32nd International Symposium on Microarchitecture (MICRO32), pages 248–259, Nov. 1999. 1, 15 [3] C. Anderson et al. Physical design of a fourth-generation power ghz microprocessor. In ISSCC Digest of Technical Papers, page 232, 2001. 1 [4] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconﬁguration for energy and performance in general purpose architectures. In Proceedings of the 33rd International Symposium on Microarchitecture (MICRO-33), pages 245–257, Dec. 2000. 1, 15

16

P. Bose et al.

[5] A. Baniasadi and A. Moshovos. Instruction ﬂow-based front-end throttling for power-aware high performance processors. In Proceedings of International Symposium on Low Power Electronics and Design, August 2001. 4, 14 [6] S. Borkar. Design Challenges of Technology Scaling. IEEE Micro, 19(4):23–29, July-August 1999. 1 [7] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architecturallevel power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA-27), June 2000. 2 [8] D. Brooks, J.-D. Wellman, P. Bose, and M. Martonosi. Power-Performance Modeling and Tradeoﬀ Analysis for a High-End Microprocessor. In Power Aware Computing Systems Workshop at ASPLOS-IX, Nov. 2000. 2, 7 [9] A. Buyuktosunoglu et al. An adaptive issue queue for reduced power at high performance. In Power Aware Computing Systems Workshop at ASPLOS-IX, Nov. 2000. 1, 3, 12, 14 [10] A. Dhodapkar, C. Lim, and G. Cai. TEM2 P2 EST: A Thermal Enabled MultiModel Power/Performance ESTimator. In Power Aware Computing Systems Workshop at ASPLOS-IX, Nov. 2000. 2 [11] K. Diefendorﬀ, P. Dubey, R. Hochsprung, and H. Scales. AltiVec extension to PowerPC accelerates media processing. IEEE Micro, pages 85–95, April 2000. 7, 9 [12] D. Folegnani and A. Gonzalez. Energy-eﬀective issue logic. In Proceedings of the 28th International Symposium on Computer Architecture (ISCA-28), pages 230–239, June 2001. 1, 3 [13] M. Gowan, L. Biro, and D. Jackson. Power considerations in the design of the Alpha 21264 microprocessor. In 35th Design Automation Conference, 1998. 1 [14] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun. Conﬁdence estimation for speculation control. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), pages 122–31, June 1998. 4, 14 [15] K. Inoue et al. Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of International Symposium on Low Power Electronics and Design, pages 273–275, August 1999. 1, 15 [16] H. Jacobson et al. Synchronous interlocked pipelines. IBM Research Report (To appear in ASYNC-2002) RC 22239, IBM T J Watson Research Center, Oct. 2001. 5, 6, 14 [17] T. Karkhanis et al. Saving energy with just-in-time instruction delivery. submitted for publication. 4, 14 [18] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), pages 132–41, June 1998. 4, 14 [19] J. Neely et al. CPAM: A Common Power Analysis Methodology for High Performance Design. In Proc. 9th Topical Meeting on Electrical Performance of Electronic Packaging, Oct. 2000. 3, 7 [20] D. Ponomarev, G. Kucuk, and K. Ghose. Dynamic allocation of datapath resources for low power. In Workshop on Complexity Eﬀective Design 2001 at ISCA28, June 2001. 1, 3 [21] J. Rabaey and M. Pedram, editors. Low Power Design Methodologies. Kluwer Academic Publishers, 1996. Proceedings of the NATO Advanced Study Institute on Hardware/Software Co-Design. 1, 5 [22] H. Sanchez et al. Thermal management system for high performance PowerPC microprocessors. Digest of Papers - COMPCON - IEEE Computer Society International Conference, page 325, 1997. 14

Early-Stage Deﬁnition of LPX: A Low Power Issue-Execute Processor

17

[23] S. Schuster et al. Asynchronous interlocked pipelined CMOS operating at 3.3-4.5 GHz. In ISSCC Digest of Technical Papers, pages 292–293, February 2000. 5, 14 [24] V. Srinivasan. Hardware Solutions to Reduce Eﬀective Memory Access Time. PhD thesis, University of Michigan, Ann Arbor, February 2001. 15 [25] V. Srinivasan et al. Recovering single cycle access of primary caches. submitted for publication. 15 [26] J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 system microarchitecture. IBM J. of Research and Development, 46(1):5–26, 2002. 1, 3, 12 [27] S.-H. Yang et al. An energy-eﬃcient high performance deep submicron instruction cache. IEEE Transactions on VLSI, Special Issue on Low Power Electronics and Design, 2001. 1

Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints Koji Inoue1 , Vasily Moshnyaga1, and Kazuaki Murakami2 1

Dept. of Electronics Engineering and Computer Science, Fukuoka University 8-19-1 Nanakuma, Jonan-ku, Fukuoka 814-0133 JAPAN 2 Dept. of Informatics, Kyushu University 6-1 Kasuga-Koen, Kasuga, Fukuoka 816-8580 JAPAN

Abstract. This paper proposes an architecture for low-power directmapped instruction caches, called “history-based tag-comparison (HBTC) cache”. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. Execution footprints are recorded in an extended BTB (Branch Target Buﬀer), and are used to know the cache residence of target instructions before starting cache access. In our simulation, it is observed that our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation.

1

Introduction

On-chip caches have been playing an important role in achieving high performance. In particular, instruction caches give a great impact on processor performance because one or more instructions have to be issued on every clock cycle. In other words, from energy point of view, instruction caches consume a lot of energy. Therefore, it is strongly required to reduce the energy consumption for instruction-cache accesses. On a conventional cache access, tag checks and data read are performed in parallel. Thus, the total energy consumed for a cache access consists of two factors: the energy for tag checks and that for data read. In conventional caches, the height (or the total number of word-lines) of tag memory and that of data memory are equal, but not for the width (or the total number of bit-lines). The tag-memory width depends on the tag size, while the data-memory width depends on the cache-line size. Usually, the tag size is much smaller than the cache-line size. For example, in the case of a 16 KB direct-mapped cache having 32-byte lines, the cache-line size is 256 bits (32 × 8), while the tag size is 18 bits (32 - 9bit index - 5bit oﬀset). Thus, the total cache energy is dominated by data-memory accesses. Cache subbanking is one of the approaches to reducing the data-memoryaccess energy. The data-memory array is partitioned into several subbanks, and only one subbank including the target data is activated [6]. Figure 1 depicts B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 18–32, 2003. c Springer-Verlag Berlin Heidelberg 2003

Dynamic Tag-Check Omission

32-bit word

19

64-bit word

1.00

Normalized Energy Consumption

0.90 Others

0.80

Energy Consumed in Bit-lines

0.70

Energy Consumed in Word-lines

0.60 0.50 0.40 0.30 0.20 0.10 0.00

1 (8) base

2 (4)

4 (2)

8 (1)

1 (8) base

2 (4)

4 (2)

8 (1)

# of Subbanks (# of Words in a Subbank)

Fig. 1. Eﬀect of tag-check energy

the breakdown of cache-access energy of a 16 KB direct-mapped cache with the varied number of subbanks. We have calculated the energy based on the Kamble’s model [6]. All the results are normalized to a conventional conﬁguration denoted as “1(8)”. It is clear from the ﬁgure that increasing the number of subbanks makes signiﬁcant reduction for data-memory energy. Since the tagmemory energy is maintained, however, it becomes a signiﬁcant factor. If the number of subbanks is 8, about 30 % and 50 % of total energy are dissipated by the tag memory where the word size is 32 bits and 64 bits, respectively. In this paper, we focus on the energy consumed for tag checks, and propose an architecture for low-power direct-mapped instruction caches, called “historybased tag-comparison (HBTC) cache”. The basic idea of the HBTC cache has been introduced in [4]. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. When an instruction block is referenced without causing any cache miss, a corresponding execution footprint is recorded in an extended BTB (Branch Target Buﬀer). All execution footprints are erased whenever a cache miss takes place, because the instruction block (or a part of the instruction block) might be evicted from the cache. The execution footprint indicates whether the instruction block currently resides in the cache. At and after the next execution of that instruction block, if the execution footprint is detected, all tag checks are omitted. In our simulation, it has been observed that

20

Koji Inoue et al.

our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation. The rest of this paper is organized as follows. Section 2 shows related work, and explains the detail of another technique proposed in [11] to omit tag checks as a comparative approach. Section 3 presents the concept and mechanism of the HBTC cache. Section 4 reports evaluation results for performance/energy eﬃciency of our approach, and Section 5 concludes this paper.

2

Related Work

A technique to reduce the frequency of tag checks has been proposed [11]. If successively executed instructions i and j reside in the same cache line, then we can omit the tag check for instruction j. Namely, the cache proposed in [11] performs tag checks only when i and j reside in diﬀerent cache lines. We call the cache interline tag-comparison cache (ITC cache). This kind of traditional technique has been employed at commercial microprocessors, e.g., ARMs. The ITC cache detects unnecessary tag checks by monitoring program counter (PC). Against to the ITC cache, our approach exploits an extended BTB in order to record instruction-access history, and can omit unnecessary tag checks even if successive instructions reside in diﬀerent cache lines. In Section 4.2, we compare our approach with the ITC cache. Direct Addressing (DA) is another scheme to omit tag checks [13]. In DA, previous tag-check results are recorded in the DA register, and are reused for future cache accesses. The DA register is controlled by compiler, whereas our HBTC cache does not need any software support. Note that the ITC cache and the DA scheme can be used for both instruction caches and data caches, while our HBTC cache can be used only for direct-mapped instruction caches. The extension to set-associative caches is discussed in Section 5. Ma et al. [9] have proposed a dynamic approach to omitting tag checks. In their approach, cache line structure is extended for recording valid links, and a branch-link is implemented per two instructions. Their approach can be applied regardless of cache associativity. The HBTC cache is another alternative to implement their approach on direct-mapped instruction caches, and can be organized with smaller hardware overhead. This is because the HBTC cache records 1-bit cacheresidence information for each instruction block, which could be larger than the cache line. The S-cache has also been proposed in [11]. The S-cache is a small added memory to the L1 cache, and has statically allocated address space. No cache replacements occur in the S-cache. Therefore, S-cache accesses can be done without tag checks. The scratchpad-memory [10], the loop-cache [3], and the decompressor-memory [5] also employ this kind of a small memory, and have the same eﬀect as the S-cache. In the scratchpad-memory and the loop-cache, application programs are analyzed statically, and compiler allocates well executed instructions to the small memory. For the S-cache and the decompressor-memory, prior simulations using input-data set are required to optimize the code alloca-

Dynamic Tag-Check Omission

21

tion. They are diﬀer from ours in two aspects. First, these caches require static analysis. Second, the cache has to be separated to a dynamically allocated memory space (i.e., main cache) and a statically allocated memory space (i.e., the small cache). The HBTC cache does not require these arrangements. The ﬁlter cache [8] achieves low power consumption by adding a very small L0-cache between the processor and the L1-cache. The advantage of the L0cache largely depends on how many memory references hit the L0-cache. Block buﬀering can achieve the same eﬀect of the ﬁlter cache [6]. Bellas et al. [2] proposed a run-time cache-management technique to allocate the most frequently executed instruction-blocks to the L0-cache. On L0-cache hits, accessing both the tag-memory and data-memory of L1-cache is avoided, so that of cause tag checks at L1-cache do not performed. However, on L0-cache misses, the L1-cache is accessed with conventional behavior (tag checks are required). Our approach can be used in conjunction with the L0-caches in order to avoid L1-cache tag checks.

3 3.1

History-Based Tag-Comparison Cache Concept

On an access to a direct-mapped cache, a tag check is performed to determine whether the memory reference hits the cache. For almost all programs, instruction caches can achieve higher hit rates. In other words, the state (or contents) of the instruction cache is rarely changed. Only when a cache miss takes place, the state of instruction cache is changed by ﬁlling the missed instruction (and some instructions residing in the same cache line with the missed instruction). Therefore, if an instruction is referenced once, it stays in the cache at least until the next cache miss occurs. We refer the period between a cache miss to the next cache miss as a stable-time. Here, we consider where an instruction is executed repeatedly. At the ﬁrst reference of the instruction, the tag check has to be performed. However, at and after the second reference, if no cache miss has occurred since the ﬁrst reference, it is guaranteed that the target instruction currently resides in the cache. Therefore, for accesses to the same instruction in a stable-time, performing tag checks is absolutely required at the ﬁrst reference, but not for the following references. We can omit tag checks if the following conditions are satisﬁed. – The target instruction has been executed at least once. – No cache miss has occurred since the previous execution of the target instruction. Figure 2 shows how many unnecessary tag checks are performed in a conventional 16 KB direct-mapped cache for two SPEC benchmark programs. Simulation environment is explained in Section 4.1. The y-axis is the average referencecount (up to one hundred times) for each cache line per stable-time. We ignored where each cache line has never referenced in a stable-time. The x-axis is the

22

Koji Inoue et al.

Ave. # of Reference-Counts per Stable-Time

129.compress 100 90 80 70 60 50 40 30 20

10 9 8 7 6 5 4 3 2

1 0

127

255

383

Ave. # of Reference-Counts per Stable-Time

Cache-line Address

511

132.ijpeg

100 90 80 70 60 50 40 30 20

10 9 8 7 6 5 4 3 2

1 0

127

255

383

511

Cache-line Address

Fig. 2. Opportunity of tag-check omission

cache-line address. It can be understood from the ﬁgure that the conventional cache wastes a lot of energy for unnecessary tag checks. Almost all cache lines are referenced more than four times in a stable-time, and some cache lines are referenced more than one hundred times. In order to detect the conditions for omitting unnecessary tag checks, the HBTC cache records execution footprints in an extended BTB (Branch Target Buﬀer). An execution footprint indicates whether the target-instruction block or fall-through-instruction block associated with a branch resides in the cache. An execution footprint is recorded after all instructions in the corresponding instruction block are referenced. All execution footprints are erased, or invalidated, whenever a cache miss takes place. At the execution of an instruction block, if the corresponding execution footprint is detected, we can fetch instructions without performing tag checks. 3.2

Organization

Figure 3 depicts the organization of the extended BTB. The following two 1-bit ﬂags are added to each BTB entry.

Dynamic Tag-Check Omission

PBAreg

Target of branch-K

Branch Inst. Addr.

branch-inst. addr.

Adr-A: basic-block A

Inst. Addr. of Branch-K

Prediction Result Address target addr.

EFF

Adr-A

Inst. Addr. of Branch-Y

Adr-E

Inst. Addr. of Branch-Z

Adr-F

basic-block B PC+1

branch Y Adr-C:

EFT

Branch Target Buffer

branch X Adr-B:

23

fall-through address of Branch-Y

PC

Direct-Mapped Instruction Cache

Prediction Result Mode Controller tag-check omitting

basic-block C branch Z

Instruction-Fetch Address

Program Code

Fig. 3. The Organization of a Direct-Mapped HBL Cache

– EFT (Execution Footprint of Target instructions): This is an execution footprint of the branch-target-instruction block whose beginning address is indicated by the target address of current branch. – EFF (Execution Footprint of Fall-through instructions): This is an execution footprint of the fall-through-instruction block whose beginning address is indicated by the fall-through address of current branch. The end address of the branch-target- and fall-through-instruction block is indicated by another branch-instruction address which is already registered in the BTB, as shown in Figure 3. In addition, the following hardware components are required. – Mode Controller: This component selects one of the following operation modes based on the execution footprints read from the extended BTB. The detail of operation is explained in Section 3.3. • Normal-Mode (Nmode): The HBL cache behaves as a conventional cache (tag checks are performed). • Omitting-Mode (Omode): Tag checks for instruction-cache accesses are omitted. • Tracing-Mode (Tmode): The HBL cache behaves as a conventional cache (tag checks are performed). When a BTB hit is detected in this mode, the execution footprint indexed by the PBAreg is set to ’1’. – PBAreg (Previous Branch-instruction Address REGister): This is a register to keep the previous-branch-instruction address. The prediction result (taken or not-taken) is also kept.

24

Koji Inoue et al.

GOtoNmode

EFT (or EFF) is ’1’

Omitting I-Cache miss or Mode BTB replacement or RAS access or Branch misprediction GOtoNmode

Normal Mode

GOtoNmode

BTB Hit

EFT (or EFF) is ’0’

Tracing Mode

Fig. 4. Operation-Mode Transition

3.3

Operation

Execution footprints (i.e., EFT and EFF ﬂags) are left or erased at run time. Figure 4 shows operation-mode transition. On every BTB hit, the HBTC cache works as follows: 1. Regardless of current operation mode, both EFT and EFF ﬂags associated with the BTB-hit entry are read in parallel. 2. Based on the branch-prediction result, EFT for taken or EFF for not-taken is selected. 3. If the selected execution footprint is ’1’, operation mode is transited to Omode. 4. Otherwise, operation mode is transited to Tmode. At that time, current PC (branch-instruction address) and the branch-prediction result are stored into the PBAreg. Whenever a cache miss takes place, operation mode is transited to Nmode, as explained in the next paragraph. Therefore, occurring a BTB hit on Tmode means that there has never been any cache miss since the previous BTB hit. In other words, the instruction block, whose beginning address is indicated by the PBAreg and end address is indicated by the current branch-instruction address, has been referenced without causing any cache miss. Thus, when a BTB hit occurs on Tmode, the execution footprint indexed by the PBAreg is validated (set to 1). If one of the followings takes place, execution footprints have to be invalidated. In addition, the operation mode is transited to Nmode. – instruction-cache miss: The state of instruction cache is changed by ﬁlling the missed instruction. The cache-line replacement might evict the instruction block (or a part of the instruction block) corresponding to valid execution footprints from the cache. Therefore, the execution footprints of the victim line have to be invalidated.

Dynamic Tag-Check Omission

25

– BTB replacement: As explained in Section 3.3, the end address of an instruction block is indicated by another branch-instruction address already registered in the BTB. We lose the end-address information when the BTB-entry is evicted. Thus, the execution footprints of the instruction block, whose end address is indicated by the victim BTB-entry, have to be invalidated. Although it is possible to invalidate only the execution footprints aﬀected by the cache miss or the BTB replacement, we have employed a conservative scheme, i.e., all execution footprints in the extended BTB are invalidated. In addition, when an indirect jump is executed, or a branch mis-prediction is detected, the HBTC cache works on Nmode (tag checks are performed as conventional organization). These decisions make it possible to avoid area overhead and complex control logic. 3.4

Advantages and Disadvantages

Total energy dissipated in the HBTC cache (ET OT AL ) can be expressed as follow: ET OT AL = ECACHE + EBT Badd , where ECACHE is the energy consumed in the instruction cache and EBT Badd is the additional energy for BTB extension. The energy consumed in conventional BTB organization is not included. ECACHE can be approximated by the following equation: ECACHE = Etag + Edata + Eoutput + Eainput , where Etag and Edata are the energy consumed in tag memory and data memory, respectively. Eoutput is the energy for driving output buses, and Eainput is that for address decoding. In this paper, we do not consider Eainput , because some papers reported that it is about three orders of magnitude smaller than other components [1] [8]. EBT Badd can be expressed as follows: EBT Badd = EBT Bef + EBT Blogic , where EBT Bef is the energy consumed for reading and writing execution footprints, and EBT Blogic is that for the control logic (i.e., mode controller and PBAreg). The logic portion can be implemented by simple and small hardware, so that we do not take account for EBT Blogic . On Omitting-Mode (Omode), the energy consumed for tag checks (Etag ) is completely eliminated. However, that for accessing execution footprints (EBT Bef ) appears as energy overhead on every BTB access. On the other hand, from performance point of view, the HBTC cache causes performance degradation. Reading execution footprints can be performed in parallel with normal BTB access from the microprocessor. However, for writing, the HBTC cache causes one processor-stall cycle. This is because the BTB entry accessed for executionfootprint writing and that for branch-target reading are diﬀerent. Whenever

26

Koji Inoue et al.

a cache miss or BTB replacement takes place, execution-footprint invalidation is required. This operation also causes processor-stall cycles, because BTB access from the microprocessor has to wait until the invalidation is completed. The invalidation penalty largely depends on the implementation of the BTB. In Section 4.2, we discuss the eﬀects of the invalidation penalty on processor performance.

4 4.1

Evaluation Simulation Environment

In order to evaluate the performance-energy eﬃciency of the HBTC cache, we have measured the total energy consumption (ET OT AL ) explained in Section 3.4 and total clock cycles as performance. We modiﬁed the SimpleScalar source code for this simulation [15]. To calculate energy consumption, the cache energy model assuming 0.8 um CMOS technology explained in [6] was used. We referred the load capacitance for each node from [7] [12]. In this simulation, the following conﬁguration was assumed: instruction-cache size is 16 KB, cache-line size is 32 B, the number of direct-mapped branchprediction-table entry is 2048, predictor type is bimod, the number of BTB set is 512, BTB associativity is 4, and RAS size is 8. For other parameters, the default value of the SimpleScalar out-of-order simulator was used. In addition, we assumed that all caches evaluated in this paper employ subbanking approach, 16 KB data memory is partitioned into 4 subbanks. The following benchmark programs were used in this evaluation. – SPECint95 [16]: 099.go, 124.m88ksim, 126.gcc, 129.compress, 130.li, 132.ijpeg (using training input). – Mediabench [14]: adpcm encode, adpcm decode, mpeg2 encode, mpeg2 decode 4.2

Results

Tag-Check Count Figure 5 shows tag-check counts required for whole program executions. All results are normalized to a 16 KB conventional cache. The ﬁgure includes the simulation results for the ITC cache explained in Section 2 and the combination of the ITC cache and the HBTC cache. Since sequential accesses are inherent in programs, the ITC cache works well for all benchmark programs. While the eﬀectiveness of the HBTC cache is application dependent. The HBTC cache produces more tag-check count reduction than the ITC cache for two SPEC integer programs, 129.compress and 132.ijpeg, and all media programs. In the best case, adpcm dec, the tag-check count is reduced by about 90 %. However, for the other benchmark programs, the ITC cache is superior to our approach. This result can be understood by considering the characteristics of benchmark programs. Media application programs have relatively well structured loops. The HBTC cache attempts to avoid performing unnecessary tag checks by exploiting iterative execution behavior. Thus, we can

Dynamic Tag-Check Omission

27

Normalized tag-check Count

1.00 ITC: intraline tag-compare cache HBTC: History-based Tag-Comparison cache Comb: Combination of ITC and HBTC

0.80

0.60

0.40

0.20

0.00

099.go

126.gcc 130.li adpcm_enc mpeg2_enc mpeg2_dec 124.m88ksim 129.compress 132.ijpeg adpcm_dec

Benchmark Programs

Fig. 5. Tag-check count compared with other approaches

consider that if our main target is media applications, employing the HBTC cache makes energy advantages. Otherwise, we should employ the ITC cache. The hybrid model of the ITC cache and the HBTC cache makes signiﬁcant reductions. It eliminates more than 80 % and 95 % of unnecessary tag checks for all benchmark programs. Therefore, we conclude that combining the ITC and the HBTC caches is the best approach to avoiding energy dissipation caused by unnecessary tag checks. Energy Consumption Figure 6 reports energy consumption of the HBTC cache and its break down for each benchmark program. All results are normalized to the conventional cache. As explained in Section 4.1, a 0.8 um CMOS technology is assumed. The energy model used in this paper does not take account for the energy consumed in sense ampliﬁers. However, we believe that the energy reduction reported in this section can be achieved even if sense ampliﬁers are considered. This is because tag-memory accesses can be completely eliminated when the HBTC cache works on Omitting-Mode, so that the energy consumed in sense ampliﬁers is also eliminated. As discussed in Section 4.2, the HBTC cache makes a signiﬁcant tag-check count reduction for 129.compress, 132.ijpeg, and all media application programs.

28

Koji Inoue et al.

E btbadd

Normalized Cache-Energy Consumption

1.20

E output E tag 1.00 E data 0.80

0.60

0.40

0.20

0.00

099.go 126.gcc 130.li adpcm_enc mpeg2_enc 124.m88ksim 129.compress 132.ijpeg adpcm_dec mpeg2_dec

Benchmark Programs

Fig. 6. Cache-Energy Consumption

Since the extension of each BTB entry for execution footprints is only 2 bits, the energy overhead for BTB accesses (EBT Badd ) does not have a large impact on the total cache energy. As a result, the HBTC cache reduces the total cache energy by about 15 %. However, for 099.go and 126.gcc, the energy reduction is only from 2 % to 3 %. This is because the HBTC cache could not eliminate eﬀectively unnecessary tag checks due to irregular behavior of the program execution. Performance Overhead As explained in Section 3.4, the HBTC cache causes processor stalls when the extended BTB is up-dated for recording or invalidating execution footprints. Figure 7 shows program-execution time in terms of the total number of clock cycles. All results are normalized to the conventional organization. From the simulation results, it is observed that the performance degradation is less than 1 % for all but three benchmark programs. However, for 126.gcc, the performance is degraded by about 2.5 %. This might not be acceptable if high performance is strictly required. The processor-stalls are caused by conﬂicting BTB accesses from the processor with up-date operations of execution footprints.

Dynamic Tag-Check Omission

29

Normalized Execution Time (clock cycle)

1.10

1.05

1.00

0.95

0.90

099.go 126.gcc 130.li adpcm_enc mpeg2_enc 124.m88ksim 129.compress 132.ijpeg adpcm_dec mpeg2_dec Benchmark Programs

Fig. 7. Program Execution Time

In order to alleviate the negative eﬀect of the HBTC cache, we can consider two approaches. First is to pre-decode fetched instructions. Since conventional BTB is accessed on every instruction fetch regardless of the instruction type, processor stalls occur whenever execution footprints are up-dated. By pre-decoding fetched instructions, we can determine whether, or not, it has to access the BTB before starting normal BTB access. In this case, processor stalls occur only when branch (or jump) instruction conﬂicts with up-dating execution footprints. Another approach for compensating the processor stalls is to add a decoder logic for accessing execution footprints. This makes it possible to access BTB for obtaining branch-target address and up-dating execution footprints simultaneously. Eﬀects of Execution-Footprint-Invalidation Penalty All execution footprints recorded in the extended BTB are invalidated whenever a cache miss, or a BTB replacement, takes place. So far, we have assumed that the invalidation can be completed in one processor-clock cycle. However, the invalidation penalty largely depends on the implementation of extended BTB.

Normalized Execution Time (clock cycle)

30

Koji Inoue et al.

3.00 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg adpcm_enc adpcm_dec mpeg2enc mpeg2dec

2.50

2.00

1.50

1.00 mpeg2enc132.ijpegadpcm_encadpcm_dec

0.80

1

2 4 8 16 Execution-Footprint Invalidation Penalty (clock cycle)

32

Fig. 8. Eﬀect of Execution-Footprint Invalidation Penalty

Figure 8 depicts performance overhead caused by the HBTC approach where the invalidation penalty is varied from 1 to 32 cycles. The y-axis indicates program-execution time normalized to conventional organization for all benchmark programs, and the x-axis shows the invalidation penalty in terms of clock cycles. For all benchmark programs, it is observed that performance degradation is trivial if the invalidation penalty is equal to or less than 4 clock cycles. We have analyzed the break down of the invalidations, and have found that more than 98 % are caused by cache misses (less than 2 % are caused by BTB replacements). The invalidation penalty can be hidden if it is smaller than cache-miss penalty. Actually, in this evaluation, we have assumed that cache-miss penalty is 6 clock cycles. However, the invalidation penalty clearly appears where it is grater than 6 clock cycles, so that we can see large performance degradation for 099.go and 126.gcc. On the other hand, for 132.ijpeg, adpcm enc, adpcm dec, and mpeg2 decode, performance degradation is small even if the invalidation penalty is large. This is because cache-miss rates for these programs are high, resulting in the small number of invalidations. Actually, each cache-miss rate of 099.go, 126.gcc, 132.ijpeg, and mpeg2 decode was 4.7%, 5.5%, 0.5%, and 0.5%, respectively.

Dynamic Tag-Check Omission

5

31

Conclusions

In this paper, we have proposed the history-based tag-comparison (HBTC) cache for low-energy consumption. The HBTC cache exploits the following two facts. First, instruction-cache-hit rates are much higher. Second, almost all programs have many loops. The HBTC cache records the execution footprints, and determines whether the instructions to be fetched are currently cache resident without performing tag checks. An extended branch target buﬀer (BTB) is used to record the execution footprints. In our simulation, it has been observed that the HBTC cache can reduce the total count of tag checks by about 90 %, resulting in 15 % of cache-energy reduction. In our evaluation, it has been assumed that the BTB size, or the total number of BTB entries, is ﬁxed. Our future work is to evaluate the eﬀects of the BTB size on the energy reduction achieved by the HBTC cache. In addition, the eﬀects of branch-predictor type will be evaluated. Another future work is to establish a microarchitecture for set-associative caches. By memorizing wayaccess information as proposed in [9], we can extend the HBTC approach for set-associative caches.

References [1] Bahar, I., Albera, G., and Manne, S.: Power and Performance Tradeoﬀs using Various Caching Strategies. Proc. of the 1998 International Symposium on Low Power Electronics and Design, pp.64–69, Aug. 1998. 25 [2] Bellas, N., Hajj,—I., and Polychronopoulos, C.: Using dynamic cache management techniques to reduce energy in a high-performance processor. Proc. of the 1999 International Symposium on Low Power Electronics and Design, pp.64–69, Aug. 1999. 21 [3] Bellas, N., Hajj, I., Polychronopoulos. C., and Stamoulis, G.: Energy and Performance Improvements in Microprocessor Design using a Loop Cache. Proc. of the 1999 International Conference on Computer Design: VLSI in Computers & Processors, pp.378–383, Oct. 1999. 20 [4] Inoue, K. and Murakami, K.: A Low-Power Instruction Cache Architecture Exploiting Program Execution Footprints. International Symposium on HighPerformance Computer Architecture, Work-in-progress session (included in the CD proceedings), Feb. 2001. 19 [5] Ishihara, T. and Yasuura, H.: A Power Reduction Technique with Object Code Merging for Application Speciﬁc Embedded Processors. Proc. of the Design, Automation and Test in Europe Conference, pp617–623, Mar. 2000. 20 [6] Kamble, M. and Ghose, K.: Analytical Energy Dissipation Models For Low Power Caches. Proc. of the 1997 International Symposium on Low Power Electronics and Design, pp.143–148, Aug. 1997. 18, 19, 21, 26 [7] Kamble, M. and Ghose, K.: Energy-Eﬃciency of VLSI Caches: A Comparative Study. Proc. of the 10th International Conference on VLSI Design, pp.261–267, Jan. 1997. 26 [8] Kin, J., Gupta, M., and Mngione-Smith, W.: The Filter Cache: An Energy Eﬃcient Memory Structure. Proc. of the 30th Annual International Symposium on Microarchitecture, pp.184–193, Dec. 1997. 21, 25

32

Koji Inoue et al.

[9] Ma, A., Zhang, M., and Asanovi´c, K.: Way Memorization to Reduce Fetch Energy in Instruction Caches. ISCA Workshop on Complexity Eﬀective Design, July 2001. 20, 31 [10] Panda, R., Dutt, N., and Nicolau, A.: Eﬃcient Utilization of Scratch-Pad Memory in Embedded Processor Applications. Proc. of European Design & Test Conference, Mar. 1997. 20 [11] Panwar, R. and Rennels, D.: Reducing the frequency of tag compares for low power I-cache design. Proc. of the 1995 International Symposium on Low Power Electronics and Design, Aug. 1995. 20 [12] Wilton, S. and Jouppi, N.: An Enhanced Access and Cycle Time Model for On-Chip Caches. WRL Research Report 93/5, July 1994. 26 [13] Witchel, E., Larsen, S., Ananian, C., and Asanovi´c, K.: Direct Addressed Caches for Reduced Power Consumption. Proc. of the 34th International Symposium on Microarchitecture, Dec. 2001. 20 [14] MediaBench, URL: http://www.cs.ucla.edu/˜leec/mediabench/. 26 [15] “SimpleScalar Simulation Tools for Microprocessor and System Evaluation,” URL:http://www.simplescalar.org/. 26 [16] SPEC (Standard Performance Evaluation Corporation), URL: http://www.specbench.org/osg/cpu95. 26

A Hardware Architecture for Dynamic Performance and Energy Adaptation Phillip Stanley-Marbell1 , Michael S. Hsiao2 , and Ulrich Kremer3 1

3

Dept. of ECE, Carnegie Mellon University, Pittsburgh, PA 15213 [email protected] 2 Dept. of ECE, Virginia Tech, Blacksburg, VA 24061 [email protected] Dept. of Computer Science, Rutgers University, Piscataway, NJ 08854 [email protected]

Abstract. Energy consumption of any component in a system may sometimes constitute just a small percentage of that of the overall system, making it necessary to address the issue of energy eﬃciency across the entire range of system components, from memory, to the CPU, to peripherals. Presented is a hardware architecture for detecting regions of application execution at runtime, for which there is opportunity to run a device at a slightly lower performance level, by reducing the operating frequency and voltage, to save energy. The proposed architecture, the Power Adaptation Unit (PAU) may be used to control the operating voltage of various system components, ranging from the CPU core, to memory and peripherals. An evaluation of the tradeoﬀs in performance versus energy savings and hardware cost of the PAU is conducted, along with results on its eﬃcacy for a set of benchmarks. It is shown that on the average, a single entry PAU provides energy savings of 27%, with a corresponding performance degradation of 0.75% for the SPEC CPU 2000 integer and ﬂoating point benchmarks investigated.

1

Introduction

Reduction of the overall energy usage and per-cycle power consumption in microprocessors is becoming increasingly important as device integration increases. This leads to increases in the energy density of generated heat, which creates problems in reliability and packaging. Increased energy usage is likewise undesirable in applications with limited energy resources, such as mobile battery powered applications. Reduction in microprocessor energy consumption can be achieved through many means, from altering the transistor level design and manufacturing process to consume less power per device, to modiﬁcation of the processor microarchitecture to reduce energy consumption. It is necessary to address the issue of energy eﬃciency across the entire range of system components, from memory, to the CPU, to peripherals, since the CPU energy consumption may sometimes constitute a small percentage of that of the complete system. It is no longer suﬃcient B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 33–52, 2003. c Springer-Verlag Berlin Heidelberg 2003

34

Phillip Stanley-Marbell et al.

for systems to be low power, but they must also be energy aware, adapting to application behavior and user requirements. In applications in which there is an imbalance between the amount of computation and the time spent waiting for memory, it is possible to reduce the operating frequency and voltage of the CPU, memory, or peripherals, to reduce energy consumption at the cost of a tolerable performance penalty. Previous studies have shown that there is signiﬁcant opportunity for slowing down system components such as the CPU without incurring signiﬁcant overall performance penalties [9, 8, 7]. Compiler approaches rely on static analyses to predict program behavior. In many cases, static information may not be accurate enough to take full advantage of program optimization opportunities. However, static analyses often have a more global view of overall program structure, allowing coarse-grain program transformations, in order to enable further optimizations at the ﬁne-grain levels by the hardware. Hardware approaches are often complementary to compilerbased approaches such as [9, 7]. The window of instructions that is seen by hardware may not always be large enough to make voltage and frequency scaling feasible. However, even though only the compiler can potentially have a complete view of the entire program structure, only the hardware has knowledge of runtime program behavior. Presented in this paper is a hardware architecture for detecting regions of application execution at runtime, for which there is the possibility to run a device (e.g. CPU core) with a bounded decrease in performance while obtaining a signiﬁcant decrease in per-cycle power and overall energy consumption. The proposed architecture, the Power Adaptation Unit (PAU), appropriately sets the operating voltage and frequency of the device, to reduce the power dissipation and to conserve energy, while not incurring more than a prescribed performance penalty. The PAU attempts to eﬀectively identify such dynamic program regions, and to determine when it would be beneﬁcial to perform voltage and frequency scaling, given the inherent overheads. Because of the type of behavior the PAU captures, even a small single entry PAU is eﬀective in reducing power consumption under bounded performance penalty, for the benchmark applications investigated. The additional hardware overhead due to the PAU is minimized due to the fact that a majority of the facilities it relies on (e.g. performance counters) are already integrated into contemporary processor designs. The overhead due to maintaining PAU state is shown to be small, and proposals are provided for using existing hardware to implement other functionality required by the PAU. The remainder of this paper is structured as follows. The next section describes opportunities for implementing dynamic resource scaling. Section 3 details the structure of the PAU architecture, and describes how entries are created and managed in the PAU. Section 4 illustrates the action of the PAU with an example. Section 5 discusses the analytical limits of utility of the PAU. Section 6 discusses the hardware overheads of the PAU. Sections 7 presents simulation results for 8 benchmarks from the SPEC CPU 2000 integer and ﬂoating point

A Hardware Architecture for Dynamic Performance and Energy Adaptation

ALU t=0

Mem. t=4

ALU Instructions

Memory Stall t=6

t=8 Memory Access

ALU

t=12 t=14

Mem. t=18

Memory Stall

35

Memory Stall t=28

t=20 ALU Instructions

Memory Access

t=32

t=40 Memory Stall

Fig. 1. Opportunities for energy/performance tradeoﬀ in a single-issue architecture benchmark suites. Section 8 discusses related work and Section 9 concludes the paper.

2

Opportunities for Scaling

In statically scheduled single-issue processors, any decrease in the operating voltage or frequency will lead to longer execution times. However, if the application being executed is memory-bound in nature (i.e. it has a signiﬁcant number of memory accesses, which cause cache misses), the processor may spend most of its time waiting for memory. In such memory-bound applications, if the processor is run at a reduced operating voltage and memory remains at-speed, the portions of the runtime of a program performing computation (few) will take more time, while the portions of the runtime performing memory stalls (many) will remain the same, as illustrated in Figure 1. As illustrated in the ﬁgure, halving the operating voltage and frequency of the CPU while keeping that of memory constant can result in ideal case energy savings of 87.5% , with a 43% degradation in performance for the example scenario. In practice, this can only be approximately achieved, as there exist dependencies between the operating frequency of the CPU core and that of memory (one usually runs at a multiple of the frequency of the other). Dynamically scheduled multiple issue (superscalar) architectures will permit the overlapping of computation and memory stalls, and would witness a smaller slowdown if the CPU core (or portions of it) were run at a lower voltage while the operating voltage of memory were kept constant. This initial work focuses on single-issue in-order processors, such as those typically employed in low power embedded systems. The beneﬁts to superscalar architectures will be pursued in future work.

3

Power Adaptation Unit

The power adaptation unit (PAU) is a hardware structure which detects regions of program execution with imbalance in memory and CPU activity, such as code execution that leads to frequent repeated memory stalls (e.g. memory-bound loop computations), or regions of execution that lead to signiﬁcant CPU activity with little memory activity (e.g. CPU-bound loop computations). In both cases, the PAU outputs control signals suitable for adjusting the operating frequency

36

Phillip Stanley-Marbell et al.

Memory

CPU

FREQ Vdd

PAU

FREQ Vdd Programmable Voltage/Frequency Controller Peripherals

FREQ Vdd

Fig. 2. Typical implementation of PAU

and voltage of the unit it monitors, to values such that less power is consumed per cycle, while still maintaining similar performance. The PAU must determine an appropriate voltage and frequency at which to run the device it controls, for which a speciﬁed performance penalty (say, 1%) will not be exceeded. The PAU in a typical system architecture is shown in Figure 2. The next two subsections focus on controlling the CPU for memory-bound applications and extending these ideas to controlling the cache and memory in CPU-bound applications respectively.

Power Adaptation Unit (PAU)

Init PC

Tag

Index

Active Transient Valid Tag

=?

STRIDE Q

NCLKS

NINSTR I

A T

Delta Computation

Programmable Voltage Controller

Fig. 3. The Power Adaptation Unit

V

A Hardware Architecture for Dynamic Performance and Energy Adaptation

37

Clock NCLK++; if (NCLK > STRIDE) STRIDE := NCLK, NCLK := 0, Q--; ACTIVE

Stall

Q == HIH2O

Clock

Q <= LOH2O

Clock NCLK++; if (NCLK > STRIDE) STRIDE := NCLK, NCLK := 0, Q--;

TRANSIENT

INVALID

Clock

NCLK > STRIDE, Q == 0

Stall STRIDE := NCLK, NCLK := 0, Q++

Stall

STRIDE := NCLK, NCLK := 0, Q++

Stall

INIT

Clock NCLK == STRIDE_MAX

Clock

NCLK++

Fig. 4. PAU entry state transition diagram

3.1

PAU Table Entry Management

The primary component of the PAU is the PAU table. Figure 3 illustrates the construction of the PAU table for a direct mapped conﬁguration. The least signiﬁcant bits of the program counter, the index, are used to select a PAU table entry. A hit in the PAU occurs when there is a match between the Tag from the PC (the most signiﬁcant bits before the index) and the Tag in the PAU entry, as illustrated in Figure 3. The PAU operates on windows, which are ranges of program counter values in the dynamic instruction stream. Windows are deﬁned by a starting PC value, a stride in clock cycles, STRIDE, and a count in overall instructions executed, NINSTRS. An entry corresponding to a window is created on an event such as a cache miss. The STRIDE ﬁeld is a distance vector [3, 14], specifying the distance between the occurrence of events in the iteration space. The NCLKS ﬁeld maintains the age of the entry symbolizing a window. Four state bits, INIT, TRANSIENT, ACTIVE and INVALID, are used by the PAU to manage entries. The Q ﬁeld is a saturating counter that provides an indicator of the degree of conﬁdence to be attached to the particular PAU entry.

38

Phillip Stanley-Marbell et al.

Figure 4 shows the state transition diagram for a PAU entry. There are four states in which an entry may exist, INIT, TRANSIENT, ACTIVE and INVALID corresponding to the state bits in the PAU entry described previously. Transitions between states occur either when there is a pipeline stall due to a cache miss, or may be induced by the passage of a clock cycle. The two extremes of the state machine are the INVALID and ACTIVE states, with the INIT and TRANSIENT states providing hysteresis between these two states. In the ﬁgure, the transitions between states are labeled with both the event causing the transition (circled) and the action performed in the transition. For example, the transition between the TRANSIENT and INVALID states occurs on the passage of a clock cycle if NCLK is greater than STRIDE, and Q is zero. Entries created in the PAU table are initially in the INIT state, and move to the TRANSIENT state when there is a stall caused by the instruction which maps to the entry in question. On every clock cycle, the NCLKS ﬁelds of all valid entries are incremented, on the faith that the entries will be used in the future. For all valid entries, the NCLKS ﬁeld is reset to 0 on a stall, after copying it to the STRIDE ﬁeld, and incrementing Q. The number of instructions that completed in that period is then recorded in the NINSTRS ﬁeld for that PAU entry. The goal of a PAU entry is to track PC values for which repeated stalls occur. The PAU will eﬀectively track such cases even if the time between stalls is not constant. Whenever the NCLKS ﬁeld for a TRANSIENT or ACTIVE entry reaches the value of the STRIDE ﬁeld, it is reset to zero and the Q ﬁeld decremented, therefore if the distance between stalls decreases monotonically, the STRIDE will be correctly updated with the new iteration distance. If the number of iterations for which the distance between stalls is increased is large, then the entry will eventually be invalidated, and then recreated in the table with the new STRIDE. This purpose is served by the high and low water marks (HIH2O and LOH2O). The high and low water marks determine how many repeated stalls occur before an entry is considered ACTIVE, and how many clock cycles must elapse before the determination is made to degrade an entry from ACTIVE status, respectively. The values of HIH2O and LOH2O may either be hard-coded in the architecture, or may be modiﬁed by software such as an operating system, or by applications, as a result of code appropriately inserted by a compiler. It is also possible to have the values of HIH2O and LOH2O adapt to application performance, their values being controlled using the information that is already summarized in the PAU, with a minimal amount of additional logic. Such techniques are beyond the scope of this paper and are left for future research. Once Q reaches the high water mark, the entry goes into the ACTIVE state. If a PAU hit occurs on an ACTIVE entry, VDD and FREQ are altered as will be described in Section 5. If Q falls below a low water mark, LOH2O, this indicates that the repeated stalls with equal stride have stopped happening, and have not occurred for STRIDE*(HIH2O-LOH2O) cycles. In such a situation, the VDD and FREQ are then set back to their default values.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

39

for (x = 100;;) { if (x-- > 0) a~= i; b = *n; c = *p++; }

Fig. 5. Example In a multiprogramming environment where several diﬀerent processes, with diﬀerent performance characteristics, are multiplexed onto one or more processing units, HIH2O and LOH2O must permit the PAU to respond quickly enough, and STRIDE*(HIH2O-LOH2O) must be signiﬁcantly smaller than the length of a process quantum. Alternatively, an operating system could invalidate all entries in the PAU on a context switch. Addresses that only cause one stall could potentially tie up a PAU entry forever. To avoid this, PAU entries in the INIT state time out after PAU_STRIDE_MAX cycles. 3.2

Handling Cache and Memory

For real beneﬁt across the board, both memory- and CPU-bound applications must be handled simultaneously – either the CPU is stalled for memory, or memory is idle while the CPU is busy, or both may be busy. It is desirable to use the same structure, if possible, to detect CPU-bound code regions, as for memory-bound regions, to amortize the on-chip real estate used in implementing the PAU. The control signals generated by a PAU entry for a given PC value can also be applied to shutting down memory banks or shutting down sets in a set-associative cache, along the lines of [2] and [18]. Periods of memory inactivity are detected by identifying memory load/store instructions at PC values for which the corresponding PAU entries’ NINSTR and STRIDE ﬁelds indicate a large ratio of computation to stalls for memory. For example, if NINSTR is very close to the ratio of STRIDE to the average machine CPI for the given architecture, then for repeated memory accesses to the corresponding address, there are very few cache misses. In such a situation, since most activity is occurring in the cache as opposed to main memory, memory can be run at a lower voltage. Similar techniques have previously been applied to RAMBUS DRAMS [1] in [15]. The PAU ensures that such adjustments only occur when they will be of long enough duration to be beneﬁcial.

4

Example

At any given time, there may be more than one PAU entry in the ACTIVE state, i.e., during the execution of a loop, there may be several program counter values

40

Phillip Stanley-Marbell et al.

that lead to repeated stalls. In the example illustrated in Figure 5, let us assume that the assignments to variables a, b and c all cause repeated cache misses (e.g. the variables reside at memory addresses that map to the same cache line in a direct mapped cache). After 1 iteration of the loop, there will be 3 PAU entries corresponding to the PC values of the memory access instructions for the three assignments, and these will be placed in the INIT state, with their Q ﬁelds set to 0. The NCLK ﬁelds of all entries are incremented once each clock cycle hereafter. On the second iteration, after all three memory references cause cache misses once more, the three PAU entries will move from the INIT state to the TRANSIENT state, the Q ﬁelds of the entries will be incremented and the value of the NCLK ﬁeld copied to the STRIDE ﬁeld. The value of the NCLK ﬁelds at this point denotes the number of clock cycles that have elapsed since the last hit for each entry. Likewise, the NINSTR ﬁeld denotes the number of instructions that have been executed, since the last hit to the entry. If the architecture is conﬁgured with LOH2O of 1 and a HIH2O of 3, then following a process similar to that described above, the entries will graduate to the ACTIVE state in the third iteration of the loop. On the fourth iteration, with all 3 entries in the ACTIVE state, the ﬁrst PAU hit occurs due to the memory reference associated with the assignment to variable a. On a hit in an ACTIVE PAU entry, the values of the STRIDE and NINSTR ﬁelds are used to calculate the factor by which to slow down the device being controlled, which in these discussions, is the CPU. Intuitively, the ratio of NINSTR to STRIDE provides a measure of the ratio of computations to time spent stalling for memory. A detailed analysis of this calculation is described in the next section. After 100 iterations of the loop, the variable x in the program decrements to zero, and the PAU entry corresponding to the memory access to variable a will degrade from ACTIVE to TRANSIENT and eventually to INVALID. In the organization of the PAU described here, the other ACTIVE entries would only be able to inﬂuence the operating voltage after this degradation from ACTIVE has occurred, (HIH2O-LOH2O)*STRIDE cycles after the 100th iteration of the loop. Energy is saved when there is a PAU hit on an ACTIVE entry, and the operating voltage is lowered. When the operating voltage is lowered however, increased gate delays make it necessary to reduce the operating frequency as well, to maintain correct circuit design behavior. At the new operating voltage, instructions will take longer to execute, but memory accesses will incur the same penalty in terms of absolute time, though the number of memory stall cycles will be smaller. There is an overhead (both time and energy) involved in lowering the operating voltage, as well as bringing it back up. This makes it useful only to lower the voltage if it can be determined that the processor will run at the low voltage for a long enough time. A more formal analysis of the opportunities for lowering operating voltage/frequency and the overheads involved therein, are presented in the next section.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

5

41

Limits on Energy Savings

It is possible to incur no performance degradation if computation and memory accesses can be perfectly overlapped, the program being executed is memorybound, and the CPU is run at a slower than default execution rate. For an ACTIVE PAU entry, we can determine the eﬀective instruction execution rate as: N IN ST R F REQ instructions = ST RIDE = ST RIDE time F REQ N IN ST RS In the above, STRIDE/NINSTRS is the eﬀective CPI, and is similar to the inverse of the average-rate requirement deﬁned in [23]. It is desired to ﬁnd an appropriate frequency at which we can run while keeping the ratio instructions/time constant. The following analysis is performed in terms of the frequency, and the interdependence between operating voltage and frequency is not explicitly shown. The maximum value of the instructions/time ratio will be: 1 AV GCP I

CY CLET IM E

=

RAT ED F REQ , AV GCP I

where AVGCPI is the theoretical average number of cycles necessary to execute an instruction on the architecture of interest, and RATED FREQ is the processor’s rated operating frequency. For an architecture in which memory operations can be perfectly overlapped with execution, it will be possible to lower the clock frequency until RAT ED F REQ = AV GCP I Therefore

Fnew =

RAT ED F REQ AV GCP I

Fnew ST RIDE N IN ST RS

ST RIDE · N IN ST RS

The slowdown factor, δ, is a number greater than 1 by which the original operating frequency is divided to obtain the scaled frequency. The slowdown factor for cases of possible ideal overlap of memory operations and computation is: RAT ED F REQ δideal+overlap = F new N IN ST RS = · AV GCP I ST RIDE In the general case, it will not be possible to perfectly overlap computation and memory accesses, thus slowdown of the processor will not be hidden by memory latency, since memory accesses will be sequential with computation.

42

Phillip Stanley-Marbell et al.

In architectures that cannot overlap memory access and computation, the performance penalty can still be relatively small compared to the savings in energy, and per-cycle power will almost certainly be reduced. In such situations, since there will always be a performance degradation, it is necessary to deﬁne a limit to the acceptable degradation in performance. For the purposes of evaluation, a maximum degradation in performance of 1% will be used throughout the remainder of this paper.

Tnew

Told = Tmem + Tcpu = Tmem + δno−overlap · Tcpu

For a < 1% slowdown: Tnew − Told ≤ 0.01 Told Therefore δno−overlap · Tcpu − Tcpu ≤ 0.01 Tmem + Tcpu δno−overlap ≤

(0.01) · (Tmem + Tcpu ) + Tcpu Tcpu

Let Tmem Tmem + Tcpu Tcpu cpu f rac = Tmem + Tcpu

mem f rac =

then δno−overlap ≤

0.01(mem f rac + cpu f rac) + cpu f rac cpu f rac

The slowdown factor can also be expressed in terms of the entries in the PAU structure: mem f rac =

ST RIDE − N IN ST R ST RIDE N IN ST R cpu f rac = ST RIDE

Then δno−overlap ≤

0.01(ST RIDE) + N IN ST R N IN ST R

A Hardware Architecture for Dynamic Performance and Energy Adaptation

43

Fig. 6. Eﬀect of PAU Table size on energy consumption

6

PAU Overhead and Tradeoﬀs

In this section, the overheads involved in adjusting the operating voltage and frequency are discussed, as well as the area cost of implementing the PAU table. The energy cost incurred by the PAU structure will be addressed in our future research. As will be shown in Section 7, a PAU size of even a single entry is eﬀective in reducing energy consumption, while incurring a minimal degradation in performance. Besides the PAU table, most of the information needed for each PAU entry is already available in current state-of-the-art architectures such as the Intel XScale architecture [11], and the Berkeley lpARM processor [16]. For example, the Intel XScale microarchitecture maintains event counters to monitor instruction and data cache hit rates, instruction and data Translation Look-aside Buﬀer (TLB) hit rates, pipeline stalls, Branch Target Buﬀer (BTB) prediction hit rates, and instruction execution count. Furthermore, eight additional events may be monitored when using the Intel XScale microarchitecture as the basis for an application speciﬁc standard product [11]. The largest real-estate overhead of the PAU is incurred by the PAU table and δ calculation. It should be possible to use unused functional units for δ computation, as the computation and the attendant voltage scaling can be postponed if resources are unavailable. For an m-entry direct mapped PAU, in a b-bit architecture, with i-byte instructions, the number of bits, PAUbits needed to implement one PAU entry, is given by:

44

Phillip Stanley-Marbell et al.

P AUbits = m · ((b − log2 (m) − log2 (i)) + 3 · log2 (PAU_STRIDE_MAX) + log2 (HIH2O) + +2) The terms on the right hand side of the above equation correspond to the (1) Tag, (2) NCLK, STRIDE and NINSTR (3) Q, (4) FREQ and (5) Entry state bits, respectively. Thus, a single entry PAU table can be implemented with just 106 bits on an architecture with a 32-bit PC and a chosen PAU_STRIDE_MAX of 224 , HIH2O of 4 and 4-byte instructions. Altering the operating voltage by the DC-DC converter is neither instantaneous nor energy-cost-free. In general, the time, tRF G taken to reconﬁgure from a voltage V1 to V2 , with a maximum current IMAX at the output of the converter, converter eﬃciency η, and a supply smoothing capacitor with capacitance C, is given, from [6], by: tRF G ≈

2·C · |V2 − V1 | IMAX

Likewise, the energy cost of reconﬁguration, ERF G , is given as: ERF G = (1 − η) · C · |V22 − V12 | With a DC-DC converter smoothing capacitance value of 10µF, which is twice the minimum suggested in [6], a transition from 3.3V to 1.65V, IMAX of 1A and η of 90%, tRF G equals 33µs. Similarly, the energy cost of reconﬁguration, ERF G , is 8.167500µJ. In the simulations a reconﬁguration penalty of 1024 clock cycles, and 14µJ was used. This penalty may be pessimistic as it has been shown in [6] that it is possible to perform voltage scaling without halting computation for designs with a small die area, as is the case for embedded processors.

7

Eﬃcacy of the PAU

Beyond the overall architecture of the PAU, there exist implementation parameters that will determine the eﬃcacy of the PAU in a system. This section investigates the eﬀect of the size of the PAU table on the energy savings and performance degradation, and ultimately, the eﬀect on the energydelay product. In a practical implementation, it is unlikely that a fully associative PAU structure will be utilized, due to hardware overhead involved, and a real PAU implementation is more likely to employ a small, set-associative or directmapped structure. Eight diﬀerent direct-mapped PAU sizes of 0, 1, 2, 4, 8, 16, 32 and 64 entries were investigated. In all of these conﬁgurations, a VDD reconﬁguration penalty of 14µJ and 1024 clock cycles was used, based on [6]. The overhead involved in performing voltage scaling was discussed in Section 6.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

45

Fig. 7. Eﬀect of PAU Table size on energy savings

7.1

Simulation Environment

The investigation was performed using the Myrmigki simulator, a power estimating execution driven simulator which models a single issue embedded processor [21]. The modeled architecture has a 5 stage in-order pipeline, uniﬁed 8K 4-way set-associative L1 cache with 16 byte blocks, and a miss penalty to main memory of 100 cycles. The power estimation framework has been shown to provide accuracy within 6.5% of the hardware it models. The benchmarks were taken from the SPEC2000 benchmark suite, and compiled with GCC [20] version 2.95.3 for the Hitachi SH architecture. The optimization ﬂags during compilation were the default ﬂags speciﬁed for compiling each benchmark from the SPEC suite. Table 1 provides a summary of the bench-

Table 1. Summary of benchmarks used in experimental analysis Benchmark

SPEC Suite # of Instructions Simulated 164.gzip Integer 200,000,000 175.vpr Integer 200,000,000 197.parser Integer 200,000,000 256.bzip2 Integer 200,000,000 176.gcc Integer 200,000,000 181.mcf Integer 122,076,300 183.equake Floating Point 200,000,000 188.ammp Floating Point 200,000,000

46

Phillip Stanley-Marbell et al.

Fig. 8. Eﬀect of PAU Table size on performance degradation

marks used, and number of dynamic instructions for which they were simulated. Each of the benchmarks was simulated for 200 million dynamic instructions, unless the execution time was smaller, as was the case for 181.mcf. The inputs to the benchmarks were taken from the SPEC reduced simulation inputs [13], except for 176.gcc, where the reference input 166.i was used. 7.2

Eﬀect of PAU Size on Energy Savings

Figure 6 illustrates the eﬀect of the number of PAU entries in a direct-mapped PAU organization, on the energy consumption, for a targeted 1% performance degradation 1 . The zero-sized PAU table is the baseline case, and illustrates the energy consumption without the use of the PAU. The percentage reduction in energy consumption with increasing PAU table size is illustrated in Figure 7. The general trend is that the energy savings for the largest PAU conﬁguration (64 entries), is only slightly better than that of a single entry PAU. In the case of Gzip, the energy savings with a 64-entry PAU are actually less than those for a single entry PAU. This non-monotonic increase in savings with increasing PAU size can also be witnessed for Ammp, Vpr and Equake. The reason for this behavior is that as the number of entries in the PAU table is increased, there is a greater possibility that regions of recurrent cache misses which have smaller duration will be allocated entries in the PAU. With an 1

The actual performance degradation observed is not exactly 1%, as discussed in the next section

A Hardware Architecture for Dynamic Performance and Energy Adaptation

47

increase in the number of potentially less beneﬁcial occupants of the PAU table, there is a greater occurrence of the voltage and frequency being lowered due to short runs of repeated stalls. Since there is an overhead involved in changing the operating voltage, such short runs lead to a smaller beneﬁt in energy savings. Adding more entries to the PAU increases the opportunity for voltage scaling to occur, but does not increase the chances that a more beneﬁcial execution region (longer, larger proportion of memory stalls) would be captured. The trend in energy consumption in Figure 6 tracks that of energy savings in Figure 7, and the benchmarks with a larger energy consumption witness a greater savings with the use of the PAU. The eﬀect of increased number of PAU table entries on the energy savings does not follow the same trend, with some benchmarks (e.g., Mcf) beneﬁtting more from the use of larger PAU sizes than others (e.g. Equake). For the average over the 8 benchmarks, there is a steady increase in the energy savings with increased number of PAU entries, except for the case of a 2entry PAU table where there is a slight degradation over the single entry PAU. The additional energy savings from a 64-entry PAU are however not signiﬁcant, with the 64-entry PAU having an energy saving of 31% versus 27% for the single entry PAU. 7.3

Eﬀect of PAU Size on Performance Degradation

Figure 8 shows the trend in performance degradation with increasing number of PAU entries. As the number of PAU entries is increased, the number of times an entry takes control over the operating voltage increases, since there is a general increase in the number of ACTIVE entries. Due to the overhead involved in switching the operating voltage, there is a general increase in performance degradation which eventually plateaus, as the number of stall-inducing memory references approaches the number of PAU entries. The increase in performance degradation is not monotonic, and for example, in going from a 4-entry PAU table to an 8-entry PAU table for Ammp, there is a decrease in the performance degradation. The reasons are similar to those previously discussed for the trend in the energy savings. As the number of PAU entries is increased, the PAU captures more dynamic execution regions. These lead to allocations of entries in the PAU table which will lead to increased occurrences of voltage scaling, but may or may not be beneﬁcial to the overall energy consumption and performance degradation. It is important to note that, on the average, with increasing PAU size, even though there is an increase in the energy savings, there is an increase in performance degradation. Using larger PAU tables does not provide greater accuracy, but rather only provides greater opportunity to perform resource scaling. In choosing an appropriate PAU size, it is therefore necessary to tradeoﬀ the energy saved for the hit in performance. This makes it desirable to use the energy-delay product as a metric rather than just either performance or energy consumption.

48

Phillip Stanley-Marbell et al.

Fig. 9. Eﬀect of PAU Table size on Energy-Delay product 7.4

Eﬀect of PAU Size on Energy-Delay Product

To evaluate the eﬃcacy of each conﬁguration, both the energy savings and performance degradation must be considered together. An appropriate metric is the energy-delay product, with smaller values being better. Figure 9 shows the trend in energy-delay product with increasing PAU size. In the ﬁgure, the baseline case (no PAU) is listed as a PAU table size of zero. On the average over all the benchmarks, there is little additional decrease in the energy-delay product after the addition of a single PAU entry. Even though there is an apparent signiﬁcant increase in performance degradation with increasing PAU table size (Figure 8), the contributions to energy savings far outweigh the performance penalty. One factor that is not accounted for in Figure 9 is the additional hardware cost of larger PAU sizes. If this cost is high, it would preclude using larger PAU table sizes, as it might lead to an increase in the energy delay-product.

8

Related Work

Although hardware architectures aimed at improving application performance have been around for decades, hardware targeted at reducing per-cycle power and overall energy consumption have only recently begun to be proposed. In [22], it is observed that hardware techniques for shutting down unused hardware modules will provide the possibility for signiﬁcant (upward of 20%) energy savings, over software techniques, which in themselves involve the execution of instructions which consume energy.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

49

Hardware architectures which adapt to application needs, by reconﬁguring to match applications and save energy have been proposed in [2, 12, 18]. In [12], the authors detail a scheme for adjusting the number of hardware units, in this case resource reservation units in use, in a model of the SimpleScalar architecture, in order to reduce power consumption and overall energy consumption. The authors further propose applying the architecture to performing dynamic voltage scaling. In a similar manner to hardware architectures for performance, and similar also to the solutions proposed in [12], the PAU uses application history to determine opportunities for hardware reconﬁguration. Furthermore, while [12] alters the hardware conﬁguration on superscalar processors, to save energy, the PAU performs dynamic voltage scaling and clock speed setting, and addresses the spectrum of hardware architectures ranging from single-issue processors to multiple issue VLIW and superscalar architectures. In [2], sets in a set-associative cache are disabled to reduce the energy dissipation of the cache, with a small degradation in performance. The technique takes advantage of the existing cache sub-array partitioning that already exists in high performance cache designs. However, even though the proposal is based on hardware structures, it requires software (the operating system or applications with the help of a compiler) to perform the selection of the cache ways to be disabled. The proposed mechanism for this interface is the addition of two new instructions into the machine ISA for reading and writing a cache way select register. The Dynamically ResIzable i-cache (DRI i-cache) in [18] employs a combination of two novel techniques, gated-Vdd [17] and a purely hardware structure to take advantage of the variation in i-cache usage to reduce leakage power consumption of the instruction cache. The techniques introduced herein are complementary to those previously proposed in [2, 12, 18]. Like [12], one of the aims of the PAU is to reduce the power consumption of the CPU core. Unlike [18], the PAU does not address leakage power consumption, which is increasingly important as supply and threshold voltages are lowered. It should be possible to employ a combination of the PAU and the techniques proposed in [18, 17] either in concert with voltage scaling, or replacing it altogether. Structures such as those described in [4, 19, 10] perform dynamic thermal management, reducing power consumption and saving energy, while incurring only limited application performance degradation. Thermal management is indirectly achieved by the PAU by attempting to reduce power consumption. The action of the PAU in this regard is pro-active as opposed to reactive, however it will not be able to detect situations of “thermal crisis”. The calculation of the CPU slowdown factor in Section 5, is based on previous eﬀorts described in [9]. In [9], the slowdown factor determination was for processors in which it is possible to overlap computation and memory accesses, such as multiple issue superscalar processors. The analysis presented in Section 5, builds upon and extends that of [9], to the general case of processors with and without the ability to overlap computations with memory accesses.

50

Phillip Stanley-Marbell et al.

The work in [7] discusses a compiler that identiﬁes program regions where the CPU can be slowed down without resulting in signiﬁcant performance penalties. A trace-based prototype compiler is implemented as part of the SUIF2 compiler infrastructure and achieved up to 24% energy savings with performance penalties of less than 2.7% on the SPECfp95 benchmark suite. The PAU hardware is complementary to compiler techniques such as [9] and [7].

9

Summary and Future Work

Presented was a hardware structure, the PAU, that detects dynamic execution regions of a program for which there is a mismatch between the number of computations occurring, and the number of memory stalls, and if feasible, lowers the operating voltage and frequency of the processor to obtain a savings in energy with a slight degradation in performance. A direct mapped conﬁguration of the PAU was investigated for 8 PAU sizes ranging from a baseline conﬁguration with no PAU to a 64-entry PAU. It was observed that PAU sizes of even a single entry provide an average of of 27% savings in energy with performance degradation of 0.75%. In general, it was observed that there was increased energy savings accompanied by increased performance degradation with increasing PAU size, as more penalties of voltage scaling were incurred. The overall eﬀect of using larger PAUs was however positive, with an overall decrease in the energy-delay product as the PAU size was increased. Lacking in the current analysis is an accurate estimate of the hardware cost of the PAU. This is the subject of our current research, and we are investigating various hardware implementations. The usefulness of a PAU in a superscalar architecture is also being investigated, with the implementation of the PAU in the Wattch [5] simulator. This will also permit preliminary analysis of the hardware cost of the PAU table, as it will be possible to model the PAU table in Wattch as an array structure. In addition to investigation of the utility of the PAU in superscalar architectures, implementation in Wattch permits the analysis of the performance of the PAU in a machine with a diﬀerent ISA. In this regard, it is also planned to implement the PAU in the SimplePower simulator [24] for further comparison. The proposed hardware structure only addresses the dynamic power dissipation, though the use of voltage scaling. With decreasing feature sizes, leakage power is becoming increasingly important, and it is therefore necessary to investigate the possible impact of any proposal on the leakage power consumption. It should be straightforward to incorporate a PAU into current state-of-theart low power architectures, given that most of the hardware required by the PAU is currently beginning to appear in some commercial and research microprocessor designs. In the short term however, it should be possible to implement the PAU in a programmable logic device and use it as an additional board level device in a system design.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

51

References [1] RDRAM. http://www.rambus.com, 1999. 39 [2] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. Journal of Instruction Level Parallelism, 2(2000):1–6, May 2000. 39, 49 [3] J. R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491– 542, Oct. 1987. 37 [4] D. Brooks and M. Martonosi. Dynamic Thermal Management for HighPerformance Microprocessors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, January 2001. 49 [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In 27th Annual International Symposium on Computer Architecture, pages 83–94, June 2000. 50 [6] T. D. Burd and R. W. Brodersen. Design issues for dynamic voltage scaling. In Proceedings of the 2000 International Symposium on Low Power Electronics and Design, ISLPED’00, pages 9–14, July 2000. 44 [7] C.-H. Hsu and U. Kremer. Compiler-Directed Dynamic Voltage Scaling Based on Program Regions. Technical Report DCS-TR-461, Department of Computer Science, Rutgers University, November 2001. 34, 50 [8] C.-H. Hsu, U. Kremer, and M. Hsiao. Compiler-Directed Dynamic Frequency and Voltage Scaling. In Workshop on Power-Aware Computer Systems, ASPLOS-IX, November 2000. 34 [9] C.-H. Hsu, U. Kremer, and M. Hsiao. Compiler-Directed Dynamic Frequency/Voltage Scheduling for Energy Reduction in Microprocessors. In Proceedings of the 2001 International Symposium on Low Power Electronics and Design, ISLPED’01, pages 275–278, August 2001. 34, 49, 50 [10] M. Huang, J. Renau, S.-M. Yoo, and J. Torrellas. A Framework for Dynamic Energy Eﬃciency and Temperature Management. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 202– 213, 2000. 49 [11] Intel Corporation. Intel XScale Microarchitecture Technical Summary. Technical report, 2001. 43 [12] A. Iyer and D. Marculescu. Power aware microarchitecture resource scaling. In Proceedings of 2000 Design Automation and Test in Europe, pages 190–196, 2001. 49 [13] A. KleinOsowski, J. Flynn, N. Meares, and D. J. Lilja. Adapting the SPEC2000 Benchmark Suite for Simulation-Based Computer Architecture Research. In Proceedings of the Workshop on Workload Characterization, International Conference on Computer Design, September 2000. 46 [14] D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eighth Annual ACM Symposium on the Principles of Programming Languages, Jan. 1981. 37 [15] A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power Aware Page Allocation. In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 105–116, November 2000. 39 [16] T. Pering, T. Burd, and R. Brodersen. Voltage scheduling in the lparm microprocessor system. In Proceedings of the 2000 International Symposium on Low Power Electronics and Design, ISLPED’00, pages 96–101, July 2000. 43

52

Phillip Stanley-Marbell et al.

[17] M. D. Powell, S.-H. Yang, B. Falsaﬁ, K. Roy, and T. N. Vijaykumar. GatedVdd: A circuit technique to reduce leakage in cache memories. In ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’00)., pages 90–95, July 2000. 49 [18] M. D. Powell, S.-H. Yang, B. Falsaﬁ, K. Roy, and T. N. Vijaykumar. Reducing leakage in a high-performance deep-submicron instruction cache . IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9(1):77 – 89, February 2001. 39, 49 [19] H. Sanchez, B. Kuttanna, T. Olson, M. Alexander, G. Gerosa, R. Philip, and J. Alvarez. Thermal Management System for High Performance PowerPC Microprocessors. In Proceedings IEEE Compcon, page 325, February 1997. 49 [20] R. M. Stallman. Using and Porting GNU CC, 1995. 45 [21] P. Stanley-Marbell and M. Hsiao. Fast, ﬂexible, cycle-accurate energy estimation. In ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED’01., pages 141–146, August 2001. 45 [22] V. Tiwari and M. Lee. Power analysis of a 32-bit embedded microcontroller. In Proceedings, Asia and south Pacific DAC, pages (CD–ROM), August 1995. 48 [23] F. Yao, A. Demers, and S. Shenker. A scheduling model for reduced cpu energy. In Proceedings IEEE Symposium on Foundations of Computer Science, pages 374–382, October 1995. 41 [24] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool. In Proceedings of the 37th Conference on Design Automation, pages 340–345, 2000. 50

Multi-Processor Computer System Having Low Power Consumption C. Michael Olsen and L. Alex Morrow IBM Research Division, P.O.Box 218, Yorktown Heights, NY 10598, USA {cmolsen, alex_morrow}@ us.ibm.com

Abstract. We propose to improve battery life in pervasive devices by using multiple processors that trade off computing capacity for improved energy-per-cycle (EPC) efficiency. A separate scheduler circuit intercepts interrupts and schedules execution to minimize overall energy consumption. To facilitate this operation, software tasks are compiled and profiled for execution on multiple processors so that task requirements to computing capacities may be evaluated realistically to satisfy system requirements and task response time. We propose a simple model for estimating the EPC for each processor. To optimize energy consumption, processors are designed to satisfy a particular usage model. Thus, the particular task suite that is anticipated to run on the device, in conjunction with user expectations to software reaction times, governs the design point of each processor. We show that the battery life of a wearable device may be extended by a factor 3-18 depending on users activity.

1 Introduction A major obstacle for the success of certain types of battery powered Pervasive Devices (PvD) is the battery life. Depending on the device and its usage model, the battery may last anywhere from hours to months. An important mode of device operation is the user idling mode. In this mode, the device is always “on” but without being used by the user. “on” refers to that the device is instantly responsive. Wearable devices fall into this category because they form an extension of the user and therefore may be expected to be always instantly available. Secondly, the lower bound of power consumption on the device may be limited by its need to keep time and perform periodic tasks such as polling sensors and evaluate data, regardless of user activity. In other words, the main contributor to the accumulated “on” battery drain is the user idling mode rather than the user active mode in which the user actively is using the device. Most PDAs follow this usage model. A user will turn the PDA on and then press one or two buttons and make a selection from the screen. The user then reads the information and then either leave the device idling, or turn it off. In either case, the time the PDA spent idling is generally significantly larger than the time it spent executing instructions associated with the button and screen selections. PvDs with advanced power management capabilities, such as the Compaq Itsy [1] and the IBM Linux Watch [2], have several stages of power saving modes. In the most efficient “on” low power state, the Itsy may last for 215 hours on its 610mAh battery B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 53-67, 2003. c Springer-Verlag Berlin Heidelberg 2003

54

C. Michael Olsen and L. Alex Morrow

while the Linux Watch may last for 64 hours on its 60mAh battery. However, if the Linux Watch, for example, had to perform small periodic tasks more frequently than once per second, it would largely be prevented from taking advantage of its most efficient low power state, and battery life would drop to 8 hours. A battery lifetime of this magnitude, or even a couple of days with a larger battery, is not satisfactory. Users may not be able to recharge or replace batteries at such short intervals. Further, users may be annoyed at the frequent charging requirements, especially if they feel they are not even using the device. Although the battery drains more quickly when the user does use the device, this is more reasonable since the user can develop a sense of how much a given action costs and make usage decisions accordingly. Another lesson we learned from the Linux Watch was that even if keeping and displaying time was the only task expected of it, the battery life of 64 hours still pales in comparison with commercial wrist watches. Although these devices also use processor chips, they can maintain and display time for several years on a single watch battery. This two-order of magnitude discrepancy in battery life was a primary motivation for this investigation. It led us to think there might be great benefits in off-loading simple repetitive tasks, such as time keeping and sensor polling, from the high-performance processor to, perhaps, a low-speed 8-bit processor with a small cache and a few necessary blocks in the I/O ring. The idea is that the low-speed processor would be specifically designed to execute small simple tasks in such a way as to consume much less active energy as compared to executing equal tasks on the high-performance processor. In other words, there must be a significant differential in energy-per-cycle (EPC) between low end and high end processors. Several means exist to widen this EPC differential, for example, by changing the architecture of the low end processor so that fewer transistors are involved in each cycle. Voltage scaling, transistor device scaling, and the switching scheme known as adiabatic switching are circuit techniques that improve EPC [3]. The concept of using more than one processor in a computer for power management is not new. The PC AT used a separate, small, battery-powered microprocessor to maintain time and date when the PC is powered off. The batteries for this function were often soldered in place in early PC’s, so it was clearly designed for very low current drain over a long period of time. Further, a number of mobile phone companies have filed patents on computer architectures which utilize multiple computational devices [4,5]. The common thread among these systems is that the systems represent static configurations with prescribed functionality. On the other hand, we are mainly interested in developing a power efficient dynamic, or general purpose, computer system with a functionality like the Palm Pilot, Compaq Itsy and Linux Watch. In other words, a computer platform for which a programmer with relative ease can write new application and driver code and in which said code is executed in the most power efficient manner. The multi-processor system we are going to propose can not readily be developed since many of the software and hardware components of the system are presently non-existing or require significant modification. In other words, it would take a

Multi-processor Computer System Having Low Power Consumption

55

considerable effort to properly research and mature such a system. Nevertheless, we still believe that the system has merits from the perspective of Makimoto’s Figure of Merit formula [6], Figure of Merit = (Intelligence) / ( (Size)(Cost)(Power) ), which is a qualitative measure of the value of a nomadic device as perceived by the user. Even though the formula is crude, it does suggest that it may be acceptable to trade off Size and Cost for improved Power and Functionality/Intelligence. The paper is organized as follows. In Chapter 2, the multi-processor system is presented and we walk through a usage example. Next, in Chapter 3 and 4 we present the hypothetical target device to perform energy analysis on and the processor energy model for calculating EPC for each processor. Chapter 5 presents the task suite, user model and discusses the analytical results. Chapter 6 takes a broader look at the whole system. Chapter 7 is a summary.

2 A Low Power Multi-Processor Computer System Architecture. In this and the next chapter we shall propose a low power multi-processor computer system. It is a first attempt to piece the whole system together in enough detail to facilitate some minimal analysis of the power savings potential. We wish to give readers enough appreciation for how the system may be connected and operated so they can improve or suggest alternatives to the system. Slowest, most power efficient

T1

P1 SIG GP

BUS MEM

MEM

GOV T2

BUS I/O

P2 Interrupt lines

I/O

Fastest, least power efficient

Fig. 1. Multi-processor computer system for power conscious task scheduling.

Figure 1 shows an example of a multi-processor system. It utilizes 2 processors, P1 and P2, and a governor circuit, GOV. MEM is the memory space and I/O is the I/O

56

C. Michael Olsen and L. Alex Morrow

space. SIGGP, BUSMEM and BUSI/O is the governor-processor signal lines, the memory bus and the I/O bus, respectively. P1 and P2 execute tasks. P1 is the most power efficient processor with little computing performance. P2 is the least power efficient processor but with very high computing performance. All interrupts from I/O space and from the two processors are brought to GOV. GOV intercepts the interrupt signals and determines which of the 2 processors should handle the interrupt. Issues such as interrupt ownership and which processor may execute the task associated with the interrupt in the most power efficient manner, are being considered by GOV and is discussed next. System Infrastructure. In the following we discuss some of the dynamic aspects of the system operation. The discussion is generic and is not limited to a 2-processor system. All static issues, such as initial setup, software loading, table establishment, and so forth are not discussed. The discussion will shed some light on how the whole computer system may work together. At the end, we give an example of how a calculator application is launched and operated by using a touchscreen. The following assumptions to the system infrastructure are made: y y

y y

y y y

Processors execute tasks simultaneously in parallel. In general the processors do not share code nor data space, and code is never moved from one processors code space to another processors code space. The only memory spaces shared among the processors are device buffer areas, the before mentioned tables in GOV and space for passing parameters. Interrupt handlers and tasks have been individually profiled so their computing capacity requirements is known for each of the processors they may execute on. Four system tables are used to coordinate energy efficient scheduling of tasks (see Figure 2): Interrupt vector table (IVT), peripheral device attribute table (DAT), process task attribute table (TAT), and the processor capacity table (PCT). As shown in Figure 2, the tables are local to GOV. GOV can access the tables without stealing bus cycles from BUSMEM, and tables may be updated dynamically by the processors through BUSMEM. The IVT contains dynamic pointers to DATs and TATs so GOV can access the proper table upon reception of an interrupt. DAT and TAT structures are identical. The parameters are shown in Figure 2, most of which are self-explanatory. POWNER is the processor ID of the processor that currently owns the task or handler. NPH is the number of processors which may potentially host (execute) the task, or handler. {P, CPS, ADDR}TID,i is the {processor ID, demand to processor bandwidth, code entry address} of the i’th most power efficient processor. Note that processors are listed in order of descending energy efficiency. Processors dynamically update the PCT on each launch or termination of a task or interrupt handler to reflect the processors current instantaneous spare computing capacity. GOV needs this information to properly schedule the execution of tasks and handlers. An OS on one processor may utilize the governor to schedule a process task for execution on an OS on another processor. A file system may be shared OSs. Each OS/processor utilizes a local timer interrupt mechanism. The processors share a common time base counter for agreeing on instantaneous time.

Multi-processor Computer System Having Low Power Consumption

y

57

The OS utilizes a work dependent timing scheme [2] in which the local hardware timer is dynamically programmed to interrupt the processor only when there is work to be done. Physical timer ticks that do not result in work are skipped, enabling the processor to save power and to shut down more effectively.

MEM IVT

GOV DAT

Parameter TID POWNER NPH PTID,1 CPSTID,P ADDRTID,1

Description Task Identification number. ID of current owner processor (if any). Number of potential host processors. ID of most energy efficient processor. Required cycles/sec to sustain task. Pointer to task code.

.

TAT

. .

PCT

ID of least energy efficient processor. PTID,NPH CPSTID,P Required cycles/sec to sustain task. ADDRTID,NPH Pointer to task code.

Fig. 2. System tables for energy efficient task/handler scheduling.

Usage Example. The following example demonstrates how the whole system could work together. Assume a process called user interface (UI) is running on the processor PBIG. Assume that UI has opened the touchscreen device, and the system has therefore updated POWNER in the touchscreen DAT located in GOV so that POWNER=PBIG. Now, the user uses the touchscreen to select the icon representing a calculator application. The touch interrupt is detected by GOV which uses the IVT to find the associated attribute table. GOV then checks in the table if/which processor currently owns the touchscreen, finds POWNER in the processor list, puts the touchscreen interrupt handler address into a predefined memory slot, and finally signals/interrupts PBIG which in turn jumps to the interrupt handler address. UI may now launch the calculator application, which is yet another process task. But let’s assume that the launcher software first peeks into the calculator applications TAT and discovers that it requires very little computing capacity to execute. In this case, the launcher decides not to launch the calculator on PBIG but rather passes the calculator request to GOV for execution on a more power efficient processor. PBIG now updates

58

C. Michael Olsen and L. Alex Morrow

its own interrupt entry in the IVT in GOV with the address of the calculator applications TAT and clears the POWNER field in the attribute table. Then, PBIG interrupts GOV. Upon reception of the interrupt, GOV, via the IVT, finds the associated attribute table and determines that it is not owned by any processor. GOV will then schedule the calculator application on the most power efficient processor, let’s call it PLITTLE, on which another UI process is also running. PLITTLE now changes the owner to POWNER=PLITTLE in the calculators TAT and then launches the calculator application (say from FLASH). The next time PLITTLE receives a calculator interrupt, it’s probably due to the user entering data. So PLITTLE must determine the proper address to jump to in the calculator application upon future calculator inputs. PLITTLE then updates the address in the calculator TAT accordingly. Since it is now likely that the next screen interrupt will be associated with the calculator application, PLITTLE further opens a touchscreen driver and updates the driver address and POWNER in the touchscreen DAT accordingly. In this fashion, the next touchscreen interrupted is routed directly to PLITTLE instead of the original owner PBIG which can then be put to sleep for a longer period. Though the shift in ownership of the touchscreen interrupt should only be done if PLITTLE has the spare processor bandwidth as specified in the touchscreen DAT. It is also important that the touch handler and/or the UI manager on PLITTLE can determine if the (x,y)-coordinates belong to the calculator. If the coordinates do not belong to the calculator application, it is equally important that PLITTLE can determine to which application the coordinates do belong, if any, so that it can reflect the interrupt to the proper processor via GOV (assuming the application is not already running on PLITTLE). PLITTLE would do this by putting the (x,y) data in a shared buffer somewhere, update the jump address in PLITTLE‘s IVT entry to point to the applications TAT and then finally interrupting GOV. PLITTLE should also be able to launch a new application in the same fashion that PBIG originally launched the calculator application. In order to save power effectively, it’s important that the UI on PLITTLE itself is capable of updating the screen whenever the calculator is being used. This has two consequences. First, it requires the use of an external display controller. Secondly, it requires the ability of several UI managers to coordinate access and share information about screen content and contexts. This may be accomplished through the shared file or buffer system.

3 The Target Device In this section, we introduce the SensorWatch, a small wearable device with several sensors intended to help it infer its wearer’s condition. For example, our hypothetical SensorWatch is able to measure it’s wearer’s s body temperature and pulse. When the

Multi-processor Computer System Having Low Power Consumption

59

user takes the SensorWatch off, putting it on his bedside table, the device infers, from the lack of a pulse and temperature, that the user is not wearing the device. This is used to enter a power saving mode, disabling interfaces and tasks which are not required when the watch is not worn. On the other hand, it maintains the integrity of the watch, keeping the time with the most power efficient processor. Note that, in this case, time is not the only task that must be run in detached mode, since the watch will want to sample sensors periodically to determine when the wearer puts the watch on again. For simplicity in exposition and analysis, we simplify the analysis to consider just static scheduling of tasks. In other words a task’s characteristics, such as the processor on which it should run, are established at task creation time and do not vary. A task, whenever it is invoked, will always run on the same processor. Assumptions. The SensorWatch has time and sensor monitoring functions which must take place continuously, and at the lowest possible power. It also has on-demand functions requested in various ways by the user, which have response time requirements that may make it necessary to run them on higher powered processors. SensorWatch is a hypothetical device. For clarity we ignore other power consuming devices, such as sensors, memory, network interface and display. It is assumed that: 1. 2. 3. 4. 5. 6. 7. 8.

We have a wearable device with multiple processors. The device has multiple sensors it must monitor at the lowest possible power. Wearers create a predictable mix of events. Each processor is maximally duty cycled. The CPU cycles required to enter and exit SLEEP mode are negligible, relative to the task CPU cycles. The CPU cycles required by the first level interrupt handlers are negligible, relative to the task CPU cycles. The power consumed by GOV is negligible. Each processor is able to accommodate the worst case combination of tasks which run concurrently under a multi-tasking operating system on the processor.

We consider SensorWatches with one, two and three processors. We first describe our hypothetical task characteristics and review certain task scheduling issues. Next we present a processor energy model and define energy related task parameters. We then give a suite of tasks to be considered for analysis. Finally, we give the results. Task Characteristics. We characterize tasks as either CPU-bound or I/O-bound. CPU-bound tasks run to completion as quickly as their CPU can process them, never entering SLEEP mode. I/O-bound tasks run until they must issue an I/O request through some interface, which they then wait for by putting the processor in SLEEP mode. When the I/O completes, the SLEEP mode is interrupted and the task resumes execution. Task events arrive in two ways: either randomly or predictably. A randomly scheduled task is characterized by how many times per day, NIPD, the user, or some other random-like process, triggers the task. A predictably scheduled task is characterized by an interrupt frequency, F=1/T, where T is the scheduling interval. Thus, we can now define the following 4 task types.

60

Type A: Type B: Type C: Type D:

C. Michael Olsen and L. Alex Morrow

CPU Bound, randomly scheduled. CPU Bound, predictably scheduled. I/O Bound, randomly scheduled. I/O Bound, predictably scheduled.

Scheduling. As a first approximation to the scheduling algorithm outlined earlier, we are going to assume a static distribution of tasks. Thus, for any interrupt received by GOV, it always results in the same processor selection for task execution.

4 Processor Energy Model We will assume a simple energy model for the processors. It is assumed that a processor dissipates the same amount of energy in each cycle. This enables us to represent a processor’s energy efficiency by its Energy Per clock Cycle, EPC. The energy model is based on the assumption that the energy efficiency improves as the processor clock frequency decreases. This may be achieved by voltage scaling, and by optimizing transistor design parameters [3]. Further, the overall size of the chip may be reduced by making the caches smaller, by shrinking register and bus widths, and by ensuring that the tasks that run a low-end processor limit themselves to the native register widths. These constraints would further reduce EPC by lowering the number of switching elements per operation and by reducing wiring capacitance. Finally, we mention the technique of adiabatic switching [3]. The notion of adiabatic switching is to charge up the switching capacitor slowly by ramping up the supply voltage in synchronization with the change in output bit value, thus effectively minimizing heat loss in the resistive path. In conventional CMOS switching technology, a transistor state is changed by instantaneously applying or removing the supply voltage, Vdd, across the RC element. Adiabatic switching promisesEPC ` f clk . In other words, the slower the processor is running the better the energy efficiency. However, to implement adiabatic switching requires additional control circuitry which increases capacitance and complexity. Thus, some of the advantage is lost. Adiabatic switching circuits appears to be most promising in low-speed circuits with clocking frequencies smaller than 10 MHz or so [9,10]. By lumping together all the techniques mentioned above, and being somewhat conservative about the net result, we assume that a processors energy efficiency may be characterized by the equation

EPC = K f clk . where K is a proportionality constant.

(1)

Multi-processor Computer System Having Low Power Consumption

61

Task Related Energy Parameters. Next, we define the following parameters: NC: Number of Cycles to complete task. CPS: Cycles Per Second [Hz] required to complete task in time, T. NIPD: Number of Interrupts Per Day For periodic tasks, i.e. type B, NIPD may be calculated as NIPDi = 86, 400s¸Ti where Ti is the maximum duration the task may take to complete, which in case of a type B task is identical to the interrupt interval, and 86,400 is the number of seconds in a day. For type A and type C tasks, NIPD is based on the User Activity Level, UAL, or how frequently he uses his PvD. When discussing specific tasks later, we are going to assign typical values of NIPD to these tasks and then consider what happens if the user is either a more active or less active user. The Number of Cycles Per Day, NCPD, for task i on processor j may be calculated as

NCPD i,j = NC i,j NIPD i .

(2)

The Energy Per Day, EPD, for task i on processor j may be calculated as

EPD i,j = NCPD i,j EPC j .

(3)

The Energy Per Day, EPD, for processor j may be calculated as NT j

EPD j = EPD i,j .

(4)

i=1

NTj is the number of tasks on processor j. The total Energy Per Day, EPDtot, for all NP processors may be calculated as NP

EPD TOT = EPD j .

(5)

j=1

5 Task Suite We now create a hypothetical mix of tasks, categorized into three categories depending on their requirements for processor performance. The task mix and their associated computational characteristics are listed in Table 1-3. The names of the tasks should be self-explanatory. The second column accounts for the basic demands to response time (or periodicity), T, the number of cycles, NC, to run to completion (if applicable) and the number of times per day the task is toggled (user dependent). The third column contains the demand to processor bandwidth required to sustain the task. At the very bottom (in bold font) is the total demand to processor bandwidth, CPS TOT , assuming the worst case mix of tasks executing simultaneously. (Note, some tasks may be mutually exclusive.) The fourth, and last, column contains the total number of cycles per day for each task. At the very bottom (in bold font) is the accumulated total number of cycles per day, NCPDTOT , for the particular task suite.

62

C. Michael Olsen and L. Alex Morrow

The low-performance tasks (shown in Table 1) are all CPU bound and periodic (type B) in that they are timer interrupted tasks which in turn poll a sensor interface (except TimeDate which just updates time and date), update some variables and then determine if the new values of the variables have exceeded a threshold value, or if the evolution of the values signifies some interesting change. The purpose of the low-performance tasks is largely to determine whether to initiate/enable or disable other tasks and hardware components for the sake of power management and to infer about the state of the user and users surroundings. All tasks may run concurrently. Table 1. Characteristics of low-performance tasks (NTlow=8).

Task Name TimeDate UserTemp UserPulse UserAudio AmbTemp AmbHumid DeviceOrient DeviceAccel

Basic Demands and Task Properties T=1s, NC=500 T=60s, NC=500 T=10s, NC=500 T=100ms, NC=500 T=1s, NC=500 T=100ms, NC=500 T=50ms, NC=500 T=50ms, NC=500

CPS [Hz] NCPD [106] 500 43 8 0.7 50 4.3 5k 433 500 43 5k 433 10 k 864 10 k 864 31 k 2,685

Table 2. Characteristics of medium-performance tasks (NTmed=5).

Task Name EvaluateWorld UpdateDisplay FetchDbRec UINavigation SyncDb

Basic Demands and Task Properties CPS [Hz] NCPD [106] NIPD=250/day, T=50ms, NC=10000 0.2 M 2.5 NIPD=1000/day, T=50ms, NC=50000 1M 50 NIPD=250/day, T=50ms, NC=10000 0.2 M 2.5 NIPD=500/day, T=50ms, NC=5000 0.1 M 2.5 NIPD=10/day, T=10ms/rec1000rec=10s 0.1 M 10 NC=1000cycles/rec1000rec=106 1.5 M 67.5

Table 3. Characteristics of high-performance tasks (NThigh=2). Task Name Basic Demands and Task Properties CPS [Hz] NCPD [106] 6 VoiceCommand NIPD=100/day, T=1s, NC=7.510 7.5 M 750 AudioMemo 2.5 M 250 NIPD=20/day, T=5s, NC=12.5106 7.5 M 1,000

The middle-performance tasks (shown in Table 2) have user-centric real-time requirements: acceptable behavior for them is governed by a reaction time requirements based on user experience considerations. EvaluateWorld, UpdateDisplay, FetchDbRec and UINavigation are of type A since they have to run

Multi-processor Computer System Having Low Power Consumption

63

completion within a time acceptable to a user. On the other hand, SynchDb (synchronize database) is a task that may incorporate network resources. It will typically send out requests and information and then sit and wait for a reply of sorts. Thus, it is of type C. The reply, once it arrives, may not be continuous but rather arrive in multiple chunks. This task may put the processor into the SLEEP state while waiting for the network interface to generate an interrupt. All tasks may run concurrently. The high-performance tasks (shown in Table 3) are similar to most of the medium-performance tasks in that they are randomly interrupted and, once interrupted, run as fast as they need to sustain their function. Their tasks are both of type A. The two tasks are mutually exclusive. User Activity Level. As mentioned earlier, the User Activity Level, UAL, will impact the total energy performance, thus UAL must be included in the analysis. If UAL=1, then the user is assumed to use the system exactly as described above. For example, he would issue 100 voice commands per day, where each command lasts 1sec and he would synchronize his databases 10 times per day. Now, if the user is twice as active, i.e. UAL=2, he will issue 200 voice commands per day and synchronize his databases 20 times per day. More generally, we'll assume that the users Activity Level only affects middle- and high-performance tasks. Processor Speeds and Energy Efficiencies. First, we need to make an assumption about energy efficiency at some given clock frequency. So let’s assume a good mobile processor, such as the StrongARM SA-1110, as our reference candidate. This processor dissipates 240mW@133MHz. Thus, we can calculate the Energy Per Cycle for this reference point EPC(f clk =133Mhz) = 0.24W/133MHz = 1.8nJ from which the proportionality constant in Eq. 1 can be calculated, and thus Equation 1 now becomes EPC = 1.8nJ¸ 133MHz f clk = 156fJs 1/2 f clk .

(6)

As mentioned earlier, we assume that each processor is designed to support exactly the worst case combination of tasks that may conceivably run on each processor. The requirement to processor j's clock frequency, f clk,j is found by appropriately summing CPS TOT from Table 1-3 according to how many processors are considered and on which processor each task suite is executing. In turn, we can then calculateEPCj . The results follow. 1-processor system: All tasks run on P1. hP1: f clk,1 = 9.131MHz => EPC1 = 0.471nJ 2-processor system: Low-performance tasks run on P1 and other tasks on P2. P1: f clk,1 = 0.031MHz => EPC1 = 0.028nJ P2: f clk,2 = 9.1MHz => EPC2 = 0.471nJ 3-processor system: Low-, medium- and high-performance tasks run on P1, P2 and P3, respectively.

64

C. Michael Olsen and L. Alex Morrow

P1: f clk,1 = 0.031MHz P2: f clk,2 = 1.6MHz P3: f clk,3 = 7.5MHz

=> EPC1 = 0.028nJ => EPC2 = 0.197nJ => EPC3 = 0.427nJ

Results: Assuming the user activity level varies from very inactive (UAL=0.001) to very active (UAL=10), we calculated the energy performance, EPDTOT , for 1-, 2- and 3-processor systems. The results are shown in Figure 3. When comparing the 2- and 3-processor case as with the 1-processor case, it may be seen that the processor energy consumption is reduced by a factor of 18 when the user is very inactive (UAL=0.001), by a factor of 3 when the user activity is average (UAL=1) and by a factor of 1.25-1.4 when the user is very active. The reason why the 3-processor system does not offer much improvement is that the task suite that runs on P2 does not significantly contribute to the total amount of task cycles consumed by the entire system.

Energy Per Day, EPDtot [Joules]

10

NP=1 1

NP=3

NP=2

0.1

0.01 0.001

0.01

0.1

1

10

User Activity Level, UAL

Fig. 3. Total Energy Per Day versus User Activity Level for 1-, 2- and 3-processor systems. UAL only affects medium- and high-performance tasks.

Keeping processor design constant in the 3 cases, and considering a usage model in which the user does not use the high-performance task suite at all (say, if he uses the device in a noisy environment) and in which he’s toggles the medium-performance tasks at a five times higher rate, the 3-processor case would clearly improve the efficiency over the 2-processor case in the UAL=0.1-10 range where the energy consumption is dominated by the medium- and high-speed processors.

Multi-processor Computer System Having Low Power Consumption

65

6 The Big Picture: A Discussion Hardware Systems Perspective. The above results look very promising. But there are several factors that currently make it difficult, if not impossible, to reap the benefits of a low-power multi-processor system. Most importantly, energy-efficient low-speed (say <5 MHz) processors do not exist and there is no ongoing commercial effort to fabricate them. Furthermore, the multi-processor system is a part of a bigger system which may have several peripheral components and several analog interfaces. Current state-of-the-art peripheral components and interfaces such as memory, network interface, sensors, etc and their associated power management operation may limit the minimally achievable energy consumption by the whole system. At least for the low-speed system considered above. To put the minimum EPDTOT = 75 mJ in Fig. 3 in perspective, consider that a state-of-the-art 1.8V 2MB SRAM [7] consumes around 2 uW in its Standby mode. Keeping 2MB data alive for the whole day would then consume 173 mJ = 2.3EPDTOT . Next, let’s put the EPC1 = 0.028 nJ of the most power efficient processor in the 2-processor system in some perspective. Consider that the same SRAM consumes around 45 mW during a 70 ns read operation, then a single read operation alone would consume around 3 nJ = 113EPC1 . With respect to retrieving code, this is best performed out of FLASH (at low speeds). A read operation in a state-of-the-art 1.8V 8MB FLASH [8] dissipates around 0.9 nJ = 31EPC1 . From the above perspective it may appear there is not much power savings to be gained with the multi-processor system. After all, a useful computer system needs to read/write memory and keep the data alive. However, it turns out the same techniques that may be used to reduce the EPC in processors, in particular transistor device scaling, voltage scaling and adiabatic switching, are also applicable to the design of low-power memory architectures [3,11,12] as well as for driving output pads and busses [3]. What really may limit power consumption in the end in low-speed systems, is not the central components (e.g. memory and processors) but rather the analog interfaces, such as sensor, visual, audio and wireless interfaces. With respect to visual, audio and wireless interfaces, these may be strategically power managed by trading off usability/performance for reduced power consumption. How effectively this can be done depends on how frequently and for how long they are toggled, and on their power consumption which will depend on their position relative to the receiver (i.e. the eye, ear and wireless receiver). Their relative impact on battery life will increase as the energy efficiency of the central components decrease. With respect to sensor interfaces, their power consumption, size and complexity is quite dependent on the type of sensor [13]. Furthermore, sensors have to be strategically positioned to perform optimally. Obviously, supplying power to these sensors from a central energy source (e.g. battery) is impractical. With the recent advancement in MEMS technology (micro electromechanical systems) [13,14], many of these sensors will become much more power efficient as well as smaller and

66

C. Michael Olsen and L. Alex Morrow

cheaper. Also, certain remote sensors may be self-powered. For example, any sensor or I/O device that is in use when placed on the skin (e.g. heart rate monitor, mini-speaker in ear) may in principle be self-powered (at least partially) from the heat differential between the skin and the ambient air which can be converted into electrical power by means of the thermoelectric Seebeck effect. This effect was recently used to self-power a watch [15]. Systems Issues. Due to the probably different instruction set in each processor, each task associated with an interrupt must be compiled for each of those processors that it makes sense, from a power savings perspective, to execute the task on. One consequence of this is that both more ROM and RAM is needed to hold the extra task, OS and library binaries. In turn, this may result in increased cost and increased footprint. Another thing that will impact cost is the longer time required to develop and profile mature code for multiple processors. The fact that there is more than one processor in the system will also boost cost and physical size. Finally, in order for the programmer to properly code and profile the application code, a significant understanding and appreciation for the power and performance characteristics of the device is needed. We have proposed a system in which GOV is able to schedule tasks dynamically and power efficiently by selecting the most optimal processor from a range of processors. A static system is the special case of the dynamic system in which each task only has one processor entry in the DAT/TAT. The dynamic approach allows for more freedom to expand and configure the system. Though the static approach is simpler and may be a quite sufficient solution for devices that are designed to offer zero to limited expandability and/or which are designed to perform only a well known set of functions. In a static system the user/programmer would have to determine on which processor a new application should run on, whereas in a dynamic system, the user would just load the software and the system would figure this out by itself.

7 Summary We have proposed a low-power multi-processor computer system. It was shown how such a system may be constructed and operated to exhibit much lower power consumption than a single processor system by taking advantage of the energy-per-cycle differential between high-end and properly-designed low-end processors. It was pointed out that energy-efficient low-speed (say <5 MHz) processors do not exist and that there is no ongoing commercial effort to fabricate them either. The same may be said about memory components. This is not surprising as there hasn’t been an incentive to develop such products; there hasn’t been an application which demands such parts, and which would merit the significant investment required to design, test and mature them. We believe, however, that the applications and thus the motivation are presently emerging in that there is a continuing effort to make pervasive devices smaller, lighter, more wearable and/or more distributed. These devices accordingly will have limited energy capacity. Smart

Multi-processor Computer System Having Low Power Consumption

67

Dust [16] is an extreme example of this trend. If the application is important enough, the excess investment required to develop and mature a low-power multi-processor system may be justified.

Acknowledgements We would like to thank Chandra Narayanaswami and Mandayam Raghunath of IBM Research for support and helpful discussions. We also thank Jaime Moreno, David J. Frank, Gheorghe Almasi and Jose Castanos of IBM Research for useful discussions.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

W.R. Hamburgen et al, “Itsy: Stretching the Bounds of Mobile Computing,” IEEE Computer, 4/2001. N. Kamijoh et al, “Energy trade-offs in the IBM Wristwatch computer,” Int’l Symp. Wearable Computing (ISWC2001), Zurich, 10/8-9/2001. A.P. Chandrakasan, R.W. Brodersen, “Low Power Digital CMOS Design,” Kluwer Press, 1995. F. Inagami, “Mobile telephone terminal having selectively used processor unit for low power consumption,” US Patent #5,058,203, 1991. T.E. DAiley, “Low power architecture for portable and mobile two-way radios,” US Patent #5,487,181, 1996. T. Makimoto et al, “The Cooler the Better: New Directions in the Nomadic Age,” IEEE Computer, 4/2001. Samsung K6F1616R6M 16Mbit SRAM data sheet, 6/2001. Intel 28F320W18 32Mbit FLASH data sheet, 8/2001. J.-H. Kwon et al, “A three-port nRERL register file for ultra-low-energy applications,” Int’l Symp. Low Power Electronics Devices (ISLPED’00), 7/2000. D.J. Frank and P.M. Solomon, “Electroid-oriented adiabatic switching circuits,” Int’l Symp. Low Power Electronics & Devices (ISLPED’95), 4/1995. R.H. Dennard and D.J. Frank, “Memory with adiabatically switched bit lines,” US Patent 5,526,319, 1996. S. Avery and M. Jabri, “A three-port adiabatic register file suitable for embedded applications,” Int’l Symp. Low Power Electronics Devices (ISLPED’98), 8/1998. R. Frank, “Understanding smart sensors,”2nd Ed., Artech House, 2000. R. Allan, “MEMS designs gear up for greater commercialization,”Electronic Design, 6/2000. http://jin.jcic.or.jp/trends98/honbun/ntj990207.html, 2/1999. J.M. Kahn et al, “Next century challenges: Mobile networking for Smart Dust,” ACM Int’l Conf. Mobile Computing & Networking (MOBICOM’99), 8/1999.

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling Pedro Mejia1 , Eugene Levner2 , and Daniel Moss´e3 1

CINVESTAV-IPN Secci´ on de Computaci´ on, Av. IPN. 2508, M´exico DF [email protected] 2 Holon Academic Institute of Technology Department of Computer Science, 52 Golomb St, Holon 58102, Israel [email protected] 3 Computer Science Department University of Pittsburgh, Pittsburgh, PA 15260 [email protected]

Abstract. In this paper we propose a novel scheduling framework for a dynamic real-time environment that experiences power consumption constraints. This framework is capable of dynamically adjusting the voltage/speed of the system, such that no task in the system misses its deadline and the total energy savings of the system are maximized. Each task in the system consumes a certain amount of energy, which depends on a speed chosen for execution. The process of selecting speeds for execution while maximizing the energy savings of the system requires the exploration of a large number of combinations, which is too time consuming to be computed on-line. Thus, we propose an integrated heuristic methodology which executes an optimization procedure and an approximate greedy algorithm in a low computation time. This scheme allows the scheduler to handle power-aware real-time tasks with low cost while maximizing the use of the available resources and without jeopardizing the temporal constraints of the system. Simulation results show that our heuristic methodology achieves a performance with near-optimal results.

1

Introduction

Power management is increasingly becoming a design factor in portable and hand-held computing/communication systems. Energy minimization is critically important for devices such as laptop computers, PCS telephones, PDA’s and other mobile and embedded computing systems simply because it leads to extended battery lifetime. The problem of reducing and managing energy consumption has been addressed in the last decade with a multi-dimensional eﬀort by the introduction of engineering components and devices that consume less power, low power techniques involving VLSI/IC designs, algorithm and compiler transformations, and by the design of computer architectures and software with power as a primary B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 68–83, 2003. c Springer-Verlag Berlin Heidelberg 2003

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

69

source of performance. Recently, hardware and software manufacturers have introduced standards such as the ACPI (Advanced Conﬁguration and Power Interface) [8] for energy management of laptops, desktops and servers that allow several modes of operation, turning oﬀ some parts of the computer (e.g., the disk) after a preset period of inactivity. Energy management is also achieved by variable voltage scheduling (VVS), which involves dynamically adjusting the voltage and frequency (hence, the CPU speed). By reducing the frequency at which a component operates, a speciﬁc operation will consume less energy but may take longer to complete. Although reducing the frequency alone will reduce the average energy used by a processor over that period of time, it may not always deliver a reduction in energy consumption overall, because the power consumption is linearly dependent on the increased time and quadratically dependent on the increased/decreased voltage. In the context of dynamic voltage scaled processors, VVS in real-time systems is a problem that assigns appropriate clock speeds to a set of periodic tasks, and adjust the voltage accordingly such that no task misses its predeﬁned deadline while the total energy savings in the system is maximized. The aim in this work is to study the problem of maximizing energy savings during the scheduling of dynamic real-time tasks in a single processor environment. In a dynamic environment, we must compute a solution for our power optimization problem at every task arrival (and departure). The identiﬁcation of feasible options that maximize our optimality criteria (expressed as the total energy savings of the system) requires the exploration of a large combinatorial space of solutions. This optimization problem is stated in this paper as a linear (0/1) multiple-choice knapsack optimization problem [16]. In order to cope with the highly computation costs of the dynamic real-time environment, we have developed a low-cost power-aware scheduling paradigm. Our Power-Optimized Real-Time Scheduling Server (PORTS) consists of four stages: (a) an acceptance test for deciding if and when dynamically arriving tasks can be accepted in the system, (b) a reduction procedure which transforms the original multiple-choice knapsack optimization problem into a standard knapsack problem, (c) a greedy heuristic algorithms used to solve the transformed optimization problem, and (d) a restoration algorithm which restores the solution of the original problem from the transformed problem. The optimization procedure developed (b,c and d above) are novel mathematical formulations which provide a near-optimal solution for the problem of selecting speeds of execution of all tasks in the system. The solution developed satisﬁes the condition of maximizing the energy savings of the system while guaranteeing the deadlines of all tasks in the system. The performance of the PORTS Server and its heuristic algorithms will be compared with the performance of several known algorithms. The remainder of this paper is organized as follows. In Section 2 related models and previous work are reviewed. In Section 3, the system and energy models used in this paper are deﬁned. In Section 4, the power-optimized scheduling is formulated as an optimization problem. In Section 5, the Power-Optimized Real-Time Scheduling Sever (PORTS) is described and in Section 6 we describe

70

Pedro Mejia et al.

a methodology for handling power-aware real-time tasks. In Section 7, simulation results are presented to show the performance of the PORTS Server. Finally, Section 8 presents concluding remarks.

2

Related Work on Variable Voltage Scheduling

Broadly speaking, there are two methods to reduce power consumption of processors through OS-directed energy management techniques. The ﬁrst is to bring a processor into a power-down mode, where only certain parts of the computer system such as the clock generation and the timer circuits are kept running when the processor is in idle state. Most power-down modes have a trade-oﬀ between the amount of power savings and the latency overhead incurred during mode change. For an application that cannot tolerate latency, as those in real-time systems, the applicability of power-down modes is limited. The second method is to dynamically change the speed of a processor by varying the clock frequency along with the supply voltage. Power Reduction via variable voltage can be classiﬁed as static and dynamic techniques. Static techniques, such as static scheduling, compilation for low power [17] and synthesis of systems-on-a-chip [7], are applied at design time. In contrast, dynamic techniques use runtime behavior to reduce power when systems are serving dynamically arriving real-time tasks, light workloads or the system is idle. Static (or oﬀ-line) scheduling methods to reduce power consumption in realtime systems were proposed in [24, 10, 5]. These approaches address task sets with a single period or aperiodic tasks. Heuristics for on-line scheduling of aperiodic tasks while not hurting the feasibility of oﬀ-line periodic requests are proposed in [6]. Non-preemptive power-aware scheduling is investigated in [5]. Recent work on VVS includes the exploitation of idle intervals in the context of the Rate Monotonic and Earliest Deadline First (EDF) scheduling frameworks [19, 11, 2, 15]. Most of the above research work on VVS assumes that all tasks have identical power functions. Using an alternate assumption, eﬃcient power-aware scheduling solutions are provided where each real-time tasks have diﬀerent power consumption characteristics [1, 4]. Although systems which are able to operate on an almost continuous voltage spectrum are rapidly becoming a reality thanks to advances in power-supply electronics [3], it is a fact nowadays that most of the microprocessors that support dynamic voltage scaling use a few discrete voltage levels. Some examples of processors that support discrete voltage scaling are: (a) the Crusoe processor [23] which is able to dynamically adjust clock frequency from 200 to 700 MHz and from 1.1 V to 1.6 V, in 33 MHz steps; (b) the ARM7D processor [22] which can run at 33MHz and 5V as well as at 20MHz and 3.3V; and (c) the Intel StrongARM SA1100 processor, which supports 11 clock speeds: 59-221 MHz in 14.7 MHz Steps [9].

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

3

71

System and Energy Models

We consider a set T = {T1 , . . . , Tn } of n periodic preemptive real-time tasks running on one processor. Tasks are independent (i.e., do not share resources) and have no precedence constraints. Each task Ti arrives in the system at time ai . The Earliest Deadline First (EDF) [13] scheduling policy will be considered. The life-time of each task Ti consists of a ﬁxed number of instances ri , that is, after the execution of ri instances, the task leaves the system. The period of Ti is denoted by Pi , which is equal to the relative deadline of the task. Examples of event-driven real-time systems exhibiting this behavior include: (1) Internet video conferencing and multimedia systems, where media streams are generated aperiodically; each stream contains a ﬁxed number of periodic instances which are transmitted over the network, and (2) digital signal processing, where each task processes source data that often arrives in a bursty fashion. Given a CPU speed determined by a voltage/frequency pair, the worst-case workload is represented by the traditional worst-case execution time (WCET) value. Note that, however, for VVS framework where the actual execution time is dependent on the CPU speed, the worst-case number of required CPU cycles is a more appropriate measure of the workload. We denote by Ci the number of processor cycles required by Ti in the worst-case. Under a constant speed Ci SPi (given in cycles per second), the execution time of the task is ti = SP . i A schedule of periodic tasks is feasible if each task Ti is assigned at least Ci CPU cycles before its deadline at every instance. The utilization of the system denotes the amount of processor load in percentage that a task is demanding for execution. Ui = Ptii (or SPCiiPi ) denotes the utilization of task Ti . According to EDF, a set of tasks are feasible (no tasks misses its deadline) if the utilization of the system is less or equal than the total capacity of the system, Ui ≤ c. For EDF, c = 1; that is, the achievable capacity is 100%. We assume that at the arrival of any task, the CPU speed can be changed at discrete levels between a minimum speed SPmin (corresponding to a minimum supply voltage level necessary to keep the system functional) and a maximum speed SPmax . SPij denotes the speed of execution of an instance of task Ti when executes at speed j, and Uij denotes the utilization of task Ti executing at speed j. The power consumption of the task Ti is denoted by gi (SP ), assumed to be a strictly increasing convex function [3], speciﬁcally a polynomial of at least second degree. If the task Ti occupies the processor during the time interval [t1 , t2 ], then the energy consumed during this interval is t E(t1 , t2 ) = t12 gi (SP (t))dt. The total energy consumed in the system from t = 0 up to t = t2 is therefore E(0, t2 ). We assume that the speed remains the same during the execution of a single instance. Finally, a schedule is energy-optimal if it is feasible and the total energy consumption for the entire execution of the system is minimal. While applying voltage-clock scaling under EDF scheduling, we make the following additional assumptions: (1) The time overhead associated with voltage switching is negligible. According to [23] the time overhead associated with

72

Pedro Mejia et al.

voltage switching in the Transmeta Crusoe microprocessor is less than 20 microseconds per step. The worst-case scenario of a full swing from 1.1 V to 1.6 V takes 280 microseconds, and (2) Diﬀerent tasks have diﬀerent power consumptions. This assumption is based in the real-life fact that the power dissipation is dependent on the nature of the running software of each task in the system. This assumption is clearly justiﬁed taking into consideration the following examples: some tasks will use more of the memory system (in addition to the cache), some tasks will use the ﬂoating point unit more than others, some will ship the tasks to specialized processors (e.g., DSPs, micro-controllers, or FPGAs).

4

Formulation of the Problem

In a real-time system with energy constraints, the scheduler should be able to guarantee the timing constraints of all tasks in the system and to select the speed of execution of each task such that the energy consumption of the system is minimized, or equivalently, that the energy savings of the system is maximized. Therefore, the problem can be formulated as follows. Each time a new task Ti arrives or leaves the system, the problem is to determine the speed of execution for each task in the system such that no task misses its deadline and the energy savings of the system is maximized. Note that a solution to this problem must be computed each time a new task arrives or leaves the system, therefore we can not allow a solution with high computation time. 4.1

The Optimization Problem

For each task Ti in the system we deﬁne a set of speeds of execution which will be called class Ni . Each level of speed j ∈ Ni has a Energy Saving computed by Sij = (Ei1 − Eij )

(1)

where Ei1 is the energy consumed by task Ti executing at its maximum speed and Eij denotes the energy consumption of Ti executing at speed j. Furthermore, each task running at speed SPij , will have utilization Uij = Ci . Note that the size of class Ni is ni and the total number of items is SPij ·P i n m = i=1 ni . It is assumed that the items j ∈ Ni for all tasks are arranged in non-decreasing order, so that Si1 and Ui1 are the items with the smallest values in Ni . Each task Ti in the system accrues an accumulated energy savings Sik upon executing a number of instances during the interval of time between arrivals ak and ak+1 . S k denotes the amount of energy savings accrued by all the tasks in the system during ak+1 − ak . Sk =

n

Sik

(2)

i=1

The aim of this optimization problem is to ﬁnd an speed level j ∈ Ni for each task Ti , such that the sum of energy savings for all tasks is maximized without having the utilization sum to exceed the capacity of the system c. That is,

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

73

maximize Z0 = ni=1 j∈Ni Sij xij n subject to j∈Ni Uij xij ≤ c i=1 x = 1, i = 1, ..., n ij j∈Ni 1 if speed j ∈ Ni for task Ti is chosen xij = 0 otherwise We call this problem, Problem P0 . By achieving the optimality criteria, whenever a new task arrives or departs from the system, we intend to maximize the accumulated energy savings S k for each arrival and therefore to maximize the accumulated energy savings obtained after scheduling the entire set of tasks for the complete duration of the schedule. We have formulated the power saving problem as a Multiple-Choice Knapsack Problem (MCKP) with 0-1 variables [16]. According to the real-life requirements of dynamic power-aware real-time systems, any instance of the mediumsize MCKP containing 10 to 80 tasks with 5 to 40 diﬀerent speed levels is to be solved within a few milliseconds. However, the MCKP is known to be NPhard [16] which implies that it is very unlikely to design a so fast (polynomialtime) exact method for its solution. From a practical point of view it means that some of the available exact methods for power-aware scheduling that solve our optimization problem, such as dynamic programming [16], Lagrange multipliers [1], mixed-integer linear programming [21] and enumeration schemes [6], do not satisfy the above realistic requirements for solving the problem.

5

PORTS: Power-Optimized Real-Time Scheduling Server

The Power-Optimized Real-time Scheduling Server PORTS, is an extension of the Earliest Deadline First scheduling algorithm (EDF [13]). The PORTS Server is capable of handling dynamic real-time tasks with power constraints, such that the energy savings of the system is maximized and the deadlines of the tasks are always guaranteed. In order to meet our optimality criteria, when new tasks arrive in the system, the PORTS Server adjusts the load of the system by controlling the speed of execution of the tasks. The PORTS Server is activated whenever a new task arrives in the system. The PORTS Server ﬁrst executes a Feasibility Test (FT) to decide whether or not the new task can be accepted for execution in the system. If the new task is accepted, an optimization procedure is executed to calculate the speeds of execution of all tasks in the system. This optimization procedure consists of three parts: 1. A reduction algorithm, which converts the original MCKP to a standard KP. 2. An approximation algorithm (e.g. Enhanced Greedy Algorithm) capable of ﬁnding an approximate solution to the reduced KP, and 3. A restoration algorithm, which re-constructs the solution of the MCKP from the KP.

74

Pedro Mejia et al.

The solution provided by the optimization procedure is such that no task in the system misses its deadline and the speeds of execution chosen for all tasks, maximizes the energy savings of the system. After the optimization procedure is executed, the Total Bandwidth Server [14] is used to compute the start time of the new task. Finally, with the start time of the new task computed and the solution provided by the optimization procedure (the set of speeds for execution), the PORTS Server will schedule the new task in the system. The PORTS Server is also activated when a task leaves the system, in which case, the Feasibility Test is not executed.

6

Handling Power-Aware Real-Time Tasks

The proposed method consists of ﬁve basic parts, or stages, as illustrated in Figure 1, and described in detail in the following subsections. 6.1

Activating the PORTS Server and Feasibility Test

The two conditions for activating the PORTS Server and their procedures are: 1. Task Arrival. When a new task Tj arrives in the system, the feasibility test is executed. The task is rejected when running all tasks (including Tj ) at the maximum speed (minimum utilization) the system is not feasible. Otherwise, the new task is accepted: Feasibility Test (FT): n Tj is accepted if Umin = i=1 Ui1 ≤ 100 % FT = Tj is rejected otherwise After a new task has been accepted in the system, the next problem is to choose the speed of execution of each task in the system. This problem is related to our optimization problem because by choosing a speed for the execution of task Ti we will obtain its corresponding energy savings achieved. Obviously, energy savings are minimum when all tasks execute at their maximal speeds. Therefore, our goal is to choose the speed for execution of each task such that our optimization criteria is met. 2. Task Departure. The PORTS Server is also activated when a task leaves the system. In this case, the optimization procedure is executed to satisfy the optimality criteria for the new set of tasks in the system. In this case, the Feasibility Test is clearly not needed. 6.2

Reduction Scheme from MCKP to the Classical KP

Our approximation algorithm is based on the reduction of the MCKP to the equivalent KP using the convex hull concept [16]. In order to reduce the MCKP, denoted by P0, the following auxiliary problems will be used:

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

75

New Task Activating the PORTS Server

Feasibility Test

P0

P1

P2

P3

P4

Greedy

Initial

Truncated

Truncated Relaxed

Convex Hull

Equivalent

Algorithm

Reduction of the MCKP to KP

(1)

Total Bandwidth Server

(2)

Scheduling Time

P

P

Initial

Truncated

0

Scheduling New Task

(3)

Set of Speeds

1

P2 Truncated Relaxed

Restoration of the Solution to the Original Problem Po

(5)

(4)

Fig. 1. Methodology for Handling Power-Aware Real-Time Tasks

P1 : The Truncated MCK Problem Problem P1 is constructed from P0 , by extracting the lightest item from each class and assuming that all these items are inserted into the knapsack. n The sum of i=1 Si1 nthe lightest items from each class is denoted by S0 = and U0 = i=1 Ui1 . When formulating P1 , we have to write j∈Ni xij ≤ 1 (instead of j∈Ni xij = 1) because the lightest items are assumed to be already inserted into the knapsack. Therefore, some or even all classes in Problem P1 may contain no items, that is, it is allowed that j∈Ni xij = 0 for the optimal solution of Problem P1 . Problem P1 : n Maximize Z1 = i=1 j∈Ni (Sij − Si1 ) xij n subject to i=1 j∈Ni (Uij − Ui1 ) xij ≤ (c − U0 ), x j∈Ni ij ≤ 1, i = 1, ..., n, xij = 0 or 1, for j ∈ Ni , i = 1, ..., n. P2 : The Truncated Relaxed MCK Problem Problem P2 is formulated from Problem P1 by allowing a relaxation on the variable integrality condition: 0 ≤ xij ≤ 1. Let Z2 be the objective function of Problem P2 . The reason for introducing this problem is that its exact solution can be found in low computation time, which in turn, provides a good approximation solution to Problem P1 and hence a good approximation solution to P0 . The algorithm for exact solving the Problem P2 [12, 20, 16, 18] can be obtained by solving the following P3 and P4 problems.

76

Pedro Mejia et al.

P3 : The Relaxed MCK Problem on the Convex Hull Given P2 , a convex hull of items in each class can be found [16]. The elements constituting the convex hull will be called P-undominated and denoted by (Rij , Hij ) (this notion will be explained below in more detail). Let us start by denoting (Sij − Si1 ) in P2 by pij and (Uij − Ui1 ) by wij . Definition 1. (Sinha and Zoltners [20]). If two items r and s in the same class Ni in Problem P2 satisfy that pir ≤ pis and wir ≥ wis then item r is said to be dominated by s. Definition 2. (Sinha and Zoltners [20]). In every optimal solution of P3 , xis = 0, that is, the dominated items do not enter into the optimal solution. Definition 3. (Sinha and Zoltners [20]). If some items r, s, t from the same class Ni are such that pir ≤ pis ≤ pit , wir ≤ wis ≤ wit , and (pit − pis ) (pis − pir ) ≤ , (wis − wir ) (wit − wis ) then xis = 0 in every optimal solution of P2 .

(3)

The item s ∈ Ni is called P-dominated [16]. In what follows, we exclude Pdominated points from each class Ni when solving the relaxed Problem P3 to optimality. The items remaining after we excluded all the P-dominated points are called P-undominated. All these items belonging to the same class, if depicted as points in the two dimensional space (R, H), form the upper convex hull of the set Ni [16]. Note that R denotes energy savings and H denotes utilization. The set of all P-undominated items may be found by examining all the items in each class Ni in an increasing order and according to Equation 3. Because of the ordering of the items, the upper convex hull can be found in O(m log m) time [20]. Recall that m = ni=1 ni . The obtained Multiple-Knapsack Problem on the Upper Convex Hull is denoted as Problem P3 . Problem P3 : Maximize Z3 = ni=1 j∈Ni Rij yij n subject to i=1 j∈Ni Hij yij ≤ (c − U0 ), j∈Ni yij ≤ 1, i = 1, ..., n, 0 ≤ yij ≤ 1, for j ∈ Ni , i = 1, ..., n. As described in [20], some items belonging to the class Ni (i.e., yij = 1) can be included into the solution entirely; they are called variables. On the other ninteger hand, some items may exceed the constraint: i=1 j∈Ni (Hij yij ) ≤ (c − U0 ) and only part of it could be included into the solution. This items are called fractional variables. P4 : The Equivalent Knapsack Problem (EKP) The equivalent Knapsack Problem P4 is constructed from P3 . In each class slices, or increments are deﬁned as follows:

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

77

Pij = (Rij − Ri,j−1 ); i = 1, . . . , n; j = 2, . . . , CHi

(4)

Wij = (Hij − Hi,j−1 ); i = 1, . . . , n; j = 2, . . . , CHi

(5)

where CHi is the number of the P-undominated items in the convex hull of class Ni . When solving the (continuous) Problem P3 , we may now discard the condition j∈Ni xij ≤ 1, i = 1, ..., n, and solve the problem of selecting slices in each class. Problem P4 : n Maximize Z4 = i=1 j∈Ni Pij zij n subject to i=1 j∈Ni (Wij zij ) ≤ (c − U0 ), 0 ≤ zij ≤ 1, for j ∈ Ni , i = 1, ..., n. From the analysis of Problem P4 [20, 12] it follows that, in all integer classes: if some variable is equal 1 (e.g., the variable is chosen) then all preceding variables are also 1; if some variable is equal zero (e.g., the variable is not chosen) then all subsequent variables are also zeros. From this fact the following important properties of Problem P4 follow. Property 1. The sum of several slices in Problem P4 correspond to a single item in Problem P3 , and in each class all the slices are numbered in the decreasing P order of their ratios, Wijij . Property 2. There should not be a gap in a set of slices corresponding to a solution in any class. To exemplify this Property, let us consider the class Nj containing the slices r, s and t. According to Property 2, the following solutions are valid: {}, {r}, {r, s} and {r, s, t}, while {s}, {t}, {r, t} and {s, t} are invalid. In particular, {r, t} is invalid because slice s is not included, causing a gap in the solution. 6.3

Enhanced Greedy Algorithm

In order to solve the equivalent knapsack Problem P4 , we may collect all slices P from all classes (following a decreasing order of their ratios, Wijij ) as candidates for including them into a single class: P W . With all slices in the single class P W , now the problem becomes the standard knapsack problem. The main idea of the Standard Greedy Algorithm (SGA) for solving the standard knapsack is to insert the slices, {pi , wi } (obtained from the single class P W ) inside the available capacity of the knapsack (c− U0 ) in order of decreasing ratio wpii , until the knapsack capacity is completely full, or until no more slices can be included. If the knapsack is ﬁlled to its full capacity (c − U0 ) in the mentioned order, then this is the optimal solution. While inserting slices into the knapsack, one of them may not ﬁt into the available capacity of the knapsack. This slice is called the break-slice [16], and its corresponding class is called the break-class.

78

Pedro Mejia et al.

1: Enhanced Greedy Algorithm: (EGA Algorithm) 2: input: a set of slices pj and wj from P4 ordered by the ratio 3: c: (size of the knapsack), n ˆ :(number of items on Problem P4 ) 4: output: xi : (solution set); 5: (p∗ , u∗ ): (energy savings and utilization result) 6: begin 7: c¯ = (c − U0 ); p∗ = 0; w∗ = 0; 8: for j = 1 to n ˆ do 9: if wj > cˆ then 10: xj = 0; break-slice = j; 11: exit; (condition for SGA algorithm) 12: else 13: xj = 1; cˆ = cˆ − wj ; 14: p∗ = p∗ + pj ; u∗ = u∗ + wj 15: end;

pi wi

Fig. 2. Greedy Algorithms: SGA, EGA

Contrary to the solution proposed by Pisinger [18], our method does not consider fractional items to be part of the solution. Therefore, we will discard the break-slices, and consequently (following Property 2) all subsequent slices from the same break-class. To the greedy scheme of [18] we add the following two rules.

– Rule 1. When computing the solution of P4 take into account Z4 = {(pmax ), Zˆ4 }, where pmax = max{pi } is the maximal energy saving item in the truncated MCKP P2 and Zˆ4 = p1 + p2 + . . . + pk−1 is the approximate solution obtained by the Standard Greedy Algorithm (SGA). – Rule 2. After ﬁnding the break-slice, the remaining empty space is ﬁlled in by slices from the non-break classes in decreasing order of the ratios wpii . The SGA algorithm is executed until the ﬁrst break-slice is found. The Enhanced Greedy Algorithm (EGA) algorithm is executed for all slices in the single class P W . According to Rule 2, break-slices are not considered to be part of the solution in the EGA algorithm. The SGA and EGA algorithms are illustrated in Figure 2. 6.4

Restoring the Solution from the EKP to the MCKP

An approximate solution to Problem P4 is obtained as follows:

– SGA Algorithm: Z4 = max{pmax , (p1 + p2 + . . . + pk−1 )}

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

79

– EGA Algorithm: Z4 = max{pmax , (p1 + p2 + . . . + pk−1 + α)} The term α is a possible increment caused by using Rule 2, that is, the proﬁts of additional items from non-break classes. The approximate solution to the Problem P0 is deﬁned as Z4 + S0 . Recall n that S0 = i=1 Si1 , are the elements truncated in Problem P1 . From the deﬁnition of the slice (described in Equations 4 and 5) and Property 1, it follows that if several slices, (for example s, r and t in that order) belonging to the same class Nj are chosen to be part of the solution of the greedy algorithm, then the item corresponding to the slice t is considered to be part of the solution of P0 . On the other hand, if no slice is chosen from class Nj to be part of the solution, then the truncated item considered in Problem P1 (Sj,1 and Uj,1 ) is chosen to be part of the solution. The above criteria allows us to construct the corresponding items (speeds) from each class from Problem P4 that are part of the solution of Problem P0 . The solution from Problems P1 , P2 and P4 can be obtained in O(m) time, while the EGA Algorithm obtains solutions in O(m log m) time. 6.5

Scheduling the New Task

After the optimization procedure is executed, the Total Bandwidth Server (TBS) [14] will calculate the start time of the new task. It is well known that TBS Server provides low response times for handling aperiodic tasks. It is important to note that the newly arrived task may not be scheduled immediately at its arrival time because it may cause some missing deadlines. The resulting utilization, after executing the optimization procedure, may not be immediately subtracted from the total processor load because at the arrival time some tasks may already have delayed the execution of other tasks. Finally, with the start time of the new task computed and the solution provided by the optimization procedure (the set of speeds for execution), the PORTS Server will schedule the new task in the system.

7

Simulation Experiments

The following simulation experiments have been designed to test the performance of the PORTS Server and its ability to achieve our optimality criteria using synthetic task sets. The goals in this simulation experiments are: (1) to measure the quality of the results over a large set of dynamic tasks that arrive and leave the system at arbitrary instants of time, and (2) to measure and compare the performance and run-time of our algorithms against known algorithms. The algorithms used for comparison are: Dynamic Programming (DP) [16], Static Discrete Algorithm (SD) and the Optimal Discrete Algorithm OP(d) [1]. Each plot in the graphs represents the average of a set of 5000 task arrivals. The results shown in the graphs are compared with the SD Algorithm and the size of the knapsack used in the experiments is 1000 (100% of the load).

80

Pedro Mejia et al.

Each task has a life-time (ri ) that follows a uniform distribution between 30 and 200 instances (periods). At the end of its life-time, the task leaves the system. The period Pi of each task follow a uniform distribution between 1000 and 16000 time units, such that the LCM of the periods is 32000. ∗ri+1 , where nt is The arrival time of task Ti+1 is computed by ai+1 = Pi+1nt the actual number of tasks in the system. It is assumed that, for a given number of speeds, each speed level is computed proportionally between the maximum speed (SPmax = 1.0) and the minimum speed (SPmin = 0.2). For example, if there are 5 speed levels, the speed levels will be {1.0, 0.8, 0.6, 0.4, 0.2}. The utilization of task Ti under minimum speed, Uin , is chosen as a random variable with uniform distribution between 20% and 30%. Ci is computed by Ci = Uin ·SPij ·Pi . For each speed, utilization Uij is computed t Ci . by Uij = Piji , and tij = SP ij The power functions for each task Ti used [11, 19, 21] are of the form ki · Sixi , where ki and xi are random variables with uniform distributions between 2 and 10, 2 and 3 respectively. Then, the energy consumption for each task and each speed SPj is computed by Eij = I · (ki · SPjxi SPCj i·Pi ), where I is a ﬁxed interval, given by I = LCM . Finally, the input to our Optimization Problem P0 is computed by Equation 1. The performance of our algorithms is measured at each task arrival (and departure) according to the following metrics: – Percentage (%) of Energy Savings: This metrics is computed as follows. The solution obtained (in terms of Energy Consumption) by each algorithm n for all task give us the total energy consumption Etot = i=1 Ei . The solution provided by each algorithm is then compared with the solution obtained by algorithm SD, and the percentage of improvement is plotted in the graphs. – Run-Time: This metrics denotes the execution time of each algorithm, which measures the physical time in microseconds, using a PC Intel 233 MHZ with 48MB of RAM and running on the Operating System Linux. The function used for the measurements is gettimeofday(). We show here two cases to demonstrate the performance of our algorithms. The ﬁrst case (Figure 3), executes the simulations considering 10 speed levels, and the number of tasks is varied from 5 to 80. In the second case (Figure 4), the number of tasks is set to 30, and the speed level is varied from 3 to 60. The results obtained by algorithm EGA (shown in Figure 3) vary from 95 to 99 near optimal solutions (when compared with the DP Algorithm), with % of energy savings ranging from 23 % to 25 %. While the SGA provide solutions from 92 to 96 near-optimal, with % of energy savings ranging from 19 % to 22 %. This results give an improvement of over 80% from the results obtained by the OP(d) algorithm. It is important to note that the continuous OP(c) algorithm was also simulated giving % of energy savings between 26 and 30. The results shown in Figure 3 indicate the low cost of the enhanced greedy algorithms. For the SGA and EGA algorithms the run-time varies from 56 to 853 microseconds. Note the large diﬀerence in run-time obtained by the EGA

30

65536

25

16384

Run-Time (MicroSeconds)

% of Energy Savings

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

20 15 10

DP SGA EGA OPT(d)

5

81

4096 1024 256

DP SGA EGA OPT(d)

64 16

0

0

10

20

40 50 30 Number of Tasks

70

60

0

80

10

20

50 30 40 Number of Tasks

70

60

80

Fig. 3. % of Energy Savings and Run-Time (Microseconds)

30

65536

25

16384

Run-Time (MicroSeconds)

% of Energy Savings

algorithms when compared with the DP and the OP(d) algorithms. For our simulation settings, the OP(d) algorithm varies from 155 to 102500 microseconds, and the DP algorithm varies from 2529 to 49653 microseconds. The results shown in Figure 4 indicate how important is to consider an appropriate number of speed levels for achieving a high percentage of energy savings. As shown in Figure 4, under low number of speed levels, between 3 and 30, the EGA algorithm gives better performance than the OPT(d) algorithm. However, for more than 30 speed levels OPT(d) algorithm outperforms the EGA algorithm. For this experiment the run-time computed (shown in Figure 4), indicate that the OPT(d) algorithm has very little sensibility to the number of speed levels (i.e., the run-time of the OPT(d) algorithm varied from 6900 to 7100 microseconds). In contrast, our Greedy Algorithms increased their run-time with higher number of speed levels. For this experiments, the run-time of the Greedy Algorithms varied from 99 to 1800 microseconds.

20 15 10

DP SGA EGA OP(d)

5

4096 1024 256

DP SGA EGA OP(d)

64 16

0

0

10

20 30 40 Number of Speed Levels

50

60

0

10

20 30 40 Number of Speed Levels

Fig. 4. % of Energy Savings and Run-Time (Microseconds)

50

60

82

Pedro Mejia et al.

Further tests were conducted (increasing the number of speed levels) to conclude that both the EGA algorithm and the OPT(d) algorithms have similar run-time, when the number of speed levels is reaching 100. The results obtained in our simulations indicate that the Enhanced Greedy Algorithms are a low cost and eﬀective solutions for scheduling power-aware real-time tasks with discrete speeds.

8

Conclusions

In this paper we proposed a power optimization method for a real-time application running on a variable speed processor with discrete speeds. The solution proposed is based on the use of a Power-Optimized Real-Time Scheduling Server (PORTS) which is comprised of two parts (a) a feasibility test, for testing the admission of new dynamic tasks arriving in the system, and (b) an optimization procedure used for computing the levels of speed of each tasks in the system, such that energy savings of the system is maximized. The process of selecting levels of voltage/speed for each tasks while meeting the optimality criteria requires the exploration of a potentially large number of combinations, which is infeasible to be done on-line. The PORTS Server ﬁnds near-optimal solutions at low cost by using approximate solutions to the knapsack problem. Our simulation results show that our PORTS Server has low overhead, and most importantly generates near-optimal solutions for the scheduling of real-time systems running on variable speed processors. We will extend the PORTS Server with algorithms for multiple processors and for real-time tasks with precedence and resource constraints.

References [1] H. Aydin, R. Melhem, D. Mosse, P. Mejia. “Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Diﬀerent Power Characteristics”. EuroMicro Conference on Real-Time Systems, June 2001. 70, 73, 79 [2] H. Aydin, R. Melhem, D. Mosse, P. Mejia. “Dynamic and Aggressive Scheduling Techniques for Power-Aware Real-Time Systems”. IEEE Real-Time Systems Symposium, Dec. 2001. 70 [3] T. D. Burd, T. A. Pering, A. J. Stratakos, R. W. Brodersen, “A Dynamic Voltage Scaled Microprocessor System”, IEEE J. of Solid-State Circuits, Vol. 35, No. 11, Nov. 2000. 70, 71 [4] F. Gruian, K. Kuchcinski. “LEneS:Task Scheduling for Low Energy Systems Using Variable Supply Voltage Processors”. In Proc. Asia South Pacific - DAC Conference 2001, June 2001. 70 [5] I. Hong, D. Kirovski, G. Qu, M. Potkonjak and M. Srivastava. “Power Optimization of Variable Voltage Core-Based Systems”. In Design Automation Conference, 1998. 70 [6] I. Hong, M. Potkonjak and M. B. Srivastava. “On-line Scheduling of Hard RealTime Tasks on Variable Voltage Processor”. In Computer-Aided Design (ICCAD)’98, 1998. 70, 73

An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling

83

[7] I. Hong, G. Qu, M. Potkonjak and M. Srivastava. “Synthesis Techniques for Low-Power Hard Real-Time Systems on Variable Voltage Processors”. In Proc. of 19th IEEE Real-Time Systems Symposium, Madrid, December 1998. 70 [8] Intel, Microsoft, Compaq, Phoenix and Toshiba. “ACPI Speciﬁcation”, developer.intel.com/technology/IAPC/tech. 69 [9] Intel StrongARM SA-1100 microprocessor developer’s manual. 70 [10] T. Ishihara and H. Yasuura. “Voltage Scheduling Problem for Dynamically Varying Voltage Processors”, In Proc. Int’l Symposium on Low Power Electronics and Design, 1998. 70 [11] C. M. Krishna and Y. H. Lee. “Voltage Clock Scaling Adaptive Scheduling Techniques for Low Power in Hard Real-Time Systems”. In Proc. of the IEEE RealTime Technology and Applications Symposium, 2000. 70, 80 [12] E. Lawler. “Fast Approximation Algorithms for Knapsack Problems”. Mathematics of Operations Research, Nov. 1979. 75, 77 [13] C. L. Liu, J. Layland. “Scheduling Algorithms for Multiprogramming in Hard Real-Time Environments”, J. ACM, 20(1). Jan. 1973. 71, 73 [14] G. Lipari, G. Buttazzo. “Schedulability Analysis of Periodic and Aperiodic Tasks with Resource Constraints”, J. of Systems Architecture, (46). 2000. 74, 79 [15] J. R. Lorch, A. J. Smith. “Improving Dynamic Voltage Scaling Algorithms with PACE”. In Proc. of ACM SIGMETRICS Conference Cambridge, MA, June 2001. 70 [16] S. Martello and P. Toth. “Knapsack Problems. Algorithms and Computer Implementations”. Wiley, 1990. 69, 73, 74, 75, 76, 77, 79 [17] D. Mosse, H. Aydin, B. Childers, R. Melhem. “Compiler Assisted Dynamic Power-Aware Scheduling for Real-Time Applications”. In Workshop on Compiler and Operating Systems for Low Power COLP’00 October, 2000. 70 [18] D. Pisinger. “A Minimal Algorithm for the Multiple-Choice Knapsack Problem”, European Journal of Operational Research, 83. 1995. 75, 78 [19] Y. Shin and K. Choi. “Power Conscious Fixed Priority Scheduling for Hard Real-Time Systems”. In Proc. of the Design Automation Conference. 1999. 70, 80 [20] P. Sinha, A. Zoltners. “The Multiple Choice Knapsack Problem”. Operations Research, May-June 1979. 75, 76, 77 [21] V. Swaminathan, K. Chakrabarty. “Investigating the Eﬀect of Voltage-Switching on Low-Energy Task Scheduling in Hard Real-Time Systems”. In Proc. Asia South Pacific - DAC Conference 2001. 73, 80 [22] www.arm.com. 70 [23] www.transmeta.com. 70, 71 [24] F. Yao, A. Demers, S. Shenker. “A Scheduling Model for Reduced CPU Energy”. IEEE Annual Foundations of Computer Science, 1995. 70

Power-Aware Task Motion for Enhancing Dynamic Range of Embedded Systems with Renewable Energy Sources Jinfeng Liu, Pai H. Chou, and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine, CA 92697-2625, USA {jinfengl,chou,nader}@ece.uci.edu

Abstract. New embedded systems are being built with new types of energy sources, including solar panels and energy scavenging devices, in order to maximize their utility when battery or A/C power is unavailable. The large dynamic range of these unsteady energy sources is giving rise to a new class of power-aware systems. They are similar to low-power systems when energy is scarce; but when energy is abundant, they must be able to deliver high performance and fully exploit the available power. To achieve the wide dynamic range of power/performance trade-oﬀs, we propose a new task motion technique, which tunes the system-level parallelism to the power/timing constraints as an eﬀective way to optimize power utility. Results on real-life examples show an energy reduction of 24% with a 49% speedup over best previous results on the entire system. Keywords: power-aware scheduling/task motion, timing/power constraint modeling, power/performance range, system-level design

1

Introduction

Recent years have seen the emergence of power-aware embedded systems. They are characterized by not only low power consumption, but more generally by their ability to support a wide range of power/performance trade-oﬀs. That is, these systems can be viewed as providing “knobs” that can be turned one direction to reduce power consumption, or the other direction to increase performance. The ability to adapt the range of power/performance trade-oﬀs is driven by new applications that demand very high performance while under stringent timing and power constraints. One example that ﬁts this description is the Mars rover by NASA/JPL [1]. It was designed to roam on Mars to take digital photographs and perform scientiﬁc experiments over several hundred days. Its energy sources consist of a battery pack and a solar panel, and future versions are expected to incorporate nuclear generators, thermal batteries, and energy scavenging devices. Besides the Mars rover, many new emerging embedded systems are also following this trend towards new types of heterogeneous, renewable energy sources. Future personal B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 84–99, 2003. c Springer-Verlag Berlin Heidelberg 2003

Power-Aware Task Motion

85

digital assistants (PDAs) will likely include solar panels as found in many calculators today. Yet another example is the distributed sensors. They are being built today to draw energy from solar power, wind power, or even ocean waves. They represent a great improvement because they enable the system’s continued operation for useful or critical tasks when the traditional energy sources like battery and A/C become unavailable. These new types of energy sources are posing new challenges to designers of power-aware systems. What they all have in common is that many of these new energy sources are far from being ideal power supplies. For example, the output of a portable solar panel today can be up to 15W under direct sunlight, or down to 1mW under incandescent light. Similarly, other sources will be determined by the wind or ocean wave, which can also cause the available power to vary by several orders of magnitude. Embedded systems powered by such sources must be designed to operate in as wide a range as possible. Indeed, new emerging components such as the Intel XScale are able to scale their power/performance over 20×, and this dynamic range will likely to increase. While low power operation is clearly important, the ability to fully exploit the available power when energy is abundant is equally important. However, today’s systems let much free energy go to waste, because they are designed for ﬁxed budgets. For example, a system with an XScale draws approximately 1W of power, but when the solar panel outputs 15W in direct sunlight, up to 1400% of the power will be wasted. Even if there is a rechargeable battery, when it becomes fully charged, the extra power turns into waste heat. This is also the case with the Mars rover, which accomplishes its low-power property by serializing all tasks, including mechanical and heating as well as computation. However, it also discards excess power as waste heat. One way to take advantage of the excess power is to increase parallelism. In fact, parallelism is in general an eﬀective way for both high performance and low power. By operating additional processors at their peak rate, they will be able to take advantage of the abundant energy. Parallelism can also enable a set of processors to operate at a lower power level than a single processor with the same performance. Although it is diﬃcult to parallelize algorithms in general, systems with many concurrent activities present many opportunities for parallelism-based trade-oﬀs. Peak-power poses new challenges to such a power-aware architecture with multiple processors. Today’s systems satisfy the peak-power constraint by construction, that is, each component is given a budget that is guaranteed never to be exceeded according to their data sheet. However, by using multiple processors to fully utilize the available power when abundant, a multi-processor architecture would risk exceeding the total budget when the supply power is low, if it is not designed carefully. Therefore, it is of utmost importance that the proposed scheme be able to fully respect the maximum power as a hard constraint. In this paper, we propose to enhance the dynamic range of these embedded systems by means of task motion and power-aware scheduling. It transforms tasks within their timing constraints and their precedence dependencies in order

86

Jinfeng Liu et al.

to match the parallelism to the available power level. Furthermore, we exploit domain-speciﬁc knowledge about the power-consuming tasks to achieve additional signiﬁcant power/performance improvements over existing schedulers. The enhanced dynamic range and power-awareness enable the system to accomplish more tasks in a shorter amount of time while respecting all timing constraints. The beneﬁts must ultimately be translated into application-speciﬁc metrics, but as power-aware systems are deployed in more mission-critical applications, the saving from reduced mission time or enhanced quality may translate into a saving of millions of dollars. Section 2 reviews related work. Section 3 uses an example showing a counterintuitive result when some of the well-known techniques will fail at the system level. However, this problem can be successfully addressed by our new technique, which is presented in Section 4. We discuss experimental results in Section 5.

2

Related Work

To explore the power/performance range in power-aware embedded systems, we can draw from many techniques developed for low power and high performance. This section surveys related work in these areas with a discussion on their integration at the system level. Low power can be achieved by many ways. For system-level designs, where the components are largely oﬀ-the-shelf or already designed, the applicable techniques include subsystem shutdown and dynamic voltage scaling (DVS). In the ﬁrst case, subsystem shutdown decision can be based on ﬁxed idle times, adaptive timeout, or predictive based on a mix of proﬁle and runtime history [15, 14, 4]. Similarly, power-up may be either event-driven or predictive in an attempt to minimize timing or power penalty. In the second case, DVS techniques have been developed for variable-voltage processors (introduced by [16], with followup by [5, 12] and more). Because energy is a quadratic function of voltage, lowering the voltage can result in signiﬁcant saving while still enabling the processor to continue making progress, unlike the shutdown case. Lowering the voltage will also require reduction in frequency, which has the eﬀect of reducing dynamic switching power. In addition to low power, the power/performance range can also be increased towards high performance by drawing from previous work on retiming or pipelining and applying them to the system level. Leiserson et al. ﬁrst established the theoretical foundation for retiming synchronous circuits [8], and this has been extended to loop scheduling for VLIW processors [13, 2, 6]. Shifting tasks in a data ﬂow graph (DFG) across the iteration boundary can result in a shorter execution time or alleviate the resource pressure (e.g. number of registers and functional units). Such techniques are also used in power minimization by reducing switching activities [7, 17]. Existing techniques need signiﬁcant enhancements before they can be correctly applied to a system-level power management problem. First, most techniques to date treat either power or timing as an objective, rather than a con-

Power-Aware Task Motion

87

straint. In real systems, the max power budget is a real, hard constraint, whose violation can lead to malfunction. Max power was not of central concern previously, but as we consider additional power sources such as solar whose power output can vary, max power constraints must be strictly enforced. This becomes especially important as we increase the range of power and performance tradeoﬀs by tuning the parallelism. Second, the tasks to be scheduled are related to each other not only by precedence, data dependency or deadline, but also related across diﬀerent components by dependencies like co-activation, which must be correctly modeled for system-level power management, or else anomalies can occur. Co-activation means the execution of one task requires the power consumption of other dependent services or tasks. A simple example is that when the CPU is running, it imposes a co-activation dependency on the memory. Techniques such as DVS are designed mainly for minimizing CPU power, but they have not considered other components that have dependencies on the CPU. In fact, energy saved on the CPU may be more than oﬀset by the increased energy consumed by the rest of the system. The following section presents a simple example to illustrate such an anomaly with applying DVS without system-level considerations.

3

DVS Anomaly

We present a simple example in Fig. 1 to illustrate an anomaly with applying DVS without considering system-level dependencies, resulting in an incorrect system. It will be further used to explain our new system model and scheduling technique in the ensuing text. In this example, ﬁve tasks a, b, c, x, y are to be scheduled on four execution resources A, B, X, Y . The constraints are: 1. 2. 3. 4. 5.

The overall deadline is at time 3. The max power budget is 10W. Tasks a, b and c must be serialized. The execution resources A, B are not voltage-scalable. Only task x can be voltage-scaled on resource X (e.g. a processor), and it has some slack time to ﬁnish before time 2. 6. Task y must co-activate with task x, and its resource Y is also not voltagescalable (e.g. memory, I/O).

Note that task y need not start and ﬁnish at the same time as x, but it must envelop x, i.e., start no later than x starts and ﬁnish no sooner than x ﬁnishes. For simplicity, this example assumes x and y start and ﬁnish together. We present schedules as power-aware Gantt charts, where the horizontal and vertical axes represent time and power, respectively. Each chart also consists of a pair of views: time view organizes tasks by horizontal tracks that correspond to power consuming resources (processors, peripherals), and power view stacks the tasks over time to show the power breakdown by tasks. The curve that traces the height of the power view is the power profile for the entire system.

88

Jinfeng Liu et al.

A

a

B

b

X Y Power 12 10 8 6 4 2 0

x y 0

c

task x has deadline 2 task y co-activates with task x 1 2 exceeding max power budget

3 Time Pmax: 10

a x y

Energy: 19

b

c

0 1 2 3 Time (a) The schedule is not valid since max power budget is exceeded at time slot [0,1] due to parallel tasks x, y and a.

A

x is slowed down to save power/energy

a

B X Y

b x y 1

c

y's execution delay increases by co-activating with x 2 3 Time exceeding max power budget Pmax: 10 b more Energy: 21

Power 0 12 10 8 a 6 x 4 energy 2 y c 0 Time 0 1 2 3 (b) DVS technique reduces power and energy consumption of task x. However, it fails to produce a valid schedule to the entire system. The energy comsumption of the whole system is increased by co-activation. prolog

loop body can be iterated after time 1

A

a B shift tasks x, y to previous iteration

X Y

x y

x y

b

c x[1] from next iter. y[1] from next iter.

1 2 3 4 Time Power 0 12 Pmax: 10 10 c 8 x[1] from next iter. Energy: 19 x 6 4 b y y[1] from next iter. a 2 00 1 2 3 4 Time (c) Our task motion technique shifts task x and its co-activated task y to the previous iteration such that the max power budget is satisfied.

Fig. 1. An example where DVS fails to reduce power and energy at system level, while our new technique will succeed

Power-Aware Task Motion

89

Fig. 1(a) shows a time-valid schedule with a max power violation during time [0, 1]. Rescheduling x and y in [1, 2] will be time-valid but still violates max power. Fig. 1(b) shows the case when DVS was used to slow down task x until its deadline of time 2. Intuitively, reducing both power and energy of task x should eliminate the max power violation, but instead it not only does not reduce max power, but actually increases total energy at the system level. Because x runs more slowly, its co-activated task y must also consume power for longer on a device that is not voltage scalable. As a result, the execution of x and y overlaps that of task b, thereby leading to higher system-level power. Furthermore, energy saved by slowing down x is more than oﬀset by the additional energy consumed by the lengthened y. This anomaly is an example where DVS should not be applied in isolation. Fig. 1(c) shows a feasible solution obtained by our new power-aware task motion technique on iterative tasks. Task x and y are shifted (or promoted) to the previous iteration to overlap task c instead of a or b. As a result, both the max power and the deadline are satisﬁed. However, the optimal solution cannot be obtained unless we exploit domain-speciﬁc knowledge about the task set by eliminating a precedence dependency and replacing it with a utilization constraint. The details will be explained in later sections.

4

Task Motion under Timing and Power Constraints

We propose power-aware task motion for exploring power/performance trade-oﬀs in embedded systems. We ﬁrst deﬁne our constraint model and introduce our representations based on a timing constraint graph, where we capture two classes of constraints: intra-iteration and inter-iteration timing constraints. Task motion shifts tasks across iteration boundaries and relaxes timing constraints to achieve more scheduling opportunities. We also deﬁne utilization constraints to support more aggressive but provably correct design space exploration. We close this section by sketching an algorithm that combines power-aware scheduling [9, 10] and task motion as a new “knob” for power-aware designs. 4.1

Constraint Graph and Schedule

The input to the scheduler is a (timing) constraint graph G(V, E), where the vertices V represent tasks, and the edges E ⊆ V × V represent timing constraints between tasks. Each vertex v ∈ V has three attributes, d(v), p(v) and r(v), representing task v’s execution delay, power consumption and resource mapping, respectively. Each edge (u, v) ∈ E has two attributes, δ(u, v) and λ(u, v). δ(u, v) speciﬁes the min/max timing constraints [3]. For any function σ that assigns the start times to tasks u and v as σ(u) and σ(v), σ(v) − σ(u) ≥ δ(u, v). If δ(u, v) ≥ 0, then the edge (u, v) is called a forward edge, and it speciﬁes a min timing constraint. If δ(u, v) < 0, then it is a backward edge indicating a max timing constraint. λ(u, v) is called the dependency depth, which speciﬁes constraints across iterations. An iteration is a full pass of executing each task once in a valid

90

Jinfeng Liu et al.

order. δ(u, v) and λ(u, v) indicate that the execution of task u in iteration i must precede task v in iteration i + λ(u, v) by δ(u, v) time units. If λ(u, v) = 0, edge (u, v) speciﬁes an intra-iteration constraint. Otherwise, it is an inter-iteration constraint. We assume that inter-iteration constraints are only precedence dependencies (forward edges) and their dependency depths are positive integers. For backward edges, their dependency depths are always zero. A schedule σ assigns a start time σ(v) to each task v ∈ V . It has a finish time τσ when all tasks complete their execution. Schedule σ is called time-valid if all the start time assignments satisfy all timing constraints, and tasks that share the same resource are serialized. If G represents an iteration of a loop, σ must also satisfy inter-iteration constraints such that they must hold across iterations when multiple instances of σ are concatenated. A schedule σ has a power profile function of time Pσ (t), 0 ≤ t ≤ τσ representing the instantaneous power consumption of all tasks during the execution of σ (illustrated by the power view of the Gantt-chart in Fig. 1). The power proﬁle is constrained by two parameters: Pmax , Pmin , such that Pmax ≥ Pσ (t) ≥ Pmin ≥ 0. The max power constraint Pmax speciﬁes the maximum level of power that can be supplied by the power sources. The min power constraint Pmin speciﬁes the level of power consumption to maintain a preferred level of activity. The max power constraint is a hard constraint. At any given time t, the value of the power proﬁle function Pσ (t) must not exceed Pmax . Schedule σ is called power-valid (or simply, valid) if it is time-valid and its power proﬁle does not exceed the max power constraint. However, we treat the min power constraint as a soft constraint that could be violated occasionally in a valid schedule. In cases where the min power constraint Pmin represents the free power level (e.g. solar), the energy drawn from the non-renewable energy sources is deﬁned as the energy cost Ecσ (Pmin ) of a schedule σ. It distinguishes between costly power and free power in such a way that any power consumption below the free power level does not contribute to the energy cost on non-renewable energy sources, and therefore should be utilized maximally. 4.2

Task Motion under Timing Constraints

Task motion obtains diﬀerent versions of a scheduling problem by converting between intra-iteration and inter-iteration constraints. We ﬁrst construct an iteration graph G (V, E ): it has the same vertices as those of the constraint graph G(V, E), but edges E consist of only intra-iteration constraints. Formally, E = {(u, v) : (u, v) ∈ E such that λ(u, v) = 0, δ (u, v) = δ(u, v)}. The edges in E will not have dependency depths λ, since they are always zero. The expected loop duration τ is obtained from the original schedule computed from the initial iteration graph G . Without loss of generality, we focus our discussion on task promotion by which the execution of a task is shifted to the previous iteration of the loop, and the instance of the same task in the next iteration is promoted into the new loop body. The inverse procedure for task demotion can be similarly deﬁned.

Power-Aware Task Motion

91

A task v is promotable if either vertex v ∈ V does not have any incoming forward edges, or all of v’s incoming forward edges in G have at least one dependency depth. If σ is a valid schedule of one iteration, we can promote a task v according to the expected loop duration, which is the ﬁnish time τσ of σ. Given τ = τσ , promoting a task v entails the following transformations on G and G : 1. For each of v’s incoming forward edges (u, v) in graph G, decrease λ(u, v) by one. If (u, v) becomes an intra-iteration constraint, (λ(u, v) = 0), edge (u, v) is added to graph G if it is not present in G . 2. For each v’s outgoing forward edge (v, u) in graph G, increase λ(v, u) by one. 3. For each v’s incoming backward edge (u, v) in graph G , increase δ (u, v) by τ , that is, δ (u, v) = δ (u, v) + τ . 4. For each v’s outgoing edge (v, u) in graph G , decrease δ (v, u) by τ , that is, δ (v, u) = δ (v, u) − τ . Steps 1 and 2 push one dependency depth from v’s incoming forward edges to its outgoing forward edges. Step 1 also adds any new intra-iteration edges to graph G , which tracks only intra-iteration constraints. Step 3 transforms the incoming backward edges of v for the promotion (its incoming forward edges are managed in step 1). Step 4 transforms the outgoing edges of v, for both forward and backward edges. Steps 3 and 4 can be validated as follows. When a task v is promoted in graph G , vertex v represents the execution of task v in the next iteration. Therefore, the new start time assignment σ (v) = σ(v)+τ . In step 3, before promoting v, edge (u, v) indicates σ(v)−σ(u) ≥ δ (u, v). Thus after the promotion, σ (v) − σ(u) = (σ(v) + τ ) − σ(u) ≥ δ (u, v) + τ . Therefore, the new constraint in G is δ (u, v) + τ . Similarly in step 4, edge (v, u) means σ(u) − σ(v) ≥ δ (v, u) before promotion. Thus, σ(u) − σ (v) = σ(u) − (σ(v) + τ ) ≥ δ (u, v) − τ . The constraint becomes δ (u, v) − τ after the promotion. When a task v is being promoted, its corresponding min timing constraints (zero or positive values) will become max timing constraints (negative values) by step 4; and vice versa, its corresponding max timing constraints will transform into new min timing constraints by step 3. Promotion eﬀectively reduces the values of min constraints and makes the problem easier to solve by exposing more scheduling opportunities. We say that the constraint is relaxed, and this is a key technique for increasing the system’s dynamic range. Fig. 2 illustrates task promotion on the example previously shown in Fig. 1. Fig. 2(a) shows the initial constraint graph G consisting of ﬁve vertices representing ﬁve tasks a, b, c, x, y. They all have the same execution delay of one time unit, and their power consumption is p(a) = 3W, p(b) = 6W, p(c) = 2W, p(x) = p(y) = 4W . Therefore the most power consuming task is b and the least power consuming one is c. Tasks a, x, y have dedicated execution resource A, X, Y (r(a) = A, r(x) = X, r(y) = Y ), respectively; while tasks b and c share the execution resource B (r(b) = r(c) = B). For brevity, these task attributes are not shown in the graph. The edges in the constraint graph G represent timing constraints. They are denoted as (λ, δ) corresponding to the dependency depths and the values of the timing constraints.

92

Jinfeng Liu et al.

For example, the forward edge (a, b) represents an intra-iteration constraint with dependency depth λ(a, b) = 0, and it is a min constraint with δ(a, b) = 1 indicating σ(b) − σ(a) ≥ 1. Since task a’s delay d(a) = 1, this constraint can be paraphrased as “task b cannot start until task a completes,” that is, tasks a and b must be serialized. Similarly tasks b and c are also serialized by edge (b, c). Edge (x, a) with δ(x, a) = 0 indicates that task a cannot start before task x starts, because σ(a) − σ(x) ≥ 0. Edge (x, c) with δ(x, c) = 2 speciﬁes a min separation between task x and task c, that is, σ(c) − σ(x) ≥ 2. Therefore, task c must wait until task x has started for two time units. Edge (c, a) with δ(c, a) = −2 is a backward edge representing a max constraint: σ(c) − σ(a) ≤ 2. It deﬁnes the deadline to start task c relative to the start time of task a. This deadline is equal to the start time of task a plus two time units. In addition to these intra-iteration timing constraints, there is an inter-iteration timing constraint (b, x), indicating that the start time of task b precedes task x in the next iteration (λ(b, x) = 1) by one time unit (δ(b, x) = 1). Inter-iteration constraints are marked as dashed arrows. There is a co-activation dependency between task x and task y. This is denoted as a pair of special timing constraints. As mentioned previously, we assume each iteration must ﬁnish within three time units. The initial iteration graph G has the same set of vertices representing tasks a, b, c, x, y. The edges in G only represent intra-iteration constraints. Therefore only the constraint value δ is shown on each edge. Dependency depth λ is not shown since it is always zero in graph G . For example, the inter-iteration edge (b, x) does not appear in the initial G . The co-activation dependency is still denoted as a special constraint in G . The initial schedule σ computed from the iteration graph G is also shown in Fig. 2(a). It is the same as Fig. 1(a). Although all timing constraints are satisﬁed, the schedule σ is not valid since during time [0, 1] the power consumption of the whole system is 11W, exceeding the max power constraint Pmax = 10W . No valid solution is possible even if we try voltage scaling for tasks x. In Fig. 2(b) task x and its co-activated task y are promoted to produce a valid schedule (same as Fig. 1(c), except that the prolog is not shown). Tasks x and y are promoted together due to co-activation, but they are scheduled as separate tasks because they may not start and ﬁnish at the same time. The constraint graph G will only update dependency depths λ of the timing constraints corresponding to x. Since the original schedule ﬁnishes at time 3, the timing constraints δ in G will be transformed using τ = 3. By step 1, edge (b, x) ∈ G becomes an intra-iteration edge (solid arrow) and is inserted to G . By step 2, edges (x, a) and (x, c) ∈ G become inter-iteration edges (dashed arrows). By step 4, edges (x, a) and (x, c) ∈ G reduce their constraint values by τ = 3. Accordingly, task x’s outgoing min constraints are transformed into more relaxed max constraints (δ (x, a) = −3, δ (x, c) = −1, compared to 0 and 2 in Fig 2(a)). As a result, tasks x can be rescheduled in time slot [2, 3] without violating any timing constraints, and the max power constraint is also satisﬁed. Without task motion, this valid solution cannot be achieved.

Power-Aware Task Motion

4.3

93

Utilization Constraints

Task motion is based on the classiﬁcation of intra-iteration and inter-iteration timing constraints. However, in some cases, it is diﬃcult or unnecessary to decide whether a timing constraint should be intra-iteration or inter-iteration. Such cases are present in the Mars rover. For example, for timing constraints between a heater and a motor by which the motor is heated periodically, whether to model these constraints as intra-iteration or inter-iteration is not clear. In fact, whether the heaters and the motors stay in the same iteration does not matter. In the computation domain, these correspond to background, preemptible tasks that need not synchronize with the main control loop but must be given a share of the CPU time to avoid starvation. We call such constraints utilization-based timing constraints. They can be expressed as either intra-iteration or inter-iteration ones. A utilization constraint between two tasks u and v is also represented as an edge (u, v) ∈ E in constraint graph G with its dependency depth denoted as λ(u, v) = ∗, indicating that it can be either zero or non-zero.

Constraint graph G

Schedule σ

Iteration graph G' A

a

a (0,0)

0

(0,1)

B

1

X (1,1) Y -2 (0, -2) x b b x Power 0 1 (0,1) 2 (0,2) co-active co-active 12 10 8 6 c y y c 4 2 0 (a) before task motion, no valid solution can be found. 0 A (1,0)

-3

X Y 1 (0,1) -2 b (0, -2) x b x Power 0 1 12 -1 (1,2) (0,1) co-active co-active 10 8 6 c y 4 c y 2 0 0 (b) after promoting task x and co-activating task y, a valid solution is found. A (0,1)

(*,0)

x

(0,1)

-3

(*,1)

b

B

(*, -2)

x

1

-2

(1,2)

co-active

-1

1

b

-2

3 Time

2

Pmax: 10

a x y

Energy: 19

b 1

c 2

3 Time

a b

1

c x[1] y[1]

a

3 Time

2 c x[1] y[1]

b 1

2

a[*] b

c x[1] y[1]

X Y

Power 0 12 10 8 6 c y c y 4 2 0 0 (c) after promoting task a with utilization constraints, co-active

(0,1)

c

1

a

a

b x y

B

a

a

a

2 Time

1

1

c x[1] y[1]

a[*] b 1

Pmax: 10 Energy: 19 2 Time

a new solution with better performance is found.

Fig. 2. Task motion under timing constraints

Pmax: 10 Energy:19 3 Time

94

Jinfeng Liu et al.

Now we examine task motion under utilization constraints. It needs only minor modiﬁcations to the procedure we deﬁned in Section 4.2. (a) The initial iteration graph G will include both intra-iteration constraints and utilization constraints in its edges. (Treat utilization constraints as intraiteration). (b) A task v is promotable if either vertex v ∈ V does not have any incoming forward edges, or the dependency depths λ of all v’s incoming forward edges are positive values or ∗. (Treat utilization constraints as inter-iteration). (c) The modiﬁed procedure for promoting a task v is as follows. 1. For each of v’s incoming forward edges (u, v) in graph G, decrease λ(u, v) by one, if λ(u, v) = ∗. If λ(u, v) becomes 0, add edge (u, v) to graph G if it is not present in G . (No update for utilization constraints in step 1). 2. For each v’s outgoing forward edge (v, u) in graph G, increase λ(v, u) by one, if λ(u, v) = ∗. (No update for utilization constraints in step 2). 3. For each v’s incoming backward edge (u, v) in graph G , δ (u, v) = δ (u, v) + τ , if λ(u, v) = ∗. Otherwise, δ (u, v) remains unchanged. (No update for utilization constraints in step 3). 4. For each v’s outgoing edge (v, u) in graph G , δ (v, u) = δ (v, u) − τ . (Same as the previous step 4). Since utilization constraints can be either intra-iteration or inter-iteration, by giving them some special treatments, the modiﬁed procedure is straightforward except steps 3 and 4 need more explanation. In step 3, if edge (u, v) represents a utilization constraint, δ (u, v) can be transformed into either one of the two forms: δ (u, v) or δ (u, v) + τ , since it can be either intra-iteration or interiteration. That is, the transformation is valid either σ (v) − σ(u) ≥ δ (u, v) or σ (v)−σ(u) ≥ δ (u, v)+τ holds. Obviously, the solution to these two inequalities with an OR relation is σ (v) − σ(u) ≥ δ (u, v), which means the constraint with the smaller value applies. Therefore, the value of a utilization constraint will not increase by τ in step 3. Likewise, in step 4, the value of the new constraint is the smaller one between δ (v, u) − τ and δ (v, u), which is δ (v, u) − τ . In summary, if the promoted task v has any incoming utilization-constraint edges, then these edges remain the same in the iteration graph G during the promotion. For v’s outgoing utilization-constraint edges, the values of constraints in G are decreased by the loop duration τ . As a result, utilization constraints will always be relaxed to produce more scheduling opportunities. For example, if resource A is a heater, a motor, or a CPU running a preemptible background tasks, then we can model task a with utilization constraints (x, a), (a, b) and (c, a). The initial graphs G, G and schedule σ look very similar to Fig. 2(a), except utilization constraints (x, a), (a, b) and (c, a) in G will be denoted as a new type of arrows, and their dependency depths λ = ∗ (as seen in Fig. 2(c)). After promoting tasks x and y, graphs G, G and schedule σ will also look similar to Fig. 2(b) except that the utilization constraints (x, a), (a, b) and (c, a) in G will not be changed by task motion.

Power-Aware Task Motion

95

Fig. 2(c) shows the resulting graphs G, G and schedule σ after promoting task a with utilization constraints, which are marked as a diﬀerent type of dashed arrows in graph G. By the modiﬁed step 3, the value of constraint δ (c, a) in G will remain −2; otherwise it will be resumed to 1 if it is not a utilization constraint. The same rule also applies to utilization constraint (x, a) such that δ (x, a) = −3 instead of 0. Since the serialization chain formed by min constraints is broken, tasks a, b, c (after promoting a, the chain becomes b, c, a in Fig. 2(c)) no longer need to be serialized. Now task a, a small power consumer, can overlap b such that an unexpected solution with a shorter execution time (τσ = 2) is discovered, and it also satisﬁes the max power constraint. This optimal solution could not have been obtained without using utilization constraints, which enable more aggressive, provably correct relaxation of the time constraints. 4.4

Scheduling Algorithms for Power-Aware Task Motion

We combine power-aware scheduling with system-level task motion as a way to discover a wider range of power/performance trade-oﬀs. Our core scheduling algorithms consist of (a) transforming the problem into its new versions by task motion, and (b) power-aware scheduling for each version. From the illustration in Sections 4.2 and 4.3, the implementation of (a) is straightforward. Algorithm (b) is derived from [10] by applying the power-aware scheduler to the iteration graph G after each task motion. For brevity, details of the scheduling algorithms are omitted in this paper but can be found in [11].

5

Experimental Results

We use the NASA/JPL Mars rover [1] to evaluate the eﬀectiveness our poweraware task motion technique. We construct a system-level representation that includes the computational, mechanical and thermal subsystems. The timing constraints on the heaters and preemptible background computation tasks can be modeled with utilization constraints. We also consider dual energy sources: a solar panel and a non-rechargeable battery. We consider three scenarios with diﬀerent solar power output levels: 14.9W (noon time), 12W, and 9W (dusk). The min power constraints are set to the respective solar output levels, while the max power constraints are set to the solar power plus 10W, which is the maximum battery power rating. Table 1 compares the results of four techniques by using the energy cost to the non-rechargeable battery and the execution time of each iteration as metrics: (0) (I) (II) (III)

the existing manual solution (fully serialized), power-aware scheduling [10], power-aware task motion without utilization constraints, power-aware task motion with utilization constraints.

– For scenario 1 (14.9W solar power), all schedulers except JPL’s (0) compute fast schedules (i.e., short τ ), but these three solutions vary in energy cost.

96

Jinfeng Liu et al.

Solutions by schedulers I and II are eliminated, because they must draw more energy from the battery in addition to the solar panel in order to achieve the same performance as solution III. Scheduler III could not have achieved this solution without exploiting utilization constraints. – For scenario 2 (12W solar power), schedulers I and II produce the same solution that is slower than in scenario 1 due to the limited power budget. Scheduler III produces a fast schedule at a higher energy cost than I and II, but it is still within the max power constraint. No one solution is strictly better than the other, and they represent diﬀerent trade-oﬀ points. – In scenario 3 (9W solar power), the low power budget rules out all but the fully serialized solution, and all schedulers produce the same solution as JPL’s manual schedule (0). The results show that our technique not only yields a larger dynamic range by being able to operate at diﬀerent power levels, but more importantly it uses the available energy more eﬀectively for actual useful work. This is not easy due to complex timing constraints, but the improvement can translate into signiﬁcant savings in application-speciﬁc metrics, as shown in Table 2. Suppose the rover is traveling to a target location in a distance of 48 steps. Since the rover moves two steps during each iteration, it needs 24 iterations to reach the destination. The mission starts with maximum solar power at 14.9W

Scenario

Table 1. Comparison in three scenarios (0) JPL's Low-power (hand-craft)

1

τ = 75s Ec = 0J

2

τ = 75s Ec = 55J √ τ = 60s Ec = 147J √

same as (I)

3

τ = 75s Ec = 388J √

same as (0)

(I) Power-aware (II) Power-aware + Task motion √ τ = 50s Ec = 79.5J × τ = 50s Ec = 16.5J × same as (0)

(III) Power-aware + Task motion + Utilization constraint τ = 50s Ec = 4.5J √ τ = 50s Ec = 208J √ same as (0) √ = keep

× = drop

Table 2. Comparison in a comprehensive scenario

Time frame (s) 0 - 599 600 - 1199 1200 -

Total Improvement

Scenario 1 2 3

JPL (0-0-0)

Task motion A (III-I-0)

Task Motion B (III-III-0)

Distance Time Energy Distance Time Energy Distance Time Energy (step) cost (J) (step) (s) cost (J) (s) cost (J) (step) (s) 24 600 129 16 600 0 24 600 129 16 600 440 20 600 1470 23 600 2482 1 10 85 16 600 3114 4 150 776

48

1800 3554

48

1350 2375 33%

33%

48

1210 2696 49%

24%

Power-Aware Task Motion

97

(Scenario 1). Then, it drops to 12W (Scenario 2) after 10 minutes, and falls to 9W (Scenario 3) 10 minutes later. If the existing low-power, serial schedule is applied, the rover will spend 10 minutes evenly in all three scenarios at a ﬁxed slow moving speed. This results in a long execution time and a high energy cost in Scenario 3. On the other hand, our technique can produce two schemes. Both schemes use more free solar energy to speed up in scenarios 1 and 2 (while satisfying timing and power constraints) so that they can ﬁnish the mission earlier to avoid the costly scenario 3. Schemes A and B diﬀer only in scenario 2 where A uses solution I while B uses the faster but more expensive solution III. As a result, scheme A achieves a 33% speedup and a 33% energy saving; and scheme B further speeds up by 49% with a 24% energy reduction. These two alternative designs with diﬀerent energy/performance trade-oﬀs are discovered by our power-aware task motion technique. They could not have been found by the existing techniques.

6

Conclusion

We have presented a power-aware task motion technique for enhancing the dynamic range of embedded systems powered by heterogeneous energy sources that include renewable, unsteady ones like solar panels. They must be able to not only operate as low-power devices when the supply power is low, but equally importantly use the free abundant energy for useful work while respecting power and timing constraints. We used a DVS Anomaly example to show the pitfalls of applying existing power management techniques without considering system-level dependencies like co-activation, and this has resulted in not only higher energy consumption but also violation of max power constraints. We then showed our constraint formulation and task motion technique to safely transform the tasks while respecting these system-level dependencies. We further enhanced task motion by exploiting utilization-based constraints that exposed additional scheduling opportunities for preemptible, background tasks or even non-computational power consumers such as heaters. These all served to enhance the dynamic range while ensuring all transformations are safe and provably correct. Experimental results on the Mars rover demonstrated the eﬀectiveness of our approach for the solar- and battery-powered system. We expect the beneﬁts to transfer to a whole emerging class of new embedded systems that must draw energy from many renewable but unsteady sources.

Acknowledgement This research was sponsored by DARPA under contract F33615-00-1-1719. It represents a collaboration between the University of California at Irvine and the NASA/Cal Tech Jet Propulsion Laboratory. Special thanks to Dr. N. Aranki, Dr. B. Toomarian, Dr. M. Mojarradi and Dr. J. U. Patel at JPL and Kerry Hill at AFRL for their discussion and assistance.

98

Jinfeng Liu et al.

References [1] NASA/JPL’s Mars Pathﬁnder home page. http://mars3.jpl.nasa.gov/MPF/index0.html. 84, 95 [2] L.-F. Chao, A. LaPough, and E. H.-M. Sha. Rotation scheduling: A loop pipelining algorithm. IEEE Transactions on Computer Aided Design, 16(3):229–239, March 1997. 86 [3] P. Chou and G. Borriello. Software scheduling in the co-synthesis of reactive real-time systems. In Proc. Design Automation Conference, pages 1–4, June 1994. 89 [4] E.-Y. Chung, L. Benini, and G. De Micheli. Dynamic power management using adaptive learning tree. In Proc. International Conference on Computer-Aided Design, pages 274–279, 1999. 86 [5] I. Hong, D. Kirovski, G. Qu, and M. Potkonjak. Power optimization of variablevoltage core-based systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(12):1702–1714, 1999. 86 [6] M. Jacome, G. de Veciana, and C. Akturan. Resource constrained dataﬂow retiming heuristics for VLIW ASIPs. In Proc. International Symposium on Hardware/Software Codesign, pages 12–16, May 1999. 86 [7] K. Lalgudi and M. Papaefthymiou. Fixed-phase retiming for low power design. In Proc. International Symposium on Low Power Electronics and Design, pages 259–264, August 1996. 86 [8] C. Leiserson and J. Saxe. Retiming synchronous circuitry. Algorithmica, 6(1):5– 35, 1990. 86 [9] J. Liu, P. H. Chou, N. Bagherzadeh, and F. Kurdahi. A constraint-based application model and scheduling techniques for power-aware systems. In Proc. International Symposium on Hardware/Software Codesign, pages 153–158, April 2001. 89 [10] J. Liu, P. H. Chou, N. Bagherzadeh, and F. Kurdahi. Power-aware scheduling under timing constraints for mission-critical embedded systems. In Proc. Design Automation Conference, pages 840–845, June 2001. 89, 95 [11] J. Liu, P. H. Chou, N. Bagherzadeh, and F. Kurdahi. Power-aware task motion: Dynamic range enhancement for power-aware embedded systems. Technical Report IMPACCT-01-09-01, University of California, Irvine, September 2001. 95 [12] T. Okuma, T. Ishihara, and H. Yasuura. Real-time task scheduling for a variable voltage processor. In Proc. International Symposium on System Synthesis, pages 24–29, November 1999. 86 [13] F. Sanchez and J. Cortadella. Time-constrained loop pipelining. In Proc. International Conference on Computer-Aided Design, pages 592–596, November 1995. 86 [14] T. Simunic, L. Benini, and G. De Micheli. Event-driven power management of portable systems. In Proc. International Symposium on System Synthesis, pages 18–23, 1999. 86 [15] M. Srivastava, A. Chandrakasan, and R. Brodersen. Predictive system shutdown and other architectural techniques for energy eﬃcient programmable computation. IEEE Transactions on VLSI Systems, 4(1):42–55, March 1996. 86 [16] F. Yao, A. Demers, and S. Shenker. A scheduling model for reduced CPU energy. In IEEE Annual Foundations of Computer Science, pages 374–382, 1995. 86

Power-Aware Task Motion

99

[17] T. Z. Yu, F. Chen, and E. H.-M. Sha. Loop scheduling algorithms for power reduction. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3073–6, May 1998. 86

A Low-Power Content-Adaptive Texture Mapping Architecture for Real-Time 3D Graphics Jeongseon Euh, Jeevan Chittamuru, and Wayne Burleson Department of Electrical and Computer Engineering University of Massachusetts Amherst {jeuh,jchittam,burleson}@ecs.umass.edu

Abstract. The eﬀect of texture mapping in enhancing the realism of computer-generated images has made the support for real-time texture mapping a critical part of 3D graphics pipelines. However, the texture mapping is one of the major power consumers in 3D graphics pipelines due to the intensive interpolation computation and high memory bandwidth. This power consuming requires an increased emphasis on lowpower design for the migration of 3D graphics systems into portable and future user interface devices. In this paper, we present a dynamically adaptive hardware texture mapping system that can perform adaptive texture mapping based on a model of human visual perception which is less sensitive to the details of moving objects. This ﬂexibility may result in signiﬁcant power savings without noticeable quality degradation. Our work shows that power savings, up to 33.9%, comes from the reduced oﬀchip memory accesses as the result of an adaptive texel interpolation algorithm. Additional power savings, up to 73.8%, comes from using variable clock and supply voltage scaling in the adaptive computing unit.

1

Introduction

3D computer graphics has become an increasingly important aspect of many applications including games, virtual reality (VRML), e-commerce, visualization and advanced CAD. Due to intensive ﬂoating point and ﬁxed point computation and high memory bandwidth, the 3D graphics pipeline is a major power consumer in computing devices. This problem becomes more acute with the migration of graphics systems into portable and future user interface devices which support high quality display, such as notebook computers, wearable computers, and 3D graphics heads-up displays. Among the stages in the 3D graphics pipeline, texture mapping has been used widely in real-time 3D graphics systems. It is used to enhance the realism of computer-generated images by exploiting regularity to reduce geometric complexity. However, texture mapping is one of the pipeline stages with the highest workload in terms of computation and memory accesses [11, 16]. In [11], it is mentioned that texture mapping is the most computation and memory intensive stage of rasterization engine with 40% − 60% of B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 99–109, 2003. c Springer-Verlag Berlin Heidelberg 2003

100

Jeongseon Euh et al.

its total computations and 100% of its memory accesses. Hence texture mapping contributes signiﬁcantly to the total power consumption of the rasterization engine. Texture mapping is one of several parts of 3D graphics rendering pipeline that exploits human visual perception to reduce power consumption. As described in [14, 5], the human visual perception (HVP) model shows that visual sensitivity for an object varies with the spatial frequency in cycles per degree (cpd) and the velocity in degrees per second (dps). These characteristics of the HVP have been used as the basis for video and computer graphics applications to reduce the number of computations and enhance the eﬃciency of systems. In 3D computer graphics, there has been research applying HVP, such as Level of Detail generation algorithms[17, 6, 10, 12], perceptually-based rendering[21], and perceptually-based texture caching[8]. The Level of Detail technique based on the distance between the object and the view point, has been explored to reduce the rendering computation and the amount of polygonal information to be stored and transmitted. In [21], both HVP and visual attention concepts are adapted to accelerate rendering speed by reducing computations in global illumination algorithm. The HVP model is used to construct a spatio-temporal error tolerance map and the visual attention model is used to track the behavior of the eye. In our previous work[9], an object’s screen velocity and depth from a viewer are used as the criteria for the adaptive shading algorithm. In this paper, we present a content adaptive texture mapping architecture based on the HVP model and dynamic voltage scaling (DVS) [4] technique to reduce power consumption. Two interpolation pipelines, bilinear and trilinear, are implemented for the adaptive control. Bilinear interpolation requires about half of the computations and memory accesses required by trilinear interpolation. However, bilinear interpolation introduces more aliasing artifacts than trilinear interpolation. The adaptive control is based on the criteria described in our previous work[9]. Two parameters namely cycles per degree (cpd) and degrees per second (dps) are used as control parameters. cpd and dps represent complexity of a texture image and velocity of an object, respectively. While dps of the object varies dynamically, cpd of the texture image mapped on to the object is predeﬁned. These two control parameters are supplied from a previous stage of graphics pipe line, such as geometry engine, to adaptive texture mapping architecture. The adaptive controller selects one of the two implemented interpolation algorithms depending on the supplied cpd and dps parameters of the object to be textured. Within trilinear interpolation, the adaptive controller shifts objects with less sensitivity to bilinear interpolation. In general, Trilinear interpolation computations are performed using two bilinear interpolation blocks and bilinear interpolation computations use one of these blocks. This work uses DVS technique to reduce power consumption in the bilinear mode. The two bilinear interpolation blocks are used together under reduced clock and supply voltage to achieve the original throughput obtaining a quadratic reduction in power consumption. The applied DVS scheme uses only two pairs of V dd and Clock, such

A Low-Power Content-Adaptive Texture Mapping Architecture

101

as 2.5V V dd with 100M HzClock and 1.6V V dd with 50M HzClock. The results of Quake II game simulation on the presented architecture show power savings up to 73.8% due to computations and up to 33.9% due to external memory access.

2

Texture Mapping

To realize realistic images or a very complex image, the number of required polygons (or triangles) are increased and thus the number of calculations is increased. If, however, the computing power of the given system is not enough to perform the required calculations in time, the number of polygons should be reduced so that the rendering system could match the number of input polygons. Consequently, quality degradation in terms of scene complexity is introduced and in some cases this degradation is not tolerable. Hence, to have more realistic images with less geometric data, texture mapping [3, 2] has been used commonly in 3D computer graphics. As its name suggests, texture mapping refers to the process of applying a texture to an object in the 3D world, e.g. brick texture mapping to a wall, wood texture mapping to a ﬂoor, and terrain mapping for ﬂight simulation. Due to the absence of no one-to-one mapping between texels (texture pixel) and pixels, an interpolation calculation is necessary for high quality mapping. Higher quality requires computation intensive interpolation. Fig. 1 (a) shows the relationship between a pixel and a texel. The most commonly used texture interpolation algorithms are Point Sampling, Bilinear interpolation, Trilinear interpolation, and Anisotropic interpolation with Mip-mapped [20] texture maps. The term Mip-Mapping refers to the technique of using multiple resolutions of pre-interpolated texture maps in order to improve the quality of texture mapping. The number of texels used for point sampling, bilinear, trilinear, and anisotropic interpolation are 1, 4, 8, and nonﬁxed, respectively. Fig. 1 (b) shows the mip-mapped brick wall texture. Level 0

Y

V S

Z

U Texture Coordinate

X World Coordinates

256 X 256 128 X 128 64 X 64 32 X 32

16 X 16 1 X 1

Level 3

V

Level 2 Level 1 U

Texture Map

3D Object

(a)

S Level 0

(b)

Fig. 1. (a) Pixel on 3D vs. Texel on 2D (b) Mip-mapped brick wall texture map

102

Jeongseon Euh et al.

Table 1. Calculations required for each interpolation Interpolation Required No. of Texels Required Memory Bandwidth Point Sampling 1 37 MB/sec Bilinear 4 147 MB/sec Trilinear 8 295 MB/sec Anisotropic 32 1.2 GB/sec

is the original texture map and level 1 is half size on both dimensions of level 0. The last level has only one texel. In this paper, we are focusing on two interpolation algorithms for 3D graphics systems, bilinear and trilinear interpolation. For mip-mapped textures, bilinear interpolation (Eq. 1) uses four texels retrieved from one level texture which is the closest to pixel position. Trilinear interpolation (Eq. 2) uses eight texels retrieved from two level textures, four texels from each level. I(x,y) =

3

Ti × Wi

(1)

i=0

I(x,y) = Wl Tl + Wl+1 Tl+1

(2)

where I(x,y) is the interpolated texture color at pixel position (x, y), Ti is the texel color value of the neighboring pixel position, Wi is the weight of each texel, Wl is weight of level l, and Tl is the bilinear interpolated texel color value of level l. To calculate the number of texel memory accesses, let’s assume a 640 × 480 display resolution, one layer full screen texturing, 4 bytes texel data for RGBA and 30 Hz frame rate. The required texture memory bandwidths are shown in Table 1 for each interpolation algorithm. When textures are to be applied for more than one layer, the memory bandwidth is a multiple of the numbers shown in the memory bandwidth column in table 1.

3

Previous Work

The texture memory latency and bandwidth has become a bottleneck in performing real-time texture mapping in low-end system. To overcome this bottleneck, Beers, et al.,[1] adapted image compression technique to texture mapping. The technique introduced by them was to render directly from compressed textures. They could get compression ratios of up to 35:1 by using vector quantization. Since vector quantization is a lossy compression technique, some visible artifacts can be seen. Kugler[15] used compression techniques to reduce the required bandwidth, and also discussed a local space-variant ﬁlter in the texturing unit to minimize artifacts.

A Low-Power Content-Adaptive Texture Mapping Architecture

103

A high degree of locality in texture memory access patterns was observed and the working set sizes were found to be small across a wide variety of cases studied. Hakura[11] showed that a small on-chip texture cache of size about 16 KB was enough to reduce the texture memory bandwidth requirement by 3-15 times. Igehy, et al.,[13] used prefetching cache scheme to overcome the texture memory latency. Cox, et al.,[7] introduced the usage of an L2 cache besides an L1 cache mainly to exploit the inter-frame texture locality. Their results showed that about 2-8 MB L2 cache could signiﬁcantly exploit the inter-frame texture locality, while stemming the growth of local texture memory requirement. Dumont, et al.,[8] adapted the HVP model for texture memory resource allocation. Their algorithm allocates more texture memory to the textures that are determined to have more visual importance by decision algorithm. The textures that have less visual importance, are forced to use low resolution textures to use a small portion of the texture memory. Their primary goal to provide high frame rates is diﬀerent from our objective. Rosman, et al.,[18] used trilinear interpolation only for the pixel which fell in predeﬁned range of level of depth, and used bilinear interpolation otherwise. For example, if a pixel position is in the middle of two adjacent texture levels, trilinear interpolation is used, since two interpolated texel values from each level attribute evenly to the pixel color value. If a pixel position is much closer to one level than the other, then bilinear interpolation is used instead. This adaptive scheme does not utilize both bilinear interpolation units when texture mapping system is in bilinear mode.

4 4.1

Content Adaptive Texture Mapping Proposed Approach

For adaptive control, object velocity and texture spatial frequency are used as the input parameters. According to the value of these two parameters, the adaptive controller decides which interpolation will be used for the incoming data. Fig. 2 (a) shows the contrast sensitivity function graph[14] which is the base of decision rule and an example is depicted in Fig. 2 (b). In Fig. 2 (b), if the velocity and the texture spatial frequency of an object fall in shaded area, mode controller sets the texture mapping system to bilinear interpolation. Decision making in the mode controller can be implemented with a memory map that has decision rule graph. Thus the power savings ratio can be controlled by changing the content of memory map. More power savings can be achieved by increasing the shaded area increasing the chance of noticeable quality degradation. Fig. 3 (a) shows the quality diﬀerence between bilinear and trilinear interpolation. Although there are some noticeable aliasing artifacts in Fig. 3 (a), it is hard to distinguish the diﬀerence in Fig. 3 (b). As described in the introduction section, DVS is applied to the interpolation block in the adaptive texture mapping system. In general, trilinear interpolation function block consists of two bilinear interpolation function blocks in parallel. If

Jeongseon Euh et al.

1.E+03

1.E+03

1.E+02

1.E+02

Sensitivity

Sensitivity

104

1.E+01 1.E+00 1.E−01 10

1.E−02

50 100

1.E+00 1.E−01 10

1.E−02

50 150

1.E+01

10

200

10 1

1 100 0.1

50 10

100

200

0.1

Spatial Freq. (cpd) 10

50 100 150

100

Temporal Freq. (dps)

(a)

10

0.1 1

Spatial Freq. (cpd) 10

1 100 0.1

Temporal Freq. (dps)

(b)

Fig. 2. (a) Spatiotemporal Contrast Sensitivity Function (CSF) Plot (b) Shaded area: sensitivity less than 50

(a)

(b)

Fig. 3. (a) Bilinear interpolated cube (left) and Trilinear interpolated cube (right) (b) Motion blurred by 10 pixels

we assume that inputs from the previous stage of 3D graphics pipeline are fed at a constant rate, one of two bilinear interpolation function block should be idle to operate the texture mapping system in bilinear mode. Since we assume that the input data feeding rate from the previous stage is constant in both modes, half clock operation in bilinear mode does not reduce the throughput but reduces the power consumption. 4.2

Proposed Architecture

The proposed architecture is depicted in Fig. 4 and consists of a mode controller, two bilinear interpolators, a memory controller, 16 KB on-chip texture cache, and a trilinear MAC. The mode controller sets the texture mapping system in one of the two modes, bilinear or trilinear, according to the mode control input from previous stage of 3D graphics pipeline, such as application and geometry step. It is also responsible to distribute proper data and clock for using the DVS technique. In trilinear mode, full clock and supply voltage are provided to all function blocks. In the other case, half rate clock and proper supply voltage

A Low-Power Content-Adaptive Texture Mapping Architecture

105

according to the selected clock rate are provided to two bilinear interpolators, two TAGs, and two Fragment FIFOs. The bilinear interpolator contains a weight generation unit, an address generation unit, and a MAC unit. The weight FIFO is also included in it to take care of the texture data fetching latency. The main operation of the memory controller is to receive texel address requests from the interpolator and to supply the associated texel values back to it. In order to use the locality of texture data usage, 16 KB size on-chip texture cache is implemented. As a direct mapped cache exhibits very less conﬂict miss rates when a separate cache is used for the even and odd levels, this work uses two 4-way interleaved direct mapped caches. The proposed architecture is implemented and simulated at the RTL level using TSMC 0.25µ library and Synopsys Power Compiler. Since TSMC 0.25µ library supports only 2.5V as supply voltage, we use Fig. 5 Tp vs. Vdd relationship graph from [19] to derive the power consumption for half clock operation. From Fig. 5, we picked 1.6V Vdd for 50 MHz operation and 0.41 as the power reduction ratio, since TSMC 0.25µ CMOS technology operates at 100MHz with

Vdd 1.6V 2.5V

Texel Addresses(1)

Clock 100MHz 50MHz

Clk, Vdd u1, v1, level1

Tags1

Fragment FIFO1

Tags2

Fragment FIFO2

Bilinear Block 1

u1, v1, level1 MODE CONTROLLER

Texel Addresses(2)

Interpolator Control data (cpd, dps) u2, v2, level2

MEMORY CONTROLLER

Request FIFO

u2, v2 , level2

Address + Data Router

Bilinear Block 2

Address

Texel addresses

Prefetch Data Buffer

TEXTURE MEMORY

Texel Addresses

Texel Value

Trilinear MAC

Cache 16 KB

Weight FIFO

Clk

Address Gen

Texel weights

MAC

Weight Gen

Trilinear output Bilinear1 output

To next stage

O/P value latch u,v, level

Bilinear2 Co−ordinate Generator

output

Clk1

BILINEAR INTERPOLATION BLOCK

Fig. 4. Content adaptive texture mapping architecture

106

Jeongseon Euh et al. 100 Tp = (K1 * Vdd) / ((Vdd − Vtn) ** 2) Tp = (K2 * Vdd) / ((Vdd − Vtavg) ** 2) Tp = (K3 * Vdd) / ((Vdd − |Vtp|) ** 2)

90 80 70 Tp (ns)

60 50 40

1.6V for 50MHz Clk

30

2.5V for 100MHz Clk

20 10 0 1

1.5

2

2.5

3

VDD

Fig. 5. Propagation delay (Tp ) vs. Vdd :TSMC 0.25µm, 2.5V technology library (from [19])

2.5V Vdd . Thus, the architecture uses two modes; 2.5V Vdd at 100MHz clock for trilinear mode and 1.6V Vdd at 50MHz clock for bilinear mode. It is assumed that these voltages and clocks are supplied by the system level source simultaneously and a couple of multiplexers are used to select appropriate voltage and clock values without any signiﬁcant delay. For the memory access activity simulation, Quake II PC game (Fig. 6), and Mesa, an OpenGL like graphics library, are used to trace down the traﬃc between the on-chip cache and the external texture memory. The size of texture data block to be transfered from external memory is 64 bytes for each cache miss. Due to the huge amount of data involved, the numbers in the results section are obtained from 10 frames of simulation.

5

Results

Table 2 shows the power consumptions of the adaptive texture mapping system for the various interpolation conversion ratios. Interpolation conversion means that bilinear interpolation is performed for a certain pixel which is supposed to be interpolated in trilinear mode. Since we do not have the source code of Quake II, the various conversion ratios are manually selected. As shown in table 2, bilinear operation with DVS and without the HVP model leads up to 18.4% power saving. The table 2 shows that up to 73.8% power savings is possible by setting the adaptive texture mapping system completely in bilinear mode using DVS (no trilinear interpolation). From table 3, it can be seen that the number of memory accesses decrease linearly with an increase in conversion ratio. If the adaptive texture mapping system without texture cache is set for low power operation in the ‘bilinear only’ mode, 33.9% improvement can be obtained with some image quality degrada-

A Low-Power Content-Adaptive Texture Mapping Architecture

107

Fig. 6. Scene from Quake II game

tion. However, the use of texture cache reduces the improvement to 13.1%. The decrease in improvement obtained from the cache is due to the small size of texture maps used in Quake II. Since there are a few texture maps bigger than 64x64 in size, 16 KB on-chip cache in the proposed architecture reduces external memory accesses signiﬁcantly in the Quake II texture mapping function. In case of an application using very large sized textures, it is expected that the cache miss rate will be higher and the proportional savings will be like the savings obtained without cache. From the two tables in this section, we show that signiﬁcant power consumption reduction is possible by implementing the proposed content adaptive texture mapping system. We expect to have power savings with various ratios for applications other than Quake II.

Table 2. Power saving from adaptive interpolation: Quake II 10 frames Converted Number of Number of Power Power ratio Bi-linear Tri-linear consumption savings (%) computation computation (mW) (%) 0 3619749 3808894 454.68 0 0, DVS only 3619749 3808894 371.19 18.36 5 3810193 3618450 358.59 21.13 10 4000638 3428005 345.98 23.91 15 4191083 3237560 333.37 26.68 20 4381527 3047116 320.77 29.45 25 4571972 2856671 308.16 32.22 100 7428643 0 119.06 73.81

108

Jeongseon Euh et al.

Table 3. Memory access pattern: Quake II 10 frames Converted Number of Reduced Number of Reduced ratio Texel request Texel request Cache miss Cache miss (%) (%) (%) 0 45037756 0.0 65950 0.0 5 44251272 1.74 65792 0.24 10 43488244 3.58 64918 1.56 15 42745716 5.23 64363 2.41 20 41952328 6.99 63947 3.04 25 41205784 8.64 63098 4.32 50 37412060 17.05 60595 8.12 100 29803164 33.92 57301 13.11

6

Conclusion

This paper presents a content adaptive texture mapping system improving the power eﬃciency of a 3D graphics system. Unlike other previous work, The HVP model and DVS technique are applied to the conventional texture mapping system to achieve power savings without noticeable quality degradation. Also, the proposed system can be used to control power in energy constrained systems, such as notebook computers and future user interface devices. For future work, we plan to develop more precise adaptive control based on the HVP model and to apply reconﬁguration techniques to the on-chip cache.

References [1] A. Beers, M. Agrawala, and N. Chaddha. Rendering from compressed textures. In SIGGRAPH ’96, 1996. 102 [2] J. Blinn. Simulation of wrinkled surfaces. In SIGGRAPH ’78, pages 286–292, 1978. 101 [3] J. Blinn and M. Newell. Texture and reﬂection in computer generated image. Communications of the ACM, pages 542–547, October 1976. 101 [4] T. Burd, T. Pering, A. Stratakos, and R. Brodersen. A dynamic voltage scaled microprocessor system. IEEE Journal of Solid-State Circuits, 35(11):68–75, November 2000. 100 [5] C. J. van den Branden Lambrecht. A working spatio-temporal model of human visual system for image restoration and quality assesment applications. In International Conference on Acoustics Speech and Signal Processing, May 1996. 100 [6] J. Cohen, M. Olano, and D. Manocha. Appearance-preserving simplication. In SIGGRAPH ’98, July 1998. 100 [7] M. Cox, N. Bhandari, and M. Shantz. Multi-level texture caching for 3D graphics hardware. In 25th International Symposium on Computer Architecture, June 1998. 103

A Low-Power Content-Adaptive Texture Mapping Architecture

109

[8] R. Dumont, F. Pellacini, and J. Ferwerda. A perceptually-based texture caching algorithm for hardware-based rendering. In Eurographics Workshop on Rendering, pages 62–71, June 2001. 100, 103 [9] J. Euh and Wayne Burleson. Exploiting Content Variation and Perception in Power-aware 3D Graphics Rendering. Springer, 2000. 100 [10] A. Gueziec, G. Taubin, F. Lazarus, and W. Horn. Simplical maps for progressive transmission of polygonal surfaces. In VRML ’98, February 1998. 100 [11] Z. Hakura and A. Gupta. The design and analysis of a cache architecture for texture mapping. In 24th International Symposium on Computer Architecture, 1997. 99, 103 [12] H. Hoppe. Progressive meshes. In SIGGRAPH ’96, July 1998. 100 [13] H. Igehy, M. Eldridge, and K. Proudfoot. Prefetching in texture cache architecture. In Eurographics/SIGGRAPH Workshop on Graphics Hardware, 1998. 103 [14] D. Kelly. Motion and vision. II. stabilized spatio-temporal threshold surface. Journal of the Optical Society of America, 79(10):1340–1349, October 1979. 100, 103 [15] A. Kugler. A high-performance texturing circuit. In 25th International Symposium on Computer Architecture, pages 302–311, 1997. 102 [16] T. Mitra and T. Chiueh. Dynamic 3d graphics workload charaterization and the architectural implications. In The 32nd Annual International Symposium on Microarchitecture, pages 62–71, November 1999. 99 [17] R. Pajarola and J. Rossignac. Compressed progressive meshes. IEEE Transactions on Visualization and Computer Graphics, 6(1):79–93, 2000. 100 [18] A. Rosman and M. Pimpalkhare. United States Patent No.: US 6,184,894 B1, February 2001. 103 [19] S. Venkatraman. A Power-aware Synthesizable Core for the Discrete Cosine Transform. Master Thesis, ECE Dept., UMASS Amherst, 2001. 105, 106 [20] L. Williams. Pyramidal parametrics. ACM Computer Graphics, pages 1–11, 1983. 101 [21] H. Yee, S. Pattanaik, and D. Greenberg. Spatiotemporal sensitivity and visual attention for eﬃcient rendering of dynamic environments. ACM Transactions on Graphics, 20(1), January 2001. 100

Energy-Driven Statistical Sampling: Detecting Software Hotspots Fay Chang1 , Keith I. Farkas2 , and Parthasarathy Ranganathan2 1

Systems Research Center, Compaq Computer Corporation 130 Lytton Ave., Palo Alto, California 94301 2 Western Research Lab, Compaq Computer Corporation 250 University Ave., Palo Alto, California 94301 {fay.chang,keith.farkas,partha.ranganathan}@compaq.com

Abstract. Energy is a critical resource in many computing systems, motivating the need for energy-eﬃcient software design. This work proposes a new approach, energy-driven statistical sampling, to help software developers reason about the energy impact of software design decisions. We describe a prototype implementation of this approach built on the Itsy pocket computing platform. Our experimental results of 14 benchmark programs show that when multiple power states are exercised, energydriven statistical sampling provides greater accuracy than existing timedriven statistical sampling approaches. Furthermore, if instruction-level energy attribution is desired, energy-driven statistical sampling may provide better resolution. On simple handheld systems, however, many applications may exercise only a single power state other than idle mode. In this case, time proﬁling may suﬃciently approximate energy proﬁling for the purpose of assisting programmers, without requiring any hardware support.

1

Introduction

Energy is a critical resource for many computing systems, spurring the desire for energy-eﬃcient software. Unfortunately, most current systems expose only coarse-grained information about energy consumption. Therefore, when programmers reason about the energy impact of their design decisions, they tend to rely on intuition. This intuition may be misleading if programmers have not internalized an accurate model of the energy cost of diﬀerent operations or properly accounted for the relative frequency of the operations that take place. Researchers have proposed a variety of tools that, by exposing more information about a system’s energy consumption, could help programmers develop software that is more energy-eﬃcient. Many of these tools estimate energy consumption by multiplying activation counts for various activities (e.g. executing a particular type of instruction) with estimates of the energy consumed to perform each activity [3, 4, 10]. One advantage of this approach is that it can leverage activation counts for background activities to more easily attribute energy

B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 110–129, 2003. c Springer-Verlag Berlin Heidelberg 2003

Energy-Driven Statistical Sampling: Detecting Software Hotspots

111

consumed asynchronously. However, the approach has the disadvantage of requiring system-speciﬁc knowledge about what activities should be counted, and system-speciﬁc estimates of the energy consumed to perform each such activity. In this paper, we describe and assess a new tool for helping programmers evaluate the energy impact of their design decisions. Our tool relies on statistical sampling, an alternative approach that does not require any system-speciﬁc information. Statistical sampling tools introduce a periodic source of interrupts. When each such interrupt is serviced by the system, the tool records a sample, which contains information about the system’s state, such as the program counter of the interrupted instruction. The samples gathered during a given time period can then be analyzed to produce estimates of the system’s behavior during that time period. In particular, our tool enables programmers to quickly determine the energy required to execute some software and the energy hot spots within that software. With this information, programmers may then employ techniques, such as those described by Simunic et al. [11], to improve energy eﬃciency. Previous statistical sampling tools use sampling interrupts that are separated by a ﬁxed amount of time (possibly with some small variation to diminish the risk of synchronization) [1, 6, 7]. In one approach used by these tools [1, 6], the distribution of the samples recorded by the tool constitute a time profile (an estimate of the proportion of time spent executing each instruction in the program). Previous work has demonstrated that time proﬁles can greatly assist programmers in reducing execution times [1]. However, since duration and energy consumption are not always proportional, time proﬁles can mislead programmers that are trying to reduce energy consumption. An alternate time-driven sampling approach, PowerScope [7], adjusts for incongruities between duration and energy consumption by including in each sample an estimate of the system’s instantaneous power consumption. This feature allows PowerScope to weight each sample by its instantaneous power measurement in order to produce an energy profile (an estimate of the proportion of energy spent executing each instruction). In contrast, our tool uses sampling interrupts that are separated by a ﬁxed amount of energy consumption (which we refer to as the energy quanta). Thus, the distribution of samples recorded by our energy-driven sampling tool directly constitutes an energy proﬁle. Conceptually, with a suﬃciently small energy quanta for our tool and a suﬃciently high sampling rate for PowerScope, both of these tools should produce the same energy proﬁle. However, because the sampling rate of our tool varies with the system’s power consumption, our tool can deliver more accurate energy proﬁles for a given amount of overhead. Moreover, PowerScope’s weighting scheme for generating energy proﬁles assumes that instantaneous power consumption approximates the average power consumption since the last sample. Measurements of our test system reveal large and rapid variations in power consumption, suggesting that this assumption may introduce signiﬁcant inaccuracies. Finally, since PowerScope will continue to collect samples at the same rate even when the system is consuming relatively little energy,

112

Fay Chang et al.

such as when the processor is in a light sleep mode, PowerScope’s sample collection can signiﬁcantly aﬀect the actual energy proﬁle of the system. The key contributions of this paper are three fold. – First, we introduce a new approach – energy-driven statistical sampling – for evaluating the energy impact of software design decisions. – Second, we describe a prototype implementation of this approach for the Itsy pocket computing platform [9], and present experimental data that validates this implementation. – Third, we present the results of an energy-consumption study of 14 benchmarks comparing energy-driven statistical sampling to existing time-driven statistical sampling approaches (both simple time proﬁling and the PowerScope approach). Our results show that, for a given sampling rate, energydriven statistical sampling provides better resolution at an instruction level. It also provides greater accuracy when there are multiple power states, such as would exist in a system with dynamic voltage/frequency scaling. On the other hand, when there is only a single power state other than idle mode (which was the case for most of our benchmarks on our test system), time proﬁles may suﬃciently approximate energy proﬁles for the purpose of assisting programmers. The rest of this paper is organized as follows. In Section 2, we describe our prototype implementation of energy-driven statistical sampling and an error analysis of this prototype. Then, in Section 3, we describe our prototype implementation of the time-driven statistical sampling approaches, the methodology we used in our empirical comparison, and its results. In Section 4, we expand on the diﬀerences between sampling-based approaches and activation-model approaches. Finally, in Section 5, we conclude and discuss ongoing and future work.

2

Energy-Driven Statistical Sampling Prototype

We have built a prototype implementation of energy-driven statistical sampling for the Itsy version 2.0 pocket computing platform [9]. In this section, we describe the hardware and software components of our prototype and the experimental results validating this prototype. 2.1

Hardware

As illustrated in Figure 1, we replaced the Itsy’s battery with a power supply, and interposed an energy counter between the power supply and the Itsy electronics. We used a power supply rather than the battery to simplify our experimental procedure and to ensure that the Itsy’s power eﬃciency stays constant during experiments; this issue is discussed further in Section 2.3. The purpose of the energy counter is to generate an interrupt whenever a predetermined amount of energy has been consumed. The energy counter operates as follows.

Energy-Driven Statistical Sampling: Detecting Software Hotspots

Power Supply

Rsense (0.15 Ohm)

Energy Counter

Itsy

battery terminals

i_s Vs (4.1V)

current-sense amplifier (MAX4172) i_m C (5.6 nF)

113

Vx new count monostable multivibrator (LM555)

load

count-down counter

processor (SA 1100)

zero

interrupt

Fig. 1. The Itsy energy-driven statistical sampling prototype. The energy counter is interposed between the power supply and the Itsy pocket computer. Voltage Vx is generated by the Itsy and equals 3.04 V As current is is drawn from the power supply by the Itsy, a current mirror comprising a resistor and a current-sense ampliﬁer generates a current im = α×is (where α = 1.5 × 10−3 in our implementation). This current im deposits charge on the positive plate of a capacitor, which acts as a current integrator. When the voltage across the capacitor plates reaches 23 of the supply voltage of the monostable multivibrator (Vx ), the multivibrator generates an output pulse P , and then discharges the capacitor to a voltage of V3x . The capacitor then begins to accumulate charge again via im . Thus, each pulse P indicates that the capacitor (with capacitance C Farads) has accumulated Qc = C × V3x Coloumbs, during which time the Itsy has consumed Qi = Qαc Coloumbs. Since the voltage powering the Itsy stays constant at Vs , and the voltage drop across the resistor is small (as further discussed in Section 2.3), the energy consumed by the Itsy during this time is approximately Vs × Qi . In our implementation, this quantity, which we refer to as the minimum energy quanta, is approximately 450 µJ. Each pulse decrements the value of the count-down counter. When this counter’s value reaches zero, the counter generates an interrupt request by asserting a signal that is connected to one of the Itsy’s general purpose I/O lines (GPIOs). The value of the counter is reset by software, as described in the next section. Thus, the energy counter allows the amount of energy consumption that will generate an interrupt to be varied dynamically. For the experiments reported in this paper, we set the counter value to 1, and hence, our energy quanta is approximately 900 µJ. 2.2

Software

The software for energy-driven statistical sampling is divided into three components – (i) a kernel-level device driver, which includes an interrupt service routine, (ii) a user-level data-collection daemon, and (iii) user-level data-processing

114

Fay Chang et al.

tools. This software has been written to run on top of Linux (version 2.0.30), the operating system run by the Itsy. Calls to the device driver control proﬁling (e.g. turn it on or oﬀ). When proﬁling is turned on, the interrupt service routine handles interrupts raised via the above noted GPIO line. On each interrupt, this routine performs two tasks. First, it records a sample in a circular buﬀer. A sample consists of the identity of the process, the software module (executable or dynamically linked library), and the instruction within that module that was interrupted. Second, before the routine ends, it resets the value of the countdown counter discussed above. (Calls to the device driver can designate a single reset value, or a small range of values from which the interrupt service routine can randomly select a value.) The user-level data-collection daemon issues calls to the device driver to obtain sample data, and writes the sample data to the sample data ﬁles. At any subsequent time, these ﬁles can be processed using the user-level data-processing tools to produce human-readable energy proﬁles. 2.3

Error Analysis and Validation

In this section, we discuss the sources of error inherent in our energy-driven sampling prototype. Broadly, the sources of error can be classiﬁed into two categories: (i) energy measurement related, and (ii) attribution and analysis related. The rest of this section discusses each of these in detail. Measurement-Related Errors Energy-driven sampling operates on the assumption that each interrupt signiﬁes that the Itsy has consumed a ﬁxed amount of energy. Since our energy counter integrates only the current being drawn by the Itsy, an equivalent assumption is that each interrupt signiﬁes that a ﬁxed amount of charge has been consumed. The degree to which this assumption holds depends on the degree to which the Itsy’s battery-terminal voltage varies. As this voltage drops, the Itsy’s power consumption also drops owing to improved voltage-regulation eﬃciency. This decrease in power consumption is not suﬃcient, however, to oﬀset the decrease in voltage, and thus, the current drawn by the Itsy increases. Yet, in our implementation, which uses a bench power supply, this supply-voltage dependence problem contributes no signiﬁcant error. First, we use a well-regulated bench supply, and second, for our benchmarks, the maximum voltage drop across the sense resistor was 52 mV, a value too small to signiﬁcantly change the currentpower relationship for the Itsy. If, on the other hand, the Itsy were powered by its battery, our tool would require an additional mechanism that accounts for the current-power relationship. One possibility would be to characterize this relationship, and, using periodic sampling of the battery voltage, increase the value written into the count-down counter as the battery voltage decreases. A second source of measurement-related error is the inherent non-ideal characteristics of electrical components. Of particular note is the time required for the multivibrator to discharge the capacitor, which we determined empirically

Energy-Driven Statistical Sampling: Detecting Software Hotspots

115

to be approximately 290 ns. For our benchmarks, the total time spent during such discharges of the capacitor represents less than 0.1% of the total time. The other component-related errors are also negligible due to careful selection of components and operating ranges. A third measurement-related error arises from the energy counter measuring the e nergy being consumed by some of the components (monostable vibrator and the count-down counter) of which it is built. However, in practice, these components consume a neglible amount of energy compared to that consumed by the itsy. Finally, a fourth source of error is the degree to which the sampling software perturbs the resulting energy proﬁles. All the sampling software except the interrupt service routine is proﬁled. Therefore, the percentage of samples attributed to this software serves as an estimate of the error it induces. For each of our benchmarks, this percentage is given in column 2 of Table 1. (The benchmarks are described later in Table 2.) Although the interrupt service routine cannot be proﬁled, we can measure the time spent in the routine. We then estimate the number of samples that would have been attributed to the routine by assuming that the routine’s power consumption is approximately the same as our zip benchmark, which (like the interrupt service routine) is memory-intensive, does not include any idling, and does not use any devices. This estimate, as a percentage of the total number of samples collected, is given in column 1. That the percentages in columns 1 and 2 are so low indicates that the proﬁling software did not incur signiﬁcant error.

Table 1. Estimates of the impact of sampling software on energy proﬁles Benchmark

ISR overhead (est.) daemon & driver overhead (1) (2) cpu-bound-1 0.35% 0.25% cpu-bound-2 0.49% 0.20% idle 0.66% 0.01% mpeg 1.32% 0.50% synth 1.62% 0.11% voice 0.77% 0.15% midi 1.08% 0.02% wave 0.74% 0.01% zip 1.13% 0.29% ftp 0.92% 1.49% qpe1 1.54% 0.14% qpe2 1.58% 0.17% qpe3 1.48% 0.20% edemo 1.45% 0.17%

116

Fay Chang et al.

Attribution and Analysis Errors The other major class of errors is attribution and analysis errors. To accurately attribute a sample to an instruction, it is important to minimize the delay between the following two events: the monostable multivibrator generating a pulse P that will decrement the counter to zero (see Section 2.1), and the interruption of the program in execution so that the interrupt can be serviced. This delay is composed of the time required for the interrupt to be conveyed to the processor core, and the time required for the processor to begin servicing the interrupt. For our prototype, the ﬁrst component is on the order of nano-seconds, and is thus a small fraction of the time between interrupts, which is on the order of milli-seconds. Given that the power consumed during such small time intervals does not vary signiﬁcantly, the ﬁrst component’s impact is a marginal variation in the energy quanta. Regarding the second component, deferred interrupt handling could incur delay in some speciﬁc situations (because the processor is handling a higher priority interrupt, for example), but such situations occur infrequently. In addition, the Itsy pocket computer uses a StrongARM SA1100 processor. One characteristic of this processor is that, before servicing a trap or exception, the processor ﬁrst completes execution of all instructions in the decode and later pipeline stages. Thus, when the energy-counter interrupt is processed, the sample will be attributed to the instruction that was fetched after the instruction that was in execution at the time that the interrupt was delivered to the core. Detailed information about instruction delays, pipeline scheduling rules and interrupt handling on the SA1100 processor could be leveraged to adjust for this processor-induced skew in attribution. Without this information, however, this skew will be common to all interrupt-driven statistical sampling tools for the SA1100 processor (including the ones discussed in this paper). A second source of error arises if phases of the application being proﬁled are synchronized exactly with the discharging of the capacitor. Although none of our benchmarks exposed this error, our prototype allows synchronization errors to be avoided by dynamically varying the energy quanta after which an interrupt will be generated (as described in Section 2.2). Finally, a third source of error concerns sensitivity to the sampling rate. The accuracy with which a sample distribution reﬂects the true allocation of energy consumption for an application is proportional to the square root of the number of samples [5]. However, larger number of samples can lead to greater processing overhead and increased proﬁler intrusiveness. For our benchmarks, our default parameters resulted in the collection of 12,000 to 125,000 samples at 130Hz-1100Hz sampling frequencies. Sensitivity experiments with longer execution times and higher sampling frequencies indicated that this setup did not have appreciable errors.

Energy-Driven Statistical Sampling: Detecting Software Hotspots

Power Supply

Rsense (0.33 Ohm)

117

Itsy

Power Sample Collection

battery terminals voltage sample collection Vs

sync

data acquisition system clock generator

processor (SA 1100)

sample clock

interrupt

Fig. 2. Our implementation of PowerScope. The data acquisition system has its own power supply (unshown)

3

Comparison of Sampling Approaches

This section presents a study comparing energy-driven statistical sampling to existing time-driven statistical sampling approaches. We begin by describing our experimental methodology, with Section 3.1 describing our time-driven statistical sampling prototypes, and Section 3.2 describing our 14 benchmarks. Section 3.3 then presents the results of our comparison study. 3.1

Time-Driven Statistical Sampling Prototypes

We compared our energy-driven statistical sampling tool with two versions of time-driven statistical sampling, a simple time proﬁler and our implementation of the PowerScope approach [7]. The simple time proﬁler requires no hardware support and is the existing tool available with the Itsy public distribution [9]. Both our energy-driven sampler and our PowerScope implementation use the same software, with minimal modiﬁcations. Figure 2 depicts our PowerScope prototype. Our prototype diﬀers from the original PowerScope implementation in that we use a data acquisition system rather than a digital meter to measure the instantaneous current of the system. The data acquisition system generates a clock that drives both time-driven sampling on the Itsy and the collection of instantaneous current measurements. The current measurements are correlated oﬀ-line with the samples recorded by the interrupt service routine. Since the battery-terminal voltage of the Itsy varied by approximately 2%, the current measurements were directly proportional to the Itsy’s instantaneous power consumption. Energy proﬁles were generated by weighting each sample by its correlated current measurement, under the assumption that the instantaneous power consumption of the system suﬃciently approximates the average power consumption since the last sample.

118

Fay Chang et al.

Error Analysis for Prototypes For the PowerScope prototype, calibration experiments indicate that the current measurement error introduced by the data acquisition system is signiﬁcantly lower than 1%. The data acquisition system is powered by a separate power supply to minimize other perturbances.1 To avoid sample-collection and application synchronization, our simple time proﬁler jitters the time between successive interrupts (by 1/16 of the base frequency). Note that the instruction-level sample-attribution skew discussed in the Section 2.3 is intrinsic to the SA1100 processor, and therefore common across all our prototypes. 3.2

Benchmarks

Table 2 describes the fourteen benchmarks we studied, their execution times, and their average sampling frequencies for our energy-driven sampler. The benchmarks were chosen to cover a wide range of activities. They include two contrived CPU-intensive benchmarks (cpu-bound1 and cpu-bound2), a variety of media processing benchmarks (mpeg, synth, voice, midi, and wave), ﬁle handling and ﬁle transfer benchmarks (zip and ftp), and a few complex benchmarks (qpe and edemo) that include a combination of the functionality of several of the other benchmarks. To ensure repeatability, the experiments with qpe – the only benchmark that needed a lot of user input through the touch screen – were automated by modifying the touch-screen driver to play back pre-recorded traces. In addition, most of the benchmarks were chosen to include multiple instances of an application, or to take enough time to ensure a large number of samples. To ensure consistency, we used a single Itsy for all our experiments. In addition, for each benchmark, we set the sampling frequency of the time-driven approaches to the average sampling frequency with energy-driven sampling, thereby ensuring that all approaches obtain a similar number of samples. However, for two benchmarks (mpeg and qpe), matching frequencies caused the PowerScope prototype to generate trace ﬁles that exceeded the available storage on the Itsy. Therefore, in these two cases, we used the highest frequency possible that did not exceed the available storage – 500 Hz for mpeg and 100 Hz for qpe. The running time and periodicity of these benchmarks was suﬃcient to ensure that the lower sampling rate did not have a signiﬁcant impact on our results. 3.3

Experimental Results

Figure 3 summarizes the proﬁles for the fourteen benchmarks. For each benchmark, the bar on the left (Time) summarizes the proﬁle generated by the simple 1

We also implemented an online version of PowerScope using the hardware provided by the Itsy for measuring its own instantaneous current consumption. However, the power consumed by these components was suﬃciently large to cause signiﬁcant perturbances in the proﬁles.

Energy-Driven Statistical Sampling: Detecting Software Hotspots

119

Table 2. Benchmarks studied Benchmark Description

Time Avg samp (sec) freq (Hz) cpu-bound-1 Two interleaved loop-based compute-intensive (i.e., ad- 193 441 dition) procedures cpu-bound-2 Six instances of a loop-based compute-intensive proce- 444 282 dure interleaved with sleep periods idle Itsy in idle mode 299 130 mpeg Two MPEG movies (bigtoy.mpg and swars2.mpg) 171 1113 played sequentially for six iterations synth Speech synthesis of a 6KB text document (samples 375 592 scaled by 0.25) voice Recording dictation of two paragraphs 41 297 midi MIDI ﬁle (3-spain.mid) played to completion using 342 411 Timidity player wave Itsy.wav sound ﬁle played four times 141 298 zip Three instances of tarring, zipping, and removing a 30 782 573KB directory ftp FTP’ing of a 573KB ﬁle 55 291 qpe1 Windowing environment trace involving operations on 245 493 the calculator program, the tux toy program, and the window canvas program qpe2 Windowing environment trace involving operations on 240 570 the datebook program (read, edit, delete, add) qpe3 Windowing environment trace involving typing a few 259 668 paragraphs on the keyboard edemo Itsy E-mail demo, which uses command-and-control 344 524 speech recognition to process e-mail. Test included voice-directed and button-based navigation, speech synthesis of several messages, and recording audio of a response.

time proﬁler, the bar in the middle (PowerScope) summarizes the proﬁle generated by the PowerScope prototype, and the bar on the right (Energy) summarizes the proﬁle generated by the energy-driven sampling tool. Each proﬁle is divided into four components – the application modules (Application), the kernel-idle loop (Kernel-idle), other kernel modules (Kernel-misc), and miscellaneous library modules (Misc). Proﬁle Diﬀerences Figure 3 shows that the diﬀerent tools often produce signiﬁcantly diﬀerent results. Eight of the fourteen benchmarks in our study show substantial diﬀerences between the time and energy proﬁles. While the diﬀerences between the energy proﬁles generated by our PowerScope prototype and our energy-driven sampling

120

Fay Chang et al.

Fig. 3. Module-level comparison of time and energy proﬁles

tool are smaller, there are still substantial diﬀerences for six of the fourteen benchmarks. The diﬀerences between proﬁles are particularly evident for benchmarks that cycle through multiple power states. In our benchmarks, the most dominant example of such multiple power states is the diﬀerence between the power consumed while executing the kernel-idle loop and any other code. For example, the synthetic cpu-bound-2 benchmark consumes about 100mW when in the idle loop, but about 400mW when in the cpu-intensive portions of the benchmark. Figure 4 illustrates this diﬀerence. This ﬁgure plots the time since the last sample for the sequence of samples obtained by our energy-driven sampler. Since one energy quanta was consumed during each sampling interval, larger time values correspond to periods of lower power consumption.2 Notice that the time values while the processor is idle are about four times those of when the processor is busy. The simple time proﬁler consistently over-estimates the energy consumption of the kernel-idle loop. PowerScope detects some of the diﬀerence between the 2

Thus, since the average power consumption during each sampling interval can be determined by dividing the energy quanta by the time between successive samples, energy-driven sampling may also be used to obtain a time-sequenced set of power measurements. Similarly, with energy-driven sampling, the relative power consumption between two tasks can be compared simply by comparing the average sampling rate during the two tasks. Of course, such information could also be obtained using other means (e.g. digital meters or a data acquisition system).

Energy-Driven Statistical Sampling: Detecting Software Hotspots

121

Fig. 4. Variation in time between interrupts for the cpu-bound-2 benchmark

idle and non-idle modes, but still attributes too much energy to the kernelidle loop. This discrepancy arises because the current drawn by the Itsy while idle is not constant. Rather, due to background system activity (e.g. real-time clock updates and daemon activations), there are occasional pronounced current peaks of short duration. This is illustrated in Figure 4 by the multiple bands of sampling intervals associated with the idle periods. Since PowerScope only samples instantaneous power, it would require a much higher sampling rate to capture such transient eﬀects accurately. To investigate the impact of power states other than idle mode, Figure 5 shows proﬁles for the fourteen benchmarks that were generated by ignoring samples in the kernel-idle loop. As in Figure 3, for each benchmark, the bars on the left, middle and right summarize the proﬁle generated by our simple time proﬁler, our PowerScope prototype and our energy-driven sampling tool, respectively. In addition, to compare the proﬁles generated by the diﬀerent approaches at a ﬁner granularity, each proﬁle is now divided into seven categories. The proﬁles for each benchmark distinguish the four application procedures (App-1 to App-4) and the (non-idle) kernel procedure (Kernel-1) that received the greatest number of samples using our energy-driven sampler. Figure 5 reveals that there continue to be diﬀerences between the proﬁles even after idle time is factored out. However, the diﬀerences are both smaller and less common. Indeed, aside from the qpe1 and qpe3 benchmarks, it is not apparent whether the diﬀerences between time and energy proﬁles at this granularity would be signiﬁcant to a programmer who would use a proﬁle to steer develop-

122

Fay Chang et al.

Fig. 5. Procedure-level comparison of time and energy proﬁles that exclude idleroutine samples

ment eﬀort focused on reducing energy consumption. Note that the proﬁles for the idle, wave, and ftp benchmarks also show variations, but the signiﬁcance of these variations is questionable since 80% or more of the samples for these benchmarks were attributed to the kernel-idle loop, as shown in Figure 3. Proﬁle Resolution Considering the variations we observed in instantaneous power measurements, we were surprised by the absence of more signiﬁcant diﬀerences in the proﬁles obtained by our energy-driven sampler and our PowerScope prototype. For example, Figure 6 shows instantaneous current readings obtained by sampling at three diﬀerent frequencies (100 Hz, 800Hz, and 12.5 KHz) during the mpeg benchmark. Recall that the PowerScope approach assumes that the instantaneous power consumption associated with each sample approximates average power consumption since the prior sample. Comparing each sample value at 100 or 800 Hz with the average across 12.5 KHz sample values since the prior 100 or 800 Hz sample reveals more than a few instances in which this assumption does not hold. This observation suggests that the proﬁle generated by the PowerScope approach could contain substantial inaccuracies, a hypothesis that explains the approach’s over-attribution of energy to the kernel-idle loop. Interestingly, however, when the samples attributed to this loop are ignored (Figure 5), the proﬁles generated by the two approaches do not diﬀer substantially.

Energy-Driven Statistical Sampling: Detecting Software Hotspots

123

Fig. 6. Variation in instantaneous current samples during the mpeg benchmark

Fig. 7. Diﬀerences in instruction-level proﬁles for the mpeg benchmark

124

Fay Chang et al.

To help understand this somewhat surprising result, we compared the instruction-level proﬁles produced by our energy-driven sampler and our PowerScope prototype. Figure 7 illustrates this comparison for the mpeg benchmark. Each point on this graph corresponds to a single instruction. The location of a point is determined by the percentage of samples attributed to that instruction by energy-driven sampling (X-axis value) and the PowerScope approach (Y-axis value). Hence, if the instruction-level proﬁles of the two approaches were nearly identical, all data points would lie on or near the x=y diagonal. The presence, in Figure 7, of a signiﬁcant number of outliers indicates that there are indeed substantial diﬀerences in where the two approaches attribute energy consumption at an instruction level. It appears that instruction-level diﬀerences simply average out with increasing proﬁle granularity. To validate this hypothesis, we estimated how the difference between the proﬁles produced by these two approaches changes as the granularity of the proﬁles increases. In particular, within each instruction-level proﬁle, we distributed the samples attributed to each instruction I uniformly across the instructions with PC addresses within n of I. Then, for diﬀerent values of n, we calculated the diﬀerence between (redistributed) proﬁles as the sum of the absolute diﬀerences between the number of samples attributed to each instruction. This method ignores control ﬂow, but provides a ﬁrst order estimate of the impact of granularity. In support of our hypothesis, we observed that, for increasing values of n between 0 and 16, the diﬀerence between the proﬁles decreased. Furthermore, for most of the benchmarks, there was a 55% to 75% drop in this diﬀerence between n = 0 and n = 16. Impact of Other Power States As noted already, in our experiments, the most signiﬁcant variations in the proﬁles were due to multiple power states caused by the kernel-idle component. In this section, we examine two other important power states: those due to cache misses, and those due to frequency/voltage scaling. Empirical data [12] indicates that the SA1100 may not exhibit multiple power states at the processor level for most instructions. However, cache misses and the resulting memory accesses may produce another power state that may lead to dominant diﬀerences between the diﬀerent tools. The experiments reported in this paper were obtained with the StrongARM SA1100 run at 133 MHz with clock switching enabled. Clock switching allows the core clock frequency to drop to the bus speed until the required data is read from memory. The reduction in the core power oﬀsets the increased power for the memory access leading to comparable power consumption for cache hits and cache misses (less than 10% variation [8]). However, we performed several experiments with a synthetic cache-miss workload at a higher frequency (206 MHz) where cache misses took twice as much power as cache hits [8]. Our results showed marked diﬀerences between time and energy proﬁles, with the diﬀerences proportional to the cache miss ratios.

Energy-Driven Statistical Sampling: Detecting Software Hotspots

125

Table 3. Impact of frequency scaling

Module App-206 App-59 Other

Time 22.01% 77.44% 0.56%

Energy 41.13% 57.50% 1.37%

Module Time App-59 31.36% App-89 20.84% App-118 15.67% App-148 12.47% App-176 10.33% App-206 8.85% Other 0.46%

Energy 20.52% 17.56% 16.32% 15.57% 14.92% 14.56% 0.54%

Additionally, the SA1100 is a fairly simple processor with a limited number of power states [12]. More complex processors are likely to have more instructions that produce additional power states, which would increase the diﬀerences between proﬁling approaches. Frequency and voltage scaling, like cache misses, also oﬀer the potential for additional power states that can be cycled through rapidly. The impact that frequency and voltage scaling may have is suggested by the following two experiments. In the ﬁrst experiment, we obtained time and energy proﬁles for a workload comprised of running a long benchmark twice, ﬁrst at a clock frequency of 206 MHz, then at a clock frequency of 59 MHz. In the second experiment, we obtained time and energy proﬁles for a workload comprised of running a short benchmark six times, each time with a diﬀerent clock frequency (206 MHz, 176 MHz, 148 MHz, 118 MHz, 89 MHz and 59 MHz). In both cases, the energy proﬁle was obtained using our energy-driven sampler (rather than our PowerScope prototype). Table 3 summarizes the results of these two experiments. While these workloads are synthetic, the results indicate the inappropriateness of time proﬁles for estimating the energy consumption of applications that exploit frequency scaling. Moreover, since PowerScope assumes that instantaneous power consumption approximates the average power consumption between successive samples, rapid cycling through multiple energy states may increase the diﬀerence between energy-driven sampling and the PowerScope approach. 3.4

Observations

Based on our results, we can make several observations. – Simple time proﬁles do not accurately reﬂect energy consumption in many cases. Such inaccuracies are particularly pronounced when software cycles through multiple power states (e.g., idle mode versus CPU-busy). – Energy-driven proﬁles give better instruction-level resolution than previously proposed time-driven power sampling approaches (e.g., our PowerScope prototype) at the same sampling frequency. Furthermore, when multiple power states are exercised, they may provide greater accuracy with lower sampling frequencies.

126

Fay Chang et al.

– However, on simple handheld systems which do not exercise multiple power states other than idle mode, time-based proﬁling approaches may suﬃciently approximate energy proﬁling for the purposes of assisting programmers – without requiring any additional hardware support.

4

Comparison with Activation-Model Approaches

Prior work has also explored approaches to exposing more information about a system’s energy consumption that are not based on statistical sampling. These approaches, which we refer to as activation-model approaches, require two components: (1) energy-use estimates for speciﬁc system activities, such as executing a speciﬁc type of instruction, idling the processor, accessing memory or a disk, and sending or receiving messages over a network; and (2) activation counts for each activity, which are obtained by counting the number of times each activity occurs when an application runs. Energy consumption for the application can then be obtained by multiplying the activation counts by the energy cost per activation. Activation-model approaches diﬀer widely in how they obtain activation counts. They can be classiﬁed as counting-based, sampling-based, or simulationbased approaches. With counting-based approaches, activation counts are obtained by either running instrumented versions of applications (e.g., Millywatt [4]), or running (unmodiﬁed) applications on an operating system that is modiﬁed to leverage hardware event counters (e.g., Joule Watcher [2]). A sampling-based approach is put forth by PowerMeasure/StateProﬁler [10], which periodically samples the system’s state to estimate some activation counts. Finally, with simulation-based approaches (e.g., SimplePower [13] and Wattch [3]), a specially-designed simulator is used to collect the activations of interest for an application. One advantage of activation-model approaches is that they can leverage activation counts for background activities to more easily attribute energy consumed asynchronously. In contrast, both the PowerScope and energy-driven samplers we evaluate in this paper attribute the energy consumed by all system activity to the software modules that are executing while this activity is on-going. For example, should a software module initiate DMA, the samplers would attribute the energy consumed by the DMA engine to all the software modules that execute while the engine is busy. Thus, while this approach to attributing energy consumption gives programmers insight into system-level energy consumption, it does not identify the modules that are actually responsible for such background activity. We are currently exploring ways to augment statistical sampling to enable more sophisticated attribution. Section 5 brieﬂy describes one of the approaches we are considering. The main disadvantage of activation-model approaches is the diﬃculty of ensuring their accuracy. In particular, the accuracy of these approaches depends on tracking all important activation counts, and on obtaining accurate energy-use estimates for all important system states - often through detailed simulation or

Energy-Driven Statistical Sampling: Detecting Software Hotspots

127

application instrumentation. Obtaining such estimates is non-trivial for a single platform, let alone many platforms. In contrast, our tool requires neither prior intuition about what activities may be of interest nor energy-use estimates for speciﬁc activities.

5

Summary and Future Work

While the issues with designing applications to reduce execution time are fairly well understood, a similar understanding about how to design applications to reduce their energy consumption is lacking. This paper presented a new approach, energy-driven statistical sampling, to exposing information about energy consumption. Tools developed with this approach can help the software designer both reason about the energy impact of software design decisions and identify application energy hot spots. Energy-driven statistical sampling uses a small amount of hardware to trigger an interrupt at pre-deﬁned quanta of energy consumption. The interrupt is used to collect information about the program currently executing, and the information thus collected is processed oﬀ-line to generate an energy proﬁle of where energy was spent during the program’s execution. We have developed a prototype of this approach for the Itsy mobile computing platform and our studies on the prototype indicate that energy-driven statistical sampling can provide an accurate system-level software energy proﬁle with very little overhead. We compare energy-driven statistical sampling to two time-driven statistical sampling approaches by comparing the proﬁles generated by these approaches for 14 benchmarks programs. Our results show that energy-driven statistical sampling can provide better instruction-level resolution and greater accuracy than existing time-driven statistical sampling approaches. In particular, there are often signiﬁcant diﬀerences between the proﬁles generated by energy- and timedriven statistical sampling when the workload cycles through multiple power states. On simple handheld systems, many applications may exercise only a single power state other than idle mode, in which case time proﬁling may suﬃciently approximate energy proﬁling for the purpose of assisting programmers. However, preliminary investigations indicate that emerging functionality, like frequency and voltage scaling, will increase the diﬀerences between time and energy proﬁles, and therefore the beneﬁt of energy-driven statistical sampling. As part of our ongoing eﬀorts, we plan to extend our work to account for background power usage. One solution we are exploring is to augment the energycounting hardware of a system to include software-readable counters that track the amount of energy consumed by background energy consumers (e.g., the backlight, a wireless radio). These counters could then be used by our interrupt service routine to assign fractions of the sample (based on energy usage) to system energy-use categories (e.g., backlight on, wireless active). In this way, a programmer may identify major energy consumers at the hardware level and, with further investigation, the software modules responsible.

128

Fay Chang et al.

We are also repeating this comparison study in the PocketPC environment on the iPAQ pocket computer, and are considering doing so on systems that support frequency and voltage scaling.

Acknowledgment A number of people assisted us with the research described in this paper and with the larger eﬀort of which it is a part. We thank Deborah Wallach and Marc Viredaz for helping us with the Itsy Pocket Computer hardware and software and Wayne Mack for helping with soldering and component assembly. Also, for supporting our research in numerous ways, we thank Ken Nicholas, Alan Eustace, Jim Mann, Ramakrishna Anne, George Bold, Scott Briggs, Tim Kamrath, and Tu Nguyen. Finally, we thank the anonymous reviewers for their comments.

References [1] J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous proﬁling: where have all the cycles gone. In Proceedings of the 16th Symposium on Operating Systems Principles, October 1997. 111 [2] F. Bellosa. The beneﬁts of event-driven energy accounting in power sensitive systems. In Proceedings of the 9th ACM SIGOPS European Workshop, September 2000. 126 [3] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architecturallevel power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA), June 2000. 110, 126 [4] T. L. Cignetti, K. Komarov, and C. Ellis. Energy estimation tools for the Palm. In Proceedings of the ACM MSWiM’2000: Modeling, Analysis and Simulation of Wireless and Mobile Systems, August 2000. 110, 126 [5] J. Dean, J. E. Hicks, C. Waldspurger, B. Weihl, and George Chrysos. ProﬁleMe: Hardware support for instruction-level proﬁling on out-of-order processors. In Proceedings of the 30th Annual International Symposium on Microarchitecture, December 1997. 116 [6] X. Zhang et al. Operating system support for automated proﬁling and optimization. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, October 1997. 111 [7] J. Flinn and M. Satyanarayanan. PowerScope: A tool for proﬁling the energy usage of mobile applications. In Proceedings of the Workshop on Mobile Computing Systems and Applications (WMCSA), pages 2–10, February 1999. 111, 117 [8] Jason Flinn, Keith I. Farkas, and Jennifer Anderson. Power and energy characterization of the itsy pocket computer (version 1.5). Technical Report Technical Note TN-56, Compaq Western Research Laboratory, February 2000. 124 [9] W. R. Hamburgen, D. A. Wallach, M. A. Viredaz, L. S. Brakmo, C. A. Waldspurger, J. F. Bartlett, T. Mann, and K. I. Farkas. Itsy: Stretching the bounds of mobile computing. IEEE Computer, 34(4), April 2001. 112, 117 [10] J. Lorch and A. J. Smith. Energy consumption of Apple Macintosh computers. IEEE Micro Magazine, 18(6), November/December 1998. 110, 126

Energy-Driven Statistical Sampling: Detecting Software Hotspots

129

[11] T. Simunic, L. Benini, and G. De Micheli. Energy-eﬃcient design of batterypowered embedded systems. In Proceedings of the International Symposium on Low-Power Electronics and Design ’98, June 1998. 111 [12] Amit Sinha and Anantha P. Chandrakasan. Jouletrack - a web based tool for software energy proﬁling. In Design Automation Conference (DAC 2001), June 2001. 124, 125 [13] W. Ye, N. Vijaykrishan, M. Kandemir, and M. J. Irwin. The design and use of SimplePower: A cycle-accurate energy estimation tool. In Proceedings of the Design Automation Conference, June 2000. 126

Modeling of DRAM Power Control Policies Using Deterministic and Stochastic Petri Nets Xiaobo Fan, Carla S. Ellis, and Alvin R. Lebeck Department of Computer Science Duke University, Durham, NC 27708, USA {xiaobo,carla,alvy}@cs.duke.edu

Abstract. Modern DRAM technologies oﬀer power management features for optimization between performance and energy consumption. This paper employs Petri nets to model and evaluate memory controller policies for manipulating multiple power states. The model has been validated against the analysis and simulation used in our previous work. We extend it to model more complex policies and our results show that DRAM chip should always immediately transition to standby and never transition to powerdown provided that it exhibits typical exponential access behavior. Keywords: Control Policy, DRAM, Memory Controller, Modeling, Petri Nets

1

Introduction

Energy eﬃciency is becoming increasingly important in system design. It is desirable both from the point of view of battery life in mobile devices, and from environmental and economical points of view in all computing platforms. With the introduction of low power processors, novel displays, and systems without hard disks, main memory is consuming a growing proportion of the system power budget. Modern DRAM technologies are making memory chips with multiple power states to oﬀer power management capability. Usually there is an Active state to service requests and several degraded states (Standby, N ap and P owerdown) with decreasing power consumption but increasing time to transition back to Active. We must design a power control policy to manipulate these states eﬀectively to improve energy eﬃciency without sacriﬁcing too much performance. Our work to date has adhered relatively closely to the speciﬁcations of RDRAM [4], thus giving our new power aware memory management ideas the credibility of being based on actual hardware. However, our experience with this one design point suggests that alternatives to the current set of RDRAM power states may oﬀer better management possibilities. There is a large space of potential DRAM power states and associated memory controller policies to explore. Identifying the most productive design points is valuable to inﬂuence future power aware DRAM development. B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 130–140, 2003. c Springer-Verlag Berlin Heidelberg 2003

Modeling of DRAM Power Control Policies

131

Since the design space is so large, it is impractical to use the detailed simulations. Therefore, we need models to allow rapid exploration of the space. As part of our previous work we investigated memory controller policies for manipulating DRAM power states in cache-based systems [3]. We developed an analytic model that approximates the idle time of DRAM chips using an exponential distribution, and validated our model against trace-driven simulations. Our results show that, for our benchmarks, the simple policy of immediately transitioning a DRAM chip to a lower power state when it becomes idle is superior to more sophisticated policies that try to predict DRAM chip idle time [2]. Although this model served our purpose, it is not easily extended to more power states. Therefore, we propose developing a Petri net model of DRAM power states. The ﬁrst step is to create a model that mimics our previous analytic model for validation. Then we can extend it to more power states, ﬁnally using it to explore the space. We use the Petri net toolkit, DSPNexpress [6], which supports graphical development and performance analysis of deterministic and stochastic Petri nets. In this paper, we ﬁrst establish a DSPN (Deterministic and Stochastic Petri Nets) model for the 2-state control policy we used in [3]. We use DSPNexpress [6] to solve the model. In [3] the analysis is done by assuming exponentially distributed gap — the idle time between clustered accesses, while here the model is driven by an exponentially distributed inter-access time. By measuring gap from the DSPN model, we verify that it is equal to the exponentially distributed inter-access time. From the DSPNexpress solution, we can derive other results of our interest. Using the same DRAM conﬁguration, all results are consistent with those in [3]. Then we extend this model to 4 states and explore other threshold-based policies. Based on the available DRAM conﬁguration, we have the following conclusions for typical exponential memory access patterns: 1) a simple Active to N ap immediate transition policy is the best 2-state policy if the average gap is large enough, 2) chips should always transiton to Standby if it is available, 3) chips should not transition to P owerdown during execution. The rest of this paper is organized as follows. In the next section, we introduce how our control policy works on a multi-state memory chip and develop a DSPN model for the 2-state power control policy. Section 3 discusses how to derive performance and energy data from DSPN solutions and validates these results against probability based analysis. In Section 4, we extend the model to 4 states and investigate the eﬀect of two other thresholds on memory performance and energy consumption. Finally, Section 5 concludes and describes future work.

2

Methodology

Figure 1 is an example illustrating how a 2-state control policy works. In this case, we use Active and N ap power states. When a memory chip has any outstanding access, it stays in the Active state. After the last access completes and before the start of the next access, the memory chip is idle, and we denote this interval as the gap. If the gap is greater than a threshold value, the chip

132

Xiaobo Fan et al.

= request

*

= completion of “last” access

tacc

twait>Th

tacc twait
twait>Th

= no access = transaction cycle

Active

*

*

*

Standby PowerDown

…

Resyn

Nap gapi

gapj

time

Fig. 1. Power Management

transitions down to the low power state, N ap, otherwise it remains in Active. This is a typical state/event dynamic system which can be modeled by Petri net. Each state of a memory chip (i.e. Active, N ap, etc.) can be mapped to a marking in the Petri net model. Each state change or transition (i.e. power degradation, resynchronization, etc.) can be described by a Petri net transition. By associating timing information with these transitions, we can model system performance. Furthermore, knowing performance results and power consumption of each state and transition, we can evaluate the system’s energy eﬃciency. Due to space limitations, we skip the background knowledge about Petri nets and Stochastic Petri nets. Detailed information can be found in [10], [7], [8], [1] and [9]. Based on this 2-state control policy, we can develop a Petri net model, as shown in Figure 2. There are three situations a memory chip could be in when a request arrives: 1) in the Active state and currently servicing other requests, 2) in the Active state and idle-waiting, or 3) in the N ap state. In the ﬁrst case, it only takes a small amount of additional time to service this request because most DRAM technologies oﬀer “bursting” optimizations and we assume requests are serviced by independent internal DRAM banks. In the Petri net model, we use the place labelled active to represent this state and a deterministic transition service with delay, serviceDelay, to represent a service of the request. In the second case, represented by the place labelled idle, this initial access incurs delay to initiate the access. We use the transition activate and its delay activateDelay to depict the initial access cost. In the last case, a resynchronization cost is incurred to transition out of the low power state, nap, denoted by transition resync2 with resync2Delay as its cost. After completing the last access, the memory chip goes to idle through the immediate transition rest, and further into nap through a timed transition sleep with delay equal to the waiting threshold, napT h. Both these two degrading transitions can be disabled or interrupted by an inhibitor arc from place buf f er, where outstanding requests are buﬀered. It means that when there is an outstanding request, if the chip is in active, it stays in active servicing the request; if

Modeling of DRAM Power Control Policies

service serviceDelay

133

request

arrival

rest

active

arrivalDelay 8

activateDelay

buffer

activate 1

idle

sleep napTh

resync2 resync2Delay

nap

Fig. 2. DSPN model for 2 state power control policy it is in idle and waiting for the threshold, the timer is canceled and the transition activate will ﬁre after activateDelay. Based on analysis from our previous studies, we use an exponentially distributed memory access pattern to drive this power management model. The transition arrival is an exponential stochastic transition with mean ﬁring delay arrivalDelay. Since we assume all requests come from a cache hierarchy with 8 outstanding misses, there is an inhibitor arc with multiplicity 8 from buf f er to arrival to model this aspect. We allocate one token in place request to generate accesses and one token in the set of places {active, idle, nap} to simulate power states. Table 1 are the parameter values for the multi-state DRAM which we use to do all the following computation. Transitions resync1 and resync3 are going to be used later in the 4-state model.

3

Validation

After running the DSPNexpress steady-state solution, we obtain the throughput for all deterministic and stochastic transitions, and average token number for all

134

Xiaobo Fan et al.

Table 1. DRAM Power State and Transition Values Power Power Time State (mW) (nS) Active Pa = 300 serviceDelay=60 Standby Ps = 180 Nap Pn = 30 Powerdown Pp = 3 Transition activate Pa = 300 activateDelay = 60 resync1 Ps→a = 240 resync1Delay = 6 resync2 Pn→a = 165 resync2Delay = 60 resync3 Pp→a = 152 resync3Delay = 6000

places. We use XT to denote the throughput of transition T and NP to denote the average token number of place P . Then we can derive other performance and power consumption values. Since we deﬁne gap as the time interval between two clustered accesses, we want to determine the period of each gap cycle and the average gap. Then, for each gap cycle, we can compute how much time is spent in each state and how much energy is consumed. Because each ﬁring of transition activate terminates the current gap cycle and creates a new gap cycle, the mean period of the gap cycle is Ttotal =

1 Xactivate

Then we can compute the time spent on each place (all the following computations are for one gap cycle with mean period Ttotal ) Tactive = Ttotal Nactive Tidle = Ttotal Nidle Tnap = Ttotal Nnap Since the time elapsed in place nap also includes time spent on transition resync2 where the power consumption is diﬀerent from that in the nap state, we need to determine how much time is spent on resynchronization: Tresync2 = Ttotal · Xresync2 · resync2Delay Therefore the power consumption is (assuming Pa , Pn and Pn→a are power consumptions for the active power state, nap power state and resynchronization, respectively) Etotal = Pa (Tactive + Tidle ) + Pn (Tnap − Tresync2 ) + Pn→a Tresync2 Finally we have the per gap energy•delay product and the relative product (compared to the always-active policy) e • d = Etotal Ttotal

Modeling of DRAM Power Control Policies

5e+06

napTh=0 napTh=50 napTh=100 napTh=0 napTh=50 napTh=100

0 -5e+06 -1e+07 ∆(e•d) (mW•ns2)

135

-1.5e+07 -2e+07 -2.5e+07 -3e+07 -3.5e+07 -4e+07 -4.5e+07 0

50

100

150

200

250

300

350

400

Gap (ns)

Fig. 3. ∆(e • d) computed from analysis and modeling (lower is better) ∆(e • d) = Etotal Ttotal − Pa (Ttotal − Tresync2 )2 Before solving the model and comparing it with the analytical results in [3], there is one more problem we need to deal with. The memory access pattern we use to drive this model, as shown in Figure 2, follows exponential distribution on the inter-access-arrival time instead of the inter-clustered-access idle time (gap), which we used to develop the analytical model in [3]. In theory we can prove that if inter-access-arrival time follows exponential distribution, gap should follow the same distribution. As we can see from Figure 1, gap is the interval between the completion of the last access and the arrival of next access. Because of the memoryless property of exponential distribution, the time lapsed from the arrival to the completion of the last access doesn’t aﬀect the probability of the next access’s arrival. Therefore gap follows the same distribution as inter-arrival time. To verify this, we can also compute the average gap µ from the DSPN model using one of the following equations: µ = Ttotal − Tactive − Tresync2 − activateDelay µ = Tidle − activateDelay + Tnap − T resync2 In fact each gap value computed from above equation is equal to the corresponding arrivalDelay value used in the model. Using the same DRAM parameters as in our previous analysis [3], we run DSPNexpress to obtain results for this DSPN model. These results are shown in Figure 3, as the points, together with the solid lines derived from our previous analysis. As we can see, they exactly match, validating the DSPN based modeling and probability based analysis against each other. Recall, our probability based analysis was validated against simulations [3]. Because the DSPN model is much easier to develop and extend than mathematical analysis, we propose to use it to investigate more complex policies and

136

Xiaobo Fan et al.

to explore a larger parameter space. In particular, we plan to explore the impact of: 1) changing the threshold values for transitioning to lower power states, 2) changing the power consumed during each DRAM power state, 3) changing the power consumed and delay incurred to transition between power states, and 4) changing the number of available power states. This paper only covers the ﬁrst case.

4

Model Extension

In the previous section, we validated the DSPN model against an analytical model, which is validated in [3] against simulation. In this section, we will extend the 2-state model to 4 states and investigate how the other two thresholds aﬀect energy eﬃciency. Figure 4 is the DSPN model for a 4-state power control policy. We add two more low power states – standby and powerdown. When the memory chip is in idle for time standbyT h, it ﬁrst transitions down to standby. Then it transitions down to nap if it stays in standby for napT h, and further down to powerdown if it stays in nap for powerdownT h. If a memory access arrives during any of these downward transitions, the transiton is disabled by one of the three inhibitor arcs. Then the relevant upward resynchronization and/or activation is ﬁred. This is consistent with the real hardware. Since we already know the immediate active to nap transition is the best 2-state policy when average gap is greater than 75ns and no nap transition should be made when average gap is less than 75ns, we want to know what the appropriate standby threshold should be in both of the two cases. Therefore, for memory access patterns with average gap greater than 75ns, we use 0 as napT h and, for those with average gap less than 75ns, we use inﬁnity. In order to avoid the eﬀect of the powerdown transition, we set powerdownT h to inﬁnity. Then we observe how the energy•delay product changes as we vary standbyT h. Figure 5 shows the relative energy•delay product for diﬀerent standbyT h values when gap increases from 15ns to 375ns. Unlike the nap-based transition, zero threshold is always the best even when the gap is very small. Therefore even with a very high cache miss rate the memory chip should always transition down to standby right after the completion of outstanding accesses. For contrast only when the cache hierarchy generates high enough hit rate so that the average gap is large enough (e.g. 75ns) should the memory chip transition down to nap immediately. This justiﬁes the fact that standby is the default power state for Rambus DRAM when the chip is idle. Knowing standbyT h should always be 0 and napT h should be 0 when gap is greater than 75ns, we next want to know what powerdownT h should be for our exponential memory access pattern. P owerdown is an extreme state in that memory chip consumes very low power (3mW) and it incurs huge delay to get out of the state (6000ns). We run the model with gap starting from 75ns, because this is the boundary beyond which transitioning down to nap starts giving beneﬁt and powerdown

Modeling of DRAM Power Control Policies

service serviceDelay

request

arrival arrivalDelay 8

active

rest

activate 1

buffer

activateDelay idle

T6 standbyTh

resync1 resync1Delay

standby resync2 resync2Delay T7 napTh nap

powerdownTh

T8

resync3 resync3Delay

powerdown

Fig. 4. DSPN model for 4-state power control policy

137

138

Xiaobo Fan et al.

5e+06

standbyTh=00 standbyTh=10 standbyTh=20 standbyTh=50

0

∆(e•d) (mW•ns2)

-5e+06 -1e+07 -1.5e+07 -2e+07 -2.5e+07 -3e+07 -3.5e+07 -4e+07 0

50

100

150

200

250

300

350

400

Gap (ns)

Fig. 5. ∆(e • d) computed from 4-state model for diﬀerent gap and standbyT h values becomes reachable. Figure 6 shows the relative energy•delay product for diﬀerent powerdownT h values. We didn’t include results for powerdownT h values less than 500ns because they are out of the graph range, making other parts undistinguishable. When powerdownT h is small (less than 2000ns), the chip could more easily get to the powerdown state and incur very long resynchronization delay when an access comes. Therefore it performs much worse than the alwaysActive policy. The larger the average gap, the bigger the probability that it goes to powerdown and the more penalty. When powerdownT h is large enough (greater than 5000ns), the probability of getting into powerdown becomes so small that the policy is similar to the immediate nap transition. The larger the powerdownT h, the smaller the probability and the better the policy performs. Therefore when the memory access pattern imposed on one memory chip exhibits exponential or close to exponential distribution and has average gap on the order of 100ns, we need to set an inﬁnity powerdown threshold to prevent any powerdown transition. The above conclusion does not hold on the unoccupied chips which are intentionally created by some page allocation or page movement algorithms [5]. The gaps of these idle chips are usually several magnitude larger than those of the ”active” chips we have studied above and do not follow an exponential distribution anymore. Hence in those cases we still need a powerdown threshold to shut down the ”idle” chips to get maximum energy eﬃciency.

5

Conclusion

Power management features provided by modern DRAM technologies can be exploited to develop power eﬃcient memory systems. Petri net is a powerful tool

Modeling of DRAM Power Control Policies

1.2e+08

powerdownTh=500 powerdownTh=1000 powerdownTh=2000 powerdownTh=5000 powerdownTh=10000

1e+08 8e+07 ∆(e•d) (mW•ns2)

139

6e+07 4e+07 2e+07 0 -2e+07 -4e+07 50

100

150

200

250

300

350

400

Gap

Fig. 6. ∆(e•d) computed from 4-state model for diﬀerent gap and powerdownT h values that can be used to model and evaluate diﬀerent memory power control policies. In this paper, we consider a 2-state control policy model, derive our energy efﬁciency metric, and validate the model against probabilistic analysis. Then we extend our model to 4 states and investigate the eﬀects of these additional states on energy eﬃciency. The results reveal that memory chips should always immediately transition to standby, and for chips that exhibit exponential-like access pattern without extremely long average gap values they should not transition to powerdown state. All our studies to date are investigations of the appropriate control policy based on a certain available DRAM platform which provides power management features. As future work, in order to obtain some valuable power aware memory design points, we plan to explore the design space of alternative potential DRAM power/performance features (i.e., the number of available power states, the power consumption for each power state, the power consumption and delay for transiton between power states, etc.).

Acknowledgments This work supported in part by NSF Grants CCR-0082914, EIA-99-72879, EIA99-86024, NSF CAREER Award MIP-97-02547, Duke University, and equipment donations from Intel and Microsoft. We thank Professor Christoph Lindemann and his group at the University of Dortmund for providing the DSPNexpress software.

140

Xiaobo Fan et al.

References [1] G. Chiola, M. Ajmone Marsan, G. Balbo, and G. Conte. Generalized stochastic petri nets: A deﬁnition at the net level and its implications. IEEE Transactions on Software Engineering, 19(2):89–107, February 1993. 132 [2] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin. DRAM Energy Management Using Software and Hardware Directed Power Mode Control. In HPCA 2001, January 2001. 131 [3] Xiaobo Fan, Carla S. Ellis, and Alvin R. Lebeck. Memory controller policies for dram power management. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), August 2001. 131, 135, 136 [4] D. Lammers. IDF: Mobile Rambus spec unveiled. EETimes Online, February 1999. //www.eetimes.com/story/OEG19990225S0016. 130 [5] Alvin R. Lebeck, Xiaobo Fan, Heng Zeng, and Carla S. Ellis. Power aware page allocation. In Proceedings of Ninth International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS IX), November 2000. 138 [6] Christoph Lindemann. Performance Modelling with Deterministic and Stochastic Petri Nets. John Wiley and Sons, 1998. 131 [7] M. Ajmone Marsan, G. Balbo, and G. Conte. Modeling with Generalized Stochastic Petri Nets. J. Wiley, 1995. 132 [8] M. Ajmone Marsan and G. Chiola. On petri nets with deterministic and exponential transition ﬁring times. In Proceedings of the 7th European Workshop on application and Theory of Petri Nets, pages 151–165, June 1986. 132 [9] T. Murate. Petri nets: properties, analysis, and applications. Proceedings of IEEE, 77(4):541–580, April 1989. 132 [10] J. L. Peterson. Petri Net Theory and the Modeling of Systems. Prentice-Hall, 1981. 132

SimDVS: An Integrated Simulation Environment for Performance Evaluation of Dynamic Voltage Scaling Algorithms Dongkun Shin, Woonseok Kim, Jaekwon Jeon, Jihong Kim, and Sang Lyul Min School of Computer Science and Engineering Seoul National University, Seoul, Korea

Abstract. We describe SimDVS, a uniﬁed simulation environment for evaluating dynamic voltage scaling (DVS) algorithms, and present the evaluation results for three case studies using SimDVS. In recent years, DVS has received a lot of attention as an eﬀective low-power design technique, and many research groups have proposed various DVS algorithms. However, these algorithms have not been quantitatively evaluated, making it diﬃcult to understand the performance of a new DVS algorithm objectively relative to the existing DVS algorithms. The SimDVS environment provides a framework for objective performance evaluations of various DVS algorithms. Using SimDVS, we compare the energy eﬃciency of the intra-task DVS algorithm and inter-task DVS algorithms, and evaluate various heuristics for a hybrid DVS approach. We also show that more eﬃcient DVS algorithms may incur higher system overheads, degrading the overall energy eﬃciency of the DVS algorithms.

1

Introduction

For battery-operated mobile embedded devices such as personal digital assistants (PDAs) and cellular phones, power consumption is an important design constraint. As an eﬀective low-power design technique, dynamic voltage scaling (DVS) recently has received a lot of attention. For example, several commercial variable-voltage microprocessors [1, 2, 3] were introduced last 2 years, and many DVS algorithms applicable to these microprocessors [4, 5, 6, 7, 8, 9, 10, 11, 12] have been proposed, especially targeting for hard real-time systems. Although proposed DVS algorithms are shown to be eﬀective in reducing the energy consumption of a target system under their own experimental scenarios, these algorithms have not been quantitatively evaluated each other under a uniﬁed evaluation framework. The lack of comprehensive evaluation studies makes it diﬃcult to understand the energy eﬃciency of a new DVS algorithm relative to that of the existing DVS algorithms. In this paper, we describe SimDVS, an

This work was supported by grant No. R01-2001-00360 from the Korea Science & Engineering Foundation. Woonseok Kim and Sang Lyul Min were supported in part by the Ministry of Science and Technology under the National Research Laboratory program.

B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 141–156, 2003. c Springer-Verlag Berlin Heidelberg 2003

142

Dongkun Shin et al.

integrated simulation environment for DVS algorithms, which can be used in comparing the energy eﬃciency of various DVS algorithms. 1.1

DVS Algorithms for Hard Real-Time Systems

For hard real-time systems, DVS algorithms can be categorized into two classes, inter-task DVS (InterDVS) and intra-task DVS (IntraDVS). InterDVS algorithms determine the supply voltage on task-by-task basis while IntraDVS algorithms adjust the supply voltage within an individual task boundary. InterDVS algorithms are further divided depending on the scheduling policy employed, say, the earliest-deadline-ﬁrst (EDF) or rate-monotonic (RM) scheduling policies. Table 1 summarizes the recent DVS algorithms proposed for hard real-time systems, six EDF InterDVS algorithms and two RM InterDVS algorithms. InterDVS algorithms under the same scheduling policy are diﬀerent mainly in the way how slack times are estimated. For example, lppsEDF conservatively estimates available slack times at scheduling points. On the other hand, lpSHE employs an aggressive technique in estimating the slack times available. IntraDVS algorithms can be divided into two subcategories, path-based IntraDVS algorithms [11] and stochastic IntraDVS algorithms [12], depending on how to estimate slack times and how to adjust speeds. In the path-based IntraDVS algorithms, the voltage and clock speed are statically determined by computing the diﬀerences between the execution cycles of the predicted execution path (e.g., the worst case execution path (WCEP)) and the execution cycles of the execution path actually taken. When the actual execution deviates from the predicted execution path (say, by a branch instruction), the change in workload is detected and the clock speed is adjusted. The stochastic IntraDVS algorithms adjust the execution speed within a task boundary based on the probability density function of execution times of a task. This method is based on an observation that, from the energy consumption point of view, it is better to execute with a lower speed in the beginning and increase the execution speed later when necessary. Under the stochastic method, the clock

Table 1. Recent DVS algorithms proposed for hard real-time systems Category Scheduling Policy DVS Algorithms lppsEDF [7] ccEDF [9] EDF laEDF [9] InterDVS DRA [8] AGR [8] lpSHE [13] RM lppsRM [7] ccRM [9] IntraDVS Path-based Method intraShin [11] Stochastic Method intraGruian [12]

SimDVS: An Integrated Simulation Environment

143

speed is raised at the speciﬁc times, regardless of the execution paths taken. Unlike the path-based IntraDVS algorithms that can utilize all the slack times from the task execution in scaling the execution speed, the stochastic IntraDVS algorithms may leave some slack times unused when the actual execution takes a short execution path (other than WCEP). 1.2

Our Contribution

The SimDVS simulation environment was developed to help quantitative performance analysis and evaluation of DVS algorithms by providing a uniﬁed evaluation framework. The current version of SimDVS supports all the DVS algorithms listed in Table 1. In addition to the DVS algorithms, SimDVS includes utility programs that are useful for DVS comparative studies. For example, SimDVS provides a tool that automatically generates a task set with speciﬁc task characteristics. In order to demonstrate the eﬀectiveness of SimDVS, we present three case studies in this paper. First, we compare the energy eﬃciency of an IntraDVS algorithm and InterDVS algorithms. Second, we evaluate if hybrid DVS algorithms (that adopt both the IntraDVS approach and the InterDVS approach) can perform better than pure IntraDVS algorithms or pure InterDVS algorithms. Third, we show that more eﬃcient DVS algorithms may experience higher system overheads (e.g., more context switches), possibly degrading the overall energy eﬃciency of the DVS algorithms. The rest of the paper is organized as follows. In Section 2, we present the overview of SimDVS. The detailed description of main modules of SimDVS is given in Section 3. We present three case studies in Section 4 and conclude with a summary in Section 5.

2 2.1

Overview of SimDVS Design Goals

In order to eﬀectively evaluate various DVS algorithms under a variety of simulation scenarios, SimDVS was architected to meet the following design goals: 1. New DVS algorithms based on diﬀerent DVS approaches should be easily integrated into SimDVS. 2. Variations in simulation scenarios should be easily supported. Simulation scenarios can diﬀer, for example, in task workloads, variations in executed paths, and task set speciﬁcations. 3. Diﬀerent type of variable-voltage processors should be easily supported. Fig. 1 shows three examples of using SimDVS for evaluating DVS algorithms. As shown in Fig. 1(a), SimDVS can be used to compare the energy eﬃciency of diﬀerent DVS algorithms using the same task set speciﬁcation under the same machine conﬁguration. Using this evaluation, one can decide the best DVS algorithm for the given application on the given hardware platform.

144

Dongkun Shin et al.

DVS algorithm 1

DVS algorithm 2

task set 1

task set 2

SimDVS

Energy 1 Other Cost 1

SimDVS

Energy 2

task set

DVS algorithm

DVS algorithm

SimDVS

Energy 1

Other Cost 2

(a) DVS algorithm comparison

machine configuration 2

machine configuration

task set machine configuration

machine configuration 1

Other Cost 1

SimDVS

Energy 2 Other Cost 2

SimDVS

Energy 1 Other Cost 1

SimDVS

Energy 2 Other Cost 2

(b) Performance variations (c) Performance variations with diﬀerent task sets with diﬀerent machine conﬁgurations

Fig. 1. Three SimDVS usage examples

SimDVS can be used as well when evaluating a given DVS algorithm under various evaluation conditions. For example, one can evaluate how the energy eﬃciency of the DVS algorithm changes with diﬀerent task sets (as shown in Fig. 1(b)). A similar evaluation can be performed with diﬀerent machine conﬁgurations (as shown in Fig. 1(c)). If properly instrumented, SimDVS can collect information on various system events or performance parameters other than energy consumption. These extra proﬁling data are useful when understanding how DVS algorithms aﬀect the general behavior of a target system. The current version of SimDVS can collect the frequency of speed changes and the number of context switches. 2.2

Architectural Organization

Fig. 2 shows an architectural overview of SimDVS. SimDVS consists of three main modules: 1) the InterDVS Module, 2) the IntraDVS Module and 3) the IntraDVS Preprocessing Module. SimDVS takes two inputs: 1) a task set speciﬁcation or a DVS-aware control ﬂow graph (CFG) of an input binary program, respectively, for an InterDVS algorithm or an IntraDVS algorithm, and 2) a target machine speciﬁcation. As outputs, the energy consumption of the input task(s) is estimated. If required, other proﬁling data are also collected. The InterDVS Module is responsible for the whole operation of SimDVS. It simulates a given task set under a selected scheduling policy using a slack estimation and distribution heuristic. The IntraDVS Module simulates IntraDVS algorithms using the Intra-Task Simulator. The input to the IntraDVS Module is pre-processed by the tools available in the IntraDVS Preprocessing Module. For faster simulations of path-based IntraDVS algorithms, we simulate the CFG of the input program instead of the input program itself. For a comparison study, the current version of SimDVS supports ten DVS algorithms listed in Table 1.

SimDVS: An Integrated Simulation Environment

145

Task Set Generator Inputs

Machine Specification Off−line

Task Set Specification

InterDVS Module

Slack

Slack Estimation

Information

Executable Program Profile Information

Module Task Execution Module

Energy Estimation Module

IntraDVS Preprocessing Module

CFG

Voltage Scaler IntraDVS Module

CFG Generator

DVS−aware CFG

Stochastic Data

Speed Transition Table

Outputs * Energy Consumption

Intra−Task Simulator

*. . .

Fig. 2. An overview of the SimDVS simulation environment

3

Main Modules of SimDVS

In this section, we describe main modules of SimDVS (shown in Fig. 2) in detail. 3.1

SimDVS Inputs

The task set speciﬁcation describes various task set characteristics that aﬀect the energy eﬃciency of a DVS algorithm while the machine speciﬁcation describes the machine characteristics that aﬀect the energy eﬃciency of a DVS algorithm. Task Set Specification The energy-eﬃciency of DVS algorithms can be affected by the characteristics of a given task set such as the number of tasks, the task execution time distributions, and the worst case processor utilization (WCPU). Therefore, when evaluating DVS algorithms, it is necessary to understand how the performance of the DVS algorithms varies depending on task sets with diﬀerent characteristics. In SimDVS, the characteristics of a task set T =(τ1 , τ2 , · · ·, τn ) is speciﬁed in a script ﬁle that contains the following information on each task τi of the task set T : – – – – – –

IDi : the identiﬁer of τi . Pi : the period of τi . Di : the deadline of τi . WCETi : the worst case execution time (WCET) of τi . BCETi : the best case execution time (BCET) of τi . Distributioni : the execution time distribution of τi .

146

Dongkun Shin et al.

In order to automatically generate a task set with speciﬁc characteristics, the Task Set Generator is used. The Task Set Generator takes the following information as inputs and generates as an output the corresponding script ﬁle for a task set satisfying the requirements: – – – – –

The The The The The

number of tasks. range and variation of periods. ratio between BCETi and WCETi . worst case processor utilization of a task set. scheduling policy (e.g., EDF or RM).

The Task Set Generator creates schedulable task sets only, under the scheduling policy speciﬁed. For EDF scheduling, if the worst case processor utilization of the task set is lower than or equal to 1.0, the task set is schedulable. For RM, the schedulability is veriﬁed using the exact schedulability condition described in [14]. Machine Specification A machine speciﬁcation includes the available voltage and clock levels of a target variable-voltage processor. The machine speciﬁcation reﬂects the characteristics of the target variable-voltage processor. Using SimDVS, with the target task sets and DVS algorithms ﬁxed, the DVS-related architectural exploration is possible in designing variable-voltage processors. By a default, SimDVS uses the machine speciﬁcation described in [15]. If necessary, other machine speciﬁcations are easily supported. The current version of SimDVS includes the speciﬁcations of Intel’s XScale [3], AMD’s K6-2+ [2], and Transmeta’s Crusoe [1] processors. 3.2

InterDVS Module

The InterDVS Module, responsible for scheduling tasks, plays a role of a real-time scheduler in a hard real-time system. It takes as an input a task speciﬁcation for periodic tasks and simulates each task based on the speciﬁed scheduling policy (e.g., RM or EDF). To simulate an InterDVS scheduling algorithm, the InterDVS Module consists of two submodules, one for estimating available slack times and the other for distributing the slack times to each task. The slack estimation is done by the Slack Estimation Module which computes the total available time for the scheduled task while the slack distribution is done by the Task Execution Module which determines the operating speed for the scheduled task and simulates the execution of the task. For a new InterDVS algorithm, these two submodules should be re-deﬁned. Slack Estimation Module The implementation of this module is diﬀerent depending on how the target InterDVS algorithm estimates the available slack times. This module is integrated with the InterDVS Module using the getAvailableTime function. This function receives the task identiﬁer and the start time of the task as inputs, and returns the total available time for the task. Some

SimDVS: An Integrated Simulation Environment

147

DVS algorithms (e.g., [12]) may need oﬀ-line pre-processing steps for computing total available times during run time. For these algorithms, the Slack Estimation Module can take oﬀ-line slack analysis results as an additional input. Task Execution Module This module has two roles. First, it determines the voltage level and clock speed based on the available time for the current task and the WCET of the task. Although most existing DVS algorithms employ a greedy approach in distributing the available slack times, if a DVS algorithm adapts diﬀerent slack distribution methods, they can be supported in this module. Using the available voltage levels speciﬁed in the machine speciﬁcation input, this module sets the voltage level and clock speed. With the assigned clock speed, the activated task instance consumes all the assigned time interval if its execution takes the WCEP. Second, this module simulates the task execution itself. In this module, a real workload for each task is generated based on the input workload variation factors (i.e., Distributioni ), and the unused time as well as the elapsed time is calculated out of the available time interval. This module also sends the execution time information and speed information to the Energy Estimation Module. When the IntraDVS algorithm is used, this module calls the Intra-Task Simulator of the IntraDVS Module to simulate IntraDVS. Energy Estimation Module This module takes the timing and speed information from theTask Execution Module, and computes the energy consumption of the current task execution using the current machine speciﬁcation. Energy consumption is calculated using the equation E ∝ Ncycle ·Vdd 2 , where Ncycle and Vdd denote the number of execution cycles and the supply voltage, respectively. 3.3

IntraDVS and Its Preprocessing Module

In order to support the simulation of the IntraDVS algorithms, voltage scaling points within a task boundary should be determined during the oﬀ-line phase. The submodules in the IntraDVS Preprocessing Module are responsible for making intra-task voltage scaling decisions, which are passed to the IntraDVS Module using a DVS-aware CFG or a Speed Transition Table. To reﬂect the execution behavior of real applications, the CFG Generator produces a CFG from SimpleScalar 2.0 [16] binary programs. Each node of the CFG has several node attributes (e.g., the number of instructions in a basic block) that are necessary for simulation. Voltage Scaler This module is used for the path-based IntraDVS algorithms. It takes the CFG of a target application, and extracts the timing information from the CFG. By analyzing the CFG, this module computes the remaining predicted execution times (RPETs) for each basic block. Based on the RPETs computed, the voltage scaling edges in the CFG are selected using the algorithm described in [11]. As an output, this module generates the DVS-aware CFG which includes the voltage scaling information.

148

Dongkun Shin et al.

Speed Transition Table To simulate the stochastic IntraDVS algorithm, the stochastic data, i.e., the cumulative distribution function for task execution times, is either provided by user or collected using some proﬁling runs. Based on the stochastic data, the Speed Transition Table, which describes when the execution speed is changed to which speed, is constructed. Intra-Task Simulator This module simulates the task execution using a given DVS-aware CFG and a Speed Transition Table for the path-based IntraDVS algorithms and the stochastic IntraDVS algorithm, respectively. During simulation, it adjusts the voltage and clock speed based on the voltage scaling information speciﬁed as a part of the input CFG or the Speed Transition Table. To simulate the path-based IntraDVS algorithms, the Intra-Task Simulator requires the information on the execution path actually taken. The Intra-Task Simulator generates an execution path trace from the input CFG by randomly selecting one of two branching edges and setting the number of loop iterations by a random number between 0 and Nmax loop where Nmax loop is the maximum number of loop iterations. To change the execution times of the selected execution paths, SimDVS controls two parameters, α and β, which can be speciﬁed by user. α represents the probability of selecting the branching edge with a longer remaining time while β indicates the ratio of the average number of loop iterations to the maximum number of loop iterations for each loop. For example, if β is set to 0.5, the average number of loop iterations becomes close to the half of the maximum number of loop iterations. When both α and β are set to 1.0, the execution time of the selected path is close to WCET. Fig. 3 illustrates how the execution cycles of a task change when α and β vary. As α and β increase, the execution time tends to increase. Once the execution time t for the simulated task is determined by the Task Execution Module (in the InterDVS Module), the Intra-Task Simulator generates the execution path trace whose execution time is close to the task execution time t. To help the path generation step, the IntraDVS Preprocessing Module maintains the database of execution path traces with their (α, β) values and corresponding execution times.

4

Case Studies

In this section, we present three case studies that demonstrate how SimDVS can be used in evaluating various DVS algorithms. 4.1

Performance Evaluation of InterDVS and IntraDVS

First, we compared the energy eﬃciency of InterDVS algorithms and an IntraDVS algorithm. For the evaluation study, we compared the energy eﬃciency of four EDF InterDVS algorithms, the lppsEDF, ccEDF, laEDF and DRA algorithms, with that of the intraShin1 algorithm. (These algorithms are listed in 1

Before the intraShin algorithm is applied for each task instance, the time slot for each task instance is assigned by the oﬀ-line InterDVS algorithm described in [9].

SimDVS: An Integrated Simulation Environment

149

Fig. 3. Changes in the number of execution cycles when α and β vary Table 1.) As test cases, we used two diﬀerent task sets, A and B. The task set A is homogeneous (i.e., the tasks in A have similar periods and WCETs) while the task set B is heterogeneous (i.e., the tasks in B have large variations in their periods and WCETs). Fig. 4 shows the normalized energy consumption of the InterDVS algorithms over that of intraShin. Except for a few cases, the intraShin algorithm outperforms the InterDVS algorithms tested. We can observe that the relative performance of the InterDVS algorithms is getting worse when the worst case processor utilization gets smaller. This is because unused slack times are increasing, in the InterDVS algorithms, when WCPU becomes smaller. On the other hand, intraShin utilizes all the slack times, resulting in higher energy reductions. The DRA algorithm’s performance is signiﬁcantly diﬀerent with two task sets. As shown in Fig. 4(a), DRA outperforms both laEDF and intraShin when the task set A is used. However, when the task set B is used, Fig. 4(b) shows that DRA’s performance is inferior to that of laEDF and intraShin. This is because the slack estimation method used in DRA does not work well when the task utilizations are not uniform. 4.2

Performance Evaluation of Hybrid Methods

We have compared the energy eﬃciency of the InterDVS algorithms and the IntraDVS algorithm in the previous section. However, there are cases where pure IntraDVS or pure InterDVS dose not work well. Fig. 5 illustrates such cases. In Fig. 5(a), when an InterDVS algorithm is used, the slack time generated by the task τ1 cannot be used by the task τ2 because the release time of the task τ2 is same to the deadline of the task τ1 . This slack time could be used if the task τ1 were scheduled using an IntraDVS algorithm. On the other hand, in Fig. 5(b), when an IntraDVS algorithm is used, all the slack times generated by the task τ1 are used by the task τ1 . However, this slack distribution is unbalanced. If we

150

Dongkun Shin et al.

(a) Task set A

(b) Task set B

Fig. 4. Normalized energy consumption of InterDVS algorithms over intraShin

used InterDVS, we could get a more eﬃcient schedule by distributing the slack time of τ1 for the task τ2 . In this section, we investigate whether hybrid DVS algorithms (HybridDVS algorithms) with both IntraDVS and InterDVS features perform better than pure IntraDVS algorithms or pure InterDVS algorithms. Although both intraShin and intraGruian can be used for performance comparison, we use intraShin as the base IntraDVS algorithm. This is because intraShin is less likely to generate dynamic slack times, thus making the distinctions among the diﬀerent HybridDVS methods clearer. HybridDVS algorithms select either the intra mode or the inter mode when slack times are available during the execution of the current task. At the inter mode, the slack time is used not for the current task but for the following tasks. Therefore, the speed of the current task is not changed by the slack time produced by the current task. At the intra mode, all the slack time is used for the current task, reducing its own execution speed.

speed

task τ1

slack interval

speed

task τ2

task τ1

task τ2 time

time

deadline(τ1) = release(τ2)

(a) The case where InterDVS can not utilize the slack time.

release(τ2)

deadline(τ1)

(b) The case where the slack distribution is not balanced due to IntraDVS.

Fig. 5. Cases where pure InterDVS or pure IntraDVS performs poor

SimDVS: An Integrated Simulation Environment

151

Table 2. Heuristics for HybridDVS algorithms Heuristic H0 H1 H2 H3 H4 H5

Description always uses the inter mode (i.e., the pure InterDVS approach). uses the inter mode as a default but uses the intra mode if no activated task exists. uses the inter mode at ﬁrst, but changes into the intra mode when the unused slack time is more than a predeﬁned amount of slack time. alternates the intra mode and the inter mode keeping the balance of slack consumption in each mode. uses the intra mode at ﬁrst, but changes into the inter mode when the current task has used a predeﬁned amount of slack time. always uses the intra mode.

Table 2 summarizes six heuristics for HybridDVS algorithms we consider in this section. The heuristics are diﬀerent in that how close they are to the pure IntraDVS approach or pure InterDVS approach. H0 is identical to the pure InterDVS approach. H1 and H2 are closer to the pure InterDVS approach while H4 and H5 are closer to the pure IntraDVS approach. H1 uses the intra mode only when there is no following task which can utilize the slack time from the current task. We have evaluated six heuristics in Table 2 with six InterDVS algorithms in Table 1. Fig. 6 shows the energy eﬃciency comparison results of the HybridDVS algorithms over the power-down method varying WCPUs. In the power-down method, active tasks execute with the full speed. When there is no active task, the system enters into the power-down mode. The HybridDVS algorithms, H1, H2, H3 and H4, generally reduce the energy consumption by 5∼20% over that of the pure DVS algorithms, H0 and H5. Fig. 6 shows that the energy eﬃciency of HybridDVS algorithms are strongly aﬀected by the eﬃciency of on-line slack estimation methods used by each InterDVS algorithm. In laEDF, DRA, AGR, and lpSHE where slack times are aggressively identiﬁed, it is a good idea that some (or all) of slack time produced by the current task is passed to the following tasks (as in Fig. 5(b)). However, in lppsEDF/RM and ccEDF/RM where slack times are less aggressively identiﬁed, there are many cases where the current slacks are wasted unless used by the current task (as in Fig. 5(a)). In this case, it is better for the current task to utilize most of the slack time generated. Therefore, if a HybridDVS algorithm is based on laEDF, DRA, AGR, or lpSHE, H1 and H2 are better choices. On the other hand, for lppsEDF/RM and ccEDF/RM, H4 and H5 are better choices. Fig. 7 shows the spectrum of HybridDVS heuristics, and summarizes wellmatching hybrid heuristics for each InterDVS algorithm. For example, if laEDF is extended to a HybridDVS algorithm, H1 is a good candidate for a matching hybrid heuristic. However, if lppsRM is modiﬁed for a hybrid DVS algorithm, H4 is a better hybrid heuristic.

152

Dongkun Shin et al.

(a) lppsRM

(b) ccRM

(c) lppsEDF

(d) ccEDF

(e) AGR

(f) laEDF

Fig. 6. Energy eﬃciency comparison results of the HybridDVS algorithms

SimDVS: An Integrated Simulation Environment

laEDF

AGR

DRA

ccEDF ccRM

153

lppsEDF lppsRM

InterDVS

IntraDVS

H0

H1

H2

H3

H4

H5

Fig. 7. Spectrum of HybridDVS heuristics

4.3

Overhead Measurement of InterDVS Algorithms

In designing an InterDVS algorithm, it is common to assume that the voltage scaling overhead is negligible. However, since eﬃcient InterDVS algorithms generally lengthen the active execution intervals of tasks, InterDVS may aﬀect other system performance factors, possibly causing negative side eﬀects on the overall energy eﬃciency. In this section, using SimDVS, we evaluate how the InterDVS algorithm aﬀects the number of context switches. In particular, we investigate whether it increases signiﬁcantly. The example task set in Table 3 illustrates that DVS can increase the number of context switches due to preemption. As shown in Fig. 8(a), the example task set can be scheduled with the maximum frequency fmax under the EDF scheduling policy. In this case, there is no preemption. When the same task set is scheduled by an InterDVS algorithm, the schedule may look like one shown in Fig. 8(b). If we assume that there is no system overhead as well as energy overhead due to extra preemption, the schedule in Fig. 8(b) consumes less energy than one in Fig. 8(a) because it operates under the slower speed flower . However, the energy-eﬃcient schedule in Fig. 8(b) increases the number of context switches due to extra preemption. For example, since the execution time of τ1 is increased by the InterDVS algorithm, τ1 cannot complete its execution before t = 2, thus it is preempted by the second instance of τ2 . Fig. 9 shows how the number of context switches changes with the InterDVS algorithms. The number of context switches is measured varying the number of tasks and the number of voltage levels available in the target machine. The results are normalized by the number of context switches by the power-down method. The laEDF, DRA and lpSHE algorithms show high rates of increase in preemption. When a task is preempted, the number of cache misses and memory accesses may

Table 3. An example task set Task Period WCET Actual Execution Time τ1 6 1 1 τ2 2 1 0.5

Dongkun Shin et al.

(a)

clock frequency

154

fmax

(b)

2

3

4

5

6

clock frequency

1

time

flower 1

2

3

4

5

6

time

Fig. 8. Extra preemption due to DVS

(a) Number of Tasks

(b) Voltage Levels

Fig. 9. Changes in the number of context switches

increase as well. Therefore, more energy is consumed in the memory and bus. Furthermore, as the number of context switches increases, the live length2 of a task is lengthened. If the live length of a task is increased, more memory blocks are simultaneously required, increasing the probability of page faults. Our experimental results indicate that when the preemption cost is considered, DVS algorithms need to be evaluated diﬀerently. For example, although lppsEDF is less energy-eﬃcient than other InterDVS algorithms under the assumption of no context switching overhead, it might be more eﬃcient when a context switch consumes a signiﬁcant amount of energy because tasks under lppsEDF preempt each other less frequently.

5

Conclusion

In this paper, we have described SimDVS, a uniﬁed simulation environment for performance comparison of dynamic voltage scaling algorithms, and demonstrated its eﬀectiveness as a DVS evaluation tool using three case studies. Based 2

The time duration between the arrival time and completion time of a task instance.

SimDVS: An Integrated Simulation Environment

155

on the modular design structure, SimDVS supports both IntraDVS algorithms and InterDVS algorithms and allows an easy integration of new DVS algorithms such as HybridDVS algorithms. Using SimDVS, we compared the energy eﬃciency of the IntraDVS algorithm and InterDVS algorithms. Although, the IntraDVS algorithm generally outperformed the InterDVS algorithms, the relative energy eﬃciency was dependent on the task set characteristics. As the second case study, we also evaluated various heuristics for the HybridDVS algorithms which use both IntraDVS and InterDVS features. The heuristics close to the pure InterDVS algorithm worked better when they are based on the aggressive InterDVS algorithms while the heuristics close to the pure IntraDVS algorithm performed better when they are based on the non-aggressive InterDVS algorithms. Finally, we showed that more eﬃcient DVS algorithms may suﬀer from more system overhead such as the context switches.

Acknowlegement The RIACT at Seoul National University provided research facilities for this study.

References [1] M. Fleischmann. Crusoe Power Management: Reducing The Operating Power with LongRun. In Proc. of HotChips 12 Symposium, 2000. 141, 146 [2] Advanced Micro Devices, Inc. AMD PowerNow Technology, 2000. 141, 146 [3] Intel, Inc. The Intel(R) XScale(TM) Microarchitecture Technical Summary, 2000. 141, 146 [4] F. Yao, A. Demers, and S. Shenker. A Scheduling Model for Reduced CPU Energy. In Proc. of 36th Annual Symposium on Foundations of Computer Science, pages 374–382, 1995. 141 [5] I. Hong, G. Qu, M. Potkonjak, and M. B. Srivastava. Synthesis Techniques for Low-Power Hard Real-Time Systems on Variable Voltage Processor. In Proc. of Real-Time Systems Symposium, pages 178–187, 1998. 141 [6] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically Variable Voltage Processors. In Proc. of International Symposium On Low Power Electronics and Design, pages 197–202, 1998. 141 [7] Y. Shin, K. Choi, and T. Sakurai. Power Optimization of Real-Time Embedded Systems on Variable Speed Processors. In Proc. of International Conference on Computer-Aided Design, pages 365–368, 2000. 141, 142 [8] H. Aydin, R. Melhem, D. Mosse, and P. M. Alvarez. Dynamic and Aggressive Scheduling Techniques for Power-Aware Real-Time Systems. In Proc. of RealTime Systems Symposium, 2001. 141, 142 [9] P. Pillai and K. G. Shin. Real-Time Dynamic Voltage Scaling for Low-Power Embedded Operating Systems. In Proc. of 18th ACM Symposium on Operating Systems Principles (SOSP’01), 2001. 141, 142, 148

156

Dongkun Shin et al.

[10] G. Quan and X. Hu. Energy Eﬃcient Fixed-Priority Scheduling for Real-Time Systems on Variable Voltage Processors. In Proc. of Design Automation Conference, pages 828–833, 2001. 141 [11] D. Shin, J. Kim, and S. Lee. Intra-Task Voltage Scheduling for Low-Energy Hard Real-Time Applications. IEEE Design and Test of Computers, 18(23):20– 30, Mar. 2001. 141, 142, 147 [12] F. Gruian. Hard Real-Time Scheduling Using Stochastic Data and DVS Processors. In Proc. of International Symposium on Low Power Electronics and Design, pages 46–51, 2001. 141, 142, 147 [13] W. Kim, J. Kim, and S. L. Min. A Dynamic Voltage Scaling Algorithm for Dynamic-Priority Hard Real-Time Systems Using Slack Time Analysis. In Proc. of Design, Automation and Test in Europe (DATE’02), pages 788–794, 2002. 142 [14] J. Lehoczky, L. Sha, and Y. Ding. The Rate Monotonic Scheduling Algorithm: Exact Characterization and Average Case Behavior. In Proc. of Real-Time Systems Symposium, pages 166–171, 1989. 146 [15] T. Burd, T. Pering, A. Stratakos, and R. Brodersen. A Dynamic Voltage Scaled Microprocessor System. In Proc. of International Solid-State Circuits Conference, pages 294–295, 2000. 146 [16] D. Burger and T. M. Austin. The SimpleScalar Tool Set, version 2.0. Technical Report 1342, University of Wisconsin-Madison, CS Department, Jun. 1997. 147

Application-Supported Device Management for Energy and Performance Taliver Heath, Eduardo Pinheiro, and Ricardo Bianchini Department of Computer Science Rutgers University, Piscataway, NJ 08854-8019 {taliver,edpin,ricardob}@cs.rutgers.edu

Abstract. Energy conservation without performance degradation is an important goal for battery-operated computers, such as laptops and hand-held assistants. In this paper we determine the potential benefits of application-supported device management for optimizing energy and performance. In particular, we consider application transformations that increase device idle times and inform the operating system about the length of each upcoming period of idleness. We use modeling and experimentation to assess the potential energy and performance benefits of this type of application support for a laptop disk. Our main modeling results show that these benefits are significant. Our experimental results demonstrate that unless applications are transformed, they cannot accrue any of the predicted benefits. Overall, we find that the transformations can reduce disk energy consumption by as much as 89% with only a small degradation in performance.

1

Introduction

Recent years have seen a substantial increase in the amount of research directed towards battery-operated computers. The main goal of this research is to develop hardware and software that can improve energy eﬃciency and, as a result, lengthen battery life. The most common approach to achieving energy eﬃciency is to put idle resources or entire devices in low-power states until they have to be accessed again. The transition to a lower power state usually occurs after a period of inactivity (an inactivity threshold), and the transition back to active state usually occurs on demand. Unfortunately, the transitions to and from the low-power state can consume signiﬁcant time and energy. Nevertheless, this strategy works well when there is enough idle time to justify incurring such costs. Previous studies of device control for energy eﬃciency have shown that some workloads do exhibit relatively long idle times. However, these studies were limited to interactive applications (or their traces), slow microprocessors, or both. Recent advances in fast, low-power microprocessors and their use in batteryoperated computers are increasing the number of potential applications for these computers. For instance, non-interactive applications, such as movie playing, decompression, or encryption, are now commonly run on laptop computers. For B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 157–178, 2003. c Springer-Verlag Berlin Heidelberg 2003

158

Taliver Heath et al.

most of these non-interactive applications, fast processors reduce device idle times, which in turn reduce the potential for energy savings. Furthermore, incurring re-activation delays in the critical path of the microprocessor now represents a more signiﬁcant overhead (in processor cycles), as re-activation times are not keeping pace with microprocessor speed improvements. Thus, to maximize our ability to conserve energy without degrading performance under these new circumstances, we need ways to increase device idle times, eliminate inactivity thresholds, and start re-activations in advance of device use. Device idle times can be increased in several ways and at many levels, such as by energy-aware scheduling or prefetching in the operating system, by performing loop transformations during compilation, etc. The set of possibilities for achieving the other two goals is more limited. In fact, those goals can only be achieved with fairly accurate predictions of future application behavior, which can be produced consistently with programmer or compiler involvement. For these reasons, we advocate that programmers or compilers, i.e. applications, should be directly involved in device control in single-user, battery-operated systems such as laptops. To demonstrate the beneﬁts of this involvement, in this paper we evaluate the eﬀect of transforming explicit I/O-based applications to increase their idle times. These transformations can be performed by a sophisticated compiler, but can also be implemented by the programmer after a sample proﬁling run of the application. For greater beneﬁts, the transformations must involve an approximate notion of the original and target idle times. Thus, we also evaluate the eﬀect of having the application inform the duration of each idle period (hereafter referred to as a CPU run-length, or simply run-length) to the operating system. With this information, the operating system can apply more eﬀective device control policies. (For simplicity, we focus on the common laptop or hand-held scenario that only one application is ready to run at a time; other applications, such as editors or Web browsers, are usually blocked waiting for user input.) In particular, we study two kernel-level policies, direct deactivation and pre-activation, that rely on run-length information to optimize energy and performance. We develop simple analytical models that describe the isolated and combined energy and performance beneﬁts of program transformations and these energy management policies. The models allow a quick assessment of these beneﬁts, as a function of device and application characteristics such as the overhead of device re-activation, and the distribution of run-lengths. Essentially, the models can assess the implications of any application and device with more than one power state. As a concrete example of the use of the models, we apply them towards evaluating application-supported control of the disk of a commercial laptop. Interestingly, our experiments with this disk indicate that our models are not enough to accurately approximate the behavior of the disk. The reason is that the disk does not behave exactly as described in its own manuals. As a result, we develop behavior-adjustment models to extend our initial models.

Application-Supported Device Management for Energy and Performance

159

The execution of several applications on our laptop provides the run-length information we plug into the models. Unfortunately, our experiments show that several common laptop applications do not exhibit long enough run-lengths to allow for energy savings. To evaluate the potential of application transformations and application/operating system interaction, we manually transform applications, implement the policies in the Linux kernel, and collect experimental energy and performance results. These results show that our adjusted models can accurately predict disk energy consumption and CPU time. Furthermore, the results demonstrate that the transformed applications can conserve a signiﬁcant amount of disk energy without incurring substantial performance degradation. Compared to the unmodiﬁed applications, the transformed applications can achieve disk energy reductions of up to 89% under our most sophisticated energy management policy with small performance degradation. Based upon our modeling and experimental results, we conclude that application-supported device management can be very useful in terms of energy and performance. These results should motivate compiler designers to develop the infrastructure required by our application-supported policies. The remainder of this paper is organized as follows. The next section discusses the related work and highlights the aspects that distinguish our contributions. Section 3 details the diﬀerent policies we consider and presents models for them. Section 4 describes the disk and application workload we consider, and presents the results of our analyses and experiments. The section also describes the type of application transformation we advocate. Finally, section 5 summarizes the conclusions we draw from this research.

2

Related Work

Application Support in the Control of Devices. There have been several previous proposals for giving applications greater control of power states [1, 2, 3, 4, 5, 6]. Carla Ellis [1] articulated the beneﬁts of involving applications in energy management, but did not study speciﬁc techniques or policies. Lu et al. [2] suggested an architecture for dynamic energy management that encompassed application control of devices, but did not evaluate this aspect of the architecture. In a more recent paper, Lu et al. [3] studied the beneﬁt of allowing applications to specify their device requirements with a single operating system call. Microsoft’s OnNow project [4] suggests that applications should be more deeply involved, controlling all power state transitions. Our work diﬀers from these previous approaches in that we propose a different form of application support: one in which the application is transformed to increase its run-lengths and informs the operating system about each upcoming run-length, after a device access. This strategy allows us to handle short-run-length and irregular applications. Our approach also simpliﬁes programming/compiler construction (with respect to OnNow) without losing any energy conservation opportunities.

160

Taliver Heath et al.

Delaluz et al. [5] and Hom and Kremer [6] are developing compiler infrastructure for similar approaches to application support. These developments suggest that our proposed transformations can indeed be implemented by compilers. Delaluz et al. transform array-based benchmarks to cluster array variables and conserve DRAM energy, whereas Hom and Kremer transform such benchmarks to cluster page faults and conserve wireless interface energy. Both groups implement their energy management policies in the compiler. We implement our policies in the operating system for two reasons: (1) the kernel is traditionally responsible for managing all devices; and (2) the kernel can actually reduce any inaccuracies in the run-length information provided by the application, according to previously observed run-lengths or current system conditions; the compiler does not have access to that information. Nevertheless, our work is complementary to theirs in that we map out the potential beneﬁts of diﬀerent policies, determine where in the parameter space unmodiﬁed applications lie, and experimentally demonstrate the eﬀect of a diﬀerent form of application transformation. Analytical Modeling. We are only aware of two previous analytical studies of energy management strategies. Greenawalt [7] developed a statistical model of the interval between disk accesses using a Poisson distribution. Fan et al. [8] developed a statistical model of the interval between DRAM accesses using an exponential distribution. In both cases, modeling the arrival of accesses as a memoryless distribution does not seem appropriate, as suggested by the success of history-based policies [9, 5]. Greenawalt [7] also considered the issue of the lifetime of a disk as a function of the spin up/down frequency. This issue is beyond the scope of our paper. However, note that transforming applications to increase run-lengths and providing the operating system with more information for device control should only increase lifetimes. Direct Deactivation and Pre-activation. As far as we know, only recently have application-supported policies for device deactivation and pre-activation been proposed [5, 6]. Other works, such as [10], simulate idealized policies that are equivalent to having perfect knowledge of the future and applying both direct deactivation and pre-activation. Rather than simulate, we model, implement, and experimentally evaluate direct deactivation and pre-activation. Conserving Disk Energy. Disks have been a frequent focus of energy conservation research, e.g. [11, 12, 10, 9, 13, 2, 14, 7, 15]. The vast majority of the previous work has been on history-based, adaptive-threshold policies, such as the one used in IBM disks. Because our application-supported policies can use information about the future, they can conserve more energy and avoid performance degradation more eﬀectively than history-based strategies. Furthermore, in contrast with previous studies, we focus on non-interactive applications and application-supported disk control.

Application-Supported Device Management for Energy and Performance

3

161

Models

In this section, we develop simple and intuitive models of ﬁve device control policies: Energy-Oblivious (EO), Fixed-Thresholds (FT), Direct Deactivation (DD), Pre-Activation (PA), and Combined DD + PA (CO). The models predict device energy consumption and CPU performance, based on device and application parameters such as power consumption at each device state and application run-lengths. For the purpose of terminology, we deﬁne the power states to start at number 0, the active state, in which the device is being actively used and consumes the most power. The next state, state 1, consumes less power than state 0. In state 1, there is no energy or performance overhead to use the device. Each of the next (larger or deeper) states consumes less power than the previous state, but involves more energy and performance overhead to re-activate. Re-activations bring the device back to state 1. Before presenting the models, we state our assumptions: – We assume that run-lengths are exact. This assumption means that we are investigating the upper-bound on the beneﬁts of the policies that exploit DD and PA. In practice, we expect that run-lengths can be approximated with good enough accuracy to accrue most of these beneﬁts. – We assume that the application calls to the operating system have negligible time and energy overheads. Our experiments show that these overheads are indeed insigniﬁcant in practice. For example, the implementation of the DD policy for disk control takes on the order of tens of microseconds to execute on a fast processor, compared to run-lengths on the order of milliseconds. – We assume that run-lengths are delimited by device operations that occur in the critical path of the CPU processing (e.g. blocking disk reads that miss in the buﬀer cache). Extending the models to consider non-blocking accesses is straightforward. 3.1

Modeling the Energy-Oblivious Policy

The EO control policy keeps the device at its highest idle power state, so that an access can be immediately started at any time. Thus, this policy promotes performance, regardless of energy considerations. We model a device under the EO policy to use energy per run-length (E eo ) that is product of the run-length (R) and the power consumption when the device is in state 1 (P 1 ), i.e. E eo = R · P 1 The CPU time per run-length under the EO policy (T eo ) is simply the runlength, i.e. T eo = R

162

3.2

Taliver Heath et al.

Modeling the Fixed-Thresholds Policy

The FT control policy recognizes the need to conserve energy in battery-operated computers. It determines that a device should be sent to the consecutive lowerpower states after ﬁxed periods of inactivity. We refer to these ﬁxed periods as the their inactivity thresholds. For example, the device could be put in state 2 from state 1 after an inactivity period of 4 seconds (the inactivity threshold for state 1), and later be sent to state 3 after another 8 seconds (the inactivity threshold for state 2), etc. Thus, after 12 seconds the device would have gone from state 1 to state 3. We deﬁne the energy consumed by the device under the FT policy per runlength (E f t ) as the sum of three components: the energy spent going from state 1 to the state before the ﬁnal state f , the energy spent at state f , and the energy necessary to re-activate the device starting at state f . Thus, f −1

Ef t = (

f −1

s,s+1 P s · T s + Edeact ) + ((R − (

s=1

f T s )) · P f ) + Eact

s=1

In this equation, P represents the power consumed at state s, T s represents the amount of time spent at state s (equal to the inactivity threshold for this s,s+1 represents the energy consumed when going from state s to the state), Edeact f lower power state s+1, and Eact represents the re-activation energy from state f to state 1. The ﬁnal state f is the lowest power state that can be reached within the run-length, i.e. the largest state such that s

f −1

Ts < R

s=1

The CPU time per run-length (T f t ) is then the run-length plus the time to f re-activate from state f (Tact ), i.e. f T f t = R + Tact 1 1 Note that Eact = 0 and Tact = 0, because in state 1 the device is ready to be used. In addition, the time consumed by the transition from state s to a lower s,s , does not appear in the time equations because it is not in power state s , Tdeact the critical path of the CPU. The FT policy can be implemented by the operating system (according to the ACPI standard) or by the device itself. These implementation diﬀerences are of no consequence to our model. In fact, since the FT policy can conserve energy in a fairly straightforward fashion, we use it as the baseline when evaluating the other policies.

3.3

Modeling the Direct Deactivation Policy

The FT policy we described above is based on the assumption that if the device is not accessed for a certain amount of time, it is unlikely to be accessed for a while

Application-Supported Device Management for Energy and Performance

163

longer. If we knew the run-lengths a priori, we could save even more energy by simply putting the device in the desired state right away. This is exactly the idea behind the DD control policy, i.e. use application-level knowledge to maximize the energy savings. We model the device energy consumed per run-length under the DD policy (E dd ) as the energy consumed to get to the low power state, plus energy consumed at the low power state and the energy required to re-activate the device, i.e.

1,f f E dd = Edeact + P f · R + Eact

Note that we diﬀerentiate between states f and f , as the FT and DD policies do not necessarily reach the same ﬁnal state for the same run-length. In fact, f is deﬁned to be the lowest power state for which going to the next state would consume more energy, i.e. the largest state such that

1,f f 1,f +1 (Edeact + R · P f + Eact ) < (Edeact + R · Pf

+1

f +1 + Eact )

The CPU time per run-length for the DD policy (T dd ) is then similar, but diﬀerent, than that for the FT policy:

f T dd = R + Tact

3.4

Modeling the Pre-activation Policy

In both the FT and DD policies the time overhead of bringing the device back from a low-power state to state 1 is exposed to applications, as the transition is triggered by the device access itself. However, with run-length information from the application, we can try to hide the re-activation overhead behind useful computation. This is the idea behind PA, i.e. to allow energy savings (through FT or DD) while avoiding performance degradation. For maximum energy savings, the pre-activated device should reach state 1 “just before” it will be accessed. The speciﬁc version of PA we model uses the FT policy to save energy. The PA policy should achieve the same performance as the EO policy, but with a lower energy consumption. We model the device energy consumed per run-length under PA (E pa ) as E

pa

f −1

=(

P ·T + s

s

s,s+1 Edeact )

s=1

f −1

+ ((R − (

f f T s ) − Tact ) · P f ) + Eact

s=1

Again, we highlight that the ﬁnal low-power state f need not be the same as for the FT and DD policies for the same run-length, because re-activation occurs earlier with device pre-activation. f is deﬁned as the highest power state such that

f f +1 ( T s + Tact )>R s=1

164

Taliver Heath et al.

The CPU time per run-length under this policy (T pa ) is T pa = R 3.5

Modeling the Combined Policy

We can achieve the greatest energy savings without performance degradation by combining the PA and DD policies. This is the idea behind the CO policy. We model the device energy (E co ) and the CPU time (T co ) under the CO policy as 1,f ∗ f∗ f∗ + (R − Tact ) · P f ∗ + Eact and T co = R E co = Edeact State f ∗ is again diﬀerent than previous ﬁnal states, since the choice of state needs to take the energy overhead of pre-activation into account. f ∗ is deﬁned as the lowest power state such that 1,f ∗ f∗ f∗ + (R − Tact ) · P f ∗ + Eact )< (Edeact 1,f ∗+1 f ∗+1 f ∗+1 (Edeact + (R − Tact ) · P f ∗+1 + Eact )

Table 1 summarizes the parameters to the models and table 2 summarizes all equations. 3.6

Modeling Whole Applications

So far we presented models that compute energy and time based on a single run-length. These models could be applied directly to determine the energy and time consumed by an application, if we could somehow ﬁnd a run-length that represented all run-lengths of the application. This is easy to do for an application in which all run-lengths are of the same size. Unfortunately, few applications are this well-behaved. Another option would be to use the average run-length. However, the average run-length is not a good choice for two reasons: (1) applications may exhibit widely varying run-lengths, making average calculations

Table 1. Parameters for the models Parameter E pol T pol R Ps Ts s Eact s,s Edeact s Tact

Explanation Energy consumed by policy pol CPU time consumed by policy pol Run-length Average power consumed at state s Inactivity threshold for state s Average device energy to re-activate from state s Average device energy to transition from state s to lower power state s Average time to re-activate from state s

Application-Supported Device Management for Energy and Performance

165

Table 2. Energy and time equations for all policies Policy Equation EO E = R · P1 T =R f −1 f −1 s,s+1 f FT E = ( s=1 P s · T s + Edeact ) + ((R − ( s=1 T s )) · P f ) + Eact f T = R + Tact f −1 s f is the largest state such that T
CO

f −1

f −1

s,s+1 f f E = ( s=1 P s · T s + Edeact ) + ((R − ( s=1 T s ) − Tact ) · P f ) + Eact T =R f f +1 f is the smallest state such that ( s=1 T s + Tact )>R 1,f ∗ f∗ f∗ f∗ E = Edeact + (R − Tact ) · P + Eact T =R f ∗ is the largest state such that 1,f ∗ f∗ f∗ 1,f ∗+1 f ∗+1 f ∗+1 (Edeact + (R − Tact ) · P f ∗ + Eact ) < (Edeact + (R − Tact ) · P f ∗+1 + Eact )

meaningless; and (2) the models are non-linear, so modeling energy and time based on the average run-length would not be accurate, even if the average could be meaningfully computed. Instead of using the average run-length, we model applications by separating run-lengths into ranges or groups for which the models are linear, i.e. groups are associated with power states. For instance, under the FT policy, we deﬁne the groups according to the inactivity thresholds, i.e. group 1 is the set of run-lengths such that R < T 1 , group 2 is the set of run-lengths such that T 1 < R < T 2 , and so on. This grouping scheme allows us to work with the run-length exactly in the middle of each range, as the average run-length for the group. However, using these run-lengths we would not know whether we were underestimating or overestimating energy and time. Instead of doing that, we ﬁnd it more interesting to bound the energy and time consumed by applications below and above, using the minimum and maximum values in the groups, respectively. Besides, focusing on upper and lower bounds obviates the need to determine the distribution of run-lengths, which has been shown a complex proposition [8]. For a whole application, the upper and lower bounds on energy and time depend solely on the fraction of the run-lengths that fall within each group. More speciﬁcally, the overall potential of each policy is lower bounded by the sum of the minimums and upper bounded by the sum of the maximums. For instance, under the FT policy, if all run-lengths for an application are in group 2, the minimum energy consumption occurs when all run-lengths are T 1 .

166

Taliver Heath et al.

Application transformations that increase run-lengths have the potential to increase energy savings. The eﬀect of these transformations are translated to changes in the fraction of run-lengths that fall in the diﬀerent groups we just deﬁned. Increasing the length of run-lengths increases the fraction of run-lengths in larger numbered groups, with a corresponding decrease in the fraction of runlengths in smaller numbered groups.

4

Evaluation for a Laptop Disk

This section applies our models towards an evaluation of application transformation and the policies we consider for controlling the Fujitsu MHK2060AT laptop disk. This drive can be found in a number of commercial laptops. The section also describes the details of the application transformations and the implementations of the policies. The section proceeds as follows. First we adjust the models for our disk and instantiate their parameters. Next, we measure the run-lengths of several laptopstyle applications on a 667 MHz Pentium III-based system to ﬁnd the policies’ potential beneﬁts in practice. After that, we discuss the beneﬁts achievable for a subset of the applications that we have transformed to increase run-lengths. Last, to corroborate the models, we run both original and modiﬁed applications, and measure the resulting energy and performance gains. 4.1

The Fujitsu Disk

The Fujitsu disk we study is a 6-Gbyte, 4200-rpm drive with ATA-5 interface. This particular disk only implements four power states, according to its manual: 0. Active – the disk is performing a read or a write access. 1. Idle – all electronic components are powered on and the storage medium is ready to be accessed. This state is entered after the execution of a read or write. This state consumes 0.92 W. 2. Standby – the spindle motor is powered oﬀ, but the disk interface can still accept commands. This state consumes 0.22 W. 3. Sleep – the interface becomes inactive and the disk requires a software reset to be re-activated. This state consumes 0.08 W. However, our experiments with the disk demonstrate that there are two hidden transitional states. The ﬁrst occurs before a transition from active to idle. Right after the end of an access, the disk moves to the hidden state. There it consumes 1.75 W for at most 1.1 secs, regardless of policy. The second hidden state occurs when we transition the disk from idle (FT and PA) or the ﬁrst hidden state (DD and CO) to standby or sleep state. Before arriving at the ﬁnal state, the disk consumes 0.74 W at this hidden state for at most 5 secs. We do not number these extra states. To make the models more accurate, we extend them to take the hidden states into consideration. The model adjustment factors for energy are listed in table 3. No time adjustments are needed.

Application-Supported Device Management for Energy and Performance

167

Table 3. Energy adjustment models for all policies Policy All DD CO FT, PA

Equation Adj(R) = min(1.1, R) · (1.75 − 0.92) Adj(R) = 0.4 · (1.75 − P f ) f∗ Adj(R) = 0.4 · (1.75 − P ) Adj(R) = (5 − R + T 1 ) · (0.22 − 0.74)

Condition R < T1 R > T1 R > T1 T1 < R < T1 + 5

Table 4 lists the measured value for each of the parameters of our models. The measurements include the hidden states, obviously. The values marked with “†” were picked assuming that the disk should stay at a higher power state only as long as it has consumed the same energy it would have consumed at the next s,s+1 s+1 s = Edeact +(P s+1 ·T s )+Eact . The rationale lower power state, i.e. (P s ·T s )+Eact for this assumption is similar to the famous competitive argument about renting or buying skis [16]. 4.2

Model Predictions

Given the energy and time values of the disk, we can now evaluate the policies with our models’ predictions. Figure 1 plots the disk energy (left) and CPU time

Table 4. Model parameters and measured values for the Fujitsu disk. Values s,s+1 s+1 s = Edeact +(P s+1 ·T s )+Eact marked with “†” were picked so that (P s ·T s )+Eact Parameter Measured Value P1 0.92 W P2 0.22 W P3 0.08 W T 1 (FT) 9.222 secs† T 2 (FT) 16.429 secs† T 1 (PA) 8.712 secs† T 2 (PA) 17.276 secs† T 3 (FT and PA) Not applicable 1 Eact 0J 2 Eact 1.4 J 3 Eact 3.7 J 1,2 Edeact 5.0 J 1,3 Edeact 5.0 J 2,3 Edeact ∼0.0 J 1 Tact 0 ms 2 Tact 1.600 secs 3 Tact 2.900 secs

168

Taliver Heath et al.

50

50

FT DD PA CO EO

40 35 Energy(J)

FT DD PA CO EO

45 Time to Perform Runlength(seconds)

45

30 25

20 15

10 5

40 35

30 25

20 15

10 5

0

0

0

5

10

15

20

25

30

Runlength(seconds)

35

40

45

50

0

5

10

15

20

25

30

35

40

45

50

Runlength(seconds)

Fig. 1. Disk energy (left) and CPU time (right) for Fujitsu values, as a function of run-length

(right) for each policy as a function of run-length. Figure 2 plots the energy (left) and CPU time (right) diﬀerence between each policy and FT, again as a function of run-length. All ﬁgures assume our adjusted models. The energy graphs show that the FT and PA policies consume the most energy out of the energy-conscious policies. In fact, FT consumes signiﬁcantly more energy than even the EO policy for run-lengths between about 9 (T 1 , i.e. the ﬁrst inactivity threshold) and 18 seconds. PA consumes more energy than EO 2 ) to 17 seconds. This result is a consequence for run-lengths between 10 (T 1 +Tact of the high energy penalty for re-activating the disk in FT. The DD and CO policies consume signiﬁcantly less energy than FT and PA for run-lengths that are longer than 9 seconds. For a run-length of 26 seconds, for instance, this diﬀerence is almost 50%. For run-lengths that are longer than 17 seconds, DD and CO send the disk directly to sleep state. FT and PA only reach the sleep state for run-lengths that are longer than about 26 (T 1 + T 2 ) and 3 ), respectively. Thus, for run-lengths in these ranges, 29 seconds (T 1 + T 2 + Tact energy consumption diﬀerences increase slightly. It is interesting to note that CO conserves slightly more energy than DD, as the former policy takes advantage of the (small) energy beneﬁt of pre-activating the disk. This beneﬁt also explains the small diﬀerence between the PA and the FT policies for most of the parameter space. The CPU time graphs show that the EO, PA, and CO policies perform better than the DD and FT policies for all run-lengths greater than 9 seconds. Just beyond this threshold, the EO, PA, and CO policies become about 15% better than DD and FT. DD and FT exhibit the same performance for run-lengths in the 9 to 17 seconds range and run-lengths that are longer than 26 seconds. For run-lengths between 17 and 26 seconds, DD exhibits worse performance than FT

Application-Supported Device Management for Energy and Performance

2

1.2

DD/FT PA/FT CO/FT EO/FT

DD/FT EO/FT,CO/FT,PA/FT

1.15

1.1

Time Ratio

1.5 Energy Ratio

169

1

1.05

1 0.95

0.5

0.9 0.85

0

0.8

0

5

10

15

20

25

30

35

40

45

Runlength(seconds)

50

0

5

10

15

20

25

30

35

40

45

50

Runlength(seconds)

Fig. 2. Disk energy (left) and CPU time (right) with respect to FT, as a function of run-length 3 2 > Tact . At a run-length of 50 seconds, the performance diﬀerence because Tact between the policies is approximately 5%.

4.3

Benefits for Applications

As mentioned in section 3.6, we group run-lengths with respect to power states to model whole applications. Under FT, for instance, we divide run-lengths into 3 groups: [1, T 1 − 1], [T 1 , T 2 − 1], and [T 2 , 49999]. As is apparent in this break down, we use 1 millisecond and 50 seconds as the shortest and longest possible run-lengths, respectively. (We have found experimentally that only a few runlengths in our original and modiﬁed applications do not fall in this range.) Figures 3 and 4 plot the average power (energy/sec) consumed by applications under the DD and PA policies, respectively, as a function of the percentage of run-lengths that fall within the groups associated with states 1 (idle) and 2 (standby). The run-lengths associated with state 3 (sleep) are the remaining ones. In both ﬁgures, lower (higher) planes represent minimum (maximum) average power consumptions. (We will soon explain what the points in the ﬁgures represent.) Recall that CO has roughly the same power behavior as DD, whereas FT has almost the same power behavior as PA. Consequently, we do not present energy results for CO and FT explicitly. The ﬁgures show that run-length distributions that are skewed towards long run-lengths can bring energy consumption to a small fraction of the power consumption of state 1, regardless of the policy used. Under an extreme scenario in which all run-lengths are in state 3, i.e. coordinates (0,0,*) in the ﬁgures, the minimum average power is roughly 29% (DD) and 50% (PA) lower than the power consumption of state 1. Note that the minimum average power is the energy consumed by run-lengths of 49.999 seconds plus the energy to re-activate

Taliver Heath et al.

1.8

2

1.6

1.8

1.4

1.6

Average Power for PA

Average Power for DD

170

1.2 1

0.8 0.6

1.4 1.2 1

0.8

0.4

0.6

0.2 1

0.4 1

0.5 0

State 2

0

0.2

0.4

0.6

0.8

1

0.5 0

State 1

Fig. 3. Average power for DD policy, as a function of the percentage of runlengths in states 1, 2, and 3. Lower (higher) plane represents min (max) consumption

State 2

0

0.2

0.4

0.6

0.8

1

State 1

Fig. 4. Average power for PA policy, as a function of the percentage of runlengths in states 1, 2, and 3. Lower (higher) plane represents min (max) consumption

1 millisecond later, divided by 50 seconds. In contrast, the maximum average power is the result of run-lengths of about 17 (DD) or 29 (PA) seconds and re-activating after 1 millisecond. Another interesting observation is the signiﬁcant diﬀerence between maximum and minimum consumptions, especially when the percentage of run-lengths in state 1 is large. The largest diﬀerences between maximum and minimum consumptions occur at coordinates (1,0,*). For DD, these discrepancies represent the diﬀerence between having all run-lengths in state 1 be 1 millisecond (maximum) or 9 (minimum) seconds. This diﬀerence in the distribution of state 1 run-lengths can cause a factor of almost 2 diﬀerence in average power. These observations demonstrate that, even when run-lengths are relatively short, their actual distribution may aﬀect the energy consumption signiﬁcantly. Finally, the ﬁgures conﬁrm that the DD policy consumes less energy than the PA policy across the whole parameter space. The only exception is when all run-lengths are in state 1, i.e. at coordinates (1,0,*) in the ﬁgures; with this distribution, the two policies consume the same amount of energy. Also, note that at these coordinates the average power is always higher than P 1 , due to the signiﬁcant energy overhead of the ﬁrst hidden state. Figure 5 illustrates the percentage CPU time overhead under the DD policy. Results for FT are similar to those in the ﬁgure, whereas EO, PA, and CO all exhibit no overhead under our modeling assumptions. The ﬁgure shows that minimum overheads are low (< 6%) for most of the parameter space. In contrast, maximum overheads quickly become signiﬁcant as the fraction of run-lengths

Application-Supported Device Management for Energy and Performance

171

Average Overhead for DD

0.2

0.15

0.1

0.05

0 1 0.8

1

0.6

0.8 0.6

0.4

0.4

0.2

State 2

0.2

0

0

State 1

Fig. 5. Average % CPU time overheads of DD policy. Lower (higher) plane represents min (max) overheads

corresponding to states 2 and 3 is increased. Again, we see the importance of the distribution of run-lengths within each group. Figures 3, 4, and 5 present the absolute behavior of our policies. However, it is also important to determine the beneﬁts of our policies in comparison to more established policies such as FT. DD can achieve signiﬁcant energy gains with respect to FT for most of the parameter space. Even the minimum gains are substantial in most parts of the parameter space. Gains are especially high when most of the run-lengths are within the bounds of state 3. Reducing the percentage of these run-lengths decreases the maximum savings slightly when in favor of state 2 run-lengths and more signiﬁcantly when in favor of state 1 run-lengths. In terms of CPU time, PA performs at least as well as FT for the whole parameter space, even in the worst case scenario. The maximum gains can reach 15%, especially when the distribution of run-lengths is tilted towards state 2. Reducing the percentage of these run-lengths in favor of state 3 run-lengths decreases the maximum savings, but not as quickly as increasing the fraction of state 1 run-lengths. Figures 3 to 5 visualize the potential beneﬁt of our policies for the entire range of run-length distributions. However, it is also important to determine where applications actually lie. Table 5 lists the non-interactive applications we consider and their inputs. We measured the applications’ run-lengths by instrumenting the operating system kernel (Linux) on a 667 MHz Pentium IIIbased system to compute that information. The results of these measurements show that all run-lengths in these applications fall in state 1 and, thus, prevent any energy savings. In terms of the ﬁgures we just discussed, this means that applications lie in the right extreme of the graphs, i.e. coordinates (1,0,*). To accrue energy savings, we need to increase

172

Taliver Heath et al.

run-lengths, moving the applications towards the left side of the graphs. That is the purpose of our main proposed application transformations. Our transformed applications are represented in ﬁgures 3, 4, and 5 by diﬀerent point marks. Each application is represented by two points, one for each plane. As we can see, the modiﬁed applications do lie in a more proﬁtable part of the parameter space in terms of energy savings. The next subsection discusses our proposed transformations in detail. 4.4

Application Transformations

As mentioned above, for non-interactive applications to permit energy savings, we need to be able to increase run-lengths. We propose that run-lengths can be easily increased by modifying the applications’ source codes. In particular, the codes should be modiﬁed to cluster disk read operations, so that the processor could process a large amount of data in between two clusters of accesses to disk. If the reads are for consecutive parts of the same ﬁle, a cluster of reads can be replaced by a single large read. Intuitively and supported by ﬁgures 3 and 4, one might think that the best approach would then be to increase run-lengths to the extreme by grouping all reads into a single cluster. However, one must realize that increasing run-lengths in this way will correspondingly increase buﬀer requirements. Given this direct relationship between run-length and buﬀer space, we propose that applications should be modiﬁed to take advantage of as much buﬀer space as possible, as long as that does not cause unnecessary disk activity, i.e. swapping. Unfortunately, this approach does not work well for all applications. Streaming applications should have the additional restriction that a cluster of reads (or large read) should take no longer than 300 millisecs, to avoid human-perceptible pauses of the stream. To determine the amount of memory that is available, we propose the creation of a system call. The operating system can then decide how much memory is available for the application to consume and inform the application. The following example illustrates the transformations on a simple (nonstreaming) application based on explicit I/O. Assume that the original application looks roughly like this: i = 1; while i <= N { read chunk[i] of file; compute on chunk[i]; i = i + 1; } After we transform the application to increase its run-length: // ask OS how much memory can be used available = how_much_memory();

Application-Supported Device Management for Energy and Performance

173

num_chunks = available/sizeof(chunks); i = 1; while i <= N { // cluster read operations for j = i to min(i+num_chunks, N) read chunk[j] of file; // cluster computation for j = i to min(i+num_chunks, N) compute on chunk[j]; i = j + 1; } A streaming application can be transformed similarly, but the number of chunks of the ﬁle to read (num chunks) should be min(available/ sizeof(chunks), (disk bandwidth x 300 millisecs)/sizeof(chunks)). Regardless of the type of application, the overall eﬀect of this transformation is that the run-lengths generated by the computation loop are now num chunks times as long as the original run-lengths. As a further transformation, the information about the run-lengths can be passed to the operating system to enable the policies we consider. The sample code above can then be changed to include the following system call in between the read and computation loops: next_R(appl_specific_func(available)); Note that for regular applications, such as streaming audio and video, the operating system itself could predict run-lengths based on past history, instead of being explicitly informed by the programmer or the compiler. However, the approach we advocate is more general; it can handle these applications, as well as applications that exhibit irregularity. Our image smoothing application, for instance, smooths all images under a certain directory. As the images are fairly small (can be loaded to memory with a single read call) and of diﬀerent sizes, each run-length has no relationship to previous ones. Thus, it would be impossible for the operating system to predict run-lengths accurately for this application. In contrast, the compiler or the programmer can easily approximate each runlength based on the image sizes, which is what we do in our experiments. 4.5

Experiments

Methodology. To support and validate our models, we experimented with real non-interactive applications running on a Linux-based laptop. We implemented the FT, DD, PA, and CO policies in the Linux kernel. FT is implemented with a kernel timer that goes oﬀ according to the T 1 and T 2 thresholds. When the timer goes oﬀ, the kernel sends the disk to the next available lower power mode. DD, PA, and CO were implemented by creating a system call that can be called by applications to inform the kernel about the run-length that is about to

174

Taliver Heath et al.

Table 5. Applications, their inputs, and the grouping of run-lengths in their modiﬁed versions (assuming CO states). We consider two streaming (top) and two non-streaming applications (bottom) Application

Input

Modified Rs {s1,s2,s3} MP3 player 2.49-MByte song {0, 0, 1} MPEG player 12.75-MByte movie {0, 0.83, 0.17} Image smoother 30 images, {0, 0, 1} 2.46 MBytes each MPEG encoder 800 files, {0, 0.5, 0.5} 115 KBytes each

start. With that information, the kernel can implement DD by determining f according to our model and putting the disk in that state. The kernel can also implement PA by starting a timer to go oﬀ when the disk should be re-activated, again according to our model. Recall that the PA policy assumes FT for energy conservation. The kernel implements the CO policy by combining PA with DD, rather than FT. Unmodiﬁed non-interactive applications usually exhibit very short runlengths. To achieve greater energy gains, we need applications with longer runlengths and that call the kernel informing their approximate run-lengths. One of our main goals is to inspire compiler writers to produce the infrastructure required by these transformations. Because the infrastructure does not exist at the moment, we transformed a subset of our applications manually. To determine a reasonable read buﬀer size for a laptop with 128 MBytes of memory, we determined the amount of memory consumed by a “common laptop environment”, with Linux, the KDE window manager, 1 Web browser window, 1 slide presentation window, 1 emacs window, and 2 xterms. To this amount, we added 13 MBytes (10% of the memory) as a minimum kernel-level ﬁle cache size. The remaining memory allowed for 19 MBytes of read buﬀer space. The transformed streaming and non-streaming applications exhibit the runlength distributions listed in the third column of table 5. The groups in the table are deﬁned with respect to the run-lengths that delimit CO states. Recall that the original distributions were all {1,0,0} for our applications. The disk energy consumed by the applications is monitored by a multimeter directly connected to the disk device. The multimeter collects instantaneous power measurements 3-4 times per second and sends these measurement to another computer, which stores them in a log for later use. Results. Figures 6 to 9 present the measured and modeled results for our applications. Each ﬁgure plots two groups of bars, disk energy (left) and CPU time (right), with results for all policies. The rightmost bar in each group (labeled “UM”) presents the results for the original, unmodiﬁed versions of the applications. Each bar is divided into 4 diﬀerent parts. The bottom part represents the

Application-Supported Device Management for Energy and Performance Time FT DD PA CO EO UM

Time FT DD PA CO EO UM

250

Corrected Estimated Measured Disk Access

200

200

150

150

100

100

50

50

100 150

80 60

100

Performance(sec)

250

120

200 Energy(J)

250

160 140

300

300

Energy(J)

350

Performance(sec)

350

Corrected Estimated Measured Disk Access

180

300

400

400

175

40 50

0

0

FT DD PA CO EO UM Energy

Fig. 6. Energy (left) and time (right) for image smoother

20 0

0

FT DD PA CO EO UM Energy

Fig. 7. Energy (left) and time (right) for MP3 player

energy/time associated with actual disk accesses. Stacked on top of it, we have the energy/time as predicted by the simple models, the energy/time as predicted by the adjusted models, and the experimentally measured energy/time. The modeling results were computed using the actual run-lengths observed during the applications’ runs. The only bars that are not divided into these components are those for the unmodiﬁed MPEG encoder. The reason is that our current kernel implementation cannot handle the huge amount of run-length information generated by this application. We can make several interesting observations from these ﬁgures. First, the ﬁgures conﬁrm that the adjusted models can predict energy/time more accurately than the simple models. The diﬀerence between simple and adjusted predictions approaches a factor of 2 in some cases, namely energy for the unmodiﬁed MP3 and MPEG players. The ﬁgures demonstrate that our adjusted models can indeed approximate the behavior of the applications in all cases. Second, the ﬁgures demonstrate that the application support we propose indeed conserves energy. The transformation to increase run-lengths reduces energy consumption even under EO, an energy-oblivious policy. When the modiﬁed applications are run under FT, energy consumption is further reduced in most cases. The exception here is the MPEG player application, for which run-lengths are exactly in the range where FT performs worse than EO, namely between 9 and 18 seconds. PA conserves either a little more or a little less energy than FT, as one would expect. Exploiting run-length information to conserve energy provides even more gains, as shown by the DD and CO results. Modiﬁed applications under DD and CO can consume as much as 89% less energy than their unmodiﬁed counterparts, as in the case of the MP3 player. The CO policy usually consumes a little more energy than DD. The main reason is that run-length mispredictions may cause

Taliver Heath et al. Time FT DD PA CO EO UM

Time FT DD PA CO EO UM 140

250

200

Corrected Estimated Measured Disk Access

160

120

Corrected Estimated Measured Disk Access

100

140

80 60

100

80

120 Energy(J)

150

Performance(sec)

100 Energy(J)

120

180

100

60 80

40

60

Performance(sec)

176

40

40

50

20 0

0

FT DD PA CO EO UM Energy

Fig. 8. Energy (left) and time (right) for MPEG encoder

20 20

0

0

FT DD PA CO EO UM Energy

Fig. 9. Energy (left) and time (right) for MPEG player

the disk to be idle longer than necessary under CO. The same problem is not as severe for PA (percentage-wise) because re-activations under this policy usually come from a shallower state than in CO for these applications. Third, the CPU time bars show that FT and DD usually exhibit the worst performance, as one would expect. The disk re-activations are the main cause for the performance degradation under these policies. Furthermore, the ﬁgures show that PA and CO are eﬀective at limiting performance degradation. Performance under these policies is always about the same as under EO, except in the case of the MPEG player for which the performance degradation is 5%. This small discrepancy is a consequence of a few run-length mispredictions. Overall, the experimental results indicate that our adjusted models can accurately predict energy conservation and CPU time, whereas the simple models sometimes fail to do so. Moreover, the results demonstrate that the application transformations we propose are extremely eﬀective at conserving energy. Finally, our results conﬁrm that CO is the best policy in that it conserves signiﬁcant energy without degrading performance. The main diﬃculty with CO (and PA) is coming up with accurate run-length predictions.

5

Conclusions

This paper studied the potential beneﬁts of application-supported device management for optimizing energy and performance. We proposed simple application transformations that increase device idle times and inform the operating system about the length of each upcoming period of idleness. Using modeling, we showed that there are signiﬁcant beneﬁts to performing these transformations for large regions of the application space. Using operating system-level implementations

Application-Supported Device Management for Energy and Performance

177

and experimentation, we showed that current non-interactive applications lie in a region of the space where they cannot accrue any of these beneﬁts. Furthermore, we experimentally demonstrated the gains achievable by performing the proposed transformations.

Acknowledgements We would like to thank Enrique V. Carrera, Uli Kremer, and the anonymous referees for comments that helped improve this paper. We are also grateful to Uli Kremer for lending us the power measurement infrastructure of the Energy Eﬃciency and Low-Power (EEL) lab at Rutgers.

References [1] Carla Ellis. The case for higher level power management. In Proceedings of Hot-OS, March 1999. 159 [2] Yung-Hsiang Lu, Tajana Simunic, and Giovanni De Micheli. Software controlled power management. In Proceedings of the IEEE Hardware/Software Co-Design Workshop, May 1999. 159, 160 [3] Yung-Hsiang Lu, Luca Benini, and Giovanni De Micheli. Requester-aware power reduction. In Proceedings of the International Symposium on System Synthesis, September 2000. 159 [4] OnNow and Power Management. http://www.microsoft.com/hwdev/onnow/. 159 [5] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin. Dram energy management using software and hardware directed power mode control. In Proceedings of the International Symposium on HighPerformance Computer Architecture, January 2001. 159, 160 [6] Jerry Hom and Uli Kremer. Energy management of virtual memory on diskless devices. In Proceedings of the Workshop on Compilers and Operating Systems for Low Power, September 2001. 159, 160 [7] Paul Greenawalt. Modeling power management for hard disks. In Proceedings of the Conference on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Januray 1994. 160 [8] Xiaobo Fan, Carla Ellis, and Alvin Lebeck. Memory controller policies for dram power management. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’01), August 2001. 160, 165 [9] Fred Douglis and P. Krishnan. Adaptive disk spin-down policies for mobile computers. Computing Systems, 8(4):381–413, 1995. 160 [10] Fred Douglis, P. Krishnan, and Brian Marsh. Thwarting the power-hungry disk. In Proceedings of the 1994 Winter USENIX Conference, 1994. 160 [11] J. Wilkes. Predictive Power Conservation. Technical Report HPL-CSP-92-5, Hewlett-Packard, May 1992. 160 [12] Kester Li, Roger Kumpf, Paul Horton, and Thomas Anderson. A quantitative analysis of disk drive power management in portable computers. In Proceedings of the 1994 Winter USENIX Conference, pages 279–291, 1994. 160

178

Taliver Heath et al.

[13] David P. Helmbold, Darrell D. E. Long, and Bruce Sherrod. A dynamic disk spindown technique for mobile computing. In Proceedings of the 2nd International Conference on Mobile Computing, pages 130–142, 1996. 160 [14] Chi-Hong Hwang and Allen C.-H. Wu. A predictive system shutdown method for energy saving of event-driven computation. ACM Transactions on Design Automation and Electronic Systems, 5(2):226–241, April 2000. 160 [15] Yung-Hsiang Lu, Eui-Young Chung, Tajana Simunic, Luca Benini, and Giovanni De Micheli. Quantitative comparison of power management algorithms. In Proceedings of the Design Automation and Test Europe, March 2000. 160 [16] A. Karlin, M. S. Manasse, L. Rudolph, and D. D. Sleator. Competitive snoopy caching. Algorithmica, 3(1):79–119, 1988. 167

Energy-Eﬃcient Server Clusters E.N. (Mootaz) Elnozahy, Michael Kistler, and Ramakrishnan Rajamony Low-Power Computing Research Center IBM Research, Austin TX 78758, USA http://www.research.ibm.com/arl

Abstract. This paper evaluates ﬁve policies for cluster-wide power management in server farms. The policies employ various combinations of dynamic voltage scaling and node vary-on/vary-oﬀ (VOVO) to reduce the aggregate power consumption of a server cluster during periods of reduced workload. We evaluate the policies using a validated simulator that calculates the energy usage and response times of a Web server cluster serving traces culled from real-life Web server workloads. Our results show that a relatively simple policy of independent dynamic voltage scaling on each server node can achieve savings ranging up to 29% and is competitive with more complex schemes for some workloads. A policy that brings nodes online and takes them oﬄine depending on the workload intensity also produces signiﬁcant savings up to 42%. The largest savings are obtained by using a coordinated voltage scaling policy in conjunction with VOVO. This policy provides up to 18% more savings than just using VOVO in isolation. All ﬁve policies maintain server response times within acceptable norms. Keywords: Power Management, Clusters, Voltage Scaling, Web Servers

1

Introduction

Power consumption is rapidly becoming a key design issue for servers deployed in large data centers and web hosting facilities. Anecdotal evidence from data center operators indicates that a signiﬁcant fraction of the operation cost of these centers is due to power consumption and cooling. Computing nodes in these densely packed systems also often overheat, leading to intermittent failures. These problems are likely to worsen as newer server-class processors oﬀer higher levels of performance at the expense of increased power consumption. This paper explores ﬁve polices for reducing the energy consumption of server clusters with varying degrees of implementation complexity. The ﬁrst policy, independent voltage scaling (IVS), simply uses voltage scaled processors that independently vary their voltage and frequency according to the server workload. The second policy also uses voltage scaled processors, but coordinates the processor voltage scaling algorithms. We call this coordinated voltage scaling (CVS).

This research has been supported in part by The Defense Advanced Research Projects Agency under contract F33615-00-C-1736.

B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 179–197, 2003. c Springer-Verlag Berlin Heidelberg 2003

180

E.N. (Mootaz) Elnozahy et al.

The third policy uses non-voltage scaled processors and turns entire servers on or oﬀ depending on the workload intensity. This policy, which we call vary-on vary-oﬀ (VOVO), was originally proposed by Pinheiro et. al. [11], and has been evaluated for web-server clusters [4, 11] and compute-server clusters [11]. The fourth policy combines IVS and VOVO, while the ﬁfth uses a combination of coordinated voltage scaling and VOVO. We use workloads constructed from the server access logs of the 1998 Nagano Winter Olympics website and those of a ﬁnancial services web site to evaluate the ﬁve policies in terms of both their response time and energy savings. The evaluation is performed using a simulation model of a web server cluster. This simulation model is an extension of a previously developed single-node web server simulator that has been extensively validated for accuracy in both energy consumption and response times against energy and response time measurements from a commodity server [1]. Our ﬁndings show that independent voltage scaling, the simplest of all policies in terms of implementation complexity1 , oﬀers energy savings ranging from 20% to 29%. Coordinated voltage scaling oﬀers slightly better savings than IVS, but this beneﬁt is probably not suﬃcient to justify the increased implementation complexity. The energy savings aﬀorded by VOVO are workload dependent. For the Finance workload, VOVO saves more energy than IVS. However, IVS saves more energy than VOVO for Olympics98. Combining voltage scaling with VOVO oﬀers the most energy savings with VOVO-IVS saving more energy than either voltage scaling or VOVO in isolation. VOVO-CVS saves the most energy (up to 18% more than VOVO) at the expense of a more complicated implementation. All ﬁve policies can be engineered to keep server response times within acceptable norms. The remainder of this paper is organized as follows. In Section 2 we present ﬁve policies for power management in server clusters. In Section 3, we analytically derive an optimal operating frequency range for servers in a cluster with voltagescaled processors. This derivation forms the basis for the VOVO-CVS policy. Section 4 describes the evaluation methodology for the various cluster power management policies using workloads based on logs from real web servers and presents the results of our evaluation. A comparison to related work is presented in Section 5 and conclusions in Section 6.

2

Cluster Power Management

We explore the beneﬁts of ﬁve processor power management policies for clusters. The policies use two basic power management mechanisms, dynamic voltage scaling and node vary-on/vary-oﬀ. We begin by brieﬂy describing these mechanisms, and then explain how these mechanisms are employed in the policies. 1

We use the term “implementation complexity” to refer to the additional work required in integrating commodity components to implement a policy. More speciﬁcally, we do not consider the complexity of implementing or designing the commodity components.

Energy-Eﬃcient Server Clusters

181

Dynamic voltage scaling adjusts the operating voltage and frequency of the CPU to match the intensity of the workload. Voltage scaling reduces energy consumption by virtue of the fact that the energy consumed by a processor is typically directly proportional to V 2 , where V is the operating voltage. To ensure reliable operation, processor frequency must also be reduced proportionally with voltage. Setting the processor frequency (and thence the voltage) to the lowest point where the workload can be executed with adequate responsiveness results in reduced energy consumption. In most high performance server systems, the CPU provides the greatest opportunity to reduce power consumption, since other components consume a smaller fraction of the total system power, exhibit much less variation in energy consumption with workload, and could substantially impact system performance and/or reliability if power managed. Voltage scaling works particularly well in Web servers since typically, their conﬁgured capacity is signiﬁcantly larger than the average encountered workload [1]. Dynamic voltage scaling is an intra-node power management mechanism. Node vary-on/vary-oﬀ takes whole nodes oﬄine when the workload can be adequately served by a subset of the nodes in the cluster. Machines that are taken oﬀ-line may be powered oﬀ or placed in a low power state. Machines that are oﬀ-lined are placed back online should the workload increase. Node varyon/vary-oﬀ is an inter-node power management mechanism. All ﬁve policies assume that the incoming workload is balanced across cluster nodes using a mechanism such as weighted round-robin request distribution, with the servers weighted according to the average response time of recent requests. Independent Voltage Scaling (IVS): In this policy each node independently manages its own power consumption using dynamic voltage scaling. This policy performs no inter-node power management, so all the nodes in the cluster stay in the active state even during periods of low workload. Each node may operate at diﬀerent frequencies and voltages due to workload variations and the diﬀerences in the computational demands of individual requests. However, since the request distribution mechanism balances the workload across all nodes, on average, each node will operate at roughly the same frequency and voltage. While IVS requires that the cluster nodes have processors and infrastructure that support dynamic voltage scaling, no other software support is necessary. In particular, a cluster composed of nodes using Transmeta CrusoeTM processors implements this policy. Coordinated Voltage Scaling (CVS): This policy uses dynamic voltage scaling in a coordinated manner to reduce cluster power consumption. In contrast to IVS, the cluster nodes coordinate their voltage scaling actions so that all nodes operate very close to the average frequency setting across the cluster. A centralized monitor periodically computes the average frequency setting of all active nodes and broadcasts it to all servers in the cluster. Each node then restricts its voltage scaling policy to a small interval of settings around this average frequency. As in the IVS policy, there is no inter-node power management. Hence, all the nodes in the cluster are active even during periods of low workload. The CVS policy is expected to save more energy than IVS because a cluster where the nodes operate at a particular frequency/voltage setting S is more eﬃcient

182

E.N. (Mootaz) Elnozahy et al.

than one where the nodes operate at multiple settings whose average is S. To implement the CVS policy, cluster processors must support software controlled dynamic voltage scaling, such as that aﬀorded by AMD’sTM PowerNowTM technology. In addition, CVS requires a central facility that monitors the frequency settings of all nodes and disseminates the average setting to the cluster. This monitoring facility can be implemented as a software service running on one of the cluster nodes (or a separate support server) with software probes running on each of the nodes. Vary-On Vary-Oﬀ (VOVO): This policy, originally proposed by Pinheiro et. al. [11], turns oﬀ server nodes so that only the minimum number of servers required to support the workload are kept active. Nodes are brought online as and when required. VOVO does not use any intra-node voltage scaling, and can therefore be implemented in a cluster that uses standard high-performance processors without dynamic voltage scaling. However, some hardware support, such as a Wake-On-LAN network interface, is needed to signal a server to transition from inactive to active state. Node vary-on/vary-oﬀ can be implemented as a software service running on one of the cluster nodes (or a separate support server) to determine whether nodes must be taken oﬄine or brought online and requires software probes running on each of the cluster nodes to provide information on the utilization of that node. The load distribution mechanism must be aware of the state of the servers in the cluster so that it does not direct requests to inactive nodes, or to nodes that have been selected for vary-oﬀ but have not yet completed all received requests. Combined Policy (VOVO-IVS): This policy is a combination of the VOVO policy to reduce the number of active servers and the IVS policy to reduce power consumption on individual nodes. VOVO-IVS can be easily implemented using Transmeta CrusoeTM processors in a VOVO cluster. Thus, the implementation complexity of VOVO-IVS is similar to that of VOVO. Coordinated Policy (VOVO-CVS): Conceptually, this policy is a combination of the VOVO policy to use the fewest number of active servers and the CVS policy to reduce the power consumption of individual active nodes. However, in this policy the power management actions of the two policies are integrated to use the most eﬀective mechanism whenever the maximum cluster capacity is not needed. In particular, VOVO-CVS places a larger emphasis on voltage scaling than VOVO-IVS, since voltage scaling provides a quadratic beneﬁt whereas node vary-on/vary-oﬀ provides only a linear beneﬁt. The trade-oﬀ between these two mechanisms is heavily dependent on the power consumption characteristics of the processor and system components, and is discussed in detail in 3. Operationally, the policy achieves this integration by constraining the range of frequency and voltage settings based on the number of active nodes in the cluster. This range also dictates when a node is brought online or oﬄine. A new node is brought online when the existing nodes ﬁnd they have to operate above the allowed frequency range. Similarly, an existing node is taken oﬄine when the nodes attempt to operate below their allowed frequency range.

Energy-Eﬃcient Server Clusters

183

VOVO-CVS is more complicated to implement than VOVO. The voltage scaling component requires processors that support software controlled dynamic voltage scaling such as that aﬀorded by AMD’sTM PowerNowTM technology. In addition, the CVS component requires a central facility that communicates the optimal operating range to the software that sets the individual processor frequency/voltage settings. The VOVO component requires a central monitoring service to determine whether nodes must be taken oﬄine or brought online and probes to monitor the utilization on each node. Finally, as with the VOVO policy, the load distribution mechanism must be aware of the state of the servers in the cluster so that it does not direct requests to inactive nodes, or to nodes that have been selected for vary-oﬀ but have not yet completed all received requests.

3

Details of the Coordinated Policy

In this section we explore in detail how to best integrate the VOVO policy, which expands and contracts the set of active servers based on workload, and the CVS policy, which uses software managed dynamic voltage scaling to ensure all nodes operate at a low average frequency and voltage. We construct a simple model for the power consumption of the cluster and use this model to determine how the two policies should cooperate to minimize power consumption. This theory will then be used as the basis for our coordinated power management policy. As previously discussed, CPU power is the largest and most variable component of system power consumption in typical server systems. Therefore, the model assumes that power consumption of all other system components is essentially constant regardless of system activity. Our earlier empirical observations support the validity of this assumption [1]. Since our processors support dynamic voltage scaling, CPU power consumption depends on the CPU voltage and frequency as well as CPU utilization. To simplify the model, we will assume the CPU is fully utilized. Another way to view this assumption is that during periods of low workload, CPU frequency can be lowered to the point that the processor is kept fully utilized. The dynamic component of CPU power consumption – the only part aﬀected by system load – is given by the following formula [10]: Dynamic CPU Power = A C V 2 f where A is an activity factor that accounts for how frequently gates switch, C is the total capacitance at the gate outputs, V is the voltage of the processor, and f is the operating frequency. In voltage scaled processors, decreasing the voltage requires that frequency also be reduced in approximately the same proportion. Furthermore, we can express voltage as a linear function in the frequency V = αf . Substituting for V and combining constants, our formula for dynamic power of the CPU can be reduced to a function of processor frequency: Dynamic CPU Power = c1 f 3 For convenience, we express f in terms of a ratio to the maximum CPU frequency, and thus 0 ≤ f ≤ 1. When the CPU runs at a reduced frequency,

184

E.N. (Mootaz) Elnozahy et al.

it takes longer to complete a given task. We assume that this slow-down is proportional to frequency – that is, the number of cycles required to perform any given task is ﬁxed regardless of processor frequency. This will be generally true for workloads where the CPU is the critical resource, such as web server workloads where the data mostly ﬁts in the ﬁlesystem cache. Even for CPU intensive workloads, this is a conservative estimate, since at lower frequencies, the processor spends less time stalled on accesses to memory, network, disk, etc. Since the power consumption of all other components in the system is essentially constant, we now hove the following simple model of the power consumed by one node in the cluster running at frequency f : System Power = P (f ) = c0 + c1 f 3

(1)

where c0 is a constant that includes the power consumption of all components except the CPU, plus the base power consumption of the CPU. Power consumption of a cluster is just the sum of the system power consumed by its active nodes. The c0 and c1 terms should also include power loss due to power supply ineﬃciency. This power loss is typically more signiﬁcant at low power levels, and thus may have a greater eﬀect of the c0 term than the c1 term, If a single power supply is used for all systems in the cluster, power loss also depends on the number of active servers. However, power supply eﬃciency increases quickly with load until it reaches a nearly steady level, so for simplicity we treat power supply loss as a ﬁxed fraction of overall power consumption. Now consider a cluster of n systems operating at frequency f1 . Using the model of system power stated above, the power consumed by these n systems is: n × P (f1 ) = n × c0 + c1 f13 When the frequency f1 is low, it can be beneﬁcial to vary oﬀ (turn oﬀ) a server and consolidate the workload on remaining n − 1 servers. This would allow us to eliminate one system’s ﬁxed power consumption at the cost of slightly increased variable energy consumption on the remaining n − 1 systems. In this scenario, to maintain the same response time and performance in the remaining n f1 , n − 1 systems we must increase the processor frequency in each system to n−1 causing the power consumed by these n − 1 systems to become: 3 n n f1 ) = (n − 1) × c0 + c1 f1 (n − 1) × P ( n−1 n−1 Thus, the conﬁguration with n − 1 servers consumes less energy than with n servers when 3 n f1 < n × c0 + c1 f13 (n − 1) × c0 + c1 n−1 3 n c0 (n − 1) f13 < + n f13 n−1 c1 2n2 − n 3 c0 f < (n − 1)2 1 c1

Energy-Eﬃcient Server Clusters

185

Thus, if the cluster has n active servers, we can reduce power consumption in the cluster by varying oﬀ one member of the cluster when the average CPU frequency of the backend servers falls below c0 (n − 1)2 fvaryoﬀ (n) = 3 c1 2n2 − n Likewise, when the frequency f1 is high, it can be beneﬁcial to vary on (turn on) a server and spread the workload across n + 1 systems. The reduced workload on each server allows us to decrease processor frequency in each system n f1 while maintaining the same response time. In this conﬁguration, the to n+1 power consumed by the n + 1 systems is: 3 n n f1 ) = (n + 1) × c0 + c1 f1 (n + 1) × P ( n+1 n+1 Constructing an inequality with the power consumed by n systems and solving for f1 in the same manner as above, we ﬁnd that when the average CPU frequency of the backend servers rises above fvaryon (n), we can reduce power consumption in the cluster by varying on a new member of the cluster. c0 (n + 1)2 fvaryon (n) = 3 c1 2n2 + n Thus, given the constants c0 and c1 , the optimal average frequency range for a cluster of n systems is simply fvaryoﬀ (n) ≤ CPU frequency ≤ fvaryon (n) Now we apply this theory to determine the optimal operating frequency for our simulated cluster nodes. The simulated CPU has a frequency range of 600MHz to 1.2GHz, and corresponding operation voltage range of 1.15 V to 1.4 V, and the rest of the system consumes 8.5W regardless of workload. When completely idle, the server consumes approximately 13.5 Watts (5 Watts for the constant processor power, and 8.5 watts for the rest of the system), and when fully loaded (CPU 100% busy), the system consumes 36.2 Watts. The idle power roughly corresponds to system power at CPU frequency f = 0.0, from which we can estimate c0 = 13.5 Watts. (This estimate is not entirely accurate, since Linux places the CPU in the Halted state when the system is idle. However, for the purpose of this illustration, this small error can be ignored.) Likewise, the fully loaded power corresponds to the system power at CPU frequency f = 1.0, from which we can estimate c1 = 22.7 Watts. Using these parameters, we can compute the “vary-on” and “vary-oﬀ” threshold frequencies for any value of n, and thence the ideal operating frequency for a given cluster. Table 1 gives some representative values, which are also illustrated graphically in Figure 1. There are two points to note. First, for a particular

186

E.N. (Mootaz) Elnozahy et al.

Total System Power per Task 36 35.5

Power (Watts)

35 34.5 34 33.5 33 32.5 32 31.5 0.45

0.5

0.55

0.6

0.65 0.7 0.75 Frequency

0.8

0.85

0.9

0.95

Fig. 1. Optimal Power Consumption for a Server Cluster

implementation, the table can be precomputed and stored to control the operating range of the cluster and the activity state of each node. Furthermore, for a particular cluster, this theory suggests the power management policy presented as VOVO-CVS in Section 2. Our derivation of the optimal frequency range assumes that inactive servers consume no power, but this can easily be adapted for a case where inactive servers are placed in a low power state. For example, a server could be placed in standby mode, where the CPU consumes almost no power, but memory and peripherals remain powered so that state is not lost. A server in standby mode can be brought online much more rapidly than a server that is powered oﬀ. In this case, the power consumed in standby mode becomes a ﬁxed cost of the cluster that is independent of the number of servers currently active. To determine the vary-on and vary-oﬀ frequencies for the optimal range, we can simply subtract

Table 1. Optimal average frequency range for a cluster with 1200MHz processors n fvaryoﬀ MHzvaryoﬀ fvaryon MHzfvaryon 1 2 3 4 5 10 20

0.0000 0.4628 0.5413 0.5761 0.5958 0.6329 0.6505

0 555 650 691 715 760 781

0.9256 0.8119 0.7681 0.7447 0.7302 0.6998 0.6839

1111 974 922 894 876 840 821

Energy-Eﬃcient Server Clusters

187

the standby power from c0 , and then use the originally derived formulas. In other words, the c0 constant really represents the additional ﬁxed cost of the system incurred by placing it into the active state. Bounds on Operating Voltage/Frequency In computing the vary-on and vary-oﬀ points for the optimal range, we also must consider the bounds on operating voltage and frequency of the processor. There are two cases that must be considered: •

n n−1 fvaryoﬀ

(n) > fmax A cluster with n nodes operating at the varyoﬀ frequency fvaryoﬀ (n) cannot set the frequency of n−1 nodes to the prescribed frequency because it exceeds the maximum processor frequency. n • n+1 fvaryon (n) < fmin A cluster with n nodes operating at the varyon frequency fvaryon (n) cannot set the frequency of n + 1 nodes to the prescribed frequency because it is below the minimum processor frequency. In Table 1, this happens with a system containing two active nodes.

In the ﬁrst case, we must reduce the varyoﬀ frequency for n nodes to the point at which n − 1, running at fmax , can handle the work of n nodes running at the varyoﬀ frequency. In other words, we simply set fvaryoﬀ (n) = fmax ×

n−1 n

In the second case, starting a new node at fvaryon (n) will result in higher energy consumption, since the processors will be running at a higher frequency than speciﬁed by the optimal range. On the other hand, waiting to start a new node until the expected new frequency is fmin will also typically waste energy, since the corresponding frequency of the n node cluster could be signiﬁcantly above the varyon point. Thus, the optimal transition point lies somewhere between these two extremes. At the transition point, we should have n × P (f ) = (n + 1) × P (fmin ) We should expect that our n + 1 systems are not fully utilized, and therefore we include a utilization factor u in the power equation for n + 1 systems. 3 n ∗ (c0 + c1 f 3 ) = (n + 1)(c0 + u(c1 fmin )) 3 3 n ∗ c1 f = c0 + (n + 1)u(c1 fmin ) 3 ) c0 /c1 + (n + 1)u(fmin f3 = n

188

E.N. (Mootaz) Elnozahy et al.

Table 2. Characteristics of two web server workloads Workload Avg requests / sec Peak requests / sec Avg requests / conn Files Total ﬁle size Requests Total response size 97%/98%/99% (MB)

Olympics98

Finance

97 171 12 61,807 705 MB 8,370,093 49,871 MB 24.8 / 50.9 / 141

16 46 8.5 16,872 171 MB 1,360,886 2,811 MB 3.74 / 6.46 / 13.9

If the two conﬁgurations perform equal work, then n+1 u × fmin n Substituting into the previous equation: f=

(2)

3 3 n+1 ) c0 /c1 + (n + 1)u(fmin u × fmin = n n 3 (n + 1)3 3 /c + (n + 1)u(fmin ) c 0 1 3 u × f = min 3 n n (n + 1)3 3 3 3 fmin × u − (n + 1)fmin × u − c0 /c1 = 0 n2

At this point we have a cubic equation in u, which has a closed form solution 2 . Solving for u and substituting into equation 2 gives the optimal value of fvaryon .

4

Evaluation

4.1

Methodology

We constructed workloads using web server logs obtained from several production Internet servers. The ﬁrst workload, Olympics98, is derived from the requests received on Feb 19, 1998 at the Nagano Winter Olympics servers at Columbus [2]. The second workload, Finance, is derived from the requests recorded on Oct 19, 1999 at the web site of a major ﬁnancial services company. Table 2 summarizes the characteristics of the two workloads. The table shows the average and peak request rates per second over the entire day. Additionally, the table shows the average requests per connection, which is based on grouping 2

A closed form equation for the roots of a general cubic equation was ﬁrst published by Girloamo Cardano (1501-1576) in his book on Algebra titled Ars Magna

Energy-Eﬃcient Server Clusters

189

Table 3. Comparison of Measured to Simulated CPU energy for the two workloads. Correlation coeﬃcients were computed based on the energy used in 30 second intervals over the length of the run Workload Measured CPU Energy (J) Simulator CPU Energy (J) Error in Total Energy Correlation Coeﬃcient

Olympics98

Finance

1,232,710 1,253,652 1.70% 0.9846

711,415 739,200 3.91% 0.9960

requests into connections. The table includes statistics about the data served, including the number of ﬁles requested, their total size, and the total response size in bytes which represents the the total size of response data sent back to clients (excluding HTTP headers). The 97%/98%/99% data is the amount of memory needed to hold the unique data for 97%/98%/99% of all requests and gives an indication of the potential beneﬁts of a RAM-based cache. For example, in the Olympics98 workload, the total size of the data required to serve 99% of requests is only 141 MB. Relatively speaking, modern systems can cache this amount entirely in memory, reducing the disk activity required to serve the workload. From the two base workloads, we generate workloads of diﬀerent intensities by scaling the inter-arrival time of connections (but not requests within a connection) by a constant amount. A scalefactor of “2×” corresponds to reducing the inter-arrival time of connections by 50%. This results in an increase in connection arrival rate corresponding to the scalefactor, but maintains the same basic pattern of connection arrivals. Inter-arrival time of requests within a connection are not scaled since these times represent user think time or network and client overhead involved in retrieving multiple components of a web page. We use this scaling mechanism to evaluate server performance for a range of client load intensities. Thus we are able to scale the intensity of all three workloads to “2.5×”, “4×”, etc. This methodology has been used by other researchers [4]. We extended our simulation model for energy consumption and response time of a web server [1] to model a cluster of web servers. The energy consumption reported by the simulator was validated against actual measurements performed on a “commodity” web server system with a 600MHz processor; these results are summarized in Table 3. A complete description of the methodology is presented elsewhere [1]. Our simulated CPU supports dynamic voltage scaling with frequency range of 600MHz to 1.2GHz, and corresponding voltage range of 1.15 V to 1.4 V. These parameters are loosely based on currently available processors designed for mobile systems. We present results for two diﬀerent values of base system power consumption (the c0 component of Equation 1): 8.5W, corresponding to current commodity servers, and 5W, which we present as representative of a future power-eﬃcient server system.

190

E.N. (Mootaz) Elnozahy et al.

The form of dynamic voltage scaling we use permits the CPU frequency to be set to a limited number of discrete settings. The processor voltage is set to the lowest value that allows reliable operation at a given frequency. The processor voltage and frequency are adjusted at regular intervals based on CPU utilization. At the beginning of each interval, the system determines the average CPU utilization over some set of recent intervals, and then uses a simple thresholding scheme to select the CPU frequency for the next interval. If the current CPU frequency is below the maximum setting and CPU utilization exceeds a high threshold, the CPU frequency is set to the next highest discrete frequency setting. Likewise, if the CPU is not currently at the minimum frequency and CPU utilization is less that a low threshold, the frequency is decreased one step. When the processor utilization is between the high and low thresholds, the frequency is left unchanged. In our study we use a low and high threshold of 80% and 95% respectively. This form of dynamic voltage scaling is commonly referred to as PAST and was originally proposed by Weiser et. al. [16]. Servers placed in the inactive state consume no energy. When a server is selected to be varied oﬀ, the distribution mechanism stops sending new requests to this server. When the server completes all its outstanding requests, it transitions to the inactive state after a “shutdown” time. When a server is brought online, it transitions to the active state after a “startup” time. We set both startup and shutdown times to 30 seconds in the simulations. When a varyon or varyoﬀ is initiated, our policy inhibits subsequent node power management actions for the next 60 seconds (twice the transition time), to allow the node to transition and the cluster to adjust to its new conﬁguration. In particular, this time should be suﬃcient for a newly started web server to warm up its ﬁle cache. 4.2

Results

Table 4 summarizes the results of our simulations. For both Finance and Olympics98 workloads, the table shows the energy and response times for the ﬁve policies described in Section 2. The total energy data is also presented graphically in Figure 2. For both workloads, the IVS policy exhibits signiﬁcant savings over the case without any power management. However, while the CVS policy does save more energy than IVS, the extra savings only amount to about 1%. Since the implementation complexity of CVS is far greater than that for IVS, the ﬁndings indicate that a ﬁrst step in saving energy is to simply populate a server cluster with voltage scaled processors. Note that the response time aﬀorded by both policies is well within acceptable limits. Compared to desktop and mobile workloads, relatively larger response times are acceptable for web workloads because the response time perceived by the client is typically much larger when factoring other components such as WAN delays [12]. Comparing IVS with VOVO yields an interesting result. While VOVO saves more energy for the Finance workload, the opposite is true for Olympics98. The switch comes about because of the nature of the two workloads [1]. After a steep increase in intensity at 9:30am, the Finance workload maintains a high intensity

Energy-Eﬃcient Server Clusters

191

Table 4. Simulated energy and response time for the diﬀerent cluster power management policies. Energy1 uses a ﬁxed cost of 8.5 W, and Energy2 uses a ﬁxed cost of 5 W. Except for VOVO-CVS, response times are independent of the ﬁxed cost. For VOVO-CVS, the ﬁxed cost plays a role in determining when nodes are brought online and taken oﬄine, causing the response times to vary. Here, the variation is minor. 90% of responses were satisﬁed below the 90%-ile response time numbers

Workload Finance

Olympics98

Energy1 Policy K % savings over Joules None VOVO None 17879 — — IVS 14404 19.4 — CVS 14216 20.5 — VOVO 10313 42.3 — VOVO-IVS 9237 48.3 10.4 VOVO-CVS 8985 49.7 12.9 None 20977 — — IVS 15821 24.6 — CVS 15445 26.4 0.8 VOVO 15818 24.6 — VOVO-IVS 14605 30.4 7.7 VOVO-CVS 13968 33.4 11.7

Energy2 Response Time K Joules % savings over Average 90%-ile None VOVO (ms) (ms) 14785 — — 5 8 11310 23.5 — 10 22 11122 24.8 — 12 26 9133 38.2 — 14 26 8062 45.5 11.7 16 34 7469 49.5 18.2 19 38 17925 —— 5 9 12769 28.8 9.2 13 29 12393 30.9 11.9 15 34 14064 21.5 — 13 24 12852 28.3 8.6 18 34 11533 35.7 18.0 36 38

level until about 6pm. Afterward, the Finance workload decreases in intensity. In comparison, the Olympics98 workload is more stable. VOVO saves more energy than IVS for Finance only because during the periods of low intensity (late evening and night), VOVO is able to place machines oﬄine. IVS becomes more favorable as the ﬁxed costs of the cluster components decrease. VOVO-IVS always performs better than VOVO. For the set of workloads and ﬁxed costs, the improvement ranges from about 8% to 12%. At the same time, the response time degradation is well within acceptable norms. These ﬁndings suggest that when building web clusters, using voltage scaled processors in combination with VOVO provides increased energy savings with minor (for example, use Transmeta processors) additional implementation cost. The policy that yields the most energy savings is VOVO-CVS. Compared to VOVO, VOVO-CVS is able to provide savings ranging from 12% to 18%. When compared to VOVO-IVS, the savings range from 3% to 10%. While VOVO-CVS has an increased implementation complexity compared to VOVO, the savings it aﬀords increase as the ﬁxed costs of the system decrease. VOVO-CVS saves up to 50% of the energy expended by the cluster with no power management.

192

E.N. (Mootaz) Elnozahy et al.

Fig. 2. Total Energy Consumed by workload and base power consumption for our ﬁve cluster power management policies

Figures 3 through 5 present the request rate, power consumption, and number of active servers for the VOVO and VOVO-CVS policy over the course of both workloads. Note that for both policies, power consumption closely follows the workload on the cluster, with the VOVO-CVS policy consuming less power at almost all workload intensities. The graph shows two or three instances in which the VOVO-CVS consumes more power for a short period, all of which occur when workload levels oﬀ after a period of rapid increase. In Figure 5 shows that VOVO-CVS often uses more servers than VOVO; this is the fundamental source of power savings provided by VOVO-CVS. By using more servers, the variable component of power consumption is reduced quadratically by voltage scaling, to the point where it balances against the additional ﬁxed cost of a larger active set. Figure 6 presents the power savings from the CVS, VOVO, and VOVO-CVS policies over a server cluster without power management over the course of both workloads. These graphs indicate that at low workload intensities, the VOVO policy provides large savings since the workload can be adequately served by a small set of machines. However, these savings drop signiﬁcantly as workload increases, and VOVO provides relatively little beneﬁt at moderately heavy workloads. In contrast, power savings from CVS are modest at low workload intensities, since the CPU is idle most of time, and thus obtains no beneﬁt from voltage scaling. As workload increases, average CPU utilization increases, yielding increased savings from the CVS policy. Finally, by employing both vary-on/vary-oﬀ and voltage scaling, the combined VOVO-CVS policy achieves signiﬁcant power savings across a broad range of workload intensities.

Energy-Eﬃcient Server Clusters

193

Olympics98

Finance 50

200 175 Requests/sec

Requests/sec

40 30 20 10

150 125 100 75 50 25

0

0 0

6

12 18 Time (hours)

24

0

6

12 18 Time (hours)

24

350

350

300

300 Energy (Joules)

Energy (Joules)

Fig. 3. Request rate

250 200 150 100 VOVO VOVO-CVS

50

250 200 150 100 VOVO VOVO-CVS

50

0

0 0

6

12 18 Time (hours)

24

0

6

12 18 Time (hours)

24

10 9 8 7 6 5 4 3 2 1 0

Servers

Servers

Fig. 4. Power consumption

VOVO VOVO-CVS 0

6

12 18 Time (hours)

24

10 9 8 7 6 5 4 3 2 1 0

VOVO VOVO-CVS 0

6

Fig. 5. Number of active servers

12 18 Time (hours)

24

E.N. (Mootaz) Elnozahy et al. 100 90 80 70 60 50 40 30 20 10 0

VOVO-CVS VOVO CVS

0

6

12 18 Time (hours)

% savings over base

% savings over base

194

24

100 90 80 70 60 50 40 30 20 10 0

VOVO-CVS VOVO CVS

0

6

12 18 Time (hours)

24

Fig. 6. Energy savings from three policies over time

Another way to interpret these results is that node vary-on/vary-oﬀ is a coarse-grained mechanism that provides power savings in a few large increments, with corresponding large aﬀects in cluster capacity. In contrast, voltage scaling is a ﬁne-grained mechanism, since each individual power management action saves only a small amount of power, but also results in a very small decrease in cluster capacity. Furthermore, many components, particularly disk drives, are adversely aﬀected by frequent power cycles, and thus node vary-on/vary-oﬀ must be constrained in the frequency of power management actions. Voltage scaling has no such adverse aﬀect, and thus can be applied much more liberally. Thus, VOVO-CVS represents a combination of coarse- and ﬁne-grained power management that outperforms either mechanism used alone.

5

Related Work

Reducing power consumption by reducing the clock frequency of the processor has been widely studied [16, 8]. Until recently, this mechanism was employed almost exclusively in portable, battery powered systems as a means of extending battery life. However, power consumption and the related costs of cooling and reliability have led to a new focus on server systems with voltage scaled processors. Mudge [10] observed that the combination of voltage scaling and parallelism (e.g. clusters) could be applied to reduce energy consumption. Indeed, there is an emerging class of servers that are designed around processors with dynamic voltage scaling mechanisms [14], though to the best of our knowledge, there has been no formal study of the potential energy savings in clusters of such systems. Flautner et. al. [5] explored a software-managed dynamic voltage scaling policy that sets CPU speed on a task basis rather than by time intervals. Lorch et. al. [9] describe how a task-based, software-managed voltage-scaling policy can be optimized when task completion times cannot be predicted accurately. Both of these techniques perform well for desktop application workloads, but it is unclear how they could be applied to server environments. Our work demonstrates how

Energy-Eﬃcient Server Clusters

195

the same basic mechanism of software-managed dynamic voltage scaling can be used to achieve signiﬁcant power savings in server clusters. Pinheiro et. al. [11] proposed a simple policy for managing energy use in server clusters by powering machines on and oﬀ (similar to the VOVO policy). They examined this policy in the context of both web-server and compute-server clusters. Chase et. al. have employed this mechanism in the context of an economic framework in which web sites “bid” on resources based on their current workload. They report savings of 29%, 38%, and 78% for three diﬀerent workloads. We believe the high energy savings they report for the third workload (from the World Cup Trace) can be attributed to the signiﬁcant variation in intensity of that workload. Our work complements and extends these studies by comparing VOVO to two forms of dynamic voltage scaling, and extending VOVO with voltage scaling mechanisms. A wide range of techniques have been explored for power management in microprocessors, and many microprocessor architectures and microarchitectures incorporate power-saving features: examples include the mobile processors available from Intel with its SpeedStep(TM) technology and the Transmeta Crusoe processor with LongRun [6]. More recently developed and less widely deployed today are new memory chip architectures that are incorporating similar ”spin down” states so that the system can actively manage the power used by main memory [3]. In addition, a number of current research eﬀorts are focusing on new power management mechanisms employed at the operating system [15] and application layers [7] of the system. Techniques for dynamically controlling processor temperature [13] can also be applied to web servers. This results in power savings because CPU activity is decreased to lower processor temperature. Some of these mechanisms could be used to extend the gains from power management demonstrated in this paper.

6

Conclusions

In this paper, we have described and analyzed ﬁve distinct policies for power management in server clusters with varying degrees of implementation complexity. The policies employ various combinations of intra-node (dynamic voltage scaling) and inter-node (node vary-on/vary-oﬀ) power management mechanisms to reduce the aggregate power consumption of a server cluster during periods of reduced workload. We use a validated simulator to study the potential beneﬁts of these policies for workloads derived from the server access logs of the 1998 Nagano Winter Olympics and a ﬁnancial services web site. We ﬁnd that a relatively simple policy of independent dynamic voltage scaling on each server node yields energy savings between 20% and 29%. A policy that varies nodes on or oﬀ based on the workload intensity achieves between 22% and 42% savings at the expense of cluster-wide coordination. However, for some cases, we ﬁnd that independent voltage scaling is more eﬃcient without the need for cluster-wide coordination. The most energy savings are obtained by a policy that combines node vary-on/vary-oﬀ with voltage scaling with pre-computed

196

E.N. (Mootaz) Elnozahy et al.

optimal transition points. This combined policy achieves up to 18% more savings than pure vary-on vary-oﬀ, albeit at the expense of increased implementation complexity. Compared to a cluster that is not power managed, the combined policy saves between 33% and 50% of the cluster energy.

References [1] Pat Bohrer, Mootaz Elnozahy, Tom Keller, Michael Kistler, Charles Lefurgy, Chandler McDowell, and Ramakrishnan Rajamony. The Case for Power Management in Web Servers. In Robert Graybill and Rami Melhem, editors, PowerAware Computing, Kluwer/Plenum Series in Computer Science, to appear. January 2002. 180, 181, 183, 189, 190 [2] Jim Challenger, Paul Dantzig, and Arun Iyengar. A scalable and highly available system for serving dynamic data at frequently accessed web sites. In Proceedings of 1998 ACM/IEEE Supercomputing (SC98), Orlando, Florida, November 1998, 1998. 188 [3] Rambus Corporation. Rambus Technology Overview, Feb 1999. 195 [4] Jeﬀ Chase et. al. Managing Energy and Server Resources for a Hosting Center. In 18th Symposium on Operating Systems Principles (SOSP), October 2001. 180, 189 [5] K. Flautner, S. Reinhardt, and T. Mudge. Automatic Performance Setting for Dynamic Voltage Scaling. In Proceedings of the 7th ACM Int. Conf. on Mobile Computing and Networking (MOBICOM), July 2001. 194 [6] M. Fleischmann. Crusoe Power Management: Cutting x86 Operating Power Through LongRun. Embedded Processor Forum, June 2000. 195 [7] J. Flinn and M. Satyanarayanan. Energy-aware adaptation for mobile applications. In 17th ACM Symposium on Operating Systems Principles (SOSP’99), pages 48–63, 1999. 195 [8] K. Govil, E. Chan, and H. Wasserman. Comparing Algorithm for Dynamic Speed-Setting of a Low-Power CPU. In Mobile Computing and Networking, 1995. 194 [9] Jacob R. Lorch and Alan Jay Smith. Improving Dynamic Voltage Scaling Algorithms with PACE. In ACM SIGMETRICS 2001, June 2001. 194 [10] Trevor Mudge. Power: A First Class Architectural Design Constraint. IEEE Computer, 34(4):52–57, April 2001. 183, 194 [11] E. Pinheiro, R. Bianchini, E. V. Carrera, and T. Heath. Load Balancing and Unbalancing for Power and Performance in Cluster-Based Systems. In Workshop on Compilers and Operating Systems for Low Power, September 2001. 180, 182, 195 [12] R. Rajamony and M. Elnozahy. Measuring Client Perceived Response Times on the WWW. In Proceedings of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS’01), Mar. 2001. 190 [13] Erven Rohou and Michael D. Smith. Dynamically Managing Processor Temperature and Power. In 2nd Workshop on Feedback-Directed Optimization, November 1999. 195 [14] RLX Technologies. http://www.rlxtechnologies.com/home.html. 194 [15] A. Vahdat, A. Lebeck, and C. Ellis. Every Joule is Precious: The Case for Revisiting Operating System Design for Energy Eﬃciency. In 9th ACM SIGOPS European Workshop, September 2000. 195

Energy-Eﬃcient Server Clusters

197

[16] M. Weiser, B. Welch, A. Demers, and S. Shenker. Scheduling for reduced CPU energy. In First Symposium on Operating Systems Design and Implementation, pages 13–23, Monterey, California, U. S., 1994. 190, 194

Single Region vs. Multiple Regions: A Comparison of Diﬀerent Compiler-Directed Dynamic Voltage Scheduling Approaches Chung-Hsing Hsu and Ulrich Kremer Department of Computer Science Rutgers University, Piscataway, New Jersey, USA {chunghsu,uli}@cs.rutgers.edu

Abstract. This paper discusses the design and implementation of a proﬁle-based power-aware compiler using dynamic voltage scaling. The compiler identiﬁes program regions where the CPU can be slowed down without resulting in a signiﬁcant overall performance loss. Two strategies have been implemented in SUIF2. The single-region strategy slows down a single region for energy savings, while the multiple-region strategy slows down as many regions as needed. A comparison of both strategies based on six SPECfp95 benchmarks shows that in ﬁve cases, the energy-delay product was comparable. In the remaining case, the multiple-region strategy was signiﬁcantly better. Both strategies achieved energy savings of up to 48% for the ﬁve programs at the slowdown between 1% and 16%, and energy savings of 74% for the multiple regions vs. 50% for the single region strategy for the remaining program at the slowdown up to 21%.

1

Introduction

With the advances in technology, power is becoming a ﬁrst-class architecture design constraint not only for embedded/portable electronic devices but also for high-end computer systems [18]. In this paper, we focus on software-controlled power management using dynamic voltage scaling [19]. Dynamic voltage scaling is a technique that varies the CPU supply voltage and frequency on-the-ﬂy to provide multiple power modes with diﬀerent performance levels. Energy eﬃcient computation can be achieved by dynamically adapting the CPU performance level to the current needs, i.e., reducing the CPU supply voltage and frequency when the CPU is not being fully utilized. The presented approach uses whole program analyses at compile time to identify CPU slack that can be exploited through dynamic voltage scheduling without resulting in signiﬁcant performance penalties. The target applications are not real-time systems with hard deadlines, but applications that may tolerate a small performance loss in exchange for power and energy savings. Given a soft execution deadline, i.e., a user supplied acceptable performance penalty, our compilation strategy determines CPU slowdown factors for program regions that

This research was partially supported by NSF CAREER award CCR-9985050.

B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 197–211, 2003. c Springer-Verlag Berlin Heidelberg 2003

198

Chung-Hsing Hsu and Ulrich Kremer

are expected to yield the highest energy savings within the performance penalty range. One factor that distinguishes our work from others is the type of CPU slacks being exploited. Some work (e.g. [19, 15]) identiﬁes the slacks between the processing time and human perception time, while others (e.g. [13, 21]) take advantage of the diﬀerence between the conservative performance estimation and the real execution time for applications with hard performance deadlines. In this paper we exploit a third type of CPU slacks – in which the memory accesses are on the critical path for performance. Since the CPU processing is not on the critical path, it can be slowed down without introducing signiﬁcant performance loss. However, the CPU may not be slowed down too much since it issues memory instructions and may alter the criticality of the memory bottleneck. The presented work is one of the ﬁrst to use a compiler approach for voltage scheduling. Compilers have the advantage that the entire program can be analyzed, and in addition be modiﬁed to exhibit a desired characteristic, thereby enabling further optimizations. Compilation strategies work well if the program behavior can be derived at compile time. For such applications, more aggressive optimizations can be performed, and the performance and energy overhead introduced by operating systems or hardware approaches can be avoided. However, not all programs allow static analyses that yield suﬃcient information about their runtime characteristics. In such cases, operating system and/or hardware techniques (e.g. [23]) are more promising strategies. We believe that hybrid approaches for voltage scheduling consisting of a combination of compiler, operating system, and hardware strategies will be most eﬀective and necessary, for instance in multiprogramming environments. A discussion of such hybrid approaches is beyond the scope of this paper. This paper presents the design and implementation of a proﬁle-based poweraware compiler using dynamic voltage scaling. The compiler identiﬁes memorybound program regions and implements two strategies in SUIF2. The singleregion strategy slows down a single region for energy savings, while the multipleregion strategy slows down as many regions as needed. A comparison of both strategies based on six SPECfp95 benchmarks shows that in ﬁve cases, the energydelay product was comparable. In the remaining case, the multiple-region strategy was signiﬁcantly better. Both strategies achieved energy savings of up to 48% for the ﬁve programs at the slowdown between 1% and 16%, and energy savings of 74% for the multiple regions vs. 50% for the single region strategy for the remaining program at the slowdown up to 21%.

2

Basic Compilation Strategy

Our compilation strategy tries to ﬁnd memory-bound program regions where CPU may be slowed down without aﬀecting signiﬁcantly the overall program performance. The basic idea is to “hide” the degraded CPU performance behind the memory hierarchy accesses which are on the critical path. Within each such region, a slowdown factor will be selected by the compiler. A slowdown factor

Single Region vs. Multiple Regions

199

δ is deﬁned as a ratio of the peak CPU frequency to the desired frequency. For example, δ = 2 on a 1 GHz machine indicates the desired frequency of 500 MHz. In our model, we assume that the dynamic frequency and voltage changes only occur between regions with diﬀerent slowdown factors. Our compilation strategy considers the entire program (P ) as the union def of program regions (Ri ), i.e., P = i Ri , each of which is characterized by a quadruple (Wic , Wib , Wim , vi ). The values of Wic , Wib , Wim represent the workload (in cycles) of diﬀerent parts of region Ri , and vi represents the number of times Ri is accessed for the entire program execution. The total workload for c b m region Ri is then deﬁned as W i = Wi + Wi + Wi , and the total workload for program P is deﬁned as W = i Wi . A program region is split into three parts to better estimate the performance impact of the CPU slowdown for a region [9]. Speciﬁcally, if region Ri is slowed down by a factor of δ, the resulting performance will become Wi (δ) = δ ∗ Wic + max(δ · Wib , Wib + Wim ) def

where – Wic is the number of cycles in region Ri that the CPU is busy while the memory is idle (cpu busy); this includes CPU pipeline stalls due to hazards and activities of both level one cache and level two cache, – Wim is the number of cycles in region Ri that the CPU is stalled while waiting for data from memory (memory busy), – Wib is the number of cycles in region Ri that both CPU and memory are active at the same time (both busy). Slowdown factor δ, by its deﬁnition, is never less than one, i.e., δ ≥ 1. In addition, the total workload of region Ri , Wi , is treated as an abbreviation of Wi (1). The model assumes that CPU cycles that did not overlap with memory activities before the slowdown, Wic , will also not overlap with memory activities after the CPU slowdown, and that the CPU cycles that did overlap with memory activities before the slowdown, Wib , will maintain that property after the slowdown. As a result, a performance penalty of δ ∗ Wic will occur if the entire Wib workload can be hidden behind the memory activity workload (Wib + Wim ). If only partial hiding is possible, an additional performance penalty will be accounted for. The dynamic voltage scheduling based on program regions is formulated as follows: given a program (P = i Ri ), we are solving the mixed-integer nonlinear programming (MINLP) problem (P) for variables δi ’s: 1 (P) minimize E = W · i Wi /δi2 subject to i Wi (δi ) + s · i,j vij · θ(δi , δj ) ≤ (1 + r) · W 1 ≤ δi δi ∈ R

200

Chung-Hsing Hsu and Ulrich Kremer ENTRY

?

R1

? R ?

Ri Wic Wim Wib vi R1 116.98 27.35 8.27 1 R2 52.82 200.67 54.46 10 R3 59.70 207.52 38.34 10 R4 2.81 4.43 1.22 1 R5 0.06 27.98 2.72 1 R6 9.79 201.25 52.06 8

2

R3

?- R ?; ; @@R ?@ EXIT ?@R 4

R6

; ?;

R5

Fig. 1. W = 1068 million cycles, s = 10000, and r = 0.01. Workloads are given in million cycles. The graph on the left represents the control ﬂows between regions R1 −R6 . On the right, there are workload characteristics for each region Ri , which is recorded by the proﬁle where s represents the performance cost of each transition, vij represents the number of transitions between region Ri and Rj , and r represents the userdeﬁned performance penalty, Function θ(·, ·) indicates whether a transition occurs or not and is deﬁned as follows. 0 if δi = δj def θ(δi , δj ) = 1 otherwise The ﬁrst inequality models the resulting performance of all regions after their respective slowdown, plus the transition costs introduced by switching between diﬀerent voltages/frequencies. Problem (P) searches for the appropriate δi values such that the performance penalty of the voltage scaled program does not exceed the user-speciﬁed value and its (relative) energy consumption E is minimized. For example, the real execution of swim on training input indicates the following transitions for Figure 1 1

10

1

1

8

1

8

R1 → R2 , R2 → R3 , R3 → R4 , R3 → R5 , R3 → R6 , R5 → R2 , R6 → R2 . where the weight of each edge represents vij . Given performance tolerance of 1% (r = 0.01) and an assumed voltage scaling overhead of 10,000 cycles (s = 10000), the optimal solution for problem (P) is E = 77.3% with the following δ assignment: δ1 = 1, δ2 = 1.03, δ3 = 1.03, δ4 = 1, δ5 = 4.58, δ6 = 1.75.

Single Region vs. Multiple Regions

201

Using the optimal δ assignment, the compiler can then insert DVS instructions at appropriate places. In our example, the entries of regions R2 , R4 , R5 , and R6 are “guarded” with DVS instructions of desired CPU frequency. The simulation result showed that the performance degradation is 2.2% with relative energy consumption of 75.1%. A special case of the region-based compilation strategy is to ﬁnd a single program region that minimizes the energy consumption within the same (1 + r) · W deadline. Issues related to the single-region strategy have been discussed in [8]. An important diﬀerence between the single-region strategy and the multipleregion strategy is that the single-region strategy considers the combined regions as well, i.e., Ri&j = Ri ∪Rj . For example, the single-region strategy not only evaluates regions R1 - R6 but also examines combined regions R2&3 , R5&6 , and R1−6 . With the same performance tolerance of 1% and DVS overhead of 10,000 cycles, the compiler found that the combined region R5&6 with δ5&6 = 2.07 gives the best energy saving. That is, the δ assignment derived by the single-region strategy can be described as follows: δ1 = 1, δ2 = 1, δ3 = 1, δ4 = 1, δ5 = 2.07, δ6 = 2.07. Experimental results showed that this selection resulting in energy consumption of 75.7% and a performance penalty of 2.7%. The experimental results showed that, for ﬁve of six SPECfp95 benchmarks we tested, both multiple-region strategy and single-region strategy were similar in their eﬀectiveness. Multiple-region strategy did much better in one benchmark. While conceptually the multiple-region strategy should be as eﬀective as the single-region strategy, it is more complicated to implement and it raises different issues from the single-region strategy. More detailed comparisons will be discussed in Section 4.

3

Implementation

The prototype of the dynamic voltage scheduling based on program regions is implemented as part of SUIF2 [24] using the proﬁle-driven approach. Both multiple-region strategy and single-region strategy are implemented. The prototype implementation has four phases. The ﬁrst phase instruments the original C program at appropriate program locations. The instrumented code is then executed (the second phase), collecting information needed in the third phase. The third phase uses both the instrumented program and the proﬁle information to determine the best δ assignment for the program. Once the slowdown factors of regions are determined, the ﬁnal phase restores the instrumented program back to the original one, and inserts speed setting instructions at appropriate region boundaries. The output of the prototype is the original program with a few additional DVS instructions. Figure 2 shows the phases of the implementation. Phase 1: Instrumentation – Two kinds of program constructs are instrumented in our implementation, namely call sites and explicit loop structures.

202

Chung-Hsing Hsu and Ulrich Kremer original C program

? SUIF2 passes

?

instrumented C program

DVS’ed C program

6

SUIF2 passes

> 6 proﬁle - machine

Fig. 2. The ﬂow diagram of the compiler implementation Explicit loop structures include for, while and do-while loops. Loops based on goto’s are not instrumented in the current implementation. Phase 2: Profiling – The information collected for each instrumented program construct R is the quadruple (WRc , WRm , WRb , vR ). While our experimental results rely on a simulator, hardware performance counters may also be used, if such counters are available. Phase 3: Region Selection – The choice of program regions is implementation dependent. A program region can be deﬁned as small as a basic block or as large as a procedure body. While program regions of small granularity may expose more opportunities for energy reduction, the large amount of them may prohibit the solver to ﬁnd the optimal δ assignment in an eﬃcient way. Our current implementation assumes a program region to be of single entry and single exit. For the multiple-region strategy, only perfect loop nests are considered as the program regions. For the single-region strategy, since combined regions are also taken into account, our implementation considers loop nests, call sites, ifstatements, and sequence of regions as program regions. As part of the third phase, the multiple-region strategy is required to solve a MINLP problem (P). Currently, we rely on a MINLP solver, namely MINLP [14], on the NEOS server [6] to solve the problem. Since the solver will run for a long time if the problem size is too large, we adopted various techniques to either reformulate the problem or cut down the problem size by approximation. In addition, the proﬁling phase only records the quadruple of each region but not the transitions between regions. An reaching deﬁnition analysis sub-phase has been implemented to estimate the values of vij . More details are discussed in the following section. For single-region strategy, a pre-analysis is needed to compute the quadruple of each combined region since it is not proﬁled. Once all the quadruples of

Single Region vs. Multiple Regions

203

candidate regions are available, the enumeration process begins, to ﬁnd the δ assignment (or the region) that has the smallest objective function value E is problem (Ps ). The details of the implementation can be found in [8]. Phase 4: Code Generation – Finally, the speed-setting instructions are placed at appropriate program locations. For the single-region strategy, the entry and exit of the slowed-down region are “guarded” with these instructions. The speed at entry of the region is set according to the slowdown factor. At exit, the speed is resumed to the original, non-scaled frequency. For the multiple-region strategy, only the entry of a region is considered as a candidate location to insert the speed-setting instruction. The region is “guarded” if there is an immediately preceding region that has a diﬀerent slowdown factor. 3.1

Estimate the Transition Graph

For the multiple-region strategy, the information about the number of transitions between regions vij is needed. While it can certainly be done through edge/path proﬁling techniques (e.g. [3]), the current prototype implements a reaching deﬁnition analysis pass to estimate the values from the proﬁles of regions. The idea is to treat each region as an assignment to a global variable shared by all the regions. In doing so, reaching deﬁnition analysis captures the control ﬂows between basic regions. The number of transitions between two regions is then determined by the minimum of the number of visits for both regions, i.e., vij = min(vi , vj ). This analysis may over-estimate the values of vij ’s. For example, the analysis result of transitions between regions in Figure 1 derives the following transition graph: 1 10 1 1 R1 → R2 , R2 → R3 , R3 → R4 , R3 → R5 , 8

1

1

1

8

R3 → R6 , R4 → R5 , R4 → R6 , R5 → R2 , R6 → R2 . Comparing with the real execution behavior, the estimated transition graph 1 1 introduces two unrealizable transitions R4 → R5 and R4 → R6 . A more precise, possibly more costly, analysis to estimate vij ’s is to solve it as a network ﬂow problem, i.e., the total incoming ﬂows are always equal to the total outgoing ﬂows. In this case, the extra transitions introduced by our analysis will not make a big impact because R4 will be executed only once. In general, the analysis may under-estimate the potential energy reduction using the multiple-region strategy. 3.2

Reformulate the Problem

Problem (P) is diﬃcult to be described in modeling languages such as AMPL and GAMS, which is in turn needed by the MINLP solver. To eliminate the maximum function and the binary function θ(·, ·), problem (P) is reformulated as follows:

204

Chung-Hsing Hsu and Ulrich Kremer

1 (P’) minimize E = W · i Wi /δi2 subject to c i (Wi · δi + zi ) + s · i,j vij · θij ≤ (1 + r) · W b b zi ≥ δi · Wi , zi ≥ Wi + Wim θij · (1 − u) ≤ δi − δj ≤ θij · (u − 1) 1 ≤ δi ≤ u δi ∈ R, θij ∈ {0, 1} Variables zi and θij are introduced to model the results of the maximum function zi = max(δi · Wib , Wib + Wim ) and the binary function θij = θ(δi , δj ), respectively. The upper bound of a slowdown factor, u, is also introduced to support specifying θij . In practice, this upper bound always exists from the hardware features. It is deﬁned to be u = 5 throughout the paper. The size of problem (P’) can be characterized by a pair (n, m) that speciﬁes a transition graph of n regions and m transitions. MINLP problems are in general considered as extremely hard problems since they combine the numerical diﬃculties of nonlinear programming with the combinatorial aspect of integer programming. Experiences tell us that when the number of transitions m exceeds over 50, the solver has a hard time to solve it eﬃciently. As a result, various techniques have been applied, if necessary, to reduce the problem size. In particular, two techniques have been used to identify large vij ’s and enforce regions Ri and Rj to be of the same slowdown factor. 3.3

Other Issues

There are a couple of other issues involved in the design of the multiple-region strategy, for example, 1. How to model the δ’s of regions that behave like the wild card? Our formulation suggests that each region will have a speciﬁc δ value. Alternatively, regions can be left “open” without any speciﬁc δ assignment. 2. How to model the “conﬂict” between certain regions? For example, once a procedure has δ assignment for regions inside it, it does not make sense to assign δ’s to all call sites to this procedure. In our problem formulation for multiple-region strategy, all regions are assumed to be independent, i.e., none of regions can prohibit the rest from being slowed down. In practice, however, this may not be the case. Consider the C code in Figure 3. In this code, there are ﬁve regions R1 − R5 that can be slowed down, including two call sites R2 and R3 to the same procedure h(). Suppose the call sites R2 and R3 are chosen to be slowed down with slowdown factors δ2 and δ3 , respectively, and δ2 = δ3 . It does not make sense to assign slowdown factors to regions in procedure h(), i.e., R4 and R5 . On the other hand, we may slow down regions in a procedure, and enforce all the calls to that procedure excluded from basic region candidates. In summary, there are “conﬂicts” between regions being slowdown candidates.

Single Region vs. Multiple Regions

R1 R2

void f() { for(int i=0; i
R3

void g() { h(); }

R4

void h() { for(int i=0; i
R5

205

for(int i=0; i
Fig. 3. A possible code structure that complicates the choices of basic regions. Regions R1 − R5 are potential candidates for our compiler strategy, but not all of them can be slowed down at the same time Our current implementation only considers the perfect loop nests without calls inside as the candidate regions. For Figure 3, it means that only regions R4 and R5 are considered. The conﬂictness can be modeled by introducing a control binary variable for each case that there are multiple call sites to a procedure, and is smoothly integrated into our multiple-region problem formulation. All these issues will be addressed in the future.

4

Experiments

The experimental setting is as follows. Six SPECfp95 benchmarks were used as program inputs. The SimpleScalar out-of-order issue processor timing simulator [5] with memory hierarchy extensions and DVS extensions served as the underlying machine model. The transition costs, i.e., voltage switching overheads were modeled in the simulator. The training data sets (train.in) provided with the benchmarks distribution were used during the proﬁling phase of our compiler. To reduce the simulation time, the reduced reference data sets (std.in) developed by Burger [4], were used instead of the original reference data inputs. The user-speciﬁed relative performance penalty r was varied from 1% to 10%, in order to expose energy performance trade-oﬀs. Finally, a simple analytical energy model was used to estimate the energy consumption of a program. SimpleScalar provides a cycle-accurate simulation environment for a modern out-of-order superscalar processor with 5-stage pipelines and fairly accurate branch prediction mechanism. The memory extensions model the limitedness of non-blocking caches through ﬁnite miss status holding registers (MSHRs) [12]. Bus contention and arbitration at all levels are also taken into account. Table 1 gives the simulation parameters used in the experiments. The DVS extensions introduce a new speed-setting instruction. The speed setting instruction takes as argument an integer that speciﬁes the desired CPU

206

Chung-Hsing Hsu and Ulrich Kremer

Table 1. System simulation parameters Simulation parameters fetch width decode width issue width commit width RUU size LSQ size FUs branch predictor L1 D-cache L1 I-cache L1/L2 bus L2 cache L2/mem bus memory TLBs

Value 4 instructions/cycle 4 instructions/cycle 4 instructions/cycle, out-of-order 4 instructions/cycle 64 instructions 32 instructions 4 intALUs, 1 intMULT, 4 fpALUs, 1 fpMULT, 2 memports gshare, 17-bit wide history 32KB, 1024-set, direct-mapped, 32-byte blocks, LRU, 1-cycle hit, 8 MSHRs, 4 targets as above 256-bit wide, 1-cycle access, 1-cycle arbitration 512KB, 8192-set, direct-mapped, 64-byte blocks, LRU, 10-cycle hit, 8 MSHRs, 4 targets 128-bit wide, 4-cycle access, 1-cycle arbitration 100-cycle hit, single bank, 64-byte/access 128-entry, 4096-byte page

compiler gcc 2.7.2.3 -O3

frequency. Its semantics was implemented in the following way: (1) stop fetching new instructions and wait until CPU enters the ready state, i.e., the speed setting instruction is not speculative, the pipeline is drained, all functional units are idle, and all pending memory requests are satisﬁed, (2) wait a ﬁxed amount of cycles to model the process of scaling up/down to the new frequency, and (3) resume the course using the new frequency. Each step has an associated performance penalty. In the simulation we set the step (2) cost as 10,000 cycles (10 µs for a 1GHz processor) if the desired CPU frequency is diﬀerent from the current one, and zero otherwise. For our proﬁle-based compilation strategy, it was assumed that the underlying machine provides a marker instruction. A marker instruction takes as argument an integer that speciﬁes the marker value. When it is executed, the hardware starts to collect the quadruple for the associated marker value. At any given cycle, only one marker value is alive. Due to long simulation times, a simple analytical energy model was used to estimate the energy consumption of an entire program execution. It models total CPU energy usage, including both active and idle CPU cycles. The model is based on associating with each CPU cycle an energy cost. Speciﬁcally, given a program in which region R is slowed down by δ, the total CPU energy E is

Single Region vs. Multiple Regions

207

deﬁned as: E = (W c + W b ) − (1 − 1/δ 2 ) · (WRc + WRb ) + ρ · W m E2

E1

where E1 and E2 models the CPU energy consumed by active cycles and idle cycles, respectively. In our experiments, an idle cycle was assumed to consume 30% of the energy cost of an active cycle, i.e., ρ = 30%. It accounts for the energy consumption of clocked components such as clock tree [10]. Finally, we assume that the operating system uses the following simple formula to determine the appropriate CPU frequency f (δ) from a compiler-provided slowdown factor δ: fpeak def f (δ) =

· lmem lmem · δ where fpeak is the peak CPU frequency and lmem is the memory latency in peak CPU cycles. The reason lmem is involved in the speed setting is because we had observed the clock skew eﬀects due to mismatch of the memory and CPU cycle times [9]. This simple formula guarantees that the selected frequency is a multiple of memory latency. In our experiments, fpeak was set to be 1000 and lmem was set to be 100. Since the compiler sets a limit on the lowest CPU frequency to be used (in terms of u in problem (P’)), it amounts to say that we considered a DVS system whose CPU frequency ranges from 200 MHz to 1000 MHz with discrete frequency/voltage levels. The compilation time for both single-region strategy and multiple-region strategy is in the order of minutes. This does not include the time needed to perform the proﬁling (phase 2) which may take up to several hours. We are currently investigating compile-time models to derive the information generated by phase 2. Table 2 lists the more detailed timings of various phases for both strategies. The user-provided performance tolerance ratio r deﬁnes a soft deadline, i.e., in some cases, the real performance may exceed this limit. As a result, both

Table 2. The compilation time (in seconds) of both single-region strategy and multiple-region strategy for r = 1% phase 1 phase 2 swim tomcatv mgrid turb3d applu apsi

4 6452 3 95591 6 96138 19 120721 48 4757 64 4317

phase 3&4 single multiple 8 16 5 9 11 17 1313 79 88 95 572 244

208

Chung-Hsing Hsu and Ulrich Kremer

single-region and multiple-region strategies have diﬀerent sets of performance (T ) and energy consumption (E) for the same r value. One way to compare them is in terms of the energy-delay product (E · T ) [7]. Figure 4 lists the energydelay product for six SPECfp95 benchmarks for various r values. It can be seen that, except for benchmark swim, both strategies are very similar to each other in the energy-delay product. In other words, when the multiple-region strategy results in more performance degradation than the single-region strategy, it is able to cut down more energy consumption to compensate for the additional performance loss. For benchmark swim, the multiple-region strategy is obviously more beneﬁcial than the single-region strategy.

5

Related Work

The closest work we are aware of is the intra-task DVS techniques using checkpoints. Intra-task scheduling is based on the reclamation of the slacks deviated from the compile-time (over-)estimation such as the worst case execution time (WCET). Checkpoints are inserted into the original program at compile time to indicate where the CPU speed (and voltage) should be re-calculated. While more checkpoints allow ﬁner performance tuning, the accumulated overheads of performing CPU re-calculation may become signiﬁcant and reverses the im-

Fig. 4. The comparison of single-region vs. multiple-region approaches for various SPECfp95 benchmarks and user-deﬁned performance penalty. The energydelay product is represented relatively with respect to the original program running at full speed

Single Region vs. Multiple Regions

209

provement into degradation. One important issue is then where to insert these checkpoints. Lee and Sakurai in [13] placed checkpoints at the equally sized time slots of the WCET of a task. In [22, 21], Shin et al. put the checkpoints at selected CFG edges of a task to capture the slacks from run-time variations of diﬀerent execution paths. Moss´e et al. proposed to insert checkpoints at boundaries of program constructs such as loops and call sites in [17]. In a follow-up paper [1], they assumed equally spaced checkpoints in time and tried to determine the optimal amount. Azevedo et al. in [2] placed checkpoints at every branch, loop and call site initially, and then used the proﬁle information to guide pruning some of the checkpoints. The most signiﬁcant diﬀerence between our work and the above is that we identify a diﬀerent type of CPU slacks. We focus on memory-bound regions, while others consider the run-time variations from the estimation. Our work also takes into account the overheads induced by the checkpoints explicitly to prevent from over-doing. These checkpoints are deﬁned at boundaries of program constructs in the source program, rather than points in the time line, which we feel more appropriate to the compiler-directed dynamic voltage scheduling. Some of the task-based algorithms formulated the dynamic voltage scheduling as a (mixed-)integer linear/nonlinear programming problem. For example, Ishihara and Yasuura in [11] gave an ILP formulation for a set of tasks and a set of discrete voltage levels. In contrast, the paper by Manzak and Chakrabarti [16] assumed continuous voltages and a single voltage/frequency for a task. Raje and Sarrafzadeh [20] formulated the problem for an acyclic task graph and discrete voltage levels. None of the above took into account the transition costs. Swaminathan and Chakrabarty in [25] incorporated transition costs into the problem formulation. They assumed a single dual-speed CPU executes a set of periodic non-preemptive real-time tasks. Our work is diﬀerent in at least the analytical performance model. Most of the work assumes pure CPU processing time (i.e., W m = W b = 0). In contrast, our model breaks down the total execution time into three parts with respect to the memory system. While our model estimates more precisely the performance impact of dynamic voltage scaling on a program, it makes the estimation much harder due to the non-continuity of the maximum function. In addition, the problem formulation in this paper assumes continuous performance levels in terms of the slowdown factors. We expect that the formulation with discrete performance levels is cleaner (the binary function θ(·, ·) is “inlined”) and may be easier to solve. However, it remains to be seen how much diﬀerence the accuracy of the performance model and the diﬃculty of the problem formulation would make for the real energy-performance trade-oﬀs.

6

Conclusions

Compile-time directed frequency and voltage scaling is an eﬀective technique to reduce CPU power dissipation. We have discussed a trace-based compiler

210

Chung-Hsing Hsu and Ulrich Kremer

approach that identiﬁes regions in the program that can be slowed down without signiﬁcant performance penalties. Two strategies were implemented within SUIF2, one that selects a single program regions for CPU slow down (single region approach), and one that allows multiple regions to be executed at diﬀerent CPU frequencies. Based on cycle accurate simulation using the SimpleScalar tool set, the resulting energy delay products, and the energy savings due to voltage scaling were comparable for ﬁve out of our six SPECfp95 benchmark programs. Both approaches achieved energy savings of up to 48%, with performance penalties of up to 16%. For the remaining benchmark program, the multiple regions approach was signiﬁcantly better in terms of energy-delay product, yielding energy savings of up to 74% vs. 50% for the single region approach, with maximal performance penalties of 21% vs. 17%, respectively. Although these results are very encouraging, more work needs to be done to improve the compilation eﬃciency of the single and multiple regions approaches. For some benchmark programs, the MINLP solver used in the multiple regions compiler was not able to compute the optimal solution within a reasonable time. We therefore were not able to include these numbers in this study. We are currently investigating fast approximation strategies that will only insigniﬁcantly degrade the quality of the determined voltage schedules.

References [1] N. AbouGhazaleh, D. Moss´e, B. Childers, and R. Melhem. Toward the placement of power management points in real time applications. In Proceedings of the Workshop on Compilers and Operating Systems for Low Power, September 2001. 209 [2] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum, and A. Nicolau. Proﬁle-based dynamic voltage scheduling using program checkpoints in the COPPER framework. In Proceedings of Design, Automation and Test in Europe Conference, March 2002. 209 [3] T. Ball and J. Larus. Using paths to measure, explain, and enhance program behavior. IEEE Computer, 31(7), 2000. 203 [4] D. Burger. Hardware Techniques to Improve the Performance of the Processor/Memory Interface. PhD thesis, Computer Science Department, University of Wisconsin-Madison, 1998. 205 [5] D. Burger and T. Austin. The SimpleScalar tool set version 2.0. Technical Report 1342, Computer Science Department, University of Wisconsin-Madison, June 1997. 205 [6] J. Czyzyk, M. Mesnier, and J. Mor´e. The NEOS server. IEEE Journal on Computational Science and Engineering, 5(3), July–September 1998. 202 [7] R. Gonzalez and M. Horowitz. Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31(9):1277–1284, September 1996. 208 [8] C.-H. Hsu and U. Kremer. Compiler-directed dynamic voltage scaling based on program regions. Technical Report DCS-TR-461, Department of Computer Science, Rutgers University, November 2001. 201, 203

Single Region vs. Multiple Regions

211

[9] C.-H. Hsu, M. Hsiao, and U. Kremer. Compiler-directed dynamic frequency and voltage scheduling. In Workshop on Power-Aware Computer Systems, November 2000. 199, 207 [10] M. Irwin. Low power design: From soup to nuts. Tutorial at the International Symposium on Computer Architecture, June 2000. 207 [11] T. Ishihara and H. Yasuura. Voltage scheduling problem for dynamically variable voltage processors. In International Symposium on Low Power Electronics and Design, pages 197–202, August 1998. 209 [12] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 18th International Symposium on Computer Architecture, pages 81–87, May 1981. 205 [13] S. Lee and T. Sakurai. Run-time voltage hopping for low-power real-time systems. In Proceedings of the 37th Conference on Design Automation, pages 806– 809, June 2000. 198, 209 [14] S. Leyﬀer. Integrating SQP and branch-and-bound for mixed integer nonlinear programming. Computational Optimization and Applications, 18(3):295–309, March 2001. 202 [15] J. Lorch and A. Smith. Improving dynamic voltage algorithms with PACE. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, June 2001. 198 [16] A. Manzak and C. Chakrabarti. Variable voltage task scheduling algorithms for minimizing energy. In Proceedings of the International Symposium on LowPower Electronics and Design, August 2001. 209 [17] D. Moss´e, H. Aydin, B. Childers, and R. Melhem. Compiler-assisted dynamic power-aware scheduling for real-time applications. In Workshop on Compiler and Operating Systems for Low Power, October 2000. 209 [18] T. Mudge. Power: A ﬁrst class design constraint for future architectures. In Proceedings of International Conference on High Performance Computing, December 2000. 197 [19] T. Pering, T. Burd, and R. Brodersen. The simulation and evaluation of dynamic voltage scaling algorithms. In Proceedings of 1998 International Symposium on Low Power Electronics and Design, pages 76–81, August 1998. 197, 198 [20] S. Raje and M. Sarrafzadeh. Variable voltage scheduling. In International Symposium on Low Power Electronics and Design, pages 9–14, August 1995. 209 [21] D. Shin and J. Kim. A proﬁle-based energy-eﬃcient intra-task voltage scheduling algorithm for hard real-time applications. In Proceedings of the International Symposium on Low-Power Electronics and Design, August 2001. 198, 209 [22] D. Shin, J. Kim, and S. Lee. Intra-task voltage scheduling for low-energy hard real-time applications. IEEE Design and Test of Computers, 18(2), March/April 2001. 209 [23] P. Stanley-Marbell, M. Hsiao, and U. Kremer. A hardware architecture for dynamic performance and energy adaptation. In Workshop on Power-Aware Computer Systems, February 2002. 198 [24] SUIF. Stanford University Intermediate Format. 201 [25] V. Swaminathan and K. Chakrabarty. Investigating the eﬀect of voltage switching on low-energy task scheduling in hard real-time systems. In Asia South Pacific Design Automation Conference, January/February 2001. 209

Author Index

Albonesi, D. . . . . . . . . . . . . . . . . . . . . . 1 Bagherzadeh, Nader . . . . . . . . . . . . 84 Bianchini, Ricardo . . . . . . . . . . . . . 157 Bose, P. . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Brooks, D. . . . . . . . . . . . . . . . . . . . . . . . 1 Burleson, Wayne . . . . . . . . . . . . . . . . 99 Buyuktosunoglu, A. . . . . . . . . . . . . . . 1 Chang, Fay . . . . . . . . . . . . . . . . . . . . 110 Chittamuru, Jeevan . . . . . . . . . . . . . 99 Chou, Pai H. . . . . . . . . . . . . . . . . . . . 84 Cook, P. . . . . . . . . . . . . . . . . . . . . . . . . . 1 Das, K. . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Dwarkadas, S. . . . . . . . . . . . . . . . . . . . .1 Ellis, Carla S. . . . . . . . . . . . . . . . . . .130 Elnozahy, E.N. (Mootaz) . . . . . . .179 Emma, P. . . . . . . . . . . . . . . . . . . . . . . . . 1 Euh, Jeongseon . . . . . . . . . . . . . . . . . 99 Fan, Xiaobo . . . . . . . . . . . . . . . . . . . 130 Farkas, Keith I. . . . . . . . . . . . . . . . . 110 Gschwind, M. . . . . . . . . . . . . . . . . . . . . 1 Heath, Taliver . . . . . . . . . . . . . . . . . 157 Hsiao, Michael S. . . . . . . . . . . . . . . . 33 Hsu, Chung-Hsing . . . . . . . . . . . . . 197 Inoue, Koji . . . . . . . . . . . . . . . . . . . . . 18 Jacobson, H. . . . . . . . . . . . . . . . . . . . . . 1 Jeon, Jaekwon . . . . . . . . . . . . . . . . . 141

Karkhanis, T. . . . . . . . . . . . . . . . . . . . . 1 Kim, Jihong . . . . . . . . . . . . . . . . . . . 141 Kim, Woonseok . . . . . . . . . . . . . . . . 141 Kistler, Michael . . . . . . . . . . . . . . . .179 Kremer, Ulrich . . . . . . . . . . . . . 33, 197 Kudva, P. . . . . . . . . . . . . . . . . . . . . . . . . 1 Lebeck, Alvin R. . . . . . . . . . . . . . . .130 Levner, Eugene . . . . . . . . . . . . . . . . . 68 Liu, Jinfeng . . . . . . . . . . . . . . . . . . . . . 84 Mejia, Pedro . . . . . . . . . . . . . . . . . . . . 68 Min, Sang Lyul . . . . . . . . . . . . . . . . 141 Morrow, L. Alex . . . . . . . . . . . . . . . . 53 Moshnyaga, Vasily . . . . . . . . . . . . . . 18 Moss´e, Daniel . . . . . . . . . . . . . . . . . . . 68 Murakami, Kazuaki . . . . . . . . . . . . . 18 Olsen, C. Michael . . . . . . . . . . . . . . . 53 Pinheiro, Eduardo . . . . . . . . . . . . . 157 Rajamony, Ramakrishnan . . . . . .179 Ranganathan, Parthasarathy . . .110 Schuster, S. . . . . . . . . . . . . . . . . . . . . . . 1 Shin, Dongkun . . . . . . . . . . . . . . . . .141 Smith, J. . . . . . . . . . . . . . . . . . . . . . . . . .1 Srinivasan, V. . . . . . . . . . . . . . . . . . . . . 1 Stanley-Marbell, Phillip . . . . . . . . . 33 Zyuban, V. . . . . . . . . . . . . . . . . . . . . . . 1

Privacy Enhancing Technologies: Second International Workshop, Pet 2002, San Francisco, Ca, Usa, April 2002 : Revised Papers

Intelligent Memory Systems: Second International Workshop, IMS 2000, Cambridge, MA, USA, November 12, 2000. Revised Papers

Power-Aware Computer Systems: First International Workshop, PACS 2000 Cambridge, MA, USA, November 12, 2000 Revised Papers

Formal Approaches to Agent-Based Systems: Second International Workshop, FAABS 2002, Greenbelt, MD, USA, October 29-31, 2002, Revised Papers

Learning Classifier Systems: 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers

Digital Watermarking: First International Workshop, IWDW 2002, Seoul, Korea, November 21-22, 2002, Revised Papers

Information Hiding: 5th International Workshop, IH 2002, Noordwijkerhout, The Netherlands, October 7-9, 2002, Revised Papers

Power-Aware Computer Systems: 4th International Workshop, PACS 2004, Portland, OR, USA, December 5, 2004, Revised Selected Papers

Advances in Learning Software Organizations: 4th International Workshop, LSO 2002, Chicago, IL, USA, August 6, 2002, Revised Papers

Innovative Internet Computing Systems: Second International Workshop, IICS 2002, Kuhlungsborn, Germany, June 20-22, 2002: Proceedings

Cryptographic Hardware and Embedded Systems - CHES 2002: 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers (Lecture Notes in Computer Science)

Regulated Agent-Based Social Systems: First International Workshop, RASTA 2002, Bologna, Italy, July 16, 2002, Revised Selected and Invited Papers

COTS-Based Software Systems: First International Conference, ICCBSS 2002, Orlando, FL, USA, February 4-6, 2002, Proceedings

Financial Cryptography: 6th International Conference, FC 2002, Southampton, Bermuda, March 11-14, 2002, Revised Papers

Conditionals, Information, and Inference: International Workshop, WCII 2002, Hagen, Germany, May 13-15, 2002, Revised Selected Papers

Multi-Agent-Based Simulation II: Third International Workshop, MABS 2002, Bologna, Italy, July 15-16, 2002, Revised Papers

Automated Deduction in Geometry: 4th International Workshop, ADG 2002, Hagenberg Castle, Austria, September 4-6, 2002, Revised Papers

Scientific Engineering for Distributed Java Applications: International Workshop, FIDJI 2002, Luxembourg, Luxembourg, November 28-29, 2002, Revised Papers

Distributed Communities on the Web: 4th International Workshop, DCW 2002 Sydney, Australia, April 3-5, 2002, Revised Papers

Hybrid Systems: Computation and Control: 5th International Workshop, HSCC 2002, Stanford, CA, USA, March 25-27, 2002, Proceedings

Job Scheduling Strategies for Parallel Processing: 11th International Workshop, JSSPP 2005, Cambridge, MA, USA, June 19, 2005, Revised Selected Papers

Multi-Agent-Based Simulation: Second International Workshop, MABS 2000, Boston, MA, USA, July 2000; Revised and Additional Papers: v. 1979

Power-Aware Computer Systems: Third International Workshop, PACS 2003, San Diego, CA, USA, December 1, 2003, Revised Papers (Lecture Notes in Computer Science)

Power-Aware Computer Systems: Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002, Revised Papers

Privacy Enhancing Technologies: Second International Workshop, Pet 2002, San Francisco, Ca, Usa, April 2002 : Revised Papers

Intelligent Memory Systems: Second International Workshop, IMS 2000, Cambridge, MA, USA, November 12, 2000. Revised Papers

Power-Aware Computer Systems: First International Workshop, PACS 2000 Cambridge, MA, USA, November 12, 2000 Revised Papers

Formal Approaches to Agent-Based Systems: Second International Workshop, FAABS 2002, Greenbelt, MD, USA, October 29-31, 2002, Revised Papers

Learning Classifier Systems: 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers

Digital Watermarking: First International Workshop, IWDW 2002, Seoul, Korea, November 21-22, 2002, Revised Papers

Information Hiding: 5th International Workshop, IH 2002, Noordwijkerhout, The Netherlands, October 7-9, 2002, Revised Papers

Power-Aware Computer Systems: 4th International Workshop, PACS 2004, Portland, OR, USA, December 5, 2004, Revised Selected Papers

Advances in Learning Software Organizations: 4th International Workshop, LSO 2002, Chicago, IL, USA, August 6, 2002, Revised Papers

Innovative Internet Computing Systems: Second International Workshop, IICS 2002, Kuhlungsborn, Germany, June 20-22, 2002: Proceedings

Cryptographic Hardware and Embedded Systems - CHES 2002: 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers (Lecture Notes in Computer Science)

Regulated Agent-Based Social Systems: First International Workshop, RASTA 2002, Bologna, Italy, July 16, 2002, Revised Selected and Invited Papers

Regulated Agent-Based Social Systems: First International Workshop, RASTA 2002, Bologna, Italy, July 16, 2002, Revised Selected and Invited Papers

Team USA 2002

Metainformatics: International Symposium, MIS 2002, Esbjerg, Denmark, August 7-10, 2002, Revised Papers

Inductive Logic Programming: 12th International Conference, ILP 2002, Sydney, Australia, July 9-11, 2002. Revised Papers

Computers and Games: Third International Conference, CG 2002, Edmonton, Canada, July 25-27, 2002, Revised Papers

COTS-Based Software Systems: First International Conference, ICCBSS 2002, Orlando, FL, USA, February 4-6, 2002, Proceedings

Financial Cryptography: 6th International Conference, FC 2002, Southampton, Bermuda, March 11-14, 2002, Revised Papers

Conditionals, Information, and Inference: International Workshop, WCII 2002, Hagen, Germany, May 13-15, 2002, Revised Selected Papers

Multi-Agent-Based Simulation II: Third International Workshop, MABS 2002, Bologna, Italy, July 15-16, 2002, Revised Papers

Automated Deduction in Geometry: 4th International Workshop, ADG 2002, Hagenberg Castle, Austria, September 4-6, 2002, Revised Papers

Scientific Engineering for Distributed Java Applications: International Workshop, FIDJI 2002, Luxembourg, Luxembourg, November 28-29, 2002, Revised Papers

Distributed Communities on the Web: 4th International Workshop, DCW 2002 Sydney, Australia, April 3-5, 2002, Revised Papers

Hybrid Systems: Computation and Control: 5th International Workshop, HSCC 2002, Stanford, CA, USA, March 25-27, 2002, Proceedings

Job Scheduling Strategies for Parallel Processing: 11th International Workshop, JSSPP 2005, Cambridge, MA, USA, June 19, 2005, Revised Selected Papers

Multi-Agent-Based Simulation: Second International Workshop, MABS 2000, Boston, MA, USA, July 2000; Revised and Additional Papers: v. 1979

Power-Aware Computer Systems: Third International Workshop, PACS 2003, San Diego, CA, USA, December 1, 2003, Revised Papers (Lecture Notes in Computer Science)

2002)

2002)

Power-Aware Computer Systems: Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002, Revised Papers

Privacy Enhancing Technologies: Second International Workshop, Pet 2002, San Francisco, Ca, Usa, April 2002 : Revised Papers

Intelligent Memory Systems: Second International Workshop, IMS 2000, Cambridge, MA, USA, November 12, 2000. Revised Papers

Power-Aware Computer Systems: First International Workshop, PACS 2000 Cambridge, MA, USA, November 12, 2000 Revised Papers

Formal Approaches to Agent-Based Systems: Second International Workshop, FAABS 2002, Greenbelt, MD, USA, October 29-31, 2002, Revised Papers

Learning Classifier Systems: 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers

Digital Watermarking: First International Workshop, IWDW 2002, Seoul, Korea, November 21-22, 2002, Revised Papers

Information Hiding: 5th International Workshop, IH 2002, Noordwijkerhout, The Netherlands, October 7-9, 2002, Revised Papers

Power-Aware Computer Systems: 4th International Workshop, PACS 2004, Portland, OR, USA, December 5, 2004, Revised Selected Papers

Advances in Learning Software Organizations: 4th International Workshop, LSO 2002, Chicago, IL, USA, August 6, 2002, Revised Papers

Innovative Internet Computing Systems: Second International Workshop, IICS 2002, Kuhlungsborn, Germany, June 20-22, 2002: Proceedings

Cryptographic Hardware and Embedded Systems - CHES 2002: 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers (Lecture Notes in Computer Science)

Regulated Agent-Based Social Systems: First International Workshop, RASTA 2002, Bologna, Italy, July 16, 2002, Revised Selected and Invited Papers

Regulated Agent-Based Social Systems: First International Workshop, RASTA 2002, Bologna, Italy, July 16, 2002, Revised Selected and Invited Papers

Team USA 2002

Metainformatics: International Symposium, MIS 2002, Esbjerg, Denmark, August 7-10, 2002, Revised Papers

Inductive Logic Programming: 12th International Conference, ILP 2002, Sydney, Australia, July 9-11, 2002. Revised Papers

Computers and Games: Third International Conference, CG 2002, Edmonton, Canada, July 25-27, 2002, Revised Papers

COTS-Based Software Systems: First International Conference, ICCBSS 2002, Orlando, FL, USA, February 4-6, 2002, Proceedings

Financial Cryptography: 6th International Conference, FC 2002, Southampton, Bermuda, March 11-14, 2002, Revised Papers

Conditionals, Information, and Inference: International Workshop, WCII 2002, Hagen, Germany, May 13-15, 2002, Revised Selected Papers

Multi-Agent-Based Simulation II: Third International Workshop, MABS 2002, Bologna, Italy, July 15-16, 2002, Revised Papers

Automated Deduction in Geometry: 4th International Workshop, ADG 2002, Hagenberg Castle, Austria, September 4-6, 2002, Revised Papers

Scientific Engineering for Distributed Java Applications: International Workshop, FIDJI 2002, Luxembourg, Luxembourg, November 28-29, 2002, Revised Papers

Distributed Communities on the Web: 4th International Workshop, DCW 2002 Sydney, Australia, April 3-5, 2002, Revised Papers

Hybrid Systems: Computation and Control: 5th International Workshop, HSCC 2002, Stanford, CA, USA, March 25-27, 2002, Proceedings

Job Scheduling Strategies for Parallel Processing: 11th International Workshop, JSSPP 2005, Cambridge, MA, USA, June 19, 2005, Revised Selected Papers

Multi-Agent-Based Simulation: Second International Workshop, MABS 2000, Boston, MA, USA, July 2000; Revised and Additional Papers: v. 1979

Power-Aware Computer Systems: Third International Workshop, PACS 2003, San Diego, CA, USA, December 1, 2003, Revised Papers (Lecture Notes in Computer Science)

2002)

2002)

Recommend Documents