ABSTRACT Title of dissertation: CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS Chenjie Yu, Doctor of Philosophy, 2010 Dissertation directed by: Assistant Professor Peter Petrov Electrical and Computer Engineering Due to physical limitations and design difficulties, computer processor archi- tecture has shifted to multi-core and even many-core based approaches in recent years. Such architectures provide potentials for sustainable performance scaling into future peta-scale/exa-scale computing platforms, at affordable power budget, design complexity, and verification efforts. To date, multi-core processor products have been replacing uni-core processors in almost every market segment, including embedded systems, general-purpose desktops and laptops, and super computers. However, many issues still remain with multi-core processor architectures that need to be addressed before their potentials could be fully realized. People in both academia and industry research community are still seeking proper ways to make efficient and effective use of these processors. The issues involve hardware archi- tecture trade-offs, the system software service, the run-time management, and user application design, which demand more research effort into this field. Due to the architectural specialties with multi-core based computers, a Cross- Layer Customization framework is proposed in this work, which combines applica- tion specific information and system platform features, along with necessary operat- ing system service support, to achieve exceptional power and performance efficiency for targeted multi-core platforms. Several topics are covered with specific optimiza- tion goals, including snoop cache coherence protocol, inter-core communication for producer-consumer applications, synchronization mechanisms, and off-chip memory bandwidth limitations. Analysis of benchmark program execution with conventional mechanisms is made to reveal the overheads in terms of power and performance. Specific cus- tomizations are proposed to eliminate such overheads with support from hardware, system software, compiler, and user applications. Experiments show significant im- provement on system performance and power efficiency. CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS by Chenjie Yu Dissertation to be submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 Advisory Committee: Assistant Professor Peter Petrov, Chair/Advisor Professor Shuvra Bhattacharyya Associate Professor Gang Qu Associate Professor Manoj Franklin Associate Professor Chau-Wen Tseng cnull Copyright by Chenjie Yu 2010 Acknowledgments I would like to give my gratitude to all the people who have made this thesis possible and to all the people who have helped me in my Ph.D program. I would like to thank my advisor, Dr. Peter Petrov for teaching and directing on my study and research. He has lent me tremendous help through the entire processes of all my research projects, from idea forming, experiment implementation, all the way to publications and presentations. I also want to thank Dr. Shuvra Bhattacharyya, Dr. Gang Qu, Dr. Manoj Franklin, and Dr. Chau-Wen Tseng for serving in my thesis defense committee and for sparing their invaluable time reviewing the manuscript of my dissertation. I owe my thanks to my colleagues and fellow graduate students at Electrical and Computer Engineering Department. Among them, Dr. Xiangrong Zhou had participated in some of my projects and shared his experience with me. Most of my research projects use the M5 simulator that originally came from Dr. Donald Yeung?s group. Dr. Yeung has also been generous enough to allow me to use their computing facilities for simulations. Things will be far more different without their support. I owe my deepest thanks to my family - my mother and father. Without their support I couldn?t have gone this far. It is impossible to remember all, and I apologize to those I?ve inadvertently left out. ii Table of Contents List of Tables vi List of Figures viii 1 Introduction and Motivation 1 1.1 Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Embedded system design . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Embedded Multi-Core Processors . . . . . . . . . . . . . . . . . . . . 3 1.4 Embedded Multi-Core Platform Design Challenges . . . . . . . . . . 5 1.4.1 User Application Layer . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 System Software Layer . . . . . . . . . . . . . . . . . . . . . . 6 1.4.3 Hardware Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 System-Level Customization in Embedded Systems . . . . . . . . . . 7 1.6 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Background and Related Work 12 2.1 Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Power-saving Techniques for Embedded Systems . . . . . . . . . . . . 13 2.2.1 DVFS and Clock-Gating . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Architecture Level Power Reduction . . . . . . . . . . . . . . . 15 2.2.3 System Software Techniques for Power Saving . . . . . . . . . 16 2.3 Cross-Layer Customization for Embedded Multi-Cores . . . . . . . . 17 2.3.1 Cross-Layer Customization Approach for Embedded Systems . 17 2.3.2 Improving Snoop Protocol Power Efficiency . . . . . . . . . . 19 2.3.3 Inter-Core Communication based on Producer-Consumer Pat- terns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Hardware Based Synchronization Mechanisms . . . . . . . . . 20 2.3.5 Cache Partitioning for Memory Bandwidth Minimization . . . 21 3 Low-Power Snoop Architecture for Synchronized Producer-Consumer Com- munication in Embedded Multiprocessors 22 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Functional Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Synchronized Producer-Consumer Communication . . . . . . . 28 3.3.2 Snoop-Phases in Producer-Consumer Communication . . . . . 29 3.3.3 Snoop-Phase Detection . . . . . . . . . . . . . . . . . . . . . . 32 3.3.4 Shared Buffer Identification . . . . . . . . . . . . . . . . . . . 34 3.4 Passive SPoT Detection . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Active SPoT Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 iii 4 Energy and Performance Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency 57 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 Inter-Core Data Sharing: To Invalidate or to Update? . . . . . 67 4.3.2 Cross-Layer Integration for Data Communication . . . . . . . 69 4.3.3 Cache Way Partitioning for Low-Power Data Sharing . . . . . 74 4.4 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 Shared Memory Identification . . . . . . . . . . . . . . . . . . 77 4.4.2 Loop Transformations for Software-triggered Remote Updates 78 4.5 System Software Support . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.1 Memory Reference Identification . . . . . . . . . . . . . . . . . 84 4.5.2 Multi-Tasking Support and False Sharing Avoidance . . . . . 86 4.6 Cache Partitioning for Low-Power Data Sharing . . . . . . . . . . . . 88 4.6.1 Functional Overview . . . . . . . . . . . . . . . . . . . . . . . 89 4.6.2 Cache Way Partitioning: Advantages and Pitfalls . . . . . . . 90 4.7 Hardware Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.1 Shared Data Communication Support . . . . . . . . . . . . . . 93 4.7.2 Cache Way Partitioning Hardware . . . . . . . . . . . . . . . . 96 4.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5 Low-Cost and Energy-Efficient Distributed Synchronization for Embedded Multiprocessors 131 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3 Functional Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3.1 Distributed Queue Abstraction Model . . . . . . . . . . . . . . 142 5.3.2 Synchronization Efficiency with Distributed Queues . . . . . . 144 5.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.4.1 Synchronization Variable Identification . . . . . . . . . . . . . 147 5.4.2 Distributed Synchronization Controller . . . . . . . . . . . . . 150 5.4.3 Lock implementation . . . . . . . . . . . . . . . . . . . . . . . 153 5.4.4 Barrier implementation . . . . . . . . . . . . . . . . . . . . . . 155 5.4.5 Power Management . . . . . . . . . . . . . . . . . . . . . . . . 158 5.5 Compiler and OS Support . . . . . . . . . . . . . . . . . . . . . . . . 160 5.5.1 OS Power Management Role . . . . . . . . . . . . . . . . . . . 162 5.5.2 Multi-tasking support per core . . . . . . . . . . . . . . . . . . 165 5.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 iv 6 Off-Chip Memory Bandwidth Minimization through Cache Partitioning for Multi-Core Processors 182 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.3 Memory Bandwidth and Last Level Caches in Multi-Core Systems . . 187 6.3.1 Bandwidth Demand and Cache Resources . . . . . . . . . . . 188 6.3.2 Cache Sharing in Multi-Core Processor Systems . . . . . . . . 191 6.4 Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.4.1 Cache Partitioning Basics . . . . . . . . . . . . . . . . . . . . 192 6.4.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . 193 6.4.3 Intuitive and Formal Description . . . . . . . . . . . . . . . . 196 6.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . 199 6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 199 6.5.2 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . 200 6.5.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 201 6.5.4 Comparison Between Heuristic AlgorithmandExhaustive Search203 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7 Conclusions 206 7.1 Embedded Multi-Core Architecture Challenges . . . . . . . . . . . . . 206 7.2 Cross-Layer Customization for Embedded Multi-Cores . . . . . . . . 207 Bibliography 208 v List of Tables 3.1 Snoop-induced cache lookups for 16K shared data buffers; Passive SPoT v.s. baseline snoop protocol. . . . . . . . . . . . . . . . . . . . 46 3.2 Snoop-induced cache lookups for 64K shared data buffers; Passive SPoT v.s. baseline snoop protocol. . . . . . . . . . . . . . . . . . . . 47 3.3 Energy consumption (?J) for 16K shared data buffers; Passive SPoT v.s. Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Energy consumption (?J) for 64K shared data buffers; Passive SPoT v.s. Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Snoop-induced cache lookups for 16K shared data buffers; Active SPoT v.s. baseline snoop protocol. . . . . . . . . . . . . . . . . . . . 53 3.6 Snoop-induced cache lookups for 64K shared data buffers; Active SPoT v.s. baseline snoop protocol. . . . . . . . . . . . . . . . . . . . 54 3.7 Energy consumption (?J) for 16K shared data buffers; Active SPoT v.s. Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.8 Energy consumption (?J) for 64K shared data buffers; Active SPoT v.s. Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 Cache Misses: Baseline vs. Achieved Reductions (%); 32K D-Caches . 98 4.2 Cache Misses: Baseline vs. Achieved Reductions (%); 32K D-Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3 Cache Misses: Baseline vs. Achieved Reductions (%); 32K D-Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Cache Misses: Baseline vs. Achieved Reductions (%); 64K D-Caches . 101 4.5 Cache Misses: Baseline vs. Achieved Reductions (%); 64K D-Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.6 Cache Misses: Baseline vs. Achieved Reductions (%); 64K D-Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7 Bus Transactions: Baseline vs. Achieved Reductions (%); 32K D- Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.8 Bus Transactions: Baseline vs. Achieved Reductions (%); 32K D- Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.9 Bus Transactions: Baseline vs. Achieved Reductions (%); 32K D- Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.10 Bus Transactions: Baseline vs. Achieved Reductions (%); 64K D- Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.11 Bus Transactions: Baseline vs. Achieved Reductions (%); 64K D- Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.12 Bus Transactions: Baseline vs. Achieved Reductions (%); 64K D- Caches (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.13 Cache way partitioning: Cache energy (mJ) and reductions . . . . . . 121 4.14 Cache way partitioning: Cache energy (mJ) and reductions (continued)122 4.15 Cache way partitioning: Cache misses and impact on miss-rate . . . . 123 vi 4.16 Cache way partitioning: Cache misses and impact on miss-rate (con- tinued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.17 Average memory access latency reduction (32K D-Cache) . . . . . . . 125 4.18 Average memory access latency reduction (32K D-Cache) (continued) 126 4.19 Average memory access latency reduction (64K D-Cache) . . . . . . . 127 4.20 Average memory access latency reduction (64K D-Cache) (continued) 128 4.21 Average memory access latency reduction with cache way allocation . 129 4.22 Average memory access latency reduction with cache way allocation (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.1 Performance characteristics (in number of cycles) and DSC reductions - Increasing data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.2 Performance characteristics (in number of cycles) and DSC reductions - Increasing data set (continued) . . . . . . . . . . . . . . . . . . . . . 167 5.3 Bus bandwidth characteristics (in number of bus transactions) and DSC reductions - Increasing data set . . . . . . . . . . . . . . . . . . 168 5.4 Bus bandwidth characteristics (in number of bus transactions) and DSC reductions - Increasing data set (continued) . . . . . . . . . . . 168 5.5 Performance characteristics (in number of cycles) and DSC reductions - Fixed computational workload . . . . . . . . . . . . . . . . . . . . . 171 5.6 Performance characteristics (in number of cycles) and DSC reductions - Fixed computational workload (continued) . . . . . . . . . . . . . . 172 5.7 Bus bandwidth characteristics (in number of bus transactions) and DSC reductions - Fixed computational workload . . . . . . . . . . . . 173 5.8 Bus bandwidth characteristics (in number of bus transactions) and DSC reductions - Fixed computational workload (continued) . . . . . 174 5.9 Thread load imbalance: 4-processor system . . . . . . . . . . . . . . . 177 5.10 Energy characteristics: 4-processor system . . . . . . . . . . . . . . . 178 5.11 Thread load imbalance: 8-processor systems . . . . . . . . . . . . . . 180 5.12 Energy characteristics: 8-processor system . . . . . . . . . . . . . . . 181 6.1 Benchmark Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . 199 vii List of Figures 1.1 General Purpose System . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Feed-Back and Customization Desig . . . . . . . . . . . . . . . . . . . 10 3.1 Synchronized producer-consumer communication with shared mem- ory. At any time moment access to the shared buffer is exclusive required by the producer or the consumer only. . . . . . . . . . . . . 26 3.2 Producer-Consumer cache snooping activities. . . . . . . . . . . . . . 29 3.3 Transferring to OS shared buffers information . . . . . . . . . . . . . 35 3.4 Hardware architecture for SPoT detection. . . . . . . . . . . . . . . . 36 3.5 Hardware architecture for Active SPoT migration. . . . . . . . . . . . 41 3.6 Application benchmarks organization. . . . . . . . . . . . . . . . . . . 43 3.7 Energy reduction for 16K shared data buffers . . . . . . . . . . . . . 50 3.8 Energy reduction for 64K shared data buffers . . . . . . . . . . . . . 51 3.9 Active vs. Passive: Energy reductions for 16K shared data buffers . . 55 3.10 Active vs Passive: Energy reduction for 64K shared data buffers . . . 56 4.1 Shared memory multiprocessor organization . . . . . . . . . . . . . . 66 4.2 Bus transactions involved in communicating data . . . . . . . . . . . 68 4.3 Synchronized inter-task communication . . . . . . . . . . . . . . . . . 71 4.4 Propagating updates with explicit store.update . . . . . . . . . . . . 73 4.5 MPSoC cache partitioning . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Transformations for row-wise array traversal with st.update support . 79 4.7 Transformations for row-wise traversal with ?irregular? row sizes . . . 79 4.8 Transformations for column-wise traversal with st.update support . . 81 4.9 Loop peeling for st.update support . . . . . . . . . . . . . . . . . . . . 81 4.10 Transformation for while loops with unknown at compile-time upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.11 Cache controller support for bus-based systems . . . . . . . . . . . . 93 4.12 Shared/Private-data cache way allocation architecture . . . . . . . . . 95 4.13 Application benchmarks organization. . . . . . . . . . . . . . . . . . . 97 5.1 Distributed lock queue information . . . . . . . . . . . . . . . . . . . 142 5.2 Local lock queue management . . . . . . . . . . . . . . . . . . . . . . 144 5.3 Distributed Synchronization Controller (DSC) organization . . . . . . 152 5.4 Local barrier queue management . . . . . . . . . . . . . . . . . . . . 155 5.5 Overall system organization . . . . . . . . . . . . . . . . . . . . . . . 157 5.6 Example parallel application . . . . . . . . . . . . . . . . . . . . . . . 161 5.7 Data-streaming benchmarks organization. . . . . . . . . . . . . . . . 166 6.1 Memory Bandwidth Requirement Curves . . . . . . . . . . . . . . . . 188 6.2 Cache Misses Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.3 Cache Miss-rate Curves . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.4 Partitioning Heuristic Pseudocode . . . . . . . . . . . . . . . . . . . . 194 viii 6.5 Algorithm Walkthrough on Example . . . . . . . . . . . . . . . . . . 198 6.6 Achieved bandwidth v.s. baseline: APP1, APP2 and APP3 . . . . . . 201 6.7 Achieved bandwidth v.s. baseline: APP4, APP5 and APP6 . . . . . . 201 6.8 Heuristic Algorithm Compared to Exhaustive Search . . . . . . . . . 204 ix Chapter 1 Introduction and Motivation 1.1 Embedded Systems Embedded systems have a wide range of applications. The products span from day-to-day household and consumer electronics, such as microwave ovens, dig- ital TVs, set-top boxes, mobile phones, PDAs, vehicular devices, etc, to industry devices and equipments like wireless communication basestations, robots, aviation equipments, to the high end military and scientific devices like missles and de- vices used in space missions. It is estimated that over 10.76 billion embedded sys- tems/devices were shipped worldwide in 2009[5]. Because of its wide application range, there is no dominating system archi- tecture for embedded systems. The very simple microcontrollers, such as Atmel?s 8051, AVR[56], and the advanced advanced massive parallel architecture, such as PC102 from picoChip[3] and Tile64 from Tilera[4] co-exist in today?s design choices. About 50% of embedded system products shipped use no formal or in-house op- erating systems, whereas the other half use commercial or open-source operating systems. In general, however, embedded system designs are moving from board level to System-On-Chip (SOC), and from single processor to multiprocessors[49] with more powerful operating systems, which is driven by increasing demand for performance and power efficiency. 1 1.2 Embedded system design In general, embedded system design features fast time-to-market, very low cost, strict performance specification and tight power budget. Embedded system designers have to make trade-offs between all those factors and fine-tune the system at hardware architecture, operating system, and application software layers. System function partitioning is very important in embedded systems. Ap- plication functions can be implemented in either hardware or software. Using pure hardware design, such as custom ASIC (Application Specific IC), gives the best per- formance, hardware cost, and power efficiency, but suffers from long time to market and excessive design cost and little flexibility for future product upgrades. Using pure software running on a general-purpose computer system, however, exchanges performance, cost, and power efficiency to time to market and small design cost and more flexibility. A designer needs to balance between the two extremes. Most micro-processor based designs follow the layered system model borrowed from the general-purpose domain[108], as show in Figure 1.1. The hardware layer sits at the very bottom, including the processor unit, the off-chip memory, and peripherals. The operating system layer lies in the middle and directly controls and manages the hardware resources and provides services for user applications. The application layer consists of user programs that achieve the desired functionalities. Off-the-shelf hardware components are used to shorten the time-to market. The operating system is also chosen from open-source or commercial available ones. However, to meet design goals, an embedded system design needs to go through 2 Figure 1.1: General Purpose System a series of customization processes to achieve the design specifications at the lowest cost. This is achieved by system level hardware-software co-design method, which tries to meet system level objectives by exploiting the synergism of hardware and software through the design process. Thus, the designers need to be knowledgeable in both hardware and software domains to make good trade-offs. 1.3 Embedded Multi-Core Processors In recent years, the scaling of silicon fabrication technologies has enabled un- precedented level of chip integration, which provides rich on-chip resource for power- ful processor designs. However, the traditional design methodology with monolithic single core architectures has met tremendous difficulties in frequency scaling, power consumption, design complexity, verification effort, and so on. The frequency scaling stops around 3GHz and has not advanced ever since. Yet there is constant need for more powerful processors with affordable energy consump- tion and cost. Such conflicts have led to the proliferation of multi-core processor architectures, which naturally address many of these problems. With multiple but simpler cores, running at lower frequencies, the chip voltage and power consump- tion are well contained, while achieving very good performance scaling. As a result, 3 multi-core processor products have been widely adopted in today?s computing sys- tems ? including the industrial embedded system applications, the general purpose desktops and servers, and the building blocks for supercomputers. In fact, embedded systems are among the first to benefit from multi-core based designs[1]. Although the market has seen flourish of multi-core processor products, the proper use of such architectures is still far from satisfactory and is thus under inten- sive study in both academia and industry research communities. The fundamentals of the hardware architectures are evolving into different directions, including the key topologies such as on-chip memory structures, coherence mechanisms, interconnect technologies, etc. The proper system software mechanisms that manage the hard- ware resrouce and task scheduling are being explored to meet the future many-core processor demands. Even the programming paradigms for multi-core/many core architectures are also evolving into varieties, including OpenMP, MPI, pthread, Thread Building Block(TBB), and many others. Although embedded systems have an early start with multi-core processors, because of the wide spectrum of embedded system applications and areas, as well as vast variation of implementation methods, the problems are even more profound and complex with embedded multi-core systems. The fact that so little has been settled on so many issues with multi-core architectures, which have already been in commercial use for about seven years, is urging much more involvement from researchers in essentially all sub-areas of computer engineering. 4 1.4 Embedded Multi-Core Platform Design Challenges As discussed in the previous section, while multi-core architectures help keep system performance scaling at affordable hardware, power, and design cost, they also bring a large number of challenges and a whole new world of design trade-offs. The challenges span in about every sub-area of the systems and are shared by both embedded system and general purpose system designs. 1.4.1 User Application Layer There are programming challenges at user application layer, which have at- tracted a large number of attention and research interest in both academia and industry. There has been significant investment in the topics of compiler auto- parallelization and code transformation techniques, as well as research works in how to do programming for these highly integrated paralle machines. There is also signif- icant industry interest in techniques that could help legacy code benefit from future multi-core platforms. For embedded systems, an important topic is to maintain hard real-time constraints with significantly higher system complexity. Currently, there is still no sight of any solution in these areas that present itself general enough and effective enough. The current de facto standard of parallel programming on mid/large scale multi-core processors is up to the programmers to parallelize the sequential solutions and perform specific tunings according to hardware and OS. Such effort requires heavy software customization and generally does not provide portable performance 5 across platforms. Furthermore, as will be shown, even perfectly parallelized appli- cations may suffer execution inefficiencies from the underlying platforms. This is becoming more prominent as system integration scaling to larger core counts. 1.4.2 System Software Layer There are also special requirements with system software for embedded multi- core architectures, including operating systems and hypervisor software. With in- creasing number of processor cores and application threads in the system, the role of operating system is becoming more important in task scheduling, resource allocation and partitioning, and architecture specific system services. It is extremely impor- tant for system software on a multi-core platform to provide system-level resource monitoring to prevent serialization on certain bottlenecks. There have already been attempts to address such need in academia research and commercial practices[2, 6]. However, such mechanisms are often bound with entirely different programming models and paradigms, which are yet to be accepted by the vast majority of pro- grammers and market. Like user software applications, the system software also needs to be tailored to specific multi-core hardware architectures, with possibly different API interfaces to the programmers. 1.4.3 Hardware Layer Most of the problems in the application and system software layer have their roots in the hardware architectures. With multi-core architectures, the processor 6 cores are getting cheaper and simpler, but the shared resources become expensive and the interaction between different cores becomes more frequent and complicated, which bring in a large number of design trade-offs. Current multi-core/many-core hardware architectures are evolving into all kinds of different directions. Vendors like Intel, IBM, Nvidia and Tilera all have drastically different multi-core processors. The differences are often in the fundamental aspects of architecture design, such as being heterogeneous or homogeneous, the interconnect technology, the cache struc- ture, etc. Such differences make the program performance generally not portable across the platforms. Even with the same vendor, the new generations of architec- tures could also be very different in terms of topology organization, especially as the systems scale into large scale chip multi-processors (CMP). All these have sig- nificant impact to system performance and often require significant re-structuring to the existing software stack. Such challenges must be addressed before the potentials of multi-core/many- core systems can be truely realized in embedded applications. On the other hand, this also means tremendous research opportunity in this area. 1.5 System-Level Customization in Embedded Systems All embedded systems need to be optimized to specific application needs and purposes, such as power efficiency and performance requirements. For example, a smart phone design may have chosen processors and operating system that promise sufficient computing power as well as media and communication processing capa- 7 bilities. Yet, it can still fall short of the power budget, which is often times the upmost important factor for such devices to be successful in the market place. It is common practice for embedded system designers to fine-tune every design element and parameters to meet design specifications. The complexities with embedded multi-core systems argue strongly for a such a cross-layer customization approach. The customization methods span across different layers of the entire system. Various embedded multi-core processors can be selected for different computing and control needs. A chosen processor platform needs further customization. For example, the processor architecture parameters need to be carefully examined and adjusted to satisfy system demand. The speed of the processors, the register file sizes, the size and way associativity of cache subsystem or scratch pad memory, all have significant impact on system performance and run-time power consumption. Sometimes, such customization may go into micro-architecture level. The designer may need to implement certain critical systems in hardware blocks so as to optimize critical path behavior and add instructions to the ISA [1, 7]. The operating system (OS) also needs significant modification while it is be- ing ported to the target multi-core system. Compared with hardware components which are more often off-the-shelf commercial products, operating systems in em- bedded systems traditionally have a much greater variety. It is reported that, of all the embedded system products shipped in 2007, more than half of them come with in-house-built operating systems[5], and the other smaller half come with open- source or commercial operating systems such as VxWorks[119], embedded Linux 8 [121], ThreadX[58] and WinCE[118]. The use of bigger operating systems is grow- ing faster. More often, the layered system structure in Figure 1.1 is violated a little bit to give application software direct access to certain hardware resources, such as IO peripherals. Certain system services, such as scheduling, memory man- agement, need to be changed for more efficient execution. The use of application specific instructions and certain micro-architecture facilities also need operating sys- tem support which is not included in standard releases. As embedded multi-core processors scale to large scale CMPs, there is also added risk of violation of tradi- tional symmetrical multiprocessor (SMP) model, which needs special attention for future many-core platforms. Likewise, user programs also need a large number of tuning to achieve the best performance and power efficiency. This usually comes in the form of compiler/user directed runtime support for specific optimization goals. The compiler or the pro- grammer extracts application information and pass it down to OS and hardware layers, via speccial system APIs. At times, for certain critical data paths, the pro- grammer may need to hand optimize the code at assembly language level. Such process would require knowledge of the underlying hardware and operating system platforms. All above customization is achieved by a cross-layer system level customization framework in embedded systems. Application specific information is extracted by profiling and code examination. This information is fed to the operating system and hardware design. The outcome of the process can be profiled again, until all system specifications are met. The whole process is usually in a feed-back loop, as shown 9 in Figure 1.2. Figure 1.2: Feed-Back and Customization Desig Such need for system level customization is fundamental to the multi-core ar- chitectures. The philosophy is as following: The future processors are not going to be faster, but much wider. The performance gains will be solely from architec- ture improvement. Thus the application software will have to capture most of the architecture features from the underlying hardware to justy the cost. The decoupled layer design from general purpose system design method can hardly fit the upcoming large scale multi-core/many-core processors. Even the su- percomputer designs are going towards hardware software co-design approaches. Because of its vast design space, however, the specific co-design techniques of such customization approach need to be explored on specific system architectures and applications. This is where this work would make contribution to. 10 1.6 Dissertation Outline This dissertation covers several embedded multi-core platform specific issues, including cache coherence protocols, inter-core communication, synchronization mech- anism, and off-chip bandwidth. Most of these techniques also apply for general pur- pose multi-core platforms, at small to midium scale. In all these topics, benchmark studies on embedded application kernels are performed to reveal the inefficiencies with conventional design methodologies. New approaches are proposed under the cross-layer customization framework to address such inefficiencies, which often bring modifications to multiple system aspects. Extensive experiments are shown that such customization often improve system performance and power efficiency signifi- cantly as compared to baseline. 11 Chapter 2 Background and Related Work 2.1 Embedded Systems Embedded processors have been reported to occupy more than 90% of the pro- cessor market while great amount money and man power have been invested into the research and development work[102]. The major concerns of many embedded systems are power consumption and real-time performance constraints. While per- formance specification requires worst case real-time guarantee, power consumption has a greater impact on battery life, microprocessor thermal effect, cooling cost, etc. This is especially true for most battery based products like smart phones and portable media players. This research will try to address these problems, from dif- ferent perspectives of embedded multiprocessor systems, while causing little or no performance and cost overhead. There have been a large number of research work on power-aware design tech- niques for computer systems. Many of them apply to both general-purpose and embedded computing systems. Moreover, embedded system design has the unique advantage of known target application and thus can exploit application specific in- formation and achieve very efficient customization and optimization. As for the high-end embedded multiprocessor systems, because of their special architectural properties and requirements, more power saving opportunities are being exposed to 12 researchers to achieve even greater power reductions. 2.2 Power-saving Techniques for Embedded Systems Embedded system design has traditionally adopted the layered system struc- ture, including the hardware layer, the operating system layer, and user application layer. Various techniques have been developped to reduce system power consump- tion at different layers. All power saving techniques are based on the following equation: P = 12 nullC nullV 2DD nullf nullN +QSC nullVDD nullf nullN +Ileak nullVDD (2.1) Where P denotes the total power, VDD is the supply voltage, f is the clock frequency. The first term represents the dynamic power, where C is the load capacitance and N is the switching activity, i.e., the number of gate output transitions per clock cycle. The second term represents short-circuit power, where QSC is the quantity of charges carried by the short-circuit current per transition. The last term is static power dissipation due to leakage current Ileak. There are many research work around trying to minimize every variable in this equation so as to reduce system power consumption. 2.2.1 DVFS and Clock-Gating The DVFS(Dynamic Voltage and Frequency Scaling) method has been a very effective technique in tuning system power consumption at the minimum level. The 13 key idea is to lower clock frequency and scale down supply voltage when the system is running light weight tasks. It is clear from the above equation that, by lowering clock frequency and supply voltage, one can reduce dynamic switching power significantly. And for embedded systems that go into idle status once in a while, such as cell phone standby mode, this method proves to be very effective. Various DVFS techniques have been both proposed in academia[87, 103] and developed in industry [46, 113]. Although DVFS works well for dynamic power, it has little impact for the static power. As transistor density continues increasing and transistor feature size continues shrinking, static power is gradually becoming the dominant part in total power consumption. This is especially true for high-end multiprocessor systems, which have more transistors on a chip and have larger cache area. Another technique, called clock-gating, also works on dynamic power. The idea is to temporarily cut off clock signals to certain structures that are inactive for some period of time and resume clock signals later. For example, a processor pipeline may decide to clock-gate part of its function units on seeing certain instructions coming through. This technique usually requires additional gates to control the clock signal of certain units and dynamically chooses to clock-gate on predefined conditions. As clock signal contributes to a very large part of the microprocessor total power consumption, this technique can effectively reduce energy to send the clock signals over the clock tree and reduce the load capacitance on the clock signal drivers. This should have greater effect on high-end embedded multiprocessors. However, clock- gating at coarser grain, such as a cache subsystem as a whole, and bigger structures may suffer from delay in resuming the clock signal, which may potentially harm 14 real-time performance. Such techniques have been discussed in various papers[23]. Both DVFSand clock-gating techniques arebecoming availablein newer verisons of microprocessors. Programmers are given the flexibility to adjust higher level policies and protocols to form various power-saving modes. Designers can also cus- tomize processors to apply these techniques to more micro-architecture components, depending on application specific requirements. Such circuit level power saving techniques could also have significant appli- cation into future many-core processors. Due to the complexity of such systems, however, DVFS and clock-gating are often applied at coarse granularities, such as per individual cores or groups of cores. Finding appropriate transistor size [106] or redesigning complex gate[90] will change the load capacity or reduce the total number of transistor count, thus reduce the power consumption. These techniques apply to both embedded systems and general-purpose computer systems. 2.2.2 Architecture Level Power Reduction A large number of work has been done about processor power consumption at architecture level, which breaks down to individual components, such as pipelines, cache subsystems, register files, TLBs, etc., and analyzed according to run time data. Customizations at this level often give much greater flexity and finer granularity, with often more complicated algorithms. Various power analysis and estimation techniques have been discussed [60, 89]. 15 More power efficient designs are proposed. Some techniques maintain transparency to the programmers. In [80], the au- thors introduced the RegionScout technique to dynamically monitor cache data sharing status at a coarse grain, in shared memory multiprocessor systems. Shared data activities are recorded in special hardware structures. The RegionScout con- troller thus can filter out unnecessary cache snooping from remote processors and save power. However, for embedded systems, similar optimization would work much more effective when put in a hardware-software co-design framework. Application specific information is used to significantly cut unnecessary modules and reduce unnecessary complexity. Optimization techniques at this level sometimes require programmer as- sistance, which is common practice in the embedded system domain. For example, in Application Specific Instruction Processors (ASIPs), a designer can customize certain application function to be implemented in hardware and insert new instruc- tions to the ISA. The programmer and the compiler needs to work together to utilize that facility for the best of system efficiency. 2.2.3 System Software Techniques for Power Saving Depending on the embedded system physical resources and requirement, sys- tem software in embedded system design range from very small size with just basic interrupt handling routines, to middle size with relative more OS functionality for embedded process, to full fledged large size OS that has all general-purpose com- 16 puter?s functionalities. Some embedded OS are also required to support hard real- time guarantees. Small size OSs are lack of many OS functions, which force software developpers to write many low-level code on hardware layer. Medium size OSs, such as Vxwork[119], INTEGRETY[38] and Neotrino[101], have relative larger size but with more system functions. With embedded systems trending to more complex and intelligent, ?bigger? operating systems are also growing at faster speed than the embedded system market in general. The use of operating system itself may be a big overhead in embedded systmes. Much effort has been contributed to analyzing this problem[107]. Techniques used at operating system level usually include operating system software, user application software, and compiler support. Application specific information needs to be carried down to processor architecture level by the operating system, usually when the programs are loaded into the system. Operating systems also needs to be modified to reflect the change in the un- derlying hardware platform. More often, power saving protocols and policies are implemented at operating system level so as to give the programmer flexibility in adapting to specific application requirements. 2.3 Cross-Layer Customization for Embedded Multi-Cores 2.3.1 Cross-Layer Customization Approach for Embedded Systems Although the above techniques work well for both embedded systems and general-purpose computer systems, they can become much more effective for embed- 17 ded systems, when put in a system-level customization framework. This is because embedded systems are dedicated to certain applications. By analyzing the appli- cation environment, the designer can remove significant redundancy in standard hardware parts and software routines to make the entire design extremely efficient. This, however, requires that the designers work concurrently at all system layers so that the extracted application specific information can be passed across the layers. For example, in ASIP systems, the register file size and the register allocation schemes are designed during system synthesis[7]. The power aware scheduling ap- proach in [132] is combined with the DVS technique and OS scheduling algorithm to reduce the overall energy consumption. In [88], the authors apply the application memory access information to the compiler and the cache architecture to reduce the total cache access energy. In [116], the authors introduce the Temporal Streaming technique to dynami- cally identify sequences of memory accesses which correspond to a data stream. By moving the data stream to the requesting processor in advance, the overall perfor- mance is improved. With multi-core processors being used for embedded applications, the systems are becoming much more complex, which presents more need for customization to achieve satisfactory power and performance efficiency. This study covers several specific topics of multi-core system customizations. 18 2.3.2 Improving Snoop Protocol Power Efficiency The current small to middle scale CMP processors often adopt shared memory model, which provides a very intuitive programming model. Most of them have private caches to individual cores and a much larger last level shared cache pool. This creates the problem of coherency among all the private caches on the same chip, which is resolved by the installation of cache coherency protocols. The general purpose cache coherenc protocls, normally the snoop protocols, are known to be general and consumer significant amount of power[33, 69, 85]. This not only puts a heavy burden for embedded system applications, but also significantly limits the system scaling into larger scales, making coherent cache unavailable in many large-scale CMPs for both embedded and general-purpose mar- kets. There have been a number of research projects on this topic trying to aug- ment traditional snoop protocols and their implementations towards better power efficiency. The basic idea is to catagorize different cache area according to data us- age and sharing patterns[81, 80, 117, 21, 33]. Being general purpose, however, many of these approaches could meet advasary cases that yield very little improvement on power efficiency. In [130], the authors propose to tag shared data regions in virtual memory, with information provided by the programmer. This information is used to filter out unnecessary snooping into the caches. Application specific information is used to preclude the speculation effort needed in general-purpose domain while the effects are much more pronound, at the same time incurring little hardware 19 overhead. 2.3.3 Inter-Core Communication based on Producer-Consumer Pat- terns As multi-core systems scale to larger number of cores, the inter-core commu- nication and cache subsystem are becoming much more important than the CPUs, which often hold the key to performance and power efficiency[33, 70, 71]. The general-purpose shared memory model gives good high level abstraction but doesn?t guarantee the best performance. For explicit producer-consumer type of communi- cations, the data sharing based on passive shared memory and cache coherence is not efficient in multi-core systems. A number of approaches have been explored to improve such mechanisms for better efficiency[105, 19, 117, 21]. Some of these approaches try to exploit com- mon producer-consumer communication patterns [8, 45, 26] and modify the system consistency model to allow more compact communication operations [27, 55], often by modifying the cache structure and coherence protocols, with compiler and OS support for such augmented hardware mechanisms[52, 66, 73]. 2.3.4 Hardware Based Synchronization Mechanisms Synchronization in embedded multi-core processors is also becoming an im- portant issue, especially in mid-large scale systems. Synchronization is fundamental in parallel computing systems. The conventional atomic instruction based imple- 20 mentation bears too much overhead in modern system designs in terms of both performance and power[10, 12, 122, 115, 64, 35]. For embedded applications, dedicated hardware synchronization controllers have been proposed to address such problems [78, 9, 98, 82, 74, 135, 100, 114, 36, 77, 84, 28, 43, 51]. They often change the underlying synchronization mechanism significantly, with sizable performance improvement. Additionally, such dedicated controllers would enable the use of power saving modes in modern multi-core pro- cessors when individual cores are waiting for synchronization[64]. 2.3.5 Cache Partitioning for Memory Bandwidth Minimization As multi-core architectures scale to mid-large scale at rapid pace, various bot- tlenecks are emerging that prevent people from fully utilize such platforms. Among the different bottlenecks, the off-chip memory bandwidth limitation is becoming a very common problem in modern multi-core processor applications. While such problem is well recognized and is due to the speed gap between processors and memory systems[20, 95, 42, 68], there are ways to help alleviate it, such as the cache partitioning technique[67, 37, 91, 54, 92, 83]. The cache partitioning is well known and well studied in uni-core era. The ef- fects on multi-core system memory bandwidth is inadequately studied, which makes it an interesting research topic in this study. 21 Chapter 3 Low-Power Snoop Architecture for Synchronized Producer-Consumer Communication in Embedded Multiprocessors 3.1 Overview The snoop-based cache coherence protocols are the most widely deployed as they rely on the inherent broadcast nature of the common bus connecting the pro- cessor nodes to the memory. Each cache controller ?snoops? the bus for mem- ory transfers, for each of which a cache lookup is performed in order to determine whether a cache block state should be changed in the local cache. Easily extend- able multiprocessor structures and software-transparent implementation have made snoop protocols easy to understand, deploy, and reuse, with minimal impact on the performance of memory subsystem [76]. However, these protocols tend to be overly conservative in many real world programs, especially embedded applications. Quite often in embedded systems data are cached in just a few nodes. Snooping in the others leads to a waste of energy. Previous research [85] has shown that only around 10% of the application memory references actually require cache coherence tracking. And it has been reported that snoop-related cache activities can contribute up to 40% of the total cache power [33, 69]. In contrast to general-purpose computing platforms, in embedded system de- 22 signs, fine-tuning hardware, compiler, and system software has been a common practice to maximize performance and achieve energy efficiency [96, 72, 50]. In this work[130, 134, 125], we propose a methodology that aggressively eliminates the ma- jority of unnecessary snoop-induced cache look-ups and thus, achieves significant power reduction. The proposed technique explicitly exploits application-specific information regarding the exact producer-consumer relationships between various tasks as well as information regarding the precise timing of synchronized accesses to shared memory buffers by their corresponding producers and/or consumers. This program knowledge is used to eliminate a large number of snoop-induced cache lookups even for references to the shared memory buffers. The conventional snoop controllers are augmented with small additional hard- ware which is controlled by the system software. This hardware dynamically iden- tifies accesses to relevant memory areas, detects the timings of temporal sharing between producers and consumers, and thus precludes snooping to those regions when it is safe to do so. The end result of the proposed methodology is significant reductions of power spent in unnecessary snoop-induced cache lookups. 3.2 Related Work The emergence of multi-core processors in embedded applications has exac- erbated the power concerns with these applications but also has exposed more op- portunities to reduce power consumption. Many research projects have addressed power-aware coherence protocol. However, most of them arein the domain of general 23 purpose systems, such as desktop and server computers. Few of them have been in the embedded system domain, which often times exhibit specific hardware architec- tures and program behaviors and usually exposes more stringent power constraints. The contribution presented in this work enables the application of snoop-based cache coherence solutions in energy-efficient embedded applications. Jetty [81] is the name of a family of snoop filters designed to reduce energy consumption in snoopy bus-based multiprocessor systems. Jetty observes cache activities in local caches and records them in special cache-like hardware-structures. By doing this, Jetty can dynamically feed the snoop controller information as ?what is present in the local cache and what is not?. The snoop controllers in local caches subsequently decide whether it is safe to filter out certain cache loop-ups, which otherwise would consume a lot of power. The authors report an average of 29% energy reduction for L2 caches. RegionScout [80] is another technique that exploits coarse-grain data sharing information to reduce energy for caches and bus traffic in server applications. In RegionScout, memory space is divided into a number of chunks(regions). Additional hardware structure is employed to dynamically record which region is holding use- ful data and which region is blank. By identifying memory references to different regions, the snoop controller can filter out remote references that are not relevant to its local activities, thus saving power to the cache system. In [117] the authors introduce Temporal Streaming technique to dynamically identify sequences of mem- ory accesses which correspond to a data stream. By moving the data stream to the requesting processor in advance, a large number of coherence misses are eliminated. 24 Additionally, the program performance is improved in directory-based snoop pro- tocols. In [21] the authors propose Coarse-Grain Coherence Tracking to monitor coherence status of large memory areas so as to avoid unnecessary broadcasts, so as to enhance performance in commercial, scientific and multiprogrammed workloads. In [33], the authors target low-power chip-multiprocessors and try to reduce TLB energy consumption by using virtually addressed data caches, as well as to reduce snoop energy loss by keeping track of the sharing set of each memory page. While techniques for general purpose applications report significant power ben- efits, their hardware and power overhead would be non-trivial in most portable em- bedded devices. Moreover, in a general purpose environment, system designers have limited software information and need to assume worst-case situations, which typi- cally leads to hardware-only methods which dynamically try to uncover important program properties. For example, in Jetty and RegionScout, additional hardware structures observe the local caches and extract possible data sharing information to snoop controllers. This may be less effective against programs that have shared data scattered in the address space or may lead to large overhead implementing the monitor hardware as compared to the energy budget of an embedded application. In embedded applications, however, programs naturally have specific and de- terministic behaviors according to their application environment and system de- signers usually have much more detailed control over software and hardware inte- gration. By exploiting this advantage, we can avoid the speculative mechanisms usually present in general-purpose approaches and minimize area and power over- head by precisely determining when and to which specific region to block snooping 25 ............. = S[m];............. = S[n]; .......................... S[i] = ..................;S[j] = ..................; S[i] 0) do 6. MAX = 0; 7. FOR k = 1 to K, j = 1 to N // find max bandwidth drop and record position 8. IF SIZE[k][j] < res and BWdrop_tbl[k][j] > MAX 9. THEN MAX = BWdrop_tbl[k][j]; 10. p = k; q = j; 11. P[p] = p_min * (2**q); 12 BWdrop_tbl[p][q] = 0; 13. SUM = SUM + P[k]-p_min; //calculate remaining resources 14. res = S - SUM; Figure 6.4: Partitioning Heuristic Pseudocode 194 teristics of each task, which can be obtained through o?ine profiling or simulations. The nature of such an algorithm is NP-hard. This can be proved by reducing the ?knapsack? problem to an instance of our problem, as following. The knapsack problem consists of P objects nullpinull, each associated with a weight nullwinull, and value nullvinull. One wants to maximize the total value V under a certain capacity W. If we let the weight be the partition size of a particular task among all possible configurations and value be the bandwidth reduction achieved with that partition size, we get an instance of our partitioning problem, with the target of maximum bandwidth reduction under the constraint of limited total cache space and a total of P tasks. Because of the huge search space of such a problem, we have developed a heuristic algorithm. The algorithmiteratively attempts a distribution of cache resources to different tasks in order to identify a mapping that minimizes the total bandwidth demand. Due to the power of two contraint of the hardware based partitioning technique, the different partitions can only grow to twice as large between consecutive config- urations. Thus for a total cache size of S, there will only be N = logS possible partition sizes. The bandwidth-cache functional relation is used to generate a PxN look-up table that records the bandwidth changes (drop) between consecutive steps for each constituent application in the workload. The algorithm works in the following iterative steps: It begins by assigning each task the minimum partition size, p min as the initial partitioning scheme. For each iteration, it calculates the difference of total cache size, S, and the sum of partition sizes of the current partitioning scheme SUM(P k). This difference is the 195 amount of available cache resources the algorithm can explore. This number is also used to calculate the maximum number of partition sizes, n, an individual partition can grow from the minimum size. Subsequently, the algorithm searches for the maximum bandwidth drop that can be reached under current cache resource constraint for all K tasks. Suppose the maximum bandwidth drop belongs to the curve of task p at q-th partition size. In this case, task p is assigned the partition size of P min null 2q. Note that in each step, the algorithm finds and selects the task that exhibits the steepest drop in bandwidth demand when moving from one partition size to the next larger one. For that task, the algorithm selects the cache size for which this maximal bandwidth drop is achieved. The algorithm continues in the same way to identify the partition sizes for the rest of the tasks within the remaining cache resources. Because the partition sizes must be a power of two, this algorithm is guaranteed to complete within O(logS) iterations. 6.4.3 Intuitive and Formal Description Although this heuristic algorithm is based on a greedy search, it has proven to be very effective for the benchmark workloads in our experiments. And there is a good intuitive reason for this. This heuristic exploits a characteristic that is typical for the majority of tasks? bandwidth-cache functions. As can be seen from the Figure 6.1 in the previous section, most functions start from (the smallest partition size) very high bandwidth requirement/misses/miss-rates and stay on the plateau 196 until they reach a threashold size, then dive quickly to another much lower plateau. Furthermore, most functions have only one such dramatically diving phase which represents the biggest bandwidth/misses drop near a favorable partition size. In the heuristic algorithm, this property would show up (in the bandwidth-drop table) as the few largest bandwidth-drop numbers for each task with all possible partition sizes. Selecting the smallest cache size that causes the largest bandwidth drop is clearly very benefitial because the smaller partition sizes before this sharp drop are still too small and the bandwidth demand is close to the worst case of severe cache contention. The larger partition sizes beyond the sharp decline, however, will result in a waste of cache reources since the gains from increased partition size would remain moderate. In this case, it may be better to use the available resource wisely on the other tasks if they can achieve a larger drop in bandwidth. When the total cache size is relatively small, the algorithm will run out of cache resources shortly after finding a good partition size for very few tasks. The rest of the tasks will keep their initially assigned minimum partition size. Although seemingly unfair to satisfy the needs of the few, these schemes often turn out to achive much lower overal system bandwidth. Judging from the bandwidth-cache functions, most applications stay very close to the high plateau before they reach certain threashold. Taking or giving any resource before that does not result in sizable difference to them. When the total cache size is very large, the algorithm could end up with extra cache space not assigned to any task. Again referring to the bandwidth-cache curves, keeping it or evenly distributing this space will not make much difference either, since most tasks would be sitting on their lower plateau. The 197 Figure 6.5: Algorithm Walkthrough on Example real interesting part in between of these two scenarios, however, is very short. Most applications exhibit in the bandwidth drop table only one or two very significant numbers, which is very well suited and targetted by this greedy based algorithm. Figure 6.4 gives a detailed description of this algorithm. Figure 6.5 provides an example of an input bandwidth table with different partitioning sizes. As can be seen, the algorithm finds the best partitioning scheme in two steps. Because the nature of the problem is NP-hard, the greedy heuristic algorithm does have adversary cases. This often happens to particular combination of tasks, which is very dependent on the specific workloads. This can be improved signifi- cantly, however, by further optimizing the heuristic program with more hints from the programmers. In the experiment section below, a comparison is made between the simpliest form of heuristic algorithm and exhaustive search method to show such differences. 198 6.5 Experimental Results and Discussion APP1: MMUL MMUL MMUL MMUL TRI TRI TRI TRI APP2: MMUL MMUL MMUL MMUL SOR SOR SOR SOR APP3: EJ EJ EJ EJ FFT FFT FFT FFT APP4: EJ EJ FFT FFT SOR SOR TRI TRI APP5: EJ EJ LU LU SOR SOR TRI TRI APP6: MMUL MMUL FFT FFT TRI TRI SOR SOR Table 6.1: Benchmark Workloads 6.5.1 Experimental Setup We have conducted detailed experiments to evaluate the effectiveness of the proposed cache partitioning methodology. We have used the M5 [18] simulator platform to conduct our experiments. M5 is a cycle-accurate full-system simulator and has been extensively enhanced for our study. The simulated machines are configurated as 8-processor systems running at 1GHz, with 16kB split L1 caches and a shared L2 cache. We have evaluated the proposed technique for L2 cache sizes from 256kB to 16MB with a power of two increment. At each configuration, the proposed memory bandwidth-based cache partitioning is applied to compare it with a baseline architecture in which the cores share the entire L2 cache. 199 6.5.2 Benchmark Applications Each benchmark workload constitutes a set of parallel applications. Each of these application in turn can be data parallel application and naively parallelized into several identical threads each running on its own part of the input data set. The input data size and the memory access patterns are the biggest factors determining their execution. The individual tasks execute one of the following computational kernels that are broadly used in many numerical and signal processing applications: EJ, FFT, FDCT, LU, MMUL, SOR, TRI. The combinations that we have used as multi-tasked benchmarks are listed in Table 6.1. Among the individual kernels in the workloads, matrix multiplication (MMUL) executed the multiplication of two square matrices; successive overrelaxation method (SOR) is a program for solving a linear system of equations; fast Fourier transform (FFT) computes discrete Fourier transform of input signals; LU decomposition (LU) is a matrix decomposition al- gorithm used in many communications applications; EJ is the extapolated jacoby method, and TRI is a transformation that converts a matrix into an upper triangular form. Each of the computational kernels operates on an input data buffer with a size from 0.5MB to 1.5MB. The off-chip bandwidth requiremetns as well as the cache miss characteristics for each task have been reported in a previous section of this work. The entire applications cover data arrays ranging from 4MB to 12MB. All caches are warmed up before the main execution in order to exclude cold cache misses. 200 Figure 6.6: Achieved bandwidth v.s. baseline: APP1, APP2 and APP3 Figure 6.7: Achieved bandwidth v.s. baseline: APP4, APP5 and APP6 6.5.3 Results and Analysis Figure 6.6 and Figure 6.7 report the results of the six benchmarks compared to the baseline architecture in terms of bandwidth requirements. The yellow lines are the bandwidth requirements from baseline executions. The blue lines represent the overall bandwidth demand when the proposed cache partitioning is applied. The bandwidth reductions achieved by the proposed technique are clearly demonstarted by the experimental data. Bandwidth requirements reductions of up to 50% are achieved by the proposed cache partitioning. 201 The experiment results clearly demonstrate three distinct situations with re- spect to the total L2 cache size. The first situation corresponds to the initial points in the L2 cache size space, i.e. when the entire L2 cache is significantly smaller than 1) the total workload data size and 2) to any individual kernel?s earliest bandwidth drop point of its specific bandwidth-cache function. In this situation, cache parti- tioning cannot help with such limited resources and the algorithm ends up with a configuration with bandwidth requirements close to the baseline - sometimes worse due to the enforced cache partition size limitations. The proposed cache partitioning exhibits its benefits in the second situation, in which it is possible to allocate sufficiently large fractions of the cache resources to most of the kernels while the rest may remain in their worst scenarios. In this stage, the cache partitioning algorithm is able to exploit the opportunities to move indi- vidual kernels off the high plateaus in the bandwidth demand functions. It should be noted that this algorithm minimizes the overall systen bandwidth requirement but it does not do so uniformaly accross all the tasks. In this way, a particular task that can provide the most bandwidth benefit to the system will be selected for most cache resources, especially when such resource are somewhat limited. Replicas of the same tasks are treated as different tasks with the same bandwidth-cache curves. For L2 caches with relatively large sizes, i.e. between 1M and 8M, the algorithm can afford to make more flexible choices. The largest reductions can be seen for such L2 cahes, as the partitioning algorithm is able to move towards the direction of and achieve the fastest bandwidth drops for all the tasks in the workload. The third distinct situation is when the L2 cache size approaches 16MB. In 202 this case the bandwidth reductions compared to the baseline are minimal. This can be easily explained by the fact that such large L2 caches can be easily shared by all the tasks as then can fit their working sets with very few or no inter-task conflicts. The difference between these three different situation is often determined by the earliest significant bandwidth-cache reduction point of a task with sizable bandwidth requireemnts. This is also the earliest point when the algorithm can identify an effective partitioning scheme. Of cousre, it is also possible to have workloads for which the baseline may slightly outperform any cache partitioning. This is largly due to increased contention when seperate cache partitions are enforced. In these cases, multiple kernels do not interfere significantly and actually benefit from larger shared L2 resources. However, even in these cases, the actual bandwidth requirements from the baseline and the one with cache partitioning tend to be very close. This case is illustrated by benchmark APP5. Overall, the experimental results clearly demonstrate that our partitioning technique is effective in achieving significant off-chip bandwidth reductions. 6.5.4 ComparisonBetweenHeuristicAlgorithmandExhaustiveSearch As mentioned in the previous section, the heuristic algorithm is greedy based and thus could meet adversary cases for a real NP-hard problem. The study that compares heuristic algorithm with a exhaustive search algorithm is conducted in 6.8. Shown in 6.8 is the additional amount of total bandwidth achieved through 203 Figure 6.8: Heuristic Algorithm Compared to Exhaustive Search heuristic algorithm as compared to the exhaustive search. From this chart, one can see that the heuristic algorithm achieves much of the benefit as an exhaustive one. It starts to behave less effective as the total cache space grows, which corresponds to significantly larger total search space of all possible partitioning schemes . It is also very clear that different workloads lead to different heuristic effectiveness. Overall, the heuristic algorithm does often come up with reasonably good partitioning scheme with constant time, while the exhaustive search, even with simple tuning, runs considerably longer as the search space grows. 6.6 Conclusions In this work we have studied the relation between off-chip memory bandwidth requirement and L2 cache partitioning for multi-core processor systems. Off-chip memory bandwidth limitation is becoming a pressing problem for multi-core archi- 204 tectures, which significantly undermines the scaling trend for future platforms. We have shown that L2 cache partitioning can be an effective technique in reducing system bandwidth pressure. The proposed partitioning algorithm utilizes the bandwidth-cache functions of individual programs in a workload and identifies a partitioning scheme that significantly reduces overall bandwidth requirement. We have shown convincing experimental data that our bandwidth based cache parti- tioning approach is effective in alleviating overall memory bandwidth requirements. 205 Chapter 7 Conclusions 7.1 Embedded Multi-Core Architecture Challenges In this thesis study, a number of issues about embedded multi-core architec- tures and their applications are covered, including the cache coherence protocols, the shared memory based inter-core communication, the synchronization mecha- nism, and last level cache space partitioning for bandwidth reduction. For all these different topics, benchmark studies show the inefficiencies of conventional implemen- tations, in terms of power and performance. Much of such inefficiencies come from the general-purpose architectures that such mechanisms are first implemented for. There is strong motivation to improve on these inefficiencies. On the one hand, chip multiprocessors are developing into unprecedented and unanticipated level of integration. The quick pace of system scaling into much larger core counts is ex- posing the conventional implementations to very serious challenges. Such challenges must be addressed before further scaling can be meaningful to system designers. On the other hand, many of the multi-core mechanisms have been developed towards general-purpose platforms. The trade-offs in embedded system domain make it necessary to tailor such mechanisms for specific applications, thereby re- ducing overhead and improving power and performance efficiency. This study shows how much can be achieved by such hardware-software cus- 206 tomization method. There is more to certainly more areas to explore and more different techniques to integrate into this framework. 7.2 Cross-Layer Customization for Embedded Multi-Cores This work alsoshows theeffectiveness ofcross-layer customization orhardware- software co-design method for embedded multi-core systems. Such heavy system level tuning often exists in embedded system designs, to make the most potential out of the existing platform. Embedded systems traditionally adopt a lot from the general-purpose archi- tectures. However, their dedicated nature makes it possible for highly specialized optimizations. The efficiency improvement over the general-purpose counter parts is often very significant. While general-purpose system designs don?t modify hardware layer, they can still tune the software layer to the specifications of hardware platforms. Such prac- tice is more common with large scale CMPs even in the general-purpose systems and the design of future supercomputer systems. In that regard, this work could also provide some insights into the proper co-design method for such systems. 207 Bibliography [1] Cavium octeon multi-core processor family. [2] Grand Central Dispatch (GCD) Reference. [3] http://www.picochip.com/products/pc102. [4] http://www.tilera.com. [5] http://www.vdc-corp.com/ehw/default.asp. [6] Intel Threading Building Blocks. [7] Xtensa architecture and performance, http://www.tensilica.com. [8] H. Abdel-Shafi, J. Hall, S.V. Adve, and V.S. Adve. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In High- Performance Computer Architecture, 1997., Third International Symposium on, pages 204?215, Feb 1997. [9] B. Akgul, J. Lee, and V. Mooney. A system-on-a-chip lock cache with task preemption support. In Conference on Compilers, Architecture, and Aynthesis for Embedded Systems (CASES), pages 149?157, 2001. [10] B. Akgul and V. Mooney. Parlak: Parameterized lock cache generator. In Design Automation and Test in Europe (DATE), pages 1138 ? 1139, 2003. [11] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In International Symposium on Microarchitecture (MICRO), pages 248?259, November 1999. [12] T. E. Anderson. The performance of spin lock alternatives for shared-money multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6?16, January 1990. [13] James Archibald and Jean-Loup Baer. Cache coherence protocols: evaluation using a multiprocessor simulation model. ACM Transactions on Computer Systems (TOCS), 4(4):273?298, 1986. [14] ARM Ltd. ARM11 Family. [15] C. Ballapuram, A. Sharif, and H-H. Lee. Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors. In ASPLOS, pages 60?69, 2008. [16] Rizwan Bashirullah, Wentai Liu, and Ralph K. Cavin. Low-power design methodology for an on-chip bus with adaptive bandwidth capability. In Design Automation Conference (DAC), pages 628?633, 2003. 208 [17] M. Berndl, O. Lhotak, F. Qian, L. Hendren, and N. Umanee. Points-to analysis using bdds. In Conference on Programming language design and implementa- tion (PLDI), pages 103?114, 2003. [18] N. Binkert, R. Dreslinski, L. Hsu, K. Lim, A. Saidi, and S. Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro, 26(4):52?60, 2006. [19] J. Brown, R. Kumar, and D. Tullsen. Proximity-aware directory-based coher- ence for multi-core processor architectures. In ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 126?134, 2007. [20] Doug Burger, James R. Goodman, and Alain K?agi. Memory bandwidth lim- itations of future microprocessors. In International Symposium on Computer Architecture (ISCA), pages 78?89, 1996. [21] Jason F. Cantin, Mikko H. Lipasti, and James E. Smith. Improving multi- processor performance with coarse-grain coherence tracking. SIGARCH Com- putuer Architecture News, 33(2):246?257, 2005. [22] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of munin. In Symposium on Operating Systems Principles (SOSP), pages 152?164, 1991. [23] Anantha P. Chandrakasan. Low-Power Digital CMOS Design. PhD thesis, University of California at Berkeley, 1994. [24] Jichuan Chang and Gurindar S. Sohi. Cooperative cache partitioning for chip multiprocessors. In International Conference on Supercomputing (ICS), pages 242?252, 2007. [25] G. Chen and M. Kandemir. An approach for enhancing inter-processor data locality on chip multiprocessors. In Springer-Verlag Lecture Notes in Com- puter Science, pages 214?233, 2007. [26] L. Cheng, J. Carter, and D. Dai. An adaptive cache coherence protocol op- timized for producer-consumer sharing. In International Symposium on High Performance Computer Architecture (HPCA), pages 328?339, 2007. [27] L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian, and J. Carter. Interconnect-aware coherence protocols for chip multiprocessors. In Interna- tional Symposium on Computer Architecture (ISCA), pages 339?351, 2006. [28] Liqun Cheng and John B. Carter. Fast barriers for scalable ccnuma systems. In ICPP, pages 241?250, 2005. [29] P. Cumming. The ti omap platform approach to soc. In Winning the SOC Revolution. Kluwer Academic Publishers, 2003. 209 [30] M. Das. Unification-based pointer analysis with directional assignments. In Conference on Programming language design and implementation (PLDI), pages 35?46, 2000. [31] Livio Soares David Tam, Reza Azimi and Michael Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA ?07, 2007. [32] C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior oriented parallelization. In Conference on Programming Language Design and Implementation (PLDI), pages 223?234, 2007. [33] M. Ekman, F. Dahlgren, and P. Stenstrom. Tlb and snoop energy-reduction using virtual caches in low-power chip-microprocessors. In ISLPED, pages 243?246, August 2002. [34] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hen- nessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. International Symposium on Computer Architecture (ISCA), pages 15?26, May 1990. [35] O. Golubeva, M. Loghi, and M. Poncino. On the energy efficiency of synchro- nization primitives for shared-memory single-chip multiprocessors. In Great Lakes Symposium on VLSI (GLSVLSI), pages 489?492, 2007. [36] James R. Goodman, Mary K. Vernon, and Philip J. Woest. Efficient synchro- nization primitives for large-scale cache-coherent multiprocessors. SIGARCH Comput. Archit. News, 17(2):64?75, 1989. [37] A. Gordon-Ross and F. Vahid. A self-tuning configurable cache. In Design Automation Conference (DAC), pages 234?237, 2007. [38] Green Hill Software. INTEGRITY: The most advanced RTOS technology. [39] M.R Guthaus, J. S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC-4: Workshop on Workload Characterization, pages 3?14, De- cember 2001. [40] M.R Guthaus, J. S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC, pages 3?14, Dec 2001. [41] Michael Hind. Pointer analysis: Haven?t we solved this problem yet? In ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE), 2001. [42] Lisa R. Hsu, Steven K. Reinhardt, Ravishankar Iyer, and Srihari Makineni. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 13?22, 2006. 210 [43] William Tsun-Yuk Hsu and Pen-Chung Yew. An effective synchronization network for hot-spot accesses. ACM Trans. Comput. Syst., 10(3):167?189, 1992. [44] http://www.gimp.org. The GNU Image Manipulation Program. [45] J. Huh, J. Chang, D. Burger, and G. Sohi. Coherence decoupling: making use of incoherence. ACM SIGARCH Computur Architecture News, 32(5):97?106, 2004. [46] Intel. Intel XeonTM Processor at 1.40 GHz, 1.50 GHz, 1.70 GHz and 2 GHz. [47] Intel Corporation. Intel XScale Microarchitecture. [48] Vadim Iosevich and Assaf Schuster. A comparison of sequential consistency with home-based lazy release consistency for software distributed shared mem- ory. In International Conference on Supercomputing (ICS), pages 306?315, New York, NY, USA, 2004. ACM. [49] Ahmed A. Jerraya. Long term trends for embedded system design. In DSD ?04: Proceedings of the Digital System Design, EUROMICRO Systems, pages 20?26, Washington, DC, USA, 2004. IEEE Computer Society. [50] V. Kathail, S. Aditya, R. Schreiber, B. R. Rau, D. C. Cronquist, and M. Sivaraman. Pico: Automatically designing custom computers. IEEE Com- puter, 35(9):39?47, 2002. [51] S. Keckler, W. Dally, D. Maskit, N. Carter, A. Chang, and W. Lee. Ex- ploiting fine-grain thread level parallelism on the mit multi-alu processor. In International Symposium on Computer Architecture (ISCA), pages 306?317, 1998. [52] Ken Kennedy and John R. Allen. Optimizing compilers for modern architec- tures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. [53] B. Khailany, W. Dally, U. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, A. Chang, and S. Rixner. Imagine: Media processing with streams. IEEE Micro, 21(2):35?46, 2001. [54] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 111? 122, 2004. [55] L. Kontothanassis, M. Scott, and R. Bianchini. Lazy release consistency for hardware-coherent multiprocessors. Supercomputing Conference (SC), pages 61?61, 1995. 211 [56] Claus Kuhnel. Avr Risc Microcontroller Handbook. Elsevier, 1998. [57] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single- isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, pages 81?92, 2003. [58] Edward L. Lamie. Real-Time Embedded Multithreading : Using ThreadX and ARM. CMP Books, 2004. [59] William Landi. Undecidability of static analysis. ACM Letters on Program- ming Languages and Systems, 1(4):323?337, December 1992. [60] P. Landman and J. Rabaey. Black-box capacitance models for architectural power analysis. In International Workshop on Low Power Design, page 65170, April 1994. [61] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In International Symposium on Microarchitecture (MICRO), pages 330?335, De- cember 1997. [62] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In MICRO, pages 330?335, Dec 1997. [63] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In Inter- national Symposium on Computer Architecture (ISCA), pages 148?159, 1990. [64] J. Li, J. Martinez, and M. Huang. The thrifty barrier: Energy-aware synchro- nization in shared-memory multiprocessors. In International Symposium on High Performance Computer Architecture (HPCA), 2004. [65] M-L. Li, R. Sasanka, S. Adve, Y-K. Chen, and E. Debes. The alpbench bench- mark suite for complex multimedia applications. In International Symposium on Workload Characterization, pages 34?45, October 2005. [66] A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchro- nization with affine transforms. In Symposium on Principles of Programming Languages, pages 201?214, 1997. [67] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In International Symposium On High Performance Computer Architecture (HPCA), pages 367?378, 2008. [68] Chun Liu, Anand Sivasubramaniam, and Mahmut Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In International Symposium on High Performance Computer Architecture (HPCA), page 176, 2004. 212 [69] M. Loghi, M. Letis, L. Benini, and M. Poncino. Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors. In GLSVLSI, pages 276?281, 2005. [70] M. Loghi and M. Poncino. Exploring energy/performance tradeoffs in shared memory mpsocs: Snoop-based cache coherence vs. software solutions. In Con- ference on Design, Automation and Test in Europe (DATE), pages 508?513, 2005. [71] M. Loghi, M. Poncino, and L. Benini. Cache coherence tradeoffs in shared- memory mpsocs. ACM Transactions on Embedded Computing Systems, 5(2):383?407, 2006. [72] D. Lyonnard, S. Yoo, A. Baghdadi, and A. Jerraya. Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on- chip. In Design Automation Conference (DAC), pages 518?523, 2001. [73] S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, and W.-M. W. Hwu. Com- piler code transformations for superscalar-based high performance systems. In Conference on Supercomputing, pages 808?817, 1992. [74] A. Marongiu, L. Benini, and M. Kandemir. Lightweight barrier-based par- allelization support for non-cache-coherent mpsoc platforms. In Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pages 145?149, 2007. [75] Grant Martin. Overview of the mpsoc design challenge. In DAC ?06: Pro- ceedings of the 43rd annual Design Automation Conference, pages 274?279, 2006. [76] M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: decoupling performance and correctness. In International Symposium on Computer Ar- chitecture (ISCA), pages 182?193, 2003. [77] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable syn- chronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21?65, 1991. [78] M. Monchiero, G. Palermo, C. Silvano, and O. Villa. Efficient synchronization for embedded on-chip multiprocessors. IEEE Transactions on Very Large Scale Integration Systems, 14(10):1049?1062, October 2006. [79] J. Montanaro et al. A 160mhz, 32b 0.5w cmos risc microprocessor. In IEEE ISCC, pages 214?229, February 1996. [80] A. Moshovos. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In International Symposium on Computer Architecture (ISCA), pages 234?245, 2005. 213 [81] A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi. Jetty: Filtering snoops for reduced energy consumption in smp servers. In International Symposium on High-Performance Computer Architecture (HPCA), pages 85?96, 2001. [82] A. Nacul and T. Givargis. Lightweight multitasking support for embedded systems using the phantom serializing compiler. In Conference on Design, Automation and Test in Europe (DATE), pages 742?747, 2005. [83] Kyle J. Nesbit, James Laudon, and James E. Smith. Virtual private caches. SIGARCH Computuer Architecture News, 35(2):57?68, 2007. [84] Dimitrios S. Nikolopoulos and Theodore S. Papatheodorou. Fast synchroniza- tion on scalable cache-coherent multiprocessors using hybrid primitives. In In Proceedings of the 14th International Symposium on Parallel and Distributed Processing, pages 711?719, 2000. [85] Jim Nilsson, Anders Landin, and Per Stenstrom. The coherence predictor cache: A resource-efficient and accurate coherence prediction infrastructure. In International Symposium on Parallel and Distributed Processing, pages 10? 17, 2003. [86] A. Patel and K. Ghose. Energy-efficient mesi cache coherence with pro-active snoop filtering for multicore microprocessors. In International Symposium on Low Power Electronics and Design (ISLPED), pages 247?252, 2008. [87] T. Pering, T. Burd, and R. Brodersen. Dynamic voltage scaling and the design of a low-power microprocessor system, 1998. [88] P. Petrov, D. Tracy, and A. Orailoglu. Energy-efficient physically tagged caches for embedded processors with virtual memory. In Design Automation Conference, pages 17?22, June 2005. [89] S. Powell et al. Estimating power dissipation of vlsi signal processing chips: The pfa technique. In VLSI Signal Processing, page 250259, 1990. [90] S. C. Prasad and K. Roy. Circuit optimization for minimization of power consumption under delay constraint. In Intl Workshop on Low Power Design, page 1520, April 1994. [91] M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Interna- tional Symposium on Microarchitecture (MICRO), pages 423?432, 2006. [92] Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. Architectural support for operating system-driven cmp cache management. In International Cconference on Parallel architectures and Compilation Techniques (PACT), pages 2?12, 2006. 214 [93] G. Ramalingam. The undecidability of aliasing. ACM Transactions on Pro- gramming Languages and Systems (TOPLAS), 16(5):1467?1471, 1994. [94] R. M. Ramanathan. Intel multi-core processors: Making the move to quad- core and beyond. Technology@Intel Magazine, pages (1):2?4, 2006. [95] Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. Scaling the bandwidth wall: challenges in and avenues for cmp scaling. SIGARCH Computer Architecture News, 37(3):371?382, 2009. [96] C. Rowen. Engineering the Complex SOC. Fast, Flexible Design with Config- urable Processors. Prentice Hall, New Jersey, 2004. [97] R. Rugina and M. Rinard. Pointer analysis for multithreaded programs. SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 34(5):77?90, 1999. [98] B. Saglam and V. Mooney. System-on-a-chip processor synchronization sup- port in hardware. In Design, Automation and Test in Europe (DATE), pages 633?641, 2001. [99] A. Salcianu and M. Rinard. Pointer and escape analysis for multithreaded programs. In Symposium on Principles and practices of parallel programming (PPoPP), pages 12?23, 2001. [100] Jack Sampson, Ruben Gonzalez, Jean-Francois Collard, Norman P. Jouppi, Mike Schlansker, and Brad Calder. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO, pages 235?246, 2006. [101] D. C. Sastryand M. Demirci. The qnx operatingsystem. Computer, 28(11):75? 77, 1995. [102] M. Schlett. Trends in embedded microprocessor design. Computer Magazine, 31(8):44?49, August 1998. [103] T. Simunic, L. Benini, A. Acquavia, P. Glynn, and G. De Micheli. Dynamic voltage scaling and power management for portable systems. In 38th DAC, pages 524?529, June 2001. [104] J. Singh, W-D. Weber, and A. Gupta. Splash: Stanford parallel applica- tions for shared-memory. SIGARCH Computer Architectures News, 20(1):5? 44, 1992. [105] K. Strauss, X. Shen, and J. Torrellas. Flexible snooping: Adaptive forwarding and filtering of snoops in embedded-ring multiprocessors. In International Symposium on Computer Architecture (ISCA), pages 327?338, 2006. [106] C. H. Tan and J. Allen. Minimization of power in vlsi circuits using transistor sizing, input ordering, and statistical power estimation. In Intl Workshop on Low Power Design, pages 75?80, April 1994. 215 [107] T. Tan, A. Raghunathan, and N. Jha. Embedded operating system energy analysis and macro-modeling, 2002. [108] Andrew Tanenbaum. Modern Operating Systems. Prentice Hall, 2001. [109] D. Tarjan, S. Thoziyoor, and N. Jouppi. Cacti 4.0: An integrated cache timing, power and area model. Technical report, HP Laboratories Palo Alto, June 2006. [110] W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In International Symposium on Microarchitecture (MICRO), pages 356?369, 2007. [111] M. Thompson, H. Nikolov, T. Stefanov, A. Pimentel, C. Erbas, S. Polstra, and E. Deprettere. A framework for rapid system-level exploration, synthesis, and programming of multimedia mp-socs. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 9?14, 2007. [112] S. Thoziyoor, N.Muralimanohar, J. Ahn, and N. Jouppi. Cacti 5.3. Technical report, HP Laboratories Palo Alto, April 2008. [113] Transmeta. Crusoe LongRun Power Management White Paper. [114] Dean M. Tullsen, Jack L. Lo, Susan J. Eggers, and Henry M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In HPCA ?99: Proceedings of the 5th International Symposium on High Perfor- mance Computer Architecture, page 54, Washington, DC, USA, 1999. IEEE Computer Society. [115] O. Villa, G. Palermo, and C. Silvano. Efficiency and scalability of bar- rier synchronization on noc based many-core architectures. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pages 81?90, 2008. [116] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Fal- safi. Temporal streaming of shared memory. In ISCA, 2005. [117] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Fal- safi. Temporal streaming of shared memory. In International Symposium on Computer Architecture (ISCA), pages 222?233, 2005. [118] James Y. Wilson and Aspi Havewala. Building Powerful Platforms with Win- dows CE. Addison-Wesley Professional, 2001. [119] WindRiver. VxWorks, http://www.windriver.com. Wind River. [120] W. Wolf. The future of multiprocessor systems-on-chips. In DAC, pages 681? 685, June 2004. 216 [121] Karim Yaghmour. Building Embedded Linux Systems. O?Reilly Media, Inc., 2003. [122] C. Yang and A. Orailoglu. Light-weight synchronization for inter-processor communication acceleration on embedded mpsocs. In Conference on Com- pilers, Architectures, and Synthesis for Embedded Systems (CASES), pages 150?154, 2007. [123] S-H. Yang, B. Falsafi, M. D. Powell, and T. N. Vijaykumar. Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. Symposium on High-Performance Computer Architecture (HPCA), 00:0151, 2002. [124] C. Yu and P. Petrov. Aggressive snoop reduction for synchronized producer- consumer communication in energy-efficient embedded multi-processors. In CODES+ISSS, pages 245?250, 2007. [125] C. Yu and P. Petrov. Low-power snoop architecture for synchronized producer- consumer embedded multiprocessing. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 17, Issue:9:1362 ? 1366, September 2009. [126] C. Yu and P. Petrov. Low-cost and energy-efficient distributed synchronization for embedded multiprocessors. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 18 Issue:8:1257 ? 1261, August 2010. [127] Chenjie Yu and Peter Petrov. Distributed and low-power synchronization architecture for embedded multiprocessors. In CODES+ISSS ?08: Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis, pages 73?78, New York, NY, USA, 2008. ACM. [128] Chenjie Yu and Peter Petrov. Latency and bandwidth efficient communication through system customization for embedded multiprocessors. In DAC ?08: Proceedings of the 45th annual Design Automation Conference, pages 766? 771, New York, NY, USA, 2008. ACM. [129] Chenjie Yu and Peter Petrov. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In DAC ?10: Proceed- ings of the 47th Design Automation Conference, pages 132?137, New York, NY, USA, 2010. ACM. [130] Chenjie Yu and Peter Peter Petrov. Aggressive snoop reduction for synchro- nized producer-consumer communication in energy-efficient embedded multi- processors. In CODES+ISSS ?07: Proceedings of the 5th IEEE/ACM interna- tional conference on Hardware/software codesign and system synthesis, pages 245?250, New York, NY, USA, 2007. ACM. [131] Chenjie Yu, Xiangrong Zhou, and Peter Petrov. Low-power inter-core commu- nication through cache partitioning in embedded multiprocessors. In SBCCI 217 ?09: Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design, pages 1?6, New York, NY, USA, 2009. ACM. [132] W. Yuan and K. Nahrstedt. Integration of dynamic voltage scaling and soft real-time scheduling for open mobile systems, 2002. [133] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architecture for embedded systems. In International Symposium on Computer Architecture (ISCA), pages 136?146, 2003. [134] X. Zhou, C. Yu, A. Dash, and P. Petrov. Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Transactions on Design Automation of Electronic Systems (TODAES),13(1), January 2008. [135] Weirong Zhu, Vugranam C Sreedhar, Ziang Hu, and Guang R. Gao. Syn- chronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In ISCA, pages 35?45, 2007. 218