ABSTRACT
Title of dissertation: ARCHITECTURAL-PHYSICAL
CO-DESIGN OF 3D CPUS
WITH MICRO-FLUIDIC COOLING
Caleb Serafy, Doctor of Philosophy, 2016
Dissertation directed by: Professor Ankur Srivastava
Department of Electrical Engineering
The performance, energy efficiency and cost improvements due to tradition-
al technology scaling have begun to slow down and present diminishing returns.
Underlying reasons for this trend include fundamental physical limits of transis-
tor scaling, the growing significance of quantum effects as transistors shrink, and a
growing mismatch between transistors and interconnects regarding size, speed and
power. Continued Moore’s Law scaling will not come from technology scaling alone,
and must involve improvements to design tools and development of new disruptive
technologies such as 3D integration. 3D integration presents potential improve-
ments to interconnect power and delay by translating the routing problem into a
third dimension, and facilitates transistor density scaling independent of technology
node.
Furthermore, 3D IC technology opens up a new architectural design space of
heterogeneously-integrated high-bandwidth CPUs. Vertical integration promises to
provide the CPU architectures of the future by integrating high performance proces-
sors with on-chip high-bandwidth memory systems and highly connected network-
on-chip structures. Such techniques can overcome the well-known CPU performance
bottlenecks referred to as memory and communication wall.
However the promising improvements to performance and energy efficiency
offered by 3D CPUs does not come without cost, both in the financial investments
to develop the technology, and the increased complexity of design. Two main limi-
tations to 3D IC technology have been heat removal and TSV reliability. Transistor
stacking creates increases in power density, current density and thermal resistance in
air cooled packages. Furthermore the technology introduces vertical through silicon
vias (TSVs) that create new points of failure in the chip and require development
of new BEOL technologies. Although these issues can be controlled to some exten-
t using thermal-reliability aware physical and architectural 3D design techniques,
high performance embedded cooling schemes, such as micro-fluidic (MF) cooling,
are fundamentally necessary to unlock the true potential of 3D ICs.
A new paradigm is being put forth which integrates the computational, elec-
trical, physical, thermal and reliability views of a system. The unification of these
diverse aspects of integrated circuits is called Co-Design. Independent design and
optimization of each aspect leads to sub-optimal designs due to a lack of under-
standing of cross-domain interactions and their impacts on the feasibility region of
the architectural design space. Co-Design enables optimization across layers with a
multi-domain view and thus unlocks new high-performance and energy efficient con-
figurations. Although the co-design paradigm is becoming increasingly necessary in
all fields of IC design, it is even more critical in 3D ICs where, as we show, the inter-
layer coupling and higher degree of connectivity between components exacerbates
the interdependence between architectural parameters, physical design parameters
and the multitude of metrics of interest to the designer (iByB power, performance,
temperature and reliability). In this dissertation we present a framework for multi-
domain co-simulation and co-optimization of 3D CPU architectures with both air
and MF cooling solutions. Finally we propose an approach for design space explo-
ration and modeling within the new Co-Design paradigm, and discuss the possible
avenues for improvement of this work in the future.

ARCHITECTURAL-PHYSICAL CO-DESIGN OF
3D CPUs WITH MICRO-FLUIDIC COOLING
by
Caleb Serafy
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2016
Advisory Committee:
Professor Ankur Srivastava, Chair/Advisor
Professor Donald Yeung
Professor Joseph JaJa
Professor Manoj Franklin
Professor Alan Sussman
© Copyright by
Caleb Serafy
2016
Acknowledgments
I would like to thank my advisor, Professor Ankur Srivastava for the support
and guidance he has provided throughout my time in the Ph.D. program at the U-
niversity of Maryland. Professor Srivastava has always been very available to meet
and discuss research while at the same time allowing his students to foster self suffi-
ciency and creative critical thinking on their own. Professor Srivastava demands the
highest quality of work from his students, but in return offers reliable support both
financially and technically, resulting in a very strong and fruitful advisor-student
relationship that facilitates significant contributions to the research community.
I would also like to thank Donald Yeung for the many hours we have spent
together discussing research and for his many insights and suggestions regarding
how to apply our EDA research base with problems of interest in the architectural
community. Identifying and advancing the state of the art at the crossover between
the two disciplines is the fundamental motivation behind this dissertation.
Furthermore I would like to thank Professor Ankur Srivastava, Professor Don-
ald Yeung, Professor Joseph JaJa, Professor Manoj Franklin and Professor Alan
Sussman for their time to serve on this committee and their valuable technical feed-
back on the content of this dissertation. I would also like to thank Professor Avram
Bar-Cohen, Professor Uzi Vishkin, Professor Yogendra Joshi, Professor Sudhakar
Yalamanchili and all of their respective students for their technical contributions to
the work put forth in this dissertation.
ii
I would be remiss not to thank my wonderful colleagues. First I should thank
my senior colleagues Dr. Bing Shi and Professor Domenic Forte for their guidance
and friendship as I began by academic career and now as I transition into the
industry. Second I thank my current colleagues, Tiantao Lu, Chongxi Bao, Zhiyuan
Yang, Yang Xie and Yuntao Liu. I thank you for all the great technical work we
have collaborated on, and the fruitful and interesting research discussions we have
had. I am grateful for the lifelong friendships and professional relationships I have
developed during my time in this group.
Finally I thank my lovely wife Kacee for all her encouragement, support and
self-sacrifice to make this dissertation possible. While I worked long hours at the
lab Kacee has done more than her share to help provide for our family and take
care of our two beautiful daughters. I thank my parents for raising me to appreciate
academia, inspiring me to pursue doctoral studies, and providing moral and financial
support throughout my studies.
iii
Table of Contents
List of Tables vii
List of Figures viii
List of Abbreviations xi
List of Publications xi
1 Introduction 1
1.1 Advantages of 3D Integration . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thermal and Reliability Issues . . . . . . . . . . . . . . . . . . . . . . 6
1.3 3D IC Co-Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 3D CPUs: Background and Motivation 12
2.1 Three-Dimensional Integration . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Memory Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 3D Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Wide-IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Hybrid Memory Cube . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Memory-on-Logic 3D CPU . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Capacity Limitations . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 3D Super-Mesh NOC . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 3D Super-Mesh TSV Requirements . . . . . . . . . . . . . . . 23
2.5.2 3D NOC-Bus Hybrid . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Thermal Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Reliability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Micro-Fluidic Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 3D CPU Co-Simulation Co-Optimization Flow 31
3.1 Architectural Design Space . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Performance Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv
3.3 DRAM Latency Model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 MC Queuing Delay . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Power/Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Pumping Power . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Core Netlist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Wire Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8.1 Leakage Model . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9 Floorplan Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.9.1 Floorplan Representation . . . . . . . . . . . . . . . . . . . . . 50
3.9.2 Simulated Annealing Approach . . . . . . . . . . . . . . . . . 51
3.9.3 Speeding Up Simulation Time . . . . . . . . . . . . . . . . . . 52
3.9.4 Core Tiling and NOC Design . . . . . . . . . . . . . . . . . . 53
3.9.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Cooling Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.10.1 Microchannel Placement Representation . . . . . . . . . . . . 57
3.10.2 Simulated Annealing Approach . . . . . . . . . . . . . . . . . 58
3.10.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.10.4 Microchannel Cost Model . . . . . . . . . . . . . . . . . . . . 61
3.11 Simultaneous Optimization . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Architectural Opportunities of Micro-Fluidically Cooled 3D CPUs 64
4.1 2D vs. 3D CPUs and the need for MF cooling . . . . . . . . . . . . . 65
4.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 Thermally Feasible Performance . . . . . . . . . . . . . . . . . 74
4.1.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Frequency Scaling with Micro-Fluidics . . . . . . . . . . . . . . . . . 78
4.2.1 Design Space and Benchmarks and Metrics . . . . . . . . . . . 79
4.2.2 Core and Frequency Scaling . . . . . . . . . . . . . . . . . . . 80
4.2.3 Scaling Trends . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5 Architectural-Physical Co-Design of Micro-Fluidically Cooled 3D CPUs 86
5.1 Thermal-Reliability Aware Architectural-Physical DSE . . . . . . . . 87
5.1.1 Feasibility Region . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.2 Optimal Performance . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.3 Reliability Constraint Sensitivity . . . . . . . . . . . . . . . . 94
5.2 Thermal-Bandwidth Trade-offs in MF Cooled 3D CPUs . . . . . . . . 96
5.2.1 Bandwidth Requirements . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Memory Controller TSV Density . . . . . . . . . . . . . . . . 99
5.2.3 Router TSV Density . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.4 TSV Density Requirement . . . . . . . . . . . . . . . . . . . . 100
v
5.2.5 Bandwidth Capacity . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.6 Pin Fin Thermal Model . . . . . . . . . . . . . . . . . . . . . 101
5.2.7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.8 Architectural Parameter Sensitivity . . . . . . . . . . . . . . . 106
5.2.9 Heatsink Parameter Sensitivity . . . . . . . . . . . . . . . . . 106
5.2.10 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6 Design Space Modeling for Physically Constrained 3D CPUs 114
6.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Modeling and Simulation Technique . . . . . . . . . . . . . . . . . . . 121
6.3.1 SS-ANOVA Modeling . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.2 Choosing Model Terms . . . . . . . . . . . . . . . . . . . . . . 123
6.3.3 Adding Simulation Points . . . . . . . . . . . . . . . . . . . . 125
6.3.4 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4.1 Architectural Design Space . . . . . . . . . . . . . . . . . . . . 127
6.4.2 Software Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.3 Discovery Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.4 Modeling and Simulation Parameters . . . . . . . . . . . . . . 130
6.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4.6 Comparison to Other Techniques . . . . . . . . . . . . . . . . 133
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.1 Design Space Characterization . . . . . . . . . . . . . . . . . . 135
6.5.2 “Optimal” Discovery . . . . . . . . . . . . . . . . . . . . . . . 137
6.5.2.1 Robustness to Constraint Tightness . . . . . . . . . . 139
6.5.3 “Pareto” Discovery . . . . . . . . . . . . . . . . . . . . . . . . 142
6.5.4 Overhead of modeling approach . . . . . . . . . . . . . . . . . 143
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7 Conclusions and Future Work 145
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.1 Expansion of Co-Design Scope . . . . . . . . . . . . . . . . . . 148
7.1.1.1 Power Delivery . . . . . . . . . . . . . . . . . . . . . 149
7.1.1.2 Signal Integrity . . . . . . . . . . . . . . . . . . . . . 150
7.1.2 Fine-Grained Design and Integration . . . . . . . . . . . . . . 151
7.1.3 Runtime Management . . . . . . . . . . . . . . . . . . . . . . 152
Bibliography 154
vi
List of Tables
2.1 Comparison of 3D mesh and 3D super-mesh NOC [1] . . . . . . . . . 22
3.1 Architectural parameters . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 2D vs. 3D DRAM Bus . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Micro-fluidic system parameters . . . . . . . . . . . . . . . . . . . . . 40
3.4 CPU core component properties . . . . . . . . . . . . . . . . . . . . . 42
3.5 Transistor and interconnect parameters for 45 nm technology [2] . . . 43
3.6 Thermal model material properties . . . . . . . . . . . . . . . . . . . 47
4.1 Study 1: Architectural Design Space . . . . . . . . . . . . . . . . . . 67
4.2 Study 2: Architectural Design Space . . . . . . . . . . . . . . . . . . 79
4.3 Maximum benchmark performance s.t. thermal constraint . . . . . . 81
5.1 Study 3: Architectural Design Space . . . . . . . . . . . . . . . . . . 87
5.2 Micro-fluidic pin-fin heatsink dimensions . . . . . . . . . . . . . . . . 97
5.3 Micro-fluidic pin-fin thermal model parameters . . . . . . . . . . . . . 103
5.4 Study 4: Architectural Design Space . . . . . . . . . . . . . . . . . . 105
5.5 Normalized Co-design Results . . . . . . . . . . . . . . . . . . . . . . 111
6.1 Architectural design space (baseline architecture shown in bold). . . . 128
6.2 Simulated Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
vii
List of Figures
1.1 (a) Transistor cost [3] (b) wire/gate delay [4] (c) wire/gate power [5] . 4
1.2 Relationship graph for 3D CPU metrics and design variables . . . . . 7
2.1 3D IC cross section . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Memory wall [6]. Multi-core trends plotted for different amounts of
workload parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Stacked DRAM architecture . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 NOC (left) 2D mesh (right) 3D mesh [7] . . . . . . . . . . . . . . . . 21
2.5 Vertical connections in a column of 3D super-mesh routers . . . . . . 23
2.6 Trapped heat effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Thermal map of (a) processor layer, (b) bottom DRAM layer and (c)
top DRAM layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 TSV CTE miss-match stress field . . . . . . . . . . . . . . . . . . . . 27
2.9 Micro-fluidic heatsink in memory-on-logic 3D CPU . . . . . . . . . . 30
3.1 Simulation flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 CPU core component netlist with net widths notated. . . . . . . . . . 41
3.3 TSV EM reliability model . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Thermal resistance grids for fluid and solid materials . . . . . . . . . 47
3.5 Thermal-leakage relationship . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Example thermally unaware floorplan with MF cooling . . . . . . . . 54
3.7 Example thermally aware floorplan with MF cooling . . . . . . . . . . 55
3.8 Temperature and power density of air cooled floorplan . . . . . . . . 59
3.9 Temperature and channel distribution using uniform MF heatsink. . . 59
3.10 Temperature and channel distribution using optimized MF heatsink. . 60
3.11 Microchannel cost model example . . . . . . . . . . . . . . . . . . . . 62
4.1 Average DRAM latency vs. number of memory controllers [8] . . . . 67
4.2 Performance vs. MCs and frequency (a) 2D CPU (c) 3D CPU . . . . 69
4.3 Temperature vs. MCs and frequency of air cooled 2D CPU . . . . . . 71
4.4 Temperature vs. MCs and frequency (a) air cooled 3D CPU (b) MF
cooled 3D CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Best achievable performance subject to thermal constraints . . . . . . 73
viii
4.6 Power dissipation vs. MCs and frequency of air cooled 2D CPU . . . 76
4.7 Power dissipation vs. MCs and frequency (a) air cooled 3D CPU (b)
MF cooled 3D CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.8 3D CPU (a) performance (b) energy efficiency vs. frequency with air
cooling and MF cooling . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.9 3D CPU (a) temperature (b) power vs. frequency with air cooling
and MF cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1 3D CPU design space performance . . . . . . . . . . . . . . . . . . . 88
5.2 Thermal feasibility region (shown in white) . . . . . . . . . . . . . . . 89
5.3 Reliability feasibility region (shown in white) . . . . . . . . . . . . . . 89
5.4 Thermal-reliability feasibility region (shown in white) . . . . . . . . . 90
5.5 Co-design results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Performance improvement due to reliability-aware FP . . . . . . . . . 95
5.7 Micro-fluidic pin-fin cooling of a single layer in a 3D-IC . . . . . . . . 97
5.8 Control volume around one pin . . . . . . . . . . . . . . . . . . . . . 102
5.9 Normalized metrics of 3D CPU architectural design space . . . . . . . 105
5.10 Maximum feasible performance and energy efficiency vs. pin pitch . . 107
5.11 Thermal feasibility region (shown in white) . . . . . . . . . . . . . . . 109
5.12 Bandwidth feasibility region (shown in white) . . . . . . . . . . . . . 109
5.13 Thermal-bandwidth feasibility region (shown in white) . . . . . . . . 110
6.1 Modeling and simulation technique . . . . . . . . . . . . . . . . . . . 121
6.2 Distribution of (a) performance (b) temperature in design space . . . 135
6.3 Temperature vs. performance of entire design space . . . . . . . . . . 136
6.4 Optimality of identified design. . . . . . . . . . . . . . . . . . . . . . 138
6.5 Additional simulations required when ivOURatOUT is reduced from 85
◦C
to 65 ◦C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6 Accuracy of identified Pareto set. . . . . . . . . . . . . . . . . . . . . 141
7.1 PDN model in a 3D IC . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 TSV-TSV coupling circuit model . . . . . . . . . . . . . . . . . . . . 151
ix
List of Abbreviations
BEOL Back End of Line
RC Resistance/Capacitance
HMC Hybrid Memory Cube
PRAM Phase-Change RAM
MRAM Magnetic RAM
MC Memory Controller
NUMA Non-Uniform Memory Access
CMP Chip Multi-Processor
NOC Network on Chip
CTE Coefficient of Thermal Expansion
MF Micro-Fluidic
PPAT Performance, Power, Area and Timing
M2S Multi2Sim
IPC Instructions per Clock
AR Aspect Ratio
RAT Register Alias Table
ALU Arithmetic Logic Unit
IFU Instruction Fetch Unit
LSU Load Store Unit
MMU Memory Management Unit
TLB Translation Look-aside Buffer
EM Electromigration
PDN Power Delivery Network
PDF Probability Density Function
TCG Transitive Closure Graph
ROUT Router
EX Execution Unit
IPnS Instructions per Nanosecond
BIPS Billion Instructions per Second
EDP Energy Delay Product
Freq Frequency
T Thermal
R Reliability
BW Bandwidth
DSE Design Space Exploration
SS-ANOVA Smoothing Spline Analysis of Variance
ROI Region of Interest
x
List of Publications
Hmsplal Nsblgaargmls
H1. Y. Xie, C. Bao, A. Scpadw, T. Lu, A. Srivastava and M. Tehranipoor, “Secu-
rity and Vulnerability Implications of 3D ICs”, AEEE Lrafkaclagfk gf Emdla-
Kcade Cgehmlafg Kqkleek, Accepted March 2016
H0. A. Scpadw, Z. Yang, Y. Hu, A. Srivastava and Y. Joshi, “Thermo-Electric
Co-design of 3D CPUs and Embedded Micro-fluidic Pin-fin Heatsinks”, AEEE
Dekagf afd Lekl, February 2016
H1. A. Scpadw, A. Bar-Cohen, A. Srivastava and D. Yeung, “Unlocking the True
Potential of 3D CPUs with Micro-Fluidic Cooling”, AEEE Lrafkaclagfk gf
NDKA Kqkleek, July 2015
H2. A. Scpadw and A. Srivastava, “TSV Placement and Shield Insertion for TSV-
TSV Coupling Reduction in 3-D Global Placement”, AEEE Lrafkaclagfk gf
CAD: Khecaad Akkme gf Phqkacad Dekagf Lechfaimek fgr Adnafced Lechfgdggq
Fgdek, January 2015
H3. A. Scpadw, B. Shi and A. Srivastava, “A Geometric Approach to Chip-Scale
TSV Shield Placement for the Reduction of TSV Coupling in 3D-ICs”, Af-
legralagf, lhe NDKA Jgmrfad bq Edkenaer: NDKA fgr lhe Feo Era, December
2013
Hmsplal Nsblgaargmls (Slbcp Pctgcu)
P1. T. Lu, A. Scpadw, Z. Yang, S.K. Lim and A. Srivastava, “3D ICs: Design
Methods and Tools”, AEEE Lrafkaclagfk gf CAD, Submitted March 2016
Amldcpclac Nsblgaargmls
A1. T. Lu, A. Scpadw, Z. Yang and A. Srivastava, “Voltage Noise Induced DRAM
Soft Error Reduction Technique for 3D-CPUs”, Aflerfalagfad Kqehgkame gf
Dgo Pgoer Edeclrgfack afd Dekagf (AKDPED), August 2016
A0. Z. Yang, A. Scpadw and A. Srivastava, “ECO Based Placement and Routing
Framework for 3D FPGAs with Micro-fluidic Cooling”, AEEE Aflerfalagfad
Kqehgkame gf Faedd-Prggraeeabde Cmklge Cgehmlafg Eachafek (FCCE),
May 2016
A1. A. Scpadw, T. Lu and A. Srivastava, “Thermal-Reliability Physical Co-
Optimization During Architectural Design Space Exploration of 3D-CPUs”,
Ggnerfeefl Eacrgcarcmal Ahhdacalagfk afd Cralacad Lechfgdggq Cgfferefce
(GGEACLech), March 2016
A2. A. Scpadw, A. Srivastava, A. Bar-Cohen and D. Yeung, “Design Space Ex-
ploration of 3D CPUs and Micro-Fluidic Heatsinks with Thermo-Electrical-
Physical Co-Optimization”, Aflerfalagfad Lechfacad Cgfferefce afd Ephaba-
lagf gf Paccagafg afd Aflegralagf gf Edeclrgfac afd Phglgfac Eacrgkqkleek
(AflerPACC), July 2015
xi
A3. A. Scpadw, A. Srivastava and D. Yeung, “Unlocking the True Potential of 3D
CPUs with Micro-Fluidic Cooling”, Aflerfalagfad Kqehgkame gf Dgo Pgoer
Edeclrgfack afd Dekagf (AKDPED), August 2014
A4. A. Scpadw, A. Srivastava and D. Yeung, “Continued Frequency Scaling in
3D ICs through Micro-fluidic Cooling”, AEEE Aflerkgcaelq Cgfferefce gf
Lheread afd Lheregeechafacad Phefgeefa af Edeclrgfac Kqkleek (ALher-
e), May 2014
A5. A. Scpadw and A. Srivastava, “Coupling-Aware Force Driven Placemen-
t of TSVs and Shields in 3D-IC layouts”, ACE Aflerfalagfad Kqehgkame
gf Phqkacad Dekagf (AKPD), April 2014
A6. A. Scpadw, B. Shi, A. Srivastava and D. Yeung, “High Performance 3D
Stacked DRAM Processor Architectures with Micro-Fluidic Cooling”, AEEE
Aflerfalagfad 3D Kqkleek Aflegralagf Cgfferefce (3D-AC), October 2013
A7. A. Scpadw and A. Srivastava, “Online TSV Health Monitoring and Built-
in Self-Repair to Overcome Aging”, AEEE Kqehgkame gf Defecl afd Famdl
Lgderafce (DFL), October 2013
A1..B. Shi, A. Scpadw and A. Srivastava, “Co-Optimization of TSV Assignment
and Micro-Channel Placement for 3D-ICs”, Greal Dacek Kqehgkame gf NDKA
(GDKNDKA), May 2013
A11.A. Scpadw, B. Shi and A. Srivastava, “A Geometric Approach to Chip-Scale
TSV Shield Placement for the Reduction of TSV Coupling in 3D-ICs”, Greal
Dacek Kqehgkame gf NDKA (GDKNDKA), May 2013
Maeaxglc Aprgalcs
M1. A. Scpadw and A. Srivastava, “Leakage Power: Physical Mechanisms and
Possible Solutions”, Edeclrgfack Cggdafg, December 2014
xii
Chapter 1: Introduction
CMOS technology has for the last half century taken advantage of aggressive
technology scaling, resulting in faster and more densely packed transistors that have
provided exponential increases in computing capacity. Over the years, the consumer
market for semiconductors has come to expect such a rate of growth to continue
far into the future. However, today transistor scaling is approaching fundamental
physical and economic limits, and already the rate of increase in computing power
and performance has begun to slow.
Vertical integration (3D ICs) is an emerging technology which promises to rein-
vigorate Moore’s Law performance scaling by reducing interconnect power and delay,
and facilitating new heterogeneous computer architectures such as stacked memory-
on-logic CPUs [9–11]. Additionally, logic-on-logic stacking can create more highly
connected circuits and increase inter-core communication bandwidth in multi-core
CPUs [7, 12, 13]. Stacking memory-on-logic can provide a high-bandwidth memory
interface to the processor [9, 14], overcoming the memory wall [6] and facilitating
the processing in memory paradigm [11].
1
Thus 3D integration brings the potential of many advantages both at the cir-
cuit and architectural level. However these advantages come with a cost in terms
of physical constraints and increased dependencies between CPU components and
across metric domains. The chief limitation associated with 3D ICs is thermal in
nature [8,14–16]. Vertical stacking inherently increases power flux while inter-layer
dielectrics significantly increase the thermal resistance of the stack. Other limita-
tions come from the introduction of through silicon vias (TSVs) which introduce
new failure modes [17–19] and sources of noise coupling [20–24] while increasing the
impedance of the power delivery network [25,26].
Increased thermal insulation makes 3D IC temperature a much more highly
coupled function of CPU architecture, performance and power [8, 27]. Furthermore
it is well known that critical path delay, leakage power and reliability are strong
functions of temperature, creating an interconnected network of metrics that all in-
fluence each other. Although the same fundamental relationships exist in 2D ICs,
the higher connectivity, and spatial coupling between stacked components exacer-
bate these inter-dependencies in 3D to such an extent that simultaneous modeling
and optimization is a must [27–32].
In this dissertation we explore the potential of 3D CPU architectural oppor-
tunities and evaluate the associated challenges (yBgB, thermal and reliability issues)
and their implications on the architectural feasibility space. We propose a co-design
paradigm to design 3D CPUs to maximize their performance and/or energy ef-
2
ficiency under physical constraints and finally propose a modeling and simulation
methodology for high dimensionality design space exploration of the 3D CPU design
space.
1.1 Advantages of 3D Integration
As transistor sizes approach atomic scale, quantum effects that have tradi-
tionally been insignificant begin to significantly effect behavior. Moreover transistor
size is fundamentally limited by the dimensions of the atoms used to construct
them. Additionally, the traditional scaling trend of manufacturing cost per transis-
tor (Figure 1.1(a)) is expected to stall out very soon, removing a significant economic
incentive to invest in future technology nodes [3].
Another issue causing Moore’s Law scaling to end is the growing gap in perfor-
mance and power efficiency of transistors vs. interconnect [4,5]. Figures 1.1(b) and
1.1(c) show the trends of transistor and interconnect delay and power respectively as
technology has advanced. Transistors are clearly increasing in speed due to smaller
input capacitance whereas interconnect is decreasing in speed due to smaller more
resistive wires, and more wire-wire parasitic capacitance [33]. For similar reasons,
chip-scale transistor power remains nearly flat over time while interconnect power is
increasing at a much faster rate [5]. Closing the gap between transistors and wires is
necessary to continue historical scaling trends of power and performance over time.
3
180 130  90  65  40  28  20  16
0
5
10
15
20
25
30
35
40
Technology (nm)
Co
st
 p
er
 M
illi
on
 T
ra
ns
ist
or
 (c
en
ts)
(v)
650 500 350 250 180 130 1000
10
20
30
40
D
e
la
y 
(ps
)
Technology (nm)
 
 
Interconnect (Al + SiO2)
Interconnect (Cu + low-k)
Gate Delay
(w)
150 130 100  90  80  70  65  45  35  20
0
1
2
3
4
5
6
N
or
m
al
iz
ed
 P
ow
er
Technology (nm)
 
 
Interconnect Power
Gate Power
(x)
Figure 1.1: (a) Transistor cost [3] (b) wire/gate delay [4] (c) wire/gate power [5]
4
Engineers are aggressively investigating new technologies and paradigm shifts
that can continue to provide the market with the growth it expects, even as technol-
ogy scaling has begun to stall out. Transistors have traditionally been laid out in a
two dimensional plane on a silicon wafer. One technique to improve transistor and
interconnect density without the use of technology scaling is to pack transistors into
three dimensional space, resulting in what are called three-dimensional integrated
circuits (3D ICs). In addition to increasing transistor density, which can increase
circuit performance and reduce power consumption, 3D integration can theoretical-
ly reduce interconnect length by a factor of
√
c where c is the number of stacked
layers [34]. Assuming optimal buffer insertion, this would reduce wire delay and
power proportionally [35].
Another advantage of vertical integration is chip level integration of circuits
manufactured in disparate technologies, referred to as heterogeneous integration.
This allows circuits such as analog sensors, MEMs, RF, DRAM, and CMOS to all
be integrated together, extending the system on a chip (SoC) paradigm to many
new applications. Not only can heterogeneous integration make new SoC designs
feasible, it can improve the quality of current SoC designs, by allowing different
components of the design to be fabricated in a manufacturing process optimized
for that specific component. Circuits that are traditionally fabricated as separate
chips and connected using an interposer or PCB can be vertically integrated with
TSVs, greatly increasing the bandwidth between these chips, and opening up oppor-
tunity to redesign how such circuits interact with one another, possibly increasing
performance and/or decreasing power consumption.
5
1.2 Thermal and Reliability Issues
Temperature and reliability are two of the most important challenges associ-
ated with 3D ICs. Other challenges include signal integrity and power delivery [26].
Thermal challenges arise from the increased power flux inherent to 3D stacking.
High temperatures can cause timing violations by increasing transistor and inter-
connect resistance, and excessively high temperatures can even cause permanent
physical damage to the chip. Thus chip temperature plays a critical roll in both
soft and hard error reliability. Temperature significantly effects leakage power. In-
creased power leads to higher current density which can cause electromigration and
IR voltage drop in the power delivery network (PDN). Furthermore temperature
fluctuations can cause TSV defect formation from thermal cycling and so called
TSV pop-out and delamination [36].
Although traditional 2D circuits can address the thermal and related reliability
issues by attaching a large heatsink to the back side of the chip to dissipate the heat
to the environment, this approach is not applicable to 3D ICs. An attached heatsink
can only remove significant heat from the top layer, as other layers are sandwiched
between electrical isolation layers composed of SiO2 which block heat dissipation
and cause high temperatures [27, 28]. We refer to this as the trapped heat effect
(Figure 2.6). Micro-fluidic cooling is a promising technology for localized embedded
cooling that can overcome the trapped heat effect and scale cooling capacity with
6
Reliability
Temperature
Power
Distribution
Floorplan
Cooling
Distribution
Power
Archiecture
Heatsink
Design
TSV 
Density
Performance
Wire 
LengthCurrent 
Density
Net 
Activity
TSV 
Count Frequency
Design Variable
Constraint
Target Metric
Stress
Figure 1.2: Relationship graph for 3D CPU metrics and design variables
number of layers. In our work we examine the power, performance, thermal and
reliability interdependence and show the massive potential of micro-fluidically cooled
and multi-objective co-design in 3D CPUs.
1.3 3D IC Co-Design
In the previous sections we have discussed the physical design challenges (yBgB,
temperature and reliability) and the architectural opportunities of 3D integration.
Traditionally the physical and architectural designs are performed independently in
sequence using different levels of abstraction. Moreover, even within the physical
design domain, design problems are tackled sequentially, and cross-domain opti-
7
mizations are not usually considered. A new paradigm which integrates the compu-
tational, electrical, physical, thermal and reliability views of the system is gaining
steam. This unification of diverse aspects of the overall integrated system is called
Co-design. Co-design enables optimizations across different layers of the design hi-
erarchy which are not possible through a conventional top down design approach
thereby unlocking new high performance configurations.
In the remainder of this dissertation we use 3D CPUs as a case study to
exemplify the interdependence of the physical and architectural design spaces. We
use a novel simulation flow which integrates placement, temperature and reliability
design challenges into a unified framework for architectural-physical optimization
and analysis (Chapter 3).
Figure 1.2 illustrates the cause and effect relationships from some chosen design
variables to the optimization and constraint metrics of interest. The figure clearly
illustrates the interdependence between the terminal and intermediate nodes, and no
metric of interest can be determined without simultaneous consideration of all design
variables. The interconnectedness of this relationship graph strongly motivates the
need for the co-design paradigm. Isolating any subset of graph nodes from Figure 1.2
requires cutting many edges. In other words estimates calculated from a subset of
design metrics, variables and objective functions suffer from comprised accuracy due
to the high connectivity in the graph and large loss of information when graph edges
are removed.
8
Furthermore, we observe that the relationship graph contains cycles, which
imply nested loops within a simulation flow. An example is the interdependence of
temperature and leakage power. Leakage power increases as temperature elevates,
and likewise temperature will rise when leakage power increases. Iterative simula-
tions are required to accurately capture such inter-dependencies. Co-design design
space exploration (DSE) is a computationally intensive problem due to both opti-
mization loops and nested simulation loops within the evaluation flow of a single
design candidate.
1.4 Thesis Outline
In this thesis we first provide some in depth background information on 3D
CPUs in Chapter 2. This includes details on the architectural advantages of 3D
integration, the physical design issues and micro-fluidic cooling. In Chapter 3 we
introduce the simulation flow used to estimate metrics of interest for a given 3D
CPU architecture, including performance, power, temperature and reliability. Fur-
thermore we introduce here the physical design optimization loops evaluated in
Chapter 5.
Chapter 4 evaluates the advantages in performance and energy efficiency that
can be achieved by 3D CPUs. Our first study shows significant performance poten-
tial, but this potential is not realized with traditional air cooling, and MF cooling
is required to unlock the benefits of high-bandwidth stacked memory. In our second
study we consider how micro-fluidic cooling and 3D memory-on-logic stacking can
9
revitalize the classic frequency scaling paradigm in parallel with the current core
scaling model. Some of the major reasons frequency scaling came to an end was
temperature and memory bandwidth issues, which are largely overcome by memory-
on-logic stacking and MF cooling.
Chapter 5 evaluates the effectiveness of physical co-design towards expanding
the 3D CPU architectural design space feasibility region and thus unlocking new
high-performance high-energy-efficient CPU architectures of the future. Physical
design of both the logic and the heatsink are explored subject to simultaneous and
interrelated temperature and reliability constraints. One interesting result is that
temperature and reliability optimization can be at conflict with one another, which
seems counter-intuitive, and further justifies the need for a co-design approach that
is aware of the intricate trade-offs between multiple design variables.
Another study reported in this chapter investigates the fundamental trade-
off between cooling capacity and inter-layer bandwidth (iByB TSV density) in a MF
cooled 3D IC. We show that using a generic heatsink design geared towards minimiz-
ing temperature or maximizing TSV density only leads to significant performance
sub-optimality, and a co-design approach is necessary to discover the best heatsink
parameters for each architectural design point.
Chapter 6 introduces a modeling and simulation scheme to bring the co-
design framework discussed in previous chapters into practical use on large multi-
dimensional problems. The 3D CPU co-simulation framework introduced in Chap-
ter 3 covers a wide array of different simulations and model, and thus consumes
a non-trivial amount of compute resources. Exhaustive application of this simu-
10
lation flow over a large industry-scale design space may not be computationally
feasible. Thus we propose a methodology to accurately predict the design space and
identify regions of interest (yBgB, optimal-feasible region or Pareto optimal front)
while simulating only a small percentage of the design space. Our results show high
accuracy compared to randomized or modeling-only approaches, and makes the co-
design paradigm developed in this dissertation practically applicable to real design
problems.
Finally Chapter 7 concludes the dissertation with a summary of the work com-
pleted, and some recommendations for future work. Avenues for continuation of the
work begun in this dissertation include integration of additional design metrics and
models, a hierarchical co-design framework to progress from high-level to detailed
design, efficient methods of cutting the co-design graph to balance design time with
quality, and the integration of runtime management approaches into the co-design
framework.
11
Chapter 2: 3D CPUs: Background and Motivation
3D Stacking is an emerging technology which offers many new opportunities
for high performance CPU architectures. The memory wall [9] is a known hurtle to
future performance and power scaling, and 3D integration is a promising technology
to overcome it. Stacked memory circuits are already in commercial production
[37,38] and heterogeneous memory-on-logic CPUs are being aggressively researched
and prototyped [14,27,39]. Moreover, communication overheads in both power and
delay have become more and more significant as we have entered the age of big data.
This is the so-called communication wall [40]. 3D CPUs offer new solutions such
as high-bandwidth on-chip processing-in-memory [11, 41, 42] and highly connected
3D NOC topologies [13, 27, 43]. Finally we discuss some of the physical challenges
associated with 3D CPUs, potential solutions, and the need for a co-design paradigm
to optimize for strong architectural-physical interactions inherent to 3D CPUs.
2.1 Three-Dimensional Integration
3D ICs are formed by stacking multiple layers of traditional (2D) ICs one atop
the other. Some nets in the 3D circuit span multiple layers, and must be connected
with vertical interconnects. The most prominent type of vertical interconnect is
12
Metal Layer
Substrate
Wire
Transistor
Top 
Layer
Bottom 
Layer
T
S
V
KOZ
L
in
e
r
Figure 2.1: 3D IC cross section
called the through silicon via (TSV). TSVs are vertical columns of metal that pass
through the silicon substrate and connect the horizontal metal wires in adjacent IC
layers, as shown in Figure 2.1. TSVs are used to deliver both signals and power
between layers of a 3D IC. Because a TSV passes through the substrate, transistors
and TSVs cannot coexist at that same location in the same layer. Hence TSV place-
ment effects the positions of transistors and the length of wires, which determine
the overall delay of a circuit.
TSVs pass through the electrically charged and conductive silicon substrate,
and so they must be surrounded by a layer of insulating material to decouple them
from the substrate. This layer of insulation is called the liner, and is typically made
of silicon dioxide (SiO2). There exists a minimum spacing between TSVs and other
features such as transistors and other TSVs, which must be enforced in order to
guarantee proper functionality of the chip. This minimum spacing is called the keep
13
out zone (KOZ) and is determined by the precision of the manufacturing process and
TSV effects such as thermally-induced stress around a TSV due to the mismatch in
thermal expansion of the silicon, the liner, and the TSV [44].
Vertical integration is a promising new technology and can continue transistor
density scaling as technology scaling slows down due to physical limitations. Beyond
transistor density scaling, 3D integration brings other unique advantages. Because
each layer in a chip stack is manufactured independently, 3D integration can fa-
cilitate heterogeneous integration by manufacturing different layers with disparate
manufacturing processes. Vertical integration also increases the overall connectivity
of a system by decreasing the average distance between system components, thus
decreasing global wirelengths, critical path delays and interconnect power. By im-
plementing a circuit in c layers, the global wirelength can be reduced by up to a
factor of
√
c [34].
2.2 Memory Wall
The so-called memory wall describes the limitation put on processor perfor-
mance and energy efficiency due to a lack of high-bandwidth, high-density low-power
DRAM circuits. The term was originally coined to describe the gap in CPU and
memory performance, as shown in Figure 2.2. An initial solution to this gap was
the addition of cache memory on chip to hide the DRAM latency, but caches are
limited in size due to silicon area and leakage power constraints. Moreover as the
multi-core paradigm has matured, memory bandwidth has become a limitation not
14
1980 1989 1998 2007 2016
1
10
100
1k
10k
100k
R
e
la
tiv
e
 
Pe
rfo
rm
a
n
ce
Year
 
 
Multi-Core
Single Core
Memory
Paralellism (%)
[100, 90, 75, 50]
Figure 2.2: Memory wall [6]. Multi-core trends plotted for different amounts of
workload parallelism.
just due to DRAM speed, but also due to increased memory access rates as more
cores operate in parallel. The memory wall is a key obstacle in the climb towards
next generation computing: both mobile and exascale supercomputing.
2.3 3D Memories
3D integration is an enabling technology to further the three memory design
goals: higher density, higher bandwidth, and lower power. Vertical stacking inher-
ently increases memory density within a fixed footprint area, and heterogeneous
integration facilitates high speed, and/or very wide TSV memory buses which dis-
sipate considerably less power than their off-chip counterparts.
Two main strategies have been employed towards bringing 3D memory into
the commercial market. One focuses on speed using very high speed differentially
signaled serial interconnects. Although this strategy increases absolute power, the
power efficiency (bandwidth per Watt) is much improved. An example of such an
15
architecture is Micron’s Hybrid Memory Cube (HMC) [37]. Alternatively a wide
parallel bus can be pursued taking advantage of the tremendous interconnect den-
sity offered by TSV technology [37]. This strategy can massively improve memory
bandwidth without increasing power, or alternatively provide very low power op-
eration at nominal performance. An example of such an architecture is Samsung’s
Wide-IO DRAM [38].
2.3.1 Wide-IO
The Wide-IO memory architecture consists of 4 independent channels each
with a 128 bit data bus. Each channel contains four 64 Mb arrays, for a total
capacity of 1Gb per layer. The Wide-IO memory can deliver peak bandwidth up
to 12.8 GB s−1, 4x higher than the equivalent LPDDR2 device, while increasing
bandwidth per Watt of IO power by more than 10x [38]. The Wide-IO 2 specification
has been released by JEDEC and makes many significant improvements [45]. The
number of channels can be increased from 4 to 8, the density ranges from 8 to 32
Gb and the peak bandwidth tops out at 34 (4 channel) or 68 (8 channel) GB s−1.
Moreover the operating voltage is reduced from 1.2 to 1.1 V, providing even lower
power. Wide-IO 2 is expected to surpass the performance of LPDDR4 in 3D stacked
devices [45].
16
Wide-IO memory is intended to be integrated directly on top of logic using
TSVs. This approach is ideal for density and power, but has thermal implications.
Wide-IO is expected to be used in high-end smart phones, but in the absence of
embedded active cooling schemes may not be thermally feasible in a server or super-
computer environment [46].
2.3.2 Hybrid Memory Cube
The HMC is connected to the CPU through a board-level high speed differ-
ential serial interface [37]. However the cube itself is composed of stacks of DRAM
on top of a layer of CMOS. This heterogeneous integration allows for optimized
common logic circuits such as decoders and memory controllers while maintaining
the memory density characteristics of stacked DRAM. HMC facilitates a distributed
architecture called “Far” mode [37] where multiple HMCs are connected together to
form a memory network for scalable high capacity memory systems. HMC moves
the memory controller to the DRAM module itself rather than the core in order to
efficiently realize such a scaled architecture.
The HMC significantly improves DRAM latency by reducing memory con-
troller queuing delays and providing more memory parallelism though independent
bank operation. Experimental data from first generation HMC prototype reports
DRAM bandwidth of 128GB s−1while dissipating 11 W, improving bandwidth per
Watt more than 3.5x over DDR4 [37].
17
Analysis by TSMC [46] shows that Wide-IO 2 brings the best of both worlds
by providing performance parity with DDR4 while matching LPDDR4 in power
dissipation. On the other hand the HMC is a revolutionary new memory architecture
that pushes performance, power and price to new extremes.
2.4 Memory-on-Logic 3D CPU
Heterogeneous 3D integration can provide massive bandwidth improvements
between CPU core logic and memory. Non-CMOS technologies such as DRAM,
phase-change RAM (PRAM) and magnetic RAM (MRAM) [47] can be stacked di-
rectly on top of logic cores. Stacked memory-on-logic DRAM architectures are a
natural solution to the memory wall problem as they can offer high-bandwidth, low-
latency, low-power interconnects between memory and CPU. Increases in bandwidth
and power efficiency come from reduction in interconnect length (iByB RC parasitics)
and massively increased integration density of TSVs as compared to off-chip PCB
traces [9, 27]. TSV integration can facilitate many more memory controller (MC)
modules to increase memory access parallelism at the expense of increased power,
temperature and area [8, 9, 12].Studies have shown that the performance improve-
ments due to main memory stacking can be up to 2x [8, 9]. Stacked DRAM is
considered to be one of the primary advantages of 3D CPUs [9,39]. A cross section
of a stacked DRAM memory-on-logic 3D CPU is shown in Figure 2.3.
18
DRAM Rank 3
Rank 2
Rank 1
Rank 0
Logic
Package Substrate
TSVs
Figure 2.3: Stacked DRAM architecture
2.4.1 Capacity Limitations
The capacity of on-chip DRAM is limited to only a few GB [11, 27]. Thus
most computing systems require both on and off-chip DRAM. On-chip DRAM could
be leveraged as cache or a non-uniform memory access (NUMA) paradigm can be
applied [48] to manage both on and off-chip DRAM as a unified main memory.
Even within a stacked DRAM module, non-uniform access constraints may need
to be applied due to non-uniform power delivery capacity in the 3D stack [49].
Such NUMA systems require memory swap controllers to keep hot memory pages
in low-latency portions of the memory [48,49].
Studies have shown the effectiveness of using stacked DRAM for additional
cache rather than main memory. DRAM cache can offer large capacity compared to
an SRAM cache of the same area [50] while maintaining higher bandwidth and lower
latency compared to main memory [51]. Moreover hot page migration into a DRAM
19
cache can be done at the cache line granularity whereas NUMA stacked memory
systems must swap memory at the page granularity, which is both inefficient and
requires OS support [48].
However there are two main limitations to DRAM cache: the tag array would
be unreasonably large for standard (yBgB, 64 MB) cache line sizes, and off-chip main
memory cannot provide the necessary bandwidth to use significantly larger cache
line sizes. Jiang yt ulB [51] proposed a hot-page filtering technique to efficiently
manage the DRAM bandwidth to leverage performance improvements of up to 25%
from a 128 MB DRAM cache. Loh [50] leveraged the DRAM row buffer hardware
to further increase DRAM cache performance by 29% by employing an adaptive
multi-queue policy. On the other hand, Chou yt ulB [48] presented a low overhead
technique that allows NUMA stacked memory to achieve cache-line level data mi-
gration, outperforming both DRAM cache and traditional NUMA stacked memory.
2.5 3D Super-Mesh NOC
Traditionally, communication between caches, cores and IO devices has been
accomplished using a bus architecture. A bus is a shared communication fabric
where communication is broadcast to all bus nodes. While such an architecture
is fast, it has been shown to scale poorly when the number of bus nodes surpasses
roughly 10 [13] due to bus contention in the shared fabric. Today’s chip multiproces-
sors (CMPs) already have more than 10 cores, and are expected to continue scaling
to hundred or even thousands of nodes [52]. Thus the network on chip (NOC) has
20
3D Mesh Link
Layer 1
Layer 2
Layer 4
Layer 3
3D Super-Mesh Link
2D  Mesh Link
2D Torus Link
3D-Torus Link
Figure 2.4: NOC (left) 2D mesh (right) 3D mesh [7]
become standard communication fabric in modern multi-core architectures. NOC-
s use a packetized routing network. Thus many communication packets can be
simultaneously passed through the network across independent router links.
The standard NOC topology has been a 2D mesh where nodes are spread
uniformly in two dimensions and each router connects to its four Manhattan neigh-
bors as well as its local node [7, 13]. However in many-core systems, whether dis-
tributed or integrate on chip, inter-core communications delays have begun to dom-
inate [11,53–55]. This is called the communication wall. The extension of the mesh
topology into 3D has been shown to provide significant improvements in latency,
throughput and energy efficiency [7, 43]. However, due to the mismatch in vertical
(hundreds of microns) and horizontal (millimeters) length of inter-core router links,
more innovative NOC topologies that provide higher connectivity in the vertical
direction have also been proposed [7, 12,13].
21
One simple extension that can be applied to either 2D or 3D mesh topologies
is the torus ring. The torus adds a connection between the first and last node in
each row and column of a mesh. This modification reduces the diameter (iByB worst
case distance) of the NOC, but introduces non-uniform delay hops which complicate
routing algorithms. However this can be significantly offset by use of a folded torus
topology. In general torus topology has less latency but consumes more power [56].
In the vertical direction, the motivation behind the torus architecture can
be further extended to include connecting all nodes in a vertical column due to
the relatively small distance between nodes on adjacent layers. Circuit analysis
estimates that multilayer routing channels can traverse up to four layers in the
vertical direction with the same delay as a horizontal connection between adjacent
cores [1,57]. The 3D super-mesh topology was introduced in [27] which connects each
pair of network nodes in a vertical column with a dedicated router link. Performance
improvements and power and area overheads versus standard 3D-NOC are shown
in Table 2.1. Mesh, torus and super-mesh topologies are illustrated in Figure 2.4.
Table 2.1: Comparison of 3D mesh and 3D super-mesh NOC [1]
Metric 3D super-mesh 3D mesh Ratio
IPC 29.3 25.3 1.16
Average Latency (cycles) 42.9 49.4 0.87
Total CPU Power (W) 315 284 1.11
Total CPU Area (m2m) 1580 1516 1.04
22
Router n
Router n-1
Router n-2
Router 3
Router 2
Router 1
...
... ...
...
...
n
-
2
s
e
ts
n
-1
s
e
ts
n-2
links
n-1
links
...
...
...
...
...
Figure 2.5: Vertical connections in a column of 3D super-mesh routers
2.5.1 3D Super-Mesh TSV Requirements
In a 3D CPU with a 3D super-mesh NOC on n logic layers, each router requires
n−1 vertical links to directly connect to all routers above and below it. Each vertical
connection between layer i and layer j requires a TSV between all adjacent layers
from i to j. Hence, the total number of TSVs that passes between layer i and layer
i+ 1 in a vertical column of 3D super-mesh NOC routers is given in Equation (2.1)
as iROUT and illustrated in Figure 2.5. lROTQ is the bit width of the router link. In
the studies presented in this dissertation lROTQ = 128 bits.
iROUT (i) = lROTQi(n− i) (2.1)
23
2.5.2 3D NOC-Bus Hybrid
A hybrid structure for 3D NOC has been proposed in [13]. A traditional 2D
mesh is used in each layer, but a subset of the routers on each layer are connected
to a vertical bus that allows broadcast communication between all routers in a
vertical column. This approach achieves full communication between all layers in
the vertical direction while minimizing the number of ports (and thus the power
and area) of each router. The number of nodes on each vertical bus is equal to the
number of layers in the NOC which is typically less than 10 [58], implying that bus
is a reasonable communication fabric in the vertical direction. Results show that
the proposed 3D NOC-bus hybrid structure applied to a shared banked L2 cache
outperforms a 2D NOC. Moreover it is shown that cache line mitigation is much
less common in the 3D NOC due to higher connectedness between nodes, and even
with cache line mitigation turned off in the 3D NOC, it still outperforms 2D [13].
2.6 Thermal Issues
The chief challenge associated with 3D integration is thermal management.
Thermal challenges in 3D ICs are twofold. Unlike technology scaling, 3D integration
increases transistor density without reducing the power per transistor. This results
in increased power flux as more layers are stacked. Exacerbating this problem, the
dielectrics between functional layers have relatively low thermal conductivity, and
significantly diminish heat flow from stacked layers to the heat sink in traditional air-
cooling schemes. The cooling capacity on each layer of an air-cooled 3D IC degrades
24
Trapped Heat
Free Heat
Si
SiO2 Insulation
Heatsink
Top
Layer
Middle
Layer
Figure 2.6: Trapped heat effect
as the layer moves farther away from the heatsink, therefore large thermal gradients
form in the vertical direction [27]. We call this phenomenon the trapped heat effect
(Figure 2.6) and it can result in extremely high peak temperatures [59,60].
Figure 2.7 shows an example thermal profile for a 3D CPU with two DRAM
layers stacked on a 16-core multiprocessor layer (Section 2.4). We observe a large
thermal gradient both within a layer and across vertical layers. We also observe
significant thermal coupling from the processor layer to the neighboring DRAM
layer, even though the DRAM layer has very low power density. This phenomenon
leads to increased DRAM leakage and requires shorter refresh periods in memory-
on-logic 3D CPUs [61], which has performance implications.
25
Processor Layer
 
 
(mm)3 6 9 12
3
6
9
12
15
18
Te
m
pe
ra
tu
re
 (°
C)
40
45
50
(v)
Bottom DRAM Layer
 
 
(mm)3 6 9 12
3
6
9
12
15
18
(w)
Top DRAM Layer
 
 
(mm)3 6 9 12
3
6
9
12
15
18
Te
m
pe
ra
tu
re
 (°
C)
40
45
50
(x)
Figure 2.7: Thermal map of (a) processor layer, (b) bottom DRAM layer and (c)
top DRAM layer
The high temperatures associated with air cooled 3D ICs cause high leakage
power (thus reducing the energy efficiency and possibly resulting in thermal runaway
[62]), increased transistor and wire delay (thus degrading performance), and reduced
chip reliability (Section 2.7). A promising solution to the thermal issue comes from
embedded active cooling technology such as micro-fluidic cooling (Section 2.8).
26
Si
Diffusion 
Barrier
TSV
CMOS
∆T<0
Residual 
Stress
Figure 2.8: TSV CTE miss-match stress field
2.7 Reliability Issues
Most reliability concerns specific to 3D ICs are related to TSVs, which intro-
duce several new failure modes. Many TSV reliability degradations are fundamen-
tally caused by thermal and stress issues [17,18,63]. The thermal issue comes from
the fact that the stacked structure increases the power density without providing
a sufficient heat removal path (Section 2.6). The stress issue is due to significant
differences in the coefficient of thermal expansion (CTE) between TSVs (yBgB, cop-
per 17.7 MK−1) and the silicon substrate (3.05 MK−1). When TSVs are cooled
down from high manufacturing temperature to room temperature, negative thermal
load is applied creating compressive and tensile stress inside TSVs and neighboring
substrate areas [44]. This phenomenon is illustrated in Figure 2.8. TSV stress not
only affects reliability, but is also shown to influence transistor mobility and thus
circuit performance [64].
27
TSV-induced reliability losses include: TSV electromigration [19,65,66], TSV
stress migration [17, 18, 63, 67], TSV oxide breakdown [68], TSV thermal cycling
[69–71] and TSV stress-induced material fracture [72–74]. TSV electromigration and
stress-migration cause TSV’s metal atoms to migrate, gradually altering material
density and resistance, and eventually causing TSVs to form short or open-circuits.
Electromigration moves atoms by transfer of momentum from flowing electrons,
whereas stress-migration moves atoms along stress gradients. TSV oxide breakdown
occurs when the electrical field inside the TSV barrier layer exceeds its threshold,
destroying the electrical isolation between TSVs and the substrate. Thermal cycling
shortens a TSV’s lifetime by introducing TSV defects through thermal fatigue. Ma-
terial fracture, initiated by manufacturing imperfections (yBgB, voids inside TSVs)
and accelerated in high stress environments, may lead to delaminations or cracks
around the TSV structure. All the above mentioned TSV failures are exacerbated
at elevated temperature [63].
2.8 Micro-Fluidic Cooling
Micro-fluidic (MF) cooling is a promising technology for cooling ICs with high
power flux. DARPA’s Intra/Interchip Enhanced Cooling (ICECool) Program [75]
has been investigating and prototyping such cooling systems for both high-flux 2D
ICs (yBgB, high gain RF amplifier arrays) and 3D CPUs. By pumping coolant into the
substrate of the chip, the resistive path through the oxide layers and chip package
are short-circuited, providing significantly lowered transistor junction temperatures
28
[27, 59]. Moreover, MF cooling channels can be etched into the substrate of each
layer in a 3D stack before bonding, providing equal cooling capacity to all layers
and removing vertical thermal gradients [27, 60]. Finally, the high conductance of
water coupled with the active heat movement due to fluid pumping velocity provide
massively increased cooling capacity as compared to traditional air cooling [16].
Although general purpose CPUs have not generally required active cooling in
the past, 3D stacking and the trapped heat effect will significantly increase thermal
resistance. Enhanced cooling will be necessary to sustain the high power density of
modern CPU architectures implemented in 3D IC technology [8]. Solutions such as
DVFS have been proposed to control temperature in air cooled 3D CPUs, but at
the expense of performance [14,76].
A MF heatsink is created by fabricating microchannels in the silicon substrate
of each layer in a 3D IC. A microchannel is a small channel (generally 10s to 100s
of min dimension [77]) etched into the silicon substrate. These microchannels are
created with the intention of pumping fluid through them in order to cool each layer
of the chip [60]. The fluid enters the system at a low temperature and as it flows
through each channel, heat is conducted through the silicon substrate into the fluid
and then pumped out of the system. This concept is illustrated in Figure 2.9.
Micro-fluidic cooling comes with some overheads. One such overhead is the
additional power required to pump the fluid. In previous work, methods for re-
ducing pumping power have been investigated, such as nonuniform microchannel
distribution [59] and dynamic control of fluid flow rate [78, 79]. The results of the
studies presented in this dissertation [8, 27–29] show that the pumping power used
29
Active Silicon
Sil
ico
n
Su
bs
tra
te
Mi
cro
ch
an
ne
l
Flu
idF
low
Heat Flow
Active Silicon
Sil
ico
n
Su
bs
tra
te
Mi
cro
ch
an
ne
l
Flu
idF
low
Heat Flow
Sil
ico
n
Su
bs
tra
te
Air
Co
ole
d
He
ats
ink
Active Silicon
Heat Flow
Sil
ico
n
Su
bs
tra
te
Active Silicon
Heat Flow
D
R
A
M
L
o
g
ic
Figure 2.9: Micro-fluidic heatsink in memory-on-logic 3D CPU
to implement a MF heatsink is more than accounted for by the leakage power re-
duction that is a result of temperature reduction. Another overhead to MF cooling
is that adding microchannels to a 3D IC requires a thicker substrate. This requires
both the length and diameter of TSVs to increase in order to maintain a specific
TSV aspect ratio defined by the manufacturing process, which increases the area
overhead of TSVs. Typical 3D IC thinned silicon substrates have thickness in the
50um range while micro-channels would require thicker substrate (in the 150-200um
range) [59]. TSVs and microchannels cannot coexist in the same space, so adding
30
micro-fluidic cooling to a design also constrains where TSVs can be placed, and the
placement of microchannels and TSVs must be co-designed [30, 31, 80]. We investi-
gate this trade-off between cooling capacity and vertical interconnect density (iByB
vertical signal bandwidth) in Section 5.2.
Chapter 3: 3D CPU Co-Simulation Co-Optimization Flow
3D integration technology brings the opportunity for new computer architec-
tures, however such drastic changes to the conventional computing paradigm require
new architectural models of 3D CPU performance, power, area and timing (PPAT).
The 3D PPAT modeling challenges can be broadly broken down into the following
categories.
• Mckmpw Fgcpapafw8 Stacked memory architectures have significantly dif-
ferent memory hierarchy topologies due to more fine grained integration with
TSV technology. CPU-DRAM communication may take place over multiple
independent communication channels which could be point-to-point, bus or a
hybrid of both [27]. Each communication channel can be wider and/or clocked
faster using high-density low-impedance on-chip interconnects. PPAT simula-
tions must be configured to model the power and performance of such uncon-
ventional memory hierarchies. Moreover heterogeneous integration facilitates
on-chip cache and/or main memory technologies such as DRAM, MRAM and
31
PRAM, all of which require complex memory controller designs [47]. Models
of these technologies and their controllers are not included in most 2D PPAT
simulation frameworks which assume on-chip SRAM and off-chip DRAM. Fi-
nally, due to drastically reduced parasitics, memory-on-chip integration could
facilitate a reemergence of large parallel interfaces as opposed to high speed
serial communication for low-power designs [38]. The whole spectrum of inter-
face implementations must have available models within a 3D PPAT simulator
for proper trade-off analysis.
• Amkkslgaargml Ncrumpis8 Like the memory hierarchy, inter-core com-
munication can leverage similar benefits from 3D integration. NOCs in 2D
CPUs usually follow typical topologies such as 2D mesh and torus. However
the expansion of cores into the third dimension in logic-on-logic architectures
introduces new 3D NOC topologies. These 3D networks are more highly con-
nected offering higher bandwidth and reduced logical distance between nodes
(iByB number of hops), but require more complex routers and thus dissipate
more power and may introduce larger router delays. Additionally, the verti-
cal distance between nodes is often much less (yBgB, 10x) than the horizontal
distance. Asymmetric NOC topologies with larger router radix in the verti-
cal direction can take advantage of this physical asymmetry (yBgB, 3D super-
mesh [27]). Thus a 3D PPAT simulator must have the capability of simulating
customized asymmetrical NOCs and the associated physical implementations
of the routers and drivers.
32
• Dglc Epaglcb Glrcepargml8 One of the main advantages of 3D integration
is the reduction to wire length due to fine grained integration. The reduction
in length to the longest wires in a large circuit (yBgB, a CPU function block)
can approach
√
n where n is the number of layers across which the circuit
is split [34]. Power, delay and area estimates for circuits with regular struc-
ture (yBgB, memory elements) can be estimated analytically using technology
and topology parameters (although 3D implementation significantly increases
the design space of the topology parameters to be considered [81]). However,
highly complex and customized circuits (yBgB, ALU) are hard to estimate an-
alytically. For 2D CPU analysis, empirical models have been fit to real CPU
circuits in the market [2]. Since 3D CPUs are still in the research and devel-
opment stage, similar data does not exist. Developing models for 3D function
unit PPAT is a challenging and open problem.
The simulation flow used to evaluate the 3D CPU design space explored in
the following chapters is shown in Figure 3.1. We provide a detailed description of
each step in the simulation flow in the following sections.
3.1 Architectural Design Space
The studies presented in Chapters 4 and 5 involve exhaustive simulation across
a set of computer architectural variables. Table 3.1 enumerates the fixed architec-
tural parameters across all studies. The three study variables (number of cores, CPU
clock rate and number of memory controllers) take on different ranges in different
33
Architecture
Parameters
Multi2Sim McPAT Floorplan Optimization Cooling Optimization
Performance Dynamic Power
Temperature
Map
Leakage
Power
Netlist and Frequency
P
o
w
e
r
M
a
p
P
e
rf
o
rm
a
n
c
e
S
ta
ts
P
o
w
e
r 
&
 A
re
a
 
E
s
ti
m
a
te
s
Wire Delay
Model
Leakage
Model
Thermal
Model
Reliability
Model
Reliability
Metric
TSV Density
Figure 3.1: Simulation flow
studies, and are thus enumerated in their respective sections. In these chapters we
maintain a relatively small architectural design space to accommodate exhaustive
simulation. However, in Chapter 6 we expand the scope and dimensionality of our
architectural design space and apply modeling techniques to feasibly estimate the
metrics of interest across a large combinational space of architectural variables.
3.2 Performance Simulation
Performance simulation is performed by Multi2Sim (M2S) [82], a cycle accu-
rate CPU simulator. Architectural parameters are passed to the simulator through
configuration files that include number of cores, number of function units with-
in cores, pipeline width, buffer/queue/register size, cache size/associativity/latency,
network-on-chip (NOC) topology/latency, branch predictor size and type ytwB Cache
and register (yBgB, register file, register alias table (RAT) and branch target buffer)
latencies are determined using CACTI [81, 83] to provide realistic architectural se-
tups to the simulator. DRAM latency is calculated as explained in Section 3.3 and
34
Table 3.1: Architectural parameters
Cores Scc Srsbw Bcragls
Clock Rate Scc Srsbw Bcragls
Memory Controllers Scc Srsbw Bcragls
Technology 45 nm
Branch Predictor 4k Entry 2-Level
Issue Out of Order
Reorder Buffer 64 entries
Fetch/Dec/Issue Width 4
Functional Units 4 IALU, 1 IMult, 2 FPALU, 1 FPMult
Physical RF 80 Int, 40 FP
BTB Size 1024 entries
Return Addr. Stack 32 entries
Load/Store Queue 20 entries
Private L1 I/D Cache 256 Sets per Core, 2-Way, 64B Block (32
kB per Core) @ 2 cycle
Shared L2 Cache 512 Sets per Core, 16-Way, 64B Block (512
kB per Core) @ 7 cycles
NOC type 3D Super-Mesh
NOC link latency 3 cycles
DRAM bus width 64 B
DRAM bus speed Core clock rate
DRAM capacity 1 GB/layer × 4 layers = 4 GB
NOC topology/latency is calculated as explained in Section 3.9. M2S simulates the
execution of an x86 binary file on the described CPU. The simulator outputs a list
of performance statistics such as IPC, memory reads, writes, hits and misses, branch
prediction rate, number of instructions that access each type of execution unit, reads
and writes to buffers, queues and RAT ytwB
35
3.2.1 Benchmarks
The studies presented in the subsequent chapters evaluate an architectural-
physical design space across a suite of benchmark workloads. All benchmarks used
in our work come from the SPLASH-2 [84] and PARSEC [85] benchmark suites.
These benchmarks are standard for evaluating the results of architectural research
on CMPs [14,86–90].
3.3 DRAM Latency Model
Although DRAM latency depends on many transient factors, many perfor-
mance simulators, including M2S, simply model memory latency as a constant av-
erage value. We propose a model for the average memory latency time, comprised
of five different steps in the DRAM access procedure, starting at the time a last
level cache (L2 cache in this work) miss is detected. We estimate the average dura-
tion of each step as a function of the architectural parameters. The five steps are
as follows: (1) MC Queuing Delay, (2) Memory Address Translation, (3) Address
Transfer Delay, (4) DRAM Core Access (5) Data Transfer Delay. Step (1) is the
only step that is a strong function of the architectural variables considered in these
studies. Steps (2) through (5) are modeled as a constant delay of 5 cycles [91], 1
DRAM bus cycle [57], 32 ns [9] and w DRAM bus cycles [57] respectively, where w
is the cache line width divided by the DRAM bus width. DRAM bus width and
frequency are given in Table 3.2.
36
Table 3.2: 2D vs. 3D DRAM Bus
Integration Bus Width Bus Frequency
2D Off-Chip DRAM 64 bits 200 MHz
3D Stacked DRAM 512 bits Core Frequency
3.3.1 MC Queuing Delay
The memory controller queuing delay represents the amount of time a memory
request spends waiting in the memory controller queue. This value depends on the
number of memory controllers (iByB consumers of memory requests) and the number
of cores (iByB producers of memory requests). The work by Awasthi yt ulB [86] reports
that the increase in queuing delay from a single core to a 16 core processors is about
8x. Dong yt ulB [91] reported that a configuration with 4 cores and one MC has
a queuing latency of 116 cycles. We linearly extrapolate these two observations to
model queuing delay as a function of #xorz, and assume that memory requests are
uniformly distributed across the address space1, such that queuing delay is inversely
proportional to the number of MCs. Thus we model MC queuing delay iQ with
Equation (3.1).
iQ =
388 ns
#bC
×
[
1 +
(
#xorz× 1−
1R8
16− 1
)
−
(
16× 1−
1R8
16− 1
)]
(3.1)
3.3.1.1 Derivation
We can solve iQ(#xorz) = iQ(y)+m(#xorz−y) as a linear function of #xorz
using the following two observations:
1ihi“ v““umption fiv“ vvliyvtzy in pFIrC
37
1. iQ(4) = 116 ns
2.
TQ(16)
TQ(1)
= 8
Observation 2 can be rearranged as iQ(1) =
1
8
iQ(16). Thusm =
TQ(16)−TQ(1)
16−1 =
iQ(16)
1− 1
8
16−1 . Allowing y = 16 we can write iQ(#xorz) = iQ(16)+iQ(16)
1− 1
8
16−1(#xorz−
16) = iQ(16)[1 +
1− 1
8
16−1(#xorz− 16)].
All that is left is to solve for iQ(16) by solving m =
TQ(4)−TQ(1)
4−1 =
TQ(16)−TQ(4)
16−4 .
Substituting Observation 1 (iQ(4) = 116 ns) and rearranged Observation 2 (iQ(1) =
1
8
iQ(16)) yieldsm =
116 ns− 1
8
TQ(16)
4−1 =
TQ(16)−116 ns
16−4 which when solved yields iQ(16) =
388 ns.
3.4 Power/Area Estimation
Dynamic and leakage power are estimated along with the total area of each
CPU component by McPAT [2], a power and area estimation tool commonly used in
computer architecture research [14,92–95]. The architectural parameters are used to
estimate the leakage power at nominal temperature using internal transistor-level
models of CPU components. Likewise these transistor models also estimate the
energy-per-access (yBgB, read, write or decode) and total area of each component.
The combination of access counts from Multi2Sim and energy-per-access estimates
from McPAT yield dynamic power. Dynamic and leakage power estimates are ap-
plied to an optimized floorplan topology to generate a power density map. The
power density map is consumed by the thermal model, which internally applies
thermal-leakage scaling (Section 3.8.1).
38
Transistor level power and area models of regular structures such as caches,
registers etc. are provided internally through CACTI [83]. Power and area mod-
els of complex combinational logic such as ALUs and decoders are generated by
applying curve fitting to empirical data collected from real CPUs. Cacti has been
expanded to estimate 3D memory implementations [81], but development fine-grain
3D combinational logic blocks is an area of future work, and in this dissertation 2D
function blocks are used2.
3.4.1 Pumping Power
The micro-fluidic heatsink’s simulated for this work consist of straight mi-
crochannels with non-uniform spacing between channels. The minimum pitch be-
tween channels is double the channel width l , however many channels are spaced
considerably farther apart than the minimum pitch. The power required to pump
fluid through a microchannels, eVuSV is defined in Equations (3.2) through (3.6) [59],
where c is the number of microchannels, f is the fluid flow rate, ∆p is the pres-
sure drop across each microchannel, γ is a function of microchannel aspect ratio
(Ag = lRH), µ is the viscosity of fluid flow, a is the length of the channel, v is
the fluid velocity, Yh is the hydraulic diameter of the channel, l is the width and
H is the height of the microchannel. Specific values used in the work reported here
are given in Table 5.2.
2lz yo vllofi thz mzmory xontrollzr vny zflzxution unit to wz “plit vxro““ tfio lvyzr“ vt “uwB
xomponznt wounyvriz“ (e.g.A [ejBI[j wounyvry in zflzxution unit or [rontBznyBwvxkBzny wounyvry
of thz mzmory xontrollzr p2r)C ihz zzxt“ on pofizr vny vrzv of “uxh v xovr“zBgrvinzy “plit vrz
v““umzy to wz nzgligiwlzC
39
Table 3.3: Micro-fluidic system parameters
Var Value Name Var Value Name
l 100 m Width µ 653 Pa s Viscosity
H 200 m Height eVuSV 2 mW per layer Pumping Power
In our study we assume a constant pumping power eVuSV. Thus a reduction
in the number of channels c results in increased pressure drop and fluid velocity
in the remaining channels, which increases the local heat transfer coefficient of each
channel [96]. Our heatsink optimization scheme (Section 3.10) finds the optimal
trade-off between number (and location) of channels vs. heat transfer coefficient
of each channel. The pumping power used to provide micro-fluidic cooling in our
studies is more than made up for by reductions in thermally induced leakage power
due to reduced chip temperatures [27–29].
eVuSV = cf∆p (3.2)
f =lHv (3.3)
∆p = 2γµavY−2h (3.4)
γ = 4O7 + 19O64× (Ag
2 + 1)
(Ag + 1)2
(3.5)
Yh =
2lH
l +H
(3.6)
3.5 Core Netlist
Each CPU core consists of a set of interconnected components as shown in
Figure 3.2. The bit width of each connection in the netlist is annotated in the
figure, and the associated utilization of each net is calculated from the Multi2Sim
40
IFU REN LSUEx(a)
ROUT MC(a)L2 MMU
Ex(b)
MC(b)
i
i i
1.5i
p
c c
r d d r
ROUT1 DRAM
f r
i = issue_width*size(word)
p = num_cache_ports*size(word)
c = num_cache_ports*size(cache_line)
r = size(cache_line)
d = size(word)
f = noc_width
r
ROUTn
f
...
12
3
4
5 6
78 9 10
CORE
Figure 3.2: CPU core component netlist with net widths notated.
performance statistics (Section 3.2). Details of each CPU component are given in
Table 3.4. The execution unit and memory controller are large components, and
are allowed to be pipelined and/or split into two sub-components which can be
placed on separate layers of the 3D stack (multi-layer)2. The instruction fetch unit
(IFU) contains the branch predictor and the instruction cache. The execution unit
contains integer and floating point function units along with the register file and
the reorder buffer. The load store unit (LSU) contains the load store queues and
the data cache and the memory management unit (MMU) contains the translation
look-aside buffers (TLBs). Core routers are connected in a 3D super-mesh topology
(Section 2.5). More detailed descriptions of each CPU component can be found
in [2].
41
Table 3.4: CPU core component properties
Name Comments
IFU Instruction Fetch Unit
REN Rename Unit
EX Execution Unit Multi-layer
LSU Load Store Unit
ROUT Router Inter-core
L2 L2 Cache Shared
MMU Memory Mgmt. Unit
MC Memory Controller Multi-layer, Inter-core, Shared
As shown in the figure, the router and the memory controller are the only com-
ponents that communicate outside of the core (inter-core), either with other cores
or with the DRAM. The L2 cache and memory controller components are slices of
a larger component that services multiple cores (shared). The L2 cache is a single
shared cache with a local slice associated with each core, whereas each memory con-
troller can service two, four, or eight L2 cache slices, depending on the total number
of memory controllers. Using the wire delay model (Section 3.6), we calculate the
maximum allowed center-to-center distance between each connected component for
the target clock frequency to prevent timing violations. These distance constraints
are used to create a timing-feasible floorplan (Section 3.9).
3.6 Wire Delay Model
We calculate the wire delay per unit length using Equation (3.7) from [35].
The variables v = 0O4 and w = 0O7 are fitting parameters taken from [35], and the
variables r, x, r0, x0 and xV are respectively the wire resistance per unit length,
wire capacitance per unit length, output resistance of a minimum-size inverter, in-
42
put capacitance of a minimum-size inverter and parasitic output capacitance of a
minimum-sized inverter. These values were extracted from the McPAT source code
and are given in Table 3.5. Given these parameters the delay per unit length cal-
culated by Equation (3.7) is 81 ps mm−1. The wire delay model is used to insure
timing feasibility during floorplan creation (Section 3.9).
y
l
= 2
√
rxr0x0
(
w+
√
vw(1 +
xV
x0
)
)
(3.7)
Table 3.5: Transistor and interconnect parameters for 45 nm technology [2]
variable value variable value
r 0.36 Ω m−1 x 0.28 fF m−1
r0 10.9 kΩ x0 0.85 fF
xV 0.31 fF
3.7 Reliability Model
Our reliability model focuses on TSV electromigration (EM), one of the 3D
CPU’s critical failure modes [18, 19, 63, 65–67, 69, 97]. As more power dissipating
device layers are stacked vertically, power flux increases dramatically. However, 3D
power delivery network (PDN) is limited by the number of power pins (iByB C4
bumps) which is a function of the footprint area of the chip, and does not increase
as more layers are stacked [25, 26]. This leads to a significant increase in PDN’s
current density in 3D CPUs. Furthermore, the stacking structure generates thermal
hotspot in areas of high power (and current) density [59]. The increases in both cur-
rent density and temperature accelerate TSV EM. In addition, the immature TSV
43
fabrication process induces structural defects such as voids inside TSVs [97], which
also degrade TSV’s EM reliability. As TSVs consume many placement/routing re-
sources, it is hard to make post-layout EM fixes (iByB redundant wires/vias) without
significant area overhead and redesign effort [18,30,31,63,98].
In the proposed reliability model each TSV’s EM lifetime is considered as
a random variable, where the randomness is caused by TSV manufacturing [99].
We model each TSV’s failure probability density function (PDF) using a Weibull
distribution. Each Weibull distribution is determined by a shape parameter k and
a scale parameter λ. We assume that TSV EM failure rate is constant over time
(therefore k = 1). The scale parameter λ, is determined by TSV’s mean-time-to-
failure (MTTF). Specifically, λ is calculated based on classic Black’s equation [100]
as shown in Equation (3.8).
λ =biiFEM ∝ (JavM)−2z
Ea
kbT O (3.8)
JavM is the average DC current density, Ea is activation energy, kb is Boltzman-
n’s constant, and i is absolute temperature in degrees Kelvin. In cases where AC
signal is concerned, JavM is its equivalent DC current density [101]. Higher current
density and temperature shorten the expected EM lifetime of TSVs, according to
Equation (3.8).
For reliability estimation, each TSV must be assigned a point in space at which
to measure the temperature. Signal TSVs within a 3D net are uniformly distributed
inside its feasible region. A 3D net’s feasible region is determined such that the in-
44
Figure 3.3: TSV EM reliability model
terconnect timing constraint between the connecting blocks is not violated using the
3D net wirelength model from [21]. Figure 3.3 illustrates our system-level EM reli-
ability modeling approach. Based on typical 3D-CPU applications, TSV activities
(messaging between logic blocks and/or memory blocks) can be acquired from per-
formance simulation (Section 3.2). Combined with voltage/frequency information,
the TSV activities are translated into transient currents by modeling the capac-
itive load’s charging/discharging behavior. The transient current is subsequently
converted to its equivalent DC current density distribution [101]. This DC current
density distribution and the thermal profile define a failure PDF for each TSV.
System’s EM reliability (gEM) is defined as the probability that none of the
TSVs fail before the target lifetime of has elapsed. gEM can be expressed using
Equation (3.9), where eEM is the probability that a 3D-CPU fails before target
lifetime, and e OEM is the probability of the i
th TSV fails before target lifetime.
gEM = 1− eEM = 1−
∏
O∈TSV
(1− e OEM)O (3.9)
45
3.8 Thermal Model
Once the chip floorplan has been constructed (Section 3.9) and component
power estimation is complete (Section 3.4), we have a power density map for each
tier of the 3D stack. Power density maps are converted into thermal maps using
our compact thermal model [59]. A 3D grid is constructed representing the physical
structure of the 3D IC. Each tier in the chip stack is divided into sub-layers: silicon
substrate (with or without microchannels), active silicon, interconnect and passi-
vation. Likewise the power map is discretized into a 3D grid and the total power
of each power grid is assigned to the respective physical grid in the active silicon
sub-layer (all other sub-layers have zero power).
Then each physical grid is converted to an electrical circuit representation as
shown in Figure 3.4. Power is modeled as a current source and thermal resistance
is modeled as electrical resistance. The voltage at the center of each circuit grid
represents the temperature of the respective physical grid. This technique takes
advantage of the thermal-electrical duality, similar to HotSpot [102]. Thermal resis-
tances are evaluated based on material properties and dimensions of the respective
physical grid using the technique in [59]. Material properties and dimensions of dif-
ferent sub-layers are listed in Table 3.6. When modeling a MF heatsink, the circuit
model contains both solid and fluid grids. The resistance of a fluid grid depends on
material properties and fluid flow rate [96].
46
Rcond
Solid Grid
Rconv
R
flo
w
Fluid Grid
Figure 3.4: Thermal resistance grids for fluid and solid materials
Table 3.6: Thermal model material properties
Sub-Layer Thickness Material Conductivity
( m) (W m−1 K−1)
Top Substrate 995 Si 148
Microchannel Substrate 200 Si 148
Microchannel Fluid 200 H2O 0.58
Thinned Substrate 55 Si 148
Active Silicon 5 Si 148
Interconnect 15 SiO2+Cu 2.25
Passivation 15 SiO2 1.4
3.8.1 Leakage Model
McPAT reports a base leakage value for each CPU component which is esti-
mated at a fixed temperature i0. To obtain more accurate leakage power estimates,
which take into account leakage power’s strong dependence on temperature, we iter-
atively solve our thermal model and then scale leakage estimates at each grid based
on the estimated temperature of that grid after the previous iteration. We repeat
this process until the change in temperature between two iterations is less than some
threshold (yBgB, 1 ◦C). The thermal leakage scaling model is extracted from McPAT
source code [2] (Figure 3.5).
47
25 45 65 85 105 125
1
1.5
2
2.5
3
3.5
Temperature (°C)
N
or
m
al
iz
ed
 L
ea
ka
ge
 P
ow
er
Figure 3.5: Thermal-leakage relationship
3.9 Floorplan Optimization
For each architectural configuration, we run a thermal-reliability aware floor-
planner to create an optimized CPU floorplan for that architecture3. Floorplans are
optimized iteratively using feedback from the thermal (Section 3.8) and reliability
(Section 3.7) models while estimating timing feasibility using the netlist (Figure 3.2)
and wire delay model (Section 3.6). A fundamental trade-off exists between timing,
reliability and temperature. Placing high power components closer together can
reduce wire delay and negative slack, but will increase hot-spot temperatures [27].
Likewise, splitting components across layers can reduce power density and thus
3homz of thz “tuyiz“ hzrz yi“vwlz oorplvn optimizvtion vny u“z v flzy topologyA fihilz othzr“
u“z moyizy owjzxtivz funxtion“C ihz vlgorithm prz“zntzy hzrz i“ thz fully xomprzhzn“ivz mzthoy
propo“zy in thi“ yi““zrtvtion vt lvrgzA fihilz othzr vzr“ion“ vrz xon“iyzrzy for xompvri“on vny
“zn“itivity vnvly“i“C
48
remove hotspots, but introduces additional TSVs which increase probability of fail-
ure [103]. Thus the timing, reliability and thermal profile must be simultaneously
co-optimized during floorplanning.
The power dissipation and net activity of each component is averaged across
all benchmark workloads when evaluating the thermal and reliability profile for
floorplan optimization. The area of each component is given by McPAT (Section 3.4)
and each component is assumed to be laid out as a rectangle. Net activities are
derived from Multi2Sim (Section 3.2) and net widths are annotated in Figure 3.2.
Our approach optimizes the floorplan of a single CPU core, and then tiles
that single-core floorplan in order to generate a chip level floorplan with the correct
number of cores. Floorplan optimization at chip-scale would have been computa-
tionally infeasible, so the problem is reduced to floorplan optimization of a single
core. However the thermal effects of core tiling and stacking are captured in the
embedded thermal and reliability models. Cores are allowed (but not required) to
be distributed across multiple layers.
Thermally aware floorplan optimization reduces peak temperature by opti-
mizing the vertical and planar power density to reduce hot-spots, as well as moving
high power components closer to the fluid inlets where maximum cooling poten-
tial exists [27]. However, timing violations are modeled (Section 3.6) throughout
the optimization flow, and only timing feasible floorplans are accepted. Reliability
aware floorplan optimization improves MTTF by preventing high activity nets to
span across layers, and by minimizing the number of TSVs in general [103].
49
3.9.1 Floorplan Representation
We use transitive closure graphs (TCGs) [104] to represent the physical re-
lationship between CPU components on each logic layer. A 3D floorplan can be
represented as a set of n TCGs, where n is the number of layers in the 3D stack.
We call such a set a 3DTCG. A simulated annealing approach is used to search
the solution space of 3DTCGs, and a nested simulated annealing loop is used to
optimize the component aspect ratios (AR) for each 3DTCG considered.
Given a 3DTCG with the area and AR of each component, a unique 3D floor-
plan is constructed. Then the chip area, thermal profile, MTTF and netlist wire-
lengths of that floorplan are evaluated. The objective of the floorplanning algorithm
is to find an optimized floorplan for each architecture which minimizes area, peak
temperature, and negative slack and maximizes lifetime. It may be hard or even
impossible to find a floorplan that meets both thermal, reliability and timing con-
straints when considering an aggressive 3D CPU architectural design. High quality
physical design optimization of the floorplan can significantly increase the feasibility
region of an evaluated architectural design space, which will ultimately result in the
selection of more optimal design points [1, 103].
50
3.9.2 Simulated Annealing Approach
Simulated annealing is used to search the solution space of 3DTCG topologies
and CPU component aspect ratios. The annealing operations used for the simulated
annealing of the 3DTCG are the original four intra-layer annealing operations from
[104] (rotate, swap, move and reverse), plus the inter-layer swap from [105] and the
inter-layer move from [106] (referred to as “Change Layer” in that paper).
The objective function used for simulated annealing of the 3DTCGs is given
in Equation (3.10), where A is the total area of the core (Section 3.4), h is the total
negative slack, i is the maximum temperature from the thermal model (Section 3.8)
and g is the reliability metric (Section 3.7). The negative slack on each net is the
wire delay (Section 3.6) on that net minus one cycle delay. Wirelength between
two components is measured as the Manhattan distance between the center point
of each component.
dWJ = x1A+ x2h + x3i − x4g (3.10)
The nested simulated annealing loop for determining aspect ratio of each com-
ponent chooses a random component and scales its AR by a value randomly chosen
from a normal distribution with µ = 1 and σ = 0O1. Aspect ratio for each compo-
nent is constrained by the equation 1
5
< Ag < 5. The objective function used for
the aspect ratio simulated annealing is dWJ = x1A+ x2h.
51
3.9.3 Speeding Up Simulation Time
Because a temperature profile is required to evaluate the objective function
at each iteration of the 3DTCG simulated annealing algorithm, the thermal model
must be evaluated many times. The full chip-scale thermal model would be too time
consuming to evaluate on each iteration, so instead we evaluate the thermal profile of
a 2×2×k core tiling and use this as an indicator of the true chip-scale temperature
profile,where k is the number of core layers. This approach can make thermal simu-
lation up to 30-50x faster than the evaluation of the full chip-scale model while still
modeling the thermal effects of core stacking and the junction where cores abut in
the horizontal direction. The correlation coefficient between the maximum temper-
ature observed by chip-scale vs. reduced model is 80%. Thus thermal simulation of
a reduced core tiling is a practical and accurate way of approximating temperature
in the thermally aware floorplanning algorithm.
Likewise the reliability model is applied to the same 2 × 2 × k tiling of the
floorplan. The thermal and reliability estimates of this reduced tiling do not provide
reliable estimates of absolute temperature and lifetime, but do provide accurate es-
timates of the relative ordering between floorplan candidates, making this technique
suitable for unconstrained optimization.
Removing thermal and reliability terms from the objective function and re-
formulating them as constraints would invalidate the proposed simulation speed up
technique, and significantly increase the optimization runtime. However this would
52
remove the need for proper choices of weighting factors to drive the trade-off be-
tween conflicting optimization terms. The comparison and trade-offs of these two
schemes is left to future work.
3.9.4 Core Tiling and NOC Design
To generate the final chip floorplan, the core floorplan is replicated on an i×j×
k grid such that ijk = n where n is the total number of cores. The dimensions of a
single core floorplan are defined as wiythcUre and hzightcUre respectively (determined
by single-core floorplan optimization). The values i, j and k are chosen such that:
• Total area per layer (iwiythcUrejhzightcUre) is less than ASax = 400 m2m.
• Total number of layers is minimized.
• Layer aspect ratio (OwOdthcTreRPheOMhtcTre) is close to unity.
NOC topology is defined as an i×j×k 3D super-mesh [7] (Section 2.5) and NOC la-
tency is defined as the wire delay of length max(wiythcUreP hzightcUre) (Section 3.6).
NOC topology and latency are fed back into the performance simulator to get ac-
curate inter-core communication simulations 4.
4[loorplvn vny cdC yz“ign vrz rzquirzy to yznz cdC pvrvmztzr“ for pzrformvnxz “imulvtionC
bxeVi i“ run onxz to gznzrvtz vrzv z“timvtz“ wzforz pzrformvnxz “imulvtionA vny thzn vgvin to
gznzrvtz pofizr z“timvtz“ vftzr pzrformvnxz “imulvtionC ihz initivl vrzv z“timvtion“ vrz znough to
gznzrvtz vn z“timvtz of cdC lvtznxyA v““uming v pzrfzxtly “quvrz xorz oorplvn fiith no fihitzB
“pvxzC
53
0 1 2 3 4 50
1
2
3
4
EX1
RAT
R
O
U
T
EX2
IFU LSU MMU
L2
M
C1
M
C2
 
 
Te
m
pe
ra
tu
re
(°C
)
20
30
40
50
60
Thermally Unaware FP
Bottom Layer Top Layer
0 1 2 3 4 50
1
2
3
4
EX1
RAT
R
O
U
T
EX2
IFU LSU MMU
L2
M
C1
M
C2
 
 
Po
w
e
r 
D
e
n
si
ty
 
(W
/c
m
2 )
0
10
20
30
40
50
60
Figure 3.6: Example thermally unaware floorplan with MF cooling
3.9.5 Example
Figures 3.6 and 3.7 illustrate an example floorplan result5 with and without
thermal awareness, and the resulting thermal and power maps. This example is from
a 32-core 16 MC 3D CPU running bceaa at 2.4 GHz with micro-fluidic cooling. We
see that thermally unaware floorplanning results in less total chip area and a more
square chip outline, however this floorplan has significantly higher temperatures.
5Dimzn“ion“ “hofin in mmC
54
Thermally Aware FP
Bottom Layer Top Layer
 
 
Po
w
e
r 
D
e
n
si
ty
 
(W
/c
m
2 )
0
10
20
30
40
50
60
 
 
Te
m
pe
ra
tu
re
(°C
)
20
30
40
50
60
0 1 2 30
1
2
3
EX1
RAT
ROUT
MMU
MC1
MC2
0 1 2 30
1
2
3
EX
2
IF
U
LSU
L2
Figure 3.7: Example thermally aware floorplan with MF cooling
Note that fluid flow direction in this figure is from left to right and the pumping
power is fixed. The thermally aware floorplan is able to improve chip temperature
using a number of techniques.
First, shifting the chip dimensions towards a more tall and narrow chip outline
allows for the fabrication of more microchannels and reduces the length of each chan-
nel, which significantly increases the cooling capacity of the micro-fluidic heatsink
by reducing the thermal wake effect [107]. Second, the function unit with the high-
est power density (ROUT) is surrounded by low power units or dead-space on all
sides, allowing for more lateral heat spreading and reducing hotspot temperatures.
55
In the thermally unaware floorplan, the router in one core abuts the MC in the
neighboring core, leading to hotspots. More importantly, the thermally aware floor-
plan splits cores across two layers, preventing vertical stacking of hotspots. In the
fixed floorplan routers are stacked vertically, leading to significant hotspot heating.
Finally, compared to the thermally unaware floorplan, the thermally aware floorplan
allocates more total power to the top layer and less to the bottom layer. This is due
to the significantly larger thermal resistance between the ambient temperature (at
the top of the chip stack) and the bottom layer, as compared to the top layer6.
3.10 Cooling Optimization
The final step in our analysis approach for DSE of 3D CPUs with micro-
fluidic heatsinks is to consider optimized non-uniform heatsink designs. Due to the
non-uniform nature of the generated power map after floorplan optimization, the
optimal microchannel distribution in the micro-fluidic heatsinks is also non-uniform
when subjected to a constant pumping power. Simply placing microchannels uni-
formly at minimum pitch (the default heatsink design in this work) is inefficient
as cooling potential is distributed to hot-spots and cold-spots equally. In addition
to the nonuniform power density profile on each layer, one must also consider the
nonuniform thermal resistance between each layer and the ambient, due to inter-
layer resistances. Thus microchannels are more valuable when placed between layers
that are far from the top (ambient interface) of the chip, where thermal resistance
is high.
6ihz wottom vny “iyz“ of thz xhip “tvxk vrz vyivwvtixC
56
Like floorplan optimization, heatsink optimization is performed for each archi-
tectural configuration, and is optimized using a simulated annealing algorithm with
feedback from the thermal model. The chip-scale power map consists of a tiling
of single-core power maps. We take advantage of this by optimizing the heatsink
configuration for a single core stack and then tile the optimized microchannel con-
figuration for the final solution. A core stack is a single core that is tiled in the
vertical direction as many times as it would be in the true chip-scale layout (iByB k
times). In other words, the microchannel placement on different layers of the stack
can be different, but in the planar direction it is tiled. Tiling of microchannels in
the vertical direction is inefficient because of the strong dependence of thermal resis-
tance on layer depth. As in floorplan optimization, thermal evaluation of heatsink
design points is carried out on a 2x2xk tiling of cores such that the thermal interface
between adjacent cores is molded accurately, while speeding up simulation time.
3.10.1 Microchannel Placement Representation
Microchannels are assumed to be straight channels of constant width which
extend along the entire length of the chip from inlet to outlet. Thus, channel
placement can be represented as a two-dimensional placement problem, the two
dimensions being vertical (iByB in the direction of layer stacking) and horizontal
(perpendicular to the direction of flow). We represent the placement of channels as
a binary matrix B, which has k rows and WchipR∆x columns, where lchOV is the width
of the chip perpendicular to the direction of flow, and ∆x is the width of a grid in
57
the thermal model (Section 3.8). In our thermal model it is assumed that ∆x =l ,
where l is the width of a microchannel. If wy;x = 1, then grid x on layer y contains
a microchannel, and if wy;x = 0, it does not. All channels must be separated by at
least one non-channel grid (iByB channel wall must have nonzero width). Thus if
wy;x = 1, then wy;x+1 = wy;x−1 = 0.
3.10.2 Simulated Annealing Approach
Simulated annealing is used to explore the solution space of matrix B. Two
annealing operations can be applied to B during simulated annealing optimization:
add or remove a channel. The initial solution is uniform channels with minimum
pitch. All entries in B which are candidates for channel insertion or removal are
identified. If a channel is being added, a random candidate is chosen and the solution
is updated. If a channel is being removed, a ranking is imposed on existing channels
using our microchannel cost model (Section 3.10.4), and a candidate is selected
from the bottom qth percentile. In these studies we set q = 25%. The objective
function used to evaluate annealing moves is dWJ = i , where i is the maximum
temperature from the thermal model (Section 3.8).
3.10.3 Example
Figures 3.8 through 3.10 exemplify how micro-channel placement optimization
can reduce on chip temperatures for a given floorplan and a fixed pumping power.
Figure 3.8 shows the power density and associated temperature maps of 32-core
58
  
Te
m
pe
ra
tu
re
(°C
)
40
60
80
100
120
 
 
Po
w
e
r 
De
n
sit
y 
(W
/c
m
2 )
0
20
40
60
80
100
120
0 5 10 15
0
5
10
15
20
0 5 10 15
0
5
10
15
20
0 5 10 15
0
5
10
15
20
0 5 10 15
0
5
10
15
20
Bottom Layer Top Layer
Air Cooling
Figure 3.8: Temperature and power density of air cooled floorplan
0 5 10 15
0
5
10
15
20
0 5 10 15
0
5
10
15
20
 
 
Te
m
pe
ra
tu
re
(°C
)
40
60
80
100
120
0 5 10 150
5
10
15
20
0 5 10 150
5
10
15
20
 
 
Su
bs
tra
te
 
 
 
 
 
↔
 
 
 
 
 
Ch
a
n
n
e
l
Top Layer
Uniform Micro-Fluidic Cooling
Bottom Layer
Figure 3.9: Temperature and channel distribution using uniform MF heatsink.
59
0 5 10 15
0
5
10
15
20
0 5 10 15
0
5
10
15
20
 
 
Te
m
pe
ra
tu
re
(°C
)
40
60
80
100
120
 
 
Su
bs
tra
te
 
 
 
 
 
↔
 
 
 
 
 
Ch
a
n
n
e
l
0 5 10 150
5
10
15
20
0 5 10 150
5
10
15
20
Bottom Layer Top Layer
Optimized Micro-Fluidic Cooling
Figure 3.10: Temperature and channel distribution using optimized MF heatsink.
3D CPU using air cooling. Each core spans two layers and the tiling topology
is 4 × 8 × 1. The dynamic power density is fixed regardless of cooling scheme,
although the leakage power does change with the temperature when uniform and
optimized MF heatsinks are applied. Figures 3.9 and 3.10 show the temperature and
associated microchannel placement vectors of a uniform and optimized MF heatsink
respectively.
We observe that the reduction in peak temperature is only marginal from air
cooling to uniform MF cooling, whereas the reduction due to an optimized MF
heatsink is substantial. The basic mechanism of improvement in this example is
as follows: by removing microchannels on the top layer that run through areas
of low power density, more cooling capacity can be delivered to the bottom layer,
60
which has much higher thermal resistance and suffers from thermal coupling with
the high-power top layer. Although the microchannel distribution on the bottom
layer remains generally uniform, the top layer only has channels running under
the thin strips of high-power-density components. Since many less channels are
used in the Optimized MF heatsink, the fluid velocity is increased, counteracting
the thermal wake effect and greatly improving heatsink cooling capacity, while still
keeping channels in place under local hotspots.
3.10.4 Microchannel Cost Model
In order to reduce convergence time of our simulated annealing approach, we
define a cost model of microchannels such that removing channels with lower cost
are more likely to improve the objective function. The basic idea is to quantify the
amount of power being sunk by each channel, and remove the channels that are
sinking the least power. The formulation for our cost model is given below, and
illustrated in Figure 3.11.
1) Ssk Nmucp8 Since B is a two dimensional variable, we must create a
corresponding two-dimensional representation of the three-dimensional power map.
Since each channel sinks power from all sources along the direction of flow, it makes
sense to sum the power map along the flow direction. However, one must take into
account the decreasing cooling capacity of a microchannel along the direction of flow
due to an increase in fluid temperature (iByB the thermal wake effect [107]). Thus the
power generated near the outlet is more critical in determining peak temperature
61
MCi
ci=Σjsij
MCi+1MCi-1
gridj
A
-1
α
(1
)
A
-1 α(
3)
A -1
α(2)
A=α(1)+α(2)+α(3)
pj
dij=1
wij=α(1)
nij=A-1α(1)
sij=pjA-1α(1)β(y)
x
gridj-1 gridj+1
Layer y
pj-1 pj+1
Figure 3.11: Microchannel cost model example
than the power located near the inlet because it is subject to less cooling. When
summing the power map along the direction of flow the power is scaled by some
function σ which increases along the direction of flow. Scaled power matrix N is
created such that py;x =
∑
z powzry;x;zσ(z) where powzr is the three dimensional
power map such that the third dimension runs along the direction of flow. In our
study we set σ(z) = 1 + 0O5(z − 1).
0) Clskcparc Mgapmafallcls alb Epgbs8 We enumerate each microchan-
nel in B and each power grid in N such that the ith microchannel is represented by
wyi;xi and the j
th power grid is has power pyj ;xj .
1) Ctalsarc Bgsralac8 Generate distance matrix B such that yO;P = |xO −
xP|+ λ|yO− yP| is the distance between the ith microchannel and the jth power grid.
The coefficient λ is the relative weighting between vertical and horizontal distance,
and can be adjusted to model the amount of thermal coupling between layers. In
our study λ = 1.
62
2) Ucgefr8 Using the distance matrix B we create a weight matrixU which
represents the relative thermal conductance from each power grid to each microchan-
nel. We convert B to U by mapping each element using some function α which
decreases with distance. Thus wO;P = α(yO;P). In our study α is a Gaussian function
centered at 0 with a standard deviation of 2. After determining the values of U
the normalized matrix N is generated such that the sum of weights between each
grid and all channels equals one: nO;P = wi;jR
∑
i wi;j. Thus all grids have the same
total influence on the outcome of the cost model, but the relative influence on each
channel is determined by distance.
3) Saalc8 Finally a scale matrix S is created representing the total power
sunk by each channel from each grid. The values of this matrix depend on the
position weights from the previous step and the total power in a grid. However,
as stated earlier, the thermal resistance to ambient of the layers deep in the stack
is more than those near the top, making the power in these layers more critical to
peak temperature. To model this, power matrix N is scaled by some function β
which is an increasing function of layer depth. Thus sO;P = nO;Ppyj ;xjβ(yP). In our
study we define β(y) = 1 + 0O5(y − 1). The final channel cost vector a is generated
by summing S across all grids: xO =
∑
P sO;P. The cost vector is used to determine
the set of channels considered for removal during each iteration of the simulated
annealing algorithm.
63
3.11 Simultaneous Optimization
One would assume that floorplan and heatsink optimization would need to be
done simultaneously, or in a nested loop to avoid convergence to a local minimum.
Initially that approach was implemented, but upon comparison of the nested opti-
mization to the sequential method proposed in the paper, we observed that sequen-
tial optimization resulted in very similar quality results as the nested optimization,
and significantly reduced the simulation runtime.
Chapter 4: Architectural Opportunities of Micro-Fluidically Cooled
3D CPUs
This chapter presents the results of two studies undertaken to quantify the
potential architectural opportunities presented by 3D IC technology using a stacked
memory-on-logic processor. In the first study (Section 4.1) we show that indeed
significant speedup can be achieved, but as expected this speedup is significantly
thermally limited by the trapped heat effect. However we show that MF cooling
can overcome the thermal issues and thus realize the true potential of the 3D CPU
architectures under consideration. In the second study (Section 4.2) we explore the
potential return to a frequency scaling scheme in light of the reduced memory wall
inherent to stacked memory processors, and the reduced leakage power and chip
64
temperatures achieved with micro-fluidic cooling. We find that the energy efficiency
scaling trend vs. frequency is actually reversed when MF cooling is applied. Finally
we summarize this chapter in Section 4.3.
4.1 2D vs. 3D CPUs and the need for MF cooling
Chapter 2 introduced a number of architectural opportunities brought on by
3D technology, as well as some of the associated challenges. Thermal management
was identified as a primary limitation of 3D integration and micro-fluidic (MF)
cooling was introduced as a promising potential solution. In this study we begin
with the simplest type of 3D CPU: a stacked DRAM memory integrated on top of a
traditional 2D multi-core processor. We ask two fundamental questions in this study:
What are the potential performance improvements offered by this architecture, and
what are the thermally feasible improvements. Furthermore, regarding the second
question we investigate how the switch from air cooling to MF cooling will affect
the thermal feasibility, and push the 3D memory-on-logic architecture closer to
realization of it’s true potential.
As discussed in Section 2.4 the primary performance benefit of memory-on-
logic stacking comes from higher memory bandwidth [9, 27, 39]. In our study we
increase the memory bus frequency to match the CPU core frequency and expand
the bus bit witch to match that of the L2 cache line (Table 3.2). Although these two
extensions do improve memory bandwidth significantly, they do not fully leverage
the additional CPU-DRAM interconnect density offered by TSV technology. To
65
explore architectural designs with even more bandwidth we consider increasing the
number of memory controllers (MCs), allowing parallel memory access and thus
scaling memory bandwidth proportional to the number of MCs.
Although additional MCs can also be added to traditional 2D CPUs with
off-chip DRAM, they will not benefit from more than a few MCs due to off-chip
bandwidth constraints imposed by IO pin count limitations [9,108,109]. On the other
hand, memory-on-logic 3D CPUs achieve monotonic (albeit diminishing) speedup
as more MCs are added due virtually unlimited1 CPU-DRAM integration density.
Memory latency vs. number of MCs is shown in Figure 4.1 in a traditional 2D off-
chip DRAM configuration and a memory-on-logic 3D CPU. This data was generated
for a 16-core CPU using the simulation infrastructure and DRAMmodels introduced
in Chapter 3. As more MCs compete for a fixed number of IO pins in a traditional
DRAM CPU, the transfer delay from our latency mode (Section 3.3) begins to
dominate as it increases proportional to the number of MCs2. This makes MC scaling
beyond 8 inefficient, whereas the DRAM latency with on-chip vertical integration
shows significant gains all the way up to 32.
In the this study we sweep the number of MCs and the clock frequency of a
traditional 2D CPU and a memory-on-logic 3D CPU and evaluate the performance,
power and temperature. We observe thermal violations in the 3D CPU with air
cooling, so we evaluate the potential improvements to thermally feasible performance
offered by applying a MF heatsink. The architectural design space considered in
1[zv“iwlz ihk intzgrvtion yzn“ity i“ mvny oryzr“ of mvgnituyz highzr thvn thz yzn“ity rzquirzy
for vny rzv“onvwlz numwzr of mzmory xontrollzr“C
2DgVb wu“ fiiyth pzr bC i“ totvl Id pin“ (6I) yiviyzy wy totvl numwzr of bC“C
66
1 2 4 8 16
0
200
400
600
Number of Memory Controllers
A
v
e
ra
g
e
 M
e
m
o
ry
A
c
c
e
s
s
 T
im
e
 (
n
s
) Traditional DRAM
3D Stacked DRAM
Bus Congestion
Figure 4.1: Average DRAM latency vs. number of memory controllers [8]
this study is given in Table 4.1. In this study the floorplan topology was fixed and
uniform microchannel placement was used. The effects of physical optimizations are
introduced in Section 5.1.
Table 4.1: Study 1: Architectural Design Space
Cores 16
Clock Rate {2.4, 2.6, 3.0, 3.2, 3.4} GHz
Memory Controllers {1, 2, 4, 8, 16, 32}
We conclude that memory-on-logic architectures do bring significant potential
performance improvements, but are thermally infeasible with traditional air cooling.
In fact, 3D stacking actually reduces the feasible performance compared to tradi-
tional off-chip DRAM when air cooling is applied because the trapped heat effect
requires total chip power to be scaled down significantly. However MF cooling is
able to realize the potential benefits of 3D CPUs by removing thermal violations.
We also show that MF cooling significantly reduces leakage power, more than mak-
ing up for the required MF pumping power, and begging the question of how MF
cooling effects energy efficiency scaling trends, which we investigate in Section 4.2.
67
4.1.1 Performance
Throughout this dissertation we measure performance by the average number
of committed instructions per nanosecond (IPnS) which is equivalent to billions of
instructions per second (BIPS). Figure 4.2 shows the performance of our target pro-
cessor with a variable number of memory controllers and clock rates. On average the
peak performance for a 3D CPU is 1.62x the peak performance of a 2D CPU within
the studied design space. Although 3D integration offers the potential for significant
speedups, these improvements can only be feasibly realized if the heat generated as a
result of the increased power flux and thermal resistance can be sufficiently removed
form the chip. It is important to note that performance improvements result from
both reduced latency at a fixed number of MCs, and the ability to leverage more
MCs and thus access multiple DRAM ranks in parallel.
4.1.2 Temperature
Figures 4.3 and 4.4 show the peak temperature of our target processor config-
urations. In this work we assume the thermal violation temperature is 85 ◦C, which
is shown as a horizontal black line in each figure. The number annotated above each
bar represents the maximum performance (across all different MC configurations)
that does not violate the thermal constraint for each frequency/benchmark pair.
68
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
020406080
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Performance
(Instructions per ns)
 
 
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
(v
)
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
020406080
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Performance
(Instructions per ns)
 
 
32
 
M
Cs
16
 
M
Cs
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
(w
)
F
ig
u
re
4.
2:
P
er
fo
rm
an
ce
v
s.
M
C
s
an
d
fr
eq
u
en
cy
(a
)
2D
C
P
U
(c
)
3D
C
P
U
69
In the 2D case adding more memory controllers did not significantly increase
the temperature of the chip (Figure 4.3). This is because the generated heat has a
low thermal resistance path to the heatsink (Section 3.8). Thus no thermal violations
occur, and the optimal number of MCs can be implemented without considering any
new cooling methods. However the performance gains are limited.
In the 3D case, when the chip is air cooled (Figure 4.4(a)) the peak temperature
often surpasses the thermal constraint, and thus the peak performance cannot be
achieved. The maximum achievable performance of an air cooled 3D system is in
most cases actually less than that of a 2D IC. This is because adding more MCs to
a 3D IC increases the peak temperature drastically (which is not the case for 2D),
meaning that in most cases the 2D IC can use more MCs than the air cooled 3D
IC, causing the 3D IC to get worse performance.
We know from the performance plots (Figure 4.2) that 3D ICs are capable of
achieving much greater performance, and this motivates the need for more aggressive
cooling techniques in order to achieve the performance increases potentially offered
by 3D integration. When micro-fluidic cooling is applied (Figure 4.4(b)) the peak
temperatures are all brought to below the temperature threshold, and the great
performance increases offered by 3D integration can be thermally realized. Thus,
aggressive cooling has enabled more aggressive architectural configurations. On
average, the MF cooled 3D CPU’s maximum achievable performance is 2.4x greater
than the maximum achievable performance of an air cooled 3D CPU and 1.6x greater
than the maximum achievable performance of an air cooled 2D CPU.
70
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2040608010
0
30
31
33
34
35
21
21
22
23
23
33
34
38
39
40
25
26
28
28
29
29
32
37
40
42
11
11
12
13
13
12
12
13
13
13
12
11
12
12
13
59
64
74
78
83
26
27
30
31
32
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Temperature (°C)
 
 
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
F
ig
u
re
4.
3:
T
em
p
er
at
u
re
v
s.
M
C
s
an
d
fr
eq
u
en
cy
of
ai
r
co
ol
ed
2D
C
P
U
71
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
5010
0
15
0
16
17
20
21
19
26
22
23
23
18
19
21
21
18
18
19
19
19
27
28
9
9
26
27
27
7
9
9
10
9
11
11
12
22
21
22
15
16
17
18
18
8
8
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
3
3
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
6
7
7
ra
di
o
si
ty7
7
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Temperature (°C)
 
 
32
 
M
Cs
16
 
M
Cs
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
(v
)
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2040608010
0
47
54
62
65
68
37
40
45
47
50
46
49
56
59
63
42
45
51
54
57
31
33
39
41
44
16
17
20
21
22
32
36
41
44
46
23
25
29
32
33
60
65
75
80
85
37
40
46
49
52
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Temperature (°C)
 
 
32
 
M
Cs
16
 
M
Cs
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
(w
)
F
ig
u
re
4.
4:
T
em
p
er
at
u
re
v
s.
M
C
s
an
d
fr
eq
u
en
cy
(a
)
ai
r
co
ol
ed
3D
C
P
U
(b
)
M
F
co
ol
ed
3D
C
P
U
72
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
020406080
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Thermally Feasible
Performance
(Instructions per ns)
 
 
3D
 
M
F
2D
 
Ai
r
3D
 
Ai
r
F
ig
u
re
4.
5:
B
es
t
ac
h
ie
va
b
le
p
er
fo
rm
an
ce
su
b
je
ct
to
th
er
m
al
co
n
st
ra
in
ts
73
4.1.3 Thermally Feasible Performance
The maximum performance subject to thermal constraints (iByB the annota-
tions in Figures 4.3 and 4.4) is plotted in Figure 4.5. When air cooling is used
3D and 2D CPUs alternatively outperform each other depending on the workload.
In general 3D CPUs have better performance than 2D CPUs when the number of
MCs is the same. However, for most benchmarks 2D CPUs can thermally accom-
modate more MCs, allowing them to outperform an air cooled 3D CPU. But for the
low power benchmarks (yBgB, lh, fgeeamclhfgee and bceaa) the 3D temperature
is low enough even with air cooling to take advantage of the additional bandwidth
offered by memory-on-logic stacking. When thermal concerns are alleviated with
MF cooling, 3D CPUs always perform best.
It can be observed in Figure 4.5 that average performance improves very little
with respect to frequency in an air cooled 3D CPU. Due to thermal constraints,
there must be a trade-off between frequency and the number of memory controllers
to maintain a safe temperature. With MF cooling or a traditional 2D layout, enough
temperature slack exists in the system that both frequency scaling and increased
number of memory controllers can be leveraged for higher performance.
4.1.4 Power
Dynamic power remains the same regardless of heatsink type. However, Fig-
ures 4.6 and 4.7 show that adding MF cooling actually decreases the total power
dissipation dramatically. This is because the leakage power is strongly dependent
74
on temperature and the temperature reduction due to liquid cooling reduces the
leakage power. On average micro-fluidic cooling can reduce 3D IC leakage power
by 20.9W, which easily justifies the extra power used to pump the fluid through
the microchannels (less than 1 W). Furthermore, it begs the question of how MF
cooling effects energy efficiency scaling trends, which are examined in Section 4.2.
75
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
5010
0
15
0
20
0
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Power (W)
 
 
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
F
ig
u
re
4.
6:
P
ow
er
d
is
si
p
at
io
n
v
s.
M
C
s
an
d
fr
eq
u
en
cy
of
ai
r
co
ol
ed
2D
C
P
U
76
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
5010
0
15
0
20
0
25
0
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Power (W)
 
 
32
 
M
Cs
16
 
M
Cs
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
(v
)
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
2.
4
3.
4
5010
0
15
0
20
0
25
0
bo
dy
tra
ck
fft
flu
id
a
n
im
a
te
ra
di
x
bl
a
ck
sc
ho
le
s
lu
st
re
a
m
cl
u
st
e
r
o
ce
a
n
ra
di
o
si
ty
a
ve
ra
ge
Fr
e
q 
(G
H
z)
Power (W)
 
 
32
 
M
Cs
16
 
M
Cs
8 
M
Cs
4 
M
Cs
2 
M
Cs
1 
M
C
(w
)
F
ig
u
re
4.
7:
P
ow
er
d
is
si
p
at
io
n
v
s.
M
C
s
an
d
fr
eq
u
en
cy
(a
)
ai
r
co
ol
ed
3D
C
P
U
(b
)
M
F
co
ol
ed
3D
C
P
U
77
4.2 Frequency Scaling with Micro-Fluidics
Since the 1980s Moore’s Law performance scaling was traditionally achieved
through constant increases to CPU frequency, made possible by similar reductions in
capacitance and voltage through technology scaling. However the increase in power,
and therefore temperature, associated with frequency scaling became unsustainable
in the mid 2000s [110]. One of the biggest problems was the exponential increase in
leakage power as temperatures increased, causing energy efficiency to plummet past
a few GHz [111]. Another big issue with frequency scaling was the ever increasing
memory wall gap between processor and memory performance (Section 2.2) [110].
In Section 4.1 we observed a large reduction in leakage power and temperature
due to the application of MF cooling. Additionally we observed a significant per-
formance improvement due to increased memory bandwidth when memory-on-logic
stacking was applied. These two observations cause us to reexamine the feasibility
and efficiency of further frequency scaling in 3D CPUs with MF cooling.
In this study we first argue that frequency scaling is a more versatile scaling
trend than the core scaling that has come to replace it. We sample the parallelism
of a group of benchmarks and show that only those with very large degrees of par-
allelism will benefit from core scaling, whereas all workloads benefit from frequency
scaling. However with traditional air cooling, both core and frequency scaling are
limited in 3D CPUs. Next we compare air cooled and MF cooled 3D CPUs and their
associated scaling trends with respect to temperature, power and energy efficiency.
78
4.2.1 Design Space and Benchmarks and Metrics
The design space swept in this study includes the number of cores (iByB core
scaling) and the clock rate (iByB frequency scaling). The specific values simulated are
given in Table 4.2. Different workloads exhibit different performance/power/temperature
trade-offs across these different variables, and the highest performance thermally
feasible design point is identified for each benchmark. In this study the floorplan
topology was fixed and uniform microchannel placement was used. The effects of
these physical optimizations are introduced in Section 5.1.
Table 4.2: Study 2: Architectural Design Space
Cores {16, 32, 64}
Clock Rate {2.4, 3.0, 3.6} GHz
Memory Controllers 0.5 per Core
Each benchmark (except for feeeeg, which has a unique data pipeline) has
some period of sequential execution that occurs on a single processing core, followed
by a period of parallel execution distributed across all cores. The ratio of parallel
execution time to total execution time3 is denoted α. According to Amdahl’s law,
the amount of speedup offered by using n cores (compared to a single core) is shown
in Equation (4.1).
ezrformvnxz(n)
ezrformvnxz(1)
=
n
n− α(n− 1) (4.1)
3Wznxhmvrk“ fizrz tzrminvtzy vftzr 5IEb in“truxtion“ if thzy hvy not vlrzvyy ni“hzy to mvinB
tvin rzv“onvwlz “imulvtion timzC
79
In the architectures simulated here, adding more cores also changes the size
and distribution of the L2 cache as well as increasing the average distance between
routers in the NOC, causing performance to depend on other factors beyond Am-
dahl’s law. Nevertheless, benchmarks with a large α value often achieve optimal
performance with more cores, whereas benchmarks with a low α value often achieve
optimal performance with a smaller number of cores. The α value and highest
performing core count for each benchmark is tabulated in Table 4.3.
In this work we measure performance by the average number of committed
instructions per nanosecond (IPnS) and energy efficiency by the reciprocal of the
energy delay product (EDP).
4.2.2 Core and Frequency Scaling
For each benchmark, we find the highest performing architectural configuration
that does not violate the peak temperature constraint of 85 ◦C. The results of this
experiment are shown in Tables 4.3.
We observe that with air cooling both the number of cores and the frequency
is severely limited. With the application of MF cooling, every benchmark except
eadik is able to achieve its optimal number of cores. Moreover, only fwacgibaf
pursues core scaling over frequency scaling, and this is because fwacgibaf is nearly
100% parallel. The main conclusion from this data is that even when thermal
constraints are mitigated (yBgB, by applying MF cooling), the amount of potential
improvement due to core scaling has an established upper limit inherent to the
80
Table 4.3: Maximum benchmark performance s.t. thermal constraint
Wznxhmvrk
α
(%)
dptC
8Corz“
Vir Coolzy b[ Coolzy InxC
Ienh8Corz [rzq Ienh 8Corz [rzq Ienh
hfivption“ NNCM 6I F6 HCE H5CF 6I HCE FFNC6 HCIFfl
gvyifl NNCM 6I F6 HCE HICN H2 HC6 5FCM FCIMfl
Wvrnz“ NMCM 6I F6 HCE 27CI 6I HC6 7ECE 2C56fl
[bb NMC7 H2 F6 HCE 2IC5 H2 HC6 I2C6 FC7Ifl
lvtzrB“pvtivl NHC2 6I F6 HCE IEC5 6I HC6 67CF FC66fl
lvtzrBn“quvrzy NHCE F6 F6 HCE H2CI F6 HC6 HMCI FCFNfl
[[i 7ICH 6I F6 HCE 6C2 6I HC6 7C6 FC2Hfl
gvytrvxz 7FCN F6 F6 HCE FCN F6 HC6 2CF FCF5fl
[luiyvnimvtz H5C7 F6 F6 HCE IC7 F6 HC6 5C5 FCFMfl
Dzyup 2NC2 F6 F6 HC6 FCH F6 HC6 FCH ECEEfl
[vxz“im ECE F6 F6 2CI ICM F6 HC6 7CE FCIMfl
gvyio“ity ECE F6 F6 HCE 2C5 F6 HC6 HCE FCFNfl
[zrrzt B H2 F6 HCE IC6 H2 HC6 5C5 FC2Efl
Vvzrvgz FC57fl
parallelism (α) in the workload. On the other hand frequency scaling can continue
to push performance for any arbitrary workload, until the thermal constraint is hit.
With MF cooling and 3D memory-on-logic stacking we expect that frequency scaling
once again becomes a viable strategy, at least in the short term.
4.2.3 Scaling Trends
To further investigate the frequency scaling trends of 3D CPUs, we fixed the
number of cores (32) and performed a detailed frequency sweep on a sequential
benchmark (facefim). The sequential nature of the benchmark eliminates the pos-
sibility of improving its performance through core scaling, and leads us to view
frequency scaling as the only avenue for benchmark speedup. We compare the fre-
quency scaling trends of an air cooled vs. MF cooled 3D CPU.
81
2 3 4 5
0
5
10
15
freq (GHz)
Pe
rfo
rm
an
ce
 (IP
nS
)
(v)
2 3 4 5
0
100
200
300
400
freq (GHz)
En
er
gy
 E
ffi
cie
nc
y 
(us
−
1 n
J−
1 )
 
 
MF Cooling
Air Cooling
(w)
Figure 4.8: 3D CPU (a) performance (b) energy efficiency vs. frequency with air
cooling and MF cooling
82
It is obvious that frequency scaling will improve performance roughly linearly
with frequency (Figure 4.8(a)), but what is interesting is how power, temperature
and energy efficiency scale using different types of heatsinks. Figure 4.8(b) shows
that air cooled 3D CPUs will become energy inefficient beyond 3-4 GHzwhereas
MF cooled 3D CPUs will continue to be energy efficient beyond 5 GHz. This is
an interesting result because the traditionally frequency scaling paradigm ended
around 3 GHzwhich has good agreement with the simulation data. This implies
the possibility of MF cooling providing a realignment back to frequency scaling,
or the application of frequency and core scaling in tandem for future computer
architectures.
Figure 4.9(a) shows the thermal scaling trends. We can see that air cooled 3D
CPUs become thermally infeasible beyond 2 GHzwhereas MF cooling can push ther-
mal feasibility out to nearly 5 GHz. One advantage of 3D integration is core scaling
independent of technology scaling by applying logic-on-logic stacking. However this
will yield similar thermal scaling trends to frequency scaling due to increased power
flux, and will likewise require aggressive active cooling solutions such as MF cooling.
Finally, Figure 4.9(b) shows the power scaling trends. Two important obser-
vations can be made about air cooled 3D CPUs. First, they generally have large
amounts of leakage, roughly 50% up to 4 GHz. Beyond this point the thermal
runaway phenomenon [62] causes the leakage and temperature to quickly increase
without bound in a positive feedback loop. Moreover, leakage power scales at the
same rate as dynamic power, reducing energy efficiency as clock rates increase. MF
cooling not only removes the thermal runaway issue (in the range of frequencies
83
2 3 4 5
0
50
100
150
200
freq (GHz)
Pe
ak
 T
em
pe
ra
tu
re
 (°
C)
 
 
MF Cooling
Air Cooling
(v)
2 3 4 5
0
200
400
600
freq (GHz)
Po
w
er
 (W
)
 
 
MF Cooling
Air Cooling
Leakage
Dynamic
(w)
Figure 4.9: 3D CPU (a) temperature (b) power vs. frequency with air cooling and
MF cooling
84
simulated), but also causes leakage power to scale slower than dynamic, leading to
more efficient systems and improving the effectiveness of dynamic power control
schemes like clock gating [112].
4.3 Summary
In this Chapter we have quantitatively investigated some of the architectural
opportunities offered by memory-on-logic 3D CPUs with micro-fluidic cooling. We
consider the memory bandwidth advantages of 3D stacked memory and identify the
need for embedded active cooling to realize the theoretical gains of such a system.
Furthermore we consider the scaling trends of 3D CPUs with MF cooling and show
that frequency scaling may once again emerge (in conjunction with core scaling) as
a viable avenue for performance scaling of future CPUs cooled with micro-fluidics.
Section 4.1 made the case for memory-on-logic 3D CPUs by demonstrating
their potential speedup over traditional 2D CPUs with off-chip DRAM, but showed
that those improvements could only be thermally realized with embedded active
cooling such as MF cooling due to the high power flux of the core logic layer and the
trapped head effect of the stacked DRAM. Speedup was achieved by increasing the
clock speed and bit width of the memory bus using high density TSV integration,
and increasing the number of dedicated memory controllers allowing for parallel
memory access.
85
Section 4.2 built on some of the findings from Section 4.1 and evaluated the
frequency scaling trends of power, temperature and energy efficiency when using 3D
CPUs with MF cooling. Two major factors in the switch to multi-core paradigm
were excessive power and heat, and the memory wall. We show that the power and
heat scaling issue can be significantly curbed with embedded MF cooling, and that
the memory wall can be overcome with high bandwidth on-chip DRAM integration.
The scaling trends of temperature and leakage power are significantly linearized by
application of MF cooling, and moreover, the energy efficiency continues to rise in
an MF cooled 3D CPU as frequency is increased up to 5 GHzwhereas the energy
efficiency of an air cooled CPU begins to decrease past 3-4GHz.
Chapter 5: Architectural-Physical Co-Design of Micro-Fluidically Cooled
3D CPUs
In this chapter we present results from the application of our proposed co-
design flow. Section 5.1 applies the proposed scheme across a 3D CPU design
space with different physical optimizations, objective functions, and physical con-
straints. Section 5.2 investigates a fundamental trade-off between TSV density (iByB
inter-layer communication bandwidth) and the cooling capacity of a MF heatsink.
Specifically we target a pin-fin heatsink. Compared to microchannel MF heatsinks,
86
pin-fin MF heatsinks are known to have higher cooling capacity, but are more re-
strictive on TSV density and placement [113]. Section 5.3 concludes this chapter
with a summary.
5.1 Thermal-Reliability Aware Architectural-Physical DSE
In this study we investigate the effects of the floorplan (Section 3.9) and cooling
(3.10) optimization schemes on the feasibility region of a 3D CPU design space. In
addition to the thermal constraints imposed in Chapter 4 we also incorporate the
reliability model from Section 3.7 and impose a reliability constraint on the design
space. We combine the design variable spaces considered in the two previous studies
in Chapter 4. This results in a three-dimensional design space of cores, MCs and
frequency, as enumerated in Table 5.1.
Table 5.1: Study 3: Architectural Design Space
Cores {16, 32, 64}
Clock Rate {2.4, 3.0, 3.6} GHz
Memory Controllers {0.125, 0.25, 0.5} per Core
Thus we perform 3D memory-over-logic processor DSE across a combined
design space of architectural parameters, floorplan topology and MF heatsink design,
subject to thermal and reliability metrics. The optimization metric is performance
measured in instructions per nanosecond (IPnS, a.k.a. BIPS). We use a variable
reliability threshold of 0O00 ≤ α ≤ 0O99 such that the probability the CPU fails
87
2.4 3.0 3.6
8
4
2
 
2.4 3.0 3.6
16
8
4
Frequency (GHz)
#
 M
e
m
o
ry
C
o
n
tr
o
lle
rs
Normalized Performance
32 Cores 16 Cores
 
 
1
2
3
4
5
Figure 5.1: 3D CPU design space performance
before target lifetime is less than or equal to 1−α. For sensitivity analysis, we also
investigate the effects of ignoring one or more of the floorplan objective terms and
sweeping the tightness the reliability constraint.
5.1.1 Feasibility Region
First we explore the feasibility region of the design space. An architecture
is considered feasible if for all benchmarks the thermal and reliability constraints
are met. Although the entire design space from Table 5.1 was considered in this
evaluation, we found that no 64-core architectures could meet both thermal and
reliability constraints, so the 64-core architectures were trimmed from the design
space for this section1. Figure 5.1 illustrates the normalized performance of the
trimmed design space, evaluated over a set of parallel benchmarks from Splash-
2 [84] and PARSEC [85] benchmark suites. Performance values for each benchmark
were normalized to the 16-core 2 MC 2.4 GHzarchitecture before averaging across
all benchmarks.
1]ofizvzr in hzxtion 5CFC2 fiz xon“iyzr thz optimvl vrxhitzxturz of zvxh wznxhmvrk inyiviyuvlly
(v“ fiv“ yonz in hzxtion IC2) vny thz 6IBxorz vrxhitzxturz“ vrz inxluyzy in tho“z rz“ult“C
88
2.4 3.0 3.6
8
4
2
WL+T+R
OPT
2.4 3.0 3.6
16
8
4
2.4 3.0 3.6
8
4
2
WL+T
OPT
2.4 3.0 3.6
16
8
4
16 Cores 16 Cores
32 Cores32 Cores
Frequency (GHz)
N
u
m
b
e
r 
o
f 
M
e
m
o
ry
 C
o
n
tr
o
lle
rs
Thermal Feasibility Region
Figure 5.2: Thermal feasibility region (shown in white)
2.4 3.0 3.6
8
4
2
WL+T+R
OPT
2.4 3.0 3.6
16
8
4
OPT
2.4 3.0 3.6
8
4
2
WL+T
2.4 3.0 3.6
16
8
4
Frequency (GHz)
N
u
m
b
e
r 
o
f 
M
e
m
o
ry
 C
o
n
tr
o
lle
rs
Reliability Feasibility Region
16 Cores 16 Cores
32 Cores32 Cores
Figure 5.3: Reliability feasibility region (shown in white)
89
Frequency (GHz)
N
u
m
b
e
r 
o
f 
M
e
m
o
ry
 C
o
n
tr
o
lle
rs
Feasibility Region
2.4 3.0 3.6
8
4
2
WL+T+R
OPT
2.4 3.0 3.6
16
8
4
OPT
2.4 3.0 3.6
8
4
2
WL+T
2.4 3.0 3.6
16
8
4
16 Cores 16 Cores
32 Cores32 Cores
Figure 5.4: Thermal-reliability feasibility region (shown in white)
Figurea 5.2 through 5.4 show the feasibility region of the design space. Fea-
sible architectures are shown in white, infeasible architectures are shown in black
and the highest performing feasible architecture is marked with “OPT”. The ther-
mal (Figure 5.2) and reliability (Figure 5.3) feasibility regions are evaluated sep-
arately and their intersection defines the true thermal-reliability feasibility region
(Figure 5.4). Thermal feasibility is defined as maximum on-chip temperature less
than ivOURatOUT = 85
◦C. Reliability feasibility was defined as efaOR(ttarMet) < α where
α = 99% is the reliability confidence and ttarMet = 3 years is the lifetime target.
90
Two floorplan objective functions are considered. The first only includes wire-
length2 and temperature (la + i ), whereas the second also includes reliability
(la + i + g). The results in this figure assume MF cooling with uniform mi-
crochannel placement.
Looking at the thermal feasibility region, we observe that the addition of re-
liability to the floorplan objective function causes the thermal feasibility region to
contract, resulting in reduced optimal performance. However, the addition of reli-
ability to the floorplan objective massively expands the reliability feasibility region
and the true thermal-reliability feasibility region which increasing the optimal per-
formance significantly.
This result exposes an interesting potential trade-off between temperature and
reliability in 3D CPUs. Although increased temperature increases the probability
of failure of a single TSV, it is quite possible that thermally optimized floorplans
contain more 3D nets (iByB more cuts in the inter-layer partition) in order to opti-
mize the distribution of power. In some cases the increase in number of TSVs will
outweigh the reduction in temperature when considering the net effect on system
reliability.
Overall we conclude that even though one would assume optimization of ther-
mal and reliability metrics to go hand in hand, this is in fact not the case. Opti-
mization for temperature only is significantly suboptimal due to splitting too many
3D nets to get fine-grained power density matching against the thermal resistance of
2In thi“ xontzflt fiirzlzngth xon“i“t“ of thz xomwinvtion of vrzv A vny totvl nzgvtivz “lvxk S
from Equvtion (HCFE)C
91
Figure 5.5: Co-design results
each stack layer. Conversely, consideration of the reliability objective in optimiza-
tion increases hot-spot temperature, and awareness of both metrics is necessary to
maximize the intersection of the thermal and reliability feasibility region.
5.1.2 Optimal Performance
The optimal feasible performance of the investigated architectural design space
is plotted in Figure 5.5. This data is generated by finding the optimal feasible
performance of each benchmark separately, and normalizing against the base case
before averaging the results across all benchmarks. In this study the base case
92
is as follows: air cooling, thermal-reliability unaware floorplanning (la), and no
reliability constraint (iByB α = 0). Three floorplan objectives are used to generate
the data, each one adding an additional term to the objective function.
The data is obtained using two different constraints: thermal (T Constraint)
and thermal-reliability (TR Constraint). These two constraints are defined by set-
ting α = 0 and α = 0O99 respectively. The unconstrained performance of the design
space is notated as an upper bound. Likewise, four different cooling schemes are
considered: high-pumping-power uniform MF cooling (High-P Fluid), low-pumping-
power optimized MF cooling (Low-P Opt Fluid), low-pumping-power uniform MF
cooling (Low-P Fluid) and traditional air cooling (Air). Low-pumping-power MF
cooling uses 5x less pumping power, and optimized MF cooling uses the microchan-
nel placement optimization technique described in Section 3.10.
Comparing the first (leftmost) two bars in the figure, we can see that with-
out reliability constraints, thermally-aware floorplanning improves thermally feasible
performance between 3% and 13% depending on the cooling method applied. Addi-
tionally one can observe that none of the considered cooling techniques are able to
thermally unlock the entire design space, and the improvement in performance due
to increasing MF cooling power 5x is less than 2x. Finally, microchannel placement
optimization can provide significant performance improvements while maintaining
a constant pumping power, thus greatly increasing the power efficiency of the MF
heatsink.
93
Comparing the middle two bars we observe that the massive improvement to
the thermal feasibility region provided by MF cooling becomes a moot point when
reliability constraints are included. However, by comparing the last (rightmost) two
bars we see that reliability-aware floorplanning can once again unlock the perfor-
mance potential of MF cooling. Reliability feasibility does not significantly affect
the potential performance of an air-cooled 3D CPU since the architectural design
points which would benefit from the expanded reliability feasibility region are still
thermally infeasible. The conclusion here is that aggressive cooling is required to
thermally unlock 3D CPU performance, but must also be accompanied by reliabil-
ity aware physical design to realize the potential gains brought by the new cooling
technology.
5.1.3 Reliability Constraint Sensitivity
Finally we repeat the above analysis for different values of α and compare the
performance ratio between reliability aware (la+ i + g) and reliability unaware
(la + i ) designs. The improvement in average feasible performance is shown
in Figure 5.6. We observe that the performance improvement due to reliability
awareness in floorplanning increases as the reliability constraint tightens because
reliability becomes a more significant factor in determining physical feasibility.
Moreover we observe that the performance improvement due to reliability
awareness is significantly less when air cooling is used because many design points
are thermally limited. Due to a very small thermal feasibility region, reliability aware
94
0.9 0.92 0.94 0.96 0.98 1
1
1.2
1.4
1.6
1.8
Reliability Constraint
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t
(R
eli
ab
ilit
y A
wa
re 
vs
 U
na
wa
re)
 
 
Air Cooled
MF Cooled
Figure 5.6: Performance improvement due to reliability-aware FP
design has little effect on the physical feasibility region, and thus offers only marginal
improvement. On the other hand when MF cooling is used the improvement due to
reliability-aware floorplanning is quite large since reliability is the dominating factor
determining physical feasibility.
The conclusion is that the effectiveness of certain optimization schemes, such as
reliability-aware floorplanning, will depend on other design choices, such as heatsink
type, and the design specifications, such as reliability constraint. This further mo-
tivates the need for a holistic co-design paradigm.
95
5.2 Thermal-Bandwidth Trade-offs in MF Cooled 3D CPUs
In the previous studies we have investigated the trade-offs between perfor-
mance, temperature and reliability across an architectural physical design space. In
those studies constraints on TSV integration density did not come into play because
the microchannel MF heatsink can accommodate sufficient integration density to
support the architectures investigated in this dissertation3. However, other types
of MF heatsinks exist, which offer better cooling at the expense of reduced TSV
integration density [113,115]. In this study we investigate one such heatsink design:
the micro-fluidic pin-fin heatsink. In this section we present a study that shows
that a heatsink designed for maximum cooling will actually limit the architectural
design space due to inter-layer bandwidth constraints more so than a heatsink that
provides worse cooling in order to accommodate higher TSV density.
Micro-fluidic pin-fin heatsinks (Figure 5.7) pump fluid through cavities etched
into the silicon substrate of each layer in a 3D chip stack. The fluid cavities are
etched around cylindrical islands of silicon called pin-fins. Pin-fins provide a physi-
cal, electrical and thermal interconnection between adjacent layers in the chip stack,
and provide a path for heat transfer from the silicon into the fluid.Unlike microchan-
nel heatsinks, pin-fin cooling pumps all fluid through a single connected cavity, and
has been shown to provide better cooling performance compared to a micro-channel
heatsink when fluid velocity is high [113,115].
3]ofizvzr thz intzrBlvyzr intzgrvtion yzn“ity rzquirzy for morz nzBgrvinzy HD xirxuit“ mvy “zz
limitvtion“ yuz to mixroBxhvnnzl hzvt“ink“C borzovzr ihkBmixroxhvnnzl xonixt“ impo“z xonB
“trvin“ on yztvilzy gvtzBlzvzl plvxzmznt pHEAHFAFFIr
96
DActive 
Layer
Substrate 
Layer
Pins
Flow Direction Pins
TSVs
(a) (b)
S
S
H
Figure 5.7: Micro-fluidic pin-fin cooling of a single layer in a 3D-IC
Two of the most important geometric parameters that determine the cool-
ing capacity of a micro-fluidic pin-fin heat sink are the pin diameter Y and pitch
h [113,116], which are illustrated in Figure 5.7. The pin pitch determines the num-
ber of pins per unit area, and the pin diameter determines the surface area of each
pin. Increasing pin diameter or decreasing pitch increases the total surface area
between fluid and silicon substrate, increasing heat conduction, but also increas-
es the resistance to flow, causing fluid velocity to drop when a constant pressure
drop is enforced between fluid inlet and outlet. The micro-fluidic pin-fin heatsink
parameters explored in this paper are enumerated in Table 5.2.
Table 5.2: Micro-fluidic pin-fin heatsink dimensions
Variable Value Unit Description
h {250P 300P O O O P 600} m Pin Pitch
Y 75 m Pin Diameter
H 100 m Pin Height
Past work [113] has shown that micro-fluidic pin-fin heatsink parameters can
be optimized to improve cooling capacity, but have not considered how such opti-
mizations affect architectural design constraints such as vertical interconnect den-
97
sity. Furthermore that work only considered one fixed micro-architecture, and did
not consider how optimal heatsink parameters change under different architectural
design choices.
One drawback associated with micro-fluidic cooling in general is the resource
conflict that emerges between TSVs and fluid cavities. Since TSVs cannot pass
through the fluid cavities, the location and density of vertical interconnects is deter-
mined by the design of the cooling system, such as pin-fin or microchannel diameter
and pitch. In other words, TSVs can not be placed through the fluid cavity. In a
pin-fin MF heatsink, TSVs are generally more constrained because more of the chip
area is dedicated to the fluid cavity [115]. In such a heatsink, TSVs can only pass
through the pins themselves (Figure 5.7).
Past work [30,31] has shown that this resource conflict can restrict the place-
ment of TSVs, leading to increased wirelength and thus critical path delay, but
has not considered how the resource conflict can affect micro-architectural design
choices.
Our results show there exists a trade-off between maximum TSV density and
cooling capacity of the micro-fluidic heatsink. Since different 3D CPU architectures
require varying amounts of vertical interconnect density, the cooling solution for
each architecture should be designed to maximize cooling while accommodating
sufficient TSV bandwidth (BW). We show that na¨ıve application of fixed micro-
fluidic heatsink designs will severely limit the feasible design space for 3D CPUs
and result in the selection of suboptimal designs.
98
5.2.1 Bandwidth Requirements
The bandwidth requirement of a 3D CPU architecture is defined as the max-
imum TSV density required by the architecture. In this study we simulate single-
layer cores, so TSVs are only required for extra-core communication: 1) commu-
nication between memory controllers and DRAM, and 2) communication between
routers. An extension of this study which is left to future work would be to include
multi-layer cores and the TSV density requirements associated with these intra-core
vertical nets.
5.2.2 Memory Controller TSV Density
The number of DRAM buses passing through layer i in a vertical column of
memory controllers (MCs) is i: the number of MCs contained on all layers below and
including layer i. Thus the logic layer with the highest MC TSV density is always the
top layer, layer n. The minimum TSV density required for communication between
the MCs and the DRAM YiMC is given in Equation (5.1), wherelbus is the DRAM
bus width, ATSV is the area of a single TSV and AMC is the total area of a single
memory controller. In this work lbus is assumed to be 512 bits (64 bytes).
YiMC = nlbusATSV
1
AMC
(5.1)
99
5.2.3 Router TSV Density
The number of TSVs between layer i and i+1 in a vertical column of routers
was defined in Equation (2.1). Thus the minimum TSV density requirement for
router communication, YiROUT , is given in Equation (5.2) where AROUT is the
total area of a single router.
YiROUT = max
O={1;2;:::;(T−1)}
iROUT (i)ATSV
1
AROUT
(5.2)
5.2.4 TSV Density Requirement
The overall TSV density requirement of a 3D CPU Yi is the larger of the two
aforementioned density requirements, as expressed in Equation (5.3). In this study
we assume TSV pitch is 10 m, making ATSV = 100
2m. Other area values used
in this study are: AMC = 8O660
2mm and AROUT = 0O924
2mm which are obtained
from McPAT [2] (Section 3.4).
Yi = max(YiMC P YiROUT ) (5.3)
5.2.5 Bandwidth Capacity
The pin-fin structure not only affects cooling, but also the maximum band-
width capacity of a micro-fluidic pin-fin heatsink. The bandwidth capacity is defined
as the maximum TSV density supported by the heatsink. The maximum TSV den-
sity supported by a pin-fin heatsink with pin diameter Y and pin pitch h is Ye as
100
defined in Equation (5.4). The first two terms in the equation represent the cross
sectional area of a pin divided by the total area between adjacent pins. n is the
TSV yield, which is the amount of pin area which can contain TSVs. In this work
we assume n = 0O8 due to the circular shape of pin fins which results in wasted area
around the edge.
Ye =
.
4
Y2
h2
n (5.4)
5.2.6 Pin Fin Thermal Model
The thermal model introduced in Section 3.8 was for a microchannel MF
heatsink. In this study we use a different thermal model to model the pin-fin MF
heatsink. The model was developed by our collaborators at Georgia Institute of
Technology [113] with whom we preformed this study. The pin-fin MF heatsink
model is explained in the remainder of this section.
The 3D stack is discretized into multiple control volumes, each modeling the
temperature around one pin. Figure 3.4 shows the energy flows in a single control
volume. Energy balance analysis is conducted for each control volume to evaluate
the thermal map of the system.
Each control volume is assumed to have a uniform fluid temperature if and a
uniform silicon temperature is. The energy equation for the solid components of a
control volume is given in Equation (5.5), where qMeT is the energy generation rate
101
Layer k+1
Layer k-1
Tf(i,j)
Oxide
Bulk Silicon
Fluid in Fluid out
qcond
qconv
100 um
10 um
Tf(i-1,j) Tf(i,j)
Active SiliconTs(i,j) qgen(i,j)
Pin(i,j)
Figure 5.8: Control volume around one pin
obtained from the power map, qcUTd is the heat conduction from neighboring control
volumes and qcUTv is the heat transferred by convection between the solid and the
fluid.
qMeT = qcUTd + qcUTv (5.5)
The energy balance equation for the fluid is given in Equation (5.6), where m˙
is the mass flow rate, CV is the specific heat capacity of the fluid, and if (i− 1P j) is
the fluid temperature of the upstream neighbor control volume.
qcUTv = m˙CV (if (iP j)− if (i− 1P j)) (5.6)
102
A system of equations is obtained by applying energy balance analysis to each
control volume, and the system is solved simultaneously. Heat convection terms
are defined using fluid heat transfer coefficient hf , which is given in Equation (5.7),
where cu is Nusselt number which we estimate using the equations in [113], and kf
is the thermal conductivity of the fluid.
hf = cukfY (5.7)
In this study the fluid is assumed to be water. Table 5.3 gives a list of parame-
ter values used in the thermal model. Some parameters are temperature dependent,
so their default value (calculated at 25 ◦C) is given in the table, and temperature
dependent scaling factors from [117] are applied within the model. Heat conduction
from the chip stack into the environment is modeled as a heat transfer coefficient
between the ambient temperature and the top and bottom of the chip stack.
Table 5.3: Micro-fluidic pin-fin thermal model parameters
Variable Value Unit Description
iaSb 40
◦C Ambient temperature
ifOT 25
◦C Fluid inlet temperature
hbUt 10 W m
−2 K−1 Heat transfer coefficient at layer n
htUV 562 W m
−2 K−1 Heat transfer coefficient at layer 1
kSO 149 W m
−1 K−1 Thermal conductivity of silicon
kOx 1.4 W m
−1 K−1 Thermal conductivity of oxide
/f (25) 1000 kg m
−3 Fluid density at 25 ◦C
kf (25) 0.5573 W m
−1 K−1 Fluid thermal conductivity at 25 ◦C
CV(25) 4200 J kg
−1 K−1 Fluid specific heat capacity at 25 ◦C
µf (25) 1.53 mPa s Fluid dynamic viscosity at 25
◦C
∆p 1500 Pa Pressure drop from inlet to outlet
103
5.2.7 Experimental Setup
In the following sections we discuss our experiment and results. First we dis-
cuss our methodology and characterize the design space (Section 5.2.8). Next we
characterize the effect of pin-fin pitch h on the thermal and bandwidth feasibility
of the design space. Finally we introduce two na¨ıve schemes for choosing a heatsink
design and compare them to our proposed co-design methodology for choosing the
heatsink design that optimally balances thermal and bandwidth (iByB inter-tier com-
munication density) design constraints. We compare the feasibility region and max-
imum feasible performance and energy efficiency using the three heatsink design
methodologies.
We exhaustively simulate all unique combinations of the architectural design
variables in Table 5.4 using 12 parallel software workloads from the SPLASH-2
[84] and PARSEC [85] benchmark suites. For each architecture-benchmark pair we
evaluate the performance (instructions per unit time) and power using the evaluation
methodology from Chapter 3. For this study we use a fixed single-layer core floorplan
topology. For a given architecture-benchmark pair, the performance is normalized to
the performance of the baseline architecture (64-core, 32 MC, 3.6 GHz). Normalized
performance is averaged across all benchmarks to yield a single performance number
for each CPU architecture. Similarly, the dynamic and leakage power of each CPU
component of a CPU design is averaged across all benchmarks yielding a single
104
  
0
0.2
0.4
0.6
0.8
1
Total Power
2 4 8
16
32
64
2 4 8
16
32
64
Energy Efficiency
2 4 8
16
32
64
2 4 8
16
32
64
Performance
2 4 8
16
32
64
2 4 8
16
32
64
N
u
m
b
e
r 
o
f 
C
o
re
s
Cores per MC
3.0 GHz
3.6 GHz
Normalized Normalized
3.6 GHz
3.0 GHz
Normalized
3.0 GHz
3.6 GHz
Figure 5.9: Normalized metrics of 3D CPU architectural design space
power map for each architectural design point. This power map is fed into the pin-
fin thermal simulator (Section 5.2.6) to generate a unique thermal map and leakage
power estimate for each heatsink design enumerated in Table 5.2.
Table 5.4: Study 4: Architectural Design Space
Cores {16, 32, 64}
Clock Rate {3.0, 3.6} GHz
Memory Controllers {0.125, 0.25, 0.5} per Core
105
5.2.8 Architectural Parameter Sensitivity
The normalized performance, total power and energy efficiency of our CPU
designs are shown in Figure 5.94. As number of cores increases, both performance
and power increase drastically, due to the highly parallel nature of the simulated
workloads. Likewise as cores per MC decreases (iByB number of MCs increases for
a fixed number of cores) power and performance increase due to higher memory
bandwidth and parallel memory access, leading to higher core utilization. These
trends are more or less the same for both frequencies, with the higher frequency
offering higher performance at the expense of higher power. We calculate the energy
efficiency of each design point as VerfUrSaTce
2
VUwer
which is similar to the inverse of the
energy-delay-product (EDP) metric.
5.2.9 Heatsink Parameter Sensitivity
Each cooling design has a unique cooling capacity and maximum bandwidth
capacity. The cooling capacity is modeled using the pin-fin thermal model (Sec-
tion 5.2.6) and the maximum BW capacity is modeled in Equation (5.4). Likewise
each CPU architectural design has a unique bandwidth requirement as modeled in
Equation (5.3). A heatsink-architecture pair is considered to be thermally feasible
if the maximum temperature is less than ivOURatOUT = 85
◦C. A heatsink-architecture
4iotvl pofizr vny znzrgy z#xiznxy yzpzny on lzvkvgz vny mixroBuiyix pumping pofizrA fihixh
i“ v funxtion of hzvt“ink yz“ignC ]ofizvzr thz trzny“ yiy not “uw“tvntivlly xhvngz vxro““ hzvt“ink
yz“ign“A “o only thz yvtv gznzrvtzy wy our propo“zy xoByz“ign mzthoyology i“ “hofin in thz gurzC
106
300 400 500 6000.2
0.4
0.6
0.8
1
Pin Pitch (µm)
Normalized
Performance
300 400 500 6000.6
0.8
1
Pin Pitch (µm)
Normalized
Energy Efficiency
Figure 5.10: Maximum feasible performance and energy efficiency vs. pin pitch
pair is considered to be bandwidth feasible if the required TSV capacity is met by
the heatsink (iByB Ye ≥ Yi ). Only heatsink-architecture pairs that meet both
feasibility constraints are considered as feasible design choices.
Figure 5.10 shows the maximum feasible performance and energy efficiency
within the architectural design space as a function of the micro-fluidic heatsink pin
pitch. We plot the maximum performance (energy efficiency) subject to BW and
thermal constraints separately and then show the maximum performance (energy
efficiency) subject to both constraints. We see that both metrics peak somewhere in
between the maximum and minimum pin pitch where the optimal balance is struck
between thermal and bandwidth feasibility regions.
In this study, the intersection of the thermal and bandwidth feasibility region-
s is largest between 400 and 500 m, thus unlocking more high performance and
energy efficient 3D CPU architectures. Note that when different architectural pa-
107
rameters and physical parameters such as floorplan are considered, the optimal pin
pitch value may change, but the fundamental trade-off between cooling and band-
width as a function of pin pitch will remain and require co-design optimization.
5.2.10 Results
Finally, we analyze the architectural design space using three schemes for
assigning a separate heatsink design to each architectural design point. The first
two schemes are examples of na¨ıve methods that might be used in absence of a
comprehensive co-design methodology. These involve simply designing the heatsink
independent of the logic architecture. Thus they apply the same heatsink parameters
across the design space. The third scheme is our proposed co-design method, which
designs a unique heatsink for each CPU architecture in order to maximize feasible
performance or energy efficiency. The considered schemes are as follows:
1. \Mav Ammlgle": Choose a fixed heatsink design for all architectures that
minimizes peak temperature.
2. \Mav BU": Chose a fixed heatsink design for all architectures that maxi-
mizes bandwidth capacity (iByB pin density).
3. \Am-bcsgel": Choose a separate heatsink design for each architecture that
minimizes leakage power5 while maintaining thermal and BW feasibility.
5lz minimizz lzvkvgz pofizr to mvflimizz znzrgy z#xiznxy “inxz yynvmix pofizr vny pzrforB
mvnxz vrz not vzxtzy wy hzvt“ink yz“ignC
108
Co-design
2 4 8
16
32
64
2 4 8
16
32
64
Max BW
2 4 8
16
32
64
2 4 8
16
32
64
Max Cooling
2 4 8
16
32
64
2 4 8
16
32
64
Cores per MC
N
u
m
b
e
r 
o
f 
C
o
re
s
3.6 GHz3.6 GHz3.6 GHz
3.0 GHz3.0 GHz3.0 GHz
Thermal Feasibility Region
Figure 5.11: Thermal feasibility region (shown in white)
Co-design
2 4 8
16
32
64
2 4 8
16
32
64
Max BW
2 4 8
16
32
64
2 4 8
16
32
64
Max Cooling
2 4 8
16
32
64
2 4 8
16
32
64
Cores per MC
N
u
m
b
e
r 
o
f 
C
o
re
s
3.6 GHz3.6 GHz3.6 GHz
3.0 GHz3.0 GHz3.0 GHz
Bandwidth Feasibility Region
Figure 5.12: Bandwidth feasibility region (shown in white)
109
Cores per MC
N
u
m
b
e
r 
o
f 
C
o
re
s
Co-design
OPT
2 4 8
16
32
64
2 4 8
16
32
64
Max BW
OPT
2 4 8
16
32
64
2 4 8
16
32
64
Max Cooling
2 4 8
16
32
64
OPT
2 4 8
16
32
64
Feasibility Region
Figure 5.13: Thermal-bandwidth feasibility region (shown in white)
Figure 5.11 and 5.12 respectively show the thermal and bandwidth feasibility
region of the architectural design space using the three schemes discussed above.
We can observe that “Max BW” makes the entire design space bandwidth feasible,
but offers a very small thermal feasibility region. Alternatively, “Max Cooling”
offers a large thermal feasibility region but a very restrictive bandwidth feasibility
region. “Co-design” is able to match the thermal feasibility of “Max Cooling” while
drastically increasing the bandwidth feasibility region, leading to the largest overall
feasibility region among the three schemes. Thus the “Co-design” scheme unlocks
more high performance and energy efficient designs than the two na´’ive schemes. The
optimal feasible architectural design under each heatsink design scheme is designated
110
as “OPT” in Figures 5.11 through 5.13. The optimal design is determined by cross-
referencing the feasibility regions with the performance and energy efficiency results
shown in Figure 5.96.
Table 5.5: Normalized Co-design Results
Metric Max Cooling Max BW Co-design
Optimal Performance 0.70x 0.81x 1.00x
Optimal Energy Efficiency 0.82x 0.94x 1.00x
Optimal Number of Cores 16 32 32
Optimal Cores per MC 2 4 2
Optimal Frequency (GHz) 3.6 3.0 3.0
Chosen Pin Pitch (m) 600 250 500
A comparison of the maximum feasible performance and energy efficiency of
the architectural design space using the three heatsink design schemes is shown in
Table 5.5. Numbers in this table have been normalized to “Co-design”. The results
show that co-design of 3D CPU architecture and micro-fluidic pin-fin heatsink can
achieve significant improvements by optimally balancing the trade-off between TSV
density and cooling capacity. The optimal design points are enumerated in the table,
and illustrated in Figures 5.11 through 5.13.
We observe that “Max Cooling” in fact achieves the worst performance and
energy efficiency because the TSV density is so restricted as to not allow core stack-
ing (iByB the number of cores was restricted to only 16, which is the maximum that
can be accommodated on one layer). Although the additional cooling did facilitate
higher frequency, it was not able to achieve good performance due to limits on core
scaling.
6In our “tuyy thz “vmz yz“ign i“ optimvl in woth pzrformvnxz vny z#xiznxyA hofizvzr it i“
xzrtvinly po““iwlz (zvzn likzly) thvt tfio yizrznt yz“ign“ xouly hvvz wzzn optimvl in thz tfio
yizrznt mztrix“ if v yizrznt phy“ixvl or vrxhitzxturvl yz“ign “pvxz fizrz xon“iyzrzyC
111
Alternatively, “Max BW” was unable to accommodate sufficient MCs due to
thermal constraints. “Co-design” chooses a heatsink pin-fin pitch in between the
pitch chosen by the na´’ive schemes, thus providing sufficient cooling to accommodate
many MCs and maintaining sufficient bandwidth to accommodate core stacking.
5.3 Summary
In this chapter we introduce the physical optimization algorithms discussed in
Chapter 3 into our evaluation of the 3D CPU architectural design space. Section 5.1
introduces reliability constraints on top of thermal constraints and studies their
effect on the feasibility region of the CPU design space at hand. The impact of
different floorplan objective functions is reported and the conclusion is that all
metrics of interest (in this case temperature and reliability) must be considered
simultaneously during physical design to select the optimal feasible architectural
design point. Furthermore the microchannel heatsink optimization technique from
Section 3.10 is evaluated and shown to offer significant cooling improvements for
a fixed pumping power, and blindly increasing pumping power with a uniform MF
heatsink is shown to be inefficient.
Section 5.2 examines the trade-off between TSV bandwidth and cooling capac-
ity which is inherent to MF heatsinks, especially pin-fin MF heatsinks. The optimal
heatsink design will be a different for different architectural and physical CPU de-
112
signs with their unique cooling and TSV density requirements. We show that a
simple fixed heatsink design focusing on maximizing either cooling or bandwidth
will fail to realize the true potential of the design space at hand.
113
Chapter 6: Design Space Modeling for Physically Constrained 3D
CPUs
Design space exploration (DSE) involves the evaluation of a multitude of design
choices prior to detailed implementation. Such a technique is necessary to identify
regions of interest in the design space and perform educated trade-off analysis of
conflicting objectives. In its simplest form, DSE can be performed by exhaustively
simulating the entire design space. However as CPU designs become ever more
complex in the pursuit of Moore’s law performance scaling, the DSE problem has
become increasingly intractable as the design space grows combinatorially in the
number of design parameters. Exhaustive simulation across such large design spaces
is inefficient and potentially infeasible or unaffordable in terms of runtime.
Past work has attempted to overcome the computational infeasibility of ex-
haustive simulation in two ways. One is to reduce simulation time by orders of
magnitude using techniques such as host-compiled simulation [118] or statistical
simulation [119]. Although these approaches can make exhaustive simulation possi-
ble, the accuracy of such fast simulation techniques is reduced, and the applicability
of the techniques is limited in scope. Another approach to the DSE problem is to
114
simulate only a small subset of the the full design space and use modeling techniques
to predict the properties of un-simulated designs. Modeling approaches [120–123]
have shown promising results on large architectural design spaces.
Vertical integration of circuits (3D ICs) moves the architectural design problem
into uncharted territory where traditional domain knowledge and designer intuition
may no longer apply. Moreover, past work [12, 29] has shown that 3D-CPU ar-
chitectural design choices have a profound impact on physical properties such as
power, area and temperature and significant portions of the 3D CPU design space
can be infeasible due to physical constraint violations. 3D integration significantly
complicates the DSE problem as follows:
• 3D integration brings many new architectural opportunities that significantly
compound the intractability of exhaustive simulation.
• The effects of these new architectures on the design trade-off space are cur-
rently not well understood.
• 3D ICs are more thermally sensitive to architectural changes than equivalent
2D chips due to their physical structure [27,29].
• 3D ICs can eliminate communication bottlenecks that are inherent in 2D ICs,
making performance and power more sensitive to architectural changes [8].
115
• Ux how fixes late in the design cycle due to poor architectural design choices
can be more costly in 3D ICs because of higher interconnectivity and density
of circuit components and resource conflicts between transistors and vertical
vias [30, 31,114].
Physically aware DSE is becoming more important, especially in the context
of 3D ICs. Past work [29,103,124] has examined the effect of physical constraints on
a CPU design space, but has only done so with exhaustive simulation over a small
design space. On the other hand, the literature on design space modeling [120–123]
has only attempted to model optimization variables such as performance or energy
efficiency with no consideration of physical constraints.
In this Chapter we introduce a modeling and simulation technique for 3D
CPUs. The proposed technique models physical properties (yBgB, power, area and
temperature) and traditional optimization metrics (yBgB, instructions per second
or energy-delay-product). The technique uses these models to direct simulation
effort towards user-defined regions of interest in the design space for the purpose
of identifying interesting trends such as the Pareto optimal trade-off curve. Our
models accurately predict the performance and temperature of a diverse 3D CPU
design space and identify the optimal feasible design point (Pareto optimal design
set) with 100% (98%) accuracy while simulating less than 2% (5%) of the design
space.
116
This Chapter is laid out as follows. Section 6.1 gives a detailed overview
of related work and Section 6.2 enumerates the contributions this work makes to
the research effort. Section 6.3 introduces our modeling and simulation approach
for identifying the design space region of interest to the designer and accurately
estimating optimization metrics and physical properties while only simulating a
small subsection of the space. Section 6.4 explains the experimental setup of our
studies, and Section 6.5 presents the results which demonstrate the effectiveness
and accuracy of our DSE modeling and simulation technique using two case studies.
Finally, Section 6.6 concludes the chapter with a summary.
6.1 Previous Work
As the CPU design space has become increasingly large, exhaustive simulation
has become computationally infeasible. Methodologies to facilitate large scale DSE
have taken two orthogonal approaches: drastically reduce simulation time or produce
models of un-simulated design points using simulation data from a small subset of the
design space. The works by Genbrugge and Eeckhout [119] and Perelman yt ulB [125]
attempt to significantly reduce simulation time with statistical simulation, which
entails constructing a short code sequence that is representative of a full workload.
Other work by Gandhi yt ulB [118] uses host-compiled simulation, which natively
executes workloads that have been annotated with performance and power data
generated offline using system models. Both techniques massively reduce simulation
time, but at the cost of reduced accuracy and limited applicability.
117
Design space modeling likewise trades off accuracy for increased simulation
time by omitting simulation of certain design points and instead estimating those
points using modeling techniques. Historically, design space modeling techniques
[120–123] have used uniform random sampling to build models of the entire design
space. However there is a missed opportunity here. A significant advantage of
modeling approaches is the ability to control the accuracy of the model in different
regions of the design space, which we refer to as directed simulation. This is impor-
tant because it is often the case that accuracy of the simulations is only important in
a small subset of the design space, such as the Pareto front for the design objectives
at hand, or the region of physically feasible design points. Directed simulation can
improve the efficiency of a design space modeling technique by achieving sufficient
model accuracy in the region of interest while using significantly less simulations as
compared to random sampling.
Different modeling techniques have been proposed to accurately estimate the
properties of a design space. Early work by Joseph yt ulB [123] used linear regres-
sion to model instructions per cycle (IPC) across a 23-variable CPU design space.
However only two factors of each variable were considered, and the accuracy of
the generated models was not reported. Later that year two similar works by Lee
and Brooks [122] and I˙pek yt ulB [121] applied spline regression and artificial neu-
ral network models to similar problems, yielding average errors less than 10% and
maximum error around 50%. More recent work by Jia yt ulB [120] applied spline
regression to GPUs. This technique reduced maximum error to around 15% and
had average error in the single-digit range.
118
Past work has had significant limitations. Most work has attempted only to
build models of the design space and not to apply those models in an efficient manner
to solve design space exploration problems of interest to a designer. Moreover no
work until now has attempted to use modeling to estimate the physical feasibility
region of the design space, or to provide a generic and systematic framework for
solving a multitude of DSE problems involving discovery of a region of interest in the
design space. Our proposed technique leverages the observation that it is inefficient
to model the entire design space when only a small subset of the design space is
physically feasible, or many of the design points represent low quality configurations
that should be trimmed from the design space.
Finally, past work has only been applied to traditional computer architectures
where a large amount of domain knowledge and intuition exists. 3D CPUs are
a new frontier of computer architecture research and their design will rely much
more heavily on statistical modeling than designer intuition. Moreover, physical
constraints, especially thermal, are well known to be one of the primary limitations
to the potential performance and efficiency of new 3D CPU architectures [15, 27].
Proper consider of physical feasibility constraints during DSE must be incorporated
in order to properly design the 3D CPUs of the future.
6.2 Contributions
This work makes the following contributions:
119
• We propose a design space modeling and simulation technique that builds
regression models to identify the region of the design space that is of interest
to the designer and predict optimization metrics and physical properties within
that region while only simulating a small subset of the space.
• To the best of our knowledge our work is the first to apply design space
modeling techniques to 3D CPUs. 3D CPU design is expected to rely more
on design space modeling than traditional CPU architectures due to a lack
of designer experience and intuition regarding this emerging technology and
architectural paradigm.
• To the best of our knowledge our work is the first to apply design space
modeling to physical properties such as temperature to predict the feasibility
region of a design space. This is extremely important for designing 3D CPUs
which are known to be heavily thermally constrained [15,29].
• Unlike past work, our proposed modeling and simulation methodology is ex-
pendable to any arbitrary design objective and associated metrics (yBgB, power,
performance, area, timing, temperature) and is able to maximize the efficiency
of optimization through directed simulation.
120
Architectural 
Design Space
Optimal 
Design
Evaluate 
Stopping 
Criteria
Predict 
Metrics
Build Model
performance, 
temperature, power etc.
Initial 
Random 
Sampling
Empty 
Model
Model
Evaluate models 
with addition of 
each 1st order 
term not in 
model
Add term 
T with 
most 
positive 
benefit
Evaluate models 
with addition of each 
2nd order term (T,J) 
where J is already in 
the model
Add term 
(T,J) with 
most 
positive 
benefit
While more 1st order terms available
While more 2nd order terms available
Discovery Metric
optimal, pareto, etc.
Select New Simulation
Define Region of Interest (ROI)
Randomly Select Points from ROI
Choose Model Terms
Figure 6.1: Modeling and simulation technique
6.3 Modeling and Simulation Technique
In this section we introduce our modeling and simulation technique for 3D
CPU DSE subject to physical constraints. We use the smoothing spline analysis
of variance (SS-ANOVA) [126] modeling technique to build models for each design
parameters of interest (yBgB, performance, temperature and power) as a composition
of cubic spline functions evaluated on combinations of design variables (iByB model
terms). First we give some background on SS-ANOVA modeling and then describe
our technique for building models of the 3D CPU architectural design space with a
limited number of simulations. Figure 6.1 illustrates the overall flow of our model-
ing and simulation technique, and details are given in the subsections below. The
basic flow is an iterative back-and-forth between model building and choosing new
simulation points based off the constructed model predictions.
121
6.3.1 SS-ANOVA Modeling
A spline is a piecewise polynomial function [126]. In this work we consider
cubic splines, which are piecewise cubic functions. Splines are both differentiable and
continuous at the piecewise boundaries which are called knots [126]. The smoothing
spline is a technique to smooth noisy data by fitting a spline function to the data.
Analysis of variance (ANOVA) is a statistical technique for analyzing the underlying
source of variations in a population [126]. Multi-factor ANOVA can be used to
generate models of an observed data set as a function of some underlying properties
of each observation. An observation f can be modeled as a function of the variables
v = x1P x2P O O O P xT as shown in Equation (6.1) [126]. SS-ANOVA limits the functions
{f1P O O O P fTP f1;2P O O O P f1;2;:::;T} to be spline functions which operate on some subset of
the variables in v. Each unique subset of input variables is called a term, and the
order of a term is the number of members in the subset. x is the trivial function on
the 0th order term (iByB a scalar value).
f(v) = x+
∑T
P=1 fP(xP)+
∑T
P=1
∑T
Q=P+1 fP;Q(xPP xQ)+ O O O+ f1;2;:::;T(x1P x2P O O O P xT)
(6.1)
In this work we use the gff [127] package for the statistical computing environ-
ment R [128] to generate a unique smoothing spline model for each design property
of interest. To generate each model, gff requires a set of simulation data and a set
of model terms. However, choosing the appropriate simulation points and model
122
terms are nontrivial problems. The choice of model terms and simulations points
strongly affects the quality of the model and suboptimal choices have a high cost
in terms of total simulation time and model complexity. Our iterative technique for
model term and simulation point selection and is explained in detail in the following
subsections.
6.3.2 Choosing Model Terms
The maximum number of terms (iByB unique subsets of all model variables)
associated with n variables is 2T. However as a rule of thumb a model is unreliable
when the number of terms is greater than sR20 [129] where s is the number of simulat-
ed points. If too many model terms are used, the model can suffer from over-fitting,
making it very accurate with respect to the observed data, but a poor predictor of
the un-simulated data we wish to predict. Thus the number of model terms must
be kept relatively small in order to maintain model accuracy when the number of
simulations is small. The intended goal of the modeling and simulation approach
is to build accurate models while requiring only a small number of simulations, so
avoidance of the over-fitting problem is of critical importance.
The coefficient of determination (g2) is a commonly used metric to evaluate
how well a model fits the data [130]. However g2 monotonically increases as new
terms are added to a model [120]. Thus optimization of g2 itself would inevitably
lead to inclusion of all model terms, unnecessarily complicating the model and po-
tentially causing over-fitting. Adjusted g2 (g¯2) [131] (Equation (6.2)) scales g2
123
relative to the number of model terms, m, and the number of data points, s. Thus
if an additional model term is added that only marginally improves g2, g¯2 will
decrease, indicating that the added term has reduced the quality of the model. Sep-
arate models (using separate sets of model terms) are built for each design property
of interest, so a separate g¯2 value is calculated for each model.
g¯2 = 1− (1−g2) s−1
s−S−1 (6.2)
We use a forward selection g¯2 based technique to select the terms in the model.
The model building technique is similar to the technique used in [120], and is shown
in the bottom half of Figure 6.1. Starting with an empty model we consider each
model consisting of one first order term. We evaluate the g¯2 metric for each model
and accept the one with the largest value. We then consider adding each remaining
first order term and accept the terms that increase the quality of the model by at
least θ. Model terms are added in decreasing order of model improvement, and
model improvement is reevaluated each time any term is added to the model.
Every time a new first order term is added to the model, we consider all
second order interaction terms created by combining the new first order term with
any other first order terms already in the model. Amongst all new second order
terms generated this way we add any that cause the model quality to improve by at
least θ. Second order terms are added to the model in a nested loop in decreasing
order of model improvement. The model is complete once all first order terms have
been added to the model, or when adding any new first order terms causes model
124
quality to improve less than θ. We limit our model to terms of order two and below,
although the proposed model building approach could easily be extended to include
terms of arbitrary order. High order interactions are seldom significant [126] so
limiting the order of our model is expected to reduce the complexity of the model
and the model building procedure without incurring significant losses in accuracy.
6.3.3 Adding Simulation Points
The designer defines a discovery metric, which determines the point(s) in the
design space they are interested in accurately identifying. Some examples of poten-
tial discovery metrics are the optimal design point subject to a set of constraints
(yBgB, design space optimization), or the set of Pareto optimal designs (yBgB, trade-off
analysis). The optimality metric (yBgB, performance or energy efficiency), constraints
(yBgB, temperature, power, area or timing) and Pareto metrics (yBgB, temperature-
performance trade-off curve) are defined by the designer. The goal of our proposed
modeling and simulation technique is to identify these points by iteratively pre-
dicting them and concentrating simulator effort around the predicted point(s) to
improve the accuracy of the prediction.
Initial models are built using a random sampling of η simulation points from
the design space. Using the model predictions1, the predicted design point(s) of
interest are identified. However, due to model error, the identified point(s) are not
necessarily the true points of interest. Luckily, the true points of interest are likely
1Dz“ign point“ thvt hvvz vlrzvyy wzzn “imulvtzy u“z rzvl “imulvtion mztrix“ rvthzr thvn przB
yixtion“ from moyzl“ to improvz vxxurvxy of thz mzthoyC
125
to be close to the predicted points of interest. Thus a region of interest (ROI) is
defined which contains the design points which are close to the predicted point(s)
of interest, and additional simulation effort is concentrated towards this ROI to
improve model fidelity in that region. The ROI is defined as the design points close
to the predicted point(s) of interest, however the concrete definition of closeness will
necessarily be a function of the discovery metric. Section 6.4 introduces the specific
discovery metrics and associated ROI definitions used for the case studies presented
in this chapter.
Each iteration of the flow identifies χ new design points from the predicted
ROI and queues them for simulation. Once the simulations are performed, the
model is rebuilt and the process repeats. If the initial model mispredicts the ROI,
additional simulation effort in the mispredicted region will reduce model residuals
in that region and cause the newly predicted ROI to move away from its original
mispredicted region towards the true ROI. Thus as the modeling and simulation
flow iterates, predictions of the design point(s) of interest converge towards their
true values. The process terminates when a defined stopping criteria has been met.
6.3.4 Stopping Criteria
Stopping criteria could involve reaching a maximum number of simulations,
or a sustained convergence in predictions of ROI and/or point(s) of interest across
multiple iterations. Since we are considering different discovery metrics with differ-
ent definitions of point(s) of interest and ROI, we simply set the stopping criteria to
126
terminate when the total number of simulations reaches ζ. However we investigate
the trade-off between number of simulations and optimality of our selected design
space in Section 6.5, and the point at which prediction convergence is achieved can
be observed post how in the results.
6.4 Experimental Setup
In this section we describe the experimental setup to evaluate the effective-
ness of the modeling and simulation technique introduced in Section 6.3. In the
following subsections we introduce the 3D CPU design space, the discovery metrics
and associated ROI definitions considered in our case studies and the metrics we
use to measure the success of our approach. Results are presented and discussed in
Section 6.5.
6.4.1 Architectural Design Space
Our study searches the architectural design space in Table 6.1. Variables with
values in brackets can take on any of the bracketed values, and the cross product of
all variable values represents the complete design space. The architectural design
space in Table 6.1 contains 4374 unique design points.
127
Table 6.1: Architectural design space (baseline architecture shown in bold).
Variable Value(s)
Technology Node 32 nm
Number of cores (xorz) {6, 16, 32}
Memory controllers xorz{1R2, 1R4, 1R8}
Clock frequency {0.2, 3.0} GHz
NOC width 128 bits
L2 cache size (per core) {034, 512, 1024} kB
L2 cache associativity {2, 8, 16}
L1 cache size (per core) {14, 32, 64} kB
L1 cache associativity 1
Pipeline width {0, 4, 6}
Branch predictor Tournament
Local history table 1024 8-bit entries
Global predictor 4096 2-bit entries
BTB size 32 kB
BTB associativity 1
Reorder buffer length (row) {74, 128, 160}
Issue queue length 0O4row
Load-store queue length 0O5row
Fetch queue length 64
Int architectural registers 0O67row
FP architectural registers 0O33row
RAT size row 8-bit entries
DRAM size 4 GB
Cache line size 64 B
DRAM bus width 64 B
6.4.2 Software Benchmarks
Each architectural design point is evaluated using a set of software workloads
from the SPLASH-2 [84] and PARSEC [85] benchmark suites. The performance
of each design point is defined as the average normalized performance across all
benchmarks and the maximum temperature for each design point is the maximum
128
Table 6.2: Simulated Workloads
SPLASH-2 PARSEC
wagee-afdhaeed blackfchblef
ffg flhidaaimage
eadik dedhc
fwacgibaf
temperature amongst all benchmarks. The specific benchmark programs used for
this study are given in Table 6.2. The inputs and parameters used for each bench-
mark are the default settings recommended in the Multi2Sim documentation [82].
6.4.3 Discovery Metrics
The goal of our DSE study is to identify the design point(s) of interest as
defined by the discovery metric chosen by the designer. Two discovery metrics are
considered as case studies in this paper, but our proposed methodology is applicable
to any arbitrary discovery metric. The discovery metrics considered here are:
• \Mnrgkal": design point with highest normalized performance subject to
thermal constraint tzmpV < ivOURatOUT.
• \Napcrm": Pareto optimal set of design points in thermal-performance space.
Thus the modeled design parameters are performance and temperature.
Each discovery metric defines an accompanying ROI of radius ϕ = (ϕVerf P ϕteSV).
The ROI for the “Optimal” and “Pareto” discovery metrics are given in Equation-
s (6.3) and (6.4)2 respectively, where pzrfO and tzmpO are the performance and
2evrzto optimvl point“ vrz thz “zt of point“ “uxh thvt no othzr point i“ wzttzr in vll mztrix“
of intzrz“tC Equvtion (6CI) prz“znt“ v ϕBrzlvflzy yznition of evrzto optimvlity thvt inxluyz“ vll
point“ “uxh thvt no othzr point i“ wzttzr wy v yzgrzz of ϕ in vll mztrix“ of intzrz“tC
129
temperature of design point i and Ω is the design space. Design point p is the
predicted optimal feasible point for the discovery metric “Optimal”. The defined
ROI is the set of points within distance ϕ of the identified point(s) of interest, and
setting ϕ = (0%P 0◦C) causes the ROI to degenerate into a set containing only the
identified point(s) themselves. The nominal thermal constraint is ivOURatOUT = 85
◦C,
however the impact on our results due to reduced ivOURatOUT is studied in Section 6.5.
gdIOVtOSaR = {i ∈ Ω |
pzrfO − pzrfVpzrfV
 ≤ ϕVerf ∧ |tzmpO− tzmpV| ≤ ϕteSV} (6.3)
gdIParetU = {i ∈ Ω | ∀(P ̸=O)∈Ω pzrfP(1−ϕVerf ) ≤ pzrfO ∨ (tzmpP+ϕteSV) ≥ tzmpO}
(6.4)
6.4.4 Modeling and Simulation Parameters
The modeling and simulation technique introduced in Section 6.3 can be
parametrized to make trade-offs between simulation time and optimality of the se-
lected design point. In this study we use the following parameters:
• We sample η = 40 simulation points at random from the design space to build
the initial model. The parameter η should be large enough to generate an
initial model with reasonable accuracy in order to yield a reasonable approxi-
130
mation of ROI. However a large value of η would degrade the efficiency of the
method as it degenerates towards random sampling. Letting η = 40 was found
to be the smallest number of simulations that would allow the gff package to
generate models without causing software errors, and larger values degraded
efficiency.
• The threshold for accepting new model terms is g¯2Tew − g¯2curreTt S θ = 0. By
increasing θ, the model complexity could be reduced at the expense of model
quality.
• We use ROI radius of ϕ = (8%P 4◦C) when the discovery metric is “Optimal”
and ϕ = (5%P 3◦C) when the discovery metric is the “Pareto”. Larger values
of ϕ prevent convergence to local minima, but generally increase the number
of simulations. The values chosen were determined experimentally to make
good tradeoffs between these two properties.
• We iteratively simulate chosen design points in increments of χ = 5. Small
values of χ increase the number of iterations and thus the number of times
model building must be performed. Moreover the new model is unlikely to
change much if χ is very small since only one or two new simulations does not
significantly change the input to the model builder. However excessively large
values of χ will spend too much simulation effort in the current estimation of
ROI when potentially the prediction of ROI will change substantially after the
next iteration. The value χ = 5 was found experimentally to provide a good
trade-off between these two concerns.
131
• We use a nominal stopping criteria of ζ = 200 simulations. The trade-off
of optimality vs. number of simulations is investigated in Section 6.5. The
value ζ = 200 represents nearly 5% of the total design space. Simulation of
significantly more points would degrade the usefulness of the proposed method,
whose intended goal is to only simulate a very small subset of the space.
Moreover we find that our proposed method achieves very accurate results
with less than 200 simulations.
6.4.5 Evaluation Metrics
The goal of the experiment is to identify the design point(s) defined by the
discovery metric, while minimizing the total number of simulations performed. Thus
the primary metrics used to evaluate the quality of our technique will be the accuracy
of the identification, the number of simulations performed and the runtime overhead
of the modeling technique. The accuracy of identification is defined as the distance
of the identified point(s) from the actual point(s) of interest (which were obtained
by exhaustive simulation solely for the purpose of evaluation).
When the discovery metric is “Optimal”, the distance between the identified
point and the true solution is quantified as optimality, which is the ratio Verfp
VerfT
where
p is the predicted optimal feasible point and o is the true optimal feasible point
(determined by exhaustive simulation).
132
When the discovery metric is “Pareto”, the distance between the identified
points and the true Pareto set is quantified as accuracy, which is the average Pareto
optimality of the predicted Pareto set. The Pareto optimality of design point k is
determined by finding the smallest value of ϕ such that k is included in the ROI.
Specifically, the Pareto optimality of k is αQ and the smallest value of ϕ that includes
k in the ROI is ϕ = (1− αQ)(100%P 60◦C3).
In general the optimality/accuracy of the predicted point(s) will increase as
more simulations are performed, eventually degenerating into the exhaustive simu-
lation . The net speedup of our technique consists of the reduction in total number
of simulations minus the runtime overhead of building the models. However we
will show in Section 6.5 that the modeling overhead is negligible compared to the
reduction in necessary simulations due to application of our approach.
6.4.6 Comparison to Other Techniques
The rudimentary technique to which our technique could be compared is ex-
haustive simulation. However one can conceive of a less rigorous random sampling
approach to DSE in which some portion of the solution space is sampled at random
and the best design amongst the sampled designs is selected4. Additionally we could
consider a less sophisticated modeling-only version of our proposed technique that
uses SS-ANOVA model building to predict the design point(s) of interest, but sim-
ply uses random sampling to provide data to the model builder. The modeling-only
36E C fiv“ roughly thz thzrmvl rvngz of thz yz“ign “pvxz xon“iyzrzy in thi“ fiork v“ “hofin in
[igurz 6CHC
4Eflhvu“tivz “imulvtion i“ “imply v yzgznzrvtivz xv“z of rvnyom “vmpling fihzrz thz “imulvtzy
portion of thz “olution “pvxz i“ thz zntirz “pvxzC
133
approach is representative of design space modeling techniques proposed in past
work [120–123]. The advantage of a modeling-only technique is that it only requires
models to be built once, but we will show that the time spent building models is
insignificant compared to the savings in simulation time achieved by our proposed
modeling and simulation technique. In Section 6.5 we compare the trade off curves
of simulation count vs. quality for the three aforementioned techniques:
• Npmnmscb: modeling and directed simulation
• Mmbclgle-Mllw: modeling and random simulation (representative of past
work [120–123])
• Palbmk Saknlgle: no modeling and random simulation
Since all techniques involve randomized sampling to some degree (yBgB, building
the initial model in our proposed technique), experiments are replicated multiple
times.
6.5 Results
In this section we describe the results of our experiments. First we provide
some characterization of the design space explored in our study, and then we com-
pare the quality of the different methodologies described in Section 6.4.6 for the
“Optimal” and “Pareto” discovery metrics.
134
2 4 6 8 10 12 14
0%
10%
20%
30%
Normalized Performance
Pe
rc
en
t
 
 
All Designs
Feasible Designs (85°C)
Feasible Designs (65°C)
(v)
50 60 70 80 90 100 110 120
0%
10%
20%
30%
40%
Maximum Temperature (°C)
Pe
rc
en
t
(w)
Figure 6.2: Distribution of (a) performance (b) temperature in design space
6.5.1 Design Space Characterization
We begin by examining the properties of the design space. Exhaustive simu-
lation was performed for the purpose of evaluation, as the design points of interest
must be identified before the quality of the considered techniques can be evaluat-
ed. Exhaustive simulation took weeks to perform using university servers, further
motivating the strong need for techniques such as the one proposed in this paper in
135
0 5 10 15
60
80
100
120
Normalized Performance
M
ax
im
um
 T
em
pe
ra
tu
re
 (°
C)
Figure 6.3: Temperature vs. performance of entire design space
order to reduce simulation time significantly below that of exhaustive design space
simulation. We provide some statistics of the design space properties in order to
give context for the results of this study.
Figure 6.2(a) shows the distribution of normalized performance across all ar-
chitectural design points. We can see that the design space is biased heavily towards
the low-performance region. Furthermore thermal feasibility constraints bias the de-
sign space even further as the constraints tighten (iByB ivOURatOUT is reduced). This
implies that random sampling is not a very good technique for discovering the “Op-
timal” design point since the probability of randomly sampling a high-performance
thermally-feasible design point is low. The more biased the performance distribu-
tion is towards low-performance design points, the less effective random sampling
will be for finding the “Optimal” design point, and the greater the need for directed
simulation. Likewise Figure 6.2(b) shows the distribution of temperature. From this
figure we can see how different values of ivOURatOUT will affect the size of the feasibility
region of the design space.
136
Figure 6.3 shows a scatter plot of the performance and temperature of each
design point in the design space. We can see that identification of both the optimal
feasible design point and the Pareto optimal design set without exhaustive simula-
tion is non-trivial. The vast majority of design points in the design space are far
from the point(s) of interest using either discovery metric. Moreover the correlation
between performance and temperature is weak, motivating the need for independent
models of each design property.
6.5.2 “Optimal” Discovery
There exists a fundamental trade-off between the number of simulations and
the quality of the identified solution. We compare the random sampling and modeling-
only technique to our proposed modeling and simulation technique and show that
our technique is far better both in terms of quality of trade-off and the reliability
of the approach. First we evaluate the techniques using the “Optimal” discovery
metric.
Figure 6.4(a) tracks the optimality of the evaluated techniques as they itera-
tively add additional simulation points. We observe that modeling alone is a large
contributor to the optimality of the identified point. With only 1% of the solution
space sampled (roughly 40 points), the two modeling techniques can already iden-
tify a solution within 90% of the optimal, 38% closer to the optimal point than
the prediction made by random sampling. However the true power of the proposed
technique becomes clear as the number of samples increases. The random sam-
137
   0  200  400  600  800 1000
85%
90%
95%
100%
Number of Simulations
O
pt
im
al
ity
 o
f S
ol
ut
io
n
 
 
Proposed
Modeling−Only
Random Sampling
(v)
99.5%99%98%95%90%
  30
 100
 300
1000
Optimality Target
R
eq
ui
re
d 
Si
m
ul
at
io
ns
 
 
Proposed
Modeling−Only
Random Sampling
(w)
Figure 6.4: Optimality of identified design.
pling techniques, both with and without modeling, quickly improve the optimality
of the predicted design as more simulation points are added, but then eventually
flatten out as additional sampling is unable to significantly improve the quality of
the prediction. However this diminishing returns phenomena is not observed in our
proposed modeling and simulation technique. By using models to direct simulation
effort on each iteration towards the ROI, the technique is able to make roughly linear
138
improvements to prediction accuracy for each additional simulation. Our proposed
technique is able to identify the optimal feasible design point while simulating less
than 2% (roughly 80 points) of the entire design space.
Figure 6.4(b) re-examines the data from the perspective of the number of
simulations required to reach an optimality target. The data plotted here is on a
log-log axis, meaning polynomial relationships will appear as a straight lines whose
slope is proportional to the polynomial degree. An interesting result is that even if
only 90% accuracy is required, the application of model building still reduces the
total simulation time by roughly 2x compared to random sampling (saving over
100 simulation-hours in our study). This gap increases superlinearly as the opti-
mality target increases. Furthermore as the optimality target tightens beyond 98%
the slope of the trendline for the modeling-only technique significantly increases as
the technique begins to degenerate into random sampling. On the other hand our
proposed technique shows no such degeneration.
6.5.2.1 Robustness to Constraint Tightness
The previous results were evaluated at ivOURatOUT = 85
◦C. However as Fig-
ure 6.2(a) shows, reducing ivOURatOUT significantly reduces the size of the thermal
feasibility region. It is expected that this will reduce the quality of the random
sampling technique significantly, but it is unclear how shrinking the feasibility re-
gion will affect the techniques that use model building. The fundamental question
here is how the size of the feasibility region affects the quality of the different tech-
139
99.5%99%98%95%90%
  30
 100
 300
1000
Optimality Target
Ad
di
tio
na
l S
im
ul
at
io
ns
 
 
Proposed
Modeling−Only
Random Sampling
Figure 6.5: Additional simulations required when ivOURatOUT is reduced from 85
◦C to
65 ◦C.
niques. Although that question is investigated in this study by simply tightening
the thermal constraints, it is logically equivalent to considering a lower-performance
heatsink which would cause many design points to become thermally infeasible due
to elevated temperatures. Moreover, heterogeneous integration in 3D ICs may in-
troduce thermal constraints at substantially lower temperatures than those used for
CMOS logic. Reduction in ivOURatOUT was a simple way to consider the effect of design
space constraints without requiring re-simulation of the entire solution space.
Figure 6.5 plots the number of additional simulations required when ivOURatOUT
changes from 85 ◦C to 65 ◦C. We notice that the number of additional simulations
required for our proposed method is less than 30 (¡1% of the entire design space) and
moreover, remains roughly constant as the optimality target is tightened. On the
other hand random sampling and modeling-only both require superlinearly increas-
ing amounts of additional simulations in order to meet optimality targets. Although
model-building in and of itself does significantly reduce the amount of overhead com-
pared to random sampling, the point at which additional simulation effort begins to
140
   0  200  400  600  800 1000
85%
90%
95%
100%
Number of Simulations
Ac
cu
ra
cy
 o
f P
re
di
ct
io
n
 
 
Proposed
Modeling−Only
Random Sampling
(v)
99.5%99%98%95%90%
  30
 100
 300
1000
Accuracy Target
R
eq
ui
re
d 
Si
m
ul
at
io
ns
 
 
Proposed
Modeling−Only
Random Sampling
(w)
Figure 6.6: Accuracy of identified Pareto set.
show diminishing returns now occurs when optimality target reaches roughly 95%,
reducing the scalability of this approach in heavily constrained design spaces. The
conclusion is that our proposed technique is nearly independent of the size of the
design space feasibility region due to the application of directed sampling, whereas
techniques that use random sampling become less effective as the feasibility region
shrinks.
141
6.5.3 “Pareto” Discovery
Figures 6.6(a) and 6.6(b) show the accuracy of the considered methods when
the “Pareto” discovery metric is applied. Although the general trends and rela-
tive ordering of the method results are similar to the “Optimal” case, there are
some significant differences. The most obvious difference is that the quality of both
model-based techniques is reduced. Identification of a set of Pareto points is a more
challenging problem and it makes sense that more simulation would be required
to identify the true Pareto design set. However the relative improvement of our
proposed technique vs. the modeling-only technique is substantially increased, in-
dicating the increased need for directed simulation for more complex design space
modeling and exploration problems such as identification of the Pareto design set.
Another interesting difference is that modeling-only is degenerating into ran-
dom sampling much sooner than it did for the “Optimal” discovery metric. The
conclusion here is that models built with random sampling can approximating a
single design much better than the relative ordering of all design points. Directed
simulation towards the ROI is of utmost importance for estimation of the Pareto
design set, even for rather loose accuracy targets.
Finally we observe that random sampling has roughly the same trade-off curve
whether predicting a single optimal feasible point or the entire Pareto optimal set.
However the modeling-based approaches both perform significantly better for the
“Optimal” discovery metric, which is the simpler problem5. This implies that ran-
5In fvxt thz qdptimvl7 yi“xovzry mztrix prowlzm i“ v “uwBprowlzm of thz qevrzto7 yi“xovzry
mztrix prowlzmA wut fiith “ignixvntly rzyuxzy xomplzflityC
142
dom sampling (and by extension exhaustive sampling) is failing to take advantage
of the significantly different degrees of problem complexity to efficiently find a solu-
tion. Our technique is able to take advantage of the reduced complexity across all
accuracy targets, and a modeling-only approach is able to take the same advantage
when the accuracy target is low.
6.5.4 Overhead of modeling approach
There is obviously some runtime overhead for building the model in the pro-
posed modeling approaches. We observed that the time consumed building models
in our proposed approach was less than the time consumed to simulate a single
design point (< 0O025% of the design space). Figure 6.4(b) clearly shows that this
overhead is negligible compared to the savings in number of required simulations
compared to random sampling.
6.6 Summary
In this chapter we propose a modeling and simulation technique to apply the
co-simulation and co-optimization techniques explored in the previous chapters to
a large design space where exhaustive simulation of the architectural design space
is not computationally feasible. We use smoothing spline ANOVA to build models
of the metrics of interest across the entire design space using simulation data from
only a small subset of the space. We iteratively build models and use these models
to choose new simulations that will improve the accuracy of the model in the region
143
of interest to the designer, such as the optimal feasible design point or the Pareto
optimal front. Our proposed methodology is applied to an eight-dimensional 3D
CPU design space and tasked to discover the optimal feasible point and the Pareto
optimal set of designs. Using less than 5% of the design space, we are able to identify
both objectives with an accuracy of over 98%.
144
Chapter 7: Conclusions and Future Work
In Chapter 1 we introduce 3D integration as a promising new technology that
promises to overcome some of the fundamental roadblocks to CPU performance s-
caling, such as interconnect power and delay dominance, the slowdown of economic
incentives for technology scaling, and the physical fundamental limits of technology
scaling due to quantum effects. We cite thermal and reliability concerns as first tier
limitations to 3D IC technology, and discuss the fundamental interconnectedness
of many metrics of interest and physical constraints in modern ICs. This inter-
connectedness is only exacerbated by 3D stacking and we introduce the co-design
paradigm as a systematic methodology for addressing the simultaneous modeling
and optimization of many design metrics and their interdependence on each other
as well as design variables.
In Chapter 2 we explain 3D integration technology and provide more de-
tailed analysis of the potential opportunities of 3D CPUs including massive mem-
ory bandwidth and highly connected on-chip inter-core communication networks.
Such architectural advancements offer an opportunity to overcome the memory-
and communication-wall. We detail the thermal and reliability concerns in 3D inte-
gration and introduce micro-fluidic cooling as a potential solution.
145
Chapter 3 introduces the co-simulation co-optimization flow used to evaluate a
given architectural-physical design space throughout the many experiments present-
ed in this dissertation. The flow models performance, power, timing, reliability and
temperature. This chapter also introduces the physical optimization loops evaluat-
ed in Chapter 5 which can be driven by objective functions composed of arbitrary
combinations of simulated design metrics.
Chapter 4 presents the results of two studies that quantitatively show the
potential performance opportunities of stacked memory-on-logic CPUs and the as-
sociated need for micro-fluidic cooling. The first experiment finds that 3D stacking
has the potential to improve performance significantly, but without proper cooling
may actually reduce performance in order to meet thermal constraints. The second
experiment explores the possibility of a return to a frequency scaling paradigm in
parallel with the current core-scaling scheme in place today. This is made by the
combination of high bandwidth architectures and micro-fluidic cooling.
In Chapter 5 we apply the physical optimization algorithms introduced in
Chapter 3 and demonstrate the need for and advantages of simultaneous simulation
and optimization of a multitude of design metrics, and the impact of their interde-
pendence. We also introduce a new trade-off unique to MF cooled 3D ICs, which is
between inter-layer via density (iByB inter-layer bandwidth) and cooling capacity.
Finally Chapter 6 brings together the co-design simulation scheme and propos-
es a way to realistically apply it across a real-world design space where exhaustive
simulation is not computationally feasible. We propose a modeling and simulation
framework that is able to apply the co-design paradigm over a large design space
146
while only simulating a small subset of design points. Our method can discover
the user-defined architectural regions of interest with over 98% accuracy while only
requiring simulation of 5% of the design space.
7.1 Future Work
This dissertation significantly advances the emerging co-design paradigm, and
represents a prototype of application of co-design in a holistic and comprehen-
sive simulation and optimization framework. However, being an emerging design
paradigm coupled with an exciting new technology, there are obviously many ex-
citing avenues for future work in this field. Significant expansion of the scope of
our work can be achieved by introducing models of heretofore un-modeled phenom-
ena and improving (yBgB, adding granularity and inter-metric coupling) the existing
models. Furthermore, an open research question how to efficiently model interac-
tion relationships to best balance design time with quality. The extension of the
co-design paradigm to low level detailed design will inevitable be introduced in
future research, however our work sets the groundwork with a comprehensive high-
level abstract implementation. Finally, our work investigates the application of the
co-design paradigm to design-time decision making, but it can equally be applied
to run-time management, and the interaction and simultaneous application of these
two domains will certainly be the ultimate goal of the research effort begun in this
dissertation.
147
7.1.1 Expansion of Co-Design Scope
The work presented in this dissertation has covered significant ground towards
an implementation of the co-design paradigm. However it is by no means exhaustive.
There are other significant interconnected design challenges and metrics that are not
considered here, such as power delivery and signal integrity. In reality the co-design
relation graph presented in Figure 1.2 is only a sub-graph of the true scope of the
interconnected relationships involved in chip design. Due to the finite nature of
compute resources and the need to find efficient trade-offs between design time and
design quality, not every relationship can be considered in a real implementation of
the co-design paradigm. However the decision of which relationships to model and
which to ignore is domain specific, and as of yet there is no methodology in place
to quantitatively decide how to construct the co-design simulation structure (iByB
to choose the sub-graph of the true global relationship graph to include in a co-
design implementation). Development of such a methodology would be a significant
contribution to be made by future work in this area, and would significantly advance
the work towards industrial-scale applicability for arbitrary design problems.
In the following subsections we discuss two important design problems that
are expected to limit the further advancement of 3D IC technology if the thermal
and reliability concerns can be overcome. Modeling and optimization of these design
problems would be a logical next step in expanding the scope of the proposed co-
design framework put forth in this dissertation.
148
VDD PCB Package u-Bump
Chip Mesh
n Tiers
P/G TSVs
Figure 7.1: PDN model in a 3D IC
7.1.1.1 Power Delivery
In a 3D IC, power is delivered from off-chip package through C4 bumps and
then distributed vertically through power TSVs. Figure 7.1 illustrates a 3D PDN
circuit model, which consists three parts: PCB, Package and On-chip circuits. The
on-chip circuit is modeled as a meshed RLC network capturing the voltage distri-
bution in both vertical and planar directions.
The vertical structure of a 3D PDN brings several new challenges. First, as
3D integration enables stacking multiple functional layers vertically, power scales
volumetrically with the product of footprint area and number of layers. However,
the number of power delivery pins (iByB the power delivery capacity) is a function of
footprint area only. This imbalance between power supply and demand makes main-
tenance of high quality voltage rails a challenging problem. Second, the parasitics
of power/ground TSVs affect the resonant frequency of each layer thus influenc-
ing the power noise characteristics in 3D ICs. As the current draw in 3D ICs has
significant spacial variation, the PDN noise shows great variation spatially. Third,
149
the stacking structure of 3D ICs enables power noise from one layer to couple in
neighboring layers. For example, when CPUs at different layers share the same
PDN, one active CPU core can affect the voltage level of another core on a different
layer. Fourth, in an air-cooled 3D IC, the heatsink and the power delivery pins are
almost always on opposite ends of the chip stack. This means there is a trade off in
that the chip layer with the most cooling capacity (iByB closest to the heatsink) will
also be the layer with the worst power integrity, and vice versa. This necessitates
aggressive management and design methodologies considering both power delivery
and temperature.
7.1.1.2 Signal Integrity
Another design challenge in 3D ICs is to ensure signal voltage noise is main-
tained within design margins. Cross coupling between switched devices can cause
increased leakage/short circuit currents and possibly result in digital glitches that
affect circuit behavior or cause incorrect computations. In addition to the tradition-
al sources of coupling noise (wires and transistors), TSVs provide a new coupling
source in 3D ICs. TSVs have the potential to be more problematic than planar
wires since they are much larger, and surrounded by a much thinner insulation lay-
er [20,21]. TSVs can easily couple into the conductive silicon substrate through the
thin oxide liner around the TSV [23]. From there the voltage noise can couple into
other TSVs or transistors through the conductive substrate.
150
Substrate
L
in
e
r
T
S
V
Figure 7.2: TSV-TSV coupling circuit model
Figure 7.2 shows a circuit model of coupling between two TSVs. TSV coupling
is most strongly affected by liner capacitance which is independent of the distance
between TSVs [23]. Thus, TSV coupling is not efficiently mitigated by increasing
TSV pitch. Liu yt ulB [23] show that increasing TSV pitch from 1 mto 20 m(20x
increase) only reduced TSV coupling from 255 mVto 225 mV(12% reduction).
We have done extensive work on modeling and reducing TSV-TSV coupling
noise [20, 21, 132, 133], but this work is at this point outside the scope of this dis-
sertation since it operates at the global placement layer of abstraction. However
by applying the co-design paradigm to more fine-grained detailed physical design
(Section 7.1.2) our past work on TSV coupling could be easily integrated into the
co-design paradigm.
7.1.2 Fine-Grained Design and Integration
The work in this dissertation has attacked the co-design problem at a high level
of abstraction. The architectural design knobs considered were macro-architectural
parameters and the physical design space consisted of high-level abstract floorplan-
151
ning. A significant avenue for future work is to consider micro-architectural design
variables and/or more detailed physical design such as global placement. A promis-
ing approach going forward would be to add a fine-grained co-design scheme as a
hierarchical level under the high-level co-design flow presented in this dissertation.
Thus this future work would be a direct vertical extension of the current work.
Fine grained co-design would fundamentally require fine-grained models at
both the physical and architectural level. Although such models do exist for tradi-
tional 2D CPUs, to our knowledge no generally accepted low level models have been
put forth for 3D CPUs, and this is an area of ongoing research. For this reason the
current work in this dissertation has considered coarse-grained integration of either
vertical stacking of traditional 2D CPU layouts, or folding of high level function
blocks across layers. However theoretical and experimental work has shown that
the true advantage of 3D integration comes when circuits are split across layers
at a fine granularity [15, 134]. Development of fine-grained physical models for 3D
CPU function blocks would be a significant contribution to the advancement of the
co-design paradigm and would facilitate a hierarchical co-design approach to go all
the way from architectural design space exploration to tape-out physical layouts.
7.1.3 Runtime Management
This dissertation has only considered a design-time solution space. However
the co-design paradigm could equally be applied to the design of runtime man-
agement policies and algorithms. For example, traditional dynamic voltage and
152
frequency scaling considers only core performance and temperature (or power). But
these policies also affect reliability, power integrity, DRAM refresh rate ytwB The
co-design principle tells us we should simultaneously consider all the effects of a
given runtime policy or decision in order to choose the optimal operating conditions
at any given point in time.
Similar to design-time architectural decisions, runtime architectural decisions
such as turning on/off certain cores, memory controllers, regions of cache ytwB can be
made using the co-design paradigm. Such adaptive architectures will become neces-
sary in the future due to the Dark Silicon effect [87]. Even micro-fluidic heatsinks can
benefit from runtime control [79]. Although the placement and dimensions of fluid
cavities are determined at design time, the fluid flow rate can be toggled, especially
in conjunction with DVFS and task migration techniques, and micro-values can be
designed to give runtime control of which cavities fluid is pumped through [135].
Runtime management is an orthogonal but not an independent means of chip
co-design. The scope of runtime techniques available are inherently decided at de-
sign time, and the existence of adaptive control can allow co-design methodologies
to target average rather than worst case design, opening up significant average per-
formance improvements while still guaranteeing worst case viability.
153
Bibliography
[1] C. Serafy, Z. Yang, Y. Hu, A. Srivastava, and Y. Joshi. Thermo-electric co-
design of 3d cpus and embedded micro-fluidic pin-fin heatsinks. Xysign hyst,
]EEE, PP(99):1–1, 2015.
[2] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen,
and Norman P Jouppi. Mcpat: an integrated power, area, and timing mod-
eling framework for multicore and manycore architectures. In aiwrourwhitywA
tury, FDDMB a]WfcAHFB HFnx Unnuul ]EEECUWa ]ntyrnutionul gymposium
on, pages 469–480. IEEE, 2009.
[3] Benjamin Sutherland. No moore? a golden rule of microchips appears to be
coming to an end. hhy Ewonomist, 2013.
[4] Toshihiko Osada and Milt Godwin. International technology roadmap for
semiconductors. 1999.
[5] Nir Magen, Avinoam Kolodny, Uri Weiser, and Nachum Shamir. Interconnect-
power dissipation in a microprocessor. In drowyyxings of thy FDDH ]ntyrnutionul
korkshop on gystym Lyvyl ]ntyrwonnywt dryxiwtion, SLIP ’04, pages 7–13, New
York, NY, USA, 2004. ACM.
[6] J Hennessy and D Patterson. Memory hierarchy design. Womputyr UrwhitywA
turyN U euuntitutivy Upprouwh, pages 390–525, 2011.
[7] B. Feero and P.P. Pande. Performance evaluation for three-dimensional
networks-on-chip. In VLg], FDD7B ]gVLg] 'D7B ]EEE Womputyr gowiyty UnA
nuul gymposium on, pages 305–310, March 2007.
[8] C. Serafy, Bing Shi, A. Srivastava, and D. Yeung. High performance 3d stacked
dram processor architectures with micro-fluidic cooling. In GX gystyms ]ntyA
grution Wonfyrynwy (GX]W), FDEG ]EEE ]ntyrnutionul, pages 1–8, Oct 2013.
154
[9] Gabriel H. Loh. 3d-stacked memory architectures for multi-core processors. In
drowyyxings of thy GIth Unnuul ]ntyrnutionul gymposium on Womputyr UrwhiA
tywtury, ISCA ’08, pages 453–464, Washington, DC, USA, 2008. IEEE Com-
puter Society.
[10] G.L. Loi, B. Agrawal, N. Srivastava, Sheng-Chih Lin, T. Sherwood, and
K. Banerjee. A thermally-aware performance analysis of vertically integrated
(3-d) processor-memory hierarchy. In Xysign Uutomution Wonfyrynwy, FDDJ
HGrx UWaC]EEE, pages 991–996, 2006.
[11] S.H. Pugsley, J. Jestes, Huihui Zhang, R. Balasubramonian, V. Srinivasan,
A. Buyuktosunoglu, A. Davis, and Feifei Li. Ndc: Analyzing the impact of
3d-stacked memory+logic devices on mapreduce workloads. In dyrformunwy
Unulysis of gystyms unx goftwury (]gdUgg), FDEH ]EEE ]ntyrnutionul gymA
posium on, pages 190–200, March 2014.
[12] Caleb Serafy, Ankur Srivastava, and Donald Yeung. Unlocking the true po-
tential of 3d cpus with micro-fluidic cooling. In drowyyxings of thy FDEH ]nA
tyrnutionul gymposium on Low dowyr Elywtroniws unx Xysign, ISLPED ’14,
pages 323–326, New York, NY, USA, 2014. ACM.
[13] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, Vijaykr-
ishnan Narayanan, and Mahmut Kandemir. Design and management of 3d
chip multiprocessors using network-in-memory. In drowyyxings of thy GGrx
Unnuul ]ntyrnutionul gymposium on Womputyr Urwhitywtury, ISCA ’06, pages
130–141, Washington, DC, USA, 2006. IEEE Computer Society.
[14] Jie Meng, K. Kawakami, and A.K. Coskun. Optimizing energy efficiency of 3-d
multicore systems with stacked dram under power and thermal constraints. In
Xysign Uutomution Wonfyrynwy (XUW), FDEF HMth UWaCEXUWC]EEE, pages
648–655, 2012.
[15] Gabriel H. Loh, Yuan Xie, and Bryan Black. Processor design in 3d die-
stacking technologies. aiwro, ]EEE, 27(3):31–48, May 2007.
[16] Yue Zhang, A. Dembla, Y. Joshi, and M.S. Bakir. 3d stacked microfluidic
cooling for high-performance 3d ics. In EWhW'EF, pages 1644–1650, May
2012.
[17] Tiantao Lu and Ankur Srivastava. Detailed electrical and reliability study of
tapered tsvs. In dhysiwul Xysign for GX ]ntygrutyx Wirwuits, pages 39–52. CRC
Press, 2015.
[18] Tiantao Lu, Zhiyuan Yang, and Ankur Srivastava. Electromigration-aware
placement for 3d-ics. In drowyyxings of thy FDEJ intyrnutionul symposium on
euulity Elywtroniw Xysign. ACM, 2016.
155
[19] Jiwoo Pak, Mohit Pathak, Sung Kyu Lim, and David Z Pan. Modeling of
electromigration in through-silicon-via based 3d ic. In Elywtroniw Womponynts
unx hywhnology Wonfyrynwy (EWhW), FDEE ]EEE JEst, pages 1420–1427. IEEE,
2011.
[20] Caleb Serafy, Bing Shi, and Ankur Srivastava. A geometric approach to chip-
scale TSV shield placement for the reduction of TSV coupling in 3d-ics. ]ntyA
grution, thy VLg] Journul, (0):–, 2013.
[21] C. Serafy and A. Srivastava. Tsv replacement and shield insertion for tsv-
tsv coupling reduction in 3-d global placement. ]EEE hWUX, 34(4):554–562,
April 2015.
[22] J. Cho, E. Song, K. Yoon, J.S. Pak, W. Kim, J. J. Lee, H. Lee, et al. Modeling
and analysis of through-silicon via (tsv) noise coupling and suppression using
a guard ring. Womponynts, duwkuging unx aunufuwturing hywhnology, ]EEE
hrunsB on.
[23] Chang Liu, Taigon Song, Jonghyun Cho, Joohee Kim, Joungho Kim, and
Sung Kyu Lim. Full-chip tsv-to-tsv coupling analysis and optimization in 3d
ic. In drowyyxings of thy HLth Xysign Uutomution Wonfyrynwy, DAC ’11, pages
783–788, New York, NY, USA, 2011. ACM.
[24] Taigon Song, Chang Liu, Yarui Peng, and Sung Kyu Lim. Full-chip multiple
tsv-to-tsv coupling extraction and optimization in 3d ics. In drowyyxings of
thy IDth Unnuul Xysign Uutomution Wonfyrynwy. ACM.
[25] Jun So Pak, Joohee Kim, Jonghyun Cho, Kiyeong Kim, Taigon Song, Seungy-
oung Ahn, Junho Lee, Hyungdong Lee, Kunwoo Park, and Joungho Kim. Pdn
impedance modeling and analysis of 3d tsv ic by using proposed p/g tsv array
model based on separated p/g tsv and chip-pdn models. Womponynts, duwkA
uging unx aunufuwturing hywhnology, ]EEE hrunsuwtions on, 1(2):208–219,
2011.
[26] Runjie Zhang, Kaushik Mazumdar, Brett H. Meyer, Ke Wang, Kevin Skadron,
and Mircea Stan. A cross-layer design exploration of charge-recycled power-
delivery in many-layer 3d-ic. In drowyyxings of thy IFbx Unnuul Xysign UuA
tomution Wonfyrynwy, DAC ’15, pages 133:1–133:6, New York, NY, USA, 2015.
ACM.
[27] C. Serafy, A. Bar-Cohen, A. Srivastava, and D. Yeung. Unlocking the true
potential of 3-d cpus with microfluidic cooling. In ]EEE hrunsuwtions on Vyry
Lurgy gwuly ]ntygrution (VLg]) gystyms, volume 24, pages 1515–1523, April
2016.
[28] C. Serafy, A. Srivastava, and D. Yeung. Continued frequency scaling in 3d ics
through micro-fluidic cooling. In hhyrmul unx hhyrmomywhuniwul dhynomynu
in Elywtroniw gystyms (]hhyrm), FDEH ]EEE ]ntyrsowiyty Wonfyrynwy on, pages
79–85, May 2014.
156
[29] Caleb Serafy, Ankur Srivastava, Avram Bar-Cohen, and Donald Yeung. De-
sign space exploration of 3d cpus and micro-fluidic heatsinks with thermo-
electrical-physical co-optimization. In drowyyxings of thy UgaE FDEI ]ntyrA
nutionul hywhniwul Wonfyrynwy unx Efihivition on duwkuging unx ]ntygrution
of Elywtroniw unx dhotoniw aiwrosystyms. ASME, 2015.
[30] Zhiyuan Yang and Ankur Srivastava. Co-placement for pin-fin based micro-
fluidically cooled 3d ics. In UgaE FDEI ]ntyrnutionul hywhniwul WonfyrA
ynwy unx Efihivition on duwkuging unx ]ntygrution of Elywtroniw unx dhotoniw
aiwrosystyms wollowutyx with thy UgaE FDEI EGth ]ntyrnutionul Wonfyrynwy
on bunowhunnyls, aiwrowhunnyls, unx ainiwhunnyls, pages V001T09A036–
V001T09A036. American Society of Mechanical Engineers, 2015.
[31] Zhiyuan Yang and Ankur Srivastava. Physical co-design for micro-fluidically
cooled 3d ics. In hhyrmul unx hhyrmomywhuniwul dhynomynu in Elywtroniw
gystyms (]hhyrm), FDEJ ]EEE ]ntyrsowiyty Wonfyrynwy on. IEEE, 2016.
[32] Avram Bar-Cohen, Ankur Srivastava, and Bing Shi. Thermo-electrical co-
design of three-dimensional integrated circuits: challenges and opportunities.
Wompututionul hhyrmul gwiynwysN Un ]ntyrnutionul Journul, 5(6), 2013.
[33] Mark T Bohr et al. Interconnect scaling-the real limiter to high performance
ulsi. In ]ntyrnutionul Elywtron Xyviwys ayyting, pages 241–244. INSTITUTE
OF ELECTRICAL & ELECTRONIC ENGINEERS, INC (IEEE), 1995.
[34] J.W. Joyner, P. Zarkesh-Ha, and J.D. Meindl. A stochastic global net-length
distribution for a three-dimensional system-on-a-chip (3d-soc). In Ug]WCgcW
Wonfyrynwy, FDDEB drowyyxingsB EHth Unnuul ]EEE ]ntyrnutionul, pages 147–
151, 2001.
[35] Ralph HJM Otten and Robert K Brayton. Planning for performance. In XUW,
DAC ’98, pages 122–127, New York, NY, USA, 1998. ACM, ACM.
[36] Kuan H. Lu, Suk-Kyu Ryu, Qiu Zhao, Xuefeng Zhang, Jay Im, Rui Huang,
and Paul S. Ho. Thermal stress induced delamination of through silicon vias
in 3-d interconnects. In ElywtronB WomponB unx hywhB WonfB (EWhW), FDED
drowB JDth, pages 40 –45, June 2010.
[37] J Thomas Pawlowski. Hybrid memory cube (hmc). In Hot Whips, volume 23,
2011.
[38] Jung-Sik Kim, Chi Sung Oh, Hocheol Lee, Donghyuk Lee, Hyong Ryol Hwang,
Sooman Hwang, Byongwook Na, Joungwook Moon, Jin-Guk Kim, Hanna
Park, Jang-Woo Ryu, Kiwon Park, Sang Kyu Kang, So-Young Kim, Hoy-
oung Kim, Jong-Min Bang, Hyunyoon Cho, Minsoo Jang, Cheolmin Han,
Jung-Bae Lee, Joo Sun Choi, and Young-Hyun Jun. A 1.2 v 12.8 gb/s 2 gb
mobile wide-i/o dram with 4 × 128 i/os using tsv based stacking. golixAgtuty
Wirwuits, ]EEE Journul of, 47(1):107–116, Jan 2012.
157
[39] Dae Hyun Kim, K. Athikulwongse, M. Healy, M. Hossain, Moongon Jung,
I. Khorosh, G. Kumar, Young-Joon Lee, D. Lewis, Tzu-Wei Lin, Chang Liu,
S. Panth, M. Pathak, Minzhen Ren, Guanhao Shen, Taigon Song, Dong Hyuk
Woo, Xin Zhao, Joungho Kim, Ho Choi, G. Loh, Hsien-Hsin Lee, and
Sung Kyu Lim. 3d-maps: 3d massively parallel processor with stacked mem-
ory. In golixAgtuty Wirwuits Wonfyrynwy Xigyst of hywhniwul dupyrs (]ggWW),
FDEF ]EEE ]ntyrnutionul, pages 188–190, Feb 2012.
[40] Michael Gschwind. Blue gene/q: design for sustained multi-petaflop comput-
ing. In drowyyxings of thy FJth UWa intyrnutionul wonfyrynwy on gupyrwomA
puting, pages 245–246. ACM, 2012.
[41] Y Eckert, Nuwan Jayasena, and G Loh. Thermal feasibility of die-stacked
processing in memory. In drowyyxings of thy Fnx korkshop on byurAXutu
drowyssing, 2014.
[42] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L
Greathouse, Lifan Xu, and Michael Ignatowski. Top-pim: throughput-oriented
programmable processing in memory. In drowyyxings of thy FGrx intyrnutionA
ul symposium on HighApyrformunwy purullyl unx xistrivutyx womputing, pages
85–98. ACM, 2014.
[43] V. F. Pavlidis and E. G. Friedman. 3-d topologies for networks-on-chip. ]EEE
hrunsuwtions on Vyry Lurgy gwuly ]ntygrution (VLg]) gystyms, 15(10):1081–
1090, Oct 2007.
[44] Bing Shi and Ankur Srivastava. Thermal stress aware 3d-ic statistical static
timing analysis. In drowyyxings of thy FGrx UWa intyrnutionul wonfyrynwy on
[ryut lukys symposium on VLg], GLSVLSI ’13, pages 281–286, New York,
NY, USA, 2013. ACM.
[45] JEDEC. Wide i/o 2 (wideio2) (jesd229-2). August 2014.
[46] Joel Hruska. Beyond ddr4: The differences between wide i/o, hbm, and hybrid
memory cube. Efitrymyhywh oonlinyq, 2015.
[47] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, and Yuan
Xie. Hybrid cache architecture with disparate memory technologies. In droA
wyyxings of thy GJth Unnuul ]ntyrnutionul gymposium on Womputyr UrwhitywA
tury, ISCA ’09, pages 34–45, New York, NY, USA, 2009. ACM.
[48] Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. Cameo: A two-
level memory organization with capacity of main memory and flexibility of
hardware-managed cache. In drowyyxings of thy H7th Unnuul ]EEECUWa ]nA
tyrnutionul gymposium on aiwrourwhitywtury, MICRO-47, pages 1–12, Wash-
ington, DC, USA, 2014. IEEE Computer Society.
158
[49] Manjunath Shevgoor, Jung-Sik Kim, Niladrish Chatterjee, Rajeev Balasub-
ramonian, Al Davis, and Aniruddha N Udipi. Quantifying the relationship
between the power delivery network and architectural policies in a 3d-stacked
memory device. In drowyyxings of thy HJth Unnuul ]EEECUWa ]ntyrnutionul
gymposium on aiwrourwhitywtury, pages 198–209. ACM, 2013.
[50] G.H. Loh. Extending the effectiveness of 3d-stacked dram caches with an
adaptive multi-queue policy. In aiwrourwhitywtury, FDDMB a]WfcAHFB HFnx
Unnuul ]EEECUWa ]ntyrnutionul gymposium on, pages 201–212, Dec 2009.
[51] Xiaowei Jiang, N. Madan, Li Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell,
D. Solihin, and R. Balasubramonian. Chop: Adaptive filter-based dram
caching for cmp server platforms. In High dyrformunwy Womputyr UrwhitywA
tury (HdWU), FDED ]EEE EJth ]ntyrnutionul gymposium on, pages 1–12, Jan
2010.
[52] Shekhar Borkar. Thousand core chips: a technology perspective. In drowyyxA
ings of thy HHth unnuul Xysign Uutomution Wonfyrynwy, pages 746–749. ACM,
2007.
[53] Keren Bergman, Gilbert Hendry, Paul Hargrove, John Shalf, Bruce Jacob,
K. Scott Hemmert, Arun Rodrigues, and David Resnick. Let there be light!:
The future of memory systems is photonics and 3d stacking. In drowyyxings
of thy FDEE UWa g][dLUb korkshop on aymory gystyms dyrformunwy unx
Worrywtnyss, MSPC ’11, pages 43–48, New York, NY, USA, 2011. ACM.
[54] Syed Minhaj Hassan, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Near
data processing: Impact and optimization of 3d memory system architecture
on the uncore. In FDEI ]ntyrnutionul gymposium on aymory gystyms (aymsys
FDEI), October 2015.
[55] Stephen Jarvis, Steven Wright, and Simon D Hammond. High dyrformunwy
Womputing gystymsB dyrformunwy aoxyling, Vynwhmurking unx gimulutionN
Hth ]ntyrnutionul korkshop, daVg FDEG, Xynvyr, Wc, igU, bovymvyr EL,
FDEGB fyvisyx gylywtyx dupyrs, volume 8551. Springer, 2014.
[56] M. Mirza-Aghatabar, S. Koohi, S. Hessabi, and M. Pedram. An empirical
investigation of mesh and torus noc topologies under different routing algo-
rithms and traffic models. In Xigitul gystym Xysign Urwhitywturys, aythoxs
unx hools, FDD7B XgX FDD7B EDth Euromiwro Wonfyrynwy on, pages 19–26, Aug
2007.
[57] I. Savidis and E.G. Friedman. Closed-form expressions of 3-d via resis-
tance, inductance, and capacitance. Elywtron Xyviwys, ]EEE hrunsuwtions
on, 56(9):1873–1881, 2009.
159
[58] A.W. Topol, D.C.La Tulipe, L. Shi, D.J. Frank, K. Bernstein, S.E. Steen,
A. Kumar, G.U. Singco, A.M. Young, K.W. Guarini, and M. Ieong. Three-
dimensional integrated circuits. ]Va Journul of fysyurwh unx Xyvylopmynt,
50(4.5):491–506, July 2006.
[59] Bing Shi, Ankur Srivastava, and Peng Wang. Non-uniform micro-channel
design for stacked 3d-ics. In drowyyxings of thy HLth Xysign Uutomution WonA
fyrynwy, DAC ’11, pages 658–663, New York, NY, USA, 2011. ACM.
[60] M.S. Bakir, C. King, D. Sekar, H. Thacker, B. Dang, Gang Huang, A. Naeemi,
and J.D. Meindl. 3d heterogeneous integrated systems: Liquid cooling, power
delivery, and implementation. In Wustom ]ntygrutyx Wirwuits Wonfyrynwy, FDDLB
W]WW FDDLB ]EEE, pages 663–670, 2008.
[61] Mrinmoy Ghosh and Hsien-Hsin S. Lee. Smart refresh: An enhanced memory
controller design for reducing energy in conventional and 3d die-stacked drams.
In drowyyxings of thy HDth Unnuul ]EEECUWa ]ntyrnutionul gymposium on
aiwrourwhitywtury, MICRO 40, pages 134–145, Washington, DC, USA, 2007.
IEEE Computer Society.
[62] Bing Shi and Ankur Srivastava. Dynamic thermal management considering
accurate temperature-leakage interdependency. Wooling of aiwroylywtroniw unx
bunoylywtroniw EquipmyntN Uxvunwys unx Emyrging fysyurwh, page 43, 2014.
[63] Tiantao Lu and Ankur Srivastava. Electrical-thermal-reliability co-design for
tsv-based 3d-ics. In UgaE FDEI ]ntyrnutionul hywhniwul Wonfyrynwy unx EfihiA
vition on duwkuging unx ]ntygrution of Elywtroniw unx dhotoniw aiwrosystyms
wollowutyx with thy UgaE FDEI EGth ]ntyrnutionul Wonfyrynwy on bunowhunA
nyls, aiwrowhunnyls, unx ainiwhunnyls, pages V001T09A037–V001T09A037.
American Society of Mechanical Engineers, 2015.
[64] Jae-Seok Yang, Krit Athikulwongse, Young-Joon Lee, Sung Kyu Lim, and
David Z. Pan. Tsv stress aware timing analysis with applications to 3d-ic lay-
out optimization. In drowyyxings of thy H7th Xysign Uutomution Wonfyrynwy,
DAC ’10, pages 803–806, New York, NY, USA, 2010. ACM.
[65] T. Frank, S. Moreau, C. Chappaz, L. Arnaud, P. Leduc, A. Thuaire, and
L. Anghel. Electromigration behavior of 3d-ic tsv interconnects. In ElywtronB
WomponB unx hywhB WonfB (EWhW), FDEF ]EEE JFnx, pages 326 –330, 29 2012-
June 1 2012.
[66] YC Tan, Cher Ming Tan, XW Zhang, Tai Chong Chai, and DQ Yu. Elec-
tromigration performance of through silicon via (tsv)–a modeling approach.
aiwroylywtroniws fyliuvility, 50(9):1336–1340, 2010.
[67] Zhaohui Chen, Zhicheng Lv, Xuefang Wang, Yong Liu, and Sheng Liu. Mod-
eling of electromigration of the through silicon via interconnects. In Elywtroniw
duwkuging hywhnology : High Xynsity duwkuging (]WEdhAHXd), FDED EEth
]ntyrnutionul Wonfyrynwy on, pages 1221–1225. IEEE, 2010.
160
[68] Cathal Cassidy, Jochen Kraft, Sara Carniello, Frederic Roger, Hajdin Ceric,
Anderson Pires Singulani, Erasmus Langer, and Franz Schrank. Through
silicon via reliability. Xyviwy unx autyriuls fyliuvility, ]EEE hrunsuwtions on,
12(2):285–295, 2012.
[69] T Frank, Ste´phane Moreau, C Chappaz, Patrick Leduc, L Arnaud, Aure´lie
Thuaire, E Chery, F Lorut, L Anghel, and G Poupon. Reliability of tsv
interconnects: Electromigration, thermal cycling, and impact on above metal
level dielectric. aiwroylywtroniws fyliuvility, 53(1):17–29, 2013.
[70] P Kumar, I Dutta, and MS Bakir. Interfacial effects during thermal cycling of
cu-filled through-silicon vias (tsv). Journul of ylywtroniw mutyriuls, 41(2):322–
335, 2012.
[71] Chukwudi Okoro, John W Lau, Fardad Golshany, Klaus Hummler, and Yaw S
Obeng. A detailed failure analysis examination of the effect of thermal cycling
on cu tsv reliability. Elywtron Xyviwys, ]EEE hrunsuwtions on, 61(1):15–22,
2014.
[72] Juergen Auersperg, Dietmar Vogel, Ellen Auerswald, Sven Rzepka, and Bernd
Michel. Nonlinear copper behavior of tsv for 3d-ic-integration and cracking
risks during beol-built-up. In Elywtroniws duwkuging hywhnology Wonfyrynwy
(EdhW), FDEE ]EEE EGth, pages 29–33. IEEE, 2011.
[73] David Z Pan, Sung Kyu Lim, Krit Athikulwongse, Moongon Jung, Joydeep
Mitra, Jiwoo Pak, Mohit Pathak, and Jae-seok Yang. Design for manufactura-
bility and reliability for tsv-based 3d ics. In Xysign Uutomution Wonfyrynwy
(UgdAXUW), FDEF E7th Usiu unx gouth duwiw, pages 750–755. IEEE, 2012.
[74] Zhen Zhang. Guideline to avoid cracking in 3d tsv design. In hhyrmul unx
hhyrmomywhuniwul dhynomynu in Elywtroniw gystyms (]hhyrm), FDED EFth
]EEE ]ntyrsowiyty Wonfyrynwy on, pages 1–5. IEEE, 2010.
[75] Avram Bar-Cohen1, Joseph J Maurer, and Jonathan G Felbinger. Darpas
intra/interchip enhanced cooling (icecool) program. In Wg aUbhEWH WonA
fyrynwy, auy EGthAEJth, 2013.
[76] W. Yun, Jongpil Jung, Kyungsu Kang, and Chong-Min Kyung. Temperature-
aware energy minimization of 3d-stacked l2 dram cache through dvfs. In goW
Xysign Wonfyrynwy (]gcWW), FDEF ]ntyrnutionul, pages 475–478, Nov 2012.
[77] Bing Shi, Caleb Serafy, and Ankur Srivastava. Co-optimization of tsv assign-
ment and micro-channel placement for 3d-ics. In UWa [ryut Lukys gymposium
on VLg].
[78] A.K. Coskun, J.L. Ayala, D. Atienza, and T.S. Rosing. Modeling and dynamic
management of 3d multicore systems with liquid cooling. In Vyry Lurgy gwuly
]ntygrution (VLg]AgoW), FDDM E7th ]F]d ]ntyrnutionul Wonfyrynwy on, pages
35–40, 2009.
161
[79] M.M. Sabry, A.K. Coskun, D. Atienza, T.S. Rosing, and Thomas Brun-
schwiler. Energy-efficient multiobjective thermal control for liquid-cooled 3-d
stacked architectures. WomputyrAUixyx Xysign of ]ntygrutyx Wirwuits unx gysA
tyms, ]EEE hrunsuwtions on, 30(12):1883–1896, 2011.
[80] Bing Shi, Caleb Serafy, and Ankur Srivastava. Co-optimization of tsv assign-
ment and micro-channel placement for 3d-ics. In drowB of thy FGrx UWa ]ntB
WonfB on [ryut Lukys gympB on VLg], GLSVLSI ’13, pages 337–338, New
York, NY, USA, 2013. ACM.
[81] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykrishnan, and M. J. Irwin. Design space
exploration for 3-d cache. ]EEE hrunsuwtions on Vyry Lurgy gwuly ]ntygrution
(VLg]) gystyms, 16(4):444–455, April 2008.
[82] Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro Lopez. Multi2sim:
A simulation framework to evaluate multicore-multithreaded processors. In
Womputyr Urwhitywtury unx High dyrformunwy Womputing, FDD7B gVUWAdUX
FDD7B EMth ]ntyrnutionul gymposium on, pages 62–68, 2007.
[83] Premkishore Shivakumar and Norman P Jouppi. Cacti 3.0: An integrated
cache timing, power, and area model. Technical report, Technical Report
2001/2, Compaq Computer Corporation, 2001.
[84] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh,
and Anoop Gupta. The splash-2 programs: Characterization and methodolog-
ical considerations. In drowyyxings of thy FFbx Unnuul ]ntyrnutionul gympoA
sium on Womputyr Urwhitywtury, volume 23 of ]gWU 'MI, pages 24–36, New
York, NY, USA, 1995. ACM, ACM.
[85] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The
parsec benchmark suite: Characterization and architectural implications. In
drowyyxings of thy E7th ]ntyrnutionul Wonfyrynwy on durullyl Urwhitywturys unx
Wompilution hywhniquys, PACT ’08, pages 72–81, New York, NY, USA, 2008.
ACM.
[86] Manu Awasthi, David W Nellans, Kshitij Sudan, Rajeev Balasubramonian,
and Al Davis. Handling the problems and opportunities posed by multiple on-
chip memory controllers. In drowyyxings of thy EMth intyrnutionul wonfyrynwy
on durullyl urwhitywturys unx wompilution tywhniquys, pages 319–330. ACM,
2010.
[87] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankar-
alingam, and Doug Burger. Dark silicon and the end of multicore scaling. In
Womputyr Urwhitywtury (]gWU), FDEE GLth Unnuul ]ntyrnutionul gymposium
on, pages 365–376. IEEE, 2011.
162
[88] Wim Heirman, Souradip Sarkar, Trevor E. Carlson, Ibrahim Hur, and Lieven
Eeckhout. Power-aware multi-core simulation for early design stage hard-
ware/software co-optimization. In drowyyxings of thy FEst ]ntyrnutionul WonA
fyrynwy on durullyl Urwhitywturys unx Wompilution hywhniquys, PACT ’12,
pages 3–12, New York, NY, USA, 2012. ACM.
[89] W. J. Song, S. Mukhopadhyay, and S. Yalamanchili. Managing performance-
reliability tradeoffs in multicore processors. In FDEI ]EEE ]ntyrnutionul fyliA
uvility dhysiws gymposium, pages 3C.1.1–3C.1.7, April 2015.
[90] Michael Moeng and Rami Melhem. Applying statistical machine learning
to multicore voltage & frequency scaling. In drowyyxings of thy 7th UWa
]ntyrnutionul Wonfyrynwy on Womputing Frontiyrs, CF ’10, pages 277–286,
New York, NY, USA, 2010. ACM.
[91] Xiangyu Dong, Yuan Xie, N. Muralimanohar, and N.P. Jouppi. Simple but
effective heterogeneous main memory with on-chip memory controller support.
In High dyrformunwy Womputing, bytworking, gtorugy unx Unulysis (gW),
FDED ]ntyrnutionul Wonfyrynwy for, pages 1–11, 2010.
[92] Ming-Yu Hsieh, Arun Rodrigues, Rolf Riesen, Kevin Thompson, and William
Song. A framework for architecture-level power, area, and thermal simula-
tion and its application to network-on-chip design exploration. g][aEhf]Wg
dEf, 38(4):63–68, March 2011.
[93] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Architec-
ture support for disciplined approximate programming. In drowyyxings of thy
gyvyntyynth ]ntyrnutionul Wonfyrynwy on Urwhitywturul gupport for drogrumA
ming Lunguugys unx cpyruting gystyms, ASPLOS XVII, pages 301–312, New
York, NY, USA, 2012. ACM.
[94] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani,
Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. Gpuwattch: En-
abling energy optimizations in gpgpus. In drowyyxings of thy HDth Unnuul ]nA
tyrnutionul gymposium on Womputyr Urwhitywtury, ISCA ’13, pages 487–498,
New York, NY, USA, 2013. ACM.
[95] R. Sheikh, J. Tuck, and E. Rotenberg. Control-flow decoupling: An approach
for timely, non-speculative branching. ]EEE hrunsuwtions on Womputyrs,
64(8):2182–2203, Aug 2015.
[96] Y. Zhang, A. Dembla, and M. S. Bakir. Silicon micropin-fin heat sink
with integrated tsvs for 3-d ics: Tradeoff analysis and experimental testing.
]EEE hrunsuwtions on Womponynts, duwkuging unx aunufuwturing hywhnoloA
gy, 3(11):1842–1850, Nov 2013.
163
[97] T. Frank, C. Chappaz, P. Leduc, L. Arnaud, F. Lorut, S. Moreau, A. Thuaire,
R. El Farhane, and L. Anghel. Resistance increase due to electromigration
induced depletion under tsv. In fyliuvility dhysiws gymposium (]fdg), FDEE
]EEE ]ntyrnutionul, pages 3F.4.1–3F.4.6, April 2011.
[98] Jason Cong and Guojie Luo. A 3D physical design flow based on Open Access.
In ]ntyrnutionul Wonfyrynwy on Wommuniwutions, Wirwuits unx gystyms. IEEE,
2009.
[99] Tiantao Lu and Ankur Srivastava. Detailed electrical and reliability study
of tapered tsvs. In GX gystyms ]ntygrution Wonfyrynwy (GX]W), FDEG ]EEE
]ntyrnutionul, pages 1–7. IEEE, 2013.
[100] J.R. Black. Mass transport of aluminum by momentum exchange with con-
ducting electrons. In fyliuvility dhysiws gymposium, pages 1 – 6, 2005.
[101] J. Pak, S. K. Lim, and D. Z. Pan. Electromigration-aware routing for 3d ics
with stress-aware em modeling. In FDEF ]EEECUWa ]ntyrnutionul Wonfyrynwy
on WomputyrAUixyx Xysign (]WWUX), pages 325–332, Nov 2012.
[102] Wei Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and
M.R. Stan. Hotspot: a compact thermal modeling methodology for early-stage
vlsi design. Vyry Lurgy gwuly ]ntygrution (VLg]) gystyms, ]EEE hrunsuwtions
on, 14(5):501–513, 2006.
[103] Caleb Serafy, Tiantao Lu, and Ankur Srivastava. Thermal-reliability physical
co-optimization during architectural design space exploration of 3d-cpus. In
[caUWhywh, 2016.
[104] Jai-Ming Lin and Yao-Wen Chang. Tcg: a transitive closure graph-based
representation for non-slicing floorplans. In Xysign Uutomution Wonfyrynwy,
FDDEB drowyyxings, pages 764–769, 2001.
[105] Jason Cong, Jie Wei, and Yan Zhang. A thermal-driven floorplanning algo-
rithm for 3d ics. In ]WWUX'DH, pages 306–313. IEEE, 2004.
[106] Jill HY Law, Evangeline FY Young, and Royce LS Ching. Block alignment in
3d floorplan using layered tcg. In [LgVLg]'DJ, pages 376–380. ACM, 2006.
[107] A. Ortega, S. Ramanathan, J. D. Chicci, and J. L. Prince. Thermal wake
models for forced air cooling of electronic components. In gymiwonxuwtor
hhyrmul ayusurymynt unx aunugymynt gymposium, EMMGB gEa]AhHEfa
]lB, binth Unnuul ]EEE, pages 63–74, Feb 1993.
[108] A. Kagi, J. R. Goodman, and D. Burger. Memory bandwidth limitations of
future microprocessors. In Womputyr Urwhitywtury, EMMJ FGrx Unnuul ]ntyrA
nutionul gymposium on, pages 78–78, May 1996.
164
[109] Jaehyuk Huh, D. Burger, and S. W. Keckler. Exploring the design space
of future cmps. In durullyl Urwhitywturys unx Wompilution hywhniquys, FDDEB
drowyyxingsB FDDE ]ntyrnutionul Wonfyrynwy on, pages 199–210, 2001.
[110] Rajkumar Buyya, Christian Vecchiola, and S Thamarai Selvi. austyring wloux
womputingN founxutions unx uppliwutions progrumming. Newnes, 2013.
[111] Joel Hruska. The death of cpu scaling: From one core to many–and why were
still stuck. Efitrymyhywh oonlinyq, 2012.
[112] Tiantao Lu and Ankur Srivastava. Gated low-power clock tree synthesis for
3d-ics. In drowyyxings of thy FDEH ]ntyrnutionul gymposium on Low dowyr
Elywtroniws unx Xysign, ISLPED ’14, pages 319–322, New York, NY, USA,
2014. ACM.
[113] Zhimin Wan, He Xiao, Yogendra Joshi, and Sudhakar Yalamanchili. Co-
design of multicore architectures and microfluidic cooling for 3d stacked ics.
aiwroylywtroniws Journul, 2014.
[114] Dae Hyun Kim, Krit Athikulwongse, and Sung Kyu Lim. A study of through-
silicon-via impact on the 3d stacked ic layout. In drowyyxings of thy FDDM
]ntyrnutionul Wonfyrynwy on WomputyrAUixyx Xysign, ICCAD ’09, pages 674–
680, New York, NY, USA, 2009. ACM.
[115] B. A. Jasperson, Y. Jeon, K. T. Turner, F. E. Pfefferkorn, and W. Qu.
Comparison of micro-pin-fin and microchannel heat sinks considering thermal-
hydraulic performance and manufacturability. ]EEE hrunsuwtions on WompoA
nynts unx duwkuging hywhnologiys, 33(1):148–160, March 2010.
[116] Yoav Peles, Ali Koar, Chandan Mishra, Chih-Jung Kuo, and Brandon Schnei-
der. Forced convective heat transfer across a pin fin micro heat sink. ]ntyrnuA
tionul Journul of Hyut unx auss hrunsfyr, 48(17):3615 – 3627, 2005.
[117] Frank P Incropera. Funxumyntuls of hyut unx muss trunsfyr. John Wiley &
Sons, 2011.
[118] Darshan Gandhi, Andreas Gerstlauer, and Lidiya John. Fastspot: Host-
compiled thermal estimation for early design space exploration. In euulity
Elywtroniw Xysign (]geEX), FDEH EIth ]ntyrnutionul gymposium on, pages
625–632. IEEE, 2014.
[119] Davy Genbrugge and Lieven Eeckhout. Chip multiprocessor design space
exploration through statistical simulation. Womputyrs, ]EEE hrunsuwtions
on, 58(12):1668–1681, 2009.
[120] Wenhao Jia, Kelly Shaw, Margaret Martonosi, et al. Stargazer: Automated
regression-based gpu design space exploration. In dyrformunwy Unulysis of
gystyms unx goftwury (]gdUgg), FDEF ]EEE ]ntyrnutionul gymposium on,
pages 2–13. IEEE, 2012.
165
[121] Engin I˙pek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin
Schulz. Ewiyntly yfiploring urwhitywturul xysign spuwys viu pryxiwtivy moxyling,
volume 40. ACM, 2006.
[122] Benjamin C Lee and David M Brooks. Accurate and efficient regression mod-
eling for microarchitectural performance and power prediction. In UWa g][A
dLUb botiwys, volume 41, pages 185–194. ACM, 2006.
[123] PJ Joseph, Kapil Vaswani, and Matthew J Thazhuthaveetil. Construction
and use of linear regression models for processor performance analysis. In
HighAdyrformunwy Womputyr Urwhitywtury, FDDJB hhy hwylfth ]ntyrnutionul
gymposium on, pages 99–108. IEEE, 2006.
[124] Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron.
Cmp design space exploration subject to physical constraints. In HighA
dyrformunwy Womputyr Urwhitywtury, FDDJB hhy hwylfth ]ntyrnutionul gymA
posium on, pages 17–28. IEEE, 2006.
[125] Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood,
and Brad Calder. Using simpoint for accurate and efficient simulation. In droA
wyyxings of thy FDDG UWa g][aEhf]Wg ]ntyrnutionul Wonfyrynwy on ayuA
surymynt unx aoxyling of Womputyr gystyms, SIGMETRICS ’03, pages 318–
319, New York, NY, USA, 2003. ACM.
[126] Chong Gu. gmoothing spliny UbcVU moxyls, volume 297. Springer Science
& Business Media, 2013.
[127] Chong Gu. Smoothing spline anova models: R package gss. Journul of gtuA
tistiwul goftwury, 58(5):1–25, 2014.
[128] Brian D Ripley. The r project in statistical computing. agcf Wonnywtions,
1(1):23–25, 2001.
[129] Frank E Harrell. fygryssion moxyling strutygiysN with uppliwutions to linyur
moxyls, logistiw rygryssion, unx survivul unulysis. Springer Science & Business
Media, 2013.
[130] Michael H Kutner, Chris Nachtsheim, and John Neter. Uppliyx linyur rygrysA
sion moxyls. McGraw-Hill/Irwin, 2004.
[131] Henry Theil. Economic forecasts and policy. 1958.
[132] Caleb Serafy, Bing Shi, and Ankur Srivastava. Geometric approach to chip-
scale tsv shield placement for the reduction of tsv coupling in 3d-ics. In droA
wyyxings of thy FGrx UWa intyrnutionul wonfyrynwy on [ryut lukys symposium
on VLg], GLSVLSI ’13, pages 275–280, New York, NY, USA, 2013. ACM.
166
[133] Caleb Serafy and Ankur Srivastava. Coupling-aware Force Driven Placemen-
t of TSVs and Shields in 3D-IC Layouts. In ]ntyrnutionul gymposium on
dhysiwul Xysign. ACM, 2014.
[134] Moongon Jung, Taigon Song, Yang Wan, Yarui Peng, and Sung Kyu Lim. On
enhancing power benefits in 3d ics: Block folding and bonding styles perspec-
tive. In Xysign Uutomution Wonfyrynwy (XUW), FDEH IEst UWaCEXUWC]EEE,
pages 1–6, June 2014.
[135] Terry J Dishongh, Jason T Cassezza, and Kevin S Rhodes. Microfluidic cooling
of integrated circuits, January 26 2010. US Patent 7,652,372.
167