ABSTRACT 
 
Title of Dissertation: RESISTIVE RAM BASED MAIN-MEMORY 
SYSTEMS: UNDERSTANDING THE 
OPPORTUNITIES, LIMITATIONS, AND 
TRADEOFFS 
  
 Meenatchi Jagasivamani 
Doctor of Philosophy, 2020 
  
Dissertation directed by: Professor Bruce Jacob 
Department of Electrical & Computer 
Engineering 
 
As DRAM faces scaling issues as a high-density memory, emerging 
technologies are being explored as alternatives.  One promising candidate is 
Resistive Memories (ReRAM), which is scalable, vertically stackable, and because 
of the possibility of integration with standard logic process, can deliver higher 
density as a main-memory solution.  The key differentiator with this approach 
involves a ReRAM memory array that integrates directly with a logic processor 
underneath.   
 
 
In this research work, I explore ReRAM as a main-memory alternative at 
three levels of detail ? at the device level, the physical-design level, and finally at 
the architecture level.  I begin with an overview of ReRAM and compare with 
alternate technologies.  I look at the physical design of the solution and present the 
results of area studies on integrating a VSCALE processor at the 45nm technology 
node with a ReRAM bit-cell array.  The area study was performed based on 
parameters specified by my collaborators at Crossbar Inc.  The results showed that 
the optimum operating point is at 50% array efficiency with a VSCALE processor, 
and that this configuration incurs an area penalty of 18%.   
Two of the key challenges for ReRAM with respect to DRAM performance 
include the higher write latency requirement (typically on the order of 1us) and the 
lower write endurance (typically less than 10^8 cycles).  This compares with 
DRAM write-latency times of less than 30ns (depending on technology node and 
generation) and write endurance of more than 10^15 write cycles.   In this research 
work, I explore the possibility of utilizing the ReRAM cell in an intermediate state 
between non-volatile state and threshold state, where I intentionally tradeoff the 
write energy for a much lower data retention.  This allows the chip to more easily 
replace existing DRAM-like main memory applications, without requiring higher 
write programming current or accommodating for a longer write latency.  I 
performed this evaluation both at the device-level and at the architecture level.   
 
 
At the device-level, I used UMD?s Nano-fab lab to construct a Metal-Oxide 
based ReRAM bitcells on which I characterized the relationship between data-
retention and write current applied.  My fabricated ReRAM was composed of 
Titanium-Oxide and Aluminum Oxide.  I also confirmed the behavior of a mixed-
volatility state where a formed filament relaxes over time to move to a high-
resistance level.  Based on my experimental measurements, operating in the mixed 
volatile state would reduce write energy by 10 to 100x, and thereby improve the 
write endurance. 
Finally, at the architecture-level, I used the Structural Simulation Toolkit 
(SST) to characterize a ReRAM-based main-memory system and compare with a 
DRAM-based one using our research group?s DRAMSIM3 tool.  I also 
characterized the sensitivity of various architectural parameters (core-to-memory 
controller ratio, queue depth, NoC topology) on system performance on stream and 
gups-based graph benchmarks which indicated that the torus topology will provide 
reasonable performance.  Impact of the number of parallel processors indicated that 
at low processor counts, DRAM outperforms ReRAM due to its faster memory 
latency.  However, at high processor counts, ReRAM with its higher number of 
parallel connections is able to deliver higher system performance than DRAM.   
 
  
 
 
 
 
 
 
 
RESISTIVE RAM BASED MAIN-MEMORY SYSTEMS: UNDERSTANDING 
THE OPPORTUNITIES, LIMITATIONS, AND TRADEOFFS 
 
 
 
 
by 
Meenatchi Jagasivamani 
 
 
 
 
 
Dissertation submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Doctor of Philosophy 
2020 
 
 
 
 
Advisory Committee: 
Professor Bruce Jacob, Chair/Advisor 
Professor Manoj Franklin 
Professor Robert Newcomb 
Professor Martin Peckerar 
Professor Donald Yeung  
Professor Lourdes Salamanca-Riba, Dean?s Representative 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
? Copyright by 
Meenatchi Jagasivamani 
2020 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
To my parents, Vadivel and Rani Jagasivamani; 
my husband, Guru Thuduppathy; 
and my children, Gugan and Kayal Thuduppathy. 
 
 
 
 
 
 
 
 
 
 
 
 
 
ii 
 
 
Acknowledgements 
 
I would like to thank my family for supporting me in this journey and 
encouraging me to pursue the PhD after working in the industry.  I am grateful to 
my mother for being my number one supporter, my father for instilling a deep sense 
of curiosity and appreciation for engineering, my husband for always believing in 
me to reach beyond my limits, and my children for filling me with joy each day.   
I owe a great deal to Professor Jacob for being a wonderful research advisor.  
He was patient and encouraging throughout my PhD research, while having a long-
term vision for the research work being explored.  His suggestions in providing 
specific research questions to explore, while allowing me the intellectual freedom 
to discuss and pursue research areas of interest was paramount to the depth and 
breadth of my research study.   
I would also like to specially thank Professor Yeung, for his valuable technical 
feedback and offering suggestions for improvement during the entire course of my 
PhD.   Thank you to all of my committee members who provided me with their time 
and energy to further consider certain aspects during my defense.   
Finally, I would like to thank my friends and colleagues at the University of 
Maryland, Shang, Brendan, Candace, Devesh, Luyi, and Daniel for their wish and 
support all these years.   
iii 
 
Table of Contents 
ABSTRACT .............................................................................................................................. I 
TABLE OF CONTENTS ...................................................................................................... IV 
1 INTRODUCTION ......................................................................................................... 1 
1.1 MOTIVATION AND PROBLEM DESCRIPTION ................................................................. 1 
1.2 PROPOSED APPROACH ................................................................................................. 4 
1.3 CONTRIBUTION AND SIGNIFICANCE ............................................................................. 6 
1.4 ORGANIZATION OF DISSERTATION .............................................................................. 8 
2 EMERGING MEMORY TECHNOLOGIES ............................................................11 
2.1 SRAM, DRAM (HMC, HBM, EDRAM), STT-MRAM, RERAM .............................11 
2.2 RERAM IMPLEMENTATION VARIATIONS ...................................................................15 
2.3 APPLICATIONS FOR RERAM TECHNOLOGY ................................................................21 
2.3.1 ReRAM with Support Logic Circuits ................................................................22 
2.3.2 ReRAM for Super Conducting applications .....................................................25 
3 RERAM BACKGROUND ...........................................................................................27 
3.1 RERAM AS DRAM ALTERNATIVE .............................................................................27 
3.2 OVERVIEW OF RESISTIVE MEMORY AND CELL OPERATION ........................................29 
3.3 RERAM READ AND WRITE PERFORMANCE TRADEOFFS ............................................31 
3.4 RERAM WRITE ENDURANCE CHALLENGE .................................................................33 
4 AREA EXPLORATION STUDIES ............................................................................35 
4.1 CROSSBAR RERAM INTEGRATION CONSTRAINTS ......................................................37 
4.2 CAD METHODOLOGY.................................................................................................45 
4.3 SINGLE RERAM CLUSTER INTEGRATION ...................................................................48 
4.4 MULTIPLE RERAM CLUSTER INTEGRATION ..............................................................55 
4.5 SRAM-RERAM INTEGRATIONS .................................................................................68 
4.6 MEMORY ARCHITECTURE CALCULATOR (MAC) .......................................................72 
4.7 ALTERNATIVE FLOORPLAN ARRANGEMENTS (L, CROSSBAR, FRACTAL DESIGN) ........74 
4.8 CONCLUSION ..............................................................................................................78 
5 RERAM DEVICE-LEVEL RESEARCH STUDY ....................................................80 
5.1 MOTIVATION ..............................................................................................................80 
5.2 FABRICATION APPROACH ...........................................................................................82 
5.3 MASK GENERATION ...................................................................................................87 
5.4 DEVICE FABRICATION ................................................................................................94 
5.5 RERAM RESISTIVE SWITCHING BEHAVIOR..............................................................100 
5.6 THRESHOLD BEHAVIOR AT LOW CURRENT COMPLIANCE LIMITS.............................104 
iv 
5.7 TIME DEPENDENT VOLATILITY BEHAVIOR ...............................................................106 
5.8 IMPACT ON WRITE ENERGY AND ENDURANCE .........................................................116 
5.9 POST-CHARACTERIZATION SEM ..............................................................................118 
5.10 CONCLUSION AND FUTURE WORK .......................................................................121 
6 ARCHITECTURE-LEVEL SIMULATIONS .........................................................124 
6.1 SST SIMULATOR.......................................................................................................124 
6.2 BASELINE ARCHITECTURE COMPARISON..................................................................126 
6.3 IMPACT OF MEMORY PARALLELISM FOR RERAM ....................................................133 
6.4 MOTIVATION FOR CENTRAL RERAM DESIGN ..........................................................137 
6.5 AREA FLOORPLAN CENTRAL RERAM DESIGN.........................................................140 
6.6 WRITE PERFORMANCE IMPACT OF RERAM .............................................................145 
6.7 IMPACT OF CORE COUNT ..........................................................................................155 
6.8 ENERGY COMPARISON ..............................................................................................161 
7 NOC TOPOLOGY IMPACT ....................................................................................165 
7.1 MOTIVATION ............................................................................................................165 
7.2 BACKGROUND ..........................................................................................................167 
7.2.1 ReRAM-based Main-Memory Architecture ....................................................172 
7.2.2 NoC Topologies of Interest ............................................................................175 
7.2.3 Simulation Methodology ................................................................................178 
7.3 EXPERIMENT RESULTS & ANALYSIS ........................................................................179 
7.3.1 NoC Topology Evaluation ..............................................................................180 
7.3.2 DRAM Memory Controller Optimization .......................................................189 
7.4 CONCLUSION ............................................................................................................193 
8 RERAM AS TRUSTED ON-CHIP MAIN MEMORY ...........................................194 
8.1 MOTIVATION ............................................................................................................194 
8.2 BACKGROUND ..........................................................................................................197 
8.2.1 Integrated Processor-ReRAM Architecture ...................................................198 
8.2.2 ReRAM Security Implications ........................................................................200 
8.3 PROPOSED APPROACH ..............................................................................................201 
8.4 ANALYSIS AND DISCUSSION .....................................................................................204 
9 CONCLUSION ...........................................................................................................206 
10 FUTURE WORK........................................................................................................210 
11 REFERENCES ...........................................................................................................213 
  
v 
Table of Figures 
Figure 1-1 Motivation: Memory Bandwidth Wall .............................................................................. 2 
Figure 1-2  System Connection in Proposed Approach (Side and Corner View) ............................ 6 
Figure 2-1  Conventional 6T SRAM cell ............................................................................................13 
Figure 2-2 DRAM Bit-cell ....................................................................................................................13 
Figure 2-3  STT-MRAM Bitcell Figure source: (MRAM-info, 2016) ..............................................14 
Figure 2-4 Cell-Size Comparison for different Memory Technologies ............................................18 
Figure 2-5  Read-Latency Comparison for different Memory Technologies ..................................20 
Figure 2-6  Write-Latency Comparison for different Memory Technologies .................................21 
Figure 2-7  Augmenting Logic to enable ReRAM adaption into key applications (a) output buffer 
to increase data bandwidth (b) pipelined floating point logic to enable computation ..........24 
Figure 2-8 ? RFSQ Circuit ..................................................................................................................26 
Figure 3-1 DRAM Bandwidth-Capacity Tradeoff .............................................................................28 
Figure 3-2 ReRAM Bitcell Details (a) ReRAM bitcell cross-section (b) Crossbar 1S1R array bias 
scheme, with selected cell circled ...............................................................................................30 
Figure 3-3 ReRAM Array Size vs Read Latencies ............................................................................32 
Figure 3-4  Write Endurance Ranges for DRAM vs ReRAM ..........................................................34 
Figure 4-1  Cross-Section ReRAM bitcell ..........................................................................................35 
Figure 4-2  ReRAM Physical Integration. ..........................................................................................36 
Figure 4-3  Crossbar ReRAM Bitcell (a) Orthogonal Bitcell Layout (b) ReRAM integration with 
CMOS Process Figure source (Crossbar Inc., 2018) ...............................................................38 
Figure 4-4  45nm ReRAM bitcell ........................................................................................................41 
Figure 4-5 Memory Organization .......................................................................................................43 
Figure 4-6  Via tap points from ReRAM metal layer to periphery circuits ....................................44 
Figure 4-7  Digital Implementation Tool Flow of an integrated ReRAM RISC-V Processor tile. 47 
Figure 4-8 Layout of VSCALE Processor Core .................................................................................50 
Figure 4-9  Blockage Region for ReRAM Peripheral Circuits .........................................................52 
Figure 4-10 Layout of an integrated ReRAM RISC-V Processor tile. .............................................53 
Figure 4-11 Embedding Multiple ReRAM Mat Clusters within a Larger Processor .....................56 
Figure 4-12 Scaled 256-bit VSCALE Processor Layout....................................................................57 
Figure 4-13  Inter-Mat ReRAM Array Spacing causing Inefficient Layout ...................................59 
Figure 4-14 Multiple ReRAM clusters integrated with a 256-bit RISC-V Processor .....................60 
Figure 4-15 Impact of Inter-MAT ReRAM cross spacing on Area ..................................................63 
Figure 4-16 Impact of Inter-MAT ReRAM cross spacing on Efficiency .........................................64 
Figure 4-17 ? Bitcell Relative Sizes at 45nm ......................................................................................69 
Figure 4-18 - OpenRAM 45nm (a) Generated Bitcell Array (b) SRAM..........................................70 
vi 
Figure 4-19  ReRAM Integrated with SRAM memory .....................................................................71 
Figure 4-20 MAC JavaScript Architectural Area Estimator ...........................................................73 
Figure 4-21 ReRAM with 3-core VSCALE processor .......................................................................76 
Figure 4-22 Independent Core with ReRAM block ...........................................................................77 
Figure 4-23 Alternative ReRAM-Processor integration floorplans showing (a) Fractal approach 
for Star topologies and (b) Mesh approach for mini-core parallel architectures .................78 
Figure 5-1 ReRAM Metal Stack ..........................................................................................................83 
Figure 5-2 UMD ReRAM Device Fabrication....................................................................................86 
Figure 5-3  Mask Configuration ..........................................................................................................88 
Figure 5-4 Mask Prototype Creation ..................................................................................................89 
Figure 5-5 Final Mask Configurations for Mask 1 (left) and Mask 2 (right) ..................................91 
Figure 5-6 Mask Fabrication clockwise from top: (a) ProtoTRAK SMX Milling Station (b) Sheet 
Mask being cut (c) Finished mask set (d) Finished Mask Set overlaid ..................................93 
Figure 5-7  Fabrication flow for Pt/Al2O3/TiO2/Ti/Pt ReRAM structures (a) Thermal SiO2 (b) 
Mask 1: PVD of bottom electrode (c) Mask 2: PVD of ReRAM stack and top electrode ....95 
Figure 5-8 ? (a) PVD chamber used for fabrication (b) Fabricated test wafer of discrete devices 
with probe measurements ..........................................................................................................96 
Figure 5-9 - Crucible materials into PVD Chamber .........................................................................97 
Figure 5-10 ? Platinum Deposition on First Mask .............................................................................97 
Figure 5-11 PVD Chamber and MicroProbe Station ........................................................................98 
Figure 5-12 - Die Photograph of Fabricated Devices (a) Probe Landed (b) Probe etch mark (c) 
Top-electrode/Metal Stack boundary .......................................................................................99 
Figure 5-13 SEM Cross-section photo with EDS spectra of the ReRAM stack ..............................99 
Figure 5-14 ? Oscilloscope Measurement on the Applied Program Pulse .....................................101 
Figure 5-15 - ReRAM Switching between LRS and HRS in bipolar program mode ...................102 
Figure 5-16 - ReRAM Resistive switching in Unipolar program mode .........................................104 
Figure 5-17 - ReRAM Threshold behavior at low current compliance (Ic) limits ........................105 
Figure 5-18 - Change in Resistance after 5 and 10 minutes delay as a function of the initial 
resistance. Log(Delta-Resistance) is calculated for the y-axis ..............................................108 
Figure 5-19 - Resistance change over time grouped by Cell sizes with trend observed across 
multiple devices. Diameters of Cell 2=5.94mm, cell 3=7.56mm, cell 6=14.04mm. Log(Delta- 
Resistance) is calculated for the y-axis....................................................................................110 
Figure 5-20 - Predicted vs Observed change in resistance for cellstates with Rinit below 10M?. 
Log(Delta-Resistance) is calculated for the y-axis. ................................................................111 
Figure 5-21 - Resistance change for different Program Current Compliance values ..................113 
Figure 5-22 - Resistance change as a function of Program Current Compliance. ........................115 
Figure 5-23 - Sliced Sample inside GAIA SEM Chamber ..............................................................119 
Figure 5-24 ? SEM Thickness Measurement ...................................................................................119 
Figure 6-1  SST Component-based Framework ..............................................................................125 
Figure 6-2 Architecture Comparison ................................................................................................127 
vii 
Figure 6-3 SST Simulation Result .....................................................................................................130 
Figure 6-4 Memory Latency Breakdown, Queue Depth=2 .............................................................132 
Figure 6-5 Impact of Queue Depth ...................................................................................................134 
Figure 6-6 Impact of Queue Depth and Multiple Mem-Controllers ..............................................136 
Figure 6-7 Hybrid ReRAM-DRAM System Floorplan ...................................................................138 
Figure 6-8 ReRAM Memory Controller Design ..............................................................................139 
Figure 6-9 ? Memory Footprint for Central ReRAM Design .........................................................141 
Figure 6-10 ? Bank Controller Area .................................................................................................142 
Figure 6-11 ? Placement of Control Logic, Buffers, and SRAM ....................................................143 
Figure 6-12 ? Interconnect Routing over Central ReRAM Floorplan ...........................................144 
Figure 6-13 - DRAM ReRAM Architecture Comparison ...............................................................146 
Figure 6-14 - SST STREAM Benchmark Comparison for 21 cores ..............................................150 
Figure 6-15 - SST GUPS Benchmark Comparison for 21 cores .....................................................152 
Figure 6-16 ? Impact of Increasing Core Count ..............................................................................154 
Figure 6-17 - Performance Comparison between DRAM and ReRAM system using STREAM 
and GUPS benchmarks (note: Log-Scale X & Y axis) ..........................................................157 
Figure 6-18 - Bandwidth Comparison between DRAM and ReRAM system using STREAM and 
GUPS benchmarks (note: Log-Scale X & Y axis) ..................................................................159 
Figure 6-19- Energy-Delay Plot of DRAM-DDR4 and ReRAM system using STREAM and 
GUPS benchmarks (note: Log-Scale X axis) ..........................................................................163 
Figure 7-1 - Comparison of (a) Conventional off-chip main-memory system with (b) Integrated 
CPU die with ReRAM layers on-chip .....................................................................................166 
Figure 7-2 Comparison of various NoC topologies ..........................................................................168 
Figure 7-3 - ReRAM Array Access ...................................................................................................173 
Figure 7-4 - Hybrid ReRAM-DRAM System ...................................................................................175 
Figure 7-5 - Overview Diagram of NoC Topologies Simulated ......................................................177 
Figure 7-6 - Torus Configuration for Central ReRAM Architecture ............................................179 
Figure 7-7 - NoC Topology Performance: Impact of Cores ............................................................183 
Figure 7-8 - NoC Topology Performance: Impact of Link Bandwidth ..........................................186 
Figure 7-9 - NoC Topology Tradeoff: Execution Time vs Aggregate Bandwidth for STREAM 
benchmark (Note: Log Scale X & Y axis) ..............................................................................188 
Figure 7-10 - DRAM Performance: Impact of Cores and Memory Controller ............................190 
Figure 7-11 - DRAM Speedup: Impact of Cores and Memory Controller (note: Log-Scale X axis)
 ....................................................................................................................................................192 
Figure 8-1 - Vulnerabilities in Main Memory ..................................................................................196 
Figure 8-2 - ReRAM Resistance Creation ........................................................................................198 
Figure 8-3 - Integrated ReRAM-Configuration...............................................................................199 
Figure 8-4 - ReRAM-based Main-Memory Solution .......................................................................201 
Figure 8-5 - ReRAM Three Modes of Operation .............................................................................203 
viii 
 
 
Table of Tables 
Table 2-1  Comparison of key parameters of Memory Technologies ..............................................12 
Table 2-2 Key performance metrics of various ReRAM implementations .....................................17 
Table 4-1 Crossbar 1S1R ReRAM Parameters .................................................................................39 
Table 4-2 Integration Results ..............................................................................................................54 
Table 4-3 Summary of Inter-Mat Spacing on Area and Efficiency .................................................61 
Table 4-4 Area, Power, and Performance comparison of Processors ..............................................73 
Table 5-1 Mask Feature Specifications ...............................................................................................92 
Table 5-2 ? SEM Analysis of Deposited Thickness ..........................................................................120 
Table 6-1 Summary of SST Architecture Details ............................................................................128 
Table 6-2 - Architectural Parameters ...............................................................................................148 
Table 6-3 Bandwidth Comparison ....................................................................................................155 
Table 7-1 Comparison of NoC Topologies .......................................................................................171 
Table 7-2 ? Network Sizing Parameters ...........................................................................................180 
Table 7-3 - Speedup for different NoC Topologies (baseline: 16 cores) .........................................185 
Table 9-1 - Summary of Key Contributions .....................................................................................207 
 
 
ix 
 
 
 
1 Introduction 
 
1.1 Motivation and Problem Description 
The memory bus is a major limiting factor to overall system performance.  
Current system performance is limited between the processing power?s data needs and 
the data rate received by the memory system, with CPU request rate typically 3-4x 
faster than the data rate received from the overall memory system.  System architects 
have come to accept the limitation due to the memory bandwidth wall and have focused 
on modifying memory access patterns and increasing parallelism in the computation 
layer in order to increase instruction throughput.   
There are several mitigation strategies that are currently employed to address 
this problem.  Hardware techniques include employing multiple levels of cache 
memory blocks.  This relies on memory access requests being either temporally or 
spatially related, allowing for access requests to be serviced using data present in the 
cache blocks.  Software techniques include prefetch to load specific data for an 
application or managing the access patterns by locating data in a predictable pattern in 
1 
the memory.  Finally, system-level techniques include introducing multiple memory 
controllers for bandwidth and incorporating high-bandwidth memories.  While all of 
these techniques mitigate some of the issues, as the computational system becomes 
increasingly parallel, the memory parallelism imposes an upper limit on the overall 
system performance.  Figure 1-1 illustrates a conventional system and depicts the 
problem.   
 
Figure 1-1 Motivation: Memory Bandwidth Wall 
This figure shows multiple CPU processors that are embedded within a single 
chip, to perform the computations.  These multiple CPUs generate several memory 
requests in parallel, often independent of each other.  Each CPU is connected to 
multiple levels of cache memory blocks, often with the final level (Last Level Cache) 
being a shared memory block.  The cache blocks attempt to service the data 
requirements of the CPU if the requested address is within the confines of the data 
2 
contained with the cache.  Any misses in the requested data would necessitate an access 
to an external main memory, typically DRAM, to fetch the data and fill the cache block.  
As can be seen in the figure, the external main memory is often located off-chip and 
are accessed through a few memory controller circuits embedded on chip.  The memory 
controller itself has the ability to queue pending incoming memory requests, while the 
external memory is servicing the requests.  
Figure 1-1 denotes six such CPU units, however modern systems could make 
use of close to 100 such CPU units.  As the number of independent CPU or processor 
units increase, so does the number of independent memory access requests and the 
likelihood for a bottleneck at the memory controller.  This causes memory requests 
from the CPU to be stalled while pending requests are serviced by the main memory.  
The result is that while DRAM device-level memory latency is on the order of 10s of 
nanoseconds, due to this bottleneck of memory requests, from the CPU?s point of view, 
the perceived memory access latency ends up being much higher, on the order of 100s 
of nanoseconds for large parallel systems.    
Thus, we can observe that the system is fundamentally limited by the number of 
wires that connect the processor and the memory chip.  This bandwidth wall stems from 
the limited number of memory access points that exist in current systems.  Due to the 
number of pins required to make a connection to an external DRAM subsystem (ex: 
DIMM), the DRAM memory controllers on-chip are often limited to six or eight per 
3 
chip.  Our proposed approach seeks to alleviate this bandwidth wall problem directly 
by utilizing a memory technology, ReRAM, that allows for higher numbers of access 
connections between the processor and the memory subsystems. 
 
1.2 Proposed Approach 
Emerging memory technologies are currently being explored by industry and 
academia to address both scalability concerns with conventional solutions and 
improved power-performance capabilities [1].  One promising memory technology is 
Resistive Memories which utilize creation of a high or low resistance state in a device 
to correspond to a digital value of 0 or 1.  The resistance states are modified by creating 
a conductive filament in a dielectric material.  The filaments could be either oxide-
based (OxRAM) or metal ion based (CBRAM) and are controlled by applying specific 
high voltage or current pulse(s) of a specific shape [2,3,4].  In comparison with DRAM, 
ReRAM promises nonvolatility combined with better scalability, CMOS back-end-of-
line (BEOL) compatibility, reasonable switching speeds for read, and higher density 
when stacked.  Integrated Logic and ReRAM Integrated Circuits open the doorway for 
enabling more intuitive implementation of addressing the memory bandwidth wall 
problem without requiring complete redesign of long-standing software to hardware 
design techniques. 
4 
Our proposed solution for addressing the memory bandwidth wall described 
earlier involves using ReRAM as a main-memory replacement for DRAM and 
integrating it to the CPU logic on the same chip.  This is different from 3D stacked-die 
types of approaches that make use physical integration of discrete dies, as shown in 
Figure 1-2. Our solution, which we call Monolithic Computer, involves the ReRAM 
cells residing in metal layers which are fabricated on the same die.  This ReRAM 
technology has been demonstrated and fabricated in products from Crossbar, Inc who 
our research group is in collaboration with for part of this research work, as well as 
others in industry, such as Intel, Micron, and Rambus. Additionally, this approach 
enables extremely high parallel connections to the CPU and directly addresses the 
Memory Bandwidth Wall problem.  
Current studies and research work focus on a specific material composition, 
with characterizations pertaining only to that area.  A broad understanding of the 
technology, implications on how one parameter affects another, and the various 
tradeoffs involved is missing.  Such an understanding allows wider adoption of this 
technology by computer architects to leverage the advantages into their design.  
 
5 
 
Figure 1-2  System Connection in Proposed Approach (Side and Corner View) 
 
1.3 Contribution and Significance 
In this dissertation, for background, I pull together research work done from 
different groups, both in industry and academia to extract the broad trends that emerge 
for this technology and draw together the various implementations of resistive memory 
to reveal design insights and architectural impacts.  This is a literature survey of 
existing research on all variations of resistive memory technology, known by different 
names, such as ReRAM, PCM (Phase Change Memory), memristor. 
6 
For the experimental component, I begin with my results of design experiments 
performed using a collaboration with Crossbar Inc.  Crossbar ReRAM utilizes a novel 
fabrication technology that provides integration capabilities with logic.  Exploration 
studies on a specific ReRAM instance from Crossbar have been performed to 
understand the impacts on area, power, and bandwidth of integrating with a RISC-V 
processor.  I have successfully established a methodology for physical floor-planning 
of a Resistive Memory layer on top of existing logic and present the area impact of a 
memory-processor architecture.   
I also directly seek to address the high write latency and low write-endurance 
problem associated with ReRAM by characterizing the impact of write energy on the 
data-retention of the cell.  My research thrust to support this goal involved fabricating 
ReRAM bitcells as test-cells using UMD?s Nano-fab lab.  I collected characterization 
data on these cells and characterized the relationship between data retention and write 
energy.   
My final research thrust involved architectural simulations to quantify the 
impact of ReRAM write latency on various parallel simulations and evaluate the impact 
of additional memory hierarchies and non-regular NoC Topologies.  To support this 
effort, I utilize Structural Simulation Toolkit (SST) to model and simulate different 
architectural configurations.  My simulations indicated that despite the longer access 
time latencies of the ReRAM array, due to the much higher number of connections to 
7 
the CPU logic, the ReRAM architecture is able to exceed the performance when 
compared to DRAM.  A high enough number of memory access requests were needed 
where this advantage comes into play, with the crossover point for my simulation being 
64 cores.  My first order NoC topology comparison showed that typically torus and fat-
tree configurations performed the best when compared with a mesh topology, with torus 
being 39% better and fat-tree being 70% better at the lower link bandwidths where the 
topology counts.  Due to its ease of implementation, torus might be preferable over the 
other topologies as the link bandwidth increases, or as the number of cores increases. 
 
1.4 Organization of Dissertation 
The dissertation will begin with an overview of emerging memory technologies 
and a comparison of them.  Here, I present the resistive memory cell operation and 
relationship between related memories such memristor and PCM.  I present the key 
tradeoff pertinent to this technology in terms of area, program bandwidth, read 
performance, power consumption, long-term data-retention and reliability effects, and 
multi-level cell implementations.  In the first part of my report, I present the detailed 
implementation of my area study, including the CAD flow to perform the study, and 
the results from my study.   In the second part of my report, I present the premise of 
leveraging the non-volatile/volatile switching behavior of the cell, the device 
fabrication and characterization work, and present some of the preliminary results from 
8 
my SST simulation.  To sum, the thesis spans a broad range of topics and research 
techniques from physical design to device and circuit level, and to the architectural 
level.   
High-level summary of the chapters are as follows: 
? Chapter 2:  Literature Survey and overview of Non-volatile memory 
technologies.  Additional focus is given for the different implementation 
of ReRAM and alternate application space for this technology. 
? Chapter 3:  My motivation for using ReRAM as a replacement for 
DRAM as the main-memory, a more in-depth overview of the cell 
operation, and some of the device level challenges as a main-memory. 
? Chapter 4:  Floorplanning study of the die area impact of ReRAM 
integration with CPUs using Cadence and Synopsys design tools to 
perform the synthesis and digital implementation. 
? Chapter 5:  Device based characterization of a test ReRAM cell 
investigating cell behavior with lower program current and its effect on 
the data retention of the resistance state.  
? Chapter 6:  Performance studies (C and C++ based performance 
modeling using SST) comparing conventional DRAM based memory 
systems against ReRAM based main-memory system.  Additional 
studies on the impact of parallelism are also presented.  Finally, I 
9 
calculate the area required for some of the sub-blocks in the central 
ReRAM IP that I propose and provide a floorplan for the design. 
? Chapter 7:  Expanded architectural simulation work looking at different 
NoC topology and system configurations and the impact on performance.  
I also present effect of the number of DRAM memory controllers on the 
system performance in support of a Hybrid ReRAM-DRAM solution. 
? Chapter 8:  I talk about utilizing ReRAM in the volatile state to limit the 
data persistence and its possible application as a trusted on-chip main 
memory to improve overall system security. 
? Chapter 9:  Conclusion of the dissertation and research work 
? Chapter 10: Bibliography of the Technical literature and references. 
? Appendix A:  Command File used in the Auto-Place-Route Physical 
Design study. 
? Appendix B: Javascript code to perform architectural sizing calculations 
to estimate number of processors, and ReRAM blocks within a given 
chip size. 
 
  
10 
 
 
 
2 Emerging Memory Technologies 
 
2.1 SRAM, DRAM (HMC, HBM, eDRAM), STT-MRAM, ReRAM 
In this section, I provide a high-level comparison of current state-of-art and 
emerging memory technologies? capabilities. I begin with a brief overview of each of 
the technologies.  Table 2-1 presents a summary of key parameters for the different 
memory technologies.  I go over each of the memory technologies in detail.   
SRAM:  Static Random-Access Memory (SRAM) consists of a six-transistor 
(6T) bitcell with a back-to-back inverter pair tied to pass-transistors that allow access 
to the cell, as shown in Figure 2-1.   The bitcells continuously maintain the data injected 
into the storage node.  Data is statically maintained as long as power is supplied to the 
circuit.  SRAM has one of the fastest access time at the expense of area overhead and 
typically serve as cache blocks on a chip. 
  
11 
 
 
SRAM STT-MRAM DRAM DRAM eDRAM ReRAM 
(HMC) (HBM) 
Read Latency 1-10 ns 1-10 ns ~30ns ~30ns 100ns 200-800 ns 
Write Latency  1-10 ns 10-50 ns ~30ns ~30ns 100ns 1-10 us 
High Write None Yes, None None None 6v 
Voltage dependent on 
Requirement retention 
(charge pump) requirement 
(3v to 6v) 
Write- 1e16 1e13 1e16 1e16 1e16 1e6 
Endurance 
Area 2 2 2 2 2 2 
200 F  32 F  8 F  6-8 F  35 F  1-4F  
(dependent 
on # of 
stacks) 
Process CMOS CMOS + CMOS CMOS CMOS CMOS + 
MTJ layer (+Cap) ReRAM 
Energy Moderate Moderate Moderate Low Low 20-30x 
Efficiency lower than 
flash 
Non-Volatile? No Yes, possible No, refresh No, refresh No, requires Yes 
every ~10- every ~10- refresh every 
100 ms 100 ms < 100us 
Table 2-1  Comparison of key parameters of Memory Technologies 
 
12 
 
Figure 2-1  Conventional 6T SRAM cell  
DRAM:  Dynamic Random Access Memory (DRAM) bitcell is comprised of a 
capacitor whose charge is altered to store a data value of 0 or 1, as shown in Figure 2-2.  
Because the charge on the capacitor dissipates over time, a periodic write is performed 
to refresh the data on all bit-cells.  An access transistor provides the mechanism to read 
and write the capacitor.  DRAM provides fast read and write access times, but since 
it?s typically located off-chip, it has limitations in achieving very high memory 
bandwidth and density at the same time.  Also, DRAM technology based on the current 
implementations are projected to run into scaling issues at advanced process nodes.   
 
Figure 2-2 DRAM Bit-cell 
13 
Typically, DRAM memory is implemented as a separate stand-alone discrete die.  
Embedded DRAM (eDRAM) versions of the bitcells allow for the DRAM memory to 
be integrated on the same die as CPU but requires more expensive processing and take 
up silicon area.  New DRAM architectures provide increased density by stacking 
several DRAM memory layers in a single chip.  The two most common ones are Hybrid 
Memory Cube (HMC) by Micron and High Bandwidth Memory (HBM).  HMC is 
developed by Micron to provide a discrete high-density DRAM memory chip 
consisting of 3D-integrated stacks of DRAM Memory dies.  HBM is an open-standard 
high-bandwidth DRAM memory stack that requires a silicon interposer to connect the 
DRAM to a CPU/GPU die.   
STT-MRAM: Spin-Transfer Torque (STT) Magnetic RAM (MRAM) bitcell is 
comprised of a magnetic tunnel junction (MTJ), where the direction of magnetic 
moments and the spin direction of the electrons determine the state of the bitcell, as 
shown in Figure 2-3.   
 
Figure 2-3  STT-MRAM Bitcell Figure source: (MRAM-info, 2016)  
14 
Because of its CMOS compatibility, this bitcell could be integrated into standard 
manufacturing process and could deliver high density with non-volatile data retention. 
 
2.2 ReRAM Implementation Variations 
There are several variations on the exact resistive creation mechanism based on 
the materials used [7,8,9].  The three major versions are:  
(1) CBRAM: Conductive Bridging RAM which relies on the creation of 
microscopic conductive filaments through metal-ion migration; 
(2) OxRAM: Creating Metal-oxide physical defects which results in 
conductive paths of varying resistances in a layer of oxide material by 
causing a valence change. 
(3) PCM: Phase Change Memory which changes the crystal structure of a 
chalcogenide glass from amorphous to crystalline, thus altering the 
resistance of the material.  
PCM is constructed using a heater material, such as tungsten (W), which has a 
high resistivity and emits heat to its surrounding.  The chalcogenide material is placed 
on top of the heater and a current is passed through structure to apply a high temperature 
(close to 600K) to melt the material.  By lowering the programming current slowly, we 
anneal the material to cool slowly and settle into a crystalline structure which has a 
lower resistivity.  Alternatively, by abruptly bringing down the program current, we 
15 
quench the material and the resulting structure is amorphous and highly resistive in 
nature.  Thus, the resistivity of the material is altered, and the data state is represented 
as the resistance value.  A select device is needed in conjunction with the PCM cell so 
that a single cell can be ?selected? among an array of cells.   
Up until recently, most PCM implementations used either a MOS transistor or a 
buried PNP-BJT as the selector.  This prevented PCM array itself from being stacked 
vertically.  Additionally, PCM cells were observed to have a drift phenomenon, where 
the natural state of the material eventually drifted towards a crystalline structure (low-
resistance), which is especially problematic for multi-level cell behavior.  Device 
engineering work, along with a new selector that can reside in the metal layers are being 
investigated to circumvent this problem.  In comparison to PCM, ReRAM have not 
been reported to be prone to data disturb from signal lines adjacent or underneath to the 
memory bitcell.  The work described in this thesis covers OxRAM and CBRAM 
implementations, both of which work on creating a conducting filament and altering 
the overall resistance of the material.   
To provide an overview of existing ReRAM implementations, I performed a 
survey of reported specifications of different Resistive Memory implementations based 
on published data.  Table 2-2 summarizes the performance metrics.  Several of the 
implementations are partnerships between design companies working closely with a 
semiconductor manufacturing fab to realize high-volume implementations of ReRAM 
cells.  
16 
Read Write 
Process Area Density Cell Size 
Organization Capacity Structure Latency Latency 
(nm) (mm2) 2(Gb/mm2) (F ) (uS) (uS) 
Intel/Micron PCM+OTS between 
16Gb 20 206.5 0.62 4.4 8 30 
3D Xpoint Metal 4 & Metal 5 
OxRAM Metal-Oxide 
SanDisk/Toshiba 
32Gb 24 based with Diode 130.7 1.958684 7 40 230 
ReRAM 
selector 
Micron/Sony CBRAM based CuTe 
16Gb 27 Alloy+Buried MOS 168 0.761905 6 2 10 
ReRAM selector 
Crossbar 
4Mb 40 1TnR test chip  5.6 0.5 10 
ReRAM stacked 
Crossbar 
16Mb 40 1T1R, 9 metal test chip   0.02 10 
ReRAM 1T1R 
Adesto 
512kb 40 CBRAM test chip  118 1.2 60 
EEPROM 
IBM/Macronix PCM based with MOS 
- 90 test chip - 20 0.0375 0.13125 
PCRAM selector 
Table 2-2 Key performance metrics of various ReRAM implementations 
Figure 2-4 below plots the bitcell size comparison of the different 
implementation against the process node.  SRAM and DRAM metrics are also provided 
for comparison.  The cell size is reported in feature-squared (F^2), which denotes the 
multiplication of the smallest feature size achievable in that particular process node.  
The value of F is a critical technology parameter defined as the minimum polygon that 
can be fabricated in that process node and is typically limited by the lithography of the 
process.  It is often used as the minimum achieved gate length of the transistors.  The 
figure highlights the bitcells based on ReRAM technology.   
17 
 
Figure 2-4 Cell-Size Comparison for different Memory Technologies 
From the plot, we can observe improvement in cell size as we scale to advanced 
process node, largely through innovations vertical stacking.  One exception to this is 
the Adesto EEPROM product which uses a PCM bitcell with a MOS selector and is not 
stackable.  This product targets low-power IoT applications and the high-area is 
sufficient for the low-volume product.  For the other implementation of ReRAM cells, 
I see the feature size to be lower than DRAM bitcell.  Note that the Crossbar ReRAM 
bitcell is based on a two-layer stack but is expected to be vertically scalable to up to 8 
stacks, which would further reduce the bitcell size.  STT-MRAM occupies higher area 
18 
compared to most ReRAM implementations, but it?s read and write latencies, which 
are on the order of SRAM latencies, are much lower than ReRAM.  This memory 
technology could be a competitive alternative for on-chip cache application to replace 
much the higher area cost of SRAM cells.   
Read latency comparison among ReRAM implementation is presented in Figure 
2-5.  ReRAM bitcells have higher read latencies when compared with the other 
technologies.  The exception here is the Crossbar 1T1R ReRAM, which reported a read 
latency of 20ns and was targeting a high-speed embedded memory application.  This 
product was implemented in 40nm 9-metal process and had a total capacity of 16Mb. 
This implementation used a transistor as the selector device, and therefore would not 
be stackable.  From the plot, we can observe a slight increase in read latency with 
advanced process nodes, however, as can be seen in summary Table 2-2, this is more 
due to the capacity of the memory rather than advances in technology.  As the 
technology matures, we can observe that ReRAM transitions from EEPROM type of 
memories that require lower capacity to Intel?s 3D Xpoint memory with higher memory 
needs.  The higher capacity is supported by larger arrays, which often requires higher 
latency times.  I discuss this phenomenon in more detail in the device tradeoff section 
3.3.  
19 
Read Latency (uS)
100
SanDisk/Toshiba 
OxRAM
10 Intel/Micron 3D 
Xpoint
Micron/Sony 
CBRAM
1 Adesto EEPROM
Crossbar ReRAM 
stacked
0.1
DRAM_offchip Crossbar ReRAM IBM/Macronix 
eDRAM 1T1R PCRAM0.01
IBM STT-MRAM SRAM
0.001
0 20 40 60 80 100
Process node (nm)
 
Figure 2-5  Read-Latency Comparison for different Memory Technologies 
Write latency comparison among ReRAM implementation is presented in 
Figure 2-6.  ReRAM bitcells, being non-volatile memory, have higher write energy 
requirements, which also translates into higher write latencies. The IBM/Macronix 
PCM reported a lower write latency of 131ns on a test-chip product.  Although not 
reported in this chapter, the overall write energy also tends to be higher and leads to 
lower write endurance when compared to volatile memory technologies.  In chapter 5, 
I discuss the device-level challenges in adopting ReRAM as a main-memory 
replacement for current computer architectures.  ReRAM is a Non-Volatile memory 
20 
Read Latency (uS)
with higher write energy and latency requirements than DRAM cells.   ReRAM targeted 
for main memory applications needs to be engineered to support shorter latencies and 
higher write endurances.   
Write Latency (uS)
1000
SanDisk/Toshiba 
OxRAM
100
Intel/Micron 3D Adesto EEPROM
Xpoint Crossbar ReRAM 1T1R
10
Micron/Sony Crossbar ReRAM stacked
CBRAM
1
0.1 IBM/Macronix 
DRAM_offchip
IBM STT-MRAM PCRAM
0.01 eDRAM
SRAM
0.001
0 20 40 60 80 100
Process node (nm)
  
Figure 2-6  Write-Latency Comparison for different Memory Technologies 
 
2.3 Applications for ReRAM technology 
The ease of integration and low-bandwidth characteristics of ReRAM readily 
lends itself to be used in applications that require high parallelism with fine access 
granularity.  Parallel multi-processor architectures meet this criterion and could be 
21 
Write Latency (uS)
implemented using the mesh architecture topology that I mentioned in the previous 
section.  Such parallel multi-processors are most suited for computation intensive 
programs that can be expressed as SIMD (Single Instruction Multiple Data), MODS 
(Monolithic Operations, Distributed Data), or DODS (Distributed Operations 
Distributed Storage).  These architectures consist of an array of tiles, with each 
consisting of a modest processing core, supporting ReRAM memory, and NOC switch 
to support inter-tile communication.  Each unit should be capable of functioning as 
autonomous processing unit, with individual processors having modest computation 
power.  Collectively the mesh architecture could provide higher power efficiency on 
computation intensive programs (SIMD, MODS, DODS). 
 
2.3.1 ReRAM with Support Logic Circuits 
Resistive Memories is an emerging technology that has huge promises in terms 
of scalability, integration with logic, and helping address the memory wall problem.  
However, it has limitations in stream bandwidth and write endurance making an 
augmenting memory component a suitable transition, rather than a replacement for 
currently existing SRAM or DRAM cache needs. One of the biggest advantages with 
ReRAM comes from the fact that ReRAM has the potential to support computation-in-
memory because you can fit in much more complex logic underneath the memory layer 
and it can still be at near sense-amp pitch.  In this section, I look at applications that 
22 
take advantage of the proximity to the memory that the higher BEOL allows in enabling 
the creation of more powerful ReRAM that allows us to build custom sense-amp 
pitched logic to support certain data-intensive applications.  Some potential circuits to 
integrate could be buffer circuits to increase bandwidth and embedded hardware 
accelerators to create processor-in-memory like features.  These accelerators would be 
supporting floating point operations.   
I present two augmenting logic to tailor ReRAM for an application that 
researchers are looking at.  To overcome the bandwidth limitation for streaming 
intensive applications, I can place register banks that shift in data from ReRAM and 
provide a single wide data output.  For example, suppose a ReRAM memory array has 
read bandwidth of 8 bits per read, which implies 8 Sense-Amplifier columns.  If my 
target bandwidth is one 128bits per read, and if the read time is dominated by partly by 
wordline selection time, then selecting one row and reading a 128-bit ?page? at a time 
to accumulate into ?shift register? would optimize some of the read-time overhead.  
This register bank buffer could be placed directly underneath the ReRAM memory 
resulting in no additional area requirements.  
Figure 2-7(a) illustrates this using a simple augmenting logic using an output 
buffer to increase the perceived data bandwidth.  This approach is similar to DRAM 
stream read access, where a single read outputs 128-bit granularity by switching the 
23 
column multiplexer and selecting subsequent columns in a single row.  Another simple 
augmenting logic to be considered includes floating point computation logic that 
supports several signal processing and matrix computation applications.  As shown in 
Figure 2-7(b), this would involve pipelined floating-point computation logic (Multiply-
Multiply-Add) folded directly underneath the ReRAM memory.  A read request from 
one or more arrays would feed a pipelined floating logic block to perform the 
computations as successive reads are performed in parallel.  This approach is similar 
to the FPGA pipelined approach used to embed accelerators into the FPGA fabric. 
  
Figure 2-7  Augmenting Logic to enable ReRAM adaption into key applications 
(a) output buffer to increase data bandwidth (b) pipelined floating point logic to 
enable computation 
Finally, non-volatile logics are a group of circuits that make use of the non-
volatility in ReRAM to preserve the state of logic when a chip goes to deep power-
down modes.  Because of the ease of integration that ReRAM allows, logic states of 
24 
key circuits could easily be preserved in ReRAM memory residing above.  Due to the 
intermittent power availability of IoT (Internet on Things) devices, non-volatile logics 
are being explored as an application for ReRAM technology.  
 
2.3.2 ReRAM for Super Conducting applications 
ReRAM could have an application for Super Conducting circuits due to their 
non-volatility.  Super conducting circuits make use of a RSFQ (Rapid single flux 
quantum) type of logic, as opposed to traditional CMOS logic.  This type of circuit 
relies heavily on capturing spikes that propagate through the system to achieve the 
different logic functions.  Figure 2-8 shows an example RFSQ circuit where signal from 
BLK1 is transmitted to BLK2 as a spike.  The input signal, i, appears before the clk 
signal spike in order to latch in the input signal.  A Joseph Junction device, depicted as 
X in the figure, is used to maintain the signal until it is consumed by the BLK2.  
However, these RFSQ gates consume the input token/charge, and a lot of care is taken 
to path-balance and synchronize the arrival of all inputs.  In order to combat this, 
conventional techniques [24], implement special Non-Destructive-Read-Out (NDO) 
circuits that implement persistence of the input signal.   
25 
 
Figure 2-8 ? RFSQ Circuit 
Although MRAM type of memory technologies are being studied as a possible 
application for this, ReRAM could be a better alternative due to its integration with 
traditional fabrication technologies.  However, it?s higher write energy and write 
latency make it a device challenge in adopting in this application.  One interesting study 
to explore is the possibility of making use of ReRAM, to implement the persistence 
instead.  By using a ReRAM with a very low data retention time, it could translate to a 
lower write-energy requirement, as long as the written spike need not be maintained for 
very long time (not non-volatile behavior).  In chapter 5, I explore using ReRAM in 
this mixed volatility state to tradeoff data retention with write energy and write 
endurance.  One challenge in this solution might be that the temperature range that 
these circuits would operate in might limit the material composition of ReRAM to be 
used.  Further exploration is needed to evaluate the feasibility of this solution.    
  
26 
 
 
 
3 ReRAM Background 
 
3.1 ReRAM as DRAM alternative  
There is currently one DRAM limitation that system architects have to work 
around.  This is the tradeoff between bandwidth and capacity, illustrated in Figure 3-1.  
The x-axis in the graph is the peak bandwidth rated for the device, measured in GB/sec.  
The y-axis is the typical total memory capacity available for that particular 
implementation.   
Conventional DDR4 implementations are capable of high storage capacity, 
close to 400GB.  Their bandwidths on stream triad benchmarks are reported as below 
100GB/sec.   The stacked DRAM implementation, on the other hand, has a high 
bandwidth close to 500GB/sec measured with the stream triad benchmark.  However, 
their capacity maybe quite low, on the order of 16GB in total.  While these bandwidths 
are for sequential dense access patterns, the effective bandwidth drops dramatically for 
sparse access patterns to below 100GB/s.   
27 
 
Figure 3-1 DRAM Bandwidth-Capacity Tradeoff 
While off-chip DRAM can provide significant capacity, the bandwidth is low.  
The option of increasing aggregate bandwidth using additional chips has a high-power 
penalty (~2-4W per DIMM).  Implementations such as Stacked DRAM provide high-
bandwidth, but with low-capacity.  DRAM bandwidth also is targeted for dense-type 
of memory access patterns and degrades severely in sparse type of access patterns.   
ReRAM, on the other hand, can provide a much higher bandwidth at higher 
density. ReRAM has a low access granularity of 8B and can sustain the bandwidth for 
both dense and sparse memory access patterns.   The graph shows the projected 
28 
ReRAM bandwidth of 320GB/sec at a capacity of 200GB, based on current capabilities. 
Additionally, I project that by stacking the ReRAM devices vertically in a 3D-IC could 
increase the capacity dramatically with a small impact of the overall bandwidth.   
In addition to the bandwidth-capacity tradeoff, DRAM is also reported as facing 
scaling issues and being vulnerable to failure at advanced technology nodes.  ReRAM 
on the other hand has been fabricated at 28nm technology and shows no issues of being 
scalable beyond 7nm.  This allows the ReRAM memory to scale with advancements in 
the processor and logic technology and could further improve the capacity and 
bandwidth of the memory.   
 
3.2 Overview of Resistive Memory and Cell operation 
ReRAM stands for resistive Random Access Memory, where the resistance of a 
material is varied by applying different voltage/current across the material, and the 
resistance is used to indicate a data value of 0 or 1.  In this section, I present a brief 
overview of ReRAM characteristics.  Resistive memories have two main components: 
the selector, and the resistive storage element.  The ReRAM bitcell?s basic storage 
mechanism of operation involves the use of dielectric materials which normally don?t 
conduct current.  A dielectric breakdown is induced by subjecting the material to a high 
enough current or voltage, which typically causes permanent damage to the device in 
29 
other dielectric devices, such as diodes and capacitors.  The ReRAM materials are 
engineered in such a way so that this dielectric breakdown does not cause permanent 
damage and is reversible.   
Figure 3-2(a) shows the cross section of the 1S1R (1 selector per 1 resistive 
element). During read, the voltages are expected to operate in the nominal range for the 
technology, while write voltages are expected to be pumped to a higher voltage level.  
This Crossbar ReRAM implementation does not utilize a separate access transistor for 
selection, but the selector device is integrated with the resistive element to form the 
switching medium (SM) layer for the bitcell as shown in Figure 3-2.   
 
Figure 3-2 ReRAM Bitcell Details (a) ReRAM bitcell cross-section (b) Crossbar 
1S1R array bias scheme, with selected cell circled  
The SM is sandwiched between the bottom electrode (BE) and the top electrode 
(TE). A voltage above a threshold (> VTH) is required to select the cell to perform a read 
or write operation. For the program operation, a much higher voltage (>VPRG) is applied 
to enable the formation or resetting of the conductive filaments.  Figure 3-2(b) shows 
30 
the bias scheme of the crossbar memory array for selection. All wordlines and bitlines 
are held at V/2, while the selected cell?s wordline and bitline are biased to have a 
difference of V across it. The selector device is engineered so that the ratio between the 
ON-resistance, defined as when the bitcell has a high voltage bias (V) across it, and the 
OFF-resistance (voltage bias of V/2 in this example), is very high.  This high selectivity 
ensures minimal sneak path current on unselected cells on the same bitline, which have 
a potential of V/2 across their cells. 
 
3.3 ReRAM Read and Write Performance Tradeoffs 
The read latency of a ReRAM array is dependent on the overall array size, as 
shown in the graph in Figure 3-3.  The x-axis in the graph is the sub-array size of the 
memory array, which is the product of the number of rows and columns with an array.  
The graph has two y-axis ? overall die area required to meet a certain memory storage 
capacity measured in sq mm and read latency delay measured in micro-seconds.   
For the purpose of area efficiency, it is desirable to have as high an array size as 
possible.  This is because having several smaller arrays would increase the overhead to 
the surrounding peripheral circuits, such as the row and column decoders.  Although 
the overall sizing of the individual drivers could be smaller for the smaller array, the 
overall area needed would be higher since there would more of the decoder logic.  
31 
Additionally, by separating the arrays into small sub-arrays, certain duplication of 
control and sensing circuits becomes necessary, adding to the overall overhead area.   
 
Figure 3-3 ReRAM Array Size vs Read Latencies 
The graph shows that read-latency delay (marked by the latency numbers), 
increases as the size of the sub-array increases.  A very small array of a single bit (1), 
can have an expected delay of 0.1uS, or 100ns, while a very large sub-array of 2000 
bitcells can have an expected delay of 2.2uS.  From the die-area point of view, the small 
sub-array of a single bit would incur a high die-area of 100mm2, while the large sub-
array of 2K would have a die-area of 3mm2.  Thus, there exists a strong tradeoff 
between array performance and the area.  This high dependency is due to the latency 
timings largely being dominated by the parasitic elements (Resistance, Capacitance) of 
32 
the wordline and the bitline.  A shorter array reduces the length of these lines, and 
therefore directly helps to reduce the latency of the memory. 
To consume shorter latency, the array size needs to be kept small, which results 
in lower area efficiency.  In order to match read latencies close to DRAM main-memory, 
this tradeoff between array-size and area efficiency could require a smaller-bank based 
architecture to increase the read bandwidth, at the expense of die-area.  On the write-
latency side, these are much longer when compared to DRAM write-latencies due to 
the non-volatile state change of the bitcell.  For write-latencies, a separate write-back 
cache could be used as a solution to buffer write operations for certain applications. 
 
3.4 ReRAM Write Endurance Challenge 
Conventional ReRAM bitcell write endurances are on the range of 105 to 108 
cycles, while typical DRAM write endurance is greater than 1015 cycles.  Figure 3-4 
compares the write endurance ranges of DRAM against ReRAM.    Write endurance 
reflects the durability of the bitcell for write operation and is measured in the number 
of write cycles.  ReRAM bitcell, as it is, is over 7 orders of magnitude lower than 
DRAM.The large difference in write endurance limits between ReRAM and DRAM is 
a critical device challenge for ReRAM.  If the write endurance limits are limited to 105 
33 
cycles (100,000 cycles), then typical applications that make use of main-memory would 
not be supportable using ReRAM as a direct replacement for DRAM. 
 
Figure 3-4  Write Endurance Ranges for DRAM vs ReRAM 
DRAM write energy on average is around 19pJ/bit, while ReRAM write energy 
is quoted as 65pJ/bit.  I expect that write endurance has a tradeoff with data Retention 
that could be leveraged for main-memory applications. Additionally, this tradeoff could 
have benefits in lowering the write energy requirements, which is the second device 
challenge I mentioned.   
  
34 
  
 
 
4 Area Exploration studies   
 
One of the most common versions of ReRAM memory involves a ?crossbar? 
structure of two orthogonal strips of wordlines and bitlines, the intersection of which 
produces both the resistive storage element and the selector device.  Figure 4-1 shows 
the Crossbar?s version of the ReRAM bitcell being comprised of the selector device 
and the memory cell, both of which are sandwiched between the orthogonal bitlines 
and wordline signal lines.  This pattern can be continued to provide vertical stack-
ability of the memory, thereby increasing the effective density. 
 
Figure 4-1  Cross-Section ReRAM bitcell 
35 
  These memory layers are fabricated on BEOL metal layers and can be stacked 
to provide increased memory capacity and density.  Additionally, these can be 
integrated onto traditional CMOS processes, allowing for logic or ASIC circuits to be 
placed in certain regions under the memory.  In comparison to separate vertical high-
density memories, this technology helps manufacturers circumvent some of the 
challenges with existing 3D ICs, including higher development costs, and reliability 
with the TSV fabrication.  An IC with integrated Logic and Memory layers (see Figure 
4-2) increases the function per unit volume/area while reducing power consumption 
significantly.  The crossbar version of ReRAM memories is stackable and allows for 
logic to be placed under the memory layer. 
 
Figure 4-2  ReRAM Physical Integration.  
As shown in Figure 4-2, while the actual memory cells are in a BEOL metal 
layers, the peripheral circuits ? such as, the word line decoder, column multiplexer, and 
36 
sense amplifier, would need to take up space in the substrate and lower metal layers 
and forming blockage regions for logic circuits.  However, this still leaves a majority 
of unused space under ReRAM Memory Stack (more than 70% for a two-layer stack).  
I propose using the unused space under ReRAM metal stack for CPU or other Logic 
elements. 
My aim for the physical design feasibility study was to explore a monolithic 
processor core that can be physically integrated with a ReRAM memory on the same 
chip.  In this section, I attempt to integrate a standard-cell based synthesized RISC 
processor circuit with a ReRAM crossbar memory circuit and analyze the area and 
routing congestion that results from such an integration.  I first present some of the 
ReRAM integration constraints and the CAD methodology used to study the area 
impact.  I consider three different ReRAM integration and summarize the measured 
results.  Two different integration ReRAM-Processor configurations are presented in 
this section with the area impact results obtained.  Finally, I consider the physical 
integration of a SRAM memory placed underneath the ReRAM array layout. 
 
4.1 Crossbar ReRAM Integration Constraints 
My initial study is based on the Crossbar implementation of ReRAM memory 
which is CMOS compatible and back end of line (BEOL) stackable.  CMOS 
37 
compatibility ensures that the exotic materials used for the ReRAM stack can be 
deposited on top of standard CMOS fabrication techniques.  One method of physically 
realizing this type of integrated circuit involves a two-step process, where the CMOS 
circuits are fabricated at a standard process foundry and then taken to a ReRAM 
fabrication facility for the specialized ReRAM layers to be deposited on top, in a split-
fabrication like approach.  Figure 4-3 presents the physical implementation of the 
ReRAM bitcell into a standard CMOS process. 
Crossbar ReRAM uses a Select device embedded with the Resistive cell (1S1R) 
and the cells lie at the cross point of orthogonal metal layers, as shown in Figure 4-3 
(a).  The Figure 4-3 (b) shows the split-fabrication like approach described earlier, 
where the specific ReRAM layers can be embedded on-top of, or even in the middle of 
standard fabrication processes. 
 
 
Figure 4-3  Crossbar ReRAM Bitcell (a) Orthogonal Bitcell Layout (b) ReRAM 
integration with CMOS Process Figure source (Crossbar Inc., 2018) 
38 
Table 4-1 summarizes the key performance metrics of Crossbar?s ReRAM array.  
I will briefly go over each of these characteristics and compare with DRAM 
performance where applicable.  The bitcell area is competitive with a DRAM bitcell 
and has the potential to achieve higher density with increased vertical scaling.  Also, as 
noted in the table, the bandwidth per array is 4-8 bits.  Therefore, to provide sufficient 
bandwidth to a single core, I envision several arrays that are distributed across the full-
chip and are accessed in ganged mode, in a Distributed Shared Memory-like 
architecture.   
Key Parameter Performance 
Area 24-16 F  
Bandwidth per array 4-8 bits 
Read Latency 200-700 ns 
Write Latency 1 us 
Cell Leakage 0.1 nA/cell 
Program Energy 10-100 pJ/cell 
Endurance 5 8> 10  ? 10  cycles 
Retention > 7-10 years 
Scaling Potential < 10 nm 
Ron/Roff ratio 100 
Selectivity (?I @V , V ) 6 10
R R/2 > 10  - 10  
Table 4-1 Crossbar 1S1R ReRAM Parameters 
39 
Some of the critical parameters that pose a device challenge for ReRAM 
replacing DRAM as a main-memory are the latency and write endurance limits.  Both 
read and write latency times are much higher than typical DRAM times, with write 
latency being especially much higher.  The Ron/Roff ratio in the table is a characteristic 
of the selector device engineered by Crossbar, Inc.  The crossbar ReRAM bitcell has a 
high Ron/Roff ratio over 100 to reduce sneak path currents from unselected cells and a 
low cell leakage current. 
The program energy per bit is also significantly higher than DRAM, and 
consequently, the write endurance for ReRAM is expected to be around 10^5 ? 10^8 
cycles, which is much lower than that of DRAM, which is quoted to be above 10^15 
write cycles.  This is a critical device challenge to be overcome in order to replace 
DRAM for typical applications.  The flash memory bitcell, however, has a much lower 
write endurance of 10,000 to 100,000 cycles.  This low write-endurance is is managed 
by wear-leveling techniques to minimize the number of write operations to any 
particular cell, along with flash memory?s application consisting largely of read 
operations.  ReRAM has high scaling potential, however, and is expected to be scalable 
below 10nm. 
Crossbar ReRAM technology integrates with standard logic processes, is 
stackable vertically for increased density, and has a 1-4F2 cell size, depending on the 
40 
number of stacks. Although not all ReRAM variations in development allow for this 
assumption, the general direction of ReRAM is moving towards increased density by 
utilizing vertical scaling and integration with logic-process compatibility.  This type of 
ReRAM is organized so that the bitcells are stacked on higher metal layers which are 
shown in Figure 4-2 as M11, M12, as an example.  The bitcell layout of the ReRAM 
could be simplified as a cross-section of adjacent metal lines, whose intersection 
determines the location of the resistive storage element.  Figure 4-4shows the bitcell 
layout in 45nm technology used for my area study.  The ReRAM bitcell dimension I 
am using for the array is 1.4(2*?)2 which is 106nm x 106nm at 45nm technology.   
 
Figure 4-4  45nm ReRAM bitcell 
The peripheral support circuitry for the ReRAM to perform the address decode, 
row and column selection, and sense amplifier read and verify circuits would be 
implemented in the substrate using standard CMOS layers, such as the diffusion, 
polysilicon, and some of the lower metal layers.  Embedding these peripheral circuits 
41 
into a processor circuit would have an area cost and is one of the focus of my area study.  
Processor circuits are implemented using an Auto-Place-and-Route (APR) tool.  This 
tool takes a high-level design description netlist, such as VHDL or Verilog synthesized 
netlist, and places the standard cells in order to meet timing and minimum area goals.  
For my area study, I assess the impact of embedding the ReRAM peripheral circuits 
into a processor logic.  In traditional digital implementation flow, I model these 
peripheral circuits with a blockage layer to indicate to the APR tool that standard cells 
may not be placed in this region.   
Figure 4-5 shows the memory array organization that can be formed to group 
together multiple arrays and provide sufficient data bandwidth.  A single ReRAM array, 
shown on the left in the figure, consists of bitcells arranged in several rows and columns.  
A single horizontal row, referred to as wordline, is selected during a read or write access 
by wordline (WL) decoders.  Multiple columns, also called bitlines, are sensed through 
a column multiplexer (MUX) which is often placed below the array.  A sense amplifier 
(SA) compares the current sensed on the selected bitlines against a reference current to 
decide on the data read out.  This is done for both read and write operations, as write-
operations often involve a verify step to ensure that the write pulse was able to 
successfully place the cell to the desired state.  As can be seen in the diagram, these 
peripheral circuits form a L-shape on the side and bottom of the array. 
42 
The physical layout of four single arrays is grouped into a mat, shown in Figure 
4-5 on the right.  The four arrays are rotated so that their peripheral circuits are placed 
next to each other.  This organization allows for sharing of control signals between the 
arrays during an access.  The peripheral circuits make use front-end-of-line (FEOL) 
layers, such as the ones needed to create the transistors (diffusion, polysilicon, contact), 
as well as the lower metal layers to connect the CMOS logic together.  The ReRAM 
array itself only uses BEOL layers, and the area underneath is available for the CPU 
logic, as I mentioned before.   
 
Figure 4-5 Memory Organization 
The ReRAM peripheral circuits are the blocked regions during the APR digital 
implementation and form a ?cross? shape of blocked region, and are indicated in Figure 
4-5.  Any CPU logic blocks need to either fit under one of the ReRAM arrays or need 
43 
to have a method for connecting between two ReRAM array locations.  While the 
blocked region specifies that no standard cells may be placed in that location, there can 
be limited restriction on the interconnect routing over these blocked regions.  The 
specific metal layers that are blocked have significant impact on the routability of the 
overall integrated design.  Completely blocking all metal routing over the blockage 
region necessitates any routing connections to go around the blocked regions results in 
significant additional routing area overhead with an integrated design. 
 
Figure 4-6  Via tap points from ReRAM metal layer to periphery circuits 
Figure 4-6 describes the routing approach I assumed for my area study.  The 
figure shows the close up of the physical interconnection between wordlines and the 
wordline decoder in the peripheral circuit region.  The horizontal bars on the figure are 
wordlines coming from the ReRAM array to connect to an individual wordline driver, 
44 
which is often the connected to a drain node on one or more transistors.  Therefore, this 
connection needs to be able to route the high-metal line of the wordline (for example, 
from metal-layer 11) to the diffusion node of a transistor.  This means that this 
connection has to go through multiple metal-via taps to descent to metal-1, and then 
connect to a diffusion contact.  Having the CPU logic interconnection lines through this 
region poses a potential conflict with this transition.  Therefore, I have identified a way 
in which an uninterrupted feed-through path could be allocated for the CPU logic 
interconnections.  This feed-through path allows for global signals to route between 
standard-cell logic groups of the CPU logic circuit.  The top-down view shows 
staggered via tap points that allows for a routing channel for signals to feedthrough 
across blocked region.  This approach is scalable as the number of ReRAM stack 
increases.  With higher stacking, there would be more via tap connections that would 
be needed.  The blocked region could expand to accommodate a larger staggered 
connection from the higher memory metal layers to the base transistors below.  
 
4.2 CAD Methodology 
In this section, I go over the tool flow methodology I followed to perform my 
area assessment.  Standard EDA tools are used for performing the area analysis of a co-
located ReRAM with a processor, as shown in  
45 
Figure 4-7.  The tool-flow starts with a design netlist to be synthesized.  In the 
figure, this is indicated as RISC-V processor netlist in Verilog (.v) format, since my 
study involved a RISC-V processor.  This behavioral Verilog netist is synthesized by 
Synopsys Design Compiler into physically realizable individual standard-cells selected 
from a design library.  The process design kit (PDK) I used for my study is based on 
45nm process node and makes use of design library from Nangate.  The synthesized 
netlist (_syn.v) output from the synthesis step is input to Cadence Encounter is used for 
the APR step of the flow to produce the final GDSII layout.  This is used in conjunction 
with specific limitations on the blockage to embed the ReRAM peripheral logic within 
the processor layout. 
I used the design collateral files from North Carolina State University?s (NCSU) 
45nm process design kit (PDK).  I also needed standard cell design libraries at this 
technology node.  I initially looked at using one provided by Oklahoma State 
University (OSU).  I chose the open-source Berkeley RISC-V VSCALE processor as 
the core for studying the processor-memory area impacts.  The synthesizable Verilog 
netlist of the core is called VSCALE and uses a 32-bit instruction set with a single-issue 
in-order 3-stage architecture.  The resulting layout was 59,672 sq um and operated at a 
maximum frequency of 150MHz.   
46 
 
Figure 4-7  Digital Implementation Tool Flow of an integrated ReRAM RISC-V 
Processor tile. 
The OSU library for the 45nm process only provided 32 standard-cells, which 
may not provide sufficient diversity for optimum choice of standard-cells in terms of 
area and performance.  This could cause the digital implementation to be overly 
pessimistic in terms of area and power, and not be representative of real PDKs available 
when manufacturing.  As a result, I explored utilizing an alternative open-source PDK 
from Nangate based on the same 45nm PDK but containing a larger number (134) of 
standard cells.  I repeated the Synthesis and APR step on the VSCALE processor to 
obtain an overall physical layout area of 30,373 sq um at 250MHz clock frequency, 
which was over 50% area reduction observed with this design kit.  I attribute this area 
reduction to be due to sufficient diversity in the standard-cells available, which enabled 
47 
the digital implementation to select an optimum standard-cell instance to minimize area 
and delay.   I used this Nangate PDK to perform relative area comparison studies.   
To mimic the integration constraints listed in the previous section, two types of 
blockage layers are indicated in the Cadence Encounter setting.  The first is for the 
placement blockage to prevent standard cells from being placed, and the second is 
routing blockage for the specific metal layers to limit routing.  Based on our discussion 
with Crossbar, prior ReRAM area measurements indicated that a 25% memory to 
periphery area ratio is a reasonable approximation for the two-layer memory stack.  I 
used this guideline for allocation of the blockage area.  For this second type of 
constraint, I mimic the restricted metal routing described in the previous section by 
blocking metal layers 1-8 and allowing for the APR tool to route through the blocked 
region using metal 9 and 10.  The ReRAM memory layers are assumed to be in metal 
layers 11 and 12 above the standard CMOS layers.  Rather than mimicking routing 
feedthrough channels, this allows for global interconnection signals that need to 
connect across the blocked region limited routing options.  A summary of the blockage 
settings and metal allocation for my design is also provided in Figure 4-7. 
4.3 Single ReRAM Cluster Integration   
My first objective was to integrate the VSCALE processor with a ReRAM 
memory to create a processor-memory tile that could be laid out in an array, based on 
48 
application needs.  To begin with, I measure the stand-alone area of the VSCALE 
processor core alone.  The synthesized netlist targeted an operating frequency of 
150MHz with a total of 59,672 standard cells at the 45nm process technology (nominal 
process, 1v, 27c).  I used Cadence Encounter to perform the APR and generate the 
layout for the core alone.  My approach measures minimum feasible area by iteratively 
reducing the floorplan dimension and checking for congestion, Design Rule Check 
(DRC), and connectivity violations.  If the floorplan area provided to perform the APR 
step is too small, then the tool will not be able to place all the standard-cells, make 
necessary connections, and meet the timing constraints imposed for the design.  DRC 
is a check that ensures that the physical layers are drawn to meet the lithography rules 
of the process.  Figure 4-8 shows the generated layout of the standalone core with 
power-rings around the core and a power-strap in the center.   
The VSCALE core with the 45nm PDK, the core area consumed 30,373 sq. um.  
The dimensions of the floorplan are 172um x 172um.  The generated layout includes 
the necessary standard-cells to implement the function described, as well as the 
interconnections in metal to make the connections.  This PDK allows for 10 metal 
layers and the standard-cells are covered almost entirely by the metal signal lines.  The 
APR tool typically uses even-odd metal routing, meaning that even metal layers are 
used for one direction, for example vertical, and odd metal-layers are used for 
49 
horizontal location.  This allows for efficient packing of a high number of metal 
interconnects.  The floorplan also includes the power-rings in metal 9 and metal-10, 
and a metal-strap in the center of the core to allow for sufficient power supply bias.  
 
Figure 4-8 Layout of VSCALE Processor Core 
Next, I talk about how embedding a ReRAM memory within the standalone core 
could be accomplished.  As I indicated in Figure 4-5, a single array will require an L-
shaped peripheral region surrounding it and is expected to have a relatively low 
bandwidth of 4-8 bits per array.  In order to deliver reasonable bandwidth, I expect 
these arrays to be grouped together, in a mat, to form banks of arrays to meet the data 
50 
bandwidth requirement in parallel.  Physically, I chose these to be placed back-to-back 
in order to form one contiguous blockage region for higher area utilization.   
I created a physical layout of the integrated ReRAM peripheral circuit with the 
VSCALE core using the above physical constraints as inputs to the Cadence Encounter 
tool.    For this experiment, I used 4 ReRAM arrays, each of size 75um x 75um, which 
corresponds to a memory capacity of about 0.5MB for a 2-layer ReRAM stack.  Note 
that crossbar has demonstrated feasibility of scaling to 8-layers for the ReRAM stack.  
Since the peripheral region takes 25% of the ReRAM area, this amounted to a total 
blocked region of 5600 sq. um for this configuration. The minimum feasible area was 
measured by iteratively creating a floorplan of smaller dimensions until the design is 
successfully placed and routed without any DRC or connectivity violations.    
Figure 4-9 shows the generated layout of the ReRAM peripheral circuits 
embedded into a single VSCALE core.  This layout only shows the standard-cell and 
blockage region information.  The center cross (in red) denotes the blockage region, 
we?ve described to the APR to keep out the standard cell placements.  The rows of 
standard cells (in blue) surround the blockage region complete.  As mentioned before, 
the blockage region is specified for four of the L-shaped peripheral circuits arranged in 
a Cross configuration for my physical design study.  This configuration has the 
51 
advantage of allowing for I/O connectivity between the ReRAM memory and the 
processor, as well as allowing for connection between the overall tile which would need 
to communicate with other blocks.    
 
Figure 4-9  Blockage Region for ReRAM Peripheral Circuits  
Figure 4-10 shows the final integrated ReRAM-Processor layout with all of the 
metal layers up to metal-8, excluding metal-9 and metal-10.  Each of the red-square 
represents a single ReRAM array.  The standard-cells and metal lines surround the L-
shaped peripheral region, which is on the corner of each of the ReRAM arrays.  The 
ReRAM arrays themselves will use higher metal layers, above metal-10 in this process 
node.  This generated layout required the minimum floorplan area to meet the design 
and performance constraints without violating the DRC and connectivity rules.  The 
52 
dimensions of the layout are 200um x 200um, with the individual ReRAM arrays being 
of size 75um x 75um. 
 
Figure 4-10 Layout of an integrated ReRAM RISC-V Processor tile. 
  Table 4-2 summarizes the measured area and the impact penalty of integrating 
a single ReRAM cluster with a RISC-V processor.  The total area of this integrated 
design was 40,026 sq. um.  Each ReRAM array?s area was 75um x 75um, with the total 
ReRAM dimension being 150um x 150um.  This allows for a total data storage for all 
four ReRAM arrays of 244kB, assuming a 5.6?2 ReRAM cell per layer.  For a 2-level 
53 
stack, this translates to 488kB of total storage.  The total blocked region blocked for 
the ReRAM?s peripheral circuit was 5600 sq. um, which is 25% of the total ReRAM 
area of 22,500 sq um, in line with the expected overhead for a 2-layer ReRAM stack.   
 
Table 4-2 Integration Results 
After accounting for the peripheral blockage area and the actual standard-cell 
logic area of the processor, the total integrated layout incurs an additional overhead of 
~11.3% in the 45nm process.  The area penalty from the integration is measured as the 
difference between the total area of the integrated design and the sum of the VSCALE 
processor area and the ReRAM blocked region.  This area penalty is mainly attributed 
to additional area needed for the routing of signals due to the blocked area in the center, 
around which there would be a higher incidence of routing congestion.  There is also 
minor contribution due to additional filler cells incurred due to the larger overall area 
54 
of the block.  Filler cells are needed periodically to provide tap connections to the n-
well and p-substrate from the power supply.  This ensures that the body node of the 
transistors is well-biased.  A larger area, therefore, requires more of these tap 
connections, increasing the overall area needed as well.  This overhead is the area 
penalty due to additional area required for routing and standard-cell placement 
inefficiencies caused by noncontiguous regions available for the processor.   
 
4.4 Multiple ReRAM Cluster Integration   
Due to the small number of bits that each array outputs, about 4-8 bits/array, I 
expect many ReRAM arrays are tiled across the chip to form mats.  These mats are 
accessed in a ganged mode to provide sufficient bandwidth.  To study the area impact 
of such an approach, I studied the impact of multiple ReRAM arrays integrated into a 
single core, as illustrated in Figure 4-11.   
55 
 
Figure 4-11 Embedding Multiple ReRAM Mat Clusters within a Larger 
Processor 
The previous area study used a single ReRAM array to fit within the VSCALE 
core.  VSCALE is a 32-bit integer core and is not representative of realistic cores which 
tend to be larger and more complex.  To correspondingly increase the core size, I scaled 
the VSCALE processor?s data path from 32-bit to 256-bit.  Figure 4-12 below shows 
the scaled 256-bit VSCALE processor without any embedded ReRAM.  The minimum 
generated layout had a floorplan dimension of 533um by 533um and an area of 284,077 
sq. um at the 45nm technology node using the FreePDK based Nangate standard cell 
56 
library.  This larger core allows us to integrate multiple ReRAM mats into the VSCALE 
design.   
 
Figure 4-12 Scaled 256-bit VSCALE Processor Layout 
Using the larger 256-bit VSCALE processor, I studied the impact of embedding 
four of the mat clusters within them in a 2x2 tile pattern.  For the ReRAM array size, I 
aimed for an array of 1000 x 1000 matrix, and therefore used an array size of 
109umx109um, making the mat size to be 218um x 218um.  This size allowed us to 
tile the 2x2 mat within the 256-bit VSCALE processor for the purpose of my study.  I 
iteratively varied the inter-tile cluster spacing to obtain the optimum spacing for 
minimum overall area for a range of tile spacings from 50um to 400um.  The minimum 
57 
area for each spacing parameter was found by iteratively reducing the floorplan 
dimension to check for feasibility.   
Figure 4-13.  below illustrates the floor planning result at the extremes of the 
inter-tile spacing when embedding ReRAM clusters within a larger circuit.  If there is 
not sufficient spacing between the ReRAM peripheral circuit?s blocked regions, then 
network congestion occurs when the processor blocks are being placed between them 
which cannot be resolved by the APR. On the other hand, if the spacing is too far apart, 
the entire generated layout can fit between the tiles resulting in large unused spaces.  
This can be seen Figure 4-13(a), which has an inter-crossbar spacing of 300um, and an 
overall chip dimension of 750um x 750um.  Figure 4-13(b) shows the minimum area 
configuration for an inter-crossbar distance of 50um and a specified floorplan 
dimension of 750um x 650um (width x height).   Note that it might be possible to 
optimize this layout manually and utilizing the areas in the corner to overcome this, 
however manual layout is beyond the scope of my initial area study.   
58 
 
(a) Spacing too large    (b) Mat spacing too close 
Figure 4-13  Inter-Mat ReRAM Array Spacing causing Inefficient Layout 
  Figure 4-14 shows one of the generated layouts with four clusters of ReRAM 
arrays tiled and embedded within a 256-bit scaled VSCALE version.  This layout shows 
the minimum area possible for an inter-mat spacing of 200um.  There are four ReRAM 
mats, with each mat consisting of 4 arrays themselves.  The total number of ReRAM 
arrays in this layout is 16, each of which follows the dimensions in the previous section.  
The design was obtained by iteratively reducing the overall floorplan size until the APR 
generated the layout successfully for this specific inter-tile spacing.  The APR tool itself 
attempts 10 iterations by default to optimize the signal routing to meet the timing spec 
59 
in the minimum possible area.  Once the connectivity is verified, the DRC checks are 
performed to ensure that none of the physical design rules are violated.   
 
Figure 4-14 Multiple ReRAM clusters integrated with a 256-bit RISC-V 
Processor 
The iterative process of finding the minimum area was repeated for a range of 
inter-mat spacing from 50um to 400um and the results are presented in Table 4-3.  The 
minimum width represents the width of the minimum design, considering the width of 
60 
the combined ReRAM MAT widths, and the width of the spacing.  The ReRAM mat 
dimension is 218um x 218um, as I mentioned earlier.  For example, at an inter-mat 
spacing of 25um, the minimum width would be 2*218um+25um=461um.  The 
minimum area therefore would be square of 461, or 212,521 sq um.  This type of 
floorplan would have no spacing on the outer edge of the array and therefore is not a 
feasible design.  The chip area denotes the actual minimum floorplan area to realize the 
processor-ReRAM integrated design.   
Inter tile 25 50 100 150 200 300 400 
spacing (um) 
Min width 461 486 536 586 636 736 836 
(um) 
min area  212521 236196 287296 343396 404496 541696 698896 
(sq um) 
chip area  562500 487500 390000 390000 422500 562500 722500 
(sq um) 
stdcells only  223365 221981 221697 222471 221997 221465 221632 
(sq um) 
stdcell area  516402 441178 343797 343797 376257 516256 677012 
(sq um) 
stdcell 39.71% 45.53% 56.85% 57.04% 52.54% 39.37% 30.68% 
efficiency 
array area 190096 190096 190096 190096 190096 190096 190096 
array 33.79% 38.99% 48.74% 48.74% 44.99% 33.79% 26.31% 
efficiency  
% penalty 0.71 0.49 0.19 0.19 0.29 0.71 1.20 
Table 4-3 Summary of Inter-Mat Spacing on Area and Efficiency 
The stdcells-only row lists the raw standard-cells area reported from the 
synthesis tool, while the stdcell-area row lists the measured std-cell area from the APR 
61 
tool, to include the additional area needed for routing.  As can be seen on the results, 
this is often double of the raw standard-cell area.  The stdcell efficiency reports the area 
occupied by the std-cell over the overall floorplan area and is intended to be a metric 
of how much usable space was devoted for the processor logic.  At the optimum spacing 
of 100um or 150um, I see the standard cell efficiency being close to 60%.   
The array area row denotes the total ReRAM array area that are used within the 
floorplan.  Note that most of this array area makes use of higher-metal lines that don?t 
coincide with the lower layers.  As a result, there can be a high amount of overlap 
between the stdcell and the ReRAM array area.  This is reflected in the results that 
show that at the optimum inter-mat spacing of 100um, the array efficiency is close to 
50%.  The final parameter, % penalty, denotes the additional area incurred from the 
integrated ReRAM-Processor system.  For the four MAT, the total area of the peripheral 
region incurred 44,172 sq um.  This is in-line with the 25% guideline that I followed 
for the array-to-peripheral area ratio.  The table results indicate that the optimum 
configuration has a penalty of 19%.  
The plots presented below in Figure 4-15 show the results of the optimal spacing 
and minimum area as a function of the inter-crossbar spacing.  The x-axis lists the inter-
tile spacing of the ReRAM mat blocks varying from 25um to 400um.  Note that this 
62 
spacing is uniformly applied between all of the MAT blocks.  The y-axis in the first 
figure shows the area in sq mm.   
 
Figure 4-15 Impact of Inter-MAT ReRAM cross spacing on Area  
The min-area, as described earlier, lists the theoretical limit on the minimum 
feasible floorplan, considering the spacing between the MAT blocks and the sizes of 
the MAT blocks themselves.  The chip area line denotes the minimum successfully 
generated layout by the APR tool, given the timing constraints.  I see that at large inter-
tile spacing for the MAT blocks, the realized design is close to the theoretical limit.  
This is because the spacing between the MAT blocks is so large, that the entire design 
63 
is able to fit within the inside of the MAT arrays, similar to the design indicated in 
Figure 3 12(a).  
Figure 4-16 shows the impact of inter-mat spacing of the ReRAM blocks on the 
array and standard cell efficiency.  The x-axis varies the inter-mat spacing and the y-
axis reports the efficiency and penalty numbers as a percentage. 
 
Figure 4-16 Impact of Inter-MAT ReRAM cross spacing on Efficiency  
  The standard-cell efficiency peaks at 57% at the optimum spacing of 150um.  
The ReRAM peripheral logic area occupies a total of 44,172 sq um, which accounts for 
11.3% of the total design area of 390,000 sq um.  The power rings surrounding the 
floorplan also consume some area, approximately 13%.  The remaining area numbers 
64 
could be accounted for filler cells, and unused standard-cells at the corner of the 
floorplans.  Since the area study is done in increments of 50um, a finer step might 
indicate a lower feasible design than identified.   
On the ReRAM array side, the efficiency peaks at 49%.  The limitation on the 
ReRAM side preventing the array from completely covering the provided area is the 
peripheral circuits that align with each row and column.  Completely covering the array 
would mean that these peripheral circuits extend all the way to the end of the floorplan, 
severely limiting the signal interconnections across the blocked regions.  While the 
inter-tile spacing dictates the spacing between the mat blocks, the overall floorplan 
dimension dictates the spacing from the boundary of the design to the edge of the 
peripheral region.   
The third curve in fig shows the area penalty of integrating the two design blocks.  
The area penalty is calculated by subtracting total design area from the individual 
processor and ReRAM peripheral block area.  This penalty accounts for the cost of 
disrupting the processor floorplan area with a ReRAM peripheral block, largely due to 
additional routing for signals between standard cell groups, with possible routing 
around blockage regions.   
To summarize, the results show that an optimum inter-ReRAM spacing exists 
to maximize area efficiency at close to 50%.  At 45nm, with my design configuration, 
65 
the optimum inter- spacing is 100um to 150um.  Larger spacing (> 200um) leads to 
inefficiency from unused synthesized areas (empty space) while smaller spacing 
(<100um) leads to inefficiency from routing congestion between standard cell groups.  
The optimum spacing produced a peak array efficiency of 50%, with around 20% area 
overhead penalty for this configuration.  For alternative configurations, the specific 
optimal point could be affected by the relative size of the processor and the blocked 
region due to the ReRAM array and would be worth investigating this relationship in a 
future study. 
The 256-bit VSCALE extrapolation only scales the data path portion of the 
processor and will not model impacts of the control path of a more complex, realistic 
processor.   However, I am only interested in the impact of routing congestion from a 
larger processor.  For this purpose, extrapolating the data path is likely to have a higher 
impact on the generated layout rather than from a more complex control path.  This is 
because I believe while complex control circuits might require more interconnects, 
these connections would be spatially local.  On the other hand, data-path connections 
typically tend to span over longer distances to connect between subblocks.  Therefore, 
I believe the area impact results would be a conservative indication of more realistic 
processor. 
66 
The total consumed area for the optimum layout was 0.4 sq. mm, with a total 
ReRAM data storage of 4MB for a 2-level stack, for all four clusters combined.  The 
implementation shows a 2x2 array of ReRAM crosses integrated with a 256-bit integer 
RISC-V processor.  Using a ReRAM array of size 109um x 109um, the total ReRAM 
data storage realized would be 4MB at 45nm process node. 
Extrapolating these results to an 8-layer stack would create a 16MB ReRAM 
memory integrated into the ReRAM-CPU tile with an area of 0.4 mm2. For a 400 mm2 
die size, the above ReRAM array could be tiled 1000 times across the chip, to produce 
a total storage capacity of 16GB ReRAM. Because an 8-layer stack would require 
additional peripheral circuits to decode the wordline per stack and/or higher current 
driving transistors, the number of cores will be scaled down.  At the 16nm process, 
assuming a 10x reduction in area, a 400 mm2 chip should be capable of delivering 
160GB ReRAM storage with logic underneath assuming a 50% area efficiency. 
Error! Reference source not found. contains the final Cadence Encounter 
command file used to specify the blockage settings and perform the APR to generate 
the layout. 
 
67 
4.5 SRAM-ReRAM Integrations 
One other configuration of interest is integrating an SRAM memory array 
underneath the ReRAM memory.   The motivation for this is an SRAM array that would 
function as a write-back cache to an ReRAM main-memory so that the impact of 
ReRAM write latency, which is on the order of 1us, could be minimized.  For my study, 
I have selected an open-source academic memory compiler, called OpenRAM [6], 
created by UC Santa Cruz and OSU.  This tool includes SRAM leaf cells for the 45nm 
process using the same FreePDK45 design kit used by my standard-cell logic.   
The SRAM bitcell used by the OpenRAM library at the 45nm node is shown in 
Figure 4-17 and compared with a 45nm ReRAM cell, which is close to 100 times 
smaller.  The left side of the figure shows the ReRAM bitcell layout modeled as a cross-
section of two metal layers, with a bitcell size of 5.6*Feature2.  At 45nm technology, 
this translates to 106nm x 106nm per bit.  The right side of the figure shows the bitcell 
leaf-cell from the OpenRAM library provided by UC Santa Cruz and Oklahoma State 
University at 45nm.  The bitcell dimensions are 0.707um x 1.344um and is composed 
of the conventional 6-transistor design.  Since the academic version of the SRAM 
bitcell can be 2.5x larger than commercial version, I can expect the difference between 
the ReRAM and SRAM bitcells to be closer to 35x larger.  Industry SRAM bitcells are 
optimized for the specific process they are to be fabricated in, and therefore have special 
68 
SRAM DRC rules that allow the pitch of the metal and base layers to be drawn closer 
than for regular logic, due to the regularity of the lithography pattern. 
 
Figure 4-17 ? Bitcell Relative Sizes at 45nm 
Figure 4-18 shows the generated memory array bitcells (a) and the complete 
generated memory (b) in the 45nm technology.  Figure 4-18 (a) shows the regular 
structure of two SRAM bitcell rows.  The generated memory contains 128 rows and 
256 columns and has a storage capacity of 4kB.  The total area for the SRAM memory 
is 194.1um by 207.86um.  
69 
(a) 
(b)    
Figure 4-18 - OpenRAM 45nm (a) Generated Bitcell Array (b) SRAM 
Figure 4-19 shows four SRAM arrays placed together with four ReRAM array 
on top.  The SRAM arrays are rotated to allow for the I/O ports of the SRAM to be 
accessed externally and not conflict with the central control region of the ReRAM array.   
70 
 
Figure 4-19  ReRAM Integrated with SRAM memory 
The total SRAM capacity in this instance is 16kB (4kB each SRAM) with a total 
layout area of 211,725 sq. mm.  I have drawn the SRAM arrays rotated to allow for 
their I/O ports on outside of tiles.  The four ReRAM arrays each are drawn as a 115um 
x 115um array, with a total data storage of 1.1 MB for a 2-level ReRAM stack, with 
potential to scale to multiple layers based on fabrication capability.   
Compared to the ReRAM-CPU layout, SRAM?s array region would largely be 
limited to the lower metal layers (below metal-4).  Therefore, ReRAM I/O connections 
can be made on the higher regions without difficulty.  Also, because the four SRAM 
71 
arrays are independent blocks, there is no need for the signal feedthroughs on the 
peripheral blockage regions, which makes this a more straightforward implementation.    
 
4.6 Memory Architecture Calculator (MAC) 
My next plan with regards to the physical design study was to use the area 
overhead numbers obtained to create a rough estimator on the die size, while being 
integrated with different processor types.  Since I are considering a tile-based 
architecture, I looked at existing commercial and academic processors that have 
multiple-cores that could be adopted in such a way.   
I considered four processor types for the study:   
1. Raven-3 RISC-V processor with 56kB L1 cache per core  
2. Fujitsu Sparc64 XII processor with 128kB L1 cache per core  
3. Intel Skylake-X processor with 64kB L1 cache per core  
4. Intel Xeon Phi (Knights Landing) with 32kB L1 cache per core  
My target process node for my in-house calculator was 16nm.  I extrapolated 
the area per core based on die-size measurements to estimate the per-core area for each 
of the processors at 16nm.  They are listed in Table 4-4.  Each of the different provide 
different functionality targeting their specific application, and consequently the area per 
72 
core varies based on the complexity.    This is reflected in the peak performance results 
listed for each processor. 
Processor Area per Avg Power Peak Power 
core per core Performance Efficiency 
[mm2] [W] [GFLOPS] [GFLOPS/W] 
RISC-V 0.55 0.17 6 34 
Sparc64 5.02 24.5 448 1.14 
Intel Skylake 16.9 9.17 1152 6.98 
Intel Xeon Phi 3.13 3.61 3456 13.29 
Table 4-4 Area, Power, and Performance comparison of Processors 
I created a web based Monolithic Architecture Calculator (MAC) using 
JavaScript to provide rough estimates on what can "fit" in each chip dimension.  Figure 
4-20 has a screen-capture of the MAC interface.   
 
Figure 4-20 MAC JavaScript Architectural Area Estimator 
73 
This can be used to assess architectural tradeoffs with various design options on 
a Monolithic Memory-Processors System and have the specific instance count of the 
different processor and memory controller.  Users can specify cache size and number 
of memory controllers on a 2D mesh NoC topology. The left-side of the frame is the 
user-input, and the right-side summarizes the resulting characteristic of the chip based 
on the parameters shown, when the user clicks on ?CALCULATE?.  User selects the 
type of main-memory (ReRAM or DRAM), the die-size, the processor type.  The user 
also can select the ratio of area allocated between core and cache.  The default value 
shown of 0.85 specifies 85% allocated for the processor area with 15% reserved for the 
SRAM cache area.  The user can also specify the number of memory controllers, which 
can be an iterative process based on the number of processors that can fit.  The example 
shown has a core processor to memory controller ratio of 1:1.   Error! Reference 
source not found. has the complete JavaScript source code for the MAC. 
 
4.7 Alternative Floorplan arrangements (L, Crossbar, Fractal design) 
In this chapter, I analyze 3D floor planning options on how to partition the 
different blocks and I/O placement to minimize routing congestion and performance. 
74 
The previous experiment showed that integrating with a cross like connection 
in the middle of a processor logic limits the overall array efficiency of the chip.  Here 
I am trading off the ability to connect to several discrete ReRAM memories locally to 
processor tiles to provide high bandwidth.  As an alternate, if memory capacity is of 
prime importance, there is a way to approach near 100% array efficiency by utilizing 
an L-shape for the overall memory. 
The floorplan shown in Figure 4-21 shows a 3-instance grouping of VSCALE 
processors (Single issue 3-stage in-order 32-bit integer RISC-V processor) underneath 
a 1MB 2-layer stack ReRAM memory in the 45nm process, as the previous section.  
The APR layout area APR area without ReRAM came out to be 304um x 304um = 
92,712 sq. um, while adding this L-shaped ReRAM floorplan increased the area to 
320um x 320um = 102,400 sq. um.  This shows a negligible area overhead penalty 
from incorporating ReRAM in this way: 102.4k/ (92.4k + 11k) = ~1, i.e., no increase 
in area.  I attribute this to the fact that since the available area for performing the APR 
is contiguous, no additional routing area needed. 
75 
 
Figure 4-21 ReRAM with 3-core VSCALE processor 
Depending on the array and processor size, each tile could be a self-contained 
core along with a memory, as shown in Figure 4-22.  A small region between the tiles 
could be used for inter-tile routing channels and for network-on-chip (NoC) signals.   
However, there is a limitation in the ReRAM array size being too large, as this increases 
the read and write latency of the memory.  
76 
 
Figure 4-22 Independent Core with ReRAM block 
Therefore, for designs that can tolerate a single interface point, it is possible to 
achieve a much higher array efficiency for the chip by placing the memory peripheral 
circuit alongside two edges of the chip.  This ensures that maximum contiguous area is 
available for the APR tool.  
With the motivation of having ReRAM integrate with a tiled processor, there 
are two floorplan options available based on the communication needs.  In the case of 
a star network topology, the fractal design shown in Figure 4-23(a) allows for every 
ReRAM + Processor tile to be connected through the central node to any other tile.  By 
not closing off the fourth tile, interconnection congestion would be prevented. This type 
77 
of topology would typically be used in a server-client type of system with a need of 
central network connection.  
 
Figure 4-23 Alternative ReRAM-Processor integration floorplans showing (a) 
Fractal approach for Star topologies and (b) Mesh approach for mini-core 
parallel architectures 
As an alternate, consider the massively parallel multi-processor approach where 
each individual tile consists of a modest processor coupled with local memory to 
provide higher power efficiency for certain tasks.  These typically adopt a mesh-
architecture topology where the interconnect communication is handled by a separate 
NoC (network-on-chip) control circuit.  Figure 4-23(b) shows a possible approach of 
how this type of chip could be implemented with the ReRAM tiles. 
 
4.8 Conclusion 
Three observations are of note with the results obtained so far.  First, I have 
shown that by making minor modifications to established standard tool flows, it is 
78 
feasible to create a hybrid chip utilizing ReRAM, logic, and embedded SRAM blocks.   
Second, ReRAM density with respect to SRAM is quite favorable, especially using the 
2-layer stack implementation.  In the case of the core, floorplan results indicate that I 
can integrate the peripheral logic with minimal area penalty, while gaining the ability 
to create an integrated processor-memory system.  Finally, I gave an overview of 
alternate floorplans arrangements that maybe suitable for specific applications that 
align with the memory access pattern.  
79 
 
 
 
5 ReRAM Device-Level Research Study 
 
Based on my previous study, we believe that emerging memory technologies 
such as ReRAM that can be integrated onto standard CMOS processes have a 
significant advantage in replacing conventional DRAM as main-memory systems.  
These memory systems provide highly parallel, low granularity memory systems that 
support graph algorithms that are critical for machine learning and data science 
applications.  In this section, I cover the research study that addresses the challenges at 
the device-level. 
 
5.1 Motivation 
ReRAM?s high write energy and write latency requirements, along with lower 
write endurance, are key device-level challenges to be overcome when compared with 
existing DRAM solutions.  The higher write energy requirement for ReRAM (when 
compared with DRAM) comes from the need to induce a physical change for storing 
Non-Volatile data.  The data-retention time typically targeted for Non-Volatile ReRAM 
80 
is typically around 10 years.  However, ReRAM for Main-Memory applications do not 
necessarily require non-volatility of data.  Current DRAM solutions store the memory 
for a few milliseconds before a refresh operation rewrites the data to preserve them 
indefinitely, as long as the power supply to the chip is supplied.  I propose that if I 
reduce the data-retention requirement from 10 years to a much shorter time scale (for 
example: 100 seconds), it can be possible to use a lower write energy during the 
program operation.  This lower data retention bitcell could be augmented with a 
periodic refresh so that the data would be rewritten.   
Several prior work on ReRAM for neuromorphic applications, have 
demonstrated the switching between volatile and non-volatile states of these materials 
to mimic Spike-Timing Dependent Plasticity (STDP) [17-21].  For example, previous 
work by Shi, etc. [21] using Hexagonal Boron Nitride (h-BN) stacks-based ReRAM 
has shown switching behavior between volatile and non-volatile states.  There was an 
observed ?Self-Recovered region? which was an intermediate region between High and 
Low electrical stress which induced a time delay before ?resetting? of carbon filaments 
once the stress was removed.  I am not aware of anyone who is intentionally using the 
plasticity of ReRAM as a temporary memory storage device in order to exploit it for 
Main-Memory or DRAM replacement uses.   
81 
Based on reported characterization data for neuromorphic applications, there 
exists an intermediate region between high and low electrical stress where the memory 
retains the data for a much shorter time, but also has a lower electrical stress 
requirement.  This translates to a lower electrical voltage or current applied to the cell, 
and/or for a shorter time.  My approach is to use these materials in an intermediate 
region between volatile and non-volatile state where the data is retained for a much 
shorter time than is typically expected for non-volatile memory.  In this intermediate 
region, based on the electrical stress applied, the conductive filaments remain for a 
shorter period, after which, the metal ions migrate back to the electrodes, relaxing the 
cell?s state.  Also, because this intermediate region requires less electrical stress than 
the non-volatile state, this translates to a lower write-latency, and/or lower program 
current/voltage to write to the cell.  Additionally, this would also alleviate the 
requirement for a higher-voltage supply and corresponding charge pump circuitry to be 
included on the chip.  This would make ReRAM-Processor integration more feasible 
for general applications, and not just read-heavy applications.   
 
5.2 Fabrication Approach 
  Figure 5-1 shows an example of a Resistive Memory stack with two layers of 
metal-oxide region for the resistive-switching.  This figure represents a cross-section 
of a ReRAM bitcell and exposes the material composition used to form the bitcell stack.  
82 
At the bottom of the stack, is a metal electrode formed with Platinum (Pt).  The resistive 
switching element is composed of two materials ? an Aluminum Oxide (Al2O3) and 
Titanium Oxide (TiO2).  Closing out at the top of the ReRAM stack are two metal 
electrodes ? a Titanium (Ti) layer, and a Platinum (Pt) top electrode layer.  This entire 
stack could be fabricated on top of substrate or on top of metal, depending on the 
process flow. 
 
Figure 5-1 ReRAM Metal Stack 
This ReRAM stack shown in the figure is one possible implementation of 
ReRAM that prior literature has shown to display the short-term plasticity.  Using such 
a device, one possible scenario is that the data could be loaded into ReRAM from 
storage and allow for the computations to take place on the data for a set duration.  After 
this set period, ReRAM data would be reloaded back from storage or refreshed from 
83 
ReRAM itself periodically, similar to what is done for DRAM memory.   Alternate 
operation modes could also be introduced that allow for varying levels of persistence 
of memory depending on the level of write energy applied. 
Oxide based ReRAM is attractive as the underlying metal insulator?metal 
structure is simple, compact and CMOS-compatible. Also, these materials have been 
observed to provide multi-level behavior and results in bipolar, asymmetric structure 
which follows the ionic migration model of the STDP behavior. Based on my literature 
survey, the following were identified as possible candidates for the ReRAM stack: 
1. HfOx-based RRAM:   TiN/HfO2/Ti/TiN,  TiN/HfO2/Mg/W 
2. Pt/Ta2O5-x/W 
3. Ta/TaOx/TiO2/Ti 
4. Ti/AlOx/ TiN 
5. Au/Ti/h-BN/Cu 
6. Pt/Al2O3/TiO2/Ti/Pt  
All of the above have been observed to provide multi-level behavior and results 
in bipolar, asymmetric structure which follows the ionic migration model of the STDP 
behavior.  Based on discussions with the UMD Nanofab lab, our universities? in-house 
fabrication and device testing facility, the final ReRAM stack combination 
Pt/Al2O3/TiO2/Ti/Pt was feasible option and I chose to fabricate this stack, as shown 
in Figure 5-1.   
84 
The process outline for fabricating the ReRAM device for the 
Pt/Al2O3/TiO2/Ti/Pt metal stack is as follows: 
1. Start with 4? silicon wafers covered by 200nm of Thermal SiO2 
2. Perform Standard Clean and Rinse 
3. Form Bottom Electrode:  Physical Vapor Deposition (PVD) of Platinum=Pt 
(60nm)  
4. PVD of 5-nm Al2O3  
5. PVD of 30-nm TiO2 
6. PVD of 15nm Ti 
7. Complete with Top Electrode: PVD of 60nm Pt 
 
I used the following shadow-mask configuration as my initial fabrication, which 
allows us to create the masks manually, without requiring an external mask supplier.  
Figure 5-2 shows the initial ReRAM devices that I planned to fabricate.  The devices 
will be on the range of 6mm for proof of concept.  The figure on the left shows the top-
down view of a 4? wafer with six devices of varying sizes, each having two probe points 
for the top and bottom electrode.  Note that these devices are the resistive switching 
element alone and does not include the select device needed in an array to control 
unselected cells? leakage current.  The figure on the right shows a 3D view of the 
ReRAM stack and the connection to the bottom electrode plate in Platinum.  The 
marked spots denote the location of the probe landings for my characterization 
measurements.  The exact dimensions of the devices are provided in the next section 
85 
which goes over the fabrication approach in detail.  Based on my understanding of the 
resistive filament creation, I believe that the filament width will be localized and limited 
based on the current and electrical stress applied.  Therefore, the shortest path through 
the oxide layers will limit the width of the filament.   
 
 
Figure 5-2 UMD ReRAM Device Fabrication 
The configuration in Figure 5-2 shows six discrete ReRAM devices that will be 
fabricated.  The top and bottom electrodes will be connected to test-probes to apply the 
stress and measure the resistance of the path.  My characterization plan is to study the 
relationship between the resistance state of the device, data-retention time, and the 
electrical stress applied.  The nature of the electrical stress is a combination of many 
parameters:    
86 
? Current Limitation (CL) ? The program or write pulse can be controlled to not 
exceed above a set current-limitation point.  This prevents the cell and write-
path circuits from being exposed to excessively high amounts of current and 
being damaged. 
? Pulse Height (voltage) ? The program operation involves applying a voltage at 
a certain amplitude. 
? Pulse Length (time) ? The duration or the width of the write pulse applied. 
? Pulse Period (time between pulses) ? Certain write operations involve applying 
multiple write pulses in succession to move the placement of the resistance state.  
This induces the filament to be formed gradually and helps in avoiding over-
setting the bitcell. 
 
5.3 Mask Generation 
For the test devices, I use two masks to create the pattern needed, as shown in 
Figure 5-3 below.  The figure shows the top-down view of the masks used for the 
fabrication.  The combined overlay of the two masks is shown in Figure 5-3 (a).  One 
rectangular mask is used for the bottom electrode (Pt), which is Mask-1 in  Figure 5-3 
(b).  Mask 2 in the figure is composed of a circular opening for the oxide layers and the 
87 
top electrodes (Al2O3/TiO2/Ti/Pt).  A small alignment mark is placed on the top-right 
corner to help with the positioning of the second mask. 
   
(a)  Overlay of Masks           (b) Two masks used for device fabrication 
Figure 5-3  Mask Configuration 
I ordered 4? (100mm) Si wafers with the <100> orientation from University 
Wafers to create my test ReRAM device structures on using Physical Vapor Deposition.  
I initially used a 3D-printer to print a polymer mask to check for alignment and confirm 
with UMD?s fab-lab staff, shown below on the left.   Ultimaker Cura software (v 3.6.0) 
was used to create the stereolithography file (.STL) CAD descriptions for the two 
masks.  The masks were printed on a Creality 3D CR-10s printer using PLA, with a 
mask thickness of 0.5mm.  Figure 5-4 shows the Cura generated mask file (a) and the 
prototype 3d-printed mask (b).   
For the final mask, I decided to create the devices with different sizes to study 
the impact, and also changed the alignment marks to a circle (from a cross) to make it 
easier to create the mask. Figure 5-4 shows the final mask configuration I used to create 
88 
the hard metal mask.  Due to the higher temperature to which the mask would be 
exposed during the PVD process and the low resolution of the features (order of mm), 
I decided to use an Aluminum Shadow Mask for the features.  I ordered 6061 
Aluminum sheets (0.063? thick) from McMaster Carr. 
 
(a) Cura 3D-Model of Mask Prototype 
  
(b) 3D printed PLA mask prototype  (c) Final Mask Configuration 
Figure 5-4 Mask Prototype Creation 
89 
University of Maryland has an iReap Machine Lab with a ProtoTRAK SMX 
milling station which I used to cut the features and create the mask. The dimensions 
input for the first and second mask are given below.  There are six devices of varying 
dimensions that were fabricated.  The milling tool has the option to cut geometric 
shapes with specified location and dimensions.   
For my mask generation, I used the circle and rectangle pattern to input the 
features to be drawn shown in Figure 5-5.   For both masks the lower-left (LL) and the 
upper-right (UR) alignment marks were drawn as circles, which is an easier geometry 
to draw. The Xo and Yo indicates the origin of the feature, which is the center for a 
circle and the lower-left and upper-right coordinates for a rectangle.  Mask 2 specifies 
the actual location and dimension of the ReRAM stack.  The bitcell diameters used are 
two devices of 5.94mm, three devices of 7.56mm, and one device of 14.04mm.  
90 
 
Figure 5-5 Final Mask Configurations for Mask 1 (left) and Mask 2 (right) 
Table 5-1 lists the dimensions and coordinates used to specify the location of 
the features that were input into the milling tool for the two masks.  With respect to the 
center point of the mask, the respective X and Y coordinates for the different features 
are explicitly specified.  The dimensions for the circle that would be encompass the 
ReRAM stack are also specified in the diameter parameter.  
91 
Xo Yo  Mask 1 Features 
LL alignment mark, 
-38 21  Circle diameter = 7.56 
UR alignment mark, 
26 -34  Circle diameter = 7.56 
-14 -28  LL coord Rectangle 1 
13 -42  UR  coord Rectangle 1 
Xo Yo Mask 2 Features 
-38 6  LL coord Rectangle 2 
 
LL alignment mark, 
-24 -21  UR  coord Rectangle 2 -38 21  Circle diameter = 7.56 
-13 5  LL coord Rectangle 3 UR alignment mark, 
1 -22  UR  coord Rectangle 3 26 -34  Circle diameter = 7.56 
13 -10  LL coord Rectangle 4 -7 -36  Circle 1, diameter = 7.56 
34 -21  UR  coord Rectangle 4 19 -15  Circle 2, diameter = 5.94 
13 6  LL coord Rectangle 5 -32 -1  Circle 3, diameter = 7.56 
34 -6  UR  coord Rectangle 5 -6 -2  Circle 4, diameter = 7.56 
-25 39  LL coord Rectangle 6 18 0  Circle 5, diameter = 5.94 
23 13  UR  coord Rectangle 6 -12 26  Circle 6, diameter = 14.04 
 
Table 5-1 Mask Feature Specifications  
The pictures below in Figure 5-6 (a) show the ProtoTrak SMX milling station 
which allows the features? coordinates and dimensions to be input.  Figure 5-6 (b) 
shows the features being cut into the aluminum sheet.  Figure 5-6 (c) shows the final 
two masks after the cut, with them overlaid on top of each other using the alignment 
marks in Figure 5-6 (d).  The edges of the masks were deburred to smooth them out.   
92 
  
  
Figure 5-6 Mask Fabrication clockwise from top: (a) ProtoTRAK SMX Milling 
Station (b) Sheet Mask being cut (c) Finished mask set (d) Finished Mask Set 
overlaid 
These two masks fabricated create the rectangular bottom electrode and the 
circular metal-oxide ReRAM stack along with the top-electrode.  My process used 
Platinum and Titanium for the metal electrodes and Aluminum-Oxide and Titanium-
Oxide for the metal-oxide stack.  I used UMD's Physical-Vapor-Deposition chamber to 
sputter the materials to the areas, which is a feasible approach due to the larger 
dimensions of these devices. 
93 
5.4 Device Fabrication 
 I worked with UMD?s nanofab lab to fabricate the Pt/Al2O3/TiO2/Ti/Pt ReRAM 
devices on the 4? (100mm) wafer using the aluminum masks I had previously milled.  
As mentioned before, my motivation is to study the use of ReRAM devices in an 
intermediate region between volatile and non-volatile state where the data is retained 
for a much shorter time than is typically expected for non-volatile memory.  In this 
intermediate region, based on the electrical stress applied, the conductive filaments 
remain for a shorter period, after which, the metal ions migrate back to the fill the 
oxygen vacancies in the filament, relaxing the cell?s state.  Also, because this 
intermediate region requires less electrical stress than the non-volatile state, this 
translates to a lower write-latency, and/or lower program current/voltage to write to the 
cell.   
The process outline for my device fabrication is shown in Figure 5-7.  Figure 
5-7(a) shows the initial SiO2 deposited onto the silicon substrate.  I used 4? (100mm) 
Si wafers covered by 200nm of Thermal SiO2.  Fabrication of the devices was 
performed using Physical Vapor Deposition (PVD).  First, the bottom electrode was 
formed of 60nm Platinum using the first mask (rectangular base), as shown in Figure 
5-7(b).  Then, the metal-oxide layers and the top-electrodes were deposited using the 
94 
second mask (circular).  The thickness used were 5nm Al2O3, 30nm TiO2, 15nm Ti, 
and 60nm Pt for the top electrode, as shown in Figure 5-7(c).   
 
Figure 5-7  Fabrication flow for Pt/Al2O3/TiO2/Ti/Pt ReRAM structures (a) 
Thermal SiO2 (b) Mask 1: PVD of bottom electrode (c) Mask 2: PVD of ReRAM 
stack and top electrode 
The material deposition was performed using the Denton Ebeam/thermal 
evaporator.  Figure 5-8 shows the PVD chamber setup used for the fabrication.  On the 
lower half of the chamber, shown in Figure 5-8(a), the E-Beam is generated and 
directed to the crucible.  The material to be sputtered is placed in the crucible and 
magnets on the side of the chamber are used to direct the E-Beam towards the crucible.  
The chamber is brought to a low-pressure environment to accelerate the conditions for 
evaporation.  A shutter resides over the crucible preventing any early evaporated 
material from reaching the wafer, which is mounted on the upper half of the chamber.  
A mirror mounted on the sidewall of the chamber allows for the material in the crucible 
to be observed through a window, to ensure that the material has evaporated.  Figure 
5-8(b) shows the upper half of the PVD chamber.  The wafer along with a hard-mask 
95 
is mounted onto the wafer-clamp facing the crucible (upside down).  A sensor to the 
side of the wafer clamp is used to measure the amount of deposited material, which is 
used to calculate the thickness of the material deposited.   
 
Figure 5-8 ? (a) PVD chamber used for fabrication (b) Fabricated test wafer of 
discrete devices with probe measurements 
For Platinum, the evaporation temperature is 1768 deg-C (3214 deg-F). My 
fabrication process began with loading the Denton E-Beam/Thermal evaporator with 
the materials in the crucibles (see Figure 5-9 below).    
96 
 
Figure 5-9 - Crucible materials into PVD Chamber 
 After cleaning the hard aluminum mask with Isopropyl Alcohol (IPA) to wipe 
down any debris, the mask was clamped onto a wafer and mounted to the chamber.  
Figure 5-10 shows the wafer clamped with the first mask and the platinum, which is 
the bottom electrode, being deposit onto the wafer.  The two-circular alignment 
markers can be seen in the corners of the wafer.  This is used as reference when 
clamping the second mask onto this wafer. 
  
Figure 5-10 ? Platinum Deposition on First Mask 
97 
The above deposition process was repeated with the second mask to deposit the 
remaining materials onto the wafer.  These include the metal oxide materials 
(Aluminum Oxide and Titanium Oxide), and the top metal electrodes (Titanium and 
Platinum).  Figure 8-11 shows the final fabricated wafer with three device diameters of 
5.94mm, 7.56mm, and 14.04mm.  Since the filament width will be largely localized to 
the stress location, the shortest path through the oxide layers will limit the width of the 
filament.  Thus, the exterior dimensions of the bottom electrode or the top electrode 
should be largely irrelevant.    
 
Figure 5-11 PVD Chamber and MicroProbe Station 
Figure 5-12 shows the probe landed on the wafer, the etch mark (b), and the 
boundary of the top-electrode and metal stack, at a magnification of 2.5x.   The figure 
98 
shows some of the surface deformities from the deposition, which is a limitation with 
the equipment and process used. 
 
(a)                  (b)                      (c)   
Figure 5-12 - Die Photograph of Fabricated Devices (a) Probe Landed (b) Probe 
etch mark (c) Top-electrode/Metal Stack boundary 
In order to analyze the composition of the ReRAM stack, I took a cross-section 
of the ReRAM bitcell.   The Scanning Electron Microscopy (SEM) cross-section photo 
and the EDS spectra of the stack are shown in Figure 5-13, confirming the presence of 
the various materials deposited.   
 
Figure 5-13 SEM Cross-section photo with EDS spectra of the ReRAM stack  
99 
SE is the Scanning electron image of the stack, where the stack can be seen as 
the lighter region.  Si displays the presence of the Silicon substrate and is largely located 
beneath the stack.  Pt displays the platinum element deposited in the stack layer.  The 
Aluminum did not show up local to the stack alone and could be an artifact of the tool.  
Oxygen was detected in the stack and should be present in the Aluminum-Oxide and 
Titanium Oxide layer.  Titanium is also shown to be present slightly higher than the 
rest of the layers, as part of the top electrode.  Though the resolution of the individual 
material position and thickness is quite low, I can use this EDS spectra to confirm the 
presence and rough location of the various materials. 
 
5.5 ReRAM Resistive Switching Behavior  
The first part of my characterization consists of confirm the resistive switching 
behavior.  In this experiment, my intent was to confirm that the resistant state itself 
could be altered between the low and high resistant states.  Characterization was 
performed on the fabricated devices at room temperature.  The Agilent 4155C 
parametric analyzer was used to drive the probe points and apply the program pulse.  
Figure 5-14 shows the oscilloscope measurement of the applied voltage pulse.  The 
voltage ramps from 0v to a peak of 5v, with a step duration of 92ms, which was the 
shortest pulse duration possible with the parametric analyzer. 
100 
 
Figure 5-14 ? Oscilloscope Measurement on the Applied Program Pulse 
Figure 5-15 shows the SET transition from a high-resistive-state (HRS) to a low-
resistive-state (LRS) and the RESET transition from HRS to LRS using bipolar 
program mode operation.  The program operation was performed by applying a voltage 
from 0 to 6v with a current compliance of 100uA for SET operation and 1mA for 
RESET operation.   The x-axis shows the voltage applied and the y-axis shows the 
current measured across the cell.    
At the positive voltages, as the voltage ramps from 0v up to 4v, the current 
measured is at 1uA, reflecting the state of the cell.  For this cell, this seems to reflect 
the open-circuit current of not a fully formed filament.  At around 4v, I see the cell 
transition to abruptly to a higher current state.  This could reflect the conductive 
filament being formed across the electrode material.  Beyond 4v, the cell retains its 
lower-resistance state.  A subsequent SET pulse, ramping again from 0 to 6v, confirms 
that the cell-state is retained and remains SET.   
101 
On the negative voltage side, as I ramp down the voltage from 0v to -6v, I see 
that the cell is able to maintain the low resistance state (and the filament) for most of 
the region.  At -5.8v, the cell abruptly switches to a low-current/high-resistance state 
again.  I can visualize that the negative bias repaired the oxygen vacancies created 
during the SET pulse, thus ?breaking? the conductive filament between the two 
electrodes.  A subsequent RESET pulse, ramping from 0v to -6v, confirms that the 
bitcell remains in a higher resistance state. 
 
Figure 5-15 - ReRAM Switching between LRS and HRS in bipolar program 
mode 
Published literature has shown two modes of write operations for ReRAM 
bitcells ? bipolar and unipolar modes.  Bipolar mode involves applying a positive 
voltage for SET-going operations, while using a negative voltage for RESET-going 
102 
operations.  Unipolar mode, on the other hand, uses positive voltage for both SET-going 
and RESET-going operations.  The difference between the two operations is determined 
by the maximum voltage applied (higher for SET-going) and the current-compliance 
limit applied (higher for RESET-going).  The results shown previously used the bipolar 
mode of operation.  Unipolar mode is based on thermal acceleration of redox transitions 
and is simpler to implement but can lower cycling endurance. Bipolar mode is based 
on ionic migration assisted by electric field, has higher endurance due to the defects 
being conserved and is therefore generally a more popular method of program [28].  
TiO2 has been observed to switch in both bipolar and unipolar methods of 
resistive switching. I confirmed resistive switching operation in unipolar mode as well, 
as shown in Figure 5-16. The x-axis shows the voltage applied, and the y-axis shows 
the measured current.   
For SET, the program voltage was ramped to 6v, with the current compliance 
set to 100uA, while for RESET, the program voltage was ramped to 3v, with a current 
compliance of 1mA.  The figure confirms both successful SET-going and RESET-going 
operations.  For SET-going cell starting at a high-resistance state (with low current 
measured), at 3.8v, there is an abrupt change in current measured reflecting a state 
transition to LRS state.  After the transition, the current measured is higher, reflecting 
103 
a low-resistance state implying the creation of a filament. For RESET-going cell, the 
current measured is linear with increased voltage, implying a constant resistance of 
18.5K-?.  At 2.5v, the RESET-going cell?s measured current abruptly drops implying 
a state transition to an open-circuit, HRS state. 
 
Figure 5-16 - ReRAM Resistive switching in Unipolar program mode 
 
5.6 Threshold Behavior at Low Current Compliance Limits  
My next set of measurements were intended to confirm the threshold behavior 
of ReRAMs.  At low-current compliance limits, a program pulse does not affect the 
state of the bitcell permanently.  In this mode, the bitcell acts as a passive device, 
104 
allowing current to pass when the bias is present, but not affecting the overall state of 
the bitcell.  Figure 5-17 shows the threshold behavior of the device on an LRS cell.    
 
Figure 5-17 - ReRAM Threshold behavior at low current compliance (Ic) limits 
I performed the measurement by ramping the voltage from 0 to 6v, and again 
from 0 to -6v for different current compliance levels.  Different current compliance 
limits (Ic) were applied, starting from a low current compliance of 1e-9 and increasing 
to 1e-4, in orders of magnitude.  The cell had an initial state of a low-resistance-state 
(conducting), and therefore as soon as the voltage rise, the current measured is clamped 
to the limit set by the compliance.  At each successively higher current compliance 
105 
level, the bitcell?s state was not altered from its set/LRS state to a reset state/HRS, even 
when the negative voltage bias was applied. 
 
5.7 Time Dependent Volatility Behavior 
I next characterized the data retention of the cell. Data retention is defined as 
the duration after program for which the programmed state is maintained in the bitcell. 
Previous work from literature survey [27] indicates that the initial conductance of the 
bitcell is the key dependent variable for predicting amount of state change.  The 
plasticity model proposed in [18] for example lists the following relationship between 
the change in conductance and the initial state. 
?? = ? ?  
Here, ?G represents the initial conductance and ?t represents the change in time.  This 
compact model implies that the change in conductivity with time is proportional with 
initial state of the bitcell.   Conductance of the bitcell is measured as the inverse of the 
resistance, 1/Rinit, where Rinit is the initial resistance. 
My characterization method is as follows.  I applied a program pulse in bipolar 
mode consisting of either positive or negative voltage bias, depending on whether cell 
was SET-going or RESET-going.  My current compliance varied from 1e-9 to 1e-3 A.  
106 
For my experiment, I measured the cell resistance immediately after program (Rinit) 
and after a wait-time of 5 minutes and 10 minutes to assess the change in resistance.   
I start from an LRS bitcell or a RESET bitcell, in which there is no conductive 
filament that has formed between the top and bottom electrode.  Applying the program 
pulse of different current compliance either successfully or unsuccessfully completes 
the formation of the filament.  As stronger, meaning one with higher current compliance, 
program pulse is applied, the probability of the filament formation is higher.  
Additionally, the thickness of the filament formed is also larger.  Conversely, weaker, 
or lower current compliance, may produce filaments that are thinner or not at all formed.  
There is some movement of the filament after the program pulse which contributes to 
the filament relaxing causing the breaking or thinning of the filament.  I expect that 
cells placed in intermediate states of resistance have a higher probability of this 
happening, causing movement towards a more RESET state.   
Figure 5-18 presents the summary of the data collected.  The x-axis is the initial 
resistance of the cell, measured immediately after the program pulse is applied, in log 
scale.  The y-axis is the change in resistance after 5 and 10 minutes had elapsed, also 
presented in log scale.  These results confirm the expected direction of resistance for 
the bitcells whose initial resistance, Rinit was below 10-M?.   For these cells, I see that 
for the most part, the change in resistance increased after 5 or 10 minutes, implying that 
107 
the bitcell became more reset, or moved towards a higher resistant state after the wait 
time.  For two bitcells in this region (below 10-M?), there was no change, and for one 
bitcell in this region (below 10-M?), the bitcell reduced in resistance slightly.  These 
were anomalous behavior, whose cause needs to be investigated further.  However, for 
all other bitcells, there was an increase in resistance, which is in line with the expected 
filament relaxation behavior.   
 
 
Figure 5-18 - Change in Resistance after 5 and 10 minutes delay as a function of 
the initial resistance. Log(Delta-Resistance) is calculated for the y-axis 
108 
For cells with Rinit above 10-M?, I observed a decrease in resistance after the 
10-min wait time, for these hard-Reset cells.  These were very high resistance state 
bitcells, whose resistance after the wait time shifted to a lower resistance state.  This 
was true for both the 5- and 10-minute wait time.  It is unclear the exact mechanism for 
this behavior. Other studies on single-crystal TiO2 ReRAM have indicated 
electrochemical resistive switching behavior after a post-annealing step [55], which 
could be a possible explanation.  Since the commonality amongst these cells is that 
they all are very high RESET state, I can theorize that this might be caused by a transfer 
of defects from the top electrode to the bottom electrode, causing the bottom electrode 
to be the defect reservoir.  This behavior was described in [28] as complementary 
switching (CS) during the absence of a current limitation with a positive voltage bias.  
After the wait time, the oxygen defects could have migrated from the bottom electrode 
back towards the top electrode causing the cell to move towards a lower-reset-state.  
I categorized the measurement taken by the cell size and plotted the result in 
Figure 5-19.  The bitcell diameters for Cell 2, 3, and 6 were 5.94mm, 7.56mm, and 
14.04mm, respectively.   The x-axis is the initial resistance of the bitcell in log-scale, 
and the y-axis is the change in resistance (Rdelay ? Rinit), where Rdelay is the 
measured resistance after the delay wait time, again plotted in log scale.  The data 
present here is for the combined 5- and 10-minute wait times.  The plot confirms that 
109 
the behavior observed is present on multiple devices of varying sizes and is not a 
function of the cell size.  I do notice that the initial resistance of the cell seems to have 
a slight relationship to the cell diameter, with larger cells having a lower initial 
resistance.  Although I did not focus the characterization on the impact of cell diameter, 
this would be a study for future work. 
 
Figure 5-19 - Resistance change over time grouped by Cell sizes with trend 
observed across multiple devices. Diameters of Cell 2=5.94mm, cell 3=7.56mm, 
cell 6=14.04mm. Log(Delta- Resistance) is calculated for the y-axis. 
Using the data on the bitcells below 10-M?, I fit the data to a linear equation of 
the log-log data points.  Equation modeling based on the observed data for cell 
resistances below 10-M? yields the following relationship.  
110 
???(??) =  1.64 ? log(? ) ? 3.37 
? .
?? = .  10
Here, Rinit is the initial resistance measured immediately after the program operation 
and ?R is the change in resistance after a wait-time.  
Figure 5-20 plots the fit of measured results against the predicted equation 
model for bitcells with their initial resistance below 10-M?s. The R-square of the fit is 
0.57.   
 
Figure 5-20 - Predicted vs Observed change in resistance for cellstates with Rinit 
below 10M?. Log(Delta-Resistance) is calculated for the y-axis. 
111 
For my final set of experiments, I measure the effect of current compliance 
applied to the data retention time as a function of time with measurements at 2 min, 4 
min, and 8 min. The cell was initially placed in an LRS state and increasing amounts 
of current compliance was applied. The program current compliance level affects the 
placement of the cell, with low current compliance levels not successfully moving the 
cell from an HRS to an LRS.  Figure 5-21 demonstrates the observed cell relaxation 
for the time range of around 10 minutes, collected at the following four specific points:  
Immediately after program, 2 min, 4 min and 8 min.  The x-axis tracks the time elapsed 
after the program pulse, measured in seconds, while the y-axis tracks the actual 
resistance measured.  A read voltage of 50mV with a current compliance set to 1uA 
was used for the measurement.  I see the change in resistance of four bitcells over the 
measured time period.  The dashed lines in the plot denote possible range for 
intermediately placed cell that have a high change in resistance over the time period.  
Bitcells placed below the bottom dashed line would be well-SET cells (LRS), while 
those placed above the top dashed line would be well-RESET cells (HRS). 
112 
 
Figure 5-21 - Resistance change for different Program Current Compliance 
values 
For this cell, I define a ?well-SET? cell as below 100k?, and a well RESET cell 
of above 100M?.  The results show that a well-SET cell, formed by applying a high 
amount of current compliance (Res_1e-3), is able to retain its SET value of 48k? 
through the measured time.  Similarly, a well RESET cell (Res_1e-9), with a resistance 
value of 5.7e9?, remains RESET which a final measured value of 4.25e10?.  The two 
intermediately placed cells (Res_1e-4 and Res_1e-5), show the resistance values 
increase with a much higher delta change in resistance.  The cell placed with a 1e-5A 
current compliance (Res_1e-5), changed from 3.11e6? to 8.29e8?.  These 
observations confirm the relaxation behavior of an intermediate cell with a filament 
113 
relaxing to break its conductive bond, resulting in a higher resistance state.  Note that 
this relaxed higher resistance state is still lower an order of magnitude lower than the 
well RESET state of 4.25e10 with the Res_1e-9 current compliance.    The results 
indicate that well-set cell with a high current compliance of 1mA retained the state for 
the full 8-min duration, while intermediate program Ic levels of 1e-4 and 1e-5 shifted 
the cell state to two orders of magnitude higher resistance.  
To study the effect of current compliance on the change in resistance, I plot the 
observed data in a different way.  Figure 5-22 shows the change in resistance as 
function of the current compliance for three measurement points ? immediately after 
program (Rinit), 2 minutes and 8 minutes after program (R2min and R8min, 
respectively).  The x-axis tracks the program current compliance used (in amperes), 
and the y-axis tracks the resistance measured on the bitcell.  The program pulse was 
applied at a specific current compliance level, and then the resulting resistance level 
was measured at the three delay points.  Note that I started with an HRS (RESET) cell 
prior to the measurement.  I observe that at 1e-9 and 1e-6, the cell remains in the HRS 
state.  At program current compliance of 1e-5 and 1e-4, there is a marked shift in the 
bitcell resistance, starting at a lower resistance level and gradually moving to a higher 
resistance level.  At Ic of 1e-5, the bitcell resistance started at 3.11M? and after 2 
minutes, shifted to 60.7M?, and after 8 minutes, measured to be 829M?.  Similarly, 
114 
for Ic of 1e-4, the bitcell resistance started at 0.7M? and after 2 minutes, shifted to 
14.3M?, and after 8 minutes, measured to be 103M?.  Finally, the bitcell that was 
programmed with an Ic of 1mA, remained as a SET cell, below 50k??.  These results 
confirm that intermediate current compliance limits show the greatest change in 
resistance. 
 
Figure 5-22 - Resistance change as a function of Program Current Compliance. 
One interesting observation to note is that even for the specific points where the 
resistance does not shift, the Rinit datapoints all measure to be slightly lower resistance.  
One possible cause of this could be due to the measurement following a program pulse 
with a high voltage bias (6v) possibly having an effect on the state of the cell.  This 
115 
could possibly be due to thermal effect from the high voltage applied during the 
program pulse and is a topic for future exploration. 
In this section, I presented data that shows that the cell?s data-retention time 
could be modified by reducing the current compliance applied during the program pulse.   
This intermediate current compliance acts as a digital volatile mode for the bitcell.  A 
system making use of the cell in the digital volatile mode must calibrate the read 
threshold currents for this lower range, to properly interpret the intermediate state as 
well.   
 
5.8 Impact on Write Energy and Endurance  
In this section, I estimate the benefit in write-endurance that can be gained from 
the lower write-current applied.  From write-energy point of view, I see that the 
intermediate mixed-volatility state requires 1-2 orders of magnitude less write-current. 
Instead of 1mA, applying 10uA might suffice to program the cell in the intermediate 
state.  The write energy is the product of the current, voltage, and the duration of the 
stress applied to the cell.  From a write-stress point of view, the cell is in this 
intermediate mode is now seeing 100x lower write energy per write operation.  The 
following equation from [33] relates the relationship between the energy applied per 
cycle to the maximum amount of energy tolerated by the cell. 
116 
? = ? ? ?  
Here, Emax is the maximum energy that the ReRAM bitcell?s dielectric material 
can sustain, E1cycle is the energy seen by the cell in one cycle, and Ncmax is the maximum 
number of write cycles that can be performed.  Ncmax is the measure of write-endurance 
for the cell.  As the equation points out, there is an inverse relationship between the 
energy applied per cycle to the overall number of write cycles tolerated by the cell.   
Since I apply 100x lower write energy per cycle, I can expect that the Ncmax 
would increase by 100x.  My original stated write-endurance for ReRAM was 10^5 to 
10^8 cycles.  This can therefore be expected to be increased to 10^7 to 10^10 cycles.  
Although this is still not near the write endurance tolerated by the DRAM cell, this 
amount of improvement allows for the cell to approach the write-endurance limits 
needed for main-memory applications.  Furthermore, by combining wear-leveling 
techniques used in flash memory chips, the effective write-endurance could be further 
increased.    
In terms of the total write energy, there is an increased amount of write cycles 
needed to perform the refresh in the cases where the data needs to be maintained for 
long periods of time.  For the calculation given above, there is a 100x reduction in the 
write energy applied per pulse.  However, after 100 refresh cycles, where the data is 
written back to the cell during the refresh cycle, we lose the benefit of the write energy 
and write endurance to the cell.  In this case, for those data where the data needs to be 
117 
persistent, we may selectively apply a high write energy to begin with to store the data 
in a non-volatile state.  In addition to the write-endurance impact, there is an impact to 
the overall system performance as well.  Since write energy is a function of the current 
amplitude and the duration of the program pulse, the lower write energy could also 
translate to a faster write cycle.  For certain applications, having the lower write cycle 
might be critical for overall system performance where the shorter write latencies could 
be more easily hidden and prevent stalls due to write operations. 
 
5.9 Post-Characterization SEM  
I did a final SEM photo of the characterized wafer to assess the thickness of the 
material deposited.  I used Tescan GAIA FIB/SEM machine from UMD?s AIM lab to 
perform this measurement.  I first performed a FIB (Focused Ion Beam) cut on the 
wafer to ensure a sharp cross-section edge to make the measurement.  Figure 5-23 
shows the wafer material inside the SEM chamber.  The wafer is sliced and mounted 
onto a vice inside the chamber.  The figure shows the wafer positioned directly under 
the microscope. 
118 
  
Figure 5-23 - Sliced Sample inside GAIA SEM Chamber 
 Figure 5-24 shows the inverted cross-section photo of the wafer.  Table 5-2 
summarizes the measured thickness of the materials against the target thickness.   
 
Figure 5-24 ? SEM Thickness Measurement  
119 
Layer # Material Target Thickness Thickness Measured 
(nm) (nm) 
1 Platinum (Pt) 60 73 
2 Aluminum Oxide (Al2O3) 5 
59 
3 Titanium Oxide (TiO2) 30 
4 Titanium (Ti) 15 14 
5 Platinum (Pt) 60 47 
Table 5-2 ? SEM Analysis of Deposited Thickness 
The measurement showed that the material deposited is on the order of the target 
thickness of the different materials I targeted.  The aluminum and titanium oxide 
material could not be differentiated in the SEM photo mode, but I estimate the sum 
thickness to be larger than the target thickness of 35nm.  Since this device was not a 
virgin material, the process of applying the electrical stress likely caused the material 
to diffuse into adjacent layers.  This can be seen at the bottom of the wafer, being 
diffused into the silicon.  Overall, the thickness of the material measured appears to be 
slightly larger than my intended thickness overall.  In the future, using smaller mask 
dimensions and alternate deposition process might more accurately control the 
deposition of the materials. 
120 
5.10 Conclusion and Future Work 
A ReRAM device composed of a TiO2/AlO2 metal-oxide stack was fabricated, 
and characterization results were analyzed.  Resistive switching and threshold 
behaviors were observed.  Additionally, time-dependent relaxation of the cell resistance 
was observed, causing those cells placed in intermediate cell states, by using a lowered 
program current compliance, to see the greatest shift. This is in-line with my target use 
of using lower write-energy to place the cell in a mixed-volatile state having a lower 
data retention time.   
As mentioned in the beginning of the chapter, the motivation for this research 
work is to verify that we are able to observe that ReRAM could be operated ina n 
intermediate state where the formed filament of oxygen vacancies in the metal oxide is 
able to repair itself after a period of time.  The experiment results are a proof-of-concept 
of the possibility of ReRAM as a digital volatile memory.  With regards to scaling, I 
expect the observed behavior to be retained, since scaling occurs in the dimension of 
the ReRAM metal planes and typically not as much in the thickness between layers.  
Since the filament is localized to the points of the stress, the movement of the oxygen 
vacancies should follow at a similar rate even at advanced nodes.  Volume data on this 
phenomenon would provide more data points which would lead to better averaging of 
the program current compliance and the expected rate of the relaxation. 
Several points of observation merit a closer look.  I have mentioned these in the 
experimental discussion, with regards to anomalous points of data observed for very 
121 
high resistance bitcells and the change in resistance after a delay.  These can be further 
mapped to a function of the current compliance applied and the sequence of preceding 
program pulses.  Due to the large dimensions of my bitcell, it is possible that multiple 
filaments have formed in parallel, that may be the cause of the cell behavior at the very 
high resistance states (above 10 M?).  For this reason, future work can try to make the 
dimensions smaller, towards the target dimensions seen in the intended application.   
Oxygen partial pressure has been known to have a strong impact on the 
movement and retention of oxygen vacancies in metal oxides [55, 56].  The effect of 
oxygen partial pressure in introducing contaminants to the material layers during the 
fabrication process needs to be studied more closely.  The PVD fabrication for my 
experiment was performed in a low-pressure chamber, however there could be oxygen 
contaminants between the layers.  Specifically, between the mask steps, where the top 
electrode of Titanium might have oxidized to form TiO2.  The SEM analysis did not 
have sufficient resolution to identify the regions clearly.  Future work can analyze the 
material fabricated with higher resolution.  In the operational mode we expect to use, 
the temperature ranges between -40C to 100C, and therefore we do not expect a high 
variation from the oxygen partial pressure on the relaxation of the oxygen vacancies.  
In space applications, the lower oxygen partial pressure might slow the movement and 
relaxation of the oxygen vacancies, thus increasing the data retention of the filaments 
created in the intermediate state.   
122 
Lastly, the biggest change is to gather volume data for the characterization 
results so that the noise could be further isolated, and the cell retention relationship 
could be more robustly developed for design of the memory system application.  This 
requires fabricating a full array, with more than 1000 bitcells so that statistical analysis 
could be performed to more completely characterize the bitcell behavior. 
 
  
123 
 
 
 
6 Architecture-Level Simulations  
 
In the next thrust of my research work, I looked into architecture-level simulations 
that would provide the impact of various design configurations in my ReRAM 
architecture.  I compare this to a conventional DRAM based architecture and vary key 
parameters to analyze the impact of them.  I provide a brief introduction into the 
simulation methodology I used and then provide the results of my baseline architecture 
comparison.  I next study the impact of a central ReRAM based design with varying 
number of cores on the performance and the energy consumption of the architecture. 
6.1 SST Simulator 
SST is a simulation tool developed by Sandia National Laboratories, which 
provides a flexible framework as a ?Parallel Discrete-Event Simulator? and allows for 
a multitude of custom simulators.  The tool has demonstrated scaling to over 512 
processors, and comes with many built-in simulation models for processors, memory, 
and network, including DRAMSIM.  The tool follows a modular OpenMPI interface 
based on linking together various components (see Figure 6-1 from the SST website).   
124 
The figure shows the operation of the simulation framework driven by an SST core 
engine that keeps track of the instantiated elements, components, and the links in the 
simulation.  Each component represents a physical structure in the architecture, such 
as a CPU, the network router, the memory, or the cache, for example.  Each component 
is connected to another component through a link with a latency property, which is used 
to track the timing of the simulation.  This framework allows for modular use of 
different elements that are developed outside of Sandia.  For example, to model the 
DRAM memory, I used DRAMSIM3 as the backend memory model.  
 
Reference:  http://sst-simulator.org 
Figure 6-1  SST Component-based Framework 
The SST framework is component based, cycle-accurate simulator for fast 
comparison of different architectures.  I have used SST to model the ReRAM-CPU 
architecture using the following external components (element libraries): 
? MemHierarchy  - Cache and Memory 
? DRAMSim     - DDR DRAM Memory 
125 
? Miranda         - Pattern-based CPU model 
? Merlin            - Network router model and NIC 
? Messier          - Model ReRAM with asymmetric read & write latencies 
 
6.2 Baseline Architecture Comparison 
I performed an initial simulation on the STREAM and GUPS benchmark on the 
architecture shown in Figure 6-2.  The DRAM architecture was roughly based on the 
Intel Knights Landing platform and a comparative architecture using ReRAM instead 
of DRAM was used.  The left side of the figure shows the baseline DRAM architecture.  
A mesh topology with 6 rows and 8 columns is used to support 36 CPU processor units, 
along with dedicated L1 and L2 cache blocks.  Additionally, there are six memory 
controllers that connect to 4GB DDR3 main memory blocks, to provide a total capacity 
of 24GB.   
On the right side of the figure, the ReRAM architecture that I used is presented.  
This version shows a tiled architecture, again with 36 CPU processor units with 
dedicated L1 and L2 cache blocks.  The main memory in this architecture, however, 
consists of 36 ReRAM blocks each of 0.9GB located adjacent to the CPU tiles, along 
with the memory controller.  This is in-line with the tiled ReRAM-CPU layout that I 
presented earlier.  This architecture also uses a mesh topology of 9 rows by 9 columns.  
126 
 
Figure 6-2 Architecture Comparison 
The key architecture parameters are provided in Table 6-1 for comparison.  I 
used our in-house DRAMSIM2 simulator to model a dual-channel DDR3 Micron 
device with a speed grade of 1333-J.  For the ReRAM memory, I used the Messier 
element in SST to model asymmetrical read and write latencies of 200ns, and 1us, 
respectively.  The peak memory bandwidth for DRAM is 10.4 GB/s per channel, for 
an aggregate bandwidth of 124.8GB/s.  A very high NoC link bandwidth of 96GB/s per 
link was simulated to allow the NoC latency not to be an issue for the comparison.  
127 
  
Table 6-1 Summary of SST Architecture Details 
The following lists the pseudo code for both benchmarks.   
STREAM Benchmark: a[i] = b[i] + k * c[i]; 
MemoryOpRequest* read_b = new MemoryOpRequest(start_b + (i * reqLength), reqLength, READ); 
MemoryOpRequest* read_c = new MemoryOpRequest(start_c + (i * reqLength), reqLength, READ); 
MemoryOpRequest* write_a = new MemoryOpRequest(start_a + (i * reqLength), reqLength, WRITE); 
write_a->addDependency(read_b->getRequestID()); 
write_a->addDependency(read_c->getRequestID()); 
 
GUPS Benchmark: a[b[i]]; 
MemoryOpRequest* readAddr = new MemoryOpRequest(addr, reqLength, READ); 
MemoryOpRequest* writeAddr = new MemoryOpRequest(addr, reqLength, WRITE); 
writeAddr->addDependency(readAddr->getRequestID());  
 
The STREAM benchmark consists of two read operations followed by a 
dependent write operation.  The GUPS benchmark has a read and a dependent write 
128 
operation, with the address being randomly generated.  The STREAM benchmark has 
dense memory access, meaning that the address locations in memory are accessed in 
sequential order and therefore I expected that DRAM?s higher-access granularity would 
be more favorable for this benchmark.  The GUPS benchmark has sparse memory 
access, for which I expect ReRAM?s low-access granularity to be more favorable.  
Figure 6-3 shows the SST simulation result of the comparison between DRAM 
and ReRAM based main memory architecture.  The y-axis in the plot reports the 
execution time of the simulation, where a lower number is better (faster).  The 
simulation was performed with a Miss Status Hold Register (MSHR) queue depth of 2, 
meaning that at any time, two outstanding requests could be stalled at the individual 
memory controller.  The plot shows the results for both Stream and GUPS benchmarks 
for DRAM, and two version of ReRAM ? one with 200ns write latency and a second 
with 1us write latency, both versions have a read latency of 200ns.  The access latency 
for DRAM is set by DRAMSIM as a function of the pending requests and stalls.   
129 
 
Figure 6-3 SST Simulation Result 
The comparison show that DRAM outperforms ReRAM for Stream applications 
regardless of the write latency times.  My simulation result shows that when the write-
latency is reduced, for the STREAM benchmark, there is no noticeable improvement 
with the improved ReRAM write-time, and DRAM performs more favorably as 
expected.  Because of the ratio of read to write operations is 2 to 1, the effect of a ?faster? 
write latency does not improve the overall execution time in this scenario.  For GUPS 
benchmark, however, ReRAM slightly outperforms DRAM in the shorter write latency 
configuration.  However, DRAM still outperforms ReRAM when the write latency is 
1us.  This is due to shorter request length of GUPS and the irregular access pattern not 
allowing for the write requests to be re-ordered and thus mitigated. 
130 
The memory latency breakdown of both simulations is presented in Figure 6-4.  
The latency is reported from both the memory controller point of view, and from the 
CPU overall point of view.  The memory controller latency is largely dominated by the 
memory latency itself, with some additional overhead depending on the number of 
stalls seen by the requests.  As seen in the figure, there is a huge discrepancy between 
the two.  For DRAM, the average memory controller latency for both benchmarks were 
32ns.  However, the average CPU latency for STREAM was 194ns, while for GUPS 
was considerably higher at 951ns.  This goes back to my original motivation of 
addressing the memory bandwidth wall problem resulting in these huge discrepancies.  
The problem is worse for GUPS due to its inherent finer granularity which prevents 
access overhead from being amortized over a larger amount of data.   
For the STREAM benchmark, the latency reported from ReRAM?s memory 
controller point of view was close and slightly higher than the average overall latency 
from CPU point of view.  The reason for the CPU latency being lower can be explained 
by a higher percentage of cache hit with the STREAM benchmark that allows for 67% 
of the accesses to be serviced by on-board caches.   Since cache access is much lower 
than the ReRAM main-memory access, the overall latency is slightly lower.  For the 
GUPS benchmark, however, the DRAM trend is also present with ReRAM ? the overall 
CPU latency is much higher than the memory latency itself.   Here, this implies again 
131 
that there is a higher number of stalls that cause the performance of the system to be 
limited, not by the memory latency but by queuing of the requests. 
 
Figure 6-4 Memory Latency Breakdown, Queue Depth=2 
With regards to the impact of the write-latency, STREAM benchmark reported 
very little change between 200ns write time and 1us write time, as I saw in the previous 
simulation study.  With GUPS however, I see that average access latency, which is a 
combination of the read and write times, is reduced with the ?faster? 200ns write 
latency time.   The ratio for write-to-read is also higher with GUPS with 1-to-1 vs 1-
to-2 with STREAM, causing the higher write latency to negatively impact the GUPS 
more.   
132 
The difference between the memory latency and the overall latency is an 
indication of the amount of stall occurring in the architecture.  Because the access 
pattern is random, the cache hit rate with the GUPS benchmark is close to 0%, 
combined with the queue depth of 2, this causes more of the memory requests to be 
stalled at the CPU with the GUPS benchmark.  The results imply that finer access 
granularity on ReRAM benefits GUPS benchmark with 200ns write latency.  For the 
1us write latency, the results show no ReRAM advantage for the GUPS benchmark.  In 
the STREAM case, DRAM performs better over ReRAM regardless of the write-
latency. 
 
6.3 Impact of Memory Parallelism for ReRAM 
Memory queue depth and the number of memory controllers are some of the 
key parameters that affect overall system performance.  To assess the impact of the 
queue depth on the performance, I increased the MSHR queue depth from 2 to 10 at 
the Memory Controller.  The results are presented in Figure 6-5.  The y-axis again is 
the execution time of the simulation with lower execution time meaning faster 
performance of the architecture.  Two additional sets of information are presented from 
the previous graph ? the results with queue depth of 10 (5x queue depth from previous) 
for ReRAM.   
133 
 
Figure 6-5 Impact of Queue Depth 
In comparison with the queue depth of 2, I see a significant improvement in the 
execution time in the case of STREAM benchmarks for both the slow and fast write 
latency times.  There was little to no improvement of the performance as the queue 
depth was increased for the GUPS benchmark.  This implies that increasing the queue 
allowed for more memory requests to arrive at the memory controller, and potentially 
be combined due to any locality of the memory requests.   The increased queue depth 
helped efficient scheduling of multiple requests that may be related in the STREAM 
case.  The STREAM memory mapping was assigned to be interleaved with an 8B offset, 
134 
which allowed for the parallelism of the architecture to help efficient servicing of the 
memory requests.  There was minimal change with the GUPS benchmark due to limited 
temporal locality, with the cache miss rate being close to 100%. 
For my next study, I observed the impact of the number of memory controllers 
on the performance.  To do this, I increased the number of memory controllers to be 
twice as the original architecture, again using mesh topology.  Figure 6-6 shows the 
impact on the performance.  The two additional sets of data are for the ReRAM 
simulations with the number of memory controllers being 72, while the previous 
simulation used 36 memory controllers.  The impact of the increased can be seen most 
drastically in the GUPS simulation for both the slow and fast write ReRAM memories.   
135 
 
Figure 6-6 Impact of Queue Depth and Multiple Mem-Controllers 
For the STREAM benchmark, queue depth helped improve the performance by 
efficient scheduling of multiple requests that may be related.  Additionally, since the 
memory address mapping was interleaved across banks with an 8-Byte offset, the 
increased queue depth allowed for parallel processing of memory requests.  The cache 
miss rate for the STREAM benchmark was noted to be 37%.  For the GUPS benchmark, 
there was minimal change to the performance improvement due to the increased queue 
depth.  This can be attributed to little spatial locality, with a near 100% cache miss rate.   
136 
Increasing the number of memory controllers improved both STREAM and 
GUPS, with a larger improvement for GUPS benchmark.  This implies that the higher 
queue depth within a memory controller is beneficial for STREAM benchmarks to 
allow for more efficient grouping of memory requests to take advantage of spatial 
locality.  For sparse memory access benchmarks, such as in the case of the GUPS, 
independent parallel memory controllers are needed to allow for parallel servicing of 
memory requests.  
 
6.4 Motivation for Central ReRAM Design 
At the architectural level, SST simulations were used to help answer the 
question of what performance benefits can be gained at the expense of non-volatility or 
data-retention.  I utilized SST to model non-symmetric heterogeneous NoC 
architectures to support the monolithic ReRAM-CPU architecture.   Based on my 
previous simulation results, I believe that a hybrid memory system utilizing both 
DRAM and ReRAM would be beneficial to deliver the advantages relevant for each 
technology based on the benchmark and application need.  Additionally, this approach 
allows different processor type to be integrated into the same chip, including GPUs 
and/or accelerators.  Figure 6-7 shows the floorplan of such a system.   
From my area studies, I know that interspersing ReRAM peripheral logic within 
a core incurs a significant area penalty.  Furthermore, since each core does not have a 
137 
dedicated ReRAM tile, and because graph algorithms do have irregular sparse access 
patterns, the memory architecture needs to support requests from any processor on the 
chip.  The centrally located ReRAM block in Figure 6-7 is designed to act as a single 
embedded memory IP with a separate internal NoC based on the torus topology.  In 
addition to the NoC router circuits, the area underneath the ReRAM array could be used 
to store cache memory that can act as the last-level-cache for the memory.  Four DRAM 
memory controllers are placed in the corner to allow access to an external DRAM 
memory off-chip.  This hybrid memory system would allow for ReRAM to function as 
the Main-Memory and rely on DRAM as either a Last-Level-Cache (LLC) or as a 
selective cache for write-intensive applications only. 
 
Figure 6-7 Hybrid ReRAM-DRAM System Floorplan 
138 
The ReRAM memory controller would coordinate access to n number of banks, 
where n needs to be selected to tradeoff between fine-grain granularity and reducing 
area overhead.  The bank controller will also be capable of supporting Streaming Mode 
to perform a burst-mode from adjacent 8 banks to match DRAM granularity and 
improve streaming bandwidth.   Figure 6-8 shows the design for the ReRAM memory 
controller, which coordinates multiple banks.  With each bank, a bank controller will 
contain an incoming request queue, a data-buffer to store the read and write data (64-
bits), and the circuit to initiate the Read/Write kickoff signal to all 16 arrays.    
 
Figure 6-8 ReRAM Memory Controller Design 
I performed architectural simulation using SST to model the system floorplan 
shown in Figure 6-7, and to select the optimal ratio and grouping. My performance 
results, presented in the next section, indicate the impact of the write-timing on this 
139 
architecture.  Additionally, based on the size of the network needed to facilitate this 
approach, I also looked into alternative NoC topologies that might better meet the 
throughput required.  The next chapter goes over the NoC topology study results.  Once 
an optimal configuration is selected, I could generate the overall ReRAM embedded 
block design & external interface block.  This block can be used to generate a floorplan 
layout and provide area estimates to identify placement of individual components to 
achieve such a system. 
 
6.5 Area Floorplan Central ReRAM Design 
I performed a next level estimate for the bank and memory controller circuits 
that would reside beneath the ReRAM.  I used the 45nm layout, shown in Figure 6-9, 
to estimate I obtained previously for my repeating block.  The total layout area shown 
in the figure is 625um x 625um = 400,000 sq um for the 4MB 2-level stack.  The 
vscale_core circuit had a standard-cell efficiency in this space of 60% for a total 
consumed area of 240,000 sq um. 
140 
 
Figure 6-9 ? Memory Footprint for Central ReRAM Design 
I propose placing the following circuits underneath the ReRAM block: Bank Controller, 
Memory Controller, NoC router, and SRAM cache.  To estimate the areas for the bank 
and memory controller, I synthesized a representative Verilog file to model the 
functions and used it to estimate the APR area.  I extrapolate this by using the area 
reported from the VSCALE_CORE layout study where the standard cell area was 
22,088 sq um, and the APR area was 30,373 sq um.   
The bank controller logic has three main functions, as shown in Figure 6-10:  an 
incoming request queue, a circuit to initiate the read and write kickoff signals to all 16 
arrays, and a data buffer to store the read and write data.  I model a 32B register file for 
the incoming request to support eight 32-bit command requests.  A 64B register file is 
141 
used to model the data-buffer to store pending read and write data.  The synthesized 
area for this circuit was 4,162 sq um, which translates to 5,723 sq um after the APR 
step.  As a square block, this circuit could be expected to take up 76um x 76um of area 
underneath the ReRAM array. 
 
Figure 6-10 ? Bank Controller Area 
The memory controller logic coordinates 8 different banks and has the following 
functions: 
? Address decode to select one of 8 banks, with additional control logic to select 
multiple in stream-mode 
? Incoming request queue of 32B to support eight requests of 32-bit commands 
? Control logic to reorder pending requests 
? Data buffer of 256B to store read and write data. 
The synthesized netlist for this logic reported a total area of 16,147 sq um, which 
I extrapolate to be 22,204 sq um after the APR step.  Since this logic block will be 
shared among 8 banks, this circuit could be expected to use a square footprint of 18um 
x 18um for each bank.   
142 
The Bank, Memory, and NoC controller will be placed in a central area of the 
ReRAM mat, which has an area of 343x343um.  The spacing between the arrays within 
a bank is 125um.  Figure 6-11 shows the relative sizes and placement of the blocks, 
with the bank controller (B) being 75x75um, and the memory controller (M) being 
18x18um, which is shared among 8 banks.  The remaining area in the block can be used 
for the NoC router.  
 
Figure 6-11 ? Placement of Control Logic, Buffers, and SRAM 
SRAM arrays will surround the central area with size of 125umx125um.  These 
SRAM arrays can be used as the last-level cache on the chip and can operate 
independently from the main-memory ReRAM control logic.  In one bank, I can fit 3 
143 
of these SRAM arrays.  Assuming 75% array efficiency and using the academic 
OpenRAM bitcell which has a size of 1.344um x .707um, the total SRAM storage per 
array is 4.1kB.  A commercial version of the SRAM bitcell could foreseeably be drawn 
2.5x smaller, and thus achieve 10kB of SRAM capacity per array.   
Finally, I consider the routing channel to connect the main-memory and the 
independently operating SRAM cache memories to the NoC router endpoints.  The 
NoC interconnects can be drawn in metal-7 and metal-8 which are available to be used 
in the regions between the ReRAM arrays and over the SRAM arrays.  This is 
illustrated in Figure 6-12 below.  The horizontal tracks are metal-7 and the vertical 
tracks are in metal-8 and these would provide a global interconnect channel to the 
ReRAM and SRAM arrays. 
 
Figure 6-12 ? Interconnect Routing over Central ReRAM Floorplan 
144 
For the NoC routing channel, I propose a total target of 32B interconnect width 
and allocate 16B for Main-Mem and 16B for the SRAM Cache in order to keep the two 
memory systems separate and to allow for different priorities and address schemes to 
be implemented between them.  The available spacing for this in the floorplan above is 
125um wide.  This requires a metal pitch of .5um in the metal-7 and metal-8 for this 
routing which should be achievable in this technology. 
 
6.6 Write Performance Impact of ReRAM 
For the next phase of my simulation efforts, I focused on the Central ReRAM 
architecture surrounded by several CPU modules.  As mentioned earlier, such a central 
architecture has several possible advantages over a tiled-CPU network in cases where 
the memory access pattern is not localized to the tiles immediately above it.  In my 
prior benchmark simulation results, I found this to be true.  Additionally, separating the 
CPU modules allows for a contiguous area for the design implementation and avoids 
having to incur the area penalty I had observed. 
I first performed a high-level comparison using the SST simulation framework 
to compare the DRAM and ReRAM based architectures shown in Figure 6-13.  The 
DRAM figure shows a central tiled CPU architecture with 6 memory controller (MC) 
access points, 3 on each side, to connect to external DDR4 devices.  The ReRAM 
145 
architecture shows centrally located ReRAM arrays, grouped by banks, which are 
accessed by surrounding CPU processors (labeled as C).    
 
Figure 6-13 - DRAM ReRAM Architecture Comparison 
For my simulation, I assumed the following system specifications.  If one bank 
needs to provide 64-bits request width, and assuming that ReRAM is capable of a per-
array bandwidth of 4 bits, then for a single bank, I would need to access 16 arrays in 
ganged mode.  In order to provide a sustainable BW of 16B/ns, assuming 200ns 
ReRAM latency, I would need to group 400 banks per core.  Therefore, a single core 
needs to coordinate with 400 banks for reasonable bandwidth performance.  Tying in 
my previous area calculations, a single bank of 4MB (assuming a 2-layer ReRAM 
stack) at the 16nm is estimated to take up 0.4mm2 in area.  A full-chip die area of 
686mm2 can fit 8575 banks, assuming 50% array efficiency.  This can support 
8575/400 = 21 cores for a VLIW type of architecture with 8B granularity.  This 
146 
translates to a full-chip memory capacity of 8575 banks * 4 MB/bank = 32 GB.  With 
an 8-level stack, this capacity scales to 128GB.  As the vertical stack increases the 
capacity scales, but would require higher ReRAM peripheral area usage, leaving less 
amount of unused space underneath the memory. For my simulation of the ReRAM?s 
mesh topology, I used a ratio of 8 banks per memory controller to provide a total 
number of 1000 memory controllers on chip.  The system ratio of Core to Memory-
Controllers to Banks to Array is 21:1000:400:16.   
The DRAM architecture was based on the Intel Knights Landing platform [23] 
and a comparative architecture using ReRAM instead of DRAM was used.  The 
characteristics are listed in Table 6-2.  The CPU model used 8 issues per core per cycle 
and the mesh NoC topology is used.  I used the hardware-verified DRAMSIM3 
simulator to model a dual-rank DDR4-2666 DRAM device operating at 2.66GHz and 
also a High-Bandwidth-Memory-2 (HBM2) version of DRAM main memory.  For 
ReRAM, I assumed a centrally located memory IP with 1000 access memory 
controllers with the support circuits and bank-select logic located underneath the 
memory, while the CPUs surround the array. 
147 
 
Table 6-2 - Architectural Parameters 
In order to understand the impact of the longer write latency of ReRAM, I 
compared DRAM with two versions of ReRAM: SlowWrite and FastWrite. For the 
ReRAM SlowWrite version, I used a write-latency of 1us, while for the FastWrite 
version, I used 200ns. The read latency was set to 200ns for both versions of the 
ReRAM.  Figure 6-14 summarizes the result of the architectural simulation for the 
STREAM benchmark. The y-axis in the top plot shows the overall execution time in 
ms, with DRAM over 2x faster than the ReRAM-slowWrite option. The ReRAM 
FastWrite with 200ns latency is slightly faster, but still performs worse than the DRAM 
configuration. 
 
148 
The memory latency comparison, however, shows a much higher delay 
difference between the two architectures.  The DRAM memory latency on average is 
32ns, which is over 8x faster than ReRAM_SlowWrite.  However, the CPU perceived 
latency is only 2x faster, despite this large difference.  The MSHR occupancy 
comparison shows the reason for the discrepancy, with DRAM having a much higher 
occupancy resulting in a greater number of bottlenecks at the memory controller and 
stalls from the CPU point of view. 
  
149 
 
 
 
Figure 6-14 - SST STREAM Benchmark Comparison for 21 cores 
150 
Figure 6-15 shows the results for the GUPS benchmark, which has a finer access 
granularity of 8B.  Again, the results show that overall DRAM much longer RunTime, 
by 1.5x when compared to the ReRAM_SlowWrite case.  At the memory level, DRAM 
latency is faster, but from CPU point of view, overall perceived latency is slower due 
to a bottleneck at memory controller, which is shown in the MSHR_occupancy 
comparison.   
For the STREAM benchmark, DRAM is faster overall by 2x, and by 1.5x for 
the GUPS benchmark.  Though there are more pending requests due to the limited 
number of memory controllers with the DRAM architecture, the higher latency with 
ReRAM results in an overall longer latency time.  This observed trend was consistent 
for both STREAM and GUPS benchmarks at the 21-core level.   
Next, I increased the number of cores to 68 cores, which is the number used in 
the Intel KNL chip.  I simulated the comparison with both 21 cores and 68 cores, and 
the results are shown in Figure 9-4. The results indicate that when the number of cores 
is low (21), DRAM-based architecture outperforms ReRAM, even for GUPS type of 
algorithms. Although there was a small performance improvement with the 
ReRAM_FastWrite version, this still was not enough to overcome DRAM architecture 
performance. 
151 
 
  
 
Figure 6-15 - SST GUPS Benchmark Comparison for 21 cores 
152 
 
However, when the number of cores was increased from 21 to 68 cores, I see 
that in both STREAM and GUPS based benchmarks, ReRAM is able to outperform 
DRAM-based architecture. This is due to the higher amount of memory access requests 
needed with the higher core count. This requirement is more easily met by a more 
parallel memory system such as the one architected with the ReRAM based main 
memory. I see this reflected in the bottom plot in the figure of the memory latency 
breakdown for the stream benchmark.  Comparing the impact of increasing core count 
on the CPU perceived latency, I see a sharp increase for DRAM, while minimal impact 
to the ReRAM scenarios.  This increase in latency can be attributed to a higher amount 
of bottleneck resulting in more stalls.  Therefore, as the core count is increased, there 
needs to be enough parallel request to fully exploit the high amount of parallelism 
afforded by ReRAM and overcome the higher latency with ReRAM.   
  
153 
 
Figure 6-16 ? Impact of Increasing Core Count  
154 
Table 6-3 compares the memory bandwidth processed in each of the simulated 
conditions. At lower core count, DRAM based architecture provides STREAM 
bandwidth of 76GB/s is nearly 40% higher than the one provided through ReRAM 
based architecture. At higher core count, ReRAM provides a higher bandwidth of 
138GB/s, while the DRAM-based architecture?s STREAM bandwidth is 30% lower at 
95GB/s. 
Bandwidth Cores DRAM ReRAM_SlowWrite ReRAM_FastWrite 
(GB/s) 
GUPS 21 1.36 0.89 1.07 
 
68 1.53 1.94 2.51 
STREAM 21 76 37.45 47.07 
 
68 95.6 136.63 138.6 
Table 6-3 Bandwidth Comparison 
 
6.7 Impact of Core Count 
Based on the previous section, I see that the advantages of the Monolithic 
ReRAM architecture?s parallelism can only be exploited when there are sufficient 
number of accesses, realized at higher core counts.  To study the impact of the core 
count, I performed a set of simulations varying the core count and compared the 
performance on a ReRAM architecture with a write-latency of 1us.  In addition to the 
155 
DDR4 version for the DRAM architecture, I also used an HBM2 version for the DRAM 
model.  Figure 6-17 shows the SST simulation result of the comparison as a function 
of the core count, from 20=1 up to 29=512.  The x-axis is the core-count and the y-axis 
is the execution time.   
If I look at the STREAM benchmark result for DRAM-DDR4, as the core count 
increases, the execution time reduces at a constant slope, implying an improvement in 
performance gained through the higher processing power.  However, this trend 
saturates at around 8 cores, beyond which the simulation time improves at a slower rate.  
A similar trend exists for the DRAM-HBM implementation as well, with the execution 
time being lower, but having a similar inflection point which is on the order of the 
number of memory controllers for the DRAM implementation.  For the ReRAM 
implementation, the execution time much higher due to the higher memory latency but 
falls at a similar rate as the DRAM implementations.  The difference, however, is that 
ReRAM?s inflection point is much higher, above 250-core count. 
 
  
156 
 
 
Figure 6-17 - Performance Comparison between DRAM and ReRAM system 
using STREAM and GUPS benchmarks (note: Log-Scale X & Y axis) 
157 
Both DRAM-DDR4 and DRAM-HBM outperform ReRAM in the low core 
count for both benchmarks.  HBM is able to perform slightly better than DRAM due to 
its higher bandwidth and proximity with the CPU.  However, starting at 64 cores, 
ReRAM begins to outperform both DRAM devices with a low execution time.  Figure 
6-18 summarizes the bandwidth comparison between ReRAM, DRAM-DDR4, and 
DRAM-HBM2 architectures.   In the STREAM bandwidth plot, the star represents the 
reported 90+ GB/s number from Intel Knights Landing, which corroborates with my 
simulation results.  
  
158 
 
 
Figure 6-18 - Bandwidth Comparison between DRAM and ReRAM system using 
STREAM and GUPS benchmarks (note: Log-Scale X & Y axis) 
159 
At the inflection point of 64 cores, ReRAM outperforms DRAM-DDR4 by 30% 
for the STREAM benchmark case and meets the performance of HBM2-based 
architecture.  The results indicate that at very low core counts (less than 64), DRAM 
outperforms ReRAM due to its much lower inherent access latency for both 
benchmarks. However, as core count increases, DRAM cannot keep up with the data 
bandwidth needs, while ReRAM's parallelism compensates for its higher memory 
latency.  This can be further illustrated when I analyze the read latency contribution 
from the different system components. At the memory level, DRAM latency is 8x faster 
than ReRAM. Yet, at the CPU level, the overall perceived latency for DRAM is only 
2x faster at a core count of 16. This manifests in the DRAM architecture as bottleneck 
of the memory requests at the miss status holding register (MSHR), the hardware 
structure for tracking outstanding misses. For ReRAM, due to the high amount of 
parallelism, the memory requests are processed without having to hold them. At the 
higher core counts, there is a sufficient amount of access requests to take advantage of 
the memory parallelism offered by the ReRAM architecture. 
The results confirm that with sufficient processing power, the highly parallel 
ReRAM with long latencies performs better than high-speed DRAM with limited 
memory controllers. The cross-over point when ReRAM outperforms is 85GB/s for 
DRAM-DDR4 and 135GB/s for DRAM-HBM2 device with the STREAM benchmark. 
160 
One interesting note is with the slight worsening of performance with HBM DRAM at 
very high core count of 512. Because of HBM's higher bandwidth interface, the low-
access granularity of GUPS suffers with HBM due to stalls from prior access requests.  
 
6.8 Energy Comparison 
I also calculated the total energy dissipated for the DRAM-DDR4 system for 
comparison against ReRAM.  For the CPU and network power dissipation, I 
extrapolated from Intel's Knights Landing power specification.  For DRAM, my 
simulations for the 16GB dual rank DDR4 2400MHz DIMM model reported average 
energy per bit dissipation of 19.5pJ/bit.  For ReRAM, I used energy numbers of 
64pJ/bit for write (reported in Crossbar?s whitepaper) and 0.5pJ/bit for read operations 
assuming a read current of 5uA, a 2V voltage bias, and a cell-sensing time of 50ns.  
Figure 6-19 shows the energy-delay plot for DRAM and ReRAM architectures for both 
benchmarks.  The x-axis in the plot is the total delay to complete the simulation, while 
the y-axis is the energy consumed in mJ, as a product of the power (voltage * current) 
and duration of the power consumption.  These points were obtained from the different 
core counts I used in the previous section.  The energy-delay cross product is a metric 
used to assess the impact that a reduction in delay would provide in terms of energy. 
161 
For the STREAM energy-delay plot, for both DRAM and ReRAM, as the core 
count is increased, resulting in lower delay, I see little impact on the energy consumed.  
This is because, at these points, increasing the core count reduces the overall duration 
of power consumption, which is taken up by the higher number of core power.  
However, I see an inflection point, after which there is very little reduction in delay by 
increasing core count, but there is a much higher energy penalty.  This is the knee of 
the curve observed, where throwing more processing power does little to provide 
improvement in performance.  This inflection point is at higher core count, similar to 
the performance plot I saw in the previous section.  A similar trend is seen for GUPS 
benchmark, with its inflection point being much higher for ReRAM, due to the 
advantage that ReRAM offers in terms of finer access granularity. 
  
162 
 
 
Figure 6-19- Energy-Delay Plot of DRAM-DDR4 and ReRAM system using 
STREAM and GUPS benchmarks (note: Log-Scale X axis) 
163 
The optimal operating point on the energy-delay tradeoff is circled on the figure, 
indicating that ReRAM performs at or better than DRAM at both benchmarks. I 
observe that, overall, ReRAM delivers lowest delay for both benchmarks, as seen in 
the performance comparison.  This is achieved at the higher-core counts, where 
ReRAM as a main memory is able to provide an energy efficient, especially for GUPS 
where high access granularity of DRAM incurs additional penalty.  This energy 
efficiency comes primarily as a result of faster execution time, which reduces the 
duration of CPU and NoC power dissipation. 
 
  
164 
 
 
 
7 NoC Topology Impact 
 
7.1 Motivation 
 ReRAM based main memory architectures offer advantages in terms of 
scalability, density, and fine-access granularity. These architectures are capable of 
delivering high connectivity and low access granularity. To truly exploit the parallelism 
offered by ReRAM architectures, a robust Network-On-Chip (NoC) topology and 
optimum scaling of core count is critical to ensure low packet latency while being able 
to offer the high throughput in communication. 
 In this chapter, I compare different NoC topologies for a ReRAM based main-
memory system and study the effect on speedup as the number of cores scales on-chip. 
Based on architectural simulation results from SST on streaming and GUPS 
benchmarks, I observe that fat-tree and torus topologies provide performance gains of 
78% and 39%, respectively. I also observed that optimal core and memory controller 
configuration have a bigger impact at moderate to high number of cores than the 
topology. Performance comparison of a ReRAM-based main-memory architecture 
165 
against a conventional DRAM-based architecture indicate a gain of 30% with 64 cores. 
Power, cost, and performance tradeoff analysis are also presented. 
 Figure 7-1 (a) shows a conventional CPU chip connected to on-chip DRAM 
High Bandwidth Memory (HBM) devices through a silicon interposer. Figure 7-1 (b), 
illustrates an integrated CPU processor with ReRAM main-memory on the same chip, 
enabling a high-number of connections between the two systems. In contrast, the 
conventional on-chip DRAM solutions are limited in the number of connections, 
through memory controller access points, to the on-chip HBM or DDR4 DRAM 
devices. 
 
Figure 7-1 - Comparison of (a) Conventional off-chip main-memory system with 
(b) Integrated CPU die with ReRAM layers on-chip 
 As mentioned before, to support reasonable sustained bandwidth requirements 
in a system, a high number of these ReRAM arrays need to be accessed in parallel. 
ReRAM being on-chip allows these connections to occur directly through metal-vias, 
rather than through an on-chip I/O port, an external interposer for HBM, or large TSVs 
in the case of 3D-ICs.   
166 
 The resulting system requires a robust Network-on-Chip (NoC) between the 
many core and 1000s of memory controller points on the chip. Conventional NoC is 
based on ring and mesh like topologies and are typically built for 100s of access points. 
With our highly parallel memory-CPU memory architecture, these topologies may not 
be able to support the higher network throughput needed, especially for future process 
nodes. In this paper, I compare performance and power metrics between a DRAM and 
ReRAM system, look at the effect of different network-on-chip topologies on the 
system performance, and investigate optimal memory controller configuration for a 
hybrid ReRAM-DRAM memory system. 
 The rest of the chapter is organized as follows. Section 2 provides a short 
background of the NoC topologies I investigated, and the simulation methodology I 
followed. Section 3 presents the results and discussion of the effect of the NoC topology, 
and the optimal DRAM Memory controller configuration.  Section 4 presents the 
conclusion. 
 
7.2 Background 
Based on previous work [11], a homogenous 2D-Mesh topology for the 
Network-On-Chip (NoC) is unlikely to keep up with the relatively high communication 
need of the ReRAM-architecture I envision.  In order to support low latency across the 
167 
chip, I performed a survey of NoC topologies that can support high network capacity 
for a large number of nodes on-chip.  In this section, I perform a brief summary of NoC 
topologies of interest, and the performance metrics used to measure them.  The 
topologies that have been physically fabricated by other research projects or in the 
industry are shown in Figure 7-2, implemented as NoC or as datacenter network 
topologies.    
 
Figure 7-2 Comparison of various NoC topologies 
Some of the metrics that are used to compare the network performance are node-
degree, diameter, and bisection width.  The node-degree of a topology denotes the 
number of ports connected to each node and reflects the input-output complexity of the 
network.  A high node-degree reduces the average path-delay but increases the 
complexity of the implementation.   The diameter is the worst-case path delay in the 
168 
network and reflects the maximum shortest path between any two nodes.  The bisection 
width is the minimum number of links that needs to be bisect, or cut, in order to divide 
the topology into two separate networks.  This parameter is used to indicate the 
parallelism of the network.  Ring and bus networks have a fixed bisection width. 
1. Bus: This topology consists of a common routing channel to which multiple 
devices connect to communicate with each other.  It allows for a simple 
implementation, and is the paradigm used in older system-on-chip type 
implementations.  However, the single common bus prevents simultaneous 
communications between devices and requires bus arbitration policies to 
allocate the resource between the devices.  Therefore, this type of topology is 
not scalable as the number of devices increase.    
2. Crossbar:  The crossbar topology allows for multiple parallel connection 
between different input and output permutations.  The result is a topology that 
is low latency with higher throughput than the bus topology.  The IBM Power5 
architecture uses a crossbar topology.  However, as the number of nodes 
increase the matrix expands to an additional row and a column, resulting in a 
high overhead.  Therefore, this topology cannot support a scalable architecture.    
3. Ring: The ring topology consists of a closed bus with the communication 
direction restricted to one direction.  Each node has two neighbors (degree=2) 
169 
and the ?first? and ?last? nodes are connected to each other.  The information 
packet travels along the ring from the source until the destination is reached.  
The communication scheme is simpler and requires lower area to implement.  
The bisection width is 2, and the diameter is n/2, where n is the number of nodes.  
Architectures that have used the ring topology include the IBM Cell and earlier 
Intel architectures, such as Knights Ferry. 
4. Mesh: The mesh topology consists of m rows by n columns to support m*n 
nodes.  At each intersection, a router directs the direction of the packet to take 
the shortest path to the destination.  Higher path diversity makes multiple 
simultaneous packet transmission possible.  The architecture is easy to layout.  
The bisection width is min(m, n), the diameter is (M+N-2), and the node degree 
is 5 for the central nodes, 4 for edge nodes and 3 for corner nodes.  The Tilera 
100-core CMP and the Intel Knights Landing uses this topology in their 
architecture.  
5. Torus: This topology is similar to Mesh with the end points in the network being 
connected to each other.  This has the added advantage of allowing for better 
fairness due to limiting maximum number of hops, while slightly increased 
complexity and area cost.  This leads to better path diversity than mesh and 
improves the diameter of the network.  The bisection width is 2*min(m, n), the 
170 
diameter is (m/2)+(n/2), and the node degree is 5 for all nodes.  Currently, 3-
dimension torus networks are used in some supercomputer networks, such as 
the CRAY XT3 and IBM BlueGene.   
6. Hoffman-Singleton: The Hoffman-Singleton is a high-radix symmetric graph.  
It limits the number of connections between any two nodes to two hops but is 
more complex to implement on a 2D die.  This topology is currently used in 
large scale datacenters, such as the high-radix CRAY XE6. 
Table 7-1 summarizes the relative advantages and disadvantages of the different NoC 
topologies. 
TOPOLOGY ADVANTAGE DISADVANTAGE 
BUS Simple implementation Simultaneous communication 
between devices not possible 
CROSSBAR Multiple parallel connection Not scalable as number of 
possible nodes increase 
RING Simpler communication scheme; Slower implementation as data 
lower area to implement packet travels through all nodes 
MESH Higher path diversity; Ease of Higher cost to implement than 
layout previous models 
TORUS Reduces worst-case path from Slightly more complexity in 
Mesh implementation; wiring than Mesh 
HOFFMAN- Two hops between any two Complex communication 
SINGLETON nodes scheme; higher cost for 
implementation  
Table 7-1 Comparison of NoC Topologies 
A NoC topology?s performance is highly dependent and specific to the 
application and hardware architecture.  I looked into modeling tools that allow us to 
171 
assess performance impact of various topologies such as mesh, torus, and non-
symmetric heterogenous NoC architectures which might be needed to support the 
Monolithic ReRAM CPU architecture.  To support this, I have used the SST (Structural 
Simulation Toolkit) to model the different heterogeneous NoC architectures. 
 
7.2.1 ReRAM-based Main-Memory Architecture 
For ReRAM memories, the read and write latencies are considerably higher than 
DRAM memory latencies.  For ReRAM memories, a trade-off exists between the 
number of bits accessed from a single word-line and the access latency. In order to limit 
the latency, per-array bandwidth are typically low as shown in Figure 7-3.  The figure 
shows a ReRAM array with wordline (WL) decoders selecting a single row in the array.  
A column multiplexer (MUX) circuit at the bottom of the array is used to select a few 
of the bitcells in the selected wordline to read from.  The bitline current from the 
selected columns is then used by a sense amplifier to differentiate between a logic high 
and logic low read value.  This operation is done for a few cells in an array.  In order 
to provide sufficient bandwidth, several of these mini array banks would need to be 
accessed in a parallel ganged mode.  As shown in the figure, multiple arrays are 
accessed in parallel together to provide n times the per array bandwidth. 
172 
 
Figure 7-3 - ReRAM Array Access 
 At the system level, the sustained bandwidth is calculated as a function of the 
per-array bandwidth, the arrays per bank, the number of banks, and the access latency.  
The following equation can be used to calculate the number of banks needed on a single 
chip in order to meet a desired sustained bandwidth target.  
(BitsPerArray ? ArrayPerBank ? Banks)
Bandwidth, BW =   
(AccessLatency)
 For example, in order to provide a sustained bandwidth (BW) of 16B/ns, 
assuming an access latency of 200ns, four bits per array, and 16 arrays per bank, 400 
banks would be needed. 
4 ? 16 ? 400 B
BW =  = 16  
200ns ns
 My sustained BW calculation indicates that each core needs to access 400 banks 
in parallel to deliver 16B/ns, to provide sufficient bandwidth for real-world applications. 
Note that I chose a per-bank access granularity of 8B, which is desirable for fine-access 
granularity applications, and therefore have 16 arrays per bank. ReRAM architectures 
173 
require high amount of parallel memory accesses. The high access time is amortized 
by accessing multiple arrays at once. Multiple processors would need to access these 
arrays in order to fully exploit the fine access granularity provided by ReRAM. This 
necessitates the need for a highly efficient network-on-chip connection topology to 
service the requests between the multi-processor system and the memory system. 
 A tiled floorplan, consisting of memory-processor tiles, incurs a relatively high 
area overhead of 20% caused by the embedding the ReRAM peripheral circuits with 
the CPU logic [25]. The inefficiency is caused by the CPU logic circuits not having 
contiguous space for the digital implementation. Additionally, the memory access 
patterns based on our simulation results also pointed to the fact that they are not 
dedicated to the memory immediately above it. Therefore, the overall architecture 
needs to support memory access patterns to any of the memory within the chip.  Rather 
than a tiled floorplan, I selected a centrally located ReRAM type of floorplan, as shown 
in Figure 7-4.  
174 
 
Figure 7-4 - Hybrid ReRAM-DRAM System  
This type of central-memory floorplan will allow for the ReRAM to be treated 
as an embedded unit with their own internal NoC network. Additionally, the CPU cores 
can be independent of the memory, and be implemented with a contiguous area 
floorplan. DRAM memory controllers could also be provided for cache, to be 
selectively used for write-intensive or sequential access type of algorithms. 
 
7.2.2 NoC Topologies of Interest 
 As mentioned before, a homogeneous 2D-Mesh topology for the Network-On-
Chip (NoC) is unlikely to keep up with the relatively high communication need of the 
ReRAM-architecture I envision. In order to support low latency across the chip, I 
selected a set of NoC topologies that can support high network capacity for a large 
number of nodes on-chip. 
175 
 Based on previous area studies, I calculated that it is feasible to have more than 
1000 individual memory banks, each needing to be accessed independently. This would 
entail over 1000 network endpoints, which maybe fairly large for a simple mesh or 
torus type of NoC topology. Therefore, I selected topologies, even ones not typical for 
NoCs, following trends in supercomputers, where 1000s of network endpoints is 
commonplace. I consider the following four topologies of interest to evaluate my 
performance comparison: mesh, torus, fattree, and dragonfly. Figure 7-5 illustrates the 
high-level connections of the different components for each topology.  In the figure, 
the rectangles denote the component to be connected, the small circle denotes the router 
endpoints, and the lines denote the links and connections between the endpoints.  These 
four topologies can be modeled in the architectural simulator that I had chosen, SST.  
The next subsection summarizes the simulation details.  
176 
 
Figure 7-5 - Overview Diagram of NoC Topologies Simulated 
1. 2D Mesh: Mesh is the simplest and most widely used NoC topologies 
due to its ease of physical layout. For my study, I limited the comparison to a 
two-dimensional mesh which consists of an array layout. Every network switch 
is connected to four neighboring switches and one component, which could 
either be a processor or a memory controller.  
2. 2D Torus: This topology is similar to mesh with the added connection 
between the endpoints. This has the advantage of allowing for better fairness 
177 
due to limiting maximum number of hops, while slightly increased complexity 
and area cost.  
3. Fat-Tree: Also known as flattened butterfly, the topology follows a 
hierarchical layout. The higher-level root nodes have more connections than the 
leaf nodes [38]. In my study, I use a 3-tier fat-tree Noc topology for my 
experiments.  
4. Dragonfly: The dragonfly network is a high-order radix topology that is 
also hierarchical in nature. Switches are clustered together in groups with high 
inter-group connections. Intra-group connections between other groups are 
formed to provide high connectivity. The number of connections between any 
two nodes is limited three hops (Local-Global-Local) but requires more 
complex physical implementation [39]. 
7.2.3 Simulation Methodology 
 My simulation environment consists of the SST (Structural Simulation Toolkit) 
to model and evaluate the different memory architectures and NoC topologies. As 
before, I made use of the following external element libraries to model the different 
system components. 
? Miranda - Pattern-based CPU model to model the individual processor 
on STREAM and GUPS benchmarks  
178 
? Merlin - Network-On-Chip router model to model the different 
topologies and specify the connection links between the different components. 
?  MemHierarchy - L1 and L2 cache model  
? DRAMSim3 - DRAM Memory model for DDR4 and HBM2 devices 
?  Messier - ReRAM Memory model with asymmetrical read and write 
latencies 
 
7.3 Experiment Results & Analysis 
In this section, I present the NoC topology comparison simulation results 
comparing the four topologies:  Mesh, Torus, Fattree, and Dragonfly.  Additionally, I 
also studied the impact of higher number of memory controllers to support a hybrid 
ReRAM-DRAM architecture.  Figure 7-6 shows the central ReRAM torus topology, 
with the access points to the CPU lying on the boundary in darker green color.  
 
Figure 7-6 - Torus Configuration for Central ReRAM Architecture  
179 
7.3.1 NoC Topology Evaluation 
 The SST architectural simulation result presented in the previous chapter used 
a Mesh topology for the comparison between DRAM and ReRAM. That work showed 
the configurations that are optimal to take advantage of ReRAM, which are sparse 
access patterns and higher core counts. Next, I studied the impact of various NoC 
topologies presented in Section 2 on the highly parallel ReRAM architecture described 
in the previous chapter. I simulated the different topologies using a link bandwidth of 
1, 2, 4, 8, and 16GB/sec for all topologies. The network parameters for the various 
topologies are provided in Table 7-2. The input and output buffer sizes were set to 2KB, 
with 2 virtual channels and a flit size of 16B. The mesh and torus are two-dimensional 
arrays of 34 rows and 35 columns to meet the maximum number of nodes for the 
simulation range.  The fat-tree network is a 3-level tree, with 2048 hosts. 
 
Table 7-2 ? Network Sizing Parameters 
180 
 At the lowest and middle level, there are 256 routers per level, with 8 upper and 
8 lower links each. At the top-level of the fat-tree, there are 64 routers with 32 links.  
The fattree topology was simulated using a deterministic routing algorithm that only 
relies on the source and destination address, rather than the current state of the network. 
The dragonfly network has a high radix of upto 30 links per router at the lowest level. 
Minimal routing algorithm was used on the dragon fly topology, which selects the route 
based on the shortest path to the destination. The table also summarizes the total links 
present in the network for the simulation ranges of my experiment. 
 Figure 7-7 shows the summary of the NoC comparison simulation results in 
terms of the raw execution time for a system with 16, 32, and 64 cores for STREAM 
and GUPS benchmarks.  The link bandwidth was kept at 8 GB/s for all simulations.  
The x-axis is the number of cores and the y-axis is the total execution time.  For the 
STREAM benchmark, I see that at the lowest core count simulated of 16, mesh has a 
much higher execution time than all of the other three topologies, with the fat-tree 
providing the best performance.  This trend is much more prevalent for the GUPS 
benchmark, where the MESH is close to 2x slower performance than the fattree 
topology.  Since GUPS requires higher number of separate connections to provide 
memory access to the CPU, it would have a higher load on the network.   
181 
 As the core count increases, I see the specific topology having less impact on 
the performance.  This could be due to the contention points being more spread apart, 
due to more originators of the memory requests.  While the memory access points are 
very high (1000), the number of cores is only 16, this causes the network paths near the 
CPUs to be more congested.  As I increase the number of cores, the contention is 
alleviated by spreading apart the location of the memory request origins.  
  
182 
 
 
 
Figure 7-7 - NoC Topology Performance: Impact of Cores 
183 
 The results show the impact of cores by keeping the link bandwidth constant at 
8GB/sec for all topologies. For both benchmarks, I observe a consistent trend in the 
performance as follows. Mesh topology delivered the highest execution time, followed 
by torus, then dragonfly, and finally fattree topology. As the number of cores increase, 
the performance improves, although saturates and the network topology has less impact 
on the overall performance. This is especially notable for the the GUPS benchmark at 
the 16-core configuration. Here, mesh had much higher execution time due to network 
contention owing its sparse finer access granularity. 
 I next computed the speedup of scaling the performance using the lowest core 
count, 16, as the baseline.  The speedup is used to assess how much faster parallelizing 
the system improves the system performance, and is calculated by 
ExecutionTime
 
ExecutionTime
 
Table 7-3 shows the effectiveness of scaling up for the different cores from a baseline 
of 16 cores. 
184 
 
Table 7-3 - Speedup for different NoC Topologies (baseline: 16 cores) 
 Figure 7-8 shows the impact of the link bandwidth on the performance, while 
keeping the number of cores constant at 32.    
185 
 
 
Figure 7-8 - NoC Topology Performance: Impact of Link Bandwidth 
186 
 While mesh generally performs worst, at very low link bandwidths I see that 
dragonfly has a degraded performance, worse than both mesh and torus topologies. I 
looked at two network statistics to explain this anomaly at the 1GB/s link bandwidth 
scenario. The send-packet-count metric for torus had a higher (5x) number of packets 
sent overall. This indicates a higher number of hops needed for torus than the higher-
order dragonfly, as expected. The output-port-stalls metric, however, showed that the 
dragonfly topology had a 6-orders of magnitude higher stall count than with dragonfly. 
This is likely due to Dragonfly using a greedy locally optimized routing algorithm that 
can cause local link saturation resulting in a bottleneck.  
 Similar results of poor performance of dragonfly over torus and fat-tree at low 
message sizes was reported in other works [42, 43]. Torus performed well overall for 
the benchmarks I simulated, at both reasonable link bandwidths of 4GB/s and above.  
Although, fat-tree topology had the lowest execution time, the marginal performance 
improvement seen over torus may not justify the added complexity at these ranges. 
Using Mesh topology as a baseline, at a link bandwidth of 2GB/s, fattree performed 
78% better, while dragonfly and torus performed 35% and 39% better, respectively, for 
the GUPS benchmark. 
187 
 The tradeoff between the complexity of the topology and the performance 
attained in terms of execution time is graphed in Figure 7-9 for the STREAM 
benchmark.  
 
Figure 7-9 - NoC Topology Tradeoff: Execution Time vs Aggregate Bandwidth 
for STREAM benchmark (Note: Log Scale X & Y axis) 
 The total concurrent aggregate bandwidth is computed by multiplying the per-
link bandwidth by the number of links summarized in Table 7-2. This is a rough proxy 
for the cost of the topology, in terms of both area, design complexity, and power 
dissipated by the NoC.  Here, Torus topology offers the best trade-off in terms of 
complexity and delay, while Fattree topology can deliver the lowest execution time, at 
188 
the cost of higher number of links. Mesh topology also delivered reasonable execution 
times with low cost.  Dragonfly has a poor tradeoff at the lower bandwidth ranges based 
on the specific configuration that I had chosen. Future work will focus on selecting the 
optimal configuration of hosts/router and routers per group to reduce link saturation 
and produce a more balanced dragonfly network. Varying the workload and introducing 
additional benchmarks suites would yield a more rigorous comparison of the different 
architectures. 
7.3.2 DRAM Memory Controller Optimization 
 In my final set of experiments, I performed sensitivity analysis on the number 
of memory controllers (MC) in a conventional DRAM system. The motivation for this 
final study was to optimize on the number of memory controllers in hybrid DRAM-
ReRAM System referred in Figure 7-4. For this study, I used a simple mesh topology 
and varied the number of memory controllers to 2, 4, 6, and 8 for the full range of cores 
(upto 512 cores).  
 Figure 7-10 shows the execution time comparison for STREAM and GUPS 
benchmark. Again, scaling the cores improves the performance to a point, after which 
the performance saturates. At lower core counts, a higher number of memory controller 
cannot be fully utilized. The biggest improvement can be seen from two memory 
controllers to four for a specific core count. 
189 
 
 
Figure 7-10 - DRAM Performance: Impact of Cores and Memory Controller 
190 
 Figure 7-11 shows the speedup comparison, using the slowest overall 
configuration of 1 core and 2 memory controllers as the baseline. The plot shows that 
for both benchmarks, increasing the memory controllers has a marginal benefit. A 
configuration of 64 cores and 4 memory controllers seems to be an optimal trade-off 
between performance and cost. For the hybrid DRAM-ReRAM main-memory solution, 
having four DRAM Memory controller access points offered the highest performance 
gain. 
  
191 
 
 
Figure 7-11 - DRAM Speedup: Impact of Cores and Memory Controller (note: 
Log-Scale X axis) 
192 
7.4 Conclusion 
 ReRAM as a main-memory delivers several advantages over conventional 
DRAM in terms of scaling, capacity, and performance for sparse-access patterns in 
support of parallel computations. Power-efficiency is also achieved due to the on-chip 
data access communication path. At higher core counts, ReRAM is able to surpass 
DRAM performance and results in lower energy cost. Torus Noc topology performed 
well in my simulation and might be preferred over fat-tree and dragonfly due to its 
simpler implementation and lower cost. 
  
193 
 
 
 
8 ReRAM as Trusted On-Chip Main Memory 
 
8.1 Motivation 
DRAM as a main-memory is one of the vulnerable points in a hardware system 
due to it being located off-chip. This opens the system up to snooping on the system 
bus, side-channel attacks on the memory data through mechanisms like row-hammering 
attack by malicious devices. Embedded DRAM variations, like eDRAM are limited in 
capacity and cannot accommodate space needed for real-word application workloads. 
Additionally, as DRAM faces scaling issues as a high-density memory, emerging 
memory technologies are being explored as alternatives. One promising alternative for 
this application is ReRAM, which is scalable, vertically stackable, and because of the 
possibility of integration with standard logic process, can deliver higher density as a 
main-memory solution. The key differentiator with this approach involves a ReRAM 
memory array that integrates directly with a logic processor underneath, eliminating 
the need to go off-chip. 
194 
ReRAM as an on-chip trusted main-memory which is impervious to side-
channel attacks, leaves the memory more protected and prevents snooping of the bus. 
Additionally, by controlling the write energy applied during a program, I can selectively 
reduce the data-retention time and prevent the cold-boot access, a concern with non-
volatile systems. Area studies and measurement results on a fabricated test structure 
demonstrating the cell relaxation is presented. Architectural performance comparison 
against a DRAM system shows a 30% improvement. 
Secure processor architecture requires addressing both processor and off-chip 
memory access vulnerabilities. In conventional system architectures, critical data in 
RAM is typically located off-chip in DRAM and could be comprised due to two major 
security vulnerabilities. The first is bus-snooping, 1 in Figure 8-1, on the connection 
between the processor chip and a Main-Memory system that is located off-chip. The 
second concern is DRAMs vulnerability to Row-Hammer Attacks, 2 in Figure 8-1, 
whereby accessing a bitcell repeatedly in succession, an adversary is able to introduce 
data disturbance on a bit in an adjacent column.  
195 
 
Figure 8-1 - Vulnerabilities in Main Memory 
In addition to these security vulnerabilities, DRAM as a high-density memory 
is reported as facing scaling issues and being vulnerable to failure at advanced 
technology nodes [16]. Being located off-chip, DRAM has to interface to the processor 
system through a limited set of memory controller access points. This is especially true 
for multi-processor systems, as shown in Figure 6-7, where the interface to the main-
memory system is through a limited set of memory controllers, often on the order of 4-
8 access points per chip. This limited set of connections leads to performance 
bottlenecks which result to high latencies at the system level, despite low memory 
latencies. On-chip DRAM options, such as embedded DRAM (eDRAM), are not viable 
options due to their larger bitcell size and limited capacity. The key advantage of 
ReRAM, from a system-vulnerability and performance point of view, is that they are 
196 
On-Chip, allowing for the processor to be directly connected to the memory. ReRAM 
functioning as an on-chip main-memory, enhances both the performance and security 
of the system. 
ReRAM, being a non-volatile memory, does have its challenges [11]. Being a 
non-volatile memory, it is especially vulnerable to cold-boot type of attacks, where data 
could be recovered from the hard-disk. At the device level, ReRAM?s read and write 
latencies are much longer than DRAM. The write endurance limits are also much lower 
than what DRAM is able to deliver, which poses an issue for typical applications to be 
supported.  My solution to this is that by controlling the write-energy applied which 
has an advantage for both performance and security of these ReRAM-based Main-
Memory architectures. 
 
8.2 Background 
I will go briefly into the bitcell operation mechanism.  A conductive filament is 
grown in the middle layer by applying electrical stress, which allows for the resistance 
of the device to be modified. Figure 8-2 shows a cross-section of a resistive memory 
and the resistance modifying behavior.  In ReRAM, when the conductive filament is 
created, the cell is in low-resistance-state (LRS). On the other hand, when the filament 
is broken by, for example, applying a high-voltage of the opposite polarity, the filament 
197 
is broken causing the bitcell to be open-circuit, in a high-resistance-state (HRS). The 
filament is created or broken in the middle dielectric layer(s) by applying a high enough 
current or voltage causing a dielectric breakdown. These materials are engineered so 
that the breakdown is not permanent and is reversible, upto a certain number of cycles. 
The write endurance specifies the number of these write cycles before the bitcell fails 
and can no longer transition. 
 
Figure 8-2 - ReRAM Resistance Creation 
 
8.2.1 Integrated Processor-ReRAM Architecture 
An integrated Processor-ReRAM architecture layout has the flexibility to be 
configured in many ways. The data access pattern between the processors and the 
memory systems for the application space would be a key determinant. Figure 8-3 
198 
shows two possible approaches. The tiled configuration consists of individual ReRAM 
arrays embedded into a larger processor. These processor-ReRAM tiles are ideal for 
highly local data accesses where each processor computes on workload in the main-
memory located over its tile. The central ReRAM configuration shows a high number 
of individual ReRAM arrays located centrally, surrounded by multiple processors. This 
configuration is desired for access patterns that are sparse and random, requiring any-
to-any connection between the processor and an individual array. In this study, my area 
analysis focused on the tiled approach. 
 
 Figure 8-3 - Integrated ReRAM-Configuration 
 
199 
8.2.2 ReRAM Security Implications 
From a system architecture point of view, using ReRAM as a main memory that 
is integrated directly onto a processor enables several security advantages. Figure 8-4 
illustrates an on-chip ReRAM based main memory solution, with the connections 
between the multiple processor subsystem and the ReRAM arrays handled by a 
Network-On-Chip interconnection. By being on-chip, ReRAM is impervious to side-
channel analysis. All of the communication channels between the processor and 
memory is through on-chip metal vias and thus not available for bus-snooping. 
Additionally, ReRAM is also not susceptible to the data-disturbance seen in DRAM 
through the Row-Hammer attack. Memory systems are susceptible to cold-boot type 
of physical attacks to recover data. In this form of attack, an adversary that has physical 
access to the hardware performs a memory dump of the RAM in order to obtain 
encryption keys or other sensitive data. Even without power, DRAM main memory 
remains stable for a short duration, before the charge on the bitcell is dissipated.  This 
is characterized as the time between refresh cycles, which is often set as 65ms as a 
conservative specification. In a cold-boot attack, this data recovery time is extended by 
the adversary lowering the temperature of the memory module. This slows the 
discharge on the DRAM bitcells capacitor, thereby retaining the data on the bitcell well 
past the refresh time needed.  
200 
 
Figure 8-4 - ReRAM-based Main-Memory Solution 
Since this attack exploits an intrinsic hardware vulnerability, it poses a serious 
threat even for trusted platforms [30]. Cold-boot attacks are even more problematic for 
nonvolatile memories. ReRAM as a non-volatile memory retains data without power, 
with a typical data-retention time of 10 years. This makes ReRAM-based main 
memories to be especially susceptible to these types of physical attacks to recover data 
from a system. 
 
8.3 Proposed Approach 
My solution for ReRAM?s cold-boot attack problem comes from the insight that 
ReRAM for main-memory applications do not necessarily require non-volatility of data. 
Currently, DRAM as a main memory is volatile and extends its data remanence with 
periodically through a refresh operation from storage. If I are able to selectively control 
201 
the data-retention time of the ReRAM by applying a lower write-energy, I would be 
able to reduce the impact of cold-boot type of attacks by preventing the data from being 
available for long times.  Figure 8-5 illustrates ReRAM behavior operating in three 
different modes based on the electrical stress applied.  The electrical stress is indicated 
in the figure?s y-axis by controlling the current compliance limit applied during the 
program operation.  What this would mean is a system where the main memory retains 
the data for a short time, on the order of a few milliseconds. I can mitigate the data-loss 
by apply a periodic refresh, which is a manageable solution similar to how DRAM deals 
with data discharge on its bitcell. Furthermore, studies on the impact of temperature on 
data retention indicate high stability for ReRAM bitcells [32]. This implies that 
ReRAM?s data retention time may be unaffected by external lowering of the 
temperature, bolstering it against cold-boot type of attacks. 
202 
 
Figure 8-5 - ReRAM Three Modes of Operation 
ReRAM?s data-retention time is a function of the energy applied to the cell as a 
function of the voltage and current applied during a Program or a write operation. If a 
lower write energy is applied, that would result in either a lower program voltage and/or 
lower write latency, both of which have positive performance implications. 
Additionally, a RESET cell (LRS) in the digital volatile mode must also be placed in 
this lower state in order to make it non differentiable to an unpowered read attack. 
Additionally, lower write energy would also result in improved write endurance 
for the cell, which is one of the device challenges with ReRAM technologies [11]. 
Finally, lower data retention time also helps prevent cold-boot ReRAM data from being 
accessible by a malicious adversary. 
203 
8.4 Analysis and Discussion 
In the Chapter 5, I demonstrated that the digital volatile ReRAM behavior of the 
data is automatically lost after a short duration. I did this by fabricating discrete 
ReRAM devices in order to observe the programmed cell relaxing from a set to a reset 
value. I used Physical Vapor Deposition (PVD) to create my ReRAM stack using 
Platinum and Titanium for the top and bottom electrodes, and Aluminum Oxide and 
Titanium Oxide for the dielectric layer. The selection of this particular ReRAM stack 
was based on prior work that had demonstrated short term-time-dependent plasticity 
(STDP) to mimic neuron behavior [22, 35]. 
The results presented in the previous chapter, demonstrates the observed cell 
relaxation after a duration of 10 minutes.  From a system security point of view, this 
relaxation behavior can be exploited to ensure that certain critical information could be 
programmed in an intermediate region so that the data is lost after a set duration. The 
effect of temperature on the data-retention, specifically whether cold temperature will 
extend the data retention time is something to be explored.  
As mentioned in section 2 of this chapter, previous studies showed little impact 
of temperature on the data stability of ReRAM devices [32]. The primary mechanism 
of resistance creation in TiO2 based resistive memories is through the creation of 
oxygen vacancies through redox reactions, rather than dominantly from thermally 
based mechanisms like with Phase Change Memories (PCM). However, since redox 
204 
reactions could be accelerated by heat, there might be a slow-down in the relaxation 
time in the case of a lower-temperature. This particular effect was not studied as part 
of this work and would be a good extension of this research for the future. 
ReRAM as a main-memory delivers several advantages over conventional 
DRAM in terms of scaling, capacity, and performance for sparse-access patterns in 
support of parallel computations. Power-efficiency is also achieved due to the on-chip 
data access communication path. In addition to these performance benefits, on-chip 
ReRAM main memory can be a trusted hardware resource. There is no off-chip system 
bus snooping and no vulnerability to row-hammer hardware attacks.   
In this paper, I presented the opportunity for ReRAM to be leveraged as mixed 
volatility main memory based on the electrical stress applied. The low data-retention 
time avoids Cold-Boot physical attack on the system by clearing the data over short 
time. There would also be a tradeoff of lower write energy leading to improved write 
endurance which is an effect to be studied in future work.  Additionally, the 
experimental study could be repeated at lower and higher temperature in order to 
evaluate the effect of temperature on the data retention and the behavior in the cold-
boot scenario.   
205 
 
 
 
9 Conclusion 
 
A monolithic processor that integrates ReRAM memory and processor requires 
optimum configuration of the Core, NoC Topology, and memory controller at the 
architecture level to fully exploit the advantages. The core/CPU needs to be able to 
issue multiple non-blocking memory requests per cycle. This can be achieved through 
superscalar or multi-threading processors with SIMD flexible scatter-gather memory 
requests [11].  The Network-On-Chip needs to support high-throughput, which can be 
realized by incorporating higher-dimensional alternative NoC topologies. The ratio of 
Memory Controllers to cores need to be optimized to balance the area incurred against 
the need for parallelism.  A summary of the key contribution is presented in the table 
below.  
In this work, I have demonstrated a method for evaluating integrated ReRAM-
Processor type of architectures using standard EDA tools. I also presented an overview 
of Crossbar ReRAM technology that has been demonstrated in fabricated silicon chips 
that allow this novel on-chip main memory architecture. My layout results indicate that 
206 
I can integrate a cluster of ReRAM mat arrays with a processor logic underneath and 
incur an area penalty of 18% and an overall area efficiency of 50%.  Based on the 
memory access patterns, however, I noted that a central ReRAM approach would allow 
for independent development of the ReRAM and Processor logic.  The area under the 
ReRAM array could be used to support SRAM cache array and the NoC interconnect 
logic. 
Knowledge Area Contribution 
Physical Design Digital Implementation of Integrated ReRAM-CPU solution 
 Area analysis showed a 20% penalty with 50% area 
efficiency 
 Floorplan of controller circuits underneath Central ReRAM 
block 
Device Level Fabrication of Resistive Test Structure by milling aluminum 
mask and performing PVD of test structure 
 Demonstrated digital volatile cell behavior and confirmed 
bipolar program switching operation 
 100x lower write energy per write possible in digital volatile 
state with similar lower write endurance  
Architecture Level ReRAM performs favorably with higher parallel requests 
 Queue depth impacted dense memory access pattern 
 Reducing ReRAM Write latency from 200ns to 1us 
improved the bandwidth by 25% for GUPS benchmark   
 Fat-tree and torus NoC topologies provide performance 
gains of 78% and 39%, at bandwidth constrained scenario 
 Torus NoC topology performed well across varying core 
count and link bandwidths 
Table 9-1 - Summary of Key Contributions   
The device-level research work I performed demonstrated that ReRAM could be 
used in a mixed-volatile state where a SET programmed cell could be placed to lose its? 
207 
value over time.  This has several advantages such as lower write energy, which 
translates to lower write current and/or lower write latency, and improved write 
endurance due to the lower write energy applied.  Additionally, controlling the volatility 
of the data in this manner, opens the memory technology to be used in many ways to 
selectively retain data for security or data persistence. 
Finally, architectural simulations comparing ReRAM and DRAM based 
architectures showed that ReRAM-based main memory architectures outperform at 
higher core counts, where their high amount of memory parallelism can be sufficiently 
utilized.  My simulations showed the cross-over point where ReRAM outperforms 
DRAM-DDR4 to be at 64 cores for the STREAM benchmark.  My NoC topology 
comparison indicated both Fat-Tree and Torus topologies to have good performance for 
my configuration, with torus being an optimal choice due to its simplicity of 
implementation.   
ReRAM as a main-memory delivers several advantages over conventional DRAM 
in terms of scaling, capacity, and performance for sparse-access patterns in support of 
parallel computations. Power-efficiency is also achieved due to the on-chip data access 
communication path. In addition to these performance benefits, on-chip ReRAM main 
memory can be a trusted hardware resource. There is no off-chip system bus snooping 
and no vulnerability to row-hammer hardware attacks. In this paper, I presented the 
208 
opportunity for ReRAM to be leveraged as mixed volatility main memory based on the 
electrical stress applied. The low data-retention time avoids Cold-Boot physical attack 
on the system by clearing the data over short time. There would also be a tradeoff of 
lower write energy leading to improved write endurance which is an effect to be studied 
in future work.  
209 
 
 
 
10 Future Work 
 
In this chapter, I summarize future work of the different research aspects that 
were investigated in terms of physical-design, device-level, and architecture-level. 
Physical Design:  With the increased complexity of integrating two distinct full-
chip like designs, floor-planning placement of the blocks, their orientation, and location 
of the I/O ports will be critical in minimizing routing congestion.  The next future work 
can explore floor-planning and digital implementation of the central ReRAM approach 
with multiple surrounding processor cores, NoC router circuitry, and any additional 
hardware accelerators for optimal performance of graphical processing applications.   
This will allow to quantitively evaluate placement options for the multi-core central 
ReRAM fabric to maximize I/O bandwidth to individual tiles  and the intra-tile 
communication network needed. Thermal dissipation of the underlying logic circuits 
through the ReRAM BEOL layers is a possible concern that needs to be looked at. 
The current area study was limited to a simple RISC-V processor in an academic 
45nm technology.  The next course of study can include more complex and divergent 
processors to stress the connectivity to the memory bank network.  Additionally, 
extending to a more advanced process nodes with a process design kit (PDK) from a 
210 
foundry such as TSMC, GlobalFoundries, SEMS, would make the diversity of 
standard-cell logic more accurate in the area estimations. 
Device-Level: Several points of observation merit a closer look.  Volume data 
on the observed volatile state is critical to provide more data points which would lead 
to better averaging of the program current compliance and the expected rate of the 
relaxation.  As mentioned in Ch. 5, volume data for the characterization results would 
be useful to isolate noise and model the cell retention relationship more robustly.  This 
requires fabricating a full array, with more than 1000 bitcells so that statistical analysis 
could be performed to more completely characterize the bitcell behavior.    
Further analysis on the behavior for very high resistance bitcells and the increase 
in resistance after a delay needs to be analyzed.  These can be further mapped to a 
model of the current compliance applied and the sequence of preceding program pulses.  
As for the dimensions of the cell, future work can try to make the dimensions smaller, 
towards the target dimensions seen in the intended application in order to minimize the 
creation of parallel filaments.  Future work on the effect of oxygen partial pressure can 
be analyzed to identify any impurities created during the fabrication process.  In regard 
to the trusted memory application, the effect of temperature on the data retention is 
critical in assessing the cold-boot attack approach discussed in chapter 8.   
Architecture-Level:  The impact of core count showed a inflection point for 
both ReRAM and DRAM based architecture where the system performance saturates 
211 
at a point near the number of memory controllers.  While this point was above the 
number of memory controllers for DRAM, it was well below that for ReRAM.  Future 
work can explore the reason for this difference.  One possible reason could be the 
differing memory models used.  ReRAM used a simple Messier memory model from 
SST to model the latencies as a constant value.  While for the DRAM, DRAMSIM was 
used to more accurately model the effect of reordering and stalls from pending requests.    
On the NoC topology studied, further work is needed on optimizing the design 
configurations of the different topologies.  For the dragonfly configuration, this would 
be the ratio of the number of groups, hosts, and routers.  For mesh and torus topologies, 
a more optimal approach would be to scale the router array for each core count.  Finally, 
an architectural simulation of the hybrid ReRAM-DRAM approach would be valuable 
to investigate a solution where the best points of each technology is exploited.    
  
212 
 
 
 
11 References 
 
1. Y. Chen, C. Petti, ?ReRAM technology evolution for storage class memory 
application,? 2016 46th European Solid-State Device Research Conference 
(ESSDERC), Lausanne, 2016, pp. 432-435. 
2. Sung Hyun Jo, Kuk-Hwan Kim, and Ii Lu, ?High-Density Crossbar Arrays 
Based on a Si Memristive System,? Nano Letters, 2009, Vol. 9 No (2), pp. 870-
874 
3. G. C. Adam, B. D. Hoskins, M. Prezioso, F. M. Bayat, B. Chakrabarti and D. 
B. Strukov, "Highly-uniform multi-layer ReRAM crossbar circuits," 2016 46th 
European Solid-State Device Research Conference (ESSDERC), Lausanne, 
2016, pp. 436-439. 
4. Sung Hyun Jo, T. Kumar, S. Narayanan, W. D. Lu and H. Nazarian, "3D-
stackable crossbar resistive memory based on Field Assisted Superlinear 
213 
Threshold (FAST) selector," 2014 IEEE International Electron Devices Meeting, 
San Francisco, CA, 2014, pp. 6.7.1-6.7.4. 
5. I. Bhati, M. T. Chang, Z. Chishti, S. L. Lu and B. Jacob, "DRAM Refresh 
Mechanisms, Penalties, and Trade-Offs," in IEEE Transactions on Computers, 
2016, vol. 65, no. 1, pp. 108-121.  
6. M. R. Guthaus, J. E. Stine, S. Ataei, B. Chen, B. Wu, M. Sarwar, "OpenRAM: 
An Open-Source Memory Compiler," Proceedings of the 35th International 
Conference on Computer-Aided Design (ICCAD), 2016. 
7. T. Y. Liu et al., "A 130.7mm2 2-layer 32Gb ReRAM memory device in 24nm 
technology," 2013 IEEE International Solid-State Circuits Conference Digest of 
Technical Papers, San Francisco, CA, 2013, pp. 210-211. 
8. R. Fackenthal et al., "19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read 
in 27nm technology," 2014 IEEE International Solid-State Circuits Conference 
Digest of Technical Papers (ISSCC), San Francisco, CA, 2014, pp. 338-339. 
9. A. Fumarola et al., "Accelerating machine learning with Non-Volatile Memory: 
Exploring device and circuit tradeoffs," 2016 IEEE International Conference on 
Rebooting Computing (ICRC), San Diego, CA, 2016, pp. 1-8. 
214 
10. MRAM-info. (2016, August). STT-MRAM: Introduction and market status. 
Retrieved from MRAM-info: https://www.mram-info.com/stt-mram 
11. Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang 
Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung.  
?Memory Systems Challenges in Realizing Monolithic Computers.?  In 
Proceedings of the 4th International Symposium on Memory Systems 
(MEMSYS-IV).  National Harbor, MD.  October 2018. 
12. Emerging Technology and Architecture for Big-data Analytics, by Anupam 
Chattopadhyay, Chip Hong Chang, Hao Yu (Chapter 4: Compute-in-Memory 
Architecture for Data-Intensive Kernels) 
13. Shrunk-2-D: A Physical Design Methodology to Build Commercial-Quality 
Monolithic 3-D ICs, by Shreepad Panth, Kambiz Samad, Yang Du, and Sung 
Kyu Lim 
14. Circuit design for beyond von Neumann applications using emerging memory: 
From nonvolatile logics to neuromorphic computing by Ii-Hao Chen;  Win-San 
Khwa ;  Jun-Yi Li ;  Ii-Yu Lin ;  Huan-Ting Lin ;  Yongpan Liu ;  Yu Wang ;  
Huaqiang Wu ;  Huazhong Yang ;  Meng-Fan Chang 
15. A 16Mb dual-mode ReRAM macro with sub-14ns computing-in-memory and 
memory functions enabled by self-write termination scheme by Ii-Hao Chen; 
215 
In-Jang Lin; Li-Ya Lai; Shuangchen Li; Chien-Hua Hsu; Huan-Ting Lin; Heng-
Yuan Lee; Jian-Ii Su; Yuan Xie; Shyh-Shyuan Sheu; Meng-Fan Chang 
16. ITRS Roadmap. International technology roadmap for semiconductors. 
Semiconductor Industry Association, 2017. 
17. Y. Li, P. Yuan, L. Fu, R. Li, X. Gao, C. Tao, " Coexistence of diode-like volatile 
and multilevel nonvolatile resistive switching in a ZrO 2 /TiO 2 stack structure 
", Nanotechnology, vol. 26, no. 39, pp. 391001, Sep. 2015. 
18. M. Prezioso, F. Merrikh, B. Hoskins, K. Likharev and D. Strukov, ?Self-
adaptive spike-time-dependent plasticity of metal-oxide memristors?, 2015, 
arXiv preprint arXiv:1505.05549 
19. C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, Y. 
Xie, "Overcoming the challenges of crossbar resistive memory architectures", 
High Performance Computer Architecture (HPCA) 2015 IEEE 21st 
International Symposium on. IEEE, pp. 476-488, 2015. 
20. H. Zhang, N. Xiao, F. Liu, Z. Chen, "Leader: Accelerating ReRAM-based main 
memory by leveraging access latency discrepancy in crossbar arrays", DATE, 
pp. 756-761, 2016.  
216 
21. Y. Shi, C. Pan, V. Chen, N. Raghavan, et. Al, ?Coexistence of volatile and 
nonvolatile resistive switching in 2D h-bn based electronic synapses,? IEDM 
pp.119-122, 2017. 
22. W Banerjee, Q Liu, H Lv, S Long, M Liu, ?Electronic imitation of behavioral 
and psychological synaptic activities using TiO x/Al 2 O 3-based memristor 
devices,? Nanoscale 9 (38), 14442-14450. 
23. J. Jeffers, J. Reinders, and A. Sodani, ?Knights landing overview, ?Intel Xeon 
Phi Processor High Performance Programming, pp. 15?24, 2016. 
24. Katam, N. K., Mukhanov, O. A., & Pedram, M. (2018). Superconducting 
Magnetic Field Programmable Gate Array. IEEE Transactions on Applied 
Superconductivity, 28(2), 1?12. doi: 10.1109/tasc.2018.2797262. 
25. M. Jagasivamani, C. Walden, D. Singh, L. Kang, S. Li, M. Asnaashari, S. 
Dubois, D. Yeung, B. Jacob.  ?Design for ReRAM-based Main-Memory 
Architectures."  In Proceedings of the International Symposium on Memory 
Systems, Washington D.C., October 2019.  
26. Lei Wang, CiHui Yang, Jing In, and Shan Gai, Emerging Nonvolatile Memories 
to Go Beyond Scaling Limits of Conventional CMOS Nanodevices, Journal of 
Nanomaterials, vol. 2014, Article ID 927696, 10 pages, 2014. 
217 
27. Ielmini, D. (2016). Resistive switching memories based on metal oxides: 
Mechanisms, reliability and scaling. Semiconductor Science and 
Technology,31(6), 063002. doi:10.1088/0268- 1242/31/6/063002 
28. ReRAM Memory ? Crossbar. (n.d.). Retrieved from https://crossbar-
inc.com/en/ 
29. Jakub Szefer, Principles of Secure Processor Architecture Design, in Synthesis 
Lectures on Computer Architecture, Morgan Claypool Publishers, October 2018. 
30. J. Alex Halderman, Seth D. Schoen, Nadia Heninger, William Clarkson, 
William Paul, Joseph A. Calandrino, Ariel J. Feldman, Jacob Appelbaum, and 
Edward W. Felten.2009. Lest I remember: cold-boot attacks on encryption keys. 
Commun. ACM 52, 5 (May 2009), 91-98. DOI: 
https://doi.org/10.1145/1506409.1506429. 
31. M. Jagasivamani, C. Walden, D. Singh, L. Kang, S. Li, M. Asnaashari, S. 
Dubois, B. Jacob, and D. Yeung.  ?Analyzing the Monolithic Integration of a 
ReRAM-based Main Memory into a CPU's Die.?, in IEEE Micro (Special Issue 
on Monolithic 3D Architectures), November/December 2019. 
218 
32. Ambrosi, E., Bricalli, A., Laudato, M., Ielmini, D. (2019). Impact of oxide and 
electrode materials on the switching characteristics of oxide ReRAM devices. 
Faraday Discussions, 213, 8798. doi: 10.1039/c8fd00106e 
33.  Nail, C., Molas, G., Blaise, P., Piccolboni, G., Sklenard, B., Cagli, C., ? 
Perniola, L. (2016). Understanding RRAM endurance, retention and window 
margin trade-off using experimental results and simulations. 2016 IEEE 
International Electron Devices Meeting (IEDM). doi: 
10.1109/iedm.2016.7838346  
34. Zhang, Y., Feng, D., Liu, J., Tong, W., Wu, B., Fang, C. (2017). A Novel 
ReRAM-based Main Memory Structure for Optimizing Access Latency and 
Reliability. Proceedings of the 54th Annual Design Automation Conference 
2017 on ? DAC 17. doi:10.1145/3061639.3062191. 
35. Prezioso, M., Bayat, F. M., Hoskins, B., Likharev, K., Strukov, D. (2016). Self-
Adaptive Spike-Time-Dependent Plasticity of Metal-Oxide Memristors. 
Scientific Reports,6(1). doi:10.1038/srep21331 
36. Ge, J., & Chaker, M. (2017). Oxygen Vacancies Control Transition of Resistive 
Switching Mode in Single-Crystal TiO2 Memory Device. ACS Applied 
Materials & Interfaces,9(19), 16327-16334. doi:10.1021/acsami.7b03527. 
219 
37. B. Akin, C. Chou, J. Park, C. J. Hughes, and R. Agarwal, ?Dynamic fine-
grained sparse memory accesses," in Proceedings of the International 
Symposium on Memory Systems, MEMSYS '18, (New York, NY, USA), pp. 
85-97, ACM, 2018. 
38.  N. Moussa, F. Nasri, and R. Tourki, ?Noc architecture comparison with 
network simulator ns2," International Journal of Engineering Trends and 
Technology, vol. 13, no. 7, pp. 340-346, 2014. 
39. J. Kim, W. J. Dally, S. Scott, and D. Abts, ?Technology-driven, highly-scalable 
dragonfly topology,"2008 International Symposium on Computer Architecture, 
2008. 
40. A. F. Rodrigues, R. C. Murphy, P. Kogge, and K. D. Underwood, ?Poster 
reception|the structural simulation toolkit," Proceedings of the 2006 ACM/IEEE 
conference on Supercomputing - SC 06, 2006. 
41. Crossbar Inc., ?Crossbar ReRAM Technology White Paper." 2017. 
42. N. Jain, A. Bhatele, S. White, T. Gamblin, and L. V. Kale, ?Evaluating hpc 
networks via simulation of parallel workloads," SC16: International Conference 
for High Performance Computing, Networking, Storage and Analysis, 2016. 
220 
43. A. Bhatele, N. Jain, Y. Livnat, V. Pascucci, and P.-T. Bremer, ?Analyzing 
network health and congestion in dragonfly-based supercomputers," 2016 IEEE 
International Parallel and Distributed Processing Symposium (IPDPS), 2016. 
44.  J. Seo and B. Kim, ?Read margin analysis in an reram crossbar array," 2016 
IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), 2016. 
45.  S. V and N. Chiplunkar, ?Design and implementation of mesh and torus for 
network on chip based system," 2015 International Conference on Trends in 
Automation, Communications and Computing Technology (I-TACT-15), 2015. 
46.  M. M. Kim, J. D. Davis, M. Oskin, and T. Austin, ?Polymorphic on-chip 
networks," 2008 International Symposium on Computer Architecture, 2008. 
47.  M. M. Kim, M. Mehrara, M. Oskin, and T. Austin, ?Architectural implications 
of brick and mortar silicon manufacturing," Proceedings of the 34th annual 
international symposium on Computer architecture ? ISCA 07, 2007. 
48.  X. Liu, S. Mohanraj, M. Pioro, and D. Medhi, ?Multipath routing from a trac 
engineering perspective: How beneficial is it?," pp. 143-154, 10 2014.  
49.  R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, and Y. Hoskote, 
?Outstanding research problems in noc design: System, microarchitecture, and 
circuit perspectives," IEEE Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, vol. 28, no. 1, pp. 3-21, 2009.  
221 
50.  K. S. Solnushkin, ?Automated design of two-layer fat-tree networks." 
http://arxiv.org/abs/1301.6179, January 2013. arXiv:1301.6179. 
51.  K. S. Solnushkin, ?Automated design of torus networks." 
http://arxiv.org/abs/1301.6180, January 2013.arXiv:1301.6180. 
52.  F. J. Andujar, S. Coll, M. Alonso, P. Lopez, and J.-M. Martinez, ?Powar," vol. 
15, pp. 1-22, 2019. 
53.  Z. Wang and S. Ma, Networks-on-chip: from implementations to programming 
paradigms. Morgan Kaufmann, 2015. 
54. Kannan, S., Karimi, N., Sinanoglu, O., Karri, R. (2015). Security 
Vulnerabilities of Emerging Nonvolatile Main Memories and Countermeasures. 
IEEE Transactions on Computer- Aided Design of Integrated Circuits and 
Systems,34(1), 2-15. doi:10.1109/tcad.2014.2369741  
55. Shewmon, P. (1989). Diffusion in solids. Warrendale, PA: Minerals, Metals & 
Materials Society. 
56. Bertaud, T., Sowinska, M., Walczyk, D., Walczyk, C., Kubotsch, S., Wenger, C., 
& Schroeder, T. (2012). Resistive switching of Ti/HfO2-based memory devices: 
impact of the atmosphere and the oxygen partial pressure. IOP Conference 
Series: Materials Science and Engineering, 41, 012018.  
222