ABSTRACT Title of dissertation: TRACE OBLIVIOUS PROGRAM EXECUTION Chang Liu, Doctor of Philosophy, 2016 Dissertation directed by: Professor Michael Hicks Department of Computer Science, University of Maryland and Professor Elaine Shi Department of Computer Science, Cornell University The big data era has dramatically transformed our lives; however, security incidents such as data breaches can put sensitive data (e.g. photos, identities, genomes) at risk. To protect users’ data privacy, there is a growing interest in building secure cloud computing systems, which keep sensitive data inputs hidden, even from computation providers. Conceptually, secure cloud computing systems leverage cryptographic techniques (e.g., secure multiparty computation) and trusted hardware (e.g. secure processors) to instantiate a secure abstract machine consisting of a CPU and encrypted memory, so that an adversary cannot learn information through either the computation within the CPU or the data in the memory. Unfor- tunately, evidence has shown that side channels (e.g. memory accesses, timing, and termination) in such a secure abstract machine may potentially leak highly sensitive information, including cryptographic keys that form the root of trust for the secure systems. This thesis broadly expands the investigation of a research direction called trace oblivious computation, where programming language techniques are employed to prevent side channel information leakage. We demonstrate the feasibility of trace oblivious computation, by formalizing and building several systems, including GhostRider, which is a hardware-software co-design to provide a hardware-based trace oblivious computing solution, SCVM, which is an automatic RAM-model se- cure computation system, and ObliVM, which is a programming framework to facili- tate programmers to develop applications. All of these systems enjoy formal security guarantees while demonstrating a better performance than prior systems, by one to several orders of magnitude. TRACE OBLIVIOUS PROGRAM EXECUTION by Chang Liu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016 Advisory Committee: Professor Michael W. Hicks, Co-Chair/Co-Advisor Professor Elaine Shi, Co-Chair/Co-Advisor Professor Charalampos Papamanthou Professor Zia Khan Professor Lawrence C. Washington c© Copyright by Chang Liu 2016 Dedication To my loving mother. ii Acknowledgments I am very fortunate to have spent three fantastic years at UMD. Through- out my journey to learn how to do research, many people have helped me. This dissertation would not have been possible without them. First, I want to thank my advisors, Elaine Shi and Michael Hicks, for their guidance to this amazing academic world. Through collaborating with them, Elaine and Mike have set up examples for me with both precept and practice. I not only benefited from insightful technical discussions with them, but also learnt a lot about other aspects of research: how to pick an important problem to work on, how to criticize a work, especially the one from myself, and so on. I have learnt from many people that a piece of research is an execution of a great taste: I have learnt from Elaine a great taste of choosing research problems and pushing them closer to perfection. I have learnt from Mike a great taste to develop elegant theories to explain phenomenon and to communicate the ideas precisely and concisely. I also have learnt from Jon Froehlich, my proposal committee member, a great taste of giving presentation, and I have learnt from Dawn Song, who I worked closely in my last year, a great taste of envisioning novel research directions. I am grateful that I can work with these fantastic researchers during my PhD life. My committee members – my advisor, Babis Papamanthou, Zia Khan, and Larry Washington – have been extremely supportive. My thanks also go to all other members of the MC2 faculty. I am also fortunate to be surrounded by a diverse and inspiring group of (former and current) students at MC2. iii I am fortunate to work with my collaborators. Many ideas in this dissertation cannot be executed so well without their help. They are my advisors, Xiao Wang, Natik Kayak, Yan Huang, Martin Maas, Austin Harris, Mohit Tiwari and Jonathan Katz without an order. Now, I would like to thank some dear friends. Xi Chen, Yuening Hu, Ke Zhai, He He, and I have spent a wonderful summer at Microsoft Research at Redmond, which is the most enjoyable summer during my PhD life. I am fortunate to be friends of Yulu Wang and Shangfu Peng, who made my early graduate life happy and substantial. Qian Wu and Hang Hu are my old friends who always support me on both life and research. Xi Yi and Yu Zhang brought both happiness and insightful suggestions into my life. I am happy to become the introducer of the couple to meet each other. Life at UMD would have been much more difficult without members of the administrative and technical staff. Jennifer Story has always been patient and help- ful beyond the call of duty. Joe Webster has made resources available just as we needed it. Jodie Gray and Sharron McElroy have minimized bureaucracy when it comes to tax and money. Thank you all. Last but in no way the least, I thank my family who have given me uncondi- tional love throughout my life. My mother, to whom I owe a lot due to my academic career, has been supporting me altruistically for me to pursuit my dream. Words are not enough to express my gratitude for them. Finally, I thank Danqi Hu, who has always been there for me as an inspiring partner. Thank you for steering me to a better self. iv Table of Contents List of Figures viii 1 Introduction 1 1.1 Beyond Oblivious RAM . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A hardware-software co-design for ensuring memory trace obliviousness 4 1.3 Automatic RAM-model secure computation . . . . . . . . . . . . . . 5 1.4 A programming framework for secure computation . . . . . . . . . . . 7 1.5 Our Results and Contributions. . . . . . . . . . . . . . . . . . . . . . 8 2 Memory Trace Obliviousness: Basic Setting 11 2.1 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Memory trace obliviousness by typing . . . . . . . . . . . . . . . . . . 16 2.4.1 Operational semantics . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Memory trace obliviousness . . . . . . . . . . . . . . . . . . . 21 2.4.3 Security typing . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 Type checking source programs . . . . . . . . . . . . . . . . . 30 2.5.2 Allocating variables to ORAM banks . . . . . . . . . . . . . . 32 2.5.3 Inserting padding instructions . . . . . . . . . . . . . . . . . . 33 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Conclusion Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3 GhostRider: A Compiler-Hardware Approach 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 Our Results and Contributions. . . . . . . . . . . . . . . . . . 40 3.2 Architecture and Approach . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Threat model . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 v 3.2.3 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Formalizing the target language . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Security by typing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 Memory Trace Obliviousness . . . . . . . . . . . . . . . . . . . 50 3.4.2 Typing: Preliminaries . . . . . . . . . . . . . . . . . . . . . . 53 3.4.3 Type rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.4 Security theorem . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 Source Language . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.2 Memory bank allocation . . . . . . . . . . . . . . . . . . . . . 65 3.5.3 Basic compilation . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.4 Padding and register allocation . . . . . . . . . . . . . . . . . 68 3.6 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4 RAM-model Secure Computation 79 4.1 Technical Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Background: RAM-Model Secure Computation . . . . . . . . . . . . 83 4.3 Technical Overview: Compiling for RAM-Model Secure Computation 87 4.3.1 Instruction-Trace Obliviousness . . . . . . . . . . . . . . . . . 87 4.3.2 Memory-Trace Obliviousness . . . . . . . . . . . . . . . . . . . 88 4.3.3 Mixed-Mode Execution . . . . . . . . . . . . . . . . . . . . . . 89 4.3.4 Example: Dijkstra’s Algorithm . . . . . . . . . . . . . . . . . 90 4.4 SCVM Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.4 Type System . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.5 From SCVM Programs to Secure Protocols . . . . . . . . . . 109 4.5 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 116 4.6.2 Comparison with Automated Circuits . . . . . . . . . . . . . . 118 4.6.2.1 Repeated sublinear-time queries . . . . . . . . . . . . 119 4.6.2.2 Faster one-time executions . . . . . . . . . . . . . . . 121 4.6.3 Comparison with RAM-SC Baselines . . . . . . . . . . . . . . 124 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5 ObliVM: A Programming Framework for Secure Computation 129 5.1 ObliVM Overview and Contributions . . . . . . . . . . . . . . . . . . 130 5.1.1 Applications and Evaluation . . . . . . . . . . . . . . . . . . . 134 5.1.2 Threat Model, Deployment, and Scope . . . . . . . . . . . . . 135 vi 5.2 Programming Language and Compiler . . . . . . . . . . . . . . . . . 136 5.2.1 Language features for expressiveness and efficiency . . . . . . 136 5.2.2 Language features for security . . . . . . . . . . . . . . . . . . 140 5.3 User-Facing Oblivious Programming Abstractions . . . . . . . . . . . 145 5.3.1 MapReduce Programming Abstractions . . . . . . . . . . . . . 145 5.3.2 Programming Abstractions for Data Structures . . . . . . . . 150 5.3.3 Loop Coalescing and New Oblivious Graph Algorithms . . . . 153 5.4 Implementing Rich Circuit Libraries . . . . . . . . . . . . . . . . . . . 158 5.4.1 Case Study: Basic Arithmetic Operations . . . . . . . . . . . 158 5.4.2 Case Study: Circuit ORAM . . . . . . . . . . . . . . . . . . . 160 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.5.1 Back End Implementation . . . . . . . . . . . . . . . . . . . . 162 5.5.2 Metrics and Experiment Setup . . . . . . . . . . . . . . . . . . 162 5.5.3 Comparison with Previous Automated Approaches . . . . . . 163 5.5.4 ObliVM vs. Hand-Crafted Solutions . . . . . . . . . . . . . . . 168 5.5.5 End-to-End Application Performance . . . . . . . . . . . . . . 170 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6 Conclusion Remarks and Future Directions 176 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2.1 Verifying Hardware ORAM Implementation . . . . . . . . . . 177 6.2.2 Parallel Trace Oblivious Execution . . . . . . . . . . . . . . . 177 6.2.3 Differentially Privately Oblivious Execution . . . . . . . . . . 178 A Proof of Theorem 1 179 A.1 Trace equivalence and lemmas . . . . . . . . . . . . . . . . . . . . . . 179 A.2 Lemmas on trace pattern equivalence . . . . . . . . . . . . . . . . . . 184 A.3 Proof of memory trace obliviousness . . . . . . . . . . . . . . . . . . . 185 B Proof of Theorem 2 195 C Proof of Theorem 3 236 C.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 D The hybrid protocol and the proof of Theorem 5 246 Bibliography 256 vii List of Figures 2.1 Language syntax of Lbasic . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Auxiliary syntax and functions for semantics in Lbasic . . . . . . . . . 17 2.3 Operational semantics of Lbasic . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Trace equivalence in Lbasic . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Auxiliary syntax and functions for typing in Lbasic . . . . . . . . . . . 23 2.6 Typing for Lbasic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 Trace pattern equivalence in Lbasic . . . . . . . . . . . . . . . . . . . . 27 2.8 Finding a short padding sequence using the greatest com- mon subsequence algorithm. An example with two abstract traces Tt = [T1;T2;T3;T4;T5] and Tf = [T1;T3;T2;T4]. One great- est common subsequence as shown is [T1;T2;T4]. A shortest common super-sequence of the two traces is Ttf = [T1;T3;T2;T3;T4;T5]. . . . 34 2.9 Simulation Results for Strawman, Opt 1, and Opt 2. . . . . . . . . . 37 3.1 Motivating source program of GhostRider. . . . . . . . . . . . . . . . 43 3.2 GhostRider architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Syntax for LGhostRider language, comprising (1) ldb and stb instruc- tions that move data blocks between scratchpad and a specific ERAM or ORAM bank, and (2) scratchpad-to-register moves and standard RISC instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 LGhostRider code implementing (part of) Figure 3.1 . . . . . . . . . . . 49 3.5 Symbolic values, labels, auxiliary judgments and functions . . . . . . 53 3.6 Trace patterns and their equivalence in LGhostRider . . . . . . . . . . . 56 3.7 Security Type System for LGhostRider (Part 1) . . . . . . . . . . . . . . 57 3.8 Security Type System for LGhostRider (Part 2) . . . . . . . . . . . . . . 60 3.9 Security Type System for LGhostRider (Part 3) . . . . . . . . . . . . . . 62 3.10 Simulator-based execution time results of GhostRider. . . . . . . . . . 73 3.11 Legends of Figure 3.10 . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.12 FPGA based execution time results: Slowdown of Baseline and Final versions compared to non-secure version of the program. Note that unlike Figure 3.10, Final uses only a single ORAM bank and conflates ERAM and DRAM (cf. Section 3.6). . . . . . . . . . . . . . 76 viii 4.1 Dijkstra’s shortest distance algorithm in source (Part) . . . . . . . . . 90 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 Formal results in SCVM. . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4 Syntax of SCVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5 Auxiliary syntax and functions for SCVM semantics . . . . . . . . . . 98 4.6 Operational semantics for expressions in SCVM . . . . . . . . . . . . 101 4.7 Operational semantics for statements in SCVM (Part 1) . . . . . . . . 102 4.8 Operational semantics for statements in SCVM (Part 2) . . . . . . . . 103 4.9 Type System for SCVM . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.10 SCVM vs. automated circuit-based approach (Binary Search) . . . . . 118 4.11 SCVM vs. hand-constructed linear scan circuit (Binary Search) . . . . 119 4.12 Heap insertion in SCVM . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.13 Heap extraction in SCVM . . . . . . . . . . . . . . . . . . . . . . . . 122 4.14 KMP string matching for median n (fixing m = 50) in SCVM . . . . . 123 4.15 KMP string matching for large n (fixing m = 50) in SCVM . . . . . . 124 4.16 Dijkstra’s shortest-path algorithm’s performance in SCVM . . . . . . 125 4.17 Dijkstra’s shortest-path algorithm’s speedup of SCVM . . . . . . . . . 125 4.18 Aggregation over sliding windows’s performance in SCVM . . . . . . . 126 4.19 Aggregation over sliding windows’s speedup in SCVM . . . . . . . . . 126 4.20 SCVM’s Savings by memory-trace obliviousness optimization (inverse permutation). the non-linearity (around 60) of the curve is due to the increase of the ORAM recursion level at that point. . . . . . . . . . . 127 4.21 Savings by memory-trace obliviousness optimization (Dijkstra) . . . . 127 5.1 Streaming MapReduce in ObliVM-Lang. See Section 5.3.1 for oblivious algorithms for the streaming MapReduce paradigm [36]. . . 147 5.2 Oblivious stack by non-specialist programmers. . . . . . . . . . . . . 151 5.3 Code by expert programmers to help non-specialists implement obliv- ious stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.4 Loop coalescing. The outer loop will be executed at most n times in total, the inner loop will be executed at most m times in total – over all iterations of the outer loop. A naive approach compiler would pad the outer and inner loop to n and m respectively, incurring O(nm) cost. Our loop coalescing technique achieves O(n+m) cost instead. 154 5.5 Karatsuba multiplication in ObliVM-Lang. . . . . . . . . . . . . 159 5.6 Part of our Circuit ORAM implementation (Type Defini- tion) in ObliVM-Lang. . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.7 Part of our Circuit ORAM implementation (ReadAndRemove) in ObliVM-Lang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.8 Sources of speedup in comparison with state-of-the-art in 2012 [44]: an in-depth look. . . . . . . . . . . . . . . . . . . . . 166 B.1 Well formedness judgments for proof of Memory-Trace Obliviousness of LGhostRider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 B.2 Symbolic Execution in LGhostRider . . . . . . . . . . . . . . . . . . . . 218 ix C.1 Operational semantics for simA . . . . . . . . . . . . . . . . . . . . . 238 C.2 Operational semantics for statements in simA (part 1) . . . . . . . . 239 C.3 Operational semantics for statements in simA (part 2) . . . . . . . . 253 D.1 Hybrid Protocol piG (Part I) . . . . . . . . . . . . . . . . . . . . . . . 254 D.2 Hybrid Protocol ΠG(PartII) . . . . . . . . . . . . . . . . . . . . . . . 255 x Chapter 1: Introduction Cloud computing allows users to outsource both data and computation to third-party cloud providers, and promises numerous benefits such as economies of scale, easy maintenance, and ubiquitous availability. These benefits, however, come at the cost of giving up physical control of one’s computing infrastructure and private data. Privacy concerns have held back government agencies and businesses alike from outsourcing their computing infrastructure to the public cloud [18,94]. To protect users’ data privacy, there is a growing trend to build secure cloud computing systems, which enable computation over two or more parties’ sensitive data, while revealing nothing more than the results to the participating parties, and nothing at all to any third party providers Conceptually, secure cloud comput- ing systems leverage cryptographic techniques (e.g. secure multiparty computation) and trusted hardware (e.g. secure processors) to instantiate a “secure” abstract machine consisting of a CPU and encrypted memory, so that an adversary cannot learn information through either the computation within the CPU or the data in the memory. Unfortunately, evidence has shown that side channels (e.g., memory accesses, timing, and termination) in such a “secure” abstract machine may poten- tially leak highly sensitive information including cryptographic keys that form the 1 root of trust for the secure systems. The thesis of this work is that programming language-based techniques— notably compilers and type systems— can be used to make programs secure, despite powerful adversaries with a fine-grained view of execution, by enforcing the property of trace obliviousness, which we define in this thesis. 1.1 Beyond Oblivious RAM To cryptographically obfuscate memory access patterns, one can employ Obliv- ious RAM (ORAM) [33, 35], a cryptographic construction that makes memory ad- dress traces computationally indistinguishable from a random address trace. En- couragingly, recent theoretical breakthroughs [80,84,90] have allowed ORAM mem- ory controllers to be built [28, 63] – these turn DRAM into oblivious, block- addressable memory banks. The simplest way to deploy ORAM is to implement a single, large ORAM bank that contains all the code and data (assuming that the client can use standard PKI to safely transmit the code and data to a remote secure processor). A major drawback of this baseline approach is efficiency: every single memory-block access incurs the ORAM penalty which is roughly (poly-)logarithmic in the size of the ORAM [35, 80]. In practice, this translates to almost 100× additional bandwidth that, even with optimizations, incurs a ∼10× latency cost per block [28,63]. Another issue is that, absent any padding, the baseline approach reveals the total number of memory accesses made by a program, which can leak information about the secret 2 inputs. In practice, we observe that many programs are intrinsically oblivious, meaning that their execution traces observed by an adversary do not leak information. For example, let us consider a program to compute the summation of an array of integers, which are to be hidden from the adversary. This program will sequentially scan through the entire array, and assuming the array is encrypted in the memory, its memory access pattern will not leak any information about the secret integers stored in the array. In this case, ORAM is not necessary. To enable such optimizations is not trivial. A program may not always be oblivious. For example, a program may mistakenly allocate an array whose access pattern indeed leaks sensitive information outside any ORAM banks. We take the approach to formalize the property of memory trace obliviousness (MTO), and design novel type systems to enforce that a well-typed program enjoys memory trace obliviousness. A type system is a static analysis method to enforce a well-typed program to satisfy certain properties. Intuitively, the type systems we develop are an extension of language-based information flow systems [79], which is used to keep track of whether or not a variable used by a program contains sensitive information. Our extension further keeps track of whether the memory access patterns produced by a program’s execution will leak information to the adversary. In the following, we will discuss two main solutions to achieve secure cloud computing, i.e., secure-processor-based solution and secure-computation-based so- lution. 3 1.2 A hardware-software co-design for ensuring memory trace obliv- iousness To protect the confidentiality of sensitive data in the cloud, thwarting software attacks alone is necessary but not sufficient. An attacker with physical access to the computing platform (e.g., an malicious insider or intruder) can launch various physical attacks, such as tapping memory buses, plugging in malicious peripherals, or using cold-(re)boots [40,81]. Such physical attacks can uncover secrets even when the software stack is provably secure. A secure processor enables memory encryption [32,56,85–87] to hide the con- tents of memory from direct inspection, but an adversary can still observe memory addresses transmitted over the memory bus. As argued above, however, the mem- ory address trace is a side channel that can leak sensitive information. We thus exploit the above idea to enable memory obfuscation within the secure processor, and employ programming language techniques to optimize program’s execution. Deploying the above idea to build a memory trace oblivious system in realistic hardware architectures is non-trivial. First, as explained above, the entire memory needs be split into regions of ORAM banks, whose access addresses are obfuscated, versus encrypted RAM, whose access addresses are not. Second, cache behaviors can break a program’s memory trace oblivious execution, and it is hard to be tracked statically. The compiler needs hardware support to deterministically control the cache behavior. Third, the type system needs to deal with assembly code directly, 4 since otherwise the the MTO property may be broken during compilation. To tackle these problems, we present a hardware-software co-design called GhostRider. GhostRider’s hardware architecture supports both an encrypted RAM region, and multiple ORAM banks, which can be leveraged by programs to optimize the performance. GhostRider compiler can translate a program written in a C-like high-level language into assembly code using GhostRider’s instruction set architec- ture (ISA). GhostRider provides a type checker over the assembly code to enforce a well-typed program enjoys MTO. We will explain GhostRider in detail in Chapter 3. 1.3 Automatic RAM-model secure computation An alternative route to achieve secure cloud computing is through secure com- putation. Secure computation is a cryptographic technique that allows mutually distrusting parties to make collaborative use of their local data without harming privacy of their individual inputs. Since Yao’s seminal paper [96], research on secure two-party computation—especially in the semi-honest model we consider here—has flourished, resulting in ever more efficient protocols [11, 38,52,97] as well as several practical implementations [16, 43, 45, 46, 54, 65]. Since the first system for general- purpose secure two-party computation was built in 2004 [65], efficiency has improved substantially [11,46]. Almost all previous implementations of general-purpose secure computation assume the underlying computation is represented as a circuit. While theoretical developments using circuits are sensible (and common), compiling typical programs, 5 which assume a von Neumann-style Random Access Machine (RAM) model, to efficient circuits can be challenging. One significant challenge is handling dynamic memory accesses to an array in which the memory location being read/written depends on secret inputs. A typical program-to-circuit compiler typically makes an entire copy of the array upon every dynamic memory access, thus resulting in a huge circuit when the data size is large. Theoretically speaking, generic approaches for translating RAM programs into circuits incur, in general, O(TN) blowup in efficiency, where T is an upper bound on the program’s running time, and N is the memory size. To address these limitations, researchers have more recently considered secure computation that works directly in the RAM model [38, 62]. The key insight is, to rely on ORAM to enable dynamic memory access with poly-logarithmic cost, while preventing information leakage through memory-access patterns. Gordon et al. [38] observed a significant advantage of RAM-model secure computation (RAM- SC) in the setting of repeated sublinear-time queries (e.g., binary search) on a large database. By amortizing the setup cost over many queries, RAM-SC can achieve amortized cost asymptotically close to the run-time of the underlying program in the insecure setting. To enable secure computation with practical usage, it is ideal to have a com- plete system, so that developers can implement the applications in a high-level lan- guage (which are mostly in RAM-model) rather than implementing circuits directly, and the compiler can translate the program into an efficient secure computation pro- tocol while enforcing security. 6 While pursuing this goal, the MTO approach that we developed for GhostRider can be leveraged in this setting as well. In particular, for a program, we can use the same kind of MTO analysis approach to decide whether or not we should store some data in an ORAM bank, and ensure that a program is MTO using a type system. Secure computation, however, imposes more security restrictions to be considered. For example, a secure computation requires the circuit to be evaluated by both par- ties. This means that both parties know the instruction being executed. Therefore, secure computation requires instruction trace obliviousness beyond memory trace obliviousness. Further, since ORAM protocols are implemented in circuits in secure computation, this allows programmers to make non-blackbox usage of ORAMs as well. For example, developers can use tree-based non-recursive ORAM [90], which is a less expensive building block of ORAM, directly to achieve better performance. To address these issues, we developed the SCVM system which includes a SCVM intermediate representation, a compiler, and a secure type system to demonstrate how to achieve automatic efficient RAM-model computation. We detail SCVM in Chapter 4. 1.4 A programming framework for secure computation As a last contribution, we also deliver the ObliVM system as an extension on top of SCVM. It provides a programming framework with more expressive power and easy-programmability to help developers write better algorithms more easily. ObliVM focuses more on how to facilitate developers to build secure computation 7 applications. The design goal is to allow both cryptographic experts to improve the efficiency of low-level cryptographic protocols, and application developers who may not be familiar with cryptography to implement efficient applications in a high- level language. To meet these goals, we designed a new high-level programming language, called ObliVM-Lang, as an extension to SCVM and implement a compiler. Using this language, we can implement programming abstractions, which enable ap- plication developers to implement algorithms in an easy and efficient way. ObliVM also provides a backend called ObliVM-SC to allow cryptographic experts to imple- ment different protocols to further accelerate the execution of the whole system. We will explain ObliVM in Chapter 5. 1.5 Our Results and Contributions. In this thesis, we propose the trace oblivious computation theory and bring it to practice. We design and build GhostRider, a hardware/software platform for provably secure, memory-trace oblivious program execution, which can compile pro- grams to a realistic architecture while formally ensuring MTO. We also design and build SCVM and ObliVM to enable developers to develop efficient secure computation applications. In summary, our contributions are: Trace obliviousness theory. We greatly extend the study of trace oblivious program execution by providing a theory to establish when if a program is trace oblivious. Particularly, we demonstrate the way how to formalize a language such that the adversary-observable execution traces generated by the program can be 8 modeled. We also demonstrate how type systems can be used to enforce the trace obliviousness of programs in these languages. On the one hand, using these type systems, we extend the set of oblivious programs that can be verified automatically, to include more efficient implementations. On the other hand, these type systems can be extended to analyze other side-channel leakages that can be expressed as traces, and thus used to defend against attacks leveraging these channels. GhostRider system. GhostRider is the first system to bring trace oblivious computation theory to practice. By building GhostRider itself, we build the first memory-trace obliviousness compiler that emits target code for a realistic ORAM- capable processor architecture. GhostRider’s compiler optimizes the generated as- sembly code while ensuring the optimized code still satisfies MTO. To enable these optimizations, GhostRider builds on the Phantom processor architecture [63] but exposes new features and knobs to the software. Our empirical results on a real processor demonstrate the feasibility of our architecture and show that compared to the baseline approach of placing everything in a single ORAM bank, our compile- time static analysis achieves up to nearly an order-of-magnitude speedup for many common programs. SCVM system. We build SCVM as the first system to enable automatic RAM- model secure computation. SCVM provides a complete system that takes a program written in a high-level language and compiles it to a protocol for secure two-party computation of that program. To achieve this goal, SCVM provides an intermedi- ate representation, a type system to ensure any well-typed program will generate a secure computation protocol secure in the semi-honest model, and a compiler 9 to transform a program written in a high-level language into a secure two-party computation protocol while integrating compile-time optimizations curcial for im- proving performance. Our evaluation shows a speedup of 1–2 orders of magnitude as compared to standard circuit-based approaches for securely computing the same programs. ObliVM system. Building on top of SCVM, we design and implement ObliVM, which focuses more on richer expressive power and easy programmability while achieving state-of-the-art performance for secure computation. ObliVM provides an expressive programming language to allow both cryptographic experts and non-experts to use ObliVM to customize both front-end and back-end optimizations very easily. Us- ing this programming language, we implement several programming abstractions in ObliVM to help developers design and implement new efficient oblivious algorithms. Experiments show that the automatically generated circuits incurs only 0.5% to 2% overhead over manually optimized implementations. 10 Chapter 2: Memory Trace Obliviousness: Basic Setting In this chapter, we discuss a simple setting for memory trace oblivious program execution, which still captures the essential ideas. We consider a simple client-server scenario. A program is run on the client, but the data operated by the program is stored on the server. The goal is to enforce the server cannot learn any sensitive information from both the data stored on the server, and its interaction with the client-side program. We assume the server manages the data as a big memory, so that the only way a client program can interact with the server is through memory read/write APIs. Particularly, through a read call read(i) will request the data block indexed by i from the server, and a write call write(i, data) will update the data block indexed by i with data. The server thus can observe i and data during such interactions. This chapter will explain how can we use ORAM to prevent the server to learn any information through these interactions, and how a compiler and a type system can help enforcing this security property over a program automatically while preserving efficiency. Before we go into the details, we want to emphasize that although this setting itself has interesting applications in this setting, it can be extended to more appli- 11 cations such as secure processor applications, where CPU and RAM correspond to client and server respectively, and secure multi-party computation. We will discuss these extensions in Chapter 3, 4, 5. We want to explain the basic ideas and tech- niques to achieve trace oblivious computation in this chapter, which will be further extended in later chapters. This chapter is based on a paper that I co-authored with Michael Hicks and Elaine Shi [58]. I developed the formalism and conducted the proof of the main memory trace obliviousness theorem under the help of Michael Hicks, and provided preliminary experimental results on performance. 2.1 Threat Model Particularly, a server stores all the data, and a client runs programs interacting with the server to get data. We assume that the client has small local storage. The server manages the data as a random access memory (RAM), which supports read and write operation with random addresses. The adversarial model assumes that the server can observe all addresses that the client program is accessing, but not client programs’ internal states. We assume the data stored on the server are all encrypted, so that the server cannot observe the data directly. Client program will decrypt the data once retrieved, and re-encrypt them before uploading to the server (via write operations). We consider the honest-but-curious model, such that the server does not modify the stored data. Orthogonal techniques, such as Merkle’s hash tree, can be used to enforce data integrity. 12 Though they are a real threat, timing and other covert channels are not con- sidered in this basic setting. Later, we will show how our GhostRider system (Chap- ter 3) and ObliVM (Chapter 5) can prevent leakage through these channels. 2.2 Motivating Examples Let us consider the following program, where the client does not have enough space to store a big dataset, but offloads it to the server instead. 1: int findmax(public int n, secret int* data) { 2: secret int max = data[0]; 3: for (public int i=1; i max) 5: max = data[i] } } In this client program, data refers to the client’s data stored on the server. The keyword secret is used to denote this data is sensitive. A straightforward idea to enforce that the server does not learn information through addresses is that the client program can run an ORAM protocol with the server and stores all data in a giant ORAM. This approach, however, has two drawbacks. First, this may not secure, because the total number of memory accesses may still leak the information. For example, in the above example, based on the total number of accesses, the server will learn how many times line 5 is executed. The server can infer about 13 some information such as whether data is in ascending order. To mitigate this problem, the client program needs perform dummy accesses to the server even though the condition at line 4 is not satisfied. To this aim, this program needs be rewritten such that in the false-branch of line 4, a dummy statement dummy max = data[i] is inserted, where dummy max is an inserted dummy variable which has no side-effect to other data such as max and data. Second, this is not efficient, as it will incur an ORAM overhead which is unnecessary. In particular, This program sequentially scans through the entire data array, and keeps the maximal value in max. In this case, ORAM is not necessary, since the data access addresses, i.e. i, are public information which can be inferred by the server without knowing any information about the secret data. 2.3 Approach Overview In the following, we present several technical highlights in the following. Memory Trace Obliviousness. The adversary can observe the stream of ac- cesses to server, even if he cannot observe the content of those accesses, and such observations are sufficient to infer secret information. To eliminate this channel of information, we need a way to run the program so that the event stream does not depend on the secret data—no matter the values of the secret, the observable events will be the same. Programs that exhibit this behavior enjoy a property we call memory trace obliviousness. Padding. Toward ensuring memory trace obliviousness, the compiler can add 14 padding instructions to either or both branches of if statements whose guards reference secret information (we refer to such guards as high guards). This idea is similar to inserting padding to ensure uniform timing [6, 10, 22, 42]. We need to insert dummy accesses to the two branches such that both branches have equivalent access patterns. We will detail our approach in Section 2.5. ORAM for secret data. We store secret data in Oblivious RAM (ORAM), and extend our trusted computing base with an client side ORAM library. This library will encrypt/decrypt the secret data and maintain a mapping between addresses for variables used by the program and actual storage addresses for those variables on server. For each read/write issued for a secret address, the ORAM library will issue a series of reads/writes to the server with actual addresses, which has the effect of hiding which of the accesses was the real address. Moreover, with each access, the ORAM library will shuffle program/storage address mappings so that the physical location of any program variable is constantly in flux. Asymptotically, ORAM accesses are polylogarithmic blowup in the size of the ORAM [35]. Note that if we were concerned about integrity, we could compose the ORAM library with machinery for, say, authenticated accesses. Multiple ORAM banks. ORAM can be an order of magnitude slower than regular DRAM [83]. Moreover, larger ORAM banks containing more variables incur higher overhead than smaller ORAM banks [35, 80]; as mentioned above, ORAM accesses are asymptotically related to the size of the ORAM. Thus we can reduce run-time overhead by allocating code/data in multiple, smaller ORAM banks rather than all of it in a single, large bank. 15 Arrays. Implicitly we have assumed that all of an array is allocated to the same ORAM bank, but this need not be the case. Indeed, for our example it is safe to simply encrypt the contents of data[i] because knowing which memory address we are accessing does not happen to reveal anything about the contents of data[], as we have explained before. If we allocate each array element in a separate ORAM bank, the running time of the program becomes roughly 2n accesses: each access to data[i] is in a bank of size 1. Each iteration will read data twice, and thus there are 2n accesses in total for n iterations. In comparison, the na¨ıve strategy of allocating all variables in a single ORAM bank would incur 2n · poly log(n+ 2)) memory accesses (for secret variables), since each access to an ORAM bank of size m requires O(poly log(m)) actual server ac- cesses. This shows that we can achieve asymptotic gains in performance for some programs. 2.4 Memory trace obliviousness by typing This section formalizes a type system for verifying that programs enjoy mem- ory trace obliviousness. In the next section we describe a compiler to transform programs like the one in Section 2.2 so they can be verified by our type system. We formalize our type system using a simple language Lbasic presented in Figure 2.1. A program is a statement S which can be a sequence S;S, no-op skip, assignments to variables and arrays, conditionals, and loops. Expressions e consist 16 Variables x, y, z ∈ Vars Numbers n ∈ Nat ORAM bank o ∈ ORAMbanks Expressions e ::= x | e op e | x[e] | n Statements S ::= skip | x := e | x[e] := e | if(e, S, S) | while (e, S) | S;S Figure 2.1: Language syntax of Lbasic Arrays m ∈ Arrays = Nat ⇀ Nat Labels l ∈ SecLabels = {L} ∪ORAMbanks Memory M ∈ Vars ⇀ (Arrays ∪Nat)× SecLabels Traces t ::= read(x, n) | readarr(x, n, n) | write(x, n) | writearr(x, n, n) | o | t@t |  get(m,n) = { m(n) if 0 ≤ n < |m| 0 otherwise upd(m,n1, n2) = { m[n1 7→ n2] if 0 ≤ n1 < |m| m otherwise evt(l, t) = { l if l ∈ ORAMbanks t otherwise Figure 2.2: Auxiliary syntax and functions for semantics in Lbasic of constant natural numbers, variable and array reads, and (compound) operations. For simplicity, arrays may contain only integers (and not other arrays), and bulk assignments between arrays (i.e., x := y when y is an array) are not permitted. 2.4.1 Operational semantics We define a big-step operational semantics for Lbasic in Figure 2.3, which refers to auxiliary functions and syntax defined in Figure 2.2. Big-step semantics is sim- pler than the small-step alternative, and though it cannot be used to reason about non-terminating programs, our cloud computing scenario generally assumes that 17 〈M, e〉 ⇓t n E-Var M(x) = (n, l) t = evt(l, read(x, n)) 〈M,x〉 ⇓t n E-Const 〈M,n〉 ⇓ n E-Op 〈M, e1〉 ⇓t1 n1 〈M, e2〉 ⇓t2 n2 n = n1 op n2 〈M, e1 op e2〉 ⇓t1@t2 n E-Arr 〈M, e〉 ⇓t n M(x) = (m, l) n′ = get(m,n) t1 = evt(l, readarr(x, n, n ′)) 〈M,x[e]〉 ⇓t@t1 n′ 〈M, s〉 ⇓t M ′ S-Skip 〈M, skip〉 ⇓ M S-Asn 〈M, e〉 ⇓t n M(x) = (n′, l) t′ = evt(l,write(x, n)) 〈M,x := e〉 ⇓t@t′ M [x 7→ (n, l)] S-AAsn 〈M, e1〉 ⇓t1 n1 〈M, e2〉 ⇓t2 n2 M(x) = (m, l) m′ = upd(m,n1, n2) t = evt(l,writearr(x, n1, n2)) 〈M,x[e1] := e2〉 ⇓t1@t2@t M [x 7→ (m′, l)] S-Cond 〈M, e〉 ⇓t1 n 〈M,Si〉 ⇓t2 M (i = ite(n, 1, 2)) 〈M, if (e, S1, S2) 〉 ⇓t1@t2 M ′ S-WhileT 〈M, e〉 ⇓t n n 6= 0 〈M, (S; while(e, S))〉 ⇓t′ M ′ 〈M,while(e, S)〉 ⇓t@t′ M ′ S-WhileF 〈M, e〉 ⇓t 0 〈M,while(e, S)〉 ⇓t M P-Stmts 〈M,S1〉 ⇓t1 M ′ 〈M ′, S2〉 ⇓t2 M ′′ 〈M,S1;S2〉 ⇓t1@t2 M ′′ Figure 2.3: Operational semantics of Lbasic 18 programs terminate. The main judgment of the former figure, 〈M,S〉 ⇓t M ′ (shown at the bottom), indicates that program S when run under memory M will termi- nate with new memory M ′ and in the process produce a memory access trace t. We also define judgments 〈M, e〉 ⇓t n for evaluating statements and expressions, respectively. We model server side storage as memory M , which is a partial functions from variables to labeled values, where a value is either an array m or a number n, and a label is either L or an ORAM bank identifier o. Thus we can think of an ORAM bank (managed by the client side ORAM library) o containing all data for variables x such that M(x) = ( , o), whereas all data labeled L is stored on the server directly. We model an array m as a partial function from natural numbers to natural numbers. We write |m| to model the length of the array; that is, if |m| = n then m(i) is defined for 0 ≤ i < n but nothing else. To keep the formalism simple, we assume all of the data in an array is stored in the same place, i.e., all on the server directly or all in the same ORAM bank. We also assume data referred by non-array variables are also stored on the server. We will show how these assumptions can be relaxed while describing GhostRider (Chapter 3) and ObliVM (Chapter 3). A memory access trace t is a finite sequence of events arising during pro- gram execution that are observable to the server. These events include read events read(x, n) which states that number n was read from variable x and read(x, n1, n2), which states number n2 was read from x[n1]. The corresponding events for writes to variables and arrays are similar, but refer to the number written, rather than read. Event o indicates an access to ORAM—only the storage bank o is discernable, not 19 the precise variable involved or even whether the access is a read or a write. (Each ORAM read/write in the program translates to several actual DRAM accesses, but we model them as a single abstract event.) Finally, t@t represents the concatenation of two traces and  is the empty trace. The rules in Figure 2.3 are largely straightforward. Rule (E-Var) defines vari- able reads by looking up the variable in memory, and then emitting an event con- sonant with the label on the variable’s memory. This is done using the evt function defined in Figure 2.2: if the label is some ORAM bank o then event o will be emitted, otherwise event read(x, n) is emitted since the access is to the server directly. The semantics treats array accesses as “oblivious” to avoid information leakage due to out-of-bounds indexes. In particular, rule (E-Arr) indexes the array using auxiliary function get , also defined in Figure 2.2, that returns 0 if the index n is out of bounds. Rule (S-AAsn) uses the upd function similarly: if the write is out of bounds, then the array is not affected.1 We could have defined the semantics to throw an exception, or result in a stuck execution, but this would add unnecessary complication. Supposing we had such exceptions, our semantics models wrapping array reads and writes with a try-catch block that ignores the exception, which is a common pattern, e.g., in Jif [19, 49], and has also been advocated by Deng and Smith [26]. The rule (S-Cond) for conditionals is the obvious one; we write ite(x,y,z) to denote y when x is 0, and z otherwise. Rule (S-WhileT) expands the loop one 1The syntax m[n1 7→ n2] defines a partial function m′ such that m′(n1) = n2 and otherwise m′(n) = m(n) when n 6= n1. We use the same syntax for updating memories M . 20 unrolling when the guard is true and evaluates that to the final memory, and rule (S-WhileF) does nothing when the guard is false. Finally rule (P-Stmts) handles sequences of statements. 2.4.2 Memory trace obliviousness The security property of interest in our setting we call memory trace oblivious- ness. This property generalizes the standard (termination-sensitive) noninterference property to account for memory traces. Intuitively, a program satisfies memory trace obliviousness if it will always generate the same trace (and the same final memory) for the same adversary-visible memories, no matter the particular values stored in ORAM. We formalize the property in three steps. First we define what it means for two memories to be low-equivalent, which holds when they agree on memory contents having label L. Definition 1 (Low equivalence). Two memories M1 and M2 are low-equivalent, denoted as M1 ∼L M2, if and only if ∀x, v.M1(x) = (v,L)⇔M2(x) = (v,L). Next, we define the notion of the Γ-validity of a memory M . Here, Γ is the type environment that maps variables to security types τ , which are either Nat l or Array l (both are defined in Figure 2.5). In essence, Γ indicates a mapping of variables to memory banks, and if memory M employs that mapping then it is valid with respect to Γ. Definition 2 (Γ-validity). A memory M is valid under a environment Γ, or Γ-valid, 21 @t ≡ t@ ≡ t t1 = t2 t1 ≡ t2 t1 ≡ t2 t2 ≡ t1 t1 ≡ t2 t2 ≡ t3 t1 ≡ t3 t1 ≡ t′1 t2 ≡ t′2 t1@t2 ≡ t′1@t′2 (t1@t2)@t3 ≡ t1@(t2@t3) Figure 2.4: Trace equivalence in Lbasic if and only if, for all x Γ(x) = Nat l⇔ ∃n ∈ Nat.M(x) = (n, l) Γ(x) = Array l⇔ ∃m ∈ Arrays.M(x) = (m, l) Finally, we define memory trace obliviousness. Intuitively, a program enjoys this property if all runs of the program on low-equivalent, Γ-valid memories will always produce the same trace and low-equivalent final memory. Definition 3 (Memory trace obliviousness). Given a security environment Γ, a program S satisfies Γ-memory trace obliviousness if for any two Γ-valid memories M1 ∼L M2, if 〈M1, S〉 ⇓t1 M ′1 and 〈M2, S〉 ⇓t2 M ′2, then t1 ≡ t2, and M ′1 ∼L M ′2. In this definition, we write t1 ≡ t2 to denote that t1 and t2 are equivalent. Equivalence is defined formally in Figure 2.4. Intuitively, two traces are equivalent if they are syntactically equivalent or we can apply associativity to transform one into the other. Furthermore,  plays the role of the identity element. 2.4.3 Security typing Figure 2.6 presents a type system that aims to ensure memory trace oblivi- ousness. Auxiliary definitions used in the type rules are given in Figure 2.5. This 22 Types τ ::= Nat l | Array l Environments Γ ∈ Vars ⇀ Types Trace Pats T ::= Read x |Write x | Readarr x |Writearr x | Loop(T, T ) | o | T@T | T + T |  l1 unionsq l2 = { l1 if l1 6= L l2 otherwise l1 v l2 iff l1 = L orl2 6= L and l2 6= n select(T1, T2) = { T1 if T1 ∼L T2 T1 + T2 otherwise Figure 2.5: Auxiliary syntax and functions for typing in Lbasic type system borrows ideas from standard security type systems that aim to enforce (traditional) noninterference. For the purposes of typing, we define a lattice order- ing v among security labels l as shown in Figure 2.5, which also shows the unionsq (join) operation. Essentially, these definitions treat all ORAM bank labels o as equivalent for the purposes of typing (you can think of them as the ”high” security label). In the definition of v, we also consider the case when l2 could be a program location n, which is treated as equivalent to L (this case comes up when typing labeled statements). The typing judgment for expressions is written Γ ` e : τ ;T , which states that in environment Γ, expression e has type τ , and when evaluated will produce a trace described by the trace pattern T . The judgments for statements s and programs S are similar. Trace patterns describe families of run-time traces; we write t ∈ T to say that trace t matches the trace pattern T . Trace pattern elements are quite similar to their trace counterparts: ORAM accesses are the same, as are empty traces and trace concatenation. Trace pattern 23 Γ ` e : τ ;T T-Var Γ(x) = Nat l T = evt(l,Read(x)) Γ ` x : Nat l;T T-Con Γ ` n : Nat L;  T-Op Γ ` e1 : Nat l1;T1 Γ ` e2 : Nat l2;T2 l = l1 unionsq l2 Γ ` e1 op e2 : Nat l;T1@T2 T-Arr Γ(x) = Array l Γ ` e : Nat l′;T l′ v l T ′ = evt(l,Readarr(x)) Γ ` x[e] : Nat l unionsq l′;T@T ′ Γ, l ` S;T T-Skip Γ, l0 ` skip;  T-Asn Γ ` e : Nat l;T Γ(x) = Nat l′ l0 unionsq l v l′ Γ, l0 ` x := e;T@evt(l′,Write(x)) T-AAsn Γ ` e1 : Nat l1;T1 Γ ` e2 : Nat l2;T2 Γ(x) = Array l l1 unionsq l2 unionsq l0 v l Γ, l0 ` x[e1] := e2;T1@T2@evt(l,Writearr(x)) T-Cond Γ ` e : Nat l;T Γ, l unionsq l0 ` Si;Ti (i = 1, 2) l unionsq l0 6= L⇒ T1 ∼L T2 T ′ = select(T1, T2) Γ, l0 ` if(e, S1, S2);T@T ′ T-While Γ ` e : Nat l;T1 Γ, l0 ` S;T2 l unionsq l0 v L Γ, l0 ` while(e, S); Loop(T1, T2) T-Seq Γ, l0 ` S1;T1 Γ, l0 ` S2;T2 Γ, l0 ` S1;S2;T1@T2 Figure 2.6: Typing for Lbasic 24 events for reads and writes to variables and arrays are more abstract, mentioning the variable being read, and not the particular value (or index, in the case of arrays); we have read(x, n) ∈ Read(x) for all n, for example. There is also the or -pattern T1 + T2 which matches traces t such that either t ∈ T1 or t ∈ T2. Finally, the trace pattern for loops, Loop(T1, T2), denotes the set of patterns T1 and T1@T2@T1 and T1@T2@T1@T2@T1 and so on, and thus matches any trace that matches one of them. Turning to the rules, we can see that each one is structurally similar to the corresponding semantics rule. Each rule likewise uses the evt function (Figure 2.2) to selectively generate an ORAM event o or a basic event, depending on the label of the variable being read/written. Rule (T-Var) thus generates a Read(x) pattern if x’s label is L, or generates the ORAM event l (where l 6= L implies l is some bank o). As expected, constants n are labeled L by (T-Con), and compound expressions are labeled with the join of the labels of the respective sub-expressions by (T-Op). Rule (T-Arr) is interesting in that we require l v l′, where l is the label of the index and l′ is the label of the array, but the label of the resulting expression is the join of the two. As such, we can have a public index of a secret array, but not the other way around. This is permitted because of our oblivious semantics: a public index reveals nothing about the length of the array when the returned result is secret, and no out-of-bounds exception is possible. The judgment for statements Γ, l0 ` S;T is similar to the judgment for expres- sions, but there is no final type, and it employs the standard program counter (PC) label l0 to prevent implicit flows. In particular, the (T-Asn) and (T-AAsn) rules both require that the join of the label l of the expression on the rhs, when joined 25 with the program counter label l0, must be lower than or equal to the label l ′ of the variable; with arrays, we must also join with the label l1 of the indexing expression. Rule (T-Cond) checks the statements Si under the program counter label that is at least as high as the label of the guard. As such, coupled with the constraints on assignments, any branch on a high-security expression will not leak information about that expression via an assignment to a low-security variable. In a similar way, rule (T-Lab) requires that the statement location p is lower or equal to the program counter label, so that a public instruction fetch cannot be the source of an implicit flow. Rule (T-Cond) also ensures that if the PC label or that of the guard expression is secret, then the actual run-time trace of the true branch (matched by the trace pattern T1) and the false branch (pattern T2) must be equal; if they were not, then the difference would reveal something about the respective guard. We ensure run- time traces will be equal by requiring the trace patterns T1 and T2 are equivalent, as axiomatized in Figure 3.6. The first two rows prove that  is the identity, that ∼L is a transitive relation, and that concatenation is associative. The third row unsurprisingly proves that ORAM events to the same bank and fetches of the same location/bank are equivalent. More interestingly, the third row claims that public reads to the same variable are equivalent. This makes sense given that public writes are not equivalent. As such, reads in both branches will always return the same run-time value they had prior to the conditional. Notice that the public reads to the same arrays are also not equivalent, since indices may leak information. Finally, the (T-Cond) emits trace T ′, which according to the select function (Figure 2.5) will 26 T ∼L T @T ∼L T@ ∼L T  ∼L  T1 ∼L T2 T2 ∼L T3 T1 ∼L T3 o ∼L o T1 ∼L T ′1 T2 ∼L T ′2 T1@T2 ∼L T ′1@T ′2 (T1@T2)@T3 ∼L T1@(T2@T3) Read x ∼L Read x Figure 2.7: Trace pattern equivalence in Lbasic be T1 when the two are equivalent. As such, conditionals in a high context will never produce or-pattern traces (which are not equivalent to any other trace pattern). In Rule (T-While), the constraint l unionsq l0 v L mandates that loop guards be public (which is why we need not join l0 with l when checking the body S). This constraint ensures that the length of the trace as related to the number of loop iterations cannot reveal something about secret data. Fortunately, this restriction is not problematic for many examples because secret arrays can be safely indexed by public values, and thus looping over arrays reveals no information about them. Finally, we can prove that well typed programs enjoy memory trace oblivious- ness. Theorem 1. If Γ, l ` S;T , then S satisfies memory trace obliviousness. The full proof can be found in Appendix A. 2.4.4 Examples Now we consider a few programs that do and do not type check in our sys- tem. In the examples, public (low security) variables begin with p, and secret (high security) variables begin with s; we assume each secret variable is allocated in its own ORAM bank (and ignore statement labels). 27 There are some interesting differences in our type system and standard infor- mation flow type systems. One is that we prohibit low reads under high guards that could differ in both branches. For example, the program if s > 0 then s := p1 else s := p2 is accepted in the standard type system but rejected in ours. This is because in our system we allow the adversary to observe public reads, and thus he can distinguish the two branches, whereas an adversary can only observe public writes in the standard noninterference proof. On the other hand, the program if s > 0 then s := p+1 else s := p+2 would be accepted, because both branches will exhibit the same trace. Another difference is that we do not allow high guards in loops, so a program like the following is acceptable in the standard type system is rejected in ours: s := slen; sum := 0; while s ≥ 0 do sum := sum + sarr[p]; s := s - 1; done The reason we reject this program is that the number of loop iterations, which in general cannot be determined at compile time, could reveal information about the secret at run-time. In this example, the adversary will observe O(s) memory events and thus can infer slen itself. Prior work on mitigating timing channels often makes the same restriction for the same reason [6, 10, 22, 42]. Similarly, we can mitigate the restrictiveness of our type system by padding out the number of iterations to a 28 constant value. For example, we could transform the above program to be instead p := N; sum := 0; while p ≥ 0 do if p < slen then sum := sum + sarr[p]; else sdummy := sdummy + sarr[p]; p := p - 1; done Here, N is some constant and sdummy and sum are allocated in the same ORAM bank. The loop will always iterate N times but will compute the same sum assuming N ≥ slen. We also do not allow loops with low guards to appear in a conditional with a high guard. As above, we may be able to transform a program to make it accept- able. For example, for some S, the program if s > 0 then while (p > 0) do S; done could be transformed to be while (p > 0) do if s > 0 then S; done (assuming s is not written in S). This ensures once again that we do not leak information about the loop guard. 2.5 Compilation Rather than requiring programmers to write memory-trace oblivious programs directly, we would prefer that programmers could write arbitrary programs and rely on a compiler to transform those programs to be memory trace oblivious. While more fully realizing this goal remains future work, we have developed a compiler 29 algorithm that automates some of the necessary tasks. In particular, given a program P in which the inputs and outputs are labeled as secret or public, our compiler will (a) infer the least labels (secret or public) for the remaining, unannotated variables; (b) allocate all secret variables to distinct ORAM banks; (c) insert padding instructions in conditionals to ensure their traces are equivalent; and finally, (d) allocate instructions appearing in high conditionals to ORAM banks. These steps are sufficient to transform the max program in section 2.2 into its memory-trace oblivious counterpart. We can also transform other interesting algorithms, such as k-means, Dijkstra’s shortest paths, and matrix multiplication, as we discuss in the next section. We now sketch the different steps of our algorithm. 2.5.1 Type checking source programs The first step is to perform label inference on the source program to make sure that we can compile it. This is the standard, constraint-based approach to local type inference as implemented in languages like Jif [49] and FlowCaml [74]. We introduce fresh constraint variables for the labels of unannotated program variables, and then generate constraints based on the structure of the program text. This is done by applying a variant of the type rules in Figure 2.6, having three differences. First, we treat labels l has being either L, representing public variables; H, representing secret variables (we can think of this as the only available ORAM bank); or α, representing constraint variables. Second, premises like l1 v l2 and l0 unionsq l1 v l2 30 that appear in the rules are interpreted as generating constraints that are to be solved later. Third, all parts having to do with trace patterns T are ignored. Most importantly, we ignore the requirement that T1 ∼L T2 for conditionals. Given a set of constraints generated by an application of these rules, we at- tempt to find the least solution to the variables α that appear in these constraints, using standard techniques [30]. If we can find a solution, the compilation may con- tinue. If we cannot find a solution, then we have no easy way to make the program memory-trace oblivious, and so the program is rejected. As an example, consider the findmax program in Section 2.2, but assume that variables i and max are not annotated, i.e., they are missing the secret and public qualifiers. When type inference begins, we assign i the constraint variable αi and max the constraint variable αm. In applying the variant type rules (with the PC label l0 set to L) to this program (that is, the part from lines 5–7), we will generate the following constraints: (αi unionsq L) unionsq L v L line 3 αi v H line 4, for h[i] in guard l0 = αi unionsqH unionsq αm unionsq L PC label for checking if branch αi v H line 5, for h[i] in assignment l0 unionsq (H unionsq αi) v αm line 5, assignment L unionsq (αi unionsq L) v αi line 3 (For simplicity we have elided the constraints on location labels that arise due to 31 (T-Lab), but normally these would be included as well.) We can see that the only possible solution to these constraints is for αi to be L and αm to be H, i.e., the former is public and the latter is secret. Assuming that the programmer minimally labels the source program, only indicating those data that must be secret and leaving all other variables unlabeled, then the main restriction on source programs is the restriction on the use of loops: all loop guards must be public, and no loop may appear in a conditional whose guard is high. As mentioned in the previous section, the programmer may transform such programs into equivalent ones, e.g., by using a constant loop bound, or by hoisting loops out of conditionals. We leave the automation of such transformations to future work. 2.5.2 Allocating variables to ORAM banks Given all variables that were identified as secret in the previous stage, we need to allocate them to one or more ORAM banks. At one extreme, we could put all secret variables in a single ORAM bank. The drawback is that each access to a secret variable could cause significant overhead, since ORAM accesses are polylogarithmic in the size of the ORAM [35] (on top of the encryption/decryption cost). At the other extreme, we could put every secret variable in a separate ORAM bank. This lowers overhead by making each access cheaper but will force the next stage to insert more padding instructions, adding more accesses overall. Finally, we could attempt to choose some middle ground between these extreme methods: put some variables 32 in one ORAM bank, and some variables in others. Ultimately, there is no analytic method for resolving this tradeoff, as the “break even” point for choosing padding over increased bank size, or vice versa, depends on the implementation. A profile-guided approach to optimizing might be the best approach. With our limited experience so far we observe that storing each secret variable in a separate ORAM bank generally achieves very good performance. This is because when conditional branches have few instructions, the additional padding adds only a small amount of overhead compared to the asymptotic slowdown of increased bank size. Therefore we adopt this method in our experiments. Neverthe- less, more work is needed to find the best tradeoff in a practical setting. We also need to assign secret statements (i.e., those statements whose location label must be H) to ORAM banks. At this stage, we assign all statements under a given conditional to the same ORAM bank, but we make a more fine-grained allocation after the next stage, discussed below. 2.5.3 Inserting padding instructions The next step is to insert padding instructions into conditionals, to ensure the final premise of (T-Cond) is satisfied, so that both branches will generate the same traces. To do this, we can apply algorithms that solve the shortest common superse- quence problem [31] when applied to two traces (a.k.a. the 2-scs problem). That is, given the two trace patterns Tt and Tf for the true and false branches of an if 33 T1 T T T T1 TT T T3 T3 T5 T3 23 4 2 54 T T := :=f t Figure 2.8: Finding a short padding sequence using the greatest com- mon subsequence algorithm. An example with two abstract traces Tt = [T1;T2;T3;T4;T5] and Tf = [T1;T3;T2;T4]. One greatest common subsequence as shown is [T1;T2;T4]. A shortest common super-sequence of the two traces is Ttf = [T1;T3;T2;T3;T4;T5]. (following ORAM bank assignment), let Ttf denote the 2-scs of Tt and Tf . The differences between Ttf and the original traces signal where, and what, padding instructions must be inserted. The standard algorithm builds on the dynamic pro- gramming solution to the greatest common subsequence (gcs) algorithm, which runs in time O(nm) where n and m are the respective lengths of the two traces [23]. Using this algorithm to find the gcs reveals which characters must be inserted into the original strings, as illustrated in Figure 2.8. When running 2-scs on traces, we view Tt and Tf as strings of characters which are themselves trace patterns due to single statements. Each statement-level pattern will always consist of zero or more of the following events: Read, oi for ORAM bank i .2 For example, suppose we have the program skip; x[y] := z where, after ORAM bank assignment, the type of y is Nat o1, the type of z is Array o1, and the type of x is Nat o2. This program generates trace pattern @o1@o1@o2. For the purposes of running 2-scs, this trace consists of three characters: o1, o1, and o2, which corresponds to the statement x[y] := z. 2Because of the restrictions imposed by the type system, Tt and Tf will never contain loop patterns, (public) read-array or write patterns, or or-patterns. 34 Once we have computed the 2-scs and identified the padding characters needed for each trace, we must generate “dummy” statements to insert in the program that generate the same events. This is straightforward. In essence, we can allocate a “dummy” variable do for each ORAM bank o in the program, and then read, write, and compute on that variable as needed to generate the correct event. Suppose we had the program if(e, skip, x[y] := z) and thus Tt =  and Tf = o1@o1@o2. Computing the 2-scs we find that Tt can be pre-padded with o1@o1@o2 while Tf needs not be padded. We can readily generate statements that correspond to both. For the second, we do not need to pad anything. For the first, we can produce do2 := do1 +do1 . When we must produce an event corresponding to a public read, or read from an array, we can essentially just insert a read from that variable directly. Note that this approach will generate more padding instructions than is strictly needed. In the above example, the final program will be if(e, (do2 := do1 + do1 ; skip), (x[y] := z)) Peephole optimizations can be used to eliminate some superfluous instructions. However, a better approach is to use a finer-grained alphabet which in practice is available when using three address code, i.e., as the intermediate representation of an actual compiler. We will demonstrate this in GhostRider. Once padding has been inserted, both branches have the same number of statements, and thus we can allocate each pair of statements in its own ORAM bank. Assuming we did not drop the skip statements in the program above, we 35 Table 2.1: Programs and parameters used in our simulation. No. Description 1 Dijkstra (n = 100 nodes) 2 K-means (n = 100 data points, k = 2, I = 1 iteraion) 3 Matrix multiplication (n× n matrix where n = 40) 4 Matrix multiplication (n× n matrix where n = 25) 5 Find max (n = 100 elements in the array) 6 Find max (n = 10000 elements in the array) could allocate them both in ORAM bank o3 and allocate the two assignments in ORAM bank o4, rather than allocate all instructions in ORAM bank o as is the case now. 2.6 Evaluation To demonstrate the efficiency gains achieved by our compiler in comparison with the straightforward approach of placing all secret variables in the same ORAM bank, we choose four example programs: Dijkstra single-source shortest paths, K- means, Matrix multiplication (na¨ıve O(n3) implementation), and Find-max. We will compare three different strategies: Strawman: Place all secret variables in the same ORAM bank . Opt 1: Store each variable in a separate ORAM bank, but store whole arrays in the same bank. Opt 2: Store each variable and each member of an array in a separate ORAM bank . In all three cases, we insert necessary padding to ensure obliviousness. 36 Figure 2.9: Simulation Results for Strawman, Opt 1, and Opt 2. 2.6.1 Simulation Results We also performed simulation to measure the performance of the example programs when compiled by our compiler. Table 2.1 shows the parameters we choose for our experiment. We built a simulator in C++ that can measure the number of memory accesses for transformed programs. Implementing a full-fledged compiler that integrates with our ORAM-capable hardware concurrently being built [?] is left as future work. Simulation results are given in Figure 2.9 for the six setups described in Ta- ble 2.1. The ORAM scheme we used in the simulation is due to Shi et al [80]. The figure shows that Opt 1 is 1.3 to 5 times faster the strawman scheme; and Opt 2 is 1 to 3 orders of magnitude faster than the strawman for the chosen programs and parameters. 37 2.7 Conclusion Remarks In this chapter, we have briefly explained how to achieve memory trace obliv- iousness (MTO) in a basic client-server setting. We have demonstrated how MTO is defined under Lbasic syntax and semantics, and how a type system can enforce MTO by keeping track of traces. Throughout the rest of this thesis, we will exploit the same idea in more settings, where hardware (in GhostRider) or algorithmic (in ObliVM) constraints will require reasoning about more leakage. We will also empirically demonstrate the advantages of the MTO approach in these settings. 38 Chapter 3: GhostRider: A Compiler-Hardware Approach 3.1 Introduction In Chapter 2, we have explained the basic idea how a type system can enforce a program to be memory trace oblivious. In this chapter, we consider a realistic setting to protect the programs against physical attackers who have control to everything except the processor. As explained in above, data encryption is necessary but not sufficient. Our approach is to build a processor with an ORAM controller [28, 63] so that it can obfuscate the memory accesses to the ORAM. As explained in Chapter 2, this is not the most efficient approach, since when a program’s memory access pattern does not depend on secret data, encryption is sufficient and ORAM is not necessary. To enable such an optimization, we enhance the existing ORAM processor design to allow splitting memory into regions, consisting of one or more ORAM banks, encrypted memory, and normal DRAM. On top of such a design, building a compiler from a high-level source language into a low-level assembly language is non-trivial, even given our results from Chap- ter 2. In particular, the cache behavior and handling assembly code make designing a MTO type system much more challenging. In this Chapter, we present GhostRider as a hardware-compiler co-design ap- 39 proach to achieve memory trace obliviousness and efficiency at the same time. This Chapter is based on a paper that I co-authored with Austin Harris, Michael Hicks, Martin Maas, Mohit Tiwari, and Elaine Shi [57]. I developed the formalism and the proof under the help of Michael Hicks. I also developed the compiler, which implements both the optimization and the type checker, and emits code that is runnable on an FPGA implementation developed by my co-authors Austin Harris, Martin Maas, and Mohit Tiwari. I conducted experiments to show the compiler’s effectiveness with the help of Austin Harris, Martin Maas, and Mohit Tiwari. 3.1.1 Our Results and Contributions. In this paper, we make the first endeavor to bring the theory of MTO to practice. We design and build GhostRider, a hardware/software platform for prov- ably secure, memory-trace oblivious program execution. Compiling to a realistic architecture while formally ensuring MTO raises interesting challenges in the com- piler and type system design, and ultimately requires a co-operative re-design of the underlying processor architecture. Our contributions are: New compiler and type system. We build the first memory-trace oblivious compiler that emits target code for a realistic ORAM-capable processor architec- ture. The compiler must explicitly handle low-level resource allocation based on underlying hardware constraints, and while doing so is standard in non-oblivious compilers, achieving them while respecting the MTO property is non-trivial. Stan- 40 dard resource allocation mechanisms would fail to address the MTO property. For example, register allocation spills registers to the stack, thereby introducing memory events. Furthermore, caching serves memory requests from an on-chip cache, which suppresses memory events. If these actions are correlated with secret data, they can leak information. We introduce new techniques for resolving such challenges. In lieu of implicit caches we employ an explicit, on-chip scratchpad. Our compiler implements caching in software when its use does not compromise MTO. To formally ensure the MTO property, we define a new type system for a RISC-style low-level assembly language. We show that any well-typed program in this assembly language will respect memory-trace obliviousness during execution. When starting from source programs that satisfy a standard information flow type system [26], our compiler generates type-correct, and therefore safe, target code. Specifically, we implement a type checker that can verify the type-correctness of the target code. Processor architecture for MTO program execution. To enable an auto- mated approach for efficient memory-trace oblivious program execution, we need new hardware features that are not readily available in existing ORAM-capable processor architectures [27, 29, 63]. GhostRider builds on the Phantom processor architecture [63] but exposes new features and knobs to the software. In addition to supporting a scratchpad, as mentioned above, the GhostRide architecture comple- ments Phantom’s ORAM support with encrypted RAM (ERAM), which is not obliv- ious and therefore more efficiently supports variables whose access patterns are not sensitive. Section 3.6 describes additional hardware-level contributions. We proto- 41 typed the GhostRider processor on a Convey HC2 platform [21] with programmable FPGA support. The GhostRider processor supports the RISC-V instruction set [93]. Implementation and Empirical Results. Our empirical results are obtained through a combination of software emulation and experiments on an FPGA proto- type. Our FPGA prototype supports one ERAM bank, one code ORAM bank, and one data ORAM bank. The real processor experiments demonstrate the feasibility of our architecture, while the software simulator allows us to test a range of config- urations not limited by the constraints of the current hardware. In particular, the software simulator models multiple ORAM banks at a higher clock rate. Our experimental results show that compared to the baseline approach of placing everything in a single ORAM bank, our compile-time static analysis achieves up to nearly an order-of-magnitude speedup for many common programs. 3.2 Architecture and Approach This section motivates our approach and presents an overview of GhostRider’s hardware/software co-design. 3.2.1 Motivating example We wish to support a scenario in which a client asks an untrusted cloud provider to run a computation on the client’s private data. For example, suppose the client wants the provider to run the program shown in Figure 3.1, which is a simple histogram program written in a C-like source language. As input, the program takes 42 void histogram(secret int a[], // ERAM secret int c[]) { // ORAM (output) public int i; secret int t, v; for(i=0;i<100000;i++) // 100000 <= len(c) c[i]=0; i=0; for(i=0;i<100000;i++) { // 100000 <= len(a) v=a[i]; if(v>0) t=v%1000; else t=(0-v)%1000; c[t]=c[t]+1; } } Figure 3.1: Motivating source program of GhostRider. an integer array a, and as output it modifies integer array parameter c. We assume both arrays have size 100,000. The function’s code is straightforward, computing the histogram of the absolute values of integers modulo 1000 appearing in the input array. The client’s security goal is data confidentiality : the cloud provider runs the program on input array a, producing output array c, but nevertheless learns nothing about the contents of either a or c. We express this goal by labeling both arrays with the qualifier secret (data labeled public is non-sensitive). 3.2.2 Threat model The adversary has physical access to the machine(s) being used to run client computations. As in prior work that advocates the minimization of the hardware trusted computing base (TCB) [85–87], we assume that trust ends at the bound- ary of the secure processor. Off-chip components are considered insecure, including memory, system buses, and peripherals. For example, we assume the adversary can observe the contents of memory, and can observe communications on the bus 43 Instruc/on! Scratchpad! Data! Scratchpad! Determinis/c!CPU!pipeline! Data!! Transfer!! Unit!Encrypted! RAM!Ctrl! Oblivious!! RAM!Ctrl!H os t@ Co pr oc es so r!! In te rf ac e!Host% CPU% +%% OS% Remote% User% ! GhostRider%Secure%Co4processor% Main%Memory% Encrypted! RAM!Bank! …! Oblivious!! RAM!Bank! …!RAM!Bank! …! Figure 3.2: GhostRider architecture. between the processor and the memory. By contrast, we assume that on-chip com- ponents are secure. Specifically, the adversary cannot observe the contents of the cache, the register file, or any on-chip communications. Finally, we assume the ad- versary can make fine-grained timing measurements, and therefore can learn, for example, the gap between observed events. Analogous side channels such as power consumption are outside the scope of this paper. 3.2.3 Architectural Overview As mentioned in the introduction, one way to defend against such an adversary is to place all data in a single (large) ORAM; e.g., for the program in Figure 3.1 we place the arrays a and c in ORAM. Unfortunately this baseline approach is not only expensive, but also leaks information through the total number of ORAM accesses (if the access trace is not padded to a value that is independent of secret data). We 44 now provide an architectural overview of GhostRider (Figure 3.2) and contrast it with this baseline. Joint ORAM-ERAM memory system In the GhostRider architecture, main memory is split into three types—normal (unencrypted) memory (RAM), encrypted memory (ERAM), and oblivious RAM (ORAM)—with one or more (logical) banks of each type comprising the system’s physical memory. The differentiation of mem- ory into banks allows a compiler to place only arrays with sensitive access patterns inside the more expensive ORAM banks, while keeping the remaining data in the significantly faster RAM or ERAM banks. For example, notice that in the program in Figure 3.1 the array a is always accessed sequentially while access patterns to the array c can depend on secret array contents. Therefore, our GhostRider com- piler can place the array a inside an ERAM bank, and place the array c inside an ORAM bank. The program accesses different memory banks at the level of blocks using instructions that specify the bank and a block-offset within the bank (after moving data to on-chip memory as described below). Our hardware prototype fixes block sizes to be 4KB for both ERAM and ORAM banks (which is not an inherent limitation of the hardware design). Software-directed scratchpad As mentioned earlier, cache hit and miss behavior can lead to differences in the observable memory traces. To prevent such cache- channel information leakage, the GhostRider architecture turns off implicit caching, and instead offers software-directed scratchpads for both instructions and data. These scratchpads are mapped into the program’s address space so that the compiler can generate code to access them explicitly, and thereby avoid information leaks. 45 For example, the indices of array a in Figure 3.1 are deterministic; they do not depend on any secret input. As such, it is safe to use the scratchpad to cache array a’s accesses. The compiler generates code to check whether the relevant block is in the scratchpad, and if not loads the block from memory. On the other hand, all accesses to array c depend on the secret input a, so a memory request will always be issued independent of whether the requested block is in the scratchpad or not. Deterministic Processor Pipeline To avoid timing-channel leakage, our pipelined processor ensures that instruction timings are deterministic. We do not use dynamic branch prediction and fix variable-duration instructions, such as division, to take the worst-case execution time, and disable concurrent execution of other instructions. Initialization We design the oblivious processor and memory banks as a co-processor that runs the application natively (i.e., without an OS) and is connected to a net- worked host computer that can be accessed remotely by a user. We assume that the secure co-processor has non-volatile memory for storing a long-term public key (certified using PKI), such that the client can securely ship its encrypted code and data to the remote host, and initialize execution on the secure co-processor. Imple- menting the secure attestation is standard [4], and we leave it to future work. 3.3 Formalizing the target language This section presents a small formalization of GhostRider’s instruction set, which we call LGhostRider. The next section presents a type system for this language that guarantees security, and the following section describes our compiler from a 46 m,n ∈ Z o1, ..., on ∈ ORAMbanks k ∈ Block IDs r ∈ Registers l ∈ Labels = {D,E} ∪ORAMbanks ι ::= ldb k ← l[r] load block to scratchpad | stb k store block to memory | r ← idb k retrieve the block ID | ldw r1 ← k[r2] load a scratchpad val. to reg. | stw r1 → k[r2] store a reg. val. to scratchpad | r1 ← r2 aop r3 compute an operation | r1 ← n assign a constant to a register | jmp n (relative) jump | br r1 rop r2 ↪→ n compare and branch | nop empty operation I ::= ι | I; ι instruction sequence Figure 3.3: Syntax for LGhostRider language, comprising (1) ldb and stb instructions that move data blocks between scratchpad and a specific ERAM or ORAM bank, and (2) scratchpad-to-register moves and standard RISC instructions. C-like source language to well-typed LGhostRider programs. 3.3.1 Instruction set The core instructions of LT are in the style of RISC-V [93], our prototype’s instruction set, and are formalized in Figure 3.3. We define labels l that distinguish the three kinds of main memory: D for normal (D)RAM, E for ERAM, and oi for ORAM. For the last, the i identifies a particular ORAM bank. We can view each label as defining a distinct address space. The instruction ldb k ← l[r] loads a block from memory into the scratchpad.1 Here, l is the address space, r is a register containing the address of the block to load from within that address space, and k is the scratchpad block identifier. Our 1In our hardware prototype the scratchpad is mapped into addressable memory, so this in- struction and its counterpart, stb, are implemented as data transfers. In addition, the compiler implements idb. We model them in LT explicitly for simplicity; see Section 3.6 for implementation details. 47 formalism refers to scratchpad blocks by their identifier, treating them similarly to registers. Our architecture remembers the address space and block address within that address space that the scratchpad block was loaded from so that writebacks, via the stb k instruction, will go to the original location. We enforce this one-to-one mapping to avoid information leaks via write-back from the scratchpad (e.g., that where a scratchpad block is written to, in memory, could reveal information about a secret, or that the effect of a write could do so, if blocks are aliased). To access values from the scratchpad, we have scratchpad-load and scratchpad- store instructions. The scratchpad-load instruction loads a word from a scratchpad block, having the form ldw r1 ← k[r2]. Assuming register r2 contains n, this instruction loads the n-th word in block k into register r1 (notice that we use word- oriented addressing, not byte-oriented). The scratchpad-store instruction is similar, but goes in the reverse direction. The instruction r ← idb k retrieves the block offset of a scratchpad block k. We have two kinds of assignment instructions, one in the form of r1 ← r2 aop r3, and the other in the form of r ← n. In LT we only model integer arithmetic operations, such as addition, subtraction, multiplication, division, and modulus. Jumps and branches use relative addressing. The jump instruction jmp n bumps the program counter by n instructions (where n can be negative). Branches, having the form br r1 rop r2 ↪→ n, will compare the contents of r1 and r2 using rop, and will bump the pc by n if the comparison result is true. An instruction sequence I is defined to be a sequence of instructions concatenated using a logical operation ;. 48 ; v=a[i] 1 t1 ← ri div sizeblk 2 t2 ← ri mod sizeblk 3 ldb k1 ← E[t1] 4 ldw rv ← k1[t2] ; if(v>0) t= ... 5 br rv ≤ 0 ↪→ 3 6 rt ← rv % 1000 7 jmp 3 ; else t= ... 8 t1 ← 0 − rv 9 rt ← t1 % 1000 ; c[t]=c[t]+1 10 t1 ← rt >> 9 11 t2 ← rt & 511 12 ldb k2 ← O[t1] 13 ldw t3 ← k2[t2] 14 t4 ← t3 + 1 15 stw t4 → k2[t2] 16 stb k2 Figure 3.4: LGhostRider code implementing (part of) Figure 3.1 We overload ; to operate over two instruction sequences such that I; (I ′; ι) , (I; I ′); ι and I1; I2; I3 , (I1; I2); I3. Note that our formalism does not model the instruction scratchpad; essen- tially it assumes that all code is loaded on-chip prior to the start of its execution. Section 3.5 discusses how the instruction scratchpad is used in practice. 3.3.2 Example Figure 3.4 shows LGhostRider code that corresponds to the body of the second for loop in the source program from Figure 3.1. We write rX for a register cor- responding to variable X in the source program (for simplicity) and write ti for i ∈ {1, 2, ...} for temporary registers. In the explanation we refer to the names of variables in the source program when describing what the target program is com- puting. The first four lines load the ith element of array a into v. Line 1 computes the address of the block in memory that contains the ith element of array a and 49 line 2 computes the offset of the element within that block. Here sizeblk is the size of each block, which is an architecture constant. Line 3 then loads the block from ERAM, and line 4 loads the appropriate value from the loaded block into v. The next five lines implement the conditional. Line 5 jumps three instructions forward if v is not greater than 0, else it falls through to line 6, which computes t. Line 7 then jumps past the else branch, which begins on line 8, which negates v to make it positive before computing t. The final seven lines increment c[t]. Lines 10–13 are analogous to lines 1–4; they compute the address of tth element of array c and load it into temporary t3. Notice that this time the block is loaded from ORAM, not ERAM. Line 14 increments the temporary; line 15 stores it back to the block in the scratchpad; and line 16 stores the entire block back to ORAM. 3.4 Security by typing This section presents a type system for LT that guarantees programs obey the strong memory trace obliviousness (MTO) security property. 3.4.1 Memory Trace Obliviousness Memory trace obliviousness is a noninterference property that also considers the address trace, rather than just the initial and final contents of memory [79]. MTO’s definition relies on the notion of low equivalence which relates memories whose RAM contents are identical. We formally define this notion below, using the 50 following formal notation: M ∈ Addresses→ Blocks a ∈ Addresses = Labels×Nat b ∈ Blocks = Nat→ Z We model a memory M as a map from addresses to blocks, where an address is a pair consisting of a label l (corresponding to an ORAM, ERAM, or RAM bank, as per Figure 3.3) and an address n in that bank. A block is modeled as a map from an address n to a (integer) value. Here is the definition of memory low-equivalence: Definition 4 (Memory low equivalence). Two memories M1, M2 are low equivalent, written M1 ∼L M2, if and only if for all n such that 0 ≤ n < size(D) we have M1(D,n) = M2(D,n). The definition states that memories M1 and M2 are low equivalent when only the RAM bank’s values of the memories are the same, but all of the other values could differ. Intuitively, memory trace obliviousness says two things given two low-equivalent memories. First, if the program will terminate under one memory, then it will ter- minate under the other. Second, if the program will terminate and lead to a trace t under one memory, then it will do so under the other memory as well while also finishing with low-equivalent memories. To state this intuition precisely, we need a formal definition of a LT execu- tion, which we give as an operational semantics. The semantics is largely stan- 51 dard, and can be found in the Appendix. The key judgment has the form I ` (R, S,M, pc) −→t (R′, S ′,M ′, pc′), which states that program I, with a register file R, a (data) scratchpad S, a memory M , and a program counter pc, executes some number of steps, producing memory trace t and resulting in a possibly modified register file R′, scratchpad S ′, memory M ′, and program counter pc′. Definition 5 (Memory trace obliviousness). A program I is memory trace obliv- ious if and only if for all memories M1 ∼L M2 we have I ` (R0, S0,M1, 0) −→t1 (R′1, S ′ 1,M ′ 1, pc1), and I ` (R0, S0,M2, 0) −→t2 (R′2, S ′2, M ′2, pc2), and |t1| = |t2| implies t1 ≡ t2 and M ′1 ∼L M ′2. Here R0 is a mapping that maps every register to 0, and S0 maps every address to a all-0 block. Traces t consist of reads/writes to RAM (both address and value) and ERAM (just the address), accesses to ORAM (just the bank), and instruction fetches. For the last we only model that a fetch happened, not what instruction it is, as we assume code will be stored in a scratchpad on chip. We write t1 ≡ t2 to say that traces t1 and t2 are indistinguishable to the attacker; i.e., they consist of the same events in the same order. Our formalism models every instruction as taking unit time to execute – thus the trace event also models the time taken to execute the instruction. On the real GhostRider architecture, each instruction takes deterministic but non-uniform time; as this difference is conceptually easy to handle (by accounting for instruction execution times in the compiler), we do not model it formally, for simplicity (see Section 3.5). 52 Sym. vals. sv ∈ SymVals = n | ? | sv1 aop sv2 | Ml[k, sv] Sym. Store Sym ∈ Registers ∪Block IDs→ SymVals Sec. Labels ` ∈ SecLabels = L | H Label Map Υ ∈ (Registers→ SecLabels) ∪ (Block IDs→ Labels) sv1 ≡ sv2 `safe sv1 `safe sv2 sv1 = sv2 sv1 ≡ sv2 Auxiliary Functions select(l, a, b, c) = a if l = D b if l = E c if otherwise. slab(l) = select(l, L, H, H) ite(x, a, b) ={ a if x is true. b otherwise. `safe sv l = D `safe sv `safe Ml[k, sv] `safe n `safe sv1 `safe sv2 `safe sv1 aop sv2 `const sv `const n `const ? `const sv1 `const sv2 `const sv1 aop sv2 `const Sym ∀r. `const Sym(r) ∀k. `const Sym(k) `const Sym Figure 3.5: Symbolic values, labels, auxiliary judgments and functions 3.4.2 Typing: Preliminaries Now we give a type system for LGhostRider programs and prove that type correct programs are MTO. Symbolic values To ensure that the execution of a program cannot leak infor- mation via its address trace, we statically approximate what events a program can produce. An important determination made by the type system is when a secret 53 variable can be stored in ERAM—because the address trace will leak no information about it—and when it must be accessed in ORAM. As an example, suppose we had the source program if(s) then x[i]=1 else x[i]=2; If x is secret but stored in RAM, then the value in x[i] after running this program will leak the contents of secret variable s. We could store x in ORAM to avoid this problem, but this is unnecessary: both branches will modify the same element of x, so encrypting the content of x is enough to prevent the address trace from leaking information about s. The type system can identify this situation by symbolically tracking the contents of the registers, blocks, etc. To do this, the type rules maintain a symbolic store Sym, which is a map from register and block IDs to symbolic values. Figure 3.5 defines symbolic values sv, which consist of constants n, (symbolic) arithmetic expressions, values loaded from memory Ml[k, sv], and unknowns ?. Most interesting is memory values, which represent the address of a loaded value: l indicates the memory bank it was loaded from, sv corresponds to the offset (i.e., the block number) within that bank, and k is the scratchpad block into which the memory block is loaded.2 The type rules also make use of a label map Υ mapping registers to security labels and block IDs to (memory) labels; the latter tracks the memory bank from which a scratchpad block was loaded. The figure defines several judgments; the form of each judgment is boxed. The 2In actual traces t, the block number k is not visible; we track it symbolically to model the scratchpad’s contents, in particular to ensure that the same memory block is not loaded into two different scratchpad blocks. 54 first defines when two symbolic values can be deemed equivalent, written sv1 ≡ sv2: they must be syntactically identical and safe static approximations. The latter is defined by the judgment `safe sv, which accepts constants, memory accesses to RAM involving safe indexes, and arithmetic expressions involving safe values. Judgement `const sv says that symbolic value sv is not a memory value. That is, sv is either a constant, a ?, or a binary expression not involving memory values. Further, for a symbolic store Sym, if all the symbolic values that it maps to can be accepted by `const sv, then we have `const Sym. The latter judgment is needed when checking conditionals. Finally, we give three auxiliary functions used in the type system. Based on whether l is D, E, or an ORAM bank, function select(l, a, b, c) returns a, b, or c respectively. Function slab(·) maps a normal label l to a security label `, which is either L or H. The label H classifies encrypted memory—any ORAM bank and ERAM—while label L classifies RAM. These two labels form the two-point lattice with L @ H. Note that L is equivalent to the public label used in Figure 3.1, and H is equivalent to secret. Finally, function ite(x, a, b) returns a if x is true, and returns b if x is false. Trace patterns Figure 3.6 defines trace patterns T , which are largely similar to those for Lbasic that approximate traces t. The first line in the definition of T defines single events. The first two indicate reads and writes to RAM or ERAM; they reference the memory bank, block identifier in the scratchpad, and a symbolic value corresponding to the block address (not the actual value) read or written. Pattern F corresponds to a non memory-accessing instruction. The next pattern indicates 55 Trace Pats. T ::= read(l, k, sv) | write(l, k, sv) | F | o | T1@T2 | T1 + T2 | loop(T1, T2) sv1 ≡ sv2 read(l, k, sv1) ≡ read(l, k, sv2) o ≡ o T1 ≡ T2 T2 ≡ T1 F ≡ F sv1 ≡ sv2 write(l, k, sv1) ≡ write(l, k, sv2) T1 ≡ T ′1 T2 ≡ T ′2 T1@T2 ≡ T ′1@T ′2 T1@(T2@T3) ≡ (T1@T2)@T3 Figure 3.6: Trace patterns and their equivalence in LGhostRider a read or write from ORAM bank o: this bank is the trace event itself because the adversary cannot determine whether an access is a read or a write, or which block within the ORAM is accessed. Trace pattern T1@T2 is the pattern resulting from the concatenation of patterns T1 and T2. Pattern T1 + T2 represents either T1 or T2, and is used to type conditionals. Finally, pattern loop(T1, T2) represents zero or more loop iterations where the guard’s trace is T1 and the body’s trace is T2. Trace pattern equivalence T1 ≡ T2 is defined in Figure 3.6. In this definition, reads are equivalent to other reads accessing exactly the same location; the same goes for writes. Two ORAM accesses to the same ORAM bank are obviously treated as equivalent. Sum patterns specify possibly different trace patterns, and loop pat- terns do not specify the number of iterations; as such we cannot determine their equivalence statically. The concatenation operator @ is associative with respect to equivalence. 56 Instructions T-LOAD l 6∈ ORAMbanks⇒ Υ(r) = L Υ′ = Υ[k 7→ l] Sym′ = Sym[k 7→ Sym(r)] T0 = read(l, k, Sym(r)) T = select(l, T0, T0, l) ` ` ldb k ← l[r] : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T T-STORE Sym(k) = sv Υ(k) = l T0 = write(l, k, sv) T = select(l, T0, T0, l) ` ` stb k : 〈Υ, Sym〉 → 〈Υ, Sym〉;T T-LOADW l = Υ(k) Υ(r2) v slab(l) Υ′ = Υ[r1 7→ slab(l)] sv = Ml[k, Sym(r2)] Sym′ = Sym[r1 7→ sv] ` ` ldw r1 ← k[r2] : 〈Υ, Sym〉 → 〈Υ′, Sym′〉; F T-STOREW ` unionsqΥ(r1) unionsqΥ(r2) v slab(Υ(k)) ` ` stw r1 → k[r2] : 〈Υ, Sym〉 → 〈Υ, Sym〉; F T-IDB Sym(k) = sv Υ(k) = l Υ′ = Υ[r 7→ select(l, L, L, H)] Sym′ = Sym[r 7→ sv] ` ` r ← idb k : 〈Υ, Sym〉 → 〈Υ′, Sym′〉; F T-BOP `′ = Υ(r2) unionsqΥ(r3) Υ′ = Υ[r1 7→ `′] sv = Sym(r2) aop Sym(r3) Sym ′ = Sym[r1 7→ sv] ` ` r1 ← r2 aop r3 : 〈Υ, Sym〉 → 〈Υ′, Sym′〉; F T-ASSIGN Υ′ = Υ[r 7→ L] Sym′ = Sym[r 7→ n] ` ` r ← n : 〈Υ, Sym〉 → 〈Υ′, Sym′〉; F T-NOP ` ` nop : 〈Υ, Sym〉 → 〈Υ, Sym〉; F T-SEQ ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T1 ` ` ι : 〈Υ′, Sym′〉 → 〈Υ′′, Sym′′〉;T2 ` ` I; ι : 〈Υ, Sym〉 → 〈Υ′′, Sym′′〉;T1@T2 Figure 3.7: Security Type System for LGhostRider (Part 1) 57 3.4.3 Type rules Figures 3.7, 3.8, 3.9 define the security type system for LT . The figures are divided into three parts. Instructions Figure 3.7 presents judgment ` ` ι : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T , which is used to type instructions ι. Here, ` is the security context, used in the standard way to prevent implicit flows. The rules are flow sensitive: The judgement says that instruction ι has a type 〈Υ, Sym〉 → 〈Υ′, Sym′〉, and generates trace pattern T . Informally, we can say by executing ι, a state corresponding to security type 〈Υ, Sym〉 will be changed to have type 〈Υ′, Sym′〉. Rule T-LOAD types load instructions. The first premise ensures that the contents of register r, the indexing register, are not leaked by the operation. In particular, the loaded memory bank l must either be ORAM, or else the register r may only contain public data (from RAM). In the latter case, there is no issue with leaking r, and in the former case r will not be leaked indirectly by the address of the loaded memory since it is stored in ORAM. The final two premises determine the final trace pattern. When the memory bank l is D or E, then the trace pattern indicates a read event from the appropriate block and address. When reading from an ORAM bank the event is just that bank itself. The other premises in the rule update Υ to map the loaded block k to the label of the memory bank, and update Sym to track the address of the block. We defer discussion of rule T-STORE for the moment, and look at the next three rules, T-LOADW, T-STOREW, T-IDB, which are used to load and store 58 values related blocks in the scratchpad. The first two rules resemble standard infor- mation flow rules. The second premise of T-LOADW is similar to the first premise of T-LOAD in preventing an indirect leak of index register r2, which would occur if the label of r2 was H but the label of k was L. Likewise, the premise of T-STOREW prevents leaking the contents of r1 and r2 into the stored block, and also prevents an implicit flow from ` (the security context). As such, these two rules ensure that a block k with label ` never contains information from memory labeled `′ such that `′ A `. The remaining premises of Rule T-LOADW flow-sensitively track the label and symbolic value of the loaded register. In particular, they set the label of r1 to be that of the block loaded, and the symbolic value of r1 to be the address of the loaded value in memory. T-STOREW changes neither Υ nor Sym: even though the content of the scratchpad has changed, its memory label and its address in memory has not. Both rules emit trace pattern F as the operations are purely on-chip. We emit this event to account for the time taken to execute an instruction; assuming uniform times for instructions and memory accesses, MTO executions will also be free of timing channels. Returning to rule T-STORE, we can see that the store takes place unconditionally— no constraints on the labels of the memory or block must be satisfied. This is because the other type rules ensure that all blocks k never contain information higher than their security label `, and thus the block can be written straight to memory having the same security label. That said, information could be leaked through the memory trace, so the emitted trace pattern will differ depending on the label of the block: If the label is D or E then the trace pattern will be a write event, and otherwise it will 59 Branching T-IF I = ι1; It; ι2; If |It| = n1 − 2 |If |+ 1 = n2 ι1 = br r1 rop r2 ↪→ n1 ι2 = jmp n2 `′ = ` unionsqΥ(r1) unionsqΥ(r2) `′ ` It : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T1 `′ ` If : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T2 `′ = H⇒  T1@F ≡ T2 ∧ ` = L⇒`const Sym ∧ ∀r.Υ′(r) = L⇒ `safe Sym′(r)  T = ite(`′ = H, F@T1@F, F@((T1@F) + T2)) ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T T-LOOP I = Ic; ι1; Ib; ι2 |Ib| = n1 − 2 |Ic|+ n1 = 1− n2 ι1 = br r1 rop r2 ↪→ n1 ι2 = jmp n2 ` unionsqΥ(r1) unionsqΥ(r2) v L ` ` Ic : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T1 ` ` Ib : 〈Υ′, Sym′〉 → 〈Υ, Sym〉;T2 ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉; loop(T1, T2) Figure 3.8: Security Type System for LGhostRider (Part 2) be the appropriate ORAM event. Leaks via the memory trace are then prevented by T-IF and T-LOOP, discussed shortly. Rule T-IDB is similar to rule T-LOADW. For the third premise, if l is either D or E, the block k has a public address, and thus the value assigned to register r is public; otherwise, when l is an ORAM bank, the register r is secret. Rule T-BOP types binary operations, updating the security label of the target register to be the join of labels of the source registers. Rule T-ASSIGN gives the tar- get register label L as constants are not secret. Rules T-NOP is always safe and has no effect on the symbolic store or label environment. All of these operations occur on-chip, and so have pattern F. Finally, rule T-SEQ types instruction sequences by composing the symbolic maps, label environments, and traces in the obvious way. 60 Branching Figure 3.8 presents rules T-IF and T-LOOP for structured control flow. Rule T-IF deals with instruction sequences of the form of I = ι1; It; ι2; If , where ι1 is a branching instruction deciding, ι2 is a jump instruction jumping over the false branch, and It and If are the true and false branches respectively; the relative offsets n1 and n2 are based on the length of these code sequences. We require both branches to have the same type, i.e. 〈Υ, Sym〉 → 〈Υ′, Sym′〉, as for the sequence I itself. When the security context is high, i.e. ` = H, or when the if-condition is private, i.e. Υ(r1) unionsq Υ(r2) = H, then `′ will be H and we impose three restrictions. First, both of the blocks It and If must have equivalent trace patterns. (The trace of the true branch is T1@F where T1 covers It and F covers the jump instruction ι2.) Second, if the security context is public, i.e. ` = L, then we restrict `const Sym to enforce Sym(r) does not map to memory values. The reason is that in a public context, two equivalent symbolic memory values may refer to two different concrete values, since the memory region D can be modified. Third, for any register r, its value after taking either branch must be the same, or the register r must have a high security label (i.e. Υ′(r) = H). So if Υ′(r) = L, the type system enforces that its symbolic values on the two paths are equivalent, i.e. Sym′(r) ≡ Sym′(r), which only requires `safe Sym′(r). The final premise for rule T-IF states that the sequence’s trace pattern T is either F@T1@F when both branches’ patterns must be equal, or else is an or-pattern involving the trace T2 from the else branch. Rule T-LOOP imposes structural requirements on I similar to T-IF. The 61 Subtyping T-SUB ` ` ι : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T Υ′  Υ′′ Sym′  Sym′′ ` ` ι : 〈Υ, Sym〉 → 〈Υ′′, Sym′′〉;T S-LABEL ∀r.Υ(r) v Υ′(r) ∀k.Υ(k) = Υ′(k) Υ  Υ′ S-SYM ∀r.Sym′(r) = ? ∨ Sym(r) = Sym′(r) ∀k.Sym′(k) = ? ∨ Sym(k) = Sym′(k) Sym  Sym′ Figure 3.9: Security Type System for LGhostRider (Part 3) premise ` unionsq Υ(r1) unionsq Υ(r2) v L implies two restrictions. On the one hand, ` v L, prevents any loop from appearing in a secret if-statement, because otherwise the number of loop iterations may leak information about which branch is taken. On the other hand, Υ(r1)unionsqΥ(r2) v L implies that the loop condition must be public, or otherwise, similarly, the number of iterations would leak secret information about r1 and/or r2. Subtyping Finally, Figure 3.9 presents subtyping rules. Rule T-SUB supports subtyping on the symbolic store and the label map. For the first, a symbolic store Sym can approximated by a store Sym′ that either agrees on the symbolic values mapped to be Sym or maps them to ?. For the second, a register’s security label can be approximated by one higher in the lattice; block labels may not change. Subtyping is important for typing join points after branches or loops. For example, if a conditional assigned a register r the value 1 in the true branch but assigned r to 2 in the false branch, we would use subtyping to map r to ? to ensure that the 62 symbolic store at the end of both branches agrees, as required by T-IF. 3.4.4 Security theorem All well-typed programs are memory-trace oblivious: Theorem 2. Given I, Υ, and Sym, if there exists some Υ′, Sym′ and T such that L ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T , where ∀r.Sym(r) = ? and Υ(r) = L and ∀k.Sym(k) = ? and Υ(k) = D then program I is memory-trace oblivious. The proof can be found in the Appendix B. 3.5 Compilation We have developed a compiler from an imperative, C-like source language, which we call LS, to LGhostRider. Our compiler is implemented in about 7600 lines of Java, with roughly 400 LoC dedicated to the parser, 700 LoC to the type checker, 3500 LoC to the compiler/optimizer, 950 LoC to the code generator, and the remain- der to utility functions. This section informally describes our compilation approach. 3.5.1 Source Language Syntax An LS program is a collection of (possibly mutually recursive) functions and a collection of (possibly mutually recursive) type definitions. A type definition is simply a mapping of a type name to a type where types are either natural numbers, arrays, or pointers to records (i.e., C-style structs). Each type is annotated with a 63 security label which is either secret or public indicating whether the data should be visible/inferrable by the adversary or not. A function consists of a sequence of statements s which are either no-ops, variable assignments, array assignments, conditionals, while loops, or returns. As usual, conditional branches and loop bodies may consist of one or more statements. Expressions e appearing in statements (e.g., in assignments) consist of variables x, arithmetic ops e1 aop e2, array reads e[e], and numeric constants n. Variables may hold any data other than functions (i.e., there are no function pointers). Guards in conditionals and while loops consist of predicates involving relational operators. Typing LS programs are type checked before they are compiled. We do this using an information flow-style type system (cf. the survey of Sabelfeld and Myers [79]). As is standard, the type system prevents explicit flows and implicit flows. In par- ticular, it disallows assignments like p = s where p is a public variable and s is a secret variable, and disallows conditionals like if (s == 0) then p = 0 else p = 1, which leaks information about s since after the conditional the adversary knows p == 0 implies s == 0. It also disallows array writes like p[s] = 5 since the adversary can learn the value of s by seeing which element of the public array has changed. Note that accessing s[p] is safe because, despite knowing the index, an adversary cannot learn the value being accessed. To prevent the length of a memory trace from revealing information, we re- quire that loop guard expressions only involve public values (which is a standard restriction [79]). One can work around this problem by “padding out” loop iter- ations, e.g., by converting a loop like while (slen > 0) { sarr[slen--]++; } 64 to be plen = N; while (plen > 0) { if (plen <= slen) sarr[--plen]++; } where N is a large, fixed constant. For similar reasons we also require that whether a function is called or returned from, and which function it is, may not depend on secret information (e.g., the call or return may not occur in a conditional whose guard involves secret information). Compilation overview After source-language type checking, compilation proceeds in four stages—memory layout, translation, padding, and register allocation—after which the result is type checked using the LGhostRider type system, to confirm that it is memory-trace oblivious.3 3.5.2 Memory bank allocation The first stage of compilation allocates global variables to memory banks. Public variables are always stored in RAM, while secret variables will be allocated either to ERAM or ORAM. Two blocks in the scratchpad are reserved for secret and public variables, respectively, that will fit entirely within the block; these are essentially those that contain numbers, (pointers to) records, and small arrays. Such variables will be loaded into the scratchpad at the start of executing a program, and written back to memory at the end. The remaining scratchpad blocks are used for handling (large) arrays; the compiler will always use the same block for the same array. Public arrays are allocated in RAM, and secret arrays always indexed by public values are allocated in ERAM, and ORAM otherwise. The compiler initially 3This is essentially a kind of translation validation [73], which removes the compiler from the trusted computing base. We believe that well typed LS programs yield well typed LGhostRider programs, but leave a proof as future work. 65 assigns a distinct logical ORAM bank for each secret array, and allocates logical banks up to the hardware limit. 3.5.3 Basic compilation The next stage is basic compilation (translation). Expressions are compiled by loading relevant variables/data into registers, performing the computation, and then storing back the result. Statements are compiled idiomatically to match the structure expected by the type rules in Figure 3.7, Figure 3.8,and Figure 3.9 (with some work deferred to the padding stage). Perhaps the most interesting part is handling variable accesses. Variables permanently resident in the scratchpad are loaded at the start of the program, and stored back at the end. Each read/write results in a ldw, to load a variable into a temporary register, and a stw to store back the result. Accesses to data (i.e., arrays) not permanently stored in the scratchpad will also require a ldb to load the relevant block into the scratchpad first and likewise a stb to store it back. A standard software cache, rather than a scratchpad, could eliminate repeated loads and stores of blocks from memory but could violate MTO. This is because a non-present block will induce memory traffic while a present block will not, and the presence/absence of traffic could be correlated with secret information. To avoid this, we have the compiler emit instructions that perform caching explicitly, using the scratchpad, with caching only enabled when in a public context, i.e., in a portion of code whose control flow does not depend on secret data. To support software-based caching, 66 the compiler statically maps memory-resident data to particular scratchpad blocks, always loading the same data to the same block. Prior to doing so, and when safe, the compiler uses the idb instruction to check whether the relevant scratchpad block contains the memory block we want and loads directly from it, if so. Supporting functions requires handling calling contexts and local variables. We do this with two stacks, one in RAM and one in ERAM. Function calls are only permitted in a public context, which means that normal stack allocation and deallocation reveal no information, so no ORAM stack is needed. When a function is called, the current scratchpad variable blocks are pushed on the relevant stacks. At the start of a function, we load the blocks that hold the local variables. Local variables implementing ORAM arrays are stored by reference, with the variable pointing to the actual array stored in ORAM. This array is deallocated when its variable is popped from the stack, when the function returns (which like calls are allowed only in a public context). The compiler is also responsible for emitting instructions that load code into the instruction scratchpad, as implicit instruction fetches could reveal information. (To bootstrap, the first code block is loaded automatically.) At the moment, our compiler emits code that loads the entire program into the scratchpad at the start; we leave to future work support for on-the-fly instruction scratchpad use. 67 3.5.4 Padding and register allocation Both branches of a secret conditional must produce the same trace. We ensure they do so by inserting extra instructions in one or both branches according to the solution to the shortest common supersequence problem [31]. When matching the two branches, we must account for the memory trace and instruction execution times. Only ldb and stb emit memory events; we discuss these shortly. While our formalism assumes each instruction takes unit time, the reality is different (cf. Table 3.2): times are deterministic, but non-uniform. For single-cycle operations (e.g., 64b ALU ops), we pad with nops. For two-cycle ldw and stw instructions, we pad with two nops. For multiply and divide instructions, which take 70 cycles each, we could pad with 70 nops but this results in a large space overhead. As such, we pad both with the instruction r0← r0 ∗ r0, where r0 is always 0. For conditionals, we pad the not-taken branch with two nops, to account for the hardware-induced delay on the taken branch. Padding for stb and ldb requires instructions that generate matching trace events. An access to ORAM is the simplest to pad, since the adversary cannot distinguish a read from a write. We can load any block (e.g., the first block of the ORAM) into a dedicated “dummy” scratchpad block, i.e. this block is used for loading and saving dummy memory blocks only. For RAM and ERAM, the address being accessed is visible, so we need to make sure that the equivalent padding accesses the same address. To do this, the compiler should insert further instructions to compute the address. These instructions can 68 be computed using the symbolic value: (1) if the symbolic value is a constant, then insert an assign instruction; (2) if the symbolic value is a binary operation of two symbolic values, then insert instructions to compute the two symbolic values respectively, and then another instruction to compute the binary operation; and (3) if the symbolic value is a memory value, then insert instructions to compute the offset first, and then insert a ldw instruction. With instructions inserted to compute the address, we must emit either a load or a store depending on the instruction we are trying to match. For RAM, this instruction will always be a load because we perform padding in the H context, and the type system prevents writing to RAM. To mimic the read(l, k, sv) trace pattern, we first compute sv and then insert a ldb k ← l[r] instruction where r stores the value for sv. To handle ERAM writes is challenging because we want the write to be a no-op but not appear to be so. To do this, we require the compiler to always follow an ERAM ldb with a stb back to the same address. In doing so, the compiler also prevents the padded instruction from overwriting a dirty scratchpad block. At the conclusion of the padding stage we perform standard register allocation to fill in actual registers for the temporaries we have used to this point. 3.6 Hardware Implementation We implement our deterministic processor by modifying Rocket, a single-issue, in-order, 6-stage pipelined CPU developed at UC Berkeley [77]. Rocket implements the RISC-V instruction set [93] and is comparable to an ARM Cortex A5 CPU. 69 We modified the baseline processor to remove branch prediction logic (so that con- ditional branches are always not-taken) and to make each instruction execute in a fixed number of cycles. We describe the remaining changes below. Instruction-set Extension. We customize RISC-V to add a single data transfer instruction that implements ldb and stb from the formalism. We do this using a Data Transfer accelerator (Figure 3.2) that attaches to the processor’s accelerator interface [88]. We also interface the Data Transfer accelerator with the x86-Linux host through Rocket’s control register file so that it can load an elf-formatted binary into GhostRider’s memory and reset its processor. Once this is done, the host performs processor control register writes to initiate transfers from the co- processor memory to the code ORAM for the code and data sections of the binary. The first code block of a program is loaded into the instruction scratchpad to begin execution; if subsequent instruction blocks are needed they must be loaded explicitly. Scratchpads. GhostRider has two scratchpads, one for code and one for data, each of which can hold eight 4KB blocks. The instruction scratchpad is imple- mented similar to an 8-way set-associative cache, where each way contains one block. The accelerator transfers one block at a time to a specified way in the instruction scratchpad. Once a block has been written, the valid and tag bits for that block are updated. The architecture does not implement the idb instruction from the formalism; instead, the compiler uses the first 8 bytes of every block to remember its address. ORAM controller. We implement ORAM by building on the Phantom ORAM controller [63] and implement an ORAM tree 13 levels deep (i.e., 212 leaf buckets), 70 with 4 blocks per bucket and an effective capacity of 64MB. ORAM controllers in- clude an on-chip stash to temporarily buffer ORAM blocks before they are written out to memory. We set this stash to be 128 blocks. The Phantom design (and like- wise, Ascend’s [27–29]) treats the stash as a cache for ORAM lookups, which is safe when handling timing channels by controlling the memory access rate. GhostRider mitigates timing channels by having the compiler enforce MTO while assuming that events take the same time. As such, we modify Phantom’s design to generate an access to a random leaf in case the requested block is found in the stash, to ensure uniform access times. FPGA Implementation. GhostRider is implemented on one of Convey HC- 2ex’s [21] four Xilinx Virtex-6 LX760 FPGAs. We measure hardware design size in terms of FPGA slices for logic and Block RAMs for on-chip memory. A slice com- prises four 6-input, 2-output lookup tables (implementing configurable logic) and eight flip-flops (as storage elements) in addition to multiplexers, while each BRAM on Virtex-6 is either an 18Kb or 36Kb SRAM with up to two configurable read- write ports. The GhostRider prototype uses 47,357 such slices (39% of total) to implement both the CPU and the ORAM controller, and requires 685 of 1440 18Kb BRAMs (47.5%). Table 3.1 shows how these resources are broken up between the Rocket CPU and the ORAM controller, with the remaining resources being used by Convey HC-2ex’s boilerplate logic to interface with the x86 core and DRAM. Note that this breakdown is a synthesis estimate before place and route. Our prototype currently supports one data ORAM bank, one code ORAM bank, and one ERAM bank. We do not implement encryption (it is a small, fixed 71 cost and uninteresting in terms of performance trends), and do not have separate DRAM; all public data is stored in ERAM when running on the hardware. The Convey machine requires the hardware design to be run at 150 MHz while our ORAM controller prototype currently synthesizes to a maximum operating fre- quecy of 140MHz. Pending further optimization to meet 150 MHz timing, we run both the CPU and the ORAM controller in a 75 MHz clock domain, and use asyn- chronous FIFOs to connect the ORAM controller to the DDR DRAM controllers. GhostRider simulator timing model. In addition to demonstrating feasibility with our hardware prototype, we study the effect of GhostRider ’s compiler on al- ternate, more efficient ORAM configurations, e.g., Phantom at 150MHz [63] with two ORAM banks and a distinct (non-encrypting) DRAM bank. Hence we gen- erate a timing model for both the modified processor and ORAM banks based on Phantom’s hardware implementation [63], and incorporate the timing model into an ISA-level emulator for the RISC-V architecture; the model is shown in Table 3.2. Slices BRAMs Rocket 9287 (8.8%) 36 (10.5%) ORAM 12845 (12.2%) 211 (61.5%) Table 3.1: FPGA synthesis results on Convey HC-2ex. 3.7 Empirical Evaluation Programs. Table 3.3 lists all the programs we use in our evaluation. These pro- grams range from standard algorithms to data structures and include predictable, 72 Feature Latency (# cycles) 64b ALU 1 Jump taken/not taken 3/1 64b Multiply/Divide 70/70 Load/Store from Scratchpad 2 DRAM (4kB access) 634 Encrypted RAM (4kB access) 662 ORAM 13 levels (4kB block) 4262 Table 3.2: Timing model for GhostRider simulator. Figure 3.10: Simulator-based execution time results of GhostRider. partially predictable, and predominantly irregular (data-driven) memory access pat- terns. Execution time results. We present measurements both for the simulator and for the actual FPGA hardware, starting with the former because the simulator allows us to evaluate the benefits from splitting memory into ERAM and ORAM banks 73 Name Brief Description Input Size (KB) sum Summing up all positive elements in an array 103 findmax Find the max element in an array 103 heappush insert an element into a min-heap 103 perm computing a permutation executing a[b[i]] = i for all i 103 histogram compute the number of occurances of each last digit 103 dijkstra Single-source shortest path 103 search binary search algorithm 1.7× 104 heappop pop the minimal element from a min-heap 1.7× 104 Table 3.3: Benchmark programs for GhostRider organized into programs with pre- dictable, partially predictable, and data dependent memory access patterns (in order from top). Non-secure Non-secure program: all variables in ERAM, no padding, and uses scratchpad. Baseline Secure baseline: all secret variables in a single ORAM, no scratchpad. Split ORAM Variables can be split across multiple ORAM banks, or placed in ERAM. Performs padding. No scratchpad. Final Scratchpad on top of Split ORAM. Figure 3.11: Legends of Figure 3.10 v. additionally using a scratchpad. We also discuss the execution time results by categorizing them based on the regularity in the programs’ access patterns. Simulator-based results. Figure 3.10 depicts the slowdown of various configura- tions relative to a non-secure configuration that simply stores data in ERAM and employs the scratchpad. The legends are explained in Figure 3.11. Our non-secure baseline uses a scratchpad instead of a hardware cache in order to isolate the cost of MTO/ORAM. The secure Baseline configuration places all secret variables in a single ORAM, while Split ORAM employs the GhostRider optimization of using 74 ERAM and multiple ORAM banks, and Final further adds the (secure) use of a scratchpad. Three out of eight programs—sum, findmax, and heap- push—have a pre- dictable access pattern and the secure program generated by GhostRider relies mainly on ERAM. Hence, each MTO program (Final) has almost no slowdown to 3.08× slowdown in comparison its non-secure counterpart (Non-secure), and correspondingly faster than Baseline by 5.85× to 9.03×. For perm, histogram, and dijkstra, which have partially predictable and partially sensitive memory access patterns, our compiler attempts to place sensitive arrays inside both ERAM and ORAM and also favors splitting into several smaller ORAM banks without breaking MTO. As shown in Figure 3.10, for such programs, Final can achieve a 1.30× to 1.85× speedup over Baseline (with 7.56× to 10.68× slowdown compared to Non-secure, respectively). For search and heappop, which have predominantly sensitive memory access patterns, the speedup of Final over Baseline is not as significant, i.e. 1.07× and 1.12× respectively, and is due mostly to the usage of two ORAMs to store arrays instead of a single ORAM. Examining the impact of the use of the scratchpad in the results, we can see that for the first six programs, Final reduces execution time compared to Split ORAM by a factor from 1.05× up to 2.23×.For search and heappop, the scratchpad provides no benefit because for these programs all data is allocated in ORAM, as array indices are secret (so the access pattern is sensitive), and our type system disallows caching of ORAM blocks. The reason is that the presence of the data in 75 Figure 3.12: FPGA based execution time results: Slowdown of Baseline and Final versions compared to non-secure version of the program. Note that unlike Fig- ure 3.10, Final uses only a single ORAM bank and conflates ERAM and DRAM (cf. Section 3.6). the cache could reveal something about the secret indices. A more sophisticated type system, or a relaxation of MTO, could do better; we plan to explore such improvements in future work. FPGA-based results. For the FPGA we run the same set of programs as in Table 3.3, but restrict the input size to be around 100 KB, due to limitations of our prototype. Speedups of Final over the secure Baseline follow a trend similar to the simulator, as shown in Figure 3.12. Regular programs have speedups in the range of 4.33× (for heappush) to 8.94× (for findmax). Partially regular programs like perm and histogram get a speedup of 1.46× and 1.3× respectively. Finally, irregular programs such as search and heappop see very little improvements (1.08× and 1.02× respectively). 76 Differences between the simulator and hardware numbers can be attributed to mulitple factors. First, the simulator imperfectly models the Convey memory sys- tem’s latency, always assuming the worst case, and thus slowdowns compared to the non-secure baseline are often worse on the simulator (cf. heappop and heappush). Second, the timing of certain hardware operations is different on the prototype and the simulator (where we consider the latter to be aspirational, per the end of Section 3.6). In particular, per Table 3.2, the simulator models access latency for ORAM as 4262 cycles and ERAM as 662 cycles, accounting for both reading data blocks from DRAM and moving the chosen 4KB block into the scratchpad BRAMs on the FPGA. On the hardware, ORAM and ERAM latencies are 5991 and 1312 cycles, respectively, measured using performance counters in the hardware design. The higher ERAM and ORAM access times reduce the slowdown on the simulator by amplifying the benefit of the scratchpad, which is used by the non-secure baseline, but not by the secure baseline (cf. findmax and sum). Third, the benefit of using the scratchpad can differ depending on the input size. This effect is particularly pronounced for Dijkstra, where the ratio of secure to non-secure baseline execution is smaller for the hardware than for the simulator. The reason is that the hardware experiment uses a smaller input that fills only about 1/5 of a scratchpad block. Hence, in the non-secure baseline, the block is reloaded after relatively fewer accesses, resulting in a relatively greater number of block loads and thus bringing the performance of the non-secure program closer to that of the secure baseline. Finally, note that the simulator’s use of multiple ORAM banks, and DRAM 77 with different timings, is a source of differences, but this effect is dwarfed by the other effects. 3.8 Conclusion We have presented the first complete memory trace oblivious system—GhostRider— comprising of a novel compiler, type system, and hardware architecture. The com- piled programs not only provably satisfy memory trace obliviousness, but also ex- hibit up to nearly order-of-magnitude performance gains in comparison with placing all variables in a single ORAM bank. By enabling compiler analyses to target a joint ERAM-ORAM memory system, and by employing a compiler-controlled scratchpad, this work opens up several performance optimization opportunities in tuning bank configurations (size and access granularity) and, on a broader level, into co-designing data structures and algorithms for a heterogeneous yet oblivious memory hierarchy. 78 Chapter 4: RAM-model Secure Computation Secure computation is a cryptographic technique allowing mutually distrusting parties to make collaborative use of their local data without harming privacy of their individual inputs. Since the first system for general-purpose secure two-party computation was built in 2004 [65], efficiency has improved substantially [11,46]. Almost all previous implementations of general-purpose secure computation assume the underlying computation is represented as a circuit. While theoreti- cal developments using circuits are sensible (and common), typical programs are mostly expressed in von Neumann-style Random Access Machine (RAM) model. Compiling a RAM-model program into its efficient circuit-based representation can be challenging, especially when handling dynamic memory accesses to an array in which the memory location being read/written depends on secret inputs. Existing program-to-circuit compiler typically makes an entire copy of the array upon every dynamic memory access, thus resulting in a huge circuit when the data size is large. To address this limitations, recent research [38] shows that secure computa- tion ORAM can be used to compile a dynamic memory access into a circuit with poly-logarithmic size while preventing information leakage through memory-access patterns. We refer such an approach as a RAM-model secure computation (RAM- 79 SC) approach, and Gordon et al. [38] observed that RAM-SC exhibits a significant advantage in the setting of repeated sublinear-time queries (e.g., binary search) on a large database, where an initial setup cost can be amortized over subsequent repeated queries. Our Contributions. We continue work on secure computation in the RAM model, with the goal of providing a complete system that takes a program written in a high- level language and compiles it to a protocol for secure two-party computation of that program.1 In particular, we • Define an intermediate representation (which we call SCVM) suitable for effi- cient two-party RAM-model secure computation; • Develop a type system ensuring that any well-typed program will generate a RAM-SC protocol secure in the semi-honest model, if all subroutines are implemented with a protocol secure in the semi-honest model. • Build an automated compiler that transforms programs written in a high-level language into a secure two-party computation protocol, and integrate compile- time optimizations crucial for improving performance. We use our compiler to compile several programs including Dijkstra’s shortest- path algorithm, KMP string matching, binary search, and more. For moderate data sizes (up to the order of a million elements), our evaluation shows a speedup of 1–2 orders of magnitude as compared to standard circuit-based approaches for securely 1Note that Gordon et al. [38] do not provide such a compiler; they only implement RAM-model secure computation for the particular case of binary search. 80 computing these programs. We expect the speedup to be even greater for larger data sizes. SCVM is our first attempt to demonstrate the feasibility to optimize secure computation in the RAM-model using ORAMs. In the next Chapter, we extend SCVM to build ObliVM which focuses more on richer expressiveness power and easy-programmability while achieving the state-of-the-art performance for secure computation. This chapter is based on a paper that I co-authored with Michael Hicks, Yan Huang, Jonathan Katz, and Elaine Shi [59]. I developed the formalism and the proof under the help of Michael Hicks, Elaine Shi and Jonathan Katz. I developed the compiler, which implements the optimization and the type checker, and emits code that is runnable over a secure computation backend. I conducted experiments to show the compiler’s effectiveness with the help of Yan Huang. 4.1 Technical Highlights As explained in Sections 4.2 and 4.3, the standard implementation of RAM-SC entails placing all data and instructions inside a single Oblivious RAM. The secure evaluation of one instruction then requires i) fetching instruction and data from ORAM; and ii) securely executing the instruction using a universal next-instruction circuit (similar to a machine’s ALU). This approach is costly since each step must be done using a secure-computation sub-protocol. An efficient representation for RAM-SC. Our type system and SCVM inter- 81 mediate representation are capable of expressing RAM-SC tasks more efficiently by avoiding expensive next-instruction circuits and minimizing ORAM operations when there is no risk to security. These language-level capabilities allow our compiler to apply compile-time optimizations that would otherwise not be possible. Thus, we not only obtain better efficiency than circuit-based approaches, but we also achieve order-of-magnitude performance improvements in comparison with straightforward implementations of RAM-SC (see Section 4.2). Program-trace simulatability. A well-typed program in our language is guaran- teed to be both instruction-trace oblivious and memory-trace oblivious. Instruction- trace obliviousness ensures that the values of the program counter during execution of the protocol do not leak information about secret inputs other than what is revealed by the output of the program. As such, the parties can avoid securely eval- uating a universal next-instruction circuit, but can instead simply evaluate a circuit corresponding to the current instruction. Memory-trace obliviousness ensures that memory accesses observed by one party during the protocol’s execution similarly do not leak information about secret inputs other than what is revealed by the output. In particular, if access to some array does not depend on secret information (e.g., it is part of a linear scan of the array), then the array need not be placed into ORAM. We formally define the security property ensured by our type system as program- trace simulatability. We define a mechanism for compiling programs to protocols that rely on certain ideal functionalities. We prove that if every such ideal functionality is instantiated with a semi-honest secure protocol computing that functionality, then any well-typed program compiles to a semi-honest secure protocol computing that 82 program. Additional language features. SCVM supports several other useful features. First, it permits reactive computations by allowing output not only at the end of the program’s execution, but also while it is in progress. Our notation of program- trace simulatability also fits this reactive model of computation. SCVM also integrates state-of-the-art optimization techniques that have been suggested previously in the literature. For example, we support public, local, and secure modes of computation, a technique recently explored (in the circuit model) by Kerschbaum [52] and Rastogi et al. [76] Our compiler can identify and encode portions of computation that can be safely performed in the clear or locally by one of the parties, without incurring the cost of a secure-computation sub-protocol. Our SCVM intermediate representation generalizes circuit-model approaches. For programs that do not rely on ORAM, our compiler effectively generates an effi- cient circuit-model secure-computation protocol. This paper focuses on the design of the intermediate representation language and type system for RAM-model secure computation, as well as the compile-time optimization techniques we apply. Our work is complementary to several independent, ongoing efforts focused on improv- ing the cryptographic back end. 4.2 Background: RAM-Model Secure Computation In this section, we review some background for RAM-model secure compu- tation. Our treatment is adapted from that of Gordon et al. [38], with notation 83 adjusted for our purposes. A key underlying building block of RAM-model secure computation is Oblivi- ous RAM (ORAM). ORAM is a cryptographic primitive that hides memory-access patterns by randomly reshuffling data in memory. With ORAM, each memory read or write operation incurs poly log n actual memory accesses. Existing RAM-model secure computation, which we refer as straightforward RAM-SC, employs the following scheme. The entire memory denoted mem, contain- ing both program instructions and data, is placed in ORAM, and the ORAM is secret-shared between the two participating parties as discussed above, e.g., using a simple XOR-based secret-sharing scheme. With ORAM, a memory access thus requires each party to access the elements of their respective arrays at pseudoran- dom locations (the addresses are dictated by the ORAM algorithm), and the value stored at each position is then obtained by XORing the values read by each of the parties. Alternatively, the server can hold an encryption of the ORAM array, and the client holds the key. The latter was done by Gordon et al. to ensure that one party holds only O(1) state. All CPU states are also secret-shared between the two parties. Straightforward RAM-SC proceeds as follows. Each step of the computation must be done using some secure computation subprotocol. In particular, SC-U is a secure computation protocol that securely evaluates the universal next instruction circuit, and SC-ORAM is a secure computation protocol that securely evaluates the ORAM algorithm. For ORAM.Read, each party supplies a secret share of the raddr, and during the course of the protocol, the ORAM.Read protocol will emit obfuscated 84 Scenario Potential benefits of RAM-model se- cure computation 1 Repeated sublinear queries over a large dataset (e.g., binary search, range query, shortest path query) • Amortize preprocessing cost over mul- tiple queries • Achieve sublinear amortized cost per query 2 One-time computation over a large dataset Avoid paying O(n) cost per dynamic memory access Table 4.1: Two main scenarios and advantages of RAM-model secure computation addresses for each party to read from. At the end of the protocol, each party obtains a share of the fetched data. For ORAM.Write, each party supplies a secret share of waddr and wdata, and during the course of the protocol, the ORAM.Read protocol will emit obfuscated addresses for each party to write to, and secret shares of values to write to those addresses. Deployment scenarios and threat model for RAM-model secure compu- tation. SCVM presently supports a two-party semi-honest protocol. We consider the following primary deployment scenarios: 1. Two parties, Alice and Bob, each comes with their own private data, and engage in a two-party protocol. For example, Goldman Sachs and Bridgewater would like to perform joint computation over their private market research data to learn market trends. 2. One or more users break their private data (e.g., genomic data) into secret shares, and split the shares among two non-colluding cloud providers. The shares at each cloud provider are completely random and reveal no informa- tion. To perform computation over the secret-shared data, the two cloud 85 providers engage in a secure 2-party computation protocol. 3. Similar as the above, but the two servers are within the same cloud or under the same administration. This can serve to mitigate Advanced Persistent Threats or insider threats, since compromise of a single machine will no longer lead to the breach of private data. Similar architectures have been explored in commercial products such as RSA’s distributed credential protection [3]. In the first scenario, Alice and Bob should not learn anything about each other’s data besides the outcome of the computation. In the second and third scenarios, the two servers should learn nothing about the users’ data other than the outcome of the computation – note that the outcome of the computation can also be easily hidden simply by XORing the outcome with a secret random mask (like a one-time pad). We assume that the program text (i.e., code) is public. With respect to the types of applications, while Gordon et al. describe RAM- model secure computation mainly for the amortized setting, where repeated com- putations are carried out starting from a single initial dataset, we note that RAM- model secure computation can also be meaningful for one-time computation on large datasets, since a straightforward RAM-to-circuit compiler would incur linear (in the size of dataset) overhead for every dynamic memory access whose address depends on sensitive inputs. Table 4.1 summarizes the two main scenarios for RAM-model secure computation, and potential advantages of using the RAM model in these cases. 86 4.3 Technical Overview: Compiling for RAM-Model Secure Compu- tation This section describes our approach to optimize RAM-model secure computa- tion. Our key idea is use static analysis during compilation to minimize the use of heavyweight cryptographic primitives such as garbled circuits and ORAM. 4.3.1 Instruction-Trace Obliviousness The standard RAM-model secure computation protocol described in Section 4.2 is relatively inefficient because it requires a secure-computation sub-protocol to com- pute the universal next-instruction circuit U . This circuit has large size, since it must interpret every possible instruction. In our solution, we will avoid relying on a universal next-instruction circuit, and will instead arrange things so that we can securely evaluate instruction-specific circuits. Note that it is not secure, in general, to reveal what instruction is being carried out at each step in the execution of some program. As a simple example, consider a branch over a secret value s: if(s) x[i]:=a+b; else x[i]:=a-b Depending on the value of s, a different instruction (i.e., add or subtract) will be executed. To mitigate such an implicit information leak, our compiler trans- forms a program to an instruction-trace oblivious counterpart, i.e., a program whose program-counter value (which determines which instruction will be executed next) 87 does not depend on secret information. The key idea there is to use a mux operation to rewrite a secret if-statement. For example, the above code can be re-factored to the following: t1 := s; t2 := a+b; t3 := a-b; t4 := mux(t1, t2, t3); x[i] := t4 At every point during the above computation, the instruction being executed is pre-determined, and so does not leak information about sensitive data. Instruction- trace obliviousness is similar to program-counter security proposed by Molnar et al. [67] (for a different application). 4.3.2 Memory-Trace Obliviousness Using ORAM for memory accesses is also a heavyweight operation in RAM- model secure computation. The standard approach is to place all memory in a single ORAM, thus incurring O(poly log n) cost per data operation, where n is a bound on the size of the memory. We have demonstrated in the context of securing remote execution against physical attacks (Chapter 2,3) that not all access patterns of a program are sensitive. For example, a findmax program that sequentially scans through an array to find the maximum element has predictable access patterns that do not depend on sensitive 88 inputs. We propose to apply a similar idea to the context of RAM-model secure computation. Our compiler performs static analysis to detect safe memory accesses that do not depend on secret inputs. In this way, we can avoid using ORAM when the access pattern is independent of sensitive inputs. It is also possible to store various subsets of memory (e.g., different arrays) in different ORAMs, when information about which portion of memory (e.g., which array) is being accessed does not depend on sensitive information. 4.3.3 Mixed-Mode Execution We also use static analysis to partition a program into code blocks, and then for each code block use either a public, local, or secure mode of execution (described next). Computation in public or local modes avoids heavyweight secure compu- tation. In the intermediate language, each statement is labeled with its mode of execution. Public mode. Statements computing on publicly-known variables or variables that have been declassified in the middle of program execution can be performed by both parties independently, without having to resort to a secure-computation protocol. Such statements are labeled P. For example, the loop iterators (in lines 1, 3, 10) in Dijkstra’s algorithm (see Figure 4.2) do not depend on secret data, and so each party can independently compute them. Local mode. For statements computing over Alice’s variables, public variables, or previously declassified variables, Alice can perform the computation independently 89 1 for(i = 0; i < n; ++i) { 2 int bestj = -1; bestdis = -1; 3 for(int j=0; j { secure int10[public 1000] keys; T[public 1000] values; }; In the above, the type int10 means that its value is a 10-bit signed integer. Each element in the array values has a generic type T similar to C++ templates. ObliVM- Lang assumes data of type T to be secret-shared. In the future, we will improve the compiler to support public generic types. Generic constants. Besides general types, ObliVM-Lang also supports generic con- stants to further improve the reusability. Let us consider the following tree example: struct TreeNode@m { public int@m key; T value; public int@m left, right; }; struct Tree@m { TreeNode[public (1<.search(public int@m key) { 2 public int@m now = this.root, tk; 3 T ret; 4 while (now != -1) { 5 tk = this.nodes[now].key; 6 if (tk == key) 7 ret = this.nodes[now].value; 8 if (tk <= key) 9 now = this.nodes[now].right; 10 else 11 now = this.nodes[now].left; 12 } 13 return ret 14 }; This function is a method of a Tree object, and takes a key as input, and returns a 138 value of type T. The function body defines three local variables now and tk of type public int@m, and ret of type T. The definition of a local variable (e.g. now) can be accompanied with an optional initialization expression (e.g. this.root). When a variable (e.g. ret or tk) is not initialized explicitly, it is initialized to be a default value depending on its type. The rest of the function is standard, C-like code, except that ObliVM-Lang requires exactly one return statement at the bottom of a function whose return type is not void. We highlight that ObliVM-Lang allows arbitrary looping on a public guard (e.g. line 4) without loop unrolling, which cannot be compiled in previous loop-elimination-based work [13,43,44,55,65,98]. Function types. Programmers can define a variable to have function type, similar to function pointers in C. To avoid the complexity of handling arbitrary higher order functions, the input and return types of a function type must not be function types. Further, generic types cannot be instantiated with function types. Native primitives. ObliVM-Lang supports native types and native functions. For example, ObliVM-Lang’s default back end implementation is ObliVM-SC, which is implemented in Java. Suppose an alternative BigInteger implementation in ObliVM- SC (e.g., using additively homomorphic encryption) is available in a Java class called BigInteger. Programmers can define typedef BigInt@m = native BigInteger; Suppose that this class supports four operations: add, multiply, fromInt and toInt, where the first two operations are arithmetic operations and last two operations are used to convert between Garbled Circuit-based integers and HE-based 139 integers. We can expose these to the source language by declaring: BigInt@m BigInt@m.add(BigInt@m x, BigInt@m y) = native BigInteger.add; BigInt@m BigInt@m.multiply(BigInt@m x, BigInt@m y) = native BigInteger.multiply; BigInt@m BigInt@m.fromInt(int@m y) = native BigInteger.fromInt; int@m BigInt@m.toInt(BigInt@m y) = native BigInteger.toInt; 5.2.2 Language features for security The key requirement of ObliVM-Lang is that a program’s execution traces will not leak information. These execution traces include a memory trace, an instruction trace, a function stack trace, and a declassification trace. The trace definitions are similar to SCVM in Chapter 4, and we develop a security type system for ObliVM- Lang, which is similar to the one for SCVM to enforce trace obliviousness. In addition, ObliVM-Lang provides an extended type system that imposes fur- ther constraints on how random numbers and functions should be used to achieve security. This type system extension does not enforce formal security, but it provides useful hints to capture subtle bugs, e.g., when implement oblivious data structures. Random numbers and implicit declassifications. Many oblivious programs such as ORAM and oblivious data structures crucially rely on randomness. In par- ticular, their obliviousness guarantee has the following nature: the joint distribution of memory traces is identical regardless of secret inputs (these algorithms typically have a cryptographically negligible probability of correctness failure). ObliVM-Lang 140 supports reasoning of such “distributional” trace-obliviousness by providing random types associated with an affine type system. For instance, rnd32 is the type of a 32-bit random integer. A random number will always be secret-shared between the two parties. To generate a random number, there is a built-in function RND with the fol- lowing signature: rnd@m RND(public int32 m) This function takes a public 32-bit integer m as input, and returns m random bits. Note that rnd@m is a dependent type, whose type depends on values, i.e. m. To avoid the complexity of handling general dependent types, the ObliVM-Lang compiler re- stricts the usage of dependent types to only this built-in function, and handles it specially. In our ObliVM framework, outputs of a computation can be explicitly declas- sified with special syntax. Random numbers are allowed implicit declassification – by assigning them to public variables. Here “implicitness” means that the declassi- fication happens not because this is a specified outcome of the computation. For security, we must ensure that each random number is implicitly declassi- fied at most once for the following reason. When implicitly declassifying a random number, both parties observe the random number as part of the trace. Now consider the following example where s is a secret variable. 141 1 rnd32 r1 = RND(32), r2= RND(32); 2 public int32 z; 3 if (s) z = r1; // implicit declass 4 else z = r2; // implicit declass . . . . . . XX public int32 y = r2; // NOT OK In this program, random variables r1 and r2 are initialized in Line 1 – these vari- ables are assigned a fresh, random value upon initialization. Up to Line 4 random variables r1 and r2 are each declassified no more than once. Line XX, however, could potentially cause r2 to be declassified more than once. Line XX clearly is not secure since in this case the observable public variable y and z could be correlated – depending on which secret branch was taken earlier. Therefore, we use an affine type system to ensure that each random variable is implicitly declassified at most once. This way, each time a random variable is implicitly declassified, it will introduce a independently uniform variable to the observable trace. In our security proof, a simulator can just sample this random number to simulate the trace. It turns out that the above example reflects the essence of what is needed to implement oblivious RAM and oblivious data structures in our source language. We refer the readers to Sections 5.3 and 5.4.2 for details. Function calls and phantom functions. A straightforward idea to prevent stack behavior from leaking information is to enforce function calls in a public context. Then the requirement is that each function’s body must satisfy memory- and instruction-trace obliviousness. Further, by defining native functions, ObliVM- Lang implicitly assumes that their implementations satisfy memory- and instruction- 142 trace obliviousness. Beyond this basic idea, ObliVM-Lang makes a step forward to enabling function calls within a secret if-statement by introducing the notion of phantom function. The idea is that each function can be executed in dual modes, a real mode and and a phantom mode. In the real mode, all statements are executed normal with real computation and real side effects. In the phantom mode, the function execution merely simulates the memory traces of the real world; no side effects take place; and the phantom function call returns a secret-shared default value of the specified return type. This is similar to padding ideas used in several previous works [6, 78]. We will illustrate the use of phantom function with the following prefixSum example. The function prefixSum(n) accesses a global integer array a, and com- putes the prefix sum of the first n + 1 elements in a. After accessing each element (Line 3), the element in array a will be set to 0 (Line 4). 1 phantom secure int32 prefixSum 2 (public int32 n) { 3 secure int32 ret=a[n]; 4 a[n]=0; 5 if (n != 0) ret = ret+prefixSum(n-1); 6 return ret; 7 } The keyword phantom indicates that the function prefixSum is a phantom func- tion. Consider the following code to call the phantom functions: if (s) then x = prefixSum(n); To ensure security, prefixSum will always be called no matter s is true or 143 false. When s is false, however, it must be guaranteed that (1) elements in array a will not be assigned to be 0; and (2) the function generates traces with the same probability as when s is true. To this end, the compiler will generate target code with the following signature: prefixSum(idx, indicator) where indicator means whether the function will be called in the real or phan- tom mode. To achieve the first goal, the global variable will be modified only if indicator is false. The compiler will compile the code in line 4 into the following pseudo-code: a[idx]=mux(0, a[idx], indicator); It is easy to see, that all instructions will be executed, and thus the generated traces are identical regardless of the value of indicator. Note, that such a function is not implementable in any prior loop-unrolling based compiler, since n is provided at runtime only. It is worth noticing that phantom function relaxed the restriction posed by previous memory trace oblivious type systems [58], which do not allow looping in the secure context (i.e. within a secret conditional). The main difficulty in previous systems was to quantify the numbers of loop iterations in the two branches of an if-statement, and to enforce the two numbers to be the same. Phantom functions remove the need of this analysis by executing both branches, with one branched really executed, and the other executed phantomly. As long as an adversary is unable to distinguish between a real execution from a phantom one, the secret 144 guard of the if-statement will not be leaked, even when loops are virtually present (i.e. in a phantom function). 5.3 User-Facing Oblivious Programming Abstractions Programming abstractions such as MapReduce and GraphLab have been pop- ularized in the parallel computing domain. In particular, programs written for a traditional sequential programming paradigm are difficult to parallelize automati- cally by an optimizing compiler. These new paradigms are not only easy for users to understand and program with, but also provide insights on the structure of the problem, and facilitate parallelization in an automated manner. In this section, we would like to take a similar approach towards oblivious programming as well. The idea is to develop oblivious programming abstractions that can be easily understood and consumed by non-specialist programmers, and our compiler can compile programs into efficient oblivious algorithms. In comparison, if these programs were written in a traditional imperative-style programming language like C, compile-time optimizations would have been much more limited. 5.3.1 MapReduce Programming Abstractions An interesting observation is that “parallelism facilitates obliviousness” [24, 36]. If a program (or part of a program) can be efficiently expressed in parallel programming paradigms such as MapReduce and GraphLab [2, 64] (with a few ad- ditional constraints), there is an efficient oblivious algorithm to compute this task. 145 We stress that in this paper, we consider MapReduce merely as a programming abstraction that facilitates obliviousness – in reality we compile MapReduce pro- grams to sequential implementations that runs on a single thread. Parallelizing the algorithms is outside the scope of this work. Background: Oblivious algorithms for streaming MapReduce. A streaming MapReduce program consists of two basic operations, map and reduce. • The map operation: takes an array denoted {αi}i∈[n] where each αi ∈ D for some domain D, and a function mapper : D → K × V . Now map would apply (ki, vi) := mapper(αi) to each αi, and output an array of key-value pairs {(ki, vi)}i∈[n]. • The reduce operation: takes in an array of key-value pairs denoted {(ki, vi)}i∈[n] and a function reducer : K×V2 → V . For every unique key k value in this ar- ray, let (k, vi1), (k, vi2), . . . (k, vim) denote all occurrences with the key k. Now the reduce operation applies the following operation in a streaming fashion: Rk := reducer(k, . . . reducer(k, reducer(k, vi1 , vi2), vi3), . . . , vim) The result of the reduce operation is an array consisting of a pair (k,Rk) for every unique k value in the input array. Goodrich and Mitzenmacher [36] observe that any program written in a stream- ing MapReduce abstraction can be converted to efficient oblivious algorithms, and they leverage this observation to aid the construction of an ORAM scheme. 146 1 Pair[public n] MapReduce@m@n 2 (I[public m] data, Pair map(I), 3 V reduce(K, V, V), V initialVal, 4 int2 cmp(K, K)) { 5 public int32 i; 6 Pair[public m] d2; 7 for (i=0; i(d2, 1, cmp); 10 K key = d2[0].k; 11 V val = initialVal; 12 Pair>[public m] res; 13 for (i=0; i+1> 29 (res, 1, zeroOneCmp); 30 Pair[public n] top; 31 for (i=0; i < n; i = i + 1) 32 top[i] = res[i].v; 33 return top; 34 } Figure 5.1: Streaming MapReduce in ObliVM-Lang. See Section 5.3.1 for obliv- ious algorithms for the streaming MapReduce paradigm [36]. 147 • The map operation is inherently oblivious, and can be done by making a linear scan over the input array. • The reduce operation can be made oblivious through an oblivious sorting (denoted o-sort) primitive. – First, o-sort the input array in ascending order of the key, such that all pairs with the same key are grouped together. – Next, in a single linear scan, apply the reducer function: i) If this is the last key-value pair for some key k, write down the result of the aggregation (k,Rk). ii) Else, write down a dummy entry ⊥. – Finally, o-sort all the resulting entries to move ⊥ to the end. Providing the streaming MapReduce abstraction in ObliVM. It is easy to implement the streaming MapReduce abstraction as a library in our source language ObliVM-Lang. The ObliVM-Lang implementation of streaming MapReduce paradigm is provided in Figure 5.1. MapReduce has two generic constants, m and n, to represent the sizes of the input and output respectively. It also has three generic types, I for inputs’ type, K, for output keys’ type, and V, for output values’ type. All of these three types are assumed to be secret. It takes five inputs, data for the input data, map for the mapper, reduce for the reducer, initialVal for the initial value for the reducer, and cmp to compare two keys of type K. 148 Lines 6-10 are the mapper phase of the algorithm, then line 11 uses the func- tion sort to sort the intermediate results based on their keys. After line 11, the intermediate results with the same key are grouped together, and line 12-29 pro- duce the output of the reduce phase with some dummy outputs. Finally, lines 30-35 use oblivious sort again to eliminate those dummy outputs, and eventually line 36 returns the final results. Notice that in these functions, there are three arrays, data, d2, and res. The program declares all of them to have only public access pattern, because they are accessed by either a sequential scan, or an oblivious sorting. In this case, the compiler will not place these arrays into ORAM banks. Using MapReduce. Figure 5.1 needs to be written by an expert developer only once. From then on, an end user can make use of this programming abstraction. We further illustrate how to use the above MapReduce program to implement a histogram. In SCVM (Chapter 4), a histogram program is as below. for (public int i=0; i y) r = 1; return r; } Pair mapper(int32 x) { return Pair(x, 1); } int32 reducer(int32 k, int32 v1, int32 v2) { return v1 + v2; } The top-level program can launch the computation using c=MapReduce@m@n(a, mapper, reducer, cmp, 0); 5.3.2 Programming Abstractions for Data Structures We now explain how to provide programming abstractions for a class of pointer- based oblivious data structures described by Wang et al. [92]. Figure 5.2 gives an example, where an expert programmer provides library support (Figure 5.3) for im- plementing a class of pointer-based data structures such that a non-specialist pro- grammer can implement data structures which will be compiled to efficient oblivious algorithms that outperform generic ORAM. We stress that while we give a stack example for simplicity, this paradigm is also applicable to other pointer-based data structures, such as AVL tree, heap, and queue. 150 1 struct StackNode@m { 2 Pointer@m next; 3 T data; 4 }; 5 struct Stack@m { 6 Pointer@m top; 7 SecStore@m store; 8 }; 9 phantom void Stack@m.push(T data) { 10 StackNode@m node = StackNode@m ( top, data); 11 this.top = this.store.allocate(); 12 store.add(top.(index, pos), node); 13 } 14 phantom T Stack@m.pop() { 15 StackNode@m res = store .readAndRemove(top.(index, pos)); 16 top = res.next; 17 return res.data; 18 } Figure 5.2: Oblivious stack by non-specialist programmers. Implementing oblivious data structure abstractions in ObliVM. We assume that the reader is familiar with the oblivious data structure algorithmic techniques described by Wang et al. [92]. To support efficient data structure implementations, an expert programmer implements two important objects (see Figure 5.3): • A Pointer object stores two important pieces of information: an index vari- able that stores the logical identifier of the memory block pointed to (each memory block has a globally unique index); and a pos variable that stores the random leaf label in the ORAM tree of the memory block. • A SecStore object essentially implements an ORAM, and provides the follow- ing member functions to an end-user: The SecStore.remove function essen- 151 1 rnd@m RND(public int32 m) = native lib.rand; 2 struct Pointer@m { 3 int32 index; 4 rnd@m pos; 5 }; 6 struct SecStore@m { 7 CircuitORAM@m oram; 8 int32 cnt; 9 }; 10 phantom void SecStore@m.add(int32 index, int@m pos, T data) { 11 oram.add(index, pos, data); 12 } 13 phantom T SecStore@m .readAndRemove(int32 index, rnd@m pos) { 14 return oram.readAndRemove(index, pos); 15 } 16 phantom Pointer@m SecStore@m.allocate() { 17 cnt = cnt + 1; 18 return Pointer@m(cnt, RND(m)); 19 } Figure 5.3: Code by expert programmers to help non-specialists implement oblivious stack. tially is a syntactic sugar for the ORAM’s readAndRemove interface [80, 90], and the SecStore.add function is a syntactic sugar for the ORAM’s Add inter- face [80,90]. Finally, the SecStore.allocate function returns a new Pointer object to the caller. This new Pointer object is assigned a globally unique logical identifier (using a counter cnt that is incremented each time), and a fresh random chosen leaf label RND(m). Stack implementation by a non-specialist programmer. Given abstractions provided by the expert programmer, a non-specialist programmer can now imple- ment a class of data structures such as stack, queue, heap, AVL Tree, etc. Figure 5.2 gives a stack example. 152 Role of affine type system. We use Figure 5.3 as an example to illustrate how our rnd types with their affine type system can ensure security. As mentioned earlier, rnd types have an affine type system. This means that each rnd can be declassified (i.e., made public) at most once. In Figure 5.3, the oram.readAndRemove call will declassify its argument rnd@m pos inside the implementation of the function body. From an algorithms perspective, this is because the leaf label pos will be revealed during the readAndRemove operation, incurring a memory trace where the value rnd@m pos will be observable by the adversary. 5.3.3 Loop Coalescing and New Oblivious Graph Algorithms We introduce a new programming abstraction called loop coalescing, and show how this programming abstraction allowed us to design novel oblivious graph al- gorithms such as Dijkstra’s shortest path and minimum spanning tree for sparse graphs. Loop coalescing is non-trivial to embed as a library in ObliVM-Lang. We therefore support this programming abstraction by introducing special syntax and modifications to our compiler. Specifically, we introduce a new syntax called bounded- for loop as shown in Figure 5.4. For succinctness, in this section, we will present pseudo-code. In the program in Figure 5.4, the bwhile(n) and bwhile(m) syntax at Lines 1 and 3 indicate that the outer loop will be executed for a total of n times, whereas the inner loop will be executed for a total of m times – over all iterations of the outer loop. 153 1 bwhile(n) (; u{ 4 int@n id, pos; 5 T data; 6 }; 7 struct CircuitOram@n{ 8 dummy Block@n[public 1<[public STASHSIZE] stash; 10 }; Figure 5.6: Part of our Circuit ORAM implementation (Type Definition) in ObliVM-Lang. 6 to 9 of Figure 5.5, the syntax num$i~j$ means extracting the part of integer num from i-th bit to j-th bit. 5.4.2 Case Study: Circuit ORAM In Figure 5.7, we show part of the Circuit ORAM implementation using ObliVM-Lang. Line 3 to line 6 is the definition of a ORAM block containing two metadata fields of an index of type int, and a position label of type rnd, along with a data field of type . Circuit ORAM (line 7-10) is organized to contain an array of buckets (i.e. arrays of ORAM blocks), and a stash (i.e. an array of blocks). The dummy keyword in front of Block@n indicates the value of this type can be null. In many cases, (e.g. Circuit ORAM implementation), using dummy keyword leads to a more efficient circuit generation. Line 11-30 demonstrates how readAndRemove can be implemented. Taking the input of an secret integer index id, and a random position label pos, the label 160 11 phantom T CircuitOram@n .readAndRemove(int@n id, rnd@n pos) { 12 public int32 pubPos = pos; 13 public int32 i = (1 << n) + pubPos; 14 T res; 15 for (public int32 k = n; k>=0; k=k-1) { 16 for (public int32 j=0;j |t| t if i = 1 ∧ t = read(x, n) | write(x, n) | readarr(x, n, n′) | writearr(x, n, n′) t1[i] if t = t1@t2 ∨ 1 ≤ i ≤ |t1| t2[i− |t1]] if t = t1@t2 ∨ |t1| < i ≤ |t| It is easy to see that if ∀i.t1[i] = t2[i] implies |t1| = |t2| by the following lemma. Lemma 2. t[i] 6=  for all i such that 1 ≤ i ≤ |t|, and  otherwise. Proof. The second part of the conclusion is trivial since it directly follows the defi- nition. We prove the first part by induction on |t|. If |t |= 0, then the conclusion is trivial. If |t| = 1, and 1 ≤ i ≤ |t|, then i must be 1. Therefore, t[i] is one of read(x, n, write(x, n), readarr(x, n, n′), and writearr(x, n, n′), and therefore t[i] 6= . If |t| > 1, then t must be a concatenation of two subsequences, i.e. t1@t2. If 1 ≤ i ≤ |t1|, then t[i] = t1[i], and by induction, we know that t[i] 6= . Otherwise, if |t1| < i ≤ |t|, then 0 < i − |t1| ≤ |t| − |t1| = |t2|. For natural number n, n > 0 implies n ≥ 1. Therefore 1 ≤ i − |t1| ≤ |t2|, and by induction, we have t[i] = t2[i− |t1]] 6= . Before we go to the next lemma, we shall define the canonical representation 180 of a trace. First, we define the number of blocks in a trace t, denoted by #(t), as #(t) =  #(t1) + #(t2) if t = t1@t2 1 otherwise Then we define an order t between two traces t1 and t2 as follows: t1 t t2 if and only if either of the following two conditions hold true: (i) #(t1) < #(t2), or (ii) #(t1) = #(t2) ≥ 2, t1 = t′1@t′′1, t2 = t′2@t′′2, and either of the following three sub-conditions holds true: (ii.a) #(t′1) > #(t ′ 2); (ii.b) #(t ′ 1) = #(t ′ 2) and t ′ 1 t t′2; or (ii.c) t′1 = t ′ 2 and t ′′ 1  t′′2. It is easy to see that t is complete. Definition 10 (canonical representation). The canonical representation of a trace t is the minimal element in the set {t′ : t ≡ t′} under order t. Lemma 3. The canonical representation of t is (i)  is |t |= 0; or (ii) can(t) = (...((t1@t2)@t3)...@tn), where n = |t| > 0, and ti = t[i]. Proof. On the one hand, it is easy to see that can(t) belongs to the set {t′ : t ≡ t′}. In fact, we can prove by induction on #(t). If #(t) = 1, then either |t |= 1, or |t |= 0. For the former case, t is one of read(x, n, write(x, n), fetch(p), readarr(x, n, n′), and writearr(x, n, n′), and thus t = t[1] = can(t). For the later case, t = . Now suppose #(t) > 1, and thus t = t′@t′′. Suppose |t′ |= l1 and |t′′ |= l2. If l2 = 0, by induction, t ′′ = , and thus t ≡ t′. Furthermore, we have |t| = |t′|, and ∀i.t[i] = t′[i] by definition. Therefore t ≡ t′ ≡ can(t′) = can(t). Similarly, we can prove the conclusion is true when l1 = 0. Now suppose l1 > 0 and l2 > 0, then can(t′) = (...((t1@t2)@t3)...tl1), and can(t ′′) = (...((tl1+1@tl1+2)@tl1+3)...@tl1+l2). 181 Then t ≡ can(t′)@can(t′′) ≡ can(t). On the other hand, we shall show that can(t) is the minimal one in {t′ : t ≡ t′}. To show this point, we only need to show that for all t′ ≡ can(t), we have can(t) t t′. We prove by induction on n. If n = 1, the conclusion is obvious. Suppose n > 1 and the conclusion holds true for all n′ < n. It is easy to see that #(t′) > 1, therefore we suppose t′ = tl@tr. Then we prove that there exists k such that tl ≡ (...(t1@t2)...@tk) and tr ≡ (...(tk+1@tk+2)...@tn). We prove by induction on n and how many steps of equivalent-transitive rule, i.e., t1 ≡ t2∧t2 ≡ t3 ⇒ t1∧t3, should be applied to derive can(t) ≡ t′. If we should apply 0 step, then we know one of the following situations holds: (i) t′ = t′′@tn where t′′ ≡ (...(t1@t2)...@tn−1); (ii) t′ = (...(t1@t2)...tn−2)@(tn−1@tn); (iii) t′ = t@; or (iv) t′ = @t. In any case, our conclusion holds true. Now suppose we need to apply n > 0 steps to derive t′, where in the n−1 step, we derive that can(t) ≡ t′′ and we can derive t′′ ≡ t′ without applying the equivalent-transitive rule. Therefore by induction, we know that t′′ = t′′l @t ′′ r , and there is k such that t ′′ l ≡ (...(t1@t2)...@tk) and t′′r ≡ (...(tk+1@tk+2)...@tn). Since we can derive t′′ ≡ t′ without applying equivalent- transitive rule, we know that one of the following situations holds: 1. t′′l ≡ tl and t′′r ≡ tr; 2. t′ = @t′′; 3. t′ = t′′@; 4. t′′ = t′@; 182 5. t′′ = @t′; 6. @t′ = t′′@; 7. t′@ = @t′′; 8. t′′l = (tl1@tl2), tl ≡ tl1, and tr ≡ tl2@t′′r (in this case, we have (tl1@tl2)@t′′r ≡ tl1@(tl2@t ′′ r)); and 9. t′′r = (tr1@tr2), tl ≡ t′′l @tr1, and tr ≡ tr2 (in this case, we have t′′l @(tr1@tr2) ≡ (t′′l @tr1)@tr2). For the first 7 cases, the conclusion is trivial. For Case 8, by induction, we know there are some k′ such that tl1 ≡ (...(t1@t2)...@tk′) and tl2 ≡ (...(tk′+1@tk′+2)...@tk). Therefore tl ≡ (...(t1@t2)...@tk′), and tr ≡ (...(tk′+1@tk′+2)...@tk)@(...(tk+1@tk+2)...@tn) while the later is equivalent to (...(tk′+1@tk′+2)...@tn). Similarly, we can prove under case 9, the conclusion is also true. Next, we prove that can(t) t tl@tr. To show this point, by induction, we know (...(t1@t2)...@tk)  tl and (...(tk+1@tk+2)...@tn)  tr. If either #(tl) > k or #(tr) > n − k, we have #(t′) > n and thus can(t)  t′. Suppose #(tl) = k and #(tr) = n− k. If k < n− 1, then by the definition of , we have can(t) t t′. Next suppose k = n − 1, then by induction, we know (...(t1@t2)...@tn−1) t tl, and thus can(t) t tl@tr = t′. Next is the most important lemma about trace-equivalence. Lemma 4. t1 ≡ t2, if and only if ∀i.t1[i] = t2[i]. 183 Proof. “⇒” Suppose ∀i.t1[i] = t2[i]. Then by Lemma 3, we know can(t1) = can(t2), and thus t1 ≡ can(t1) ≡ can(t2) ≡ t2. “⇐” Suppose t1 ≡ t2, then by Lemma 3, we have can(t1) ≡ can(t2). Due to both can(t1) and can(t2) have the same form, we know they are identical. Therefore, we can conclude that ∀i.t1[i] = t2[i]. A.2 Lemmas on trace pattern equivalence Trace pattern equivalence has similar properties as trace equivalence. If fact, we define the length of a trace pattern T , denoted as |T |, to be |T | =  1 if T = Read(x) | Fetch(p) 0 if T =  |T1|+ |T2| if T = T1@T2 Similar to trace, we define the i-th element in a trace pattern T , denoted T [i], as follows: T [i] =   if i ≤ 0 ∨ i > |T | T if i = 1 ∧ T = Read(x) T1[i] if T = T1@T2 ∨ 1 ≤ i ≤ |T1| T2[i− |T1]] if T = T1@T2 ∨ |T1| < i ≤ |T | Using exactly the same technique, we can prove the following lemma: Lemma 5. T1 ∼L T2, if and only if ∀i.T1[i] = T2[i]. To avoid verbosity, we do not provide the full proof here. It is quite similar to 184 the proof of Lemma 4 A.3 Proof of memory trace obliviousness To prove Theorem 1, memory trace obliviousness by typing, we shall first prove the following lemma: Lemma 6. If Γ ` e : Nat L;T , then for any two Γ-valid low-equivalent memories M1, M2, if 〈M1, e〉 ⇓t1 n1, 〈M2, e〉 ⇓t2 n2, then t1 = t2 and n1 = n2 Proof. We use structural induction on expression e to prove this lemma. If e is in form of x, then Γ(x) = Nat L, and thus M1(x) = M2(x) = n according to the definition of low-equivalence and Γ-validity. Therefore t1 = read(x, n) = t2, and n1 = n2 = n. If e is in form of e1 op e2, then Γ ` e1 : Nat L and Γ ` e2 : Nat L. Suppose 〈Mi, ej〉 ⇓tij n′ij, for i = 1, 2, j = 1, 2. Then t1j = t2j and n1j = n2j for j = 1, 2. Therefore t1 = t11@t12 = t21@t22 = t2, and n1 = n11 op n12 = n21 op n22 = n2. Next, we consider the expression in form of x[e]. We know that Γ(x) = Array L, which implies Γ ` e : Nat L. Suppose 〈Mi, e〉 ⇓t′i n′i, then by induction t′1 = t ′ 2 and n ′ 1 = n ′ 2. Furthermore, since M1 ∼L M2, we have ∀i ∈ Nat.M1(x)(i) = M2(x)(i). Therefore t1 = t ′ 1@readarr(x, n ′ 1,M1(x)(n ′ 1)) = t ′ 2@readarr(x, n ′ 2,M2(x)(n ′ 2)) = t2, and n1 = M1(x)(n ′ 1) = M2(x)(n ′ 2) = n2. Finally, the conclusion is trivial for constant expression. For convenience, we define lab : Type→ SecLabels as: 185 lab(τ) =  l if τ = Int l l if τ = Array l Similar to Lemma 6, we can prove the following lemma: Lemma 7. If Γ ` e : Nat l;T and l ∈ ORAMBanks, then for any two Γ-valid low-equivalent memories M1, M2, if 〈M1, e〉 ⇓t1 n1, 〈M2, e〉 ⇓t2 n2, then t1 = t2 Proof. If l = L, then the conclusion is obvious by Lemma ??. We only consider l ∈ ORAMBanks. We use structural induction to prove this lemma. If e is in form of x, then according to the definition of Γ-validity and evt(·), we have t1 = lab(Γ(x)) = t2. If e is in form of e1 op e2, then Γ ` e1 : Nat l1 and Γ ` e2 : Nat l2. Suppose 〈Mi, ej〉 ⇓tij n′ij, for i = 1, 2, j = 1, 2. Then t1j = t2j, for j = 1, 2 by induction. Therefore t1 = t11@t12 = t21@t22 = t2. Finally, we consider the expression in form of x[e]. We know that Γ ` e : Nat l′. Suppose 〈Mi, e〉 ⇓t′i n′i. If l′ = L, then t′1 = t′2 by Lemma ??. Otherwise, l ∈ ORAMBanks, and by induction assumption, we have t′1 = t′2. Since l ∈ ORAMBanks, we know l = lab(Γ(x)), and thus t1 = t ′ 1@l = t ′ 2@l = t2. Now we shall study the property of trace pattern equivalence. First of all, we have the following lemma: Lemma 8. Suppose s and S are a statement and a labeled statement respectively. If Γ, l0 ` S;T , l0 ∈ ORAMBanks and 〈M,S〉 ⇓t M ′, then M ∼L M ′. 186 Proof. We prove by induction on the statement S. Notice that the statement is impossible to be while statement. The conclusion is trivial for the statement skip. If s is x := e, then l0 ⊆ lab(Γ(x)), and thus lab(Γ(x)) ∈ ORAMBanks. Therefore M ′ = M [x 7→ (n, l)] for some natural number n and some security la- bel l, which implies M ′ ∼L M . Similarly, if s is x[e1] := e2, then lab(Γ(x)) ∈ ORAMBanks. Furthermore, 〈M,x[e1] := e2〉 ⇓t M [x 7→ (m, l)] for some mapping m, and some security label l ∈ ORAMBanks. Therefore M ′ = M [x 7→ (m, l)], which implies for x such that M(x) = (n,L), we know that M ′(x) = (n,L). There- fore M ′ ∼L M . Next, let us consider statement if(e, S1, S2). Then we know either of the two conditions holds true: (1) 〈M,S1〉 → M ′, and (2) 〈M,S2〉 → M ′. Since Γ, l0 ` if(e, S1, S2);T , we have Γ, l ′ ` S1;T1, and Γ, l′ ` S2;T2, where l0 v l′. Therefore we know for either condition, we have M ∼L M ′. Finally, for sequence of two statements S1;S2, suppose 〈M,S1〉 ⇓t M1, and 〈M1, S2〉 ⇓t′ M ′. Then M ∼L M1 ∼L M ′. According to definition of the trace pattern equivalence, it is obvious to see that, if T ∼L T ′, then T is a sequence, whose element each is in the form of Fetch(p), Read(x), , and o. We shall define a trace t belongs to a trace pattern T , under a memory M , denoted by t ∈ T [M ] as follows: 187  ∈ [M ] o ∈ o[M ] M(x) = (n,L), n ∈ Nat read(x, n) ∈ Read(x)[M ] t1 ∈ T1[M ] t2 ∈ T2[M ] t1@t2 ∈ T1@T2[M ] t ∈ T [M ] T ∼L T ′ t ∈ T ′[M ] Now, we prove the most important lemma for t ∈ T [M ]: Lemma 9. t ∈ T [M ] if and only if |t| = |T | and ∀i.t[i] ∈ (T [i])[M ]. Proof. “⇒” Suppose |t| = |T | and ∀i.t[i] ∈ (T [i])[M ]. We prove by induction on #(t). If #(t) = 1, then the conclusion is trivial. Assume the conclusion holds for all #(t′) < n, now suppose #(t) = n > 1. Then we know t = t1@t2. If t1 = , then we know |t2| = |t| = |T | and ∀i.t2[i] = t[i] ∈ (T [i])[M ], by induction, we know t2 ∈ T [M ]. Furthermore, we have t1 =  ∈ [M ], therefore t1@t2 ∈ @T [M ]. Since @T ∼L T , we have t = t1@t2 ∈ T [M ]. A similar argument shows that if t2 = , then we also have t ∈ T [M ]. Now let us consider when |t1 |= 0. By induction, we have t1 ∈ [M ] and t2 ∈ T [M ], and then again, we have t ∈ T [M ]. Similarly, if |t2 |= 0, we also have t ∈ T [M ]. Now assume |t1| > 0 and |t2| > 0, and suppose T1 = (...(T1@T2)...@T|t1|) and T2 = (...(T|t1|+1@T|t1|+2)...@T|T |). Then by induction, we know that t1 ∈ T1[M ] and t2 ∈ T2[M ], and thus t1@t2 ∈ T1@T2[M ]. According to Lemma 5, we have T1@T2 ∼L T , and thus t = t1@t2 ∈ T [M ]. “⇐” We prove by induction on how many steps to derive t ∈ T [M ]. Suppose we need only 1 step, then one of the following four conditions is true: (i) t =  = T ; 188 (ii) t = o = T ; (iii) t = read(x, n), T = Read(x) and M(x) = n. In either case, the conclusion is trivial. Then suppose we need n step, and the last step is derived from t = t1@t2, T = T1@T2, and t1 ∈ T2[M ] and t2 ∈ T2[M ]. Then by induction we have |t1| = |T1|, |t2| = |T2|, ∀i.t1[i] ∈ (T1[i])[M ], and ∀i.t2[i] ∈ (T2[i])[M ]. For i < 1 or i > |T |, then t[i] =  = T [i], and thus t[i] ∈ (T [i])[M ]. If 1 ≤ i ≤ |T1|, then t[i] = t1[i] and T [i] = T1[i], and by induction, we have t[i] ∈ (T [i])[M ]; if |T1| < i ≤ |T |, then t[i] = t2[i− |t1]] and T [i] = T2[i− |T1]], and by induction, we have t[i] ∈ (T [i])[M ]. Finally, suppose we need n step, and the last step is derived from t ∈ T ′[M ] and T ′ ∼L T . Then according to Lemma 5, we know that ∀i.T ′[i] = T [i], which also implies that |T ′| = |T |. By induction, we have |t| = |T ′| and ∀i.t[i] ∈ (T ′[i])[M ], and therefore, we have ∀i.t[i] ∈ (T [i])[M ] and |t| = |T |. We have the following corollaries. Corollary 1. If M1 ∼L M2, and t ∈ T [M1], then t ∈ T [M2]. Proof. By Lemma 9, we only need to show that ∀i.t[i] ∈ (T [i])[M2]. Let us prove by structural induction on how t ∈ T [M ] is derived. If t =  = T , or t = o = T , or t = t1@t2 and T = T1@T2, then the conclusion is trivial. The only condition we need to prove is when t = read(x, n), and T = Read(x). If so, since t ∈ T [M1], therefore M1(x) = (n,L). Since M1 ∼L M2, we know that M2(x) = (n,L). Therefore, we have t = read(x, n) ∈ Read(x)[M2] = T [M2]. According to the definition of T [i], we know it is in one of the following three forms: , o, or Read. If T [i] = , then we know i < 1 or i > |T | = |t1|. Therefore 189 t[i] = , and thus t[i] ∈ (T [i])[M2]. If T [i] = o, then we know t[i] = o. In both situations, we have t[i] ∈ (T [i])[M2]. Finally, if T [i] = Read(x), then we know t[i] = read(x, n) where n = M1[x]. Since M1 ∼L M2, we have M2[x] = n, and thus t[i] ∈ (T [i])[M2]. Corollary 2. If t1 ∈ T [M ] and t2 ∈ T [M ], then t1 ≡ t2. Proof. Assume t1 ∈ T [M ], and t2 ∈ T [M ], according to Lemma 9, we have |t1| = |T | = |t2|, ∀i.t1[i] ∈ (T [i])[M ], and ∀i.t2[i] ∈ (T [i])[M ]. According to the definition of T [i], we know it is in one of the following three forms: , o, or Read. If T [i] = , then we know i < 1 or i > |T | = |t1| = |t2|. Therefore t1[i] = t2[i] = . If T [i] = o, then we know t1[i] = t2[i] = o. Finally, if T [i] = Read(x), then we know t1[i] = read(x, n1), n1 = M [x], t2[i] = read(x, n2), and n2 = M [x]. Therefore n1 = n2, and thus t1[i] = t2[i]. Therefore ∀i.t1[i] = t2[i], and according to Lemma 4, we have t1 ≡ t2. Then we have the following lemmas: Lemma 10. Suppose Γ ` e : τ ;T , T ∼L T ′ for some T ′, and memory M is Γ-valid. If 〈M, e〉 ⇓t n, then t ∈ T [M ]. Proof. We prove by structural induction on e. If e is n, then T =  = t. If e is x, then T = evt(lab(Γ(x)),Read(x)). If lab(Γ(x)) = l ∈ ORAMBanks, then t = l ∈ l[M ]. If lab(Γ(x)) = L, then T = Read(x), and t = read(x, n), where M(x) = (n,L). According to the definition, we know t ∈ T [M ]. If e is e1 op e2, then suppose 〈M, ei〉 ⇓ti ni and Γ ` ei : li;Ti for i = 1, 2. 190 Then according to the induction assumption, we have ti ∈ Ti[M ] for i = 1, 2. Since t = t1@t2, and T = T1@T2, we know t ∈ T [M ]. Next we consider x[e′]. Suppose Γ ` e′ : Nat l′;T ′, and 〈M, e′〉 ⇓t′ n′, then T = T ′@evt(lab(Γ(x)),Readarr(x)), and t = t′@evt(lab(Γ(x)), readarr(x, n′, n′′)) for some n′′. Moreover, we have t′ ∈ T ′[M ] by induction. Since T ∼L T ′, we know lab(Γ(x)) ∈ ORAMBanks. Therefore t = t′@lab(Γ(x)) ∈ T ′@lab(Γ(x))[M ] = T [M ]. Lemma 11. Assume Γ, l0 ` S;T , T ∼L T ′ for some T ′, and l0 ∈ ORAMBanks, and M is a Γ-valid memory. If 〈M,S〉 ⇓t M ′, then t ∈ T [M ]. Proof. We prove by structural induction on the statement S. Since l0 6= L, therefore we know S cannot be a while statement. If S is skip, then T =  = t. Let us consider when S is x := e. Then 〈M, e〉 ⇓t′ n′, and Γ ` e : τ ;T ′, and T = T ′@evt(lab(Γ(x)),Write(x)). Since T ∼L T , T does not contain Write(x), and thus lab(Γ(x)) ∈ ORAMBanks. Therefore t = t′@lab(Γ(x)) ∈ T ′@lab(Γ(x))[M ] by Lemma 10. Next, suppose S is x[e1] = e2. Suppose 〈M, ei〉 ⇓ti ni, and Γ ` ei : τ ;Ti for i = 1, 2 by induction. Then ti ∈ Ti[M ] for i = 1, 2. Similar to the discussion for x := e, we know lab(Γ(x)) ∈ ORAMBanks, and thus t = t1@t2@lab(Γ(x)) ∈ T1@T2@lab(Γ(x))[M ]. Next, let us consider (if)(e, S1, S2). Then Γ, l0 ` Si;Ti for i = 1, 2, and T1 ∼L T2. As well Γ ` e : τ ;Te, 〈M, e〉 ⇓te ne, idx = ite(ne, 1, 2), and 〈M,Sidx〉 ⇓tidx M ′. Then T = Te@T1, and te ∈ Te[M ]. If idx = 1, then 〈M,S1〉t1M ′, and thus 191 t1 ∈ T1[M ]. Therefore t = te@t1 ∈ Te@T1[M ] = T [M ]. Similarly, if idx = 2, then 〈M,S2〉t2M ′, and thus t2 ∈ T2[M ]. Therefore t2 ∈ T1[M ]. As a conclusion t = te@t2 ∈ Te@T1[M ] = T [M ]. Finally, suppose S is S1;S2. Then we know Γ, l0 ` Si;Ti for i = 1, 2, 〈M,S1〉 ⇓t1 M ′, and 〈M ′, S2〉 ⇓t2 M ′′. Since l0 ∈ ORAMBanks, we know M ∼L M ′ ∼L M ′′. By induction assumption, we know t1 ∈ T1[M ], and t2 ∈ T2[M ′]. Since M ∼L M ′, according to Corollary 1, we know t2 ∈ T2[M ]. Therefore t = t1@t2 ∈ T1@T2[M ] = T [M ]. Lemma 12. Suppose Γ, l0 ` Si;Ti, for i = 1, 2, where l0 ∈ ORAMBanks, and T1 ∼L T2. Given two Γ-valid low-equivalent memories M1, M2, if 〈M1, S1〉 ⇓t1 M ′1, and 〈M2, S2〉 ⇓t2 M ′2, then M ′1 ∼L M ′2, and t1 ≡ t2. Proof. According to Lemma 11, we know that ti ∈ Ti[Mi] for i = 1, 2. According to Lemma 8, we know that M ′1 ∼L M1 and M ′2 ∼L M2. Since M1 ∼L M2, we know that M ′1 ∼L M1 ∼L M2 ∼L M ′2. Because t1 ∈ T1[M1], and M1 ∼L M2, therefore t1 ∈ T1[M2]. Furthermore, since T1 ∼L T2, we have t1 ∈ T2[M2]. Finally, since t2 ∈ T2[M2], and according to Corollary 2 we have t1 ≡ t2. Now we are ready to prove Theorem 1. Proof of Theorem 1. We extend this conclusion by considering both normal state- ment and labeled statement, and shall prove by induction on the statement s. For notational convention, we suppose 〈M1, s〉 ⇓t1 M ′1, and 〈M2, s〉 ⇓t2 M ′2, and thus Γ, l0 ` S;T with M1 ∼L M2 and both are Γ-valid. Our goal is prove t1 ≡ t2, and M1 ∼L M2. 192 If s is skip, it is obvious. Suppose s is x := e, then Γ ` e : Nat l;T . Suppose 〈Mi, e〉 ⇓t′i ni, for i = 1, 2. According to Lemma 6 and Lemma 7, we know t′1 = t ′ 2. If lab(Γ(x)) ∈ ORAMBanks, then M ′1 = M1[x 7→ (n1, l1)] ∼L M1 ∼L M2 ∼L M2[x 7→ (n2, l2)] = M ′2, and t1 = t ′ 1@lab(Γ(x)) = t ′ 2@lab(Γ(x)) = t2, which implies t1 ≡ t2. If Γ(x) = Nat L, then we know Γ ` e : Nat L;T , and according to Lemma 6, we have n1 = n2. Then we also have M ′ 1 = M1[x→ n1] ∼L M2[x→ n2] = M ′2, and t1 = t ′ 1@read(x, n1) ≡ t′2@read(x, n2) = t2. Next, suppose s is x[e1] := e2. Suppose 〈Mi, ej〉 ⇓tij nij for i = 1, 2, j = 1, 2. If lab(Γ(x)) = L, then we know Γ ` ej : Nat L, for j = 1, 2. Then by Lemma 6, we have t1j = t2j, which implies t1j ≡ t2j, n1j = n2j, and according to the definition of Γ-validity and low-equivalence, ∀i.M1(x)(i) = M2(x)(i). Therefore t1 = t11@t21@writearr(x, n11, n12) ≡ t21@t22@writearr(x, n21, n22) = t2, and M ′1 = M1[x→M1(x)[n11 → n12]] ∼L M2[x→M2(x)[n21 → n22]] = M ′2. Otherwise, if Γ(x) ∈ ORAMBanks, suppose Γ ` ei : Nat li;Ti, for i = 1, 2. Then we know l0 unionsq l1 unionsq l2 v lab(Γ(x)). Therefore, by Lemma 7, based on the same reasoning as above for Nat l case, we have t1 = t11@t21@Γ(x) ≡ t21@t22@Γ(x) = t2. Furthermore, M ′1 = M1[x → m1] ∼L M1 ∼L M2 ∼L M2[x → m2] = M ′2 for some two mappings, m1 and m2. Then suppose the statement is if(e, S1, S2). There are two situations. If Γ ` e : Nat le;Te, where le unionsq l0 ∈ ORAMBanks, then according to Lemma 12, we know M ′1 ∼L M ′2, and t1 ≡ t2. Otherwise, we have le = L and l0 = L. Suppose 〈Mi, e〉 ⇓t′i ni, for i = 1, 2, then according to Lemma 6, we know t′1 = t′2, which 193 implies t′1 ≡ t′2, and n1 = n2. If ite(n1, 1, 2) = 1, then we know 〈M1, S1〉 ⇓t′′1 M ′1 and 〈M2, S1〉 ⇓t′′2 M ′2. Therefore t1 = t′1@t′′1 ≡ t′1@t′′2 = t2, and M ′1 ∼L M ′2 by induction. We can show the conclusion for ite(n1, 1, 2) = 2 similarly. Next, let us consider the statement while(e, S). We know Γ ` e : Nat L;T , therefore there exists a constant n, and a trace t, such that 〈Mi, e〉 ⇓t, n for both i = 1, 2, by Lemma 6. We prove by induction on how many steps applying the S-WhileT rule and S-WhileF rule (WHILE rules for short) to derive 〈while(e, S),M1〉 ⇓t1〉. If we only apply one time, then we must apply S-WhileF rule, and thus n = 0. Then we have t1 = t = t2, and M ′ 1 = M1 ∼L M2 = M ′2. Suppose the conclusion is true when we need to apply n − 1 steps of WHILE rules, now let us consider when we need to apply n > 0 steps. Then we know n 6= 0. Suppose 〈Mi, S〉 ⇓ti1 Mi1, and 〈Mi1,while(e, S)〉 ⇓ti2 M ′i , for i = 1, 2. Then we know that we need to apply n− 1 steps of WHILE rules to derive 〈M11,while(e, S)〉 ⇓t12 M ′1. By induction, we have t11 = t21, t12 = t22, and M11 ∼L M21. Therefore M ′1 ∼L M ′2, and t1 = t11@t12 = t21@t22 = t2. Finally, let us consider S1;S2. Suppose Γ, l0 ` S1;T1, Γ, l0 ` S2;T2, 〈Mi, S1〉 ⇓ti1 Mi1, and 〈Mi1, S2〉 ⇓t21 M ′i . Then by induction assumption, we have t11 = t21, t12 = t22, M11 ∼L M12, and thus M ′1 ∼L M ′2. Therefore t1 = t11@t12 = t21@t22 = t2. 194 Appendix B: Proof of Theorem 2 Our proof proceeds in two steps. First, we prove a terminating version (The- orem 6) of Theorem 2. By having the assumption that the program will finally terminate, i.e. its execution does not end in an infinite loop, we will show how our type system enforces the MTO property. Then by using Theorem 6, we will show the MTO property holds for non-terminating programs, (i.e. Theorem 7), which implies Theorem 2 as an obvious corollary. We start with the terminating case. We first prove some useful lemmas for terminating programs. Lemma 13. For all I, `,Υ, Sym, Υ′, Sym′, T , such that ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T if there is some i such that I(i) = jmp n, then 0 ≤ i+ n ≤ |I|. Proof. We prove by induction on ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T It is clearly that rule T-LOAD, T-STORE, T-LOADW, T-STOREW, T-IDB, T- 195 BOP, T-ASSIGN, T-NOP cannot derive I. If this judgement is derived using rule T-SEQ, then we know I = I1; I2, and we have ` ` I1 : 〈Υ, Sym〉 → 〈Υ′′, Sym′′〉;T1 ` ` I2 : 〈Υ′′, Sym′′〉 → 〈Υ′, Sym′〉;T2 Then either i < |I1|, or |I1| ≤ i < |I1| + |I2| = |I|. For the first case, we have I1(i) = jmp n, and thus by induction, we have 0 ≤ i + n ≤ |I1| < |I|. For the second case, we have I2(i − |I1|) = jmp n, and thus by induction assumption, we have 0 ≤ i− |I1| ≤ |I2|. Therefore, we have 0 < |I1| ≤ i ≤ |I1|+ |I2| = |I|. If this judgement is derived by rule T-IF, then we know I = ι1; It; ι2; If , and ι1 = br r1 rop r2 ↪→ n1 ι2 = jmp n2 n1 − 2 = |It| n2 = |If |+ 1 Then, there are three possible scenarios: 1. 1 ≤ i ≤ 1+|It|. In this case, we know It(i−1) = jmp n, and 0 ≤ i−1+n ≤ |It|. Therefore 0 < 1 ≤ i+ n ≤ |It|+ 1 < |I|; 2. i = 1 + |It|. In this case, I(i) = ι2. Therefore, we have 0 < i + n2 = 196 2 + |It|+ |If | = |I|; 3. 2 + |It| ≤ i < 2 + |It| + |If | = |I|. In this case, we can prove the conclusion similarly to Case 1. If this judgement is derived by rule T-WHILE, then we can prove the conclu- sion similarly to the T-IF case. Finally, if the judgement is derive by rule T-SUB, then the result follows by induction. Lemma 14. For all I, `,Υ, Sym, Υ′, Sym′, T , such that ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T if there is some i such that I(i) = br r1 rop r2 ↪→ n, then 0 ≤ i+ n ≤ |I|. Proof (sketch). The proof is similar to the proof for Lemma 13. Lemma 15. Given I = I1; I2; I3, Ri, Si,Mi, pci for i = 1, ..., k + 1, and ti for i = 1, ..., k, |I1| ≤ pci < |I1|+ |I2| ∀i ∈ {1, ..., k} |I1| ≤ pck+1 ≤ |I1|+ |I2| and |ti| = 1 ∀i ∈ {1, ..., k}. Then I ` (Ri, Si,Mi, pci)→ti (Ri+1, Si+1,Mi+1, pci+1) 197 holds true for i = 1, ..., k, if and only if I2 ` (Ri, Si,Mi, pci − |I1|)→ti (Ri+1, Si+1,Mi+1, pci+1 − |I1|) holds true for i = 1, ..., k. Proof. We only prove the only-if-direction, and the if-direction is similar. We prove by induction on k. We first prove for k = 1. Consider each possibility for I(pc1). Case ldb k ← l[r]. By rule LOAD, we know n = |R(r) mod size(l) | b = M(l, n) S2 = S1[k 7→ (b, (l, n))] t1 = select(l, read(n, b), eread(n), l) R2 = R1 M2 = M1 pc2 = pc1 + 1 Clearly, we have pc2 − |I1| = pc1 − |I1| + 1, and I2(pc1 − |I1|) = ldb k ← l[r], and thus I2 ` (R1, S1,M1, pc1 − |I1|)→t1 (R2, S2,M2, pc2 − |I1|) Cases stb k, k ← idb r, ldw r1 ← k[r2], stw r1 → k[r2], r1 ← r2 op r3, r1 ← n, or nop are similar. 198 Case jmp n′. We have R2 = R1 S2 = S1 M2 = M1 pc2 = pc1 + n ′ Since pc2 − |I1 |= pc1 − |I1| + n′, then the conclusion is obvious. We can prove similarly for I(pc1) = br r1 rop r2 ↪→ n′. Now, we assume the conclusion holds true for k ≤ k′. For k = k′+1, We know I2 ` (Ri, Si,Mi, pci − |I1|)→ti (Ri+1, Si+1,Mi+1, pci+1 − |I1|) holds true for i = 1, ..., k′ by the induction assumption. Further, since the assump- tion says the conclusion holds for k = 1, thus I2 ` (Rk′ , Sk′ ,Mk′ , pck′ − |I1|)→tk′ (Rk′+1, Sk′+1,Mk′+1, pck′+1 − |I1|) Therefore, the conclusion holds true. Using Lemma 13, 14, 15, we can prove the following important lemma. Lemma 16. Given I = Ia; I ′; Ib, where any of Ia and Ib can be empty, `, `′, Υ1,Υ ′ 1,Υ2,Υ ′ 2, Sym1, Sym ′ 1, Sym2, Sym ′ 2, T, T ′, such that ` ` I : 〈Υ1, Sym1〉 → 〈Υ2, Sym2〉;T 199 and `′ ` I ′ : 〈Υ′1, Sym′1〉 → 〈Υ′2, Sym′2〉;T ′ If for pc = |Ia| and pc′, where pc′ < |Ia| or pc′ ≥ |Ia|+ |I ′|, such that I ` (R, S,M, pc)→t (R′, S ′,M ′, pc′), then there exists R′′, S ′′,M ′′, pc′′, t′, t′, such that t = t′@t′′ (where t′′ can be ), and I ` (R, S,M, pc)→t′ (R′′, S ′′,M ′′, pc′′) I ` (R′′, S ′′,M ′′, pc′′)→t′′ (R′, S ′,M ′, pc′) pc′′ = |Ia|+ |I ′| t ≡ t′@t′′ |t′| > 0 |t′′| ≥ 0 We provide an intuitive explanation of this lemma as follows. It says suppose a type-checked program I contains a type-checked segment I ′, and if the the program runs from the start of the segment I ′ (i.e. pc = |Ia|), and ends outside the segment (i.e. pc′ < |Ia| or pc′ ≥ |Ia|+ |I ′|), then it must stop at the end of the segment (i.e. pc′′ = |Ia|+ |I ′|) first. Notice the correctness of this lemma does not depend on whether Ia or Ib can type-check. Further, none or either or even both of Ia and Ib can be empty. In the last case, the conclusion is trivial. Now we prove this lemma. 200 Proof. We suppose for k = |t| − 1, I ` (Ri, Si,Mi, pci)→ti (Ri+1, Si+1,Mi+1, pci+1) for i = 0, ..., k, where |ti| = 1 ∀i ∈ {0, ..., k} R0 = R S0 = S M0 = M pc0 = pc Rk = R′ Sk = S ′ Mk = M ′ pck = pc′ Suppose i? is the smallest one such that pci? < |Ia| or pci? ≥ |Ia| + |I ′|. Then, by Lemma 15, we know I ′ ` (Ri, Si,Mi, pˆci)→ti (Ri+1, Si+1,Mi+1, pˆci+1) for i = 0, ..., i? − 2, where pˆci = pci − |I|. Clearly, we know 0 ≤ pˆci?−1 < |I ′| by the definition of i?. Now, we consider the instruction I ′(pˆci?−1). If it is not jmp n ′ or br r1 rop r2 ↪→ n′, then we know pci? = pci?−1 + 1 > |Ia|. Therefore pci? ≥ |Ia|+ |I ′| 201 Further, since pci?−1 − |Ia |= pˆci?−1 < |I ′|, we have pci? ≤ |Ia|+ |I ′| Therefore, we have pci? = |Ia| + |I ′|, and thus R′′ = Ri? , S ′′ = Si? , M ′′ = Mi? , pc′′ = pci? , t′ = t0@...@ti?−1, and t′′ = ti?@...tk satisfy the property. If I ′(pˆci? − 1) = jmp n′, then by Lemma 13, we have 0 ≤ pˆci?−1 + n′ ≤ |I ′| Therefore, we have pci? = pci?−1 + n′ = pˆci?−1 + |Ia|+ n′ ≤ |Ia|+ |I ′| and pci? = pci?−1 + n′ = pˆci?−1 + |Ia|+ n′ ≥ |Ia| Therefore, by the same argument as above, we know pci? = |Ia| + |I ′|, and the conclusion is true. Finally, if I ′(pˆci? − 1) = br r1 rop r2 ↪→ n′, we can prove the conclusion similarly using Lemma 14. Now we can state and prove Theorem 6. Some judgments mentioned in the theorem are defined in Figure B.1. Theorem 6. Given a program I in LT , such that ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T , 202 mem((b, (l, n))) = l idx((b, (l, n))) = n block((b, (l, n))) = b ∀k ∈ BlockIDs. Υ(k) = mem(S1(k)) = mem(S2(k)) Υ(k) = D ⇒ S1(k) = S2(k) Υ(k) = E ⇒ idx(S1(k)) = idx(S2(k)) Υ ` S1 ∼ S2 ∀r ∈ Registers.Υ(r) = L⇒ R1(r) = R2(r) Υ ` R1 ∼ R2 (S, sv) ⇓ v mem(S(k)) = l = D (S, sv) ⇓ v2 v = block(S(k))(v2) (S,Ml[k, sv]) ⇓ v (S, n) ⇓ n (S, sv1) ⇓ v1 (S, sv2) ⇓ v2 v = v1 aop v2 (S, sv1 aop sv2) ⇓ v Figure B.1: Well formedness judgments for proof of Memory-Trace Obliviousness of LGhostRider two memories M1,M2, two register mapping R1, R2, and two scratchpad mapping S1, S2, if the following assumptions are satisfied: 1. M1 ∼L M2; 2. Υ ` R1 ∼ R2; 3. Υ ` S1 ∼ S2; 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `ok Sym(r)⇒ (Si, Sym(r)) ⇓ Ri(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `ok Sym(k)⇒ ∃n.(Si, Sym(k)) ⇓ n ∧ |nmod size(Υ(k)) | = idx(Si(k)). 203 and I ` (R1, S1,M1, 0)→t1 (R′1, S ′1,M ′1, pc′) I ` (R2, S2,M2, 0)→t2 (R′2, S ′2,M ′2, pc′′) where pc′ = pc′′ = |I|, then we have the following conclusions: 1. M ′1 ∼L M ′2; 2. Υ′ ` R′1 ∼ R′2; 3. Υ′ ` S ′1 ∼ S ′2; 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `safe Sym′(r) ⇒ (S ′i, Sym ′(r)) ⇓ R′i(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∨ `const Sym(k))∧ `safe Sym′(k) ⇒ ∃n.(S ′i, Sym′(k)) ⇓ n ∧ |nmod size(Υ′(k)) | = idx(S ′i(k)).; 6. t1 ≡ t2 We first provide an intuition of this theorem. Proof of Theorem 6. We prove Theorem 6 by induction on the length of derivation to derive L ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T Case T-SEQ. Then I = I1; I2, and by inversion, we have L ` I : 〈Υ, Sym〉 → 〈Υ′, Sym〉;T 204 L ` I1 : 〈Υ, Sym〉 → 〈Υ1, Sym1〉;T1 L ` I2 : 〈Υ1, Sym1〉 → 〈Υ′, Sym′〉;T2 By Lemma 15 and Lemma 16, we know I1 ` (R1, S1,M1, 0)→t11 (R′′1, S ′′1 ,M ′′1 , |I1|) I2 ` (R′′1, S ′′1 ,M ′′1 , 0)→t21 (R′1, S ′1,M ′1, |I2|) Similarly, we have I1 ` (R2, S2,M2, 0)→t12 (R′′2, S ′′2 ,M ′′2 , |I1|) I2 ` (R′′2, S ′′2 ,M ′′2 , 0)→t22 (R′2, S ′2,M ′2, |I2|) By induction assumption, we have 1. M ′′1 ∼L M ′′2 ; 2. Υ ` R′′1 ∼ R′′2; 3. Υ ` S ′′1 ∼ S ′′2 ; 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∨ `const Sym′′(r))∧ `safe Sym′′(r) ⇒ (S ′′i , Sym ′′(r)) ⇓ R′i(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∨ `const Sym′′(k)) ∧ ∧ `safe Sym′′(k)⇒ ∃n.(S ′′i , Sym′′(k)) ⇓ n ∧ |n mod size(Υ′′(k)) | = idx(S ′′i (k)). 205 6. t11 ≡ t12 By induction assumption again, we have conclusions 1-5, and t21 ≡ t22. Therefore, we know t1 = t 1 1@t 2 1 ≡ t12@t22 = t2, which is conclusion 6. Case T-LOAD. Then I = ldb k ← l[r], and pc′ = pc′′ = 1. Further, by inversion, we know ni = |Ri(r) mod size(l)| bi = Mi(l, ni) S ′ i = Si[k 7→ (bi, (l, ni))] ti = select(l, read(ni, bi), eread(ni), l) i = 1, 2 We prove conclusions 1-6 hold true. 1 holds true trivially, since the memories are not changed, i.e. M ′1 = M1 ∼L M2 = M ′2, and 2 and 4 hold true for the same reason. We prove conclusion 3 as follows. First, we know Υ′(k) = l = mem(S ′1(k)) = mem(S ′2(k)). If Υ ′(k) = D or Υ′(k) = E, then we know l 6∈ ORAMbanks, and thus Υ′(r) = L, which implies that idx(S ′1(k)) = n1 = n2 = idx(S ′ 2(k)). Further, if Υ′(k) = l = D, then we know b1 = M1(D,n1) = M2(D,n2) = b2 (due to M1 ∼L M2), which implies S ′1(k) = S ′ 2(k). Therefore, we know Υ ` S ′1 ∼ S ′2. For conclusion 5, since for all k′ 6= k, Sym′(k′) = Sym(k′) and S ′i(k′) = Si(k ′) (i = 1, 2), therefore conclusion 7 holds true. For k, if `safe Sym′(k) does not hold, then the conclusion is trivial. Now we assume `safe Sym′(k), if and only if `safe Sym(r). By assumption 6, we know (Si, Sym(r)) ⇓ Ri(r). Since |Ri(r) mod size(Υ′(k)) | = |Ri(r) mod size(l) | = ni, we know conclusion 7 holds true. 206 For conclusion 6, if l = D, then following the discussion for conclusion 3, we know n1 = n2 and b1 = b2, and thus t1 = read(n1, b1) = read(n2, b2) = t2. If l = E, then similarly we know n1 = n2, and thus t1 = eread(n1) = eread(n2) = t2. If l ∈ ORAMBanks, then we know t1 = l = t2. Thus conclusion 6 holds true. Case T-STORE. If I = stb k, and I is typed using rule T-STORE, then we have pc′ = pc′′ = 1. Further, we know (bi, ai) = Si(k) ai = (li, ni) M ′ i = Mi[ai 7→ bi] ti = select(li, write(ni, bi), ewrite(ni), li) i = 1, 2 Conclusions 2-5 are trivial, since registers and scratchpads are not changed. We first prove conclusion 6. By assumption 3, we know l1 = l2. If l1 = D, then we know t1 = write(n1, b1), and t2 = write(n2, b2). By assumption 4, we know n1 = n2 and b1 = b2, therefore t1 = t2. If l1 = E, then by assumption 5, we know n1 = n2. Therefore t1 = ewrite(n1) = ewrite(n2) = t2. Finally, if l1 ∈ ORAMBanks, then t1 = l1 = l2 = t2. Now we prove conclusion 1. The only difference between M ′i and Mi is the value for ai = (li, ni). To show that M ′ 1 ∼L M ′2, we only need to show that if l1 = l2 = D, and b1 = b2. This point is induced by assumption 4. Therefore, conclusion 1 holds. Case T-LOADW. If I = ldw rx ← k[ry], and I is typed using rule T- 207 LOADW, then we have pc′ = pc′′ = 1. Further, we know (bi, ai) = Si(k) ni = |R(ry) mod size(bi) | R′i = Ri[rx 7→ bi(ni)] i = 1, 2 Since scratchpad and memories are not changed, conclusions 1, 3, and 5, trivially hold true. Further, since t1 = f = t2, conclusion 6 is also true. We only need to prove for conclusions 2 and 4. For conclusion 2, if Υ′(rx) = H, the conclusion is trivial. If Υ′(rx) = L, then we know l = D, and by rule T-LOADW, we know Υ(ry) = L, which implies, by assumption 2, n1 = n2. Further, since l = D, by assumption 3, we know b1 = b2. Therefore R ′ 1(rx) = b1(n1) = b2(n2) = R ′ 2(rx). For conclusion 4, all we need to show is that for i = 1, 2, either (` = H∨ `const Ml[k, Sym(ry)])∧ `safe Ml[k, Sym(ry)] is not true, ` = L or (Si,Ml[k, Sym(r2) mod size(bi)]) ⇓ bi(ni) holds true. First of all, `const Ml[k, Sym(ry)] is not true. W.L.O.G. we suppose `safe Ml[k, Sym(ry)] and ` = H, then we know l = D and `safe Sym(ry). Therefore we have (Si, Sym(ry) mod size(bi)) ⇓ R(ry) mod size(bi) = ni. Further, by conclusion 3, we know Υ′(k) = mem(Si(k)) = D. Combining with bi = block(Si(k)), we have (Si,Ml[k, Sym(r2) mod size(bi)]) ⇓ bi(ni) Case T-IDB. If I = k ← idb r, and I is typed using rule T-IDB, then we 208 have pc′ = pc′′ = 1. Further, we know (bi, (li, ni)) = Si(k) R ′ i = Ri[r 7→ ni] i = 1, 2 Conclusions 1, 3, 5, and 6 hold trivially. Conclusions 2 and 4 are implied by as- sumptions 3 and 5 respectively. Case T-STOREW. If I = stw rx → k[ry], and I is typed using rule T- STOREW, then we have pc′ = pc′′ = 1. Further, we know (bi, ai) = Si(k) ni = |Ri(ry) mod size(b) | S ′i = Si[k 7→ (bi[ni 7→ Ri(rx)], ai) i = 1, 2 Since registers and memories are not changed, conclusions 1, 2, and 4 hold true. Further, since t1 = f = t2, we know conclusion 6 is true. We now prove conclusions 3 and 5. Clearly, we only need to prove for Υ′(k) and Sym(k). Suppose ai = (li, idxi), then we know Υ′(k) = Υ(k) = l1 = l2. Further, if Υ′(k) = Υ(k) = D, then we know Υ(rx) = Υ(ry) = L, which, by assumption 2, implies R1(rx) = R2(rx) and R1(ry) = R2(ry). Therefore n1 = n2. Since Υ(k) = D, by assumption 4, we have S1(k) = S2(k), and thus b1 = b2. Therefore we have S ′ 1(k) = (b1[n1 7→ R1(rx)], a1) = (b2[n2 7→ R2(rx)], a2) = S ′2(k). Finally, if Υ′(k) = Υ(k) = E, then by assumption 5, we know idx1 = idx(S1(k)) = idx(S2(k)) = idx2. Then we have idx(S ′1(k)) = idx1 = idx2 = idx(S ′ 2(k)). Therefore, conclusion 3 is true. For conclusion 5, first we suppose ` = H and `safe Sym′(k). Since Sym′(k) = Sym(k), we know `safe Sym(k), and thus by assumption 5, we know for some 209 n1, n2, (Si, Sym ′(k)) ⇓ n and |n |Υ(k) = idxi. Further, since ` = H, we know slab(Υ(k)) = H, and thus mem(Si(k)) = li 6= D for i = 1, 2. Now we prove that if (S1, sv) ⇓ v, then (S ′1, sv) ⇓ v by induction on sv. If sv = Ml[k′, sv1], then we know mem(S1(k ′)) = l = D, and (S1, sv) ⇓ v2. Since mem(S1(k)) 6= D, we know k 6= k′. Therefore we know S ′1(k) = S1(k), and thus by induction assumption, we have (S ′1, sv) ⇓ v. Next, if sv = n, then the conclusion holds trivially. If sv = sv1 aop sv2, then we know (S1, sv1) ⇓ v1, (S1, sv2) ⇓ v2, and v = v1 aop v2. Therefore we know (S ′ 1, sv) ⇓ v by induction. Similarly, we can prove that if (S2, sv) ⇓ v, then (S ′2, sv) ⇓ v. Therefore, we know (S ′i, Sym′(k)) ⇓ ni, and |ni mod size(Υ′(k)) | = idxi = idx(S ′i(k)), which means conclusion 5 holds true. Similarly, if `const Sym′(k) and `safe Sym′(k), we can also prove conclusion 5 easily. Case T-BOP. If I = r1 ← r2 aop r3, and I is typed using rule T-BOP, then we have pc′ = pc′′ = 1. Further, we know ni = Ri(r2) aopRi(r3) R ′ i = Ri[r1 7→ ni] i = 1, 2 Conclusions 1, 3, 5, and 6 are trivial. For conclusion 2, we only need to prove if Υ′(r1) = l′ = Υ(r2)unionsqΥ(r3) = L, then R1(r1) = n1 = R2(r1) = n2. To see this, since Υ(r2) unionsq Υ(r3) = L, we know Υ(r2) = Υ(r3) = L. Therefore by assumption 2, we know R1(r2) = R2(r2) and R1(r3) = R2(r3). Therefore, we have n1 = R1(r2) aop R1(r3) = R2(r2) aop R2(r3) = n2 210 For conclusion 4, we only need to consider Sym′(r1). If `safe Sym′(r1) = Sym(r2) aop Sym(r3) then we know `safe Sym(r2) and `safe Sym(r3). Further, we know ` = H or `const Sym(r1) and `const Sym(r2). Therefore, by assumption 6, we know (Si, Sym(r2)) ⇓ Ri(r2), (Si, Sym(r3)) ⇓ Ri(r3) for i = 1, 2 Therefore, we have (Si, Sym ′(r1)) ⇓ Ri(r2) aop Ri(r3) = ni = R′i(r1) i = 1, 2 Therefore conclusion 4 holds true. If I = r ← n, and I is typed using rule T-ASSIGN, then we have pc = 0 and pc′ = pc′′ = 1. Further, we know R′i = Ri[r1 7→ n] i = 1, 2 Similar to the T-BOP rule, conclusions 1, 3, 5 and 6 are trivial. For conclusion 2, we know R′1(r1) = R ′ 2(r1) = n. For conclusion 4, we have (Si, n) ⇓ n for i = 1, 2. Case T-NOP. If I = nop, and I is typed using T-NOP, then this is trivial, because Mi = M ′ i , Ri = R ′ i, Si = S ′ i for i = 1, 2, and Υ = Υ ′ and Sym = Sym′. 211 Case T-SUB. If I is typed using T-SUB, then we know ` ` ι : 〈Υ, Sym〉 → 〈Υ′′, Sym′′〉;T where Υ′′  Υ′ and Sym′′  Sym′. W.L.O.G, we assume ` ` ι : 〈Υ, Sym〉 → 〈Υ′′, Sym′′〉;T is not derived by rule T-SUB (i.e.  is transitive). By induction, we have the following results 1. M ′1 ∼L M ′2; 2. Υ ` R′1 ∼ R′2; 3. Υ ` S ′1 ∼ S ′2; 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∨ `const Sym′′(r))∧ `safe Sym′′(r) ⇒ (S ′i, Sym ′′(r)) ⇓ R′i(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∨ `const Sym′′(r))∧ `safe Sym′′(k) ⇒ ∃n.(S ′i, Sym′(k)) ⇓ n ∧ |n mod size(Υ′(k)) | = idx(S ′i(k)). 6. t1 ≡ t2 Therefore, conclusions 1 and 6 follow results 1 and 6. For conclusion 2, if Υ′(r) = L, then since Υ′′(r) v Υ′(r), we know Υ′′(r) = L, and thus R′1(r) = R′2(r). Conclusion 3 naturally holds true, since Υ′′(k) = Υ′(k). For conclusion 4, since Sym′′  Sym′, we know Sym′(r) =? or Sym′(r) = Sym′′(r). In the former case, `safe Sym′(r) does not hold, and thus conclusion 4 is vacuously true. In the later 212 case, conclusion 4 directly follows results 4. Similarly, we can prove conclusion 5 as well. Case T-LOOP. If I is typed using rule T-LOOP, then we know I = Ic; ι1; Ib; ι2, and ` ` Ic : 〈Υ, Sym〉 → 〈Υ′, Sym′〉 ` ` Ib : 〈Υ′, Sym′〉 → 〈Υ, Sym〉 We prove the conclusion by induction on number of times that instruction ι1 is executed. First, assume ι1 is executed once. Therefore, by Lemma 16, we know that Ic ` (R1, S1,M1, 0)→t11 (R′′1, S ′′1 ,M ′′1 , |Ic|) I ` (R′′1, S ′′1 ,M ′′1 , |Ic|)→t21 (R′1, S ′1,M ′1, pc′) Ic ` (R2, S2,M2, 0)→t12 (R′′2, S ′′2 ,M ′′2 , |Ic|) I ` (R′′2, S ′′2 ,M ′′2 , |Ic|)→t22 (R′2, S ′2,M ′2, pc′′) t1 = t 1 1@t 2 1 t2 = t 1 2@t 2 2 By induction assumption, we have 1. M ′′1 ∼L M ′′2 ; 2. Υ ` R′′1 ∼ R′′2; 3. Υ ` S ′′1 ∼ S ′′2 ; 213 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∧ `const Sym′′(r))∧ `safe Sym′′(r) ⇒ (S ′′i , Sym ′′(r)) ⇓ R′′i (r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∧ `const Sym′′(r))∧ `safe Sym′′(k) ⇒ ∃n.(S ′′i , Sym′(k)) ⇓ n ∧ |n mod size(Υ′′(k)) |= idx(S ′′i (k)). 6. t11 ≡ t12 Further, we know I(|Ic|) = ι1 = br r1 rop r2 ↪→ n1, where Υ(r1)unionsqΥ(r2) v L, which implies Υ(r1) = Υ(r2) = L. Therefore, we know R ′′ 1(r1) = R ′′ 2(r1) and R ′′ 1(r2) = R′′2(r2), which implies if I ` (R′′i , S ′′i ,M ′′i , |Ic|)→f (R?i , S?i ,M?i , pc?i ) then pc?1 = pc ? 2, which is either |Ic| + 1 or |Ic| + n1 = |I|. If the first case is true, then we can show Ic ` (R1, S1,M1, 0)→t11 (R′′1, S ′′1 ,M ′′1 , |Ic|) I ` (R′′1, S ′′1 ,M ′′1 , |Ic|)→t21 (R′′1, S ′′1 ,M ′′1 , |Ic|+ 1) the branch instruction ι1 will not be taken. I ` (R′′1, S ′′1 ,M ′′1 , |Ic|+ 1)→t31 (R′′′1 , S ′′′1 ,M ′′′1 , |I| − 1) I ` (R′′′1 , S ′′′1 ,M ′′′1 , |I| − 1)→t41 (R′′′1 , S ′′′1 ,M ′′′1 , 0) I ` (R′′′1 , S ′′′1 ,M ′′′1 , 0)→t51 (R′1, S ′1,M ′1, |I|) Then by the same analysis, we know during I ` (R′′′1 , S ′′′1 ,M ′′′1 , 0)→t51 (R′1, S ′1,M ′1, |I|), the instruction ι1 will be executed at least once, and thus it will be executed at least 214 twice in total, which contradicts our assumption that ι1 is executed only once. Therefore, we know pc?1 = pc ? 2 = |I|. Therefore, we know t1 = t11@f ≡ t22@f = t2, and R′′i = Ri, S ′′ i = Si, M ′′ i = Mi (i = 1, 2). In this case, conclusions 1-6 all hold true. Next, we assume the conclusions hold true for the number of the times that ι1 is executed less than u > 1, and we consider the case when |t1| = u. By the same analysis as above, we know Ic ` (R1, S1,M1, 0)→t11 (R′′1, S ′′1 ,M ′′1 , |Ic|) I ` (R′′1, S ′′1 ,M ′′1 , |Ic|)→f (R′′1, S ′′1 ,M ′′1 , |Ic|+ 1) the branch instruction ι1 will not be taken. I ` (R′′1, S ′′1 ,M ′′1 , |Ic|+ 1)→t21 (R′′′1 , S ′′′1 ,M ′′′1 , |I| − 1) I ` (R′′′1 , S ′′′1 ,M ′′′1 , |I| − 1)→f (R′′′1 , S ′′′1 ,M ′′′1 , 0) I ` (R′′′1 , S ′′′1 ,M ′′′1 , 0)→t31 (R′1, S ′1,M ′1, |I|) Ic ` (R2, S2,M2, 0)→t12 (R′′2, S ′′2 ,M ′′2 , |Ic|) I ` (R′′2, S ′′2 ,M ′′2 , |Ic|)→f (R′′2, S ′′2 ,M ′′2 , |Ic|+ 1) the branch instruction ι1 will not be taken. I ` (R′′2, S ′′2 ,M ′′2 , |Ic|+ 1)→t22 (R′′′2 , S ′′′2 ,M ′′′2 , |I| − 1) I ` (R′′′2 , S ′′′2 ,M ′′′2 , |I| − 1)→f (R′′′2 , S ′′′2 ,M ′′′2 , 0) I ` (R′′′2 , S ′′′2 ,M ′′′2 , 0)→t32 (R′2, S ′2,M ′2, |I|) t1 = t 1 1@f@t 2 1@f@t 3 1 t2 = t 1 2@f@t 2 2@f@t 3 2 215 By induction assumption, we know 1. M ′′1 ∼L M ′′2 , M ′′′1 ∼L M ′′′2 , and M ′1 ∼L M ′2; 2. Υ ` R′′1 ∼ R′′2, Υ ` R′′′1 ∼ R′′′2 , and Υ ` R′1 ∼ R′2; 3. Υ ` S ′′1 ∼ S ′′2 , Υ ` S ′′′1 ∼ S ′′′2 , and Υ ` S ′1 ∼ S ′2; 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∧ `const Sym′′(r)) ∧ `safe Sym′′(r) ⇒ (S ′′i , Sym ′′(r)) ⇓ R′′i (r); 5. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∧ `const Sym′′′(r)) ∧ `safe Sym′′′(r) ⇒ (S ′′′i , Sym ′′′(r)) ⇓ R′′′i (r); 6. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∧ `const Sym′(r)) ∧ `safe Sym′(r) ⇒ (S ′i, Sym ′(r)) ⇓ R′i(r); 7. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∧ `const Sym′′(r)) ∧ `safe Sym′′(k) ⇒ ∃n.(S ′′i , Sym′′(k)) ⇓ n ∧|n mod size(Υ′′(k)) |= idx(S ′′i (k)). 8. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∧ `const Sym′′′(r)) ∧ `safe Sym′′′(k) ⇒ ∃n.(S ′′′i , Sym′′′(k)) ⇓ n ∧|n mod size(Υ′′′(k)) |= idx(S ′′′i (k)). 9. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∧ `const Sym′(r)) ∧ `safe Sym′(k) ⇒ ∃n.(S ′i, Sym′(k)) ⇓ n ∧|n mod size(Υ′(k)) |= idx(S ′i(k)). 10. t11 ≡ t12, t21 ≡ t22, and t31 ≡ t32 216 Then conclusions 1-5 are true by the above results, and for conclusion 6, we have t1 = t 1 1@f@t 2 1@f@t 3 1 ≡ t12@f@t22@f@t32 = t2. Case T-IF. Thus I = ι1; It; ι2; If , where ι1 = br r1 rop r2 ↪→ n1. Consider `′. If `′ = L, then we know Υ(r1) = Υ(r2) = L. In this case, by assumption 2, we know R1(r1) = R2(r1) and R1(r2) = R2(r2). Therefore the program counter goes to the same value in both cases. Formally speaking, we have I ` 〈Ri, Si,Mi, 0〉 →f 〈Ri, Si,Mi, pci〉 i = 1, 2 where pc1 = pc2, which is either 1, or n1 = |It| + 2. If pc1 = 1, then by Lemma 16, we know I ` 〈Ri, Si,Mi, 1〉 →t′i 〈R′′i , S ′′i ,M ′′i , |It|+ 1〉 i = 1, 2 Further, since I(|It|+ 1) = ι2 = jmp n2, we know I ` 〈R′′i , S ′′i ,M ′′i , |It|+ 1〉 →f 〈R′′i , S ′′i ,M ′′i , |I|〉 for i = 1, 2, which implies R′′i = R ′ i, S ′′ i = S ′ i, M ′′ i = M ′ i , and ti = f@t ′ i@f. It is easy to prove, by induction assumption, that conclusions 1-5 hold true, and for 6, we have t1 = f@t ′ 1@f = f@t ′ 2@f = t2 Similarly, if pc1 = pc2 = n1, we can also prove the conclusions. Intuitively, this means if ` = L, then the branching statement in this if-block will go to the same 217 T = read(l, k, sv) l = D (S, sv) ⇓ n b = M(l, n) S ′ = S[k 7→ (b, (l, n))] t = read(n, b) 〈S,M, T 〉 →t 〈S ′,M〉 T = read(l, k, sv) l = E (S, sv) ⇓ n t = eread(n) S ′ = S[k 7→ (?, (l, n))] 〈S,M, T 〉 →t 〈S ′,M〉 T = write(l, k, sv) l = D S(k) = (b, (l, n)) t = write(n, b) M ′ = M [(l, n) 7→ b] 〈S,M, T 〉 →t 〈S,M ′〉 T = write(l, k, sv) l = E S(k) = (?, (l, n)) t = write(n, b) M ′ = M [(l, n) 7→ b] 〈S,M, T 〉 →t 〈S,M〉 T = o t = o 〈S,M, T 〉 →t 〈S,M〉 T = F t = f 〈S,M, T 〉 →t 〈S,M〉 T = T1@T2 〈S,M, T1〉 →t1 〈S ′,M ′〉 〈S ′,M ′, T2〉 →t2 〈S ′′,M ′′〉 〈S,M, T1@T2〉 →t1@t2 〈S ′′,M ′′〉 t1 ≡ t2 〈S,M, T 〉 →t1 〈S ′,M ′〉 〈S,M, T 〉 →t2 〈S ′,M ′〉 Figure B.2: Symbolic Execution in LGhostRider branch, and we can prove the theorem. Next, we consider when `′ = H. If the branching statement goes to the same pc, then based on the above discussion, we know the conclusions hold true. With out loss of generality, we can consider when the branching instruction goes to different pc. We first study the relationship between a trace and a trace pattern. We first prove conclusion 6, i.e. t1 = t2. To prove this, we first define a new notion 〈S,M, T 〉 →t 〈S ′,M ′〉 as in Figure B.2. Here, we use a special value ? to indicate a special block, that we do not care about its content. By doing so, we can use only the DRAM part of a memory M to 218 make this definition. It is easy to see that evaluating a symbolic value requires only scratchpad and memory corresponding to DRAM. We defined the DRAM projection of scratchpad and memory as follows: SD(k) =  (b, (D,n)) S(k) = (b, (D,n)) (?, (E, n)) S(k) = (b, (E, n)) undefined otherwise Actually, the DRAM projection of scratchpad does not contain only DRAM, but also contain the index information about ERAM. But all these information are assumed to be public. MD(l, n) =  M(l, n) l = D undefined otherwise It is easy to see, if 〈S,M, T 〉 →t 〈S ′,M ′〉, then we also have 〈SD,MD, T 〉 →t 〈S ′D,M ′D〉 It is worth mentioning that given two low-equivalent memories M1 ∼L M2 if and only if M1D = M 2 D. Based on these rules, we can prove the following lemmas. Lemma 17. Given S,M, S ′,M ′, S ′′,M ′′, T1, T2, if T1 ≡ T2, and 〈S,M, T1〉 →t1 〈S ′,M ′〉 219 〈S,M, T2〉 →t2 〈S ′′,M ′′〉 then we have t1 ≡ t2 S ′ = S ′′ M ′ = M ′′ Proof. We consider how T1 ≡ T2 is derived. Case 1. T1 = read(l, k, sv1) and T2 = read(l, k, sv2). Then we know `safe sv1 and sv1 = sv2. Therefore we know if (S, sv1) ⇓ n1 and (S, sv2) ⇓ n2, then n1 = n2. If l = D, then we know b = M(l, n1) = M(l, n2), and thus t1 = read(n1, b) = read(n2, b) = t2. Further S ′ = S[k → (b, (l, n1))] = S[k → (b, (l, n2))] = S ′′, and M ′ = M = M ′′. If l = E, then t1 = eread(n1) = eread(n2) = t2. S ′ = S[k → (?, (l, n1))] = S[k → (?, (l, n2))] = S ′′ and M ′ = M ′′ is trivial. It is impossible for l to be an ORAM bank. Case 2. T1 = write(l, k, sv1) and T2 = write(l, k, sv2). If l = D or l = E, then S = (b, (l, n)) (where b = ? if l = E), and t1 = write(n, b) = t2. Further M ′ = M [(l, n) 7→ b] = M ′′, and S ′ = S = S ′′. It is impossible for l to be an ORAM bank. Case 3. T1 = T2 = o or T1 = T2 = F, then the conclusion is trivial. Case 5. T1 = Tx@(Ty@Tz) and T2 = (Tx@Ty)@Tz. In this case, we know 〈S,M, Tx〉 →tx 〈S2,M2〉 220 〈S2,M2, Ty@Tz〉 →tyz 〈S ′,M ′〉 Further, we have 〈S2,M2, Ty〉 →ty 〈S3,M3〉 〈S2,M2, Tz〉 →tz 〈S ′,M ′〉 Therefore we have t1 = tx@(ty@tz). Further, we know 〈S,M, Tx〉 →t′x 〈S?,M?〉 〈S?,M?, Ty〉 →t′y 〈S??,M??〉 〈S??,M??, Tz〉 →t′z 〈S ′′,M ′′〉 By induction assumption, we know S? = S2, M? = M2, tx ≡ t′x. Further, by induction assumption again, we have S?? = S3, M?? = M3, ty ≡ t′y. Finally, by applying induction assumption once more, we finally have S ′ = S ′′, M ′ = M ′′, and tz ≡ t′z. Therefore we know t1 = tx@(ty@tz) ≡ (tx@ty)@tz ≡ (t′x@t′y)@t′z ≡ t2 Case 6. T1 = Tx@Ty and T2 = T ′ x@T ′ y. Then we know 〈S,M, Tx〉 →tx 〈S1,M1〉 221 〈S1,M1, Ty〉 →ty 〈S ′,M ′〉 and 〈S,M, T ′x〉 →t′x 〈S2,M2〉 〈S2,M2, T ′y〉 →t′y 〈S ′′,M ′′〉 Since Tx ≡ T ′x, we know by induction assumption, tx ≡ t′x, S1 = S2, and M1 = M2. Then applying induction assumption again, we have ty ≡ t′y, S ′ = S ′′, and M ′ = M ′′. Lemma 18. Given a program I, register mappings R,R′, scratchpads S, S ′, and memories M,M ′, if ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T where ` = H, T ≡ T ′ (for some T ′), R,R′, S, S ′,M,M ′ satisfy assumptions 1-7 in Theorem 6 and I ` 〈R, S,M, 0〉 →t 〈R′, S ′,M ′, |I|〉 then we have 〈SD,MD, T 〉 →t 〈S ′D,M ′D〉 Proof. We prove by induction on how I is typed. Case T-SEQ. We know I = I1; I2, and H ` I1 : 〈Υ, Sym〉 → 〈Υ′′, Sym′′〉;T1 222 H ` I2 : 〈Υ′′, Sym′′〉 → 〈Υ′, Sym′〉;T2 where T = T1@T2 Based on the discussion above, we know I1 ` 〈R, S,M, 0〉 →t1 〈R′′, S ′′,M ′′, |I1|〉 I2 ` 〈R′′, S ′′,M ′, 0〉 →t2 〈R′, S ′,M ′, |I2|〉 where t = t1@t2. Therefore, by induction assumption, we know 〈SD,MD, T1〉 →t1 〈S ′′D,M ′′D〉 〈S ′′D,M ′′D, T2〉 →t1 〈S ′D,M ′D〉 Therefore, we have 〈SD,MD, T1@T2〉 →t1@t2 〈S ′D,M ′D〉 Case T-LOAD. I = ldb k ← l[r]. If l = D, then T = read(l, k, sv), where sv = Sym(r). Suppose b = M(l, n), where n = R(r). By assumption 2, we know (S, sv) ⇓ n, therefore we know (SD, sv) ⇓ n hold true as well. Further, S ′ = S[k 7→ (b, (l, n))], and thus S ′D = SD[k 7→ (b, (l, n))]. We also know M ′D = MD, and t = read(n, b). In conclusion, we have 〈SD,MD, T 〉 →t 〈S ′D,M ′D〉 If l = E, then we know T = read(l, k, sv) as well, where sv = Sym(r). By 223 assumption 2, we have (SD, sv) ⇓ n, where n = R(r). Combining with t = eread(n), M ′D = MD, and S ′ D = SD[k 7→ (?, (l, n))], we get our conclusion. If l ∈ ORAMBanks, then we know T = l, and t = l. Combining with M ′D = MD, and S ′ D = SD, , the conclusion is trivial. Case T-STORE. I = stb k. Suppose S(k) = (b, (l, n)). If l = D or l = E, then SD(k) = (b, (l, n)) (where b = ? if l = E), T = write(l, k, sv), t = write(n, b), and M ′D = MD[(l, n) 7→ b]. Therefore, we can get our conclusion. If l ∈ ORAMBanks, similar to the T-LOAD case, we can get our conclusion. Case T-STOREW. I = stw r1 → k[r2]. Since ` = H, we know slab(k) = H, which implies k = E or k ∈ ORAMBanks. Therefore S ′D = SD and M ′D = MD. Further, since T = F and t = f, we know 〈SD,MD, T 〉 →t 〈S ′D,M ′D〉 Case T-LOADW, T-IDB, T-BOP, T-ASSIGN and T-NOP. In all these rules, T = F, and t = f. Further, it is easy to see that S ′ = S and M ′ = M in all these rules. Therefore, the concusion 〈SD,MD, T 〉 →t 〈S ′D,M ′D, T 〉 holds true. Case T-UP. The conclusion is trivial by induction assumption. Case T-LOOP. This is impossible, since T-LOOP requires ` = L. 224 Case T-IF. I = ι1; It; ι2; If , where ι1 = br r1 rop r2 ↪→ n1 and ι2 = jmp n2. Depending on the value of R(r1) and R(r2), it may jumps to one of the two branches. If the true branch is taken, then we know t = f@t1@f, where It ` 〈R, S,M, 0〉 →t1 R′, S ′,M ′, |It|〉 Since It is typable, we know, by induction assumption, 〈SD,MD, Tt〉 →t1 〈S ′D,M ′D〉 Since 〈SD,MD,F〉 →f 〈SD,MD〉 〈S ′D,M ′D,F〉 →f 〈S ′D,M ′D〉 and T = F@t1@F, we can derive our conclusion. If the false branch is taken, then we know t = f@t2, where If ` (R, S,M, 0)→t1 (R′, S ′,M ′, |If |) Therefore, we know 〈SD,MD, Tf〉 →t2 〈S ′D,M ′D〉 225 Further, since T1@F ≡ T2, by Lemma 17, we have t1@f ≡ t2, which implies f@t1@f ≡ f@t2 Therefore, we have 〈SD,MD, T 〉 →f@t2 〈S ′D,M ′D〉 Now, we get back to prove Theorem 2 for T-IF rule. Let us remind that I = ι1; It; ι2; If . W.L.O.G, we suppose I ` (R1, S1,M1, 0)→f (R1, S1,M1, 1) It ` (R1, S1,M1, 0)→tt (R′1, S ′1,M ′1, |It|) I ` (R′1, S ′1,M ′1, |It|+ 1)→f (R′1, S ′1,M ′1, |I|) and I ` (R2, S2,M2, 0)→f (R2, S2,M2, |It|+ 2) If ` (R2, S2,M2, 0)→tf (R′2, S ′2,M ′2, |If |) By Lemma 18, we know 〈S1D,M1D, T1〉 →tt 〈S ′1D,M ′1D〉 226 〈S2D,M2D, T2〉 →tf 〈S ′2D,M ′1D〉 Since 〈S ′1D,M ′1D,F〉 →f 〈S ′1D,M ′1D〉, we have 〈S1D,M1D, T1@F〉 →tt@f 〈S ′1D,M ′1D〉 Further, by assumption 1 and 2, we know S1D = S2D and M1D = M2D. By Lemma 17, we know t1@f ≡ t2 S ′1D = S ′ 2D M ′1D = M ′ 2D Therefore, we have t1 = f@tt@f ≡ f@tf = t2, which is conclusion 6. Further, the last two assertions show that conclusions 1 and 3 are true. So we only need to prove conclusions 2, 4, and 5. The first step is to prove conclusion 4. If `safe Sym(r) is not true, then conclusion 4 is vacuum. Otherwise, since `′ = H, we know either ` = H or `const Sym(r). In either case, assumption 4, we know (Si, Sym(r)) ⇓ Ri(r) for i = 1, 2. Then, by induction, we conclude that conclusion 4 holds true: if (`′ = H∨ `const Sym′(r))∧ `safe Sym′(r), then (S ′i, Sym′(r)) ⇓ R′i(r) (i = 1, 2). Since (` = H∨ `const Sym′(r))∧ `safe Sym′(r) implies (`′ = H∨ `const Sym′(r))∧ `safe Sym′(r), we know conclusion 4 is true. We can prove conclusion 5 similarly. The next step is to prove conclusion 2, suppose Υ(r) = L, then by T-IF, 227 (since `′ = H) we know `safe Sym(r) and `const Sym(k). Therefore, by conclusion 4, we know (S ′1D, Sym(r)) ⇓ R1(r), and (S ′2D, Sym(r)) ⇓ R2(r). However, since S ′1D = S ′ 2D, we know R1(r) = R2(r). This means conclusion 4 implies conclusion 2. Theorem 7. Given a program I in LT , such that ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T , two memories M1,M2, two register mapping R1, R2, and two scratchpad mapping S1, S2, if the following assumptions are satisfied: 1. M1 ∼L M2 2. Υ ` R1 ∼ R2 3. Υ ` S1 ∼ S2 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `ok Sym(r)⇒ (Si, Sym(r)) ⇓ Ri(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `ok Sym(k)⇒ ∃n.(Si, Sym(k)) ⇓ n ∧ |nmod size(Υ(k)) | = idx(Si(k)). and I ` (R1, S1,M1, 0)→t1 (R′1, S ′1,M ′1, pc′) I ` (R2, S2,M2, 0)→t2 (R′2, S ′2,M ′2, pc′′) If |t1| = |t2|, then we have the following conclusions: 1. M ′1 ∼L M ′2 228 2. Υ′ ` R′1 ∼ R′2 3. Υ′ ` S ′1 ∼ S ′2 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `safe Sym′(r) ⇒ (S ′i, Sym ′(r)) ⇓ R′i(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∨ `const Sym(r))∧ `safe Sym′(k) ⇒ ∃n.(S ′i, Sym′(k)) ⇓ n ∧ |nmod size(Υ′(k)) | = idx(S ′i(k)). 6. t1 ≡ t2 Proof. Next, we prove the non-terminating case. Again, we suppose ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T and I ` 〈R1, S1,M1, 0〉 →t1 〈R′1, S ′1,M ′1, pc1〉 I ` 〈R2, S2,M2, 0〉 →t2 〈R′2, S ′2,M ′2, pc2〉 where |t1| = |t2|. Then we prove by induction on how to derive L ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T . If it is derived by applying one of rules T-LOAD, T-STORE, T- LOADW, T-STOREW, T-BOP, T-ASSIGN, T-NOP, then we know |t1| = |t2| = 1, and the conclusion follows directly from Theorem 6. If it is derived by applying rule T-UP, then the conclusion is trivial. So we only need to consider rule T-IF, T-LOOP, and T-SEQ. 229 Case T-SEQ. Suppose I = I1; I2. We prove by contradiction. With out loss of generality, we suppose |t1| = |t2| is minimal such that t1 6≡ t2 or M ′1 6∼L M ′2. There are two sub cases: I1 ` (R1, S1,M1, 0)→t′1 (R′′1, S ′′1 ,M ′′1 , |I1|) I2 ` (R′′1, S ′′1 ,M ′′1 , 0)→t′′1 (R′1, S ′1,M ′1, pc1 − |I1|) where pc1 > |I1| and t1 = t′1@t′′1, Or I1 ` (R1, S1,M1, 0)→t1 (R′′1, S ′′1 ,M ′′1 , pc1) In the first case, by assuming |t1| is minimal, we know I1 ` (R2, S2,M2, 0)→t′2 (R′2, S ′2,M ′′2 , |I2|) I2 ` (R′2, S ′2,M ′′2 , 0)→t′′2 (R2, S2,M ′2, pc2 − |I1|) where pc2 > |I1|. Then by induction assumption, we can prove that t′1 ≡ t′2 and t′′1 ≡ t′′2, and thus t1 ≡ t2. In the second case, by induction assumption, we directly prove that t1 ≡ t2. Case T-LOOP. Suppose I = Ic; ι1; Ib; ι2. Since Ic is typable, by Lemma 16, we know that By Lemma 16, we know either one of the following two cases happens: Ic ` (R1, S1,M1, 0)→t11 (R′′1, S ′′1 ,M ′′1 , |Ic|) 230 I ` (R′′1, S ′′1 ,M ′′1 , |Ic|)→t21 (R′1, S ′1,M ′1, pc′) or Ic ` (R1, S1,M1, 0)→t1 (R1, S1,M ′1, pc′) where pc′ ≤ |Ic|. In the latter case, the conclusion follows by induction assumption. For the former case, we know Ic ` (R1, S1,M1, 0)→t11 (R′′1, S ′′1 ,M ′′1 , |Ic|) I ` (R′′1, S ′′1 ,M ′′1 , |Ic|)→t21 (R′1, S ′1,M ′1, pc′) Similarly, we have Ic ` (R2, S2,M2, 0)→t12 (R′′2, S ′′2 ,M ′′2 , |Ic|) I ` (R′′2, S ′′2 ,M ′′2 , |Ic|)→t22 (R′2, S ′2,M ′2, pc′′) By Theorem 6, we have 1. M ′′1 ∼L M ′′2 ; 2. Υ ` R1 ∼ R2; 3. Υ ` S1 ∼ S2; 4. ∀r ∈ Registers, i ∈ {1, 2}.(` = H∧ `const Sym′(r))∧ `safe Sym′(r) ⇒ 231 (S ′′i , Sym ′′(r)) ⇓ R′i(r); 5. ∀k ∈ BlockIDs, i ∈ {1, 2}.(` = H∧ `const Sym′(r))∧ `safe Sym′(k) ⇒ ∃n.(S ′′i , Sym′(k)) ⇓ n ∧ |nΥ′′(k) |= idx(S ′′i (k)). 6. t11 ≡ t12 Further, we know I(|Ic|) = ι1 = br r1 rop r2 ↪→ n1, where Υ(r1)unionsqΥ(r2) v L, which implies Υ(r1) = Υ(r2) = L. Therefore, we know R ′′ 1(r1) = R ′′ 2(r1) and R ′′ 1(r2) = R′′2(r2), which implies if I ` (R′′i , S ′′i ,M ′′i , |Ic|)→f (R?i , S?i ,M?i , pc?i ) then pc?1 = pc ? 2, which is either |Ic| + 1 or |Ic| + n1 = |I|. If later, then we know t1 = t 1 1@f ≡ t22@f = t2, and R′′i = Ri, S ′′i = Si, M ′′i = Mi (i = 1, 2). In this case, conclusions 1-6 all hold true. In the former case, there are still two sub cases: (1) Ib ` (R′′1, S ′′1 ,M ′′1 , 0)→t21 (R31, S31 ,M31 , pc1) Ib ` (R′′2, S ′′2 ,M ′′2 , 0)→t22 (R32, S32 ,M32 , pc2) Then by induction, we know t21 ≡ t22, and therefore t1 ≡ t2, and all conclusions 1-5 are true. (2) Ib ` (R′′1, S ′′1 ,M ′′1 , 0)→t21 (R31, S31 ,M31 , |Ib|) 232 Ib ` (R′′2, S ′′2 ,M ′′2 , 0)→t22 (R32, S32 ,M32 , |Ib|) I ` (R31, S31 ,M31 , 0)→t′1 (R′1, S ′1,M ′1, |Ib|) I ` (R32, S32 ,M32 , 0)→t′2 (R′2, S ′2,M ′2, |Ib|) where ti = t 1 i@f@t 2 i@f@t ′ i (i = 1, 2). Similar to Theorem 6, we can show the conclu- sions 1-6 hold true for this case. Case T-IF. Suppose I = ι1; It; ι2; If , where ι1 = br r1 rop r2 ↪→ n1. We need the following lemma. Lemma 19. If H ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉 : T , and I ` (R, S,M, pc)→t (R′, S ′,M ′, pc′) Then pc′ > pc. Proof (sketch). Since while-statement cannot be typed in high security context, and while-statement is the only place allowing jumping or branching back, in high secu- rity context, the program counter will only increase. If Υ(r1) = Υ(r2) = L, then we know I ` (Ri, Si,Mi, 0)→f (Ri, Si,Mi, pc) I ` (Ri, Si,Mi, pc)→t′i (R′i, S ′i,M ′i , pci) where ti = f@t ′ i for i = 1, 2. In this case, we can prove the conclusions by induction 233 assumption. If ` unionsqΥ(r1) unionsqΥ(r2) = H, then by Lemma 19, we know I ` (Ri, Si,Mi, 0)→t′i (Ri, Si,Mi, |I|) for i = 1, 2, where t1 and t2 are prefixes of t ′ 1 and t ′ 2. By Theorem 6, we know t ′ 1 ≡ t′2. Therefore, we know t1 ≡ t2. Conclusion 1-5 can be proven similar to Theorem 6. Proof of Theorem 2. Suppose I is a program, Υ and Sym satisfy: (1) ∀r.Sym(r) = ? ∧ Υ(r) = L; and (2) ∀k.Sym(k) =? ∧ Υ(k) = D. There are Υ′ and Sym′, such that ` ` I : 〈Υ, Sym〉 → 〈Υ′, Sym′〉;T where ` = L, and I ` (R0, S0,M1, 0)→t1 (R1, S1,M ′1, pc1) I ` (R0, S0,M2, 0)→t2 (R2, S2,M ′2, pc2) where M1 ∼L M2, ∀r.R0(r) = 0, ∀k.S0(k) = (b0, (D, 0)), and |t1| = |t2|. Clearly, we have 1. M1 ∼L M2 2. Υ ` R0 ∼ R0; 3. Υ ` S0 ∼ S0; 234 4. ∀r ∈ Registers, k ∈ Blocks. 6`safe Sym(r)∧ 6`safe Sym(k); Therefore, by applying Theorem 7, we have t1 ≡ t2, and M ′1 ∼L M ′2. 235 Appendix C: Proof of Theorem 3 We begin by discussing how to construct simA; the simulator simB is con- structed similarly. Since Alice does not have the view of Bob’s local data, and those data secret- shared between them two, we define a special notion • as the values not observable to Alice. We define the operations on top of • as follows: • op v = • v op • = • • (v) = • m(•) = • We define the following auxiliary functions accordingly: (selectA(l, t, t ′), selectB(l, t, t′)) := select(l, t, t′) readA(l, v) :=  v l v A • otherwise val(v, l) := v val(m, l) := m lab(v, l) := l lab(m, l) := l 236 We then define Alice’s snapshot of a memory M , denoted as M ↓ A, in the following: Definition 11. Given a memory M , Alice’s snapshot of M , denoted as M ↓ A, is defined as a memory such that M ↓A(x) =  M(x) if M(x) = (v, l) where l v A • otherwise We further define the Alice-similarity property of two memories as follows: Definition 12. We say two memories M1 and M2 are Alice-similar, denoted as M1 ∼A M2, if and only if M1 ↓A = M2 ↓A. Figure C.1 defines how simA evaluate an expression. The judgement in the form of l `A 〈M, e〉 ⇓t v says that given a memory M , the simulator simA evaluates an expression e to value v, producing memory trace t. Figure C.2 and Figure C.3 defines how simA simulates the instruction- and memory-traces until the next declassification. The judgement 〈Mi, Si〉 (i,t)−−→A 〈M ′i , S ′i〉 says that given a statement Si and a memory M , simA evaluates the program Si over memory Mi and reduces to program S ′ i and memory M ′ i emitting Alice’s instruction trace i and memory trace t. The judgement 〈Mi, Si〉 (i,t)−−→ ? A 〈M ′i , S ′i〉 is similar to 〈Mi, Si〉 (i,t)−−→A 〈M ′i , S ′i〉, but requires the last statement evaluated must be a declassification statement. We emphasize that our rules enforce that the memory over which the program is eval- uated must be Γ-compatible. 237 l `A 〈M, e〉 ⇓ta v Sim-E-Const l `A 〈M,n〉 ⇓ n Sim-E-Var lab(M(x)) = l′ v = readA(l, val(M(x))) t = selectA(l, read(x, v), x) l `A 〈M,x〉 ⇓t v Sim-E-Op lab(M(xi)) = li l `A 〈M,xi〉 ⇓ti vi t = t1@t2 v = v1 op v2 i = 1, 2 l `A 〈M,x1 op x2〉 ⇓t v Sim-E-Array lab(M(y)) = l′ l `A 〈M,y〉 ⇓t1 v lab(M(x)) = l′′ v1 = val(M(x)) t2 = selectA(l, readarr(x, v1, v), x) t = t1@t2 v2 = readA(l, val(M(x))(v)) l `A 〈M,x[y]〉 ⇓t v2 Sim-E-Mux lab(M(xi)) = Nat li l `A 〈M,xi〉 ⇓ti vi i = 1, 2, 3 v1 = 0⇒ v = v2 v1 6= 0⇒ v = v3 v1 = • ⇒ v = • t = t1@t2@t3 l `A 〈M,mux(x1, x2, x3)〉 ⇓t v Figure C.1: Operational semantics for simA The simulator simA(M,S,D1, ..., Dn) runs as follows. Initially set M1 to be M and S1 to be S. For each i = 1, ..., n, simA evaluates 〈Mi, Si〉 (i,t)−−→ ? A 〈M ′i , S ′i〉. If Di = , then set Mi+1 to be M ′ i ; otherwise, Di = (x, v), set Mi+1 to be M ′ i [x 7→ v]. Finally, simA evaluates 〈Mn, Si〉 (i,t)−−→ ? A 〈M ′, S ′〉, and returns (i, t). The following lemma shows that the semantics for simA generates the same memory trace as the semantics for SCVM. Lemma 20. If l ` 〈M, e〉 ⇓(ta,tb) v and Γ ` e : Nat l′ and l `A 〈M ′, e〉 ⇓t v′ and M ∼A M ′, and l′ v l, and M and M ′ are Γ-compatible, then ta ≡ t and if l v A, then v = v′. Otherwise v′ = •. Proof. Prove by structural induction on e. If e = x, then Γ(x) = Nat l′. If l v A, 238 〈M,S〉 (i,t)−−→ ? A 〈M ′, S′〉 Sim-Declass t = y i = declass(x, y) 〈M, O : x := declassl(y)〉 (i,t)−−→ ? A 〈M, O : skip〉 Sim-Seq 〈M,S1〉 (i,t)−−→ ? A 〈M ′, S′1〉 〈M,S1;S2〉 (i,t)−−→ ? A 〈M ′, S′1;S2〉 Sim-Concat 〈M,S, 〉 (i,t)−−→A 〈M ′, S′, 〉 〈M ′, S′〉 (i ′,t′)−−−→ ? A 〈M ′′, S′′〉 〈M,S〉 (i@i ′,t@t′)−−−−−−→ ? A 〈M ′′, S′′〉 Figure C.2: Operational semantics for statements in simA (part 1) then v′ = val(M ′(x)) = val(M(x)) = v, therefore v = v′. Further t = read(x, v′) = read(x, v) = ta if l v A. If l = B, then v′ = •, and t =  = ta. If l = O, then v′ = •, and t = x = ta. If e = n, then t =  = ta, and v ′ = n = v, and l = P v A. If e = x1 op x2. Then we know l `A 〈M ′, xi〉 ⇓ti v′i, and 〈M,xi〉 ⇓(tia,tib) vi for i = 1, 2. By induction assumption, we know ti ≡ tia, and thus t = t1@t2 ≡ t1a@t 2 a = ta. For its value, suppose Γ(xi) = Nat li, i = 1, 2, if l v A, then li v A holds true, and by induction assumption, we know vi = v ′ i for i = 1, 2, and thus v = v1 op v2 = v ′ 1 op v2 = v ′. Otherwise, either or both v1 and v2 are •, and thus we know v′ = •. If e = x[y]. We first reason about the value. If l v A, then suppose Γ(y) = Nat l′′, then l′′ v l′ v l v A according to Γ ` x[y] : Nat l′. Then we know v′1 = val(M ′(y)) = val(M(y)) = v1. Further, we know (m′, l) = M ′(x) = M(x) = (m, l), and thus v′ = get(m′, v′1) = get(m, v1) = v. If l 6v A, then v = •. 239 Then we reason about the trace. If l v A, then t = read(y, v1)@readarr(x, v1, v) ≡ read(y, v′1)@readarr(x, v1, v′) = ta If l = B, we have t ≡  ≡ ta. If l = O, we have t ≡ y@x ≡ ta. For e = mux(x1, x2, x3), based on a very similar argument as for x1 op x2, we can get the conclusion. The following lemma further claims that if an expression has a type B, then simulating it will generate no observable instruction traces and memory traces to Alice. Lemma 21. If Γ ` e : Nat l′, and B ` 〈M, e〉 ⇓t v, and M is Γ-compatible then t ≡ . Proof. Prove by structure induction on e. If e = x, then t =  by rule Sim-E-Var. If e = x1 op x2. Suppose Γ ` xi : Nat li for i = 1, 2, then we know li v B. Therefore ti ≡ , and thus t ≡ . If e = x[y], the conclusion follows the fact that B ` 〈M, y〉 ⇓ v, and selectA(B, readarr(x, v1, v), x) = . If e = mux(x1, x2, x3), similar to binary operation, we know t ≡ . Lemma 22. If B ` S, and 〈M,S〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, S ′〉, where M is Γ-compatible, then ia ≡ , ta ≡ , and M ∼A M ′ Proof. We prove by induction on the structure of S. If S = l : skip, then the conclusion is trivial. 240 If S = l : x := e, then we know l = B and Γ(x) = Nat B. Then ia =  and M ′ ∼A M follow trivially. According to Lemma 21, we can prove ta ≡ . If S = O : x := oram(y) or S = O : x := declassl(y), then pc is required to be P, so that the conclusion is vacuous. If S = l : y[x1] := x2, then l = B, and thus ia = . Therefore ta = t1a@t2a@t ′ a, where B ` 〈M,xi〉 ⇓(tia,tib) vi for i = 1, 2, and t′a = selectA(B,writearr(y, v1, v2), y) = . Therefore t1a ≡ t2a ≡  according to Lemma 21. In conclusion, we have ta = t1a@t2a@t ′ a ≡ . Finally, for memory, M ′ = M [y 7→ m′] ∼A M . If S = l : if(x)then S1 else S2, then l = B, and Γ, B ` Si for i = 1, 2. Suppose M(x) = (v, B), then 〈M,S〉 (,,i ′ b,t ′ b)−−−−−→ 〈M,Sc〉, where v = 1 ⇒ c = 1 and v 6= 1⇒ c = 2. There are two cases: (1) M ′ = M and S = Sc, then the conclusion is trivial; (2) 〈M,Sc〉 (i ′′ a ,t ′′ a ,i ′′ b ,t ′′ b )−−−−−−→ 〈M ′, S ′〉. In this case, by induction assumption, we have M ′ ∼A M , i′′a ≡  and t′′a ≡ , so that ia = @i′′a ≡  and ta = @t′′a ≡ . For S = l : while(x)do S ′, the conclusion can be proven similarly to the if-case. Finally, for S = S1;S2, we know either (1) 〈S1,M〉 (ia,ta,ib,tb)−−−−−−→ 〈S ′1,M ′〉; or (2) 〈S1,M〉 (i ′ a,t ′ a,i ′ b,t ′ b)−−−−−−→ 〈l : skip,M ′′〉 and 〈S2,M ′′〉 (i ′′ a ,t ′′ a ,i ′′ b ,t ′′ b )−−−−−−→ 〈S ′,M ′〉, where ia = i′a@i′′a, and ta = t ′ a@t ′′ a. In both cases, the conclusions can be proven easily. The following lemma is the main lemma saying that when evaluating over Alice-similar memories, simA and SCVM will generate the same instruction traces and memory traces, and produce Alice-similar memory profiles. Lemma 23. If 〈M,S〉 (ia,ta,ib,tb)−−−−−−→ 〈M1, S ′〉 : D where Γ, P ` S, and M ∼A M ′, 241 and 〈M ′, S〉 (i,t)−−→A 〈M ′1, S ′′〉 (for D = ) or 〈M ′, S〉 (i,t)−−→ ? A 〈M ′1, S ′′〉 (for D 6= ), then S ′ = S ′′, ia ≡ i and ta ≡ t. If D = , then M1 ∼A M ′1; otherwise, suppose D = (x, v), then M1 = M ′ 1[x 7→ v]. Proof. The conclusion S ′ = S ′′ can be trivially done by examining the corre- spondence of each E- and S- rules and Sim- rules. Therefore, we only prove (1) M1 ∼A M ′1, (2) ia ≡ i, and (3) ta ≡ t. We prove by induction on the length of steps L toward generating declas- sification event D. If L = 0, then we know S = O : x := declassl(y);S2 (or O : x := declassl(y)). Since we assume Γ, P ` S, by typing rule T-Declass, we have l 6= O, Γ(x) = Nat l. If l v A, then D[A] = (x, v), and thus M ′1[x 7→ v] ∼A M [x 7→ v] = M1. Further, we know ia = declass(x, y) = i, and ta = y = t. Second, if lx = B, then M ′ 1 = M ′ ∼A M = M1, ia = declass(x, y) = i, and ta = y = t. We next consider L > 0, then S = S1;S2. Since (Sa;Sb);Sc is equivalent to Sa; (Sb;Sc) in the sense that if 〈M, (Sa;Sb);Sc〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, S ′〉 : D, then 〈M,Sa; (Sb;Sc)〉 (i ′ a,t ′ a,i ′ b,t ′ b)−−−−−−→ 〈M ′, S ′〉 : D, where ia ≡ i′a, ib ≡ i′b, ta ≡ t′a, and tb ≡ t′b. Therefore we only consider S1 not to be a Seq statement, then we know S1 = l : s1. By taking one step, we only need to prove claims (1)-(3), then the conclusion can be shown by induction assumption. In the following, we consider how this step is executed. Case l : skip. If S1 = l : skip, the conclusion is trivial, i.e. ia =  = i and ta =  = t and M ′ 1 = M ′ ∼A M = M1. Case l : x := e. If S1 = l : x := e, ia = l : x := e = i. Then we show t ≡ ta. 242 If l v A, t ≡ ta directly follows Lemma 20. If l = B, then by Lemma 21, we have t ≡  ≡ tb. If l = O, then we consider e separately. If e = y, then t = y@x = ta. If e = y[z], then t = z@y@x = ta. If e = n, then t = x = ta. If e = y op z, then t = y@z@x = ta. Finally, if e = mux(x1, x2, x3), then t = x1@x2@x3@x = ta. Finally, we prove the memory equivalence. If l v A, then according to Lemma 20, e evaluates to the same value v in the semantics, and in the sim- ulator. Therefore M ′1 = M ′[x 7→ v] ∼A M [x 7→ v] = M1. If B v l, then M ′1 = M ′ ∼A M ∼A M [x 7→ v] = M1. Therefore, the conclusion is true. Case O : x := oram(y). It is easy to see that M ′1 = M ′ ∼A M ∼A M [x 7→ m] = M1, and i = O : init(x, y) = ia. Suppose Γ(y) = Nat l, then we know l 6= O. If l v A, then t = y@x = ta. Otherwise, l = B, then we know t = x = ta. Case l : y[x1] := x2. By typing rule T-ArrAss, we know Γ(y) = Array l, Γ(x1) = Nat l1, Γ(x2) = Nat l2, where l1 v A and l2 v A. If l v A, then we have ta = read(x1, v1)@read(x2, v2)@writearr(a, v1, v2) = t, and ia = l : y[x1] := x2 = i. For memory, M?? = M ′[y 7→ set(m, v1, v2)] ∼A M [y 7→ set(m, v1, v2)] = M?, where (m, l) = M ′(y) = M(y), (v1, l1) = M(x1), and (v2, l2) = M(x2). If l = B, then M ′1 = M ′ ∼A M ∼A M [y 7→ m′] = M1, i =  = ia, t =  = ta. Case l : if(x)then S1else S2. If l = B, then according to Lemma 22, M ′ 1 = M ′ ∼A M ∼A M1, t ≡ ta, and i ≡ ia. If l v A, then i = l : if(x) = ia, and t = read(x, v) = ta. Further, M ′ 1 = M ′ ∼A M = M1. Therefore, the conclusion is also true. Case l : while(x)do S ′. For S1 = while(x)do Sb, the proof is very similar to the if-statement. 243 Lemma 23 immediately shows that simA can simulate the correct traces. Therefore Theorem 3 holds true. Q.E.D C.1 Proof of Theorem 4 Theorem 4 is a corollary of the following theorem: Theorem 8. If Γ, pc ` S, then either S is l : skip, or for any Γ-compatible memory M , there exist ia, ta, ib, tb,M ′, S ′, D such that 〈M,S〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, S ′〉 : D, M ′ is Γ-compatible, and Γ, pc ` S ′. Proof. We prove by induction on the structure of S. If S = l : skip, then the conclusion is trivial. If S = l : x := e, then Γ(x) = Nat l, Γ ` e : Nat l′, pcunionsq l′ v l. We discuss the type of e. If e = x′, then we know Γ(x′) = Nat l′. Since M is Γ−compatible, we know M(x′) = (v, l′), where v ∈ Nat. Therefore, 〈M,x′〉 ⇓(ta,tb) v, where (ta, tb) = select(l, read(x′, v), x′), and thus 〈M,S〉 (ia,t ′ a,ib,t ′ b)−−−−−−→ 〈M ′, l : skip〉 : , where (ia, ib) = inst(l, x := e), t′a = ta@t ′′ a, and t ′ b = tb@t ′′ b , where (t ′′ a, t ′′ b ) = select(l,write(x, v), x). Further, M ′ = M [x 7→ (v, l)]. Therefore, M ′ is also Γ-compatible, and the conclusion holds true. Similarly, we can prove that if Γ ` e : τ is derived using T-Const, T-Op, T-Array, or T-Mux, then the conclusion is also true. If S = O : x := declassl(y), then Γ(y) = Nat O, Γ(x) = Nat l where l 6= O, and pc = P. Since M is Γ−compatible, we know M(y) = (v,O), M ′ = M [x 7→ (v, l)]. Therefore M ′ is Γ−compatible. Further, ta = y = tb, ia = ib = O : declass(x, y), D = select(l, (x, v), ), and 〈M,S〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, O : skip〉, and we 244 know that Γ, P ` O : skip. Therefore the conclusion is true. Similarly, we can prove the conclusion is true for S = O : x := oram(y). For S = l : y[x1] := x2, then Γ(y) = Array l, Γ(x1) = Nat l1, Γ(x2) = Nat l2, and pc unionsq l1 unionsq l2 v l. Since M is Γ−compatible, we know M(y) = (m, l), M(x1) = (v1, l1), and M(x2) = (v2, l2). Therefore M ′ = M [y 7→ (set(m, v1, v2), l)] is also Γ−compatible. Further, (t′a, t′b) = select(l,writearr(y, v1, v2), y), ta = t1a@t2a@t′a, tb = t1b@t2b@t ′ b, and (ia, ib) = inst(l, y[x1] := x2). Therefore, 〈M,S〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, l : skip〉, where we can prove Γ, pc ` l : skip easily. Therefore, the conclusion is true. For S = l : if(x)then S1 else S2, we have Γ(x) = Nat l. Therefore M(x) = (v, l). If v = 1, then 〈M,S〉 ia,ta,ib,tb−−−−−→ 〈M,S1 where (ia, ib) = inst(l, if(x)) and (ta, tb) = select(l, read(x, v), x). Further, we know Γ, l ` S1. Since pc v l, it is easy to prove by induction that Γ, pc ` S1 is true as well. Therefore, the conclusion is true. On the other hand, if v 6= 1, then 〈M,S〉 ia,ta,ib,tb−−−−−→ 〈M,S2〉. We can also prove the conclusion. The proof for S = l : while(x)do S ′ is similar to the branching-statement by using rule S-While-True and S-While-False. For S = S1;S2, then we know Γ ` S1. The conclusion directly follows the induction assumption by applying rule S-Seq and rule S-Skip. 245 Appendix D: The hybrid protocol and the proof of Theorem 5 In this section, we present the hybrid protocol, and show it emulates the ideal world functionality F . To start with, we present smaller ideal functionalities in G used by the hybrid world protocol. 1. F (l1,l2)op are the ideal functionalities for binary operation op. They are param- eterized by two type labels l1 and l2 from {P, A, B, O} indicating which party provides the data to the functionality. Suppose the operation is x op y. l1 and l2 correspond to x and y respectively. If l1 is P , then both Alice and Bob will hand in the value of x, and the functionality verifies these two values are the same. If l1 is A (or B), then Alice (or Bob) hands in the value of x to the functionality. If l1 is O, then both Alice and Bob hand in their secret shares to the functionality respectively. The value of l2 has the same meaning but is for the data source of y. These ideal functionalities output secret shares of the result to Alice and Bob respectively. For example, F (P,A)op accepts input x, y from Alice, and x from Bob and return the results [v]a to Alice and [v]b to Bob. We denote this as ([v]a, [v]b) = F (P,A)op (x@y, x). 2. F (l1,l2,l3)mux are the ideal functionalities for the multiplex operations. The three parameters l1, l2, and l3 have the same meaning as above, but correspond to 246 the three input of the multiplex operation. These functionalities also return secret shares to Alice and Bob. 3. Fxoram for each array x is an interactive Oblivious RAM functionality. It sup- ports three operations. • initl to initialize the ORAM with a given array. l is from {P, A, B}. If l is P or A, then Alice hands in her array. If l is B, then Bob hands in his array. • read to read the content for a given index. The index is provided as a pair of secret shares from Alice and Bob. The output is also a pair of secret shares, which are returned to Alice and Bob respectively. • write to write a value into a given index. It takes four inputs: the secret shares of the index and the secret shares of the values from Alice and Bob respectively. 4. F ldeclass is the declassification function, which takes secret shares from Alice and Bob as its input, and returns the revealed value to the party corresponding to l. The protocol ΠG is then presented in Figure D.1 and Figure D.2. During the protocol’s execution, Alice and Bob consumes their instruction traces and memory traces. Since the memory traces contain all information of the public memory and their local memories, both Alice and Bob only store locally their secret shares [M ]A and [M ]B and the instruction- and memory- traces. 247 Figure D.1 presents the rules for local execution. Since all local and public data to be used in secure computation are contained in memory traces, Alice and Bob do not maintain their local data and public data. The rules are in the form of (i, t) → (, ), which means the instruction trace i and memory trace t will be consumed. In each rule, only one local instruction, i.e. the security label l 6= O, and its corresponding memory trace for each instruction will be consumed. It is not hard to verify the following proposition: Proposition 1. Assuming 〈M,S〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, S ′〉 :  and s is a statement in the set {x := e, x[x] := x, if(x),while(x)}. If ia = l : s, where l 6= O, then (ia, ta) → (, ); If ib = l : s, where l 6= O, then (ib, tb)→ (, ). Note local execution rules only handle executing one instruction. The sequence of multiple instructions are handled using rule H-LocalA, H-LocalB, and H-Seq explained later. Figure D.2 presents two parts. The first part consists the rules, in the form of 〈[M ]A, ta, [M ]B, tb, e〉 ⇓ ([v]a, [v]b), which securely evaluate an expression e. [M ]A and [M ]B are the mapping from variables to their secret shares, and ta and tb are memory traces from Alice and Bob respectively. All information to evaluate e are contained in [M ]A, [M ]B, ta, and tb. The result is in the format of ([v]a, [v]b), where [v]a and [v]b are secret shares of the result for Alice and Bob respectively. The rules restrict that ta and tb must be the memory traces generated by the ideal functionality F when evaluating e. 248 Rule SE-Const deals with constant expression n. ([v]a, [v]b) can be acquired by secret-sharing n, which is implemented using F (A,B)+ (n, 0). Rule SE-Var secret shares the value of a variable expression x. The value of x can be extracted from [M ]A and [M ]B, ta, or tb according to Γ(x). If Γ(x) is P or A, then ta = read(x, v), and [v]a and [v]b can be computed using F (A,B)+ (v, 0). If Γ(x) is B, the computation is similar, but Bob hands in his value v. If Γ(x) is O, then [M ]A(x) and [M ]B(x) can be directly returned. Rule SE-OP handles a binary operation x op y. It can be directly computed using a binary operation functionality F (Γ(x),Γ(y))op . The input to the functionality is [M ]A〈ta〉 and [M ]B〈tb〉, which is defined as follows. Suppose [M ] is a mapping from variables to secret shares, and t is a trace. Then [M ]〈t〉 is defined inductively as [M ]〈read(x, v)〉 = v [M ]〈x〉 = [M ](x) [M ]〈t1@t2〉 = [M ]〈t1〉@[M ]〈t2〉 Notice that [M ]〈t〉 is defined over only read(x, v), x, and concatenations of them. This is because this notion is used for binary operation and multiplex, where array read events and write events do not occur. The rule SE-MUX for multiplex operation is similar. For array expression y[x], there are two rules, SE-ArrVar and SE-L-ArrVar. If Γ(y) = O, then evaluating y[x] is an ORAM read operation. Rule SE-ArrVar calls the ORAM functionality Fyoram to get the secret shares ([v]a, [v]b). Otherwise, y[x] can be computed locally, and rule SE-L-ArrVar handles this case. 249 The second part of the rules (Figure D.2) are for hybrid protocol, which are in the form of 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ′]A, i′a, t′a〉, 〈[M ′]B, i′b, t′b〉 : D, meaning that Alice and Bob keeping their shares of secret variables, i.e. [M ]A and [M ]B respectively, execute over their simulated traces, i.e. ia and ta for Alice, and ib and tb for Bob, evaluates to new shares [M ′]A and [M ′]B, and new traces i′a, t ′ a, i ′ b and t′b, and generate declassification D, which is either  or (da, db). Rule H-Assign deals with the instruction O : x := e. The trace must be in the format of ta = t ′ a@x and tb = t ′ b@x, where (t ′ a, t ′ b) are the memory traces for Alice and Bob to evaluate e. This rule first evaluates the expression e to get [v]a and [v]b. Then it substitute the mapping for x in [M ]A and [M ]B accordingly. Rule H-ORAM handles ORAM initialization instruction O : init(x, y). Either of Alice’s or Bob’s memory trace must be readarr(y, 0,m(0))@...@readarr(y, l,m(l))@x, where l = |m| − 1. From this trace, one party is able to reconstruct the memory m, which is later fed into ORAM functionality Fxoram to initialize it. Rule H-ArrAss handles the instruction O : y[x1] := x2. First, the secret shares for evaluating xi are [v]ia and [v]ib for i = 1, 2 respectively. Then they are fed into the ORAM functionality Fyoram to perform a write operation. Rule H-Cond-While handles O : if(x) and O : while(x), which only consumes the corresponding memory traces x, and does not modify [M ]A and [M ]B. Rule H-Declass handles the instruction O : declass(x, y), which is the only in- struction generating non-empty declassification. According to rule S-Declass, both memory traces are y. It calls the declassification functionality FΓ(x)declass([M ]A(y), [M ]B(y)) to release the value of v to the party corresponding to Γ(x). 250 The rules discuss above handles only one instructions. There is a proposition similar to Proposition 1 that holds true for hybrid rules. We start by introducing the concept of consistency of secret-sharing mapping with a memory: Definition 13. Given a type environment G, we say a pair of secret share mappings [M ]A and [M ]B is consistent with a G-compatible memory M if and only if for all x such that Γ(x) = O, M(x) = FPdeclass([M ]A(x), [M ]B(x)). Now we are ready to present the following proposition. Proposition 2. Assuming 〈M,P 〉 (ia,ta,ib,tb)−−−−−−→ 〈M ′, P ′〉 : . We use the notation s to denote one element of the set {x := e, x[x] := x, if(x),while(x), init(x, y),declass(x, y)}. If ia = ib = O : s, and [M ]A and [M ]B are consistent with M , then 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ′]A, i′a, t′a〉, 〈[M ′]B, i′b, t′b〉 : D, and [M ′]A and [M ′]B are consistent with M ′. The rest four rules deal with multiple instructions. H-Seq and H-Concat are similar to S-Seq and S-Concat correspondingly. H-LocalA and H-LocalB are used to execute local and public instructions. We first show the hybrid protocol piG generates the same declassification events. This can be easily proved by induction leveraging Proposition 2. We then show that the hybrid protocol piG securely emulates the ideal world functionality F (Theorem 5). We suppose Alice is the semi-honest adversary, and Bob’s case is symmetric. To show this, the adversary of piG can learn ia, ta, a sequence 251 of secret share mappings [M ]A, [M ′]A, ..., and declassification events D1A, D 2 A, .... In the ideal world, and adversary can learn all the declassification events D1A, ..., and it can simulate to get ia and ta. Further the secret share mappings [M ]A, [M ′]A, ... are indistinguishable to random bits. Therefore, the adversary in real world can securely simulates the hybrid world’s adversary. 252 〈M,S〉 (i,t)−−→A 〈M ′, S′〉 Sim-Skip 〈M, l : skip;S〉 (,)−−→A 〈M,S〉 Sim-ORAM lab(M(x)) = O lab(M(y)) = l t = select(l, y, y)@x i = l : init(x, y) 〈M, O : x := oram(y)〉 (i,t)−−→A 〈M, O : skip〉 Sim-ArrAss l `A 〈M,xj〉 ⇓tj vj j = 1, 2 lab(M(y)) = l l v A⇒M ′ = M [y 7→ set(val(M(y)), v1, v2)] B v l⇒M ′ = M t = t1@t2@selectA(l,writearr(y, v1, v2), y) i = selectA(l, l : y[x1] := x2, l : y[x1] := x2) 〈M, l : y[x1] := x2〉 (i,t)−−→A 〈M ′, l : skip〉 Sim-Assign lab(M(x)) = l l `A 〈M, e〉 ⇓t′ v l v A⇒M ′ = M [x 7→ (v, l′)] B v l⇒M ′ = M t = t′@selectA(l,write(x, v), x) i = selectA(l, l : x := e, l : x := e) 〈M, l : x := e〉 (i,t)−−→A 〈M ′, l : skip〉 Sim-Cond l `A 〈M,x〉 ⇓t′ v v = 1⇒ c = 1 v 6= 1⇒ c = 2 i = selectA(l, l : if(x), l : if(x)) t = selectA(l, t ′, t′) l v A⇒ S′ = Sc l = B⇒ S′ = P : skip 〈M, l : if(x)then S1else S2〉 (i,t)−−→A 〈M,S′〉 Sim-While-True lab(M(x)) = l l v A l `A 〈M,x〉 ⇓t v v 6= 0 S = l : while(x)do S′ 〈M,S〉 (l:while(x),t)−−−−−−−−→A 〈M,S′;S〉 Sim-While-False lab(M(x)) = l l v A l `A 〈M,x〉 ⇓t v v = 0 S = l : while(x)do S′ 〈M,S〉 (l:while(x),t)−−−−−−−−→A 〈M, P : skip〉 Sim-While-Ignore S = B : while(x)do S′ 〈M,S〉 (,)−−→A 〈M, l : skip〉 Figure C.3: Operational semantics for statements in simA (part 2) 253 〈[M ]A, ta, [M ]B, tb, e〉 ⇓ ([v]a, [v]b) SE-Const ([v]a, [v]b) = F (A,B)+ (n, 0) 〈[M ]A, , [M ]B, , n〉 ⇓ ([v]a, [v]b) SE-Op ([v]a, [v]b) = F (Γ(x),Γ(y))op ([M ]A〈ta〉, [M ]B〈tb〉) 〈[M ]A, ta, [M ]B, tb, x op y〉 ⇓ ([v]a, [v]b) SE-Var (ta, tb) = select(Γ(x), read(x, v), x) Γ(x) v A⇒ ([v]a, [v]b) = F (A,B)+ (v, 0) Γ(x) = B⇒ ([v]a, [v]b) = F (A,B)+ (0, v) Γ(x) = O⇒ ([v]a, [v]b) = ([M ]A(x), [M ]B(x)) 〈[M ]A, ta, [M ]B, tb, x〉 ⇓ ([v]a, [v]b) SE-Mux ([v]a, [v]b) = F (Γ(x),Γ(y),Γ(z))mux ([M ]A〈ta〉, [M ]B〈tb〉) 〈[M ]A, ta, [M ]B, tb,mux(x, y, z)〉 ⇓ ([v]a, [v]b) SE-ArrVar ta = t ′ a@y tb = t ′ b@y〈[M ]A, t′a, [M ]B, t′b, x〉 ⇓ ([v′]a, [v′]b) ([v]a, [v]b) = Fyoram(read, [v′]a, [v′]b) 〈[M ]A, ta, [M ]B, tb, y[x]〉 ⇓ ([v]a, [v]b) SE-L-ArrVar ta = t ′ a@t ′′ a tb = t ′ b@t ′′ b Γ(y) 6= O (t′′a, t′′b ) = select(Γ(y), readarr(y, v1, v), y) Γ(y) v A⇒ ([v]a, [v]b) = F (A,B)+ (v, 0) Γ(y) = B⇒ ([v]a, [v]b) = F (A,B)+ (0, v) 〈[M ]A, ta, [M ]B, tb, y[x]〉 ⇓ ([v]a, [v]b) Figure D.1: Hybrid Protocol piG (Part I) 254 〈[M ]A, i, ta〉, 〈[M ]B, i, tb〉 〈[M ′]A, , 〉, 〈[M ]′B, , 〉 : D H-ORAM i = O : init(x, y) ta = t ′ a@x tb = t ′ b@x (t′a, t ′ b) = select(Γ(y), arr(y,m)) Fxoram(initΓ(y),m) 〈[M ]A, i, ta〉, 〈[M ]B, i, tb〉 〈[M ]A, , 〉, 〈[M ]B, , 〉 :  H-Assign i = O : x := e 〈[M ]A, t′a, [M ]B, t′b, e〉 ⇓ ([v]a, [v]b) ta = t ′ a@x tb = t ′ b@x [M ′]A = [M ]A[x 7→ [v]a] [M ′]B = [M ]A[x 7→ [v]b] 〈[M ]A, i, ta〉, 〈[M ]B, i, tb〉 〈[M ′]A, , 〉, 〈[M ′]B, , 〉 :  H-ArrAss i = O : y[x1] := x2 ta = t ′ a@y tb = t ′ b ta = t1a@t2a@y tb = t1b@t2b@y (tia, tib) = select(Γ(xi), read(xi, vi), xi) 〈[M ]A, t′ia, [M ]B, t′ib, xi〉 ⇓ ([v]ia, [v]ib) i = 1, 2 Fyoram(write, [v]1a, [v]1b, [v]2a, [v]2b) 〈[M ]A, i, ta〉, 〈[M ]B, i, tb〉 〈[M ]A, , 〉, 〈[M ]B, , 〉 :  H-Cond-While i = O : if(x) or i = O : while(x) ta = tb = x 〈[M ]A, i, ta〉, 〈[M ]B, i, tb〉 〈[M ]A, , 〉, 〈[M ]B, , 〉 :  H-Declass i = O : declass(x, y) ta = tb = y v = FΓ(x)declass([M ]A(y), [M ]B(y)) D = select(Γ(x), (x, v), (x, v)) 〈[M ]A, i, ta〉, 〈[M ]B, i, tb〉 〈[M ]A, , 〉, 〈[M ]B, , 〉 : D H-Seq 〈[M ]A, i′a, t′a〉, 〈[M ]B, i′b, t′b〉 〈[M ′]A, , 〉, 〈[M ′], , 〉 : D ia = i ′ a@i ′′ a ib = i ′ b@i ′′ b ta = t ′ a@t ′′ a tb = t ′ b@t ′′ b 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ′]A, i′′a, t′′a〉, 〈[M ′]B, i′′b , t′′b 〉 : D H-Concat 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ′]A, i′a, t′a〉, 〈[M ′]B, i′b, t′b〉 :  〈[M ′]A, i′a, t′a〉, 〈[M ′]B, i′b, t′b〉 〈[M ′′]A, i′′a, t′′a〉, 〈[M ′′]B, i′′b , t′′b 〉 : D 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ′′]A, i′′a, t′′a〉, 〈[M ′′]B, i′′b , t′′b 〉 : D H-LocalA (i, t)→ ia = i@i′a ta = t@t′a 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ]A, i′a, t′a〉, 〈[M ]B, ib, tb〉 :  H-LocalB (i, t)→ ib = i@i′b tb = t@t′b 〈[M ]A, ia, ta〉, 〈[M ]B, ib, tb〉 〈[M ]A, ia, ta〉, 〈[M ]B, i′b, t′b〉 :  Figure D.2: Hybrid Protocol ΠG(PartII) 255 Bibliography [1] http://humangenomeprivacy.org/2015. [2] Graphlab. http://graphlab.org. [3] Rsa distributed credential protection. http://www.emc.com/security/ rsa-distributed-credential-protection.htm. [4] Trusted computing group. http://www.trustedcomputinggroup.org/. [5] Private communication, 2014. [6] Johan Agat. Transforming out timing leaks. In POPL, 2000. [7] B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. In In POPL, 1988. [8] Benny Applebaum. Garbling xor gates “for free” in the standard model. In TCC, 2013. 256 [9] Aslan Askarov, Sebastian Hunt, Andrei Sabelfeld, and David Sands. Termination-insensitive noninterference leaks more than just a bit. In ES- ORICS, 2008. [10] Gilles Barthe, Tamara Rezk, and Martijn Warnier. Preventing timing leaks through transactional branching instructions. Electron. Notes Theor. Comput. Sci., 153(2):33–55, May 2006. [11] Mihir Bellare, Viet Tung Hoang, Sriram Keelveedhi, and Phillip Rogaway. Ef- ficient Garbling from a Fixed-Key Blockcipher. In S & P, 2013. [12] Marina Blanton, Aaron Steele, and Mehrdad Alisagari. Data-oblivious graph algorithms for secure computation and outsourcing. In ASIA CCS, 2013. [13] Dan Bogdanov, Sven Laur, and Jan Willemson. Sharemind: A Framework for Fast Privacy-Preserving Computations. In ESORICS. 2008. [14] Justin Brickell and Vitaly Shmatikov. Privacy-preserving graph algorithms in the semi-honest model. In ASIACRYPT, 2005. [15] Ran Canetti. Security and composition of multiparty cryptographic protocols. Journal of Cryptology, 2000. [16] Henry Carter, Benjamin Mood, Patrick Traynor, and Kevin Butler. Secure outsourced garbled circuit evaluation for mobile devices. In Usenix Security Symposium, 2013. 257 [17] Seung Geol Choi, Jonathan Katz, Ranjit Kumaresan, and Hong-Sheng Zhou. On the security of the “free-xor” technique. In TCC, 2012. [18] Richard Chow, Philippe Golle, Markus Jakobsson, Elaine Shi, Jessica Staddon, Ryusuke Masuoka, and Jesus Molina. Controlling data in the cloud: out- sourcing computation without outsourcing control. In ACM Cloud Computing Security Workshop (CCSW), pages 85–90, 2009. [19] Michael R. Clarkson, Stephen Chong, and Andrew C. Myers. Civitas: Toward a secure voting system. In IEEE Symposium on Security and Privacy (Oakland), 2008. [20] COMPCERT: Compilers you can formally trust. http://compcert.inria. fr/. [21] Convey Computer. The convey HC2 architectural overview. http: //www.conveycomputer.com/files/4113/5394/7\097/Convey_HC-2_ Architectual_Overview.pdf. [22] Bart Coppens, Ingrid Verbauwhede, Koen De Bosschere, and Bjorn De Sut- ter. Practical mitigations for timing-based side-channel attacks on modern x86 processors. In Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, SP ’09, pages 45–60, Washington, DC, USA, 2009. IEEE Computer Society. [23] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction To Algorithms, Third Edition. MIT Press, 2009. 258 [24] Dana Dachman-Soled, Chang Liu, Charalampos Papamanthou, Elaine Shi, and Uzi Vishkin. Oblivious network RAM. Cryptology ePrint Archive, Report 2015/073, 2015. http://eprint.iacr.org/. [25] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. [26] Zhenyue Deng and Geoffrey Smith. Lenient array operations for practical secure information flow. In Proceedings of the 17th Computer Security Foundations Workshop, pages 115–124. IEEE, June 2004. [27] Christopher W. Fletcher, Marten van Dijk, and Srinivas Devadas. A secure processor architecture for encrypted computation on untrusted programs. In STC, 2012. [28] Christopher W. Fletcher, Ling Ren, Albert Kwon, Marten van Dijk, Emil Ste- fanov, and Srinivas Devadas. RAW Path ORAM: A low-latency, low-area hard- ware ORAM controller with integrity verification. IACR Cryptology ePrint Archive, 2014:431, 2014. [29] Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, Omer Khan, and Srinivas Devadas. Suppressing the oblivious RAM timing channel while making information leakage and program efficiency trade-offs. In HPCA, pages 213–224, 2014. 259 [30] Jeffrey S. Foster, Robert Johnson, John Kodumal, and Alex Aiken. Flow- Insensitive Type Qualifiers. ACM Transactions of Programming Languages and Systems (TOPLAS), 28(6):1035–1087, November 2006. [31] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [32] Tanguy Gilmont, Jean didier Legat, and Jean jacques Quisquater. Enhancing security in the memory management unit. In EUROMICRO, 1999. [33] O. Goldreich. Towards a theory of software protection and simulation by obliv- ious RAMs. In STOC, 1987. [34] O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game. In STOC, 1987. [35] Oded Goldreich and Rafail Ostrovsky. Software protection and simulation on oblivious RAMs. J. ACM, 1996. [36] Michael T. Goodrich and Michael Mitzenmacher. Privacy-preserving access of outsourced data via oblivious RAM simulation. In ICALP, 2011. [37] Michael T. Goodrich and Joseph A. Simons. Data-Oblivious Graph Algorithms in Outsourced External Memory. CoRR, abs/1409.0597, 2014. [38] S. Dov Gordon, Jonathan Katz, Vladimir Kolesnikov, Fernando Krell, Tal Malkin, Mariana Raykova, and Yevgeniy Vahlis. Secure two-party computation in sublinear (amortized) time. In CCS, 2012. 260 [39] S. Dov Gordon, Allen McIntosh, Jonathan Katz, Elaine Shi, and Xiao Shaun Wang. Secure computation of MIPS machine code. Manuscript, 2015. [40] J. Alex Halderman, Seth D. Schoen, Nadia Heninger, William Clarkson, William Paul, Joseph A. Calandrino, Ariel J. Feldman, Jacob Appelbaum, and Edward W. Felten. Lest we remember: cold-boot attacks on encryption keys. Commun. ACM, 52(5):91–98, 2009. [41] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learn- ing: Data Mining, Inference, and Prediction. 2001. [42] Daniel Hedin and David Sands. Timing aware information flow security for a javacard-like bytecode. Electron. Notes Theor. Comput. Sci., 141(1):163–182, December 2005. [43] Wilko Henecka, Stefan Ko¨gl, Ahmad-Reza Sadeghi, Thomas Schneider, and Immo Wehrenberg. Tasty: tool for automating secure two-party computations. In CCS, 2010. [44] Andreas Holzer, Martin Franz, Stefan Katzenbeisser, and Helmut Veith. Secure Two-party Computations in ANSI C. In CCS, 2012. [45] Andreas Holzer, Martin Franz, Stefan Katzenbeisser, and Helmut Veith. Secure two-party computations in ansi c. In CCS, 2012. [46] Yan Huang, David Evans, Jonathan Katz, and Lior Malka. Faster secure two- party computation using garbled circuits. In Usenix Security Symposium, 2011. 261 [47] Yuval Ishai, Joe Kilian, Kobbi Nissim, and Erez Petrank. Extending oblivious transfers efficiently. In CRYPTO, 2003. [48] Yuval Ishai, Joe Kilian, Kobbi Nissim, and Erez Petrank. Extending Oblivious Transfers Efficiently. In CRYPTO 2003. 2003. [49] Jif: Java + information flow. http://www.cs.cornell.edu/jif/. [50] Anatolii Alexeevitch Karatsuba. The Complexity of Computations, 1995. [51] Marcel Keller and Peter Scholl. Efficient, oblivious data structures for MPC. In Asiacrypt, 2014. [52] Florian Kerschbaum. Automatically optimizing secure computation. In CCS, 2011. [53] Vladimir Kolesnikov and Thomas Schneider. Improved Garbled Circuit: Free XOR Gates and Applications. In ICALP, 2008. [54] Ben Kreuter, Benjamin Mood, Abhi Shelat, and Kevin Butler. PCF: A portable circuit format for scalable two-party secure computation. In Usenix Security, 2013. [55] Benjamin Kreuter, abhi shelat, and Chih-Hao Shen. Billion-gate secure com- putation with malicious adversaries. In USENIX Security, 2012. [56] David Lie, John Mitchell, Chandramohan A. Thekkath, and Mark Horowitz. Specifying and verifying hardware for tamper-resistant software. In Proceedings 262 of the 2003 IEEE Symposium on Security and Privacy, SP ’03, Washington, DC, USA, 2003. IEEE Computer Society. [57] Chang Liu, Austin Harris, Martin Maas, Michael Hicks, Mohit Tiwari, and Elaine Shi. Ghostrider: A hardware-software system for memory trace oblivious computation. In ASPLOS, 2015. [58] Chang Liu, Michael Hicks, and Elaine Shi. Memory trace oblivious program execution. CSF ’13, pages 51–65, 2013. [59] Chang Liu, Yan Huang, Elaine Shi, Jonathan Katz, and Michael Hicks. Au- tomating Efficient RAM-model Secure Computation. In S & P, May 2014. [60] Chang Liu, Xiao Shaun Wang, Kartik Nayak, Yan Huang, and Elaine Shi. Oblivm: A programming framework for secure computation. In S & P, 2015. [61] Chang Liu, Xiao Shaun Wang, Kartik Nayak, Yan Huang, and Elaine Shi. Oblivm: A programming framework for secure computation, 2015. http:// www.cs.umd.edu/~elaine/docs/oblivmtr.pdf. [62] Steve Lu and Rafail Ostrovsky. How to garble ram programs. In EUROCRYPT, 2013. [63] Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari, Elaine Shi, Kriste Asanovic, John Kubiatowicz, and Dawn Song. Phantom: Practical oblivi- ous computation in a secure processor. In ACM Conference on Computer and Communications Security (CCS), 2013. 263 [64] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. [65] Dahlia Malkhi, Noam Nisan, Benny Pinkas, and Yaron Sella. Fairplay: a secure two-party computation system. In USENIX Security, 2004. [66] John C. Mitchell and Joe Zimmerman. Data-Oblivious Data Structures. In STACS, pages 554–565, 2014. [67] David Molnar, Matt Piotrowski, David Schultz, and David Wagner. The pro- gram counter security model: Automatic detection and removal of control-flow side channel attacks. In ICISC, 2005. [68] Moni Naor and Benny Pinkas. Efficient oblivious transfer protocols. In SODA, 2001. [69] Moni Naor, Benny Pinkas, and Reuban Sumner. Privacy preserving auctions and mechanism design. EC ’99, 1999. [70] Kartik Nayak, Xiao Shaun Wang, Stratis Ioannidis, Udi Weinsberg, Nina Taft, and Elaine Shi. GraphSC: Parallel Secure Computation Made Easy. In IEEE S & P, 2015. [71] Valeria Nikolaenko, Stratis Ioannidis, Udi Weinsberg, Marc Joye, Nina Taft, and Dan Boneh. Privacy-preserving matrix factorization. In CCS, 2013. 264 [72] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan Boneh, and Nina Taft. Privacy-preserving ridge regression on hundreds of millions of records. In S & P, 2013. [73] Amir Pnueli, Michael Siegel, and Eli Singerman. Translation Validation. In TACAS, 1998. [74] Franc¸ois Pottier and Vincent Simonet. Information flow inference for ml. ACM Trans. Program. Lang. Syst., 25(1):117–158, January 2003. [75] Aseem Rastogi, Matthew A. Hammer, and Michael Hicks. Wysteria: A Pro- gramming Language for Generic, Mixed-Mode Multiparty Computations. In S & P, 2014. [76] Aseem Rastogi, Piotr Mardziel, Matthew Hammer, and Michael Hicks. Knowl- edge inference for optimizing secure multi-party computation. In PLAS, 2013. [77] riscv.org. Launching the Open-Source Rocket Chip Gen- erator, October 2014. https://blog.riscv.org/2014/10/ launching-the-open-source-rocket-chip-generator/. [78] Alejandro Russo, John Hughes, David A. Naumann, and Andrei Sabelfeld. Closing internal timing channels by transformation. In ASIAN, 2006. [79] Andrei Sabelfeld and Andrew C. Myers. Language-based information-flow secu- rity. IEEE Journal on Selected Areas in Communications, 21(1):5–19, January 2003. 265 [80] Elaine Shi, T.-H. Hubert Chan, Emil Stefanov, and Mingfei Li. Oblivious RAM with O((logN)3) worst-case cost. In ASIACRYPT, 2011. [81] Sergei Skorobogatov. Low temperature data remanence in static RAM. Tech- nical Report UCAM-CL-TR-536, University of Cambridge, Computer Labora- tory, June 2002. [82] Ebrahim M. Songhori, Siam U. Hussain, Ahmad-Reza Sadeghi, Thomas Schnei- der, and Farinaz Koushanfar. TinyGarble: Highly Compressed and Scalable Sequential Garbled Circuits. In IEEE S & P, 2015. [83] Emil Stefanov, Elaine Shi, and Dawn Song. Towards practical oblivious RAM. In Network and Distributed System Security Symposium (NDSS), 2012. [84] Emil Stefanov, Marten van Dijk, Elaine Shi, Christopher Fletcher, Ling Ren, Xiangyao Yu, and Srinivas Devadas. Path ORAM – an extremely simple obliv- ious ram protocol. In CCS, 2013. [85] G. Edward Suh, Dwaine Clarke, Blaise Gassend, Marten van Dijk, and Srinivas Devadas. Aegis: architecture for tamper-evident and tamper-resistant process- ing. In International conference on Supercomputing, ICS ’03, pages 160–171, 2003. [86] David Lie Chandramohan Thekkath, Mark Mitchell, Patrick Lincoln, Dan Boneh, John Mitchell, and Mark Horowitz. Architectural support for copy and tamper resistant software. SIGOPS Oper. Syst. Rev., 34(5):168–177, November 2000. 266 [87] Amit Vasudevan, Jonathan McCune, James Newsome, Adrian Perrig, and Leendert van Doorn. CARMA: A hardware tamper-resistant isolated execution environment on commodity x86 platforms. In Proceedings of the ACM Sym- posium on Information, Computer and Communications Security (ASIACCS), May 2012. [88] Huy Vo, Yunsup Lee, Andrew Waterman, and Krste Asanovic´. A Case for OS-Friendly Hardware Accelerators. In WIVOSCA, 2013. [89] Abraham Waksman. A permutation network. J. ACM, 15, 1968. [90] Xiao Shaun Wang, T-H. Hubert Chan, and Elaine Shi. Circuit ORAM: On Tightness of the Goldreich-Ostrovsky Lower Bound. [91] Xiao Shaun Wang, Yan Huang, T-H. Hubert Chan, Abhi Shelat, and Elaine Shi. SCORAM: Oblivious RAM for Secure Computation. In CCS, 2014. [92] Xiao Shaun Wang, Kartik Nayak, Chang Liu, T-H. Hubert Chan, Elaine Shi, Emil Stefanov, and Yan Huang. Oblivious Data Structures. In CCS, 2014. [93] Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic´. The RISC-V Instruction Set Manual, Volume I: Base User-Level ISA. Techni- cal Report UCB/EECS-2011-62, EECS Department, University of California, Berkeley, May 2011. [94] Lance Whitney. Microsoft Urges Laws to Boost Trust in the Cloud. http: //news.cnet.com/8301-1009_3-10437844-83.html. 267 [95] Andrew Chi-Chih Yao. Protocols for secure computations (extended abstract). In FOCS, 1982. [96] Andrew Chi-Chih Yao. How to generate and exchange secrets. In FOCS, 1986. [97] Samee Zahur and David Evans. Circuit Structures for Improving Efficiency of Security and Privacy Tools. In S & P, 2013. [98] Yihua Zhang, Aaron Steele, and Marina Blanton. PICCO: a general-purpose compiler for private distributed computation. In CCS, 2013. 268