ABSTRACT Title of Dissertation: ANALYZING THE COMBINED EFFECTS OF MEASUREMENT ERROR AND PERTURBATION ERROR ON PERFORMANCE MEASUREMENT Geoffrey M. Stoker, Doctor of Philosophy, 2014 Directed By: Professor Jeffrey K. Hollingsworth, Department of Computer Science Dynamic performance analysis of executing programs commonly relies on statistical profiling techniques to provide performance measurement results. When a program execution is sampled we learn something about the examined program, but also change, to some extent, the program's interaction with the underlying system and thus its behavior. The amount we learn diminishes (statistically) with each sample taken, while the change we affect with the intrusive sampling risks growing larger. Effectively sampling programs is challenging largely because of the opposing effects of the decreasing sampling error and increasing perturbation error. Achieving the highest overall level of confidence in measurement results requires striking an appropriate balance between the tensions inherent in these two types of errors. Despite the popularity of statistical profiling, published material typically only explains in general qualitative terms the motivation of the systematic sampling rates used. Given the importance of sampling, we argue in favor of the general principle of deliberate sample size selection and have developed and tested a technique for doing so. We present our idea of sample rate selection based on abstract and mathematical performance measurement models we developed that incorporate the effect of sampling on both measurement accuracy and perturbation effects. Our mathematical model predicts the sampling size at which the combination of the residual measurement error and the accumulating perturbation error is minimized. Our evaluation of the model with simulation, calibration programs, and selected programs from the SPEC CPU 2006 and SPEC OMP 2001 benchmark suites indicates that this idea has promise. Our results show that the predicted sample size is generally close to the best sampling rate and effectively avoids bad choices. Most importantly, adaptive sample rate selection is shown to perform better than a single selected rate in most cases. ANALYZING THE COMBINED EFFECTS OF MEASUREMENT ERROR AND PERTURBATION ERROR ON PERFORMANCE MEASUREMENT By Geoffrey M. Stoker Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2014 Advisory Committee: Professor Jeffrey K. Hollingsworth, Chair/Advisor Professor M. Cole Miller, Dean’s Representative Professor Ashok K. Agrawala Professor Peter J. Keleher Professor Alan L. Sussman © Copyright by Geoffrey M. Stoker 2014 ii Dedication For Akemi. iii Acknowledgements My PhD adventure has largely mirrored Winston Churchill's description of his book writing experience – from toy, through amusement, mistress, master, tyrant, and finally, to slain monster. I will be forever grateful to those who accompanied me for any length of time on this journey. The good that has come out of this process is due in the largest part to them. To my advisor, Professor Jeffrey K. Hollingsworth, thank you for the years of patiently-provided, wise advice and thoughtful guidance, without which this adventure would not have been possible. To the members of my dissertation committee, thank you for your willingness to help develop me as a researcher and improve my scholarship. To my fellow research group members, AT, Mike, Nick, Ray, and Tugrul, thank you so much for the encouragement to persevere that you provided through personal example and regular interaction. To my leaders at EECS, Colonel Ed Sobiesk and Colonel Gene Ressler, thank you for pushing me, encouraging me, and taking a chance on me. Teaching Cadets has been the highlight of my career thus far. To the many, many members of the graduate community at UMD and the faculty at USMA who made the journey worthwhile, thanks for being such wonderful friends and colleagues. Thanks (in no particular order), Paul, Ed, Bill, Paulo, Rob, Todd, Devon, Tim, Mike, Mike, Kyle, Glenn, Mike, Ray, John, Patrick, Aaron, and many, many others I'm sorry to have to leave out. iv To Kai-Lynn, Mirana, Geoffrey, Julia, and Sarah, thank you for providing me a large part of what makes life worth living – yourselves. Finally, to my wife Akemi, you are truly a gift from the Lord. Without you by my side, this adventure would count for nothing. Thank you for enduring the thousands of absent weekend and late night hours along with my emotional ups and downs over the past six years. I love you and eagerly look forward to whatever life next holds for us. v Table of Contents Dedication ..................................................................................................................... ii Acknowledgements ...................................................................................................... iii Table of Contents .......................................................................................................... v List of Tables .............................................................................................................. vii List of Figures ............................................................................................................ viii Chapter 1: Introduction ................................................................................................ 1 1.1 Motivation ........................................................................................................... 3 1.2 Contributions....................................................................................................... 6 1.3 Outline................................................................................................................. 7 Chapter 2: Background and Related Work .................................................................. 9 2.1 Handling Perturbation ......................................................................................... 9 2.1.1 Perturbation Minimization ........................................................................... 9 2.1.2 Perturbation Compensation ........................................................................ 10 2.2 Statistical Science ............................................................................................. 12 2.2.1 Approximating Hypergeometric with Normal ........................................... 12 2.2.2 Use in Performance Analysis ..................................................................... 15 2.3 Tools ................................................................................................................. 17 2.3.1 Progtime ..................................................................................................... 17 2.3.2 Prof ............................................................................................................. 17 2.3.3 Gprof .......................................................................................................... 18 2.3.4 ATOM ........................................................................................................ 19 2.3.5 XProfiler .................................................................................................... 19 2.3.6 HPCToolkit ................................................................................................ 19 2.3.7 VTune ........................................................................................................ 20 2.3.8 STAT.......................................................................................................... 20 2.3.9 DCPI .......................................................................................................... 20 2.3.10 SimPoint ................................................................................................... 21 Chapter 3: Sampling Empirical Study ....................................................................... 23 3.1 Sampling Simulation ......................................................................................... 23 3.1.1 Fixed Sample Size...................................................................................... 24 3.1.2 Calculated Sample Size.............................................................................. 28 3.2 Sampling Calibration Program ......................................................................... 34 3.2.1 Fixed Sample Size...................................................................................... 35 3.2.2 Calculated Sample Size.............................................................................. 38 Chapter 4: Model ....................................................................................................... 42 4.1 Abstract Model.................................................................................................. 42 4.2 Analytical Model .............................................................................................. 44 4.2.1 Background ................................................................................................ 44 vi 4.2.2 Applying the Intuition ................................................................................ 46 4.2.3 Analytic Function....................................................................................... 48 4.3 Simulation ......................................................................................................... 52 Chapter 5: Sequential Execution ................................................................................ 56 5.1 Calibration Program .......................................................................................... 56 5.1.1 Experiment Design and Environment ........................................................ 56 5.1.2 Program ...................................................................................................... 58 5.1.3 Measurement Tool ..................................................................................... 59 5.1.4 Result Comparison ..................................................................................... 60 5.2 SPEC CPU Programs ........................................................................................ 61 5.2.1 Experiment Design and Environment ........................................................ 61 5.2.2 Programs .................................................................................................... 64 5.2.3 Measurement Tool ..................................................................................... 64 5.2.4 Execution Details ....................................................................................... 66 5.2.5 Results ........................................................................................................ 66 5.2.6 Analysis of Analytic Equation Results ...................................................... 76 Chapter 6: Parallel Execution .................................................................................... 78 6.1 Preliminary Notes on Experiments ................................................................... 78 6.1.1 Variance of Execution................................................................................ 78 6.1.2 Software Timers ......................................................................................... 81 6.2 SPEC OMP 2001 Programs .............................................................................. 83 6.2.1 Experiment Design and Environment ........................................................ 83 6.2.3 Programs .................................................................................................... 84 6.2.4 Execution Details ....................................................................................... 84 6.2.5 Results ........................................................................................................ 85 Chapter 7: Future Work ............................................................................................. 95 7.1 Process Refinement ........................................................................................... 95 7.2 Analytic Function Refinement .......................................................................... 95 7.3 Other Application Domains .............................................................................. 96 7.4 Beyond Functions ............................................................................................. 96 7.5 Computer Universe ........................................................................................... 97 Chapter 8: Conclusions .............................................................................................. 99 Appendix A ............................................................................................................... 103 Appendix B ............................................................................................................... 105 Appendix C ............................................................................................................... 115 Glossary .................................................................................................................... 161 Bibliography ............................................................................................................. 163 vii List of Tables Table 3.1: Simulation experiment execution percentages for foo and bar. ............... 24 Table 3.2: Execution percentage pairings .................................................................. 29 Table 3.3: Execution percentage pairings .................................................................. 38 Table 3.4: Comparison of number of trials where foo < bar (per 100 runs) .............. 40 Table 4.1: Analytic function notation ........................................................................ 46 Table 5.1: Calculation of largest function per benchmark. ........................................ 64 Table 5.2: Predicted vs. actual outcomes. .................................................................. 76 Table 6.1: Sampling intervals and expected approximate sample counts for targeted experiments with equake using 4 threads on 4 cores. ..................................... 89 Table 6.2: Sampling intervals and corresponding number of runs (out of 20) with calculated proportion (percent execution) outside the 95% CI for the "true" value of smvp.omp_fn.5 from equake, 4 threads on 4 cores. ......................... 90 Table 6.3: Chi-square test results for determining "truth" for the percent execution of equake functions. ............................................................................................ 92 Table 6.4: Parallel predicted vs. best outcome. ......................................................... 93 Table A.1: Analytic function notation ..................................................................... 103 viii List of Figures Figure 1.1: Theoretical outcome of two performance measurement tools .............................. 5 Figure 1.2: Example experiment results where measurements were perturbed by sampling .. 6 Figure 3.1: Simulation results of foo=.49; bar=.48; 40,000 samples ..................................... 25 Figure 3.2: Simulation results of foo=.40; bar=.39; 40,000 samples ..................................... 25 Figure 3.3: Simulation results of foo=.30; bar=.29; 40,000 samples ..................................... 25 Figure 3.4: Simulation results of foo=.20; bar=.19; 40,000 samples ..................................... 26 Figure 3.5: Simulation results of foo=.10; bar=.09; 40,000 samples ..................................... 26 Figure 3.6: Simulation results of foo=.02; bar=.01; 40,000 samples ..................................... 26 Figure 3.7: Comparing simulation, hypergeometric, and normal when foo=.49; bar=.48 .... 27 Figure 3.8: Comparing simulation, hypergeometric, and normal when foo=.02; bar=.01 .... 28 Figure 3.9: Simulation results of foo=.49; bar=.48; 38,375 samples ..................................... 30 Figure 3.10: Simulation results of foo=.40; bar=.39; 36,718 samples ................................... 30 Figure 3.11: Simulation results of foo=.30; bar=.29; 31,956 samples ................................... 30 Figure 3.12: Simulation results of foo=.20; bar=.19; 24,115 samples ................................... 31 Figure 3.13: Simulation results of foo=.10; bar=.09; 13,201 samples ................................... 31 Figure 3.14: Simulation results of foo=.02; bar=.01; 2,204 samples ..................................... 31 Figure 3.15: Comparing simulation, hypergeometric, and normal when foo=.49; bar=.48 .. 33 Figure 3.16: Comparing simulation, hypergeometric, and normal when foo=.02; bar=.01 .. 33 Figure 3.17: Calibration results of foo=.49; bar=.48; ≈37,500 samples ................................ 35 Figure 3.18: Calibration results of foo=.40; bar=.39; ≈37,500 samples ................................ 36 Figure 3.19: Calibration results of foo=.30; bar=.29; ≈37,500 samples ................................ 36 Figure 3.20: Calibration results of foo=.20; bar=.19; ≈37,500 samples ................................ 37 Figure 3.21: Calibration results of foo=.10; bar=.09; ≈37,500 samples ................................ 37 Figure 3.22: Calibration results of foo=.02; bar=.01; ≈37,500 samples ................................ 38 Figure 3.23: Calibration results of foo=.30; bar=.29; ≈33,400 samples ................................ 39 Figure 3.24: Calibration results of foo=.20; bar=.19; ≈25,00 samples .................................. 39 Figure 3.25: Calibration results of foo=.10; bar=.09; ≈13,00 samples .................................. 40 Figure 3.26: Calibration results of foo=.02; bar=.01; ≈2,200 samples .................................. 40 Figure 4.1: Abstract model of the effect on measurement error and perturbation error as an increasing number of samples is taken during a program's execution. ....................... 44 Figure 4.2: This graph presents the predicted results for measurements of the execution of foo within a program that executes for a total of 300 seconds and in which foo accounts for 20% of the execution time. Of the measurements taken at each sample level, 95% of them are expected to fall within the upper and lower curves. .............. 49 Figure 4.3: This graph presents the simulation results for measurements of the execution of foo within a simulated program that executes (when unperturbed) for a total of 300 seconds and in which foo accounts for 20% of the execution time. 1,000 simulations were run per each sample size with the middle 95% of measurements shown as the solid part of the bar. The whiskers above and below each bar indicate the remaining 5% (2.5% above and 2.5% below). ............................................................................. 54 Figure 5.1: This graph presents the results for measurements of the execution of foo within our calibration program that executes (when unperturbed) for a total of approximately 300 seconds and in which foo accounts for approximately 20% of the execution time. ix 100 experiments were run per each sample size with the middle 95% of measurements shown as the solid part of the bar. The whiskers above and below each bar indicate the remaining 5% (2.5% above and below). ........................................... 57 Figure 5.2: Sample rate limitations ........................................................................................ 58 Figure 5.3: This graph presents a consolidated view of the results depicted in Figs 4.2, 4.3, and 5.1. Each three bar group pictured per sample count are, left to right, model prediction, simulation result, and experiment result. .................................................. 60 Figure 5.4: This graph presents the distribution of the calculated value of p for cMessageHeap::shiftup(int) arranged by number of samples taken. .......................... 62 Figure 5.5: This graph depicts the distribution of the calculated value of p for cMessageHeap::shiftup(int) ordered from smallest to largest. ................................... 63 Figure 5.8: Plot of proportion calculation results for cMessageHeap::shiftup(int), the function taking the most execution time for omnetpp, across the 28x10 execution runs. ............................................................................................................................. 67 Figure 5.9: Plot of proportion calculation results for cGate::deliver(cMessage*,double), the function taking the 2nd most execution time for omnetpp, across the 28x10 execution runs. ............................................................................................................................. 68 Figure 5.10: Plot of proportion calculation results for cSimulation::selectNextModule(), the function taking the 3rd most execution time for omnetpp, across the 28x10 execution runs. ............................................................................................................................. 68 Figure 5.11: Plot of proportion calculation results for cModule::findGate(char const*,int) const, the function taking the 4th most execution time for omnetpp, across the 28x10 execution runs. ............................................................................................................ 69 Figure 5.12: Plot of execution time calculation results for cMessageHeap::shiftup(int), the function taking the most execution time for omnetpp, across the 28x10 execution runs. ............................................................................................................................. 69 Figure 5.13: Results for the 28x10 runs compared to "truth" for cMessageHeap::shiftup(int). The solid blue line tracks the MAPE for the set of 10 runs at each sample rate and the red dashed line indicates the approximate expected MAPE. ...................................... 71 Figure 5.14: MAPE results for each of the top four functions in omnetpp. .......................... 71 Figure 5.14: cMAPE results for the top four functions in omnetpp (blue solid line) and the calculated approximate expected cMAPE (dotted red line). ...................................... 72 Figure 5.15: cMAPE results for the top four functions in bzip2 (blue solid line) and the calculated approximate expected cMAPE (dotted red line). ...................................... 73 Figure 5.16: cMAPE results for the top four functions in mcf (blue solid line) and the calculated approximate expected cMAPE (dotted red line). ...................................... 74 Figure 5.17: cMAPE results for the top four functions in milc (blue solid line) and the calculated approximate expected cMAPE (dotted red line). ...................................... 75 Figure 5.18: cMAPE results for the top four functions in sjeng (blue solid line) and the calculated approximate expected cMAPE (dotted red line). ...................................... 75 Figure 6.1: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark equake. ............................... 80 Figure 6.2: Distribution of run times, 10 each, sorted shortest to longest within each sampling interval, for the SPEC CPU 2006 benchmarks sjeng, milc, mcf, omnetpp, and bzip. ...................................................................................................................... 81 x Figure 6.3: Distribution of the percent execution taken by smvp.omp_fn.5 from 100 runs of equake with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. ................................................................................................. 86 Figure 6.4: Distribution of the percent execution taken by 4 different functions from 100 runs of equake with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion............................................................................................. 87 Figure 6.5: Results of the percent execution of smvp.omp_fn.5 for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. The 95% confidence intervals of the "true" value are depicted as red dashed lines. ................. 89 Figure 6.6: Results of the percent execution of smvp.omp_fn.5 for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. The 95% confidence intervals of the "true" value are depicted as red dashed lines. ................. 90 Figure B.1: Plot of proportion calculation results for BZ2_blockSort), the function taking the most execution time for bzip2, across the 28x10 execution runs. ............................ 105 Figure B.2: Plot of proportion calculation results for mainGtU, the function taking the 2nd most execution time for bzip2, across the 28x10 execution runs. ............................ 105 Figure B.3: Plot of proportion calculation results for BZ2_decompress, the function taking the 3rd most execution time for bzip2, across the 28x10 execution runs. ................ 106 Figure B.4: Plot of proportion calculation results for BZ2_compressBlock, the function taking the 4th most execution time for bzip2, across the 28x10 execution runs. ..... 106 Figure B.5: The aggregated MAPE results for the top four functions in bzip2 (blue solid line) and the calculated approximate expected MAPE (dotted red line). ................. 107 Figure B.6: Plot of proportion calculation results for primal_bea_mpp, the function taking the most execution time for mcf, across the 28x10 execution runs. ......................... 107 Figure B.7: Plot of proportion calculation results for refresh_potential, the function taking the 2nd most execution time for mcf, across the 28x10 execution runs. .................. 108 Figure B.8: Plot of proportion calculation results for replace_weaker_arc, the function taking the 3rd most execution time for mcf, across the 28x10 execution runs. ........ 108 Figure B.9: Plot of proportion calculation results for price_out_impl, the function taking the 4th most execution time for mcf, across the 28x10 execution runs. ......................... 109 Figure B.10: The aggregated MAPE results for the top four functions in mcf (blue solid line) and the calculated approximate expected MAPE (dotted red line)........................... 109 Figure B.11: Plot of proportion calculation results for mult_su3_na, the function taking the most execution time for milc, across the 28x10 execution runs. .............................. 109 Figure B.12: Plot of proportion calculation results for mult_su3_nn, the function taking the 2nd most execution time for milc, across the 28x10 execution runs. ....................... 110 Figure B.13: Plot of proportion calculation results for mult_su3_mat_vec, the function taking the 3rd most execution time for milc, across the 28x10 execution runs. .................. 110 Figure B.14: Plot of proportion calculation results for mult_adj_su3_mat_vec, the function taking the 4th most execution time for milc, across the 28x10 execution runs. ....... 111 Figure b.15: The aggregated MAPE results for the top four functions in milc (blue solid line) and the calculated approximate expected MAPE (dotted red line)........................... 111 Figure B.16: Plot of proportion calculation results for std_eval, the function taking the most execution time for sjeng, across the 28x10 execution runs. ..................................... 111 xi Figure B.17: Plot of proportion calculation results for setup_attackers, the function taking the 2nd most execution time for sjeng, across the 28x10 execution runs. ................ 112 Figure B.18: Plot of proportion calculation results for gen, the function taking the 3rd most execution time for sjeng, across the 28x10 execution runs. ..................................... 112 Figure B.19: Plot of proportion calculation results for remove_one, the function taking the 4th most execution time for sjeng, across the 28x10 execution runs........................ 113 Figure B.20: The aggregated MAPE results for the top four functions in sjeng (blue solid line) and the calculated approximate expected MAPE (dotted red line). ................. 113 Figure C.1: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark applu. ............................... 115 Figure C.2: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark fma3d. .............................. 116 Figure C.3: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark swim. ............................... 116 Figure C.4: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark wupwise. .......................... 117 Figure C.5: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................................ 117 Figure C.6: Results of the percent execution of main.omp_fn.10, the function taking the 3rd most execution time for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................................ 118 Figure C.7: Results of the percent execution of omp_get_num_procs for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. .......... 118 Figure C.8: Distribution of the percent execution taken by 4 different functions from 100 runs of equake with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 119 Figure C.9: Results of the percent execution of smvp.omp_fn.5, the function taking the most execution time for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval................................................................................................. 120 Figure C.10: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................ 120 Figure C.11: Results of the percent execution of main.omp_fn.10, the function taking the 3rd most execution time for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................................ 121 Figure C.12: Results of the percent execution of omp_get_num_procs, for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. .......... 121 Figure C.13: Distribution of the percent execution taken by 4 different functions from 100 runs of equake with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 122 Figure C.14: Results of the percent execution of smvp.omp_fn.5, the function taking the most execution time for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. ................................................................................... 123 xii Figure C.15: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 123 Figure C.16: Results of the percent execution of main.omp_fn.10, the function taking the 3rd most execution time for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. ................................................................................... 124 Figure C.17: Results of the percent execution of omp_get_num_procs, for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. ...... 124 Figure C.18: Distribution of the percent execution taken by 4 different functions from 100 runs of applu with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 125 Figure C.19: Results of the percent execution of ssor_.omp_fn.2, the function taking the most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................................ 126 Figure C.20: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................................... 126 Figure C.21: Results of the percent execution of buts_, the function taking the 3rd most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval................................................................................................. 127 Figure C.22: Results of the percent execution of blts_, the function taking the 4th most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval................................................................................................. 127 Figure C.23: Distribution of the percent execution taken by 4 different functions from 100 runs of applu with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 128 Figure C.24: Results of the percent execution of ssor_.omp_fn.2, the function taking the most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................................ 129 Figure C.25: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval. ................................................................................... 129 Figure C.26: Results of the percent execution of buts_, the function taking the 3rd most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval................................................................................................. 130 Figure C.27: Results of the percent execution of blts_, the function taking the 4th most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval................................................................................................. 130 Figure C.28: Distribution of the percent execution taken by 4 different functions from 100 runs of applu with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 131 xiii Figure C.29: Results of the percent execution of ssor_.omp_fn.2, the function taking the most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval. ................................................................................... 132 Figure C.30: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 132 Figure C.31: Results of the percent execution of buts_, the function taking the 3rd most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval................................................................................................. 133 Figure C.32: Results of the percent execution of blts_, the function taking the 4th most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval................................................................................................. 133 Figure C.33: Distribution of the percent execution taken by 4 different functions from 100 runs of fma3d with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 134 Figure C.34: Results of the percent execution of platq_internal_forces_.omp_fn.0, the function taking the most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. .................................................... 135 Figure C.35: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................................... 135 Figure C.36: Results of the percent execution of platq_stress_integration_, the function taking the 3rd most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................ 136 Figure C.37: Results of the percent execution of material_41_integration_, the function taking the 4th most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................ 136 Figure C.38: Distribution of the percent execution taken by 4 different functions from 100 runs of fma3d with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 137 Figure C.39: Results of the percent execution of platq_internal_forces_.omp_fn.0, the function taking the most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. .................................................... 138 Figure C.40: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. ................................................................................... 138 Figure C.41: Results of the percent execution of platq_stress_integration_, the function taking the 3rd most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. ................................................................ 139 Figure C.42: Results of the percent execution of material_41_integration_, the function taking the 4th most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. ................................................................ 139 Figure C.43: Distribution of the percent execution taken by 4 different functions from 100 runs of fma3d with 12 threads on 12 cores, 20 each at 5 different sampling intervals. xiv Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 140 Figure C.44: Results of the percent execution of platq_internal_forces_.omp_fn.0, the function taking the most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. .................................................. 141 Figure C.45: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 141 Figure C.46: Results of the percent execution of platq_stress_integration_, the function taking the 3rd most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. .............................................................. 142 Figure C.47: Results of the percent execution of material_41_integration_, the function taking the 4th most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. .............................................................. 142 Figure C.48: Distribution of the percent execution taken by 4 different functions from 100 runs of swim with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 143 Figure C.49: Results of the percent execution of GOMP_taskwait, the function taking the most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................................ 144 Figure C.50: Results of the percent execution of calc2_.omp_fn.2, the function taking the 2nd most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................................... 144 Figure C.51: Results of the percent execution of calc1_.omp_fn.3, the function taking the 3rd most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................................... 145 Figure C.52: Results of the percent execution of calc3_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................................ 145 Figure C.53: Distribution of the percent execution taken by 4 different functions from 100 runs of swim with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 146 Figure C.54: Results of the percent execution of GOMP_taskwait, the function taking the most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................................ 147 Figure C.55: Results of the percent execution of calc2_.omp_fn.2, the function taking the 2nd most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. ................................................................................... 147 Figure C.56: Results of the percent execution of calc1_.omp_fn.3, the function taking the 3rd most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. ................................................................................... 148 Figure C.57: Results of the percent execution of calc3_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................................ 148 xv Figure C.58: Distribution of the percent execution taken by 4 different functions from 100 runs of swim with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 149 Figure C.59: Results of the percent execution of GOMP_taskwait, the function taking the most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. ................................................................................... 150 Figure C.60: Results of the percent execution of calc2_.omp_fn.2, the function taking the 2nd most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 150 Figure C.61: Results of the percent execution of calc1_.omp_fn.3, the function taking the 3rd most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 151 Figure C.62: Results of the percent execution of calc3_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. ................................................................................... 151 Figure C.63: Distribution of the percent execution taken by 4 different functions from 100 runs of wupwise with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 152 Figure C.64: Results of the percent execution of zgemm_, the function taking the most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................................ 153 Figure C.65: Results of the percent execution of muldoe_.omp_fn.0, the function taking the 2nd most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................ 153 Figure C.66: Results of the percent execution of muldeo_.omp_fn.0, the function taking the 3rd most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. ........................................................................ 154 Figure C.67: Results of the percent execution of GOMP_taskwait, the function taking the 4th most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. ................................................................................... 154 Figure C.68: Distribution of the percent execution taken by 4 different functions from 100 runs of wupwise with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion........................................................................................... 155 Figure C.69: Results of the percent execution of zgemm_, the function taking the most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................................ 156 Figure C.70: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................ 156 Figure C.71: Results of the percent execution of muldoe_.omp_fn.0, the function taking the 3rd most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................ 157 xvi Figure C.72: Results of the percent execution of muldeo_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. ........................................................................ 157 Figure C.73: Distribution of the percent execution taken by 4 different functions from 100 runs of wupwise with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. ........................................................................ 158 Figure C.74: Results of the percent execution of zgemm_, the function taking the most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................................ 159 Figure C.75: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 159 Figure C.76: Results of the percent execution of muldoe_.omp_fn.0, the function taking the 3rd most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 160 Figure C.77: Results of the percent execution of muldeo_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. ........................................................................ 160 1 Chapter 1: Introduction Understanding exactly what a piece of software is doing during execution and how well it is performing have been of interest since programs were first written and run on the earliest computers over half a century ago. As software and the machines on which it runs have become more complex, interest in evaluating performance has only intensified. To aid in understanding software performance two general methods of dynamic analysis, termed profiling [36], have been developed. The first is counting events of interest, like the entry and/or exit points of a particular function, and is commonly called measurement-based profiling or just measured profiling [43]. This is typically accomplished by adding additional lines of code, before or during runtime, that will execute during the normal control flow of a program either just before or just after (or sometimes both) the code of interest being profiled. This research is not primarily concerned with measured profiling. The second category of dynamic software performance analysis methods is program status sampling [36], also called sample-based profiling or statistical profiling [43]. This technique often uses some type of operating system or underlying hardware functionality to interrupt the program being profiled and then note some aspect of its current execution before allowing the program to resume running. For instance, we may halt a program, record the current program counter (PC) or the name of the function at the top of the call stack, and then resume execution of the program. With the collected samples, experimenters produce a measured value (statistic) from which inferences about the executing program's population parameters 2 can be made. The research in this dissertation is framed entirely within the context of this kind of dynamic software performance analysis. Effectively sampling executing programs for the purpose of dynamic performance analysis via statistical profiling is a challenging problem. This is due in large part to the opposing effects of measurement error and perturbation error [17, 41]. Very frequent sampling of a program execution can provide statistically precise measured values, but it also risks perturbing the measured program so that it behaves very unlike the original object of the analysis (the un-sampled program) and results in the production of inaccurate measured values. Sparse sampling, on the other hand, perturbs the executing program to a lesser degree, but also provides less statistically precise measurement results that may be less accurate and therefore less useful than required for meaningful analysis. Achieving the highest overall level of confidence in measurement results requires understanding the effects of and striking a balance between the tensions inherent in these two types of errors. We can't determine (predict) application performance from first principles, so we use performance analysis to try and understand program behavior [55]. Choosing how and when to capture performance data is a key part of the performance analysis process and while the process is part science and part engineering, there is still a portion that is largely an art [12, 34]. When creating an execution profile via sampling, how is a sampling period chosen? What is an appropriate sampling period? It is well understood that increased sampling decreases measurement error and makes estimates of population parameters more precise and accurate. It is generally accepted that the act of sampling alters program execution, thereby perturbing 3 performance. So, general qualitative guidance is to sample enough to get precise measurements, but not so much that perturbation causes the measurements to be inaccurate and reflect something unlike the object of interest (the un-sampled program). How do we balance measurement error and perturbation error? Can we model the interaction of measurement and perturbation error? Is there an optimal sampling period? If so, can we find it? Can we predict it? We hypothesize that along the continuum of possible performance measurement results with varying sample-dependent combinations of measurement error and perturbation error there exists a "sweet spot" where the combined effects of these two error types is minimized and the performance measurement is the smallest distance from the true value being sought, and thus the most accurate. The primary aims of the research reported here are the investigation of the existence of the hypothesized "sweet spot," the development of a general model for the analysis of the interaction between measurement error and perturbation error during statistical profiling, and the pursuit of a method for identifying a sample size (and determining an appropriate sampling period) that minimizes the combined effects of those errors on a particular performance analysis experiment. 1.1 Motivation Our research is motivated by several ideas. First and most simply, as Linus Pauling and others have said, "Science is the search for truth." Researchers generally seek the most feasibly correct results from a given experiment out of a desire to get a measurement as close to truth as possible. We believe that calculating statistics from performance measurements taken with a sample size that minimizes the combined 4 effects of measurement and perturbation error will be more accurate than when any other sample size is used. Second, in many cases there are practical limits to the goodness that additional sampling can provide. As a somewhat naive, but straight- forward example, consider a situation where you are sampling to calculate the amount of time function foo takes to execute in a particular program. If foo is responsible for 25% of the execution time and each sample adds 20 microseconds to the overall run time, the improvement in the confidence interval around the run time value calculated for foo will be less than the cost of an additional sample after about 55,000 samples. Another aspect of practical limits comes into play when a very large number of samples is collected. With computers, it's quite easy to collect huge numbers of samples and possible to generate confidence intervals that are so narrow that very slight differences in performance measurements could lead investigators to conclude a statistically significant difference exists between two experiments, when the difference is related to some non-deterministic event. As well, it has been found that simply increasing the sample size does not guarantee more accurate performance measurement results [62]. A third reason we are motivated in our research is that while the most possibly correct measurement results are valuable for an individual tool in the absolute sense, the implications when comparing different performance measurement tools highlights additional value. Consider two theoretical performance measurement tools. The best tool (BT) is known to generate the least overhead when sampling program executions. The second best tool (SBT) is a good tool, but because of implementation differences generates slightly more overhead than BT. For a particular program 5 execution both BT and SBT reduce the measurement error an equal amount in accordance with established statistical principles. However, because of the overhead (perturbation) differences, any result taken at the same sample point will have BT closer to truth than SBT. If, however, the sample point used by SBT more effectively minimizes the overall effect of measurement and perturbation error, SBT could produce more accurate results despite inflicting more perturbation per sample than BT. Fig. 1.1 is a theoretical example of this idea based in part on previously published results [65]. Figure 1.1: Theoretical outcome of two performance measurement tools A fourth motivation stems from speculation that perturbation can be severe enough in some cases that results will lead to incorrect inference. This result has been published before [62] and presented in the conditions of one of our early experiments. As can be seen in Fig. 1.2, as sampling increased, the measurements of two particular functions, though they became more tightly grouped, were perturbed to the point that they changed positions relative to the order of most computationally expensive. 6 Figure 1.2: Example experiment results where measurements were perturbed by sampling The results in Fig. 1.2 show the proportion of execution time calculated for two particular functions, "scatter" and "khplq," from 11 different executions at each of 7 different sampling intervals. When the executions are lightly sampled, the calculated values clearly indicate that scatter takes a greater proportion of execution time than khplq; however, when heavily sampled, the results get reversed and it appears that khplq takes the greater proportion of execution time. Despite the tighter grouping of the calculated values (higher precision), the values, especially for scatter, appear to be much less accurate with higher sampling. This is potentially a very insidious result as efforts to optimize a program based on incorrect ordering of functions could result in wasted programmer effort. 1.2 Contributions With our research, we make the following contributions: Abstract Model. We introduce an abstract model that depicts the interaction of measurement error and perturbation error and shows how their interplay affects the accuracy of measurement results generated by sample-based profiling. 7 Mathematical Model. We develop a simplified mathematical model of the affects of decreasing measurement error and increasing perturbation error on measurement results and use it to predict the measurement point – the "sweet spot" – at which overall error should be at a minimum. Experimental Results. We conduct empirical studies and present results of a series of simulations, experiments with a calibration program, experiments with a subset of sequential programs from the SPEC CPU 2006 benchmark suite [29, 58], and experiments with a subset of shared memory parallel programs from the SPEC OMP 2001 benchmark suite [6, 59]. Quantitative Guidance. Departing from traditional qualitative guidance encouraging the use of "reasonable" or "appropriate" sampling intervals, we provide the first quantifiable guidance for choosing a sample period for experimenters conducting sample-based performance analysis. 1.3 Outline The remainder of this thesis is divided into seven chapters. Chapter 2 discusses background and related work in the field. We discuss existing techniques for dealing with perturbation during performance analysis, outline some of the key aspects of the statistical foundations that underpin our research, and provide a survey of various profiling tools' process of sample period selection. Chapter 3 presents the results of an empirical sampling study on both a series of simulations of program execution and a series of calibration programs with known population parameters. 8 Our abstract model and analytical model are introduced and described in detail in Chapter 4. We test the applicability of these models with simulation and derive a formula to use for finding an experiment sampling "sweet spot." In Chapter 5 we present the results of intensive sampling experiments with a calibration program and selected benchmarks from the SPEC CPU 2006 benchmark suite. We compare the empirical results of our study against the prediction of our derived formula from Chapter 4. Results of intensive sampling experiments using selected benchmarks from the SPEC OMP 2001 benchmark suite are described in Chapter 6. As in Chapter 5, we compare the empirical experimental results against the predictive power of our analytic function. Future work suggested and motivated by our research is outlined in Chapter 7 and finally, Chapter 8 presents conclusions. 9 Chapter 2: Background and Related Work Previous research related to our work can be broadly categorized in three ways: techniques for dealing with perturbation during performance analysis, statistical foundations of experiment design, and existing tools’ handling of sample period selection. 2.1 Handling Perturbation Many researchers have wrestled with the problem created when the pursuit of accurate and detailed performance measurements results in application perturbation effects [39, 40, 41]. We use the term perturbation throughout this dissertation, but note that the phenomena has also in the past been variously referred to as interference [12,16, 23], degradation [14, 26], artifact [52], side effects [63], probe effect [17], intrusion [4, 30], and invasion [48]. It is generally agreed that at this point in time perturbation caused by software performance tools cannot be eliminated, so the two general approaches to handling perturbation revolve around the strategies of compensation and minimization. 2.1.1 Perturbation Minimization Dynamic statistical projection pursuit [68] is a performance analysis technique developed out of a motivation to minimize the perturbation effects of instrumentation in addition to a desire to decrease the amount of performance data generated and reduce the number of performance metrics managed by a performance analysis system. The primary idea is to regularly identify performance metrics of special interest and focus only on collecting data for them. 10 Dynamic instrumentation, an idea pioneered with the Paradyn parallel performance measurement tool [46], is another performance analysis technique designed to limit the effects of perturbation. Paradyn is intended primarily to be used with very long-running applications (hours or days) on large parallel machines. By delaying inserting instrumentation code until the moment it is needed and then removing it when it is no longer needed, Paradyn significantly reduces measurement overhead and minimizes the global perturbation effects of the instrumentation. Though their techniques applied primarily to the realm of measured profiling, Kumar, et al, [38] reduced perturbation effects by optimizing the instrumentation used. They focused on reducing the number of instrumentation points, the number of times each point executed its instrumentation code, and by transforming, when possible, the instrumentation code into a more efficient form. 2.1.2 Perturbation Compensation Using a timed Petri-net model for intrusively monitored software, Andersland, et al [2], and Gannon, et al [18, 19], recover true program traces from corrupted event traces post-mortem. This is one of the techniques developed to handle perturbation effects by compensating for them in order to recover the original execution times. They view profiled applications as discrete event dynamic systems that can adequately be modeled by timed Petri nets. Malony, et al, developed perturbation models for sequential [42] and parallel [43] executing programs that capture and then remove the additional execution time attributable to perturbation effects from instrumentation. Using these models they can determine an approximate actual code execution time from a run conducted with 11 performance measurement. The process involves determining the measurement overhead and then removing it from the results of the profiling runs. Sarukkai and Malony [57] considered methods for removing perturbation effects during performance analysis of highly parallel Single-Program, Multiple Data (SPMD) programs. In general terms, the process involved analyzing a trace file with a time ordered set of events, eliminating or reducing the perturbation effects, and generating a new trace of events with a time ordering that more closely reflects the actual execution. Najafzadeh and Chaiken [49] developed a flow-graph based perturbation model designed to compensate for or minimize perturbation effects. They then used the model [50] to estimate performance from the instrumented execution output they collected during performance analysis. Lehr [40] considered the problem of perturbations caused by software monitors in parallel programs and, coming to the conclusion that the effects could not be eliminated, focused instead on detecting, measuring, and compensating for them. With 99% confidence, he claimed that with the best case results the difference between compensated estimates of the average mean and the real average mean could be calculated within 1.2 ± 0.9%. His is the rare case of research in this area that tried to quantify the topic. Investigators continue to seek better ways to mitigate the effects of perturbation that occur during performance analysis. Beyond minimizing perturbation via thoughtful programming and clever instrumentation, we focus on minimizing perturbation by identifying when it affects measurement to a greater 12 degree than measurement is reducing statistical error. Compared to compensation, minimization seems like a simpler and safer idea. Because sampling has decreasing utility value, each sample has less value, in terms of reducing measurement error, than the sample before it. Compensating for perturbation seems to require a much greater understanding of the effects of perturbation since each sample reduces measurement error to a lesser degree, thus setting a higher requirement to get compensation right. So, though much effort and significant strides have been made in the area of perturbation compensation, we believe, as has been observed before [22], that "the best solution remains minimizing perturbation." 2.2 Statistical Science 2.2.1 Approximating Hypergeometric with Normal To determine how best to characterize the distribution of results from program profiling experiments, we start with the observation that any given program execution can be considered a discrete population of elements from which samples are taken without replacement. Whether we consider the program’s elements to be delineated by a chip cycle, a machine instruction, a software clock tick, or something else; they can be viewed as a large bin of items from which we extract samples that are not replaced. From this collection of samples, we will find that some occurred within the execution context of a given function foo, while the remainder occurred outside the execution context of foo. For the purposes of this section, samples taken of foo are a "success." It follows from our understanding of a profiling experiment that it is hypergeometric and that the number of successes we find for each experiment is a 13 hypergeometric random variable. Thus, we would use the hypergeometric distribution to describe the expected outcome or probability of the results for our experiment. The hypergeometric distribution describes how many successes you could expect after n samples are taken, without replacement, from a population of size N. Success is defined as picking one of the M total items of interest from among the N total items in the population. The probability that exactly m successes are picked can be calculated as follows: (2.1) For small populations where N and M are known, the hypergeometric distribution is appropriate and useful to use. However, because a program execution represents a very large population of events and since N and M are generally unknown (likely unknowable for a program execution), the hypergeometric distribution is rather complicated and labor intensive to use at best and impossible to be used at worst. For these reasons the hypergeometric distribution is often approximated by distributions that are simpler to use. A binomial approximation is appropriate when the population is very large and the sample taken is relatively small. (2.2) The difference between a binomial experiment (aka Bernoulli trial) and a hypergeometric experiment is whether sampling is done with or without replacement. Since samples are replaced with a binomial experiment, the probability of success remains the same for each trial. The hypergeometric distribution calculation takes into account the fact that sampling is done without replacement, thus the probability 14 of a particular sample being a success is dependent on the previous samples. However, with a very large population and a sample size that is small relative to the population size, it becomes apparent that removing the sample size will have very little effect on the population. So, the probability of choosing a particular sample a second time (assuming we sampled with replacement) would be negligibly small, thus making binomial and hypergeometric distributions extremely similar under these conditions [51]. Deciding when the sample size is small enough has to do with being satisfied with the calculated error. As a sample size increases, the error calculated in the standard way becomes less correct and should be augmented with the finite population correction (fpc) factor: (2.3) The recommendations for a small enough sample size relative to the population size seem to land on the two suggestions (sample size is 10% or less of the population) and (sample size is 5% or less of the population). We did come across the suggestion that is even sufficient [16], but the prevalence of the other two standards caused us to not seriously consider it. There is a preference among the many sources we consulted that covered this topic for N > 10n, of which Brunk, et al, may have been the earliest [10]. Though we expect the sample size of a program execution to be much smaller than either of these two standards, we will justify our approximating the hypergeometric on the more common of the two standards, N > 10n. A further approximation using the Normal (Gaussian) distribution is appropriate when some additional conditions hold. These conditions have to do with 15 the observation that any binomial distribution where the probability of success is .5 ( ) is symmetrical and thus even for small n, a binomial distribution with close to .5 is appropriately approximated by the normal. The question then becomes how many more samples are required for binomial distributions to approximate the normal when is closer to the edges, or . One of the first efforts to quantify this, where was defined as , resulted in [15]. Over time, the rule has been modified somewhat. Some indicate that both and should hold [27, 61]. Others, including Lilja [39], recommend . Once again, we expect to encounter no problem meeting any of these guidelines, but we identify as the standard on which we base our appropriate use of the Normal distribution. 2.2.2 Use in Performance Analysis Many investigators advocate for additional statistical rigor and better experimental design in computer system evaluation across different areas [8, 20, 21, 34, 35 39, 47, 53, 69]. We surveyed much research related to this area, but will limit comments to a representative few. Georges’, et al, concern flowed from the area of Java performance analysis and the disparate uses of statistical rigor among the various performance evaluation methodologies. Yi, et al, observed that statistically rigorous simulation methodologies are typically not used during computer architecture design. Jain and Lilja have provided books dedicated to advancing the process of computer performance analysis with specific emphasis on providing statistical tools and concepts that will permit more thorough experiments and analyses. Patil, co- authoring with Lilja, provides the most recent comprehensive discussion of the 16 application of statistical theories to computer performance measurement. The primary points of all the proponents are to provide the highest level of confidence in experimental results and to ensure that the best possible conclusions are made from the data generated by computer system experiments. Statistical science provides a standard error calculation for determining the confidence interval of a proportion measurement at a particular confidence level for a Normally distributed (Gaussian) random variable. This standard error formula can be algebraically manipulated to permit calculation of the number of samples required to produce a proportion measurement within a desired confidence interval and confidence level. Jain, equation (2.4), and Lilja, equation (2.5), each provide an exposition on the way this formula is arranged and used when calculating proportions. In each case the result, n, is the sample size required to calculate a confidence interval of ±r (the error) for the statistical measure at the confidence level established by z (i.e. for the 95% confidence level). (2.4) (2.5) Note that Lilja includes in the denominator which calculates the confidence interval with respect to the size of rather than with respect to the experiment overall and computes a much larger sample requirement. This means that given a proportion, , and desired error range, , we could expect a confidence interval of (.195, .205) and result of with equation (2.4) and expect a confidence interval of (.199, .201) and result of with equation (2.5) at the 95% 17 confidence level, . Though they differ somewhat in form, they otherwise encompass the same principle of targeting a confidence interval at a given confidence level in order to determine how many samples to take. It is through examination of these techniques that we received the motivation for our investigation into other methods for deliberate sample calculation in other contexts. 2.3 Tools Many sample-based profiling tools have been created to assist with analyzing the performance of executing programs. They all tend to work in a similar manner. The execution of a program is interrupted, some aspect of the state of the program is sampled and noted, often the program counter (PC) or call-stack, and then execution is allowed to continue. 2.3.1 Progtime The program status sampling (statistical profiling) tool used by Knuth in the 1971 empirical study of FORTRAN programs was an extension of a routine called PROGTIME originally developed by T.Y. Johnson and R.H. Johnson for use on the IBM System/360 [36]. PROGTIME would, while running, spawn the program of interest as a subtask, then sample that program's status word at regular intervals. At runtime completion PROGTIME would produce a histogram of the execution frequency for all program instructions. 2.3.2 Prof One of the oldest and most common statistical profilers is the Unix tool prof [66]. By compiling a program with a specific flag, typically −p, a monitor function is 18 added to the executable code of that program. When the executable is run, profile data is generated and then output to a file (mon.out by default) when the program terminates normally. The profile data consists of the number of times each function in the program is entered and a statistically-based break-down of the processing time used by the program. The statistical profile of execution time is captured by interrupting the program and sampling the PC at regular intervals governed by the software clock, commonly every 10 milliseconds. Executing prof results in the creation of a profiling report that displays the data in tabular form. 2.3.3 Gprof The profiling tool gprof [24, 25] was created as an extension to prof. In addition to counting each time a function is entered during execution, it included functionality to capture the return address of the call, the name of the caller function, and how many times that particular caller-callee arc was traversed. With this additional information a call graph of the profiled program emerged which permitted execution times attributed to function calls to be further distributed among whatever subsequent function calls might have been invoked. Though only one-level incoming call graph arcs were recorded and not complete call graphs, it allowed for a good approximation of the distribution of execution time with greater context. The profile report was extended to display the time spent executing a function as well as the time spent executing other functions on its behalf. As well, each function provides statistics about the functions that called it. The other functionality remains similar to prof including the PC sampling rate of 100 samples per second. 19 2.3.4 ATOM Analysis Tools with OM or ATOM [60] introduced the concept of an instrumentation framework that could be used to build customized program analysis tools. The ATOM framework, in part, enables a program to analyze itself and has been used to create all kinds of performance analysis tools, including profilers. 2.3.5 XProfiler XProfiler [32] is part of the International Business Machine (IBM) High Performance Computing Toolkit (HPCT) capable of profiling both serial and parallel applications on the IBM Unix variant Advanced Interactive eXecutive (AIX). Given a program compiled with an IBM XL compiler and the proper flags, information similar to that provided by the gprof tool process is dumped to a profile file which XProfiler reads and presents via a graphical interface rather than in a text file. The sampling rate is also 100 samples per second. 2.3.6 HPCToolkit HPCToolkit [1, 56] is a collection of performance analysis tools designed to support dynamic performance analysis. The specific component that collects profile data via sampling is hpcrun. In addition to time intervals, hpcrun will sample based on events tracked by hardware counters and accessed via the Performance Application Programming Interface (PAPI) [9], like a specified number of cycles or specified number of L2 data cache misses. The time interval between samples can be set by the user and is recommended to be something that will generate between hundreds and thousands of samples per second. They note that Linux ITIMER 20 interrupts cannot occur with greater frequency than a jiffy which is likely 1, 4, or 10 milliseconds making thousands of samples per second unattainable. Note that jiffy in this Linux context is the smallest unit of time of the software clock. It is determined by the value of the kernel constant HZ which can be set to 100, 150, or 1000. 2.3.7 VTune The VTune Performance Analyzer [33] is a commercial performance analysis tool developed by Intel which also includes time-based profiling as well as event- based profiling. It allows the user to specify the timing interval, but recommends 1000 samples per second and will default to that number via automatic calculations if not specifically overridden by the user. 2.3.8 STAT A tool specifically designed to collect, analyze, and visualize the stack trace profiles of very large parallel and distributed applications is the Stack Trace Analysis Tool (STAT) [5]. The tool conceptually contains three parts, the front-end, tool daemons, and stack trace analysis routines. The front-end establishes the tool’s components; the daemons collect, process, and transmit the stack samples; and the analysis routines analyze the data. Sampling rates are established by specifying a sample count and an interval to wait between samples. 2.3.9 DCPI The Hewlett-Packard (HP) Digital Continuous Profiling Infrastructure (DCPI) [3, 31] appears to no longer be supported as it was specifically designed to interface with the now defunct Digital Equipment Corporation (DEC) Alpha microprocessor. 21 It’s included here because it is one of the best examples of the class of statistical profilers that is continuously running and capable of providing data not only at the application level, but also for the entire software system. As DCPI was integrated into the operating system and made use of hardware timing, high-rate sampling was possible. Samples were collected at a fixed rate of 5200 per second. 2.3.10 SimPoint Casas, et al, [13], have done work that is in a substantially different area and travels a much different solution path, but in spirit is more closely aligned with ours than the others. They investigate the manner in which portions of instruction streams are selected as appropriate benchmarks for evaluating the performance of existing computer architectures and assisting with the design of future architectures. They observe that in previous efforts to reduce the number of instructions executed during evaluation of an architecture, and thereby trim the time spent, data generated, and memory required during analysis, investigators choose a sample length in an arbitrary manner and then leave it fixed. Casas, et al, run an entire benchmark application, take periodic hardware counter samples, conduct spectral analysis on the samples to determine the existence of periodic phases, and then use this result to select the portion of the instruction stream that best represents the overall instruction stream of the benchmark application. In this way, they are able to determine the optimal sampling length of the whole instruction stream and provide subsets that improve the accuracy of the results of architecture analysis. They take an existing tool called SimPoint [28, 67], which uses a fixed and arbitrary length for the sampling interval, and modify its use with their technique to demonstrate that more deliberate and 22 careful selection of the instruction stream sample provides better overall analysis results. 23 Chapter 3: Sampling Empirical Study In order to gain a better understanding of the behavior of programs when being sampled we conducted an empirical study. We started with a series of simulations to both validate what statistical analysis tells us we can expect and also to determine how closely the normal approximation fits the expected hypergeometric distribution that results from sampling a program execution. We then conducted a series of experiments with a calibration program designed to mirror as closely as possible the context of the simulations. 3.1 Sampling Simulation For the set of simulation experiments we constructed a C program that manipulated a 1,000,000 member integer array used to represent the execution flow of a program that runs for 300 seconds with 1,000,000 function calls. With each experiment, we focused on two functions, foo and bar, that had execution percentages within 1% of each other. We assigned different integer values to the array to represent foo, bar, and all other functions in exactly the quantities we wanted for a particular experiment. The array was then shuffled with the Fisher-Yates shuffle algorithm making use of the C standard library function rand. And, we took the first x members of the shuffled array as our simple random sample of size x. For each foo and bar execution percentage of interest we ran 10,000 trials and created histograms of the results. So, for example, in our first experiment we setup foo to have an execution percentage of 49% and assigned the value 1 to 490,000 integers in the array. Bar had an execution percentage of 48%, so we assigned the value 2 to 24 480,000 integers. The remaining 30,000 integers were assigned the value 0. After the shuffle, we selected the first 40,000 integers in the array as our simple random sample. We used this array technique in our simulation rather than directly employing the rand() function to generate samples as we might in other statistical analysis simulations in order to guarantee that the execution percentages were precisely what we wanted and to provide a more intuitive equivalence between the statistical simulation and an actual statistical analysis of an executing program. 3.1.1 Fixed Sample Size For the first set of simulation experiments we took 40,000 samples of each of the 10,000 trials done with 6 different sets of values for foo and bar. The pairs of execution percentages we used are in the table below. Execution percentage pairings for six different experiments foo .49 .40 .30 .20 .10 .02 bar .48 .39 .29 .19 .09 .01 Table 3.1: Simulation experiment execution percentages for foo and bar. We created histograms for each of the six different pairings and recorded the number of times that the statistics generated from the 40,000 samples resulted in foo’s calculated proportion – the inferred execution percentage – incorrectly indicating it was less than bar’s. These histograms follow below. Note that when reading the graphs, the bar above a given number on the x axis indicates the number of trials that had calculated statistics less than or equal to that number. For example, as shown in Fig. 3.1, the bar above .48 indicates that a little more than 3,500 of the 10,000 trials run resulted in a calculated statistic for bar’s execution percentage (let’s call it bar) to be .4775 < bar <= .48. 25 Figure 3.1: Simulation results of foo=.49; bar=.48; 40,000 samples Figure 3.2: Simulation results of foo=.40; bar=.39; 40,000 samples Figure 3.3: Simulation results of foo=.30; bar=.29; 40,000 samples 26 Figure 3.4: Simulation results of foo=.20; bar=.19; 40,000 samples Figure 3.5: Simulation results of foo=.10; bar=.09; 40,000 samples Figure 3.6: Simulation results of foo=.02; bar=.01; 40,000 samples 27 The first observation of note from this set of experiments is that the quality of the results differs markedly based on the percentage execution time of the functions being analyzed. In the first simulation with foo = .49 and bar = .48, the distribution of the statistics calculated from the samples is noticeably flatter and includes 199 trials where the execution percentage statistics calculated for foo and bar incorrectly indicate that the actual execution percentage of foo is less than that of bar. In the sixth simulation context where foo = .02 and bar = .01, the distribution is quite peaked and there are no trials incorrectly indicating that the actual percentage execution percentage of foo is less than that of bar. The second observation of note is that the distribution does seem to generally conform to that of a normal/Gaussian distribution and reinforces our understanding that a normal approximation to our hypergeometric distribution would be appropriate for our use. To further validate this idea, we compared the first and sixth simulation results to calculated predictions of both hypergeometric and normal distributions with the same parameters. These results are depicted in the following two graphs. Figure 3.7: Comparing simulation, hypergeometric, and normal when foo=.49; bar=.48 28 Figure 3.8: Comparing simulation, hypergeometric, and normal when foo=.02; bar=.01 The graphs appear to confirm that the normal approximation of our hypergeometric distribution generating experiments is reasonable. The approximation is slightly better when the execution proportion of interest is closer to .50, but is not that much worse with lower execution percentages. 3.1.2 Calculated Sample Size For the second set of simulation experiments, rather than take the same fixed number of samples like the 40,000 we took for all previous simulations, we calculated how many samples were required for each of the 6 different sets of values we were using for foo and bar to achieve a similar statistical error result. Proceeding from our understanding that the normal approximation is a reasonable and appropriate approximation, we know that a confidence interval for a normally distributed random variable is calculated as follows: (3.1) 29 In Eqn. 3.1, is the execution proportion, z is the standard score (number of standard deviations) required for whatever confidence level is used, and n is the number of samples taken. We wanted to determine the number of samples required to have high confidence that our experiments would result in a proper ordering of the execution percentage for foo and bar. So, for a given confidence level (we chose 95%) we wanted to know how many samples were needed for the lower end of foo’s confidence interval to be equal to the upper end of bar’s confidence interval. Letting p represent the actual proportion, or percentage execution, for foo and q represent the same for bar, we start from the following equation and then algebraically solve for n, the number of samples. (3.1) (3.2) Using this equation to determine our target sample size for each of the 6 different simulation contexts we again ran 10,000 trials with foo and bar set to 6 different sets of values and taking 6 different numbers of samples per the table below. The graphs that follow provide histograms of the results. Execution percentage pairings for six different experiments Foo .49 .40 .30 .20 .10 .02 Bar .48 .39 .29 .19 .09 .01 # samples 38,375 36,718 31,956 24,115 13,201 2,204 Table 3.2: Execution percentage pairings 30 Figure 3.9: Simulation results of foo=.49; bar=.48; 38,375 samples Figure 3.10: Simulation results of foo=.40; bar=.39; 36,718 samples Figure 3.11: Simulation results of foo=.30; bar=.29; 31,956 samples 31 Figure 3.12: Simulation results of foo=.20; bar=.19; 24,115 samples Figure 3.13: Simulation results of foo=.10; bar=.09; 13,201 samples Figure 3.14: Simulation results of foo=.02; bar=.01; 2,204 samples 32 From this second set of simulation experiments we first note that once again, the quality of the results differs based on the percentage execution time of the functions being analyzed. With foo = .49 and bar = .48 there are 225 trials in which the statistics calculated for foo and bar incorrectly indicate that the actual execution percentage of foo is less than that of bar. Though the shape of the distributions change little among the six sets of simulations, as the values of foo and bar are changed there are fewer statistics calculated that incorrectly order foo’s and bar’s relative execution percentage. By the sixth simulation context in this set where foo = .02 and bar = .01, there are only 21 trials that produce statistics indicating an incorrect relative ordering of foo and bar. This result is rather surprising. In addition to taking over 90% fewer samples during the sixth simulation context, compared with the first, there were also 90% fewer trials where the statistics incorrectly indicated the percentage of execution for foo < bar. The second thing to note from this second set of simulations is that the distributions for the first five experiments conform quite well with that of a normal distribution as well as with each other, while the sixth doesn’t quite provide that same visual affirmation. We provide the following two graphs to once again compare the first and sixth simulation contexts to the expected distribution calculated for both a hypergeometric and a normal distribution. 33 Figure 3.15: Comparing simulation, hypergeometric, and normal when foo=.49; bar=.48 Figure 3.16: Comparing simulation, hypergeometric, and normal when foo=.02; bar=.01 The first graph once again appears to confirm that the normal approximation of our hypergeometric distribution generating experiments is reasonable, while the second graph clearly depicts a difference. The first interval less than (to the left) the actual execution percentage of both foo and bar is not proportionate to the first interval greater than the actual execution percentages. The difference is more marked with bar than with foo. Beyond the first intervals, the distributions do appear equivalent. It appears that the normal approximation of the hypergeometric distribution begins to break down when the execution percentage of a function of interest is small (near 0%). [Note: we expect we would find the same distribution divergence with large execution percentages (near 100%) due to the symmetric nature 34 of the distribution.] Fortunately, the divergence is such that results are statistically more accurate, so the approximation suits our purposes. 3.2 Sampling Calibration Program We next conducted a series of experiments meant to closely mirror the conditions of the simulations. We created a calibration program that would execute 1,000,000 function calls in about 300 seconds that could be configured to distribute the function calls in such a way that we could control the percent of execution time for which each function was responsible. We used the same 1,000,000 integer array, shuffled in the same manner, to call functions in the same order as was done in the simulations. The functions called were composed of identical loops executing simple mathematical operations designed to take 300 microseconds each to execute and thereby provide a program that runs for a total of 300 seconds. This calibration program was designed with minimal memory requirements and a single function call depth, to help minimize the overhead and perturbation effects of sampling. Due to the imprecise nature of this technique the program ends up running approximately 300 seconds per execution iteration. We used the hpcrun component of HPCToolkit [56] to sample our executing calibration program. Because we used timer intervals and the Linux ITIMER_PROF, interrupts cannot occur with greater frequency than a jiffy (1 millisecond in our case), the number of samples we could take was not as flexible and precise as with our simulations. Note that jiffy in this Linux context is the smallest unit of time of the software clock. It is determined by the value of the kernel constant HZ which can be set to 100, 150, or 1000. Because of the far greater total execution time required to 35 run experiments with the calibration program, as compared to the simulations, we conducted 100 trials per experiment configuration. 3.2.1 Fixed Sample Size For the length of our program we had the option of taking about 42,800 samples (7 millisecond intervals) or about 37,500 samples (8 millisecond intervals). We chose 37,500 because it was slightly closer to the 40,000 we used in our simulation. Figure 3.17: Calibration results of foo=.49; bar=.48; ≈37,500 samples 36 Figure 3.18: Calibration results of foo=.40; bar=.39; ≈37,500 samples Figure 3.19: Calibration results of foo=.30; bar=.29; ≈37,500 samples 37 Figure 3.20: Calibration results of foo=.20; bar=.19; ≈37,500 samples Figure 3.21: Calibration results of foo=.10; bar=.09; ≈37,500 samples 38 Figure 3.22: Calibration results of foo=.02; bar=.01; ≈37,500 samples 3.2.2 Calculated Sample Size Similar to our investigation progression with the simulations, we repeated the previous experiments using different sample rates. Once again, the millisecond granularity restriction with the ITIMER_PROF affected the experiments, so experiments with the first two sets of execution percentages were not repeated. Execution percentage pairings for six different experiments Foo .49 .40 .30 .20 .10 .02 Bar .48 .39 .29 .19 .09 .01 # samples ≈ 37,500 37,500 33,400 25,000 13,000 2,200 Table 3.3: Execution percentage pairings 39 Figure 3.23: Calibration results of foo=.30; bar=.29; ≈33,400 samples Figure 3.24: Calibration results of foo=.20; bar=.19; ≈25,00 samples 40 Figure 3.25: Calibration results of foo=.10; bar=.09; ≈13,00 samples Figure 3.26: Calibration results of foo=.02; bar=.01; ≈2,200 samples Comparison of # trials foo < bar (per 100) Simulation Calibration Foo Bar 40,000 Various 37,500 Various .49 .48 1.99 2.25 3 3 .40 .39 1.07 1.4 3 3 .30 .29 .42 .84 1 2 .20 .19 .03 .62 0 1 .10 .09 0 .34 0 0 .02 .01 0 .21 0 0 Table 3.4: Comparison of number of trials where foo < bar (per 100 runs) 41 Important observations can be made from the experiments conducted in this empirical study. First, our results suggest that far less sampling than is typically collected during dynamic performance analysis is needed to achieve meaningful statistical results for execution profiles, at least when the primary objective is ensuring a proper ordering of functions based on percent of execution time or simply identifying where the majority of execution time is taking place. Second, calculated results for smaller p are likely more accurate than larger p for a given sampling experiment, if perturbation effects have not disproportionately affected their execution. 42 Chapter 4: Model 4.1 Abstract Model Taking more and more samples from a population produces a measured value, or statistic, that gets closer and closer to the true population parameter of interest. Unless we take all possible samples (a census) the statistic will have a confidence interval associated with it within which the actual population parameter lies with some degree of likelihood. Taking more samples narrows the confidence interval in which it is very likely the true population parameter is included, making the measured value more precise and more accurate with each additional sample. The main problem with sampling a running program is that the process of taking samples perturbs the program execution in a manner that very likely is not fully understood, thus resulting in a statistic that no longer strictly describes only the original, un-sampled running program. Dilation of execution time is typically the easiest perturbation effect to observe, but changes to the spatial and temporal access patterns of cache and memory locations, event and execution pipeline reordering, additional context switching and interrupts, and register interlock stalls [42, 50] are all likely to occur and are much more difficult to confirm. As well, distortion of the accuracy of the measurements designed to provide insight into program executions is a problem of perturbation [40]. This problem is compounded by the diminishing return aspect of sampling on measurement error reduction. Though a statistic becomes more precise (and hopefully more accurate) with each additional sample, the improvement in precision provided with each additional sample is less and less. Consider for example, once 43 we've taken x samples and have a confidence interval of ±r: How many samples, y, would we need to take to reduce the confidence interval by half, such that once we've made that much progress in measurement error reduction we cannot possibly take enough samples to reduce the remaining confidence interval to 0 without conducting a full census? The answer is relatively straight forward: (4.1) (4.2) (4.3) (4.4) To make this more concrete, assume z=1.96, =.5, and we take 100 samples to produce a statistic with a confidence interval of ± .098. To cut that confidence interval in half (± .049) we need to take 300 additional samples or 400 total samples (y=4x). At this point, reducing the remaining confidence interval to 0 cannot be accomplished as easily as taking another 300 samples; it would take a complete census. Thus, the diminishing return aspect of sampling on measurement error reduction is a significant consideration and cannot be ignored. We can abstractly describe the competing effects of decreasing measurement error and increasing perturbation error on our performance measurement statistic with the diagram in Fig. 4.1. The intuition we gain from this abstract model is that it seems likely there exists a sample size that provides the best overall balance between reducing measurement error and limiting perturbation error effects. If we can 44 determine this point we can calculate the appropriate sampling period to use to achieve a performance measurement statistic with the overall least residual error. Figure 4.1: Abstract model of the effect on measurement error and perturbation error as an increasing number of samples is taken during a program's execution. 4.2 Analytical Model 4.2.1 Background We begin constructing our analytical model from our abstract model by considering a likely performance measurement context. There exists a program in which there is a function foo. Though it’s not critical to the discussion just yet, we will consider foo to be the function which takes the greatest proportion of the given program’s execution time. We would like to know: How much execution time is spent in function foo? We imagine a simple statistical profiling system that halts the execution of our program and notes which function is currently in execution. A total of samples are taken, of the samples are found to be function foo, and T is the measured execution 45 time of the entire program. We can calculate the measured proportion of execution time for function foo as: (4.5) We can then answer our performance analysis question where tfoo(n) is the measured execution time for foo after n samples are taken as follows: (4.6) We know that the statistic calculated with m and n is likely not equivalent to the actual population parameter calculated with M and N, where N is the total number of all possible samples and M is the total number of samples that would be taken during the execution of foo. So we could generate a confidence interval around the statistic into which we expect to find the true population parameter with some established confidence level. For this calculation we use the estimated standard deviation or standard error formula, (4.7) to create a confidence interval of: (4.8) Into this confidence interval we would expect the actual execution time of foo to be included with roughly 68% confidence. Because we only use one standard error we would be limited to the 68% confidence level, but we can introduce the well- known z term in front of the standard error to calculate confidence intervals at a greater range of confidence levels: 46 (4.9) At this point we expand our term T, the measured total execution time of the program which contains foo, to explicitly show that it includes both the true (un- sampled) execution time, T, and the additional execution time attributable to the sampling which will be the number of samples, n, multiplied by the overhead, o, or execution time required per sample. Thus, equation (4.2) becomes: (4.10) 4.2.2 Applying the Intuition Given these intermediate steps and the intuition from our abstract model, we can now create an analytic function to reflect the changes to a measured statistic with respect to measurement error and perturbation error that occur with each additional sample. The notation for our analytic function is in Table 4.1. tfoo(n) Execution time calculated for foo based on n samples n Number of samples o Overhead time cost per sample Measured proportion of total program execution time spent in foo T Un-sampled (true) total program execution time z Standard score (z-value) for confidence level used Table 4.1: Analytic function notation (4.11) The analytic function combines the two curves from the abstract model. Increasing perturbation error is captured in the first part of the equation by multiplying together the number of samples taken, the overhead time required for each sample, and the proportion of samples that interrupted foo. Decreasing 47 measurement error is represented by the common formula for a confidence interval of a proportion multiplied by the total program execution time. Recall that this analytic function is created to answer the specific question, how much execution time is spent in function foo? In this context, foo is the function taking the largest proportion of execution time in the program being analyzed. We choose to frame the question in this particular manner for two primary reasons. First, we wish to explicitly focus on the dilated execution time that is an indication of the perturbation caused by sampling. Most dynamic performance analysis typically results in the generation of a list of functions described chiefly by proportion of attributed execution time. Because the primary focus is on proportion of execution time it is probable that perturbation effects are masked as there is usually no accompanying comparison to actual (or expected actual) execution time. The second reason we narrowly frame the question is that the reliability of statistical results is greatly affected by the size of the proportion of the measured function (see Chapter 3). The same number of samples provides a different level of precision for each different-sized proportion. The larger a function’s proportion of execution time, the greater the width of the confidence interval, thus the less precise the statistic for agiven confidence level. Observe that the sum (1− ) ranges from 0 (when = 1 or = 0) to .25 (when = .5), but also note that this fact doesn’t invalidate our previous sentence because although a sampled statistical proportion of .5 will have a wider confidence interval than one of .6, no program can have function execution proportions add up to a sum greater than 1. For example, if a function foo has proportion .6, then function bar must have a proportion of the execution time that 48 is strictly ≤ .4. By focusing on the function taking the largest proportion of execution time we guarantee the results for all other functions will never be less statistically precise and most often they will be more statistically precise. For example, given a program with functions foo, bar, and baz calculated after 1,000 samples to take respectively .70, .20, and .10 of total execution time, they would have corresponding standard error values of .0145, .0126, and .0095. So, the confidence interval around foo, (.6855, .7145), would be wider than those around bar, (.1874, .2126), and baz, (.0905, .1095). We make the following set of assumptions for our model:  Because the full spectrum of perturbation effects is difficult to capture, we make the simplifying assumption that time (the additional time the analyzed program spends in execution beyond the unperturbed execution time) is a representative surrogate for all perturbation effects.  The Hypergeometric distribution may be appropriately approximated by a Normal (Gaussian) distribution where , and , are at least an order of magnitude larger than . See discussion in section 2.2.1.  Systematic sampling provides results similar to random sampling and occurs asynchronously with respect to any periodic events in the analyzed program. 4.2.3 Analytic Function Using our current version of the analytic function (4.11), we created the graph in Fig. 4.2 that visually presents the predicted changing behavior of the performance 49 measurement of a hypothetical function foo that is 20% of an executing program of 300 seconds where each sample incurs 250 microseconds of overhead. Figure 4.2: This graph presents the predicted results for measurements of the execution of foo within a program that executes for a total of 300 seconds and in which foo accounts for 20% of the execution time. Of the measurements taken at each sample level, 95% of them are expected to fall within the upper and lower curves. The top curve depicts the higher end of the confidence interval where the standard error is added to . (4.12) The bottom curve depicts the lower end of the confidence interval which subtracts the standard error. (4.13) The slightly thicker straight line at 60 seconds represents the true execution time for foo. The curves are calculated with a 95% confidence level ( = 1.96) which means we would expect 95% of our measurements of function foo for any given 50 sample size to fall within the two curves. For example, if we ran this hypothetical 300 second-long program 100 times and took 2,400 samples each time we would expect to find about 95 of the measurements of foo fell between 55 and 65 seconds while 5 measurements fell outside those bounds. Displaying the data generated by our analytic function in this way makes it easier to see the change in the statistical precision of the confidence interval as more samples are taken. It is also easier to see there is an intuitive benefit to using the top curve of the model over the bottom curve to identify the sample size at which the combined effect of both measurement error and perturbation error are minimized. Notice there exists a minimum along the top curve that indicates at what point the measurement is no longer making downward progress towards the true execution time for foo as the decreasing measurement error begins to be dominated by the increasing perturbation effect of the sampling overhead and the curve adopts a positive slope. If we can calculate the minimum of this upper curve, we can then determine the sample size (and corresponding sampling period) needed to reach this point that strikes the best balance between the two types of error. So, we simplify our analytic function to only add the standard error because it most clearly displays the point at which the reduction in measurement error begins to be less than the accumulating perturbation error. Our final version of the analytic function is: (4.14) 51 Finding the minimum along this curve is a matter of taking the first derivative of our analytic function (4.15) with respect to the number of samples, n, setting it to 0, and solving. The steps we used to find equation (4.15), the first derivative of our analytic function, and equation (4.16), the result of solving for n after setting the derivative to 0, are contained in Appendix A. (4.15) (4.16) Using (4.12) and the parameter values mentioned earlier (T = 300, z = 1.96, = .2, o = .00025) we get the following result: Note that this is the minimum point along the top curve in Fig. 4.2. Beyond this point in the graph the confidence interval continues to narrow as more samples are taken, but the confidence limits are converging on a number getting further away from what would be the actual unperturbed value of foo’s execution time. Beyond 28,000 samples the lower curve no longer even encompasses the true unperturbed execution time. 52 4.3 Simulation In order to validate our ideas and corresponding analytical model we next constructed a program to simulate the process of sampling the execution of a program which spends a specific fraction of time in a single function. The program was designed in such a manner to allow us to corroborate the statistical results predicted by our analytical model with the range of actual results from the simulation. The program is written in C. We simulate a time series of sampled values as an integer array of size 1,000,000. This array represents the execution of a program running for 300 seconds where each integer represents a notional 300 microseconds of execution time. We assigned 200,000, exactly 20%, of the integers the value 1 to represent the function foo and assigned the other 800,000 integers the value -1, representing all other non-foo functions. Once generated, the entire array was shuffled with the Fisher-Yates shuffle algorithm making use of the C standard library function rand. The result was the 1,000,000 integer array with the 200,000 values of 1, representing foo, spread randomly (not uniformly) throughout. We used this array technique in our simulation rather than directly employing the rand function to generate samples as we might in other statistical analysis simulations in order to guarantee that the execution percentage was precisely 20% and to provide a more intuitive equivalence between the statistical simulation and an actual statistical analysis of an executing program. With each integer representing a notional 300 microseconds of execution time, this design is relatively coarse-grained compared to the number of execution steps that could be sampled in an actual 300 second-long program. However, we felt 53 a total population size of 1,000,000 sufficient for simulation purposes in light of the fact that a more fine-grain representation of execution time with a larger array would not greatly impact the statistical calculations and therefore provide little gain. This is because the standard error term we’re using, equation (3.3), is really a simplification based on the central limit theorem assumption that all samples are randomly chosen with replacement from an infinitely large population. Our simulation takes samples without replacement from a finite population as does all sample-based dynamic performance analyses of which we’re aware. A more fully correct formula for error would include the finite population correction factor [7] making the population proportion standard error from which confidence errors are calculated as follows: (4.13) We chose to make use of the simplified version because for very large populations N and sample sizes n of 5% or less of the total population, the finite population correction factor typically evaluates to something very close to 1 and has minimal impact on the calculated error measurement. Additionally, determining N for a program execution is likely impossible. Once the array was prepared we ran 1,000 iterations of simulated 300 second executions for each of the sampling sizes from 900 to 44,400 at increments of 1,500. So, we ran 1,000 different iterations which took 900 random samples, then 1,000 iterations of 2,400 random samples, etc., all the way to 1,000 iterations of 44,400 random samples. The sampling scheme was random, not systematic as is typically done in dynamic performance analysis and meant to approximate systematic sampling rates of 3 per second through 148 per second at increments of 5. To factor in the 54 perturbation error each sample was assessed an overhead of 250 microseconds. The results of our simulation are graphed in Fig. 3.3. Figure 4.3: This graph presents the simulation results for measurements of the execution of foo within a simulated program that executes (when unperturbed) for a total of 300 seconds and in which foo accounts for 20% of the execution time. 1,000 simulations were run per each sample size with the middle 95% of measurements shown as the solid part of the bar. The whiskers above and below each bar indicate the remaining 5% (2.5% above and 2.5% below). The results of our simulation compare very favorably with the predicted results of our analytic function from Fig. 4.2. As predicted, 95% of the simulation results generated with 2,400 samples fell between 55 and 65 seconds. As well, the lower curve of our prediction graph crossed over the 60 second line after the 28,000 sample mark which matches what we see with the simulation results. So, the simulation results indicate the model has validity. 55 56 Chapter 5: Sequential Execution Given the intuition of our abstract model, the results from our analytic model, and the outcome of our simulation, we next progress to examining executing programs. For this part of our research, we conduct experiments with a simple calibration program with characteristics very similar to those used to evaluate the analytic model and conduct the simulation, then do extensive experiments with several programs from the SPEC CPU 2006 benchmark suite [29, 58]. Our model of the sampling-induced effects of perturbation on the execution of these sequential programs holds up reasonably well. Though unable to predict the best sampling period for every experiment, our model sometimes picks the best period, when not picking the best period, chooses something that close to the best period, and, importantly, avoids bad choices. Compared to the 28 sampling intervals used during our experiments, our technique guides us to better results than all but two intervals, and our results are equally good compared to these other two intervals. 5.1 Calibration Program 5.1.1 Experiment Design and Environment To further validate our idea we designed an experiment where a program running for approximately 300 seconds with a function foo accounting for 20% of the execution time was periodically interrupted and the function name at the top of the call stack recorded. We ran the experiment 100 times for each of 23 different specific systematic sampling rates from 3 samples per second up to 166 samples per second. The results of those experiments are presented in Fig. 5.1. 57 Figure 5.1: This graph presents the results for measurements of the execution of foo within our calibration program that executes (when unperturbed) for a total of approximately 300 seconds and in which foo accounts for approximately 20% of the execution time. 100 experiments were run per each sample size with the middle 95% of measurements shown as the solid part of the bar. The whiskers above and below each bar indicate the remaining 5% (2.5% above and below). The sampling rates used for the experiments were chosen to match the resolution available with ITIMER_PROF, the software interval timer we used. We used rates of 3, 8, 13, 18, 23, 28, 33, 38, 43, 47, 52, 58, 62, 66, 71, 76, 83, 90, 100, 111, 125, 142, and 166 samples/second. Because the resolution of the timer used to generate interrupts is approximately 1 millisecond, the number of sampling rates possible are greatly reduced. Fig. 5.2 delineates the relationship between the target and actual sampling rates from 1 per second through 500 per second. Instead of the seemingly 500 possible sampling rates, only 61 unique sampling rates are actually possible when using the straightforward approach of dividing 1 second by the desired sampling rate and using the resultant fractional part of a second as the timer interval. The millisecond resolution of ITIMER_PROF creates a rounding effect that gets more pronounced the greater the number of samples we try to take. 58 Figure 5.2: Sample rate limitations Development of our sampling tool and execution of all experiments were done on a machine code-named Mashie. It is a quad duocore Intel Xeon 2.33GHz with 4GB of main memory. In runs the Linux operating system, version 2.6.9-89.0.25- ELsmp, Red Hat 3.4.6-11. The measured program was written in C and compiled with gcc 4.4.3 using compiler flags -Wall –g -fPIC -rdynamic. The first part of the measurement tool was written in C++ and compiled with g++ using compiler flags - Wall -g -fPIC and library flags -ldyninstAPI -lsymtabAPI –lcommon -liberty -lelf - ldwarf. The library part of the measurement tool was written in C and compiled with gcc using flags - shared -lm -lrt. 5.1.2 Program The program we analyzed during our experiments was written to mimic as closely as possible the behavior of the program used during simulation. We used the same 1,000,000 integer array, shuffled it in the same manner, and seeded rand() with the same value. The array was then used to call functions in the same order as was done in the simulation. We created identical loops executing simple mathematical operations in each function in order to ensure each took as close as possible to 300 microseconds to execute and thereby provide a program that runs very nearly 300 59 seconds. Due to the imprecise nature of this technique the program ends up running approximately 300 seconds. 5.1.3 Measurement Tool Our measurement tool works in the following way. We invoke our tool at the command line with three arguments: the length in seconds the analyzed program executes when unperturbed, the sampling rate desired, and the executable name of the program to be analyzed (plus whatever command line arguments are required by that program). The tool forks a process for the program to be measured and inserts our tool library into that address space. It then calls an initialization function from our tool library that establishes some data structures to collect performance analysis data, installs a signal handler, saves the start time, and sets the interval timer. The program to measure is then started and it executes in the normal way. Whenever the interval timer expires, a signal is raised that is dealt with by the installed signal handler. Within the signal handler the program’s stack is accessed and the currently executing function name recorded. Upon program termination the contents of the performance analysis data structures are printed to a file and the tool then exits. The measurement tool we constructed consists of two main parts. Both are compiled against Dyninst [11], a run-time code-patching library. The first part executes in relationship to the analyzed program much like a parent process. It is responsible for setting up the analyzed program’s runtime environment, adding the tool library, calling the library initialization function, and then starting the program with its required arguments. The tool front-end waits until the measured program completes execution, and then calls the final library clean-up functions and exits. The 60 second part of our tool is the library that is added to the process space of the measured program and encapsulates the functionality for computing the sample period, handling signals, storing the performance data, and writing it to an output file. 5.1.4 Result Comparison Our results from this experiment with the calibration program are presented in Fig. 5.3 alongside the results from our simulation and our model prediction from Chapter 4. The upper and lower curves of the confidence interval at the 95% confidence level from Fig. 4.2 are changed to the format of the results from Fig. 4.3 and Fig. 5.1 in order to make comparison easier. Figure 5.3: This graph presents a consolidated view of the results depicted in Figs 4.2, 4.3, and 5.1. Each three bar group pictured per sample count are, left to right, model prediction, simulation result, and experiment result. The results are promising in that they show an outcome for the simulations and experiments that is generally consistent with that predicted by our analytic model. There is more variability in the results of the experiments than the simulations, which is expected partly due to the nature of running on a real system, but may also be 61 partly attributed to the order of magnitude difference in the number of runs conducted - 100 per experimental sampling rate vs. 1000 per simulated sampling rate. 5.2 SPEC CPU Programs To further test our idea, we conducted extensive ITIMER-based sampling experiments with the five SPEC CPU 2006 benchmark programs: bzip2, mcf, milc, omnetpp, and sjeng. We will use omnetpp for the initial explanation of our experiments and for more in-depth analyses of the results because it is most close in runtime to our simulation and calibration programs. 5.2.1 Experiment Design and Environment Running experiments with the SPEC CPU 2006 benchmark programs presented a new challenge to the method of analysis we had been using. When evaluating our analytic function and running simulations, we knew truth for foo. With the calibration program, we could carefully and directly affect truth for foo such that we had high confidence with the value we used. However, determining truth for the various functions in each SPEC CPU program was part of the very problem we are investigating. So, we decided on the following weak measurement method for determining the values for execution proportion of a given program's functions: execute a program multiple times and sample with a low perturbative sampling rate, then treat the median result for function execution time as "truth." Determining the number of times to execute a program in order to gather the samples necessary to calculate "truth" was somewhat arbitrary. For example, with omnetpp we determined that the function that took the most execution time, cMessageHeap::shiftup(int), 62 accounted for roughly 15% of it. So, using p=.15, z=1.96 (for 95% confidence level), and r=.0025 we used Jain's equation (2.4) to determine 78,369 samples were required. We settled on a sample rate of two per second to provide balance between minimizing perturbation and keeping the overall experiment runtime to something not unduly burdensome. We then executed omnetpp 115 times accumulating a total of 79,021 samples (an average of 687 samples per run). The measured values of cMessageHeap::shiftup(int) across all runs were distributed as depicted in Figs. 5.4 and 5.5. Fig. 5.4 shows the distribution of the value of p for cMessageHeap::shiftup(int) by the number of samples collected on a per run basis. Fig. 5.5 shows the distribution of the value of p when sorted from the smallest to the largest calculated value. Figure 5.4: This graph presents the distribution of the calculated value of p for cMessageHeap::shiftup(int) arranged by number of samples taken. 63 Figure 5.5: This graph depicts the distribution of the calculated value of p for cMessageHeap::shiftup(int) ordered from smallest to largest. After inspecting the distributions of these values we reasoned that using the median value as truth, rather than the mean, was appropriate because the larger estimated values in the distribution likely reflected more perturbation error caused by transient system behavior and would therefore unnecessarily skew the mean. We thus determined the "true" value of each function in each of the SPEC CPU programs we used in a similar manner. The following table indicates for each of the SPEC CPU 2006 benchmarks analyzed, how many times we executed each while taking two samples/second; how many total samples were collected aggregated across all runs; the average number of samples collected per run; the function (foo) that accounted for the greatest percentage of execution time; and the percent of execution time (p) attributed to foo. 64 SPEC # Runs # Samples Avg samples/run Foo p bzip2 352 61,858 176 BZ2_blockSort 38.5% mcf 99 77,596 784 primal_bea_mpp 43.6% milc 95 87,537 921 mult_su3_na 18.0% omnetpp 115 79,021 687 cMessageHeap:: shiftup(in) 15.8% sjeng 76 79,236 1043 std_eval 15.3% Table 5.1: Calculation of largest function per benchmark. All experiments were done on a machine named Niblick. It is a quad duocore Intel Xeon 2.93GHz with 4GB of main memory. It runs the Linux operating system, version 2.6.9-89.0.25-ELsmp, Red Hat 3.4.6-11. 5.2.2 Programs We chose this particular subset of SPEC programs for several reasons. First, we avoided using the benchmarks that were aggregations of shorter duration program runs scripted to run consecutively. Our concern with them was the potential introduction of additional overhead and the possibility of further complicating the profile results with an additional level of sample combining. We did make use of bzip2 which is provided by SPEC as an aggregation, but only ran it with the single input that caused it to execute the longest. Second, we focused on programs that had execution times nearest to our simulation and calibration run times of 300 seconds. Third, we had a slight personal preference for benchmarks written in C and C++ as compared to those written in Fortran. 5.2.3 Measurement Tool Rather than use the simple measurement tool we created, we used Rice University's HPCToolkit [56], an integrated suite of tools used for measurement and analysis of program performance. This switch was motivated by the maturity of the 65 Rice tool suite and its acceptance in the systems research community as well as its additional functionality, like tracking calling context. The workflow is depicted in Fig. 5.6 which is copied from their site. We very slightly modified the hpcrun shell script utility to output the run time of each program. We created a small dynamically- loaded shared library so that atexit() the combined processor time charged for the execution of a program's instructions (tms_utime) and the processor time charged for execution by the system on behalf of the program (tms_stime) were output to a file. Otherwise we used the HPCToolkit components as provided. Figure 5.6: HPCToolkit component work flow. We used the hpcstruct and hpcprof components of HPCToolkit as designed to associate calling context measurements with source code structure and to overlay call path profiles respectively. However, because we intended to analyze the measurement results in ways other than hpcviewer or hpctraceviewer are designed, we inserted our own data processing steps into the workflow. We wrote a parser to handle the XML file that results from running hpcprof, and then two programs to take the combined output from the parser program and output per function data for either execution percent or execution time. 66 5.2.4 Execution Details When invoking hpcrun we made use of the wall clock timer functionality, e.g., hpcrun –e WALLCLOCK@500000 omnetpp omnettpp.ini, runs the SPEC CPU program omnetpp with input parameter omnettpp.ini and samples omnetpp every 500,000 microseconds. In addition to collecting tens of thousands of samples at a sampling period of 500 milliseconds across tens of runs to establish "truth", we also conducted sampling runs with each SPEC program 10 times at each of 28 different sampling rates (200, 100, 67, 50, 33, 29, 25, 22, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, and 1 millisecond intervals) for a total of 280 runs per SPEC benchmark. The 280 runs were scripted to execute consecutively iterating over the 28 sampling rates 10 times in order to create some separation in time among each of the 10 runs of any particular sampling rate thus minimizing the impact of any temporary spike in system noise. At the end of the 28x10 experiment runs we parsed the output files, aggregated and sorted the results, analyzed them in a spreadsheet, and presented them in a graph. 5.2.5 Results Though we make use of program and function time in our analytic equation and in our previous experiments, we will present the results in this section using percent of execution (p) because it makes the comparison to "truth" in the analysis section more meaningful. Sampling a running program results in us being able to calculate a statistic for the proportion (or percent) of execution time during which a given function was executing. If we convert that to time we would simply be multiplying the calculated proportion by the complete execution time of the sampled 67 program. As well, picking a best sampling rate when considering more than one function creates a situation where using time can cause an unbounded outlier based on time to disproportionately affect performance evaluation. Using proportion provides a degree of bounding that limits the effect of outliers. Omnetpp The next four figures display the results of the 28x10 experiments for each of the top four functions of the omnetpp benchmark. The center gridline in each chart is the proportion "truth" for that function as determined during our weak measurements run. The red dashed lines indicate the 95% confidence interval with respect to decreasing measurement error only; so, they are computed solely based on the proportion "truth" and the number of samples taken. The red and blue markers are intended to make it easier to differentiate among runs of adjacent sample intervals. Figure 5.8: Plot of proportion calculation results for cMessageHeap::shiftup(int), the function taking the most execution time for omnetpp, across the 28x10 execution runs. 68 Figure 5.9: Plot of proportion calculation results for cGate::deliver(cMessage*,double), the function taking the 2nd most execution time for omnetpp, across the 28x10 execution runs. Figure 5.10: Plot of proportion calculation results for cSimulation::selectNextModule(), the function taking the 3rd most execution time for omnetpp, across the 28x10 execution runs. 69 Figure 5.11: Plot of proportion calculation results for cModule::findGate(char const*,int) const, the function taking the 4th most execution time for omnetpp, across the 28x10 execution runs. To provide a visual comparison between the execution proportion results and the execution time results and illustrate the reason for looking at proportion as opposed to time when evaluating the results for these experiments we've included in Fig. 5.12 one graph of our results with execution time on the y axis. The graph is of the function cMessageHeap::shiftup(int), the omnetpp function that takes the most execution time. Figure 5.12: Plot of execution time calculation results for cMessageHeap::shiftup(int), the function taking the most execution time for omnetpp, across the 28x10 execution runs. To evaluate the accuracy of the results of the 10 runs at each of the 28 different sampling rates, we decided to calculate the mean absolute percentage error (MAPE) for each of the top four functions as compared to "truth," combine the four values (equation 5.1), and compare them for each of the 28 sampling rates. cMAPE (5.1) Whichever sampling rate produced the smallest combined MAPE (cMAPE) value was considered to be the best. The decision to use the top four functions as 70 opposed to some other number of functions or all of the functions was based on an inspection of the data generated from all the experiments run on the SPEC programs. A seemingly small difference from "truth" of a function that accounts for a small percentage of execution time overall can have a disproportionate impact on the aggregate MAPE value, so including all functions seemed unwise. It also seemed important to look at more than only the top function, so we decided, based on intuition and experience, upon what seemed a reasonable compromise. We noticed that for the five SPEC programs on which we conducted extensive experiments and detailed analysis, almost all of the top four functions of each program took up a proportion of execution time above 5%; an arbitrary threshold, but one that seemed likely to be of interest. We determined the APE for each of the 10 runs at the 28 different sampling rates for the top four functions of each SPEC program. Figure 5.13 graphically presents the APE results for each of the 28x10 runs, the calculated MAPE (blue line), and the approximate expected MAPE (red dashed line) calculated statistically, for the top function of omnetpp, cMessageHeap::shiftup(int). We calculated the approximate expected MAPE by taking a 95% confidence interval and dividing by 2. We consider it approximate because with a bell-shaped normal distribution, more measurements should be closer to "truth" than are far away, so dividing by 2 puts the red dashed line a little further from "truth" that it should be; however, we present it here only as a guide, not for measurement purposes, as the approximate nature of the calculation has no impact on our results. 71 Figure 5.13: Results for the 28x10 runs compared to "truth" for cMessageHeap::shiftup(int). The solid blue line tracks the MAPE for the set of 10 runs at each sample rate and the red dashed line indicates the approximate expected MAPE. We then combined the MAPE values for the top four functions to determine the best overall sampling rate. Fig. 5.14 presents the MAPE for each of the top four functions in omnetpp while Fig. 5.15 presents the aggregated MAPE results from the same. Figure 5.14: MAPE results for each of the top four functions in omnetpp. 72 Figure 5.14: cMAPE results for the top four functions in omnetpp (blue solid line) and the calculated approximate expected cMAPE (dotted red line). The sampling total that resulted in the overall best result for the omnetpp benchmark was the 48,829 samples taken using a 5 ms interval. Our technique (eqn. 4.16) predicted that the best results would occur with 91,830 samples: When used with the runtime of omnetpp (345 sec) to calculate sample rate, it predicts that 3.75 ms would be the best sampling interval: This calculation brings to light another challenge when using software timer based sampling. In addition to the sample rate limitation illustrated in Fig. 5.2 which makes using a sample interval of 3.75 ms infeasible, we see here that the effective sampling outcome of the 5 ms sampling interval generated a total number of samples 73 that would have been expected with the larger interval of 7 ms. It varies somewhat from one program to another, but it is true in a general way that as the sampling interval gets smaller, the expected number of total samples generated is further from the actual number of total samples generated. For example, with a sampling interval of 100 ms we would expect to generate 3,450 samples, but on average generated 3,352, a difference of 98 or about 2.8%; at 10 ms expected is 34,500, actual was 28,665 on average for a difference of 5,835 or 16.9%; and at 1 ms expected is 345,000, actual was 115,623 on average for a difference of 229,377 or 66.5%. Thus, we must compare total samples as opposed to the calculated sample interval. The following four graphs are of the aggregated MAPE results (cMAPE) for the other SPEC benchmarks we used. The graphs of the results of the 28x10 runs for the top four functions in each of these other benchmarks is included in Appendix B. Figure 5.15: cMAPE results for the top four functions in bzip2 (blue solid line) and the calculated approximate expected cMAPE (dotted red line). 74 The sampling total that resulted in the overall best result for the bzip2 benchmark was the 21,837 samples taken using a 2 ms interval. Our technique predicted that the best results would occur with 21,418 samples. Figure 5.16: cMAPE results for the top four functions in mcf (blue solid line) and the calculated approximate expected cMAPE (dotted red line). The sampling total that resulted in the overall best result for the mcf benchmark was the 49,106 samples taken using a 6 ms interval. Our technique predicted that the best results would occur with 38,472 samples. 75 Figure 5.17: cMAPE results for the top four functions in milc (blue solid line) and the calculated approximate expected cMAPE (dotted red line). The sampling total that resulted in the overall best result for the milc benchmark was the 36,622 samples taken using a 10 ms interval. Our technique predicted that the best results would occur with 60,555 samples. Figure 5.18: cMAPE results for the top four functions in sjeng (blue solid line) and the calculated approximate expected cMAPE (dotted red line). The sampling total that resulted in the overall best result for the sjeng benchmark was the 175,387 samples taken using a 1 ms interval. Our technique 76 predicted that the best results would occur with 153,811 samples; this is nearly midway between the number of samples possible with a 2 ms sample rate and a 1 ms sample rate, but is slightly closer to the 1 ms rate. 5.2.6 Analysis of Analytic Equation Results Our analytic equation guided us to the sampling interval with the best outcome twice out of the five experiments conducted. Table 5.2 presents the overall results. The calculated cMAPE value is shown to provide a sense of how our predicted best outcome compares to the measured best outcome as well as to the measured worst outcome. This aggregation of the mean absolute percentage error (MAPE) for each of the top four functions as compared to "truth" is calculated per equation 5.1. Predicted Best Outcome Measured Best Outcome Measured Worst Outcome Sample Total Nearest Interval cMAPE Sample Total Interval cMAPE Sample Total Interval cMAPE Omnetpp 91,830 2 ms 9.78% 48,829 5 ms 8.75% 1,665 200 ms 38.77% Bzip2 21,418 2 ms 3.38% 21,837 2 ms 3.38% 431 200 ms 28.96% Mcf 38,472 8 ms 8.04% 49,106 6 ms 6.40% 1,954 200 ms 11.29% Milc 60,555 6 ms 7.02% 36,622 10 ms 6.05% 2,164 200 ms 18.78% Sjeng 153,811 1 ms 5.37% 175,387 1 ms 5.37% 2,580 200 ms 19.00% Table 5.2: Predicted vs. actual outcomes. On the surface, this is a modest outcome; however, when compared to an experimental setup that chooses a single sampling interval to use across all five experiments (as is commonly done) our analytic equation does better than all tested intervals except two. And against those two intervals of 5 ms and 2 ms our equation is tied, 2-1-2 with 5 ms and 1-3-1 with 2 ms; meaning that twice 5 ms is a better choice than what our equation recommends, once they tie, and twice our equation 77 recommendation is better. As well, our equation did a good job at avoiding bad choices. 78 Chapter 6: Parallel Execution In the next part of our investigation, we examine our idea for predicting the "sweet spot" sample count needed to balance the effects of decreasing measurement error and increasing perturbation error in the context of evaluating shared-memory parallel program executions. We set out to verify the applicability of our technique through extensive experiments with several programs from the SPEC OMP 2001 benchmark suite [59, 6] and we report that our model does reasonably well. Our model once again sometimes exactly predicts the best sampling rate, but is generally within close proximity to the best sampling rate, and avoids bad choices. 6.1 Preliminary Notes on Experiments Running experiments with the SPEC OMP 2001 benchmark programs presented two new challenges over those noted earlier in Chapter 5. The first challenge was the difference in how much more the parallel program executions varied among runs compared to the variance of sequential executions. The second challenge encountered was the odd behavior of the software timer (ITIMER_PROF) used with the profiling tool, HPCToolkit, for performance analysis of shared-memory multiprocessing programs. 6.1.1 Variance of Execution Program execution times have been studied and results have been reported that indicate non-trivial variations in program behavior are relatively common [64, 37, 54]. Somewhat surprisingly, it has been recently shown that in addition to execution variance rooted in the inherent non-determinism of computer systems, 79 measurement bias can also be a more common factor that anticipated [47]. In fact, Mytkowicz et al claim that measurement bias can manifest when relatively insignificant aspects of a computer system are changed. Their results indicate the bias can lead to performance analysis that over-states an effect or even leads to incorrect conclusions, and that the bias was evident across all architectures, both compilers, and most programs they tested. Further empirical study has shown that parallel execution times are even more variable than sequential execution times [44]. Mazouz et al specifically use experiments with and comparisons among the SPEC CPU 2006 benchmark suite and the SPEC OMP 2001 benchmark suite to demonstrate their findings. Compared to the overall run time variance seen in our experiments with the SPEC CPU 2006 benchmarks used, we also saw a marked increase in the variance of the run times of the parallel benchmark executions. With all SPEC CPU 2006 benchmarks, the average standard deviation was 1.7% of execution time compared to 10.5% for SPEC OMP 2001 benchmarks. As an example, Fig. 6.1 shows the distribution of run times from some of our experiments on the SPEC OMP 2001 benchmark equake. It is easy to see the large degree of variance possible among each set of 20 runs and the greater variance exhibited by 12 core runs compared with 8 core and 4 core runs; as well as that exhibited by 8 core runs compared with 4 core runs. In comparison, Fig. 6.2 shows a similar sampling of runs of our experiments with the SPEC CPU 2006 benchmarks we used. Though variance is present, it is much less of a factor than with equake. NOTE: the scale of the y axis of Fig. 6.2 is set to match Fig. 6.1 so the comparison is more meaningful. For both figures, the 80 legend is at the top of the graph with different shaped/colored markers differentiating among the different number of cores (Fig. 6.1) or the different benchmarks (Fig. 6.2). The alternating yellow and white background is intended simply as an aid to make it easier to see which data points are associated with which sampling intervals. Figure 6.1: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark equake. Similar charts showing the variance of run times for the other SPEC OMP 2001 benchmarks we used are contained in Appendix C. The benchmark swim exhibited the most variance and is the only chart using a different y axis scale in order to ensure all runs can be displayed. In light of the expected and observed increase in variance of run times for the parallel executions, we altered our standard experimental setup to include 20 iterations of each program execution for any given sample rate explored rather than 10 iterations as we did with the sequential program executions. 81 Figure 6.2: Distribution of run times, 10 each, sorted shortest to longest within each sampling interval, for the SPEC CPU 2006 benchmarks sjeng, milc, mcf, omnetpp, and bzip. 6.1.2 Software Timers For experiments with sequential executions in Chapter 5, we made use of the HPCToolkit hpcrun call to WALLCLOCK which specifies ITIMER_PROF, indicating that the Linux profiling timer should be used for timing the sampling interval. We switched to using the HPCToolkit hpcrun call to REALTIME, specifying the CLOCK_REALTIME Linux timer, for our experiments evaluating shared-memory parallel program executions in light of very inconsistent results we were getting with ITIMER_PROF. It seems that a problem with the ITIMER_PROF interval timer is that it is not thread-specific and profiles only the main thread when profiling a multithreaded program, thereby producing incorrect and highly variable results. 82 Software sample-based profiling uses software clocks or timers to set the sample rate. Each process in Linux is provided with three interval timers: ITIMER_REAL, ITIMER_VIRTUAL, and ITIMER_PROF. ITIMER_REAL measures wall-clock time, decrements at all times regardless of whether the profiled program is executing, and delivers SIGALRM upon expiration. ITIMER_VIRTUAL measures process time, decrements only when the profiled process is executing, and delivers SIGVTALRM upon expiration. ITIMER_PROF is intended for use with profilers, decrements both when the profiled process is executing as well as when the operating system is specifically executing on behalf of the profiled process, and delivers SIGPROF upon expiration. The ITIMER_PROF is the preferred of the three available process-specific software interval timers because it should provide the most accurate accounting of the profiled process. We experienced difficulty achieving consistent results during our initial profiling experiments with the SPEC OMP 2001 benchmarks and were initially very puzzled by the results. Upon investigation into the cause of the odd results, we discovered that the HPCToolkit authors also identified this clock problem and in their latest user manual [45] recommend using the hpcrun call to REALTIME, which is based on the CLOCK_REALTIME Linux timer that counts wall clock time. They also recommend trying the hpcrun call to CPUTIME, which is based on the CLOCK_THREAD_CPUTIME_ID Linux timer, but we found some benchmarks tended to hang when using this timer. In light of the inconsistent results we were getting with the ITIMER_PROF interval timer and our subsequent discovery of the updated advice in the HPCToolkit user manual, we switched to using REALTIME in 83 all of our SPEC OMP 2001 benchmark experiments. All results presented in this chapter use calls to REALTIME. 6.2 SPEC OMP 2001 Programs We chose to evaluate five of the SPEC OMP 2001 benchmark suites in our experiments: applu, equake, fma3d, swim, and wupwise. These particular five benchmarks were chosen for a combination of reasons – we experienced no difficulty or errors when building them and they had run times that were of reasonable length, and thus manageable, for the number of experiments we wanted to conduct We choose to exclude apsi, mgrid, galgel, and gafort because of compiling difficulties and some early error-terminating run results; art because it contained a single function that dominated overall execution time with >95%; and ammp due to its significantly longer runtime than all other benchmarks. 6.2.1 Experiment Design and Environment Execution of all experiments was done on a machine code-named Pygmy. During experiment runs the machine was dedicated to a single user. Pygmy is a 12 core, Intel Xeon running at 2.53 GHz. It runs Ubuntu version 12.04.4 LTS with the Linux kernel 3.2. The programs we evaluated were each executed using 4, 8, and 12 cores of Pygmy. Since we compiled the benchmarks with GNU's gcc, we had the option of specifying during runtime the environment variable GOMP_CPU_AFFINITY in order to bind the OpenMP threads to specific processing units. For every run using 4 cores we specified GOMP_CPU_AFFINITY=8-11; 84 using 8 cores we specified GOMP_CPU_AFFINITY=4-11; and using 12 cores we specified GOMP_CPU_AFFINITY=0-11. 6.2.3 Programs Of the SPEC OMP 2001 benchmarks evaluated, equake is written in C and the others are written in Fortran. We used the 'M' input set for all the benchmarks which is designed to be used on shared-memory systems with between 4 and 32 processors (as opposed to the 'L' input set which can be used with up to 512 processors) [58]. We used the reference input data set (as opposed to the test or train data sets) for all experiments. The benchmarks were compiled with gcc 4.5.2 within the build system provided by SPEC OMP, though all benchmark runs were conducted outside of the SPEC scripting system. 6.2.4 Execution Details For each experiment run, we used our analytic function (4.16) to predict where we expected to find the sampling "sweet spot" given the experiment parameters. Then, because of the relatively long experiment runtimes involved with executing each benchmark multiple times at each sampling interval for each of the three thread-core settings of interest (4, 8, and 12), we decided to bracket the results with sampling intervals differing by an order of magnitude, compare them to our prediction in a qualitative manner, and then allow those results to guide more focused investigation. For each benchmark at each of the three designated thread-core counts, we started by conducting 20 runs at each of 5 different sampling intervals: 10,000 ms (10 sec), 1000 ms (1 sec), 100 ms, 10 ms, and 1 ms. Once that was done, we conducted comparative qualitative analysis between graphs we created of function 85 proportions and our sampling equation prediction. Based on the outcome from that analysis, we determined which broad interval was of the most interest, conducted 20 more runs at each of 6 new different sampling intervals, and then carried out more in- depth quantitative analysis between the proportion results at our predicted value and the results at the new sampling intervals. 6.2.5 Results In this section we present some results from experiments with the SPEC OMP 2001 benchmark equake executing with 4 threads on 4 cores to demonstrate how we arrived at our findings. When we use our analytic function (4.16) for predicting the number of samples we should take during an execution, using preliminary data for smvp.omp_fn.5 and equake run with 4 threads on 4 cores, we get the following predicted sample count: Taking samples at the 5 order-of-magnitude differing intervals resulted in median sample counts of 116; 1171; 11,737; 117,992; and 1,179,613 respectively across the 5 sets of 20 runs. In Fig. 6.3 are the results from these bracketing runs of the calculated percentage of execution for the function taking the most execution time, smvp.omp_fn.5. The data within each set of 20 runs are sorted by run length, 86 shortest to longest, left to right. The results in Fig. 6.4 are the calculated percentage execution values for the top four functions in equake presented the same way. Figure 6.3: Distribution of the percent execution taken by smvp.omp_fn.5 from 100 runs of equake with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. With qualitative analysis of the two charts, we note that the 10 sec sample interval results in calculated proportions that are quite variable and have 95% confidence intervals that are far wider than desired. As can be seen in Fig. 6.4, for almost all runs of GOMP_taskwait the CIs overlap with those of both smvp.omp_fn.5 and main.omp_fn.10, so statistically the correct order of most computationally expensive to least cannot be determined. As confidence intervals indicate precision, rather than accuracy, the 1 ms sample interval provides an outcome that is not particularly useful. About half of the runs have a very tight grouping among the calculated proportions, but the 95% confidence intervals are so narrow that the slightest variation indicates there are statistically significant differences among many 87 of the runs. Somewhere between the 100 ms sampling interval and the 10 ms sampling interval seems to provide the best balance between narrowing the confidence interval and getting consistent results. Our prediction falls between the median counts for the 100 ms sampling interval (11,737 samples) and the 10 ms sampling interval (117,992 samples), so we examined this interval more closely. Figure 6.4: Distribution of the percent execution taken by 4 different functions from 100 runs of equake with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 88 Note that our use of function 4.16 treats the parallel shared-memory program wall-clock time in the same manner as for a sequential program. We make no modification to the equation based on the number of threads or number of cores being used. At first glance, this may seem odd since the total execution time (the aggregate of core execution times across all cores) seems analogous to the wall-clock time for a sequential program. However, when we explored that technique we found that sticking with the wall-clock time provides better results. Using the longer total execution time resulted in taking more samples which unsurprisingly turned out to be more disruptive to parallel programs and provided poorer results. To evaluate the effectiveness of our prediction, we used a technique similar to the broad bracketing used earlier. We calculated six sampling intervals at which we could collect sampling counts that would more narrowly bracket the desired sampling count of 78,854. In addition to running an experiment set at the predicted "sweet spot," we used a sample count interval of 20,000 (for predicted sample counts >=100,000 we used an interval of 30,000) to calculate the timing intervals required to collect 3 intervals below the predicted count and 2 above. This technique does not exhaustively test all sampling intervals around our prediction, but it offers the benefit of a check on the qualitative analysis of the results from the large bracket runs while also providing an effective trade-off between result granularity and the resource burden of intensive experimentation. The six sampling counts and the respective sampling intervals used for the smaller bracket experiments with equake, 4 threads on 4 cores, are in Table 6.1. 89 Expected approximate sample counts Interval (ms) 63 30 20 15 12 10 Expected sample count 18,000 38,000 58,000 78,000 98,000 118,000 Table 6.1: Sampling intervals and expected approximate sample counts for targeted experiments with equake using 4 threads on 4 cores. The next two charts present the results of the calculated percentage execution of smvp.omp_fn.5 for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. Fig. 6.5 displays the experiment runs grouped by sampling interval with 95% confidence intervals for the expected "true" value depicted as the red dashed lines. Figure 6.5: Results of the percent execution of smvp.omp_fn.5 for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. The 95% confidence intervals of the "true" value are depicted as red dashed lines. Fig. 6.6 displays the same experiment runs using sample count as the x-axis – in this format, runs that encountered some kind of non-deterministic system event that caused deviation from "normal" execution stand out as having run noticeably longer and consequently collected many more samples than peer runs within a particular sample interval cohort. In both graphs, red and blue markers are simply intended to more easily distinguish the runs of adjacent sampling intervals. 90 Figure 6.6: Results of the percent execution of smvp.omp_fn.5 for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. The 95% confidence intervals of the "true" value are depicted as red dashed lines. Additional charts showing the experimental results for the other three functions we looked at for equake, 4 threads on 4 cores, in addition to all charts for the other benchmarks we studied are in Appendix C. Statistical science indicates to us that with 95% confidence intervals around what we believe to be "truth" for a given experiment, we could expect 19 of 20 runs to fall w/in the interval. Considering the number of runs that fall outside the 95% CIs provides us a granular indication of the effect of sampling perturbation on measurement results. As can be seen in both figures above, more than 1 experiment per group of 20 appears to fall outside the interval, with the incidence of variation increasing as more samples are taken. The table below provides the count of runs per sampling interval that are outside the confidence interval. Number of runs outside CI Interval (ms) 63 30 20 15 12 10 # runs outside CI 4 3 3 3 6 11 Table 6.2: Sampling intervals and corresponding number of runs (out of 20) with calculated proportion (percent execution) outside the 95% CI for the "true" value of smvp.omp_fn.5 from equake, 4 threads on 4 cores. 91 It's important to note at this point that determining the "truth" to compare with the calculated proportions for each of our experiments, already a difficult process in Chapter 5, is more difficult in light of the greater variability of parallel program execution. Given this greater difficulty, we tried to ensure greater rigor in this process by applying a chi-square test of homogeneity to the calculated proportions from the many different runs. We used the chi-square test of homogeneity to determine to what degree profiles from the various executions of the same experimental context (e.g., equake, 4 threads on 4 cores) differed. We used the results from the chi-square test to determine which runs from our initial bracketing runs could be combined to create a "true" proportion value with high confidence. We then did the same for the subsequent executions done with the more targeted brackets. After examination of the experimental results, we proceeded in our chi-square test by first calculating the chi-square test-statistic with all data. If the result was statistically significant (at a significance level of .05) to reject the hypothesis of homogeneity, we sorted all experiments by total runtime within their sample-interval cohorts and then re-calculated the chi-square test-statistic using the 10 shortest running experiments per cohort. Our rationale is that the longer-running experiments are more likely perturbed by a rare, but higher impact event than the shorter-running experiments. If we did not find statistically significant evidence for homogeneity with the 10 shortest runs, we then re-calculated with the 5 shortest running experiments per cohort. The results of all the chi-square tests for equake, 4 cores on 4 cores, are in Table 6.3 below. 92 Chi-Square Test Results Function Brackets Χ2 DF P-value 95% CI smvp.omp_fn.5 Large 11.04 49 1.0000 .3179 (.3176, .3181) Small 36.1 59 .9919 .3147 (.3143, .3151) GOMP_taskwait Large 4.1 99 1.0000 .1953 (.1951, .1954) Small 1.3 59 1.0000 .1957 (.1953, .1961) main.omp_fn.10 Large 16.4 49 1.0000 .1324 (.1322, .1326) Small 26.9 59 .9999 .1309 (.1306, .1312) omp_get_num_procs Large 5.7 39 1.0000 .0210 (.0208, .0213) Small 19.5 29 .9065 .0228 (.0226, .0230) Table 6.3: Chi-square test results for determining "truth" for the percent execution of equake functions. Except for GOMP_taskwait, the results of "truth" for percent of execution for the aggregation of the initial bracketing runs don't match "truth" for the second set of more tightly bracketed runs despite both sets having "passed" chi-square tests for homogeneity among their own bracketed runs. For example, smvp.omp_fn.5 has a "true" value of p=.3179 with 95% confidence interval (.3176, .3181) after the initial large bracketing runs using order-of-magnitude difference sampling intervals and a "true" value of p=.3147 with 95% CI (.3143, .3151) after the small bracketing runs around our predicted best sampling count. These results, not entirely unexpected, illustrate the problem of large numbers of samples leading to small CIs and further back-up the parallel run variation problem noted earlier in this chapter and reported by others [47, 44]. We needed to choose one of the values as "truth" to compare performance analysis results quantitatively, and we decided upon using the "truth" derived with the set of runs to which we were comparing. Our rationale is rooted in the observation that both "true" values for each function differ only after the third significant digit, so the difference is relatively small; and is based on the points made earlier about inherent execution variability and both data sets passing the chi-square test for homogeneity. 93 Given our choice of "truth" we evaluated the second set of smaller bracketing runs for equake, 4 threads on 4 cores, once again making use of equation 5.1 for cMAPE, and determined that the best sampling interval was, in fact, the one our model predicted. We then did the same analysis for all experiments. The results for the analysis of all parallel shared-memory experiments are in Table 6.4 below. The calculated cMAPE value is shown to provide a sense of how our predicted best outcome compares to the measured best outcome as well as to the measured worst outcome. Predicted Best Outcome Measured Best Outcome Measured Worst Outcome Sample Total Nearest Interval cMAPE Sample Total Interval cMAPE Sample Total Interval cMAPE Equake 4-core 78,854 15 ms 17.87% 80,120 15 ms 17.87% 116 10 s 99.37% 8-core 53,450 24 ms 17.07% 13,603 100 ms 7.03% 129 10 s 171.75% 12-core 48,114 31 ms 56.00% 36,922 45 ms 27.27% 156 10 s 188.53% Applu 4-core 174,420 14 ms 7.84% 230,588 12 ms 7.05% 240 10 s 85.43% 8-core 126,297 21 ms 9.67% 207,060 17 ms 7.56% 256 10 s 96.41% 12-core 81,319 37 ms 2.42% 82,764 37 ms 2.42% 288 10 s 152.30% Fma3d 4-core 185,236 15 ms 3.33% 237,635 13 ms 1.46% 276 10 s 27.52% 8-core 138,683 22 ms 2.64% 149,203 28 ms 1.61% 304 10 s 35.82% 12-core 96,618 36 ms 1.91% 159,281 23 ms 1.20% 348 10 s 46.63% Swim 4-core 131,209 17 ms 144.54% 102,292 23 ms 80.74% 220 10 s 214.14% 8-core 76,634 31 ms 82.24% 72,584 52 ms 80.05% 232 10 s 249.07% 12-core 60,505 45 ms 118.92% 145,893 27 ms 65.70% 2,724 1 s 182.09% Wupwise 4-core 188,336 18 ms 2.71% 159,761 21 ms 2.58% 336 10 s 18.08% 8-core 159,459 23 ms 5.48% 170,625 23 ms 5.48% 368 10 s 37.70% 12-core 99,049 39 ms 2.69% 192,689 20 ms 0.73% 384 10 s 52.46% Table 6.4: Parallel predicted vs. best outcome. The results in Table 6.4 show that our model once again does moderately well. In addition to avoiding bad choices for the sampling interval, our technique led to the selection of 3 optimal predictions and 10 total predictions within 30% of the 94 best sample count of the targeted bracket. With all n-core experiments of applu, fma3d, and wupwise, our predicted best outcomes compare well with all measured best outcomes. For equake the comparison is somewhat mixed while the results for swim, the most memory-intensive of the OMP benchmarks, are not nearly as good. Both the measured best outcome and our predicted best outcome results for swim are noticeably worse than the results for the other experiments. This is caused by significant cache thrashing of the large arrays swim uses during its shallow water modeling being further exacerbated by the sampling process. When compared to an experimental setup that chooses a single sampling interval to use across all 15 experiments (as is commonly done) our analytic equation does better than all tested intervals except one and it does at least as equally well as that interval. Thus, using adaptive instrumentation based on our analytic function provides more accurate results than picking a single sampling interval. Our analytic equation predicts the optimal sample count in our shared- memory parallel execution experiments with the same efficacy as in our sequential execution experiments. We demonstrated that the number of samples required is proportional to time of execution regardless of number of cores used. Predicting the best sample count with wall clock time turns out to provide several benefits. It mitigates the increased perturbation potential observed with larger numbers of cores when taking very large numbers of samples. And, it moderates the decreasing utility value of each successive sample as the sample size gets very large helping avoid the misleadingly small confidence intervals that very large sample sizes imply. 95 Chapter 7: Future Work In this chapter, we describe future work motivated by the research results presented in this dissertation. 7.1 Process Refinement Our current technique for predicting the optimal sample interval for minimizing overall error relies on initial experiments to determine a starting value for p and a value for o (overhead). There are many ways that we could make the start up process less cumbersome and costly. Statically determining these values or values "close enough" would be ideal, but even selectively starting and prematurely ending runs could save time and resources with programs of moderate length and would likely be necessary for programs long enough that experimenters do not have the luxury of running them multiple times. 7.2 Analytic Function Refinement An attractive aspect of the statistical science technique for computing confidence intervals, given that certain conditions hold, is the reliance on the number of samples collected regardless of many underlying aspects of the population under investigation. Investigating the feasibility of evolving our analytic function's measure of perturbation, to something more like the statistical science community's function for decreasing measurement error, based on the number of samples taken would greatly improve its usefulness. Even some intermediate improvement of the development of different functions for programs with generally equivalent levels of 96 perturbation sensitivity would be a big improvement and something we intend to delve into. 7.3 Other Application Domains We studied portions of the class of programs related to sequential execution and shared-memory parallel executions for 12 cores and below. There are more classes of programs where we believe our technique or something very similar can be used to improve performance measurement. Exploring programs running with much greater parallelism, especially, programs where some cores are used exclusively for program execution while others are used strictly for measurement and management tasks is one such area of future research. Given the continued reemergence of virtualized machines based on their practical benefits, exploring the experiment space of profiling applications designed to run on virtual machines is another area we would like to investigate. 7.4 Beyond Functions The effects of perturbation we examined were primarily on the calculated proportion or percent execution of functions – a very common and useful, but singular and high-level point of analysis. We want to conduct further studies below the function level; e.g. basic block, source line, instruction; as well as the data centric perspective, to determine the impacts of perturbation error and how our model might be modified to capture it. 97 7.5 Computer Universe In the course of our investigation, it was not uncommon to brush up against references to the Hisenberg Uncertainty Principle with hints at the idea that computer systems can be viewed in a manner similar to that of a mini universe where there are limits to the accuracy with which we can know or predict anything. We would like to identify and model other aspects of these systems for inclusion with our model to gain a better understanding of the limits of accuracy. Some examples include the clock resolution error; the duration or reach of a sample's perturbation effect; and the residual "noise" in a system – e.g., interrupts, system calls, context switches. 98 99 Chapter 8: Conclusions With this research, we introduced and explored four ideas related to dynamic performance analysis. We started with the creation of a simple abstract model designed to demonstrate the general effects of measurement error and perturbation error on a program's execution during the process of statistical profiling. From this abstract model, we developed a mathematical model, constructed with some simplifying assumptions, to describe program execution performance; the derivative of which can be used to predict the sample count needed to find the "sweet spot" where the combined effects of residual measurement error and expanding perturbation error are lowest. In conjunction with the development of our mathematical model, we conducted an extensive set of experiments to validate the idea that a sampling "sweet spot" existed and to confirm the notion that a model might be used to find or predict the "sweet spot." This confirmation is an important (first?) step that undergirds the final larger idea that we should begin to move beyond qualitative guidance to experimenters by providing some quantitative guidance similar to what is available from the field of statistical science. The starting point of our research, the abstract model, is a simple, but useful, general model of the interaction of measurement error and perturbation error that provides a valuable and constructive manner of thinking about the interaction of the two error types and their combined effect on measurement results across the spectrum of potential sample counts. It also serves as a reliable indicator of how the interaction of the errors affects overall experimental measurement accuracy of software statistical profiling. It helps challenge the thinking, often implicit in statistical performance 100 profiling, that collecting a large number of samples is a good and useful thing to do, often motivated by the simple fact that it is easy to do so. Our mathematical model proved to be reasonably accurate at predicting the "sweet spot" sample count which allowed us to often calculate a sample interval at or near that which we determined to be optimal, while regularly avoiding those that were truly bad. This model is based on some simplifying assumptions, focusing only on cases where performance measurements dilate in response to perturbation; however, across all experimental contexts, our predictions performed better than the choice of using a single sampling interval for all experiments. We demonstrated the high degree of likelihood for the common existence of a measurement accuracy "sweet spot" through experimental results from simulations and from experiments with our calibration program, with sequential execution on a subset of SPEC CPU 2006 benchmarks, and with shared-memory parallel execution on a subset of SPEC OMP 2001 benchmarks. Further, we showed our analytic model to be sufficiently accurate in predicting the "sweet spot" and calculating the corresponding sample interval that it could stand against any singularly favored and unchanging sample interval. The systems performance analysis research literature is replete with qualitative suggestions and guidance like that provided in a 1969 paper by W.R. Deniston: "A prime consideration in developing a software measurement technique to obtain internal data is that a suitable compromise between resolution and system degradation must be achieved." [14] The main thrust of our research is in the direction of providing experimenters with some useful quantitative guidance related 101 to this problem of balancing measurement and perturbation errors. To that end, we believe we have provided an important step intended to help experimental procedures move beyond the well-founded suggestions to use "reasonable" or "appropriate" sampling intervals, towards solid quantitative guidance that helps experimenters calculate sampling intervals resulting in execution profiles with greater accuracy. 102 103 Appendix A In this appendix we provide the sequence of steps by which we progress from our mathematical analytic function to our formula for calculating the ideal number of samples for a given performance analysis run. The following notation is used: tfoo(n) Execution time calculated for foo based on n samples n Number of samples o Overhead time cost per sample Measured proportion of total program execution time spent in foo T Un-sampled (true) total program execution time z Standard score (z-value) for confidence level used Table A.1: Analytic function notation Our analytic function is as follows: (A.1) To begin, we take the first derivative of our analytic function with respect to n as follows: (A.2) (A.3) (A.4) (A.5) 104 (A.6) (A.7) (A.8) We can now take the simplified first derivative, set it equal to 0, and solve to find the minimum point of our top curve as expressed by our analytic function. (A.9) (A.10) (A.11) (A.11) We use this final equation (which is also equation 4.10 from section 4.2.2) to calculate the sample size and then determine the sampling rate that provides the best balance between achieving measurement precision and limiting the effects of perturbation error in the following manner. Solve for n using known or estimated values for terms T, z, , and o, then divide n (the total number of samples) by the number of minutes the program will run. 105 Appendix B Additional detailed results relevant to the analysis of the SPEC CPU 2006 sequential program executions we conducted are contained in this appendix. Bzip2 Following are the graphs of the bzip2 experiments. Figure B.1: Plot of proportion calculation results for BZ2_blockSort), the function taking the most execution time for bzip2, across the 28x10 execution runs. Figure B.2: Plot of proportion calculation results for mainGtU, the function taking the 2nd most execution time for bzip2, across the 28x10 execution runs. 106 Figure B.3: Plot of proportion calculation results for BZ2_decompress, the function taking the 3rd most execution time for bzip2, across the 28x10 execution runs. Figure B.4: Plot of proportion calculation results for BZ2_compressBlock, the function taking the 4th most execution time for bzip2, across the 28x10 execution runs. 107 Figure B.5: The aggregated MAPE results for the top four functions in bzip2 (blue solid line) and the calculated approximate expected MAPE (dotted red line). The sampling interval that resulted in the overall best result for the bzip2 benchmark was 2 ms. Mcf Following are the graphs of the mcf experiments. Figure B.6: Plot of proportion calculation results for primal_bea_mpp, the function taking the most execution time for mcf, across the 28x10 execution runs. 108 Figure B.7: Plot of proportion calculation results for refresh_potential, the function taking the 2nd most execution time for mcf, across the 28x10 execution runs. Figure B.8: Plot of proportion calculation results for replace_weaker_arc, the function taking the 3rd most execution time for mcf, across the 28x10 execution runs. 109 Figure B.9: Plot of proportion calculation results for price_out_impl, the function taking the 4th most execution time for mcf, across the 28x10 execution runs. Figure B.10: The aggregated MAPE results for the top four functions in mcf (blue solid line) and the calculated approximate expected MAPE (dotted red line). The sampling interval that resulted in the overall best result for the mcf benchmark was 2 ms. Milc Following are the graphs of the milc experiments. Figure B.11: Plot of proportion calculation results for mult_su3_na, the function taking the most execution time for milc, across the 28x10 execution runs. 110 Figure B.12: Plot of proportion calculation results for mult_su3_nn, the function taking the 2nd most execution time for milc, across the 28x10 execution runs. Figure B.13: Plot of proportion calculation results for mult_su3_mat_vec, the function taking the 3rd most execution time for milc, across the 28x10 execution runs. 111 Figure B.14: Plot of proportion calculation results for mult_adj_su3_mat_vec, the function taking the 4th most execution time for milc, across the 28x10 execution runs. Figure b.15: The aggregated MAPE results for the top four functions in milc (blue solid line) and the calculated approximate expected MAPE (dotted red line). The sampling interval that resulted in the overall best result for the milc benchmark was 9 ms. Sjeng Following are the graphs of the sjeng experiments. Figure B.16: Plot of proportion calculation results for std_eval, the function taking the most execution time for sjeng, across the 28x10 execution runs. 112 Figure B.17: Plot of proportion calculation results for setup_attackers, the function taking the 2nd most execution time for sjeng, across the 28x10 execution runs. Figure B.18: Plot of proportion calculation results for gen, the function taking the 3rd most execution time for sjeng, across the 28x10 execution runs. 113 Figure B.19: Plot of proportion calculation results for remove_one, the function taking the 4th most execution time for sjeng, across the 28x10 execution runs. Figure B.20: The aggregated MAPE results for the top four functions in sjeng (blue solid line) and the calculated approximate expected MAPE (dotted red line). 114 115 Appendix C Additional detailed results relevant to the analysis of the SPEC OMP 2001 shared-memory parallel program executions we conducted are contained in this appendix. The following four graphs show the distribution of run times observed across all experiments. Legends for these four graphs are across the top, indicating the name of the SPEC OMP 2001 benchmark and which graph marker corresponds with which core setting. Figure C.1: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark applu. 116 Figure C.2: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark fma3d. Figure C.3: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark swim. 117 Figure C.4: Distribution of run times, 20 each, sorted shortest to longest within each sampling interval, for the SPEC OMP 2001 benchmark wupwise. The graphs from this point to the end of the appendix present either the results of individual functions for a specific benchmark or a composite result of the top 4 functions of a benchmark with 95% confidence interval whiskers. These first 3 graphs complement those presented in Chapter 6 for equake 4 threads on 4 cores. Figure C.5: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. 118 Figure C.6: Results of the percent execution of main.omp_fn.10, the function taking the 3rd most execution time for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.7: Results of the percent execution of omp_get_num_procs for the 120 runs conducted with equake, 4 threads on 4 cores, 20 runs per sampling interval. 119 The next 5 graphs present results of experiments done with equake, 8 threads on 8 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results with 95% CIs for the small interval bracketing runs. Figure C.8: Distribution of the percent execution taken by 4 different functions from 100 runs of equake with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 120 Figure C.9: Results of the percent execution of smvp.omp_fn.5, the function taking the most execution time for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.10: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. 121 Figure C.11: Results of the percent execution of main.omp_fn.10, the function taking the 3rd most execution time for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.12: Results of the percent execution of omp_get_num_procs, for the 120 runs conducted with equake, 8 threads on 8 cores, 20 runs per sampling interval. 122 The next 5 graphs present results of experiments done with equake, 12 threads on 12 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.13: Distribution of the percent execution taken by 4 different functions from 100 runs of equake with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 123 Figure C.14: Results of the percent execution of smvp.omp_fn.5, the function taking the most execution time for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.15: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. 124 Figure C.16: Results of the percent execution of main.omp_fn.10, the function taking the 3rd most execution time for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.17: Results of the percent execution of omp_get_num_procs, for the 120 runs conducted with equake, 12 threads on 12 cores, 20 runs per sampling interval. 125 The next 5 graphs present results of experiments done with applu, 4 threads on 4 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.18: Distribution of the percent execution taken by 4 different functions from 100 runs of applu with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 126 Figure C.19: Results of the percent execution of ssor_.omp_fn.2, the function taking the most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.20: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval. 127 Figure C.21: Results of the percent execution of buts_, the function taking the 3rd most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.22: Results of the percent execution of blts_, the function taking the 4th most execution time for the 120 runs conducted with applu, 4 threads on 4 cores, 20 runs per sampling interval. 128 The next 5 graphs present results of experiments done with applu, 8 threads on 8 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.23: Distribution of the percent execution taken by 4 different functions from 100 runs of applu with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 129 Figure C.24: Results of the percent execution of ssor_.omp_fn.2, the function taking the most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.25: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval. 130 Figure C.26: Results of the percent execution of buts_, the function taking the 3rd most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.27: Results of the percent execution of blts_, the function taking the 4th most execution time for the 120 runs conducted with applu, 8 threads on 8 cores, 20 runs per sampling interval. 131 The next 5 graphs present results of experiments done with applu, 12 threads on 12 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.28: Distribution of the percent execution taken by 4 different functions from 100 runs of applu with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 132 Figure C.29: Results of the percent execution of ssor_.omp_fn.2, the function taking the most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.30: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval. 133 Figure C.31: Results of the percent execution of buts_, the function taking the 3rd most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.32: Results of the percent execution of blts_, the function taking the 4th most execution time for the 120 runs conducted with applu, 12 threads on 12 cores, 20 runs per sampling interval. 134 The next 5 graphs present results of experiments done with fma3d, 4 threads on 4 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.33: Distribution of the percent execution taken by 4 different functions from 100 runs of fma3d with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 135 Figure C.34: Results of the percent execution of platq_internal_forces_.omp_fn.0, the function taking the most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.35: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. 136 Figure C.36: Results of the percent execution of platq_stress_integration_, the function taking the 3rd most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.37: Results of the percent execution of material_41_integration_, the function taking the 4th most execution time for the 120 runs conducted with fma3d, 4 threads on 4 cores, 20 runs per sampling interval. 137 The next 5 graphs present results of experiments done with fma3d, 8 threads on 8 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.38: Distribution of the percent execution taken by 4 different functions from 100 runs of fma3d with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 138 Figure C.39: Results of the percent execution of platq_internal_forces_.omp_fn.0, the function taking the most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.40: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. 139 Figure C.41: Results of the percent execution of platq_stress_integration_, the function taking the 3rd most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.42: Results of the percent execution of material_41_integration_, the function taking the 4th most execution time for the 120 runs conducted with fma3d, 8 threads on 8 cores, 20 runs per sampling interval. 140 The next 5 graphs present results of experiments done with fma3d, 12 threads on 12 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.43: Distribution of the percent execution taken by 4 different functions from 100 runs of fma3d with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 141 Figure C.44: Results of the percent execution of platq_internal_forces_.omp_fn.0, the function taking the most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.45: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. 142 Figure C.46: Results of the percent execution of platq_stress_integration_, the function taking the 3rd most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.47: Results of the percent execution of material_41_integration_, the function taking the 4th most execution time for the 120 runs conducted with fma3d, 12 threads on 12 cores, 20 runs per sampling interval. 143 The next 5 graphs present results of experiments done with swim, 4 threads on 4 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.48: Distribution of the percent execution taken by 4 different functions from 100 runs of swim with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 144 Figure C.49: Results of the percent execution of GOMP_taskwait, the function taking the most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.50: Results of the percent execution of calc2_.omp_fn.2, the function taking the 2nd most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. 145 Figure C.51: Results of the percent execution of calc1_.omp_fn.3, the function taking the 3rd most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.52: Results of the percent execution of calc3_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with swim, 4 threads on 4 cores, 20 runs per sampling interval. 146 The next 5 graphs present results of experiments done with swim, 8 threads on 8 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.53: Distribution of the percent execution taken by 4 different functions from 100 runs of swim with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 147 Figure C.54: Results of the percent execution of GOMP_taskwait, the function taking the most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.55: Results of the percent execution of calc2_.omp_fn.2, the function taking the 2nd most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. 148 Figure C.56: Results of the percent execution of calc1_.omp_fn.3, the function taking the 3rd most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.57: Results of the percent execution of calc3_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with swim, 8 threads on 8 cores, 20 runs per sampling interval. 149 The next 5 graphs present results of experiments done with swim, 12 threads on 12 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.58: Distribution of the percent execution taken by 4 different functions from 100 runs of swim with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 150 Figure C.59: Results of the percent execution of GOMP_taskwait, the function taking the most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.60: Results of the percent execution of calc2_.omp_fn.2, the function taking the 2nd most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. 151 Figure C.61: Results of the percent execution of calc1_.omp_fn.3, the function taking the 3rd most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.62: Results of the percent execution of calc3_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with swim, 12 threads on 12 cores, 20 runs per sampling interval. 152 The next 5 graphs present results of experiments done with wupwise, 4 threads on 4 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.63: Distribution of the percent execution taken by 4 different functions from 100 runs of wupwise with 4 threads on 4 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 153 Figure C.64: Results of the percent execution of zgemm_, the function taking the most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.65: Results of the percent execution of muldoe_.omp_fn.0, the function taking the 2nd most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. 154 Figure C.66: Results of the percent execution of muldeo_.omp_fn.0, the function taking the 3rd most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. Figure C.67: Results of the percent execution of GOMP_taskwait, the function taking the 4th most execution time for the 120 runs conducted with wupwise, 4 threads on 4 cores, 20 runs per sampling interval. 155 The next 5 graphs present results of experiments done with wupwise, 8 threads on 8 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.68: Distribution of the percent execution taken by 4 different functions from 100 runs of wupwise with 8 threads on 8 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 156 Figure C.69: Results of the percent execution of zgemm_, the function taking the most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.70: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. 157 Figure C.71: Results of the percent execution of muldoe_.omp_fn.0, the function taking the 3rd most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. Figure C.72: Results of the percent execution of muldeo_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with wupwise, 8 threads on 8 cores, 20 runs per sampling interval. 158 The next 5 graphs present results of experiments done with wupwise, 12 threads on 12 cores. First is the composite graph done with the large interval bracketing runs. Following that are 4 graphs showing individual function results for the small interval bracketing runs. Figure C.73: Distribution of the percent execution taken by 4 different functions from 100 runs of wupwise with 12 threads on 12 cores, 20 each at 5 different sampling intervals. Each data point includes whiskers indicating the expected 95% confidence interval of the calculated proportion. 159 Figure C.74: Results of the percent execution of zgemm_, the function taking the most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.75: Results of the percent execution of GOMP_taskwait, the function taking the 2nd most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. 160 Figure C.76: Results of the percent execution of muldoe_.omp_fn.0, the function taking the 3rd most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. Figure C.77: Results of the percent execution of muldeo_.omp_fn.0, the function taking the 4th most execution time for the 120 runs conducted with wupwise, 12 threads on 12 cores, 20 runs per sampling interval. 161 Glossary Accuracy. The absolute difference between a measured value and the corresponding reference value, often the true value [39]. Estimation error (also sampling error). The range of uncertainty between a statistic calculated from a sample and a parameter calculated from an entire population. Measurement-based profiling (also measured profiling). Instrumenting of a program in order to observe (and record) specific events of interest (e.g., entry and exit of a function) [43]. Perturbation (also interference [12, 23, 16], degradation [14, 26], artifact [52], side effects [63], probe effect [17], intrusion [4, 30], and invasion [48]). The changes in a system's behavior caused by measuring some aspect of its performance [39]. Precision. The amount of scatter in a set of measurements. Corresponds to the repeatability of the measurements [39]. Profile. The collection of frequency counts of program parts (functions, basic blocks, line of code). Sample-based profiling (also program status sampling [36] or statistical profiling). Periodic interruption of a program, often coordinated via a timer, in order to observe (and record) the current state of an executing program. 162 163 Bibliography [1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. R. Tallent. Hpctoolkit: tools for performance analysis of optimized parallel programs http://hpctoolkit.org. Concurrency and Computation: Practice and Experience, 22(6):685–701, 2010. [2] M. Andersland and T. Casavant. Recovering uncorrupted event traces from corrupted event traces in parallel/distributed computing systems. In Proceedings of the 1991 International Conference on Parallel Processing, pages II.108–II.116, 1991. [3] J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.-T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst., 15(4):357–390, 1997. [4] Z. Aral and I. Gertner. Non-intrusive and interactive profiling in parasight. SIGPLAN Not. 23, 9, 21-30, January 1988. [5] D. C. Arnold, D. H. Ahn, B. R. D. Supinski, G. L. Lee, B. P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In International Parallel and Distributed Processing Symposium, 2007. IPDPS 2007, IEEE International, pp. 1-10, March 2007. [6] V. Aslot and R. Eigenmann. Performance characteristics of the SPEC OMP2001 Benchmarks. SIGARCH Comput. Archit. News, December 2001. [7] M. L. Berenson, D.M. Levine, and T. C. Krehbiel. Basic Business Statistics: Concepts and Applications. Prentice Hall, 10th edition, April 2005. [8] P. Berube and J.N. Amaral. Combined profiling: A methodology to capture varied program behavior across multiple inputs. Performance Analysis of Systems and Software (ISPASS), 2012 IEEE, pp. 210-220, April 2012. [9] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. Proceedings of the 2000 ACM/IEEE conference on Supercomputing, pg. 42, 2000. [10] H.D. Brunk, J.E. Holstein, and F. Williams. A comparison of binomial approximations to the hypergeometric distribution. The American statistician, vol. 22, no. 1, 24-26, February 1968. 164 [11] B. Buck and J.K. Hollingsworth. An api for runtime code patching. International Journal High Performance Computing Applcations, 14(4):317– 329, 2000. [12] P. Calingaert. System performance evaluation: suvey and appraisal. Comm. ACM Vol 10, No. 1, pp 12-18, January 1967. [13] M. Casas, H. Savat, R.M. Badia, and J. Labarta. Analyzing the temporal behavior of applications using spectral analysis. Technical Report UPC-DAC- RR-CAP-2009-9, Barcelona Supercomputing Center, February 2009. [14] W.R. Deniston. SIPE: a TSS/360 software measurement technique" Proc. ACM 24th National Conf. 229–245, 1969. [15] W. Feller. On the normal approximation to the binomial distribution. The annals of mathematical statistics, vol. 16, no. 4, 319-329, December 1945. [16] D. Ferrari, G. Serazzi, and A. Zeigner. Measurement and turning of computer systems. Prentice-Hall, 1983. [17] J. Gait. A probe effect in concurrent programs. Software – Practice and Experience, 16(3)225-233, March 1986. [18] J. Gannon, K. Williams, M. Andersland, and T. Casavant. Trace recovery: a distributed computing application for perturbation tracking. Proceedings of the 33rd IEEE Conference on decision and control, Vol. 3, pp. 2621 –2626, December 1994. [19] J. Gannon, K. Williams, M. Andersland, J. Lummp, Jr., and T. Gasavant. Using perturbation tracking to compensate for intrusion in message-passing systems. In Proceedings of the 14th International Conference on Distributed Computing Systems, IEEE, pp. 414 –421, June 1994. [20] A. Georges. Three pitfalls in java performance evaluation. PhD thesis, Ghent University, Ghent, Belgium, 2008. [21] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java performance evaluation. SIG-PLAN Notices, 42(10):57–76, 2007. [22] A.J. Goldberg and J.L. Hennessy. Performance debugging shared memory multiprocessor programs with mtool. Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pp. 481–490, 1991. [23] C.C. Gotlieb. Performance Measurement. In Software engineering, an advanced course, reprint of the first edition, pp. 464-491, Springer-Verlag, London, UK, 1975. 165 [24] S.L. Graham, P.B. Kessler, and M.K. Mckusick. Gprof: A call graph execution profiler. SIGPLAN Symposium on Compiler Construction, 17(6):120–126, 1982. [25] S.L. Graham, P.B. Kessler, and M.K. Mckusick. An execution profiler for modular programs. Software - Practice and Experience, 13(8):671–685, 1983. [26] U. Grenander and R. Tsao. Quantitative Methods for Evaluating Computer System Performance: A Review and Proposals. Statistical Computer Performance Evaluation, in Statistical Computer Performance Evaluation, W. Freiberger, Ed., Academic Press, N.Y., pp. 3-24, 1972. [27] G.J. Hahn and W.O. Meeker. Statistical intervals a guide for practitioners. Wiley-Interscience, 1991. [28] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Faster and more flexible program analysis. In Journal of Instruction Level Parallelism 7, September 2005. [29] J.L. Henning. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News, September 2006. [30] J.K. Hollingsworth, J.E. Lumpp Jr., B.P. Miller. Techniques for performance measurement of parallel programs. Parallel computers: theory and practice, IEEE, pp. 225-240, 1995. [31] HP. Digital continuous profling infrastructure (dcpi). http://h30097.www3.hp.com/dcpi/index.html, August 2010. [32] IBM. Xprofiler. http://domino.research.ibm.com/comm/research_ projects.nsf/pages/hpct.xprofiler.html, August 2010. [33] Intel. Vtune performance analyzer. http://software.intel.com/en-us/intel- vtune/, August 2010. [34] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons, Inc., 1991. [35] T. Kalibera and R. Jones. Quantifying performance changes with effect size confidence intervals. Technical report 4-12, University of Kent, June 2012. [36] D.E. Knuth. An empirical study of fortran programs. Software - Practice and Experience, 1, 1971. [37] W.T.C. Kramer and C. Ryan. Performance variability of highly parallel architectures. Lecture notes in computer science, v. 2659, p. 560-569, 2003. 166 [38] N. Kumar, B.R. Childers, and M.L. Soffa. Low overhead program monitoring and profiling. In PASTE ’05: Proceedings of the 6th ACM SIGPLAN- SIGSOFT workshop on Program analysis for software tools and engineering, pages 28–34, 2005. [39] D.J. Lilja. Measuring computer performance: a practitioner's guide. Cambridge University Press, 2000. [40] T.F. Lehr. Compensating for perturbation by software performance monitors in asynchronous computations. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1990. [41] A.D. Malony. Performance observability. PhD thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA, 1990. [42] A.D. Malony, D. A. Reed, and H. A. G. Wijshoff. Performance measurement intrusion and perturbation analysis. IEEE Transactions on parallel and distributed systems, 3(4):433-450, July 1992. [43] A.D. Malony, S. Shende, A. Morris, and F. Wolf. Compensation of measurement overhead in parallel performance profiling. International journal of high performance computer applications, 21(2):174-194, 2007. [44] A. Mazouz, S. Touati, and D. Barthou. Study of variations of native program execution times on multi-core architectures. Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), IEEE, pp. 919-924, 2010. [45] J. Mellor-Crummey, L. Adhianto, M. Fagan, M. Krentel, and N. Tallent. HPCToolkit User's Manual, http://hpctoolkit.org/manual/HPCToolkit-users- manual.pdf, January 2013. [46] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The paradyn parallel performance measurement tool. IEEE Computer, 28(11):37–46, 1995. [47] T. Mytkowicz, A. Diwan, M. Hauswirth, and P.F. Sweeney. Producing wrong data without doing anything obviously wrong! Proceedings of the 14th international conference on architectural support for programming languages and operating systems (ASPLOS XIV), ACM, pp. 265-276, 2009. [48] O. Naim and A. J. G. Hey. Invasiveness of performance instrumentation measurements on multiprocessors. IFIP transactions A – computer science and technology Vol. 44, 1994. 167 [49] H. Najafzadeh and S. Chaiken. Validated observation and reporting of microscopic performance using pentium ii counter facilities. Proceedings of the 4th international workshop on software and performance (WOSP), pp. 161–165, 2004. [50] H. Najafzadeh and S. Chaiken. Towards a framework for source code instrumentation measurement validation. Proceedings of the 5th international workshop on software and performance (WOSP '05), ACM, pp. 123–130, 2005. [51] W. L. Nicholson. On the normal approximation to the hypergeometric distribution. The annals of mathematical statistics, Vol. 27, No. 2, pp. 471- 483, June 1956. [52] G. J. Nutt. Tutorial: computer system monitors, IEEE Computer , Vol.8, No.11, pp.51-61, November 1975. [53] S. Patil and D. Lilja. Statistical methods for computer performance evalutation. Wires computational statistics, Vol. 4, Issue 1, 98-106, 2012. [54] F. Petrini, D.K. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Vol. 55, pp 15-21, November 2003. [55] D.A. Reed, R.A. Aydt, L. DeRose, C.L. Mendes, R.L. Ribler, E. Shaffer, H. Simitci, J.S. Vetter, D.R. Wells, S. Whitmore, and Y. Zhang. Performance analysis of parallel systems: approaches and open problems. Proceedings of the joint symposium on parallel processing (JSPP), pp. 239-256, 1998. [56] Rice University. Hpctoolkit. http://hpctoolkit.org/index.html, August 2010. [57] S. R. Sarukkai and A. D. Malony. Perturbation analysis of high level instrumentation for spmd programs. SIGPLAN Not., 28(7):44–53, 1993. [58] Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks. http://www.spec.org/cpu2006/ [59] Standard Performance Evaluation Corporation. SPEC OMP2001 Benchmarks. http://www.spec.org/omp2001/ [60] A. Srivastava and A. Eusance. ATOM: A system for building customized program analysis tools. Proceedings of the SIGPLAN '94 conference on programming language design and implementation (PLDI), pp. 196-205, June 1994. 168 [61] K. Stoodley. Applied and computational statistics. Halsted Press, 1984. [62] K.V. Subramaniam and M.J. Thazhuthaveetil. Effectiveness of sampling based software profilers. Proceedings of software testing, reliability and quality assurance, December 1994. [63] L. Svobodova. Performance monitoring in computer systems: a structured approach. ACM Operating system review, Vol. 15, No. 3, July 1981. [64] V. Tabatabaee, T. Tiwari, and J.K. Hollingsworth. Parallel parameter tuning for applications with performance variability. Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. [65] N.R. Tallent, J. Mellor-Crummey, and M.W. Fagan. Binary analysis for measurement and attribution of program performance. In proceedings of the 2009 ACM SIGPLAN conference on programming language design and implementation (PLDI), pp. 441-452, 2009. [66] K. Thompson and D.M. Ritchie. Unix Programmer’s Manual. Bell Telephone Laboratories, Murray Hill, NJ, June 1974. [67] UCSD. Simpoint. http://cseweb.ucsd.edu/~calder/simpoint/index.htm, August 2010. [68] J.S. Vetter and D. Reed. Managing performance analysis with dynamic statistical projection pursuit. Proceedings of the 1999 ACM/IEEE conference on Supercomputing: High performance networking and computing, 1999. [69] J.J. Yi, D.J. Lilja, and D.M. Hawkins. Improving computer architecture simulation methodology by adding statistical rigor. IEEE Transactions on Computers, Vol. 54, pp. 1360–1373, November 2005.