ABSTRACT Title of dissertation: ADVANCING THE MULTI-SOLVER PARADIGM FOR OVERSET CFD TOWARD HETEROGENEOUS ARCHITECTURES Dylan Jude Doctor of Philosophy, 2019 Dissertation directed by: Professor James Baeder Department of Aerospace Engineering A multi-solver, overset, computational fluid dynamics framework is developed for efficient, large-scale simulation of rotorcraft problems. Two primary features distinguish the developed framework from the current state of the art. First, the framework is designed for heterogeneous compute architectures, making use of both traditional codes run on the Central Processing Unit (CPU) as well as codes run on the Graphics Processing Unit (GPU). Second, a framework-level implementation of the Generalized Minimal Residual linear solver is used to consider all meshes from all solvers in a single linear system. The developed GPU flow solver and framework are validated against conventional implementations, achieving a 5.35? speedup for a single GPU compared to 24 CPU cores. Similarly, the overset linear solver is compared to traditional techniques, demonstrating the same convergence order can be achieved using as few as half the number of iterations. Applications of the developed methods are organized into two chapters. First, the heterogeneous, overset framework is applied to a notional helicopter configuration based on the ROBIN wind tunnel experiments. A tail rotor and hub are added to create a challenging case representative of a realistic, full-rotorcraft simulation. Interactional aerodynamics between the different components are reviewed in detail. The second application chapter focuses on performance of the overset linear solver for unsteady applications. The GPU solver is used along with an unstructured code to simulate laminar flow over a sphere as well as laminar coaxial rotors designed for a Mars helicopter. In all results, the overset linear solver out-performs the traditional, de-coupled approach. Conclusions drawn from both the full-rotorcraft and overset linear solver simulations can have a significant impact on improving modeling of complex rotorcraft aerodynamics. ADVANCING THE MULTI-SOLVER PARADIGM FOR OVERSET CFD TOWARD HETEROGENEOUS ARCHITECTURES by Dylan Philip Nordblom Jude Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor James Baeder, Chair/Advisor Professor Inderjit Chopra Professor Anubhav Datta Professor Stuart Laurence Professor Kayo Ide ? Copyright by Dylan Philip Nordblom Jude 2019 Dedication To Rachael ii Acknowledgments My PhD research from the past five years has not been a sole endeavor and could not have been accomplished without the help and influence of many individuals. First, I would like to thank my PhD advisor, Dr. James Baeder, whose dedication and propensity for novel computational fluid dynamics have guided my research. I am also grateful to my other committee members, Dr. Inderjit Chopra, Dr. Anubhav Datta, Dr. Stuart Laurence, and Dr. Kayo Ide for their expertise and recommendations for contextualizing the applications of my research. A significant portion of my dissertation research is a direct result of an internship conducted at the Army Aviation Development Directorate at Ames Research Center. Jay Sitaraman, Vinod Lakshminarayan, Andy Wissink, and Roger Strawn have all been tremendous mentors and have motivated expanding the scope of my research to new areas. I have greatly enjoyed working with them and look forward to continuing to work with them in the future. Many major breakthroughs during my research trace back to help received from two friends and post doctoral researchers in our department: Ananth Sridharan and Bharath Govindarajan. Ananth?s recommendations, often through dogged exchanges, always serve as a grounding for realistic, rotorcraft-related applications. Bharath?s assistance and support in almost all facets of my research and time at the University of Maryland cannot be understated. He has always been a first-stop for most challenges encountered and has continually steered my research toward higher standards. I will greatly miss our morning coffee trips and necessary football and frisbee breaks. Both work and extracurricular activities have been greatly enriched by many friends and colleagues over the last five years. YongSu and BumSeok from Dr. Baeder?s research group have greatly helped in research collaborations and preparation for deadlines. Dan and Tyler always provide a balance between bringing in home-baked iii desserts and organizing intramural sports. The friendship of Dan, Tyler and Bharath have ensured many much-needed breaks playing soccer and football and weekends at Deep Creek. I am fortunate to have completed over two decades of schooling and university education, all with the continuous support of my family. For the biggest and the smallest moments in life, my mom has always been there to give love and advice that I can count on and trust. For as long as I can remember she has been the one to set me on a path towards my greatest potential. I am forever grateful for her kindness and selflessness. To Laura, my sister, I have immense admiration for her dedication and strength in pursuing her goals. I am so thankful to have had her living nearby for a large part of my time at Maryland. My dad?s fascination with science instilled a view of education and learning as a life-long activity. He was a primary driver for my decision to pursue a PhD. His curiosity about the world around him and the conversations that we would have about my interests and research are something that I miss deeply since his passing. I only wish I could share this dissertation with him. Lastly, I owe a tremendous amount of my success during my PhD to the endless love and support of my wife, Rachael. Despite my schedule of long days and stressful deadlines, she was always there to remind me of the big picture of life and plan getaways to Shenandoah or the Poconos. Her enduring optimism and encouragement has been a primary factor in many of my accomplishments over our last ten years together. Her confidence and zest for life inspires me to be the best that I can be, both academically and personally. iv Table of Contents Dedication ii Acknowledgements iii Table of Contents v List of Tables viii List of Figures ix List of Abbreviations xiii Nomenclature xiv 1 Introduction 1 1.1 Challenges of Rotorcraft Aerodynamic Simulations . . . . . . . . . . 1 1.1.1 Physics of Rotorcraft Aerodynamics . . . . . . . . . . . . . . . 2 1.1.2 Numerical Challenges of Rotorcraft CFD . . . . . . . . . . . . 6 1.1.3 CFD Time and Cost Challenges . . . . . . . . . . . . . . . . . 8 1.2 Literature Survey of Previous Work . . . . . . . . . . . . . . . . . . . 9 1.2.1 Computational Aerodynamics . . . . . . . . . . . . . . . . . . 10 1.2.2 Computational Architectures . . . . . . . . . . . . . . . . . . 17 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 24 2 Methodology 26 2.1 Fluid Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.1 Navier?Stokes Equations . . . . . . . . . . . . . . . . . . . . . 26 2.1.2 Non-dimensionalization of the Navier?Stokes Equations . . . . 29 2.1.3 Reynolds-Averaged Navier?Stokes . . . . . . . . . . . . . . . . 30 2.1.4 Spalart?Allmaras Turbulence Model . . . . . . . . . . . . . . . 31 2.2 GPU-Accelerated Flow Solver . . . . . . . . . . . . . . . . . . . . . . 32 2.2.1 Hybrid Finite-Difference, Finite-Volume . . . . . . . . . . . . 33 2.2.2 Inviscid Fluxes . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.3 Viscous Fluxes . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2.4 DADI Preconditioner . . . . . . . . . . . . . . . . . . . . . . . 42 2.2.5 Red-Black Gauss?Seidel Preconditioner . . . . . . . . . . . . . 43 v 2.2.6 GMRES Linear Solver . . . . . . . . . . . . . . . . . . . . . . 46 2.2.7 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . 46 2.3 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.1 Numerical Implementation on the GPU . . . . . . . . . . . . . 49 2.3.2 Framework Wrapping . . . . . . . . . . . . . . . . . . . . . . . 54 2.4 Other CFD Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.4.1 OverTURNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.4.2 HamStr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 mStrand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.4.4 Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.5 Grid Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5.1 Transformation Matrices . . . . . . . . . . . . . . . . . . . . . 59 2.5.2 Elastic Deflection . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.5.3 Grid Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.5.4 Framework Wrapping . . . . . . . . . . . . . . . . . . . . . . . 64 2.6 Domain Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.6.1 Connectivity and Mesh Grouping . . . . . . . . . . . . . . . . 65 2.6.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.6.3 Framework Wrapping . . . . . . . . . . . . . . . . . . . . . . . 69 2.7 Framework Description . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.7.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.8 GMRES and Overset GMRES . . . . . . . . . . . . . . . . . . . . . . 72 3 Verification and Validation 81 3.1 Onera-M6 Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.2 Performance, Scalability, and Timing . . . . . . . . . . . . . . . . . . 87 3.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2.2 Scalability and Timing . . . . . . . . . . . . . . . . . . . . . . 90 3.2.3 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3 Elastic UH-60 Rotor . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4 Implicit Algorithms Validation . . . . . . . . . . . . . . . . . . . . . . 102 3.4.1 DADI and Red-Black Gauss?Seidel Convergence . . . . . . . . 102 3.4.2 GMRES Convergence . . . . . . . . . . . . . . . . . . . . . . . 105 3.5 Overset-GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.5.1 Poisson Equation . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.5.2 Single-solver, Euler Equations . . . . . . . . . . . . . . . . . . 112 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4 Notional Helicopter Configuration 116 4.1 Rotor + Fuselage Simulation . . . . . . . . . . . . . . . . . . . . . . . 117 4.1.1 Case Description . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.1.2 Aerodynamic Results . . . . . . . . . . . . . . . . . . . . . . . 121 4.2 Rotor + Fuselage + Hub + Tail Simulation . . . . . . . . . . . . . . 126 4.2.1 Case Description . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2.2 Full Configuration in Forward Flight . . . . . . . . . . . . . . 128 vi 4.2.3 Full Configuration in Cross-Wind . . . . . . . . . . . . . . . . 133 4.2.4 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 135 4.3 Main and Tail Rotor Interactions . . . . . . . . . . . . . . . . . . . . 136 4.3.1 Isolated Tail Rotor: Forward Flight . . . . . . . . . . . . . . . 138 4.3.2 Main and Tail Rotor: Forward Flight . . . . . . . . . . . . . . 139 4.3.3 Main and Tail Rotor: Cross-wind Flight . . . . . . . . . . . . 142 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5 Multi-Solver Overset-GMRES 148 5.1 Laminar Flow Over a Sphere . . . . . . . . . . . . . . . . . . . . . . . 149 5.2 Laminar Coaxial Rotors . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.3 O-GMRES Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6 Conclusions 171 6.1 Algorithm and Framework Development . . . . . . . . . . . . . . . . 172 6.2 Framework Applications . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.2.1 Verification and Validation . . . . . . . . . . . . . . . . . . . . 173 6.2.2 Notional Helicopter Configuration . . . . . . . . . . . . . . . . 175 6.2.3 Overset GMRES . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.3 Contributions and Future Work . . . . . . . . . . . . . . . . . . . . . 178 A Flow Visualization with Vorticity and Q-Criterion 181 Bibliography 182 vii List of Tables 1.1 Top Supercomputers by Performance [1]. RMAX refers to maximum achieved LINPACK performance in tera-floating point operations per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1 Python-visible functions in Garfield. . . . . . . . . . . . . . . . . . 55 2.2 Python-visible functions in the motion module. . . . . . . . . . . . . 64 2.3 Python-visible functions for Tioga. . . . . . . . . . . . . . . . . . . . 69 3.1 Baseline CPU case for timing and speedup comparison. . . . . . . . . 91 3.2 Timing data for Garfield on different clusters and GPU architectures. Speedup is shown against Overflow on a 24-core CPU. . . . . . . . 92 3.3 Cost estimates for 100 iterations of Garfield and Overflow based on Google Cloud Platform prices. *Cost estimates for the the 32GB V100 cards are not available on Google and are therefore estimated at the 16GB rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.4 Wind tunnel test 5240 parameters. . . . . . . . . . . . . . . . . . . . 99 4.1 Main rotor properties and flight condition. . . . . . . . . . . . . . . . 118 4.2 Tail rotor properties and flight condition. . . . . . . . . . . . . . . . . 126 4.3 Solver Timing and Performance. . . . . . . . . . . . . . . . . . . . . . 135 4.4 Properties of the tail rotor for main and tail rotor interactions analysis.137 4.5 Average tail rotor thrust coefficient for three cases. . . . . . . . . . . 144 viii List of Figures 1.1 Illustration of the wake of a single-main-rotor helicopter. The tip vortices, shown as dotted lines, may interact with the fuselage and tail rotor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Illustration of aerodynamic variation across a rotor blade in hover. . . 3 1.3 Illustration of aerodynamic variation across a rotor blade in forward flight. A reverse-flow region develops on the retreating side. . . . . . . 4 1.4 Rotor wake for a single-main-rotor in cross-wind flight. Main rotor vortices convect into the path of the tail rotor, possibly resulting in a dangerous flight condition. . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Examples of a structured, curvilinear mesh and an unstructured mesh for two airfoils. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Hybrid Navier?Stokes + free-wake model of a rotor in hover. The CFD domain is near the blade and Lagrangian free-wake models the wake below the rotor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Overset CFD mesh system: unstructured and structured meshes used together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.8 Example distribution of tasks suitable for a CPU-only CFD framework. Each solver uses all available MPI tasks. . . . . . . . . . . . . . . . . 15 1.9 Domain partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.10 45 years of micro-processor data with the transistor counts of three NVIDIA GPUs highlighted. Original data up to the year 2010 collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New data collected for 2010-2015 by K. Rupp [2]. . . 20 2.1 Cell-centered, curvilinear mesh for cell (j, k) with computational coor- dinate directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 MUSCL 3-point stencil for left and right primitive variables qL and qR. 39 2.3 WENO 5-point stencil for left and right primitive variables qL and qR. 40 2.4 Viscous flux stencil for finite-difference relations. . . . . . . . . . . . . 42 2.5 Illustration of red-black coloring. . . . . . . . . . . . . . . . . . . . . 45 2.6 Full domain consisting of interior cells, ghost, cells, and boundaries. . 46 2.7 Partitioned domain on two MPI tasks. . . . . . . . . . . . . . . . . . 47 ix 2.8 O-Mesh curvilinear coordinates result in a periodic boundary in the wrap-around direction. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.9 Wake boundary condition at root and tip of blade or wing meshes. . . 48 2.10 Inviscid wall boundary condition: mirrored velocities. . . . . . . . . . 49 2.11 Viscous wall boundary condition: negative velocities. . . . . . . . . . 49 2.12 Flowchart showing Garfield flow solver routines. Different levels of parallelism (point, variable, and line) are represented using different numbers of arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.13 One CUDA Kernel used per (j, k, l) cell (?Point-Parallel?). . . . . . . 51 2.14 In J-direction, one CUDA Kernel used per cell eigenvalue in a k ? l plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.15 Set of (4? 4) kernel threads using (6? 6) shared memory cache for efficient application of a stencil. . . . . . . . . . . . . . . . . . . . . . 54 2.16 General procedure in 2D for obtaining Hamiltonian paths used as local coordinate directions. . . . . . . . . . . . . . . . . . . . . . . . . 57 2.17 Control points along blade where deflections are defined. Distances to CFD coordinates may be discontinuous. . . . . . . . . . . . . . . . . 61 2.18 Use of projected control points results in continuous distances to CFD coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.19 Grouping of stationary meshes and reconnecting separately so that only the ?reconnecting? group reconnects every timestep. . . . . . . . 66 2.20 Illustration of the interpolation step for two overset grids with fringe points as hollow symbols, field points as solid symbols. . . . . . . . . 67 2.21 Flowchart of routines showing location of data: on the CPU or on the GPU. Performing interpolation on the GPU drastically reduces data transfer to the CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.22 Modules for CFD framework. . . . . . . . . . . . . . . . . . . . . . . 70 2.23 Cluster node topology with 2 GPUs and 16 CPU cores per node. . . . 71 2.24 Example distribution of tasks suitable for a CPU-only CFD framework. Each solver uses all available MPI tasks. . . . . . . . . . . . . . . . . 71 2.25 Example distribution of tasks suitable for a CPU-GPU framework. Different tasks assigned different solvers. . . . . . . . . . . . . . . . . 72 2.26 Example distribution of resources for a 3-Solver framework where Solver A is GPU-accelerated. . . . . . . . . . . . . . . . . . . . . . . . 72 2.27 1D heat problem between left and right walls at temperatures TL and TR respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.1 Overset mesh system and pressure coefficient solution contours. . . . 83 3.2 Pressure coefficient solution at six wing sections comparing experi- mental, Garfield, and Overflow results. . . . . . . . . . . . . . . 85 3.3 Convergence of Overflow and Garfield. . . . . . . . . . . . . . . 86 3.4 Overset run with the use of GPU memory in Tioga reduces CPU/GPU data transfers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5 Speedup and strong scalability of Kepler (cases 1-5), Pascal (cases 7-10) and Volta (cases 11-13,16) GPUs. . . . . . . . . . . . . . . . . . 93 x 3.6 Strong scaling efficiency of Kepler (cases 1-5), Pascal (cases 7-10) and Volta (cases 11-13,16) GPUs. . . . . . . . . . . . . . . . . . . . . . . . 94 3.7 Cost savings of Kepler (cases 1-5), Pascal (cases 7-10) and Volta (cases 11-13,16) GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.8 Superimposed deflections at three different azimuths: ? = 0, 90, 180. . 99 3.9 50% and 80% normal force, chord force, and moment coefficients (mean-removed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.10 Short, inviscid NACA0012 wing used as a representative mesh for convergence analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.11 Per-Iteration comparison of convergence with DADI and two Gauss? Seidel methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.12 Wall-clock time comparison of convergence with DADI and two Gauss? Seidel methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.13 Pressure solution over a laminar cylinder with M? = 0.3 and Re = 100.105 3.14 Per-Iteration comparison of convergence with DADI and GMRES methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.15 Wall-clock time comparison of convergence with DADI and GMRES methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.16 Solution, grid system, and iBlanks for the overset Poisson equation. . 109 3.17 Convergence results using different linear solver methods: GMRES, O-GMRES, and Gauss?Seidel . . . . . . . . . . . . . . . . . . . . . . 111 3.18 Solution and grid system for time-accurate wedge . . . . . . . . . . . 112 3.19 Sub-iteration and linear solver convergence for 2D wedge case. . . . . 113 4.1 Experimental Setup [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2 Fuselage and blade surfaces for CFD domain. . . . . . . . . . . . . . 119 4.3 Mesh system of the ROBIN simulation in forward flight. . . . . . . . 120 4.4 Iso-surface of vorticity magnitude (0.1). . . . . . . . . . . . . . . . . . 122 4.5 Three Center Locations. . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.6 Three Lateral Locations. . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.7 Mesh System for full configuration. . . . . . . . . . . . . . . . . . . . 127 4.8 Iso-surfaces of vorticity: blade 1 near ? = 0 passes through wake of hub.129 4.9 Comparison of Blade 1 Loads with and without hub and tail rotor. . 129 4.10 Full Solution showing Q-Criterion (Q = 0.0008) colored by Vorticity. 130 4.11 Slices behind rotor showing x-Mach number. . . . . . . . . . . . . . . 131 4.12 Tail-rotor sub-iterative convergence (volume-scaled L2 norm of Density residual). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.13 Top-view of the cross-wind Q-criterion solution (Q = 0.0008) colored by vorticity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.14 Iteration timeline varying sub-iterations in different solvers. . . . . . . 136 4.15 Tail rotor thrust coefficient. . . . . . . . . . . . . . . . . . . . . . . . 139 4.16 Tail rotor mean-removed thrust coefficient frequencies. . . . . . . . . 139 4.17 Main and tail rotor Q-criterion (Q = 0.0008) solution in forward flight 140 4.18 Tail rotor thrust coefficient . . . . . . . . . . . . . . . . . . . . . . . . 140 4.19 Tail rotor thrust coefficient frequencies . . . . . . . . . . . . . . . . . 141 xi 4.20 Main and tail rotor Q-criterion (Q = 0.0008) solution in cross-wind flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.21 Tail rotor thrust coefficient . . . . . . . . . . . . . . . . . . . . . . . . 144 4.22 Tail rotor thrust coefficient frequencies . . . . . . . . . . . . . . . . . 144 5.1 Iso-surface of Q-criterion (0.0001) with contours of Pressure Coefficient at time t?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.2 Small timestep (CFL=1.71) non-linear and linear convergence. . . . . 152 5.3 Moderate timestep (CFL=3.42) non-linear and linear convergence. . . 154 5.4 Blade cross-section showing cambered, 2% thick flat plate. . . . . . . 155 5.5 Overset grid system for the laminar coaxial rotors. . . . . . . . . . . . 156 5.6 Iso-surface of Q-criterion(0.003) with contours of vorticity. . . . . . . 157 5.7 Grid and slice location at five azimuths. . . . . . . . . . . . . . . . . 158 5.8 Iteration convergence after t?: ?? = 1?. Dual-time CFL is ? for GMRES and O-GMRES methods and 100 for the Iterative method. . 159 5.9 Vorticity solution near the tip of the blades and convergence for ? = 1?.163 5.10 Vorticity solution near the tip of the blades and convergence for ? = 3?.164 5.11 Vorticity solution near the tip of the blades and convergence for ? = 5?.165 5.12 Vorticity solution near the tip of the blades and convergence for ? = 7?.166 5.13 Vorticity solution near the tip of the blades and convergence for ? = 9?.167 xii List of Abbreviations BC Boundary Condition CFD Computational Fluid Dynamics CFL Courant?Friedrichs?Lewy (number) CPU Central Processing Unit CSD Computational Structural Dynamics CUDA Compute Unified Device Architecture DADI Diagonalized Alternating Direction Implicit (preconditioner) FLOPS Floating-point Operations Per Second Garfield GPU-Accelerated Rotor Flow Field (solver) GMRES Generalized Minimal Residual GPU Graphics Processing Unit GS Gauss?Seidel (preconditioner) HamStr Hamlitonial Strand (solver) OverTURNS Overset Transonic Unsteady Rotor Navier?Stokes (solver) O-GMRES Overset Generalized Minimal Residual LHS Left-hand-side MARCC Maryland Advanced Research Computing Center MPI Message Passing Interface MUSCL Monotone Upstream-centered Scheme for Conservation Laws PDE Partial Differential Equation RANS Reynolds-Averaged Navier?Stokes RHS Right-hand-side ROBIN ROtor Body INteraction SMR Single Main Rotor SA Spalart?Allmaras Tioga Topology Independent Overset Grid Assembler WENO Weighted Essentiall Non-Oscillatory xiii Nomenclature A Rotor Area a Speed of sound c Blade or wing chord CT Rotor thrust coefficient = T/(?AV 2 tip) Cl Sectional lift coefficient CT/? Blade loading CP Rotor power coefficient = P/(?AV 3 tip) Cp Pressure coefficient Cp,air Specific heat at constant pressure e Energy (internal + kinetic) kair Thermal conductivity for air Mtip Rotor tip Mach number = Vtip/a? Nb Number of rotor blades P Rotor power Pr Prandtl number = ?Cp,air/kair p Static pressure Qc Q-criterion Q Vector of conserved variables Fi, Gi, Hi Inviscid Flux Vectors in x, y, z-directions Fv, Gv, Hv Viscous Flux Vectors in x, y, z-directions Re Reynolds Number R Rotor Radius S Fluid Strain Tensor T Rotor Thrust ?t Physical timestep size ?? Dual- or Pseudo-timestep size ? Viscous stress tensor u, v, w x, y, z-components of velocity V Total Velocity Vtip Rotor Tip Speed V? Total Free-stream Velocity y+ Non-dimensional turbulent wall spacing ?tw Linear Twist ?0 Blade Collective Pitch Angle ?1c Blade Longitudinal Cyclic Angle ?1s Blade Lateral Cyclic Angle ? Ratio of specific heats ? Rotor angular velocity xiv ? Fluid Rotation Tensor ? Fluid Vorticity ? Rotor azimuth ? Density ? Rotor Solidity = Nbc/(?R) ?, ?, ? Computational coordinate directions All other symbols are defined locally during use. xv Chapter 1: Introduction 1.1 Challenges of Rotorcraft Aerodynamic Simulations Rotorcraft are widely used for their versatility in different flight conditions. The ability to take off and land vertically makes the helicopter a vital aircraft for search and rescue, military, and more recently urban air mobility applications. The aerodynamics of rotorcraft involve interactions between many different components; as rotorcraft designs become more compact and sophisticated, the aerodynamic interactions between components plays a larger role on the performance and safety of the aircraft. The aerodynamics of rotorcraft flight are very different from those of fixed-wing flight, and a key difference is the influence of the rotor wake. Fig. 1.1 shows an illustration of a single-main-rotor helicopter with the tip vortices, indicating the boundary of the rotor wake, shown as dotted lines. In most flight conditions the wake from each blade passes directly in the path of the next passing blade. The wake of the main rotor interacts with the fuselage and in some flight conditions, the tail rotor as well. Not shown in the simple illustration are other components which also contribute to the complex interactions, including the tail empennage and rotor hub. For rotorcraft with multiple rotors, the potential for aerodynamic interaction is increased. A history of momentum theory analysis and studies of two and three-dimensional aerodynamic models have allowed innovative and well-trusted helicopter designs used 1 Figure 1.1: Illustration of the wake of a single-main-rotor helicopter. The tip vortices, shown as dotted lines, may interact with the fuselage and tail rotor. for decades. Limitations of the available design tools, however, also result in flight conditions where the dynamics are not well understood. Maneuvers resulting in unstable or dangerous conditions are identified as best as possible and then avoided by pilots. A new generation of aircraft promises a wealth of novel, compact designs with challenging aerodynamic interactions still not understood or considered in the design process. The increasing power of computational tools for large scale aerodynamic analysis presents a promising aid for analysis of complex designs. 1.1.1 Physics of Rotorcraft Aerodynamics Rotorcraft aerodynamics are characterized by a large envelope of flight condi- tions very different from those seen by fixed-wing aircraft. The two most elementary flight conditions for a helicopter, hover and forward-flight, present many aerodynamic challenges. The primary lifting forces from rotorcraft come from rotors, which include multiple blades. Unsteady rotation, wake interactions, and aeroelastic effects are all important factors which cannot be ignored for rotorcraft analysis. From the perspective of using Computational Fluid Dynamics (CFD) for the simulation of 2 rotorcraft, understanding the key aerodynamic challenges and how they translate into numerical or modeling challenges is crucial. ?r= Vr ? r Figure 1.2: Illustration of aerodynamic variation across a rotor blade in hover. In all rotor flight conditions, the inner and outer portions of the blade operate in very different flow regimes. Due to the rotational velocity of the blade, the outer portion of the blade experiences a much higher flow velocity compared to the blade root. In hover, the effective velocity at a radial station is approximately proportional to the radius r from the center of rotation. Fig. 1.2 shows a schematic of the effective velocity along the blade. The Reynolds number, a ratio of inertial to viscous forces, is defined as ?VrL Re = (1.1) ? where ? is the density, Vr is the effective fluid velocity at some point along the blade, L is a reference length, and ? is dynamic viscosity. The Reynolds number is highest on the outer portion of the blade where the effective velocity is highest. The Mach number, a ratio of the local fluid speed to the speed of sound, is defined as Vr M = (1.2) a where a is the local speed of sound. Mach number is also higher on the outer portion of the blade. With a high enough tip speed, part of a helicopter blade may even 3 operate in transonic conditions resulting in a shock. CFD solvers for rotorcraft aerodynamics must be able to handle a large range of Reynolds number and Mach numbers to accurately resolve the flow physics. V? (?) sin V? ?r + Reverse Flow =Vr Region ? r ? Figure 1.3: Illustration of aerodynamic variation across a rotor blade in forward flight. A reverse-flow region develops on the retreating side. In forward flight, there is an azimuthal variation in the flow condition experience by the blade. Fig. 1.3 shows how the addition of a forward flight speed adds a variation based on azimuthal angle ?. On the retreating blade side, the interior portion of the blade may operate in a reverse flow region, where the fluid travels ?backwards? over the airfoil from the trailing edge to the leading edge. The reverse flow region often results in flow separation and unpredictable loads. A key difference between fixed and rotary wing aerodynamics is that rotors operate in the influence of their own wake. A fixed wing aircraft trails vortices behind the wing but a rotating blade trails a vortex directly into the path of the next following blade. In the context of rotor simulations, CFD must therefore be able to accurately model the rotor wake, in addition to capturing the near-blade aerodynamics. A rotor wake typically extends far below the rotor in hover, indicating the need for a large simulation domain and proper treatment of boundaries. Considering realistic flight conditions for a rotor further complicates the aerody- namics by adding a temporal variation to the flow field. In steady flight the dynamics 4 may be periodic but for maneuvers this is also not necessarily the case. Flight controls for conventional rotorcraft are prescribed using cyclic and collective pitch for the main rotor blades, meaning the pitch of the blade varies with blade azimuth. Either a periodic or entirely unsteady flow condition means that for CFD purposes, an accurate time-marching method is required to analyze temporal variation in the aerodynamics. Most rotorcraft use multiple rotors to balance the aircraft torque and prevent uncontrollable spin. In a conventional single-main-rotor configuration, for example, the tail rotor provides a thrust to counter the torque generated to power the main rotor. In coaxial, tandem, and other multi-rotor designs, the rotational directions of rotors are reversed to cancel torques. A primary challenge from an aerodynamic perspective comes from the interaction between rotors. Fig. 1.4 shows a simulation result of a helicopter flow-field in ?cross-wind? flight, where the vortex trailed from the main rotor enters the tail rotor. Such a flight condition might occur for forward flight in the presence of strong cross-winds. The vortices shed by the retreating side of the rotor combine into a larger vortex, which convects downstream. In cross-wind flight with the tail rotor downstream from the main rotor, the vortices from the main rotor enter the tail rotor plane and can cause unwanted vibrational or acoustic results. In his book on Military Helicopter Design Technology, Prouty [4] discusses the design of the tail rotor and suggests the direction of rotation such that the top of the tail rotor travels aft of the aircraft for ?reasons that are not well understood.? Through wind tunnel and flight-testing, engineers have rules-of-thumb for such design choices which are known to reduce unpredictable loads, but without much aerodynamic justification. Besides main-tail rotor interactions, the placement of horizontal stabilizers is also a major challenge because of interactions with the main rotor wake. Such design considerations for single-main-rotor aircraft become only 5 Tail Rotor Vortices Enter Tail Rotor Main Rotor Vortices Figure 1.4: Rotor wake for a single-main-rotor in cross-wind flight. Main rotor vortices convect into the path of the tail rotor, possibly resulting in a dangerous flight condition. more challenging for configurations with multiple, compact rotors interacting with large lifting surfaces. A CFD simulation for a full, complex rotorcraft configuration should therefore be capable of representing all aerodynamic components of interest. Rotorcraft dynamics is a multi-disciplinary problem with coupling between structural dynamics and aerodynamics. Aeroelastic effects of the rotor blades cannot be ignored. Elastic flap, lag, and torsion all play a vital role in the control system design for the aircraft. Correct blade deformations rely upon correct aerodynamic loads; conversely the aerodynamics require accurate blade deflection. 1.1.2 Numerical Challenges of Rotorcraft CFD The physical challenges associated with moving, deforming, multi-body and wake aerodynamics translate into numerical issues when attempting to simulate a rotorcraft configuration with CFD. The first step of any simulation process is the identification relevant aerodynamic areas and generating a mesh (or meshes) for 6 the simulation. The CFD mesh defines the points where a the flow solution can be analyzed. The meshes will undoubtedly include all relevant aerodynamic bodies but also any area where the capture of fluid structures might influence important engineering quantities. For rotorcraft applications a grid system must be generated for bodies like the rotorcraft blades, as well as areas in the ?background? where the rotor wake is convected. Mesh generation for a background grid is generally straightforward since it can be accomplished with a Cartesian mesh system. The goal of a background grid is generally to resolve wake structures and this is very efficiently done with equispaced Cartesian grids. The primary challenge in the generation of background grid is in knowing how fine the grid should be. In some areas perhaps a very small cell size might be needed to capture fluid structures with low dissipation whereas far away from the area of interest larger cells may suffice. The desire for different cell sizes in a domain may indicate the need for grid stretching or perhaps multiple, combined grids. For a body of simple geometry, a grid may be generated which is curved, but maps to a structured, Cartesian domain. One example might be a cylindrical mesh, where one coordinate maps to the radial direction and another to the angular direction. Airfoils in 2D or blades and wings in 3D are typically of simple geometry and can be meshed using the structured, curvilinear approach. An example of a 2D structured, curvilinear airfoil mesh is shown in Fig. 1.5(a). One coordinate is in the airfoil-wrap direction (with a periodic boundary), and the second coordinate is in the airfoil-normal direction. The flow solution on the structured meshes can then be solved with compatible CFD software. For a body of complex geometry, a mesh that maps to a Cartesian domain may not be possible, in which case an unstructured mesh might be generated instead. An unstructured mesh in 2D might contain a mixture of triangular and quadrilateral 7 elements and allow refinement in areas of aerodynamic interest. An example of a 2D unstructured mesh around an airfoil with an odd shape is shown in Fig. 1.5(b). Such a geometry would be difficult to mesh with a structured approach. While an unstructured mesh allows for significant meshing flexibility, depending on the CFD solver the benefits typically come at a cost of added computational time and/or reduced accuracy. (a) Structured, Curvilinear (b) Unstructured Figure 1.5: Examples of a structured, curvilinear mesh and an unstructured mesh for two airfoils. The pros and cons of three general mesh categories present a core challenge in the use of CFD. Structured or Cartesian meshes are ideal for the background area far away from complex surfaces; unstructured grids are convenient to generate near complex surfaces, but inefficient elsewhere. Section 1.2 reviews some of the creative solutions developed to solve these issues. 1.1.3 CFD Time and Cost Challenges Once the domain and flow condition is specified, a CFD solver has to run on a computer, or cluster of computers, to approximate the flow solution. CFD codes for large applications are typically made to run in parallel across multiple processors on a supercomputer. A supercomputer is a cluster of compute resources, for example cores of a Central Processing Unit (CPU). Based on the domain size, 8 availability of resources, and urgency of the simulation, a CFD run for a complex rotorcraft application could take on the order of days or weeks. For example, the simulation of a 9-second maneuver for the UH-60 Blackhawk helicopter performed by the Army Aviation Development Directorate took 5 days on 3600 CPU cores of a supercomputer [5]. A simulation taking days or weeks may not be feasible for some engineering analysis. For example in the process of rotorcraft design, tens or hundreds of potential designs might want to be tested but the simulation time of even one day is prohibitively high. The alternative of using more computational resources, however, means that the simulation could be significantly more expensive. Avoiding high computation time for simulations is most easily done by reducing the numerical size or complexity of the problem and making assumptions about the physics for part of the domain. For example, the blade aerodynamics of a full, single-main-rotor configuration in hover might be simplified to the analysis of the isolated main rotor. The fuselage, tail rotor, and other components can either be ignored or included in the form of a lower-order model. In the case where more complicated physics are of interest, however, for example with rotor-rotor interactions or rotor-fuselage interaction, fewer assumptions can be made and a full, high-fidelity simulation cannot be avoided. Fortunately for engineers, advancements in computer technology has enabled larger simulations to run faster and on fewer, more affordable resources. The benefit comes, however, at the cost of having to optimize or re-write code to take advantage of new compute architectures. 1.2 Literature Survey of Previous Work Incomplete understanding of complex rotorcraft flight conditions motivates the need for improved high-fidelity, large-scale aerodynamic analysis tools. The 9 focus of this work the development of a computational framework for large-scale, efficient CFD. As such, a literature survey first presents an overview of CFD and other computational tools for interactional aerodynamic analysis. The survey then briefly reviews popular computer architectures and how the evolution of computers has facilitated physics-based simulations. 1.2.1 Computational Aerodynamics The first implementations of CFD for transonic potential flow theory date back to the early 1970s with the work of Magnus and Yoshihara [6] and shortly after by Murman and Cole [7]. Using a mapped Cartesian grid, the transonic small disturbance equations correctly predicted airfoil characteristics for flow-fields with moderate shocks. The transonic potential theory, however, lacked accurate representation of vorticity and prediction of unique shock solutions. By 1981, Jameson et. al. [8] presented an efficient and accurate implementation of the 2D Euler equations for airfoil analysis. The 2D Euler equations applied to a transonic cylinder and NACA0012 were solved using a finite-volume scheme on curvilinear meshes. A scalar dissipation scheme was used with coefficients defined based on local pressure variation. By 1980 implicit formulations of the viscous Navier?Stokes equations were also developed, for example by Pulliam and Steger [9], however the lack of computational power for realistic cases limited the viscous terms to thin-layer approximations. Significant effort was placed in formulating computationally efficient methods for the 3D compressible Navier?Stokes equations. Notably a diagonal alternating-direction implicit form of the governing equations was proposed by Pulliam and Chausee in 1981 [10] and an efficient L-U factorized form by Obayashi and Kuwahara in 1984 [11]. The development of turbulence models closely followed the development of 10 Navier?Stokes codes through the 1970s and 1980s. The works of Cebeci and Smith [12], Baldwin and Lomax [13], and Johnson and King [14] present algebraic turbulence models, which depend on the existence of a well-defined boundary layer. Later, one-equation (e.g. Spalart?Allmaras [15]) and two-equation (e.g. k ?  [16], k ? ? SST [17]) transport models better captured separated flow physics and became widely used for rotorcraft and fixed-wing applications. The use of a turbulence model adds further computational complexity to the flow solver, since it adds another partial differential equation (PDE) to the system of equations solved at each timestep. As mentioned in the previous section, one of the largest challenges for rotor- craft CFD is modeling the wake of a rotor. The wake from each blade must be adequately captured to accurately predict the rotor inflow and in turn, near-blade aerodynamics. The finite-volume and finite-difference methods commonly used in CFD have numerical dissipation, which results in accelerated, non-physical vortex diffusion. Using smaller cell sizes in the wake region helps reduce dissipation but at the cost of greatly increasing the degrees of freedom and cost of computation. One method of reducing wake dissipation is with the use of a Lagrangian free- wake or other lower-order wake model. A free-wake model uses vorticity transport properties in a Lagrangian frame to represent the location and strength of discrete vortex filaments. With the assumption that the wake region far from the blade is governed by inviscid flow, the wake model represents the vortices trailed from blades without numerical dissipation. Rotorcraft applications of free-vortex methods are presented in detail in the work of Bagai and Leishman [18]. Free-wake models have been used with 2D CFD tables for interactional aerodynamic analysis. Yin and Ahmed [19] used free-vortex methods to analyze the acoustics of main-rotor and tail-rotor interactions. For improved, 3D representation of blade physics, the free-wake solution is typically coupled to an Euler or Navier?Stokes CFD solution using the Biot?Savart law at either the blade surface [20] or at a boundary surface 11 CFD Domain Lagrangian Free-Wake Figure 1.6: Hybrid Navier?Stokes + free-wake model of a rotor in hover. The CFD domain is near the blade and Lagrangian free-wake models the wake below the rotor. near the blade [21]. An example of the latter implementation is shown in Fig. 1.6 for a hovering rotor. For the shown case, the Navier?Stokes CFD domain resolves the flow solution to a distance of ? 1.5 chords away and the free-wake markers model the influence of the other rotor blade and wake below the rotor. While the Navier?Stokes to free-wake coupling procedure may work for hover or periodic simulations, application of the Biot-Savart and Bernoulli equations for the influence of the vortices on the Navier?Stokes domain do not extend easily to complex, unsteady flight conditions. When directly modeling the wake with Navier?Stokes CFD cannot be avoided, high-order numerical schemes can be used to reduce dissipation. High-order schemes can be achieved with relative ease on Cartesian grids by using a larger computational stencil. Furthermore with the use of Cartesian mapping, body-fitted structured grids can also make use of high order stencils. High-order central stencils may be used in areas of low gradients but especially near shocks, methods like the 3rd order 12 Monotone Upstream-centered Scheme for Conservation Laws (MUSCL) [22] or 5th order Weighted Essentially Non Oscillatory (WENO) [23] schemes use limiters for added stability. Central, MUSCL, and WENO schemes all require a stencil in a coordinate direction, which for structured or Cartesian meshes is clearly defined. In unstructured meshes it is more difficult to apply line-based stencils without the presence of clear coordinate directions. Most unstructured, finite-volume formulations are therefore limited to second-order accuracy. The numerical methods mentioned thus far have been primarily finite-difference or finite-volume formulations. Historically, the development of CFD has favored finite-volume-type schemes because they inherently follow conservation laws. Finite- element methods, however, have also gained popularity especially in recent years. The finite-element method was successfully used by Brooks and Hughes [24] in the early 1980s using a Streamwise-Upwind Petrov-Galerkin approach. Lately Discontinuous Galerkin (DG) [25] and Hybridizable DG (HDG) [26] have shown tremendous promise in allowing high-order methods on unstructured grids. Finite-element codes can attain high-orders of accuracy by adding quadrature points in cells but the added degrees of freedom increase the numerical cost of the simulation. Furthermore for implicit time-marching schemes, having multiple solution points and shape functions per cell results in a very complex implicit system that is not easily solved using approximate factorization or Gauss?Seidel-type methods. For rotorcraft flows, and other simulations involving motion, the problem emerges of how to mesh a domain when different grid-types are desired in different areas. Near rotor blades or other objects of simple geometry, it may be possible to use a curvilinear, Cartesian-mapped grid. For more complex geometries it may only be feasible to use unstructured meshes near the body. In the background, however, it is beneficial from an accuracy and computational cost perspective to use a simple Cartesian grid. Many researchers have had success using periodicity to simplify the 13 Curvilinear Meshes Unstructured Meshes Cartesian Meshes Figure 1.7: Overset CFD mesh system: unstructured and structured meshes used together. problem in hover. Notable numerical studies for a rotor in hover include those from Caradonna and Isom [27] with potential flow equations, Roberts and Murman [28] and Chen and McCroskey [29] for Euler equations, and Srinivasan et. al. [30] for the Navier?Stokes equations. Modern approaches for rotorcraft applications typically use overset (chimera) methods. Benek et. al. [31] first introduced the overset method in 1983 and referred to the method as ?chimera? because it allowed the use of different mesh topologies together to form a single domain. At the boundary of each individual mesh, interpo- lation is used to stitch together a continuous solution. Duque and Srinivasan [32] applied the overset method to rotor hover applications in 1992, followed by Ahmed and Strawn in 1999 [33], and by Pomin and Wagner [34] for a rotor in forward flight. An example of an overset mesh system for a helicopter using different types of grids is shown in Fig. 1.7. Cartesian grids are used for the background region of the aircraft to resolve the wake. Blades of the main and tail rotor are of simple geometry and are meshed using a curvi-linear approach. The fuselage and hub, however, would be very difficult to mesh using structured grids and therefore use an unstructured approach. 14 Overset methods have been successfully used in a number of CFD solvers and frameworks. For the purposes of this work, a ?solver? refers to stand-alone CFD software whereas a ?framework? refers to a collection of CFD solvers and other numerical tools. The NASA Overflow [35] and FUN3d [36] codes are examples of two CFD solvers that can be used independently for single-grid or overset problems. The HPCMP CREATE AVTM Helios framework [37, 38], from the Department of Defense CREATE-AV program, interfaces with Overflow, FUN3D, and other CFD codes. In addition to CFD codes, frameworks can also use separate modules for grid-motion, domain connectivity, and fluid-structure interaction. The multi-solver, multi-mesh paradigm has been successfully applied to many complex aerodynamic applications including fixed-wing aircraft [39], rotary-wing aircraft [37, 38], and wind-turbine farms [40]. Task 0 Task 1 Task 2 Task 3 Python Python Python Python Solver A Solver A Solver A Solver A Solver B Solver B Solver B Solver B Solver C Solver C Solver C Solver C Connectivity Connectivity Connectivity Connectivity Figure 1.8: Example distribution of tasks suitable for a CPU-only CFD framework. Each solver uses all available MPI tasks. Parallelization for the Helios framework is accomplished with domain decom- position and communication using the Message Passing Interface (MPI). For a framework with multiple CFD solvers, however, there are different ways of distribut- ing resources. MPI uses ?tasks? for parallelization and a task is typically run on a CPU core, but different CPU cores may have access to different hardware. The Helios framework was designed for CPU-based codes, and distributes all work over 15 all tasks. If there are three CFD solvers used, Solver A, Solver B, and Solver C, and a total of 12 MPI tasks, each solver uses all 12 tasks for domain decomposition. An flow chart of this approach in Fig. 1.8 shows the progression of routines on each task for a single simulation iteration. As long as each solver performs load balancing across tasks, the overall framework should be load balanced across all CPU cores. For highly unsteady flow, fluid structures must be accurately captured and convected between overset meshes. Treatment of fringe points at mesh boundaries can have a large effect on convergence for complex problems. The traditional approach, for example used in Overflow and Helios, is to ?freeze? the interpolated fringe points during sub-iterations. Subsequently, the overset boundary conditions in linear solver iterations are treated effectively as Dirichlet. The traditional methods are equivalent to a Schwarz alternating procedure [41], commonly used in numerical treatment of domain decomposition boundaries [42, 43]. The Schwarz method is convenient because the linear solver iterations typically reside within a CFD code that does not have access to overset connectivity routines. Some single-solver implementations have been reported in literature, where the interpolation constraints on the fringe points were included as part of an implicit linear system leading to improved convergence. In particular, Galbraith et. al. [44] used implicit overset boundaries in a discontinuous Galerkin code, showing the implicit treatment of interpolated points results in improved accuracy for a vortex convection case. Brazell et. al. [45] also used implicit interpolated terms in their implementation of a discontinuous Galerkin code to improve accuracy and convergence. The Overture overset code, from the work of Henshaw [46], uses overset interpolation equations as constraints for a multi-grid method applied to elliptic equations. The works of both Galbraith and Brazell used a flexible Generalized Minimal Residual Method (GMRES) for the linear solver of the Newton or quasi-Newton linearization of the Navier?Stokes equations. GMRES is just one of many methods 16 that can be used to solve a linear system. Traditional methods in commonly CFD use Gauss?Seidel sweeps or approximate factorization methods to iteratively approximate the inversion of the linear system. Although GMRES has generally been popular in unstructured solvers like FUN3D (NASA) [47] or mStrand (CREATE AVTM) [48], the linear solver can also help with the convergence of structured grid based codes using large, high-order stencils in areas of large gradients. The GMRES method by itself can take as many iterations as there are unknowns to achieve convergence, which is clearly intractable. Therefore, a suitable preconditioning scheme is always necessary within GMRES to achieve improved convergence. In this context, it is worth noting that both Galbraith and Brazell used the exact same preconditioners for all of their overset meshes. In a multi-solver context, this is not often feasible since the preconditioners that are implemented within each code depends on the specific needs of that solver. 1.2.2 Computational Architectures Advancement of CFD algorithms and methods since the second half of the 20th century were made possible with the evolution of the computer. The work of Magnus and Yoshihara [6] in 1970 was performed on a CDC6400 mainframe computer; a computational domain containing 3700 points took 3.5 hours to simulate. Shortly after Murman and Cole improved the efficiency of the algorithm, achieving computation times of 30 minutes for a 3034-point mesh on an IBM 360/44 mainframe computer. The same pattern of algorithm development and algorithm optimization for computational hardware continues today. One of the most notable advancements in scientific computing technology was the use of vectorizable or parallel operations. A primary conclusion from the work of Jameson et. al. in 1981 [8] was not only the ability to solve the Euler equations, but the ability to do so using vector operations on a Cray 1 computer. 17 Cray Fortran vectorized sections of the code as part of the compilation process and as a result, 500 cycles of a 4096-point domain could be run in 24 seconds. Besides compiler vectorization, OpenMP directives in codes were introduced to take advantage of multiple cores on a given Central Processing Unit (CPU) to make loops over a domain run in parallel. The Message Passing Interface [49] allows clusters of CPUs to communicate efficiently and has been one of the most popular methods of parallelization in scientific computing codes. The availability of multi-core compute architectures has motivated the study of parallelism in popular numerical routines. In the work of Decker et. al. [50] the ARC3D code was made parallel using MPI and routines classified into three different categories: parallel-by-point, parallel-by-line, and parallel-by-plane. Each category describes the granularity with which an algorithm can be parallelized. Although the terminology was applied to compute architectures of the 1990s, the same terminology applies to the modern, massively parallel architectures available today. Task 1 Task 0 Task N Poltergeist Figure 1.9: Domain partitioning. Perhaps the most straight-forward implementation of parallelism is the with the use of domain decomposition. For a domain containing N points, the problem 18 divided into N/2 points on two processors should run in half the time. Using MPI and a multi-core CPU or a cluster of CPUs, CFD domains can be sub-divided and distributed. Large clusters of computers, called supercomputers, were developed for large-scale simulations. Fig. 1.9 shows a cross-section of a coarse Hart II fuselage mesh, which can be divided into N partitions and run on the same number of MPI Tasks. Partitioning can be performed for both structured or unstructured grids, although the presence of coordinate directions in the case of structured grids simplifies the process significantly. The historical trend of CPU performance has generally followed Moore?s law, which states that the number of transistors in an integrated circuit doubles every two years. The exponential growth of CPU computational power continued into the early 2000s, after which chip makers made up for stagnating processor speeds with increased core counts [2, 51]. Fig. 1.10 shows a history of micro processor data with trends for the number of transistors, single-thread performance, frequency, power, and logical core count [2]. Up until 2005, the majority of processors used a single logical core for computing. The slowing of single-threaded performance corresponds directly to the frequency stagnation around 3GHz and the need for additional cores to maintain the trend of increasing overall performance. The transistor count for processors has continued to rise as companies like Intel, NVIDIA, and AMD are able etch smaller and smaller transistors in silicon. The Graphics Processing Unit (GPU) is one example of a many-core architecture that became widely used for video game applications. The GPU, however, is also capable of general purpose computing, as demonstrated in 2001 by Larsen and McAllister [52] for a matrix-matrix multiplication operation. NVIDIA, one of the largest manufacturers of GPUs, released the Compute Unified Device Architecture (CUDA), a C-like interface for general purpose GPU computing. Since then, many research groups have used GPUs for accelerated scientific computing. Directive-based 19 NVIDIA V100 GPU Transistors 107 NVIDIA P100 GPU (?103) NVIDIA K80 GPU 106 105 Single-Thread Perf (SpecINT?103) 104 Frequency 103 (MHz) 2 Power (Watts)10 # Cores 101 100 1970 1980 1990 2000 2010 2020 2030 Year Figure 1.10: 45 years of micro-processor data with the transistor counts of three NVIDIA GPUs highlighted. Original data up to the year 2010 collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New data collected for 2010-2015 by K. Rupp [2]. approaches have also been introduced such as OpenACC, which eliminate the need to write low-level code. Since the GPU is a processor just like the CPU, the performance can be compared to CPU performance. Three GPUs are highlighted in Fig. 1.10: the K80, P100, and V100 GPUs from NVIDIA. All three GPUs are at the top of the spectrum for transistor counts compared to CPUs and the trend shows continued increase. The evolution of processors to include more transistors has resulted in a gradual shift in the hardware used by the world?s most powerful supercomputers. Table 1.1 shows the 5 most powerful supercomputers in the world, based on data collected by TOP500 [1]. Sunway TaihuLight has by far the most number of CPU cores, at over 10 million. Both Summit and Sierra clusters, however, out-perform Sunway TaihuLight because of the presence of GPUs. Both Summit and Sierra use 6 and 4 NVIDIA V100 GPUs per node respectively. 20 # Name Location CPU Cores RMAX GPUs (TFlop/s) 1 Summit USA 2,397,824 143,500.0 yes 2 Sierra USA 1,572,480 94,640.0 yes 3 Sunway TaihuLight China 10,646,600 93,014.6 no 4 Tianhe-2A China 4,981,760 61,444.5 no 5 Piz Daint Switzerland 387,872 21,230.0 yes Table 1.1: Top Supercomputers by Performance [1]. RMAX refers to maximum achieved LINPACK performance in tera-floating point operations per second. Performance for GPU computing can be reported as a metric in absolute units, for example Floating Point Operations per second (FLOPS), or in relative units as a speedup compared to baseline hardware. Typically a speedup is defined compared to a CPU, where the metric is the time on the CPU divided by time on the GPU. To further complicate matters, speedups can be reported verses a single CPU core or the full multi-core CPU and performance varies significantly based on the generation of hardware used. As an order of magnitude comparison, a modern, high-end workstation CPU may have two dozen cores and performance measured in hundreds of Giga-FLOPS. A NVIDIA Tesla (compute-line) GPU has thousands of ?CUDA cores? and performance measured in Tera-FLOPS. Preliminary work from various groups using GPUs for CFD with NVIDIA Fermi (2010) and Kepler (2013) architectures show mixed performance results. Pickering et. al. [53] used OpenACC to accelerate an incompressible, explicit-timestepping, Navier?Stokes solver obtaining up to a 2.5? speedup for a Kepler-generation GPU over a 16-core Xeon processor. Soni et. al. [54] developed a full GPU-accelerated code in CUDA for an overset, compressible Navier?Stokes solver on structured and unstructured meshes. Single-precision and explicit time marching were used, resulting in up to a 27? speedup with the Fermi architecture over a single CPU core. The work of Jespersen [55] used CUDA to accelerate certain routines of the NASA Overflow CFD code, obtaining a 2.5? 3? speedup with the Fermi GPU architecture over a single CPU core. 21 With Pascal (2016) and Volta (2018) GPU architectures, larger, prominent CFD codes have started to shift towards using CUDA for acceleration. The NASA FUN3D code, an unstructured, implicit, compressible Navier?Stokes solver, was recently accelerated using CUDA and benchmarked on Pascal and Volta generation GPUs [56]. Compared to a dual 14-core Xeon node, the Pascal P100 and Volta V100 GPUs show a 3? and 6? speedup respectively. In a per-node comparison on the Summit supercomputer (Oak Ridge National Laboratory), 6 V100 GPUs showed a 23? 27? speedup over the dual-socket IBM Power9 CPUs. The FUN3D team was able to attain such high speedups by optimizing all algorithms and routines with CUDA. 1.3 Objectives The main objectives of this dissertation are organized such that the resulting procedures and knowledge gained from this work may contribute to improved analysis of interactional aerodynamics. ? Develop a GPU-accelerated, implicit, time-accurate Navier?Stokes solver for overset applications. An existing GPU-accelerated solver for single, stationary grids is heavily modified for compatibility with an overset CFD framework and rotorcraft applications. The implicit, finite-volume methodology is implemented and analyzed using different spatial schemes, preconditioners and GMRES as a linear solver. ? Design and implement a framework for the overset multi-solver paradigm amenable to heterogeneous architectures like the GPU. A Python framework to organize function calls and data sharing is implemented and all CFD, grid motion, and domain connectivity modules are wrapped for access from Python. The framework distributes compute resources based on CFD code type (CPU 22 or GPU) and availability of resources on a cluster. ? Identify and derive a robust, flexible algorithm for the overset multi-solver paradigm that considers fringe points implicitly. Improve upon the traditional overset time-stepping approach by using a global GMRES solver applied to all grids. Implicit handling of interpolation regions is done with the goal of improving convergence for highly unsteady cases. The methodology is a framework-level implementation which can be easily adopted by other overset frameworks. ? Verify and validate the individual components of the overset framework. The CFD flow solver, grid motion module, and grid-connectivity codes are indepen- dently tested for verification of behavior and/or validation against experimental results. The overset GMRES algorithm is also validated for simple cases like the Poisson equation and inviscid flow. ? Apply the overset framework to a full, notional rotorcraft configuration. A notional configuration consisting of a ROBIN fuselage, hub, main rotor and tail rotor is run with the heterogeneous CFD framework. Aerodynamics and forces are analyzed for main-tail rotor interactions if forward and cross-wind flight. ? Apply the implicit overset algorithm to a coaxial rotor configuration. The heterogeneous framework along with the developed overset GMRES algorithm is applied to a laminar sphere and coaxial rotors for a Mars helicopter. Both low- Reynolds number case have significant unsteady flow structures passing between meshes, causing convergence challenges with traditional overset methods. 23 1.4 Organization of the Dissertation The current chapter has reviewed challenges and existing approaches for analysis of complex rotorcraft applications using CFD. The problems and existing approaches were discussed from three perspectives: first from a physics standpoint, second from a numerical standpoint, and finally from a cost and performance standpoint. The remainder of the dissertation is organized into chapters for the methodology, validation, key results, and conclusions, in that order. The key results are divided into two chapters to distinguish between different applications. The methodology chapter reviews the development or usage of each component in the overset framework as well as the framework itself. A detailed description of the governing fluid equations and numerical discretization is followed by implementation details for the GPU architecture. The mesh-motion and grid-connectivity modules are presented in context of heterogeneous, overset CFD. The framework itself is then described, including a framework-level implementation of a GMRES linear solver for convergence acceleration. Finally, a load-balancing strategy for CPU and GPU resources is described. The verification and validation chapter presents the results of simple cases to validate sub-sets of the whole framework. Specifically, the developed GPU flow solver is analyzed for convergence and accuracy using single meshes, without any oversetting or grid-motion. The connectivity routines and GPU flow solver are validated together for a steady, transonic wing case. The grid-motion module is validated against a trusted NASA CFD solver for a UH-60 rotor flight condition. The developed overset GMRES algorithm is verified for a Poisson algorithm and validation is also performed for an unsteady, inviscid wedge case. Performance analysis of a steady, overset case run on different GPUs is compared to a similar NASA code to obtain speedup approximations and scalability metrics. 24 The first set of results pertains to a full configuration. The case from the ROBIN wind tunnel experiments, which analyze rotor-fuselage interaction, is augmented with a notional hub and tail rotor. The full configuration is analyzed in a forward-flight condition where the wake of the main rotor enters the tail rotor. Frequency analysis of the tail rotor loads with and without interactional aerodynamics are compared. Analysis of the full configuration and main-tail interactions is also performed for a cross-wind case. The second set of results is concerned with the convergence of simulations. The heterogeneous framework is applied to a low-Reynolds number sphere and a set of laminar coaxial rotors designed for flight on Mars. Both cases can be challenging to converge using overset meshes and the effect of the overset GMRES algorithm on convergence is analyzed in detail. The final chapter discusses conclusions from each of the outlined thesis ob- jectives. Specifically the cumulative findings from the validation cases and results sections are reviewed in the context of advancing the state of the art. Contributions of the present work are described along with recommendations for future research directions. 25 Chapter 2: Methodology This chapter describes the mathematical and numerical methodology for the presented dissertation research. The governing equations and numerical implemen- tation for the CFD solvers, grid motion, and domain connectivity are presented in detail. Design choices for the organization of the heterogeneous framework are dis- cussed in context of performance and load-balancing between parallel tasks. Finally, a framework-level implementation of an overset GMRES algorithm is derived for an implicit global system. 2.1 Fluid Dynamics 2.1.1 Navier?Stokes Equations The Navier?Stokes equations are a set of partial differential equations that gov- ern the physical dynamics of a fluid. The Navier?Stokes equations are a mathematical description of three conservation relations: 1. Conservation of mass 2. Conservation of momentum (in x, y, and z directions) 3. Conservation of energy In strong conservation form, the Navier?Stokes equations are written as [9]: ?Q ?Fi ?Gi ?Hi ?Fv ?Gv ?Hv + + + = + + + S (2.1) ?t ?x ?y ?z ?x ?y ?z 26 where Q is the vector of conserved variables ?? ???????? ???? ?? ?????? ?u ? ? ? ? Q = ??? ?v??? ???? ?w ? ???? (2.2) ??????e with ?, u, v, w and e denoting the local flow density, three components of velocity and total energy, respectively. The inviscid fluxes and Fi,Gi, and Hi are ????? ? ? ???? ?u ??????? ??????? ?v ????? ?? ?? ???? ?u2 + p ? ? ?? ? ? ? ??? ?? ?? ?? ?w ?????? ?vu ???? ???? ?wu ???? Fi = ????? ?uv???? ?? ?? 2 ?? ?uw ??????? , Gi = ?? ?v + p ?? , Hi =? ? ?wv ? ? ? ? (2.3) ??? ???? ?? ?? ??vw ??????? ??????? ? ?w2 + p ?????? u(e+ p) v(e+ p) w(e+ p) ? where p is the local pressure. The viscous fluxes Fv,Gv, and Hv are ????? ????? 0 ? ?? ? ?? ? ???? ?xx ?? Fv = ??????? ? ?yx ??????? (2.4) ???? ?zx ???u?xx + v?xy + w? ?xz ? qx 27 ? ? ??????? ?? 0 ???? ?? ??? ? ?xy ?? Gv = ?? ? ??? ?yy (2.5)?????? ???? ? ?zy ????? u?yx + v?yy + w? ? yz ? qy ????? ?????? 0 ??? ?? ? ???? ?xz ??? Hv = ?????? ???? ?yz ???? (2.6) ? ? ??????zzu? ?zx + v?zy + w?zz ? qz where ? is the stress tensor and q is the rate of thermal conduction. The source term S is zero for all applications considered by this work. The pressure is obtained from the perfect gas law [ } 1 p = (? ? 1) e? ?(u2 + v2 + w2) (2.7) 2 where ? is the ratio of the heat capacity at constant pressure (Cp) to the heat capacity at constant volume (Cv). For air ? is assumed constant at 1.4. The rates of thermal conduction qx, qy and qz are obtained from Fourrier?s law ? ?Tqi = k (2.8) ?xi where xi is the direction, T is the temperature, and k is the thermal conductivity. The temperature is defined as a function of pressure and density using the perfect gas law p T = (2.9) ?R 28 where R is the specific gas constant for air. The stress tensor ? is represented as [( )] [ ] ?ui ?uj ?uk ?ij = ? + + ? ?ij (2.10) ?xj ?xi ?xk where ? and ? = ?2?/3 are the first and second coefficients of viscosity using Stokes? hypothesis. ?ij is the Kronecker delta function. The fluid is assumed to be Newtonian, meaning the strain rate varies linearly with the applied shear. An approximation to the viscosity coefficient is obtained as a function of temperature using Sutherland?s formula T 3/2 ? = C1 (2.11) T + C2 where C1 and C2 are constants for air. 2.1.2 Non-dimensionalization of the Navier?Stokes Equations Non-dimensional quantities such as Reynolds number and Mach number are generally used to describe a flow condition in a generic way that allows comparison without the need for similar unit scales. In CFD, it also normalizes quantities to near unity, ensuring accurate floating point representations. Non-dimensionalization of the Navier?Stokes equations is performed for all quantities as follows: ? ta? ? ? ? (x, y, z) (u, v, w)t = , (x , y , z ) = , (u?, v?, w?) = L L a? (2.12) ?? ? = , T ? T p e ? = , p? = , e? = , ?? = ?? T? ?a2 2? ?a? ?? where ( )? denotes a free-stream quantity and L is a reference length, typically that of an airfoil chord. The non-dimensional parameters used to define the flow condition 29 are therefore: ?V?L Reynolds number: Re? = (2.13) ?? V? Mach number: M? = (2.14) a? ??Cp,air Prandtl number: Pr? = (2.15) kair where Cp is the specific heat at constant pressure. A constant Prandtl number of 0.72 is used for all cases considered in this work. The free-stream velocity is defined as the total velocity from the three components: ? V? = u2 + v2 + w2? ? ? (2.16) When non-dimensional ( )? quantities are substituted into the governing equa- tion, Eq. (2.1), the formulation is identical to the original formulation but with a few non-dimensional parameters appearing. For the remainder of this work the ( )? notation will be dropped and all flow variables are non-dimensional unless otherwise specified. In non-dimensional form the formula for the stress tensor ? becomes: [( )] [ ] ?M? ?ui ?uj ?uk ?ij = + + ? ?ij (2.17) Re? ?xj ?xi ?xk and the non-dimensional thermal heat conduction term becomes: ?M? ?T qi = (2.18) Re?Pr(? ? 1) ?xi 2.1.3 Reynolds-Averaged Navier?Stokes The Navier?Stokes equations govern the flow of both laminar and turbulent regimes. With high Reynolds numbers, however, the onset of turbulence presents a challenge since all turbulent length scales must be accurately resolved. The use of 30 Reynolds-averaging significantly reduces the computational cost of simulations by allowing reasonable cell sizes. The Reynolds-averaged formulation is accomplished using a statistical averaging of the governing equations. Each variable ? is represented as a sum of averaged and fluctuating components ? = ?+ ?? (2.19) where ?? is the fluctuation and ? is the average defined by ? 1 1 t0+?t ? = ? ? ?(t)dt (2.20) ??t t0 where ? = ? for a mass-weighted average. The Reynolds-Averaged Navier?Stokes (RANS) equations are obtained from substituting each flow variable into the original governing equation, Eq. (2.1), using the fluctuation assumption from Eq. (2.19). Using mathematical identities for averaged and fluctuating quantities, many terms cancel and the only resulting difference in the equations is the presence of the Reynolds stress tensor and heat-flux contribution: (?ij)turb = ??u?iu?j, (qi)turb = ??u?e?i (2.21) using the Boussinesq eddy viscosity hypothesis, the effective stress tensor is [ ( ) ] ?ui ?uj 2 ?uk (?ij)total = (?lam + ?turb) + ? ?ij (2.22) ?xj ?xi 3 ?xk and ?turb is found using a turbulence model. 2.1.4 Spalart?Allmaras Turbulence Model The Spalart?Allmaras turbulence model [15] is used to approximate the value of ?turb using transport properties. The model uses emperical constants based on 31 experimental results for turbulence over a flat plate. The laminar kinematic viscosity, ? = ?/?, is used along with the model variable ?? [ ]2 D?? ?? 1 [ ] = cb1S??? ? cw1fw + ? ? ((? + ??)???) + c 2b2(???) (2.23) Dt d? ? where the turbulent eddy kinematic viscosity is ?3 ?? ?t = ??fv1, fv1 = , ? = (2.24) ?3 + c3v1 ? and other variables are [ ]1/6 ?? ? ? 1 + c 6 S? = ? + , f = 1 , f = g w3v2 w ?2d?2 1 + ?fv1 g6 + c6w3 (2.25) ?? g = r + cw2(r 6 ? r), r = S??2d?2 and constants are defined as cb1 =0.135 cw2=0.3 cb2 =0.622 c?1 =7.1 ? =2/3 ? =0.41 cb1 1 + cb2 cw3=2 cw1= + ?2 ? 2.2 GPU-Accelerated Flow Solver The GPU-Accelerated Rotor flow FIELD (Garfield) code solves a discretiza- tion of the RANS equations with SA turbulence model on structured, curvilinear 32 grids. Garfield uses a mix of finite-volume and finite-difference methods, with all fluid quantities stored at the center of grid cells. The use of structured, curvilinear grids limits applications to relatively simple geometries because the grid coordinates must map to an equi-spaced computational domain. Background mesh generation, with or without stretching is trivial. Body-fitted meshes for applications considered in this work use elliptic grid generation for cross sections, extrusion or interpolation between cross section, and smoothing in the span-wise direction. Once a structured grid is generated, a temporal and spatial discretization of the governing equations has to be applied. 2.2.1 Hybrid Finite-Difference, Finite-Volume A finite-difference discretization of the governing Navier?Stokes equation,Eq. (2.1), is accomplished by performing a coordinate transformation from the physical domain, (x, y, z), to the computational domain, (?, ?, ?). The 3? 3 Jacobian matrix (J) and determinant (J) are defined as ?(x, y, z) ( ) J = , J = det J (2.26) ?(?, ?, ?) Through the chain rule, the governing equations re-written in computational space are ?Q? ?F?i ?G?i ?H?i ?F?v ?G?v ?H?v + + + = + + + S (2.27) ?t ?? ?? ?? ?? ?? ?? where the conservative vector is scaled by the Jacobian 33 ? ??????? ? ? ? ???? ?? ???? ?u ?? 1 ? ? ? Q? = J ? ?????? ?v ? ????? ?w ? ? (2.28)????? e ? and the inviscid fluxes are transformed to ????? ????? ?V ? ? ? ? ??? ? ??? ??? ??????? ? 0 ?? ?? ?? ? ?????? ?uV? + ?xp ?x? ?xx + ?y?xy + ?z?xz ?? 1 F?i = ??? ?vV + ? pJ ???? ? y ?????? 1 ? ???? ?wV + ? p ????? , F?v = ????? ?x?yx + ?y?yy + ?z?J ? yz? ?? ??? (2.29) ? z ?? ?x?zx + ?y?zy + ?z? ?zz ????? V?(e+ p) + V ? ?,tp ? ? + ? ? + ? ? ? x x y y z z where ?T ?x = u?xx + v?xy + w?xz + k ?x ?T ?y = u?yx + v?yy + w?yz + k ?y ?T ?z = u?zx + v?zy + w?zz + k . ?z The formulations for H?i,v and G?i,v are the same as for F?i,v but with ? replaced by ? and ? respectively. The contra-variant velocity in coordinate direction ? is defined as ?x ?y ?z V? = ?xu+ ?yv + ?zw ? V?,t, V?,t = ?x + ?y + ?z (2.30) ?t ?t ?t which includes influence of grid motion. Again relations for the two other coordinate directions are found by replacing ? with ? or ?. Note that grid motion appears in the convective fluxes but does not affect the viscous fluxes. 34 In computational space, coordinates (?, ?, ?) are indexed using (j, k, l) for three dimensions. Garfield stores grid coordinates at grid nodes but the conservative solution vector, Q, at the cell center. The entries of the Jacobian tensor, referred to as metrics, can be interpreted geometrically as face normal vectors. For example the face area at location (j, k) in direction ? is ? ? 1 ?? ?x??? ? ? S?,jk = ? J y ??? (2.31) ?z Furthermore the Jacobian J is equivalent to the inverse cell volume on orthogonal grids. The geometric analogy is especially relevant because fluxes are evaluated at faces between cells and not at cell centers. As a result, the finite-difference implementation is equivalent to a finite-volume implementation. Fig. 2.1 shows cell (j, k) for a 2D grid. Cartesian coordinate directions x and y do not line up with grid coordinate directions ? and ?. The fluxes are evaluated at cell faces, which in a finite-volume approach is where the face geometry and vector is well defined. (j, k + 1) 2 y ? (j ? 1 , k) 2 (j, k) x ? (j + 1 , k)2 (j, k ? 1) 2 Figure 2.1: Cell-centered, curvilinear mesh for cell (j, k) with computational coordi- nate directions. 35 The partial derivatives in space from Eq. (2.27) are discretized using a second order stencil. For example in the ? direction, ?F? = F?j+1/2,k ? F?j?1/2,k (2.32) ?? where F? is either the inviscid or viscous flux vector. All discretizations discussed thus far have been in space. Discretization of Eq. (2.27) in time can be accomplished using either an implicit or explicit method. The Euler explicit method is numerically simple but imposes a large restriction on the timestep for the simulation. For rotorcraft flows, where fast-rotating blades result in challenging physical timestep sizes, the Euler implicit method is a commonly used approach. In discussing temporal discretizations, it is convenient to group all spatial terms in a single, residual term. The spatial residual, R, is defined from Eq. (2.27) such that ?Q? + R = 0 (2.33) ?t meaning R contains all spatial operators. Next a second order backwards Euler implicit time marching scheme is imposed n+1 ? n n?13Q? 4Q? + Q? + Rn+1 = 0 (2.34) 2?t The term Rn+1 is non-linear and can be linearized using the first few terms of a Taylor series. A dual-time marching term, denoted with index p, is also required to get rid of linearization error. The dual-time marching term corresponds to a subiteration, or non-linear iteration, and uses symbol ? . The solution at subiteration index p+ 1 serves as an approximation to the solution at physical temporal index n+ 1. The resulting discretization is 36 p+1 ? p p+1 n n?1Q? Q? 3Q? ? 4Q? + Q? ?Rp + + ?Q? = ?Rp (2.35) ?? 2?t ?Q? where p+1 ?Q? = Q? ? pQ? (2.36) which can be further modified into the linear system [ ] p n n?1 I I ?Rp 3Q? ? 4Q? + Q? + + ?Q? = ?Rp ? . (2.37) ?? 2/3?t ?Q? 2?t where I is the identity matrix. The temporal terms with t and ? on the left-hand-side can be lumped into a single time term t? [ I ?Rp ] p n n?1 + ?Q? = ?Rp ? 3Q? ? 4Q? + Q? . (2.38) ?t? ?Q? 2?t where ?? ?t? = (2.39) 1 + 3?? 2?t which can be scaled by the Jacobian (inverse volume) and re-written as [ I ?Rp ] p ? ? ? p ? 3Q ? 4Q n + Qn?1 + J ?Q = J R . (2.40) ?t? ?Q 2?t The volume-scaled formulation from Eq. (2.40) is preferred over Eq. (2.38) for two reasons. First, it replaces computational vector Q? with the more familiar, original conservative vector Q. Second, the right-hand-side of Eq. (2.40) is the residual for unsteady simulations used to measure convergence. Without volume scaling the convergence trend is biased by the influence of large cells, which typically converge well because they are far from surfaces. 37 The right-hand-side (RHS) of Eq. (2.40) is the explicit side, because all data is readily available. The left-hand-side (LHS) of Eq. (2.40) is the implicit side, which forms a matrix that needs to be inverted to find the solution, ?Q. The entire RHS, including the temporal terms, is referred to as the residual for time-accurate computation. The L2-norm of the residual is a measure of subiteration convergence, since as the solution at p approaches p + 1 = n + 1, Eq. (2.34) is recovered. The inversion of the LHS matrix is rarely exactly computed and inverted, since this process is typically very computationally expensive. Instead there are three general procedures that are used in Garfield to obtain the solution ?Q: 1. Approximate Factorization preconditioner: Diagonalized Alternating Direction Implicit (DADI), applied iteratively. 2. Red-Black Gauss?Seidel preconditioner: Point Red-Black, Line Red-Black, or mixed, applied iteratively. 3. Generalized Minimal Residual Method (GMRES): A minimization-problem approach to solving linear systems. 2.2.2 Inviscid Fluxes The inviscid flux contribution to the residual from the hybrid finite-volume, finite-difference approach use the formulation ?F? = F?j+1/2,k ? F?j?1/2,k (2.41) ?? for the ?-direction. The flux quantity can be either interpreted in computational space using the second-order finite-difference relation, or in geometric space where the j + 1/2 index corresponds to a ? face of the cell with normal S?. Computing the flux at the face is performed in two steps. The fluxes are functions of the conservative or primitive variables. The first step is to reconstruct 38 the value of the flow variables at the face in a way that allows for discontinuities (for example in the case of shocks). After reconstruction, values on the left and right sides of the interface are combined into a single flux using the Roe Riemann solver [57]. Reconstruction in Garfield is performed for using primitive variables ?,u,v,w, and p. Two different reconstruction schemes are available for use. The Monotone Upstream-centered Scheme for Conservation Laws (MUSCL) [22] uses a 3-cell stencil to for a 3rd order accurate reconstruction. The MUSCL reconstruction stencil for the right and left values of face j ? 1/2 is shown in Fig. 2.2. qL qR j ? 2 j ? 1 j j + 1 Figure 2.2: MUSCL 3-point stencil for left and right primitive variables qL and qR. A mathematical form of the scheme is [ ] L 1 ( ) 1 ( )qj?1/2 = q + ?j [ q3 ( j+1 ? qj + q ? q (2.42)) 6 ( j j?11 1 )] qRj+1/2 = q + ?j qj+1 ? qj + qj ? qj?1 (2.43)3 6 where the Koren?s limiter [22] ? is given by 3?qj?qj +  ?j = (2.44) 2(?qj ??q )2j + 3?qj?q2j +  and ? and ? are forward and backward differences respectively. For any given face, the stencil requires two points on both sides of that face. 39 qL qR j ? 3 j ? 2 j ? 1 j j + 1 Figure 2.3: WENO 5-point stencil for left and right primitive variables qL and qR. The 5th order Weighted Essentially Non Oscillatory (WENO) scheme [23,58] uses a larger stencil, as shown in Fig. 2.3. For clarity, the figure shown only the stencil for the left state, however just as with the MUSCL scheme, the right state stencil is a mirror image of the left stencil. The scheme is implemented with two different options for weights: either from Jiang and Shu [23] or from Borges et. al. [58]. Once left and right primitive variables are found for a face, the Roe-flux Riemann solver [57] is used to construct a flux with appropriate dissipation to ensure numerical stability. The Roe-flux calculations can be summarized as follows F?(qL) + F?(qR) F?j?1/2 = ? ||A (qL, qRRoe )||(qR ? qL) (2.45) 2 where ||ARoe(qL, qR)|| is the flux Jacobian matrix, ?F?/?Q? constructed using so called Roe-averaged values, and the absolute value of all eigenvalues. A detailed description of the methodology is given in the work of Blazek [59]. The implementation in Garfield also uses Harten?s entropy correction [60] to correct behavior near sonic points: ?? ?? ? |?c| if |?c| > |?c| = (2.46)?2c+2 if |? | ?  2 c 40 where ?c is an eigenvalue of the flux Jacobian. 2.2.3 Viscous Fluxes The viscous fluxes are also found at the faces of each cell and require both the flow quantities and gradients at the center of the face. The quantities are found using a simple average in the face direction. For the x-direction velocity, for example: uj + uj+1 uj+1/2 = . (2.47) 2 Finding the gradient of quantities is more challenging. Gradients are required for the flow velocities, u, v, w, and the temperature T . Using the grid metrics, finite differences in the computational domain can be translated into gradients in the physical domain. Using the x-direction velocity as an example: ?u ?u ?? ?u ?? ?u ?? = + + ?x ?? ?x ?? ?x ?? ?x ?u ?u ?u = ?x + ?x + ?x (2.48) ?? ?? ?? where stencils for ?u , ?u , ?u need to be obtained. Using the ?-direction face as an ?? ?? ?? example, the stencil is simply ) ?u = uj+1 ? uj (2.49) ?? j+1/2 the other directions require averaging components from neighboring cells. Fig. 2.4 shows the cells used in the computational stencil for cross terms. 41 Flux Face Solution Locations ? ? ? Figure 2.4: Viscous flux stencil for finite-difference relations. 2.2.4 DADI Preconditioner The Diagonalized Alternating Direction Implicit (DADI) preconditioner [10] uses an approximate factorization technique to approximate the solution of the linear system from Eq. (2.40). Consider only the inviscid terms from the left-hand-side of the system: ?Rp ? ?F?i ? ?G?i ? ?H?i = + + ?Q ?Q ?? ?Q ?? ?Q ?? ? ?F?i ? ?G?i ? ?H?i = + + ?? ?Q ?? ?Q ?? ?Q ? ? ? = A + B + C ?? ?? ?? = ??A +??B +??C (2.50) where A, B and C are the flux Jacobians in the j-,k-,and l-directions respectively. The full left hand side of the linear system can be written as [ ] I + ??A + ??B + ??C ?Q (2.51) ?t? After multiplying through by the timestep, Eq. (2.51) can be approximately 42 factored into [ ] [ ] [ ] I + ?t???A I + ?t???B I + ?t???C ?Q (2.52) Each flux Jacobian can be factored into eigenvectors and eigenvalues. Assuming the eigenvectors do not vary significantly in space, the eigenvectors can be factored out of the bracketed quantities entirely: [ ] ? [ ] [ ]T 1? I + ?t????? T? T? I + ?t????? T?1? T? I + ?t?? ? T?1? ? ? ?Q (2.53) Because analytical forms of the eigenvectors are known, they can easily be inverted. The bracketed quantities in Eq. (2.53) are scalar, and form a tri-diagonal system if the finite difference operator ? uses an upwind scheme. Each set of tri-diagonal systems can be inverted analytically using the Thomas or Sherman?Morrison (if periodic) algorithms. The DADI method as described does not include any influence of viscous terms. Influence of viscosity can be added as an eigenvalue contribution to the system. The viscous eigenvalues are ( ) ?L ?T ? 2 2 2 x + ?y + ?z ?v,? = ? + (2.54) PrL PrT J for the ? direction where ()L and ()T denote laminar and turbulent quantities respectively. Eigenvalues in the other two coordinate directions are found by replacing ? with ? or ?. 2.2.5 Red-Black Gauss?Seidel Preconditioner Gauss?Seidel-type preconditioners take a different approach to approximating the inversion of the left-hand-side. The same first step is taken as with DADI 43 [ ] I + ??A + ??B + ??C ?Q = RHS (2.55) ?t? but the flux Jacobians are kept intact as 5 ? 5 block matrices. Furthermore the flux Jacobians are the full Jacobians of the Roe-fluxes but using only a first-order reconstruction. The matrix on the left-hand-side can be broken down into lower, diagonal, and upper quantities: [I + ?t?(LA + DA + UA+ LB + DB + UB+ LC + DC + UC)]?Q = RHS (2.56) where A, B and C subscripts denote the different coordinate directions and L, D and U denote the lower, diagonal, and upper blocks of the matrix. A coloring method is chosen to paint cells and each color is solved independently of the other colors. A red-black coloring method uses two colors and a point-red-black results in the coloring shown in Fig. 2.5(a) (blue is used in the place of black to distinguish from the document text color). Point-red-black methods are popular for unstructured CFD solvers since they do not rely upon line structures in the grid [47]. Since the blue cells only depend on red cells, an approximation to ?Q can be used to move all of the red values to the right-hand side of the equation: [I + ?t?(DA + DB + DC)]?Q = RHS ??t(LA + UA + LB + UB + LC + UC)?Q (2.57) Subsequently the left-hand-side is easily invertible and a new approximation to 44 ?Q is obtained. Alternating back and forth between colors results in a converged approximation to ?Q. (a) Point Coloring (b) Line Coloring Figure 2.5: Illustration of red-black coloring. Line red-black coloring is shown in Fig. 2.5(b) where each wall-normal line is painted the same color. The resulting equations now include a block tri-diagonal system in the direction of the lines: [I + ?t(LA + DA + UA + DB + DC)]?Q = RHS ??t(LB + UB + LC + UC)?Q (2.58) Solving a block tri-diagonal system can be done with a block-implementation of the Thomas algorithm. While the tri-diagonal solve is more computationally expensive than the simple diagonal inversion from the point method, the line-implicit nature of the algorithm results in improved stability. Garfield has the option of using the point Gauss?Seidel algorithm, line algorithm, or a mixture of both. In the case of a mixed approach for body meshes, the wall-normal direction (the stiffest numerically) uses a line method and the two wall-parallel coordinates use the point method. The Red-Black Gauss?Seidel method described has not included viscous effects. 45 A similar approach to the one used for the DADI algorithm is applied to the Red- Black algorithm. Because the Red-Black method does not decompose matrices into eigenvectors, however, an approximation is made by adding the discretized eigenvalues to the diagonals of the block matrices. 2.2.6 GMRES Linear Solver The Generalized Minimal Residual (GMRES) linear solver implemented in Garfield is identical to the framework-wide implementation described in Sec. 2.8. 2.2.7 Boundary Conditions As a structured, cell-centered code, Garfield stores the grid coordinates at nodes and solution at cell centers. Ghost cells are extended from the physical domain to prescribe boundary conditions. Ghost cells at physical boundaries are prescribed such that when they are used in a spatial stencil, the resulting solution at the boundary face matches the desired physical boundary. Fig. 2.6 illustrates interior points as filled circles, ghost points as hollow circles, and the location of boundaries. Two ghost points are used in this example, as required by the MUSCL scheme. Physical Boundaries Interior Cells Ghost Cells Figure 2.6: Full domain consisting of interior cells, ghost, cells, and boundaries. Domain decomposition allows a single domain to be distributed onto different, concurrent MPI tasks. The full domain from Fig. 2.6 can be divided onto two MPI tasks as shown in Fig. 2.7. Additional MPI boundaries are introduced and ghost 46 points get their values from corresponding field points in the neighboring meshes. MPI Boundaries Physical Boundaries Figure 2.7: Partitioned domain on two MPI tasks. Besides an MPI-boundary, there are other geometric boundary conditions where ghost points are assigned values from elsewhere in the mesh. One is a periodic boundary, which appears most commonly in the wrap-around direction of an airfoil mesh. A periodic boundary for a 2D cross-section of an O-mesh is shown in Fig. 2.8. Ghost points are added across the periodic boundary and exchanged each sub- iteration. Periodic boundaries also results in the use of the Sherman?Morrison algorithm in the DADI routine to solver a periodic tri-diagonal system of equations. k Periodic Boundary j Figure 2.8: O-Mesh curvilinear coordinates result in a periodic boundary in the wrap-around direction. A wake boundary can appear in 2D or 3D C-type meshes as well as at the 47 root and tip of blade meshes. An example of coarse blade mesh is shown in Fig. 2.9. As the span-wise (l-direction) sections of the mesh approach the tip of the blade, the sections start to fan towards the end of the blade and eventually collapse to a plane. This behavior occurs on both the root and tip of the blade. Similar to the treatment of periodic boundaries, the existence of physical domain cells across the boundary means that ghost points can be given real coordinates and values from interior points in the mesh. j k l lmax lmax?1 lmax?2 lmax?3 Figure 2.9: Wake boundary condition at root and tip of blade or wing meshes. Physical boundaries such as walls are still treated with ghost points however the ghost points do not have meaningful coordinates. For an inviscid wall, the velocities of ghost points are mirrored across the wall boundary, as shown in Fig. 2.10. For a viscous wall, the velocities in ghost points are reversed to enforce a no-slip condition, as shown in Fig. 2.11. Values of density and energy are prescribed such that the wall boundary is adiabatic. 48 Field Points Ghost Points Figure 2.10: Inviscid wall boundary condition: mirrored velocities. Field Points Ghost Points Figure 2.11: Viscous wall boundary condition: negative velocities. 2.3 Algorithm Implementation 2.3.1 Numerical Implementation on the GPU The flow solver and other computationally expensive routines in Garfield are all implemented on the GPU using data stored in GPU memory. Data transfer between the CPU and GPU is generally limited by a PCIe bus, where transfer rates are 32GB/s. In contrast, memory bandwidth on the GPU is an order of magnitude higher, up to 900 GB/s for the Tesla V100 device. A core design strategy for a good GPU code is therefore to keep all data on the GPU and reduce data transfers to the CPU as much as possible. Each routine in Garfield is parallelized using CUDA. Once the grids and domain are initialized, function calls launch CUDA kernels, which are handled in parallel by the GPUs many ?stream multiprocessors.? A flowchart of the main flow 49 50 CPU GPU =?Variable?Parallel GPU???GPU WENO Roe?Flux =?Point?ParallelTransfer MPI?BC 5 =?Line?Parallel Viscous?Flux Start Loop?Over? Done? Reconstruction Iteration Boundaries? Order DADI SA?Turbulence?Model Physical BC 3 Line?Red?Black Interior?BC MUSCL ApproximateImplicit?Method Linear? Point?Red?Black Solver? GMRES Figure 2.12: Flowchart showing Garfield flow solver routines. Different levels of parallelism (point, variable, and line) are represented using different numbers of arrows. solver routines in Garfield is shown in Fig. 2.12. For flux and boundary routines, a kernel is launched for each cell. For the j direction flux, an illustration of kernel (j, k, l) is shown in Fig. 2.13, where the specified kernel is responsible for the flux on the highlighted face. The gray cells show the locations of possibly concurrent kernels. Using one kernel per cell is labeled as ?point parallel? in Fig. 2.12. The reconstruction routines (WENO or MUSCL), from Sec. 2.2.2, are slightly different in that a kernel is launched per direction, per face, per variable, since the mathematics in preparation for the left and right states of the Riemann solver are the same for all five flow variables. The additional parallelism is labeled as ?variable parallel? in Fig. 2.12. Kernel(j,k,l) l k j Figure 2.13: One CUDA Kernel used per (j, k, l) cell (?Point-Parallel?). On the left-hand-side of the discretized linear system, Eq. (2.27), the implicit routines of the DADI or line Gauss?Seidel algorithms are not fully parallelizeable. For DADI, inversion of eigenvectors and preparation of eigenvalues for the tri-diagonal system is parallel, and therefore one kernel is launched per cell. The inversion of the tri-diagonal system, however, is done with the Thomas or Sherman?Morrison algorithm and is inherently serial. To maximize parallelism on the GPU, a kernel is launched for each of the five eigenvalues in each cell of a k-l plane for the j-direction tri-diagonal system. The (? thi, k, l) kernel, shown in Fig. 2.14 then loops over a j line of the domain. The same procedure is subsequently used for k and l directions. In the flowchart in Fig. 2.12, the kernels for routines involving tri-diagonal systems 51 are labeled as ?line-parallel.? For the line Gauss?Seidel method, the full block- tri-diagonal system is solved with one thread per line, since there are no factored eigenvalues. Kernel(?i,k,l) l k j Figure 2.14: In J-direction, one CUDA Kernel used per cell eigenvalue in a k ? l plane. A kernel is a function that runs on the GPU in parallel in a multi-dimensional grid of threads and blocks. For example, 16 kernels could be launched with the use of 2 blocks and 8 threads-per-block. In general, the threads and blocks are three- dimensional vectors, which conveniently map to the structured 3D computational CFD domain. The optimal number of threads-per-block is hardware specific, though in Garfield the number 128 has been found to work well with modern GPUs. The 128 threads can be organized, for example, using 3D indexing as 16 in the j-direction, 8 in the k-direction, and 2 in the l-direction. Then the number blocks required to run kernels over the full set of jtot? ktot? ltot points is found such that each point can be assigned 1 thread. For example, calling the viscous flux routine could be done as shown in Listing 2.1. Listing 2.1: Use of 3D blocks and threads mapping the 3D computational domain. 1 dim3 blocks , thr (16,8 ,2); 2 blocks.x = (jtot -1)/ thr.x+1; 3 blocks.y = (ktot -1)/ thr.y+1; 52 4 blocks.z = (ltot -1)/ thr.z+1; 5 viscous_flux <<>>(...); Inside the viscous flux kernel, the function can determine which index that kernel is responsible for based on thread and block index variables, as shown in Listing 2.2. Listing 2.2: Each kernel determines which point it is in charge of using block and thread indices. 1 __global__ void viscous_flux (...){ 2 int j = blockDim.x * blockIdx.x + threadIdx.x; 3 int k = blockDim.y * blockIdx.y + threadIdx.y; 4 int l = blockDim.z * blockIdx.z + threadIdx.z; 5 // ... 6 } Similar to the CPU, the GPU has a memory/cache hierarchy. There is global memory, which is the most commonly used form on the GPU, although it is relatively slow to access. In the kernels there is a register cache, which is not large but has the fastest read/write time. In between is shared memory, which operates as a manually managed cache shared by all threads on a single block. For stencil-type operations where neighboring values are required, using the shared memory cache significantly improves performance by reducing global memory operations. As an example, Fig. 2.15 shows a 2D version of the viscous flux stencil from Fig. 2.4. The flux face of interest, colored in red, requires neighboring cell info which may not exist within the same thread block. The (4? 4) threads perform (6? 6) loads into shared memory; once the shared memory is loaded it can be used by all threads repeatedly and very efficiently. 53 Ghost Cached Memory (0, 3) (1, 3) (2, 3) (3, 3) Interior Cached Memory Thread Index (0, 2) (1, 2) (2, 2) (3, 2) (0, 1) (1, 1) (2, 1) (3, 1) Flux Face (j ? 12 , k) (0, 0) (1, 0) (2, 0) (3, 0) Stencil includes interior and ghost memory Figure 2.15: Set of (4 ? 4) kernel threads using (6 ? 6) shared memory cache for efficient application of a stencil. 2.3.2 Framework Wrapping Framework details are presented in Sec. 2.7; for Garfield to work in the Python framework, Garfield needs to have functions accessible to Python. As primarily a C++ code, Garfield uses the native Python-C API to create a Python- readable binary. This approach requires more verbosity in programming compared to alternatives like SWIG or BoostPython but the primary advantage is full compatibility on all compute clusters. Core functions accessible from the Python framework with a brief description are summarized in Table 2.1. The last four, mvp, precondition, vdp, and getrhs, are for use with GMRES, described in Sec. 2.8. A more detailed explanation and definition of iBlanking is included in Sec. 2.6. Routines which either return vectors or are passed vectors, do so using pointers. Vectors of CPU data are wrapped using the popular NumPy numerical python library. Vectors of GPU data can also be passed with the use of Python Capsules. Capsules are meant specifically for data passed between two C-like codes. So although the 54 data cannot be accessed directly from Python, when passed to another module like for grid motion or connectivity, those modules can access the pointers inside. Function Name Description startStep Called at the start of a physical timestep to save previous solutions (for finite-differences in time). updateIblank Indicate to Garfield the map of overset interpolation nodes has been updated. updateGrid Indicate to Garfield the grid nodes have moved, indi- cating the need to calculate grid velocity. getUnstructData Obtain a Python dictionary storing pointers to the grid, solution, and other pertinent information required for overset connectivity. writeSolution Write a solution file (CGNS binary format). computeForces Compute and return integrated forces on walls in a spec- ified direction. spanLoads Compute and return a span-wise distribution of loads, typically for use in a structural dynamics code. runSubSteps Take a number of non-linear sub-iterations (usually 1). mvp Perform a Matrix-Vector-Product (MVP) operation on the provided vector and the left-hand-side matrix of the linear system, Eq. (2.40). precondition Perform a precondition operation on the provided vector, either with DADI or the specified flavor of Red-Black Gauss?Seidel methods. vdp Perform a vector dot product on two vectors. getrhs Return the right-hand-side of Eq. (2.40). Table 2.1: Python-visible functions in Garfield. 2.4 Other CFD Solvers A multi-solver CFD framework is not complete without at least a second solver to run in conjunction with Garfield. Three other codes are used in this framework, and each is a RANS solvers with the same governing equations presented for Garfield. This section will briefly review each solver, without going into implementation details. 55 2.4.1 OverTURNS The Overset Transonic Unsteady Rotor Navier?Stokes (OverTURNS) solver has been used for the past few decades for rotorcraft simulation. From the 1990s [30, 30] an early version of TURNS (without oversetting) was used for a free-wake model coupled to an Euler/Navier?Stokes code. More recently the solver has been used extensively for the overset analysis of a hovering S-76 rotor [61] and micro air vehicle applications [62]. OverTURNS is a structured, finite-difference code with a very similar discretiza- tion to that used in Garfield, described in Sec. 2.2. The main difference is that OverTURNS runs on the CPU and stores the solution at the grid nodes, not at the cell centers. OverTURNS still uses a hybrid finite-difference, finite-volume approach; the control volume is constructed around each grid node. OverTURNS has all overset functionality built into the FORTRAN code, however it has also been wrapped in Python (using F2PY) for use in an overset framework. The same Python functions from Table 2.1 are accessible for OverTURNS. 2.4.2 HamStr The Hamiltonian Strand (HamStr) solver is another in-house development effort at the University of Maryland [63]. The finite-volume code for unstructured, hexahedral meshes uses Hamiltonian paths through each cell to make use of line- based reconstruction and preconditioning techniques. Triangular surface meshes are transformed into all-quad meshes through sub-division, as shown in Fig. 2.16. The opposite faces of each quad are connected using a line which can be used for reconstruction (ie. WENO or MUSCL), and line-based preconditioners (ie. Line Gauss?Seidel or DADI). The surface-normal direction is generated using a strand- approach, where the mesh is marched outward from the surface of a body. The 56 strand-direction also forms a line used for spatial and implicit schemes. ? ? Triangle Obtain local Sub-division coordinate-directions Figure 2.16: General procedure in 2D for obtaining Hamiltonian paths used as local coordinate directions. HamStr is run entirely on the CPU and is wrapped for use with Python using SWIG. All the functions from Table 2.1 are made accessible to Python for HamStr. 2.4.3 mStrand mStrand is a multi-strand solver [48] present in the HPCMP CREATETM-AV Helios framework, developed at the Army Aviation Development Directorate. Access to mStrand was solely through a pre-compiled library on Department of Defense Supercomputing Resource Centers. Because Helios is a Python framework, routines in mStrand are accessible from Python. All of the functionality listed in Table 2.1 is available in mStrand and an additional Python layer was added to properly map function names and parameters. As a strand solver, mStrand uses unstructured grids in the wall-surface direction which are marched outward in the strand direction. A unique feature of mStrand is the ability to assign multiple nodes at the same coincident point on sharp convex corners. The use of multiple strands emanating from the same node helps with grid resolution and cell quality for complex geometries. 57 2.4.4 Overflow NASA?s Overflow solver [35] is used in this work as a trusted and well optimized CPU code for comparison against Garfield. Both Overflow and Garfield are structured, implicit, compressible, Reynolds-Averaged Navier?Stokes solvers and modifications to Overflow have been made to ensure consistent implicit algorithms are used between both codes. Engineering results from Overflow serve as a trusted baseline for accuracy comparisons. Timing results from Overflow serve as a baseline for estimating performance speedup provided by the GPU. The primary algorithmic differences between Garfield and Overflow is the location of the solution in the grid. Garfield stores the solution at cell centers and uses a ghost-point strategy to apply boundary conditions. Overflow stores solutions at grid nodes and can directly apply a restriction to points on a physical boundary. The main implications of cell vs node-centered differences apply to the wake boundary at the tip of a blade or wing mesh. At a wake boundary, the structured mesh folds on itself, as previously shown in Fig. 2.9. With a cell-centered approach, Garfield prescribes ghost cells across the fold and while cells might be skewed, the full stencil is maintained. With a vertex-centered approach, Overflow drops to a lower order model and performs averaging of points above and below the boundary. All explicit routines for flux computation of the mean-flow equations are equivalent between Overflow and Garfield. The turbulence model convection, dissipation, and source terms are also equivalent between solvers with the exception with the calculation of vorticity. Overflow uses a finite-difference method and grid metrics to compute vorticity whereas Garfield uses a linear-least-squares algorithm. The finite-difference approach is very efficient and works well for good quality meshes however the linear-least-squares method is found to improve accuracy 58 in regions of highly skewed cells. Overflow version 2.2l is used, which does not have exactly the same DADI algorithm as implemented in Garfield. Instead, Overflow has two methods which can be easily combined to get the DADI method. The first is the Diagonally Dominant Diagonalized Alternating Direction Implicit (D3ADI) [64], which results in an upwinded, scalar tri-diagonal in the three coordinate directions. The second is a three-factor Diagonalized ADI method but with central dissipation, resulting in a penta-diagonal system. The upwinded, tri-diagonal routines from D3ADI are used but without the diagonal dominance along with the eigenvector inversion routines from the penta-diagonal diagonalized ADI scheme. 2.5 Grid Motion Grid motion is implemented as an independent module for an overset CFD framework. The principles of grid motion simplify to transformation matrix multipli- cations, where the transform corresponds to translation or rotation. For applications with rigid grid motion, all points in a mesh undergo the same transform. With elastic deformation, the deflection at each point in a mesh may be different. In representing the elastic deformation of rotorcraft blades, it is important to accurately represent the motion of the blade surface without causing errors in the mesh. This section describes the mathematical implementation of an algebraic grid motion and elastic deformation module. 2.5.1 Transformation Matrices Given a point in space with coordinate (x0, y0, z0), the point can be translated by (?x,?y,?z) with the transformation: 59 ? ? ???0 0 0 ?x???0 0 0 ?y? ?? T = ? ???? (2.59)?0 0 0 ?z? 0 0 0 1 Such that the new point has coordinates ?? ?? x0? ??? ?? y ?0[x, y, z, 1] = T?? z0 ?? ?? (2.60)? 1 Similarly, rotation matrices can be defined which describe rotation of a point around the x-axis by ?x, y-axis by ?y, and z-axis by ?z: ?? ???1 0 0 0? ?0 cos(?x) ?sin(?x) 0 Rx = ???? ? ??? 0 sin(? ) cos(? ) 0?x x ?? ? 0 0 0? ? ? 1 ? ?? cos(?y) 0 sin(?y) 0??? ???cos(?z) ?sin(?z) 0 0? ??? 0 1 0 0??? ??? ? sin(?z) cos(?z) 0 0?? Ry = ? ? , Rz = ? ??? (2.61)??sin(?y) 0 cos(?y) 0? ? 0 0 1 0? 0 0 0 1 0 0 0 1 Combining translation and rotation operations using multiplication, the transfor- mation matrix for rotation by ?x, ?y, ?z about point (x0, y0, z0) and translated by (?x,?y,?z) is 60 M = T ?1?T0RzRyRxT0 (2.62) where T0 is translation by (x0, y0, z0) and T? is translation by (?x,?y,?z). The sequence, from right to left, is to translate such that the origin is at the point of interest, rotate about x, y, and z, translate back, and perform the final ? translation. The multiplication of these matrices can be done analytically. 2.5.2 Elastic Deflection Elastic deflection information typically comes from a Computational Structural Dynamics (CSD) code. For rotorcraft blade deflection from 1D beam analysis, it is common to obtain deflections about points defined in a line along the blade. An example for a swept-tip rotor blade is shown in Fig. 2.17. The line where discrete deflections are defined may shift based on the blade geometry. The general methodology is to find the nearest point on the deflection line to each CFD mesh point and deform the CFD point by applying the interpolated deflection. If the line of deflection points has any curves or kinks, however, there may be a discontinuity in the deflections applied to certain locations of the CFD mesh. In Fig. 2.17, two neighboring CFD points may get very different interpolated deflection values. Discrete location of defined deflections y x CFD Mesh Surface CFD Mesh Field Figure 2.17: Control points along blade where deflections are defined. Distances to CFD coordinates may be discontinuous. To avoid deflection discontinuities in the CFD domain, the deflection points are 61 projected onto a straight line, using the distance from the first deflection point as the parameterizing variable. The modified formulation is no longer exact on the blade surface, however assuming any blade sweep occurs near the blade tip, angles and corresponding error are both small. Once the closest point on the projected line is determined, linear interpolation between control points is used to find the deflection. Deflections are still performed about the original defined deflections points, not the projected approximation. An important detail in the implementation of CFD motion is that no assumptions are made about the topology of the CFD mesh. The procedure is the same for curvilinear, Cartesian, or unstructured meshes. Discrete location of Projected location of defined deflections defined deflections y x CFD Mesh Surface CFD Mesh Field Figure 2.18: Use of projected control points results in continuous distances to CFD coordinates. Six quantities are provided at each discrete deflection point: ?x, ?y, ?z, ?x, ?y, ?z. The deflections are performed about the coordinate of the specified deflection point in the undeformed frame. At each step, the blade must therefore be transformed back to its original position (along the x-axis) before deflections can be applied. The procedure for deforming a point in a blade mesh and rotating it about the z-axis is: Mn = RzDn (2.63) 62 ?? ? ? ??? xn ??? ? xn?1???? yn ??? ? ? ??? ?? y1 n? ?1 ?= MnMn?1 ??? ??? (2.64)? zn ? ? zn?1 ? 1 1 where Mn is the forward transformation from the undeformed frame (with the blade along the x-axis) consisting of a deflection D and rotation Rz. D changes for every grid point and therefore M is also different at each point. The inverse transform M?1n?1 un-does the movement from the previous step. Rotation is assumed about the z-axis however rotation about an arbitrary axis is possible with the use of a frame transform. The frame transform is another matrix defining another rotation and translation, allowing rotors to be place arbitrarily in space. With the frame transform, the effective transformation matrix M is Mn = FrRzDn (2.65) where Fr is the rotor frame transformation. 2.5.3 Grid Velocity The grid motion module defines the location of grids at discrete times and this information must be translated into grid velocity within each CFD solver. For rigid grid motion, the shapes and volumes of each cell remains constant and therefore finite-difference relations can be used to obtain a grid velocity: ?x xn ? xn?1 = (2.66) ?t ?t where n is the current timestep and n ? 1 is the previous timestep. A second order backwards differencing procedure can also be used. When cells are skewed 63 or volumes change during elastic grid motion, however, simple finite-differences do not conserve mass, momentum, and energy in the governing equations. Instead, a Geometric Conservation Law (GCL) interpretation of grid motion should be used. For finite-volume methods, the GCL implementation is done by finding the velocity of the cell faces, instead of the cell centers. Since the face velocities appear in the discretization of fluxes, any changes in cell volume from grid motion are accounted for as a flux. A common method of finding the face velocity is to find the volume swept by the face over time ?t, and to divide that volume by the face area multiplied by ?t. Different treatments of CGL conditions are described in the work of Blazek [59]. 2.5.4 Framework Wrapping The motion module is written in C++ and wrapped for use with Python using the native Python-C API. This is the same approach taken for Garfield, presented in Sec. 2.3.2. Core functions accessible from the Python framework with a brief description are summarized in Table2.2. Function Name Description register data Pass pointers of grid coordinates from CFD. step Step the grid to the next location in space. write forces Write sectional blade forces for use with CFD-CSD cou- pling applications. Table 2.2: Python-visible functions in the motion module. 2.6 Domain Connectivity The open source Topology Independent Overset Grid Assembler (Tioga) is used to connect and interpolate between overset domains. Tioga has been used extensively by other research groups [45,65] for both finite-volume and high-order finite-element cases. For all overset cases presented in this work Tioga performs two tasks: 64 1. Connect grids ? Classify points as hole,fringe, or field. ? For fringe points, find surrounding donor points in a neighboring mesh and compute interpolation weights. 2. Exchange Data ? Given a vector, interpolate the data at each fringe location and send the values to the corresponding mesh. Classification of points, often called ?hole-cutting?, is done using an integer iBlank array. A hole point is denoted as a point in one mesh that resides within a solid-wall boundary of another mesh, tagged using iBlank=0. A fringe point is denoted as a point that receives an interpolated value in another mesh, tagged using iBlank=-1. A fringe point is either near the boundary of a mesh or in a region where another mesh has finer resolution. A field point is a regular point where the solution is solved on that mesh, tagged using iBlank=1. 2.6.1 Connectivity and Mesh Grouping Tioga is topology agnostic and works with structured or unstructured grids. Given a coordinate that is a desired interpolation location, the search for donors is performed by sorting the grid coordinates in an Alternating Digital Tree (ADT). Once the ADT is created, the search time is fast. Creating the search tree is expensive, however, and must be done every time a grid moves. Grids that are not moving should not be reconnected every timestep, since this would be a waste of computation. Tioga does not distinguish between moving or non-moving grids but the distinction can be made at the framework level by grouping the grids. Fig. 2.19 shows an example with two rotating blades and two background 65 Stationary Full Domain Reconnecting Figure 2.19: Grouping of stationary meshes and reconnecting separately so that only the ?reconnecting? group reconnects every timestep. grids. The two blades and finest nested mesh need to be reconnected every timestep, and are categorized as ?reconnecting?. The two background meshes only need to be connected once at the beginning of the simulation. The finer nested mesh in green appears in both groups and will be assigned two iBlanking values. The two maps can be combined into a single iBlanking map with the simple operator shown in Eq. (2.67). iB = min(iBreconnecting, iBstationary) (2.67) Garfield is a cell-centered code however Tioga requires grid coordinates and solutions be stored at the same location. To remedy this Garfield constructs a dual-mesh of grid nodes aligned with the cell centers of the original mesh where solutions are stored. For structured grids the dual-mesh is trivial to construct and only has to be done once even for unsteady cases. 66 2.6.2 Interpolation Once the iBlank map is set and all interpolation donors have been assigned, data can be passed for interpolation and exchanged between meshes. Tioga uses second order transfinite linear interpolation. All interpolated values are then efficiently packaged and sent with MPI to their respective fringe cells. Fig. 2.20 illustrates in 2D the interpolation procedure between a Cartesian background mesh (orange) and a fringe point in a cylindrical mesh. Solid symbols represent field points (iBlank=1 ) and hollow symbols represent fringe points (iBlank=-1 ). Interpolate Fringe Figure 2.20: Illustration of the interpolation step for two overset grids with fringe points as hollow symbols, field points as solid symbols. Tioga is a CPU-based code and for many of the connectivity routines, altering the code to use GPU memory would require significant effort. The interpolation routines, however, are more straightforward. Furthermore while reconnecting grids occurs a maximum of once per timestep, exchanging data between grids can occur dozens of times per timestep for time-accurate cases. A flowchart of a baseline implementation with all routines on the CPU is shown in Fig. 2.21(a). Because Garfield does not have information about which cells are required for interpolation, the entire domain must be transferred back to the CPU. This is very costly, since CPU-GPU data transfer is limited by a relatively low- 67 CPU GPU Python Start Data to GPU Iteration Flow Solver Overset Exchange Overset Data to CPU Interpolation Tioga Garfield (a) Baseline Tioga CPU GPU Python Garfield Start Flow Solver Iteration Data to GPU Overset Interpolation Overset Exchange Data to CPU Tioga (b) Tioga w/ Interpolation on GPU Figure 2.21: Flowchart of routines showing location of data: on the CPU or on the GPU. Performing interpolation on the GPU drastically reduces data transfer to the CPU. bandwidth hardware bus. Furthermore, interpolation is a highly parallel algorithm but is performed on the CPU, a serial device. Fig. 2.21(b) shows the updated flowchart for a framework where Tioga handles GPU data from Garfield and performs the interpolation on the GPU. In this case Garfield passes only pointers of GPU memory to Tioga, avoiding transfers 68 back to the CPU. After interpolation, only a small fraction of the initial data set is transferred back to the CPU and exchanged between MPI tasks. Tioga does not make use of CUDA-aware MPI (for GPU-GPU transfer) for two reasons. First, since Tioga is written as a CPU code, the MPI packing and transfer routines are not written for the GPU and would take significant effort to write in CUDA. Second, not all clusters used for this work support GPU-GPU communication and the data therefore still travels through CPU memory buffers. 2.6.3 Framework Wrapping Tioga is primarily written in C++ and wrapped for use with Python using the native Python-C API. This is the same approach taken for Garfield, presented in Sec. 2.3.2. Core functions accessible from the Python framework with a brief description are summarized in Table 2.3. Function Name Description register data Pass pointers of grid data from CFD. connect Regenerate iBlank map and search for donors for each fringe point. update Given a vector of data, interpolate the quantities, send to the relevant mesh, and update the relevant fringes. Table 2.3: Python-visible functions for Tioga. 2.7 Framework Description The CFD solvers, grid motion module, and domain connectivity all fit together in a single framework, as illustrated in Fig. 2.22. Python is a convenient scripting language used to tie together each of the framework components. Python is very well supported across the majority of clusters as a flexible and user-friendly programming interface. 69 Setup CFD Solver A Overset Python Connectivity CFD Solver B Grid CFD Solver C Motion Figure 2.22: Modules for CFD framework. Data can be shared between modules using Python dictionaries. The gridData dictionary from a CFD solver, for example, includes all information about the grid coordinates and topology. A few of the lines are shown in Listing 2.3. Note that for CPU data, Python has convenient access to data for debugging. Listing 2.3: Example dictionary of grid data shared between modules of the frame- work. 1 gridData = { 2 ?grid -coordinates ? = array ([0.1, 2.3, ..., 3.4]) , 3 ?hexaConn ? = array ([0 ,1 ,... ,123]) , 4 ?iblanking ? = array ([1 ,1 ,... ,1]) 5 # ... 6 } 2.7.1 Load Balancing One of the main tasks performed by the framework is to distribute cluster resources between solvers. A feasible cluster layout with 2 GPUs and 16 CPU cores 70 per node is shown in Fig. 2.23. The framework is started with an allotted number of MPI tasks. If all solvers are CPU-based, the simplest solution is to run each solver on all tasks. This approach, previously presented in Sec. 1.2.1 describing the procedure from the Helios framework, is shown in Fig. 2.24. Node 0 Node 1 GPU 0 GPU 1 GPU 0 GPU 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 CPU cores Figure 2.23: Cluster node topology with 2 GPUs and 16 CPU cores per node. Task 0 Task 1 Task 2 Task 3 Python Python Python Python Solver A Solver A Solver A Solver A Solver B Solver B Solver B Solver B Solver C Solver C Solver C Solver C Connectivity Connectivity Connectivity Connectivity Figure 2.24: Example distribution of tasks suitable for a CPU-only CFD framework. Each solver uses all available MPI tasks. When CPU and GPU solvers are used together, it is better to separate solvers by task based on the available hardware resources. Fig. 2.25 shows an example using three solvers distributed between four MPI tasks. Solver A is split between two tasks, whereas Solver B and C each have 1 task. Typically dozens or hundreds of tasks may be launched depending on the size of the simulation, meaning no single solver would typically be assigned only a single core. 71 Task 0 Task 1 Task 2 Task 3 Python Python Python Python Solver A Solver A Solver B Solver C Connectivity Connectivity Connectivity Connectivity Figure 2.25: Example distribution of tasks suitable for a CPU-GPU framework. Different tasks assigned different solvers. The example node topology from Fig. 2.23 can be used to illustrate a scenario with three solvers, where Solver A is GPU-accelerated and the other two use only CPU cores. A GPU code, like Garfield, is assigned 6 cores: 4 on the first node sharing 2 GPUs and 2 on the second node sharing 2 GPUs. If the other two solvers are CPU-based, the rest of the cores are distributed sequentially. The distribution of cores for GPU solvers is based on the number of GPUs. In the example, there are 2 GPUs per node so the framework loops over all nodes, assigning 2 tasks per node until the total number of tasks is reached. This is why Node 0 has 4 GPU tasks running and Node 1 has 2. Node 0 Node 1 GPU 0 GPU 1 GPU 0 GPU 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 Solver A Solver B Solver C Figure 2.26: Example distribution of resources for a 3-Solver framework where Solver A is GPU-accelerated. 2.8 GMRES and Overset GMRES Once a set of PDEs are discretized and linearized, as in Eq. (2.40), the resulting system of linear algebraic equations can be represented as a general linear system 72 of form Ax = b. The Generalized Minimal Residual (GMRES) algorithm is one of many different methods that can be used to solve such a system. Compared to more traditional, iterative methods like Gauss?Seidel or approximate factorization, GMRES is a minimization algorithm on the quantity ||b? Ax||. This minimization is accomplished with the use of m Krylov-subspace vectors Vm, where each vector is the size of the solution vector x. A more detailed overview of GMRES is presented in detail in the work of Yousef Saad [66]. For brevity this section will focus on applications of GMRES to CFD. GMRES with m vectors applied to the Navier?Stokes equations to approximate the change in the solution ?Q will require m ? sizeof(?Q) in memory. High memory usage is one of the disadvantages of GMRES compared to Gauss?Seidel or approximate factorization methods. However, in the era of exa-scale computing, domains are split onto enough processors that memory often no longer becomes a primary bottleneck. Algorithm 1: Right-Preconditioned GMRES Algorithm. 1: Compute ro = b? Axo, ? := ||ro||2, and v1 := ro/? 2: for j = 1, 2, ...,m do ?? 3: Compute rj := precondition(vj) ??????? . Preconditioner4: Compute wj := Arj ? . Matrix-Vector Product 5: for i = 1, ..., j do ???? 6: hij := (wj, vi) ? 7: w := w ? h v ????? . Vector Dot Productj j ij i 8: end for ? 9: h ?j+1,j = ||wj||2 ??????? Linear Solver (Krylov) Iterations 10: if hj+1,j == 0 then 11: m := j ???? 12: goto 18 ???? 13: end if ?? 14: vj+1 = wj/hj+1,j 15: end for 16: Define the (m + 1) ? (m) Hessenberg matrix H? = {hij}1?i?m+1,1?j?m and Vm = [v1...vm] 17: Compute ym, the minimum of ||?e1 ? H?my||2 and rm = Vmym 18: Compute xm = xo+ precondition(rm) 73 The GMRES algorithm with right-preconditioning is shown in Alg. 1. Precon- ditioning is the process of altering the system of equations to a different system that has the same solution but is less stiff and hence easier to solve. Preconditioning is required to obtain good performance from the GMRES algorithm and generally right preconditioning is preferred over left because it preserves the magnitude of the resid- ual within linear iterations. In Alg. 1, semantically r is the result of preconditioning the vector vj. Mathematically the preconditioner is often expressed as a matrix M, which approximates the matrix A but is easily invertible such that for the system Mr = vj, r = M ?1vj. As recommended by Saad [66], the Hessenberg matrix from Alg. 1 is made upper-triangular with Givens rotations in each Krylov iteration, resulting in an easily solvable minimization problem and approximation of the residual in each linear solver iteration. For time-accurate simulations the residual can therefore be tracked for each temporal step, non-linear sub-iteration, and linear solver iteration (Krylov iteration). The three most expensive steps from Alg. 1 are highlighted: the preconditioner, matrix-vector product, and vector dot product. This section uses a 1-dimensional heat-equation problem to describe these steps for a simple overset system. Consider the heat (Poisson) equation between two walls where the left wall is kept at tem- perature TL and the right wall is kept at temperature TR with a thermal diffusivity constant of 1. A discretization on two overset meshes using a total of 8 points with ?x = 1 is shown in Fig. 2.27. 0 1 2 3 TL TR 4 5 6 7 Figure 2.27: 1D heat problem between left and right walls at temperatures TL and TR respectively. 74 The discrete linear system for the full overset system is shown in Eq. (2.68) with each element colored by the grid color, red or blue. Points 0 and 7 are boundary points fixed at the exact temperature solution. Points 1,2,5 and 6 are interior points and therefore have the familiar central discretization stencil [1,?2, 1]. Point 3 receives an interpolated quantity from points 5 and 6; point 4 receives an interpolated quantity from points 1 and 2. Note the interpolation coefficients (0.5 for simplicity) are on the left-hand side of the linear system and therefore treated implicitly. ?? ?? ? ???1? ???? x0?? ?? ? TL? ? ??1 ?2 1 ?? ?? ?x1?? ?? 0 ?? ?? 1 ?2 1?? ? ????????? ? ? ? x ? ?2?? ?? 0 ??? ?? 1 ?0.5 ?0.5 ???? ??? ?? ????? x3??? ? ? ? ?0.5 ?0.5 1 ??x ? ? ???? 0 ?? = ? ?? (2.68)4 ??? 1 ?2 1 ??????x ??? ? ? ??? ?? 0 ???? 5 0? ?? 1 ?2 1????x6?? ?? ??0 ?? 1 x7 TR The red and blue meshes are coupled by the interpolation weights. The linear systems can be de-coupled with a Schwartz iteration procedure such that the solution at time n is dependent on an approximate solution at time n? 1. The de-coupled system, shown in Eq. (2.69) is the traditional, Schwarz-alternating, approach that most overset solvers take because it means each grid is treated separately. 75 ????1? ? ?? ?? ? ? ? ?xn0? ? TL ? ?? ? ? ? 1 ?2 1 ?????xn? ? ?1? ? 0? ??? 1 ?2 1 ???? ???? ? ????? xn2???? ???? ?0 ??? 1 ?xn3 ???? ???????? ? ??? ???? ? ? ?0.5(x n?1 5 + x n?1 6 )? 1 xn? = ? (2.69)? ??0.5(xn? ?1 + xn?1 ?4 1 2 )?? ???? 1 ?2 1?? ? ???????xn? ? ?5??? ??? 0 ??? 1 ?2 1????xn6?? ?? 0 ?? 1 xn7 TR A GMRES algorithm which treats all interpolated boundaries implicitly cannot rely on the decoupled system from Eq. (2.69) and should instead solve the coupled system Eq. (2.68). For convenient tagging of field and fringe points an iBlanking procedure is used to set iB = 1 at field points and iB = ?1 at interpolated fringe points. A simple Gauss?Seidel preconditioner can then be applied in two loops: one over the field points and another over the fringe points. Pseudo-code for a Gauss?Seidel preconditioner applied to the example 8-point problem is shown in Alg. 2. Algorithm 2: 1D Example: Gauss?Seidel Preconditioning r = M?1v. 1: function Precondition(v) 2: Let r[0, . . . , 7] = 0 be a new array 3: for s = 1, . . . , Nsweep do 4: for i = 1, 2, 5, 6 do . All field points 5: ri := (vi ? ri?1 ? ri+1)/(?2) 6: end for 7: end for 8: r3 = 0.5r5 + 0.5r6 . Iterpolation at fringe points 9: r4 = 0.5r1 + 0.5r2 10: return r 11: end function 76 Algorithm 3: 1D Example: Matrix-Vector Product w = Ar. 1: function MVP(r) 2: Let w[0, . . . , 7] = 0 be a new array 3: for i = 1, 2, 5, 6 do . All field points 4: wi := ri?1 ? 2ri + ri+1 5: end for 6: return w 7: end function Note the preconditioning of the interior points ignores the fringe points and the preconditioning of fringe points is merely an interpolation of the other preconditioned points. From Alg.1, the preconditioning step is followed by a matrix-vector product. By preconditioning the fringe points as the interpolant of preconditioned field points, the matrix-vector product at the fringe points is guaranteed to be zero as long as no interpolated point is used as a donor in the interpolation of another point. Pseudo-code for a matrix-vector product routine for the 8-point 1D problem is shown if Alg. 3. While the matrix-vector product uses the interpolated values at fringe points, the matrix-vector product at fringe points themselves are zero and need not be filled. The last expensive step of the GMRES algorithm is the vector dot product. Since the procedure thus far has guaranteed that values in blanked indices are zero, the vector dot product of the full vector wj from line 4 of Alg.1 is the sum of the vector dot products of each individual mesh. As long as each solver ignores blanked values from vectors in the precondition and matrix-vector product steps, the only influence of coupling between meshes comes from interpolation after preconditioning and conducting a global sum for the vector dot product. The O-GMRES algorithm including communication between meshes is shown in Alg.4. The highlighted lines show changes from the original algorithm. For the Euler or Navier?Stokes equations the residual, i.e. the RHS of Eq. (2.38) is proportional to the volume of the cell and therefore the Krylov vectors vj would 77 also be proportional to volume and cannot be directly compared between meshes. The volume-scaled system of equations Eq 2.40 should instead be used to guarantee that Krylov vectors are compatable between meshes and the converged solution has smooth contours across overset boundaries. The O-GMRES algorithm is written in Python as a partitioned algorithm that does not build the full global solution vector on any processor. Nor does O-GMRES require additional memory over the traditional GMRES algorithm. The only requirement is that each solver expose routines for preconditioning and matrix-vector product. Python handles pointers to large buffers of memory but all numerical operations are handled with function calls to fast, compiled languages. Pseudo-code with the traditional de-coupled GMRES algorithm is shown in Lst. 2.4. The improved implementation is shown in Lst. 2.5, where now the GMRES algorithm is written as part of the framework and the connectivity information is built in to the Krylov iterations. 78 Algorithm 4: Right-Preconditioned O-GMRES Algorithm. 1: Compute ro = b? Axo, ? := ||ro||2, and v1 := ro/? 2: for j = 1, 2, ...,m do 3: Compute rj := precondition(vj) . Preconditioner 4: overset interpolate(rj) 5: Compute wj := Arj . Matrix-Vector Product 6: for i = 1, ..., j do 7: hij := (wj, vi) . Vector Dot Product 8: overset sum(hij) 9: wj := wj ? hijvi 10: end for 11: hj+1,j = ||wj||2 12: if hj+1,j == 0 then 13: m := j 14: goto 18 15: end if 16: vj+1 = wj/hj+1,j 17: end for 18: Define the (m+ 1)? (m) Hessenberg matrix H? = {hij}1?i?m+1,1?j?m 19: Compute ym, the minimum of ||?e1 ? H?my||2 and rm = Vmym 20: Compute xm = xo+ precondition(rm) 21: overset interpolate(xm) Listing 2.4: Traditional Framework. 1 for t in timesteps: 2 for s in sub_iterations: 3 connectivity.exchange(q) 4 # CFD solver GMRES (frozen fringe points) 5 CFDSolver.do_GMRES () 6 # ... 79 Listing 2.5: O-GMRES Framework. 1 for t in timesteps: 2 for s in sub_iterations: 3 # ... 4 for j in krylov_iterations: 5 # ... 6 connectivity.exchange(r_j) 7 # ... In all codes where GMRES is used, the matrix-vector product is computed using the approximation [ ] ?R ? R(Q + X)?R(Q)X (2.70) ?Q  where  is a small scalar chosen based on the magnitude of Q. Eq. (2.70) uses the property of Fre?chet derivatives and has been shown to work well for CFD applications [47,67]. Using the finite-difference approximation introduces error in the calculation of the matrix-vector product however this error is small compared to the error of the first-order linearization between Eq. (2.34) and Eq. (2.40). 80 Chapter 3: Verification and Validation This chapter presents verification and validation of four primary components of the full overset framework: Garfield flow solver, overset connectivity, elastic grid-motion, and the Overset-GMRES algorithm. Garfield in particular, as a novel, GPU-accelerated code, is analyzed for accuracy, algorithm convergence, and performance. The components are analyzed either individually or in sub-sets of the full framework using simple cases. The first simple application considers an Onera M6 wing used to validate the accuracy of Garfield against results from experiment and the NASA Overflow CFD solver. The Onera M6 case uses two overset grids and also serves as a sim- ple application to verify proper treatment of overset boundaries with the Tioga connectivity code. The Onera M6 case is then subsequently used in a scalability study to analyze timing, performance, and cost. Again NASA?s Overflow solver is used as a baseline for timing comparisons to estimate speedups and cost-savings for parallel GPU and CPU codes. The third section verifies the proper treatment of elastic grid-motion by looking at a UH-60 rotor in high-speed forward flight. Grid motion must be properly handled in two components of the framework. First, the grid-motion module must accurately and efficiently deform each blade. Second, Garfield must correctly apply the grid-velocities in a conservative way to the governing, discretized equations. NASA?s Overflow code is used for comparison of sectional airloads given the same set of 81 elastic deflections. The fourth chapter section focuses on single-grid convergence for Garfield with the newly implemented implicit algorithms. Convergence with both point- and line-variations of the Red-Black Gauss?Seidel method are compared to the original DADI method for an inviscid wing. The GMRES method with DADI as preconditioner in Garfield is compared to DADI alone for a laminar cylinder to verify improved convergence characteristics. The final verification and validation section continues to analyze convergence but from the framework level for overset cases. The developed Overset-GMRES (partitioned) algorithm implementation is verified against a single-processor imple- mentation for the 2D Poisson equations. Single-solver inviscid flow over a wedge is also used for convergence analysis with and without the implicit treatment of interpolation boundaries. 3.1 Onera-M6 Case The Onera-M6 wing experiments of Schmitt et. al. from 1979 [68] provide a popular validation data set for CFD. The swept aircraft wing was mounted in a wind tunnel and run at transonic conditions with a Mach number of M? = 0.8395 and angle of attack ? = 3.06?. The Reynolds number of the case is Re = 11.72 ? 106, based on the free-stream velocity and mean chord. The solution is characterized by the ?-shaped shock appearing on the upper surface of the wing. For the purposes of this study, the wing is mirrored in the span-wise direction to double the problem size and avoid modeling the wind tunnel wall. The problem size is intentionally doubled to create a large domain suitable for a scalability study, presented in Sec. 3.2. The wing geometry, provided in the experimental report, is meshed using an O-O topology. A simple O-O mesh is shown in Fig. 2.9 from a previous chapter. The j coordinate is the airfoil wrap-around direction resulting in a 82 Figure 3.1: Overset mesh system and pressure coefficient solution contours. periodic boundary at the start and end. The k-coordinate is in the airfoil normal direction, resulting in a no-slip boundary at one end and an overset boundary far from the wing. The l-boundaries are ?wake? boundaries where the mesh collapses to a plane. The non-dimensional viscous spacing at the wall corresponds to y+ = 1.38 and the near-body mesh extends to a distance 2c from the wing, where c is the mean chord of the wing. The number of points in the wrap, normal, and span-wise directions is 291, 110, and 179 respectively for a total of 5.73 million points. The off-body domain is a stretched Cartesian mesh containing 289 points in the stream-wise direction and 241 in the other two directions for a total of 16.8 million points. The cell spacing varies from 0.08c near the wing to 1c in the far-field, where c is the mean chord of the wing. A hyperbolic tangent function is used to stretch the grid in all three directions. A view of the mesh system is shown in Fig. 3.1 with the pressure solution on the wing. The resulting background mesh has width and height of 50c? 50c and a streamwise-length of 60c. The four span-wise-locations shown in Fig. 3.1 are selected locations where pressure-tap experimental data is available for comparison. Validation is also per- 83 formed against the NASA Overflow solver using the same meshes. Overflow, presented in more detail in Sec. 2.4.4, is also a structured, curvilinear RANS solver but for the CPU. The source has been modified to ensure the same explicit and implicit algorithms are used between Garfield and Overflow. The pressure coefficient at six locations from experiment, Garfield, and Overflow are shown in Fig. 3.2. The span percentage is defined such that 0% is at the center of the CFD wing mesh (near the wind-tunnel wall) and 100% is at the wing tip. Garfield (solid) and Overflow (dotted) lines of pressure coefficient are almost indistinguishable from each other. The most notable difference is at the 99% span location near x/c = 0.95 and Overflow over-predicts (higher negative CP ) the pressure compared to both Garfield and experiment. The low-pressure bump observed in the experimental results at the tip of the blade near the trailing edge of the blade is from the blade tip vortex. Garfield likely predicts a better solution compared to Overflow in this region because at the wake boundary near the blade tip Overflow reduces to a lower order spatial scheme whereas Garfield maintains the full reconstruction stencil. The only large discrepancy between the CFD and experimental results away from the wing tip occurs at the 80% span location. There is a double shock at the 80% span-wise location but the shocks are much closer together compared to the 40% or 60% locations. Garfield and Overflow do a reasonable job at predicting the location of the second shock but both codes over-predict the location of the first shock by about 5% chord. The discrepancy could be a result of a mesh resolution or turbulence model limitations. 84 ?1 0 44% Span ?1 0 65% Span. . ?0.5 ?0.5 0.0 0.0 GARFIELD GARFIELD 0.5 O 0.5VERFLOW OVERFLOW Experiment Experiment 1.0 1.0 0.0 0.5 1.0 0.0 0.5 1.0 x/c x/c (a) 44% Span (b) 65% Span ?1 0 80% Span. ?1 90% Span.0 ?0.5 ?0.5 0.0 0.0 GARFIELD GARFIELD 0.5 O 0.5VERFLOW OVERFLOW Experiment Experiment 1.0 1.0 0.0 0.5 1.0 0.0 0.5 1.0 x/c x/c (c) 80% Span (d) 90% Span ?1 0 95% Span. ?1 0 99% Span. ?0.5 ?0.5 0.0 0.0 GARFIELD GARFIELD 0.5 O 0.5VERFLOW OVERFLOW Experiment Experiment 1.0 1.0 0.0 0.5 1.0 0.0 0.5 1.0 x/c x/c (e) 95% Span (f) 99% Span Figure 3.2: Pressure coefficient solution at six wing sections comparing experimental, Garfield, and Overflow results. 85 Cp Cp Cp Cp Cp Cp 10?6 GARFIELD OVERFLOW 10?8 10?10 10?12 10?14 0 5000 10000 15000 20000 Iteration Figure 3.3: Convergence of Overflow and Garfield. Convergence per-iteration for Garfield and Overflow is compared in Fig. 3.3. Convergence over time is also a useful metric for comparison; timing results are presented for the same wing case in Sec. 3.2.2. For the first 7,000 iterations the convergence trends of both codes are very similar, as is expected for codes using the same spatial and temporal schemes. After approximately 7,000 iterations, however, Garfield appears to converge at a faster rate than Overflow. Using 10?12 as convergence criteria, Overflow requires 18,647 iterations compared to 13,777 with Garfield, indicating roughly a 25% difference. The convergence difference may be attributed to the difference in boundary condition for the wake cut at the wing tips. The use of ghost points in Garfield means that data is included as part of the implicit scheme ?across? the wake, whereas in Overflow the wake boundaries are held fixed during the implicit DADI algorithm. 86 Density Residual L2 Norm 3.2 Performance, Scalability, and Timing The performance of an overset framework refers to how efficiently components of the framework operate for a representative problem. Scalability refers to how efficiently the domain be divided to run on additional computing hardware. For CPU codes, a scalability study involves distribution of the domain onto different numbers of CPU cores. For a GPU code, different numbers of GPUs are used instead. Finally timing analysis gives an indication of speedup against a baseline implementation. Three different clusters are used for performance, scaling, and timing studies: the Maryland Advanced Research Computing Center (MARCC) cluster, the Mustang cluster from the Department of Defense Supercomputing Resource Center, and the West-1B region of Google Cloud Platform. NASA?s Overflow code, described in Sec. 2.4.4, is the baseline for timing comparison. Overflow case is run on 24 Xeon cores, one CPU socket of a Mustang node, and therefore the definition of a speedup is T24 CPU Cores Speedup = (3.1) TGPU where T is the time taken for a fixed number of iterations. 3.2.1 Performance The convergence and pressure results for Garfield from the Onera results presented in Sec. 3.1 were run on the Mustang cluster using four GPUs. The Mustang cluster has one P100 GPU per node, therefore requiring the use of four nodes. Each MPI task from Garfield runs on one CPU core and a single GPU can be shared by multiple MPI tasks. The wing mesh has 5.73 million points and the background has 16.8 million points, indicating the resource distribution should roughly be a ratio of 1:3. For a scenario with 16 tasks over 4 Mustang nodes, this section analyzes the 87 load balancing between tasks and identify proper design choices for GPU solvers in an overset framework. The amount of memory available on GPUs is typically significantly less than is available on the CPU. A primary motivation for the use of domain decomposition is to avoid hitting the GPU memory limit. Garfield attempts to evenly divide domains across unique GPUs. With 16 tasks and a ratio of 1:3, the blade mesh is assigned 4 MPI tasks and each task resides on a core of a different node to use the GPU on that node. The background mesh uses the remaining 12 tasks with 3 per node, each sharing the GPU on that node. Timing for the flow solver is shown with blue lines in Fig. 3.4. Tasks 0, 4, 8, and 12, which contain the wing mesh, run slightly faster than the 12 other tasks running the background mesh. This is a result of the distribution of points not being exactly the same between tasks. The wing mesh is divided only in the span-wise direction whereas the background Cartesian mesh is divided in all three coordinate directions. So although the 5.73/4 = 1.43 million points for the wing mesh is larger than 16.8/12 = 1.4 million in background partitions, the background meshes have added ghost cells in all directions to overlap with neighboring partitions and the added cells contribute to the flow solver calculation time. Preliminary overset results with Garfield focuses only on optimizing the flow solver without consideration of data transfer for overset routines. Results for a preliminary, na??ve run are shown in Fig. 3.4(a). Time is shown as a bar in the x-direction for each of the 16 MPI tasks. After flow solver calculations, Garfield transfers the full domain of GPU data back to CPU memory buffers shared with Tioga. After the overset routines are finished, Garfield then transfers the full domain back to the GPU. Because Garfield has no knowledge of donor or interpolation data, the full domain is copied back and forth. The transfer time, shown as green stars in Fig. 3.4(a), takes on average 75.6% the time of the flow solver. The transfer time is also hard to mask with asynchronous memory copying 88 Flow Solve Transfer Overset 0 2 4 6 8 10 12 14 Wing Mesh: Tasks 0,4,8,12 MPI Barrier 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Time / Iteration (s) (a) Without GPU memory in Tioga Flow Solve Transfer Overset 0 2 4 6 8 10 12 14 Wing Mesh: Tasks 0,4,8,12 MPI Barrier 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Time / Iteration (s) (b) With GPU memory in Tioga Figure 3.4: Overset run with the use of GPU memory in Tioga reduces CPU/GPU data transfers. 89 MPI Task MPI Task because different asynchronous streams would have to be managed through Python in Garfield and Tioga. The time required for overset connectivity, shown in orange circles, is 3.7% of the average flow solver time. An MPI barrier is intentionally placed before overset communication in order to show a clear time-line and cost comparison. The time required for interpolation and exchange of data is significantly smaller than the time taken by the flow solver and transfer routines. With routines added in Tioga to handle interpolation on the GPU, Garfield only needs to transfer data to a buffer of GPU memory shared with Tioga. Data transfer speeds between the CPU and GPU is limited by the bandwidth of the PCIe bus, typically 32 GB/s. The bandwidth for GPU-GPU memory transfer is much higher: 240 GB/s for the K80 GPU, 720 GB/s for the P100 or 900 GB/s for the V100. Fig. 3.4(b) shows updated results with the use of GPU memory in Tioga. The time taken for the flow solver did not change from the case shown in Fig. 3.4(a) but the transfer time dropped from 75.6% of the flow solver time to 5.5%. The time taken for interpolation and data exchange also reduced from 3.7% to 1.7% because the interpolation is now performed on the GPU. Following the interpolation step data is still transferred back to the CPU over a PCIe bus but the amount of data transferred is significantly reduced. 3.2.2 Scalability and Timing A baseline for timing is established using Overflow on a single socket of the Mustang cluster (24 cores of an Intel Xeon Platinum 8168 at 2.7GHz). As shown in the previous section, different components of the overall simulation can account for significant fractions of the total simulation time. Garfield and Overflow are similar in implementation of flow solver routines, but the overset routines from Overflow are not the same as the routines from Tioga. For a fair comparison, therefore, only the flow solver time is used to compare Overflow and Garfield. 90 Because the flow solver time can vary between MPI tasks the most practical metric for wall-clock time comparison is the total time taken by the slowest processor, averaged over multiple runs. Results for the baseline Overflow case are shown in Table 3.1. Time CPU #MPI Tasks (s/100 iter) Xeon 8168 (24-core) 24 215.50 Table 3.1: Baseline CPU case for timing and speedup comparison. Garfield is run on three different compute clusters: Mustang, MARCC, and the West-1B region of the Google Cloud Platform. The Mustang cluster has 24 GPU nodes with one Pascal P100 GPU per node and 16GB per GPU. MARCC has 72 nodes each with two Kepler K80 GPU cards. Since the K80 is a dual GPU, MARCC effectively has 4 discrete Kepler GPUs per node with 12GB each. Some nodes on MARCC contain two P100 GPUs, however the research facility limits usage to one node. MARCC also has a single trial node containing 4 Volta V100 GPUs, each with 32GB of memory. To expand scalability analysis of V100 GPUs, a virtual machine on the Google Cloud Platform is used containing 8 V100 GPUs, each with 16GB of memory. In each run the time taken by the flow solver in Garfield is recorded for 100 iterations. Just as with Overflow, the total time taken by the slowest processor, averaged over multiple runs is used. A summary of all timing results is shown in Table 3.2. The data is organized by cluster and GPU generation. Cases 1-5 show a scalability study from 2-32 K80 GPUs. The scalability study cannot be run on one GPU because the 12GB memory of a single K80 GPU is insufficient to fit the entire 22.5 million point domain. The last column showing speedup indicates that two K80 GPUs performs 41% faster than Overflow. Using 32 GPUs across 8 nodes, Garfield runs 17.68? faster than Overflow. The speedup and strong scalability for Garfield on the Kepler architectures is shown as the green line with triangle symbols in Fig. 3.5. The slope 91 92 Time Speedup Case Cluster GPU Type Memory #GPU #MPI Tasks (s/100 iter) Over 24 Cores 1 2 8 153.33 1.41 2 4 32 81.88 2.63 3 MARCC K80 12GB 8 32 41.96 5.14 4 16 32 22.56 9.55 5 32 32 12.19 17.68 6 MARCC P100 16GB 2 16 43.03 5.01 7 2 16 44.28 4.87 8 4 16 23.03 9.36 Mustang P100 16GB 9 8 16 13.71 15.72 10 16 16 7.38 29.22 11 1 8 40.29 5.35 12 MARCC V100 32GB 2 8 20.81 10.35 13 4 8 13.04 16.53 14 2 8 21.10 10.21 15 Google V100 16GB 4 8 13.16 16.38 16 8 8 7.55 28.54 Table 3.2: Timing data for Garfield on different clusters and GPU architectures. Speedup is shown against Overflow on a 24-core CPU. of the ideal speedup is derived from the slope from the origin to the first data point with 2 K80 GPUs. Efficiency of the strong scalability is defined in percent as the actual speedup of the code divided by the ideal speedup. Efficiency for K80 study is shown as the green line in Fig. 3.6. Garfield scales well to 32 GPUs with 78% efficiency. Mustang is used for a scalability study of Pascal-generation GPUs. Case 6 with two P100 GPUs is run on MARCC for comparison with a two P100 case from Mustang, case 7. Although the CPUs are different between MARCC and Mustang, because Garfield runs entirely on the GPU and both GPUs are the same the variation between case 6 and 7 is less than 3%. Cases 7-10 scale from 2-16 P100 GPUs. 30 V100 25 20 P100 15 alide K80 10 5 0 12 4 8 16 32 Number of GPUs Figure 3.5: Speedup and strong scalability of Kepler (cases 1-5), Pascal (cases 7-10) and Volta (cases 11-13,16) GPUs. 93 Speedup over baseline ideal ideal Similar to the K80 GPU, the 16GB memory of a single P100 GPU is insufficient to fit the entire flow domain. On two P100 GPUs Garfield runs 4.87? faster than Overflow. The speedup and strong scalability for Garfield on the Pascal architecture is shown as the orange line with square symbols in Fig. 3.5. Garfield scales from 2-16 GPUs with 75% efficiency, shown in Fig. 3.6. The V100 GPUs on MARCC have 32GB of memory, which is large enough to fit the entire flow domain of 22.5 million points. Since only 4 GPUs are available, however, the scalability study for the Volta architecture is done with both MARCC and Google platforms. A single V100 GPU attains a 5.35? speedup compared to Overflow. The dual V100 cases on MARCC and Google differ in speed by only 1.4% and the 4-V100 case differ by less than 1%. Up to 8 GPUs Garfield scales to 28.54? as fast as Overflow. Cases 11-13 on MARCC are combined with case 16 on Google shown in Fig. 3.5 as a blue line with circular symbols. Previously mentioned Kepler and Pascal cases were run across multiple nodes however on both 100 ideal 80 P100 K80 60 V100 40 20 0 12 4 8 16 32 Number of GPUs Figure 3.6: Strong scaling efficiency of Kepler (cases 1-5), Pascal (cases 7-10) and Volta (cases 11-13,16) GPUs. 94 Parallel efficiency (%) MARCC and Google, the V100s for the scalability study reside on one node. The efficiency of the scalability is shown in Fig. 3.6. Interestingly Fig. 3.5 suggests a 1:2:8 ratio of performance for V100 to P100 to K80 GPUs. 1 V100, 2 P100s and 8 K80s perform almost equivalently and the pattern is approximately maintained for 2:4:16 and 4:8:32. Looking at a comparison of strong scaling efficiency, Fig. 3.6 shows that the V100 scales less efficiently than the K80 or P100 devices. This result is partially misleading for two reasons. First, since there is no single-GPU data for the K80 or P100 runs, the 2-GPU cases start at 100% efficiency. Second, for a parallel device like the GPU strong scalability should not stay close to 100% as it should for a serial code. As the domain gets smaller, the many streaming multi-processors on the GPU handle the domain more effectively in parallel. As a result with smaller partitions and more GPUs there would be no gain in parallel efficiency. The result that the V100 scales worse than the P100 or K80 is an indication of the higher CUDA core count of the V100. 3.2.3 Cost Use of the Google Compute Platform gives the flexibility to test with different types of GPUs on the same virtual node but it also gives a cost breakdown for resources. Prices for resources, which are also a rough measure of power consumption, are listed in the Google Cloud Platform documentation. As of March 2019, on- demand costs for K80, P100, V100 GPUs and virtual CPU cores are: CK80 =$0.45/hr CP100 = $1.46/hr CV100 =$2.48/hr Ccore = $0.06322/hr 95 Such that the total cost Ctotal for NGPU P100 GPUs across Ntask tasks over time t hours is Ctotal = t ? (NGPU ?CP100 +Ntask ?Ccore) (3.2) The costs for ?pre-defined? CPU and memory configurations are $0.031611 per virtual CPU core per hour, defined by Google to include hyper-threading cores. Since Overflow is run on a 24-core machine without hyper-threading to ensure best parallel performance, the effective cost for this application is Ccore = $0.06322 per physical core. Table 3.3 shows the same 16 Garfield cases from Table 3.2 along with estimates for the Overflow case for a direct cost comparison. The cost savings results for the Kepler series (cases 1-5), Pascal series (cases 7-10) and Volta series (cases 11-13,16) are shown in Fig. 3.7. The cost savings percentages are defined relative to the total estimated cost of Overflow on the CPU: (Ctotal,CPU ?Ctotal,GPU)/Ctotal,CPU (3.3) For the cases with K80 GPUs, the minimum cost is with 32 GPUs at 5.56?for the 100 iterations. All K80 cases are run with 32 tasks, meaning the cost of the CPU cores is more than the cost of 1 K80 GPU. For P100 and V100 GPUs the number of GPUs has a larger impact on the total cost. For the Pascal GPUs, the optimal cost point is 4 GPUs. For Volta the optimal number is 2. In all cases the use of GPUs saves on cost compared to using Overflow on only CPU cores. Maximum savings are up to 65% with V100 GPUs. Comparing generations of GPUs, using Pascal reduces cost compared to the Kepler generation. Volta saves significant cost over both the Pascal and Kepler generations. Cost estimates for the 32GB V100 are based on prices for the 16GB V100s available on the Google platform. As a dollar value, the cost of the larger memory model V100 is higher than the 16GB version, 96 though as an indication of power consumption there is ideally little difference. Time Cost Savings Case GPUs (s/100it) (?) (%) 1 2? K80 153.33 5.988 34.1 2 4? K80 81.88 8.696 4.3 3 8? K80 41.96 6.555 27.8 4 16? K80 22.56 5.779 36.4 5 32? K80 12.19 5.560 38.8 6 2? P100 43.03 4.699 48.3 7 2? P100 44.28 4.836 46.8 8 4? P100 23.03 4.383 51.7 9 8? P100 13.71 4.834 46.8 10 16? P100 7.38 4.994 45.0 11 1? V100 40.29 3.341? 63.2? 12 2? V100 20.81 3.160? 65.2? 13 4? V100 13.04 3.777? 58.4? 14 2? V100 21.10 3.203 64.7 15 4? V100 13.16 3.811 58.0 16 8? V100 7.55 4.268 53.0 Overflow 24-Cores 215.50 9.083 0.00 Table 3.3: Cost estimates for 100 iterations of Garfield and Overflow based on Google Cloud Platform prices. *Cost estimates for the the 32GB V100 cards are not available on Google and are therefore estimated at the 16GB rate. 97 60 V100 P100 40 K80 20 12 4 8 16 32 Number of GPUs Figure 3.7: Cost savings of Kepler (cases 1-5), Pascal (cases 7-10) and Volta (cases 11-13,16) GPUs. 3.3 Elastic UH-60 Rotor The UH-60 Blackhawk rotor and fuselage underwent wind tunnel and flight tests which have become a popular and challenging data set for CFD validation [69]. One of the main challenges is the aeroelastic coupling that occurs between the blade structure and aerodynamics. Without the correct deflections, the aerodynamics loads from CFD are incorrect; without the correct airloads, the structural model cannot predict the correct deflections. The research of Potsdam et. al. [70] as well as Datta et. al. [71] presented some of the earliest, successful results of coupling CFD to Comprehensive Structural Dynamics (CSD).The result from CFD/CSD coupling is a converged set of structural deflections and aerodynamics loads. One such CFD/CSD converged set of deflections is used for validation of elastic deflections in the presented framework. Given the elastic deflections and resulting aerodynamic forces from the NASA Overflow code, the same deflections are provided to Garfield and airloads 98 Cost Savings (%) are compared between solvers. Wind tunnel test 5240 is a high advance-ratio test described in Table 3.4. The high advance ratio means a significant portion of the blade on the retreating side of the rotor is in reverse flow. In combination with a high tip Mach number, the advancing side has a shock near the swept blade tip. Description Symbol Value Advance Ratio ? = Vtip/V? 0.37 Blade Loading CT/? 0.09 Blade Tip Mach Mtip 0.64 Table 3.4: Wind tunnel test 5240 parameters. Garfield uses a different overset grid system from Overflow. Both codes use similar O-O-type structured blade meshes but Overflow uses automatically generated off-body grids while Garfield uses manually generated off-body stretched and Cartesian grids. Three superimposed deflections of the UH-60 blade in Garfield are shown in Fig. 3.8 for ? = 0, 90, 180 degrees. The blade bends and twists elastically based on sectional deflections. Figure 3.8: Superimposed deflections at three different azimuths: ? = 0, 90, 180. Airload comparisons between results from Overflow and Garfield are shown in Fig. 3.9 for 3 quantities at two blade span-wise locations: 50% and 80%. The normal, chord-wise, and pitching moment coefficients are further multiplied by a Mach number squared, resulting in C M2N , CCM 2 , and CMM 2. The mean has been removed from all airloads in Fig. 3.9 to obfuscate results for publication while still allowing for comparison between codes. 99 At both sections, all airloads agree very well between Overflow and Garfield. The largest discrepancies appear in areas of challenging physics. On the advancing side, near ? = 90?, a shock appears at the 80% span, which causes especially high frequency content in the pitching moment. Garfield predicts a slightly higher variation in pitching moment from the shock compared to Overflow. At the 50% span, the largest variation appears between ? = 300? and ? = 0?. This azimuth region is where the reverse-flow separation structures convect aft of the aircraft and affect the passing blade. While both Garfield and Overflow use the same deflection files, the use of different grids with different resolutions will inevitably result in slight variations in the predicted airloads. Overall there is very good agreement between the two codes and the agreement would not have been obtained without correct treatment of grid motion in the developed overset framework. 100 (a) 50% Span: Normal Force (b) 80% Span: Normal Force (c) 50% Span: Chord Force (d) 80% Span: Chord Force (e) 50% Span: Pitching Moment (f) 80% Span: Pitching Moment Figure 3.9: 50% and 80% normal force, chord force, and moment coefficients (mean- removed). 101 3.4 Implicit Algorithms Validation The methodology chapter, Ch. 2, presents three classes of algorithms for solving the linear system resulting from the implicit discretization of the Navier?Stokes equations. DADI and the Gauss?Seidel algorithms are approximate preconditioners to the linear system. GMRES is a linear-solver approach, which is more involved computationally but also more stable. This section briefly reviews the convergence of the three approaches applied to simple problems to obtain convergence trends. 3.4.1 DADI and Red-Black Gauss?Seidel Convergence An inviscid NACA0012 wing with aspect-ratio 1.0 is used to analyze convergence between DADI and Red-Black Gauss?Seidel methods. Both the Point and Line variations of the Red-Black Gauss?Seidel method are compared to DADI. The low- aspect ratio wing is challenging to converge especially in regions of highly skewed cells near each end of the wing. The selected case is inviscid for simplicity. A solution of the pressure coefficient on the wing and on the center-plane normal to the wing is shown in Fig. 3.10. Streamlines near one end of the blade show the 3-dimensional nature of the flow field. Figure 3.10: Short, inviscid NACA0012 wing used as a representative mesh for convergence analysis. 102 The goal in converging a steady simulation is to ensure that the numerical residual drops to zero. Fig. 3.11 shows the convergence of the L2-norm of the residual plotted against iteration number. Each method, DADI, Line Red-Black Gauss?Seidel, and Point Red-Black Gauss?Seidel, is run at the maximum stable local timestep, indicated by the CFL number (a higher CFL indicates a higher timestep). For this case the Line Gauss?Seidel method is run with 9 sweeps, 3? for each of the three directions. The Point Gauss?Seidel method is run with 12 sweeps. DADI requires the most number of iterations to converge with a CFL of 8, and almost converges 4 orders in 5000 iterations. The point Gauss?Seidel (?Point GS?) method performs significantly better, converging 12 orders in 5000 iterations with a CFL of 20. The line Gauss?Seidel (?Line GS?) performs the best, converging 13 orders by 3400 iterations with a CFL of 30. The stalled convergence at 10?13 is as a result of numerical precision limits for floating point operations. The ability to increase the 10?1 DADI CFL=8 10?3 Point GS 10?5 CFL=20 10?7 Line GS 10?9 CFL=30 10?11 10?13 0 1000 2000 3000 4000 5000 Iteration Figure 3.11: Per-Iteration comparison of convergence with DADI and two Gauss? Seidel methods. 103 Residual L2-Norm timestep (or CFL) for Gauss?Seidel methods translates to improved stability when the same methods are applied to unsteady simulations. A per-iteration plot does not indicate the practical time costs of a simulation since the iteration time between two methods is not the same. The Gauss?Seidel methods have very different costs, since the Line method requires a block tri-diagonal solve and the Point method does not. A convergence plot over wall-clock time is shown in Fig. 3.12. The fastest converging method is the Point method. The line Gauss?Seidel method still converges slightly faster than DADI. 10?1 10?3 DADI CFL=8 10?5 Line GS Point GS 10?7 CFL=30 CFL=20 10?9 10?11 10?13 0 1 2 3 4 5 6 7 8 9 Minutes (Wall-Clock) Figure 3.12: Wall-clock time comparison of convergence with DADI and two Gauss? Seidel methods. Although the Point Gauss?Seidel method is the fastest, the ability to increase the CFL to 30 with the line Gauss?Seidel method is important for time-accurate cases. For a rotor case, for example, the physical timestep is fixed and the effective CFL for the outer portion of the blade is significantly higher than for the inner portion of the blade. Methods with a timestep limitation may be required to take 104 Residual L2-Norm significantly more sub-iterations with smaller pseudo timesteps to avoid diverging. 3.4.2 GMRES Convergence GMRES is a common linear solver, present in both Garfield and at the framework-level. GMRES can be used with different preconditioners, however in this section only DADI will be used as a baseline for analyzing the benefits of using GMRES. A laminar cylinder is chosen for convergence analysis because the simple case has a steady solution for low Reynolds numbers. The considered case has M? = 0.3 freestream flow over a circular cylinder with Reynolds number Re = 100. The laminar solution is steady, resulting in vortices on the back of the cylinder. A pressure solution with streamlines is shown in Fig. 3.13. The low Reynolds number makes this case challenging to converge with only the DADI preconditioner. As mentioned in Sec. 2.2.4, DADI uses viscous eigenvalues to approximate the influence of viscous fluxes in the implicit terms of the linear system. For a low Reynolds number case where viscosity has a large influence on the flow physics, the DADI algorithm stability is found to be limited to a timestep corresponding to a CFL of 5. A per-iteration comparison of convergence between DADI and GMRES is shown in Fig. 3.14. GMRES is used with 5 Krylov vectors and iterations, meaning within the GMRES procedure there is a loop performing 5 residual calculations and Figure 3.13: Pressure solution over a laminar cylinder with M? = 0.3 and Re = 100. 105 100 10?2 10?4 10?6 ? DADI10 8 CFL=5 10?10 GMRES CFL=50 10?12 0 5000 10000 15000 20000 Iteration Figure 3.14: Per-Iteration comparison of convergence with DADI and GMRES methods. 5 preconditions. Each of these steps is very computationally expensive and as such, it is expected that GMRES performs so much better than the preconditioner DADI alone per iteration. A wall-clock time comparison is a much more fair way to compare the routines. The convergence over time is shown in Fig. 3.15, with the two convergence lines much closer together compared to the per-iteration plot. Even with GMRES performing significantly more work than DADI per iteration, over time GMRES converges much faster than DADI. A large factor in the improved convergence rate with GMRES is added stability for the implicit scheme and the ability to increase the timestep size. Whereas DADI alone is limited to a CFL of 5 for this case, GMRES is able to run ten times the timestep size with a CFL of 50. Similar to the results from the DADI and Gauss?Seidel analysis, a major benefit of added stability is for time-accurate applications where all parts of the flow domain must be converged well to ensure a physical solution. GMRES here is used 106 Residual L2 Norm 100 10?2 10?4 10?6 10?8 DADI CFL=5 10?10 GMRES 10?12 CFL=50 0 2 4 6 Minutes (Wall-Clock) Figure 3.15: Wall-clock time comparison of convergence with DADI and GMRES methods. with DADI as preconditioner however it can also use the Red-Black Gauss?Seidel methods. GMRES is unconditionally stable without a preconditioner, however it will only converge well for most engineering applications with the help of a good preconditioner. 3.5 Overset-GMRES The developed Overset-GMRES (O-GMRES) algorithm is analyzed for two simple cases. The first is for the 2D Poisson equations and the second is for a single-solver, overset, inviscid simulation of a triangular wedge. The goal of these two simple cases is to establish expected behavior and convergence improvements with the O-GMRES method before moving to more complex simulations. 107 Residual L2 Norm 3.5.1 Poisson Equation The first case considers the 2D Poisson equation solved on two overset grids, each with 48 ? 48 points. The governing equation is shown in Eq. (3.4) and the exact solution is shown in Eq. (3.5). The mesh system and solution contours are shown in Fig. 3.16(a). Both the coarse and fine meshes are square and centered at (x, y) = (0.5, 0.5). The coarse and fine meshes have widths 1.0 and 0.5 respectively. The boundary of the outer mesh is fixed at the exact and the boundary of the interior mesh is interpolated from the outer mesh. The iBlank map is shown for half of each mesh in Fig. 3.16(b), where the ?blanked? nodes from each mesh are light gray and the field points are blue. The iBlank map indicates the outer points of the interior mesh are interpolated from the coarse mesh and a center square from the coarse mesh is interpolated from the fine mesh. ?2q ?2q + = S(x, y) 0 ? x ? 1, 0 ? y ? 1 (3.4) ?x2 ?y2 q = sin(2?x)cos(2?y), S(x, y) = ?8?2sin(2?x)cos(2?y) (3.5) The 2D Poisson case is small enough to solve with a single processor and a version of the Poisson solver has been implemented for comparison which builds a contiguous array for all grids. The preconditioner and matrix-vector product for the single-array code is very similar to the 1D example from Alg. 2 and Alg. 3, from Ch. 2. The interpolation weights are extracted from Tioga to ensure correct preconditioning and matrix-vector product near fringe points. O-GMRES, which is intended for distributed memory, is mathematically equivalent to the single-array implementation. Using m = 50 Krylov vectors both codes converge identically within 15 iterations (each iteration in this case is a restart of the GMRES procedure), as shown in Fig. 3.17(a). Differences in the residual below 1? 10?12 are from machine 108 (a) Solution and Grid (b) Grid iBlanks Figure 3.16: Solution, grid system, and iBlanks for the overset Poisson equation. 109 precision limitations. Fig. 3.17(b),(c) and (d) compare the convergence between the traditional GMRES, O-GMRES, and Gauss?Seidel methods. Linear solver iterations refers to the number of Krylov iterations in the case of GMRES and O-GMRES and simply the number of sweeps for the Gauss?Seidel method. Fig. 3.17 plots the residual history for all iterations and linear-solver iterations. The GMRES and Gauss?Seidel methods, representative of the current approach taken in many CFD frameworks, handle fringe points explicitly since exchange of data occurs outside of the linear solver iterations. The traditional GMRES converges with a ?saw-tooth? pattern. During the linear solver iterations GMRES converges well but since each mesh is solved independently, the linear system is updated every iteration causing an increase in the residual. O-GMRES considers the full overset system including the interpolated points and the convergence per iteration follows the convergence of the Krylov iterations. 110 111 (a) O-GMRES and Single-Array GMRES. (b) 20 Linear Solver Iterations (c) 50 Linear Solver Iterations (d) 100 Linear Solver Iterations Figure 3.17: Convergence results using different linear solver methods: GMRES, O-GMRES, and Gauss?Seidel 3.5.2 Single-solver, Euler Equations The first case considered for O-GMRES analysis with Garfield is inviscid flow around a 2D triangular wedge. The solution, shown in Fig. 3.18(a), shows a wedge in Mach 0.2 free-stream with vortices shedding behind the body. The unsteady problem has been studied by other research groups [72,73] for analysis of temporal and spatial accuracy. For this case, first-order implicit-Euler time marching is used with a 5th order WENO [23] spatial reconstruction scheme. The grids coarsen in the wake of the wedge and as a result the shed vortices dissipate quickly. The grid system is therefore not ideal for vortex preservation but is sufficient for convergence analysis. The triangular wedge has sides of length 1m with inviscid grid spacing near the wall of 0.005m. The resolutions of the background grids are 0.08m and 0.16m for the fine and coarse grids respectively, shown in Fig. 3.18. The physical timestep corresponds to a CFL of 3 in the finest Cartesian nested mesh with respect to freestream conditions. Figure 3.18: Solution and grid system for time-accurate wedge The solution was allowed to advance to a time t? when vortices shed into the background meshes. The simulation is re-started with three methods: ?Iterative?, ?GMRES?, and ?O-GMRES?. The ?Iterative? method refers to the preconditioner alone applied to approximate the inversion of the linear system. A mixed Point-Line 112 Red-Black Gauss?Seidel method with 3 sweeps was used for the preconditioner of the wedge mesh and DADI was used for the two background Cartesian meshes. The iterative method used 10 times the number of sub-iterations for a fair-cost comparison with 10 Krylov iterations used in both GMRES methods. (a) Time-accurate Convergence (b) Iteration 2 Non-linear and Linear Convergence Figure 3.19: Sub-iteration and linear solver convergence for 2D wedge case. 113 Convergence comparison for three linear solvers is shown in Fig. 3.19(a) for three timesteps of the restarted simulation. The iterative methods converges the worst, achieving roughly 5 orders drop in 10 ? 10 sub-iterations. The de-coupled GMRES method does slightly better, converging roughly 6 orders in 10 sub-iterations. O-GMRES converges over 10 orders, performing much better than any other method. A zoomed in view of iteration 2 is shown in Fig. 3.19(b) for the two GMRES methods. Within each sub-iteration the linear solver is converging well for both GMRES and O-GMRES. With GMRES, however, the convergence is not consistent with the sub-iteration convergence. Though the linear solver is converging over two orders, the resulting non-linear convergence is less than one order per sub-iteration. With O-GMRES, on the other hand, the overall non-linear convergence closely follows the linear solver convergence. 3.6 Conclusions The verification and validation of the present overset framework, including the GPU accelerated flow solver, is organized into five sections. The first section used the Onera-M6 wing case to demonstrate simple, overset mesh capabilities with Tioga. Results are validated against experimental data and simulation results from NASA?s Overflow code run with the same mesh. The pressure solution from Garfield matches very well with both experimental values and those from Overflow. The Onera-M6 case is subsequently also used for performance, scalability, timing, and cost analysis of the GPU accelerated solver. Performance is analyzed, highlighting the need to minimize CPU-GPU data transfers. A strong scalability study on different GPU generations verifies Garfield scales well on three different compute clusters. The timing study concludes Garfield on the newest GPUs runs 5.35? faster than Overflow running on 24 CPU cores. Notably, a per-node comparison adds a multiplier on this speedup, since it is common to have 4-8 GPUs 114 available per node. Finally, a cost comparison is made using estimates from the Google Cloud Platform. Not only does using GPUs result in faster simulation times, it results in lower costs. Using cost as a metric of power consumption, using Garfield also saves energy compared to Overflow. The motion module of the framework, as well as grid-motion terms within Garfield, are validated against Overflow for a UH-60 Blackhawk rotor in high- speed forward flight. Using the same set of elastic deflections, Garfield results match very well with results from Overflow. Because the rotor blades undergo large bending and torsional deformation, the airloads from CFD would not be correct if the deflections were not properly handled. The newly implemented implicit numerical schemes within Garfield are validated by comparing convergence to the baseline DADI method. Both line and point variations of the Red-Black Gauss?Seidel method perform better than DADI for a representative inviscid 3D wing. Convergence results with GMRES are also compared to DADI for a laminar cylinder case. GMRES converges significantly faster, both per iteration and per wall-clock time. A major benefit in using GMRES and Gauss?Seidel methods is the ability to converge challenging time-accurate cases with a larger timestep than would be possible with DADI. The framework-level Overset-GMRES (O-GMRES) algorithm is tested on a 2D Poisson equation to verify the partitioned algorithm is equivalent to a single-matrix system. O-GMRES is further validated against the de-coupled GMRES algorithm and simple iterative approaches by showing drastically accelerated convergence. Convergence improvements are seen for the overset Poisson solver, as well as for overset, inviscid simulation of flow over a wedge. 115 Chapter 4: Notional Helicopter Configuration The goal of this chapter is to demonstrate the ability of the developed framework to simulate large, complex simulations for the purpose of interactional aerodynamic analysis. Three sections are presented, which build upon a notional helicopter configuration based on the ROtor Body INteraction (ROBIN) experiments by Mineck and Gorton [3]. The first section describes the experiment, presents the CFD grid system, and compares recorded fuselage pressure for a case involving a main rotor and fuselage. Predictions from the CFD framework are compared to experimental results as well as CFD results from another research group. The second section extends the baseline configuration to analyze the interac- tional aerodynamics with the inclusion of a tail rotor and hub. The full configuration is run and main rotor thrust is compared to the CFD result without the tail or hub. As the most complex simulation run for this dissertation, the full configuration is also used to draw conclusions about convergence challenges and computational load-balancing solutions for the heterogeneous framework. The complete vehicle configuration is also simulated with a cross-wind to demonstrate the framework capabilities. The third section analyzes rotor interactions using different combinations of isolated main and tail rotors. Harmonics of the tail rotor thrust are compared between cases for insight into vibration resulting from the interactional aerodynamics. The main and tail rotor are run for two flight conditions: forward-flight without a cross-wind and forward-flight with a cross-wind. 116 The selected configurations for simulation are notional in that then involve general design choices commonly applied to real configurations. The notional config- urations, however, preserve key computational challenges. Some of the simulation challenges considered include: 1. Ensuring adequate mesh refinement to accurately capture and convect vorticity without excessive dissipation. 2. Converging the residual of the governing equations sufficiently for a time- accurate solution. 3. Load-balancing the many meshes across computational resources (CPUs and GPUs). When analyzing the flow-field of the large configuration, iso-surfaces of vorticity or Q-criterion help visualize the locations of interesting flow features. A mathematical and qualitative description of Q-criterion and vorticity is included in Appx. A. 4.1 Rotor + Fuselage Simulation 4.1.1 Case Description Experiments with the ROBIN fuselage and rotor were performed at the NASA Langley 14? 22 foot subsonic wind tunnel. The ROBIN fuselage geometry is defined in the work of Mineck and Gorton [3] and shown in Fig. 4.1. The fuselage includes a pylon to represent the fairing around the engines and transmission. The length of the body is l = 78.7 inches and the rotor is placed over the fuselage at a longitudinal location of x/l = 0.696 from the nose and a lateral location of y/l = 0.051 from the centerline. The lateral offset indicates the rotor is not centered over the fuselage. The fuselage also has a sideslip-angle of approximately 1.2?. Figs. 4.1(a) and (b) show two views of the experimental setup in the wind tunnel. The four-bladed rotor 117 and hub is supported from above whereas the fuselage is supported from below. There is a gap between the rotor and fuselage. Description Value Radius R 33.88 in Chord c 2.61 in Number of Blades Nb 4 Linear Twist ?tw ?8? Collective ?0 6.7 ? Longitudinal Cyclic ? ?1c 2.2 Lateral Cyclic ?1s ?2.0? Tip Mach Mtip 0.53 Reynolds Number Re 8.2? 105 Blade loading CT/? 0.064 Advance Ratio ? 0.15 Shaft-tilt ? 3?s Free-stream Mach Number M? 0.08 Table 4.1: Main rotor properties and flight condition. The rotor blades have a rectangular planform with properties summarized in Table 4.1. The airfoil cross-section for the entire blade is a NACA0012. The Reynolds number is based on the tip speed and chord length. Previous numerical work [74?76] found significant discrepancies in collective pitch between numerical and experimental values, whereas cyclic pitch angles were (a) Isometric View (b) Side View Figure 4.1: Experimental Setup [3]. 118 similar. Instead of matching the collective pitch from experiment, the collective pitch is adjusted to match the blade loading between simulation and experiment. The selected collective pitch of ?0 = 6.7 ? results in a time-averaged blade loading of CT/? = 0.063. Cyclic pitch angles are selected based on trim results from Park et. al. [76]. Three different CFD solvers are used for the ROBIN simulation: HamStr, TURNS, and Garfield. Fig. 4.2 shows the surface meshes for the main rotor, run with TURNS, and the fuselage, run with HamStr. The blade mesh is a body conforming O-O type mesh consisting of 207?180?80 points in the chord-wise, span, and normal directions, respectively. The non-dimensional wall-spacing (y+) is 0.2, and the total number of points for four blade meshes is 11.92 million (2,980,800?4). The fuselage mesh is generated from an all-quadrilateral surface mesh consisting of 122,772 elements. The fuselage grid contains 50 strands layers to create the volume mesh with an initial wall spacing of y+ = 0.4. The wall normal direction stretching ratio is 1.25. The total number of hexahedra in the volume is 6,015,828. All background grids for the rotor-fuselage case are run with Garfield. The Structured blade mesh (TURNS) Cartesian wake and background meshes Unstructured (Garfield) fuselage mesh (HamStr) Figure 4.2: Fuselage and blade surfaces for CFD domain. 119 Near-wake Mesh Far-wake Mesh (a) Near and Far-wake meshes. Near-wake Mesh Far-wake Mesh Background Mesh (b) Total mesh system including background mesh. Figure 4.3: Mesh system of the ROBIN simulation in forward flight. 120 full overset mesh system is shown in Fig. 4.3(a) for the region near the fuselage and (b) for the full background mesh. The near-wake, far-wake, and background regions are all meshed using stretched Cartesian grids. The near-wake mesh has cell spacing of 0.09c to accurately capture the blade tip vortex and interactions with the fuselage. The far-wake mesh has spacing of 0.2c near the fuselage and 0.8c far away. Finally the background mesh stretches from 0.8c near the fuselage to 3.5c in the far-field. The near-wake, far-wake, and background meshes have 17.97 million, 5.92 million, and 5.46 million nodes respectively. 4.1.2 Aerodynamic Results Fig. 4.4 shows iso-surfaces of vorticity with contours of z-velocity to visualize the flow field. Large vortices trail behind the advancing and retreating blades as a result of blade-vortex and vortex-vortex interactions. The vorticity dissipates behind the aircraft in the region where the CFD grid changes from fine to coarse. The vortex dissipation, while not physical, is acceptable for simulation because the influence of a vortex on any surface of the grid diminishes hyperbolically with distance. The vortices far away from the aircraft therefore have negligible influence on the rotor inflow or fuselage surface and the influence gets only smaller as the vorticity convects backwards. Three points are selected along the longitudinal centerline of the aircraft for comparison with experiment and simulation results from O?Brien [74]. The pressure coefficient CP ? 100 is plotted against the azimuthal position of one blade. Results from the present work are shown in red. Experimental results from Mineck and simulation results from O?Brien are shown in black and blue respectively. O?Brien used the NASA FUN3D solver for all simulations. Point D9 is near the nose of the aircraft as shown in Fig. 4.5(b). The aver- age pressure coefficient is approximately the same between all three results. The 121 Figure 4.4: Iso-surface of vorticity magnitude (0.1). experimental results do not show consistent amplitudes between the four pressure peaks from the four blades. CFD results do predict consistent peaks for the 4/rev signal. The peak of the first pressure fluctuation matches well with the present results however the other three peaks appear over-predicted by CFD. Overall the present results match very closely with the results from O?Brien. Point D22 is on the pylon, just aft of the hub as shown in Fig. 4.5(d). The presented pressure results again match very well with simulation results from O?Brien. Both CFD results, however, predict a lower amplitude of pressure fluctuation. There appears to also be a slight phase difference between the pressure fluctuations from CFD and experiment. Because point D22 is on the pylon below the root section of the rotor blades, the pressure in this location may be mis-represented in CFD from the lack of a rotor hub and support structure. Point D14 is on the start of the tail boom of the aircraft as shown in Fig. 4.5(f). The amplitude of the pressure fluctuations matches well between experiment and CFD. Experimental results appear to have some asymmetry, since not all four of the 122 (a) Point D9 Pressure (b) Point D9 Location (c) Point D22 Pressure (d) Point D22 Location (e) Point D14 Pressure (f) Point D14 Location Figure 4.5: Three Center Locations. 123 (a) Point D1 Pressure (b) Point D1 Location (c) Point D19 Pressure (d) Point D19 Location (e) Point D23 Pressure (f) Point D23 Location Figure 4.6: Three Lateral Locations. 124 peaks are the same. The peaks from the present result match better with experiments than the peaks from O?Brien however the valleys from O?Brien match better than those from the present result. For all points on the lateral cross-section of the fuselage the pressure results from the presented overset framework are only compared to experimental results. Point D1 is on the under-side of the fuselage as shown on the lateral cross-section in Fig. 4.6(b). Neither experiment nor CFD results show meaningful pressure distribution at this point, even with the ?100 multiplier on the pressure coefficient. At an advance ratio of 0.15, the influence from the main rotor is convected backwards fast enough that the under-side of the fuselage experiences very little fluctuations and the pressure is essentially steady. Point D19 is on the corner of the pylon and main fuselage, as shown on the lateral cross-section in Fig. 4.6(d). The corner point is a very challenging area to mesh properly even with an unstructured mesh. The CFD results show both a phase and magnitude offset from the experiment. The CFD results, however, are smooth compared to the experimental results which show some high-frequency content. The CFD simulation for this data point may benefit from an improved mesh as well as a surface roughness model to capture laminar-turbulent transition and possibly separation. Point D23 is on the port side of the pylon as shown on the lateral cross-section in Fig. 4.6(f). The present CFD results capture the top peaks of the pressure coefficient however the valleys and overall amplitude of the signal differ from experiment. There is less phase offset that is seen from point D19. The experimental results show a much sharper upper peak to the pressure signal whereas the simulation results are more sinusoidal in shape. Overall the presented CFD results are able to capture the pressure signature on the fuselage especially in the longitudinal direction. The discrepancies especially for 125 points near the pylon could be a result of the missing hub and support structure in the CFD simulation. The presence of a hub and support structure may add additional blockage, preventing the blade root vortices from simply convecting downstream. 4.2 Rotor + Fuselage + Hub + Tail Simulation 4.2.1 Case Description To demonstrate the full capability of the presented overset framework, a tail rotor and hub are added to the previous ROBIN fuselage and rotor test case. The four main-rotor blades are identical to the previous case. The fuselage mesh is also identical but shifted to be centered under the rotor. A two-bladed tail rotor is simulated with properties described in Table 4.2. Subscripts are provided to distinguish between main and tail values. The Reynolds number is based on the chord and tip Mach number and the x-, y-, z-locations of the tail center of rotation is measured from the center of rotation of the main rotor. The coordinate system for the configuration is shown in Fig. 4.7. The tail rotor spins counter-clockwise about the y-axis-direction for a ?tractor? design. Description Value Radius Ratio Rtail/Rmain 0.2 Blade Aspect Ratio 8.0 Linear Twist ?tw ?6.67? Collective ? ?0 11.67 RPM Ratio ?tail/?main 4.5 Reynolds Number Re 2.52? 105 Number of Blades Nb 2 x-location x/Rmain 1.502 y-location y/Rmain 0.1 z-location z/Rmain -0.2119 Azimuthal Step ?? ?tail 0.9 Table 4.2: Tail rotor properties and flight condition. The flight condition and main rotor controls are the same as the previous 126 case from Sec. 4.1. A variation of the forward-flight case is also run to simulate a cross-wind from the port-side of the aircraft. The 33? cross-wind case was selected as a challenging flight condition from a pilot?s perspective. Figure 4.7: Mesh System for full configuration. A cylindrical, non-rotating hub is added to the mesh system for a first pass at hub-wake analysis. The lower and upper parts of the hub have radii 0.0427R and 0.0825R respectively. The height of the lower and upper sections are 0.146R and 0.06R respectively. The hub and fuselage bodies do not overlap to avoid complex connectivity. The hub mesh is unstructured and is run with the HamStr flow solver. Two additional nested meshes are added to the mesh system used from the rotor-fuselage-only case from Sec. 4.1. The first is a fine, nested mesh with spacing 0.05c centered about the tail rotor used to improve wake capturing of the tail rotor. Behind the fuselage, a nested mesh with spacing 0.09c extends the existing nested mesh, also for improved wake capture. Because both the existing nested mesh and added wake mesh have the same spacing of 0.09c and are coincident at overlapping points, there is no interpolation error between meshes. Both the wake and tail-rotor nested mesh additions are run with the Garfield flow solver. 127 4.2.2 Full Configuration in Forward Flight Fig. 4.8 shows a top-view of the full configuration solution with iso-surfaces of vorticity colored by z-velocity. Shed structures from the wake of the hub are shown to interact with the blade near the ? = 0? azimuth. Figure 4.9 shows a comparison of the airloads for a single blade between the full configuration (solid) and the previous case results without the tail or hub (dotted). While only a small spike in torque is shown, a large and significant difference in thrust is shown. Furthermore the change in thrust is not always the same, as shown by the over-prediction at ? = 0 and under-prediction one revolution later at ? = 360. The slight phase-shift observed near azimuth ? = 180 is a result of the y-direction shift of the rotor. In the rotor-fuselage case the rotor was not centered over the fuselage to match the experiment; in the full configuration, the rotor and hub are centered over the fuselage. A solution of the full configuration after 4 revolutions is shown in Fig. 4.10 with iso-surfaces of Q-criterion (Q = 0.0008) colored by vorticity. Q-criterion, a relation involving vorticity and strain, is a useful metric for highlighting structures in a flowfield. Flow structures generated from the hub are convected slightly starboard of the helicopter in the direction of the rotating blades. The main rotor wake is well captured by the nested mesh surrounding the rotor and fuselage. Vortices from the main rotor are well defined until they break up passing through the tail rotor. Fig. 4.11 shows the wake profile behind the helicopter. Contours of x-Mach number show the accelerated flow behind the rotor passing directly through the tail rotor. Although the configuration is not trimmed, the average value of stream-wise Mach number behind the fuselage is greater than the free-stream value of 0.08, indicating the rotor is generating forward thrust. The slice x = 1.502R, through the tail rotor hub, shows the main rotor wake passing almost directly through the center of the tail rotor plane. 128 Figure 4.8: Iso-surfaces of vorticity: blade 1 near ? = 0 passes through wake of hub. Figure 4.9: Comparison of Blade 1 Loads with and without hub and tail rotor. Tip vortices from each blade roll up into large, trailed vortices on the advancing and retreating sides behind the aircraft. The tail rotor also shows strong trailed 129 130 Figure 4.10: Full Solution showing Q-Criterion (Q = 0.0008) colored by Vorticity. 131 Figure 4.11: Slices behind rotor showing x-Mach number. vortices, although they are harder to identify because they are in the wake of the main rotor and hub. Just like with the rotor-fuselage-only case from the previous section, Newton- subiterations are used to converge the non-linear residual derivative in time. Between each exchange of overset data between meshes TURNS and HamStr run one subit- eration and Garfield does three. Garfield requires the extra subiterations to improve the convergence of the tail rotor, which rotates 0.9? (compared to 0.2? for the main rotor) per timestep. Convergence is most challenging in the blade meshes where skewed cells and large gradients require many subiterations to converge whereas the background meshes need fewer. Even with a total of 27 subiterations (9? 3), Fig. 4.12 shows less than 1 order of convergence is achieved for the tail rotor. The norm of the volume-scaled residual (from Eq. (2.40)) is used to represent convergence in all cells; plotting the norm of the un-scaled residual bias convergence to that of large cells. After the first 10 subiterations, the convergence of the tail grids appear to stall. In contrast, all nested and background Cartesian meshes achieve more than 2.5 orders of convergence. Garfield uses a Diagonalized ADI scheme to approximate the inversion of the implicit terms from Eq. (2.53); a stronger inversion scheme such as a line Gauss?Seidel method or the addition of a linear solver like GMRES would also help alleviate the convergence issues in the tail rotor meshes. Reducing Figure 4.12: Tail-rotor sub-iterative convergence (volume-scaled L2 norm of Density residual). 132 the timestep of the simulation to 0.25? for the tail rotor would help, however this translates to a 0.056? timestep of the main rotor, 6480 steps per revolution, and a drastic increase in time and resources required for the simulation. 4.2.3 Full Configuration in Cross-Wind A cross-wind from the port-side of the aircraft can result in the large, trailed vorticity from the main rotor blades to enter the tail rotor. Fig. 4.13 shows iso-surfaces of Q-criterion to visualize the location of vortices. The vortex-vortex interactions that trailed behind the aircraft in forward flight now enter the tail rotor plane. The rotation direction of the vortex is opposite to the rotation direction of the tail rotor; this is typically by design on a helicopter to avoid canceling the rotational effect of the tail rotor. Between Fig. 4.10 and Fig. 4.13 both visualizations use the same iso-surface of Q-criterion yet the cross-wind case shows significantly more chaotic fluid structures. The fuselage is a streamlined body in forward flight; with the cross-wind acting as a large sideslip angle, however, the fuselage operates more as a blunt body and sheds large structures. The tail rotor operates entirely in a ?descent? condition; the additional free-stream velocity component introduced by operating the aircraft in a sideslip condition manifests as upwash through the tail rotor disc. The tail rotor blades also shed large, complex fluid structures indicating that the blades are operating at very large angles of attack and have effectively stalled. The main rotor uses the same blade deflections in both the forward flight case and the cross-wind case. While the forward flight controls are based on trim results from literature, in a cross-wind the main rotor is not trimmed, meaning the flight condition is not necessarily representative of steady, level, sideslip flight; the same goes for tail rotor. Through proper coupling to a rotorcraft comprehensive analysis procedure the flight condition could be studied in detail and simulated using more 133 134 Tail Rotor in ?descent? Vortex Enters Tail Rotor Main Rotor Vortex Figure 4.13: Top-view of the cross-wind Q-criterion solution (Q = 0.0008) colored by vorticity. realistic pilot controls; this is one of the key recommendations for future work. 4.2.4 Performance Results The wall-clock timeline for the simulation is a function of the time taken by connectivity, flow solvers, and data exchange routines. Fig. 4.14(a) shows the timeline for one timestep with each solver taking the same number of sub-iterations. Table 4.3 shows additional metrics for performance analysis. Although Garfield handles by far the most number of points, using 6 GPUs the time per iteration is less for Garfield than for TURNS or HamStr. The overset framework has to wait for all flow solvers to complete before exchanging data between meshes and therefore the GPUs are sitting idle during this time. Table 4.3 also shows performance per iteration, per million points. For the fixed set of resources, this metric is one measure of speedup for one code over another for the same number of grid points. For example in this case, Garfield on 6 GPUs is running at a 2.04/0.156 ? 13? speedup over TURNS on 40 CPU cores. The relative speedup multipliers between codes can be used to assign points and resources (CPUs or GPUs) between solvers. A mis-match between times taken for each solver can also be adjusted by adding a multiplier on the sub-iterations. Garfield takes 6.58 seconds using the same number of sub-iterations as HamStr or TURNS; with 3? the number of sub- iterations Garfield should still finish before the other flow solvers. Fig. 4.14(b) Garfield HamStr TURNS Number of grid points 42.3 million 6.9 million 11.9 million Resources 6 GPUs 54 CPU Cores 40 CPU Cores Time per iteration (s) 6.58 28.15 24.27 Time per iteration per 0.156 4.07 2.04 million points (s) Table 4.3: Solver Timing and Performance. 135 Garfield TIOGA Connect: 23.17s HamStr TURNS 6.58s 28.15s 24.27s Exchange: 0.39s (a) Equal Sub-iterations for all flow solvers. Garfield TIOGA Connect: 23.18s HamStr TURNS 6.26s 6.26s 28.15s 24.38s 6.26s Exchange: 0.39s (b) 3? Sub-iterations for Garfield. Figure 4.14: Iteration timeline varying sub-iterations in different solvers. shows the timing result with Garfield taking 3? the number of sub-iterations. The effective time per set of sub-iterations decreased from 6.58 to 6.28 seconds because the 3? multiplier is applied within Garfield without the overhead from a Python framework call. The extra sub-iterations with Garfield can be used to help with the convergence in challenging areas. 4.3 Main and Tail Rotor Interactions The full notional configuration in both forward and cross-wind flight from Sec. 4.2 results in a complex flow-field with interactions between the main rotor, 136 Time Time tail rotor, hub, and fuselage. Interactions between individual components can be better understood by isolating the components and considering sub-sets of the full configuration. This sections considers interactions only between the main and tail rotor. The fuselage and hub are negated and all grids are run using Garfield for faster simulation times. Description Value Radius Ratio Rtail/Rmain 0.2 Blade Aspect Ratio 8.0 Linear Twist ? ?tw ?6.67 Collective ?0 11.67 ? RPM Ratio ?tail/?main 4.0 Reynolds Number Re 2.52? 105 Number of Blades Nb 2 x-location x/Rmain 1.502 y-location y/Rmain 0.1 z-location z/Rmain -0.2119 Azimuthal Step ?? 1.0?tail Table 4.4: Properties of the tail rotor for main and tail rotor interactions analysis. The geometries of the main and tail rotors are the same as from Sec. 4.2 however the tail rotor rotation has been slightly slowed to ?tail = 4? ?main instead of 4.5?. The corresponding azimuthal steps of the main and tail rotor simplify to ?? ? ?main = 0.25 and ??tail = 1.0 . The updated tail rotor properties are summarized in Table 4.4. Three cases are considered for comparison in this section: ? Isolated tail rotor in forward flight. ? Main and tail rotor in forward flight. ? Main and tail rotor in cross-wind flight. The thrust coefficient of the tail rotor is used as the primary metric for com- parison between cases. The magnitude of tail rotor thrust coefficients are compared between cases for analysis of anti-torque efficiency. Furthermore, the frequency 137 domain of the thrust is analyzed for insight into possible vibration challenges that arise from the interactional aerodynamics. 4.3.1 Isolated Tail Rotor: Forward Flight A tail rotor run without the main rotor in effect becomes a fixed-pitch, low- aspect-ratio rotor problem. The thrust of the tail rotor is plotted against the revolution of the main rotor in Fig. 4.15. The two-bladed tail results in a 2/rev signal with respect to the revolution of the tail rotor. When plotted against a revolution of the main rotor, the 4? multiplier results in a clear 8/rev signal. The average thrust coefficient of the tail rotor is CT = 0.00996. The 8/rev dominant, expected frequency is further confirmed by analyzing the frequency spectrum of the thrust coefficient. Fig. 4.16 shows the result of a Fourier transform applied to tail rotor thrust in units of 1/rev of the main rotor. The 8/rev is by far the most dominant frequency. Higher harmonics appear primarily at 16/rev and 24/rev. The tail rotor is a fixed-pitch rotor with linear twist and a collective of ?75 = 11.67 ?. The high collective results in a high angle of attack on the interior portion of the blade. The tail rotor also operates at a high advance ratio. The tip Mach number of the tail rotor is Mtip = 0.43, meaning with a free-stream Macn number of M? = 0.08, the advance ratio is 0.186. Separation on the blade as a result of low local Reynolds numbers and high angles of attack are a likely cause of the higher harmonics. In reality, blade motions of the tail rotor also play an important role in the force response however the simulation approximates the tail rotor as rigid. 138 Tail Only 0.012 0.010 0.008 0.0 0.2 0.4 0.6 0.8 1.0 Revolution (Main Rotor) Figure 4.15: Tail rotor thrust coefficient. Tail Only 0.0020 0.0015 0.0010 0.0005 0.0000 0 4 8 12 16 20 24 Freq (1/rev of Main Rotor) Figure 4.16: Tail rotor mean-removed thrust coefficient frequencies. 4.3.2 Main and Tail Rotor: Forward Flight In the presence of the main rotor, the thrust of the tail rotor can be compared to the isolated tail result. Fig. 4.17 shows the Q-criterion solution colored by vorticity to visualize the flow-field. Vortices from the main rotor can be seen entering the tail rotor at a frequency of 4/rev. Without the presence of a fuselage or hub, the root vortices of the blade combine in to complex structures however they convect starboard of the aircraft and do not appear to interact with the tail rotor. Fig. 4.18 shows both the thrust coefficients of the isolated tail and main+tail simulations. Again there appears to be a dominant 8/rev. There is a slight phase 139 CT CT Main rotor vortices trailed at 4/rev Main rotor wake enters tail rotor. Root vortex-vortex interactions Figure 4.17: Main and tail rotor Q-criterion (Q = 0.0008) solution in forward flight shift in the location of the thrust maxima and minima and the main+tail result displays a dual-peak. The mean thrust of CT = 0.01063 does not change significantly from the value for the isolated tail (CT = 0.00996). The consistent mean-thrust indicates that analyzing the tail rotor in isolation is sufficient for the purposes of determining the anti-torque of the main rotor. Tail Only Main + Tail: Forward 0.014 0.012 0.010 0.008 0.0 0.2 0.4 0.6 0.8 1.0 Revolution (Main Rotor) Figure 4.18: Tail rotor thrust coefficient 140 CT Tail Only Main + Tail: Forward 0.0020 0.0015 0.0010 0.0005 0.0000 0 4 8 12 16 20 24 Freq (1/rev of Main Rotor) Figure 4.19: Tail rotor thrust coefficient frequencies The frequency spectrum of the tail rotor thrust is shown in Fig. 4.19 for both the isolated tail and main+tail cases. The main+tail case shows a slightly lower 8/rev frequency however this is still the most prevalent frequency. Interestingly there is a very strong 16/rev component that appears only in the main+tail result. The 16/rev is therefore a result of aerodynamic interactions between the main and tail rotor. The 16/rev can be justified by considering the thrust of a single blade: ? ? R T 2blade = veff ? Cldr2 0 (4.1) ? v2eff where veff is the effective velocity experienced by the blade. Cl is the lift coefficient at each blade section, presumed to be well behaved for thin airfoils. For a simple isolated rotor in free-stream forward flight, the veff has a 1/revtail = 4/revmain component. Hypothesizing that the main rotor produces an inflow with a 4/revmain component, the effective velocity seen by a tail rotor blade results in 8/rev and 16/rev components: 141 CT veff ? sin(4?) ? sin(?tail) = sin2(4?) (4.2) v2eff ? sin4(4?) ? ?1cos(8?) + ?2cos(16?) The 8/rev for each blade results in an overall 16/rev for the tail rotor. The decom- position from the sin4() term is performed using trigonometric identities. The simulations results also predict a 4/rev signal for the tail rotor as well as higher harmonics at 12/rev, 20/rev, and 24/rev. Depending on the type of engine and frequency of operation, these higher harmonics may excite engine support and airframe modes. The higher harmonics may also result in intrusive vibrations for the pilot or passengers. 4.3.3 Main and Tail Rotor: Cross-wind Flight The final flight condition for comparison is with the free-stream at a 33? angle from the port side of the aircraft, creating an effective cross-wind. The cross-wind case is used to represent a challenging flight condition from the pilot?s perspective because the rolled up tip vortices from the retreating main-rotor blades are convected directly into the tail rotor. The vorticity entering the tail rotor can cause unpredictable loads and vibrations resulting in challenging pilot scenarios. Fig. 4.20 shows the Q-criterion solution colored by vorticity to visualize the flow-field. In hover or straight forward flight, the typical operation of the tail rotor results in inflow from the starboard side of the aircraft to the port side. As was seen from the full configuration solution in cross-wind from Sec. 4.2.3, the inflow through the tail rotor passes in the opposite direction, from port to starboard. The tail rotor is therefore in a ?descent? condition. The tail rotor controls, however, remain at the same collective angle ?75 = 11.67 ? meaning the entire blade is operating at very high angles of attack. The complex vorticity downwind of the tail rotor indicates the flow 142 Main rotor vortex- vortex interactions. Combined tip vortices Root vortex-vortex enter tail rotor. interactions Figure 4.20: Main and tail rotor Q-criterion (Q = 0.0008) solution in cross-wind flight. is largely separated on the blades. The thrust coefficient of the tail rotor for the cross-wind case is compared to the two forward flight cases (isolated tail, main+tail) in Fig. 4.21. The average thrust stays approximately the same for both forward flight cases. With a cross-wind, however, the average thrust increases significantly. The average thrust for all three compared cases is shown in Table 4.5. The high thrust of the tail rotor in cross-wind is likely a result of the drag component of force and not lift. For the high angles of attack experienced by the tail rotor blade, the blades operate almost as flat plates perpendicular to the flow, with the drag direction aligned with the direction of the desired thrust. So even though the tail rotor blades may be stalled, the rotor still produces much larger counter-torque to the main rotor and may result in corrective yawing motion of the helicopter. As the helicopter yaws left, the tail rotor may enter its own wake resulting in additional complex aerodynamic interactions. 143 Case Average CT Isolated Tail, Forward Flight 0.00996 Main+Tail, Forward Flight 0.01063 Main+Tail, Cross-wind 0.01545 Table 4.5: Average tail rotor thrust coefficient for three cases. Tail Only Main + Tail: Forward Main + Tail: Cross-wind 0.020 0.015 0.010 0.0 0.2 0.4 0.6 0.8 1.0 Revolution (Main Rotor) Figure 4.21: Tail rotor thrust coefficient Tail Only Main + Tail: Forward Main + Tail: Cross-wind 0.002 0.001 0.000 0 4 8 12 16 20 24 Freq (1/rev of Main Rotor) Figure 4.22: Tail rotor thrust coefficient frequencies The thrust coefficient from Fig. 4.21 also shows more frequency content, as further confirmed by the frequency spectrum plot in Fig. 4.22. The 8/rev is still the dominant frequency, consistent with both forward flight cases. The 16/rev is also prevalent for the cross-wind case, though it is not as strong as the forward flight main+tail case. Notably there is frequency content spread over a much larger range 144 C CT T of high frequencies. With the exception of the 16/rev, all frequencies above 12/rev are higher with a cross-wind. The 24/rev vibratory load amplitude for the cross-wind case is stronger than that of the 16/rev. Depending on the type of engine used in the aircraft, the high frequencies could be in the range of engine resonance for turbine engines. 4.4 Conclusions In this chapter the developed, heterogeneous CFD framework has been applied to a notional helicopter configuration representative of a complex, large-scale rotor- craft application. The cases from this chapter could not be run on CPUs and GPUs together prior to the development of the presented CFD framework. Single-solver or CPU-only frameworks from state-of-the-art CFD software like Helios [38] could simulate the same problems, however at a higher cost of supercomputer cluster usage and without novel compute architectures. The first case simulated the ROBIN wind-tunnel experiments performed at NASA Langley. The experiments recorded pressure variation at points on the fuselage with a four-bladed main rotor in forward flight. The CFD framework was able to successfully integrate three separate CFD solvers: HamStr, TURNS, and Garfield for the simulation. Overall the pressure results match well between experiment and simulation. Discrepancies between the results especially on the lateral cross-section are likely a result two different sources of error. First, the CFD does not model the support structure or hub for the rotor, nor the support structure of the fuselage. The presence of the rotor support and hub likely cause additional blockage, which would affect the pressure fluctuation amplitude on the fuselage. Second, the CFD does not match the rotor control angles used in experiment. Other research groups had similar issues matching the rotor thrust for the control angles used in the experiment. Instead, the collective pitch of the rotor is adjusted to match the desired thrust 145 condition. The cyclic pitch values are matched from values in literature and also differ from the experiment. The mis-match could be an offset in the measured blade pitch in the experiment due to elastic deflection or merely experimental error. Another possibility for the discrepancy is correct modeling of any interference or blockage due to the wind tunnel geometry. The wind tunnel is not modeled in CFD. The second case builds upon the rotor and fuselage case with the addition of a hub and tail rotor. The hub is cylindrical and did not rotate for preliminary analysis. The tail rotor was designed with a symetric airfoil and linear twist; the tail rotor collective was picked based on approximations for main-rotor counter-torque. The multi-rotor system demonstrates the ability of the framework to handle multiple rotors with different axes of rotation. The full configuration, run with HamStr, TURNS, and Garfield, was analyzed for performance by comparing the timeline of events occurring in parallel. Garfield, operating on GPUs, runs faster than either of the two CPU codes despite having significantly more grid points. By assigning a multiplier on the number of sub-iterations for Garfield, the framework benefits from additional convergence without any added wall-clock time. The final section analyzed three variations of the main and tail rotor without the hub or fuselage with the goal of demonstrating rotor-rotor interactional aerodynamic analysis. The tail rotor with a main rotor in forward flight shows a distinct 16/rev component of thrust that is solely from the presence of the main rotor. The interaction can be justified in an analytical manner with a lower-order blade-element model however without the full CFD the interaction is not numerically precise. The addition of a cross-wind from the port side of the aircraft is intended as a challenging case from the pilot?s perspective. With a 33? crosswind the trailed vortices from the main rotor approach the tail rotor and the tail rotor enters a ?descent? flight condition. The thrust on the tail rotor increases significantly as a result of the blades operating at very high angles of attack. The tail rotor blades are fully stalled, resulting in 146 potential control reversal; decreasing the tail rotor pitch may result in increased thrust. The cross-wind case produced additional high-frequency content to the tail rotor thrust. The high-frequency content presents potential issues for vibrations and engine resonance. Interactions between the main and tail rotors are analyzed by looking at the forces on the tail rotor but the selected flight controls are based on approximations. A more formal analysis of a real configuration would require coupling to a comprehensive rotorcraft analysis tool to trim the aircraft. The numerical convergence is a primary challenge in rotorcraft simulations because the physical timestep may not be optimal in all parts of the domain. For reasonable simulation times the main rotor may run with a 0.25? timestep, however this results in a much larger azimuthal timestep for the tail rotor. The large timestep can cause the tail blades grids to not converge very well, thus bringing into question the validity of the skin friction and forces reported from that region. The challenges faced in overset convergence, however, are characteristics of most overset frameworks which use multiple solvers for complex problems. Wake complexity in the region behind the hub and tail rotor indicates finer mesh resolution could be used to better resolve interactions. Reducing the cell size to 0.05c would improve the interpolation at the boundaries of the blade mesh and also reduce dissipation. Use of the 5th order WENO scheme with Jiang and Shu weights could also be changed to a simple central discretization since there are no discontinuities (like shocks) in the wake region. Although the presented variations of the ROBIN fuselage and rotor are notional configurations, the simulations challenges and conclusions apply generally to future analysis performed for a real case. The ability to simulate complex interactional problems and extract meaningful engineering conclusions lays the ground-work for CFD as a more efficient and approachable tool for rotorcraft design and analysis. 147 Chapter 5: Multi-Solver Overset-GMRES This chapter presents two applications of the O-GMRES method with the goal of accelerating convergence for unsteady, multi-solver simulations. Analysis of results is performed from the perspective of the numerical convergence, which is governed by the implicit algorithm used to invert the left-hand-side of the governing discretized Navier?Stokes equations Eq. (2.40). Three different formulations for the implicit operator are compared: 1. Preconditioner alone applied to individual meshes. This approach is referred to as ?iterative? or by the name of the preconditioner (ie. DADI, Gauss?Seidel). 2. GMRES applied to individual meshes with ?frozen? interpolation boundaries. The method is referred to as ?GMRES.? 3. GMRES applied to the full overset mesh system with implicit treatment of interpolation boundaries. This method is referred to as ?O-GMRES.? Two different solvers are used for convergence analysis: Garfield and mStrand. Garfield is used for the background region and mStrand is used for the near-body region. All cases are unsteady, meaning sub-iteration convergence is the primary metric for convergence analysis. In each physical timestep, sub-iterations are used to reduce the linearization error. Within each sub-iteration, the chosen implicit method may use multiple linear iterations in solving the linear system. Pseudo-code for the different time-marching iterations is shown in Alg. 5 148 Algorithm 5: Time Marching Loops 1: for t = 1, 2, ..., T do . Physical Time Iterations 2: for ? = 1, 2, ..., P do ? . Pseudo Time (Non-linear) Iterations 3: for k = 1, 2, ...,m do?? . Linear Solver IterationsLinear Solver4: ... (ie. GMRES, Gauss?Seidel) 5: end for 6: end for 7: end for For every timestep t, the residual should drop with each sub-iteration ? . Similarly within each sub-iteration ? , the residual should drop within each linear iteration k. In both the iterative and traditional GMRES formulations (1 and 2 above), the linear solver loop on line 3 of Alg. 5 is typically hidden inside of the flow solver. In the O-GMRES formulation, the linear solver iterations are performed at the framework level and therefore calls to Tioga for interpolation between domains can be built into the linear solver. All cases considered in this chapter are assumed laminar and run without a turbulence model. O-GMRES applied to a multi-solver system at the framework level is a novel approach and therefore considering only the laminar mean-flow equations is the first step towards demonstrating usefulness of the method. The first section in this chapter presents a shedding, laminar sphere at low Reynolds number. The second presents a laminar rotor designed for use on a Mars helicopter. Both cases involve unsteady fluid structures passing between overset boundaries which can be challenging to sufficiently converge for large timesteps. The final section summarizes findings. 5.1 Laminar Flow Over a Sphere Time-accurate, laminar flow over a sphere is considered with one body mesh and three nested background meshes. The sphere mesh is a ?strand-type? [77] mesh 149 run with mStrand and all background meshes are run with Garfield. The Reynolds number is set to Re = 800 in order to ensure shedding occurs, convecting unsteady flow structures between meshes. The mesh system is shown in Fig. 5.1. The entire domain is initialized with free-stream values and run to time t? when the vortices have started shedding behind the sphere. The flow solvers are re-started from time t? with three time-stepping methods: Iterative, GMRES, and O-GMRES. Garfield uses DADI as a preconditioner whereas mStrand uses 5 sweeps of the multi-colored line Gauss?Seidel algorithm. For a sphere of unit diameter D = 1m, background grid spacings of 0.07m, 0.14m and 0.28m are used for the three levels of nested Cartesian meshes. Cell volumes in the finest mesh match the approximate volume of the outer-most cells of the 404,600 point near-body mesh. The sphere mesh uses wall spacing of 1? 10?3m (y+ ? 0.1). The freestream Mach number is 0.2, based on 340m/s speed of sound. Small and moderate timestep sizes are used for convergence analysis. The small and moderate timesteps correspond to CFLs of 1.71 and 3.42 in the finest nested Cartesian mesh. In mStrand, Newton timestepping is used with an infinite dual timestep ?? = ?. In Garfield, Quasi-Newton timestepping is used with a local dual-timestep CFL of 200. Garfield uses a 5th order WENO [23] spatial reconstruction scheme for all background meshes. The solution at time t? is shown in Fig. 5.1. Contours of pressure coefficient are shown on the y = 0 slice of the sphere and iso-surfaces of Q-criterion show the vorticity pattern shed behind the sphere. Vortices can be seen convecting through the boundaries of each grid. For a fair-cost comparison between the Iterative and GMRES methods, the Iterative method uses 20 times the number of sub-iterations to match the 20 Krylov iterations performed with GMRES and O-GMRES. In Figs. 5.2 and 5.3 every 20th sub-iteration is marked for comparison with GMRES and O-GMRES. In the low 150 Figure 5.1: Iso-surface of Q-criterion (0.0001) with contours of Pressure Coefficient at time t?. CFL case, shown in Fig. 5.2, the convergence slopes for the iterative and traditional GMRES methods are approximately the same. The slope for the O-GMRES method is much higher, converging to 10?10 within 5 sub-iterations before stalling because of the errors from finite difference approximations of the matrix-vector product. The linear solver convergence from Fig. 5.2(b) shows that the GMRES linear solver is converging very well but this does not translate to good convergence of the full overset system. As a result the GMRES case shows a ?saw-tooth? pattern with fast convergence in the linear solver but slow convergence overall. The residual reported from the linear solver is based on the Hessenberg least-squares procedure described previously. As a result of the ?frozen? fringe points the residual does not reflect changes in interpolated quantities. In the moderate CFL case, shown in Fig. 5.3(a), the convergence of the Iterative method reduces drastically. The GMRES and O-GMRES methods converge at similar rates however the inclusion of the implicit overset terms results in faster initial convergence for O-GMRES. Compared to the low CFL case there is less ?saw-toothed? behavior for the GMRES method, indicating the convergence of the linear residual is not straying far from the convergence of the actual full overset system. The similar rates of convergence between GMRES and O-GMRES indicates 151 (a) Time-accurate Convergence (b) Iteration 2 Non-linear and Linear Convergence Figure 5.2: Small timestep (CFL=1.71) non-linear and linear convergence. 152 the convergence rate for this case is limited more by internal mesh flow features and less from overset boundaries. A few patterns emerge from the low and moderate CFL cases. First, the Iterative method appears to do very well initially. This is slightly misleading since the Iterative method is taking 20 times the number of sub-iterations as the GMRES methods. As a result, the Iterative method has a better approximation of the backwards differencing source term from Eq. (2.40). Second, the linear-solver convergence for O-GMRES follows the overall sub-iterative convergence for the global system except for the first sub-iteration. In the first sub-iteration, the linearized system is a poor approximation to the non-linear equations. 153 (a) Time-accurate Convergence (b) Iteration 2 Non-linear and Linear Convergence Figure 5.3: Moderate timestep (CFL=3.42) non-linear and linear convergence. 154 5.2 Laminar Coaxial Rotors The second case for O-GMRES analysis considers a set of laminar, coaxial rotors for a more realistic and complex problem. In most full-size rotorcraft applications where Reynolds numbers are in the millions, purely laminar flow is a poor assumption of the rotor physics. At the micro unmanned-aerial-vehicle (UAV) scale or in different climates, the Reynolds number may be significantly reduced and laminar flow is a useful assumption. Recent research at the University of Maryland on low-Reynolds number rotor blades and vehicle design for a Mars helicopter present a unique and interesting case for CFD analysis. Winslow et. al. researched low-Reynolds number blade performance resulting in a thin, cambered flat-plate design [78]. Escobar et. al. [79] combined the work of Winslow with design choices from the University of Maryland Helicopter Design Team from 2000 [80] to design a rotor for the Martian atmosphere, where Reynolds number is low and Mach number is high. The Mars rotor from Escobar et. al. is run with mStrand for the four blades of the coaxial system and Garfield for all background meshes. For the selected case, the untwisted blades have an aspect-ratio of 4.61, root cutout of 13%, tip Mach Figure 5.4: Blade cross-section showing cambered, 2% thick flat plate. 155 number of Mtip = 0.21, Reynolds number of Re = 10, 000, and collective of ? = 10 ?. The cambered flat plate airfoil, shown in Fig. 5.4, is 2% thick with slight camber and sharp leading and trailing edge corners. The spacing between rotors is set to 10%R. Each blade has 2.85? 106 points with wall spacing of 5.5? 10?6 chords. The wall cell size corresponds to a very small non-dimensional wall spacing of y+ ? 0.003. The ratio of specific heats, ?, is kept at 1.4 for the value on Earth. The solution was advanced to time t? with the Iterative method for 3 revolutions with azimuthal timestep ?? = 0.5?. The solution after 3 revolutions at time t? is shown in Fig. 5.6 and is the point from which convergence studies are restarted. The grid system is shown in Fig. 5.5. Four blades meshes are used, two per rotor, each with 2.8 million points. Both clock-wise and counter-clockwise blades are identical mirror images of each other. The three nested background meshes from fine to coarse contain 1.7, 8.9, and 4.6 million points respectively. With the exception of 2.6 R 68 R 0.87 R 7.6 R 8.7 R 87 R Figure 5.5: Overset grid system for the laminar coaxial rotors. 156 the finest level, which has uniform spacing of 7% chord, the other background meshes use a hyperbolic tangent function to smoothly stretch the grid in each direction. In mStrand, the multi-colored Gauss?Seidel method is used for the blade meshes. In the background Garfield used DADI for all meshes instead of the red-black method. Results titled as ?Iterative? for this case corresponds to DADI applied in the off-body with multi-colored Gauss?Seidel in the near-body and without the use of any linear solver. At restart point t? the rotors are aligned along the X-axix. Convergence is compared over 10 timesteps of ?? = 1?, after which the upper and lower rotors have rotated +10? and ?10? respectively about the +Z axis. The rotor positions at five azimuths of blade passage are shown in Fig. 5.7, with the upper rotor colored orange Figure 5.6: Iso-surface of Q-criterion(0.003) with contours of vorticity. 157 (a) 1? Azimuth (b) 3? Azimuth (c) 5? Azimuth (d) 7? Azimuth (e) 9? Azimuth Figure 5.7: Grid and slice location at five azimuths. 158 and lower rotor colored blue. The overset grid is shown on an X-slice of the domain near the blade tip (x/R = 0.97). As the blades pass, the fringe- and field-maps of the overset blade grids changes significantly. Furthermore interpolation at these azimuths involve three grids: the upper rotor blade, lower rotor blade, and nested background mesh. Initially at azimuths ? = 1? and 3? (Fig. 5.7(a) and (b), the lower rotor especially is expected to receive significant influence from the upper blade mesh. By ? = 7? and definitely ? = 9?, the convection of fluid into the lower rotor mesh is back to merely the nested mesh. Considering the direction of fluid convection through meshes (considering grid motion), is an important factor for determining the relevancy of iterpolation regions to the overall solution and convergence. The sub-iteration convergence for the ten timesteps during blade passage is shown in Fig. 5.8. The solution is restarted from ? = 0? therefore the first iteration corresponds to solving for ? = 1? for the upper (counter-clockwise) rotor and ? = ?1? for the lower (clockwise) rotor. All methods use 15 sub-iterations with 20 linear solver iterations. For the GMRES methods this corresponds to 20 Krylov iterations Iterative GMRES O-GMRES 101 100 10?1 10?2 10?3 10?4 10?5 10?6 10?7 1? 2? 3? 4? 5? 6? 7? 8? 9? 10? Azimuth ? Figure 5.8: Iteration convergence after t?: ?? = 1?. Dual-time CFL is ? for GMRES and O-GMRES methods and 100 for the Iterative method. 159 Residual L2 Norm and for the Iterative algorithm this is approximated simply as a 20? multiplier on the number of sub-iterations. Both GMRES and O-GMRES were able to run with an infinite dual-time CFL? = ?, meaning the local timestep in each cell is the physical timestep. The Iterative method did not converge well with the large timestep and instead was run with dual-time CFL? = 100. Even with the reduced CFL, the Iterative method stalls, indicating further relaxation of the dual-timestep may be required. O-GMRES in general converges better than GMRES especially in the initial azimuths. The solution and convergence at the five azimuths from Fig. 5.7 can be analyzed in more detail by looking at the vorticity solution near the blade tips and linear solver convergence for GMRES and O-GMRES methods. The vorticity solution is used to visualize important fluid structures convecting across overset boundaries. The vorticity solution at ? = 1? is shown in Fig. 5.9(a). The lower rotor, seen traveling to the left, is receiving significant flow information directly from the mesh of the upper rotor. The convergence, shown in Fig. 5.9(b), shows the sub-iteration convergence as solid symbols and linear-solver convergence as smaller, hollow symbols. In later sub- iterations O-GMRES converges at approximately the same rate as GMRES however in early sub-iterations, O-GMRES maintains a significantly steeper convergence rate. As a result, O-GMRES requires approximately half the number of sub-iterations to achieve the convergence drop seen in GMRES over 15 sub-iterations. After 3 sub-iterations the linear solver trend from O-GMRES matches very well with the sub-iteration convergence. This indicates after 3 sub-iterations that the linear system is a good approximation to the non-linear problem. With the de-coupled GMRES algorithm, however, a saw-tooth pattern emerges between the sub-iteration and linear iteration trends. Linear convergence for GMRES may be very good but this does not translate to faster non-linear convergence. At ? = 3?, shown in Fig. 5.10, the same overall observations are amplified 160 from ? = 1?. Vorticity from the upper rotor is convecting into the interpolation region between the upper rotor, lower rotor, and background mesh. As a result the convergence rate with GMRES reduces earlier by sub-iteration 2, whereas the convergence with O-GMRES remains relatively unaffected. By ? = 5?, shown in Fig. 5.11, and ? = 7?, shown in Fig. 5.12 the vortex from the upper rotor has entered the grid of the lower rotor. The convergence rate with GMRES still reduces drastically after the first sub-iteration. The convergence rate with O-GMRES starts to show two clear kinks in the convergence trend. The first kink occurs after the first sub-iteration and the second occurs after the fourth or fifth sub-iteration. Before sub-iteration 6, the linear solver convergence for O-GMRES does not match well with the sub-iteration convergence. The mis-match indicates convergence is largely affected by the temporal linearization terms. After sub-iteration 5, for ? = 5?, or 6, for ? = 7?, the O-GMRES convergence rate decreases after the second kink but the linear solver convergence matches the non-linear convergence. The kinks in convergence are therefore primarily due to linearization error. The fact that O-GMRES converges faster for the same number of sub-iterations indicates a significant portion of the linearization error is a function of the interpolated region. This conclusion matches the qualitative observation that between ? = 5? ? 7?, the vorticity solution crosses the overset boundaries of both the upper and lower rotors. Finally by ? = 9? each rotor blade is back to receiving a relatively clear flow solution from the nested mesh. The vortex from the upper blade still passes through the mesh of the lower rotor blade however the convection direction in each mesh indicates interpolation should play less of a role. The convergence result still shows a kink in the O-GMRES solution where the non-linear error is being converged however the kink occurs earlier than for ? = 5? ? 7?. This indicates non-linear solution error involves less influence from interpolation regions. O-GMRES and GMRES appear to converge toward the a similar convergence level and rate by sub-iteration 14. As 161 interpolation plays less of a role in the numerical error, it makes sense that both O-GMRES and GMRES would perform similarly. The linear solver convergence for GMRES, however, still demonstrates the saw-tooth pattern whereas GMRES converges much more consistently between linear and non-linear sub-iterations. For all timesteps, the benefit of O-GMRES occurs primarily in the first few sub-iterations especially at azimuths where interpolation of fluid structures occurs on the convection-inflow-side of the grid. Travel of information for convection- dominated flows is governed by the wave speeds. In a numerical stencil where a cell is downwind of an interpolated region with high gradients, O-GMRES greatly out-performs GMRES. Once the blades pass, large gradients may still be interpolated at the boundary however the boundary is now downwind of most interior cells. In a numerical stencil where field cells are upwind of an interpolation region, there is less of a benefit for O-GMRES compared to the traditional GMRES. After the first few sub-iterations, the non-linear error of the system is reduced enough that the linear convergence for O-GMRES is representative of the non-linear convergence. This is not the case for the traditional GMRES convergence, which instead displays a saw-toothed pattern. The O-GMRES result presents an interesting prospect for further convergence accelleration. With the linear system adequately representative of the non-linear system, increasing the number of Krylov iterations with each sub-iteration would take advantage of the quadratic convergence rate of the GMRES algorithm while still properly handling interpolation boundaries. 162 (a) Vorticity at 1? 101 10?1 10?3 GMRES 10?5 10?7 O-GMRES 10?9 0 2 4 6 8 10 12 14 Sub-iterations (b) Convergence at 1? Figure 5.9: Vorticity solution near the tip of the blades and convergence for ? = 1?. 163 Residual L2 Norm (a) Vorticity at 3? 101 10?1 10?3 GMRES 10?5 10?7 O-GMRES 10?9 0 2 4 6 8 10 12 14 Sub-iterations (b) Convergence at 3? Figure 5.10: Vorticity solution near the tip of the blades and convergence for ? = 3?. 164 Residual L2 Norm (a) Vorticity at 5? 101 10?1 10?3 GMRES 10?5 10?7 O-GMRES 10?9 0 2 4 6 8 10 12 14 Sub-iterations (b) Convergence at 5? Figure 5.11: Vorticity solution near the tip of the blades and convergence for ? = 5?. 165 Residual L2 Norm (a) Vorticity at 7? 101 10?1 10?3 GMRES 10?5 10?7 O-GMRES 10?9 0 2 4 6 8 10 12 14 Sub-iterations (b) Convergence at 7? Figure 5.12: Vorticity solution near the tip of the blades and convergence for ? = 7?. 166 Residual L2 Norm (a) Vorticity at 9? 101 10?1 10?3 GMRES 10?5 10?7 O-GMRES 10?9 0 2 4 6 8 10 12 14 Sub-iterations (b) Convergence at 9? Figure 5.13: Vorticity solution near the tip of the blades and convergence for ? = 9?. 167 Residual L2 Norm 5.3 O-GMRES Conclusions Two cases are presented for analysis of the developed O-GMRES algorithm. The framework-level algorithm incorporates interpolation between meshes as part of the implicit system for overset problems with the goal of accelerating convergence. Two different solvers are used, mStrand and Garfield, and three different implicit methods are compared. The first method uses the preconditioner available from each solver to approximate the inversion of the linear system. The second method uses a de-coupled GMRES algorithm to solve the linear system in each mesh independently. The final method uses the developed O-GMRES algorithm. In all compared cases, both sub-iterations and linear solver iterations are used for convergence analysis. The first considered case is a sphere in freestream at Re = 800, which results in unsteady vortices shed behind the sphere. mStrand is used for the sphere mesh and Garfield is used for all background meshes. Convergence with all three implicit methods are compared for two different timestep sizes. For a fair comparison, GMRES and O-GMRES are run with 10 sub-iterations of 20 Krylov (linear) iterations and the iterative scheme is run with 20?10 sub-iterations. For the small timestep, the simple iterative procedure with just the preconditioner out-performs the traditional GMRES method. The multiplier on sub-iterations for the iterative scheme results in better approximation to the time-accurate terms in initial sub-iterations. This explains the initial rapid convergence of the Iterative method. O-GMRES out-performs both GMRES and iterative methods. The O-GMRES method stalls after dropping 10 orders as a result of finite-difference limitations for floating point arithmetic. In the case with the moderate timestep, the iterative method struggles to converge and the traditional GMRES clearly showed an advantage. O-GMRES and GMRES converged with the same slope, however O-GMRES initially converged significantly faster. 168 A detailed view of the sub-iteration and linear iteration convergence is compared for both small and moderate timesteps. In both cases, GMRES shows discrepancy between convergence of the linear solver and convergence of the overall non-linear equations. This trend is a result of the fixed values at interpolation boundaries. Effectively the solution that the linear system is converging toward is inconsistent with the actual overall overset system. When the overset points are included as part of the implicit system, O-GMRES shows consistent drop in the residual between both sub-iterations and linear iterations. The second case considers a set of coaxial rotors designed for use on Mars. The cambered, flat-plate airfoils are meant for low-Reynolds number flight conditions. A 10? window during blade passage is studied in detail. Most notably from the general convergence trend, the Iterative method convergence stalls even with a relaxation applied to the timestep. GMRES and O-GMRES convergence do not stall. Similar to the results from the sphere run with moderate timestep, O-GMRES and GMRES eventually converge at similar rates. The primary benefit from O-GMRES is from faster initial convergence. Analysis of sub-iterations and linear iterations at azimuths 1?, 3?, 5?, 7?, and 9? reveals patterns from the physics that affect the convergence trend. When regions of large-gradient interpolation appear upstream of blade meshes, O-GMRES greatly out-performs GMRES. Once the blades pass and interactional aerodynamics play less of a roll, O-GMRES and GMRES converge more similarly. In both the sphere case and coaxial rotor case, O-GMRES converges more consistently than GMRES. As the interpolation regions changed for the case of passing blades, O-GMRES maintains consistent convergence over the 10? passage window. Convergence with the traditional GMRES method, however, is adversely affected by the presence of overset interpolation. In an interactional aerodynamics case where the overset connectivity may be constantly changing from the presence of 169 multiple rotors, O-GMRES offers a more reliable approach to obtain good numerical convergence. Comparing patterns of sub-iteration and linear-solver convergence leads to additional convergence acceleration possibilities. All cases considered in this chapter use constant numbers of sub-iterations and linear-solver iterations. A realistic application might instead vary the number based on convergence or sub-iteration step. For example, initial sub-iterations may benefit from the use of fewer linear solver iterations in order to quickly reduce linearization error. Later sub-iterations may benefit from more linear iterations to take advantage of the quadratic convergence rate of the GMRES algorithm. Timing of O-GMRES and GMRES has not been considered as a metric for comparison for two reasons. First, the difference in algorithms simplifies to additional data-exchange routines called during the linear iteration process. The performance of the data-exchange routines is both case and hardware dependent. Second, a real engineering application of the framework would apply practical limits to linear solver steps to avoid unnecessary computation in initial sub-iterations. In addition to aforementioned use of variable linear solver steps, future develop- ment and investigations of O-GMRES can include using the overset algorithm only on sub-sets of meshes. Similar to how Sec. 2.6 describes separating stationary and reconnecting meshes to avoid unnecessary computation, O-GMRES could be applied only to moving, reconnecting meshes where the change in interpolation regions could have a significant and adverse affect on convergence. For the coaxial rotor case, for example, all background meshes could be solved using the traditional approach while the blade meshes use O-GMRES to ensure consistent convergence. 170 Chapter 6: Conclusions The demands of commercial urban air mobility and next-generation military aircraft have resulted in the evolution of the helicopter to include complex configura- tions with many interacting rotors and lifting surfaces. Improved understanding of interactional aerodynamics between components is crucial to the safety and viability of future vertical take-off and landing vehicles. Complex configurations may require higher-fidelity computational aerodynamic tools to be used for analysis to ensure accurate representation of the problem. Just as the definition of rotorcraft has evolved, so has the definition of the modern supercomputer. Improvements in computational fluid dynamics tools since the late 20th century has largely followed advancements in computer science and engineering to develop faster, cheaper computers. From single-core to multi-core to distributed memory computers, each new hardware generation resulted in new software written to achieve faster simulation times. The modern trend of mas- sively parallel compute architectures like the GPU present a promising new era of supercomputing. From the perspective of aerodynamic analysis, full-scale rotorcraft CFD is not new and can be accomplished using a plethora of industry, government, or academic codes on traditional supercomputers with large simulation times. Similarly from the computer science perspective, using GPUs to accelerate numerical methods is also well established. The intersection of these two perspectives, however, presents promising conjecture and is the root motivation for this dissertation research. 171 Research from this dissertation has been presented with the goal of achieving six primary objectives: 1. Develop a GPU-accelerated, time-accurate Navier?Stokes solver for overset applications. 2. Design and implement a framework for the overset multi-solver paradigm amenable to heterogeneous architectures like the GPU 3. Identify and derive a robust, flexible algorithm for the overset multi-solver paradigm that considers fringe points implicitly. 4. Verify and validate the individual components of an overset framework. 5. Apply the overset framework to a full, notional rotorcraft configuration. 6. Apply the implicit overset algorithm to a coaxial rotor configuration. The first three objectives are development-based and require advancing algorithms or technology beyond the state-of-the-art whereas the last three are more application focused. This chapter reviews a summary of the presented work to achieve each objective. Contributions to the state-of-the-art are highlighted. Finally, future research directions are recommended based on findings from the current work. 6.1 Algorithm and Framework Development Addressing the first objective, the GPU-accelerated CFD solver Garfield has been developed to approximate the Reynolds-Averaged Navier?Stokes equations on structured, curvi-linear meshes. The finite-volume code has been heavily adapted to operate on multiple, moving and deforming grids run in parallel across multiple GPUs. Each algorithm within the flow solver has been designed to run efficiently in 172 parallel on the GPU and all data remains on the GPU to avoid costly transfers back to CPU memory. The preconditioner of the implicit linear system in particular poses algorithmic challenges because many common approaches use serial algorithms designed for a CPU core, a serial architecture. In Garfield, three variations of the Red- Black Gauss?Seidel method, have been designed specifically for efficiency on GPU architectures. To combine Garfield with other, CPU-based codes, a light-weight Python framework has been designed which interfaces many different numerical codes. In addition CPU-based CFD solvers, the framework utilizes a grid-motion module for rigid or elastic deflection of any grid topology (structured or unstructured). For overset communication, the open-source overset grid assembler Tioga is used. Furthermore Tioga has been adapted to efficiently handle either CPU or GPU data to drastically reduce data-transfer and simulation times. The designed framework satisfies the requirements of the second objective. The third objective of improving the convergence for challenging overset cases is accomplished with the development of the Overset GMRES method. As a distributed formulation of the GMRES algorithm, O-GMRES is implemented at the framework level to consider interpolated boundaries as part of the implicit system of equations. In contrast, traditional Schwartz-alternating methods consider each grid separately. 6.2 Framework Applications 6.2.1 Verification and Validation Subsets of the full framework have been validated using simpler, well-understood aerodynamic cases. For accuracy validation of Garfield, the GPU code is compared to NASA?s Overflow code for the same overset mesh system of an Onera M6 wing. 173 Both codes result in almost identical pressure distribution over the wing, indicating the flow solver and overset interpolation routines behave as expected. The Onera M6 case is also used for a performance comparison of CPU and GPU codes. The NASA Overflow code solves almost identical equations as Garfield however timing results show Garfield runs 5.35? faster on a single GPU compared to a single CPU. Extrapolating speedup to modern clusters containing 4-8 GPUs per node, Garfield easily demonstrates an order of magnitude in performance gain over the highly-optimized CPU code. Furthermore, the scalability of Garfield tested on different GPU architectures on up to 32 GPUs demonstrates the code scales well across multiple processors. NASA?s Overflow code has also been used to verify proper treatment of elastic deflection within the developed mesh motion module. A UH-60 rotor in high- speed forward flight is used for comparison. Without the proper elastic deflection treatment in CFD, the predicted airloads will not be correct. Assuming proper treatment within NASA?s Overflow solver, if the motion module is provided the same deflections and results in the same airloads, the deflections and grid-motion in Garfield must also be correctly handled. Airload results between Overflow and Garfield are very similar and minor differences appear only in regions of challenging flow physics, such as the shock on the advancing side of the rotor. The minor discrepancies can be attributed to different grids used in each solver. The developed Red-Black Gauss?Seidel preconditioners as well as the implicit GMRES linear solver have been verified within Garfield using simple aerodynamic problems. For rotor-blade-like meshes, the Gauss?Seidel method out-performs the baseline DADI implementation by converging in less wall-clock time and with a larger allowable timestep. For a laminar cylinder case, the use of GMRES also results in faster convergence and a larger maximum allowable timestep. Finally, benefits of the Overset GMRES algorithm has been analyzed for two 174 simple problems: the Poisson heat-equation and unsteady, inviscid flow over a wedge. As a linear partial differential equation, the Poisson equation solved with O-GMRES can be accomplished by increasing the number of linear (Krylov) iterations. As the number of linear iterations is increased, O-GMRES converges in fewer total iterations. With the traditional GMRES, however, linear iterations do not include updated interpolated quantities resulting in a saw-tooth convergence pattern and significantly more iterations. The saw-tooth pattern also appears in the time-accurate convergence of the inviscid wedge case. At each timestep, O-GMRES shows consistent convergence between linear iterations and the non-linear Newton iterations. 6.2.2 Notional Helicopter Configuration The ROBIN fuselage and rotor have been used as the starting point for applying the full, overset, heterogeneous framework to a realistic configuration. Experimental wind-tunnel results from rotor-fuselage interaction cases have been simulated in the present framework using three different in-house CFD solvers: HamStr, TURNS, and Garfield. Previous comparisons of the experiment and CFD have found discrepancies between the required rotor controls for trimmed flight. Because the trim analysis is beyond the scope of this work, the cyclic control angles from literature have been used and the collective estimated to match the main rotor lift. Overall, the pressure results in the longitudinal direction match well between experiment and CFD. On the lateral cross section the pressure history matches less well, likely as a result of ignoring the hub and support structure in CFD meshes. Most importantly the ability to model the interactional aerodynamics and extract meaningful data for the rotor-fuselage case is demonstrated. The most complex case run for this work extends the rotor-fuselage case to include a hub and tail rotor. The large simulation is representative of a full configu- ration and highlights many challenges, including convergence and load balancing. 175 The tail rotor rotates at a higher RPM than the main rotor, resulting in a larger azimuthal timestep. The larger step causes the residual of the tail rotor to stall in Garfield, indicating the need for improved implicit numerical methods. Although Garfield is assigned significantly more grid points compared to the CPU-based solvers, Garfield runs over 3? faster than the other solvers. The mis-match in performance can be offset by adding a 3? multiplier to the number of sub-iterations performed in Garfield. The additional sub-iterations are free from a wall-clock perspective and contribute to improved convergence in Garfield grids. Between the initial rotor-fuselage simulation and the rotor-fuselage-hub-tail simulation, the recorded thrust on the main rotor experiences a significant blip at azimuth ? = 0?. The change in thrust at this azimuth is a result of the hub, which sheds structures and causes separation on the rotor blade passing in its wake. While the presence of the hub affects the rotor blade thrust, the torque is hardly affected. Aerodynamic interactions between the main and tail rotors have been analyzed by running the two rotors without the fuselage or hub. Three different cases are used to draw conclusions about the tail rotor thrust (and corresponding main-rotor anti-torque). First, the isolated tail rotor shows the expected, dominant 8/rev signal in forward flight. In the presence of the main rotor, however, a strong 16/rev component of thrust appears. This higher frequency can be justified assuming a 4/rev variation in inflow to the tail rotor but without the full CFD result the analysis is not trivial. Finally a cross-wind added to the configuration shows the tail in a ?descent? flight condition results in massively separated flow on the tail rotor blades. Despite increased counter-torque to the main rotor and resulting aircraft yaw, the separation on the blades indicates the pilot may not have the control authority to correct the tail thrust without significant and possibly dangerous decrease of the tail rotor blade pitch. 176 6.2.3 Overset GMRES The first case used to analyze performance of the O-GMRES algorithm ap- plied to a multi-solver problem is unsteady, laminar flow over a sphere. Three background meshes are used with Garfield and the near-body sphere mesh is solved using mStrand. The convergence using two different timesteps both show the O-GMRES method converges better than the traditional GMRES method or simple pre-conditioning approach. Just as with the wedge case used for single-solver validation, the traditional GMRES results in a saw-tooth pattern for sub-iteration convergence. O-GMRES, on the other hand, converges consistently between linear and non-linear (Newton) iterations. A laminar rotor designed for use on Mars is used as the final case for O-GMRES analysis. Again Garfield is used for all background grids and mStrand is used for the four coaxial, counter-rotating blades. Focusing on the azimuths of blade passage, convergence in a 10? window is analyzed between the iterative preconditioner, traditional GMRES, and O-GMRES. A large azimuthal timestep of ? = 1? is used to further challenge convergence. Interestingly, the convergence with only the iterative preconditioner stalls. The traditional GMRES method converges, however the total convergence varies with each iteration as the interpolation between grids changes. Furthermore the linear solver iterations from GMRES indicate the same saw-tooth pattern observed from previous results also applies to this case. The O-GMRES method, on the other hand, converges consistently throughout the entire 10? passage window. Linear solver convergence indicates after the first four sub-iterations that the non-linear residual is sufficiently converged and the convergence is primarily governed by the linear convergence. Convergence analysis for O-GMRES has been conducted between all grids for a fixed number of linear and non-linear iterations. The convergence results can be 177 used to draw conclusions about more efficient application of the framework. For example, a realistic application may only require O-GMRES to be applied to sub-sets of grids in areas of large gradients. The number of linear iterations in initial Newton steps can also be reduced, since convergence appears dominated by the non-linear, time-differencing terms. 6.3 Contributions and Future Work Results from the heterogeneous framework design presented in this dissertation are being considered for implementation in the Helios (CREATE AVTM) software suite. Modifications to the open-source connectivity code for Python wrapping and GPU memory interpolation are available to the public through GitHub. The modifications are minor, however result in significant improvements in speedup when CPU and GPU codes are used together. Performance results from this dissertation provide a guideline for the advancement of industry and government research codes to take advantage of new, available supercomputer hardware. The Garfield source code is another key contribution, since it demonstrates an order of magnitude speedup over comparable CPU codes like NASA?s Over- flow. Most importantly the speedup and scalability of Garfield is obtained for realistic use cases on currently available clusters. Algorithm implementation details, for example of the Gauss?Seidel-type methods, suggests techniques for improved convergence accelerated on parallel architectures like the GPU. The developed Overset GMRES framework is also being considered for inclusion in the Helios (CREATE AVTM) framework. An extension of the algorithm is under consideration for intra-mesh interpolation, which occurs in strand meshes near locations of concave geometry. Improving the convergence of challenging, overset applications is key to attaining physical, trustworthy results, especially as aircraft configurations become more complex. 178 Future recommended work for the Garfield solver should consider improving the treatment of viscous terms for the Red-Black Gauss?Seidel method. Currently, viscous eigenvalue approximations are used which perform well for high Reynolds number simulations but poorly for cases like the laminar Mars rotor. The GMRES routines within Garfield should also be extended to consider the turbulence model equations. Currently the turbulence model equation is de-coupled from the mean-flow equations, which severely limits the simulation timestep size. Since the turbulence model equation is a conservation equation, it can be added to the other five equations to create a single residual vector minimized during the GMRES procedure. Future work for the O-GMRES framework should include convergence analysis using variable timestep sizes (dual-time CFL) and variable numbers of linear solver iterations. In initial Newton iterations, for example, only 5 Krylov iterations may be required whereas in final Newton iterations 40-50 Krylov iterations may be used. The additional Krylov iterations should improve convergence as a result of the quadratic convergence rate of the GMRES algorithm. Capability enhancements to the developed framework should consider a more generic treatment of grid motion to allow nested, elastic frames of motion. Currently, the blades of a rotor system are allowed to deform given an deflection file but the rotor location is fixed. For application involving a propeller on a flexing wing, the rotor frame should be allowed to deform with the wing. The developed O-GMRES method demonstrates improved convergence com- pared to traditional methods, however additional research is required to analyze the benefits of improved convergence on engineering quantities. One possible application area is the investigation of hub drag by looking at a breakdown of different com- ponents. Only recently have state-of-the-art CFD frameworks started looking at complex hub geometries with the goal of understanding specific sources of hub drag. Since the hub consists of highly complex connections to the fuselage and blades, the 179 hub region in CFD can result in a significant amount of interpolated fringe areas. Without proper convergence, the pressure and skin-friction drag may not be properly captured by the numerics. Rotor hub drag is especially important at high advance ratios and the CFD for such a case would be complemented well with additional wind or water tunnel experiments. 180 Appendix A: Flow Visualization with Vorticity and Q-Criterion When analyzing the solution of large simulations, iso-surfaces of Vorticity or Q-criterion help visualize the locations of interesting flow features. Vorticity is the magnitude of the curl of the velocity, defined as ?? ? = ?? ? V ? (A.1) and Q-criterion is a function of the rotation and strain tensors 1 ( ) Q = ???2c ? ?S?2 (A.2) 2 where the rotation and strain tensors respectively are ( ) ( ) 1 ?ui ? ?uj 1 ?ui ?uj?i,j = Si,j = + . (A.3) 2 ?xj ?xi 2 ?xj ?xi 181 Bibliography [1] E. Strohmaier, J. Dongarra, H. Simon, and M. Meuer, ?Top500 list - november 2018.? [2] K. Rupp, M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten, ?42 years of microprocessor trend data,? 2017. [3] R. E. Mineck and S. A. Gorton, ?Steady and periodic pressure measurements on a generic helicopter fuselage model in the presence of a rotor,? Tech. Rep. NASA TM-2000-210286, NASA, June, 2010. [4] R. W. Prouty, Military helicopter design technology. Malabar, Florida: Krieger Publishing Company, 1998. [5] B. Roget, J. Sitaraman, and A. M. Wissink, ?Maneuvering rotorcraft simulations using CREATE-AV(TM) Helios,? in 54th AIAA Aerospace Sciences Meeting, American Institute of Aeronautics and Astronautics, jan 2016. [6] R. Magnus and H. Yoshihara, ?Inviscid transonic flow over airfoils,? AIAA Journal, vol. 8, pp. 2157?2162, dec 1970. [7] J. D. Cole and E. M. Murman, ?Calculation of plane steady transonic flows,? AIAA Journal, vol. 9, pp. 114?121, jan 1971. [8] A. Jameson, W. Schmidt, and E. Turkel, ?Numerical solution of the euler equations by finite volume methods using runge kutta time stepping schemes,? in 14th Fluid and Plasma Dynamics Conference, American Institute of Aeronautics and Astronautics, jun 1981. [9] T. H. Pulliam and J. L. Steger, ?Implicit finite-difference simulations of three- dimensional compressible flow,? AIAA Journal, vol. 18, pp. 159?167, Feb 1980. [10] T. Pulliam and D. Chaussee, ?A diagonal form of an implicit approximate- factorization algorithm,? Journal of Computational Physics, vol. 39, no. 2, pp. 347 ? 363, 1981. 182 [11] S. Obayashi and K. Kuwahara, ?LU factorization of an implicit scheme for the compressible navier-stokes equations,? in 17th Fluid Dynamics, Plasma Dynamics, and Lasers Conference, American Institute of Aeronautics and Astronautics, jun 1984. [12] T. Cebeci and A. M. O. Smith, ?A finite-difference method for calculating com- pressible laminar and turbulent boundary layers,? Journal of Basic Engineering, vol. 92, no. 3, p. 523, 1970. [13] B. Baldwin and H. Lomax, ?Thin-layer approximation and algebraic model for separated turbulentflows,? in 16th Aerospace Sciences Meeting, American Institute of Aeronautics and Astronautics, jan 1978. [14] D. A. Johnson and L. S. King, ?A mathematically simple turbulence closure model for attached and separated turbulent boundary layers,? AIAA Journal, vol. 23, pp. 1684?1692, nov 1985. [15] Spalart, P. R., and Allmaras, S. R., ?A One-Equation Turbulence Model for Aerodyanmics Flows,? in AIAA-Paper-1992-0439, 30th AIAA Sciences Meeting and Exhibit, January 6?9, 1992. [16] B. Launder and D. Spalding, ?The numerical computation of turbulent flows,? Computer Methods in Applied Mechanics and Engineering, vol. 3, no. 2, pp. 269 ? 289, 1974. [17] F. R. Menter, ?Two-equation eddy-viscosity turbulence models for engineering applications,? AIAA Journal, vol. 32, pp. 1598?1605, aug 1994. [18] A. Bagai and J. G. Leishman, ?Rotor free-wake modeling using a pseudoimplicit relaxation algorithm,? Journal of Aircraft, vol. 32, pp. 1276?1285, nov 1995. [19] J. P. Yin and S. R. Ahmed, ?Helicopter main-rotor/tail-rotor interaction,? Journal of the American Helicopter Society, vol. 45, pp. 293?302, oct 2000. [20] F. X. Caradonna, C. Tung, and A. Desopper, ?Finite difference modeling of rotor flows including wake effects,? Journal of the American Helicopter Society, vol. 29, pp. 26?33, apr 1984. [21] D. Jude and J. D. Baeder, ?Extending a three-dimensional GPU RANS solver for unsteady grid motion and free-wake coupling,? in 54th AIAA Aerospace Sciences Meeting, American Institute of Aeronautics and Astronautics, jan 2016. [22] B. Koren, A robust upwind discretization method for advection, diffusion and source terms, pp. 117?138. Notes on Numerical Fluid Mechanics, Vieweg, 1993. [23] G.-S. Jiang and C.-W. Shu, ?Efficient implementation of weighted eno schemes,? Journal of Computational Physics, vol. 126, no. 1, pp. 202 ? 228, 1996. 183 [24] A. N. Brooks and T. J. Hughes, ?Streamline upwind/petrov-galerkin formula- tions for convection dominated flows with particular emphasis on the incom- pressible navier-stokes equations,? Computer Methods in Applied Mechanics and Engineering, vol. 32, pp. 199?259, sep 1982. [25] B. Cockburn and C.-W. Shu, ?Runge?kutta discontinuous galerkin methods for convection-dominated problems,? Journal of Scientific Computing, vol. 16, pp. 173?261, Sep 2001. [26] N. Nguyen, J. Peraire, and B. Cockburn, ?An implicit high-order hybridiz- able discontinuous galerkin method for linear convection?diffusion equations,? Journal of Computational Physics, vol. 228, pp. 3232?3254, may 2009. [27] F. X. Caradonna and M. P. Isom, ?Subsonic and transonic potential flow over helicopter rotor blades,? AIAA Journal, vol. 10, pp. 1606?1612, dec 1972. [28] T. Roberts and E. Murman, ?Solution method for a hovering helicopter rotor using the euler equations,? in 23rd Aerospace Sciences Meeting, American Institute of Aeronautics and Astronautics, jan 1985. [29] C. Chen and W. Mccroskey, ?Numerical simulation of helicopter multi-bladed ro- tor flow,? in 26th Aerospace Sciences Meeting, American Institute of Aeronautics and Astronautics, jan 1988. [30] G. R. Srinivasan, J. D. Baeder, S. Obayashi, and W. J. Mccroskey, ?Flowfield of a lifting rotor in hover - a navier-stokes simulation,? AIAA Journal, vol. 30, pp. 2371?2378, oct 1992. [31] J. Benek, J. Steger, and F. C. Dougherty, ?A flexible grid embedding tech- nique with application to the euler equations,? Fluid Dynamics and Co-located Conferences, American Institute of Aeronautics and Astronautics, Jul 1983. [32] E. Duque and G. R. Srinivasan, ?Numerical simulation of a hovering rotor using embedded grids,? AHS Forum, American Helicopter Society, June 1992. [33] J. U. Ahmad and R. C. Strawn, ?Hovering rotor and wake calculations with an overset-grid navier-stokes solver,? vol. 2 of AHS Forum, pp. 1949?1959, American Helicopter Society, 01 1999. [34] H. Pomin and S. Wagner, ?Navier-stokes analysis of helicopter rotor aerody- namics in hover and forward flight,? Journal of Aircraft, vol. 39, pp. 813?821, sep 2002. [35] D. Jespersen, T. Pulliam, and P. Buning, ?Recent enhancements to overflow,? Aerospace Sciences Meetings, American Institute of Aeronautics and Astronau- tics, Jan 1997. 184 [36] R. Biedron and J. Thomas, ?Recent enhancements to the fun3d flow solver for moving-mesh applications,? Aerospace Sciences Meetings, American Institute of Aeronautics and Astronautics, Jan 2009. [37] V. Sankaran, A. Wissink, A. Datta, J. Sitaraman, M. Potsdam, B. Jayaraman, A. Katz, S. Kamkar, B. Roget, D. Mavriplis, H. Saberi, W.-B. Chen, W. Johnson, and R. Strawn, ?Overview of the Helios version 2.0 computational platform for rotorcraft simulations,? Aerospace Sciences Meetings, American Institute of Aeronautics and Astronautics, Jan. 2011. [38] A. Wissink, B. Jayaraman, A. Datta, J. Sitaraman, M. Potsdam, S. Kamkar, D. Mavriplis, Z. Yang, R. Jain, J. Lim, and R. Strawn, ?Capability enhancements in version 3 of the Helios high-fidelity rotorcraft simulation code,? Aerospace Sciences Meetings, American Institute of Aeronautics and Astronautics, Jan. 2012. [39] M. Ghoreyshi, M. E. Young, A. J. Lofthouse, A. Jira?sek, and R. M. Cummings, ?Numerical Simulation and Reduced-Order Aerodynamic Modeling of a Lambda Wing Configuration,? Journal of Aircraft, vol. 55, pp. 549?570, June 2016. [40] H. Gopalan, C. Gundling, K. Brown, B. Roget, J. Sitaraman, J. D. Mirocha, and W. O. Miller, ?A coupled mesoscale?microscale framework for wind resource estimation and farm aerodynamics,? Journal of Wind Engineering and Industrial Aerodynamics, vol. 132, pp. 13 ? 26, 2014. [41] H. A. Schwarz, ?II. Ueber einen Grenzu?bergang durch alternirendes Verfahren..? Wolf J. XV. 272-286., 1870. [42] A. Mota, I. Tezaur, and C. Alleman, ?The schwarz alternating method in solid mechanics,? Computer Methods in Applied Mechanics and Engineering, vol. 319, pp. 19?51, June 2017. [43] B. Smith, P. Bjorstad, and W. Gropp, Domain decomposition: parallel multi- level methods for elliptic partial differential equations. Cambridge: Cambridge University Press, 2004. [44] M. Galbraith, R. Knapke, P. Orkwis, and J. Benek, ?A discontinuous galerkin chimera scheme with implicit artificial boundaries,? Aerospace Sciences Meetings, American Institute of Aeronautics and Astronautics, Jan 2013. [45] M. J. Brazell, J. Sitaraman, and D. J. Mavriplis, ?An overset mesh approach for 3d mixed element high-order discretizations,? Journal of Computational Physics, vol. 322, pp. 33?51, Oct. 2016. [46] W. D. Henshaw, ?On multigrid for overlapping grids,? SIAM Journal on Scien- tific Computing, vol. 26, pp. 1547?1572, Jan. 2005. 185 [47] W. Anderson, R. D. Rausch, and D. L. Bonhaus, ?Implicit/multigrid algo- rithms for incompressible turbulent flows on unstructured grids,? Journal of Computational Physics, vol. 128, no. 2, pp. 391 ? 408, 1996. [48] V. K. Lakshminarayan, J. Sitaraman, B. Roget, and A. M. Wissink, ?Develop- ment and validation of a multi-strand solver for complex aerodynamic flows,? Computers & Fluids, vol. 147, pp. 41?62, Apr. 2017. [49] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, ?A high-performance, portable implementation of the mpi message passing interface standard,? Parallel Com- puting, vol. 22, no. 6, pp. 789 ? 828, 1996. [50] N. H. Decker, V. K. Naik, and M. Nicoules, ?Parallelization of a class of implicit finite difference schemes in computational fluid dynamics,? International Journal of High Speed Computing, vol. 05, pp. 1?50, mar 1993. [51] R. S. Williams, ?What?s next? [the end of moore?s law],? Computing in Science & Engineering, vol. 19, pp. 7?13, mar 2017. [52] E. S. Larsen and D. McAllister, ?Fast matrix multiplies using graphics hardware,? in Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, SC ?01, (New York, NY, USA), pp. 55?55, ACM, 2001. [53] B. P. Pickering, C. W. Jackson, T. R. Scogland, W.-C. Feng, and C. J. Roy, ?Directive-based gpu programming for computational fluid dynamics,? Computers & Fluids, vol. 114, pp. 242 ? 253, 2015. [54] K. Soni, D. D. Chandar, and J. Sitaraman, ?Development of an overset grid computational fluid dynamics solver on graphical processing units,? Computers & Fluids, vol. 58, pp. 1 ? 14, 2012. [55] D. C. Jespersen, ?Acceleration of a cfd code with a gpu,? Scientific Programming, vol. 18, pp. 193?201, 01 2010. [56] E. Nielsen and A. Walden, ?Preparing the FUN3D CFD solver for the exascale era,? in Supercomputing Conference, The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2018. [57] P. Roe, ?Approximate riemann solvers, parameter vectors, and difference schemes,? Journal of Computational Physics, vol. 43, no. 2, pp. 357 ? 372, 1981. [58] R. Borges, M. Carmona, B. Costa, and W. S. Don, ?An improved weighted essentially non-oscillatory scheme for hyperbolic conservation laws,? Journal of Computational Physics, vol. 227, no. 6, pp. 3191 ? 3211, 2008. [59] J. Blazek, Computational Fluid Dynamics : Principles and Applications. Burling- ton: Elsevier, 2006. 186 [60] A. Harten and J. M. Hyman, ?Self adjusting grid methods for one-dimensional hyperbolic conservation laws,? Journal of Computational Physics, vol. 50, no. 2, pp. 235 ? 269, 1983. [61] J. D. Baeder, S. Medida, and T. S. Kalra, OVERTURNS Simulation of S- 76 Rotor in Hover. AIAA SciTech, American Institute of Aeronautics and Astronautics, Jan 2014. [62] Lakshminarayan, V. K., and Baeder, J. D., ?Computational Investigation of Micro Hovering Rotor Aerodynamics,? Journal of the American Helicopter Society, vol. 55, no. 1, pp. 14?29, 2010. [63] Y. S. Jung, B. Govindarajan, and J. Baeder, ?Turbulent and unsteady flows on unstructured line-based hamiltonian paths and strands grids,? AIAA Journal, vol. 55, pp. 1986?2001, jun 2017. [64] G. Klopfer, C. Hung, R. V. der Wijngaart, and J. Onufer, ?A diagonalized diag- onal dominant alternating direction implicit (D3ADI) scheme and subiteration correction,? in 29th AIAA, Fluid Dynamics Conference, American Institute of Aeronautics and Astronautics, jun 1998. [65] A. C. Kirby, M. J. Brazell, Z. Yang, R. Roy, B. R. Ahrabi, M. K. Stoellinger, J. Sitaraman, and D. J. Mavriplis, ?Wind farm simulations using an overset hp-adaptive approach with blade-resolved turbine models,? The International Journal of High Performance Computing Applications, p. 109434201983296, mar 2019. [66] Y. Saad, Iterative Methods for Sparse Linear Systems. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2nd ed., 2003. [67] M. Blanco and D. W. Zingg, ?Fast Newton-Krylov Method for Unstructured Grids,? AIAA Journal, vol. 36, pp. 607?612, April 1998. [68] V. Schmitt and F. Charpin, ?Pressure distributions on the onera-m6-wing at transonic mach numbers,? Experimental Data Base for Computer Program Assessment, vol. 138, 05 1979. [69] Norman, T. R., Shinoda, P., Peterson, R. L., and Datta, A., ?Full-Scale Wind Tunnel Test of the UH-60A Airloads Rotor,? in 67th Annual Forum Proceedings of the American Helicopter Society, May 3?5, 2011. [70] M. Potsdam, H. Yeo, and W. Johnson, ?Rotor airloads prediction using loose aerodynamic/structural coupling,? Journal of Aircraft, vol. 43, pp. 732?742, may 2006. [71] A. Datta, J. Sitaraman, I. Chopra, and J. D. Baeder, ?CFD/CSD prediction of rotor vibratory loads in high-speed flight,? Journal of Aircraft, vol. 43, pp. 1698?1709, nov 2006. 187 [72] L. Wang and D. Mavriplis, ?Implicit solution of the unsteady euler equations for high-order accurate discontinuous galerkin discretizations,? Aerospace Sciences Meetings, American Institute of Aeronautics and Astronautics, Jan 2006. [73] Jung, Y. S., Govindarajan, B., and Baeder, J., ?Unstructured/Structured Overset Methods for Flow Solver Using Hamiltonian Paths and Strand Grids,? AIAA-Paper-2016-1056, 54th AIAA Aerospace Sciences Meeting, January 4?8, 2016. [74] D. M. O?Brien, ?Analysis of computational modeling techniques for complete rotorcraft configurations,? phd. dissertation, School of Aerospace Engineering, Georgia Institute of Technology, 2006. [75] B.-S. Lee, M.-S. Jung, O.-J. Kwon, and H.-J. Kang, ?Numerical simulation of rotor-fuselage aerodynamic interaction using an unstructured overset mesh technique,? International Journal of Aeronautical and Space Sciences, vol. 11, pp. 1?9, Mar. 2010. [76] ?Simulation of unsteady rotor-fuselage aerodynamic interaction using unstruc- tured adaptive meshes,? Journal of the Korean Society for Aeronautical & Space Sciences, vol. 33, pp. 11?21, Feb. 2005. [77] R. Meakin, A. Wissink, W. Chan, and S. Pandya, ?On strand grids for complex flows,? in 18th AIAA Computational Fluid Dynamics Conference, p. 3834, 2007. [78] J. Winslow, H. Otsuka, B. Govindarajan, and I. Chopra, ?Basic understanding of airfoil characteristics at low reynolds numbers (104?105),? Journal of Aircraft, vol. 55, pp. 1?12, 12 2017. [79] D. Escobar, I. Chopra, and A. Datta, ?Aeromechanical loads on a mars coaxial rotor,? AHS Forum, American Helicopter Society, May 2018. [80] A. Datta, J. Bao, O. Gamard, D. Griffiths, L. Liu, G. Pugliese, B. Roget, and J. Sitaraman, ?Design of a martian autonomous rotary-wing vehicle,? Journal of Aircraft, vol. 40, pp. 461?472, 05 2003. 188