ABSTRACT Title of dissertation: RAPID PROTOTYPING OF HIGH PERFORMANCE SIGNAL PROCESSING APPLICATIONS Nimish Sane, Doctor of Philosophy, 2011 Dissertation directed by: Shuvra S. Bhattacharyya (Chair/Advisor) Professor Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies Andrew Harris (Co-Advisor) Professor Department of Astronomy Advances in embedded systems for digital signal processing (DSP) are enabling many scientific projects and commercial applications. At the same time, these applica- tions are key to driving advances in many important kinds of computing platforms. In this region of high performance DSP, rapid prototyping is critical for faster time-to-market (e.g., in the wireless communications industry) or time-to-science (e.g., in radio astron- omy). DSP system architectures have evolved from being based on application specific integrated circuits (ASICs) to incorporate reconfigurable off-the-shelf field programmable gate arrays (FPGAs), the latest multiprocessors such as graphics processing units (GPUs), or heterogeneous combinations of such devices. We, thus, have a vast design space to ex- plore based on performance trade-offs, and expanded by the multitude of possibilities for target platforms. In order to allow systematic design space exploration, and develop scalable and portable prototypes, model based design tools are increasingly used in design and imple- mentation of embedded systems. These tools allow scalable high-level representations, model based semantics for analysis and optimization, and portable implementations that can be verified at higher levels of abstractions and targeted toward multiple platforms for implementation. The designer can experiment using such tools at an early stage in the de- sign cycle, and employ the latest hardware at later stages. In this thesis, we have focused on dataflow-based approaches for rapid DSP system prototyping. This thesis contributes to various aspects of dataflow-based design flows and tools as follows: 1. We have introduced the concept of topological patterns, which exploits commonly found repetitive patterns in DSP algorithms to allow scalable, concise, and parame- terizable representations of large scale dataflow graphs in high-level languages. We have shown how an underlying design tool can systematically exploit a high-level application specification consisting of topological patterns in various aspects of the design flow. 2. We have formulated the core functional dataflow (CFDF) model of computation, which can be used to model a wide variety of deterministic dynamic dataflow be- haviors. We have also presented key features of the CFDF model and tools based on these features. These tools provide support for heterogeneous dataflow behav- iors, an intuitive and common framework for functional specification, support for functional simulation, portability from several existing dataflow models to CFDF, integrated emphasis on minimally-restricted specification of actor functionality, and support for efficient static, quasi-static, and dynamic scheduling techniques. 3. We have developed a generalized scheduling technique for CFDF graphs based on decomposition of a CFDF graph into static graphs that interact at run-time. Further- more, we have refined this generalized scheduling technique using a new notion of ?mode grouping,? which better exposes the underlying static behavior. We have also developed a scheduling technique for a class of dynamic applications that gen- erates parameterized looped schedules (PLSs), which can handle dynamic dataflow behavior without major limitations on compile-time predictability. 4. We have demonstrated the use of dataflow-based approaches for design and imple- mentation of radio astronomy DSP systems using an application example of a tun- able digital downconverter (TDD) for spectrometers. Design and implementation of this module has been an integral part of this thesis work. This thesis demonstrates a design flow that consists of a high-level software prototype, analysis, and simula- tion using the dataflow interchange format (DIF) tool, and integration of this design with the existing tool flow for the target implementation on an FPGA platform, called interconnect break-out board (IBOB). We have also explored the trade-off between low hardware cost for fixed configurations of digital downconverters and flexibility offered by TDD designs. 5. This thesis has contributed significantly to the development and release of the latest version of a graph package oriented toward models of computation (MoCGraph). Our enhancements to this package include support for tree data structures, and gen- eralized schedule trees (GSTs), which provide a useful data structure for a wide variety of schedule representations. Our extensions to the MoCGraph package pro- vided key support for the CFDF model, and functional simulation capabilities in the DIF package. RAPID PROTOTYPING OF HIGH PERFORMANCE SIGNAL PROCESSING APPLICATIONS by Nimish Sane Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctory of Philosophy 2011 Advisory Committee: Professor Shuvra S. Bhattacharyya, Chair/Advisor Professor Andrew Harris, Co-Advisor Professor Steven Tretter Professor Gang Qu Professor Amitabh Varshney, Dean?s representative c? Copyright by Nimish Sane 2011 Dedication To my parents and grandparents. ii Acknowledgments I express my sincere gratitude toward my advisor Prof. Shuvra S. Bhattacharyya for his academic, research, and professional guidance throughout my PhD research work. He allowed me to pursue my own research interests with great freedom, and enthusiastically supported all of my efforts. This resulted in our research collaboration with the National Radio Astronomy Observatory (NRAO). His expertise in the field of embedded signal processing and model based design tools helped me smoothly get into and learn about this exciting research field. His advice and insights throughout my PhD greatly enriched my research, and helped steer it to stay on track. I have learned a great deal from his research methods and attention to detail. I am grateful to him for providing the necessary motivation when needed. His patience and understanding ensured that my experience was enjoyable. I thank him for giving me excellent research, teaching, and mentoring opportunities. I will always cherish my experience working with him. I am also grateful to my co-advisor Prof. Andrew Harris for his advice on the as- pects of my research related to astronomy as well as academic and professional guidance. His insights into scientific, technical, and logistical aspects have been crucial in shaping my research goals and professional career. He introduced me to the field of radio as- tronomy instrumentation, and encouraged me to pursue my PhD research in that field. His support during my internship at the Owens Valley Radio Observatory (OVRO) and collaboration with the NRAO was very crucial. I will be always grateful to him for that. I also thank other members of my dissertation committee Prof. Steven Tretter, Prof. Gang Qu, and Prof. Amitabh Varshney for their advice and co-operation. Their iii constructive comments and feedback were valuable. The research presented in this thesis was sponsored in part by the National Radio Astronomy Observatory, Agilent Technologies, Inc., Laboratory of Telecommunication Sciences, National Science Foundation, US Air Force Research Laboratory, and Austrian Marshall Plan Foundation. I acknowledge them with thanks for their support. I am extremely thankful to Dr. Richard Prestage of the NRAO, Green Bank, WV, for his initial support and continued guidance in establishing a collaboration between the NRAO and Maryland Digital Signal Processing ? Computer Aided Design (DSPCAD) research group. I am also grateful to Mr. John Ford of the NRAO, Green Bank, WV, for his patient supervision and guidance. He encouraged me to pursue various research direc- tions and provided valuable technical insights. He was very supportive, understanding, and co-operative in dealing with the logistics of the collaboration, which made it a smooth one. I also thank Shilpa Bollineni, Srikanth Bussa, Randy McCullough, Scott Ransom, and Jason Ray of the NRAO for their support. I would like to thank Dr. Chia-Jui Hsu, and Jose` Pino (both of Agilent Technologies, Inc.) for their supervision and collaboration. I acknowledge Dr. Gunasekaran Seetharaman of the US Air Force Research Labo- ratory with thanks for his collaboration. It has been a memorable experience working at the Maryland DSPCAD research group. I am thankful to Dr. William Plishker for his support and guidance throughout my PhD research. I have learned a lot from him while working with him on various collaborative efforts. Some of these efforts have contributed to the research presented in this thesis. I sincerely thank him for the same. I also thank him and Dr. Chung-Ching iv Shen for a number of very interesting conversations that we have had on technical and research issues as well as professional career. I also thank Mary Kiemb, Kapil Anand, Dr. Hojin Kee (now at National Instruments, Inc.), and Shenpei Wu from the Maryland DSPCAD group for their contributions. All the current and past DSPCAD members and friends ? Dr. William Plishker, Dr. Chung-Ching Shen, Dr. Sankalita Saha, Dr. Hojin Kee, Dr. Ruirui Gu, Mary Kiemb, Kapil Anand, Sebastian Puthenpurayil, George Zaki, Hsiang-Huang Wu, Soujanya Kedi- laya, Inkeun Cho, Scott Kim, Kishan Sudusinghe, and Shenpei Wu ? deserve a special mention. They have been of great help throughout my research work in the DSPCAD group. I specially thank to Will, Chung-Ching, Sankalita, and Hojin for their valuable guidance during my initial days as a PhD student. I also thank Will, George, Hsiang- Huang, Shenpei, and Kishan for their help toward the end. I also had an opportunity to mentor some of the high school and undergraduate student members ? Saara Khan, Kelly Davis, Chima Ebinama, Giancarlo Bautista, Rainier Gomez, and Cordell Reid ? in the DSPCAD group. Mentoring them has been a learning experience in itself, and I appreciate their efforts and response. During first two years of my graduate studies at the University of Maryland, College Park, I worked in the departments of Kinesiology, Facilities Management, and Astronomy (at the OVRO). These positions provided not only the much needed financial assistance but also a wealth of experience, which I have cherished throughout my graduate studies. In particular, I would like to acknowledge and thank my supervisors Prof. John Jeka (of the Department of Kinesiology, University of Maryland), and Dr. James Lamb (of the OVRO). Dr. James Lamb facilitated my interactions with many others in the field v of radio astronomy. These interactions, along with his insights and encouragement, have been instrumental in shaping my research interests in the field of radio astronomy signal processing. I will be always grateful to him for that. I would like to thank all the staff of the ECE department, especially that of Graduate Studies, Business, and Payroll offices, for their timely help and useful advice. Though it would be too formal to thank my parents, grandparents, cousins, and every one in the Sane, Bhat, and Dadhich families, I, still, must do it for every one of them has been a source of inspiration and great strength. I thank all of them for always being there. I would like to express my sincere gratitude toward families of Mr. Arun and Late Mrs. Usha Karve, Dr. Ajit and Dr. Asha Kembhavi, Dr. Naresh and Mrs. Sadhana Dadhich, Mr. Mohammad and Mrs. Fatima Khadas, Mr. Jawahar and Mrs. Pushpa Nagori, and Mr. Arun Sawant without whose support and financial assistance the journey to the United States would not have been possible. It is rather unfortunate and sad that my grandmother, Mrs. Usha Karve, is not with us when I complete my doctoral degree. I must also remember and thank all my close friends who have been supportive throughout my graduate studies and are as much a part of life as my family. I am grateful to Mr. Neel and Mrs. Gandhali Phadake, and Mr. Suresh and Dr. Kunda Joshi for their support. I would like to thank Dr. Aniruddha Kembhavi, Avanti Shetye, Dr. Ayan Ghosh, and Amit Apte, who helped me immensely during my initial days in College Park, and continued to guide me throughout my graduate studies. I should thank my room-mates during my stay in College Park ? Kunal, Dhaval, vi Nirankush, Mehul, Sarang, Prashant, Ashwin, Aalap, Omkar, Jagannath, Shreyas, Saurabh, Kaustubh, Harshavardhan, Sumit, and Kamal ? for being such a wonderful family away from home, and of course, bearing with me. I also thank all my friends from College Park, who made my experience at the University of Maryland so enjoyable. I am sure there are many others who have been of great help and support at various stages of my PhD. It may not be possible to list all of them here. I take this opportunity to thank each and every one of them. vii Table of Contents List of Figures xi List of Tables xiv 1 Introduction 1 1.1 High Performance Signal Processing Applications . . . . . . . . . . . . . 1 1.2 Design Tools for Rapid Prototyping . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Topological Patterns for Specification and Analysis of Dataflow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Formulation of the Core Functional Dataflow Model . . . . . . . 11 1.3.3 Efficient Scheduling Techniques for Core Functional Dataflow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.4 Dataflow-based Rapid Prototyping for Radio Astronomy Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.5 Development and Release of the latest version of the MoCGraph Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Background 18 2.1 Dataflow Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 Synchronous Dataflow . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.2 Cyclo-static Dataflow . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.3 Parameterized Dataflow . . . . . . . . . . . . . . . . . . . . . . 22 2.2 The Dataflow Interchange Format . . . . . . . . . . . . . . . . . . . . . 27 2.3 Generalized Schedule Trees . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Graph Package Oriented toward Models of Computation . . . . . . . . . 30 2.4.1 MoCGraph Software Architecture and Features . . . . . . . . . . 31 2.4.2 mocgsched Plug-in . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 DICE: The DSPCAD Integrative Command-Line Environment . . . . . . 33 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3 Topological Patterns for Specification and Analysis of Dataflow Graphs 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Topological Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Topological Patterns in Signal Processing . . . . . . . . . . . . . 42 3.3.2 Parameter Propagation . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Topological Patterns in DIF . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Applications of Topological Patterns . . . . . . . . . . . . . . . . . . . . 47 3.5.1 Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.2 Representing HSDF Graphs . . . . . . . . . . . . . . . . . . . . 49 3.5.3 Extracting Implementation-Specific Features . . . . . . . . . . . 52 viii 3.5.4 Exploring Implementation Trade-offs . . . . . . . . . . . . . . . 54 3.5.5 Representing Schedules . . . . . . . . . . . . . . . . . . . . . . 56 3.5.6 Experimenting with Pattern-Specific Schedules . . . . . . . . . . 60 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Prototyping Heterogeneous Dataflow Applications using Core Functional Dataflow 67 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Formulation of Core Functional Dataflow . . . . . . . . . . . . . . . . . 69 4.3 Modeling Adaptive Modulation Scheme using Core Functional Dataflow . 71 4.4 Translation to Core Functional Dataflow . . . . . . . . . . . . . . . . . . 74 4.4.1 Static Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.2 Boolean Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.3 Representing PSDF and PCSDF Actors using CFDF . . . . . . . 77 4.5 Functional Simulations in DIF using Core Functional Dataflow . . . . . . 79 4.5.1 Extensions to DIF Software Architecture . . . . . . . . . . . . . 80 4.5.2 Canonical Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6.1 Programmable Polynomial Evaluation Accelerator . . . . . . . . 82 4.6.2 Design with Multiple Polynomial Evaluation Accelerators . . . . 83 4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5 Efficient Scheduling Techniques for Core Functional Dataflow Graphs 88 5.1 Scheduling using Dynamic Dataflow Graph Decomposition . . . . . . . . 88 5.1.1 Dynamic Dataflow Graph Decomposition Algorithm . . . . . . . 89 5.1.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Scheduling using Mode Grouping . . . . . . . . . . . . . . . . . . . . . 95 5.2.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Parameterized Looped Schedules . . . . . . . . . . . . . . . . . . . . . . 97 5.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3.2 Constructing Parameterized Looped Schedules . . . . . . . . . . 100 5.3.3 Bounded Memory Execution . . . . . . . . . . . . . . . . . . . . 102 5.3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Dataflow-based Rapid Prototyping for Radio Astronomy Signal Processing 110 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3 Tunable Digital Downconverter . . . . . . . . . . . . . . . . . . . . . . . 114 6.4 Dataflow-based Design and Implementation . . . . . . . . . . . . . . . . 117 6.4.1 Modeling and Prototyping using DIF . . . . . . . . . . . . . . . 118 6.4.2 Integration with the CASPER Tool Flow . . . . . . . . . . . . . . 125 6.4.3 Platform-specific Analysis using DIF . . . . . . . . . . . . . . . 128 6.4.4 Software Interface for the Tunable Digital Downconveter . . . . . 129 6.5 Exploring Implementation Trade-off with TDD and FDD Blocks . . . . . 130 ix 6.5.1 TDD and FDD for Multistage Downconversion . . . . . . . . . . 131 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7 Summary and Conclusions 134 8 Future Work 139 Bibliography 142 x List of Figures 1.1 Design flow using a model based approach. . . . . . . . . . . . . . . . . 6 1.2 DIF based design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 An SDF graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 An application graph with a simple decimator actor M using the (a) CSDF, and (b) SDF models. Actor M is a decimator with a decimation factor of 4. 22 2.3 Modeling a parameterized decimation filter (DF) application using PCSDF: (a) Application graph ? CN denotes a vector of FIR filter coefficients, and D denotes a decimation factor, and (b) PCSDF representation. . . . . 25 2.4 The DIF language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Generalized schedule tree. . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 A hierarchy of classes in MoCGraph that implements various types of graph data structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Configuring an array of nodes with a PPP. . . . . . . . . . . . . . . . . . 45 3.2 Overlapping topological patterns. . . . . . . . . . . . . . . . . . . . . . . 46 3.3 A sample rate converter. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Dataflow graph for a 4-point fast Fourier transform and the topology block in its DIF specification. . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 JPEG encoder and the topology block in its DIF specification. . . . . . 53 3.6 Dataflow graphs for (a) the generic class of applications under considera- tion, and (b) a simplified adaptive modulation scheme. . . . . . . . . . . 57 xi 3.7 A PLS for the application in Fig. 3.6(b), and the topology block in a corresponding DIF representation. Table 3.2 provides parameters associ- ated with each node in the DIF specification. . . . . . . . . . . . . . . . 58 3.8 (a) An SDF graph with a butterfly pattern. (b)-(c) two possible GST structures using schedules that are based on acyclic pairwise clustering (iteratively clustering two actors at a time). . . . . . . . . . . . . . . . . . 61 3.9 (a)-(b) SDF graphs with butterfly patterns. (c)-(d) GSTs for min- imizing buffer memory requirements of the SDF graphs in (a) and (b), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 Mode transition behavior of the mapper actor. . . . . . . . . . . . . . . 73 4.2 Mode transition behavior of the decimator actor. . . . . . . . . . . . . . . 76 4.3 Application of BDF using a Switch actor. . . . . . . . . . . . . . . . . 77 4.4 Mode transition behavior of the Switch actor. . . . . . . . . . . . . . . 78 4.5 CFDF mode transition behavior of a PCSDF actor Decimatorwith pos- sible decimation factors 3 and 4. . . . . . . . . . . . . . . . . . . . . . . 79 4.6 A pictorial representation of the dual PEA application. . . . . . . . . . . 85 4.7 Single appearance schedule for the dual PEA system. . . . . . . . . . . . 86 4.8 Multiple appearance schedule for the dual PEA system. . . . . . . . . . . 86 5.1 Algorithm for dynamic dataflow graph decomposition. . . . . . . . . . . 90 5.2 Application decomposition example. . . . . . . . . . . . . . . . . . . . . 92 5.3 Dual sampling rate conversion. . . . . . . . . . . . . . . . . . . . . . . . 93 5.4 The APGAN schedule of the sample rate conversion application. . . . . . 94 xii 5.5 Valid PLS for the application in Fig. 3.6(b). . . . . . . . . . . . . . . . . 101 6.1 Block diagram of a tunable digital downconverter. . . . . . . . . . . . . . 116 6.2 Proposed dataflow-based approach. . . . . . . . . . . . . . . . . . . . . . 117 6.3 Dataflow behavior of a Decimator actor with 4 inputs and outputs for a decimation factor of 6 using (a) SDF, and (b) CSDF models. . . . . . . 119 6.4 TDD application graph generated using DIF. . . . . . . . . . . . . . . . . 121 6.5 Partial DIF specification ? topology block ? for the TDD application graph using topological patterns. . . . . . . . . . . . . . . . . . . . . . . 122 6.6 Partial DIF specification ? topology block ? for the TDD application graph using topological patterns. . . . . . . . . . . . . . . . . . . . . . . 123 6.7 PLSs for the TDD application configured for a decimation factor of 11, and decimator actor employing the (a) PSDF and (b) PCSDF models of computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.8 TDD System overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.9 Two-stage digital downconversion. . . . . . . . . . . . . . . . . . . . . . 132 xiii List of Tables 2.1 Interactions among PDF subsystem components, dataflow inputs and out- puts, and parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Performance and resource utilization trade-offs for FPGA implementa- tion of a JPEG encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Actors and loop counts associated with nodes in the PLS graph represen- tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Average simulation times for different sink control conditions (numbers of tokens consumed by the sink) for the PLS in Fig. 3.7 using (1) GST traversal, and (2) a hand-tuned pattern-specific schedule. . . . . . . . . . 60 3.4 Buffer memory requirements for single appearance schedules generated from the SDF graph shown in Fig. 3.9(a). . . . . . . . . . . . . . . . . . 65 3.5 Buffer memory requirements for single appearance schedules generated from the SDF graph shown in Fig. 3.9(b). . . . . . . . . . . . . . . . . . 65 4.1 Valid modes for the mapper actor along with their corresponding pro- duction and consumption rates. . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Valid modes in a CFDF representation of the decimator actor, M, in Fig. 2.2(a) along with their corresponding production and consumption values. . . . 75 4.3 The behavior of the Switch actor modes in terms of tokens produced and consumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4 The behavior of the PEA modes. . . . . . . . . . . . . . . . . . . . . . . 82 xiv 4.5 The behavior of the CSDF implementation of the restricted PEA used in the dual PEA application. . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Simulation times and maximum buffer sizes for mixed-model applications. 87 5.1 Simulation times and maximum buffer sizes for mixed-model applica- tions using dynamic dataflow graph decomposition based schedules. . . . 95 5.2 Total buffer size requirements with and without mode grouping. . . . . . 97 5.3 Average simulation time for different sink control conditions (numbers of tokens consumed by the sink actor) using a canonical schedule (CS), DDFS, and PLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 Total buffer requirements for different sink control conditions (numbers of tokens consumed by the sink) for a canonical schedule (CS), DDFS, and PLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1 Total buffer requirements from a DIF prototype for different decimation factors using parameterized looped schedules. . . . . . . . . . . . . . . . 125 6.2 Implementation summary for TDD designs. . . . . . . . . . . . . . . . . 127 6.3 Implementation summary for FDD designs. . . . . . . . . . . . . . . . . 131 6.4 Implementation summary for designs employing two-stage downconver- sion using cascaded FDF or TDF blocks. . . . . . . . . . . . . . . . . . . 133 xv Chapter 1 Introduction 1.1 High Performance Signal Processing Applications The field of signal processing has expanded to cover a wide range of application domains, such as image and video processing, acoustic and speech processing, wireless communication, software-defined radio, astronomical signal processing, biomedical sig- nal processing, and medical imaging, to name a few, and it will continue to expand in the future. Each of these application domains deals with different types of signals that can be characterized by, for example, the signal sources, methods and speed of signal acquisition, signal strength in terms of signal to noise ratio, and frequency content of the signals. A signal processing algorithm processes these signals to derive useful in- formation. Apart from varying application specifications, there are certain performance metrics, such as the need for real-time signal processing, speed of real-time or offline pro- cessing, resource utilization, power consumption, and cost, that provide complex design spaces for exploration, and help characterize and optimize the overall quality of a design. These requirements may be domain-specific. For example, study of certain astronomical objects or observations at certain wavelengths would determine specifications for a ra- dio telescope and its signal processing backend; a wireless communication standard may determine signal processing requirements, while its deployment in a consumer product may constrain specifications such as cost and time-to-market; specifications of a video 1 surveillance system may be determined by its expected target-tracking capability; and so on. A designer of any such signal processing system explores the design space to deter- mine if and how the required system can be efficiently implemented using the present state of the art hardware platforms and design tools. Signal processing hardware manufacturers often classify their hardware that can be used to achieve maximum performance with re- spect to certain performance metrics as ?high performance? signal processors. Given the evolving nature of hardware platforms, architectures, and their processing capabilities, the term ?high performance? represents the extreme end of performance achievable or desirable under a given cost constraint using the state of the art at a given point in time. A signal processing application, which has specifications requiring the corresponding signal processing system to deliver maximal performance, possibly under pre-specified resource constraints, can be classified as a high performance signal processing application. This is in contrast, for example, to a commercial signal processing system that must be designed to minimize cost subject to a given performance constraint. Radio astronomy digital signal processing (DSP) is an example of high performance signal processing. Scientific objectives drive radio astronomy telescopes toward high- speed samplers and corresponding DSP backends that need to process large volumes of data at high data rates. The single dish Green Bank Telescope (GBT) [68], for example, is used for pulsar searches and high-precision timing studies, which drive bandwidth re- quirements to extremes along with the computing needs to implement techniques such as coherent dedispersion. The new GBT spectrometer being conceived will have a channel bandwidth of 3 GHz (compared to the current 800 MHz), but must also support zooming 2 into tunable narrow bands. Many of the recent radio telescopes that have been constructed and the ones that have been conceived for construction over the next decade clearly show a trend toward synthesis array design [4]. These include telescope arrays such as the Square Kilometre Array (SKA) [23]; its precursors, the Karoo Array Telescope (MeerKAT) [40], and Australian SKA Pathfinder (ASKAP) [22] being built in South Africa, and Australia, respectively; and the Long Wavelength Array (LWA) in New Mexico [27]. In the case of synthesis array telescopes, the amount and speed of data to be processed increase due to the bandwidth required per antenna as well as the large numbers of antennas that are involved. The SKA will employ a large number of antennas of smaller diameters to be able to conduct ultrasensitive surveys over large area of the sky, and its precursor arrays being developed already pose challenging DSP applications in terms of processing enor- mous amounts of high-speed data [4]. The signal processing requirements for radio arrays (for example, the complexity of a correlator) scale quadratically or with higher order in the number of elements in the array. The performance metrics for these DSP backends require handling of enormous amounts of data in real time at very high rates (on the order of petabit per second). At the same time, the power and cost need to be minimized by orders of magnitudes compared to the present estimates to provide realistic solutions. Some important commercial applications that involve high performance DSP in- clude dynamic communication systems applications related to modern wireless technolo- gies, such as the worldwide interoperability for microwave access (WiMAX) [3] and 3rd generation partnership project ? long term evolution (3GPP?LTE) [1]. The challenges posed by these applications are manifold including greater data rates, requirements for supporting dynamic configurability, need for faster simulations, and deployment in hand- 3 held devices. Some other domains of high performance signal processing include soft- ware defined radio, multimedia processing, and medical imaging. Applications such as these, especially those related to DSP for radio astronomy and wireless communication systems, have driven the work presented in this thesis. 1.2 Design Tools for Rapid Prototyping The interaction between the evolving nature of high performance hardware plat- forms and signal processing applications will be more clear if one observes the trends in a particular application domain. For example, in the field of radio astronomy signal processing, the conventional approach has been to develop optimized custom hardware using application specific integrated circuits (ASICs) (e.g., see [20] for an ASIC solu- tion to SKA DSP). Such designs, however, are not scalable, reconfigurable, or portable. Moreover, the design, development, and deployment process tends to be much longer and more costly than that for some of the reconfigurable hardware available. To account for these factors, DSP solutions that use field programmable gate array (FPGA) based reconfigurable hardware and modular software libraries have been developed (e.g., see the approach used by the Collaboration for Astronomy Signal Processing and Electronics Research (CASPER) group [55]). Recent years have seen the emergence of a large va- riety of computing platforms having general purpose or specialized muticore processors, such as graphics processing units (GPUs), the Cell, and a variety of processors by ARM, Tilera, and Intel [14]. Such platforms are gaining popularity within the radio astronomy domain (e.g., see [31, 81]). Moreover, it is quite possible that a DSP solution that uses a 4 combination of more than one hardware architecture for performing different signal pro- cessing tasks may perform better under a given set of constraints than a solution that uses only one kind of platform. We, thus, have a complex design space to explore based on performance trade-offs (e.g., throughput, latency, power, cost, etc.), and expanded by the multitude possibilities of target platforms. A key to efficient implementation is the ability of design processes and tools to allow the designer to explore the design space effectively at a high-level, and make informed design choices at an early stage rather than trying to alter the design in major ways after having a platform-specific implementation. The need for rapid pro- totyping is of particular importance for high performance signal processing applications because it allows designers to focus on functionality and functional validation in early stages of design, and decide on target platforms in later stages. A tool for designing a high performance signal processing system should, therefore, allow scalable high-level representations, semantics for analysis and resource estimation, functional verification, and portable implementations that can be reconfigured and re- targeted toward the latest hardware technologies. This is possible using a model based design approach as shown in Fig. 1.1 that is founded in a particular model of computation (MoC). Model based design approaches for design and implementation of embedded sys- tems for DSP applications continues to be an active and expanding research area both in the industry as well as academia due to ever expanding application domains and markets for such systems (e.g., see [8]). Model based design methods are extensively used in this field to make the design process streamlined, productive, robust, and efficient. As shown in Fig. 1.1, the application specification is a high-level specification used 5 Models of Computation Application Specification DSP Algorithms High-level Languages Functional VerificationModule Library System-level Verification Software or Hardware SynthesisModule Library Deployment of Platform-Level Implementation Figure 1.1: Design flow using a model based approach. to specify only the design details that are necessary for establishing functional correct- ness. The high-level language used for such a specification provides syntactic features and has semantic foundations that establish the underlying choice of the MoC. A MoC provides semantics (meaning) for the interaction between functional modules in a system, and strongly influences all the significant design processes ? specification, simulation, formal verification, implementation/optimization, and interfaces to environmental or ex- ternal systems. A MoC essentially tries to capture the designer?s intuition and effectively translates it into a model (e.g., the model that underlies ?C? and other imperative lan- guages, finite state machines, Kahn process networks, dataflow, synchronous/reactive, discrete-event, etc.) Model based design is important area for innovation in domain- specific technology and design research. To facilitate analysis of and mapping from a model-based specification, the specifi- cation is typically converted into an intermediate representation that can be used by design 6 Figure 1.2: DIF based design flow. tools during subsequent design stages. For example, in dataflow-based design, a DSP ap- plication flowgraph specification is translated into a directed graph with nodes represent- ing functional modules and edges representing first-in-first-out communication buffers between pairs of functional modules. The subsequent stages utilize this representation for further analysis and optimization. It should be noted that module libraries that contain functional code for the modules also adhere to the prespecified interface and semantics of the selected MoC. This allows easier transcoding among various platform-specific lan- guages, while the glue code for these modules remains platform-agnostic. This allows developing platform-independent functional prototypes that can be used for functional and system level verification as well as portability across different kinds of hardware technologies. The work presented in this thesis primarily focuses on the dataflow MoC, which is 7 extensively used for developing embedded systems for DSP and communication appli- cations, and electronic design automation. Dataflow-oriented DSP design tools typically allow high-level application specification, software simulation, and possibly synthesis for hardware or software implementation. Chapter 2 provides relevant background on dataflow modeling. There are various existing design tools with their semantic founda- tions in dataflow modeling such as Ptolemy [59], PeaCE [44], SysteMoc [32], StreamIt [79], CAL [26], Compaan/Laura [77], and DIF [36]. In this work, the dataflow interchange for- mat (DIF) tool has been primarily used for demonstration along with other tools such as the Advanced Design System (ADS) from the Agilent Technologies, Inc. [60] and Xilinx System Generator (XSG) [85] (See Chapter 2 for a brief description of the DIF language and package). Fig. 1.2 provides a pictorial representation of the general DIF based design flow. This thesis significantly contributes to the components of the design flow illustrated in Fig. 1.2 that have been highlighted with a gray background. It can be seen from this figure (and also comparing with the generic model based design flow in Fig. 1.1) that this thesis addresses most of the aspects of a complete model based design flow. 1.3 Contributions of the Thesis As mentioned earlier, this thesis deals with various aspects of a dataflow-oriented, model based design approach. This section lists the important contributions of this thesis. The research presented in this thesis, though demonstrated using specific design tools, is not restricted to those tools, and hence, can be applied to a variety of other dataflow- based design tools. Also, the applications presented, though driving this research, are 8 demonstrative, and the prototyping methods can be extended or adapted to other relevant DSP applications. 1.3.1 Topological Patterns for Specification and Analysis of Dataflow Graphs Tools for designing signal processing systems with their semantic foundation in dataflow modeling often use high-level graphical user interfaces (GUIs) or text based languages that allow specifying applications as directed graphs. Such graphical repre- sentations serve as initial reference points for further analysis and optimizations that lead to platform-specific implementations. For large-scale applications, the underlying graphs often consist of smaller substructures that repeat multiple times. To enable more concise representation and direct analysis of such substructures in the context of high-level DSP specification languages and design tools, we have introduced and developed the model- ing concept of topological patterns [71]. Topological patterns can be used to identify and concisely iterate across arbitrary structures in a dataflow application graph. We have shown how the types of flowgraph substructures that are pervasive in the DSP application domain can be effectively represented in terms of topological patterns, and thereby used to generate compact, scalable application representations. Some earlier research efforts have employed useful techniques for deriving and ex- ploiting various types of specialized dataflow substructures within their respective com- pilers (e.g., use of highly expressive constructs from procedural languages, such as re- currences, iteration, and conditionals, in dataflow-oriented languages [45], and various 9 textual languages for DSP system design, such as SILAGE [82], StreamIt [79], and CAL [26]). They, however, lack a general method for explicit and scalable representation of such substructures by the programmer. Such a programming interface for topologi- cal patterns is essential to capture the broad range of relevant patterns in ways that are scalable, and flexibly extensible to accommodate new types of patterns as they emerge from new applications and modeling techniques. Our concept of topological patterns is designed precisely to bridge this gap. In other prior work, higher-order functions have been shown to permit elegant construction of structured subsystems in dataflow represen- tations [48]. Higher-order functions are functions that take functions as inputs or produce functions as outputs. Topological patterns provide a related but technically different ap- proach since topological patterns operate on generic directed graph vertices. Furthermore, our development of topological patterns is tightly integrated with textual graph represen- tation and arrays of graph vertices and edges, which are useful for providing scalable representations and managing large-scale designs. We have shown how an underlying design tool can exploit a high-level applica- tion specification consisting of topological patterns in various aspects of the design flow. In particular, we have demonstrated the efficacy of topological patterns in dataflow graph analysis, concise and scalable representation of homogeneous synchronous dataflow (HSDF) graphs, and exploring implementation-specific trade-offs. We have applied the concept of topological patterns to represent schedules for application graphs. Such representations are useful, for example, when porting schedules generated using one design tool to other platform-specific tools or design languages. We have demonstrated the utility of experi- mentation with pattern-specific scheduling transformations, and how topological patterns 10 facilitate such experimentation. 1.3.2 Formulation of the Core Functional Dataflow Model For a number of years, dataflow models have proven invaluable for application ar- eas such as DSP. Their graph-based formalisms allow designers to describe applications in a natural yet semantically rigorous way. Such a semantic foundation has permitted the development of a variety of analysis tools. As a result, dataflow languages are in- creasingly popular. Their diversity, portability, and intuitive appeal have extended them to many application areas with a variety of targets (e.g., see [73], [34], and [61]). As system complexity and the diversity of components in digital signal processing platforms increases, designers are expressing more types of behavior in dataflow languages to retain these implementation benefits. This has resulted in evolution of various dataflow models with varying degrees of expressive power. On one extreme lies the synchronous dataflow (SDF) model, which is the most restrictive form of dataflow [46] that is in widespread use in the DSP design community. On the other hand, Turing complete models such as boolean dataflow (BDF) [16] can express deterministic dynamic dataflow behaviors (i.e., dynamic behaviors in which a given set of input streams always produces a unique set of output streams). There exist numerous dataflow models with intermediate levels of expressive power, such as the cyclo-static dataflow (CSDF) model [13], which allows statically specified periodic dataflow behavior, and a meta-modeling technique called pa- rameterized dataflow (PDF), which allows limited forms of data-dependent dynamic be- havior [6]. Each of these models also offer differing capabilities to analyze and estimate 11 resources at compile-time. The designer must explore this design space to find the model that can best capture the application behavior and allow sufficient analysis. While the semantic range of dataflow has expanded to cover quasi-static and dy- namic interactions, it is often challenging to map such interactions reliably and effi- ciently into implementations. The recently developed core functional dataflow (CFDF) model can be used to model a wide variety of deterministic dynamic dataflow behav- iors [64], and thereby helps to unify the processes of analysis and scheduling of quasi- static and dynamic dataflow interactions. CFDF supports flexible and efficient proto- typing of dataflow-based application representations and permits natural description of both dynamic and static dataflow actors. I have significantly contributed to the concep- tion and development of the CFDF MoC. This thesis presents its mathematical formu- lation and modeling features. This model provides an interface for actor specification that is intuitive for a DSP system designer and allows specification of heterogeneous dataflow applications. We have applied this model to develop various application proto- types that demonstrate its effectiveness for rapid DSP system prototyping, as presented in [64] and [66]. The CFDF model, and tools based on it, has the following unique set of features: 1) support for heterogeneous dataflow behaviors, 2) intuitive and common frameworks for functional specification, 3) support for functional simulation, 4) portability from many well-known dataflow models to CFDF, 5) integrated emphasis on minimally-restricted specification of actor functionality, and 6) support for efficient static, quasi-static, and dynamic scheduling techniques. These features distinguish CFDF from a related model called stream based functions (SBF) [41], and other frameworks such as Ptolemy II, which 12 offers diverse MoCs [25], SystemC [32], actor-oriented languages like CAL [26], ?S- functions? in Simulink [51], and GUI based tools like LabVIEW [39]. 1.3.3 Efficient Scheduling Techniques for Core Functional Dataflow Graphs The problem of scheduling dynamic dataflow applications has been studied earlier, and important results have been established regarding bounded memory execution and compile-time scheduling (e.g., see [16, 53]). Most of these approaches employ schedul- ing schemes that suffer from significant run-time overhead, difficulties in code gener- ation, and lack of compile-time predictability (e.g., for validating real-time signal pro- cessing performance). We have developed generalized scheduling techniques for CFDF graphs based on decomposition of a CFDF graph into static graphs that interact at run- time [62]. Furthermore, we have conceived the mode grouping based scheduling tech- nique that achieves more efficient simulations compared to previous approaches for dy- namic dataflow applications [63]. The scheduling techniques for generalized CFDF graphs in [62] and [63], how- ever, do not in general guarantee bounded memory execution for the entire input applica- tion. For dynamic communication applications that exhibit a particular kind of dataflow graph structure, I have developed a technique to construct parameterized looped sched- ules (PLSs). Well-constructed PLSs allow expressing dynamic dataflow behavior and enabling faster simulations without significantly compromising compile-time predictabil- ity [70]. This class of quasi-static schedules allows for flexible, compact specification of nested loop structures, where loop iteration counts can be either constant values or 13 symbolic expressions in terms of dynamic parameters in the underlying dataflow graph. While it may be possible to use a meta-modeling technique called PDF (e.g., by integrat- ing it with SDF to give a model called parameterized SDF (PSDF)) to express limited forms of dynamic behavior and construct PLSs [6], such an approach imposes significant restrictions on how applications are modeled (e.g., in terms of hierarchies of cooperating init, subinit, and body graphs [6]), and in general, major changes in the user interface are required to provide direct support for PDF in a design tool. In contrast, a CFDF based approach provides PLS-based bounded memory scheduling while operating within a se- mantic framework that can be integrated more directly into existing design tools compared to the more hierarchical semantic structure of PDF representations. This thesis presents analysis and scheduling techniques for CFDF graphs based on those in [62], [63], and [70], which I have contributed to. 1.3.4 Dataflow-based Rapid Prototyping for Radio Astronomy Signal Pro- cessing Application of dataflow modeling to the field of radio astronomy DSP is a signif- icant contribution of this thesis. The model based approach for designing large scale signal processing systems with focus on radio telescopes has been previously studied (e.g., see [2, 49]). Several frameworks have been proposed for model based high-level abstractions of architectures along with performance/cost estimation to guide the de- signer throughout the development cycle (e.g., see [2]). However, the focus of these approaches has been on architecture exploration. There have also been attempts to derive 14 implementation-level specifications starting from system-level specifications by segregat- ing signal processing and control flow into an application specification and architecture specification, respectively (e.g., see [49]). However, the choice of models of computation has been made primarily from control flow considerations rather than dataflow consider- ations. These approaches, though relevant, do not specifically address the issue of high- level application specification for platform-independent prototyping and use of models of computation for abstraction of heterogeneous or hybrid dataflow behaviors. This issue is critical to efficient prototyping of high performance signal processing applications, which are typically dataflow dominated, and include increasing levels of dynamic dataflow be- havior (e.g., see [8]). I demonstrate the use of dataflow-based approaches for design and development of radio astronomy DSP systems using an application example of a tunable digital downcon- verter (TDD) for spectrometers at the National Radio Astronomy Observatory (NRAO), Green Bank. Design and development of this module has been an integral part of this the- sis work. This thesis demonstrates a design flow that consists of (i) application specifica- tion and modeling using parameterized SDF and CSDF models, (ii) software prototyping, analysis, and simulation using the DIF tool, and (iii) integration of this benchmark design with the current CASPER tool flow for the target implementation on an FPGA platform, called the interconnect break-out board (IBOB) [55]. My experiments show how formal understanding of the dataflow behavior from the software prototype allows more efficient prototyping along with estimating and accounting for some of the key resource require- ments (e.g., throughput, hardware duplication, pipelining, buffer memory requirements) at much earlier stages in the design cycle, unlike conventional design approaches. 15 1.3.5 Development and Release of the latest version of the MoCGraph Package The DIF package makes use of and extends a graph package oriented toward MoC (MoCGraph). MoCGraph is a Java-based package of generic graph data structures and algorithms with emphasis on supporting graph-theoretic analysis for MoCs. MoCGraph has evolved from the graph package in Ptolemy II [25]. This thesis has contributed significantly to the development and release of the latest version of the MoCGraph package. My contributions to MoCGraph include support for tree data structures, and generalized schedule trees (GSTs). A GST is a data structure used for representing schedules of dataflow graphs [42]. This extension to the MoCGraph package has provided important support for the CFDF model, and functional simulation in the DIF package. Chapter 2 provides a detailed description of the MoCGraph and DIF packages, and the GST data structure. 1.4 Organization of Thesis The remainder of this thesis is organized as follows. Chapter 2 provides background relevant to the research presented in this thesis. It specifically deals with foundations of dataflow modeling, and features of the MoCGraph and DIF packages as well as the GST data structure. Chapter 3 describes the concept of using topological patterns for speci- fication of dataflow graphs and its application to analysis of dataflow graphs. Chapter 4 discusses formulation of CFDF model, its connections to some of the existing dataflow models, and its use in modeling dynamic dataflow applications. Chapter 5 describes var- 16 ious scheduling techniques for efficient simulation of CFDF graphs. Chapter 6 demon- strates the application of dataflow modeling techniques to radio astronomy DSP systems using the TDD application. A summary of findings, and conclusions from this work are presented in Chapter 7, while Chapter 8 lists useful directions for future research. 17 Chapter 2 Background 2.1 Dataflow Modeling Dataflow modeling involves representing an application using a directed graph G(V,E), where V is a set of vertices (nodes) and E is a set of edges. Each vertex u ? V in a dataflow graph is called an actor, and represents a specific computational block, while each directed edge (u, v) ? E represents a first-in-first-out (FIFO) buffer that provides a communication link between the source actor u and the sink actor v. A dataflow graph edge e can also have a non-negative integer delay, del(e), associated with it, which represents the number of initial data values (tokens) present in the associated buffer. Dataflow graphs operate based on data-driven execution, where an actor can be executed (fired) whenever it has sufficient amounts of data (numbers of ?samples? or data ?tokens?) available on all of its inputs. In the context of a dataflow graph, a source actor is an actor in the topology that has no input edges (for example, actor W in Fig. 2.1), and sink actor is an actor in the topology that has no output edges (for example, actors Y and Z in Fig. 2.1). During each firing, an actor consumes a certain number of tokens from each in- put and produces a certain number of tokens on each output. When these numbers are constant (over all firings), we refer to the actor as an SDF actor [46]. For an SDF actor, the numbers of tokens consumed and produced in each actor execution are referred to as 18 the consumption rate and production rate of the associated input and output, respectively. If the source and sink actors of a dataflow graph edge are SDF actors, then the edge is referred to as an SDF edge, and if a dataflow graph consists of only SDF actors, and SDF edges, the graph is referred to as an SDF graph. For a dataflow graph edge e, src(e) and snk(e), denote its source and sink actors, and if e is an SDF edge, then prd(e) denotes the production rate of the output port of src(e) that is connected to e, and similarly, cns(e) denotes the consumption rate of the input port of snk(e) that is connected to e. A static schedule for a dataflow graph G is a sequence of actors in G that represents the order in which actors are fired during an execution of G. Usually, production and consumption information ? in particular, the number of tokens produced and consumed (production/consumption volume) ? by individual fir- ings is characterized in terms of individual input and output ports so that each port of an actor can in general have a different production or consumption volume characterization. Such characterizations can involve constant values as in SDF [46] (as described above); periodic patterns of constant values, as in CSDF [13]; or more complex forms that are data-dependent (e.g., see [16, 6, 64]). A meta-modeling technique called PDF allows limited forms of dynamic behavior [6] in terms of run-time changes to dataflow graph pa- rameters. The BDF [16] and CFDF [64] models are highly expressive (Turing complete) dynamic dataflow models. The following sections elaborate more on SDF, CSDF, and PDF models. 19 2.1.1 Synchronous Dataflow An SDF graph is characterized by its compile-time predictability through the stati- cally known consumption and production rates, as defined above. Fig. 2.1 shows a simple SDF graph having actors W, X, Y, and Z. Each edge is annotated with the number of to- kens produced on it by the source actor and that consumed from it by the sink actor during every invocation of the source and sink actors, respectively. For example, actor X can be fired when there are at least two tokens on its input. Whenever actor X is fired, it con- sumes two tokens from its input buffer, and produces three tokens onto the output buffer connected to Y and two tokens onto the output buffer connected to Z. In case of SDF graphs, it is possible to construct a periodic schedule that repeats itself during the application execution. For an SDF graph G(V,E), each actor u ? V fires exactly q(u) times in a periodic schedule, where q(u) is its repetition count obtained by solving the balance equation q(src(e))? prd(e) = q(snk(e))? cns(e) (2.1) for each edge e ? E. For example, repetition counts for actors W, X, Y, Z in the SDF graph shown in Fig. 2.1 are 2, 1, 3, and 1, respectively. One of the ways to execute this SDF graph is to fire the actors in the order WWXYYYZ. This sequence represents one of the valid schedules for this SDF graph and can also be represented as an equivalent looped schedule (2 W)X(3 Y)Z. More information on SDF graphs, conditions for having a valid schedule for an SDF graph, and various techniques of scheduling SDF graphs can be found in [11]. 20 W X1D 2 1 Y 13 Z2 2 Figure 2.1: An SDF graph. 2.1.2 Cyclo-static Dataflow Many signal processing applications involve behaviors in which production and consumption rates may change during run-time. In some cases, these changes may, however, be known at compile-time. For example, consider the CSDF graph shown in Fig. 2.2(a), which has a decimator actor M in it. This actor consumes one token from its input on each invocation, but produces a token onto its output only on every fourth invo- cation. This behavior has been depicted using the varying production volumes denoted by [1 0 0 0]. The numbers of tokens produced by the decimator M follow this cyclic pattern with a period of 4. This sequence of varying production volumes, though not leading to constant output rates like an SDF actor, is still completely deterministic and known at the compile-time. This kind of dataflow behavior, where actors exhibit token production and consumption volumes (in terms of tokens per firing on specific actor ports) that are either constant or expressible as cyclic sequences of constant volumes, is referred to as CSDF. Thus, CSDF can be viewed as a generalization of SDF in which token production and consumption volumes may be different across different firings of an actor, but follow cyclic patterns that are completely specified at the compile-time. We refer readers to [13] for more details on the CSDF model. As shown in Fig. 2.2(a) and Fig. 2.2(b), it may be possible to transform a CSDF actor into an SDF actor. 21 P M[1 1 1 1] 1 R 1 [1 0 0 0] (a) P M4 1 R1 1 (b) Figure 2.2: An application graph with a simple decimator actor M using the (a) CSDF, and (b) SDF models. Actor M is a decimator with a decimation factor of 4. In general, when feedback loops are present in a dataflow graph, such a transformation may introduce deadlock, and therefore should be attempted with caution. Such a transfor- mation, when admissible (not leading to deadlock), generally has trade-offs in terms of relevant metrics including latency, throughput, and code size. More detailed comparisons between the SDF and CSDF models of computation are presented in [54, 10]. 2.1.3 Parameterized Dataflow Though CSDF provides enhanced expressive power compared to SDF, it is still unable to specify patterns in token consumption and production volumes that are not fully known at compile time. A meta-modeling technique called PDF has been proposed to represent certain kinds of dataflow application dynamics [6]. This model can be used with any arbitrary dataflow graph format that has a well-defined notion of a schedule iteration. For example, the PDF meta-model, when combined with an underlying SDF model, results in the PSDF model. A PSDF graph behaves like an SDF graph during one schedule iteration, but can assume different configurations across different schedule iterations. The PDF meta-model supports semantic and syntactic hierarchy. Syntactic hierar- 22 chy is used, as in other forms of dataflow, to decompose complex designs in terms of smaller components. On the other hand, semantic hierarchy in PDF is used to apply spe- cific features in the meta-model that are associated with dynamic parameter reconfigura- tion. A hierarchical actor that encapsulates such semantic hierarchy in PDF encapsulates a PDF subsystem. A PDF subsystem in turn has three underlying graphs called the init, subinit, and body graphs, which interact with each other in structured ways. Intuitively, the init and subinit graphs can capture data-dependent, dynamic behavior at certain points during the execution of the graph and configure the body graph to adapt in useful ways to such dynamics. Similarly, the init graph can be used to dynamically configure parameters in the subinit graph, which, in general, executes more frequently relative to the init graph. Intuitively, the init graph is designed to capture parameter configuration that is driven by higher, system-level processing, while the subinit graph is designed to capture the param- eter changes occurring across different iterations of the corresponding body graph. A PDF actor may have a set of parameters associated with it. The PDF model al- lows the behavior of a subsystem to be controlled by allowing such parameters to change during run-time. These parameters can control functional behavior as well as dataflow behavior (the rates at which actors produce and consume data to and from their output and input ports). In the context of PDF, these parameters can be classified as dataflow or non-dataflow parameters depending on whether or not they control dataflow behavior in the corresponding actors. In general, the functionality of a PDF actor depends on both dataflow and non-dataflow parameters. Depending upon the visibility of the parameters outside a PDF subsystem (possibly, to the enclosing PDF graph or subsystem), it is possi- ble to classify them as external subsystem parameters or internal subsystem parameters. 23 Table 2.1: Interactions among PDF subsystem components, dataflow inputs and outputs, and parameters. PDF Subsystem Feature Component Graphs body init subinit Input port connected to a dataflow edge Yes No Yes Output port connected to a dataflow edge Yes No No Sets external dataflow subsystem parameters No Yes No Sets external non-dataflow subsystem parameters No Yes No Sets internal subsystem parameters No Yes Yes While external subsystem parameters are visible to the enclosing PDF graph, internal subsystem parameters (for example, state information of a PDF actor) are not. These parameters, collectively, are referred to as immediate parameters. Details of interactions among init, subinit and body graphs and various types of parameters are described in [7]. Table 2.1 highlights some of the most important aspects of this interaction. Intuitively, the init and subinit graphs can capture data-dependent, dynamic be- havior at certain points during the execution of the graph and configure the body graph to adapt in useful ways to such dynamics. Similarly, the init graph can be used to dy- namically configure parameters in the subinit graph, which, in general, executes more frequently relative to the init graph. Intuitively, the init graph is designed to capture pa- rameter configuration that is driven by higher, system-level processing, while the subinit graph is designed to capture parameter changes occurring across different iterations of the corresponding body graph. For details related to parameter configuration, and semantics 24 Data FIR Filter (CN)1 1 Decimator[1 1 ... 1]1 x D 1 Output 1 [1 0 0 ... 0]1 x D (a) (b) Figure 2.3: Modeling a parameterized decimation filter (DF) application us- ing PCSDF: (a) Application graph ? CN denotes a vector of FIR filter coef- ficients, and D denotes a decimation factor, and (b) PCSDF representation. of invocations of PDF graphs, we direct the reader to [6], and [7]. To further illustrate the PDF modeling technique, we consider the application ex- ample shown in Fig. 2.3(a). This example involves a finite impulse response (FIR) filter with filter taps or coefficients given by CN = [c0, c1, . . . , cN?1] followed by a decimator with a tunable decimation factor of D. The values of D and CN are set either through a higher level system or user interface. We skip the details of this mechanism for the sake of simplicity and conciseness. Such behavior can be modeled using PDF with an 25 underlying CSDF model. Such a modeling approach is referred to as the parameterized cyclo-static dataflow (PCSDF) model [69]. Fig. 2.3(b) shows one of the possible PCSDF graphs corresponding to the application shown in Fig. 2.3(a). The subsystem DF is a PCSDF subsystem with its component graphs as shown in the figure. It can be seen here that the control actor in the DF.init graph of DF subsystem sets the required external and internal parameters, D, and CN , respectively. This actor models the required parameter control through either a higher level system or some form of user interface. In this particular case, the DF.subinit graph is empty (in general, the init, subinit and body graph do not all have to be used for a given subsystem). The PCSDF model allows CSDF actors for which the cyclic patterns of token pro- duction and consumption volumes can be parameterized in terms of their periods, the actual numbers of tokens consumed or produced in the cyclo-static sequences, or both. Such a model is of particular interest for modeling multirate DSP systems that exhibit parameterizable sample rate conversions. PCSDF allows desginers to systematically ex- plore design spaces across static, quasi-static, and dynamic implementation techniques. Here, by quasi-static implementation techniques, we mean techniques where relatively large portions of the associated software or hardware structures are fixed at compile-time with minor adjustments allowed at run-time (e.g., in response to changes in input data or operating conditions). A variety of quasi-static dataflow techniques are discussed, for example, in [8]. 26 2.2 The Dataflow Interchange Format To describe dataflow applications for a wide range of DSP applications, application developers can use the DIF language, which is a standard language founded in dataflow semantics and tailored for DSP system design [36]. DIF provides an integrated set of syn- tactic and semantic features that can fully capture essential modeling information of DSP applications without over-specification. From a dataflow point of view, DIF is designed to describe mixed-grain graph topologies and hierarchies as well as to specify dataflow- related and actor-specific information. The dataflow semantic specification is based on dataflow modeling theory and independent of any design tool. Fig. 2.4 illustrates some of the available constructs in the DIF language along with the syntax used for application specification. More details on the DIF language can be found in [35]. The topology block of the specification specifies the graph topology, which includes all of the nodes and edges in the graph. DIF supports built-in at- tributes such as interface, refinement, parameter, and actor, which identify specifications related to graph interfaces, hierarchical subsystems, dataflow parameters, and actor configurations, respectively. DIF also allows user-defined attributes, which have a similar syntax as built-in attributes except that they need to be declared with the attribute keyword. The DIF package (TDP) (see Fig. 1.2) facilitates use of the DIF language. Along with the ability to transform DIF descriptions into manipulable internal representation, it contains graph utilities, optimization engines, verification techniques, a comprehensive functional simulation framework, and a software synthesis framework for generating C 27 [dataflowModel] graphID { basedon { graphID; } [topology] { nodes = nodeID, ...; edges = edgeID(srcNodeID, snkNodeID), ...; } [builtInAttribute] { elementID = value; elementID = id; elementID = id1, id2, ...; } [attribute] userDefinedAttribute { elementID = value; elementID = id; elementID = id1, id2, ...; } } Figure 2.4: The DIF language. code [36, 64]. These facilities make DIF an effective environment for modeling dataflow applications, providing interoperability with other design environments, and developing and experimenting with new tools and dataflow techniques. Beyond these features, DIF is also suitable as a design environment for implementing dataflow-based application repre- sentations. Describing an application graph is done by listing nodes and edges, and then annotating dataflow specific information. 28 A B51 C 2 3 (a) 1 2 3 5 B A C (b) 1 2 3 5 B A C (c) Figure 2.5: (a) An SDF graph for a sample rate converter; (b) one of the schedules for it represented as a GST; and (c) GST representing guarded and unguarded execution of actors ? a GST node with two concentric ellipses or circles denotes guarded execution of the corresponding actor. 2.3 Generalized Schedule Trees The GSTs provide a dataflow model independent representation of schedules, which can then be utilized as an input to the subsequent stages of the toolflow such as simula- tion and code synthesis [42]. GSTs are ordered trees with leaf nodes pointing to the actors of an associated dataflow graph. An internal node of a GST denotes a loop count (an iteration construct to be applied when executing the schedule). In this thesis, I denote the loop count and actor associated with a node u in a GST by count(u) and actor(u), respectively. The GST representation allows exploiting topological information and al- gorithms for ordered trees in order to access and manipulate schedule elements. The execution of a schedule involves traversing the GST in a depth-first manner, and during this traversal, the sub-schedule rooted at any internal node is executed as many times as 29 specified by the loop count of that node. Fig. 2.5(b) shows a GST for a valid schedule for the SDF graph shown in Fig. 2.5(a). This particular GST represents a firing sequence (2 (5 A)B)(3 C), where (n X) implies n successive invocations of a schedule el- ement (possibly an actor) X. For sake of completeness, it must be noted that GSTs can be used to represent guarded and unguarded execution of actors as shown in Fig. 2.5(c). A GST node with two concentric ellipses or circles denotes guarded execution of the corresponding actor. A schedule represented by the GST in Fig. 2.5(c) involves guarded execution of actor A, and unguarded execution of actors B and C. The term ?guarded execution? refers to invoking an actor following some actor-specific run-time checks, and only if certain conditions (e.g., with respect to availability of the required number of data tokens on all the actor inputs) are satisfied. We will revisit this concept and elaborate more on it in Chapter 4 in the context of CFDF model. 2.4 Graph Package Oriented toward Models of Computation MoCGraph is a Java-based package of generic graph data structures and algorithms with emphasis on supporting graph-theoretic analysis for MoCs. MoCGraph has been de- veloped by the Maryland DSPCAD Research Group (?DSPCAD Group?), which focuses on computer-aided design (CAD) techniques for DSP systems. The DIF package builds on the MoCGraph package to provide representations and analysis techniques that are specialized for dataflow graphs, and provide foundations for model-based design flows targeted to embedded DSP systems [36, 64]. 30 It must be noted, however, that the features provided in MoCGraph are generally not specific to DSP or CAD-for-DSP applications, and can be used for many other kinds of graph-theoretic specification and analysis features. In fact, MoCGraph has evolved from the graph package in Ptolemy II [25], which allows experimentation with a wide variety of MoCs. This thesis has contributed significantly to the development and release of the latest version of the MoCGraph package. This version extends the previous released version of MoCGraph to include support for tree data structures, and GSTs, in particular. These de- velopments have provided a key foundation for supporting capabilities in the DIF package that are associated with CFDF modeling, and functional simulation. By linking to fea- tures of the dot package [29], this version also allows visualization of graphs that are constructed using the graph representations provided in MoCGraph. 2.4.1 MoCGraph Software Architecture and Features The MoCGraph package provides for the following important features and facilities through its software architecture: 1. MoCGraph allows creating and manipulating generic graph data structures along with special types of graphs, such as directed graphs, directed acyclic graphs (DAGs), trees, and rooted trees among many others [18]. Fig. 2.6 shows a partial hierarchy of classes in MoCGraph. Creating a graph typically involves instantiating the ap- propriate graph elements ? nodes and edges ? along with specifying the graph topology. 31 Graph DirectedGraph UndirectedGraph DirectedAcyclicGraph Forest Tree RootedTree OrderedTree ScheduleTree (mocgsched) Figure 2.6: A hierarchy of classes in MoCGraph that implements various types of graph data structures. 2. MoCGraph allows assigning application-specific information associated with a graph element ? a graph node or edge ? through a construct known as a ?weight,? which has a generic data type. Such weights can be utilized to extend the MoCGraph package for use in specific applications or domains. For example, the DIF package makes use of this facility to assign dataflow-specific information to the nodes and edges in a dataflow graph. 3. MocGraph provides implementations of algorithms for graph analysis ? for exam- ple, algorithms to determine existence of cycles in a graph, and compute all-pairs shortest path results, to name a few. The latest version of MoCGraph extends its earlier released version by providing the following new graph types ? tree, rooted tree, ordered tree, and k-ary tree [18]. 32 An important plug-in to MoCGraph is a package called mocgsched, which stands for MoCGraph scheduling support. Section 2.4.2 describes this plug-in. 2.4.2 mocgsched Plug-in This plug-in augments the MoCGraph package by providing data structures used to represent schedules for dataflow graphs. The scheduling strategies, which are graph transformations, can make use of these data structures to represent the schedules derived from the given application graphs. It must, however, be noted that scheduling transfor- mations that are specific to dataflow are not part of the MoCGraph package ? instead, packages that are intended for features specific to dataflow can build on features of MoC- Graph and mocgsched to provide such dataflow-specific transformations. This is the approach taken in DIF for many of its scheduling features. The latest version of mocgsched includes the representation that supports looped schedules (as in the previous version of MoCGraph [35]), along with a data structure for storing and manipulating GSTs. GSTs make use of various types of tree data structures, which have been added recently as mentioned in Section 2.4.1. The support for GSTs in mocgsched has provided an important foundation for CFDF-based modeling and functional simulation features in the DIF package [64]. 2.5 DICE: The DSPCAD Integrative Command-Line Environment The DSPCAD Group has developed the DSPCAD integrative command line envi- ronment (DICE), which is a package of utilities that facilitates efficient management of 33 software projects [9]. The objective of DICE is to provide a flexible, lightweight envi- ronment for the design, implementation, testing, and integration of engineering software, with a specific orientation towards projects that employ heterogeneous programming lan- guages and cross-platform design methods. The DICE package has been used extensively as an environment for software development for many of the projects involved in this thesis. DICE is implemented as a collection of utilities that are in the form of Bash scripts, C programs, and Python scripts. Therefore, facilities for interpreting/compiling these languages must be available to use all of the capabilities in DICE. DICE is developed with significant attention to cross-platform operation. Platforms on which DICE is used actively include Linux, MacOS, Solaris, and Windows (equipped with Cygwin). Apart from DICE utilities that facilitate working in a command line environment, such as those for directory navigation, another important aspect of the DICE is to provide a lightweight and flexible unit testing environment. This environment is lightweight in that it requires minimal learning of new syntax or specialized languages, and flexible in that it can be used to test source code in many languages, including C, Java, Verilog, and VHDL [9]. In DICE, the test suite for a project consists of an individual test subdirectory (ITS) for each of the unit tests in the suite. An ITS in general contains files that provide docu- mentation related to the test, a script to perform any compilation steps necessary to build the test, a script to execute the test, and files that contain the expected output and errors resulting from correct execution of the test. The output and errors resulting from the ac- tual execution of the test are also stored in the same ITS, and compared with the expected 34 behavior of the test. More information regarding ITS structure, and DICE utilities for unit testing can be found in [9]. Although testing is an established concept in the software engineering field, inte- grating testing rigorously with software development instead of testing being applied as an afterthought to the coding process is a relatively new paradigm. DICE facilitates the application of this paradigm. A companion package of DICE, called dicelang, provides a collection of language- specific plug-ins that extend the features of DICE, and provide new features to facilitate efficient software project design, implementation, and testing for selected programming languages [9]. 2.6 Summary In this chapter, we have provided background information on dataflow modeling, the dataflow interchange format, generalized schedule trees, the MoCGraph package, and the DICE package. This background and the corresponding tools are fundamental to the work developed and presented in the remainder of this thesis. 35 Chapter 3 Topological Patterns for Specification and Analysis of Dataflow Graphs Tools for designing signal processing systems with their semantic foundation in dataflow modeling often use high-level GUIs or text based languages that allow speci- fying applications as directed graphs. Such graphical representations serve as an initial reference point for further analysis and optimizations that lead to platform-specific imple- mentations. For large-scale applications, the underlying graphs often consist of smaller substructures that repeat multiple times. To enable more concise representation and direct analysis of such substructures in the context of high-level DSP specification languages and design tools, we have developed the modeling concept of topological patterns, and proposed ways for supporting this concept in a high-level language. This chapter shows how the DIF language can be augmented with constructs for supporting topological pat- terns, and topological patterns can be effective in various aspects of embedded signal processing design flows using specific application examples. 3.1 Introduction As mentioned in Section 1.2, DSP-oriented dataflow design tools typically allow high-level application specification, software simulation, and possibly synthesis for hard- ware or software implementation. These tools employ high-level description languages for application specification. These languages, which may be either GUI or text based, 36 provide syntactic and semantic constructs for specifying graphical representations of DSP applications. Such graphical representations are then parsed and converted into interme- diate representations suitable for further processing. In this work, we address the problem of representing large-scale and scalable dataflow graphs that have complex topologies. Such graphs comprise of various kinds of functional substructures that are parameterizable and can be represented in terms of concise, scalable specifications. For example, the dataflow graph of an N-point fast Fourier transform (FFT) algo- rithm consists of a combination of scaled versions of a well-known pattern called the butterfly diagram [52], and a systolic array is a mesh of computing elements having a specific dataflow structure that can solve problems such as QR-decomposition based recursive least square adaptive filtering, and minimum variance distortionless response beamforming [43]. We identify such common structures in dataflow graphs as topologi- cal patterns, and treat this kind of pattern as a first class citizen in the modeling process. Furthermore, we demonstrate and experiment with the use of topological patterns in the DIF, a textual design language and associated software package for specification, analysis, and synthesis based on DSP-oriented dataflow models of computation [36], [64]. Topological patterns not only permit scalable specifications of dataflow substruc- tures but also expose the underlying graph structure explicitly to the corresponding design tool. This allows design tools to exploit any analysis or optimization advantages offered by the substructures without having to ?discover? those structures through additional lev- els of pre-processing analysis. Some of the key components of the design flow that can potentially benefit from explicitly exposed patterns include various kinds of scheduling 37 transformations, and techniques for buffer memory optimization. Furthermore, by mak- ing it easier and more efficient to apply substructure-specific analysis techniques, pro- gramming support for topological patterns encourages the development of such analysis techniques, and provides a natural interface for reusing them across different applications and tools. The concept of topological patterns is elaborated in Section 3.3. In Section 3.4, we describe how we extend the DIF language to integrate topological patterns as a first class modeling construct. In Section 3.5, we show how topological patterns can be used by dataflow based design tools for dataflow graph analysis and transformations. We show how topological patterns can be used for graph analysis; representing equivalent HSDF graphs of application graphs modeled using SDF and CSDF models; extracting implementation-specific features; exploring trade-offs for an FPGA implementation of a JPEG image compression application; representing schedules; and experimenting with pattern-specific schedules. 3.2 Related Work Block diagrams are a natural and convenient way of describing DSP algorithms, and hence, DSP systems designers find it intuitive to have a high-level application speci- fication that captures such a description. GUI based dataflow languages try to capture this intuition using visually appealing representations, while text based languages provide syntax that looks similar to common procedural languages, such as C, but with semantic constructs that model the dataflow structure of DSP block diagrams. To effectively han- 38 dle the increasing complexity of signal processing system design, these languages must provide frameworks for modular and scalable representations with sufficient expressive power. Earlier research efforts have focused on supporting commonly used and highly ex- pressive constructs from procedural languages, such as recurrences, iteration, and condi- tionals, in dataflow-oriented languages [45]. Subsequent work includes evolution of var- ious textual languages for DSP system design, such as SILAGE [82], StreamIt [79], and CAL [26]. The StreamIt language provides high-level, architecture-independent abstrac- tions for streaming applications geared toward large-scale program development. The CAL language is an actor-oriented language, which has been applied actively for field programmable gate array (FPGA) implementation and reconfigurable video coding ap- plications. The SILAGE language has been developed with an emphasis on support for high-level synthesis and multidimensional signal processing. While these previous efforts have employed useful techniques for deriving and ex- ploiting various types of specialized dataflow substructures within their respective com- pilers, they lack a general method for explicit and scalable representation of such sub- structures by the programmer. Such a programming interface for topological patterns is essential to capture the broad range of relevant patterns in ways that are scalable, and flex- ibly extensible to accommodate new types of patterns as they emerge from new applica- tions and modeling techniques. Our concept of topological patterns is designed precisely to bridge this gap. In other prior work, higher-order functions have been shown to permit elegant con- struction of structured subsystems in dataflow representations [48]. Higher-order func- 39 tions are functions that take functions as inputs or produce functions as outputs. Topolog- ical patterns provide a related but technically different approach since topological patterns operate on generic directed graph vertices (e.g., nodes in DIF), where the actual binding to actor functionality and associated actor parameter values is specified separately, possi- bly through additional parameter propagation patterns (PPPs). Thus, unlike higher-order functions that take functions as arguments, topological patterns take only generic graph vertices (or arrays of such vertices) as arguments. Furthermore, our development of topo- logical patterns is tightly integrated with textual graph representation and arrays of graph vertices and edges, which are useful for providing scalable representations and managing large-scale designs. Perhaps the most closely related prior work is that on support for arrays of vertices and edges in the DIF language with array construction syntax and semantics similar to those in the C language [19]. These constructs provide a useful shorthand notation for specifying related groups of graph elements (nodes or edges) as arrays in which individual elements can be easily indexed. A typical elementID in the DIF specification (see Fig. 2.4) when referred to as baseName[N], generates an array of N elements. For example, tap[N] in DIF specifies an array tap of N nodes. The ith node, where i = 0, 1, . . . , N?1, can be accessed using its index as tap[i]. However, in this first-version array support within DIF, there is no mechanism for instantiating (declaring) collections of related edges automatically as structured mappings among corresponding subsets of nodes. It is also not possible to configure parameters across arrays of actors as functions of the array indices. These two features ? scalable, programmatic instantiation of graphical substructures, and association of parameter values ? are provided by our development of 40 topological patterns. This development is orthogonal to the existing support for syntactic and semantic hierarchy in the DIF language, which allows constructing hierarchical dataflow graphs. The focus here is to allow the designer to specify already identified topological patterns in the design and expose such patterns to the enclosing design tool or design process, which is generally not achieved through conventional methods for using hierarchical dataflow graphs. This chapter presents formulation of the concept of topological patterns and its ap- plication to dataflow modeling. To prototype this concept in DIF, we build upon the first- version framework of arrays in DIF, and introduce new modeling and language constructs that are dedicated to topological patterns. We also demonstrate the use of topological patterns to derive efficient implementations. A preliminary version of this work was presented in [71], while the extended work was presented in [72]. The work presented here, and in [72] goes beyond the develop- ments of [71] by significantly extending the development of applications of topological patterns. Specifically, we explore the utility of topological patterns in analyzing dataflow graphs and extracting implementation-specific features. We also use topological patterns to represent schedules obtained after applying scheduling transformations to dataflow graphs, and derive more efficient implementations from such representations. Addition- ally, we show how specific topological patterns can be exploited to construct structured schedules, and how designers can experiment with corresponding scheduling trade-offs. 41 3.3 Topological Patterns We have developed the concept of topological patterns for concise specification of functional structures at the dataflow graph (inter-actor) level. Topological patterns provide a scalable approach to specifying regular functional structures in a manner that is analogous in some ways to the use of design patterns in object oriented software [28], but with additional properties associated with being formally integrated with the framework of dataflow. This integration allows not only for specification of functional patterns but also for their analysis and optimization as part of the larger framework of dataflow. Topological patterns build on the concepts of graph element arrays, which allow indexed families of graph elements to be declared and treated as single units for pur- poses of graph construction and analysis. As with arrays in conventional programming languages, graph element arrays can be single- or multi-dimensional. Additionally, they can be parameterized in terms of dataflow graph attributes so that their sizes and other characteristics can be conveniently adapted. 3.3.1 Topological Patterns in Signal Processing We motivate the utility of incorporating topological patterns into dataflow frame- works for DSP system design by illustrating the pervasive nature of these patterns in the domain of DSP. We have already discussed a few such patterns in Section 3.1 ? in partic- ular, the butterfly and mesh patterns, which have applications in FFTs and systolic arrays, respectively. Additionally, the chain pattern is one of the most commonly found topological patterns. This pattern finds applications in modeling multi-stage sample rate 42 converters, delay lines in FIR filters, or configurations of pipeline stages. A chain of de- lay blocks, a chain of adders, and an array of filter taps collectively specify a complete FIR filter when connected together. A natural extension of this pattern is a 2-dimensional mesh structure. Such a structure is of particular use to model DSP architectures in which data flows across a network of processing elements connected to form a 2-D grid such as a systolic array, as discussed earlier in Section 3.1 [43]. A ring pattern represents a cycle in a graph as may be introduced by a phase- locked loop [47] or more generally, a feedback loop in the system. The FFT block is one of the most abundantly found blocks in DSP systems. An N-point FFT computation involves FFT computation stages of smaller dimensions that can be implemented as scaled versions of the 2-point FFT. These FFT stages resemble a butterfly-like pattern [52]. Such patterns can also be found in other applications, such as sorting networks [18]. Entropy encoding algorithms such as Huffman coding make use of the binary tree structure, a commonly found data structure in many computer algorithms [38]. A pattern in which edges connect a source node to multiple sink nodes can be termed as a broadcast pattern. This pattern finds use in applications that have computation blocks in multiple stages with blocks in one stage connected to those in the subsequent stage. Such patterns are observed in multi-layer neural networks used for pattern classification [24] and trellis coding algorithms used in digital communication [47]. It is also common to find its dual, the merge pattern, which connects multiple source nodes to a single sink node. Applications may also have parallel connections between corresponding nodes in adjacent stages. We identify this pattern as a parallel pattern in which edges form a one-to- one correspondence between nodes in two different sets. We also identify a pattern called 43 multiedge that creates multiple edges between a given pair of nodes. 3.3.2 Parameter Propagation An important feature to support in conjunction with topological patterns is a mech- anism for structured parameter propagation, whereby any parameters associated with the vertices in a topological pattern can be set as a function of the vertex indices (i.e., indices associated with the underlying vertex ordering that is input to the pattern instance). For example, Fig. 3.1 shows an array of 5 actors identified as A 0 , A 1 , . . . , A 4 , where each actor has a parameter angle associated with it that is an affine function of its index in the array. Such a parameter assignment can be implemented in a scalable, reusable, and explicitly-recognizable form as a designated PPP ? in particular, a PPP for affine mappings of parameter values across ordered vertices. Such an affine PPP can find use in specifying elements of a steering vector corresponding to each sensor in a sensor array while estimating the direction of arrival of the received signal [33]. One of our important motivations for using topological patterns is to provide for compact, scalable representations for large dataflow graphs. It is common for such large graphs to have actors with the same functionality that scale in number with the size of the application graph. These actors may have functional parameters (for example, the parameter angle associated with the actors in Fig. 3.1) that determine some of their func- tional aspects and also distinguish them from the other actors of the same (parameterized) functionality. It may be inconvenient to specify such parameters individually for all of the actors with growing size of an application graph (in fact, such individualized parameter 44 A_0_ angle = 5 A_1_ angle = 15 A_2_ angle = 25 A_3_ angle = 35 A_4_ angle = 45 Figure 3.1: Configuring an array of nodes with a PPP. specification violates the compactness objective of topological patterns). PPPs can help here by providing a compact representation format that can be used to set parameters as- sociated with actors in the large graphs that are represented by the associated topological patterns. In terms of implementation in the DIF language, just as component attributes and topological patterns can be either user-defined or built-in, similarly commonly-used PPPs can be absorbed into the language as built-in PPPs, while users have the flexibility to incorporate specialized PPPs by linking their interpretation (propagation functionality) to segments of customized Java code. 3.4 Topological Patterns in DIF We extend the DIF language by supporting topological patterns as first class citizens in the modeling framework. These patterns can be defined as built-in patterns, which are recognized and processed through corresponding keywords in the language. To enable more flexible application of patterns, we also support declaring arbitrary (user-defined) patterns, whose associated graph construction functionality can be carried out through procedural language code (Java or C in the case of DIF) that is linked with the graph specification. 45 A_0_ A_1_ A_2_ A_3_ A_6_ A_4_ A_5_ (a) topology { nodes = A[7]; edges = e0(A[0], A[1]), e1(A[3], A[6]), ring_0[5] -> ring(A[1:1:5]), ring_1[3] -> ring(A[1], A[3], A[2]); } (b) Figure 3.2: Overlapping patterns: (a) a graph topology having two ring patterns that have three nodes common to them, and (b) a corresponding DIF representation. We have added, as built-in topological pattern specifiers, new keywords in DIF corresponding to topological patterns that are relatively common in signal processing sys- tems. These keywords, such as ring, parallel, merge, butterfly, broadcast, and chain, allow specifying patterns explicitly as part of the topology block in a DIF specification. When declaring an instance of such a pattern, the designer must provide a sequence of vertices and an optional set of parameter values. The pattern construct, when parsed, generates the required edges, inserting the new edges into the graph that is being constructed. The pattern construct also configures the underlying nodes using the parameter propagation mechanism explained in Section 3.3.2. A typical way to specify a sequence of nodes is through the use of DIF notation 46 for representing nodes in an array. For example, for an array of 7 nodes, specified as A[7], we can specify that 5 of its elements form a ring structure using the construct ring(A[1:1:5]) in the topology block of the DIF code as shown in Fig. 3.2. The argument A[1:1:5] to the construct ring, specifies an array of nodes starting from A[1], ending at A[5], and having an array index increment of 1. In general, the syntax baseName[i:j:k] denotes an array of elements in an array baseName starting from the index i, ending with the index k, and having an array index increment of j. Note that, outside of the pattern instantiation construct, the nodes in the array A can be accessed by their indices to create edges that are not part of the ring pattern. Thus, one can flexibly embed patterns within arbitrary structures including structures that contain other patterns. It is also possible to generate multiple patterns that have one or more nodes common to them, as shown in Fig. 3.2. It is, thus, possible for the designer to effectively identify one or more types of overlapping topological patterns in the application graph. 3.5 Applications of Topological Patterns As described earlier, we envision topological patterns to offer a wide range of ad- vantages at various stages of the design flow from modeling to platform-specific imple- mentation. In Sections 3.3 and 3.4, we have identified topological patterns in various DSP system specifications. In the following subsections, we examine other aspects of the design flow where topological patterns can be effectively used. 47 3.5.1 Graph Analysis The explicit specification of known graphical structures as topological patterns can significantly facilitate various types of dataflow graph analysis algorithms. For example, one of the first and most important steps in many dataflow graph scheduling strategies is to analyze the input graph to identify strongly connected components (SCCs). An SCC is a maximal subgraph in which every pair of distinct nodes is connected through a cyclic path. It is often useful to cluster SCCs ? for example, SCCs can be clustered to improve scheduling of SDF graphs (e.g., see [37]). Such clustering of SCCs is typically performed in order to obtain a top-level DAG. For a directed graph G = (V,E), SCCs can be identified using an algorithm with a time complexity of the order of ?(|V |+ |E|) (see [18] for more details on the definition of the ?-notation as well as algorithm to find SCCs in a directed graph). Consider an application graph that contains multiple feedback paths that can be modeled and specified using the ring pattern. A ring represents a cycle in the graph and hence, a subset of vertices that form an SCC. Such a cycle, when directly specified as a ring can be readily reduced into a single clustered actor. A ring with n nodes in it, when clustered into a single node, effectively reduces the number of nodes in the graph G by n?1. Suppose that a graph G has many ring patterns that have been identified in the graph specification. Then by identifying these rings in constant time, which an analysis tool can do easily from explicit topological pattern specifications, the number of nodes and edges in the graph can be reduced significantly. This can lead to more efficient SCC computation, especially for large graphs. 48 3.5.2 Representing HSDF Graphs Many techniques devised for generating multiprocessor schedules from SDF graphs require that the given dataflow graph be transformed into an equivalent HSDF graph (e.g., see [76]). An HSDF graph is an SDF graph in which every actor consumes (produces) a single token from (on) each input (output) port. Techniques for converting an SDF graph into equivalent (for scheduling purposes) HSDF graphs have been developed in [46]. Such techniques are useful, because equivalent HSDF graphs can expose parallelism much more effectively compared to their more compact SDF counterparts. Unfortunately, equivalent HSDF representations can scale very inefficiently ? the size of an equivalent HSDF graph is in general not polynomially bounded in the size of the corresponding SDF graph [58]. Representing such HSDF graphs becomes a cumber- some exercise, as such representations require large amounts of storage to maintain and large amounts of computation time to process them. For a large HSDF representation, it is difficult for a design tool to traverse the HSDF representation and make effective use of it within a reasonable amount of time. Topological patterns can help in this sit- uation by providing concise representations to expose repetitive structures within HSDF representations, thereby improving the efficiency of HSDF-based schedulers. For example, Fig. 3.3(a) shows an SDF graph that models a simple sample rate converter, and its equivalent HSDF graph (below). Here, actor B is a decimator with a decimation factor of 3. Fig. 3.3(c) shows a DIF specification of this HSDF graph using topological patterns. Fig. 3.3(b) and (d) show an equivalent CSDF graph model with its HSDF graph and associated topological-pattern-based DIF specification. In the CSDF 49 A B31 C 2 1 A_0_ A_1_ B_0_ A_2_ A_3_ A_4_ B_1_ A_5_ C (a) SDF graph to HSDF graph A B [1 1 1] 1 C 2 [1 0 0] A_0_ A_1_ B_0_ A_2_ B_1_ A_3_ B_2_ A_4_ B_3_ A_5_ B_4_ B_5_ C (b) CSDF graph to HSDF graph topology { nodes = A[6], B[2], C; edges = e0[3] -> merge(A[0:2], B[0]), e1[3] -> merge(A[3:5], B[1]), e2[2] -> merge(B[0:1], C); } (c) DIF topology block for HSDF graph in (a) topology { nodes = A[6], B[6], C; edges = e_par[6] -> parallel(A[0:5], B[0:5]), e_mrg[2] -> merge(B[0:3:3], C); } (d) DIF topology block for HSDF graph in (b) Figure 3.3: A sample rate converter. 50 representation, actor B provides a decimation by a factor of 3. Actor B consumes input tokens on every firing while producing an output token only on every third firing, starting with the first firing. As this example illustrates, topological patterns can provide a concise and scalable representation of equivalent HSDF graph representations for SDF and CSDF graphs. It should, however, be noted that a graph representation using topological patterns is in general not unique. Depending on the set of available topological patterns, it may be possible to have multiple functionally-equivalent representations of a given dataflow graph using topological patterns. In the case of Figure 3.3(a), for example, it may be possible to use a tree pattern if the associated design tool supports it. Structured representations of HSDF graphs can also enable efficient tuning of HSDF graph representations in terms of application parameters. For example, for the dataflow graph in Figure 3.3(b), it can be observed that if the decimation factor of actor B is changed, then the DIF representation for the HSDF graph can be updated by simply changing the numeric arguments to the topological patterns used in its representation. In general, for a decimation factor of D, the production rate of actor B in Fig. 3.3(b) is [1 0 0 ? ? ? 0]1?D and the equivalent HSDF graph for this CSDF graph has the following specification, where D is a suitably-declared parameter. topology { nodes = A[2D], B[2D], C; edges = e_par[2D] -> parallel(A[0:2D-1], B[0:2D-1]), e_mrg[2] -> merge(B[0:D:D], C); } Thus, topological patterns provide streamlined representations that are concise, tun- able, and scalable, and are particularly useful for complex graph structures, such as those 51 A_0_ B_0_ B_1_ C_0_ A_1_ C_1_A_2_ B_2_ B_3_ C_2_ A_3_ C_3_ D_0_ D_2_ D_1_ D_3_ topology { nodes = A[4], B[4], C[4], D[4]; edges = fft2_0[4] -> butterfly(A[0:1], B[0:1]), fft2_1[4] -> butterfly(A[2:3], B[2:3]), fft4[8] -> butterfly(C[0:3], D[0:3]), e_par[4] -> parallel(B[0:3], C[0:3]); } Figure 3.4: Dataflow graph for a 4-point fast Fourier transform and the topology block in its DIF specification. found in equivalent HSDF graphs arising from multirate SDF, and CSDF models. 3.5.3 Extracting Implementation-Specific Features Fig. 3.4 shows an HSDF graph that models a 4-point FFT application [52], and the topology block in its DIF specification. Note the underlying topological pat- terns ? butterfly and parallel ? in the graph. It should also be noted that butterfly(C[0:3], D[0:3]) is a scaled version of a butterfly pattern with just 4 nodes, and is equivalent to two butterfly patterns formed by the node subsets {C 0 , C 2 , D 0, D 2 } and {C 1 , C 3 , D 1 , D 3 }. Apart from scalability, there is another useful feature in this HSDF graph repre- sentation. In particular, the bi-partite nature of both the patterns ? butterfly and parallel? allows us to generate a pipelined implementation of this application. Here, 52 RGB RGB2YCbCr Downsample (Cb) (D_0_) Downsample (Cr) (D_1_) Block Y (B_2_) Block Cb (B_0_) Block Cr (B_1_) DCT Cb (T_0_) DCT Cr (T_1_) DCT Y (T_2_) Quantize Cb (Q_0_) Quantize Cr (Q_1_) Quantize Y (Q_2_) ZigZag Cb (Z_0_) ZigZag Cr (Z_1_) ZigZag Y (Z_2_) FW Cb (F_0_) FW Cr (F_1_) FW Y (F_2_) topology { nodes = RGB, RGB2YCbCr, D[2], B[3], T[3], Q[3], Z[3], F[3]; edges = e0(RGB, RGB2YCbCr), e1[3] -> broadcast(RGB2YCbCr, D[0:1:1], B[2]), Cb[5] -> chain(D[0], B[0], T[0], Q[0], Z[0], F[0]), Cr[5] -> chain(D[1], B[1], T[1], Q[1], Z[1], F[1]), Y[4] -> chain(B[2], T[2], Q[2], Z[2], F[2]); } Figure 3.5: JPEG encoder and the topology block in its DIF specification. segments A, B, C, and D, consisting of nodes A[0:3], B[0:3], C[0:3], and D[0:3], respectively, may be considered as pipeline stages of the FFT implementation. This in- herent pipelined nature of the FFT application can be identified easily using the bi-partite nature of the underlying topological patterns. Of course, for FFTs, many efficient imple- mentations have been developed in the literature, and the use of topological patterns does not add any obvious value to the large library of existing FFT implementation techniques. However, this example succinctly illustrates the general potential of topological patterns for exposing useful implementation options more clearly and efficiently to designers and to analysis modules within design tools. 53 3.5.4 Exploring Implementation Trade-offs Fig. 3.5 shows a JPEG encoder along with the topology block in its DIF spec- ification [83]. It effectively employs the broadcast and chain patterns in its rep- resentation. The JPEG compression algorithm downsamples both the chroma (Cb and Cr) components before processing them. Except for this, all three components (both chroma and luma Y ) are processed through functionally similar chains of blocks. The input pixels are grouped into blocks that are then transformed using the discrete cosine transform (DCT), quantized, and scanned in a zigzag order. In particular, the chroma components may be processed using shared functional modules that are clearly exposed by the topology block. Without the use of topological patterns, this observation may not be clear to a designer until the entire graph is carefully traced. For a design tool, this observation may go entirely unexploited because such high-level structure can be difficult to extract automatically from unstructured specifications. The problem of identifying such graph structure is related to the graph isomor- phism problem, which is the problem of detecting whether two graphs (or two subgraphs from the same or different graphs) can have their vertices and edges placed in one-to-one correspondence with one another in a manner that maintains edge-vertex connectivity re- lationships. There are no known polynomial time algorithms for the graph isomorphism problem (e.g., see [30]). For the JPEG encoder example, we can exploit the potential for resource sharing ? which is exposed explicitly at a high-level through the use of topological patterns ? to develop a streamlined FPGA implementation. Awareness of the high-level topological 54 Table 3.1: Performance and resource utilization trade-offs for FPGA implementation of a JPEG encoder. JPEG Throughput FPGA Resource Utilization Encoder (samples Slices (out 18kB 18x18 /cycle) of 13696) BRAM MULT Non-shared 0.159 8070 (58%) 41 30 Shared 0.159 6088 (44%) 37 22 pattern in this application allows for systematic trade-off analysis between two design options ? one with shared resources for chroma component processing, and another without shared resources. An analysis of the high-level dataflow specification suggests that downsampling of chroma components would ensure that the chain processing Y component is the bottle- neck and hence, the throughput should remain unaffected even when the Cb and Cr com- ponents are processed using shared functional modules. Precise modeling of the shared- resource implementation of the JPEG encoder requires that the SDF design in Fig. 3.5 be transformed to expose more detail. For example, the design can be converted into an equivalent CSDF design in which buffers between functional modules are duplicated and alternate buffers are used in successive schedule iterations. For more background on this form of CSDF-based structural modeling, we refer the reader to [10]. From inspection of the CSDF intermediate model, it can be reasoned that the buffer requirement would remain unchanged across both designs (shared- versus separate-resource). However, we expect that the shared-resource version of the JPEG encoder would result in a net reduction in BRAM (block random access memory) utilization. This analysis can be confirmed from the resource utilization and throughput for 55 shared- and separate-resource JPEG encoder implementations on the Xilinx Virtex-II Pro FPGA, as shown in Table 3.1 [71]. The base clock rate for our experiments is 40MHz. Even though actor-level resource sharing is often avoided in FPGA implementation due to the relatively high costs of multiplexing and routing resources (e.g., see [78]), resource sharing for a subgraph in a dataflow representation can result in conservation of FPGA resources that overrides the multiplexing overhead. The shared-resource JPEG encoder uses less BRAM than the separate-resource version, which can be attributed to the shared DCT block. Also, the shared-resource version uses fewer (18 ? 18) multiplier units by employing shared downsampling, DCT, and quantization modules. As expected ? from the aforementioned bottleneck analysis ? both versions of the JPEG encoder achieve the same throughput. In particular, the Y component remains as the system bottleneck even when the Cb and Cr components are processed using shared FPGA resources. Our experiments thus demonstrate concretely how topological patterns can provide a formal path from scalable application analysis to the systematic exploration of implementation trade-offs in the design and implementation of signal processing sys- tems on a relevant target platform. 3.5.5 Representing Schedules The utility of topological patterns is not limited to representation of application graphs alone. Their utility can be extended to create concise and parameterizable repre- sentations of structures typical to schedules for certain application graphs. This can be of particular importance in functionally simulating application graphs, and porting schedules 56 Figure 3.6: Dataflow graphs for (a) the generic class of applications under consideration, and (b) a simplified adaptive modulation scheme. across design tools or languages. We elaborate on this using the following example. We consider a class of applications typically found in the domain of wireless com- munications, and signal processing systems that exhibit dataflow graph structures similar to the one shown in Fig. 3.6(a). A typical example of this type is that of the adaptive modulation scheme (AMS) shown in Fig. 3.6(b). The AMS is a dynamic communication application, which is an important part of modern wireless standards such as the world- wide interoperability for microwave access (WiMAX) [3] and 3rd generation partnership project ? long term evolution (3GPP?LTE) [1] standards. For details of AMS, we refer readers to [70]. There exist other applications that exhibit the general dataflow structure illustrated in Fig. 3.6(a), such as prediction error filters [33] and systems for frequency domain block adaptive filtering [75]. Such dataflow graphs can be efficiently simulated by constructing parameterized looped schedules (PLSs) as described in [70] and [42]. We will revisit the AMS application, and show how it can be modeled using CFDF model in Section 4.3. We will further elaborate on constructing a PLS for the AMS application in 57 1 Mapper mi Mapper Channel 2 ni Bits Demapper Sink topology { nodes = Root, N[6], B, D, Snk; edges = e0[6] -> broadcast(Root, N[0:5]), e1(N[1], B), e2(N[4], D), e3(N[5], Snk); } Figure 3.7: A PLS for the application in Fig. 3.6(b), and the topology block in a corresponding DIF representation. Table 3.2 provides parameters associated with each node in the DIF specification. Section 5.3. Fig. 3.7 shows a PLS for the AMS application. A PLS of this type is of particular importance since it can capture the dynamic dataflow behavior inherent in the applica- tion without compromising compile-time analysis. It is possible to perform useful anal- ysis (e.g., estimation of upper bounds on total buffer memory requirements) for PLSs at compile-time. In Fig. 3.6(a), the consumption rate ci and production rate pi can vary over finite ranges of positive integer values with known upper bounds cmax and pmax, respectively. The subscript i in the symbols pi and ci represents the dependence of this production and consumption rate pair on the actor execution index i ? thus, pi represents the number of tokens produced onto e4 in the ith execution (firing) of D2, and ci represents the number of tokens consumed from e1 during the ith firing of D1. In Fig. 3.7, the loop counts mi and ni are computed dynamically. 58 Table 3.2: Actors and loop counts associated with nodes in the PLS graph representation. Here, NULL indicates an internal node in the GST that does not have any actor associated with it. Node Actor Loop Count Root NULL 1 N[0] Mapper 1 N[1] NULL mi N[2] Mapper 1 N[3] Channel 1 N[4] NULL 2 N[5] NULL ni B Bits 1 D Demapper 1 Snk Sink 1 In the context of this AMS example, topological patterns help not only in specifi- cation of the application dataflow graph using the ring pattern, which can be used to identify the pair of dynamic actors easily, but also representation of generated PLSs using broadcast patterns with hierarchical nodes for SDF-schedules, as shown in Fig. 3.7. For such a well-structured schedule representation, it is possible to hand-tune an imple- mentation and use that representation explicitly for applications having similar dataflow behavior instead of traversing the GST using a generic process to derive a software or hardware implementation. In this case, topological patterns provide a framework by which hand-tuned schedules can be formally specified and reused across different ap- plications or target platforms. Table 3.3 shows a comparison between simulation times using GST traversal and hand-tuned pattern-specific implementation for the PLS in Fig. 3.7. These simulation experiments ? the results of which are presented in Table 3.3 ? differ from related ex- periments that we have reported on previously (e.g, in [72]) in that we have eliminated 59 Table 3.3: Average simulation times for different sink control conditions (numbers of tokens consumed by the sink) for the PLS in Fig. 3.7 using (1) GST traversal, and (2) a hand-tuned pattern-specific schedule. Sink control condition Average simulation time (ms) Improvement (Number of tokens) (1) (2) (%) 10000 73 32 56.16 20000 90 47 47.78 50000 148 62 58.11 100000 248 93 62.50 some of the common overheads by suppressing printing of routine debug and status infor- mation. This allows us to determine the extent of effect of these two simulation strategies on simulation speed, and compare them more precisely. It can be seen that the hand-tuned software implementation results in faster simulations by a factor of up to 62%. Further- more, through its formulation in the framework of topological patterns, the hand-tuned implementation can be analyzed, maintained, ported, and reused effectively across differ- ent design contexts. 3.5.6 Experimenting with Pattern-Specific Schedules When specifying signal processing systems, an important motivation for using topo- logical patterns is to facilitate application of pattern-specific transformations, such as pattern-specific scheduling transformations. In such a context, it can be useful for a design tool to provide features that allow the designer to experiment with various ?scheduling patterns? at a high level of abstraction. Since topological patterns provide well-defined, scalable topological information, one can generate a structured schedule from a given pattern. We demonstrate this application of topological patterns through an example of a 60 u0 w0 w1u1 (a) c0 c1 c2 c3 c4 l3 l0 c5 c6 l1 l2 (b) c0 c1 c2 l0 c3 c4 c5 c6 l3 l1 l2 (c) Figure 3.8: (a) An SDF graph with a butterfly pattern. (b)-(c) two pos- sible GST structures using schedules that are based on acyclic pairwise clus- tering (iteratively clustering two actors at a time). commonly used butterfly pattern. Consider an SDF graph having a butterfly pattern, as shown in Fig. 3.8(a). One commonly used scheduling transformation involves applying clustering transformations on one pair of connected actors at a time such that no cycle is introduced in the resultant graph, and then generating a hierarchical schedule for the given application graph by it- eratively applying such acyclic pairwise clustering (APC) [11]. In case of SDF graphs, a group of actors can be SDF-clustered if its component actors can be scheduled together (i.e., the group can be scheduled as a single unit in the overall schedule for the graph) without introducing deadlock [11]. It can be observed that more than one schedule can be 61 generated using APC depending on the pair of actors clustered at every stage of schedul- ing. In case of SDF graphs, the total buffer memory requirements depend upon the choice of a schedule, and in general, a schedule that has minimum total buffer memory require- ments is desirable in many applications. A scheduling technique based on APC called acyclic pairwise grouping of adjacent nodes (APGAN) has been described in [11] that chooses a pair of actors to be clustered at every stage of scheduling using a metric based on repetition counts of the actors in the graph. This heuristic is widely used and attempts to minimize the total buffer memory requirements. We refer readers to [11] for more information on SDF-clustering, and SDF scheduling heuristics that are based on APC including APGAN. A useful class of SDF schedules is that of single appearance looped schedules, as described in Section 2.1. Let G(V,E) denote the graph in Fig. 3.8(a), where V = {u0, u1, wo, w1}, and E = {(u0, w0), (u0, w1), (u1, w0), (u1, w1)}, (3.1) and suppose that we apply APC to the graph. Based on the steps involved in APC, there are only two possible GST structures for this example. These two structures are shown in Fig. 3.8(b) and (c). Here, each ci, i = 0, 1, ? ? ? , 6, denotes a loop count, while each li, i = 0, 1, ? ? ? , 3, denotes the actor pointed to by a leaf node in the GST. The existence of exactly two unique GST structures for this example can be verified from the following observations regarding the operation of APC (see [11] for further details on the operation of APC for SDF graphs). 1. Let U = {u0, u1}, and W = {w0, w1}. Then we can describe the graph G(V,E) as 62 V = U ?W, and E = U ?W. (3.2) 2. Let e ? E denote the group of actors clustered during the first clustering step. Then, l1 ? U , and l2 ? W . This follows from the bipartite nature of the butterfly pattern. 3. Following the first APC step, operation of APC ensures that l0 ? (U \ {l1}), and l3 ? (W \ {l2}). This is because clustering actors a and b such that a ? U and b ? W at this stage would amount to adding a cycle into the clustered graph, which is not permitted by APC. 4. Loop counts ci, i = 0, 1, ? ? ? , 6, can be accordingly determined using the SDF repetitions vector (the vector of minimal repetition counts in a periodic schedule) for the application graph. Given that each of the 4 pairs of actors can be grouped in the first-step, which, in turn, results in possibly two different schedules upon further grouping, we observe that there are at most 8 different single appearance looped schedules generated using this approach. Such different schedules can in general have different buffer memory requirements [11]. Thus, it can be useful for a designer to experiment with alternative schedules, estimate the buffer memory requirements for these schedules, and identify the schedule that best matches the application requirements and resource constraints. For the butterfly pattern shown in Fig. 3.9(a), Table 3.4 shows 9 different schedules, including a flat schedule for comparison. It can be seen that each of these 63 A C23 D 53 B 4 8 5 4 (a) A C12 D 56 B 2 5 2 3 (b) 1 5 12 4 3 D A 1 2 B C (c) 1 5 2 A 1 5 2 3 C B D (d) Figure 3.9: (a)-(b) SDF graphs with butterfly patterns. (c)-(d) GSTs for minimizing buffer memory requirements of the SDF graphs in (a) and (b), respectively. schedules has different buffer memory requirements. In a given design context, a de- signer may want to experiment with all schedules that fit the available resources in the target platform. The optimal schedule from the viewpoint of total buffer memory cost (schedule (1)) has a total buffer memory cost of 140 memory units, and is generated using the APGAN strategy. However, APGAN is in general a heuristic and is therefore not always guaranteed to derive an optimal solution. For example, consider the butterfly pattern shown in Fig. 3.9(b). Table 3.5 shows 6 different schedules for this graph, including, again, a flat 64 Table 3.4: Buffer memory requirements for single appearance schedules generated from the SDF graph shown in Fig. 3.9(a). Schedule Single Appearance Schedule Total buffer requirement (number of tokens) Flat (20 A)(15 B)(30 C)(12 D) 300 1 (5 (4 A)(3 B(2 C)))(12 D) 140 2 (20 A)(3 (5 B(2 C))(4 D)) 148 3 (5 (3 B)(2 (2 A)(3 C)))(12 D) 150 4 (15 B)(2 (5 (2 A)(3 C))(6 D)) 216 5 (15 B)(4 (5 A)(3 D))(30 C) 255 6 (15 B)(2 (2 (5 A)(3 D))(15 C)) 225 7 (20 A)(3 (5 B)(4 D))(30 C) 260 8 (20 A) (3 (5 B)(4 D)(10 C)) 180 Table 3.5: Buffer memory requirements for single appearance schedules generated from the SDF graph shown in Fig. 3.9(b). Schedule Single Appearance Schedule Total buffer requirement (number of tokens) Flat (5 A)(4 B)(10 C)(6 D) 72 1 (4 B)(5 A(2 C))(6 D) 64 2 (5 A)(2 (2 B)(5 C)(3 D)) 56 3 (5 A)(2 (2 B)(5 C))(6 D) 62 4 (5 A)(2 (2 B)(3 D))(10 C) 66 5 (5 A)(2 (2 B)(3 D)(5 C)) 56 schedule, and 5 different looped schedules. Here, schedule (1) is the one generated by ap- plying the APGAN strategy, and it can be seen that schedules (2), (3), and (5) outperform this schedule in terms of total buffer memory requirements. This example demonstrates the utility of experimenting with alternative schedules even if established heuristics, such as APGAN, are available. Topological patterns facil- itate such experimentation through their capabilities for schedule representation. In par- ticular, topological patterns allow designers to construct structured patterns of schedules, 65 which can then be examined separately to determine which one is most suitable in a given design context. Furthermore, topological pattern representations can be used to maintain libraries of subsystem-specific schedules, which can then be drawn upon efficiently when constructing larger applications that employ those subsystems. 3.6 Summary We have introduced the concept of topological patterns, which can be used to iden- tify and concisely iterate across arbitrary structures in a dataflow application graph. We have shown how the types of flowgraph substructures that are pervasive in the DSP appli- cation domain can be effectively represented in terms of topological patterns, and thereby used to generate compact, scalable application representations. We have also shown how an underlying design tool can exploit a high-level ap- plication specification consisting of topological patterns in various aspects of the de- sign flow. In particular, we have demonstrated the efficacy of topological patterns in dataflow graph analysis, concise and scalable representation of HSDF graphs, and ex- ploring implementation-specific trade-offs. We have also shown the use of topological patterns in graph analysis and extracting implementation-specific features. We have ap- plied the concept of topological patterns to represent schedules for application graphs. Such representations are useful, for example, when porting schedules generated using one design tool to other platform-specific tools or design languages. We have demonstrated the utility of experimentation with pattern-specific scheduling transformations, and how topological patterns facilitate such experimentation. 66 Chapter 4 Prototyping Heterogeneous Dataflow Applications using Core Functional Dataflow We have provided a brief summary of dataflow models relevant to the work pre- sented in this thesis in Chapter 2. These dataflow models with varying degrees of ex- pressive power can model dataflow behaviors that range from being completely static to highly dynamic, where production and consumption volumes can vary on a per-invocation basis. In Section 1.3.2, we introduced the CFDF model, which we have formulated and developed as part of this thesis. The CFDF model can be used to model a wide variety of deterministic dynamic dataflow behaviors [64]. At the same time, it supports flexible and efficient prototyping of dataflow-based application representations and permits natural description of both dynamic and static dataflow actors. In this chapter, we present the se- mantics of CFDF. We demonstrate how various heterogeneous dataflow applications can be modeled using CFDF. We show how various existing dataflow models can be repre- sented using CFDF semantics. We also present application of CFDF to rapid prototyping of heterogeneous dataflow applications. 4.1 Related Work A number of development environments utilize dataflow models to aid in the cap- ture and optimization of functional application descriptions. Ptolemy II encompasses a 67 diversity of dataflow-oriented and other kinds of MoCs [25]. To describe an application subsystem, developers employ a director that controls the communication and execution schedule of an associated application graph. If an application developer is able to write the functionality of an actor in a prescribed manner, it will be polymorphic with respect to other MoCs. To describe an application with multiple MoCs, developers can insert a ?composite actor? that represents a subgraph operating with a different MoC (and there- fore its own director). In such hierarchical representations, directors manage the actors only at their associated levels, and directors of composite actors only invoke their actors when higher level directors execute the composite actors. This paradigm works well for developers who know a priori the modeling techniques with which they plan to represent their applications. One of the other techniques employs SystemC to capture actors as composed of in- put ports, output ports, functionality, and an execution finite state machine (FSM), which determines the communication behavior of the actor [32]. There exist languages, such as CAL [26], that specifically target actor descriptions. For complete functionality in Simulink [51], actors are described in the form of ?S-functions.? By describing them in a specific format, actors can be used in continuous, discrete-time, and hybrid systems. Lab- VIEW [39] even gives designers a way of programmatically describing graphical blocks for dataflow systems. Semantically, perhaps the most related work is that of a MoC called the SBF [41]. In SBF, an actor is represented by a set of functions, a controller, state, and transition function. Each function is sequentially enabled by the controller, and uses on each invo- cation a blocking read for each input to consume a single token. Once a function is done 68 executing, the transition function defines the next function in the set to be enabled. CFDF semantics, and features in the DIF tool based on those, differ from these re- lated efforts in dataflow based design in their integrated emphasis on minimally-restricted specification of actor functionality, and support for efficient static, quasi-static, and dy- namic scheduling techniques. Each may be critical to prototyping overall dataflow graph functionality. Compared to models such as SBF, CFDF allows a designer to describe actor functionality in an arbitrary set of fixed modes, instead of parceling out actor behavior as side-effect free functions, a controller, and a transition function. CFDF is also more gen- eral than SBF as it permits multi-token reads and can enable actors based on application state. As designers experiment with different dataflow representations with different lev- els of actor dynamics, they need corresponding capabilities to experiment with compatible scheduling techniques. This is a key motivation for the integrated actor- and scheduler- level prototyping considerations in CFDF and its support in DIF. The material presented in this chapter is based on the work in [64], [66], and [65]. 4.2 Formulation of Core Functional Dataflow CFDF semantics can be viewed as a ?deterministic dataflow subset? of enable- invoke dataflow (EIDF) semantics, which require that actor specification be divided into separate enable and invoke functions [64] (described below). A CFDF actor a also has a set of valid modes Ma in which it can execute. When the actor a executes in a mode m ? Ma, it consumes (produces) a fixed number of tokens from its inputs (onto its outputs), but the number of tokens consumed and produced by an actor can vary across different modes 69 in Ma. The separation of enable and invoke capabilities helps in prototyping efficient scheduling techniques. The enable function is designed to be used as a ?hook? for dynamic or quasi-static scheduling techniques to rapidly query actors at run-time, and check whether or not they are executable. The enable function only checks for the availability of sufficient input data to allow an actor to fire in its current mode, and does not consume any tokens from the actor inputs. The current mode of an actor is always unique in CFDF, so this check of ?data sufficiency? is unambiguous. Given an actor a ? V in a dataflow graph G(V,E), the enabling function for a is defined as: ?a : (Ta ?Ma)? B, (4.1) where Ta = ?|in(a)| is a tuple of the number of tokens on each of the input edges to actor a (here, |in(a)| is the number of input edges to actor a); Ma is the set of modes associated with actor a; and B = {true, false} is true when an actor a ? V has an appropriate number of tokens for mode m ? Ma available on each input edge, and false otherwise. An actor can be executed in a given mode at a given point in time if and only if the enabling function is true-valued. The invoke function, on the other hand, consumes as many tokens from the inputs as specified by its mode of execution, and correspondingly produces the specified numbers of tokens onto the actor outputs. The invoke function can generally change the mode of the actor by returning a valid mode of execution in which the actor should be fired during its next invocation. Thus, actors proceed deterministically to a unique ?next mode? of 70 execution whenever they are enabled. The invoking function for an actor a is defined as: ?a : (Ia ?Ma)? (Oa ?Ma), (4.2) where Ia = X1 ?X2 ? ? ? ? ?X|in(a)| is the set of all possible inputs to a, where Xi is the set of possible tokens on the edge on input port i of actor a. After a executes, it produces outputs Oa = Y1 ? Y2 ? ? ? ? ? Y|out(a)|, where Yi is the set of possible tokens on the edge connected to port i of actor a, and |out(a)| is the number of output ports. If no mode is returned (i.e., an empty mode set is returned), the actor is forever disabled. We further illustrate these CFDF semantics by applying those to model the dynamic dataflow behavior of actors in applications such as the AMS in Section 4.3. For use of EIDF in modeling applications that cannot be modeled using CFDF, such as the Gustav function [5], we refer readers to the discussion in [65]. 4.3 Modeling Adaptive Modulation Scheme using Core Functional Dataflow As mentioned in Section 3.5.5, the AMS is an example of a useful, restricted class of dataflow-based applications or subsystem modules, the graphical representation of which can be reduced to the form shown in Fig. 3.6(a). In Fig. 3.6(a), S1, S2, and S3 denote regions consisting of SDF actors that can be SDF-clustered, while actors D1 and D2 have dynamic behavior. A group of actors can be said to be SDF-clustered if its component actors can be scheduled together (i.e., the group can be scheduled as a single unit in the overall schedule for the graph) without introducing deadlock [11]. The dataflow edges e1, e2, . . . , e5 denote FIFO buffers. The D on edge e5 denotes the delay, del(e5), associ- ated with it. In this targeted class of applications, it is assumed that the production and 71 consumption rates k, x, y, z, q are positive integer constants, while the consumption rate ci and production rate pi can vary over finite ranges of positive integer values with known upper bounds cmax and pmax, respectively. The subscript i in the symbols pi and ci repre- sents the dependence of this production and consumption rate pair on the actor execution index i ? thus, pi represents the number of tokens produced onto e4 in the ith execution (firing) of D2, and ci represents the number of tokens consumed from e1 during the ith firing of D1. Such a class of applications is of particular importance since useful compile- time analysis can be performed while handling the dynamic behavior as explained later in Section 5.3. Fig. 3.6(b) shows a simplified representation of the AMS with a source of input bits, dynamic mapper, channel, dynamic demapper, feedback path (e5), and out- put sink actor. The mapper maps the bit(s) from the input bitstream to a symbol for transmission over the channel, while the demapper outputs one or more bits for each of the symbols received from the channel. The number of bits per symbol depends upon the modulation and demodulation schemes (e.g., QPSK or 64QAM) used by the mapper and demapper, respectively. The mapper receives feedback from the demapper in- dicating the result of channel estimation and accordingly selects one of the modulation schemes to be employed, which in turn determines the number of bits per symbol. Hence, the number of tokens consumed (produced) by the mapper (demapper) from the buffer e1 (e4) in general can vary from one invocation to the next. We use CFDF to model dataflow behavior of the actors in the AMS application, and in particular, the dynamic mapper and demapper actors. The remaining actors are SDF actors and can be modeled easily as CFDF actors with just one valid mode 72 Table 4.1: Valid modes for the mapper actor along with their corresponding production and consumption rates. Consumption rate Production rate Mode e1 e5 e2 control 0 1 0 QPSK 2 0 1 16QAM 4 0 1 64QAM 6 0 1 Control QPSK 16QAM 64QAM Figure 4.1: Mode transition behavior of the mapper actor. each. Table 4.1 shows the possible modes for a generic dynamic mapper actor with their respective production and consumptions rates. It has a mode corresponding to each of the possible modulation schemes being employed (here QPSK, 16QAM, and 64QAM), and an additional mode called control. In the control mode, the mapper actor reads a channel quality indicator token from the feedback edge e5 (see Fig. 3.6). This information is then used to determine the modulation scheme to be employed, and the invoke function returns (as the next mode value) the mode that corresponds to this scheme. The demapper actor can be modeled in a similar manner. Fig. 4.1 shows the mode transition behavior for the mapper actor. 73 4.4 Translation to Core Functional Dataflow Many of the commonly used dataflow models can be directly translated to CFDF in an efficient and intuitive manner. In this section we show such constructions that demon- strate the expressibility of CFDF, and how the existing designs can be readily represented using CFDF semantics. 4.4.1 Static Dataflow SDF, CSDF, and other static dataflow-actor behaviors can be translated into finite sequences of CFDF modes for equivalent operation. Consider, for example, CSDF, in which the production and consumption behavior of each actor a is divided into a finite sequence of periodic phases P = (1, 2, ..., na). Each phase has a particular production and consumption behavior. The pattern of production and consumption across phases can be captured by a function ?a whose domain is Pa. Given a phase i ? Pa, ?a(i) = (Gi, Hi), where Gi and Hi are vectors indexed by the input and output ports of a, respectively, that give the numbers of tokens produced and consumed on these edges for each port during the ith phase in the execution of actor a. To construct a CFDF actor from such a model, a mode is created for each phase, and we denote the set of all modes created in this way by Ma. Given a mode m ? Ma corresponding to phase p ? Pa, the enable method for this mode checks the input edges of the actor for sufficient numbers of tokens based on what the phase requires in terms of the associated CSDF semantics. Thus, for each input port z of a, mode m checks for the availability of at least Gp(z) tokens on that port, where ?(p) = (Gp, Hp). For 74 Table 4.2: Valid modes in a CFDF representation of the decimator actor, M, in Fig. 2.2(a) along with their corresponding production and consumption values. Consumption rate Production rate Mode in out mode 0 1 1 mode 1 1 0 mode 2 1 0 mode 3 1 0 the complementary invoke method, the consumption of input ports is fixed to Gp, the production of output ports is fixed to Hp. The next mode returned by the invoke method must be the mode corresponding to the next phase in the CSDF phase sequence. For example, consider the decimator actor, M, in Fig. 2.2(a), which has a decimation factor of 4. Its dataflow behavior can be modeled using the CSDF model, as explained in Section 2.1.2. We can construct an equivalent CFDF representation of this actor using the process explained above. Table 4.2 shows modes of the decimator actor with their corresponding consumption and production values. In this table, in and out refer to the input and output ports, respectively, of the decimation actor. Fig. 4.2 shows the mode transition behavior of the decimator actor. Since any SDF actor can be viewed as a single-phase CSDF actor, the CFDF con- struction process for SDF is a specialization of the CSDF-to-CFDF construction process described above in which there is only one mode created. 75 mode_0 mode_1 mode_2 mode_3 Figure 4.2: Mode transition behavior of the decimator actor. See Table 4.2 for the dataflow behavior of the actor in each mode. 4.4.2 Boolean Dataflow BDF adds dynamic behavior to dataflow [16]. The two fundamental elements of BDF are Switch and Select. Switch routes a token from its input to one of two outputs based on the Boolean value of a token on its control input. The concept of a con- trol input is also utilized for Select, in which the value of the control token determines which input port will have a token read and forwarded to its one output. Consider an application shown in Fig. 4.3 that uses a Switch actor. To construct a CFDF actor that implements BDF semantics, we create a mode that is dedicated to read- ing that input value, which we call the Control mode. The result of this examination sends the actor into either a True mode or a False mode that corresponds to that con- trol input. In the case of Switch, this implies three modes with behavior described in Table 4.3. The mode transition behavior of the Switch actor is shown in Fig. 4.4. For a strict construction of BDF, only the Switch and Select actors are needed for imple- 76 Control Switch control_input Data data_input Truetrue_output False false_output Figure 4.3: Application of BDF using a Switch actor. Table 4.3: The behavior of the Switch actor modes in terms of tokens produced and consumed. mode consumption rate production rate control input data input true output false output Control 1 0 0 0 True 0 1 1 0 False 0 1 0 1 mentation, but CFDF does permit more flexibility, allowing designers to specify arbitrary behavior of True and False modes as long as each mode has a fixed production and consumption behavior. 4.4.3 Representing PSDF and PCSDF Actors using CFDF It is also possible to construct an equivalent CFDF representation of a PSDF or PCSDF actor in ways similar to those described in Sections 4.4.1 and 4.4.2. For construct- ing a CFDF representation, a PSDF or PCSDF actor can be considered as a combination of static and dynamic dataflow behaviors. A PSDF or PCSDF actor mode corresponding to parameter configuration can be viewed as a control-oriented mode, while other modes model different static behaviors for specific settings of actor parameters. We illustrate this using an example of a TDD. 77 Control True False Figure 4.4: Mode transition behavior of the Switch actor. See Table 4.3 for dataflow behavior of the actor in each mode. Consider the application graph shown in Fig. 2.3(a) along with its PCSDF descrip- tion in Fig. 2.3(b). In particular, consider the Decimator actor. Suppose the decimation factor of this actor can be set to either 3 or 4 depending upon a parameter set from its con- trol port. Fig. 4.5 shows the mode transition behavior of the Decimator actor in one of its possible CFDF representations. Here, modes mode 4 0, mode 4 1, mode 4 2, and mode 4 3 together represent a behavior similar to a CSDF Decimator actor with a decimation factor of 4, while mode 3 0, mode 3 1, and mode 3 2 together represent a behavior similar to a CSDF Decimator actor with a decimation factor of 3. The control mode models the parameter configuration. Note that actor does not consume or produce any dataflow tokens in this mode. Also, note the transitions out of modes mode 4 3 and mode 3 2. It is possible that parameters associated with the Decimator are config- ured after every CSDF cycle (for example, if configured by a subinit graph) or multiple CSDF cycles (for example, if configured by an init graph). The scheduler can exploit this behavior depending upon the application under consideration to construct efficient PLSs. 78 control (0, 0) mode_4_0 (1, 1) mode_3_0 (1, 1) mode_4_1 (1, 0) mode_4_2 (1, 0) mode_4_3 (1, 0) mode_3_1 (1, 0) mode_3_2 (1, 0) Figure 4.5: CFDF mode transition behavior of a PCSDF actor Decimator with possible decimation factors 3 and 4. The annotation (i, j) for a CFDF mode indicates the corresponding numbers of tokens consumed from the in- put (i), and produced onto the output (j) by the actor. 4.5 Functional Simulations in DIF using Core Functional Dataflow The CFDF semantics allow the DIF package to support functional simulations. The segregation of enable and invoke functions allows use of a guarded execution of an actor. The term guarded execution refers to a scheme of firing an actor in which the actor is fired (i.e. its invoke function is called) only if its enable function asserts the availability of sufficient data to fire the actor in its current mode by returning true. This feature allows using a canonical schedule (described below in Section 4.5.2) in which every actor is fired once in every schedule iteration. Such simulations can also be used for analyzing and estimating the buffer requirements in terms of number of tokens accumulated in those buffers. This provides an estimate of total memory requirements as well as individual buffers when porting an application to the target platform. It is also possible to use more 79 sophisticated scheduling techniques such as those we will describe in Chapter 5. 4.5.1 Extensions to DIF Software Architecture The DIF package has been restructured to support functional simulations. We have introduced a library of actors, which adhere to CFDF semantics, described in Java pro- gramming language for use in the applications. Actors are objects derived from a base class that provides each actor with mode and edge interfaces along with base methods for the enabling and invoking functions, called the enable method and the invoke method re- spectively. Modes can be created either by a user through an API or automatically, based on other information about the application (e.g., the sequence of phases in a CSDF rep- resentation). While a designer will redefine an actor?s class methods to define the proper functionality, the enable method is always restricted to only checking the number of to- kens on each input (as per the enabling function definition). The invoke method may read values from inputs, but it must consume them as tokens. In other words, when a mode is invoked on an actor, the actor consumes a fixed number of tokens that is associated with that mode, and no more values are read. In either case, we expect designers to effectively construct a case statement of all of the possible modes for a given actor, and fill in the functionality of each mode in a case. A scheduler uses GST representation to represent the generated schedule, which is then used to simulate the application graph. 80 4.5.2 Canonical Scheduler We can always construct a canonical schedule for an application graph. This is the most trivial schedule that can be constructed from the application graph. The canonical schedule is a single appearance schedule (a schedule in which actors of the application graph appear once) which includes all actors in some order. In terms of the GST repre- sentation, a canonical schedule has a root node specifying the loop count of 1 with its child nodes forming leaves of the schedule tree. Each leaf node points to a unique actor in the application graph. The ordering of leaf nodes determines the order in which actors of the application graph are traversed. When the simulator traverses GST, each actor in the graph is fired, if it is enabled. 4.6 Design Examples Polynomial evaluation is a commonly used primitive in various domains of signal processing, such as wireless communications and cryptography. Polynomial functions may change whenever senders transmit data to receivers. The kernel is the evaluation of a polynomial Fi(x) = ?nik=0 ck ? xk, where c1, c2, . . . , cn are coefficients, x is the polynomial argument, and ni is the degree of the polynomial. Since the coefficients may change at runtime, a programmable polynomial evaluation accelerator (PEA) is useful for accelerating the computation of multiple Fi?s. 81 Table 4.4: The behavior of the PEA modes. mode consumes produces Control Data Result Status Normal 1 0 0 0 Reset 0 0 0 0 Store Poly 0 1 0 1 Evaluate Poly 0 1 1 1 Evaluate Block 0 1 1 1 4.6.1 Programmable Polynomial Evaluation Accelerator Since the degree and coefficients of a polynomial may change at run-time (e.g., for different communications standards or different subsystem functions), a programmable PEA is useful for accelerating the computation of multiple Fi ?s in a flexible way. To this end, we design a PEA with the following instructions: reset, store polynomial (STP), evaluate polynomial, and evaluate block (EVB). Evaluate polynomial is for a single eval- uation, and EVB is for a bulk evaluation of the same polynomial. Since data consumption and production behavior for the PEA depends on the spe- cific instruction, a PEA actor cannot follow the semantics of conventional dataflow mod- els, such as SDF. However, if we define multiple modes of operation, we can capture the required dynamic behavior as a collection of CFDF modes. Following this principle, we have implemented the PEA as a single CFDF actor. In our functional description of the actor, we defined different modes according to the four PEA instructions. These modes are summarized in Table 4.4. The normal mode (like the ?decode? stage in a typical processor) reads an instruc- tion and determines the next operating mode of the data path. Of particular note here 82 is the behavior of STP in which the number of coefficients read varies. Each individual mode is restricted to one particular consumption rate, so when the STP mode is invoked, it reads a single coefficient, stores it, and updates an internal counter. If the counter is less than the total number of coefficients to be stored, invoke returns STP as the next mode, so it will continue reading until done. Note that persistent internal variables (?actor state variables?), such as a counter, can be represented in dataflow as self-loop edges (edges whose source and sink actors are identical), and thus, the use of internal variables does not violate the pure dataflow semantics of the enclosing DIF environment. We find that functional simulations using the high-level DIF prototype of a PEA application based on CFDF model are faster by a factor of 4.9 compared to those using an implementation in lower level language, such as Verilog [64]. 4.6.2 Design with Multiple Polynomial Evaluation Accelerators To illustrate the problem of heterogeneous complexity, we suppose that a DSP ap- plication designer might use two PEA actors customized for different length polynomi- als. For this application, we restrict the PEA?s functionality to be a CSDF actor with two phases: reading the polynomial coefficients and then processing a block of x?s to be eval- uated, as shown in Table 4.5. The overall PEA system is shown in Fig. 4.6. Two PEA actors are in the same application and we made them selectable by bracketing them with a Switch and a Select block. To manage these two PEA actors properly, this design requires control to select the PEA1 or PEA2 branch. In this system, the CSDF PEA ac- tors consume a different number of polynomial coefficient tokens, so the control tokens 83 Table 4.5: The behavior of the CSDF implementation of the restricted PEA used in the dual PEA application. Actor mode consumes produces Data Result PEA1 Store Poly 4 0 Evaluate Block 15 15 PEA2 Store Poly 7 0 Evaluate Block 15 15 driving the Switch and Select on the data path must be able to create batches of 19 and 22 tokens, respectively for each path. If the designer is restricted to only Switch and Select for BDF functionality, the balloon with CONTROLLER shows how this can be done. This design can certainly be captured with model oriented approach, pulling the proper actors into super-nodes with different models. But like many designs, this ap- plication has a natural functional hierarchy in it with the refinement of CONTROLLER, and with PEA1 and PEA2. We believe that competing design concerns of functional and model hierarchy will ultimately be distracting for a designer. With this work, we focus designers on efficient application representation and not model related issues. Immediate simulation of the dual PEA application is possible to verify correctness by using the canonical schedule. We simulated the application with a random control source and a stream of integer data. A nontrivial schedule tree can significantly improve upon the canonical performance. Given that the probability of a given PEA branch being selected is uniform, we can derive a single appearance schedule shown in Figure 4.7, where each leaf node is annotated with an actor and each interior node is annotated with 84 glyph1 glyph1 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph12glyph6glyph11glyph4glyph7glyph7glyph13 glyph6 glyph6 glyph6 glyph12glyph6glyph4glyph5glyph13 glyph12glyph5glyph4glyph6glyph13 glyph2glyph8 glyph4glyph6glyph9glyph3 glyph2glyph10glyph4 glyph6glyph9 glyph3 glyph6glyph9 glyph6glyph12glyph6glyph4glyph5glyph13 glyph12glyph5glyph4glyph6glyph13 glyph6 glyph6 glyph2glyph5 glyph4glyph6glyph9glyph3 glyph2glyph5glyph4 glyph6glyph9 glyph3 glyph6 glyph6 glyph6glyph6 glyph6 glyph7glyph7 glyph7glyph7 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 glyph6 Figure 4.6: A pictorial representation of the dual PEA application. a loop count. Figure 4.8 shows a manually designed multiple appearance schedule (a schedule in which actors may appear more than once) that attempts to process polynomial coefficients first, before queuing up data to be evaluated, to reduce buffering. Note that the SRC and CONTROL actor are unguarded as they require no input tokens to successfully fire. 4.6.3 Results We also implemented an polyphase uniform discrete Fourier transform (DFT) filter bank and a sample rate conversion application. We constructed the decimated uniform DFT filter bank using a mixed-model consisting of CSDF and SDF actors [54]. The sam- 85 1 2 41 602 CONTROL COPY VECTORIZE 2 SRC 2 2 RESULT CONTROLLER SWITCH PEA1 PEA2 SELECT Figure 4.7: Single appearance schedule for the dual PEA system. 1 2 30 30111 1 CONTROL COPY VECTORIZE 2 SRC 2 2 RESULT CONTROLLER SRC 2 SWITCH PEA1 PEA2 SWITCH PEA1 PEA2 SELECT Figure 4.8: Multiple appearance schedule for the dual PEA system. ple rate conversion application is based on concepts found in [21] and [36]. Results for these different implementations with different schedules are summarized in Table 4.6. We simulated 10000 evaluations running on a 1.7GHz Pentium with 1GB of physical mem- ory. We measured the time it took to complete enough iterations to complete all of the evaluations and maximum total queue size. The manually designed schedules performed notably better than the canonical schedule. Such insight can be invaluable when consid- ering the final implementation of the controller logic. 4.7 Summary We have formulated the CFDF model, which can be used to model a wide vari- ety of deterministic dynamic dataflow behaviors, and used to capture various well known 86 Table 4.6: Simulation times and maximum buffer sizes for mixed-model applications. Simulation Maximum observed Application Schedule Time (s) buffer size (tokens) Dual PEA - BDF Strict Canonical 6.88 2,327,733 Single appearance 1.72 1,729 Multiple appearance 1.59 1,722 Dual PEA - CFDF Canonical 3.57 1,018,047 Single appearance 0.95 1,791 Multiple appearance 0.99 1,800 DFT Filter Canonical 0.91 17 Single appearance 1.02 24 Sample Rate Converter Canonical 9.15 9,394 Single appearance 1.43 2,408 forms of dataflow in a single, unified formulation. We have also presented the features of CFDF model and tools based on it, such as support for heterogeneous dataflow behav- iors, intuitive and common framework for functional specification, support for functional simulations that allows designers to model and verify interactions between those models, portability from most of the existing dataflow models to CFDF, and integrated emphasis on minimally-restricted specification of actor functionality. With this CFDF modeling ap- proach integrated into DIF, we have demonstrated its use with various applications. Such an approach has allowed us to functionally simulate the design immediately, and then focus on experimenting on schedules and dataflow styles to improve performance. 87 Chapter 5 Efficient Scheduling Techniques for Core Functional Dataflow Graphs In Chapter 4, we described various features of the CFDF model of computation, and applied this model to specify and prototype different heterogeneous dataflow ap- plications. For functional simulations of CFDF application graphs, we used the CFDF canonical scheduler, as explained in Section 4.5.2. It is, however, possible to use more efficient scheduling techniques that are applicable to general or specialized classes of CFDF graphs. In this chapter, we focus on scheduling techniques for CFDF graphs, and present three different scheduling techniques that employ, respectively, (1) decomposition of dynamic dataflow graphs, (2) mode grouping, and (3) parameterized looped schedules. 5.1 Scheduling using Dynamic Dataflow Graph Decomposition We proposed this technique as a generalized scheduling strategy in [62]. It is based on decomposing a dynamic dataflow graph into a set of static interacting graphs. It makes use of the fact that every CFDF mode has a fixed production and consumption behavior. To construct a static graph based on these modes, it finds the combination of modes in which one mode from each actor in the subgraph is producing or consuming on an edge that has a consuming or producing mode at the other end of the edge. Since every actor can potentially provide many modes, there are an exponential number of combinations to be considered. To avoid exploring this entire space, a reachability analysis is performed 88 to consider only those modes that are connected to each other. For this, an extension of depth first search (DFS) graph traversal with the concept of mode traversal to arrive at the set of static subgraphs is employed as shown in Fig. 5.1. We have mentioned the relevant research efforts related to modeling and simulat- ing heterogeneous dataflow applications, and highlighted novelty of CFDF approach in Section 4.1. In the context of efficient scheduling and simulations of applications, our generalized scheduling framework differs from these related efforts in dataflow-based de- sign in that our framework uses top-down analysis of (explicitly-specified) application structure combined with integration of static dataflow sub-behaviors (actor modes) across groups of dataflow actors. This approach to analysis and integration systematically ex- tends the reach of static scheduling techniques so that they can be used across significant portions of dynamic dataflow designs. The approach is driven by the modeling semantics of CFDF, which provides the explicit decomposition of actors into static dataflow sub- behaviors, and efficiently exposes to the scheduler the design spaces associated with sep- arating sub-behaviors of individual actors, and grouping subsets of sub-behaviors across different actors. 5.1.1 Dynamic Dataflow Graph Decomposition Algorithm The key addition to the traditional DFS is that the next nodes to be added to the working stack S are found by following a mode from the current node. Another stack of nodes T keeps track of what order the nodes have been visited, so that the graph visited state may be unwound. When a static subgraph has been completed or an invalid 89 Function DecomposeCFDFGraph Data: CFDF Graph G Result: Returns set of static graphs Graphs Gs? {} ; foreach source mode in G do /* we use stacks for both the DFS and ensuring all modes are visited */ Stack S? {} ; Stack T? {} ; SDFGraph sdfG? empty graph; T.push(node that contains the source mode) ; while T has elements do S.push(T.pop()) ; while S has elements do Actor A? S.pop() ; if A not visited then mark A as visited ; foreach mode M in A do if M not visited matches the connecting edge then S.push(actors on inputs and outputs of M) ; sdfG.add(A) ; sdfG.annInEdges(M.cons) ; sdfG.annOutEdges(M.prod) ; end mark M as visited ; end T.push(A) end end /* when the stack is empty, one static graph is complete */ if sdfG is a valid graph then if Gs.doesNotContain(sdfG) then Gs.add(sdfG) ; end end /* in every case, unwind graph */ while T has elements do if T.peek().allModesVisited() then Actor B? T.pop() ; B.resetNodeVisitedFlag() ; B.resetAllModeVisitedFlags() ; else T.peek().resetNodeVisitedFlag() ; break ; end end end end return Gs ; Figure 5.1: Algorithm for dynamic dataflow graph decomposition. 90 graph has been found in the course of DFS, nodes are popped off of T until a node is found that has another mode to be considered (i.e. the potential of another unique static subgraph). Each of the popped nodes have their mode and node visited flags cleared, thus unwinding the graph state by making them available for the mode at the top of T . Therefore multiple graphs maybe constructed from the same source mode. Currently, we only consider directed acyclic graphs, so DFS is started at the source modes in the application (i.e., those that do not need input tokens to execute). Note that mode transition edges are not considered as edges to be traversed in DFS, effectively separating the graph at mode boundaries. For example, consider the decomposition of the Switch application in Fig. 5.2(a) as shown in Fig. 5.2(b). Two source modes were found in A and B. The DFS from the mode of A ended immediately in the control mode of Switch, but the DFS from B found two matching modes in Switch (namely true and false). After a full run of DFS from B, the graph visited state unwinds back to Switch and DFS restarts again from the other Switch mode. Thus, the single dynamic BDF application graph has been transformed into three static subgraphs. Note that for a complete iteration of the original application to finish, more than one of the subgraphs must be run to completion. Indeed, because mode transitions may be arbitrary, we have no a priori way in general of exactly balancing the execution of these three graphs, and we must rely on the dynamic GSTs for proper simulation. All graphs in the set of graphs that are created by this algorithm must be subgraphs of the original graph. Edges of this subgraph are annotated with the corresponding pro- duction and consumption numbers described by the modes used in a given run of DFS. 91 (a) (b) Figure 5.2: Application decomposition example. Since the decomposition algorithm is based on DFS, the complexity of this algorithm is founded on it as well, but mode combinations make it exponential in the number of modes. Fortunately, in practice, this approach is efficient, since modes tend to be con- nected together in a structured way. 5.1.2 Simulation Results To demonstrate this approach, we chose representative mixed-model applications to experiment with: a CSDF data distribution of audio streams to be sample-rate-converted, a polyphase decimated DFT filter bank, and an application with multiple polynomial eval- uation accelerators. Figure 5.3 shows a pictorial representation of the sample rate conversion application based on concepts found in [21] and [36]. Two audio channels are to be converted on two different subsystems. The input streams are interleaved, such as how multiple audio 92 Figure 5.3: Dual sampling rate conversion. channels might come over a single digital input. With a fixed interleaving, the CSDF DISTRIBUTOR actor distributes them to the appropriate multirate data path. In this case, a series of FIR filters is dedicated to sample rate conversion. As in the case of Section 4.6.3, we also implemented an polyphase uniform DFT fil- ter bank. We constructed a decimated uniform DFT filter bank using a mixed-model con- sisting of CSDF and SDF actors [54]. In addition, to show the dynamic capability of our approach, we used an application with PEAs, which utilizes both CSDF, SDF, and BDF elements. Polynomial functions may change when senders transmit data to receivers, so the application employs Switch and Select to dynamically change between the two data paths. 93 1 2 1 1 IN 1 DISTRIBUTOR 4 1 cd2dat1_A 5 1 cd2dat1_B 8 7 cd2dat1_C 49 2 cd2dat1_D 3 1 cd2dat1_E 1 cd2dat1_F 1 MERGE 1 OUTPUT 1 IN 1 DISTRIBUTOR 3 1 cd2dat2_A 1 cd2dat2_B 49 2 cd2dat2_C 8 7 cd2dat2_D 5 1 cd2dat2_E 4 1 cd2dat2_F 1 MERGE 1 OUTPUT Figure 5.4: The APGAN schedule of the sample rate conversion application. We applied our generalized scheduling approach to each of these applications and compared it to a CFDF canonical schedule. We compared this to the static subgraphs generated by our approach, which were scheduled with both a flat scheduler based on the repetition vectors of the SDF clusters and an APGAN-based scheduler [11]. The resulting GSTs were combined into a single GST by profiling the number successful executions, to balance the execution rates. As an example, Fig. 5.4 shows the APGAN-generated schedule derived from our design flow on the sample rate conversion application. Two unique schedule trees resulted from the two subgraphs from the original application, and (based on the distributor ele- ment) one was executed twice as often as the other. Thus, these two trees were merged with iteration counts that balanced their execution (2 and 1, respectively). Results for these different styles of implementation with different schedules are summarized in Table 5.1. We simulated thousands of tokens for each application on a 1.7GHz Pentium with 1GB of memory. The results show the utility of being able to apply the generalized scheduling approach presented in this work. 94 Table 5.1: Simulation times and maximum buffer sizes for mixed-model applications using dynamic dataflow graph decomposition based schedules. Average Maximum Observed Application Schedule Simulation buffer Strategy Time (ms) size (tokens) Canonical 9,148 9,394 Sample Rate Converter Flat 1,425 2,408 APGAN 1,462 2,278 Polyphase DFT Filter Bank Canonical 910 17 Flat 1,017 24 APGAN 1,117 24 Multi-PEA Canonical 2,163 11,198 Flat 586 57 APGAN 548 57 5.2 Scheduling using Mode Grouping While static subgraphs can be successfully found by a generalized scheduling ap- proach to dynamic applications in [62], some static behaviors are not considered. For example, in the decomposition of the switch application from Fig. 5.2 the True and False modes act predictably, always returning to Control, which is the mode that transitioned to them in the first place. The repeatable nature of these branches are the kind of static behavior that is exploitable. To this end, the actor description is augmented with the concept of mode grouping, in which application writers can refine their original application by grouping modes together [63]. For an actor a with modes Ma, we define a mode grouping, Da ? Ma, as a set of modes with a static relationship. The static mode behavior exposed in this work is a cyclic mode transitions in which all modes in the grouping return exactly one mode as the next mode, except for one mode, called the en- trance mode. The entrance mode may have multiple transitions out, as it marks the single 95 point of dynamic behavior in the grouping, but after it is fired, the modes that follow it do so in a static sequence. The mode grouping can be considered by the scheduler as a single mode that has production and consumption behavior equal to the sum of the individual modes in it. The resulting schedule then includes a repeated firing the size of the mode grouping. In the switch example, two mode groupings are Da = {{Control,True}, {Control,False}}, each with Control as the entrance mode. This exposes that Control always precedes a True or a False, allowing a larger schedule tree to be formed. For this small example performance benefits are slight, but for more complex applications, the assertion that a set of modes execute in a static sequence can lead to notably smaller buffer requirements. 5.2.1 Simulation Results To evaluate the benefits of mode grouping, a set of both static and dynamic applica- tions with actors that had mode groupings to exploit were considered. These application include B-spline interpolator, a CSDF data distribution of audio streams to be sample- rate-converted, a polyphase decimated DFT filter bank, and multiple PEAs. Apart from the applications that we have used for previously described experiments in this thesis, B-spline interpolator is a new one. We refer readers to [63] for detailed information re- garding this application and how it can be modeled using CFDF model. For each applica- tion, the generalized scheduling strategy employing decomposition of dynamic dataflow graphs (using APGAN as the static scheduler) with and without mode groupings was em- 96 Table 5.2: Total buffer size requirements with and without mode grouping. Without With Percentage Application Groups Groups Improvement BSpline Interpolation 479 304 37% Sample Rate Converter 2,278 2,278 0% Polyphase DFT Filter Bank 24 24 0% Multi-PEA 3,802 2,976 22% ployed. The resulting schedule trees were balanced and ordered based on the known input conditions, be it static patterns or probability distributions (see [63] for more details). As seen from the simulation results in Table 5.2, the two purely static applica- tions showed no benefit of using mode grouping. While mode groups were identified in CSDF actors, the original generalized scheduler performed equally well with and with- out groups. Once any dynamic behavior was inserted (i.e. the B-spline Controller and the PEA dynamic Switch and Select pair), mode grouping showed a significant improvement finding more (and larger) static schedule trees, which provided a direct sav- ings in buffering by more optimal actor firings. Generalized scheduling with and without groups for each of these examples took less than 5 seconds on a modern CPU [63]. 5.3 Parameterized Looped Schedules The latest communication technologies invariably consist of modules with dynamic behavior. There exists a number of design tools for communication system design with their foundation in dataflow modeling semantics. These tools must not only support the functional specification of dynamic communication modules and subsystems but also pro- vide accurate estimation of resource requirements for efficient simulation and implemen- 97 tation. We explore this trade-off ? between flexible specification of dynamic behavior and accurate estimation of resource requirements ? using a representative application employing an AMS. We propose an approach for precise modeling of such applications based on a recently-introduced form of dynamic dataflow called core functional dataflow. From our proposed modeling approach, we show how parameterized looped schedules can be generated and analyzed to simulate applications with low run-time overhead as well as guaranteed bounded memory execution. We have presented some of this work in [70] using the Advanced Design System from Agilent Technologies, Inc., which is a commercial tool for design and simulation of communication systems, for demonstration. In this thesis, We use DIF for demonstrating this technique. There is generally a trade-off between the expressive power of the dataflow model being used and the compile-time (i.e., prior to execution or simulation) predictability that is available when analyzing specifications in that model. Although it is desirable to have as much expressive power as possible to best capture the dynamic nature of modern DSP and communication applications, this can lead to significant reductions in the ability to predict hardware and software resource requirements when targeting simulation or effi- cient implementation. Many of these applications are ?mostly-static? hybrids in that they involve static dataflow components along with a relatively small proportion of dynamic components. We show an approach to modeling and scheduling of such hybrid commu- nication system applications using CFDF. The scheduling technique presented in this section is applicable to a class of ap- plications that we have already explained in Sections 3.5.5 and 4.3. In these sections, we have also introduced the AMS, which we use as a case study, and modeled it using 98 CFDF model. In this work, we show how efficient PLSs can be derived from the CFDF representations. for this restricted class of the CFDF applications. This restricted form is defined in a way that introduces a new trade-off point between expressive power and analysis potential that is useful for modeling of modern communication systems. 5.3.1 Related Work We have explained in previous chapters use of dataflow models like BDF to model dynamic dataflow behavior. The problem of scheduling such dynamic dataflow appli- cations has also been studied, and important results have been established regarding bounded memory execution and compile-time scheduling (e.g., see [16, 53]). Most of these approaches employ scheduling schemes that suffer from significant run-time over- head, difficulties in code generation, and lack of compile-time predictability (e.g., for val- idating real-time signal processing performance). The scheduling techniques described in earlier sections ? using dynamic dataflow graph decomposition and mode grouping ? do not in general guarantee bounded memory execution for the entire input application. A meta-modeling technique such as PDF [6] supports limited forms of dynamic behavior and has more compile-time predictability than more general kinds of dynamic dataflow models such as BDF. A useful feature of PSDF, for example, is its capability of efficient quasi-static scheduling in terms of PLSs [6]. This class of schedules allows for flexible, compact specification of nested loop structures, where loop iteration counts can be either constant values or symbolic expressions in terms of dynamic parameters in the underlying dataflow graph. While PDF is useful for many kinds of signal processing 99 applications, it imposes significant restrictions on how applications are modeled (e.g., in terms of hierarchies of cooperating init, subinit, and body graphs [6]), and in general, major changes in the user interface are required to provide direct support for PDF in a design tool. In contrast to the approaches for scheduling BDF or PDF graphs, the approach that we present here provides PLS-based bounded memory scheduling while operating within a semantic framework that can be integrated more directly into existing design tools compared to the more hierarchical semantic structure of PDF representations. 5.3.2 Constructing Parameterized Looped Schedules This technique, developed in [70], is applicable to a class of applications shown in Fig. 3.6(a). It seeks to generate efficient PLSs to reduce the run-time overhead associ- ated with dynamic scheduling. Such quasi-static schedules are also useful from a code generation perspective as the only dynamic components of such schedules are the loop iteration counts. Our approach finds static regions in the application graph that can be clustered and completely scheduled at compile-time. It then proceeds to identify the dy- namic components along with the corresponding static components, which must execute varying numbers of times in relation to the dynamic components. We then merge the appropriately-iterated static and dynamic components into a single PLS. The following sequence of steps outlines our algorithm for PLS construction (see Fig. 3.6(a)): 1. Identify SDF components in the dataflow graph and cluster them individually to 100 1 Mapper mi Mapper Channel 2 ni Bits Demapper Sink Figure 5.5: Valid PLS for the application in Fig. 3.6(b). obtain SDF-clustered regions S1, S2, and S3. This step can be performed efficiently since the specification of a CFDF actor mode includes the associated production or consumption rate for each actor port. 2. Use established SDF scheduling techniques [11, 37] for scheduling the SDF regions identified in step 1, assuming that a valid consistent schedule exists for each of the SDF subgraphs [46]. 3. Identify the pair D1 and D2 of actors with dynamic behavior, and determine which of the SDF sub-schedule loop (iteration) counts are dependent on D1 and D2. 4. Combine static sub-schedules into a PLS in which parameterized loop count ex- pressions are set up at compile time, and symbolic parameters in these expressions are varied at run-time. Fig. 5.5 illustrates a PLS for the dataflow graph in Fig. 3.6(b) that is derived using our approach to PLS construction. As can be seen from the GST in Fig. 5.5, CFDF actors use guarded execution, while other SDF actors are fired using unguarded execution in which the actor is fired 101 without checking if it is enabled (enabling is guaranteed through a carefully constructed PLS). The values of mi and ni are determined dynamically by the simulator based on the current modes of the mapper and demapper actors, respectively. Since the mode of an actor is visible to the simulator (through a flexible mode-querying mechanism in our implementation of CFDF), it can be used to set loop counts based on dynamically- changing execution state of the actor. For the class of applications targeted in this section of the thesis (see Sections 3.5.5 and 4.3) and PLSs generated using the algorithm described in this section, the number of tokens accumulated on the edge e1 (e4) after the ith iteration is related to mi and ci (ni and pi). These expressions can then be used to prove that the numbers of tokens accumulated on edges e1 and e4 are bounded by k+cmax?1 and q+pmax?1, respectively (see Section 5.3.3). Together with bounds that are derived based on the static dataflow properties of the other edges, this leads to a bound on total buffer memory requirement that can be computed at compile-time. Such bounds provide for more efficient execution or simulation (since dynamic memory allocation is not required) as well as enhanced predictability and reliability. 5.3.3 Bounded Memory Execution Since the CFDF model is Turing complete, the problem of determining whether a CFDF graph can be scheduled within bounded memory in finite time is undecidable [16]. However, for a class of applications, the graphical representations of which can be re- duced to the topology shown in Fig. 3.6(a), we can guarantee a bounded memory execu- 102 tion, if one exists. Consider the dataflow graph shown in Fig. 3.6(a). We assume that there exist valid, consistent schedules for the SDF clusters in this dataflow graph [46, 11]. Note that such schedules are periodic schedules that execute with bounded memory. From the graph topology, it is clear that we have a simple cycle, where the feedback edge and all other edges in the cycle are single-rate. Hence, the only buffer edges where we can have un- bounded accumulation of data tokens are the edges that connect SDF cluster S1 with the actor D1 and the actor D2 with SDF cluster S3. With these assumptions and observations, it suffices to show that these two buffers can be bounded in order to establish bounded memory execution for the application. Consider the edge e1. Let ti denote the number of tokens accumulated on an edge e1 after the ith iteration of the entire graph schedule, for i = 0, 1, . . . with t0 = 0. Let mi and ci denote the value of the parameterized loop counts and the number of tokens consumed from edge e1 during the ith iteration, respectively, for i = 1, 2, . . . . From the dataflow graph topology shown in Fig. 3.6(a) and (b), and the schedule shown in Fig. 5.5, we have ti = k ?mi + ti?1 ? ci, (5.1) where mi = d(ci ? ti?1)/ke . (5.2) Substituting (5.2) in (5.1) and using the following relation [18] d(a/b)e ? (a + b? 1)/b, ? integersa, b > 0, 103 it can be shown that ti ? k ? 1. (5.3) Now the maximum number of tokens that is ever queued on the buffer e1 is bounded above by the sum of the maximum number of tokens that remain after a schedule iteration and the maximum number of tokens that can be consumed from the buffer during an iteration. Thus, the number of tokens that are accumulated in the buffer for edge e1 is bounded above by (k + cmax ? 1). Similarly, we can show that the buffer for edge e4 has an upper bound on the number of tokens accumulated in it that is given by (q+pmax?1). We thus have upper bounds on the numbers of tokens accumulated in dataflow buffers for which dataflow behavior varies during run-time. These bounds, together with bounds that are derived based on the static dataflow properties of the other edges, provide a bound on the total buffer memory requirement. Moreover, this aggregate bound can be computed at compile-time, which has advantages as mentioned in Section 5.3.2. 5.3.4 Simulation Results We have implemented the approaches to CFDF modeling and PLS construction us- ing the Advanced Design System (ADS) tool from Agilent Technologies, Inc. [60]. We have employed the CFDF model for dynamic actors along with the existing SDF based actors in Agilent ADS. Using such a design approach, we implemented the AMS appli- cation shown in Fig. 3.6(b). Results for simulations of PLS-based execution of the AMS application, as implemented in Agilent ADS, are presented in [70]. In this thesis, we use DIF to prototype the AMS application using the CFDF model 104 and generate a PLS using the algorithm described in [70] as well as Section 5.3.2. In [70], we compared the performance of a PLS with that from a canonical schedule. Extend- ing [70] further, we also compare the performance of a PLS with that of the dynamic dataflow schedule (DDFS) employed in Ptolemy II [16, 15]. For this, we have imple- mented the algorithm used in Ptolemy II to generate a DDFS in DIF. A DDFS is widely used to simulate dynamic dataflow applications. In a DDFS, at any given time during a simulation, the scheduler determines whether each of the actors is enabled or deferrable. A deferrable actor is the one that is enabled, but does not require to be fired in order for any of its downstream actors to be enabled. Correspondingly, we may also have actors that are enabled but not deferrable (EBND) ? that is, actors that are enabled, and must be fired in order to have sufficient number of tokens produced at their outputs for one or more of their downstream actors to be enabled. Among the enabled actors, a DDFS first attempts to fire an actor only from the set of actors that are EBND. If there is no such actor, it proceeds to fire one of the enabled actors. After every invocation of an enabled actor, the DDFS checks if the invocation has changed the status of any of the other actors in the graph in terms of being enabled or deferrable. The execution proceeds until the control condition (a pre-specified condition for terminating graph execution) is reached or a deadlock condition is reached. A DDFS is designed to fire actors only when re- quired for enabling downstream actors. By doing so, it aims to minimize the total buffer requirements for the application graph. The results of our experiments for different sink control conditions (the total number of tokens that must be consumed by the sink actor during the simulation) are shown in Table 5.3 and Table 5.4. 105 Table 5.3: Average simulation time for different sink control conditions (numbers of to- kens consumed by the sink actor) using a canonical schedule (CS), DDFS, and PLS. Sink control Average simulation time (ms) Reduction (%) compared to condition CS DDFS PLS CS DDFS 10000 78 186 73 6.41 60.75 20000 111 301 90 18.92 70.10 50000 222 665 148 33.33 77.74 100000 401 1313 248 38.15 81.11 Table 5.4: Total buffer requirements for different sink control conditions (numbers of tokens consumed by the sink) for a canonical schedule (CS), DDFS, and PLS. Sink control Total buffer requirement (number of tokens) condition CS CS CS DDFS PLS PLS Minimum Average Maximum Experiments Theory 10000 25 28 37 19 19 19 20000 27 31 37 19 19 19 50000 29 35 45 19 19 19 100000 31 34 37 19 19 19 As evident from the results, the PLS method exhibits significant reductions in run- time overhead over a canonical schedule and DDFS, which leads to improvements in average simulation time ? up to 38% over a canonical schedule and 81% over a DDFS in our DIF implementation. Speed-up in simulation over a DDFS can be attributed to elimination of run-time overheads corresponding to determining the status of each actor (enabled or deferrable) in the graph. For a canonical schedule, total buffer requirements vary from one simulation of an application to another owing to dynamically changing dataflow behavior. We have reported minimum, maximum, and average buffer require- ments from our experiments using a canonical schedule. It should be noted that the total buffer requirements for a canonical schedule not only vary significantly across different 106 simulations but also are much higher compared to that of a PLS or DDFS. These experi- ments not only confirm the theoretical buffer bounds for PLSs estimated using the results mentioned above but also demonstrate significant reductions in the total buffer memory requirements over the canonical schedule, especially for larger values of sink control con- ditions (i.e., longer simulations). Since a DDFS employs a strategy in which actors are fired only when needed, it tries to optimize the total buffer requirements. For the AMS application in our experimental study, our PLS can achieve this optimized buffer require- ment, but with much higher simulation speeds. In summary, from our study, trade-offs among the scheduling techniques that we have examined in this section can be summarized by listing the advantages of each of the techniques as follows: ? CFDF canonical scheduler offers simplicity of implementation (e.g., for fast, early stage prototyping) and generality (arbitrary topologies can be handled). ? DDFS offers buffer size minimization, and generality. ? PLS offers buffer size minimization with compile-time analysis, and fast simulation performance. Intuitively, while the canonical schedule and DDFS offer generality for the associ- ated dynamic dataflow modeling techniques, PLSs typically require significant amounts of static or quasi-static structure to be useful ? however, when such structure can be found, their benefits can be significant, as shown in our case study using the AMS ap- plication. Furthermore, it is conceivable that the performance benefits presented here for 107 PLSs can be extended beyond simulation to implementation/synthesis scenarios; this is a useful direction for further investigation. 5.4 Summary We have presented a generalized scheduling strategy for scheduling dynamic dataflow applications that leverages CFDF semantics, which structures dynamic actors as a set of modes with fixed behavior. We presented an algorithm that decomposes dynamic dataflow graphs into a set of dynamically interacting static dataflow graphs. We demonstrated this on mixed-model applications with existing schedulers, which gave a positive indication of the utility of this approach for software implementations of such dynamic dataflow ap- plications. An immediate direction of future work is to improve the sophistication of the simulator. With a more intelligent way of dynamically switching between the resulting static schedule trees, we should achieve better run times and smaller maximum buffer sizes. A limitation of our approach, compared to related techniques, is that special at- tention is required by the designer to explicitly specify the dataflow properties associated with individual modes, and attention is also needed during testing to validate that the declared and observed behaviors match. An interesting direction for future work is the integration of our proposed scheduling methods with more formal reasoning about actor sub-behaviors, such as those being developed in conjunction with languages and models such as CAL and SysteMoc. We have presented a generalized scheduling approach with mode grouping that ex- poses more static behavior of a dynamic application graph. By identifying static groups 108 of ?modes? inside actors, we expose more of the static nature of the application, allowing traditional scheduling techniques to improve on memory requirements by up to 37%. De- veloping dynamic schedule tree selector so that a simulator or a final implementation may strategically switch between the known static behaviors at run-time is a useful direction for future work. Our PLS approach identifies the underlying static components in the application, systematically integrates the well-established compile-time scheduling techniques for SDF graphs with more flexible CFDF semantics, and uses combined CFDF/SDF analysis to generate PLSs that have significantly reduced run-time overhead, guaranteed memory bounds, and reduced memory requirements. Our approach, therefore, provides robust simulation of dynamic communication applications without major limitations on compile- time predictability and efficient scheduling. 109 Chapter 6 Dataflow-based Rapid Prototyping for Radio Astronomy Signal Processing There is a growing trend toward using high-level tools for design and implementa- tion of radio astronomy DSP systems. Such tools, for example, those from the CASPER group, are usually platform-specific, and lack high-level, platform-independent, portable, scalable application specifications. This limits the designer?s ability to experiment with designs at a high-level of abstraction and early in the development cycle. We address some of these issues using a model based design approach employing dataflow modeling, which is extensively used in design of embedded DSP systems. We use an application employing a TDD to allow narrow band modes in spectrometers as a driving and demon- strative application. Our design is targeted toward an FPGA platform, called the IBOB, that is available from the CASPER group. By a TDD, we imply a hardware digital down- converter design that can be reconfigured without the need for regenerating the hardware code. Such a design is currently not supported in the CASPER DSP library. The work presented in this chapter focuses on two aspects. Firstly, we introduce and demonstrate a dataflow-based design approach using the DIF tool for high-level application specifica- tion, and we integrate this approach with the CASPER tool flow. Secondly, we explore the trade-off between the flexibility of TDD designs and the low hardware cost of fixed- configuration digital downconverter (FDD) designs that use the available CASPER DSP 110 library. We further explore this trade-off in the context of a two-stage downconversion scheme employing a combination of TDD or FDD designs. 6.1 Introduction Key challenges in designing DSP systems employed in the field of radio astronomy arise from the need to process very large amounts of data at very high rates arriving from one or more telescopes. It is also desirable to have scalable and reconfigurable designs for shorter development cycles and faster deployment. Moreover, these designs should be portable to different platforms to keep up with advances in new hardware technologies. However, conventional design methodologies for signal processing systems in the field of radio astronomy focus on custom designs that are platform-specific. Such designs, by virtue of being platform-specific, are highly specialized, and thus difficult to retarget. The design approaches also lack high-level platform-independent application specifica- tions that can be experimented with, and later ported to and optimized for various target platforms. This limits the scalability, reconfigurability, portability, and evolvability across varying requirements and platforms of such DSP systems. A model based approach for design and implementation of a DSP system can ef- fectively exploit the semantics of underlying models of computation for precise estima- tion and optimization of system performance and resource requirements (e.g., see [8]). Though approaches for scalable and reconfigurable design based on modular FPGA hard- ware and software libraries have been developed (e.g., see [55, 56]), they neither allow for high-level abstraction nor provide linkage to formal models of computation. 111 We propose an approach using DSP-oriented dataflow models of computation to address some of these issues [46]. Dataflow modeling is extensively used in developing embedded systems for signal processing and communication applications, and electronic design automation [8]. Our design methodology involves specifying the application in DIF [36] using an appropriate dataflow model. This application specification is trans- formed into an intermediate, graphical representation, which can be further processed using graph transformations. The DIF tool allows designers to verify the functional cor- rectness of the application, estimate resource requirements, and experiment with various dataflow graph transformations, which help to analyze or optimize the design in terms of specific objectives. The DIF-based dataflow specification is then used as a reference while developing a platform-specific implementation. We show how formal understanding of the dataflow behavior from the software prototype allows more efficient prototyping and experimentation at a much earlier stage in the design cycle compared to conventional design approaches. As mentioned earlier, we demonstrate our approach using the design of a TDD that allows fine-grain spectroscopy on narrow-band signals. The TDD, which was origi- nally designed for the Green Bank Ultimate Pulsar Processing Instrument (GUPPI at the NRAO, Green Bank, finds its use in the spectrometers currently under development for the GBT and 20m telescope at the NRAO, Green Bank. One of the motivations has been to have a TDD design, where by a TDD, we mean a digital downconversion system that supports changes to the targeted downconversion ratio without requiring regeneration of the corresponding hardware code. Development of such a TDD is a significant contribu- tion of this work. We compare our TDD with the FDD designs that use the current DSP 112 library from the CASPER group. Through this kind of comparison, we explore trade-offs between the flexibility offered by TDD designs and their hardware cost. 6.2 Related Work There exist high-end reusable, modular, scalable, and reconfigurable FPGA plat- forms such as the Berkeley Emulation Engine 2 (BEE2) and IBOB, which have been in- troduced specifically for DSP systems [17]. The BEE2 uses SDF as a unified computation model for both the microprocessor and the reconfigurable fabric. It uses a high-level block diagram design environment based on The Mathworks? Simulink and the XSG. This de- sign environment, however, does not expose the underlying dataflow model. In particular, the designer has little or no scope to make use of the underlying dataflow model for ex- perimentation. Also, the SDF model used for programming the BEE2 is a static dataflow model in that all the dataflow information is available at compile-time (i.e., before exe- cuting or running the application). Though this feature provides maximal compile-time predictability, it has limited expressive power. It does not allow for data-dependent, dy- namic behavior, which is exhibited by many modern DSP applications, such as the TDD application introduced in Section 6.3. Other forms of dataflow models that can capture more application dynamics with acceptable levels of compile-time predictability may bet- ter exploit the features offered by platforms such as the BEE2. Model based approaches for designing large scale signal processing systems with a focus on radio telescopes has been previously studied (e.g., see [2, 50, 49]). Several frameworks have been proposed for model based, high-level abstractions of architec- 113 tures along with performance/cost estimation methods to guide the designer throughout the development cycle (see [2]). However, the focus of these approaches has been on architecture exploration. There have also been attempts to derive implementation-level specifications starting from system-level specifications by segregating signal processing and control flow into an application specification and architecture specification, respec- tively (see [50, 49]). However, the choice of models of computation has been made primarily from control flow considerations rather than dataflow considerations. These ap- proaches, though relevant, do not specifically address the issue of high-level application specification for platform-independent prototyping and use of models of computation for abstraction of heterogeneous or hybrid dataflow behaviors. This issue is critical to effi- cient prototyping of high performance signal processing applications, which are typically dataflow dominated, and include increasing levels of dynamic dataflow behavior (e.g., see [8]). We address this issue using the CFDF model with underlying PSDF or PCSDF be- havior and using it for system prototyping. We then show how platform-independent spec- ifications based on this modeling technique can be used to efficiently develop platform- specific implementations. 6.3 Tunable Digital Downconverter In DSP literature, the terms downsampling, decimation, and downconversion are often used interchangeably. In this chapter, a decimator refers to a block that simply deci- mates, downsamples, or downconverts the input signal without any other processing (e.g., 114 see Fig. 2.2(a) and (b)). The ratio of the sampling rate at the input of a decimator to that at its output is referred to as its decimation factor. A decimator is generally preceded by an anti-aliasing filter [80]. In this chapter, we refer to such a combined structure, con- sisting of a filter and decimator, as a decimation filter (e.g., see Fig. 2.3(a) and (b)). In a polyphase implementation of a decimation filter, such as the one we use in our imple- mentation, this structure is implemented as a single computing block [80]. We refer to the system or application that employs a decimator or decimation filter, possibly with other blocks such as mixers and filters, as a digital downconverter, and in particular, a FDD or TDD (e.g., see Fig. 6.1). The decimation factor of a decimation filter, TDD, or FDD refers to that of the decimator in it. Fig. 6.1 shows a block diagram of a TDD application. It shows an 8-bit analog-to- digital converter (ADC) that receives a baseband input IF signal of bandwidth 800MHz and samples it at the sampling rate of 1.6 giga-samples/second (GS/s). The internal design of the ADC block is such that 8 samples, where each sample is an 8-bit fixed point number, are output on the eight output lines at the same clock pulse. This results in 200mega- samples/second (MS/s) on each of the outputs of the ADC block. Correspondingly, all the downstream blocks also have 8 input and output ports. We, thus, have 8 connections between any two blocks shown in Fig. 6.1 that are directly connected. We, however, have not shown this detail (all 8 connections) for the sake of clarity and simplicity. The TDD subsystem, identified by the dotted box in Fig. 6.1, has to downsample this signal so that the resultant signal at the output of the TDD will have a tunable band of user-specified bandwidth (Bw) and center frequency (Cf ). The output of the TDD is fed to the downstream DSP blocks over the 10x auxiliary user interface (XAUI) ports. The 115 Figure 6.1: Block diagram of a tunable digital downconverter. XAUI ports stream data over CX4 connectors of the IBOB, and have a maximum data transfer capability of 10 giga-bits per second (Gbps). During narrow-band observation modes, the TDD allows narrow-band signals to be sampled at their corresponding Nyquist rates. When presented to the same number of channels in the downstream DSP system as that for the wide-band IF input, this allows fine-grain spectroscopy. Our TDD design supports integer decimation factors between 5 and 12. The valid values of Cf corresponding to the selected Bw can vary so as to span the entire 800MHz IF input. As shown in Fig. 6.1, the TDD consists of a tunable FIR filter. If the desired output is a baseband signal, then the FIR filter simply acts as a rectangular window with each of its taps set to 1. Also, in this case, the fork and select blocks are configured to route the output of the FIR filter directly to the tunable decimation filter (TDF), bypassing the mixer. If the desired output is not a baseband signal, the FIR filter acts as a bandpass filter (BPF). The cut-off frequencies for this BPF are set using the specified parameter configuration (Bw and Cf ). In this case, the output of the BPF is fed to a real mixer, 116 Figure 6.2: Proposed dataflow-based approach. which translates it into a baseband signal. The local oscillator, with a frequency fLO, is implemented as a numerically controlled oscillator (NCO). The frequency, fLO, is depen- dent on the value of Cf and Bw. The output of the mixer is then fed to the TDF, which downsamples its input depending upon the specified Bw or decimation factor. 6.4 Dataflow-based Design and Implementation We propose an approach for design and implementation of a TDD based on the dataflow formalisms discussed in Section 2.1 along with relevant capabilities of the DIF tool described in Section 2.2. Fig. 6.2 gives an overview of our dataflow based approach, which we now describe. 117 6.4.1 Modeling and Prototyping using DIF We start with an application specification that describes the DSP algorithm under consideration (for example, here, TDD) along with proper input and output interfaces. The application is specified using the DIF language. This DIF specification consists of topological information about the dataflow graph ? interconnections between the actors along with input and output interfaces. The DIF specification is a platform-independent, high-level application specification. The specification can be used, for example, to simu- late the application, given the library of actors from which the specification is constructed. Depending upon the application under consideration, the designer can select from among a variety of dataflow models of computation in DIF to effectively capture rele- vant aspects of the application dynamics. It should be noted that the designer does not always need to a specify the model in advance. The CFDF model can be used to de- scribe individual modules (actors) in the application, and the DIF package can analyze the CFDF representation (CFDF modes, to be specific) of the actors, as specified by the designer through the actor code, and annotate the actors with additional dataflow infor- mation using various techniques for identifying specialized forms of dataflow behavior (e.g., see [67]). This step requires the functionality of individual actors to be specified in CFDF semantics. The designer can use the existing blocks from the Java actor library in DIF or develop his or her own library of CFDF actors. In terms of tunability, the key components of the TDD as seen from Fig. 6.1 are the tunable FIR filter, and decimation filter blocks. The TDF block is of particular interest, considering that it is the only multirate block in the system. Its behavior resembles that of 118 ADC Decimator 6 1 6 1 6 1 6 1 Output 1 1 1 1 1 1 1 1 (a) ADC Decimator [1 1 1 1 1 1] 1 [1 1 1 1 1 1] 1 [1 1 1 1 1 1] 1 [1 1 1 1 1 1] 1 Output 1 [0 0 0 0 1 0] 1 [0 0 0 1 0 0] 1 [0 1 0 0 0 0] 1 [1 0 0 0 0 0] (b) Figure 6.3: Dataflow behavior of a Decimator actor with 4 inputs and outputs for a decimation factor of 6 using (a) SDF, and (b) CSDF models. the one described in Section 2.1.3. In view of this, we have identified PSDF and PCSDF as candidate dataflow models for efficient implementation of the targeted TDD system. For this system, we have to take into account the multiple inputs and outputs to actors, as mentioned in Section 6.3. To illustrate details of the dataflow behavior of a decimator actor based on such specifications, we have shown one such Decimator actor with 4 inputs and outputs, and having a decimation factor of 6 in Fig. 6.3(a) and Fig. 6.3(b). For the sake of simplicity and clarity, we have excluded the other single rate blocks from the application graphs in this figure. In our implementation, we extend this behavior for an actor with 8 inputs and outputs. We have created a DIF prototype using PSDF and PCSDF as underlying models for equivalent CFDF representation of actor blocks. We have also developed a Java library of actors in DIF adhering to CFDF semantics for all of the blocks. We then used DIF for software prototyping, analysis, and functional simulation. The DIF package uses the DIF specification to generate an intermediate graph represen- 119 tation, which can then be used as an input for further graph transformations including a scheduling transformation, which determines the schedule for an application. Here, by a schedule, we mean the assignment of actors to processing resources, and the execution ordering of actors that share the same resource. The functional simulation capabilities provided in DIF can be used to analyze and estimate buffer requirements in terms of the numbers of tokens accumulated on the buffers that correspond to dataflow graph edges. This provides an estimate of total memory requirements as well as specifications for in- dividual buffers when porting the application to the targeted implementation platform. Fig. 6.4 shows the TDD application graph generated using DIF. This is based on the TDD block diagram shown in Fig. 6.1 with addition of some actors that handle parameter configuration for the actors. We discard one of the two sets of outputs (more specifically, sine output) of the localOsc actor as we have employed a real mixer in our design. The complexity of the graph, which is increased due to multiple parallel edges between two actors, can easily be captured through a DIF specification that makes use of topological patterns. We have shown two possible specifications of the graph topology in DIF using topological patterns in Fig. 6.5 and Fig. 6.6. Using the TDD specification and employing the notion of PLSs described in Sec- tion 5.3, we construct PLSs for the TDD application. Fig. 6.7(a) shows a PLS for a TDD application, where the decimator actor has the underlying SDF model, while Fig. 6.7(b) shows one in which the decimator actor employs the CSDF model. As an- notated in these GSTs, loop counts p0, p1, and p2 are parameterizable. The loop count p0 is set to a user-specified number of iterations, while the loop counts p1 and p2 are tuned based upon the decimation factor as well as the underlying dataflow model for the 120 source copy bpf Merge multiplier decimator sink control fork_0_ fork_1_ fork_2_ localOsc dump Figure 6.4: TDD application graph generated using DIF. decimator. Fig. 6.7(a) and (b), in particular, show values of the parameterizable loop counts set for a decimator with a decimation factor of 11. This PLS can be viewed as providing CFDF-based execution for the given PDF-based actor specification model. The effect of various decimation factors on the total buffer requirements (in number of tokens) is shown in Table 6.1. Table 6.1 shows the total buffer requirements using PLSs for various configurations of decimation factors. Note that for a given configuration (setting of graph parameters), a 121 topology { nodes = source, copy, bpf, Merge, decimator, sink, control, fork[3], multiplier, localOsc, dump; edges = soCp[8] -> multiedge(source, copy), cpMrg[8] -> multiedge(copy, Merge), cpBpf[8] -> multiedge(copy, bpf), bpfMul[8] -> multiedge(bpf, multiplier), mulMrg[8] -> multiedge(multiplier, Merge), mrgDec[8] -> multiedge(Merge, decimator), decSnk[8] -> multiedge(decimator, sink), loMul[8] -> multiedge(localOsc, multiplier), loDump[8] -> multiedge(localOsc, dump), conFrk, f0f1, f1f2 -> chain(control, fork[0:2]), f0Bpf(fork[0], bpf), f1Lo(fork[1], localOsc), f2Mrg(fork[2], Merge), f2Dec(fork[2], decimator); } Figure 6.5: Partial DIF specification ? topology block ? for the TDD application graph using topological patterns. PSDF or PCSDF graph behaves like an SDF or CSDF graph, respectively. It can be seen that for the SDF model, the total buffer requirements vary with the decimation factor, and this is due to input buffers to the TDD block that need to accumulate varying numbers of tokens. Thus, employing the PSDF model will require tuning buffer sizes for different decimation factors if one wants to provide for optimized buffer sizes in terms of graph parameters. We have used the CASPER tool flow for developing our platform-specific imple- mentation as explained later in Section 6.4.2. This implementation is targeted to an FPGA. Our objective here is to support tuning the decimation factor without regener- ating hardware code. A dataflow buffer can be emulated using dual-port random access 122 topology { nodes = source, copy, bpf, Merge, decimator, sink, control, fork[3], multiplier, localOsc, dump; edges = c0[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c1[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c2[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c3[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c4[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c5[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c6[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), c7[6] -> chain(source, copy, bpf, multiplier, Merge, decimator, sink), cpMrg[8] -> multiedge(copy, Merge), loMul[8] -> multiedge(localOsc, multiplier), loDump[8] -> multiedge(localOsc, dump), conFrk, f0f1, f1f2 -> chain(control, fork[0:2]), f0Bpf, f1Lo, f2Mrg, f2Dec -> parallel(fork[0:2], fork[2], bpf, localOsc, Merge, decimator); } Figure 6.6: Partial DIF specification ? topology block ? for the TDD application graph using topological patterns. memory (RAM) blocks in the targeted FPGA device, but tuning the sizes of such blocks is not possible during run-time. The ADC output is of a streaming nature (data is pro- duced or consumed at every clock cycle without any synchronization signal), as is the DSP subsystem downstream of the TDD. In order to achieve the throughput constraint imposed by the maximum data rate of the ADC output stream, SDF buffers need to be pipelined, which is not efficient using RAM blocks. Thus, we use the CSDF model, which does not require tuning of dataflow 123 1 control fork_0_ fork_1_ bpf fork_2_ localOsc Merge decimator 1 p0 = 10 p2 = 11 source copy bpf localOsc multiplier dump Merge p1 = 1 1 sink decimator (a) 1 control fork_0_ fork_1_ bpf fork_2_ localOsc Merge decimator 1 p0 = 10 p2 = 1 source copy bpf localOsc multiplier dump Merge p1 = 11 1 sink decimator (b) Figure 6.7: PLSs for the TDD application configured for a decimation fac- tor of 11, and decimator actor employing the (a) PSDF and (b) PCSDF models of computation. 124 Table 6.1: Total buffer requirements from a DIF prototype for different decimation factors using parameterized looped schedules. Decimation Factor 5 6 7 8 9 10 11 12 Total buffer requirements SDF 132 140 148 156 164 172 180 188 (Number of tokens) CSDF 100 100 100 100 100 100 100 100 buffer sizes to achieve the maximum throughput constraint, as observed from our DIF- based prototype. A synchronization or enable signal derived from the TDD is used as a clock to drive the downstream DSP system. This signal is a decimated version of the clock signal. We use our DIF prototype as a reference while integrating the design with the cur- rent CASPER tool flow for the target implementation on the IBOB. Section 6.4.2 further elaborates on this approach along with implementation results. 6.4.2 Integration with the CASPER Tool Flow The CASPER tool flow is based on the BEE XPS tool flow [56]. This tool flow requires that an application be specified as a Simulink model using XSG [56]. Since we do not have an automated tool for transforming a DIF representation into an equiv- alent Simulink model, porting the DIF specification to Simulink/XSG requires manual transcoding of the DIF specification. This also requires implementing parameterizable actor blocks that are currently not supported in the XSG, CASPER, or BEE XPS libraries. Each actor gets transformed into an equivalent functional XSG block. For each of the Simulink actor blocks, we provide a pre-synthesis parameterization that allows chang- ing block parameters before hardware synthesis (see [57] for more details on Simulink 125 scripting). In order to implement our objective of tunability ? post-synthesis parame- terization ? we use the software register mechanism in the BEE XPS library to specify parameters that change during run-time (that is, after hardware code is generated, and depending upon user requirements.) Software registers can be accessed and set during run-time from the TinyShell inter- face available for IBOB. This allows tuning TDD parameters without re-synthesizing the hardware each time the parameters change from the previous setting. Each block has an enable input signal. Through systematic transformations, an application graph in DIF can be converted into an equivalent Simulink/XSG model. We have developed an interface software package using C programs, and Bash and Python scripts to compute software register values for the required TDD configuration, and set these values on the IBOB over a telnet connection, which is used for remote access to the hardware platform at NRAO. On the targeted FPGA device, we have employed dual-port RAM blocks that are loaded with pre-computed sinusoidal signal values of the required precision. Each of these dual-port RAM blocks is used to simultaneously read sine and cosine values from both of its ports. The oscillator frequency is set using a software register, and depends upon the desired output signal band. The FIR filter associated with the TDF block can have up to 16 taps. Currently, the generic FIR filter without any decimation (used, for example, as a BPF in the design) can have up to 8 taps. These, again, are set using software registers. We have employed two filter banks in our design of a TDF that operate in tandem to allow maximum throughput. Hence, our TDF block has 32 multiplication operations. As mentioned earlier, our TDF design employs a polyphase implementation as described in [80]. 126 Table 6.2: Implementation summary for TDD designs. Parameter Design 1 Design 2 Design 3 Design 4 Mixer No Yes No Yes Input bandwidth (MHz) 800 800 800 800 Decimation factor D 5 ? D ? 12 5 ? D ? 12 5 ? D ? 12 5 ? D ? 12 Latency (ns) 65 150 85 190 FPGA slices 12234 (52%) 13315 (56%) 12322 (52%) 14232 (60%) (Out of 23616) 4 input LUTs 14139 (29%) 16123 (34%) 12123 (25%) 15035 (31%) (Out of 47232) Block RAMs 41 (17%) 48 (20%) 41 (17%) 48 (20%) (Out of 232) 18? 18 Multipliers ? ? 32 (13%) 95 (40%) (Out of 232) Table 6.4.2 shows results for the TDD implementation on the IBOB using the Xilinx EDK 7.1.2. We have used this hardware platform and tool for all of the experiments reported on in the remainder of this chapter. Design 1 shows some of the device utilization parameters for a TDD that supports only baseband modes. This design does not include the tunable FIR filter, NCO, and mixer blocks shown in Fig. 6.1. Design 2 is based on the block diagram of a TDD shown in Fig. 6.1. As evaluation metrics for hardware cost, we have used the utilization of FPGA slices, 4-input look-up tables (LUTs), and block RAM units, and the number of embedded multipliers. Note that neither of these two designs use any of the available embedded multipliers for multiplication. Designs 3 and 4 are modified versions of designs 1 and 2, respectively, in that they employ embedded 18? 18 multipliers. It can be seen that using embedded multipliers does not provide significant improvements in hardware cost. We observe that use of embedded multipliers, in fact, needs to be accompanied by addition of extra latency in the design to achieve timing 127 closure. We have been able to achieve maximal throughput using an implementation based on the PCSDF model as explained in Section 6.4.1. 6.4.3 Platform-specific Analysis using DIF It is common to go back and forth between a high-level prototype and a correspond- ing platform-specific implementation while designing an embedded DSP system. Such alternation in design phases is common, for example, when one is developing a platform- specific library or tool flow. In support of such a design methodology, it is desirable for a high level design tool to support platform-specific analysis. This can be achieved by annotating the high-level application specification with platform-specific implementa- tion parameters, which are derived through device data sheets, experimentation or some combination of both. DIF supports specifying user-defined actor parameters. We use this feature in DIF to annotate actors with two relevant implementation parameters ? the latency constraint, and number of embedded multipliers. This allows estimating results based on the DIF prototype itself instead of determining them from the constructed design, which is gen- erally time consuming. We have verified the accuracy of metrics estimated by our DIF model compared with actual hardware synthesis results, as shown in Table 6.4.2. Developers of tool flows and DSP libraries can profile their library blocks to deter- mine a wide variety of platform-specific implementation parameters. DIF can use such information to estimate implementation parameters at a high level of abstraction, and ear- lier in the design cycle to help efficiently prune segments of the design space. Support 128 User interface Controller TDD interface (Bash, C, Python) Hardware (e.g. iBOB) telnet Figure 6.8: TDD System overview. for estimation of various platform-specific resources for different platforms is beyond the scope of this thesis. It is, however, an important direction toward developing alternative model based design flows and open access tool flows for astronomical DSP solutions. 6.4.4 Software Interface for the Tunable Digital Downconveter As mentioned earlier, parameterization associated with the TDD blocks is handled through the use of software registers. We have developed a TDD interface that computes various programmable TDD parameters, and sets the corresponding software registers by communicating with the IBOB board using the telnet utility depending upon the narrow- band mode chosen by the user. Fig. 6.8 shows an overview of the system organization. A user can specify a valid narrow-band mode through a user interface. This information is then provided to the un- derlying TDD interface by a controller. The TDD interface uses this information (band- width, and center frequency of the narrow-band), along with the decimation factor (either 129 derived from the input bandwidth or set explicitly), and configures the corresponding TDD block. It then communicates this TDD configuration with the hardware. The TDD interface is implemented as a collection of utilities in the form of C programs, Bash scripts, and Python scripts. One important aspect in our development of this TDD interface has been our em- phasis on unit testing. We have used the unit testing features in the DICE framework (see Section 2.5) to develop an extensive unit test suite, which can be applied for rigorous val- idation of system functionality, and can be retargeted efficiently across different levels of abstraction (e.g., simulation versus implementation) and different design languages (e.g., C, Verilog and VHDL). 6.5 Exploring Implementation Trade-off with TDD and FDD Blocks One of the motivations for the work presented in this chapter has been to develop library blocks needed for a TDD using Xilinx LogicCore and CASPER library blocks. The current CASPER DSP library provides a decimator that supports decimation factors that are powers of 2. The decimation factor as well as the filter coefficients of the FIR filter are not tunable after the hardware code is generated. Our design provides flexibil- ity with not only the decimation factor but also the filter coefficients through the use of software registers, as explained earlier. The FDD designs, though not tunable, have lower hardware cost in terms of device utilization. Table 6.5 provides a summary of some of the hardware utilization parameters for the FDD designs. These designs have also been im- plemented on a CASPER IBOB. Note that the decimation factor of 10 has been achieved 130 Table 6.3: Implementation summary for FDD designs. Parameter Design 1 Design 2 Design 3 Design 4 Mixer No No Yes Yes Input bandwidth (MHz) 800 800 800 800 Decimation factor 8 10 8 10 Bw (MHz) 100 80 100 80 Cf (MHz) 50 40 400 400 Latency (ns) 35 440 50 455 FPGA slices 4175 (17%) 6142 (26%) 5690 (24%) 6439 (27%) (Out of 23616) 4 input LUTs 5153 (10%) 5216 (11%) 5984 (12%) 6003 (12%) (Out of 47232) Block RAMs 41 (17%) 41 (17%) 49 (21%) 49 (21%) (Out of 232) 18? 18 Multipliers 8 (3%) 8 (3%) 32 (13%) 32 (13%) (Out of 232) by first interpolating the input by a factor of 80, and then decimating it by a factor of 8. Comparison between the results in this table and those in Table 6.4.2 clearly highlights the trade-off between design flexibility and hardware cost. Using the model-based approach presented in Section 6.4, the designer can effectively explore this trade-off based on the given design requirements. 6.5.1 TDD and FDD for Multistage Downconversion Though our TDD design supports limited decimation factors (integer factors be- tween 5 and 12), its usage is not limited to these factors. It can be readily scaled and applied to achieve other decimation factors by cascading multiple TDF blocks. Fig. 6.9 shows some of the possible input/output sampling rate relations that can be achieved by such use of cascaded TDF blocks. Design 1 in Table 6.5.1 employs cascaded TDF 131 Input Sample Rate 1600 MS/s Decimation by 2 Decimation by 11 Output Sample Rate 72.72 MS/s Input Sample Rate 1600 MS/s Decimation by 8 Decimation by 5 Output Sample Rate 40 MS/s Input Sample Rate 1600 MS/s Decimation by 10 Decimation by 5 Output Sample Rate 32 MS/s Figure 6.9: Two-stage digital downconversion. blocks, while design 2 in Table 6.5.1 employs cascaded fixed-configuration decimation filter (FDF) blocks. Both of these designs have been developed to demonstrate multistage downconversion for a baseband signal and neither of them employs a mixer. It is possible to extend these designs to include a mixer to allow all possible narrow band outputs and not just the baseband output. For all of the designs in this table that use one or more TDF blocks, the TDF block employs dedicated embedded multipliers. In this light, we further explore the trade-off between the low hardware cost of FDD designs and flexibility offered by TDD designs by examining a design consisting of an FDF block followed by a TDF block (designs 3 and 4 in Table 6.5.1). These designs provide limited tunable decimation factors compared to design 1, but also have lower hardware cost in terms of device utilization. 6.6 Summary We have proposed a dataflow-based approach for prototyping radio astronomy DSP systems. We have used a dataflow-based high-level application model that provides a 132 Table 6.4: Implementation summary for designs employing two-stage downconversion using cascaded FDF or TDF blocks. Bw, if tunable, can be tuned to frequencies consistent with decimation factors supported by the TDD block. Parameter Design 1 Design 2 Design 3 Design 4 Mixer No No No No Input bandwidth (MHz) 800 800 800 800 No. of FDD blocks 0 2 1 1 No. of TDD blocks 2 0 1 1 FDD Decimation factor(s) ? 8, 10 8 10 Bw (MHz) Tunable 10 Tunable Tunable (? 800) (? 100) (? 80) Latency (ns) 170 475 120 505 FPGA slices 17141 (72%) 5765 (24%) 11073 (46%) 12641 (53%) (Out of 23616) 4 input LUTs 19718 (41%) 5506 (11%) 12245 (25%) 12310 (26%) (Out of 47232) Block RAMs 41 (17%) 41 (17%) 41 (17%) 41 (17%) (Out of 232) 18? 18 Multipliers 64 (27%) 16 (6%) 40 (17%) 40 (17%) (Out of 232) platform-independent specification, and assistance in functional verification and impor- tant resource estimation tasks. This can prove effective in reducing the development cycle and faster deployment of DSP systems across various target platforms. We have employed this approach to methodically develop a TDD based DSP backend design. Our TDD im- plementation is targeted to the CASPER FPGA board, called IBOB, and supports tuning narrow band modes without the need for regenerating hardware code. We have also ex- plored the trade-off between the low hardware cost for FDD designs and the flexibility offered by TDD designs. This trade-off has also been highlighted in the context of de- signs employing a two-stage downconversion scheme. A designer can explore this design space to best meet the application requirements. 133 Chapter 7 Summary and Conclusions In this thesis, we have addressed various aspects of design flows employed by model based design tools for embedded systems in the context of rapid prototyping of high performance signal processing applications. We summarize these contributions along with our conclusions as follows: 1. We have introduced the concept of topological patterns, which can be used in dataflow modeling languages to identify and concisely iterate across arbitrary struc- tures in a dataflow application graph. We have shown how the types of flowgraph substructures that are pervasive in the digital signal processing (DSP) application domain can be effectively represented in terms of topological patterns, and thereby used to generate compact, scalable application representations. We have also shown how an underlying design tool can exploit a high-level application specification consisting of topological patterns in various aspects of the design flow. In particular, we have demonstrated the efficacy of topological patterns in dataflow graph anal- ysis, concise and scalable representation of homogeneous synchronous dataflow (HSDF) graphs, and exploring implementation-specific trade-offs. We have also demonstrated the use of topological patterns in graph analysis and extraction of implementation-specific features. We have applied the concept of topological pat- terns to represent schedules for application graphs. Such representations are use- 134 ful, for example, when porting schedules generated using one design tool to other platform-specific tools or design languages. We have demonstrated the utility of experimentation with pattern-specific scheduling transformations, and how topo- logical patterns facilitate such experimentation. 2. We have formulated the core functional dataflow (CFDF) model of computation, which can be used to model a wide variety of deterministic dynamic dataflow be- haviors, and used to capture various well known forms of dataflow in a single, unified formulation. We have also presented features of the CFDF model and tools based on it, such as support for 1) heterogeneous dataflow behaviors, 2) intuitive functional specification, 3) functional simulation that allows designers to model and verify interactions among various forms of dataflow, 4) portability from ex- isting dataflow models, 5) minimally-restricted specification of actor functionality, and 6) efficient static, quasi-static, and dynamic scheduling techniques. With the CFDF modeling approach integrated into dataflow interchange format (DIF), we have demonstrated the use of CFDF concretely on various applications. Such an approach has allowed us to functionally simulate designs from early stages of de- sign, and then focus on experimenting with schedules and dataflow transformations to improve performance. 3. We have presented a new scheduling technique for dynamic dataflow applications. This technique leverages the CFDF model, and operates by decomposing dynamic dataflow graphs into sets of dynamically interacting static dataflow graphs. We have demonstrated this scheduling technique on mixed-model applications with existing 135 schedulers, which has given a positive indication of the utility of the approach for software implementations of such dynamic dataflow applications. By identifying static groups of ?modes? inside actors, we have exposed more of the static nature of applications, allowing traditional scheduling techniques to improve on memory requirements by up to 37%. We have further used CFDF semantics to model a class of signal flow topologies that is important for modern communication systems. Our approach identifies the underlying static components in the application, systematically integrates well- established compile-time scheduling techniques for synchronous dataflow (SDF) graphs with more flexible CFDF semantics, and uses combined CFDF/SDF anal- ysis to generate parameterized looped schedules (PLSs) that have significantly re- duced run-time overhead, guaranteed memory bounds, and reduced memory re- quirements. Our approach therefore provides robust simulation of dynamic com- munication applications without major limitations on compile-time predictability and efficient scheduling. 4. We have demonstrated the use of a dataflow-based approach for prototyping radio astronomy DSP systems. We have used a dataflow-based high-level application specification format that provides a platform-independent specification, and assis- tance in functional verification and useful kinds of resource estimation. Such an approach is useful in improving designer productivity and facilitating faster de- ployment of DSP systems across various target platforms. We have employed this approach to methodically develop a tunable digital downconverter (TDD) based 136 DSP backend design. Our TDD implementation, which is targeted toward an FPGA board, called the interconnect break-out board (IBOB), from the collaboration for astronomical signal processing and electronics research (CASPER), supports tun- ing narrow band modes without the need for regenerating hardware code. We have also explored the trade-off between low hardware cost for fixed-configuration digi- tal downconverter (FDD) designs and the flexibility offered by TDD designs. This trade-off has also been highlighted in the context of designs employing multistage downconversion schemes. Using our approach for trade-off exploration between hardware cost and flexibility, a designer can efficiently explore the associated de- sign space to help optimize an implementation in terms of the given application requirements. 5. This thesis has contributed significantly to the development and release of the latest version of a graph package, called MoCGraph, that is oriented toward providing fundamental graphical data structures and implementations of graph algorithms to support analysis and manipulation of models of computation. Our contributions to this graph package include support for tree data structures, and generalized sched- ule trees (GSTs), in particular. Our extensions to the MoCGraph package have sup- ported important features for the CFDF model, and for new functional simulation capabilities in the DIF package. The work presented in this thesis, though demonstrated using specific design tools, is not restricted to those tools, and hence, can be applied to a wide variety of dataflow- based environments. Also, the applications presented, though instrumental in driving this 137 research, are demonstrative, and the prototyping methods can be extended readily to other relevant DSP applications. 138 Chapter 8 Future Work We believe that pattern-specific scheduling techniques can provide improved schedul- ing capabilities for dataflow-based design environments that employ topological patterns. In our work on topological patterns in this thesis, we have emphasized the same by pro- viding a facility that allows a user to experiment with various schedules in a systematic manner. The development of pattern-specific scheduling heuristics and support of those in design tools has been, however, beyond the scope of this thesis. Such exploration along with automating the application and integration of topological patterns are useful directions for further investigation. For scheduling general core functional dataflow (CFDF) graphs, it would be useful to look into an approach that involves identifying static and dynamic components of het- erogeneous dataflow graphs, generating two-actor schedules for clusters of two CFDF ac- tors in the graphs, and merging those into nested schedules that are optimized for selected performance cost metrics based on probabilistic analysis of CFDF graphs. The motivation here is to build on the pairwise grouping methodology for dataflow graph clustering [12], which is a useful and flexible scheduling framework for synchronous dataflow graphs, and develop efficient simulation and synthesis capabilities for dynamic dataflow graphs. In this regard, it would be also useful to explore general CFDF topologies as targets for parameterized looped schedule (PLS) construction, and apply our methods to optimized 139 hardware and software synthesis. Expanding on our work to integrate tunable digital downconverter (TDD) design with ongoing development of spectrometer designs at the National Radio Astronomy Ob- servatory (NRAO) on the latest hardware from the Collaboration for Astronomical Signal Processing and Electronics Research (CASPER) group is a natural extension of the work presented in this thesis. There is growing interest in the radio astronomy community to have open-access and portable astronomical signal processing solutions. Currently, this is constrained by proprietary commercial tools targeted for specific platforms. We have also relied on these tools, mainly for hardware synthesis and code generation, in our work. In this context, it is of interest to have high-level application description languages with semantic foundations in models of computation and the corresponding design tools for efficient specification, simulation, functional verification, and synthesis. Develop- ing model based platform-specific libraries, and devising techniques for automatic code generation from high-level representations, such as those in dataflow interchange format (DIF), specifically for the radio astronomy domain is an important direction for future research. There have been some other directions that we have explored while working on the core aspects of this thesis. We have contributed to the development of a dataflow- based design tool, called targeted DIF (TDIF), which extends the capabilities of DIF with dynamic dataflow software synthesis, cross-platform actor design support, and dataflow- integrated features for instrumenting and tuning implementations [74]. The dataflow- based approach used in TDIF is unique in that it leverages the power of dynamic dataflow models, and provides integration of automatic code generation for programming inter- 140 faces and low level customizations for implementations targeted to heterogeneous plat- forms. Also, a new model based schedule representation called the dataflow schedule graph (DSG) representation has been recently introduced [84]. Integrating prototyping features and techniques presented in this thesis with these new tools and models is also an important direction for future work. 141 Bibliography [1] 3GPP TS 36.211 V8.7.0 (2009-05): Physical channels and modulation, 2009. [2] S. Alliot and E. Deprettere. Architecture exploration of a large scale system. In Pro- ceedings of the IEEE International Workshop on Rapid System Prototyping, pages 217?224, Geneva, Switzerland, June 2004. [3] J. G. Andrews, A. Ghosh, and R. Muhamed. Fundamentals of WiMAX: understand- ing broadband wireless networking. Prentice Hall, 2007. [4] J.W.M. Baars, L.R. D?Addario, and A.R. Thompson. Radio astronomy in the early twenty-first century. Proceedings of the IEEE, 97(8):1377 ?1381, August 2009. [5] G. Berry. Bottom-up computation of recursive programs. ITA, 10(1):47?82, 1976. [6] B. Bhattacharya and S. S. Bhattacharyya. Parameterized dataflow modeling of DSP systems. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 1948?1951, Istanbul, Turkey, June 2000. [7] B. Bhattacharya and S. S. Bhattacharyya. Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing, 49(10):2408?2421, October 2001. [8] S. S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala, editors. Handbook of Signal Processing Systems. Springer, 2010. [9] S. S. Bhattacharyya, S. Kedilaya, W. Plishker, N. Sane, C. Shen, and G. Zaki. The DSPCAD integrative command line environment: Introduction to DICE version 1. Technical Report UMIACS-TR-2009-13, Institute for Advanced Computer Studies, University of Maryland at College Park, August 2009. [10] S. S. Bhattacharyya, R. Leupers, and P. Marwedel. Software synthesis and code generation for DSP. IEEE Transactions on Circuits and Systems ? II: Analog and Digital Signal Processing, 47(9):849?875, September 2000. [11] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, 1996. [12] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. APGAN and RPMC: Comple- mentary heuristics for translating DSP block diagrams into efficient software im- plementations. Journal of Design Automation for Embedded Systems, 2(1):33?60, January 1997. [13] G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete. Cyclo-static dataflow. IEEE Transactions on Signal Processing, 44(2):397?408, February 1996. 142 [14] G. Blake, R.G. Dreslinski, and T. Mudge. A survey of multicore processors. IEEE Signal Processing Magazine, 26(6):26 ?37, November 2009. [15] C. Brooks, E. A. Lee, X. Liu, S. Neuendorffer, Y. Zhao, and H. Zheng. Hetero- geneous concurrent modeling and design in Java (Volume 3: Ptolemy II domains). Technical Report UCB/EECS-2008-37, EECS Department, University of Califor- nia, Berkeley, April 2008. [16] J. T. Buck. Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD thesis, EECS Department, University of California, Berkeley, 1993. [17] C. Chang, J. Wawrzynek, and R. W. Brodersen. BEE2: a high-end reconfigurable computing system. Design & Test of Computers, IEEE, 22(2):114?125, March-April 2005. [18] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algo- rithms. MIT Press and McGraw-Hill, second edition, 2001. [19] I. Corretjer, C. Hsu, and S. S. Bhattacharyya. Configuration and representation of large-scale dataflow graphs using the dataflow interchange format. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 10?15, Banff, Canada, October 2006. [20] L. R. D?Addario and C. Timoc. Digital signal processing for the SKA: A strawman design. Technical Report SKA Memo No. 25, Square Kilometre Array, August 2002. [21] J. Dalcolmo, R. Lauwereins, and M. Ade. Code generation of data dominated DSP applications for FPGA targets. In Proceedings of the International Workshop on Rapid System Prototyping, pages 162?167, June 1998. [22] D.R. DeBoer, R.G. Gough, J.D. Bunton, T.J. Cornwell, R.J. Beresford, S. John- ston, I.J. Feain, A.E. Schinckel, C.A. Jackson, M.J. Kesteven, A. Chippendale, G.A. Hampson, J.D. O?Sullivan, S.G. Hay, C.E. Jacka, T.W. Sweetnam, M.C. Storey, L. Ball, and B.J. Boyle. Australian SKA pathfinder: A high-dynamic range wide- field of view survey telescope. Proceedings of the IEEE, 97(8):1507 ?1521, 2009. [23] P. E. Dewdney, P.J. Hall, R. T. Schilizzi, and T.J.L.W. Lazio. The square kilometre array. Proceedings of the IEEE, 97(8):1482 ?1496, August 2009. [24] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc., second edition, 2000. [25] J. Eker, J. Janneck, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer, S.R. Sachs, and Y. Xiong. Taming heterogeneity - the Ptolemy approach. Proceedings of the IEEE, Special Issue on Modeling and Design of Embedded Software, 91(1):127? 144, January 2003. 143 [26] J. Eker and J. W. Janneck. CAL language report, language version 1.0 ? document edition 1. Technical Report UCB/ERL M03/48, Electronics Research Laboratory, University of California at Berkeley, December 2003. [27] S.W. Ellingson, T.E. Clarke, A. Cohen, J. Craig, N.E. Kassim, Y. Pihlstrom, L.J. Rickard, and G.B. Taylor. The Long Wavelength Array. Proceedings of the IEEE, 97(8):1421 ?1430, August 2009. [28] E. Gamma, R. Helm, R. Johnson, and J. Vissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995. [29] E. R. Gansner, E. Koutsofios, S. C. North, and K.-P. Vo. A technique for drawing di- rected graphs. IEEE Transactions on Software Engineering, 19(3):214 ?230, March 1993. [30] M. R. Garey and D. S. Johnson. Computers and intractability: A guide to the theory of NP-Completeness. W. H. Freeman and Company, 1979. [31] C. Harris, K. Haines, and L. Staveley-Smith. GPU accelerated radio astronomy signal convolution. Experimental Astronomy, 22:129?141, 2008. 10.1007/s10686- 008-9114-9. [32] C. Haubelt, J. Falk, J. Keinert, T. Schlichter, M. Streubhr, A. Deyhle, A. Hadert, and J. Teich. A SystemC-based design methodology for digital signal processing sys- tems. EURASIP Journal on Embedded Systems, 2007:Article ID 47580, 22 pages, 2007. [33] S. Haykin. Adaptive filter theory. Prentice-Hall, Inc., 1996. [34] Y. Hemaraj, M. Sen, R. Shekhar, and S. S. Bhattacharyya. Model-based mapping of image registration applications onto configurable hardware. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pages 1453?1457, Pacific Grove, California, October 2006. Invited paper. [35] C Hsu, I. Corretjer, M. Ko., W. Plishker, and S. S. Bhattacharyya. Dataflow inter- change format: Language reference for DIF language version 1.0, user?s guide for DIF package version 1.0. Technical Report UMIACS-TR-2007-32, Institute for Ad- vanced Computer Studies, University of Maryland at College Park, June 2007. Also Computer Science Technical Report CS-TR-4871. [36] C. Hsu, M. Ko, and S. S. Bhattacharyya. Software synthesis from the dataflow interchange format. In Proceedings of the International Workshop on Software and Compilers for Embedded Systems, pages 37?49, Dallas, Texas, September 2005. [37] C. Hsu, S. Ramasubbu, M. Ko, J. L. Pino, and S. S. Bhattacharyya. Efficient sim- ulation of critical synchronous dataflow graphs. In Proceedings of the Design Au- tomation Conference, pages 893?898, San Francisco, California, July 2006. 144 [38] D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proceedings of the IRE, pages 1098?1101, September 1952. [39] G. Johnson. LabVIEW Graphical Programming: Practical Applications in Instru- mentation and Control. McGraw-Hill School Education Group, 1997. [40] J.L. Jonas. MeerKAT ? the South African array with composite dishes and wide- band single pixel feeds. Proceedings of the IEEE, 97(8):1522 ?1530, August 2009. [41] B. Kienhuis and E. F. Deprettere. Modeling stream-based applications using the SBF model of computation. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 385?394, September 2001. [42] M. Ko, C. Zissulescu, S. Puthenpurayil, S. S. Bhattacharyya, B. Kienhuis, and E. Deprettere. Parameterized looped schedules for compact representation of execu- tion sequences in DSP hardware and software implementation. IEEE Transactions on Signal Processing, 55(6):3126?3138, June 2007. [43] S. Y. Kung. VLSI Array processors. Prentice Hall, 1988. [44] S. Kwon, H. Jung, and S. Ha. H.264 decoder algorithm specification and simulation in Simulink and PeaCE. In Proceedings of the International SoC Design Confer- ence, pages 9?12, October 2004. [45] E. A. Lee. Recurrences, iteration, and conditionals in statically scheduled block diagram languages. In Proceedings of the International Workshop on VLSI Signal Processing, 1988. [46] E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous dataflow pro- grams for digital signal processing. IEEE Transactions on Computers, C-36(1):24? 35, Jan 1987. [47] E. A. Lee and D. G. Messerschmitt. Digital Communication. Kluwer Academic Publishers, 1988. [48] E. A. Lee and T. M. Parks. Dataflow process networks. Proceedings of the IEEE, pages 773?799, May 1995. [49] J. Lemaitre. Model-based specification and design of large-scale embedded signal processing systems. PhD thesis, Leiden University, The Netherlands, 2008. [50] J. Lemaitre and E. Deprettere. FPGA implementation of a prototype hierarchical control network for Large-Scale signal processing applications. In Proceedings of the International Euro-Par Conference, Lecture Notes in Computer Science 4128, pages 1192?1203. Springer, Dresden, Germany, August 2006. [51] The MathWorks Inc. Using Simulink, Version 3, January 1999. [52] A. V. Oppenheim and R. W. Schafer. Discrete-Time Signal Processing. Prentice- Hall, Inc., 1989. 145 [53] T. M. Parks. Bounded Scheduling of Process Networks. PhD thesis, EECS Depart- ment, University of California, Berkeley, 1995. [54] T. M. Parks, J. L. Pino, and E. A. Lee. A comparison of synchronous and cyclo-static dataflow. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 204?210 vol.1, Pacific Grove, California, October 1995. [55] A. Parsons, D. Backer, C. Chang, D. Chapman, H. Chen, P. Droz, C. de Jesus, D. MacMahon, A. Siemion, D. Werthimer, and M. Wright. A new approach to radio astronomy signal processing. In Proceedings of the General Assembly of the International Union of Radio Science, October 2005. [56] A. Parsons, D. Backer, Chen Chang, D. Chapman, H. Chen, P. Crescini, C. de Je- sus, C. Dick, P. Droz, D. MacMahon, K. Meder, J. Mock, V. Nagpal, B. Nikolic, A. Parsa, B. Richards, A. Siemion, J. Wawrzynek, D. Werthimer, and M. Wright. PetaOp/Second FPGA signal processing for SETI and radio astronomy. In Proceed- ings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pages 2031 ?2035, Pacific Grove, California, November 2006. Invited paper. [57] A. Parsons, D. Chapman, and H. Chen. Xilinx system generator for DSP in the CASPER group. Technical Report CASPER Memo 11, Center for Astronomy Sig- nal Processing and Electronic Research, University of California, Berkeley, January 2007. [58] J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. A hierarchical multiprocessor scheduling system for DSP applications. In Proceedings of the IEEE Asilomar Con- ference on Signals, Systems, and Computers, pages 122?126 vol.1, Pacific Grove, California, November 1995. [59] J. L. Pino, S. Ha, E. A. Lee, and J. T. Buck. Software synthesis for DSP using Ptolemy. Journal of VLSI Signal Processing, 9(1), January 1995. [60] J. L. Pino and K. Kalbasi. Cosimulating synchronous DSP applications with analog RF circuits. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, November 1998. [61] W. Plishker. Automated Mapping of Domain Specific Languages to Application Specific Multiprocessors. PhD thesis, University of California, Berkeley, January 2006. [62] W. Plishker, N. Sane, and S. S. Bhattacharyya. A generalized scheduling approach for dynamic dataflow applications. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pages 111?116, Nice, France, April 2009. [63] W. Plishker, N. Sane, and S. S. Bhattacharyya. Mode grouping for more effective generalized scheduling of dynamic dataflow applications. In Proceedings of the Design Automation Conference, pages 923?926, San Francisco, July 2009. 146 [64] W. Plishker, N. Sane, M. Kiemb, K. Anand, and S. S. Bhattacharyya. Functional DIF for rapid prototyping. In Proceedings of the International Symposium on Rapid System Prototyping, pages 17?23, Monterey, California, June 2008. [65] W. Plishker, N. Sane, M. Kiemb, and S. S. Bhattacharyya. Heterogeneous design in functional DIF. Transactions on High-Performance Embedded Architectures and Compilers. Online version available from http://www.hipeac.net/node/3030, print version to appear. [66] W. Plishker, N. Sane, M. Kiemb, and S. S. Bhattacharyya. Heterogeneous design in functional DIF. In Proceedings of the International Workshop on Systems, Architec- tures, Modeling, and Simulation, pages 157?166, Samos, Greece, July 2008. [67] W. Plishker, C. Shen, S. S. Bhattacharyya, G. Zaki, S. Kedilaya, N. Sane, K. Sudus- inghe, T. Gregerson, J. Liu, and M. Schulte. Model-based DSP implementation on FPGAs. In Proceedings of the International Symposium on Rapid System Proto- typing, Fairfax, Virginia, June 2010. Invited paper, DOI 10.1109/RSP 2010.SS4, 7 pages. [68] R.M. Prestage, K.T. Constantikes, T.R. Hunter, L.J. King, R.J. Lacasse, F.J. Lock- man, and R.D. Norrod. The Green Bank telescope. Proceedings of the IEEE, 97(8):1382 ?1390, August 2009. [69] S. Saha, S. Puthenpurayil, and S. S. Bhattacharyya. Dataflow transformations in high-level DSP system design. In Proceedings of the International Symposium on System-on-Chip, pages 131?136, Tampere, Finland, November 2006. Invited paper. [70] N. Sane, C.-J. Hsu, J. L. Pino, and S. S. Bhattacharyya. Simulating dynamic com- munication systems using the core functional dataflow model. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 1538?1541, Dallas, Texas, March 2010. [71] N. Sane, H. Kee, G. Seetharaman, and S. S. Bhattacharyya. Scalable representation of dataflow graph structures using topological patterns. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 13?18, San Francisco Bay Area, USA, October 2010. [72] N. Sane, H. Kee, G. Seetharaman, and S. S. Bhattacharyya. Topological patterns for scalable representation and analysis of dataflow graphs. Journal of Signal Process- ing Systems - Special Issue on SIPS 2010, December 2011. To appear. [73] C. Shen, W. Plishker, S. S. Bhattacharyya, and N. Goldsman. An energy-driven design methodology for distributing DSP applications across wireless sensor net- works. In Proceedings of the IEEE Real-Time Systems Symposium, pages 214?223, Tucson, Arizona, December 2007. 147 [74] C. Shen, H. Wu, N. Sane, W. Plishker, and S. S. Bhattacharyya. A design tool for efficient mapping of multimedia applications onto heterogeneous platforms. In Pro- ceedings of the IEEE International Conference on Multimedia and Expo, Barcelona, Spain, July 2011. [75] J. J. Shynk. Frequency-domain and multirate adaptive filtering. IEEE Signal Pro- cessing Magazine, 9(1):14?37, January 1992. [76] S. Sriram and S. S. Bhattacharyya. Embedded Multiprocessors: Scheduling and Synchronization. CRC Press, second edition, 2009. [77] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, and E. Deprettere. System design using Kahn process networks: the Compaan/Laura approach. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, volume 1, pages 340 ? 345 Vol.1, February 2004. [78] W. Sun, M. J. Wirthlin, and S. Neuendorffer. FPGA pipeline synthesis design exploration using module selection and resource sharing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(2):254?265, 2007. [79] W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for stream- ing applications. In International Conference on Compiler Construction, Grenoble, France, 2002. [80] P. P. Vaidyanathan. Multirate digital filters, filter banks, polyphase networks, and applications: a tutorial. Proceedings of the IEEE, 78(1):56 ?93, January 1990. [81] R.V. van Nieuwpoort and J.W. Romein. Building correlators with many-core hard- ware. IEEE Signal Processing Magazine, 27(2):108 ?117, March 2010. [82] I. M. Verbauwhede, C. J. Scheers, and J. M. Rabaey. Specification and support for multidimensional DSP in the SILAGE language. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages II/473 ?II/476, April 1994. [83] G. K. Wallace. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii ?xxxiv, February 1992. [84] H. Wu, C. Shen, N. Sane, W. Plishker, and S. S. Bhattacharyya. A model-based schedule representation for heterogeneous mapping of dataflow graphs. In Pro- ceedings of the International Heterogeneity in Computing Workshop, pages 66?77, Anchorage, Alaska, May 2011. [85] Xilinx Inc. System Generator for DSP - User Guide, Release 10.1.1, April 2008. 148