ABSTRACT Title of Dissertation: DEVELOPING MACHINE LEARNING TECHNIQUES FOR NETWORK CONNECTIVITY INFERENCE FROM TIME-SERIES DATA Amitava Banerjee Doctor of Philosophy, 2022 Dissertation Directed by: Professor Edward Ott Department of Physics Department of Electrical and Computer Engineering Institute for Research in Electronics and Applied Physics Inference of the connectivity structure of a network from the observed dynamics of the states of its nodes is a key issue in science, with wide-ranging applications such as determination of the synapses in nervous systems, mapping of interactions between genes and proteins in biochemical networks, distinguishing ecological relationships between different species in their habitats etc. In this thesis, we show that certain machine learning models, trained for the forecasting of experimental and synthetic time-series data from complex systems, can automatically learn the causal networks underlying such complex systems. Based on this observation, we develop new machine learning techniques for inference of causal interaction network connectivity structures underlying large, networked, noisy, complex dynamical systems, solely from the time-series of their nodal states. In particular, our approach is to first train a type of machine learning architecture, known as the ‘reservoir computer’, to mimic the measured dynamics of an unknown network. We then use the trained reservoir computer system as an in silico computational model of the unknown network to estimate how small changes in nodal states propagate in time across that network. Since small perturbations of network nodal states are expected to spread along the links of the network, the estimated propagation of nodal state perturbations reveal the connections of the unknown network. Our technique is noninvasive, but is motivated by the widely used invasive network inference method, whereby the temporal propagation of active perturbations applied to the network nodes are observed and employed to infer the network links (e.g., tracing the effects of knocking down multiple genes, one at a time, can be used infer gene regulatory networks). We discuss how we can further apply this methodology to infer causal network structures underlying different time-series datasets and compare the inferred network with the ground truth whenever available. We shall demonstrate three practical applications of this network inference procedure in (1) inference of network link strengths from time-series data of coupled, noisy Lorenz oscillators, (2) inference of time-delayed feedback couplings in opto-electronic oscillator circuit networks designed the laboratory, and, (3) inference of the synaptic network from publicly- available calcium fluorescence time-series data of C. elegans neurons. In all examples, we also explain how experimental factors like noise level, sampling time, and measurement duration systematically affect causal inference from experimental data. The results show that synchronization and strong correlation among the dynamics of different nodal states are, in general, detrimental for causal network inference. Features that break synchrony among the nodal states, e.g., coupling strength, network topology, dynamical noise, and heterogeneity of the parameters of individual nodes, help the network inference. In fact, we show in this thesis that, for parameter regimes where the network nodal states are not synchronized, we can often achieve perfect causal network inference from simulated and experimental time-series data, using machine learning techniques, in a wide variety of physical systems. In cases where effects like observational noise, large sampling time, or small sampling duration hinder such perfect network inference, we show that it is possible to utilize specially- designed surrogate time-series data for assigning statistical confidence to individual inferred network links. Given the general applicability of our machine learning methodology in time-series prediction and network inference, we anticipate that such techniques can be used for better model-building, forecasting, and control of complex systems in nature and in the lab. DEVELOPING MACHINE LEARNING TECHNIQUES FOR NETWORK CONNECTIVITY INFERENCE FROM TIME-SERIES DATA by Amitava Banerjee Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Professor Edward Ott, Chair/Advisor Professor Rajarshi Roy Professor Michelle Girvan Professor Yanne K. Chembo Professor Pratyush Tiwary © Copyright by Amitava Banerjee 2022 Preface Causality in complex systems This thesis is about inference of cause and effect relations from data. In complex systems where there are many interacting agents, such causal interactions form networks that ultimately dictate how the system as a whole behaves over time. Examples of such systems include neurons in our brain affecting firing patterns of one another, predators and preys in an ecosystems causing growth and decline of each other’s population, genes in our cells turning each other on and off, and so on. A general theme of the study of networked dynamical systems has been identifying how these individual causal interactions between parts of such complex systems result in the collective dynamics of these systems as a whole (e.g., how do neurons in the visual cortex fire together when we see an image). An artist’s impression of how the functioning of a multi-component system could depend on the cause and effect interactions among its parts is depicted in the drawing of an example of a scheme known as ‘Rube Goldberg machines’ in Fig. 1. Since the functioning of a complex system depends crucially on the causal interactions among its parts, it is important to identify such interactions to construct mathematical models of the system, or for the purpose of predicting or controlling such systems. Regarding this, a natural question to ask is: given only the observed behavior of different parts of a complex system over time, is it possible to construct the network of cause and effect interactions that connects them? For example, looking at how neurons in the visual cortex fire over time, can we reconstruct how they are coupled via synaptic interactions? This brings us to the question of casual inference - ii Figure 1: An example of a Rube Goldberg machine to light up the electric bulb shown in the right with the boxing glove and spring shown in the left, from Scout Life website. connecting a set of observed events or dynamical agents with causal interactions. In some cases, such causal interactions among different agents can be simple and propagating linearly from one agent to another, without any feedback and with each agent being causally affected by only one other agent and causally affecting only one (a different) agent - like falling dominoes. An example is given by the machine in Fig. 1. There, cause and effect interactions flow along the following sequence, from left to right, as the extending of the gloved spring on the left eventually lights up the bulb on the right: boxing glove and spring, bowling ball, bowling pin, left string, birdcage, small falling ball, dominoes, axe, right string, hammer, weighing scale, model of human hand, light bulb. But, in more general cases, if there are many agents in a complex system, the cause iii https://scoutlife.org/hobbies-projects/projects/159359/how-to-make-a-rube-goldberg-machine/ and effect interactions among them constitute a directed network. An example is shown in Fig. 2, where the causal predator-prey interactions form a network among the Chesapeake Bay waterbirds. In such cases, identifying causal interactions amounts to inferring the directed network of cause and effect relationships among the agents of a complex system. Moreover, as we shall see later in this thesis, unlike the example network shown in Fig. 2, there could be bi-directional causal effects between different agents, adding to a further layer of complication in causal inference. In this thesis, we shall devise and test methods to infer these causal networks in various complex systems. Figure 2: An example of a directed network of predator-prey interactions, from Wikipedia. The ladder of causation Since inference of causality in complex systems is of utmost importance in understanding iv https://en.wikipedia.org/wiki/Food_chain their inner workings, it is imperative to think about the different levels at which the task of causal inference can be achieved. A formalization of this is found from ‘The Book of Why’ by Judea Pearl and Dana Mackenzie [1], where the authors classify causal inference to three levels - association (‘seeing’), intervention (‘doing’), and counterfactual (‘imagining’) - forming the ‘ladder of causation’, as shown in Fig. 3, since each of the levels build on its lower levels. To consider this in more details, let us use the example of baking butter cookies. Suppose we wish to understand whether adding butter causes the cookie to have a certain texture. How do we go about it? Figure 3: The ladder of causation showing three levels of abilities of a causal learner, for the example of baking a butter cookie. To begin with, we need data. Suppose we have baked many butter cookies with different proportions of the baking ingredients and recorded the outcomes each time. The data relates the v amount of butter in the cookies to different textures of the cookies, but this relationship is merely an association, or, ‘correlation’, in the sense of statistics. In the language of probability, it means to be able to calculate only the probability of a cookie to have a certain texture, given a certain amount of butter in it (Fig. 3). This does not tell us for sure if the amount of butter is causing an attribute of the texture of the cookie. So, this is the most basic information that we can extract from the situation, if we only have the data available. At this stage, which is the first rung of the ladder of causation, we can predict the texture of the cookie by the amount of butter in it - but we do not know for sure whether it was the amount of the butter that caused this texture. This is because we cannot answer questions like ‘what would the texture be like if the amount of butter was halved?’ simply by looking at the data, since the amount of other ingredients are also varying in the data. So, the change in texture when the amount of butter is changed may not actually be due to the butter but, say because of the differing amounts of flour and sugar. So, relationships based on simple associations can be misleading. How to know for sure that the butter is causing a particular aspect of the texture? The way to deduce if it was really the amount of butter that caused a particular texture of the cookie is to do skillfully designed controlled experiments - baking many cookies with different amounts of butter with the amount of all other ingredients and baking conditions held fixed. This implies directly interacting with the system, using tools to actively intervene into it. Such controlled experiments enable us to calculate the probability of a cookie to have a certain texture, if the butter in it is changed by a certain amount (Fig. 3). This is the second rung of the ladder of causation, and it involves doing the experiments, like the ones that generated the data of rung one, in controlled way. As such, it assumes that we have an access into the system we wish to model. vi The final rung of the ladder is imagining what would have happened if we changed the conditions in a certain baking experiment, but, without doing any more experiment. So, suppose we have done some experiments with different amounts of butter like in rung two, and maybe others involving changing other ingredients as well. Then, we ask ”what would have happened if we changed ingredient X in experiment number Y?” Unlike rung two, we cannot do an experiment again. So this requires an additional level of sophistication - the intuition of an experienced baker. What does this imply for causal inference in complex systems? If we observe different parts of a complex systems over a period of time, we can record the changes of measured states of these different parts (e.g., membrane potentials or calcium fluorescence from individual neurons in the brain). The first thing to do with this data is to find correlations between states of different parts of the complex system. While we just argued that such correlations are the most primitive forms of causal modeling and may not necessarily imply causal interactions, we do obtain a wealth of information relating different parts of a complex system, which are useful particularly if we cannot make interventions into the system, and hence, cannot go up the ladder of causation easily. For example, correlations between activities of different parts of the human brain can be used to calculate functional connectivity, a measure that yields useful information, e.g., in studies on schizophrenia [2] and learning [3]. However, simple associations between compositions of a complex systems with its characteristics, without a mechanistic understanding of the underlying processes, has been shown to be dangerous in science. An example is linking a person’s genes to their behavioral traits. ‘Some companies see a market in reading DNA like a fortune-teller reads tea leaves’, remarks a Nature commentary [4], warning us against the consequences of misinterpreting these associations as causal effects. vii Indeed, examples of misuse of associations abound in modern machine learning. It is challenging for artificial intelligence (AI) to work by identifying causal relations in the data, going beyond probabilistic associations that most of them are trained on [5]. As consequences, mainstream AI still struggles to transfer learned skills to new environments, requires data-heavy training, and often produces opaque machines reproducing and reinforcing non-causal associations and biases in training data. ‘Widespread, mysterious, and destructive’ [6], Cathy O’Neil labels the algorithms affecting our lives, for example, by performing facial recognition in surveillance and calculating everything from credit scores, college rankings, to the amount of prison sentences. Opaque AI algorithms learning by identifying non-causal associations have been shown to, e.g., misclassify gender of a face depending on race [7], produce language models showing stereotypical association and negative sentiments towards specific groups [8], misclassify objects to be identified for steering angle prediction in self-driving cars [9] and produce false positive melanoma recognition depending on spurious surgical skin markings in medical images [10]. Thus, a general goal in machine learning is to produce AI algorithms that learn not by identifying statistical associations, but by identifying causal relationships in the training data. In order to do so, we need a way to lift the AI algorithms to one rung up the ladder of causation. This thesis is about machine learning architectures that are shown to be able to learn by identifying causal interactions in complex systems. Taking machines up the ladder of causation What does it require for a causal learner to go one level up the ladder of causation? It is the ability to do specifically designed controlled experiments. In the case of genetics experiments, the approach of ‘perturbation biology’ [11] involves knocking down genes and observing downstream effects of such knockdowns on other genes to infer causal gene interaction networks. However, viii while this example allows experimental access to the system under study, such might not be true in more general cases. For example, in the case of ecological predator-prey relationships, it is not desirable, in general, to have any human interventions to the population dynamics of different species. Considering such cases, trained machine learning algorithms should be able to accurately predict the effects of interventions, if they are to be placed on the second rung of the ladder of causation. Is there a way that it can be achieved? In the case of complex dynamical systems with many components connected by a network of causal interactions, we show in this thesis that it is possible to have machine learning architectures trained only on the time-series data from the components of a complex system, that can not only fit to the training data (i.e., learn to model associations), but can also predict the causal interactions among the components. In particular, our trained machines would be shown to accurately answer questions like “would component X of a complex system change at time t′ as a result of a change in component Y at time t, with t′ > t”? In many cases, a correct answer to these questions would mean inference of the causal network underlying the complex system. In our examples in the following chapters, we shall test the goodness of causal network inference by comparing the inferred causal network to the ground truth causal network, with the latter being known in all of our examples. The results will show that there are hopes of lifting mostly associational AI algorithms of today one step up the ladder of causation, where they would be able to predict the effect of interventions and different controlled experiments on the systems they would be trained on. ix Dedication This thesis is dedicated to my partner, Ms. Promita Ghosh, doctoral candidate at the Rosalind and Morris Goodman Cancer Institute, McGill University, for her constant support, advice, and inspiration. x Acknowledgments I shall start with the Land Acknowledgement Statement. The following is copied from the UMD Office of Diversity and Inclusion website. Every community owes its existence and strength to the generations before them, around the world, who contributed their hopes, dreams, and energy into making the history that led to this moment. Truth and acknowledgement are critical in building mutual respect and connections across all barriers of heritage and difference. So, we acknowledge the truth that is often buried: We are on the ancestral lands of the Piscataway People, who are the ancestral stewards of this sacred land. It is their historical responsibility to advocate for the four-legged, the winged, those that crawl and those that swim. They remind us that clean air and pristine waterways are essential to all life. This Land Acknowledgement is a vocal reminder for each of us as two-leggeds to ensure our physical environment is in better condition than what we inherited, for the health and prosperity of future generations. xi https://diversity.umd.edu/resources/land-acknowledgement রবীন্দ্রনাথ ঠাকুর Where my God dwells do you only see the precious jewels? For I see the dirt weep there in abundance. Where my Gurus take the seat how few good ones they care to greet? For the naive get a home there, I am a follower hence. Rabindranath Tagore (English translation mine) The graduate school life at UMD Physics taught me that gratefulness is something to be practiced mindfully. Looking back four years into the past, I realize how difficult it is for me to acknowledge everyone who helped me survive, thrive, and learn during my grad school. So, I am beginning with acknowledging my own limitation and apologizing for my ungratefulness to anyone not mentioned here. I plan to host an updated version of this section in my personal website: https://banerjeeamitava.github.io/ under the section called ‘Writings’. These years have not been easy at all. The only easy part was probably running the computer codes to produce the results that make the bulk of this thesis. I was never able to go back to my country, India, during my entire grad school, even when my friends died before the COVID vaccine could be distributed. As if the pandemic was not enough, within four years of setting foot in this country, I saw political violence, murderous police officers, free and cheap racialized hatred, discrimination, anti-LGBTQ legislation, malfunctioning courts, mass shootings, and decay xii https://banerjeeamitava.github.io/ and deliberate dismantling of human rights and dignity. If that was not enough, we had to see a full-scale war happening in some of my friends’ homeland, and countless other deadly conflicts killing people all over the world. Amongst all of this, I have been able to survive, thanks to some of the most wonderful people I have been privileged to have met, who made me feel loved and blessed. Since a lot of time during my PhD was spent at home due to the pandemic, I spent a considerable amount of time with my housemates, and they are partly the reason I was able to get through. So I am thankful to my current and former housemates: Abu Saleh Musa Patoary, for cooking N number of dishes with (/for) me and traveling, drinking, and watching Netflix with me among doing many other stuff, including getting the the defense champagne; Gautam Nambiar, for his always-positive presence, numerous Malyali treats, getting the defense cake, and for the drive to the COVID vaccine center; Elizabeth Joy Paul, Arthur Lin, and Wonseok Hwang for numerous drives to the grocery stores, helping with moving, and for teaching me about etiquette; and Wrick Sengupta, Sagar Airen, Amit Vikram Anand, and Aritra Das for great conversations, among many other memories that are making me smile now. Speaking of my friends, I must also thank Sourav Das and Sohitri Ghosh for numerous Bengali treats and all the great conversations; Subhayan Sahu and Anshuman Swain for being my constant and robust supports; and Naren Manjunath, Prakhar Gupta, and Xiaozhen Fu for numerous dinner invites to their place. In the Summer of 2019 I went for a hike to the Old Rag mountains in the Shenandoah National Park. This difficult hike was the first hike ever in my life, and I realized that it symbolized my PhD journey. I am thankful to Joseph Hart who literally lifted me up and carried my burden during the Old Rag hike and accompanied me in the hike to make sure I do not get stuck. I am also thankful to be in the presence of Ian Abel, William Dorland and Sarah Penniston-Dorland, xiii Adelle Wright, Arinjoy De, and Rupak Mukherjee in the hikes. I am grateful to have served in and served by several DEI and planning committees, and I got to know and learn from some of the best human beings that life has given me: Allison Carter, Mika Chmielewski, Kathleen Hamilton-Campos, Natalia Pankratova, and others from the Women in Physics; Daniel Serrano, Thomas Murphy, Kevin Daniels, Roxanne Defendini, Marc Swisdak, Heidi Myers, Taylor Prendargast, Sai Kanth Dacha, Carlos Rios Ocampo, and Meredith Pettit from the IREAP-ROLE; Milena Crnogorčević, Charlotte Ward, Andrew Guo, Katya Leidig, Arjun Savel, Stuart Vogel, and others from the GRAD-MAP; Madison Anderson, Ananya Sitaram, Landry Horimbere, Edgar Perez, Donna Hammer, and others from the Physics Climate Committee; Jennifer Enriquez, Justine Suegay, Shabab Ahmed Mirza, and others from MICA; Luanjiao Aggie Hu, Fatima Ali, and others from the GSL; Brian Medina from the BISS; Amanda Vu, Jackie Liu, Alythia Vo and many others from the APIDA Heritage Month Planning Committee, and many more. These people have been the best teachers in my grad school. They have convinced me that I am worthy of love and care and also capable of receiving them. I am also thankful to be served by the Physics GRAD-COMM and Mental Health Task Force, the Office of Diversity and Inclusion, and the LGBTQ+ Equity Centre. I am grateful to Physics, IREAP, and IPST staff and coordinators Josiland Chambers, Jessica Crosby, Pauline Rirksopa, Paulina Alejandro, Judy Gorski, Dorothea Brosius, Catha Stewart, and many others. They have been the lifesavers and made me feel home here. Also thanks to Leslie Miller from the the CMNS Office who made me feel special by writing a feature story on my research. I am also indebted to the cleaning staff, the nurses who gave me covid vaccines, and the numerous Uber drivers who have delivered food to me and taken me to places. I am particularly grateful to the Uber driver Michael who was exceptionally kind to have drove me through the Six xiv Flag vaccination site when I had no idea that the site was drive-thru-only beforehand. I am thankful to people whom I travelled with to the International Physics of Living Systems conference in Montpellier this summer. These include Vishavdeep Vashisht, Spandan Pathak, Nathan Zimmerberg, Elissa Moller, Frank Fazekas, and others and, for leading and planning the trip, Arpita Upadhyaya. I am grateful to my mentees, Neil Shah, Waley Wang, and Paulson Obiniyi Jr., who have showed me that I am capable of giving research and life advice that are useful. I am also thankful to all the amazing GRAD-MAP mentees whom I was lucky to have served, including Autumn Jackson Bartholomew, who was very kind to come to my thesis and helped me with the arrangements. I am thankful to my research collaborators: Jaideep Pathak, Sarthak Chandra, Juan Restrepo, William Fagan, Alireza Seif, Mohammad Hafezi, Alexey Gorshkov, Ravi Chepuri, and others, for everything I learned from them. I am thankful to my dissertation committee members Michelle Girvan, Pratyush Tiwary and Yanne Chembo for the great feedback and encouragements. I am grateful to all the people people who came to my defense talk, congratulated me on my graduation and celebrated with me. I am grateful to my partner Promita Ghosh for being a constant, loving, and supporting figure who was kind enough to be a safe haven whenever my mental health faltered. She proves that brilliant scientists can also be good human beings. I am extremely lucky to have Rajarshi Roy as my major collaborator and advisor. They say you need a lot of mentors for different things in grad school: one mentor to guide your research, one to help you network, one to help you look for jobs, one to go walk with, one who helps you when you get into bad troubles, one to vent out to during your sad days, one to shed tears to, and xv one who prays for you silently, and so on. I had Raj. Finally, I am thankful to my doctoral thesis advisor Edward Ott for being so kind as to accept me as his student and providing me with his experienced guidance throughout my PhD journey. Needless to say, the thesis would not have been done, had he not abundantly supported me and my work, in every way possible, though everything that happened. xvi Table of Contents Preface ii Dedication x Acknowledgements xi Table of Contents xvii List of Figures xx List of Abbreviations and Symbols xxi Chapter 1: Introduction 1 1.1 A network science view of complex dynamical systems . . . . . . . . . . . . . . 1 1.2 Network inference techniques and applications . . . . . . . . . . . . . . . . . . . 2 1.2.1 Model free network inference techniques . . . . . . . . . . . . . . . . . 4 1.2.2 Model-based network inference techniques . . . . . . . . . . . . . . . . 5 1.3 Network inference with reservoir computing . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Properties of computational models for network inference . . . . . . . . 6 1.3.2 Basic structure of a reservoir computer . . . . . . . . . . . . . . . . . . . 8 1.3.3 Reservoir computer models of time-series data and network inference . . 11 1.4 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.1 RC inference of short-term causal dependence . . . . . . . . . . . . . . . 15 1.4.2 RC inference of time-delayed causal dependence . . . . . . . . . . . . . 15 1.4.3 Effects of sampling conditions on RC causal inference and statistical analysis 16 Chapter 2: Inference of short term causal interactions: applications in coupled Lorenz oscillators 18 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Short Term Causal Dependence (STCD) . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Using Reservoir Computer to Determine Short Term Causal Dependence . . . . . 25 2.5 Tests of Machine Learning Inference of STCD . . . . . . . . . . . . . . . . . . . 29 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xvii Chapter 3: Inference of time-delayed causal interactions: applications in optoelectronic oscillator networks 39 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Reservoir Computing Methodology for Network Inference . . . . . . . . . . . . 44 3.3.1 The General Delay-coupled Network . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Time Series Prediction with a Reservoir Computer . . . . . . . . . . . . 46 3.3.3 Our Network Inference Procedure . . . . . . . . . . . . . . . . . . . . . 49 3.4 Opto-electronic Oscillator Networks . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Description of the experiment . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 Mathematical Model and Numerical Simulations of the Opto-Electronic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Results of Link Inference Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.1 Performance on Simulated Data - Homogeneous Delays . . . . . . . . . 62 3.5.2 Performance on Experimental Data . . . . . . . . . . . . . . . . . . . . . 69 3.5.3 Performance on Simulated Data - Heterogeneous Delays . . . . . . . . . 70 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.8 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.8.1 Determination of the time-delay from cross-correlation . . . . . . . . . . 74 3.8.2 Derivation of the discrete-time equation for simulating the opto-electronic system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 4: Network inference application for C. elegans neuronal calcium fluorescence dynamics 80 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 Physical systems and datasets used . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Calcium imaging time-series data from C. elegans . . . . . . . . . . . . . 86 4.3.2 Coupled Lorenz oscillator network simulations . . . . . . . . . . . . . . 87 4.4 Link-score distributions for the coupled Lorenz Oscillator Network . . . . . . . . 88 4.5 Statistical analysis of candidate link scores for link inference . . . . . . . . . . . 93 4.5.1 Brief review of surrogate data generation techniques for network link inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.2 Method of surrogate time-series generation and statistical significance analysis of inferred connections . . . . . . . . . . . . . . . . . . . . . . 94 4.6 Tests on Example Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6.1 Application to Lorenz network model . . . . . . . . . . . . . . . . . . . 95 4.6.2 Applications to C. elegans data . . . . . . . . . . . . . . . . . . . . . . . 98 4.6.3 Transfer entropy and Granger causality . . . . . . . . . . . . . . . . . . 100 4.6.4 Another Surrogate Data Generation Technique . . . . . . . . . . . . . . . 101 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.8.1 Reservoir-computer-based connection inference technique . . . . . . . . 105 4.8.2 Transfer Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xviii 4.8.3 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.9 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Chapter 5: Summary 111 Bibliography 113 Bibliography 113 Curriculum vitae 127 xix List of Figures 1 Schematic of a Rube Goldberg machine . . . . . . . . . . . . . . . . . . . . . . iii 2 The food web of waterbirds from Chesapeake Bay . . . . . . . . . . . . . . . . . iv 3 The ladder of causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1.1 Schematics of a reservoir computer . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Properties of time-series prediction by reservoir computers . . . . . . . . . . . . 12 2.1 Schematic of the reservoir computing architecture . . . . . . . . . . . . . . . . . 26 2.2 Performance of link inference for noiseless Lorenz oscillator networks . . . . . . 33 2.3 Effect of dynamical and observational noise on network link inference . . . . . . 35 2.4 Network link strength inference for arbitrary networks . . . . . . . . . . . . . . . 37 3.1 Schematics of a reservoir computer for time-delayed prediction . . . . . . . . . . 46 3.2 Illustration of opto-electronic oscillator and coupling scheme . . . . . . . . . . . 55 3.3 Examples of experimental and simulated time series from two opto-electronic oscillator networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 List of possible connected directed 4-node networks . . . . . . . . . . . . . . . . 60 3.5 Synchronization error for simulated time series of 4-node networks . . . . . . . . 62 3.6 Network inference performance with varying noise strength . . . . . . . . . . . . 63 3.7 Network inference performance with varying coupling strength . . . . . . . . . . 65 3.8 Distribution of individual link scores . . . . . . . . . . . . . . . . . . . . . . . . 68 3.9 Performance of network inference method on experimental data . . . . . . . . . 69 3.10 Inference of networks with heterogeneous link delays . . . . . . . . . . . . . . . 72 3.11 Time delay determination by cross-correlation of measured time series . . . . . . 75 4.1 C. elegans neural connectome inference from calcium fluorescence time-series . . 82 4.2 Separable and non-separable network link inference score distributions . . . . . . 83 4.3 Dependence of network link inference score distributions on sampling limitations 89 4.4 score distributions from different link inference methods . . . . . . . . . . . . . 99 4.5 Score distribution and surrogate data for AAFT surrogate . . . . . . . . . . . . . 103 4.6 Schematic of the reservoir computer . . . . . . . . . . . . . . . . . . . . . . . . 107 xx List of Abbreviations and Symbols AAFT Amplitude-Adjusted Fourier Transform AI Artificial Intelligence CDSD Causality-Destroyed Surrogate Data ML Machine Learning OEO Opto-Electronic Oscillator RC Reservoir Computer STCD Short Term Causal Dependence Win Input-to-Reservoir Coupling Matrix Wout Reservoir-to-Output Coupling Matrix xxi Chapter 1: Introduction In this thesis, we infer causal interactions in complex dynamical systems from machine learning models of their time-series data. This chapter serves as a background material for the rest of the thesis. It introduces both the research questions that we shall address in the rest of this thesis, and the main framework that we shall address them within. We start by explaining with examples why an understanding of the network of causal interactions underlying a complex system is essential for modeling and prediction of complex systems. Then we discuss several traditional and contemporary methods developed for such purpose. Finally, we shall introduce a new machine-learning-based technique for inference of causal interaction network structure underlying complex dynamical systems, that we shall employ in the next chapters. 1.1 A network science view of complex dynamical systems The science of agents, objects, or systems connected by graphs - network science - has flourished in the last two decades and provided us with a wealth of useful analytical and computational techniques [12, 13] to describe and study complex systems in nature and society. Examples of networked systems range from neuronal networks [14] in the brains of human and other organisms, and gene regulation and protein interaction networks inside cells [15], to the 1 internet and telecommunication systems. Network science has provided us with a ‘systems’ viewpoint of natural and social processes, that has been instrumental in their understanding and control. A particularly important subfield of network science deals with collective properties of complex dynamical systems connected via networks. Applications of such studies include epidemic spreading [16], synchronization of dynamics [17], and growth of political polarization [18] on complex networks, to mention a few. As such, networked complex systems are platforms of studying a wide variety of dynamical phenomena arising from the interactions between network topology and time-evolution of agents, acting as the network nodes, connected by the network links. 1.2 Network inference techniques and applications While studying the collective properties and dynamics of networked agents has been interesting, the inverse problem, i.e., inference of network links from the observed dynamics of the network nodal states, is also an important and challenging problem in itself. Applications of network inference techniques include determination of the synaptic connectivity in nervous systems [2, 19, 20], mapping of interactions between genes [21] and proteins in biochemical networks [22], distinguishing relationships between species in ecological networks [23, 24], and understanding the causal dependencies between elements of the global climate [25]. In all such cases, a basic network structure among the agents of these complex systems (e.g., individual neurons, genes, proteins, and species) can be found by measuring correlations among the time-dependent states of the agents. Examples include functional connectivity in neuroscience which measures the correlation of neural activity among different parts of the brain [2, 14], 2 gene co-expression networks that link genes which are statistically expressed together [26], co- occurrence/co-exclusion networks of microbes [27] that map which microbes are statistically present or absent together in the human microbiome etc. Such networks based on correlation, co-occurrence or associations among different agents have been useful in, for example, relating neural activity to neurodegenerative diseases [28, 29] and other conditions [2, 14] and relating changes in gut microbial co-occurrence patterns to inflammatory bowel disease [30]. However, despite the usefulness of correlation-based network models of complex systems, in the absence of causal models, they are generally unable to predict how the complex systems would respond to external perturbations. For example, it is impossible to predict how knocking down of a specific gene would affect the expression of other genes and stimulating a specific neuron would affect the firing pattern of other neurons, if the causal, directed interaction networks among the genes and neurons are not known. So, therapeutic interventions into these systems will not be possible without a thorough causal understanding of these systems. Thus, a broad goal of this thesis is to extract the causal network connecting different components of a complex system, solely from the time evolution of such components. Given the usefulness of causal network inference, a number of methods have been proposed for such purpose. We shall now briefly review some prominent network inference methods. It is helpful to classify them into two types: (1) model-free, i.e., network inference techniques that do not assume any particular mathematical model of the observed dynamics, and, (2) model-based, i.e., network inference techniques which assume the dynamics of the system to be governed by a particular class of mathematical models (e.g., a set of linear coupled ordinary differential equations) with parameters which are fitted using the available time-series data. We now describe some examples of the two types of network inference methods. For connectivity inference from 3 neuronal recordings, a review of different network inference methods and their comparisons can be found in Ref. [19]. 1.2.1 Model free network inference techniques 1.2.1.1 Transfer entropy Transfer entropy is an information theoretic causal network inference technique, which measures the rate of transfer of information from one time-series to another [31]. For a pair of agents (i, j) with dynamical states in a complex system, transfer entropy from i to j is defined by the conditional mutual information between the future states of j and past states of i conditioned by past states of j. It represents the amount by which the uncertainty in the future dynamics of j is reduced by knowing the past dynamics of i conditioned on the past dynamics of j. Transfer entropy is usually calculated pairwise, and can be used to construct a weighted, directed network connecting the agents of a complex system based on the directional information flow between them. Several generalizations and variations of transfer entropy also exist [19], to account for experimental limitations to acquisition of time-series data. 1.2.1.2 Supervised learning of causal interactions Model-free techniques of estimating causal interaction networks are rare because effects of causal interactions between dynamical agents can potentially affect their dynamics in multiple, unknown ways. In cases where there are available examples of dynamics with known ground truth causal connectivity, one can use such examples to train machine learning algorithm to detect the presence of causal interactions from dynamics [19, 32]. However, such techniques 4 require data-heavy training and availability of ground truth causal interaction networks which are often unknown. Furthermore, in many cases, such models lack interpretability if it is unclear what features of a complex system’s dynamics are being used to predict the causal interactions underlying the dynamics. 1.2.2 Model-based network inference techniques In model-based techniques the causal connectivity is estimated by mathematically modeling the processes that generates the sampled time-series data, and fitting model parameters to the observed data. Some examples of such models are the following. 1.2.2.1 Granger causality For a pair of agents (i, j) with dynamical states in a complex system, we say that j does not Granger cause i if and only if i, conditional on its own past dynamics, is independent of the past dynamics of j [33–36]. Alternatively, if we can have information on the past dynamics of i from the past dynamics of j, then j would Granger cause i. The typical way to test this type of dependency between two time-series involves fitting a vector autoregressive model for the time-series for i, and measuring whether inclusion of the time-series of j in that model makes the fitting error significantly lower. In such generative time-series models applied to test the predictive powers of different time-series, one can simultaneously consider more than two time-series at once, and can simultaneously include multiple time-delays in the generative model. However, these models are linear, and stationary in nature whereas the original dynamics might be nonlinear and non-stationary [37, 38]. Furthermore, the presence of unmeasured external inputs to the 5 dynamics, noise, and finite sampling rate is also known to affect Granger causality [37, 38]. 1.2.2.2 Bayesian network model Bayesian networks model the conditional probability distributions of time-series from different agents with a directed, acyclic graph [19, 24, 39] with parameters corresponding to, e.g., causal connectivity strengths. These parameters of the model are optimized to fit to the observed data and assuming a particular form of the underlying probability distribution (e.g., Gaussian) of the time-series variables from different agents [39]. 1.3 Network inference with reservoir computing 1.3.1 Properties of computational models for network inference In the preceding section, we saw some examples of popular techniques for causal connectivity inference from time-series of complex systems. While model-free methods like transfer entropy can capture causal connectivity by measuring information flow between different dynamical agents of a complex system, they do not provide any generative model for the sampled time-series data. In this thesis, we wish to extract causal interaction network connections from a model of the observed time-series, that can accurately fit to and generalize the measured data. Thus, we aim for a mathematical or computational model of the observed time-series from the complex system that has the following three properties. (i) Accurate fit to the sampled data: The model is sufficiently complex, potentially containing a large number of adjustable parameter so that it can fit to the sampled time-series data accurately. 6 (ii) Prediction or generalization of the sampled data: The model is able to predict the time-series of the complex system for some time duration in the future, or, at least be able to generate time-series data which is statistically similar (e.g., has similar power spectrum) to the original sampled time-series data. (iii) Causal inference extraction from observed data: The model should be able to predict the effect of small changes to the states of individual agents on the states of other agents at a later time in the complex system. That is, it should have the capability to extract causal connectivity structure among the dynamical states of different agents, using only the observed time-series data that it has been fitted on. In order to faithfully infer causal network connectivity from a model of time-series data, all three of the above properties are essential. Since causal interactions can affect the dynamics of individual agents in many ways, we wish to make sure that our model of the dynamics is sufficiently complex to capture all of them, so we wish our model to fit accurately to the observed time-series data. However, goodness of this fit alone does not guarantee that the model is indeed capturing the ground truth dynamical processes of the actual system, as the model could overfit to the available data. In order to avoid this overfitting, we ensure that the model is indeed able to autonomously predict the future time evolution of the actual system, or generate time-series samples which are statistically similar to the observed time-series of the complex system. Furthermore, to ensure that the model is able to fit to and generalize the observed time-series not by identifying simple statistical association between the time-series of different agents, we further require that the model must be able to predict the effect of interventions to the time-series of the different agents. In this thesis, we shall show that it is possible to obtain a model that satisfies the above three properties, using a specialized machine learning architecture called the ‘reservoir computer’ (RC). In the next 7 section, we describe the structure of a RC in more details. 1.3.2 Basic structure of a reservoir computer In its most basic form, the reservoir computer is a machine learning architecture that takes in multiple time-series at its input and produces multiple time-series at its output (Fig. 1.1). We denote the time-series that has been sampled from the complex system by the time-dependent vector u(t), of dimension Du, with its components denoting the time-series corresponding to the states of individual agents of the system (note that, if the states of the agents are given by vectors themselves, then multiple components of the time-series vector u(t) may be sampled from the same agent). If the sampling time is δt, we have this time series vector measured at times t = 0, δt, 2δt, ..., T δt. Based on the properties of causal models from the previous section, we wish to obtain a computational model of the observed time-series yielding a corresponding time-series ũ(t) of the dynamics that (i) fits to the observed time-series data u(t) (which we shall call the ‘training data’), (ii) can generate time-series data ũ(t) with similar statistical properties as u(t), and (iii) can predict the input data for a time duration (say, τ ) into the future, i.e., given u(t), the model can predict ũ(t+ τ) at its output. Fig. 1.1 shows how this is done with an RC. Input time-series u1(t) u2(t) u3(t) u4(t) u5(t) 𝐮(t) ෦u1(t + 𝜏) ෦u2(t + 𝜏) ෦u3(t + 𝜏) ෦u4(t + 𝜏) ෦u5(t + 𝜏) ෥𝐮(t + 𝜏) Predicted time-series Figure 1.1: Schematics of a reservoir computer predicting a five-variable time-series u(t) a time τ ahead into the future. 8 As depicted in Fig. 1.1, the RC is a network consisting of a large number of nodes. The number of nodes, Dr, is chosen such that Dr >> Du. Each of the nodes has a time-dependent scalar state, all of which are collected in a time-dependent column vector R(t) of length Dr. These reservoir nodes receive measured inputs of the system state vector u(t). At every time-point t, this system state vector u(t) is fed into the reservoir via a Dr ×Du input-to-reservoir coupling matrix Win (Fig. 1.1). Furthermore, the reservoir nodes are coupled among themselves with a Dr ×Dr adjacency matrix W. The time evolution of the reservoir node states R over a small time step ∆t are given by the equation, R (t+∆t) = σ (Winu(t),WR(t)) , (1.1) which maps the reservoir state at time t to the reservoir state at the next time step, t+∆t, where σ is a general nonlinear ‘activation’ function of the input and reservoir states, acting componentwise on its vector argument (which must the same dimension, Dr, as the reservoir state R). For simplicity, in this instance we can assume that δt = ∆t, even though in more general cases, it suffices that the reservoir time-evolution iteration time-step ∆t is an integer multiple of the sampling time-step δt of the system. Based on the available time series from the system u(t) for T time steps, Eq. 1.1 can be iterated to generate a corresponding time series for the reservoir state, R(t). Since the goal is to predict the future values of the sampled time-series u(t+ τ) from their current observed values u(t), we may utilize the reservoir state vectors R(t) for this purpose. In our case, this is done by using a suitable linear combination of the reservoir node states with a Du ×Dr reservoir-to-output coupling matrix Wout (Fig. 1.1) according to the equation, ũ(t) = WoutR(t), (1.2) 9 where the tilde in ũ indicates that the vector is a prediction from the RC, as opposed to being sampled from the actual system. Given the form of Eq. 1.2, the matrix Wout can be found by a linear regression between the vectors u(t) and R(t). This linear regression constitutes the training process of the RC, and the matrix Wout is the only element of the RC with adjustable parameters - the other matrices, like Win and W, are chosen randomly at the beginning of iteration of eq. 1.1, kept fixed throughout. Since the training constitutes of a linear regression, RCs can be trained much faster than conventional deep learning neural networks which contain many adjustable parameters which have to be trained by gradient descent. A comparison of the training time of RC and other machine learning architectures is done in Ref. [40]. While the subsequent chapters in this thesis will contain more details about the form of the nonlinear function σ, as well as the structure of the fixed matrices Win and W, we note here the generality of RCs. Indeed Eq. 1.1 shows that any dynamical system with a sufficiently large number of ‘nodes’ or computing units, having some nonlinearity, and a mechanism of receiving inputs and providing readable outputs can serve as a potential RC. Indeed, as reviews on RC realizations with physical systems demonstrate [41], RCs can be experimentally realized with many different systems irrespective of the microscopic details of their dynamics. Such physical implementation of the reservoir computer [41–43], include photonic [44–46], electronic and opto-electronic [47, 48], neuronal [49], molecular [50], chemical [51], spin [52] systems, or even a tank of water [53]. This flexibility of physical realization makes RCs a prime candidate for ‘neuromorphic computing’ [43], an unconventional computing paradigm that seeks inspiration from neural systems to create novel computing architectures. 10 1.3.3 Reservoir computer models of time-series data and network inference RCs have been used for a variety of applications in time-series analysis, including forecasting of spatially extended [54] and networked [55] chaotic systems, data-based reconstruction of long- term statistical properties of dynamics like the Lyapunov spectrum [56, 57], separation of mixed signals from two sources [53, 58], and fast classification of spoken words [44], to mention a few. Among these applications, the first one is the most relevant for our purpose of network inference. In terms of properties (i) and (ii) from Sec. 1.3.1, RCs with a moderate size (Dr ≈ 1000) are known to fit to the training data of size Du ≈ 10 accurately. Furthermore, as Fig. 1.2(a) shows, it is possible to use trained RCs for autonomous long-term time-series forecasting by running them in a closed loop configuration where the output time-series from the RC ũ(t) at a given time-step t is fed back as the input to the RC, thus eliminating the need to have any inputs from the original time-series data. As shown in the example of Fig. 1.2(b), such predictions remain accurate for a long time, even for chaotic systems. It is to be noted here that, for chaotic systems, time-series starting from two different initial conditions separated by a small amount δu, are expected to separate by an amount eΛmaxtδu after a time t, where Λmax > 0 is the maximum Lyapunov exponent of the system. Forecasting of such chaotic trajectories several Lyapunov times into the future thus shows that the closed-loop RC can serve as an accurate model of the observed dynamics. Fig. 1.2(b) further shows that, even when the accuracy of the long-term RC prediction breaks down (say after a time t such that Λmaxt > 6), the RC predicted time-series qualitatively looks similar to the original time-series. This is confirmed by comparing long-term statistical properties of the dynamics, like power spectrum (Fig. 1.2(c)) and Lyapunov spectrum (which is the set of Lyapunov exponents of a chaotic systems, ordered according to their numerical values, 11 shown in Fig. 1.2(d)) for the original time-series and closed-loop the RC predicted time-series. Figs. 1.2(c,d) show that long-term statistical properties of the RC prediction indeed matches with the corresponding properties of the original time-series data, thus fulfilling the ‘time-series prediction and generalization’ condition (ii) of Sec. 1.3.1. Win WoutRC (a) (b) Original Predicted Error (c) Original data RC prediction Original data RC prediction (d) × + Figure 1.2: Properties of time-series prediction by reservoir computers: (a) a schematic of closed- loop configuration of a trained RC used for autonomous time-series prediction, (b) original and RC-predicted time-series from a large, spatiotemporally chaotic system (Kuramoto-Sivashinsky model) and the prediction error over time, with time measured in terms of the maximum Lyapunov exponent Λmax, (c) comparison of the power spectrum, and (d) Lyapunov spectrum of the original time-series data and RC prediction for the Kuramoto-Sivashinsky model. (a) and (b) are adapted from Ref. [54] and (c) and (d) are adapted from Ref. [56]. Since the RC has been demonstrated in the literature to fulfill conditions (i) and (ii) of Sec. 1.3.1, we wish to know if it can capture the causal interactions between the dynamics of different agents in the complex system (condition (iii) of Sec. 1.3.1). In our formalism, this is represented by causal interaction among the components of the vector u(t). Mathematically, we suppose that, if a component i of the vector u(t) is causally affecting another component j of u(t), then, the 12 derivative ∂uj(t + τ)/∂ui(t) would be non-zero for some τ > 0. For example, if the system does not have any time-delay, then over a very short period of time τ = δt, the derivative would acquire a non-zero value if the dynamics of component i is causally affecting the dynamics of component j. For time-delayed systems, we can set τ to be the time-delay of causal interactions to have a similar effect. Note that, our notion of causality, as defined with the partial derivative of a component of the state vector u(t) with respect to another component at a past time, is close to the concept of causality in terms of interventions. That is, for example, if we changed a component ui at time t by putting an external perturbation, that would cause a corresponding change to a component uj a time τ later if and only if the component j is causally connected to the component i over this time. As such, our task of causal inference in complex systems amounts to estimating the derivatives ∂uj(t+ τ)/∂ui(t) from the sampled observations of the time-series of u(t). Partial derivatives like ∂uj(t+ τ)/∂ui(t) cannot be directly calculated from the time-series of u(t) directly in a straightforward way. This is because, when we test causal dependence between a pair of variables, we wish to have the other variables unchanged, so that their effects do not interfere with the testing pair of variables. In the partial derivative ∂uj(t+ τ)/∂ui(t), all components of the vector u(t) other than the i-th and j-th components are assumed to be held fixed. In the original time-series, this rarely happens, since, at every sampling time point, generally there are changes to all components of the state vector u(t) together. Thus, the original time series of u(t) itself cannot be used to calculate the partial derivatives ∂uj(t+ τ)/∂ui(t) indicating causal interactions. However, if we have a model for the observed dynamics of the state u(t), of the form u(t+ τ) = M (u(t)), then we can use that model to calculate the partial derivatives: ∂uj(t + τ)/∂ui(t) = ∂Mj/∂ui and test the causal dependence among the components of u(t). Since RCs 13 are able to give us an excellent model of an observed time-series that not only fits to the sampled data accurately, but can also both predict and generalize the sampled data, we use the RC model of an observed time-series to calculate the partial derivatives. The model is obtained by combining Eqs. 1.1 and 1.2 into a single equation, assuming a perfect fit to the training data by the RC (i.e., u(t) ≈ ũ(t)), u(t+ τ) = Woutσ (Winu(t),WR(t)) , (1.3) which we use as the model of the observed time-series to calculate the partial derivatives ∂uj(t + τ)/∂ui(t) = ∂Mj/∂ui. If the RC model of the time-series, given by Eq. 1.3 indeed captures the true causal dependencies among the components of the time-series, the resulting derivatives should indicate such. Throughout the rest of this thesis, we shall use equations like Eq. 1.3 to test whether RC models are able to actually predict the true causal dependencies among the components of a complex system, which would be known in all cases that we shall consider. We shall also investigate how properties of the time-series, like amount of correlations among their components, and sampling limitations, like finite sampling time, observational and dynamical noise systematically affect the causal learning performance of the RC. 1.4 Organization of this thesis The next three chapters of the thesis are about three applications of causal inference by the RC. 14 1.4.1 RC inference of short-term causal dependence In chapter 2, based on Ref. [59], we test whether RCs can capture short term causal dependencies in the data, i.e., predict how one component of a time-series affects another over a very short time. To test this, we shall use simulated data from complex networks of coupled chaotic Lorenz oscillators. Varying the size of the network, as well as observational and dynamical noise strengths in the time-series data, we show that the RC is able to correctly infer the causal interactions networks if the degree of correlation among the time-series from different Lorenz oscillators and the observational noise strength are sufficiently small. Interestingly, in contrast to what we generally described so far in this chapter, presence of dynamical noise is shown to aid RCs to identify causal interactions. We hypothesize that the reason of this is two-fold: (i) dynamical noise breaks down correlations and synchrony among the time-series from different Lorenz oscillators, and (ii) over a short time period, dynamical noise at one Lorenz oscillator only propagates to those particular oscillators which are causally connected to that oscillator. As such, dynamical noise in the network generates perturbations of the Lorenz oscillator nodal states, which propagate in time to only causally connected nodal states, revealing the causal connectivity structure in the process. Even for relatively small dynamical noise strengths, when these changes are infinitesimal, the RC is shown to be able to successfully use their information to correctly predict short-term causal connections among the networked Lorenz oscillators. 1.4.2 RC inference of time-delayed causal dependence In contrast to the short-term causal inference in chapter 2, chapter 3, based on Ref. [60], discusses inference of time-delayed causal interaction by the RC. In this chapter, we use 15 experimental and simulated voltage data from networks of delayed-feedback-coupled nonlinear optoelectronic oscillator (OEO) networks - an excellent testbed for studying nonlinear dynamics of delay-coupled networks. The results are similar to the ones in chapter 2 in that the RC is shown to be able to capture the causal interaction network correctly, provided the amount of synchronization among the dynamical variables are small - something that can be achieved by tuning the coupling strengths, increasing the dynamical noise strength, having different time-delays along different feedback network links, or by rewiring the OEO networks. Moreover, the RCs identifying causal interactions are here shown to perform equally well on experimental data. This observation inspires us to apply the RC causal inference technique to more real-world time-series datasets from less well-controlled experiments. This is the subject of chapter 4. 1.4.3 Effects of sampling conditions on RC causal inference and statistical analysis In chapter 4, we apply RC technique for short-term causal dependence, but on publicly available experimental data on calcium fluorescence dynamics from the neural system of live C. elegans. Such data suffers from low sampling temporal resolution, high noise level, large amount of correlation among different variables, and only a partial measurement of the full biophysical neural state of the organism. However, the C. elegans neuronal connectome is completely mapped and thus serves as the ground truth for benchmarking the network inference performance of the RC. To investigate the effect of data limitations on network inference in this system, we also use time-series data from a synthetic model of coupled Lorenz oscillators on the same network and systematically changed the different sampling parameters. In all such cases, when the data 16 limitations restrict the complete identification of the causal network, we apply a previously- developed statistical technique with surrogate time-series data to assign statistical confidence to each inferred causal link. The RC network inference technique, together with the statistical methodology, is shown to yield useful information about the C. elegans neural connectome, when compared with the ground truth connectome. Finally, the next section summarizes how RC can be used for inference of causal connections in complex systems. 17 Chapter 2: Inference of short term causal interactions: applications in coupled Lorenz oscillators This chapter is based on the publication “Using machine learning to assess short term causal dependence and infer network links” by Amitava Banerjee, Jaideep Pathak, Rajarshi Roy, Juan G Restrepo, Edward Ott, published as Chaos 29, 121104 (2019). The code to generate some of the results can be found in the GitHub repository banerjeeamitava/Reservoir-Computer- Network-Inference. 2.1 Overview This chapter demonstrates the first concrete application of our network inference procedure, using simulated time-series data from networks of coupled chaotic Lorenz oscillators. The problem of determining causal dependencies among various agents, in an unknown time-evolving system, from its time series observations, has wide-ranging and diverse applications. These include inferring neuronal connections from spiking data, deducing causal dependencies between genes from expression data, discovering long spatial range influences in climate variations, etc. Previous work has often addressed such problems by considering correlations, prediction impact, or information transfer metrics. In this chapter we propose a new method for causal inference in complex systems that leverages the potential ability of a specific type of machine learning, 18 https://aip.scitation.org/doi/abs/10.1063/1.5134845 https://github.com/banerjeeamitava/Reservoir-Computer-Network-Inference https://github.com/banerjeeamitava/Reservoir-Computer-Network-Inference called ‘reservoir computing’ (RC), to perform predictive and generative tasks involving complex time-series. Using RCs, We introduce and test a general computational technique for the inference of short term causal dependence among the measured time-dependent state variables of an unknown dynamical system. The basic method involves training the RC with the observed time-series data of the state variables of the unknown dynamical system to construct a computer model for the unknown dynamics, and, subsequently, estimating the causal dependencies of different state variables on each other in that computer model. We test our method on model complex systems consisting of networks of many interconnected chaotic Lorenz oscillators. Comparing the causal networks inferred by the RC to the ground truth, we find that dynamical noise can greatly enhance the effectiveness of our technique, while observational noise and presence of synchronization degrades the effectiveness. We believe that the competition between these two opposing types of noise, as well as the degree of synchrony in the system will be the key factor determining the success of causal inference in many of the most important application situations. These tests show that machine learning offers a unique and potentially highly effective approach to the general problem of causal inference. 2.2 Introduction The core goal of science is often described to be generalization from observations to understanding [61], commonly embodied in predictive theories. Related to this is the desire to use measured data to infer necessary properties and structure of any description consistent with a given class of observations. On the other hand, it has recently emerged that machine learning 19 (ML) is capable of effectively performing a wide range of interpretive and predictive tasks on data [62]. Thus it is natural to ask whether machine learning might be useful for the common scientific goal of discovering structural properties of a system from data generated by that system. In this chapter we consider an important, widely applicable class of such tasks. Specifically, we consider the use of machine learning to address two goals. Goal (i): Determine whether or not a state variable of a time evolving system causally influences another state variable. Goal (ii): Determine the ‘strength’ of such causal influences. In the terminology of ML, Goal (i) is referred to as “classification ML,” and Goal (ii) is referred to as “regression ML.” These goals have previously been of great interest in many applications (e.g., economics [33], neuroscience [63], genomics [21], climate [64], etc.). Many past approaches have, for example, been based upon the concepts of prediction impact, [33,63] correlation, [65–68] information transfer, [31, 69] and direct physical perturbations [11, 70]. Other previous works have investigated the inference of network links from time series of node states assuming some prior knowledge of the form of the network system and using that knowledge in a fitting procedure to determine links [66, 71–74]. In addition, some recent papers address network link inference from data via techniques based on delay coordinate embedding, [72] random forest methods, [75] network embedding algorithms [76] and feature ranking [77]. In this chapter, we introduce a technique that makes the use of an ML training process in performing predictive and interpretive tasks and attempts to use it to extract information about causal dependencies. In particular, here we use a particular type of machine learning (ML) called reservoir computing, an efficient method of time series analysis which has previously been successfully used for different tasks, e.g., prediction of chaotic dynamics [54, 78, 79] and speech recognition [44, 47] to mention 20 a few. In our case, a “reservoir” dynamical system is trained such that it becomes synchronized to a training time series data set from the unknown system of interest. The trained reservoir system is then able to provide an estimation of the response to perturbations in different parts of the original system, thus yielding information about causal dependencies in the actual system. We will show that this ML-based technique offers a unique and potentially highly effective approach to determining causal dependencies. Furthermore, the presence of dynamical noise (either naturally present or intentionally injected) can very greatly improve the ability to infer causality, [71, 72] while, in contrast, observational noise degrades inference. This rest of this chapter is organized as follows: Sec. 2.3 mathematically defines our notion of short term causal dependence (STCD) in the context of dynamical systems and argues how it captures the effects of small external interventions into the system, which are often employed in experiments to identify causal connections. Section 2.4 introduces the method of inferring STCD using reservoir computer models of the dynamics. The next section, Sec. 2.5, illustrates the performance of the RC-based STCD inference technique in the system of coupled Lorenz oscillators, and identifies conditions which are helpful and detrimental for STCD inference. 2.3 Short Term Causal Dependence (STCD) We begin by considering the very general case of an evolving, deterministic, dynamical system whose state at time t is represented by the M -dimensional vector z(t) = [z1(t), z2(t), . . . , zM(t)]T , where z(t) evolves via a system of M differential equations, dz(t)/dt = F(z(t)), and has reached a statistically steady dynamical state (perhaps chaotic). In this context, we frame the issue of causality as follows: Will a perturbation at time t applied to a 21 component zi of the state vector z(t) (i.e., zi(t) → zi(t) + δzi(t)) lead to a subsequent change at a slightly later time, t+ τ , of another scalar component zj (i.e., zj(t+ τ) → zj(t+ τ)+ δzj(t+ τ)); and how can we quantify the strength of this dependence? This formulation might suggest comparison of the evolutions of z(t) that result from two identical systems, one with, and the other without, application of the perturbation. However, we will be interested in the typical situation in which such a comparison is not possible, and one can only passively observe (measure) the state z(t) of the (single) system of interest. Aside from knowing that the dynamics of interest evolves according to a system of the form dz/dt = F(z), we assume little or no additional knowledge of the system, and that the available information is a limited-duration past time series of the state evolution z(t). Nevertheless, we still desire to deduce causal dependencies, where the meaning of causal is in terms of responses to perturbations as defined above. Since, as we will see, accomplishment of this task, in principle, is not always possible, our approach will be to first propose a heuristic solution, and then numerically test its validity. The main message of this chapter is that our proposed procedure can be extremely effective for a very large class of important problems. We will also delineate situations where our procedure is expected to fail. We emphasize that, as our method is conceptually based on consideration of responses to perturbations, in our opinion, it provides a more direct test of what is commonly of interest when determining causality than do tests based on prediction impact, correlation, or entropy metrics. Furthermore, although the setting motivating our procedure is for deterministic systems, dz/dt = F(z), we will also investigate performance of our procedure in the presence of both dynamical noise (i.e., noise added to the state evolution equation, dz/dt = F(z)) and observational noise (i.e., noise added to observations of z(t) used as training data for the machine learning). Both types of noise are, in practice, invariably present. An important result from our study is 22 that the presence of dynamical noise can very greatly enhance the accuracy and applicability of our method (a similar point has been made in Ref. [71] and Ref. [72]), while observational noise degrades the ability to infer causal dependence. To more precisely define causal dependence, we consider the effect of a perturbation on one variable on the other variables as follows. Taking the jth component of dz/dt = F(z), we have dzj(t)/dt = Fj(z1(t), . . . , z2(t), . . . , zM(t)), for j = 1, 2, . . . ,M . Perturbing zi(t) by δzi(t), we obtain for small τ , that the component of the orbit perturbation of zj at time (t+ τ) due to δzi is δzj(t+ τ) = τ { ∂Fj(z) ∂zi |z=z(t) } δzi(t) +O(τ 2). We define the Short Term Causal Dependence (STCD) metric, fji, of zj on zi by fji = 〈 G ( ∂Fj(z) ∂zi )〉 , (2.1) where ⟨(. . .)⟩ denotes a long time average of the quantity (. . .) over an orbit, and the function G is to be chosen in a situation-dependent manner. For example, later in this chapter, we consider examples addressing Goal (i) (where we want to distinguish whether or not ∂Fj(z)/∂zi is always zero) for which we use G(q) = |q|, while, when we consider an example addressing Goal (ii) and are concerned with the time-averaged signed value of the interaction strength, we then use G(q) = q. In either case, we view fji as quantifying the causal dependence of zj on zi, and the key goal of this chapter will be to obtain and test a machine learning procedure for estimating fji from observations of the state evolution z(t). For future reference, we will henceforth denote our machine learning estimate of fji by f̂ji. In the case of our Goal (i) experiments, where G(q) = |q|, 23 we note that fji defined by (1) is an average of a non-negative quantity and thus fji ≥ 0, as will be our estimate, f̂ji ≥ 0. Furthermore, for that case we will define STCD of zi on zj by the condition, fji > 0, and, when using our machine learning estimate f̂ji, we shall judge STCD to likely apply when f̂ji > ϵ where we call ϵ > 0 the discrimination threshold. In the ideal case (f̂ji = fji), the discrimination threshold ϵ can be set to zero, but, in practice, due to error in our estimate, we consider ϵ to be a suitably chosen positive number. We note that, in the ideal case, ϵ = 0 can be regarded as a test for whether or not Fj(z) is independent of zi. As a demonstration of a situation for which the determination of STCD from observations of the motion of z(t) on its attractor is not possible, we note the case where the attractor is a fixed point (a zero-dimensional attractor). Here, the measured available information is the M numbers that are the coordinates of the fixed point, and this information is clearly insufficient for determining STCD. As another problematic example, we note that in certain cases one is interested in a dynamical system that is a connected network of identical dynamical subsystems, and that such a network system can exhibit exact synchronization of its component subsystems [80](including cases where the subsystem orbits are chaotic). In the case where such a synchronized state is stable, observations of the individual subsystems are indistinguishable, and it is then impossible, in principle, for one to infer causal relationships between state variables belonging to different subsystems. More generally, in addition to the above fixed point and synchronization examples, we note that the dimension of the tangent space at a given point z∗ on the attractor is, at most, the smallest embedding dimension of the part of the attractor in a small neighborhood of z∗. Thus the full M ×M Jacobian of F(z) at z∗ cannot be precisely determined from data on the attractor when the local attractor embedding dimension at z∗ is less than M , which is commonly the case. Thus these examples motivate the conjecture that to efficiently and accurately infer STCD, the 24 orbital complexity of the dynamics should be large enough so as to encode the information that we seek. Note that these considerations of cases where inference of STCD is problematic do not apply to situations with dynamical noise, e.g., dz/dt = F(z) + (noise), as the addition of noise may be roughly thought of as introducing an infinite amount of orbital complexity. Alternatively, the addition of noise increases the embedding dimension of the data to that of the full state space, i.e., M . 2.4 Using Reservoir Computer to Determine Short Term Causal Dependence We base our considerations on a type of machine learning called reservoir computing, originally put forward in Refs. [81] and [82] (for a review, see Ref. [83]). We assume that we can sample the time-series data z(t) from our system at regular time intervals of length τ , so that we have a discrete set of observations {z(0), z(τ), z(2τ), ...}. To begin, we first describe a reservoir-computer-based machine learning procedure in which the reservoir computer is trained to give an output ẑ(t+ τ) in response to an M -dimensional input z(t) as illustrated in Fig. 2.1. For our numerical tests we consider a specific reservoir computer implementation (Fig. 2.1) in which the reservoir consists of a network of R ≫ M nodes whose scalar states, r1(nτ), r2(nτ), . . . , rR(nτ), are the components of the R-dimensional vector r(nτ). The nodes interact dynamically with each other through an R×R network adjacency matrix A, and their evolution is also influenced by coupling of the M -dimensional input z(nτ) to the individual nodes of the reservoir network by the M × R input coupling matrix Win according to the neural-network-type of evolution equation (e.g., Refs. [54, 78, 79, 83–85]) r((n+ 1)τ) = tanh(Ar(nτ) +Winz(nτ)), (2.2) 25 z(nτ) Win z ̂((n+1)τ) Wout r((n+1)τ) Figure 2.1: Schematic of the reservoir computing architecture used in this work. The input-to- reservoir coupling matrix Win couples the input time series for the vector z to the reservoir state vector r. The reservoir-to-output coupling matrix Wout generates the output vector ẑ from the reservoir. ẑ is found to be a good estimate of z after training. where tanh(v) for a vector v = (v1, v2, v3, . . .) T is defined as (tanh v1, tanh v2, tanh v3, . . .)T . For proper operation of the reservoir computer, it is important that Eq.(2.2) satisfy the ‘echo state property’ [78, 81, 83] (in nonlinear dynamics this condition is also know as ‘generalized synchronization’ [86–88]): given two different initial reservoir states, r1∗ and r2∗, for the same input time series of z, the difference between the two corresponding reservoir states converges to zero as they evolve in time (that is, |r1(t)− r2(t)| → 0 as t → ∞, implying that, after a transient initial period, r(t) essentially depends only on the past history of z, z(t′) for t′ ≤ t, and not on the initial condition for r). Using measured input training data over a training interval of length Tτ , which begins after the initial transient period mentioned above, we use Eq.(2.2) to generate r(τ), r(2τ), . . . , r(Tτ). We also record and store these determined values r(nτ) along with the corresponding inputs, z(nτ) that created them. The matrices A and Win are regarded as fixed and are typically chosen 26 randomly. In contrast, the R×M output coupling matrix Wout, shown in Fig. 2.1, is regarded as an adjustable linear mapping from the reservoir states r to an M -dimensional output vector ẑ, ẑ((n+ 1)τ) = Woutr((n+ 1)τ). (2.3) ‘Training’ of the machine learning reservoir computer then consists of choosing the RM adjustable matrix elements (‘weights’) of Wout so as to make ẑ(nτ) a very good approximation to z(nτ) over the time duration (τ, 2τ, . . . , T τ) of the training data. This is done by minimization with respect to Wout of the quantity,{∑T n=1 ∥ z(nτ)−Woutr(nτ) ∥2 } + β ∥ Wout ∥2. Here β ∥ Wout ∥2, with β small, is a ‘ridge’ regularization term [89] added to prevent overfitting and (r(nτ), z(nτ)) are the previously recorded and stored training data. In general, R ≫ M is required in order to obtain a good fit of ẑ to z(t). For illustrative purposes we now consider the ideal case where ẑ = z (i.e., the training perfectly achieves its goal). For the purpose of estimating STCD, we now wish to eliminate the quantity r from the basic reservoir computer system (Eqs.(2.2) and (2.3)) to obtain an evolution equation solely for the state variable z. To do this, we would like to solve Eq. (2.3) for r in terms of z. However, since R, the dimension of r, is much larger than M , the dimension of z, there are typically an infinite number of solutions of Eq.(2.3) for r. To proceed, we hypothesize that it may be useful to eliminate r by choosing it to be the solution of Eq.(2.3) with the smallest L2 norm. This condition defines the so-called Moore-Penrose inverse [90] of Wout, which we denote Ŵ−1 out; i.e., the minimum L2 norm solution for r is written r = Ŵ−1 outz. We emphasize that Ŵ−1 outz is not necessarily expected to give the correct r obtained by solving the system, Eq. (2.2) and Eq. (2.3). However, from numerical results to follow, our choice will be supported by the fact that it often yields very useful 27 estimates of fji. Now applying Wout to both sides of Eq. (2.2) and, employing r = Ŵ−1 outz to eliminate r(nτ) from the argument of the tanh function in Eq. (2.2), we obtain a surrogate time -τ map for the evolution of z, z((n+1)τ) = H[z(nτ)], where H(z) = Wout tanh[(AŴ−1 out+Win)z] . Here we note that we do not claim that this map in itself can be used for time-series prediction in place of Eqs. (2.2) and (2.3), which were commonly used in previous works (e.g., Refs. [54,78,79,84,85]) . Rather, we use it as a symbolic represention of the result obtained after eliminating the reservoir state vector r from Eqs. (2.2) and (2.3). In particular, the prediction recipe using Eqs. (2.2) and (2.3) is always unique and well-defined, in contrast to the above map, where W−1 out is clearly non-unique. So, we use this map only for causality estimation purposes, as described below. Differentiating H(z) with respect to zi, we have ∂Fj(z) ∂zi = τ−1 [ ∂Hj(z) ∂zi − δij ] , (2.4) where δij is the Kronecker delta, and we propose to use Eqs. (2.1) and (2.4) to determine STCD. In our numerical experiments, the number of training time steps is T = 6×104 for Figs. 2.2, 2.3 and T = 2× 104 for Fig. 2.4. In each case, the actual training data is obtained after discarding a transient part of 2 × 104 time steps and the reservoir system sampling time is τ = 0.02. The elements of the input matrix Win are randomly chosen in the interval [−0.1, 0.1]. The reservoir is a sparse random network of R = 5000 nodes for Figs. 2.2, 2.3 and of R = 1000 nodes for Fig. 2.4. In each case the average number of incoming links per node is 3. Each nonzero element of the reservoir adjacency matric A is randomly chosen from the interval [−a, a], and a > 0 is then adjusted so that the maximum magnitude eigenvalue of A is 0.9. The regularization parameter is β = 10−4. These parameters are adapted from Ref. [79]. The average indicated in Eq. (2.1) is 28 over 1000 time steps. The chosen time step τ is sufficiently small compared to the timescale over which z(t) evolves that the discrete time series z(nτ) is a good representation of the continuous variation of z(t). Although we use a specific reservoir computing implementation, we expect that, with suitable modifications, our approach can be adapted to ‘deep’ types of machine learning [62], as well as to other implementations of reservoir computing [44, 47, 48, 91], (notably implementations involving photonics [44], electronics [48] and field programmable gate arrays(FPGAs) [47]). 2.5 Tests of Machine Learning Inference of STCD In order to evaluate the effectiveness of our proposed method, we introduce mathematical model test systems that we use as proxies for the unknown system of interest for whose state variables we wish to determine STCD. We next use the test systems to generate simulated training data from which we determine STCD by our ML technique. We then assess the performance of the technique by the correctness of its results determined from the known properties of the test systems. We first consider examples addressing our Goal (i) (G(q) = |q| in Eq. (2.1)), and for our simulation test systems, we consider the case of a network of N nodes and L links, where each node is a classical Lorenz system [92] with heterogeneity from node to node, additive dynamical noise, and internode coupling, dxk/dt = −10[xk − yk + c N∑ l=1 a (x,y) kl (yl − yk)] + σDynnkx(t), (2.5) dyk/dt = 28(1 + hk)xk − yk − xkzk + σDynnky(t), (2.6) 29 dzk/dt = −(8/3)zk + xkyk + σDynnkz(t). (2.7) The state space dimension of this system is M = 3N . The coupling of the N nodes is taken to be only from the y variable of one node to the x variable of another node with coupling constant c, and a (x,y) kl is either 1 or 0 depending on whether or not there is a link from l to k. The adjacency matrix a (x,y) kl of our Lorenz network (not to be confused with the adjacency matrix A of the reservoir) is constructed by placing directed links between L distinct randomly chosen node pairs. For each node k, hk is randomly chosen in the interval [−h,+h], and we call h the heterogeneity parameter. Independent white noise terms of equal variance σ2 Dyn are added to the left-hand sides of the equations for dx/dt, dy/dt and dz/dt, where, for example, ⟨nkx(t)nk′x(t ′)⟩ = 2δkk′δ(t− t′). For σ = c = h = 0, each node obeys the classical chaotic Lorenz equation with the parameter values originally studied by Lorenz [92]. Furthermore, denoting the right-hand side of Eq. (2.5) by Fxk, we have ∂Fxk/∂yl = 10c or 0, depending on whether there is, or is not, a link from yl to xk. Since in this case, the derivative ∂Fxk/∂yl is time independent, ⟨|∂Fxk/∂yl|⟩ is also either 10c or 0, and, adopting the notation f (x,y) kl = ⟨|∂Fxk/∂yl|⟩, we denote its machine learning estimate by our previously described procedure by f̂ (x,y) kl . For a reasonably large network, the number N2 −N of ordered node pairs (k, l) of distinct nodes is large, and we consequently have many values of f̂ (x,y) kl . Bayesian techniques (see Ref. [93] and references therein) can be applied to such data to obtain an estimate L̂ for the total number of links L, and one can then set the value of ϵ so that there are L̂ values of f̂ (x,y) kl that are greater than ϵ. Less formally, we find that making a histogram of the values of f̂ (x,y) kl often reveals a peak at zero and another peak at a higher positive value with a large gap or discernible minimum in between. One can then estimate ϵ by a value in 30 the gap or by the location of the minimum between the peaks, respectively. For simplicity, in our illustrative numerical simulations to follow we assume that L is known (approximately equivalent to the case that L is unknown but that a very good estimate (L̂) has been obtained). Example 1: A heterogeneous noiseless case. We consider the parameter set c = 0.3, h = 0.06, σDyn = σObs = 0, N = 20, and we vary the number of links L. Figure 2.2(a) (for L = 50) and (b) (for L = 100) each show an array of 20×20 boxes where each of the boxes represents an ordered node pair (k, l) of the 20-node network, and the boxes have been colored (see Table 1) according to whether the results for our procedure predict a link from l to k (“positive”) or not (“negative”), and whether the prediction is correct (“true”) or wrong (“false”). TP (True Positive) Black Square TN (True Negative) White Square FP (False Positive) Blue Square FN (False Negative) Red Square Table 2.1: Color-coding scheme for Figs. 2.2 and 2.3. We see that for a typical case with L = 50 (Fig. 2.2(a)) all the boxes have been correctly labeled, corresponding to all boxes being either black or white. In contrast to this perfect result at L = 50, at L = 100 (Fig. 2.2(b)) the method fails terribly, and the fraction of correct inferences is small. In fact, we find excellent performance for L ≤ 50, but that, as L increases past 50, the performance of our method degrades markedly. This is shown in Fig. 2.2(c) where we give plots of the number of false positives (FP) normalized to the expected value of FP that would result if L links were randomly assigned to the N2 −N = 380 node pairs (k, l). (We denote this normalization ⟨FP⟩R; it is given by ⟨FP⟩R = L(380− L)/380.) Note that, with this normalization, 31 for the different heterogeneities plotted in Fig. 2.2(c), the curves are similar, and that they all begin increasing at around L = 60 and FP/ ⟨FP⟩R becomes nearly 1 (i.e., inference no better than random) past L ∼ 100. In our earlier discussion we have conjectured that, for inference of STCD to be possible, the orbital complexity should not be too small. To test this conjecture we have calculated the information dimension DINFO of the network system attractor corresponding to the parameters, c = 0.3, h = 0, σ = 0, N = 20, as a function of L. We do this by calculating the Lyapunov exponents of the system Eqs.(2.5,2.6,2.7), and then applying the Kaplan-Yorke formula for DINFO in terms of the calculated Lyapunov exponents. [94, 95] The result is shown in Fig. 2.2(d), where we see that DINFO decreases with increasing L. Regarding DINFO as a measure of the orbital complexity, this is consistent with our expectation that the ability to infer STCD will be lost if the orbital complexity of the dynamics is too small. As we next show, the above negative result for L increasing past about 60 does not apply when even small dynamical noise is present. Example 2: The effects of dynamical and observational noise. We first consider the effect of dynamical noise of variance σ2 Dyn for the parameters h = 0 (homogeneous), c = 0.3, N = 20, L = 200. Results (similar in style to Figs. 2.2(a) and 2.2(b)) are shown in Figs. 2.3(a), 2.3(b), and 2.3(c). For extremely low dynamical noise variance, σ2 Dyn = 10−9 (Fig. 2.3(a)), the result is essentially the same as for zero noise, and about one quarter of the boxes are classified TP, TN, FP, and FN each (since there are 200 links and 400 boxes, this is no better than random assignment). As the noise variance is increased to σ2 Dyn = 10−7.5 (Fig. 2.3(b)), the results become better, with a fraction 0.75 of the boxes either TP or TF (as opposed to 0.52 for Fig. 2.3(a)). Upon further increase of the dynamical noise variance to the still small value of σ2 Dyn = 10−6 (Fig. 2.3(c)), the results typically become perfect or nearly perfect. Furthermore, excellent results, similar to those for σ2 Dyn = 10−6, continue to apply for larger σ2 Dyn. This is shown by the red curve in 32 0 l 20 0 l 20 20 20 k k L=50, h=0.06 L=100, h=0.06(a) (b) Figure 2.2: Results of Experiment 1 (noiseless case). Panels (a) and (b) show the results of link inferences for two noiseless cases for L = 50 links and L = 100 links. The inference is perfect in (a), but is very bad in (b). (c) FP/ ⟨FP⟩R versus L for h = 0, 0.06, 0.15 averaged over 100 random realizations of the system and the reservoir adjacency matrix. (d) The orbital complexity as measured by the attractor information dimension DINFO decreases with increasing L. Note that at each value of L, we compute the DINFO for 10 random realizations of a network with L links with h = 0. The Kaplan-Yorke dimension is then averaged over all network realizations and the resulting plot is further smoothed by applying a moving average filter. 33 Fig. 2.3(f) which shows FP/ ⟨FP⟩R versus σ2 Dyn (N = 20;L = 200). Importantly, we also note that our normalization of FP by ⟨FP⟩R essentially makes the red curve L-independent over the range we have tested, 50 ≤ L ≤ 200. Our interpretation of this dynamical-noise-mediated strong enhancement of our ability to correctly infer links is that the dynamical noise allows the orbit to explore the state space dynamics off the orbit’s attractor and that the machine learning is able to make appropriate good use of the information it thus gains. We now turn to the effect of observational noise by replacing the machine learning time series training data formerly used, [xk(nτ), yk(nτ), zk(nτ)], by [xk(nτ) + σ̂Obsn̂kx(nτ), yk(nτ) + σ̂Obsn̂ky(nτ), zk(nτ) + σ̂Obsn̂k(nτ)], where the parameter σ2 Obs is the observational noise variance and the n̂kx, n̂ky, n̂kz are independent Gaussian random variables with, e.g., ⟨n̂kx(nτ)n̂k′x(n ′τ)⟩ = 2δkk′δnn′ . The blue curve in Fig. 2.3(f) shows the effect of adding observational noise of variance σ2 Obs on top of dynamical noise for the situation σ2 Dyn = 10−5 of Fig. 2.3(c). We see from Figs. 2.3(d)-(f) that, when σ2 Obs is below about 10−5, it is too small to have much effect, but, as σ2 Obs is increased above 10−5, the observational noise has an increasing deleterious effect on link inference. This negative effect of observational noise is to be expected, since inference of characteristics of the unknown system is necessarily based on the part of the signal that is influenced by the dynamics of the unknown system, which the observational noise tends to obscure. Example 3: Inferring Continuous Valued Dependence Strengths. We now wish to address Goal (ii) (for which we take G(q) = q in Eq. (2.1)) and we, accordingly, consider the case where f (x,y) kl for each (k, l) takes on a value in a continuous range (rather than the case of Examples 1 and 2 where f (x,y) kl is either 10c or zero for all (k, l)). For this purpose we replace Eq. (2.5) by dxk/dt = −10(xk − yk) + ∑ l f (x,y) kl yl, (2.8) 34 2 2222 2 2 2 2 2 2 2 2 2 2 2 Figure 2.3: The effect of noise on STCD inference. Panels (a), (b), and (c) shows the effect of increasing the dynamical noise variance σ2 Dyn is to greatly enhance the effectiveness of link identification even at the rather low noise level of σ2 Dyn = 10−6. In contrast, as shown in panels (d), (e), and (f), starting with the situation (c) and increasing the observational noise variance σ2 Obs degrades link identification. L = 200, h = 0 for all the subfigures here. and consider Eqs. (2.6,2.7, 2.8) as our new test system, with h = 0.9, σ2 Dyn = σ2 Obs = 0, and N = 100 nodes (corresponding to 100 × 100 = 104 possible connection strength values). We choose the connection strength values as follows. A photographic portrait of Edward N. Lorenz is divided up into 100× 100 = 104 boxes and, by using a shading scale from dark (coded as +10) to 35 light (coded as -5), Fig. 2.4(a) is obtained, with the shading scale given to the right of Fig. 2.4(b). Setting f (x,y) kl equal to the color scale value of box (k, l), we next numerically solve Eqs. (2.6,2.7, 2.8). We then use this orbit as the training data for input to our ML determination of causal strength dependence, f̂ (x,y) kl , and employing the same shading scale, use the thus determined values of f̂ (x,y) kl to reconstruct the original portrait, as shown in Fig. 2.4(b). We see that, although the reproduction is not exact, the overall picture is still clearly recognizable, indicating the effectiveness of the method for Goal (ii). For a more quantitative comparison of the actual and the estimated Jacobian elements, we calculate the normalized Frobenius norm of their difference matrix f (x,y) − f̂ (x,y). We first apply upper and lower cut-offs equal to 10 and -5.5 respectively to f̂ (x,y), in order to eliminate some extreme values. Then we calculate the ratio δ = ∣∣∣∣∣∣f (x,y) − f̂ (x,y) ∣∣∣∣∣∣ F〈∣∣∣∣∣∣f (x,y) − f̃ (x,y) ∣∣∣∣∣∣ F 〉 , (2.9) where ||M ||F = √ Trace(M †M) = √∑ i,j |Mij|2 is the Frobenius norm of the matrix M . Here f̃ (x,y) denotes a matrix constructed by randomly permuting the elements of the matrix f (x,y), and the angled brackets denote an average over such random permutations. So this ratio compares the total error in the inferred Jacobian with the variation in the original matrix elements of f (x,y). For example, for a perfect estimation of f (x,y), we will have δ = 0. In contrast, δ = 1 means that the prediction error is equal to the average error when the elements of f (x,y) are randomized. For the example shown in Fig. 2.4, we find that δ is approximately equal to 0.37. 36 Figure 2.4: Results of Experiment 3. Panel (a) shows a 100×100 pixelated, shade-coded portrait of Edward N. Lorenz; (b) reconstruction of (a) by our ML link inference technique. Note that, in (b), we plot all the values greater than or equal to 10 as black and all the values less than or equal to −5.5 as white. 2.6 Discussion In this chapter, we have formulated and tested a new, highl