ABSTRACT Title of Dissertation: CYCLING AROUND THE CLOCK: MODELING BIKE SHARE TRIPS AS HIGH- FREQUENCY SPATIAL INTERACTIONS Zheng Liu, Doctor of Philosophy, 2023 Dissertation directed by: Assistant Professor Taylor Oshan, Department of Geographical Sciences Spatial interactions provide insights into urban mobility that reflects urban livability. A range of traditional and modern urban mobility models have been developed to analyze and model spatial interaction. The study of bike-sharing systems has emerged as a new area of research, offering expanded opportunities to understand the dynamics of spatial interaction processes. This dissertation proposes new methods and frameworks to model and understand the high-frequency changes in the spatial interaction of a bike share system. Three challenges related to the spatial and temporal dynamics of spatial interaction within a bike share system are discussed via three studies: 1) Predicting spatial interaction demand at new stations as part of system infrastructure expansion; 2) Understanding the dynamics of determinants in the context of the COVID-19 pandemic; and 3) Detecting events that lead to changes in the spatial interaction process of bike share trips from a model-based proxy. The first study proposes a hybrid strategy to predict 'cold start' trips by comparing flow interpolation and spatial interaction methods. The study reveals 'cold start' stations with different classifications based on their locations have different best model choices as a hybrid strategy for the research question. The second study demonstrates a disaggregated comparative framework to capture the dynamics of determinants in bike share trip generation before, during, and after the COVID-19 lockdown and to identify long-term bike share usage behavioral changes. The third study investigates an event detection approach combining martingale test and spatial interaction model with specification evaluation from simulated data and explorative examination from bike share datasets in New York City, Washington, DC, and San Francisco. Results from the study recognize events from exogenous factors that induced changes in spatial interactions which are critical for model evaluation and improvement toward more flexible models to high-frequency changes. The dissertation elaborated and expanded the spatial interaction model to more effectively meet the research demands for the novel transportation mode of bike-share cycling in the context of a high-frequency urban environment. Taken as a whole, this dissertation contributes to the field of transportation geography and geographic information science and contributes methods toward the creation of improved transport systems for more livable cities. CYCLING AROUND THE CLOCK: MODELING BIKE SHARE TRIPS AS HIGH-FREQUENCY SPATIAL INTERACTIONS by Zheng Liu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Dr. Taylor Oshan, Chair Dr. Kathleen Stewart Dr. Yiqun Xie Dr. Giovanni Baiocchi Dr. Vanessa Frias-Martinez © Copyright by Zheng Liu 2023 ii Dedication To my parents, Honglei Zheng (郑红雷) and Guangdou Liu (刘光斗), for their unconditional support and love through the journey. iii Acknowledgements First and foremost, I would like to express my profound gratitude to my doctoral advisor, Dr. Taylor Oshan. Without his unwavering support and guidance, I would not have been able to successfully navigate the challenges of the PhD program and reach the finish line. Dr. Oshan provided me with invaluable feedback on my research helping me to refine my ideas and develop a clear focus for my dissertation. Beyond his professional expertise, Dr. Oshan was a constant source of encouragement and support throughout my time in the program, always taking the time to offer a listening ear or a word of wisdom when needed. In addition to my gratitude towards Dr. Oshan, I would like to express my sincere appreciation to my dissertation committee members, Dr. Kathleen Stewart, Dr. Yiqun Xie, Dr. Giovanni Baiocchi, and Dr. Vanessa Frias-Martinez. Their constructive comments and feedback on my dissertation have been invaluable in helping me to improve my work. Moreover, I am deeply grateful to each of them for the advice, encouragement, and opportunities they have provided throughout my PhD journey. Their mentorship and support have been instrumental in shaping my academic growth and development, and I am honored to have had them as my committee members. My thanks go to all friends in the Center for GIS, Zhiyue, Hai, Jeff, Peiqi, Xin, Yuehui, Weiye, Riccardo, Federico, and Xiaoguo for the wonderful work and life experiences. My thanks also go to all the colleagues in the Oshan group, Mengyu, Victor, and Ushashi, for discussing my research and providing me with substantial help in the development of my dissertation. iv I would like to thank to Guimin, Yunting, Yimeng, Junchuan, Lei, Xiuxia, Xiaodong, Yu, Ruomin, Fengyu, Yuqian, Jeremiah, Tianyu, as well as faculty and cohorts from the MSGIS program who offered me unforgettable memories and made my PhD life not alone. Special thanks to all my friends from the Lakeside North community, Yao, Jiaming, Aolin, Yanjia, Cheng, and Yuhan, who left the most pieces of memories ever since my life in the US. Beyond the fun times, their support has been invaluable in helping me navigate the many challenges that life has thrown my way. A special thanks to my girlfriend, Diyang Cui. From the very beginning of our story, she has been there for me, providing the emotional support and motivation I needed to push through even the most difficult moments. I am eternally grateful to her for all that she has done for me, and my gratitude to her has no expiration date. Finally, I am indebted to my parents for offering solid backing for me. Their constant love and encouragement have been a source of strength for me during the six years that I lived on the other side of the world pursuing my degree. v Table of Contents Dedication ....................................................................................................................................... ii Acknowledgements ........................................................................................................................ iii Table of Contents ............................................................................................................................ v List of Tables ................................................................................................................................ vii List of Figures .............................................................................................................................. viii List of Abbreviations ...................................................................................................................... x Chapter 1: Introduction ................................................................................................................... 1 1.1 Background and motivation .................................................................................................. 1 1.2 Research objectives ............................................................................................................. 10 1.3 Dissertation outline ............................................................................................................. 13 Chapter 2: Comparing spatial interaction models and flow interpolation techniques for predicting 'cold start' bike share trip demand ................................................................................................. 15 2.1 Abstract ............................................................................................................................... 15 2.2 Introduction ......................................................................................................................... 15 2.3 Background ......................................................................................................................... 18 2.4 Methodology ....................................................................................................................... 23 2.4.1 Problem statement ........................................................................................................ 23 2.4.2 Station classification .................................................................................................... 24 2.4.3 Modeling strategies ...................................................................................................... 27 2.5 Data ..................................................................................................................................... 34 2.5.1 Study area..................................................................................................................... 34 2.5.2 Study design ................................................................................................................. 35 2.6 Results ................................................................................................................................. 36 2.6.1 Gravity-type spatial interaction model calibration ...................................................... 36 2.6.2 Prediction results .......................................................................................................... 39 2.6.3 Spatial trends ................................................................................................................ 40 2.7 Discussion ........................................................................................................................... 42 2.8 Conclusion .......................................................................................................................... 44 Chapter 3: Modeling the temporal dynamics of bike share trip determinants and the impact of the COVID-19 pandemic .................................................................................................................... 46 3.1 Abstract ............................................................................................................................... 46 3.2 Introduction ......................................................................................................................... 47 3.3 Data and methods ................................................................................................................ 49 3.3.1 Independent variables .................................................................................................. 49 3.3.2 COVID-19 pandemic semantics .................................................................................. 51 3.3.3 Spatial interaction model ............................................................................................. 52 3.3.4 Variable selection......................................................................................................... 52 3.4 Results ................................................................................................................................. 53 3.4.1 Bike trip overview........................................................................................................ 53 3.4.2 Variable selection......................................................................................................... 54 3.4.3 Yearly trends ................................................................................................................ 58 3.4.4 Weekly trends .............................................................................................................. 60 3.5 Discussion ........................................................................................................................... 68 3.6 Conclusion .......................................................................................................................... 73 vi Chapter 4: Event detection in bike share trip behaviors with spatial interaction models and martingales .................................................................................................................................... 75 4.1 Abstract ............................................................................................................................... 75 4.2 Introduction ......................................................................................................................... 75 4.3 Methods............................................................................................................................... 79 4.3.1 Exchangeability and martingales ................................................................................. 80 4.3.2 Martingale test ............................................................................................................. 81 4.3.3 Adaptations .................................................................................................................. 83 4.3.4 Simulation design......................................................................................................... 84 4.3.5 Empirical design .......................................................................................................... 86 4.4 Results ................................................................................................................................. 89 4.4.1 Simulation .................................................................................................................... 89 4.4.2 Empirical work in NYC ............................................................................................... 93 4.4.3 Comparison with additional US cities - Washington, DC and San Francisco ............. 98 4.5 Discussion ......................................................................................................................... 107 4.6 Conclusion ........................................................................................................................ 112 Chapter 5: Conclusion................................................................................................................. 114 5.1 Review of dissertation....................................................................................................... 114 5.2 Significant contributions ................................................................................................... 117 5.3 Future work ....................................................................................................................... 119 5.4 Concluding remarks .......................................................................................................... 121 Appendices .................................................................................................................................. 122 Appendix A. Supplementary materials for Chapter 3 ............................................................. 122 Appendix B. Supplementary materials for Chapter 4 ............................................................. 122 Bibliography ............................................................................................................................... 123 vii List of Tables Table 1. Bike share total trip count, mean trip duration, and mean trips per day in the NYC system from 2019 to 2021. .......................................................................................................................... 5 Table 2. Numbers of stations added over time by month and year. .............................................. 25 Table 3. Attributes used in the spatial interaction models in Chapter 2. ...................................... 29 Table 4. A summary of the prediction results based on Pearson’s R for each method using data for each classification category and using all data. Standard deviations are in parentheses after the mean values. .................................................................................................................................. 39 Table 5. Variables for model selection in Chapter 3. ................................................................... 51 Table 6. Variance inflation factor values with different independent variables for the spatial interaction models. ........................................................................................................................ 58 Table 7. Yearly spatial interaction model coefficients and 95% confidence intervals. ................ 59 Table 8. Simulation results for each setting, and mean delay time for detecting the true change event. ............................................................................................................................................. 90 Table 9. Highlighted markers in the sequence of martingale values from the NYC experiment and associated events, with occurrence time, cause of the event, and the delay compared to the occurrence time (if applicable). .................................................................................................... 95 Table 10. Highlighted markers in the sequence of martingale values from the Washington, DC experiment and associated events, with occurrence time, cause of the event, and the delay compared to the occurrence time (if applicable)......................................................................... 102 Table 11. Highlighted markers in the sequence of martingale values from the San Francisco experiment and associated events, with occurrence time, cause of the event, and the delay compared to the occurrence time (if applicable)......................................................................... 106 Table B.1. Results of variable selection for three different cities using the method in section 3.3.4. ..................................................................................................................................................... 122 viii List of Figures Figure 1. Citi Bike trip arrival changes from 2019 to 2020 and 2020 to 2021 in NYC, with each hexagon categorized primarily as decreasing trips, increasing trips, and new stations installed. .. 3 Figure 2. Conceptual framework of the dissertation. .................................................................... 11 Figure 3. Illustration of areal natural neighbor interpolation and ordinary kriging interpolation with toy data. ......................................................................................................................................... 32 Figure 4. Spatial distribution of bike stations color-coordinated by the three derived classes for newly added stations and the original stations. ............................................................................. 35 Figure 5. Top: Feature weight from the SI model results averaged over the time periods. The X- axis represents the absolute value of the parameter estimates and the color represents the sign of the average magnitude. The dark green strip shows the average range. Bottom: Time series of top 11 largest feature weights. The X axis is the number of weeks elapsed since 01/01/2015. ......... 37 Figure 6. Spatial distribution of the best method for flow prediction at each 'cold start' station.. 41 Figure 7. Weekly trip numbers in the NYC bike share system from 2019 to 2021. ..................... 53 Figure 8. Regularization paths for the LASSO results. A more important variable ends in y=0 with a larger lambda. ............................................................................................................................. 55 Figure 9. Pairwise scatter plot for selected variables in the spatial interaction design matrix for 2019 model.................................................................................................................................... 57 Figure 10. Top: COVID cases by week. Bottom: Destination capacity coefficient estimates and 95% confidence interval by week. ................................................................................................ 61 Figure 11. Top: COVID cases by week. Bottom: Distance decay coefficient estimates and 95% confidence interval by week. ........................................................................................................ 63 Figure 12. Top: COVID cases by week. Bottom: Destination recreation opportunities coefficient estimates and 95% confidence interval by week .......................................................................... 65 Figure 13. Top: COVID cases by week: Bottom. Destination employment coefficient estimates and 95% confidence interval by week. ......................................................................................... 67 Figure 14. Demonstration of trend breaks from April to June 2020 in model performance (Pearson’s R) from spatial interaction models trained with different periods of data. ................. 78 Figure 15. Algorithm of event detection based on martingale test. .............................................. 83 Figure 16. A demonstration of random distribution of 25 simulated sites. .................................. 86 Figure 17. Relationship between OD pairs with non-zero flows and proportion (p) of time periods, e.g., weekdays, in the NYC system. ............................................................................................. 86 Figure 18. Left: Station distribution in NYC Citi Bike. Right: Illustrations of consistent OD flows in NYC Citi Bike with 85% time periods having positive flows. ................................................. 88 Figure 19. Left: Summary of 50 runs for four simulations. Right: A set of martingale test values by time in one sample experiment of “cost” change, with a red line marking the time of change. ....................................................................................................................................................... 90 Figure 20. Delay times and recall rates from an experiment on the relationship of the delay time and parameter change. .................................................................................................................. 91 Figure 21. Recall and false positive rate from experiments with magnitude of event using the cost factor. ............................................................................................................................................ 92 Figure 22. Martingale values and events of interest (marker A-F) from weekday samples in the NYC system. ................................................................................................................................. 94 Figure 23. Martingale test value for weekday samples and distance decay from daily spatial interaction models. Red lines mark the events from A to F except for C in Table 9.................... 96 ix Figure 24. Martingale values (log-scaled) (Top) compared with trip numbers (Middle) and change scores from the changefinder algorithm (Bottom) using the weekday samples. .......................... 98 Figure 25. Bottom: Bike station distribution in Washington, DC Capital Bikeshare. Top: Illustrations of consistent OD flows in Washington, DC Capital Bikeshare with 70% time periods having positive flows .................................................................................................................. 101 Figure 26. Martingale values and events of interest (marker A-D) from weekday samples in the Washington, DC system.............................................................................................................. 102 Figure 27. Bike station distribution in San Francisco Baywheels (Bottom-Left) and illustrations of consistent OD flows in the areas of downtown (Top-Left), East Bay (Top-Right), and San Jose (Bottom-Right) with 60% time periods having positive flows. .................................................. 104 Figure 28. Martingale values and events of interest (marker A-G) from weekday samples in the San Francisco system. Top: values calculated from mixture martingales. Bottom: values calculated from power martingales (ϵ=0.92)................................................................................................ 105 Figure A.1. Cumulative daily number of individuals fully vaccinated against COVID-19 in New York City in 2021 ....................................................................................................................... 122 x List of Abbreviations COVID-19: Coronavirus disease caused by the SARS-CoV-2 virus DC: Washington, DC GBFS: General Bikeshare Feed Specification GPS: the Global Positioning System ICT: Information and Communications Technology IOT: Internet of Things LASSO: Least Absolute Shrinkage and Selection Operator LSTM: Long Short-Term Memory NYC: New York City OD: Origin-destination OK: Ordinary Kriging POI: Point of interest RK: Regression Kriging RNN: Recurrent Neural Networks SI: Spatial interaction US: the United States VIF: Variance Inflation Factor 1 Chapter 1: Introduction 1.1 Background and motivation The increasing trend of urbanization in the past century has resulted in over half of the world's population now living in cities (United Nations, Department of Economic and Social Affairs, Population Division, 2019). This rapid urbanization has created a pressing need to gain a deeper understanding of urban infrastructure, population distribution, and job opportunities, which are crucial indicators of urban livability (Bassolas et al., 2019). A key aspect of this livability is the rising human mobility within urban areas, which serves as an essential observation for urban planners. Thus, it has become increasingly important to investigate and anticipate human mobility both within and between cities. Spatial interactions or origin-destination flows, which refer to the movement of people and goods between locations, are vital indicators of urban socio-economic activities and provide insights into the interdependence between different places. Studying spatial interactions of human mobility can gain valuable insights into the factors that influence urban livability and create more sustainable and equitable cities for all. The study of spatial interaction and mobility has long been a central focus of transportation and geographical science, as understanding and predicting the flows of people, goods, and information across space is vital for numerous fields, including urban planning, transportation, and regional development. A range of methods have been developed to analyze and model spatial interaction, including both traditional approaches and new techniques that leverage recent advances in computing and artificial intelligence. Traditional methods such as spatial interaction models (Fotheringham & O’Kelly, 1989; Oshan, 2020; Wilson, 1971) are widely used to understand the 2 spatial interaction generation process in telecommunication, commodity goods, and migration in the past decades. These models use statistical techniques to estimate the flows of people or goods between different locations, based on factors such as distance, accessibility, and infrastructure. In recent years, modern urban mobility methods such as deep learning have gained attention in the study of spatial interaction, due to the development of more powerful computing resources and the increased use of artificial intelligence. Among the different targets of urban mobility models, there are (1) trip-based origin-destination (OD) demand (Z. Cheng et al., 2021; Ke et al., 2019; L. Liu et al., 2019; Y. Wang et al., 2019; Xiong et al., 2020); (2) location-based demand (Ai et al., 2019; P.-C. Chen et al., 2020; Geng et al., 2019; S. Guo et al., 2019; Li & Axhausen, 2020; Lin et al., 2019; Z. Pan et al., 2020; Ren, Chen, et al., 2019; Sun et al., 2020; Y. Yang et al., 2020); and (3) volumetric demand (L. Cai et al., 2020; Cui et al., 2019; Do et al., 2019; Ma et al., 2019; Qu et al., 2019; Ren, Cheng, et al., 2019; Shleifer et al., 2019; B. Yang et al., 2019; Y. Zhang et al., 2020; Zhang & Xin, 2020). Trip-based OD demand, also referred to spatial interaction in geography and other disciplines, has focused on enhancing predictions with the recent advances in statistical learning and deep learning at various temporal scales utilizing big geospatial data. The availability of geo-located traces from Information and Communications Technologies (ICTs) has revolutionized the study of urban mobility, providing much greater spatiotemporal resolution than was previously possible. However, the use of individual-level traces from mobile phone records can be limited by concerns around privacy and consistency of data availability. As a result, anonymous flow records from transportation systems have become a popular choice for mobility modeling studies. Historically, public transportation systems such as buses, rail, and taxicabs have formed the backbone of cities, by facilitating interaction and providing an affordable and reliable alternative to private transportation. More recently, bike share systems have emerged as a new 3 form of transportation that can be more easily changed or expanded due to their lightweight and affordable infrastructure. Docked bike share systems, in particular, have become popular in many cities, offering daily trips with detailed time information that create new opportunities to study the dynamics of spatial interaction processes. Many articles have been dedicated to build the relationship between bike-share trips and its determinants. Factors related to socio-demographics, public transit, system infrastructure, street features, land use or places of interests, and weather are frequently included in the regression model with bike-share trip demand (Hu et al., 2022; A. Li et al., 2020; Maas et al., 2021; Noland et al., 2016; Ross-Perez et al., 2022; Schimohr & Scheiner, 2021; X. Wang et al., 2021; Younes et al., 2020; Zacharias & Meng, 2021). Figure 1. Citi Bike trip arrival changes from 2019 to 2020 and 2020 to 2021 in NYC, with each hexagon categorized primarily as decreasing trips, increasing trips, and new stations installed. This dissertation aims to address unsolved challenges that arise for this new form of transportation by proposing new methods and frameworks to model and understand the high-frequency changes 4 in the spatial interaction from a bike share system. Taking the New York City (NYC) system demonstrated in Figure 1 above as an example, there are several types of changes from 2019 to 2021. The blue hexagons contained new stations installed within the year, the red hexagons experienced a decrease in trip arrivals compared to the previous year, and the green hexagons marked an increase in the total trip arrivals. This exploratory data analysis from the NYC system has shown at least two challenges to predict the associated spatial interaction flows. The first challenge is the changing layout of the stations, which does not allow the direct application of models like Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) networks that take historical flows as input. A new approach is needed to deal with trips at those new stations. The second challenge stems from the changes associated with the COVID-19 pandemic as the total trips in 2020 dropped slightly compared to 2019 but total trips in 2021 exceeded those from 2019. In Table 1 below, changes in the total trips and trip duration across three years suggest the spatial interaction model calibrated before the pandemic may no longer be valid for the post-pandemic era (Werner et al., 2022). Such changes also imply that the health benefits of bike share are recognized by people and lead to a much stronger recovery after the COVID-19 pandemic. This short-term and long-term motivation for adopting the bike share brings a huge potential for the expansion of bike share systems in urban areas in the near future. Therefore, there is a critical need for developing new tools to understand the dynamics of spatial interactions from this new form of transportation. To achieve this goal, three studies are conducted that build on the theoretical basis of spatial interaction models. Specifically, the unconstrained gravity model (Wilson, 1971) is used as the core spatial interaction model across the three studies. Methodological frameworks that build from and incorporate spatial interaction models are introduced in each of the three main 5 chapters to accomplish the research objectives whose motivation and background are summarized below. Table 1. Bike share total trip count, mean trip duration, and mean trips per day in the NYC system from 2019 to 2021. Year 2019 2020 2021 Total trips 20,551,517 19,506,857 27,551,166 Mean trip duration 16:18 21:50 16:30 Mean trips per day 56,306 53,297 75,483 The first study in the dissertation is motivated by the aforementioned challenges from new stations as the system evolves. Among different transportation modes, bike share can be analyzed with models proposed for other transportation modes but it has unique features that require special treatments and make recent cutting-edge trip prediction methods not applicable: individual bike- sharing programs are often piloted for a limited number of zones within an urban area that are known to have relatively high overall activity and transport demand and then systems are expanded and updated according to the evolution of demand over time and across space. Previous flows across stations are essential input in almost all cutting-edge spatial interaction prediction methods. However, the dynamic development of bike-sharing systems brings in a large amount of new bike 6 stations into the system without any historical flows. This feature prevents a straightforward deployment of the majority of cutting-edge models for flow prediction. A new methodology is hence needed to avoid the issue of missing previous flows when using cutting-edge prediction methods. Therefore, study 1 investigates a methodology for predicting OD flows of bike share systems given that the system evolves over time and demand is constantly changing. More specifically, this work focuses on predicting demand for a new station where there may be no knowledge of previous flows, also known as a ‘cold start’ station, which occurs when the system is progressively updated in order to maintain efficiency and grow the ridership of bike share systems. Flow prediction using spatial interaction models and flow interpolation methods are two directions that can be used to compute flows at a new station. On one hand, spatial interaction models are calibrated from the system-wide flow generation processes, so predictions are made based on the general relationships between the independent variables and all spatial interactions between all origins and destinations. On the other hand, interpolation methods may incorporate data borrowed from nearby locations and incorporate local information that is not captured in a spatial interaction model. However, flow interpolation techniques are under-discussed with little literature on either their implementation or application. For example, only an areal natural neighbor method has been introduced for flow interpolation so far (Jang & Yao, 2011; Šimbera & Aasa, 2019). In study 1, a new Kriging-based interpolation method is proposed by adopting Ordinary Kriging and Regression Kriging in the flow domain. Along with the areal-based kriging method proposed in previous work, these interpolation methods are compared to spatial interaction models. Of note is that Regression Kriging also incorporates a spatial interaction model as the regression step before "kriging" the residual of the regression output. This array of potential methods is tested within the NYC system, 7 which is the longest and largest bike share system in the US, to get an initial result for the effectiveness of the newly proposed flow interpolation method against each alternative. As a result, a hybrid strategy is suggested that uses a different method for different station classifications based on the location of a new station in comparison to existing stations. Human mobility in major US cities was heavily impacted during the COVID-19 pandemic and the consequential lockdown period, resulting in significant changes in general mobility metrics (Kang et al., 2020; Y. Pan et al., 2020; Sevtsuk et al., 2021), metro ridership (H. Wang & Noland, 2021), bike share systems (Padmanabhan et al., 2021), and micro-transit (Y. Zhou et al., 2021). As will be seen in this research, changes in the coefficients in the spatial interaction models indicate that the COVID-19 pandemic affects the trip generation processes of cycling. Recent studies on bike share usage also revealed changes in the bike share trip duration during the COVID-19 pandemic. For example, in NYC there was a nearly 40% increase in the average trip duration in March 2020 compared to the month before COVID (Teixeira & Lopes, 2020). Similar increases were also observed in other US cities like Boston and Chicago (Padmanabhan et al., 2021) as well as in some European cities such as London, UK (Heydari et al., 2021), Zurich, Switzerland (Li et al., 2021) and Kosice, Slovakia (Kubaľák et al., 2021). However, beyond changes in ridership volume, there is a lack of evidence regarding the processes driving changes in human behavior, such as the effects of distance and attraction factors at different locations. The motivation of the second study is therefore to use spatial interaction models calibrated over different periods of time to investigate how behavior related to bike share spatial interaction processes evolve before, during, and after the COVID-19 lockdown. As a result, the second study presents a disaggregated comparative modeling framework using two different time scales, yearly and weekly, to capture the dynamics of the determinants in the bike 8 share usage. A similar framework was proposed to split taxi trips by monthly and hourly subsets (Oshan, 2020), but it has never been applied to a bike-sharing dataset. NYC Citi Bike is selected again as it has the largest usage of bike share systems in the US and is the most frequently discussed in the related literature (Chai et al., 2018; P.-C. Chen et al., 2020; X. Chen & Jiang, 2022; Faghih- Imani, Anowar, et al., 2017; Kong et al., 2020; Liu et al., 2016; Teixeira & Lopes, 2020; Wang & Noland, 2021). In order to get a parsimonious interpretation of determinants from the model calibration procedure, model selection is first performed to obtain a subset of the most important variables. Model selection with the LASSO (Least Absolute Shrinkage and Selection Operator) regression assigns penalties to the covariates by shrinking the associated coefficients until they converge to zero. The path of coefficients toward zero reflects the relative strength of each variable in explaining the observations of the dataset. Yearly modeling and weekly modeling of spatial interaction from the bike share provide a mechanism to observe the dynamics of determinants as a time series. An intellectual significance of the study is that further contextualization of the dynamics was carried out by relating trends from the coefficient time series to features of the COVID-19 pandemic timeline to help interpret the results. The third study in the dissertation builds on the findings of study 2, which showed that there were behavioral changes in bike share usage during the COVID-19 pandemic, particularly during the lockdown. Behavioral changes associated with trip volume have already been discussed by (X. Chen & Jiang, 2022). This study aims to go beyond this by investigating what other events can cause changes in the spatial interaction processes generating bike share trips and how these events can be detected. Detecting such events that may alter the spatial interaction process is important for the application of study 1 as it highlights the importance of the temporal dependency towards the prediction and understanding of spatial interaction. Event detection is another area where the 9 temporal subset comparison framework proposed in study 2 can be adopted for events that are less obvious than the COVID-19 pandemic and may have a smaller impact. Unlike the ubiquitous impacts of COVID-19, other events, such as weather, that are critical to the decision to cycle (Bean et al., 2021), can significantly influence bike-sharing over much shorter time horizons. However, these events may be more difficult to isolate, and their ephemeral nature means that the behavioral changes may occur more quickly without lasting very long. As a result, incorporating a finer time scale may enhance event detection. The problem of event or anomaly detection has a large literature due to the availability of time series data in a wide range of domains, including security and surveillance (Karbalaie et al., 2022), traffic and crowd monitoring (Djenouri et al., 2019), business intelligence (Ranco et al., 2015), social media analysis (T. Cheng & Wicks, 2014; Vioulès et al., 2018), environmental studies (Meyer et al., 2019), medical diagnosis (Ukil et al., 2016), and manufacturing (Pittino et al., 2020). Various methods of event detection have been implemented for analyzing data streams from transportation systems, such as likelihood ratio tests (Pang et al., 2011) and k-nearest neighbor (Dang et al., 2015). However, many of these methods focus on processing events from individual locations (i.e., origins or destinations), while origin-destination events related to spatial interaction have received less attention. As a result, a novel method is explored to detect events in spatial interaction processes by adapting the martingale framework for data streams (Cherubin et al., 2018; Ho, 2005; Ho & Wechsler, 2010) in combination with spatial interaction models. The martingale framework has several advantages, including compatibility with regression-based data-generating processes, potential for “on-line” streaming deployment, and ease of tuning. These benefits increase the significance of introducing such a framework into spatial interaction models that are typically time agnostic. To evaluate this new method, simulated spatial interaction flows are first 10 used to validate the ability of the method to detect known parameter changes that represent data generating processes (i.e., conditional associations). Then, the framework is tested on empirical data from three US cities, including NYC, which has one of the highest bike share usage rates in the country, as well as Washington, DC and San Francisco as two other major urban areas that are not as densely populated. The results demonstrate the framework's ability to extract different types of events from the data. 1.2 Research objectives Overall, this dissertation aims to develop a comprehensive framework for understanding and predicting dynamically evolving spatial interaction demand in a bike share system, as shown in Figure 2. By doing so it addresses several research questions that were not adequately explored in previous studies. The framework proposed in this study addresses three critical challenges related to spatiotemporal dynamics in a bike share system. The first challenge is related to the expansion of system infrastructure causing limitations at newly added stations for models like LSTM that require historical flows. The second challenge comes from a gap in the knowledge about the changes in the processes generating bike trips. Although bike share trips after the lockdown due to the initial outbreak of the COVID-19 pandemic recovered to the pre-pandemic level quickly, the mean trip duration significantly increased in 2020 compared to 2019 suggesting behavioral changes from the public in making the decision to cycle during and after the lockdown. Such behavioral changes associated with COVID-19 are not fully addressed in the previous studies by analyzing changes in trip numbers. The last challenge comes from anomalies or events that cause smaller impacts on the trip generation process and are too subtle to be explained by weekly spatial 11 interaction models. To address this research gap, the third study employs an event detection framework that combines the martingale test and spatial interaction model to detect the events associated with bike share behavior (i.e., changes in coefficients over time). Study 2 directly uses coefficients from spatial interaction models while the regression kriging method in study 1 and the martingale test in study 3 both use the residual of the spatial interaction model as the input, making the spatial interaction model a common thread that connects the three studies and updates this classic spatial analysis framework to better accommodate spatiotemporal data in the era of big data. Figure 2. Conceptual framework of the dissertation. The objectives of each research topic and associated research questions are as follows: Research objective of study 1. Infrastructure expansion in a bike share system will present a challenge for demand prediction at the station level as most prevailing modeling methods (i.e., LSTM models) rely on historical flows at each station. The absence of historical data for newly 12 added stations makes such models unavailable and thus requires a novel solution for predicting ‘cold start’ trips at the new stations, either from direct data borrowing interpolation or from indirect model-based interpolation. The first study compares methods and proposes a hybrid strategy taking advantage of both aspects by answering the following research questions. Research question 1a. How do interpolation methods, including areal natural neighbor, Ordinary Kriging, and Regression Kriging, compare to spatial interaction models for predicting the 'cold start' bike trips? Research question 1b. What is the best strategy for 'cold start' bike trip prediction at different locations? Research objective of study 2. Bike share usage during the COVID-19 pandemic changed on a much larger scale than the pre-pandemic routine, implying changes in the driving factors of bike trips. Dynamics of determinants of bike share trips during routine activities as well as during and after the COVID-19 lockdown were analyzed through a disaggregate comparative framework and by addressing the related questions below. Research question 2a. From a long-term perspective, what are the changes in the generating factors of bike trips before COVID-19 in 2019 and across different phases of COVID-19 in 2020 and 2021? Research question 2b. From a short-term view, how did the weekly determinants of bike trips change with the evolution of COVID-19? Research objective of study 3. Short-term changes in the spatial interaction that are not explained using the approach from study 2 are considered potential events associated with trend breaks in 13 the spatial interaction processes over a long time period. The third study explores a method to detect the events that can lead to behavioral changes in the spatial interaction processes related to bike share trips. The proposed method is developed using a martingale framework, which is evaluated to answer the research questions below. Research question 3a. To what extent can events be detected using model-based proxies for changes in human behavior? Research question 3b. How to effectively combine the spatial interaction model and the martingale framework to detect events? Research question 3c. What types of events can be reliably detected within different US cities from a multiyear observation from 2018 to 2021? 1.3 Dissertation outline The dissertation is organized into five chapters. The overall goal of the dissertation is to develop methods to help understand and predict high-frequency spatial interactions produced by bike share systems. Chapter 1 describes the essential background and motivations for three research objectives. Chapter 2 presents a hybrid approach to predict ‘cold start’ bike trips at newly added stations in a docked bike share system. Chapter 3 introduces a disaggregated comparative framework for capturing and understanding changes over time in the determinants of spatial interactions associated with the COVID-19 pandemic. Chapter 4 continues the focus on changes in spatial interaction processes by adapting and evaluating an event detection framework based on the martingale test. The utility of the event detection method is tested using simulated data and by 14 applying it to three different US cities. Chapter 5 concludes the dissertation by reviewing the key findings from the three studies, summarizing significant contributions from this research, and discussing the limitations and potential areas of future work. 15 Chapter 2: Comparing spatial interaction models and flow interpolation techniques for predicting 'cold start' bike share trip demand 2.1 Abstract Bike-sharing systems are expanding rapidly in metropolitan areas all over the world and individual systems are updated frequently over space and time to dynamically meet demand. Usage trends are important for understanding bike demand, but an overlooked issue is that of ‘cold starts’ or the prediction of demand at a new station with no previous usage history. This chapter explores a methodology for predicting bike trips from and to a 'cold start' station in the NYC Citi Bike system. Specifically, gravity-type spatial interaction models and spatial interpolation models, including natural neighbor interpolation and kriging, are employed. The overall results from experiments of a real-world bike-sharing system in NYC indicate that the regression kriging model outperforms the other models by taking advantage of the robustness and interpretability of gravity-type spatial interaction regression models and the capability of ordinary kriging to capture spatial dependence. 2.2 Introduction The past century has witnessed a significant increase in urbanization with more than half of the world's population currently living in urban areas (United Nations, Department of Economic and Social Affairs, Population Division, 2019). As a result, it has become increasingly important to understand and anticipate human mobility within and across cities. In particular, public transportation systems, historically composed of buses, trains, and taxicabs, form the backbone of cities, facilitating interaction and providing an affordable and reliable alternative to carry out routine activities compared to private transportation. The development and optimization of public 16 transportation infrastructure have therefore remained an important aspect of urban planning and community development. A more recent trend is the emergence of bike-sharing as an alternative means of transportation and the establishment of bike-sharing systems in cities around the world. Compared to other modes of mobility, cycling provides health and environmental benefits in addition to offering a more efficient means of navigating the urban environment (Oja et al., 2011; Otero et al., 2018; M. Wang & Zhou, 2017; Y. Zhang & Mi, 2018). For example, the New York City (NYC) Mobility Report indicates that trips made using the NYC Citi bike share system are over a minute faster than taxi trips across all distance categories within the Midtown area of Manhattan, and cost less than 25% for taxi trips for all trip length categories except those less than half a mile (NYC Department of Transportation, 2019). Such advantages are further highlighted during rush hour (Faghih-Imani, Hampshire, et al., 2017). This has led to the proliferation of bike-sharing systems, with nearly 2000 bike-sharing systems now in operation around the world1. Individual bike-sharing programs are often piloted for a limited number of zones within an urban area that are known to have relatively high overall activity and transport demand. Systems are then expanded and updated according to the evolution of demand over time and across space. This dynamic development of bike-sharing systems requires a strong understanding of mobility behavior and flexible methods for predicting spatial-temporal demand. In particular, travel demand 1 According to bikesharingworldmap.com and including both docked and dockless systems (last accessed on 07/30/2021). 17 models may focus on predicting overall system demand, the total demand at each location or station within a system, or demand between individual locations. It is this latter scenario that is the primary focus of this research as it typically receives less attention (e.g., Li & Shuai, 2020; Y. Zhou et al., 2019). This is perhaps because it is more challenging to model individual origin- destination (OD) flows due to a lack of detailed data and because they usually contain more noise and higher levels of variability compared to the overall system usage or the total inflows or outflows for a set of locations. However, models of OD flows, often referred to as spatial interaction models, are becoming increasingly possible as detailed data are generated by GPS- enabled sensors and the internet of things (Shaw & Sui, 2018). For example, Calafiore et al. (2021) leveraged user OD flows based on social media check-ins to study the characteristics of cities’ neighborhoods. Furthermore, recent machine learning methods for incorporating various dimensions of spatial and temporal dependence hold promise for building models with enhanced predictive capabilities (Chu et al., 2020; Ke et al., 2019; L. Liu et al., 2019; Y. Wang et al., 2019). Consequently, this work investigates a methodology for predicting OD flows of bike share systems given that the system evolves over time and demand is constantly changing. More specifically, this work focuses on predicting demand for a new station where there may be no knowledge of previous flows, also known as a ‘cold start’, which is necessary when the goal is to progressively update transport infrastructure in order to maintain efficiency and grow the ridership of bike share systems. Previous work has not examined in detail the task of predicting spatial interaction demand for new stations, likely because more traditional public transportation infrastructure evolves much more slowly compared to the relatively inexpensive and flexible bike share infrastructure. That is, much 18 of the related state-of-the-art research assumes that the infrastructure is persistent across time and there is prior knowledge of station activity, which is not always the case for bike share systems. Therefore, section 2.3 of this chapter provides some additional background and formalizes the challenges that need to be overcome. Then, section 2.4 presents a novel methodological approach for predicting the spatial interaction demand associated with new bike share stations, and section 2.5 briefs the data used in the case study and experiment settings. Section 2.6 describes the experimental results used to benchmark the proposed approaches. Overall, the results indicate that regression kriging models are slightly more performant than the other techniques considered. Finally, section 2.7 concludes with a discussion of the contributions of this research, the limitations of the proposed approach, and suggestions for future work in this area. 2.3 Background The widespread adoption of information and communication technology has produced massive amounts of transportation and mobility data, sparking a wave of research into how to best model human movement. Meanwhile, the popularity of bike-sharing systems has inspired much recent research. Si et al. (2019) provide a review of bike-sharing papers published from 2010 to 2018 and group them based on the following categories: demand factors, rider behavior, system optimization, and impact on other modes. The factors associated with variation in bike-sharing ridership demand can largely be summarized as being related to the weather, built environment, sociodemographic, or temporal dimensions (Barbour et al., 2019; El-Assi et al., 2017; Eren & Uz, 2020; Guo et al., 2017; H. Yang et al., 2020). Knowledge of the distribution of trips is useful for developing strategies to optimize systems through bike rebalancing or station repositioning, allowing the system to satisfy higher demand (P.-C. Chen et al., 2020; Faghih-Imani, Hampshire, et al., 2017; 19 Pan et al., 2019; R. Zhu et al., 2020). The introduction of a bike share system may impact other transport modes, leading researchers to compare bike share usage to taxi and bus ridership (Campbell & Brakewood, 2017; X. Zhou, Wang, & Li, 2019). Another trend focuses on bike share trip prediction using travel demand models and OD flow models, the latter of which is the focus of this study and will be used to consider a range of factors influencing bike share trip demand to predict OD flows. Trip demand at 'cold start' stations, however, catches less attention and is only explicitly discussed in Noland et al. (2016) and Y. Zhang et al. (2017). And both papers used linear regression models on station-level demand to examine the generalization of the models. Research by X. Wang et al. (2021) discussed the use of a regression model for making predictions at new stations but only performed out-of-sample predictions by randomly selecting stations as a test set, which is not an exhaustive validation for trip demand at potential new stations. Furthermore, there is a lack of effort dedicated to OD flow predictions for new stations. Spatial interaction (SI) describes aggregate movements of individuals, commodities, capital, or information over geographic space, resulting from some requisite decision-making process (Farmer & Oshan, 2017). A typical quantitative representation of SI is the origin-destination (OD) matrix: 20 Equation 2.1 Where 𝑇 is the OD flows matrix from a set of origins (𝑂1, 𝑂2, . . . , 𝑂𝑛) to a set of destinations (𝐷1, 𝐷2, . . . , 𝐷𝑚), and 𝑇𝑖𝑗 is the magnitude of flows from origin 𝑖 to destination 𝑗. Inspired by Newton's law of gravity, early models of spatial interaction theorized OD flows to be proportional to the product of potential at origins and destinations and inversely proportional to the squared distance between them, yielding the following formulation, Equation 2.2 where 𝑇𝑖𝑗 still denotes the flows between origin 𝑖 and destination 𝑗, and represent the potential at 𝑖 and 𝑗 , 𝑑𝑖𝑗 describes the distance between 𝑖 and 𝑗 , and 𝑘 is a scaling factor that ensures the number of flows predicted by the model matches the number of observed flows. In this context, the potential is often defined as the location population or the number of job opportunities (Gao et al., 2013; Krings et al., 2009; Lenormand et al., 2016), but can be expanded to consider a series of origin and destination attributes that may contribute towards the flow generation, each with their own parameter (Fotheringham & O’Kelly, 1989; Oshan, 2020). According to Kar et al. (2021), two groups of variables can be identified besides population or jobs: socio-economic attributes and built environment attributes. The former group usually explains trip generation, such as labor force 21 (Pourebrahim et al., 2018; Signorino et al., 2011); GDP (L. Zhang et al., 2019); human activity density (Marrocu & Paci, 2013). The latter group includes common destination determinants of travel demand, such as amenities (e.g., schools, hospitals, markets) (Botella et al., 2021; Kar et al., 2021), tourism attractions (e.g., hotel rooms) (Khadaroo & Seetanah, 2008), and land use types (X. Liu et al., 2016). Recent work seeks to exploit temporal dependence within SI flows to increase predictive performance. The simplest instance involves taking the average of historical observations, which can yield accurate predictions for future flows when there is limited variation in the historical observations. Typical methods using temporal dependence include historical averaging; autoregressive integrated moving average (ARIMA); Deep neural network structures including recurrent neural networks (RNN) and long short-term memory (LSTM) architectures (Cheng, Trepanier, & Sun, 2021; Chu et al., 2020; Ke et al., 2019). The common factor amongst all these methods is that they require sufficient previous OD flows and become invalid to predict future flows when there are no historical observations to draw upon. Meanwhile, bike share systems are also spatially dynamic in that stations are frequently updated or perhaps more importantly that new stations are added to extend the system, though this is often overlooked in related work. One exception is Lu et al. (2018) who explored the extension of system infrastructure by deploying agent-based methods to simulate the result of adding new stations towards the usage of different transportation modes. In contrast, most previous research focuses on predicting bike-sharing trip flows at existing locations (i.e., in-sample spatial prediction), often using historical data as an important input feature. This is problematic when the focus is instead 22 on predicting flows associated with a new station that is being added to the system (i.e., out-of- sample spatial prediction) because there is no historical flow data that can be used to learn temporal dependencies. A similar issue often occurs in recommender systems where new users or new items will have no historical record to be used for preference analysis (Su & Khoshgoftaar, 2009; Volkovs et al., 2017) and is often called the ‘cold start’ issue. This term is also used in the literature associated with detecting stops and trips from GPS trajectories (Schuessler & Axhausen, 2009; Stopher et al., 2005). Overcoming this limitation is the primary contribution of this research in order to develop a robust methodology for predicting historical OD flows at a new bike share station. It is reasonable to leverage spatial dependence to predict values at unobserved locations based on values from nearby observed locations. The First Law of Geography states that things that are closer together are typically more similar (Tobler, 1970) and is the basis for popular spatial statistics, such as Moran’s I measure of spatial autocorrelation and spatial interpolation methods, including natural neighbor interpolation, inverse distance weighting, Kriging, and more (Mitas & Mitasova, 1999). In particular, kriging has become a core spatial interpolation tool and is now used in many topic areas such as air quality analysis (Bayraktar & Turalioglu, 2005), natural resource analysis (Emery, 2005), water studies (Zimmerman et al., 1998) and traffic (Eom et al., 2006; X. Wang & Kockelman, 2009). Recently, spatial interpolation tasks have been adapted into a lattice or graph structure and integrated into generative adversarial networks (D. Zhu et al., 2020) or graph neural networks (Wu et al., 2020), but these extensions do not yet apply to the case of OD flows. For the interpolation of OD flows, Jang and Yao (2011) proposed an areal weighting method, which was further employed by Šimbera and Aasa (2019). As far as the authors are aware there 23 are no studies extending interpolation methods to SI flows. However, the spatial dependence between SI flows has previously been leveraged for community detection (Gao et al., 2013; Yao et al., 2018) and land use identification (X. Liu et al., 2016), indicating that kriging, which also relies on the presence of spatial dependence, may be promising for flow interpolation. Therefore, this research explores flow-based kriging and compares it to natural neighbor interpolation and gravity-type spatial interaction models. 2.4 Methodology 2.4.1 Problem statement For a station-based bike share system, 𝑆𝑛, there are 𝑛 docking stations serving as both origins 𝑆𝑖 and destinations 𝑆𝑗 and information is available for each trip in the system regarding its origin station, destination station, start time, and end time. Trips are also sorted into discrete temporal subsets, 𝑡, based on their starting time (e.g., hour, day, week, etc.). Therefore, each trip in the system can be denoted using a 3-tuple (𝑡, 𝑠𝑖 , 𝑠𝑗)and the corresponding OD flow matrix 𝑇 is comprised of entries denoting aggregated trips between stations at the time 𝑡 (i.e., 𝑇𝑡,𝑖,𝑗 ). The diagonal elements of 𝑇 (i.e., 𝑖 = 𝑗) are filtered out and set to zero to remove their undue influence on any subsequent modeling procedures. A 'cold start' station refers to a scenario where there is no information available about previous flows for a newly added station 𝑆𝑥 and the goal is to predict future outflows 𝑇𝑡+1,𝑥,𝑗 and/or future inflows 𝑇𝑡+1,𝑖,𝑥 . The primary issue that arises is that there are no previous flows to use in any of the methods that leverage historical data. A methodology is proposed below to overcome this limitation by classifying stations and using a combination of regression modeling and interpolation techniques depending on the station class. 24 2.4.2 Station classification Predicting the flows associated with the addition of new stations is the primary task of this research. However, depending on the location of the new station and its proximity to currently existing stations, different techniques and information are available to carry out the predictions. At least three different scenarios can be distinguished. The first is referred to here as interpolation, which entails borrowing information directly from the existing stations. Interpolation typically only applies in the situation where the newly added station is sufficiently proximal to existing stations (i.e., within current system coverage). The second scenario is referred to here as margin interpolation and is the case where a new station is added on the margin of the current system. As such, there are some nearby stations with previous flow information that can be borrowed. The final scenario is referred to here as extrapolation and is concerned with the addition of stations that are essentially outside the coverage of the current system. Predictions for this scenario are hypothesized to be only possible through the extrapolation of modeled relationships, since there are no existing nearby stations to borrow information from. A method for identifying empirical instances of system expansion and classifying new candidate stations across the three scenarios is proposed below and then subsequently deployed and evaluated. Newly added stations are identified by examining the time series of trip data for the bike share system. For each station present in the system before July 2020, an array of station inflows (or station outflows) for each temporal subset (weekly in this case study) is used to track its operation status. Intuitively, the first non-zero entry of the array, 𝑡𝑖𝑛 > 0, is recorded as a preliminary inferred date that a station was initially put into operation. However, these preliminary dates are 25 then refined by manually checking for consistent station service as in practice there sometimes are test trips prior to the official launch of a new station. In addition, the system record indicates that all new stations added after the initial system rollout in 2013 did not occur until 2015. Therefore, the original stations existing before 2015 (Table 2) are not included in the set of classified stations nor are they used for prediction validation. Table 2. Numbers of stations added over time by month and year. Month and Year 2013 2015 2016 2017 2018 2019 2020 January - - 2 2 9 6 8 February - - 2 8 2 3 5 March - - 2 4 6 6 1 April - - 2 4 3 17 2 May - - 3 5 7 11 36 June 336 - 5 4 5 5 35 July 1 1 8 4 4 3 - August - 85 98 4 5 4 - September - 39 43 64 1 15 - October - 16 2 74 3 38 - November - 5 2 6 3 34 - December - 2 2 4 4 15 - Next, these new stations, 𝑆𝑥, are classified between interpolation, margin, and extrapolation. This is done based on their relationship to the existing system coverage, which is represented here by computing the convex hull of all the stations in operation at the time when 𝑆𝑥 is introduced. The convex hull of all the station points provides a simplified representation of the system service area 26 in continuous Euclidean space, though the urban environment does not exist separately from the natural environment (i.e., rivers). To ensure these restricted areas are not included in the representation of system coverage, a few extra steps were taken to prepare a modified convex hull for the bike share system. First, stations on Governor’s Island and a single isolated station in the southern part of Brooklyn were removed from the analysis because these stations are essentially disconnected from the larger system. Then, multiple convex hulls were created separately for stations in Manhattan and stations in Brooklyn/Queens, ensuring that neither of the convex hulls cross into major waterways. For every temporal subset 𝑡, only the system coverage associated with stations running at time 𝑡 − 3 is used to classify the new station 𝑆𝑥 initialized at time 𝑡 where each temporal subset pertains to a week within the period from January 2015 through July 2020. This time lag of up to three weeks ensures that stations rolled out consecutively over a relatively short time period are not considered as previously existing to each other and is necessary because there is usually a short period of time before a new station becomes fully integrated within the system. If 𝑆𝑥 falls outside this current system coverage, it will be regarded as an extrapolation case. In contrast, if 𝑆𝑥 is within this current system coverage, it will be regarded as an interpolation case. It is then necessary to further distinguish the partial interpolation cases, which correspond to stations added to the margin of the current system coverage. The three cases are formally classified by the percentage of overlap between their Voronoi tessellation cell and the current system coverage. Stations overlapping < 5% are classified as an extrapolation station while those overlapping > 95% are classified as an interpolation station. The remaining stations are classified as margin stations. 27 Applying the above classification to all newly added stations identified in the previous step. Using these three different categories, it is possible to apply and evaluate unique mechanisms to borrow flow information based on data and/or modeled relationships for each category. The hypothesis employed here is that different mechanisms will be more efficient for each category because different types and levels of information are available. In particular, it is anticipated that better predictive performance will be attained for stations classified as extrapolation cases using a SI modeling approach because borrowing information for data from relatively far stations could introduce a disproportionately large amount of misinformation. In contrast, the interpolation and margin cases are anticipated to achieve higher predictive performance using methods that incorporate data-borrowing through interpolation techniques. That is, flows for stations classified as interpolation and margin are expected to be best predicted using interpolation methods whereas those that are not must rely on extrapolation through modeled relationships (i.e., SI model). However, it is less clear whether or not the same method will be the most effective for both interpolation and margin stations, since different levels of information are available. To investigate these issues, several interpolation methods are explored in the following sections and then subsequently compared to each other and to predictions from SI models. 2.4.3 Modeling strategies Gravity-type spatial interaction model The unconstrained gravity-type spatial interaction model (Gravity SI) described in section 2.3 is perhaps the most widely used model for diverse types of aggregate transport flows (for several recent examples see Kar et al., 2021; Lenormand et al., 2016; Oshan, 2020; T. Zhou et al., 2020). It is calibrated here using a log-linear Poisson regression with a power distance-decay function and a set of origin/destination attributes using the spint module of the Python Spatial Analysis 28 Library (PySAL) (Oshan, 2016). Following Eren and Uz (2020), these attributes include points of interest (POI) for urban amenities and services, transportation infrastructure, and socio- demographic factors. For transportation infrastructure, accessibility to other bike stations is computed based on a distance-weighted sum (i.e., spatial lag) of nearby bike station capacity2. POI data was obtained from SafeGraph (2020) and contains a hierarchical schema. Eleven categories were formed to aggregate POIs from this schema and include care, education, finance, food, housing, recreation, religion, shopping, travel, professional, and other services. Further descriptions of data sources and variables are summarized in Table 3. The selection process of distance parameters for station accessibility and POIs is based on trial and error, however, sensitivity results show there are no significant differences in gravity SI results when using other distance bands (e.g., 500 meters or 1000 meters) to compute these variables. 2 Since capacity values from the most recent station information are representative of all previous system states, capacity values were based on system snapshots of the station information from Sep 2018, May 2019, and Feb 2020. Of the 1103 stations that existed from 2015-2020 this provided valid capacity information for all but 135 stations, which either were not running during the snapshots or had zero capacity values. These 135 stations were not included in the computation of the accessibility metric used here. 29 Table 3. Attributes used in the spatial interaction models in Chapter 2. Attributes Source Features description Population Cenpy Census.gov Population (2017 ACS5) of each census tract Employment Longitudinal Employer- Household Dynamics (LEHD)3 Workplace Area Characteristics by census block Station accessibility Station Info4 Weighted sum of nearby station capacity with inverse distance weights Access to subway NYC Open Data5 Euclidean distance to nearest subway station. Points of Interest SafeGraph6 Place count weighted by spatial lag within 1500 meters of station Natural neighbor interpolation Spatial interpolation methods predict unknown values for an unsampled point in space based on its relationship to a set of sampled points. Given values of a random field 𝑍(⋅) measured at locations 𝑠1, . . . , 𝑠𝑛 yielding 𝑍(𝑠𝑖) for any station 𝑠𝑖, the objective is to estimate the value 𝑍(𝑠𝑥) for one or more unmeasured locations 𝑠𝑥. One well-established spatial interpolation method is the natural neighbor technique that finds the closest subset of input samples to a query point and weights them proportionally based on areal overlap to estimate a value (Sibson, 1981). Based on 3 https://lehd.ces.census.gov/data/ 4 https://gbfs.citibikenyc.com/gbfs/en/station_information.json 5 https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49 6 https://www.safegraph.com/covid-19-data-consortium 30 the assumption that nearby stations typically experience similar levels of demand, it is possible to borrow data for a new station from neighboring stations using this method. The natural neighbor interpolation method for points/polygons was extended to spatial interaction flows by applying a modified areal-based weighting scheme (Jang & Yao, 2011). Equations 3.1-3.2 present the natural neighbor formula for calculating the flow 𝑇𝑥𝑗 between a new station 𝑆𝑥 and a destination station 𝑆𝑗. First, Voronoi tessellations are computed for 𝑺𝒏 with and without 𝑆𝑥, denoted as 𝑉𝑛+𝑥 and 𝑉𝑛, respectively, with each individual Voronoi polygon approximating the service area for each station. In equation 3.1, 𝑚 is the number of stations overlapping with 𝑆𝑥 , 𝑝𝑘 is the proportion of the overlapping area between Voronoi shape of 𝑆𝑥 in 𝑉𝑛+𝑥 and 𝑆𝑘 in 𝑉𝑛 in relation to the total area of the Voronoi shape of 𝑆𝑥 in 𝑉𝑛+𝑥 (Equation 3). The resulting [𝑇𝑥1, 𝑇𝑥2, . . . , 𝑇𝑥𝑛]⊤ are the ‘borrowed’ outflows of 𝑆𝑥. Thus, natural neighbor interpolation provides an areal-based method that can be used on flow data and will be one of the interpolation methods applied here. 𝑇𝑥𝑗 = ∑ [𝑝𝑘 × 𝑇𝑘𝑗]𝑚 𝑘=1 Equation 2.3.1 𝑝𝑘 = 𝐴𝑉𝑛+𝑥∩𝑉𝑛 𝑘𝑥 𝐴𝑉𝑛+𝑥𝑥 Equation 2.3.2 Kriging While natural neighbor interpolation derives weights based only on location, kriging techniques derive weights based on both the location 𝑆𝑥 and value of each sample point 𝑍(𝑠𝑖) with 𝑖 ∈ (1,2, . . . , 𝑛). It is an optimal linear estimator of the form 𝑍(𝑠𝑥) = ∑ 𝛼𝑖𝑍(𝑠𝑖) 𝑛 𝑖=1 Equation 2.4 where the weights 𝛼𝑖 are chosen to make the estimator unbiased and of minimal prediction error. Kriging is a two-step process. First, the spatial covariance structure of the sampled points is 31 determined by fitting a variogram (a spherical model is used here) that captures the variability between all data points as a function of distance. Then, weights derived from this covariance structure are used to interpolate values for unsampled observations across the spatial field. Ordinary kriging assumes the random field 𝑍(⋅)is intrinsically stationary; that is for any location 𝑠: 𝐸[𝑍(𝑠)] = 𝜇 Equation 2.5.1 𝑉𝑎𝑟[𝑍(𝑠) − 𝑍(𝑠 + ℎ)] = 2𝛾(|ℎ|) Equation 2.5.2 where 𝛾(|ℎ|) is the semi-variogram and a function of distance ℎ that separates two locations. Further mathematical details are available in (Cressie, 1993). Ordinary kriging typically uses a k-dimension (usuallyk <= 3) point as an input location and a 1- dimension target value 𝑍. However, a spatial interaction flow between stations is a spatial process involving two points in a 2-dimension Euclidean space (k = 4). To accommodate this higher dimensionality, a flow-based extension inspired by (Y. Zhou et al., 2019) is proposed that aggregates all the outflows to the m possible destination stations or inflows from m stations to the single origin station 𝑠𝑖 as 𝑍(𝑠𝑖) and the definition of 𝑇𝑖𝑗 remains the same as in Equation 2.1. The coordinates of the origin station remain in 2-D space, but the target value 𝑍𝑖 is an m-length vector representing the flows to each destination. According to (X. Liu et al., 2016), this station-centric view, which aggregates the OD flows into vectors describing flows coming in and going out from one single station, is also referred to as the OD flow signature of a station. In this station-centric view, the adoption of the vectorized 𝑍 is needed for handling higher-dimensional 𝑍 within kriging 32 techniques. The inflow predictions 𝑍𝑖𝑛and outflow predictions 𝑍𝑜𝑢𝑡 are carried out separately and then concatenated as the overall OD predictions for a target station. Alternatively, regression kriging leverages information from exogenous variables in addition to the spatial dependence in a sample by first fitting a regression model and then kriging the residuals of the regression model. Therefore, a Gravity SI model is first fit and then a flow-based ordinary kriging (OK) model as described above is used to interpolate the residuals of the gravity-type SI model. Specifically, when using the station-centric view in the regression kriging model (RK), the kriging step uses 2-dim input with k-dim targets, while the regression step uses location data in OD flow form with 4-dim features. Figure 3. Illustration of areal natural neighbor interpolation and ordinary kriging interpolation with toy data. 33 To better highlight the difference between the kriging interpolation and natural neighbor interpolation, an illustration with pseudo data is attached to demonstrate the two interpolation processes step by step in Figure 3. Both methods take identical data structure as input data, e.g., existing locations 𝑆𝑛, a 'cold start' location 𝑆𝑥, and flow matrix among existing stations 𝑇. The bottom-left section shows the steps of areal-based natural neighbor interpolation. First, the service area of each station defined by the Voronoi shapes is calculated before and after a 'cold start' station is added. Then the overlapped area over the 'cold start' station is calculated at each existing station, consisting of the portion (𝑝) of new stations which is the final proportional weight to calculate the interpolated outflows and inflows at the new station shown in the last row. In contrast to the natural neighbor that exploits locations only, ordinary kriging interpolation leverages both locations and flow similarity. Under the assumption of ordinary kriging, variations between stations at a fixed range of distance is only decided by the distance ℎ. Thus, distances between each pair of origin and destination are grouped into bins (Right-bottom section, Figure 3 left) and the variance in each group is calculated and plotted (Right-bottom section, Figure 3 right). It is proved that the optimization equation can be solved using the semivariogram only. Details can be found in geostatistical textbooks such as Le & Zidek (2006). So, it is essential to interpolate the flow variance between new stations and any existing stations by fitting a covariance function (black solid line in Right-bottom section, Figure 3 right). Finally, the optimized alpha that minimizes the prediction error is calculated and used for the flow interpolation results just as the natural neighbor interpolation method. 34 2.5 Data 2.5.1 Study area The Citi Bike system in New York City started operation in 2013 and now becomes the biggest bike share system in the US. Among the 809 stations added to the system since 2015, 55 of them had too few trips to produce reliable output in any one of the methods, specifically producing a null or negative correlation index value. Thus, the following results come from the remaining sample of stations ( 𝑛 = 754 ) with 313 interpolation stations, 123 margin stations, and 318 extrapolation stations using the (0.05,0.95) classification threshold. The spatial distribution of the stations is mapped in Figure 4 along with the original (pre-2015) set of stations. All the trips are aggregated into unique spatial and time tuples by their original stations, destination stations, and the week of the trips starting from Jan 1st, 2015. In the regression-based models, the following expression is used: “flows ~ cost + origin attributes (15) + destination attributes (15)” Origin and destination attributes including 15 variables each listed in Table 3 where POI includes 11 categories, e.g., care, education, finance, food, housing, recreation, religious, shopping, travel, professional, and other services. In the interpolation-based models, OD-matrices are formatted as the demonstration in Figure 3. 35 Figure 4. Spatial distribution of bike stations color-coordinated by the three derived classes for newly added stations and the original stations. 2.5.2 Study design Each new station is associated with a unique time frame for training and testing. So iterating station-wise, SI models were trained for each added station 𝑆𝑥, with a training set including all the OD flows available in the time period before 𝑆𝑥 is added, and with prediction test set from the time period after 𝑆𝑥 is rolled out. Training data consisted of flows one week before a new station was added at time 𝑡, (i.e., 𝑡 − 1) while the evaluation data was set to flows from one week after the roll out of a new station (i.e., 𝑡 + 2). This one-unit period provides a chance for demand to stabilize after a new station is added to the system for a complete week. Bike trips with duration more than 36 3 hours are excluded as abnormal trips. To evaluate the performance of each prediction method for all 𝑆𝑥, Pearson’s R correlation coefficient is employed7. 2.6 Results 2.6.1 Gravity-type spatial interaction model calibration Before producing the prediction results, it is necessary to calibrate the gravity-type SI models. An advantage of the SI model over the interpolation methods is the ability to elucidate how independent variables affect bike trip flows based on the parameter coefficients. To highlight the weight of each independent variable, the top of Figure 5 shows the average parameter magnitude and 95% confidence intervals based on the models for each station. The bottom of Figure 5 captures the top 11 most important features over time. 7 Zero flows were not included when calculating the performance metrics. 37 Figure 5. Top: Feature weight from the SI model results averaged over the time periods. The X- axis represents the absolute value of the parameter estimates and the color represents the sign of the average magnitude. The dark green strip shows the average range. Bottom: Time series of top 11 largest feature weights. The X axis is the number of weeks elapsed since 01/01/2015. 38 Generally, parameter estimates associated with a variable have the same direction, either negative or positive, and similar magnitudes for the origin stations and the destination stations. The cost factor, which is the average trip completion time of the OD pair and is typically called the distance- decay effect, is the most important factor, attesting to the First Law of Geography. The next most important factors included recreation and other services POIs. The former POI group suggested the purpose of the bike share trips for leisure activities. The latter group, other services, includes many parking facilities. It could be interpreted as a potential common commuting pattern: people park their cars and then take bike trips. Surprisingly, population was not a top factor, suggesting the residential population around stations is not necessarily a strong indicator of the size of the potential bike share user base. In terms of the parameter estimate stability over time, the top 11 important factors typically have consistently positive or negative signs. Some general trends can also be observed over time. First, the distance-decay factor associated with trip duration became more negative (i.e., stronger) over time as the system expanded. There is also a periodic fluctuation that captures the seasonal trend where distance-decay is less negative in the winter (the spikes around 50th, 110th, 210th weeks). Second, the impact of the COVID-19 lockdown was a decrease in the magnitude for the parameter estimates of most factors, likely because the regular trips of the public were essentially halted during this time and the alternative behavior was either less associated with these factors or generally more random. Besides, the gravity SI model also showed robustness in the parameter importance when changing the distance band from 1500m to 1000m to compute spatially lagged explanatory variables (i.e., POIs and accessibility), providing that the direction and magnitude of coefficients remain the same over the two distances. 39 2.6.2 Prediction results Table 4. A summary of the prediction results based on Pearson’s R for each method using data for each classification category and using all data. Standard deviations are in parentheses after the mean values. Pearson’s R (Standard deviation) Interpolation Margin Extrapolation All Natural neighbor 0.53 (0.22) 0.47 (0.22) - - Ordinary Kriging 0.52 (0.20) 0.45 (0.22) 0.41 (0.20) 0.46 (0.21) Gravity SI (Poisson) 0.44 (0.19) 0.41 (0.24) 0.40 (0.24) 0.42 (0.22) Gravity SI (NegBin) 0.42 (0.18) 0.37 (0.22) 0.39 (0.22) 0.40 (0.20) Regression Kriging (Poisson) 0.54 (0.21) 0.49 (0.24) 0.41 (0.22) 0.47 (0.23) The prediction results based on the correlation index Pearson’s R between predicted and actual trip counts were recorded in Table 4 for the five methods (natural neighbor, ordinary kriging, regression kriging, gravity-type SI model, and negative binomial model) using data for each classification category (interpolation, margin, extrapolation), as well as the entire dataset. Overall, the results demonstrate that regression kriging is probably the most well-rounded model to capture the correlation in all three station types. The results also provide evidence in support of the primary hypothesis that different mechanisms in spatial interaction and spatial interpolation will be different depending on the location and the regression kriging can outperform the single methods of ordinary kriging or SI model. Meanwhile, the standard deviations around 0.2 indicate the 40 correlation varies much across stations. It is hard to say regression kriging can always outperform other methods. 2.6.3 Spatial trends The relative comparison of the candidate methods within one location is more meaningful than the quantitative comparison of metrics across stations. Figure 6 reveals the spatial distribution of 'cold start' stations rendered with the best method in the individual station. Warm colors are associated with interpolation methods (Red: Natural Neighbor; Orange: Ordinary Kriging) and cold colors stand for regression methods (Cyan: Gravity SI; Blue: Negative Binomial). Green dots represent the compound model: regression kriging method. Overall, there is a substantial portion (518 out of 754) of interpolation methods topped as the best methods over the two regression-only methods, which supports flow interpolation as flow prediction tools. Then focusing on the lower Manhattan Island where stations are mostly added as interpolation types. Interpolation methods are commonly rated as the best methods there which comply with the intuition that nearby flow patterns facilitate interpolation methods. Two regression methods as best methods could hardly be spotted in lower Manhattan, but located in North Manhattan, Queens, and Brooklyn (north, east, and south of the system, respectively). 41 Figure 6. Spatial distribution of the best method for flow prediction at each 'cold start' station. On the edge of the bike share system, where added stations are usually of the extrapolation type, a mixture of best methods can be seen. Even though extrapolation stations are out of the coverage of the existing system with limited nearby spatial dependence, regression models are not necessarily superior to interpolation methods in all cases, which is unexpected. It implies the necessity of considering the spatial dependence or spatial structure of the bike share system at all times regardless of when locations are added. Another explanation could be that fewer trips at extrapolation stations introduce more noise than signal, having a negative impact on all methods. When comparing two gravity models, the Negative Binomial (NegBin) model seems to do better 42 in the outskirts of the system where there are probably more stations with a few numbers of rides and therefore more overdispersion. 2.7 Discussion Spatial interaction demand for bike-sharing trips has caught the attention of many researchers in recent years as new bike share systems are installed at an astonishing speed around the world. Yet the prediction of demand at 'cold start' stations is an overlooked issue. Previous studies work around the issue either by filtering out the new stations or using a different spatial aggregation, such as grids. This chapter leveraged spatial interaction models and flow interpolation techniques to propose an approach to compare different models through a case study using the NYC Citi Bike system. Specifically, the proposed approach attempts to predict OD flows of newly added stations after the system infrastructure changes. After comparing all candidate methods on each identified 'cold start' station, results show that the gravity SI model comes with the advantage of the interpretability of the parameter estimates. For example, the interpretation of gravity SI models shows that the leading trip generation factor is the number of recreation sites nearby, which provides useful additional information to advise the planning of bike share systems. Alternatively, interpolation methods, including natural neighbor and ordinary kriging, show better predictability for stations classified as interpolation but are less performant for stations classified as extrapolation. Regression kriging combines both gravity SI and ordinary kriging and yields the best performance. Finally, spatial trends reveal some heterogeneity in the prediction performance of different methods. Specifically, the downtown Manhattan region shows higher predictability for interpolation methods, but stations added in 43 Brooklyn are less performant for regression methods only. Interpolation methods unexpectedly outperform regression models in many extrapolation cases, implying the presence of long-range spatial dependence and suggesting the use of spatial interaction models that incorporate spatial dependence (Oshan, 2021) in future work. One significance of this work is that it begins to fill the gap of the missing historical data that is needed for many popular methods. For example, the methodology could be used to generate pseudo historical flows so that methods dependent upon them can be used to make predictions over time even for 'cold start' stations. Therefore, the proposed approach facilitates a more robust demand modeling framework. However, trip prediction for stations classified as extrapolation is still limited using the proposed approach because neither the gravity-type spatial interaction model nor the interpolation techniques work as well as they do for the interpolation stations. For future work, the idea of geostatistical transfer learning may help overcome the issue (Hoffimann et al., 2021). Meanwhile, the evolving techniques of graph convolutional neural networks (such as Pareja et al., 2020) may be adapted as an alternative of flow interpolation method to be compared with the proposed Kriging methods in future work. Another limitation that needs to be more fully explored is the weak predictability of 2020 stations, which is possibly due to the low trip counts or drastic behavioral changes during the pandemic, or both. When there are low trip counts, the methods used here tend to underestimate the few stations with higher activity and more effort is needed to develop methods that can handle this scenario. 44 The methods included here only provide an initial investigation of the 'cold start' issue. Future work may unfold differences in the performance of methods at station level to show the magnitude of the differences and further investigate differences in spatial dependence at a finer spatial scale. Future studies could also further investigate and leverage the temporal dependencies of the OD flows. In this work, a weekly time interval was used, though other intervals could also be explored, as well as the temporal windows used to characterize demand before and after stations are added to the system. Future work could therefore explore different time parameters or even combine them to increase predictive accuracy. These are just a few suggestions that could be on the line of inquiry initiated here to grapple with the issues caused by 'cold start' stations in dynamic transportation systems. 2.8 Conclusion This chapter focuses on the overlooked 'cold start' issue as historical flows are available at newly added stations. Instead, flow interpolation methods based on Kriging are proposed in the flow domain. Specifically, Ordinary Kriging and Regression Kriging with the adaptations in the flow domain are investigated and compared with existing flow interpolation of areal natural neighbor, as well as spatial interaction models that calibrate relationships between independent variables and spatial interaction flows. Based on experiments conducted on a bike-sharing system in NYC, the regression kriging model exhibits superior performance compared to other models regardless of the locations of 'cold start' stations by leveraging the strengths of both gravity-type spatial interaction regression models, which are robust and interpretable, and ordinary kriging, which is capable of capturing spatial dependence. The distribution of best methods at each station also 45 suggests a hybrid strategy to use interpolation-based methods at stations added within the system coverage and regression-based methods at stations added at the new areas of the system. 46 Chapter 3: Modeling the t