ABSTRACT The Title of Document: ARTERIAL PROBABILISTIC TRAFFIC MODELING AND REAL-TIME TRAVEL TIME PREDICTION WITH VEHICLE PROBE DATA USING MACHINE LEARNING Bahar Zarin, Ph.D., 2018 Directed By: Prof. Ali Haghani, Department of Civil and Environmental Engineering This study proposes a probabilistic modeling framework for the estimation and prediction of link-based arterial travel time distribution using GPS data. The spatiotemporal correlations of the network are modeled using a directional acyclic graphical model, and several external variables in the prediction model are included to yield a better prediction in a variety of situations. This study also aims to investigate the effects of each factor on the travel time and the uncertainty associated with it. In the proposed model, factors such as weather conditions, seasons, time of day, and day of the week are added as external variables in the graphical model. After determining the structure of the model, Streaming Variational Bayes (SVB) is used for training and parameter inference; this offers a valuable option when constant streaming data is utilized. SVB adaptively changes its parameters gradually with a lower computational cost, which makes the process less time-consuming and more efficient. The analysis shows that incorporating external variables can improve the model performance. The data used in this study is INRIX vehicle trajectory raw data from four months – February, June, July, and October of 2015 – which makes it possible to take into account the effects of seasons and weather conditions on travel time and its uncertainty. One of the products of this study is a framework for vehicle trajectory data cleaning process including trip identification, removing outliers, and cleaning the trips data. Once the data are cleaned and ready to use, they should be mapped to the roads. The Hidden Markov Model (HMM) map matching algorithm is used to map the GPS latitude/longitude data to the Open Street Map (OSM) base map and find the traversed links between each pair of GPS points of vehicle trajectories. Finally, a novel procedure to compare any travel time prediction model with any available commercial routing API is proposed and tested to compare the proposed model with Google API. ARTERIAL PROBABILISTIC TRAFFIC MODELING AND REAL-TIME TRAVEL TIME PREDICTION WITH VEHICLE PROBE DATA USING MACHINE LEARNING By Bahar Zarin Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Professor Ali Haghani, Chair [Professor Larry S. Davis] [Professor Paul M. Schonfeld] [Dr. Mark Austin] [Dr. Cinzia Cirillo] © Copyright by Bahar Zarin 2018 Preface This dissertation is composed as a requirement for the degree of Doctorate in Philosophy in Civil and Environmental Engineering. ii Dedication To my parents, sister, and brothers for their endless love, support, and encouragement. iii Acknowledgements First and foremost I would like to express my deepest gratitude to my advisor, Dr. Ali Haghani who has been a tremendous mentor and supported me both academically and personally during my Ph.D. study. I am extremely grateful for all his contributions of time, idea, and endless support throughout my Ph.D. research and dissertation. It has been the greatest honor to learn from him and be his student. Very special thanks to my committee members, Dr. Paul M. Schonfeld, Dr. Larry S. Davis, Dr. Cinzia Cirillo, and Dr. Mark Austin for their time, support, and their brilliant comments and suggestions to improve my research. It is an absolute honor to have them as my committee members. I also would like to thank the Center for Transportation Technology (CATT) for giving me the opportunity to work on interesting projects. I learned a lot during my two years of research there. Additionally, I like to thank the Maryland State Highway for providing the data used in this dissertation. I am grateful for friendship and encouragement of my graduate research group for the tremendous learning experience I gained by being in the group and learning from their research. Special thanks to Mahyar for his support and insight during my dissertation. I would also like to thank all my friends: Kiana, Ali, Elham, Mahshid, Ali, Elham, Samira, Babak, Soroush, Marjan, Mostafa, Sepehr, Ladan, Niloofar, Pouya, Aida, Azadeh, Marjan and list goes on, for making my life beautiful and being like a family to me. Finally, I would like to dedicate my dissertation to my parents, my brothers and my sister. Words cannot express how grateful I am for all of the sacrifices they have made iv for me to get here and all the support and endless love they gave me in my entire life, thank you! v Table of Contents Preface ...................................................................................................................... ii Dedication ................................................................................................................ iii Acknowledgements .................................................................................................. iv Table of Contents ..................................................................................................... vi List of Tables ........................................................................................................... ix List of Figures ........................................................................................................... x Chapter 1: Introduction ............................................................................................. 1 1.1 Problem Statement .......................................................................................... 3 1.2 Research Objectives ........................................................................................ 5 1.3 Research Contributions .................................................................................... 8 1.4 Research Outline ............................................................................................. 9 Chapter 2: Literature Review................................................................................... 11 2.1 Parametric Models ......................................................................................... 12 2.1.1 Naïve Models ......................................................................................... 12 2.1.2 Autoregressive Models ........................................................................... 12 2.1.3 State Space Model .................................................................................. 13 2.1.4 Hidden Markov Model (HMMs) ............................................................. 14 2.1.5 Bayesian Networks (BNs) ....................................................................... 14 2.1.6 A Feedforward Neural Network .............................................................. 15 2.2 Non-Parametric Models ................................................................................. 16 2.2.1 K-nearest neighbors ................................................................................ 16 2.3 Other Methods ............................................................................................... 16 2.4 Arterial Travel Time ...................................................................................... 17 2.5 Intersection Delay.......................................................................................... 21 2.6 Probe Data ..................................................................................................... 23 2.7 Summaries and Limitations of Previous Studies............................................. 23 2.7.1 Limitations of Previous Studies............................................................... 25 Chapter 3: Model Preparations ................................................................................ 27 3.1 Data Cleaning ................................................................................................ 28 3.2 Probe Filtering ............................................................................................... 29 3.2.1 Trip Splitter ............................................................................................ 32 3.2.1.1 Time Difference Splitter ...................................................................... 32 3.2.1.2 Way Point Sequence Splitter ................................................................ 33 3.2.2 Cleaning/Filtering the Trips .................................................................... 33 3.2.2.1 Same Time Filter ................................................................................. 33 3.2.2.2 Same Location Filter ............................................................................ 34 3.2.2.3 Speed Boundary Filter ......................................................................... 34 3.2.2.4 Idle Filter ............................................................................................. 34 3.2.2.5 Location Displacement Filter ............................................................... 35 3.2.2.6 Trip Length Filter ................................................................................ 35 3.2.2.7 Large Error Size Filter ......................................................................... 36 3.2.2.8 Shaking Probe Filter ............................................................................ 36 3.2.3 An Examples .......................................................................................... 36 vi 3.3 Map Matching ............................................................................................... 37 3.3.1 HMM-based Map Matching .................................................................... 38 3.3.1.1 Measurement Probabilities ................................................................... 39 3.3.1.2 Transition Probabilities ........................................................................ 39 3.3.1.3 Optimum Match ................................................................................... 40 3.3.2 Map Matching Assumptions ................................................................... 40 3.4 Graphical Model ............................................................................................ 41 3.4.1 Acyclic Graphical Models....................................................................... 42 3.4.2 Model Assumptions ................................................................................ 43 3.4.3 Implemented Graphical Model ................................................................ 44 3.4.4 Creation of the Network .......................................................................... 46 3.5 Data ............................................................................................................... 47 3.5.1 Study Area .............................................................................................. 47 3.5.2 Open Street Map (OSM) ......................................................................... 48 3.5.3 INRIX Data ............................................................................................ 49 3.5.4 Weather Data .......................................................................................... 50 3.6 Summary ....................................................................................................... 51 Chapter 4: Graphical Model .................................................................................... 53 4.1 The External Variables. ................................................................................. 53 4.1.1 Weather Conditions ................................................................................ 53 4.1.2 Seasons ................................................................................................... 54 4.1.3 Day of the Week ..................................................................................... 55 4.1.4 Time of Day ........................................................................................... 55 4.1.5 The External Variables ........................................................................... 56 4.2 Graphical Model with External Variables ...................................................... 57 4.3 Travel Time Allocation.................................................................................. 59 4.4 Parameter Estimation ..................................................................................... 61 4.4.1 The Inference .......................................................................................... 61 4.4.2 Streaming Variational Bayes (SVB) ........................................................ 63 4.5 Summary ....................................................................................................... 66 Chapter 5: Analysis ................................................................................................. 68 5.1 Case Study .................................................................................................... 68 5.1.1 Trip Summaries ...................................................................................... 70 5.2 Developed Models ......................................................................................... 71 5.2.1 The Initial Developed Graphical Model .................................................. 71 5.2.2 Baseline Model ....................................................................................... 72 5.2.3 Test ........................................................................................................ 73 5.3 Validation of the Proposed Model.................................................................. 74 5.3.1 The Model without any Variables ........................................................... 74 5.3.2 The Model with Day of the Week ........................................................... 75 5.3.3 Travel Time Distribution ........................................................................ 76 5.4 Model Improvements ..................................................................................... 78 5.4.1 The Model with All of the External Variables ......................................... 78 5.4.1 Regression Model ................................................................................... 82 5.4.2 Comparison Between the Final Model and Improved Base Model........... 84 5.4.3 Comparison Between the Two Graphical Models.................................... 90 vii 5.4.4 The Model Test on a Rainy Day.............................................................. 91 5.4.1 The Time of Day..................................................................................... 93 Other Modifications ............................................................................................ 95 5.4.2 The Graphical Model with Intersection ................................................... 95 5.4.3 The Graphical Model with Driver Habits Variable .................................. 99 5.5 Summary ..................................................................................................... 101 Chapter 6: Comparison and Sensitivity Analysis ................................................... 103 6.1 Validation .................................................................................................... 103 6.2 Shape-Matching .......................................................................................... 104 6.2.1 Hausdorff Distance ............................................................................... 105 6.2.2 Fr´echet Distance .................................................................................. 106 6.3 Route Similarity .......................................................................................... 107 6.4 Google Maps API ........................................................................................ 109 6.4.1 General Information .............................................................................. 109 6.4.2 Departure Time ..................................................................................... 109 6.4.3 Travel Time with Traffic....................................................................... 110 6.5 Waypoint Parameter ................................................................................... 111 6.6 Comparison ................................................................................................. 111 6.6.1 Case Studies ......................................................................................... 112 6.6.2 Results .................................................................................................. 113 6.6.2.1 Off-peak Travel Time Comparison..................................................... 113 6.6.2.2 PM peak Travel Time Comparison ..................................................... 119 6.7 Sensitivity Analysis ..................................................................................... 121 6.7.1 Number of States .................................................................................. 121 6.7.2 Number of Observations ....................................................................... 124 6.8 Summary ..................................................................................................... 127 Chapter 7: Summary and Conclusion .................................................................... 129 7.1 Summary ..................................................................................................... 129 7.2 Conclusions ................................................................................................. 134 7.3 For Future Research .................................................................................... 135 Bibliography ......................................................................................................... 138 viii List of Tables Table 1 Summary of trips used for each month and each day of the week in the training data set ....................................................................................................... 71 Table 2 Summary of GPS points used for each month and each day of the week in the training data set ....................................................................................................... 71 Table 3 Accuracy of predictions for all the weekdays .............................................. 74 Table 4 Accuracy of predictions based on days of the week for the Baltimore test data ............................................................................................................................... 75 Table 5 Summary of winter predictions based on days of the week for the Baltimore test data ................................................................................................................... 85 Table 6 Summary of summer predictions based on days of the week for the Baltimore test data ................................................................................................................... 87 Table 7 Summary of fall predictions based on days of the week for the Baltimore test data ......................................................................................................................... 88 Table 8 Aggregated average percentage of error for different season ....................... 89 Table 9 Aggregated average percentage of error for each days of week ................... 89 Table 10 The comparison of the model with and without weather variable .............. 92 Table 11 Summary of winter predictions based on days of the week for daytime ..... 94 Table 12 Google and proposed travel time prediction summary (off-peak) ............ 117 Table 13 Sample of Google, proposed model travel time prediction and observations travel time comparison (off-peak) ......................................................................... 118 Table 14 Google, proposed model travel time prediction and observations travel time comparison (PM peak) .......................................................................................... 120 Table 15 Sensitivity analysis on the number of states ............................................ 123 Table 16 Sensitivity analysis on the amount of data ............................................... 125 ix List of Figures Figure 1 The change in delay over time for Baltimore, Maryland and DC (Virginia, 2017) ........................................................................................................................ 2 Figure 2 The change in commute time over time for Baltimore, Maryland and DC (Virginia, 2017) ........................................................................................................ 2 Figure 3 The conceptual framework of modeling the arterial travel time ................. 28 Figure 4 Sample of trips in Telenav probe data ........................................................ 30 Figure 5 Sample of trips in INRIX probe data ......................................................... 31 Figure 6 The data cleaning Process.......................................................................... 32 Figure 7 An example of Telenav trip before and after filtering ................................ 37 Figure 8 An example of INRIX trip before and after filtering .................................. 37 Figure 9 A Map matching result instance for Baltimore city .................................... 41 Figure 10 Spatiotemporal model of arterial traffic evolution represented as a Graphical model. ..................................................................................................... 46 Figure 11 Study area, downtown Baltimore ............................................................. 48 Figure 12 The graphical model with the proposed external variables ....................... 59 Figure 13 The available data on the area .................................................................. 69 Figure 14 The available data for the case study area ................................................ 70 Figure 15 Comparison of prediction error between proposed model and the baseline by days of the week ................................................................................................. 76 Figure 16 The example link in downtown Baltimore ............................................... 77 Figure 17 The travel time histogram for link 13 ....................................................... 78 Figure 18 Real-time level of congestion subnetwork of Baltimore downtown. ......... 81 Figure 19 Real-time relative travel time of Baltimore downtown network. .............. 82 Figure 20 Comparison of prediction error between proposed model and base model in winter ...................................................................................................................... 86 Figure 21 Comparison of prediction error between proposed model and base model in summer ................................................................................................................... 86 Figure 22 Comparison of prediction error between proposed model and base model in fall .......................................................................................................................... 87 Figure 23 Aggregated average percentage of error for different season .................... 89 Figure 24 Aggregated average percentage of error for each days of week ................ 90 Figure 25 Comparison of the proposed model with seasons and weather conditions and without them..................................................................................................... 91 Figure 26 The comparison of the model with and without weather variable ............. 93 Figure 27 Comparison of prediction error between proposed model and base model for day time in winter .............................................................................................. 95 Figure 28 The intersection movement demonstration ............................................... 97 Figure 29 The proposed graphical model including the intersections ....................... 98 Figure 30 Demonstration of path travel time as the sum of links and intersections travel times ............................................................................................................. 99 Figure 31 The proposed graphical model with proposed relative speed variable .... 100 Figure 32 Hausdorff distance as a measure of similarity ........................................ 106 Figure 33 Fr´echet distance as a measure of shape similarity ................................. 107 x Figure 34 Route similarity index example (two routes with similarity index of 70.05%) ................................................................................................................ 108 Figure 35 The proposed framework for travel time estimation comparison between models .................................................................................................................. 112 Figure 36 Proposed Model, Google API, and observed travel time comparison (off- peak) ..................................................................................................................... 115 Figure 37 Google and proposed model Percentage of error trend (off-peak) .......... 117 Figure 38 Proposed Model, Google API, and observed travel time comparison (PM peak) ..................................................................................................................... 120 Figure 39 Sensitivity analysis on the number of states ........................................... 123 Figure 40 Sensitivity analysis on the amount of data ............................................. 126 xi Chapter 1: Introduction Having an accurate and reliable traffic information system is the first step toward achieving active congestion control and alleviation, and can help reach a reliable network. The traffic information could be used by both travelers and agencies. One of the primary uses of travel time information is pre-trip guidance for travelers. This information impacts travelers’ decisions regarding departure time, route, and mode choice. Travel time information can also be used by transportation agencies for Advanced Traffic Management System (ATMS) strategies, Emergency Transportation Operations, and Traffic Incident Management. Such applications are increasingly important in a world in which urban road transport systems experience increasing congestion year after year. Figure 1 and Figure 2 each demonstrate the increase in the congestion level for several states and cities including Washington, D.C. and Baltimore, Maryland. As shown, in these regions, average commute travel time and hours of travel delay have consistently increased over time in most cases. 1 Figure 1 The change in delay over time for Baltimore, Maryland and DC (Virginia, 2017) Figure 2 The change in commute time over time for Baltimore, Maryland and DC (Virginia, 2017) Arterials are major roads that carry large volumes of traffic each day. In fact, arterials serve as the primary route of travel in major cities and as the second leading route of travel between cities. Thus, it is critical that traffic information systems offer accurate 2 information regarding arterial travel time. One of the most important elements for providing accurate travel time information in general and, more importantly, for arterials is travel time reliability and uncertainty associated with travel time. No matter the degree to which influential factors are considered in prediction models, there is always a degree of uncertainty associated with travel time predictions. Nevertheless, information about travel time uncertainty can be used for risk-averse routing, reporting travel time reliability to users, and for many other purposes. 1.1 Problem Statement Historically, the use of traffic monitoring systems has been limited primarily to freeways. Arterial traffic monitoring is particularly challenging because arterials are not covered by dedicated sensing infrastructures. Probe vehicle data remains the main available source of data for arterials. Because more than 40 percent of U.S. vehicle- traveled miles occur on arterial roads – and there are not as many studies in this realm – new studies to precisely estimate and predict arterial travel time are highly needed. Many travel time forecasting models – either for freeways or arterials –provide only expected travel time. Still, by modeling statistical distributions of travel times, rather than just mean values, there is an opportunity to provide travelers with travel time reliability (or probability of on-time arrival) in addition to travel time. This probabilistic forecast of travel time is beneficial for risk-averse routing. It could also be used to report travel time reliability to users. In this research, we study the probabilistic modeling of arterial travel time. 3 While there have been previous efforts to establish arterial traffic estimation/prediction methodologies, which are discussed in existing literature, modeling and estimating travel time remains a challenging task. To start, there is a great deal of variation associated with travel time caused by factors such as weather, demand uncertainty, incidents, and roadway conditions, as well as varying driver behavior. Such variations are more significant in arterials as opposed to other roadways because variations associated with traffic lights, stop signs, bus stops, left turns, etc. add a higher degree of variation. Since it is impossible to fully consider the effects of all of the influential factors in the modeling process, there is always some degree of uncertainty associated with travel time estimation/prediction. It is thus essential to model this uncertainty and establish an estimation of it and how it changes due to the myriad factors. Even though it is impossible to offer an exact estimation/prediction of travel time, incorporating as many influential factors into the modeling process as possible could produce a more accurate and more reliable estimation/prediction. Unfortunately, such factors are rarely considered in the modeling processes outlined in existing studies due to a lack of data. Low sampling frequency creates difficulties in inferring parameters. In fact, there is a direct relationship between the frequency of data and availability of data in different conditions, and the ability to include such data in the modeling process. This study thereby represents one of the only attempts to incorporate the effects of some of these aforementioned factors into the modeling process. 4 1.2 Research Objectives The main objective of this study is to provide probabilistic modeling of arterial travel time using GPS data, it addresses both travel time estimation/prediction as well as uncertainty associated with it. Beyond the variability inherent in any travel time estimate due to factors such as driver behavior, there is additional uncertainty associated with arterial travel time as the result of traffic lights, and/or stop signs, and/or left turns, etc. The proposed model considers some of these variabilities and incorporates them into modeling by adding them into the modeling process in efforts to yield a more accurate estimation of travel time in the arterials. Further, the proposed model aims to capture variability associated with travel time for arterials. Almost none of the existing travel time estimation/prediction modeling is comprehensive; in most cases, the proposed models center on normal weather conditions and specific time(s) of day and day(s) of the week. However, it is important to have an accurate travel time estimation/prediction that can be applied to a variety of scenarios, such as adverse weather conditions. Currently, even commercial estimation/prediction models lack this capability. As such, one of the primary objectives of this study is to propose a model that can provide an accurate travel time estimation/prediction in such conditions. Another important objective of this study is to evaluate the effects of incorporating this information in modeling performance. A better performing model with this information is needed to improve mobility and network accessibility and alleviate the cost 5 associated with unreliability. To do so, factors such as days of the week, and time of day impact on the travel time and its variability are evaluated. In addition to the primary objective of this study, another aim of this study is to provide a comprehensive procedure that can be used to conduct such analysis. Thus, for each step of this procedure, a separate framework is proposed and tested. The discussed procedure starts with the modeling pre-steps such as data cleaning, mapping vehicle trajectory data over a map of the observed region, and creating the city network. Then, developing a model with all of the proposed variables and testing it on a case study for several test days to compare it with other models. The final step is the creation of a framework to compare the proposed model with any commercial routing API. This study uses raw GPS trajectory probe data in order to consider real-world situations. As previously mentioned, preparation of this data is important to this study and applications to modeling. Establishing criteria for data selection and developing a process for identifying outliers within the data pose challenges. As such, this study proposes a comprehensive framework for cleaning the raw probe data. This framework includes identifying trips from device-based probe data and applying several filters to clean trips and make them ready to be used in the model. The framework could be used for any type of vehicle trajectory data. One of the important steps for conducting any modeling is validation in order to allow for comparisons with other models. This study uses a general method of validation that involves splitting the data into training and test datasets and validate the model on the test dataset. In addition, this study proposes a framework to compare the model with commercial navigation systems. It can be challenging to compare the proposed model 6 with existing commercial models, such as Google’s navigation system, since neither Google’s cost model nor its real-time data is available. As such, as an important follow- up step for the proposed modeling procedure, this study outlines a new method for comparing the model and the trip travel times with a commercial routing API. In summary, this study includes the following tasks: (1) Propose a comprehensive framework for working with raw vehicle GPS trajectory data. The framework includes data selection, data cleaning (trip splitters, probe filtering, and removal of outliers), and finally, preparation of the data to be used in the training and test models. Furthermore, the proposed framework is tested on two different vehicle GPS trajectory datasets. (2) Propose a new model for real-time arterial travel time estimation/prediction utilizing Streaming Variational Bayes, which can be used with streaming data (for small-to-medium-sized networks). (3) Investigate the effects of different factors, such as weather conditions, days of the week, time of day, and seasons on arterial travel time. (4) Propose a new arterial travel time estimation/prediction framework that can be applied in different weather conditions, days of the week, and time of day, and provide more accurate and reliable predictions for any conditions. Then, compare the performance to a base model as well as to a base model improved by a regression model. 7 (5) Propose a new framework for comparing estimation/prediction resulting from the discussed model (or any real-time travel time prediction model, in general) with any available routing API, such as Google routing API. 1.3 Research Contributions What follows is a list of the primary contributions of this dissertation to the field. • We have proposed a framework for cleaning vehicle trajectory data which can be used for all types of data availability and is applicable to any study that utilizes real-world data. • We have proposed a graphical model that takes into account different external variables such as weather conditions, seasons, time of day, and day of the week. We proposed the use of the Streaming Variational Bayes to make it possible to use the graphical model for larger-scale networks and when a number of variables are at play. • We have evaluated the effects of incorporating each of these variables in the modeling and have addressed how such variables can impact the modeling process. We have done so by incrementally adding the variables into the model. • Finally, we have proposed a new procedure with which one can compare any type of travel time prediction model (for any kind of network: freeway, arterial, or both) to any commercial routing API, such as Google API. 8 1.4 Research Outline The rest of this dissertation is organized as follows. In the next chapter, we further review existing literature related to travel time estimation studies and the methods used in travel time estimation/prediction in general as well as in arterial travel time estimation/prediction. Chapter 3 mainly explains the preliminary steps taken before the modeling process. In this chapter, a new framework for a data cleaning and filtering process is proposed and tested on two different datasets. The chapter then addresses the next important step before the modeling – the map matching process used to map the cleaned data to the network. To do this, different map matching techniques and algorithms were reviewed; the map matching algorithm used in this study is discussed in greater detail. The last step before the modeling involves the creation of the roads network and capturing the roads’ dependencies. The graphical model and the relationship between variables are further explained to demonstrate how the city network should be created. Chapter 4 demonstrates the enhanced graphical model proposed in this study. All the variables considered in the model are explained in greater detail, as is the structure of the enhanced graphical model and the relationship between variables. The chapter continues by proposing using an approximate inference algorithm named Streaming Variational Bayes. This algorithm is used in order to make the model applicable for larger-sized networks and instances in which more variables need to be incorporated in the model. Chapter 5 outlines the performance of the proposed model with all the variables and compares the proposed model with an improved base model. Further, the 9 chapter analyzes the impact of including each of these variables in the model performance. Chapter 6 explains the new approach for comparing the proposed model (or any other model) with a commercial routing API when the associated cost model is not available. Finally, the last chapter includes the conclusion and possible future work for this study. In the last chapter, the summary of the research, the contributions, and the findings are discussed. The limitations of this study and suggestions for future work are also discussed in this chapter. 10 Chapter 2: Literature Review In this chapter, some of the general ideas and methods related to estimating travel time (both arterial and freeway) are presented. Following that, more detailed literature discussing the arterial travel time estimation/prediction, and the associated methods, assumptions, and data used are provided. Throughout this dissertation, other specific detailed literature is referenced when appropriate. There are different approaches that can be applied to the traffic estimation/prediction problem (both freeways and arterials). The approaches can be divided into three essential clusters: scope, output, and method (Vlahogianni et al., 2004). Regarding conceptual output, there are two main trends with different outcomes in the estimation of travel times using probe data. One trend is providing expected travel time. Most large-scale navigation systems, such as Apple Maps or Google Maps have this outcome, which can be applied to any large-traffic network. The other trend involves providing probability distributions of travel times and inferring their parameters using different statistical methods that can be used for small-to-medium-sized networks, as they are computationally intensive (Hunter, 2013). Studies on travel time estimation/prediction can also be divided based on their scope, be it microscopic or macroscopic. The microscopic modeling method mostly includes a small number of intersections and segments assuming complete data availability, including signal timing, vehicle counts, etc. In contrast, macroscopic modeling focuses on large scales, such as cities or countries. Another way to distinguish between studies on travel time estimation is by comparing the methods they use. There are different approaches to 11 model travel time, such as parametric, and non-parametric (Perry& Greene, 1982). This study focuses on those methods that use probe data. Given this, what follows are the methods used for travel time estimation, in general, as well as for arterial travel time estimation/prediction using probe data. 2.1 Parametric Models The parametric models have strong assumptions about the data which should be satisfied, or they can result in inaccurate estimation. These models have a fixed number of parameters (Murphy, 2012). 2.1.1 Naïve Models One of the simplest parametric models is the naïve model. The naïve models incorporate very simple methods such as historical averaging or smoothing and are considered the simplest forecasting models which are used in travel time estimation (Farokhi Sadabadi, Hamedi, & Haghani, 2010). Historical average models are merely averaging historical data with the assumption that the traffic repeats the same pattern over time and these patterns are used for future prediction (Smith & M. Demetsky, 1997; Farokhi Sadabadi et al., 2010). 2.1.2 Autoregressive Models A time series is defined as a sequence of measurements of the same variable(s) made over time intervals (day, week, etc.). The model in which a value from one-time series is regressed on its own past values from that time series is called the autoregressive 12 model. The order of an autoregression in a time series is defined as “the number of immediately preceding values used for present value prediction” (onlinecourses.science, 2016). In Autoregressive modeling, an autoregressive integrated moving average (ARIMA) approach – also referred to as Box-Jenkins– is one of the approaches for analyzing and forecasting, and is a specific form of an autoregressive moving average (ARMA). Several early studies in freeway traffic modeling used ARIMA for volume and occupancy prediction (Ahmed & Cook 1979, Levin & Tsao1980). Later, the ARIMA model was used in all kind of traffic parameters forecasting studies or comparing the new models with it (Zhang, 2015; Davis & Nihan, 1990; Hamed et al., 1995; Kamarianakis & Prastacos, 2005; Williams et al., 1998; Cetin & Comert, 2006; Karlaftis & Vlahogiann, 2009; Min & Wynter, 2011). SARIMA, on the other hand, is Seasonal Auto-Regressive Integrated Moving Average. In fact, SARIMA is an ARIMA model that incorporates the seasonal factors in the model as well. Some studies tried to capture the seasonal effect by using the SARIMA (Kumar &Vanajaksh 2015; Williams & Hoel 2003). 2.1.3 State Space Model A state space model (SSM) models the probabilistic dependency between the latent state variable and the observed measurement (Koller & Friedman 2009). One specific type of SSMs is the hidden Markov models (HMMs), in which hidden states are discrete (Murphy, 2012). The model could be used to analyze both deterministic and stochastic dynamic systems that are observed through a stochastic process. Kalman filter and Kalman smoothing algorithms are among the most widely used algorithms in 13 SSM models. There are several studies that have used the state space model for predicting traffic both in freeway and arterial networks (Stathopoulos & Karlaftis 2003; Ghosh et al., 2009) 2.1.4 Hidden Markov Model (HMMs) In general, in the (first-order) Markov chain or Markov model, the assumption is that the present is only dependent on the immediate past. In other words, Markov model centers on the idea that the immediate past captures everything we need to know about the entire history. Since this is a very strong assumption, higher-order Markov models are created whereby the hidden Markov model (HMM) offers another means through which to capture long-range correlations between observations. HMM assumes there is an underlying hidden process represented by a hidden variable at each time (Murphy, 2012). 2.1.5 Bayesian Networks (BNs) A Bayesian network– also known as a causal model as well as a directed graphical model – is an approach to representing a system and the dependencies using a directed graph. Sun et al. (2006) developed a Bayesian network for the urban highways flow forecasting by considering the spatial/temporal correlations of traffic flow between adjacent links There are several studies using this dynamic Bayesian network for arterial travel time modeling (Herring, 2010; Hunter et al., 2009; Hofleitner et al., 2012 a; Hofleitner et al., 2012 b). Within a dynamic Bayesian network, however, the network is not dynamic; 14 instead, it is fixed and problem-specific. There are different structures to be defined. First, the structure of the first time-slice must be defined, then, the structure between two time-slices, and finally, conditional probability distributions (CPDs) governing all the network’s random variables must be defined. 2.1.6 A Feedforward Neural Network A Feedforward Neural Network can be used for classification or regression problems. In fact, it is a series of logistic regression models on top of each other, and the information moves in a forward direction, with a final layer of logistic regression in cases where there is classification problem, or linear regression model if there is a regression problem. Neural Network models consist of hidden layers and weight matrices from input to hidden nodes and weight matrices from hidden nodes to the output. The hidden layers learn non-linear combinations of the original inputs. In instances where there is no nonlinearity, the model simply becomes a large linear regression model (Murphy, 2012). There are several studies that have proposed different neural network for freeway traffic modeling. Zeng & Zhang proposed neural network models for freeway travel time forecasting (Zeng & Zhang 2013). Jiang and Adeli (2005) proposed a dynamic wavelet neural network model to predict freeway traffic flow and included time of the day and the day of the week as the explanatory variables. Even though most of these studies are for freeways traffic prediction, there are few studies using the neural network for arterial traffic modeling. Yin et al. (2002) used a fuzzy-neural model 15 (FNM) to predict the traffic flow in an urban street network. Park and Lee (2004) used a neural network as a classifier for arterial travel time estimation. 2.2 Non-Parametric Models Unlike parametric models, the nonparametric model does not have a predetermined form for the predictor; rather the predictor form is derived from data. Nonparametric models do not have a predetermined number of parameters, and they could grow with the amount of data (Murphy, 2012). 2.2.1 K-nearest neighbors One of the simplest examples of a non-parametric classifier is the K-nearest neighbor (KNN) classifier. This classifier consists of K points in the training set, each test input of x class is defined based on counts of members of each class that are nearest to that input (Murphy, 2012). Tiesyte and Jensen (2008) used the Nearest-Neighbor Trajectory (NNT) technique on real data from buses to predict future movement. The observed trajectory in this study is the path taken between two bus stops. Myung et al. (2011) proposed k nearest neighbor method to predict freeway travel times using vehicle detector data and an automatic toll collection data. Robinson and Polak (2005) proposed the use of the k nearest neighbors technique to estimate urban link travel time. 2.3 Other Methods There are some other studies using parametric models, non-parametric models, or combinations of them. Zhang (2015) proposed a gradient boosting tree-based model to 16 predict freeway travel time. Hamner (2010) proposed the use of context-dependent random forest to predict travel time. Leshem and Ritov (2007) used Adaboost Algorithm with random forests as a weak learner to predict freeway traffic flow. Yu et al. (2010) applied the support vector machine (SVM) and Kalman filtering to predict bus travel time. Herring proposed Bayesian inference as one of the proposed models to estimate arterial travel time (2010). Huang et al. (2014) proposed a deep learning approach for freeway traffic flow prediction. They used the combination of a deep belief network at the bottom and a multi-task regression layer at the top of the architecture to model traffic flow. Yu et al. (2017) proposed a deep neural network based on long short-term memory to forecast freeway peak-hour traffic and extreme condition traffic (for example, post-accident). Cui et al. (2018) proposed a deep bidirectional and unidirectional long short-term memory neural network architecture to predict freeway network-wide speed. 2.4 Arterial Travel Time Thus far, this paper has outlined the general ideas of, and the methods used in travel time estimation. What follows offers a more detailed discussion of the probabilistic methods in arterial travel time estimation and prediction. Herring (2010) compares different travel time estimation models including the regression model in which a logistic regression is used to classify the discrete states and STARMA for the travel time estimation/forecasting. Another considered model in this study is historical modeling using Bayesian inference for real-time estimation. Real-time estimation is calculated by considering the historical outcome as the prior 17 distribution and updating the parameters of distributions by weighting between the parameters from historical data and real-time observation. The last considered model is a probabilistic graphical model, in which the spatiotemporal conditional dependencies of the arterial traffic are modeled as a probabilistic graphical model. Then, Expectation-Maximization (EM) estimation algorithm, which is an iterative method to estimate maximum likelihood, is used to learn parameters. The used data is probe data from 500 taxis traveling in a small area (322 links) in San Francisco, from 3 p.m. to 8 p.m. on weekdays. Hunter et al. clearly (2009) proposed a probabilistic model of travel times in the arterial network, based on low-frequency taxi GPS probes. In their study, an EM estimation algorithm is proposed that iterates between travel time allocation and parameter estimation. The study assumes that the travel times on the segments are normally distributed and travel times on different segments are independent. The data used in this study are obtained from a fleet of 50 taxis in San Francisco, however, the performance of the model and scale of the network are not mentioned. Hofleitner et al. further develop their primary in their later works (Hofleitner et al., 2012a; Hofleitner et al., 2012b). Hofleitner et al. (2012 b) proposed a density model to scale the partial travel time on the partial links (the start GPS point and end GPS point). Also, they integrated the step of travel time allocation on the links is into the EM algorithm which means the observations are path travel times. In other words, instead of explicitly distributing the path travel time to its corresponding links, in their graphical model, the path travel time is treated as the same parent for every link inside the path. They employed a particle- 18 filtering inference in the E-step to optimize the model that can also learn how to distribute the total travel time to the links implicitly. The model is tested using Global Positioning System (GPS) data from a fleet of 500 probe taxis in San Francisco, and the studied network included 800 links. Hofleitner et al. (2012 a) proposed a hybrid model. In this study, both the distribution parameters as well as static parameters of the roadways (such as free-flow velocity or traffic signal parameters) are to be estimated. Their primary approach is improved by introducing a state variable that represents the number of vehicles that stop (number of queuing vehicles) on a link per light (as opposed to binary state variable) and turn fractions at intersections. EM algorithm is used for the parameter learning. The model is tested using data from a fleet of 500 probe vehicles in a sub-network of San Francisco shown on 769 links. Hunter et al. (2013) took a slightly different approach in using a graphical model. Their algorithm called “Stop & go” consists of three steps. First, they use maximum likelihood to get the number of stops for each link from the observation. Then, they use the Markov model (MM) to calculate the state transition probabilities – the probability of the number of stops on a link given the number of stops on the preceding links. Next, they use the Gaussian Markov Random Field (GMRF), which provides the joint distribution of the travel times of neighboring links, given their state. Their findings show that the model can be improved by incorporating the variability of travel times due to stops in the structure of the model. However, the independence or correlation of travel times does not have a significant impact on the model. It should be noted that the 19 data used in this study is 9.6 measurements per link, which means that the model needs to incorporate a high-frequency GPS measurement. There are some other studies that propose probabilistic modeling for emergency vehicles, such as ambulances, and they mostly model the distribution of travel time on either the entire trip or based on links. However, this modeling might not work as well for general purposes since the data source is very low-volume and does not reflect travel time experiences for non-emergency vehicles (Budge et al., 2010; Westgate et al., 2015; Westgate et al., 2013). For instance, Westgate et al. (2016) proposed a regression model for travel time distribution for an entire trip over large networks including both freeways and arterials, using ambulance data. In another study, Westgate et al. (2013) used Bayesian data augmentation to simultaneously estimate the paths travel times as well as parameters of travel time distribution for each road segment. Jenelius and Koutsopoulos (2013) proposed a spatial moving average (SMA) method for arterial travel time estimation on heavily traveled routes. In this study, travel time mean and variance is a function of network characteristics (speed limit, functional Class) and trip conditions (time of day, season, and weather conditions). This study is one of the few studies in which the network characteristics and trip conditions are considered in the modeling. However, according to the authors, the focus of this study is to model the impact and significance of different explanatory factors on the travel times rather than developing a high-performance model for the travel time estimation. In this study, Stockholm, Sweden probe vehicle data from a taxi fleet is used. 20 Woodard et al. (2017) proposed a probabilistic estimation of travel time for large networks, including both freeways and arterial networks, using Bing Maps data. In this work, the travel time for each link is considered as a Gaussian variable conditioned on some latent unobserved random variables which depend only on the same link. Assuming some probability distributions over the latent random variables, the team used the Maximum A Posteriori (MAP) method to find the set of model parameters that maximize the posterior distribution. However, the proposed model has the limiting assumption that the state of each link is independent of the others. Yang et al. (2017) proposed a modified Gaussian mixture model to estimate arterial link travel time distributions. Their model could be used when either fixed-location sensors or mobile sensors data is available. They used simulation and a small number of real-world data to validate their model. 2.5 Intersection Delay Intersections play a critical role in travel time estimation/prediction, but they are rarely considered in the route’s travel time estimation/prediction models. In fact, the travel time of a trip consists of two parts: road section travel time and intersection delays. Intersection delay is the timestamp difference between the times of vehicles entering and leaving an intersection. Tang et al. (2016) proposed a microscopic method to estimate intersection travel time. Their method assumes that the intersection travel pattern is different based on the different flow pattern. They divide the traffic flow patterns into four different categories: free traffic flow, minor impediment, moderate impediment, and serious 21 impediment. In their study, they first recognize the flow pattern, and then using the linear fuzzy regression method, they obtain the intersection travel time for each mentioned flow pattern. Ban et al. (2009) proposed a microscopic model to estimate the individual intersection delay pattern and validated it using microsimulation. Their proposed model identifies the observed sampled travel times measured between upstream and downstream locations of a signalized intersection, and it estimates the intersection delay pattern. The group’s proposed model includes two steps, the first of which detects the start and end of a cycle by considering the queue-forming and discharging process pattern, and the second of which uses a least-squares–based linear-fitting algorithm to estimate the delay pattern in each cycle. They considered two different flow conditions which are normal and oversaturated conditions. Hao et al. (2017) proposed a model to estimate vehicle state trajectory that can potentially be used to extract travel time and delays on arterial networks. In their model, the team considered four different modal activities: acceleration, deceleration, cruising, and idling. Further, the group considered all of the possible sequences of the modal activities, and the sequence that maximizes the likelihood of the product of probabilities for multiple independent events is chosen as the vehicle speed trajectory. Finally, they compared their model with linear interpolation as a baseline using Next-Generation SIMulation dataset and showed they reached better results compared to the baseline method. Even though there have been some studies addressing intersection-delay estimation, hardly any macroscopic level study of intersection delay estimation has been published. 22 In other words, there are few studies that consider both link travel times and intersection delays together and as part of the travel time estimation or prediction for a whole trip. 2.6 Probe Data One of the most valuable types of data used in transportation studies is probe data. Probe data can be used to develop applications that can improve roadway operations, planning, and maintenance, and keep travelers informed of travel conditions. Probe technology uses location-aware and internet-enabled devices that are either installed in vehicles or carried by travelers. The providers of probe data can vary. For instance, the source could be GPS-equipped commercial fleet vehicles, such as trucks, package delivery vans, taxi vehicles, construction vehicles, or buses. The disadvantage of this data, however, that it is based on the specific spatiotemporal travel patterns of these vehicles. Another source of probe data is participatory sensing, such as INRIX, TomTom, Google, or Nokia – all of which collect location data from GPS-equipped smartphones, personal navigation devices, vehicles with embedded GPS, and other mobile consumer devices. The source of data in this study is probe data (either mobile data or vehicle embedded GPS data) provided by INRIX. 2.7 Summaries and Limitations of Previous Studies In summary, traffic estimation/prediction models using probe data can be categorized based on three main aspects: scope, output, and method. Based on the output, there are two different potential outcomes for expected travel time and probability distributions. With regards to scope, there are two main options: microscopic (which is based on a 23 small number of intersections and segments, assuming comprehensive data availability) and macroscopic. Finally, based on the method, there are two main approaches: parametric, non-parametric. It should be noted that there might be some overlaps between these categories. Since this study utilizes a macroscopic scale with an output of probabilistic distribution, most of the supporting literature reviews stem from those same categories. Parametric models are usually based, in part, on theoretical reasoning and strong assumptions about the data. Some of the frequently used parametric models applied to travel time prediction are autoregressive models, Hidden Markov Model, and Feedforward Neural Network. Non-parametric models, however, often do not carry any assumptions about the data, and their parameters grow as the direct result of the amount of data used. One of the most frequently used non-parametric models is the k-nearest neighborhood model. When it comes to real-time arterial travel time prediction, some models are used more than the others. For example, some studies have used simple Bayesian inference or regression models to estimate the historic travel time and considered a factor deduced from real-time traffic condition that is multiplied by the historic estimation to calculate the future prediction. One of the more sophisticated models used in real-time arterial travel time prediction is the graphical model through which one considers the state of traffic as a hidden variable in the model and uses EM to solve it. Other models have implemented complex regression models in which the trip and link variables take into account some probabilistic assumptions about the variables and use the Likelihood estimator (MLE) tor Maximum A Posteriori (MAP) to solve it. 24 What follows discusses some of the limitations of the reviewed literature and outlines how this study is addressing those limitations. 2.7.1 Limitations of Previous Studies Many previous studies rely on either taxi data or bus data (and in some case, ambulance data). Although these data are valuable because of their availability, they might not be representative of general travel time and, in most cases, these vehicles experience higher travel times (Jenelius & Koutsopoulos, 2013). The reason for choosing the mentioned probe data is because the other type of high frequency of probe data is usually not available for researchers to use. Additionally, previous studies rarely consider weather conditions despite the fact that the impact of weather on the travel time is critical, however, it is one of the most needed areas to focus on since the existing models including the commercial ones (Google map, etc.), have the lowest performance in bad weather conditions. Another factor that is rarely addressed is taking into account the driver behavior. Many studies are also limited by the fact that they treat the intersection as part of the links rather than treat intersections as separate entities. Despite the fact that the link and the entering intersection are in many ways related, treating these as separate entities might result in better traffic modeling. Finally, most existing probabilistic models are applicable only to small-sized networks because many of them are not computationally efficient for large-road networks and/or large datasets. Recognizing these challenges, this study aims to propose a probabilistic model that uses vehicle trajectory data and brings different factors, such as weather conditions and 25 travel behavior, into the modeling process. This study also considers the intersections and links separately in order to provide different travel time distributions for each. Last but not least, this study uses Streaming Variational Bayes for the inference which makes it applicable for large-sized networks as well. 26 Chapter 3: Model Preparations In a perfect world, there are perfect data - and engineers simply need to find the best model that can use that data toward a final objective. In line with this thinking, many studies try to find cleaned data to use, but this limits the area of study. In fact, in the real world, there is always “dirty data” that need to be cleaned prior to any applications for modeling. Thus, it is critical to have a data cleaning framework in place, and this chapter provides such framework. On the other hand, in this study, we have the GPS trajectory data as the observations and city roads network as the variables for which we want to predict the travel time. Thus, the next step involves mapping these observations (GPS trajectory data) on the roads network. Additionally, it is necessary to understand the structure of the city and how the road traffic could be interrelated. In order to do this, the main structure of the graphical model is explained; this model provides the primary building blocks for the model in the next chapter with all the variables. This chapter is dedicated to all of the aforementioned preliminary steps taken prior to the modeling. Additionally, this chapter discusses input data, including geographic map and weather data. Figure 3 illustrates the modeling framework and demonstrates how the needed data and model are interrelated. 27 Pre-Modeling Steps Graphical Model Raw Probe Parameter Data Data Cleaning Map Matching Estimation Travel Time Allocation OSM Making a Graph Other Information Figure 3 The conceptual framework of modeling the arterial travel time 3.1 Data Cleaning Many studies focus on the cleaning of probe data, with different end objectives. Chung et al. (2003) proposed a framework to clean the probe data to determine the origin- destination (OD) pattern of the probes. Their proposed data cleaning framework consists of six considerations to find the correct trip destination. The six considerations are the gap between parking brake event, the long gap, the gap with unrealistic speed, the long stop, the short stop with hazard light, and the U-turn. There are other studies that attempt to implement methods to complete an incomplete data. Hao et al. (2017) proposed a model to find the vehicle trajectory between two observations rather than just merely interpolate between them. In this study, a data cleaning process for the vehicle trajectory data with the purpose of travel time estimation/prediction is proposed. 28 The vehicle trajectory data contains the GPS trace information of any vehicle traversing a path from an origin to a destination. For each vehicle along its path, there is a GPS report (including information such as location, time, speed, etc.) received for every time interval (which varies in different datasets). However, these GPS reports are most of the time not clean and ready to use. Various external causes may degrade the quality of the reports. GPS device may not work as expected, the GPS localization may not be accurate in some regions, GPS signal may not be fixed in some time intervals, etc. Inaccurate data results in inaccurate modeling. Consequently, this proposes a framework to clean the waypoint GPS traces before modeling the travel time. This framework deals with problems such as idling, impossible speed, impossible sequence, impossible shift in location or time, etc. What follows explains the data cleaning process in the most general form, in order to allow it to be used for any kind of data. The framework is explained in the most comprehensive manner and can be applied to any available data. The cleaning filters take into account all possible information, including latitude, longitude, timestamp, speed, headway, and GPS error. Also, the process assumes there is no order to the data. Based on the availability of data and prior cleaning processes, a subset of this filtering might be used. 3.2 Probe Filtering If the provided probe data is raw data that is not cleaned in any way, it can be difficult to use for modeling. Use of incorrect data results in the development of an inaccurate model. As such, it is essential to have a useful framework and to develop a thorough 29 cleaning process. What follows is the explanation of the “probe filter” framework, which explains the procedure to find tripes, clean the probe data and eliminate noise and outliers. The proposed framework is applied to both TeleNav data and INRIX data, based on their availabilities. Figure 4 shows an example of TeleNav raw data in the Santa Clara, California area. TeleNav data is pulled from different providers; each provider has different standards of quality and uses different time intervals between the GPS reports (ranging from one second between reports to a couple of minutes). The Telenav data is only used in the data cleaning process. The data is the probe data with a one-second interval and the data is device-based. The data are from 2015 to 2017 and are for the cities in northern California, including Santa Clara, San Jose, Mountain View, and Palo Alto. Figure 4 Sample of trips in Telenav probe data Figure 5 shows the example of the INRIX probe data. As mentioned, from the INRIX data, the Baltimore area is selected, and the data covers four months of data with the interval of one minute between GPS reports. The INRIX data is not as raw, but it still needs to undergo a cleaning process. 30 Figure 5 Sample of trips in INRIX probe data If the data is device-based (TeleNav), then the cleaning process involves two different steps. The first step centers on finding the trips and the second step centers on cleaning the data from each trip. If the data is trip-based (INRIX) then only the second step is needed. What follows explains each of these two steps: 1. Finding the trips from the device-based probe data (trip splitter). 2. Cleaning the trip (applying the rest of the filters). Here, all the filters are applied consecutively to the probe data; however, one might choose only a few of the filters according to the need, data availability, and the objective of the study. Figure 6 shows the cleaning process. 31 Trip Filters Raw Probe Data Trip Cleaned Splitters Data Figure 6 The data cleaning Process 3.2.1 Trip Splitter The trip splitter might be used for one of two reasons: the waypoints’ unique identifier might be device-based rather that trip-based or the trip might need to be split into more than one trip in order to use it in the modeling. The three different criteria to split the trips are time, location, and the waypoint (if available). 3.2.1.1 Time Difference Splitter First, if the data are device-based (rather than trip-based), the data are split based on the datestamp or timestamp of the data. For instance, the first trip data might be split if it is from different days. Then, on the same day a time interval is considered above which the probe data is split into different trips (e.g. max time difference threshold = 120 sec). This time interval depends on the GPS report timestamp (for example, if it is one minute or one second between each of the two consecuitive GPS reports). This means that the probe data with a time interval over a certain threshold might be split 32 Same Time Same Location Speed Boundary Idle Location Displacement Trip Length Large Error Size Shaking Probe into two trips. The challenge with this filter, however, is that if we choose a higher time-difference threshold, we might incorrectly select two different trips as one and then we might lose the whole trip in the cleaning process (“shifted probe filter”). If we choose lower threshold, we might split one trip into more trips, but that won’t cause any problem since we won’t loose any information aside from the actual origin- destination of the trip. If the data is already trip-based, there is no need for this operation. 3.2.1.2 Way Point Sequence Splitter The last step in the trip splitter process is splitting if the sequence of the waypoints is not ordered or when there is a gap in the sequence. This is due to the fact that we limited the study area to a boundary, and those trips that go out of the boundary and come back again would have a gap in their sequence. Thus, those trips are split every time they pass the boundary. 3.2.2 Cleaning/Filtering the Trips 3.2.2.1 Same Time Filter This filter removes the probes that have the same timestamp. In a way, the first probe with a unique timestamp is kept, but all subsequent probes with the same timestamps are removed. This filter is a partial filter, which means that only the probes that have this duplicative characteristic are removed from the trip data. 33 3.2.2.2 Same Location Filter This filter is also a partial filter; it removes those probes that are based on the same location. This process involves keeping the first probe and removing all the subsequent probes that have a distance lower than a minimum threshold from the first point (e.g. minimum distance between probes of 1 meter). 3.2.2.3 Speed Boundary Filter This filter is used to filter out those trips that feature an impossible speed. The filter removes those probes that have speeds registering above the maximum speed (e.g. maximum speed = 200 mph). Similarly, if the average trip speed is above the maximum or below the minimum speed (e.g. minimum speed = 2 mph and maximum speed = 200 mph), then the whole trip is removed.This filter could be used even when speed data is not available because the speed could be calculated based on the location and timestamp data. This filter could be partial, or it could target the entire trip. 3.2.2.4 Idle Filter This filter removes the probes that are identified as idle probes, such as when the speed is below a minimum threshold speed, from a trip. Applying this filter means that, if a car is idle in the gas station, those probes will be removed; similarly, if time is important to us the travel time from origin to destination would be much higher. To address this problem, the time difference threshold (trip splitter) could be checked once again. If removing the filter causes the time difference between the consecutive probes to be more than the time difference threshold, the trip should be split into two trips. Another 34 potential problem with this filter is the probes behind the traffic light, which are removed because they indicate a speed below the threshold (e.g. minimum speed of 1.0 mph). This filter could be used regardless of whether or not speed data is available since the speed could be calculated based on locations and timestamps. 3.2.2.5 Location Displacement Filter This filter removes the whole trip if there is at least one shifted coordinate by more than max Distance threshold (e.g. maximum miles between probes = 3 ). This filter is not a partial filter and if the conditions are met the whole trip will be removed. The problem with this filter could be losing a lot of data if we have a higher threshold of max distance and having a lot of shifted probes if we choose the lower max distance. One solution to this problem could be to avoid losing this data by either splitting the trips to smaller ones, or removing the shifted probes and interpolating between the remaining probes or using trajectory identification techniques (Hao et al. 2017) instead of the interpolation to improve the accuracy. 3.2.2.6 Trip Length Filter This filter removes the entire trip if the probe count is less than a minimum count. The count threshold is based on the probe timestamp intervals (if they are 1-second or 1- minute interval). Here the threshold of 60 probes is used when we have the timestamp interval of 1 second and threshold of 5 probes when we have the timestamp interval of 1 minute (e.g., Min Probes Count = 60). 35 3.2.2.7 Large Error Size Filter This filter is again a partial filter; it removes those probes with GPS error larger or equal to a maximum threshold error size from the trip data. Error size/accuracy is a variable in the probe data, and it shows the accuracy in meters of the GPS recording. (e.g. maximum error size in meters = 200 m) (TeleNav, 2017). This filter could be used when the error information is available in the dataset; if it is not, this filter is omitted. 3.2.2.8 Shaking Probe Filter A probe is identified as “shaking” if the previous and next probes have a heading angle difference outside the maximum angle difference threshold. This filter removes the shaking probes from the trip data, and if the number of shaking probes is too large, the whole trip is removed. (maximum percentage of angle difference for shaking probes = 40%) (Telenav, 2017). This filter could be used when there is a heading; if there is no heading, this filter can be omitted. 3.2.3 An Examples The following examples (Figure 7 and Figure 8) show the before and after of the filtering. Figure 7 demonstrates how the filtering process solves the displacement probe problem by splilliting the trip to two trips. Figure 8 demonstrates how the time- difference problem is solved by splitting the trip into two trips. 36 before after Figure 7 An example of Telenav trip before and after filtering before after after Figure 8 An example of INRIX trip before and after filtering 3.3 Map Matching One problem with GPS data is that the data points are never precise enough, so they might be off the road. Also, we might not know the traversed links between two GPS reports. For example, in our proposed model, the observations that are used as input are link-based, however, what we collect from the data are GPS reports that are randomly reported. As such, we might have more than one GPS report in one link – which is very rare for low-frequency GPS data – or we might have only one GPS report 37 for several links. Therefore, it is important to find the traversed links between every two consecutive GPS reports. To address this, it is important to map these points on the road. This problem of matching measured latitude/longitude points to roads, along with the path between two points, is called map matching. There are a lot of studies about map matching and what follows provides a very brief literature review of this concept. These approaches include a geometric approach in which the algorithm tries to match a path that has a similar curve to observed locations (Greenfeld, 2002; Brakatsoulas et al., 2005; Kim &kim 2001). Another approach uses HMM and Kalman filter (Krumm et al., 2007; Newson & Krumm 2009). Some such approaches also include the uncertainty associated with the path; however, studies have found little benefit to using this in predicting travel time distributions on routes (Westgate et al., 2015; Woodard et al., 2017). In this study, an HMM-based map matching algorithm, according to the Newson and Krumm paper, is implemented (Newson and Krumm, 2009), and it is adapted in accordance with this study’s needs. What follows explains the Newson and Krumm HMM-based map matching proposed method, as well as assumptions and considerations in this study. 3.3.1 HMM-based Map Matching Hidden Markov Model (HMM) map matching finds the most likely point on the map for each set of latitude/longitude coordinates, as well as most likely links sequence between two consecutive latitude/longitude pairs. 38 3.3.1.1 Measurement Probabilities Due to GPS noise, GPS observations rarely point exactly to a link; but, it is important to have GPS points on the links. Relatedly, measurement probability is the likelihood that location observation denoted as ot, resulting from a given state li (a link) and based on that observation itself. Measurement probability is the likelihood that the observation O𝑡 would be observed if the vehicle were actually on road segment l𝑖. For a given ot and li, the closest point on the links is denoted as y𝑡, and | ot - y𝑡,𝑖| is the distance on the surface between the observed point and the candidate match on the map. Ideally, this calculated figure should be as low as possible (Newson and Krumm, 2009). 3.3.1.2 Transition Probabilities Once all the latitude/longitudes are mapped on the links, the next step involves finding the traveled links sequence between each consecutive latitude/longitude points pair. Transition probabilities are the probability of a vehicle traversing a set of links from the first point (i) to the second one (j). If the observation at that time is ot and the next observation is ot+1, and their candidate matches are y𝑡, and y𝑡+1, j respectively, the path distance would be | y𝑡, -y 𝑡+1,j |path, which is the network distance between y𝑡,𝑖 and y𝑡+1,j (sum of all the links’ lengths from the first point to the second point), which should be as small as possible. Based on the Newson and Krumm study (2009), the histogram of the absolute distance differences follows an exponential distribution shown in Equation1, where 𝛽 is the parameter to be estimated. = @A𝑝(𝑑 ) = 𝑒 B>: (1) > 39 𝑑: = ||𝑜: − 𝑜:F=| − |𝑦:,H − 𝑦:F=,I|| (2) 3.3.1.3 Optimum Match Newson and Krumm (2009) proposed using Viterbi algorithm to find the best point matches and path between each pair of consecutive points. The Viterbi algorithm finds the match that maximizes the product of the measurement probabilities and transition probabilities through dynamic programming (Newson and Krumm, 2009). 3.3.2 Map Matching Assumptions There are several assumptions inherent to the implemented map matching algorithm in this study and the used coding. The following are some of them. 1. When it comes to mapping each GPS latitude and longitude to the map, the priority is with intersections. Meaning the intersections are strong candidates to which we can map the GPS reports. 2. If there is more than one observation per link per vehicle, only one is considered in the modeling. Figure 9 shows the results of one of the map matchings for the study area. The map on the left shows the observations with the reported GPS points, and the map on the right shows the result of map matching and the path that the hidden Markov map matching algorithm suggests. 40 Figure 9 A Map matching result instance for Baltimore city 3.4 Graphical Model A graphical model is a probabilistic model in which conditional dependencies between random variables are demonstrated by a graph. The vertices of this graph represent the random variables, and the edges show the statistical dependence between the variables. Graphical models have been widely applied to different problems due to their intuitive representation, the natural way to define independency between variables, and the efficient approximate methods developed to perform the inference once the graph structure is defined. These models are successfully used in probability theory, machine learning, statistics, and, in particular, Bayesian statistics. There are several types of graphical models; Bayesian networks and Markov random fields are two of the widely used branches of graphical representations of distributions (Koller & Friedman, 2009). 41 In arterial traffic modeling, the spatiotemporal conditional dependencies of the network traffic could be modeled as a probabilistic graphical model. 3.4.1 Acyclic Graphical Models Bayesian Networks serve as graphical models with structures as directed acyclic graph. This model represents a factor of the joint probability of all random variables. Equation 3 is this joint probability. 𝑝[𝑌=, … , 𝑌M] = ∏HR= 𝑃(𝑌H|𝑝𝑎H) (3) Where pai is the set of parents of node Yi, and 𝑌=,… , 𝑌M are the events. In other words, the joint distribution factors become a product of conditional distributions. Once the problem is formulated by means of a Bayesian network, the inference problem should be solved. In general, the inference can be defined as the problem of evaluating the distribution over a set of random variables, given another set of random variables. There are three different main classes of inference for Bayesian networks: inferring unobserved variables, inferring parameters, and structure learning (Russell & Norvig 2003). The first two are what we need to work on in this study. While it is possible to perform inference in a naïve way by computing all required probability terms, in practice this tactic is inefficient. The exact inference, however, can be solved more efficiently by means of methods like belief propagation and factor graph propagation for singly connected graphs. Moreover, for more complex problems, when the exact inference becomes intractable, approximate methods such as loopy belief propagation can be applied to reduce the computational complexity. 42 In parameter learning, the inference variables of interest are the model parameters. There are, in general, two cases in which parameter estimation is considered: inference when all variables are observed, and inference when variables are partially observed. Approaches toward inferring the model parameters can be divided into Maximum Likelihood and Bayesian approaches. In the Maximum Likelihood approach, the parameters of the model are estimated using the Maximum Likelihood estimator, when all variables are observed. When some of the variables are hidden (i.e. unobserved) the Expectation Maximization paradigm can be applied to solve the problem. On the other hand, the Bayesian approach takes the prior distribution over the parameters into account. The prior distribution can be one of any number of different distributions, but Drichlet distribution is usually deployed due to its flexibility. This leads to updating Drichlet distributions when all variables are observed. When there are hidden variables in the model, methods such as the Markov Chain Monte Carlo (MCMC), Viterbi algorithm, or Variational Bayes(VB) methods can be applied. Variational Bayes methods are among the most efficient Bayesian inference techniques and, in terms of time complexity, they are similar to the Expectation Maximization approaches. 3.4.2 Model Assumptions Our base graphical model has the following assumptions: 1. The travel times of links are normally distributed and travel times on different links are independent of each other. 2. Discrete congestion states: a discrete value of s, indicates the level of congestion. This variable is not observable; it is a hidden variable. 43 3. Conditional independence of link travel times: conditioned on the state of a link, the travel time distribution of that link is independent of all other traffic variables. In words, link travel times are not correlated across links. This assumption is based on computational tractability. 4. Conditional independence of state transitions: conditioned on the states of the spatial neighbors of link l, at time t, the state of link l at time t + 1 is independent of all other current link states, all past link states, and all past travel time observations. This assumption means that each link is correlated with a few neighboring subsets of neighboring links, but remains independent of the rest of the network. 3.4.3 Implemented Graphical Model Arterial traffic has space and time components. The spatiotemporal conditional dependencies of arterial traffic are modeled using a probabilistic graphical model, or Dynamic Bayesian Network. Each node is a representation of a random variable, and each arrow represents a dependency. The hatched nodes represent the observed variables, and non-hatched nodes represent the hidden variables. In our graph, the discrete state of each link l and time period t is denoted Xl,t. Since the state of each link for all times is not observable, these variables are considered hidden variables and shown by non-hatched nodes. The observed travel time on link l and time period t is denoted as Yl,t, shown by hatched nodes. The index d is excluded from the variables according to assumption 1. The forward arrows show the spatial and temporal dependencies of the network. As shown in Figure 10, each Yl,t has the parent Xl, t which 44 is a representation of assumption 2 and means that the travel time distribution on each link is conditioned solely on its hidden state. Also, each state Xl,t, has parents of Xl,t-1, and Xpa(l),t-1which is same as assumption 4, meaning a state on a link at time period t, is solely conditioned on the neighboring links’ state at the previous time step. Later, this model is improved by including other information with adding observation nodes to the model. This system is assumed to be a Markov process with unobserved states that evolve over two time steps. It is also referred to as the Coupled Hidden Markov Model (CHMM). There are three probabilities that need to be estimated to utilize this model: • The initial probability of the state of congestions denoted πl,s: It is necessary to estimate the initial state probabilities for each link. • The discrete transition probability distribution functions, denoted Pl,t, which refer to the transition probability between the states. In other words, the element of line m and column s in the matrix of Pl,t (m; s), represents the probability of link l to be in state s at time t + 1 given that the neighbors of l are in state m at time t. • The distribution of travel time on a link given the state of that link, denoted gl,s,t. As mentioned before, the travel time on each link is dependent on the state of that link. The outcome of this model is prediction/estimation of links travel times at each time step. In the implemented model, the dependency among the neighboring links is considered as relationships between their traffic states. However, for tractability, this 45 study assumes the travel time distributions of the links are independent of each other given their traffic states. This independency of the links makes it possible to calculate the path travel time by summing over the estimated links travel time encountered in that path. Figure 10 Spatiotemporal model of arterial traffic evolution represented as a Graphical model. 3.4.4 Creation of the Network The network or Dynamic Acyclic Graph (DAG) of the roads is created by creating a graph of the street map. Once this graph is created, a variable is assigned to each direction of a link, in case the link runs in two directions. There are always two sets of 46 variables, one in time step t, and the same variable in the time step of t+1. Then, if a variable A (link and direction A) has a forward traffic direction to variable B (link and direction B), A in time step t would be the parent of B in the time step of t+1. 3.5 Data There are different datasets used in this study: vehicle trajectory GPS probe data from INRIX, road network data from Open Street Map (OSM), and weather data and severe weather data from different sources. What follows explains each of these datasets in more detail. 3.5.1 Study Area The study area at this stage is Baltimore, MD with more than 1,700 links and 1,250 intersections (Figure 11). The network information is extracted from Planet OSM. Planet OSM provides the data from the OSM project, which is normally updated every day (Planet OSM, 2016). Once the data is obtained, a graph is made from this area that notes network information, such as links, speed limit, and roadway type. In the graph, each edge is a link between two nodes (intersections), which means that an edge could be run forward, backward or in two directions, per the direction of traffic. The following section explains the OSM briefly. 47 Figure 11 Study area, downtown Baltimore 3.5.2 Open Street Map (OSM) Open Street Map is a map that features free geographic data for the world, and is created mostly by users (Open Street Map, 2016). Open Street Map represents the physical structure of the world with three main elements: nodes, ways, and relations. Elements have their specific tags which give the attributes of that element. Tags are to provide more information on a specific feature such as functional types of ways or nodes. A node is a core element of the OSM data which has latitude, longitude, and node identification. Any point feature, ranging from a building or a traffic light to a fountain or ATM, could be represented by a node. Tags can be used to provide more information on features, such as building names or the type of an intersection. Some of the most 48 important nodes in arterial travel time modeling are traffic signals, stop signs, crossings, bus-stops, motorway junctions, and turning circles. In the proposed model with intersections, this data is used. A way is another element of OSM data. It features an ordered list of nodes, and it can be open or closed. Closed ways are ways in which the first and last nodes are the same. Ways represent linear features such as roads, barriers, and railroads. Tags provide more information, such as the type of the road, the name of the road, etc. The most important ways in this study are roads that are named highways with type (tags) of motorway- link, secondary-link, primary-link, tertiary-link, trunk-link, trunk, motorway, primary, secondary, tertiary, unclassified, residential, service, road. A relation is also a core data element. Relations are used to represent logical or geographic relationships between features. Relations consist of an ordered list of nodes, ways, and/or relations. Relations can also show all of a road’s segments (Wiki Open Street Map). 3.5.3 INRIX Data The traffic data in this study covers the full months of February, June, July and October 2015. The data includes trip information as well as trip records waypoints for each trip and trip provider details. The trip data consists of a variety of information, including start/end location and time, trip provider, vehicle weight class, source of the probe data (embedded GPS, mobile device, unknown), and an indicator in the trip data that shows the mode of transport (walk, vehicle, and rail). The data include both passenger vehicle and truck data, the information in the “vehicle weight class” is used to choose only 49 passenger cars or light trucks data which are the majority of the trips (Markovic et al. 2017). The waypoints data includes a trip identifier and sequence Id which notes the order of the waypoint within the trip starting with "1" and increases by one unit along with the latitude and longitude and their corresponding timestamps. The data consists of more than one billion waypoints (around 112 GB) for the available area and more than 20 million waypoints (around one GB) for the studied area in Baltimore. It should be noted that the vehicle trajectory data is different from the Traffic Message Channel (TMC)-based Vehicle Probe Project (VPP) data provided by RITIS (2017). RITIS has TMC-based probe data for the arterial network which are provided by three vendors: INRIX, HERE, and TomTom. In fact, these providers use their own vehicle trajectory data to provide the TMC-based travel time for the arterial network. The quality of these arterial TMC-based probe data is assessed by Sharifi et al. (2017). 3.5.4 Weather Data One type of weather data used in this study is “Quality Controlled Local Climatological Data (QCLCD),” which is high-quality data consisting of hourly, daily, and monthly summaries for more than 1,600 U.S. locations. This data is available from January 1, 2005, to the current date as of this writing (NOAA, 2015). Other weather data used in this study is from the Global Historical Climatology Network (GHCN), which collects information from stations in its database but does not offer as much quality control as the QCLCD. For those stations that are missing data due to times in which they were not functioning or are turned off, the GHCN is used (NOAA Climate, 2015). The last dataset used for this study is the adverse weather data set obtained from Storm Events 50 Database (NOAA Storm Events, 2015). The Storm Events Database contains data for storms and other significant weather phenomena, as well as significant meteorological events from 1950 to 2016 and features the year discussed in this study. 3.6 Summary This section provided a comprehensive framework for the data cleaning process, which includes data splitting and trip filtering. The splitting process is more critical when the unique identifier is a device rather than a trip. The splitting process for a probe set with a unique identifier is based on the time and location and the waypoint sequence (if available). The filtering includes eight different filters and should be applied to each split trip. The filters remove the outliers or the whole trip. These filters deal with idle probes, speeds lower or higher than the possible speed, probes having the same time or same location as each other, very small-length trips, probes for trips with sudden shifts in the location, large errors, and probes with sudden or frequent changes in the heading (if the heading is available). Once the data are cleaned, they need to be mapped to the roads. There are different map matching algorithms to do that. Map matching algorithms first find each GPS point on the map (measurement) and then find the traversed links between two GPS points (transition). HMM-based Map matching was used in this study. In HMM-based map matching, the Viterbi algorithm is used to select the mapped point and the traversed link alternative that maximizes the product of the measurement probabilities and transition probabilities. 51 Once the inputs are ready, the structure of the city should be created based on the graphical model assumptions. In the graphical model, each link (the roads between two intersections) is a variable, and the goal is to find the travel time distribution for each of the links (variables) at each time step. The traffic state is hidden variable, and the assumption is that travel time on each link is dependent on this state. However, in the arterial network roads, travel times are dependent on each other, and this dependency is reflected in the states; that is, each link state at time t+1 is dependent on the neighboring links at time t. The network of the roads and the adjacency of the links (to get the neighboring links) are created and the initial graph is based on them. The data used for this study are: geographic map, which is accessed via open-source Open Street Map (OSM), the GPS trajectory probe data which is obtained from INRIX and Telenav (used only for cleaning procedure), and the weather data which was obtained from different data sources provided by NOAH. 52 Chapter 4: Graphical Model In recent years, thanks in part to greater availability of data, there have been efforts to conduct probabilistic modeling of arterial travel time. It is, however, rare for any such models to consider the effects of external variables such as seasons or weather conditions. These variables are shown to be influential in both demand or driving behavior as well as travel time in freeways. Still, they are rarely considered predictive variables in arterial travel time modeling. This chapter focuses on developing structures for enhanced graphical models, including all variables includign linkes and states variables as well as external variables into the modeling. This chapter outlines each step of the modeling process, including travel time allocation and conducting inference using an approximate inference algorithm. Detailed algorithms for each step of the modeling are provided in the next chapter. 4.1 The External Variables. 4.1.1 Weather Conditions Weather conditions affect travel time on both freeways and arterials network. Studies also show that weather conditions significantly affect traffic flow, demand, and travel behavior, and consequently travel time and its variations (Qiao et al., 2012; Huang & Ran, 2003; Tsirigotis et al., 2011). Qiao et al. (2012) proposed a K nearest neighbor- integrated model and incorporated the effects of weather in the freeways traffic modeling using Bluetooth data. The team demonstrated that their model outperforms 53 the ARIMA and KNN models in rainy weather conditions. Hranac et al. (2006) proposed weather condition adjustment factors that are multiplied by the base clear- condition variables in order to incorporate weather condition information in traffic volume estimations. Most of these studies have been conducted for freeways, and there are rarely studies that incorporate weather conditions in arterial travel time estimation. Jenelius and Koutsopoulos (2013) considered weather conditions as an important variable in their modeling, however, their model does not provide real-time travel time estimation/prediction model. Several types of weather information can be used to study weather impacts on traffic. The most frequent are precipitation type, precipitation intensity, visibility, average wind speed, and adverse weather. Consideration of these variables relies on data availability, and incorporation of more conditions requires the availability of data pertaining to each condition. In this study, weather conditions are incorporated in real-time arterial travel time estimation/prediction. An external variable for weather conditions is added into the Dynamic Bayesian Network. The considered variables are precipitation intensity, visibility, and average wind speed. 4.1.2 Seasons There are seasonal variations in demand that can affect travel times on freeways and arterials. In general, there is a seasonal pattern in traffic demand and travel time; as such, including the season as a variable factor could capture this pattern in a given 54 model. A lot of studies tried to use time series analysis or introducing a seasonality coefficient in the model such as, for example, Seasonal Auto Regressive Integrated Moving Average (SARIMA) to capture the seasonality effects for the travel time prediction. Most of these studies have been done more on freeway travel time modeling (Williams et al., 1998; Peng et al. ,2014) and less on arterial travel time modeling (Khoei et al., 2013). In this study, we utilize four months of data, February in winter, June and July in the summer and October in the Fall. Thus, we have at least one month representing each season (except for spring). The summer season has the highest data and number of observations available since it has two months of data. 4.1.3 Day of the Week Another time-varying feature is the day of the week, as it plays a key role in travel time modeling. Traffic demand and, consequently travel time varies depending on the day of the week. By including this feature, we are able to capture traffic patterns specific to each of the days of the week. This feature is critical to traffic estimation/prediction modeling. In this study, a variable for weekdays is considered in the modeling, and holidays are excluded from the dataset. 4.1.4 Time of Day Another feature that is key to travel time modeling is the time of day. The traffic demand – and even travel behavior and signal phasing – could be different depending on the time of day. Time of day could be categorized as morning/AM peak, 55 afternoon/PM peak, nighttime post-afternoon peak and before morning peak, and daytime between morning peak and afternoon peak. In this study, a variable, showing time of day, is added to the model in order to capture all of the aforementioned values. In general, time variables could capture many of the variations in travel time patterns. 4.1.5 The External Variables As mentioned before, travel behavior and demand –and, relatedly, travel time distributions– vary based on the days of week and holidays, as well as different seasons, different time of days, and different weather conditions. As such, these variables are incorporated into the model. By including these variables in the proposed model, we are able to provide the links’ travel time distributions in various combinations of these conditions. What follows are the variables and their notations. 1. Variables for the summer (June–July), winter (February), and fall (October) seasons. 2. A variable for weather conditions. 3. A variable for each weekday from Monday to Friday (holiday excluded). 4. A variable for each time of day: morning peak, evening peak, between morning and evening peak, and off-peak nighttime. 56 As is discussed in the following section, all these variables are the parents of the link travel times in the dynamic Bayesian network. This means that, with each set of factors, we would have a different distribution for all of the links of the network. 4.2 Graphical Model with External Variables The discussed graphical model depicted in section 3.4.3 can be generalized not only to capture the relationships between link states and observations, but also to take additional information into account. There are additional external variables that can affect the travel time distributions of the links. As an example, we can expect to have different conditional distributions for a single like under different weather conditions (e.g., sunny or rainy). One of the advantages of modeling the travel time distributions using a graphical model is the flexibility to represent such conditional dependencies. We introduce additional global variables to explicitly represent such information in the proposed model, where it is available. These external variables allow the model to determine different travel time distributions based on the provided additional distributions (e.g., estimating different Gaussian distributions for rainy and sunny weather conditions). Figure 12 represents the final deployed model. In this figure, 𝑋TUV:WUX,: represents the random variable introduced for capturing the weather information at time step 𝑡 . Considering the available data sources, in this dissertation, we consider two states for the weather; that is 𝑋TUV:WUX,:is a binary random variable, which shows whether in time 𝑡 the condition was rainy or not. We assume that the random variable comes from a 57 multinomial distribution with two states. In this study, we assume 𝑋TUV:WUX,: is observed for all time steps. The travel time distribution for different links and time steps also varies based on the day and season of the year. To model this distribution drift, we introduce two new random variables 𝑋YUVYZM and 𝑋@V[ , which denote the season of the year and the day of the week. 𝑋YUVYZM is modeled by a multinomial distribution with four discrete states to represent each season of the year. 𝑋@V[ is modeled as another multinomial random variable and each discrete variable represents a day of the week. These variables are observed both during training and inference. In this way, the travel time distribution for each link not only depends on the time step, but it also depends on the weather condition, the day of the week, and the season of the year 58 !"# $ ,&'( !"# $ ,& !"# $ ,&)( !"# $ ,&'( !"# $ ,& !"# $ ,&)( !$,&'( !$,& !$,&)( !+,# ,&'( !+,# ,& !+,# ,&)( *$, &'( *$, & *$, &)( !-#. !/,#/01 Time t-1 Time t Time t+1 Figure 12 The graphical model with the proposed external variables 4.3 Travel Time Allocation Once the traversed links between two consecutive GPS points are estimated from map matching, the next step is to allocate the total path travel times between the links. This can be achieved by maximizing the log-likelihood of the link travel times for each observation using equation 4. This optimization could be calculated given the model parameters. This means that the probability of link l being in state s, and the parameters for travel time distribution of link l for state s should be known; however, we do not have this data at this stage. As such, this represents a chicken-or-egg dilemma. The optimization has two constraints, shown in equations 5 and 6. First, the sum of travel 59 times on the links should be equal to the path travel time (travel time between consecutive points). Second, the travel time on the links has an upper bound computed for each link by using the maximum speed that is realistically possible for the link; this maximum speed is 60 mph over the speed limit. 𝐴𝑟𝑔𝑚𝑎𝑥[ a∑e∈gV:W 𝑙𝑛 ∑YYRh 𝑋e,Y𝑔e,Y(𝑦e)i (4) 𝑆. 𝑡. ∑e∈gV:W 𝑦e = 𝑇gV:W (5) 𝑆. 𝑡. 𝑦e > 𝑦e nVo (6) The travel time is allocated based on the speed limit and length of the links, and then, parameters of the model are estimated for the first stream. For the second stream, the estimated parameters are used to solve the optimization. The optimization and travel time allocation are updated with each stream and the iterations between travel time allocation and parameters estimation continue, as long as we have a new stream of data. The number of variables in the optimization problem in equation 4 is equal to the number of links for the path between two consecutive GPS measurements; this number is always a relatively small number, and it makes the optimization problem easy to solve using numerical methods. 60 4.4 Parameter Estimation 4.4.1 The Inference To update the parameters of the model related to each of the defined hidden states, one faces a chicken or egg dilemma. To update the parameters related to each of the states, the samples – which are more likely from that state – should be determined. However, in order to determine these likely samples, the distribution parameters of the model should be known. The most frequently used approach to tackle this problem in machine-learning is the Expectation-Maximization (EM) approach. EM consists of two stages. In the first stage, known as “expectation,” the most likely state for each sample is determined based on the current estimated parameters of the model. The second stage, dubbed “maximization,”' chooses the best parameters of the model given the data and the estimated hidden state from the expectation stage. These two stages are repeatedly applied until a convergence criterion is met. Although theoretically, the EM algorithm is not guaranteed to converge to the global optima of the problem, empirically, the algorithm leads to satisfactory results and is one of the most widely used algorithms in the machine-learning community. EM provides a practical solution for the aforementioned chicken-or-egg problem. However, it is not a truly Bayesian approach, and it follows a frequentist view in the expectation stage. In other words, during the expectation stage, it only computes a point estimate for the hidden variables (i.e. the most probable value). Another way of solving the problem is to stick with the Bayesian approach, even during the expectation stage, 61 to consider the full posterior distribution for describing the current estimate for hidden states. In the proposed approach, we consider a generalization of the EM algorithm. For each batch, given the seen variables, we use the Variational Bayes approximation inference to estimate the posterior distribution of all variables (i.e. finding a set of parameters for the posterior distribution which describes the data best). For efficient inference, all probability distributions governing the variables are assumed to be a part of the exponential family. In other words, all the distributions in our model can be formulated as: 𝑓q(𝑥|𝜃) = ℎ(𝑥)exp (𝛾(𝜃)𝑆(𝑥) − 𝜆(𝜃)) (7) Where h(.) is a function of x, and γ(.) and λ(.) are only functions of θ. S (.) is a function of x, known as sufficient statistic, which contains all sufficient information for determining the posterior distribution. By these assumptions and by considering conjugate priors, one can calculate the posterior distribution in closed form which makes the inference very efficient. After estimating the posterior of the variables in closed form, we update the parameters of the model to match the posterior using the Streaming Variational Bayes and iterate these steps through batches (sets of trips) till we visit all batch samples. As described earlier, the SVB is used for making the inference, and the training model involves iteration between the travel time allocation for a new stream and updating the model and the parameters. The proposed algorithm iterates between 62 finding the most likely state of the network given the model parameters and then uses those state estimates to update the most likely model parameters. 4.4.2 Streaming Variational Bayes (SVB) Bayesian inference is time-consuming, especially when it is used for complex models. To this end, approximate inference algorithms are usually deployed for real-world applications. There are different approaches toward the approximate training of Bayesian Networks, including Monte Carlo Markov Chain (MCMC), and Variational Inference techniques. MCMC has been used for decades as the primary technique for posterior approximation. MCMC forms a Markov Chain on the posterior variable and samples from the created chain before finally approximating the posterior with empirical estimates computed from the generated samples. However, MCMC becomes less usable when the models become more complex or the scale of the data at hand becomes large. Variational Inference methods on the other hand use optimization to find the approximate posterior. In these approaches, a family of distributions is considered to approximate the posterior. Then, an optimization problem is solved to find a specific distribution from the family which has the minimum KL-divergence with the true posterior. Since the approximate is chosen from a pre-defined family of distributions, this family should be defined to be as flexible as necessary. The travel time estimation for arterial networks not only leads to complex inference models, but it also adds another degree of difficulty to the problem. GPS data from users continuously adds live traffic information. Thus, an effective method should not only be able to approximate the posterior distribution effectively, but it should also 63 change the solution adaptively with minimal computational cost. This incremental approach to the problem requires an online learning paradigm for updating the parameters of the posterior distribution. To address these challenges, we used inference module on a family of approximate inference methods called Streaming Variational Bayes. Streaming Variational Bayes deploy the Bayes rule to convert a conventional Variational Bayes method into an incremental approach. This approach is covered with more detail in the remainder of this chapter. Assume data is added to the system in i.i.d. batches 𝐵=, 𝐵x, … , 𝐵y where 𝐵H contains the samples in batch “𝑖”. Also, let 𝛩 represent the parameters of the model. Assume that batches 𝐵=, … , 𝐵{|= have been seen previously and the new batch 𝐵{ is added. To have an effective incremental strategy, the parameters estimated from 𝐵=, … , 𝐵{|= should be updated efficiently given the newly added batch 𝐵{. This can be achieved by directly applying the Bayesian rule. According to the Bayes rule, one can derive the following recursive formula: 𝑝(𝛩|𝐵=, … , 𝐵{) ∝ 𝑝(𝐵{|𝛩)𝑝(𝛩|𝐵=,… , 𝐵{|=) (8) That is the posterior distribution of the parameters 𝛩 given batches 𝐵= …𝐵{ is proportional to the likelihood of the batch 𝐵{ alone and the posterior of the parameters given all batches that came prior to batch 𝐵{. This formula is the base idea behind the approximate Streaming Variational Bayes technique. Let 𝑞(𝛩) = 𝑉𝐵(𝐵, 𝑝(𝛩)) be the naive approximate Variational Bayes 64 inference method which computes the posterior over the parameters of the model 𝛩, given a mini-batch 𝐵 and a prior over the parameters of the model 𝑝(𝛩). Now, let 𝑞€(𝛩) represent the posterior estimates after seeing mini-batches 𝐵= …𝐵€ i.e. 𝑞€(𝛩) = 𝑝(𝛩|𝐵= …𝐵€). From equation(8), we can derive a recursive update for computing 𝑞€(𝛩) based on our estimate from batches 𝐵= …𝐵€|= i.e. 𝑞€|=(𝛩). More precisely, the streaming update can be written as follows: 𝑞€(𝛩) = 𝑉𝐵(𝐵€, 𝑞€|=(𝛩)) (9) This recursive nature makes efficient approximate Streaming Variational Bayes method applicable to the travel time estimation problem. Moreover, it is possible to compute the actual posterior distribution in a parallel fashion. This is achieved by reconsidering the Bayes formula. Assuming mini-batches are i.i.d., we have: 𝑝(𝛩|𝐵=, … , 𝐵y) ∝ 𝑝(𝐵=,… , 𝐵y|𝛩)𝑝(𝛩) =𝑝(𝐵H|𝛩)𝑝(𝛩) H 𝑝(𝛩|𝐵 ) = [ H𝑝(𝛩) ] × 𝑝(𝛩) (10) H That is, the problem of computing the posterior given all batches is converted to computing the posterior given each of the batches independently and then combining the resulting information with Equation (10). Each independent batch update can be computed in parallel (Broderick et al., 2013). 65 Considering Equation (10), a similar formula can be derived when an approximate inference method (e.g. Variational Bayes) is used. This approximate update is as follows: 𝑃(𝛩|𝐵= …𝐵€) ≈ 𝑞 …†(†‡,g(ˆ)) €(𝛩) ∝ [ ∏H ] × 𝑝(𝛩) (11)g(ˆ) That is the final approximate posterior and can be computed by distributing and computing the approximate posterior for each batch independently and then combining the results. This leads to an efficient incremental update which is essential for the problem at hand. 4.5 Summary This chapter proposed a graphical model in which external variables such as weather conditions, season, the day of the week, and time of day are considered in the modeling. Each of these variables is shown to be influential in travel time in general. In this study, these variables are included in the graphical model as an observed variable and as the parent of all the links. This means we would have different travel time distributions for each combination of the mentioned variables. The idea is that, by incorporating more of the influential variables, we would yield a better prediction both in general as well as in those scenarios. The next chapter demonstrates the model performance in most of these scenarios. The process of inferring the links ravel-time distribution consists of two steps, travel time allocation, and parameter inference. For travel time allocation, an optimization model is used to allocate travel time based on the previous model parameters; this refers 66 to the probability of the link to be in each congestion state and the distribution parameters associated with that state. For the inference of the parameters based on the allocated travel time, this study proposes to use the Streaming Variational Bayes to do approximate Bayesian learning. This offers an advantageous possibility of updating the model with streaming data, and it is less computationally intensive. The next chapter demonstrates the validation of the proposed model and further analysis of the considered variables. 67 Chapter 5: Analysis In Chapter 4 we explained the proposed graphical model with the external variables. Any estimation/prediction model should be validated and tested to see how well the model is performing. Thus, we apply and test the proposed model on the INRIX probe data for a case study, and demonstrate and discuss the results in this chapter. This chapter demonstrates the performance of the graphical model with no external variables and compares the result with the weighted average model. Then, the performance of the proposed model with external variables is illustrated for each day of the week, and for each day of the week in each season, and finally in bad weather conditions. For each of these scenarios, a representative test day is chosen and the trained model is tested and the performance of the model is illustrated in each scenario and compared with other models. Finally, the chapter ends with two proposed modification of the model which were not possible to do in this study, due to lack of enough data. However, the detailed structure of the variables and how they should be included in the model is demonstrated and discussed. 5.1 Case Study As mentioned earlier, the case study is a subnetwork of Baltimore city. The network consists of 1,700 links and 1,250 intersections. The test data are for PM weekday peak hours, 4 PM to 6 PM. The considered discrete time in this study is five minutes. Two 68 models (proposed model and base model) are developed for PM peak each day of the week, and the results are shown as follows. Figure 13 demonstrates the available vehicle GPS trajectory data in the Baltimore county for a Monday PM peak. As is evident, both the important freeways and downtown Baltimore have the highest number of observations. For the purpose of this study and its focus on arterial travel time modeling, downtown Baltimore was chosen as the study area. Observations 0 - 15 15 - 64 64 - 168 168 - 390 390 - 988 Figure 13 The available data on the area Figure 14 demonstrates the number of observations for the Baltimore downtown, and again for a Monday PM peak. Some of the roads have quite a considerable number of observations. However, there are some roads for which there is not any observation at all. © OpenStreetMap (and) contributors, CC-BY-SA 69 Observations 0 - 1 1 - 32 32 - 67 67 - 116 © OpenStreetMap (and) contributors, CC-BY-SA Figure 14 The available data for the case study area 5.1.1 Trip Summaries The total number of vehicle trips used for the training purpose is 63,419 on the studied road network. Table 1 demonstrates the volume of trips used for the training purpose by days of the week and by available month. As expected, the number of trips for the weekend is much lower than the number of trips for the weekdays since the number of trips is less on weekends. The demonstrated data is for PM peak hour (4-6 PM). Table 2 demonstrates the vehicle GPS points for the training dataset. In total, more than 1 million GPS points are used to train the proposed model. 70 Table 1 Summary of trips used for each month and each day of the week in the training data set Time Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total June 2,782 3,254 2,507 2,529 2,705 861 802 15,440 July 2,114 2,328 2,765 2,824 2,544 857 545 13,977 Feb 2,502 2,491 2,831 2,797 3,059 1,425 970 16,075 Oct 2,396 2,574 2,673 3,529 3,486 1,980 1,289 17,927 Total 9,794 10,647 10,776 11,679 11,794 5,123 3,606 63,419 Table 2 Summary of GPS points used for each month and each day of the week in the training data set Time Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total June 13,510 15,102 13,434 12,537 13,978 4,246 4,047 76,854 July 9,439 11,157 13,958 13,579 12,186 4,147 2,927 67,393 Feb 52,435 52,905 80,730 66,830 89,958 70,740 53,534 467,132 Oct 67,916 77,088 76,907 111,626 113,058 106,314 70,217 623,126 Total 143,300 156,252 185,029 204,572 229,180 185,447 130,725 1,234,505 5.2 Developed Models 5.2.1 The Initial Developed Graphical Model For the graphical model, we split the data into two data sets. One is used as a real-time data set and the other is used as the test data set to determine the performance measures. The model uses the average of observed travel time on each link as the current observation for the corresponding variable and predicts the travel time for the next time 71 step using the graphical model shown in the previous chapter. The test data set is used to measure the performance of the model and percentage of error for this model. To estimate the travel time at each discrete time step, our model performs an approximate probabilistic inference on hidden traffic steps shown in equation 12. 𝑠e,: = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝑥 = 𝑆∗Š e,: |𝐷h...: ,𝑀) (12) Where, 𝑠e,: is the state of link l at time t to be estimated, 𝑥e,: is hidden state variable and S* is the state that makes the probability the highest. 𝐷h...:, 𝑀, is the learned model and real-time data. Then the Gaussian distribution of that state is used for the estimation is shown below. 𝐺(𝜇Y,Ae , 𝜎 Y,A e ) (13) In the next sections, this model is improved by all the external variables, the algorithm of training procedure with all the external variables is demonstrated in Algorithm 1. 5.2.2 Baseline Model For the baseline model, we also split the test data into two sets, the real-time data and the test data. The baseline model uses the real-time average travel for those links that are seen during real-time data. Considering the real-time average or regression as the baseline for comparison is a common practice in the literature (Herring, 2009; Hofleitner et al., 2012 a; Woodard et at., 2017; Hofleitner et al., 2012 b). Here, the baseline model uses the historical prior for those links that are not seen in real-time 72 data. The test data is used for measuring the performance of the model and the percentage of error. 5.2.3 Test As mentioned before, the test data set is used for testing both models. The model can also be tested by considering the Bluetooth sensors data as the ground truth travel time as proposed by Haghani et al. (2010), for those links with available Bluetooth data. The purpose of the test is to estimate the performance of both models and calculate the percentage of error for both. To do so, the existing trips GPS point and their timestamps in the test data set are used as the ground truth. Both models are used to predict the travel time for same trips using the explained procedures. The travel time for each time is estimated using equation 14. As mentioned before, this study assumes the independency between the links travel time, so, the path travel time in the proposed model is the sum of traversed links travel time in the path which can be computed using equation 14. Y TTgV:WH = ∑ ‡,A‡ H 𝜇e {𝑙 … 𝑙 } ∈ 𝑃𝑎𝑡ℎ (14) ‡ = y H The average duration of test trips is above 10 minutes and with different average trip lengths for each day of the week. The result of the test is demonstrated in the next section. 73 5.3 Validation of the Proposed Model 5.3.1 The Model without any Variables We implemented the model in the case study and compared the results to the base model. At first, we used all the data, without considering the days of the week and tested the model for the test data which is also from all days of the week. The performance of our model is the same as the base model, and it’s very low. The reason could be attributed to the fact that when the days of the week are not considered in modeling, the variation is so high that maybe just using the average real-time data could be the best approach for the estimation/prediction. The average trip duration is 1068 seconds (18 minutes). Table 3 demonstrates the baseline error and proposed model error in seconds, as well as the percentage of errors for both models. Table 3 Accuracy of predictions for all the weekdays All weekdays Average test trip 1067.0 Proposed model error (s) 399.0 Baseline (s) 397.2 Proposed model error % 0.37 Baseline error % 0.37 74 5.3.2 The Model with Day of the Week We also provided a model for each of the weekdays. The average test trip duration varies on different days, with an average of 1040 seconds (17 minutes). Table 4 demonstrates the average trip, the average error in seconds for both the baseline model and the model proposed in this study, and the percentage of error for both baseline and the model proposed in this study by days of the week. The best performance is the Monday model. Thursday and Friday have the highest percentage of errors. Table 4 Accuracy of predictions based on days of the week for the Baltimore test data Day of the week Monday Tuesday Wednesday Thursday Friday Average test trip 974.4 978.0 1347.0 1060.5 849.7 Proposed Model error (s) 210.8 248.4 330.3 279.2 221.0 Baseline error (s) 351.9 490.6 559.1 515.9 446.9 Proposed Model error % 21.6 25.4 24.5 26.3 26.0 Baseline % 36.1 50.2 41.5 48.7 52.6 Figure 15 shows the comparison between the base model and the model proposed in this study. As it is shown, there is a huge improvement in the percentage of errors when using the model proposed in this study. It should be noted that both baseline model and our model approximately have the same trend. This means that for those days on which the model proposed in this study has a lower performance and a higher percentage of 75 error, the baseline model also has a lower performance. This could be due to demand and variation in demand for that day of the week. Figure 15 Comparison of prediction error between proposed model and the baseline by days of the week Even though the model is developed in a way that it can handle all times of the day, since peak hour travel time prediction and validation is the most challenging one, here just the peak hours results are demonstrated. 5.3.3 Travel Time Distribution To explain the proposed model, one example of link travel time is demonstrated as follows. Figure 16 shows the location of the example link for which a histogram is created from raw travel time data. The example link is, E Pratt St. in downtown 76 Baltimore and the length is 223 feet. Figure 17 shows the travel time frequency for the example link (link 13), for the Tuesday training data. The histogram of the travel time on this link shows two main peaks, one around 19 seconds and one around 43 seconds. One could identify two distributions with two means. The proposed model for the same link and same days also yields two distributions, one for uncongested condition (s = 0) with the mean of 21.47 seconds, and one for congestion condition (s=1) with the mean of 45.32, which is very close to what is observed. Figure 16 The example link in downtown Baltimore 77 Figure 17 The travel time histogram for link 13 5.4 Model Improvements 5.4.1 The Model with All of the External Variables In this section, we tested the model for each day of the week and each season. The developed model has the potential of being tested on any combinations of all of the scenarios (time of day, day of week, season, and weather); however, here the results of the analysis for one time of day of (PM peak) and three seasons and five weekdays (From Monday-Friday) is demonstrated. This means that we selected 15 (five weekdays* and three seasons) test days to evaluate the proposed model vs. based model improved by the regression model. Algorithm 1 illustrates the step-by-step procedure for the training of the proposed model. In this algorithm, 𝕃 represents the set of all links in the network, 𝕋 represents the set of time-steps, and 𝕄H denotes the set of reported GPS information for the 𝑖 ’th training trip. 𝑋e,:Uo: represents the observed external 78 variables for time step 𝑡 and link 𝑙 in the network. 𝑋H,I˜™š is the 𝑗 ’th reported GPS information for the 𝑖’th training trip. We denote the traffic state of link 𝑙 in time step 𝑡 by random variable 𝑆e,:. 𝒩(𝑙) represents the set of neighboring links connected to link 𝑙. 𝜇eY, 𝜎eY repsents the mean and standard deviation of the normal distribution over link 𝑙 conditioned on traffic condition 𝑠. An example of predicted traffic states and travel time associated with those states is also demonstrated as follows. Figure 18 demonstrates the predicted level of congestion for downtown Baltimore on October 26th on Monday at 4:50 PM. The green color demonstrates the level of congestion “one” and the red color demonstrates the level of congestion “two”. It should be noted that the level of congestion “one” is not necessarily uncongested and level of congestion “two” is not necessarily congested. They could be congested and more congested since both belong to the same time bin of PM peak and they are just comparable to each other in that time bin, meaning we can say red is more congested than green. 79 Algorithm 1. The training procedure. Inputs: - 𝑋e,:Uo: ∀𝑙 ∈ 𝕃, 𝑡 ∈ 𝕋 observed external variables for each time step 𝑡 and each link 𝑙. - 𝑋H,I˜™š ∀𝑖 ∈ ℕ, 𝑗 ∈ 𝕄H the time of 𝑗Ÿ𝑡ℎ reported GPS data for trip 𝑖 in the training. Outputs: - 𝑃a𝑆e,: = 𝑠 𝑆𝒩(e),:|=i ∀𝑙 ∈ 𝕃, 𝑠 ∈ 𝕋 the probability that link 𝑙 has the discrete state 𝑠 in time step 𝑡 given the state of its neighbors in time step 𝑡 − 1. - 𝜇eY, 𝜎eY ∀𝑙 ∈ 𝕃, 𝑠 ∈ 𝕊 the mean and standard deviations of the Gaussian distribution for link 𝑙 for each discrete state 𝑠. 1: while not converged do Ø CREATE A BATCH BY SAMPLING THE TRIPS 2: 𝔹 ← 𝑠𝑎𝑚𝑝𝑙𝑒(1…ℕ) Ø ALLOCATE TRAVEL TIMES BY SOLVING THE OPTIMIZATION 3: 𝑋H,e,:˜™š ← 𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑒(¥𝑋 H,I e,: ˜™š¦ , ¥𝑃a𝑆 = 𝑠 𝑆𝒩 (e),:|=i¦ e∈𝕃,:∈𝕋) H∈𝔹,I∈𝕄‡ Ø EXPECTATION STEP: INFER THE SET OF HIDDEN STATES USING VARIATIONAL MESSAGE PASSING 4: ℋe,: ← 𝑉𝑀𝑃(𝑋H,e,: , 𝑋e,: , 𝜇e , 𝜎e, 𝑃a𝑆e,: = 𝑠 𝑆𝒩(e),:|=˜™š Uo: Y Y i ) Ø MAXIMIZATION STEP: PERFORM THE MAXIMIZATION STEP BY APPLYING STREAMING VARIATIONAL BAYES 5: 𝜇e , 𝜎e, 𝑃a𝑆e,: = 𝑠 𝑆𝒩(e),:|=i ← 𝑆𝑉𝐵(ℋ𝑙,𝑡 , 𝑋𝑖,𝑙,𝑡 , 𝑋e,: , 𝜇e , 𝜎e, 𝑃a𝑆e,: = 𝑠 𝑆𝒩(e),:|=Y Y 𝐺𝑃𝑆 Uo: Y Y i) 6: end while 7: return (𝜇e e e,: 𝒩(e),:|=Y, 𝜎Y , 𝑃a𝑆 = 𝑠 𝑆 i) ∀𝑙 ∈ 𝕃, 𝑠 ∈ 𝕊 80 Figure 18 Real-time level of congestion subnetwork of Baltimore downtown. Figure 19 demonstrates the predicted travel time over travel time with the speed limit. Green is for values equal or above 0.7 meaning the predicted travel time is close to the travel time driving at the speed limit. The black color indicates the relative travel time below 0.2 meaning the predicted travel time is one-fifth of the travel time at the speed limit. The red and orange are the relative travel time in between two mentioned numbers. 81 Figure 19 Real-time relative travel time of Baltimore downtown network. What follows explains how the base model improved by the regression model. Also, an Algorithm is dedicated to demonstrate steps of developing this base model. 5.4.1 Regression Model In order to improve the base model a regression model is developed and the proposed model is compared to the improved base model. The implemented regression model has two different sets of explanatory variables. Trip variables such as weather conditions, time of day, day of the week, and season, as well as link explanatory variables such as the speed limit, the road type, and the length of the link between the two intersections. The goal is to capture the underlying factors behind spatial and 82 temporal variations in the travel time. The real-time prediction in each time step would be a combination of the regression prediction and the real-time observations. Equation 15 shows the developed regression model and Algorithm 2 demonstrates the detailed procedure from travel time allocation to updating the travel time prediction for each link using real-time data and the regression model. 𝑌e,: = 𝑊e e e,: e,: e,:Uo𝑋Uo + 𝑊:XHg𝑋:XHg + 𝑏 (15) In this algorithm, 𝕃 𝑎𝑛𝑑 𝕋 represent the set of all links in the network, and the set of time-steps respectively. 𝑋eUo is the set of explanatory variables for link 𝑙, 𝑋 e,: :XHg is the set of trip variables for link 𝑙 in time-step 𝑡, 𝑋e,:Z€Y is the set of real-time observations for link 𝑙 in time-step 𝑡, and 𝑌e,: is the travel time for link 𝑙 in time-step 𝑡. We represent the size of a set by |. |. 83 Algorithm 2. The regression model algorithm Inputs: - 𝑋eUo ∀𝑙 ∈ 𝕃, 𝑡 ∈ 𝕋 explanatory variables for each link. - 𝑋e,::XHg ∀𝑙 ∈ 𝕃, 𝑡 ∈ 𝕋 trip variables for each link at each link 𝑙 and time 𝑡. - 𝑋e,:Z€Y ∀𝑙 ∈ 𝕃, 𝑡 ∈ 𝕋 real-time observations for each link at each link 𝑙 and time 𝑡. Outputs: - 𝑌e,: ∀𝑙 ∈ 𝕃, 𝑡 ∈ 𝕋 travel time for each link 𝑙 and each time step 𝑡. 1: for each 𝑙 ∈ 𝕃 do 2: for each 𝑡 ∈ 𝕋 do 3: Find 𝑊e , 𝑊e,:Uo :XHg , and 𝑏e,: by solving the regression given training samples 4: end for 5: end for 6: for each 𝑙 ∈ 𝕃 do 7: for each 𝑡 ∈ 𝕋 do 8: 𝑛Z€Y ← |𝑋 e,: Z€Y| 9: if 𝑛Z€Y = 0 then 10: 𝑌e,: ← 𝑊e 𝑋e +𝑊e,: 𝑋e,: + 𝑏e,:Uo Uo :XHg :XHg 11: else: 12: 𝑌e,: ← = ∑Mª«¬ e,: Mª«¬ MR= 𝑋Z€Y[𝑛] 13: end if 14: end for 15: end for 16: return 𝑌e,: ∀ 𝑙 ∈ 𝕃 , 𝑡 ∈ 𝕋 5.4.2 Comparison Between the Final Model and Improved Base Model The average test-trip duration varies on different days and in different seasons. Figure 20 through Figure 22 and Table 5 through Table 7 demonstrate the average trip, the average error in seconds for both the improved base model by regression and the model proposed in this study, and the percentage of error for both the improved base model and the model proposed in this study by days of the week for each season. 84 Figure 20 and Table 5 demonstrate the analysis results for winter season and all of the weekdays. As it is shown, there is a significant improvement in the proposed model with all the variables compared with the previous model across nearly all weekdays. Also, there is a huge improvement in the percentage of errors for the base model (regression and real-time average). Still, the proposed model significantly outperforms the improved base model. Table 5 Summary of winter predictions based on days of the week for the Baltimore test data Day of the week Monday Tuesday Wednesday Thursday Friday Average test trip 544 511 546 518 592 Proposed Model error (s) 83 120 99 111 117 Baseline error (s) 138 166 179 152 157 Proposed Model error % 15.3 24.1 18.5 22.2 19.6 Baseline % 25.3 31.8 32.3 28.2 26.9 85 35% 30% 25% 20% 15% 10% 5% 0% Monday Tuesday Wednesday Thursday Friday Baseline model error % Proposed model error % Figure 20 Comparison of prediction error between proposed model and base model in winter Figure 21 and Table 6 demonstrate the same analysis for the summer and Figure 22 and Table 7 demonstrate it for the fall. As it is shown there is a general improvement both in the base model and the proposed model. 35% 30% 25% 20% 15% 10% 5% 0% Monday Tuesday Wednesday Thursday Friday Baseline model error % Proposed model error % Figure 21 Comparison of prediction error between proposed model and base model in summer 86 Prediction Error Prediction Error Table 6 Summary of summer predictions based on days of the week for the Baltimore test data Day of the week Monday Tuesday Wednesday Thursday Friday Average test trip 570 529 553 583 602 Proposed Model error (s) 59 95 95 107 123 Baseline error (s) 112 152 159 122 191 Proposed Model error % 10.4 18.5 17.8 18.4 20.1 Baseline % 19.8 28.0 27.8 20.8 32.2 35% 30% 25% 20% 15% 10% 5% 0% Monday Tuesday Wednesday Thursday Friday Baseline model error % Proposed model error % Figure 22 Comparison of prediction error between proposed model and base model in fall 87 Prediction Error Table 7 Summary of fall predictions based on days of the week for the Baltimore test data Day of the week Monday Tuesday Wednesday Thursday Friday Average test trip 502 542 503 550 537 Proposed Model error (s) 107 89 104 121 108 Baseline error (s) 132 180 141 163 152 Proposed Model error % 21.2 17.4 21.1 21.8 20.2 Baseline % 26.4 31.4 27.6 30.1 28.1 On average, the model performance in the summer is better, with an average percentage of error of 17% for all days of the week as shown in Table 8 and Figure 23. This could be attributed to the higher number of observation for the summer season. We had two months of data for summer, June and July. This also could be attributed to the fact that summer has less traffic demand and better weather conditions, so traffic patterns are more predictable. The fall and winter average performance for different days of weeks are the same, with an average percentage of error of 20%. On average, the performance of the proposed model is better on Mondays, with an average percentage of error of 15.5% for all the seasons (Table 9 and Figure 24). This is followed by Friday and Wednesday. The highest percentage of error is for Thursday with the average percentage error of 21 %. This again could be attributed to the lower traffic demand on Mondays and better predictions of the traffic pattern for them. 88 Table 8 Aggregated average percentage of error for different season Season Average Percentage of Error % Fall 20.3 Summer 17.0 Winter 19.9 21.0% 20.0% 19.0% 18.0% 17.0% 16.0% 15.0% Fal l Summer Winter Seasons Figure 23 Aggregated average percentage of error for different season Table 9 Aggregated average percentage of error for each days of week Season Average Percentage of Error % Monday 15.6 Tuesday 20.0 Wednesday 19.1 Thursday 20.8 Friday 20.0 89 Percentage of Error 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% Monday Tuesday Wednesday Thursday Friday Days of the Week Figure 24 Aggregated average percentage of error for each days of week 5.4.3 Comparison Between the Two Graphical Models To demonstrate how incorporating other variables into the graphical model has led to better travel time prediction models, the performance of the two graphical models are compared. As discussed before, one of these models is a model including the season and weather conditions (discussed in section 5.4.1), and the other model is the one with only the day of week and time of day variables and not the other two (discussed in section 5.2.1). The problem is that we have different parentages of error for different days and seasons; to overcome this challenge, the average of the percentage of error for all of the weekdays are estimated, and a comparison of the two models’ performance is done for each day of weekdays. As it is shown in Figure 25, there is an evident improvement for 90 Percentage of Error the weekdays. For most of the weekdays, there is more than 5% improvement in the percentage of error. 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% Monday Tuesday Wednesday Thursday Friday Model without external variable Model with external variable Figure 25 Comparison of the proposed model with seasons and weather conditions and without them It should be noted here that some of the test datasets might be in the bad weather conditions or good weather conditions. In the next section, in order to specifically evaluate the effects of weather conditions variables on the performance of the model, another model is trained with all the variables, except for the weather conditions, and it is tested on a rainy day. 5.4.4 The Model Test on a Rainy Day As mentioned before, one of the objectives of this study is to develop an integrated model that could predict travel times under various weather conditions. Thus, in this study, various weather data sources were used, and the variables were included in the 91 Percentage of Error graphical model as a trip (global) variable. The test day is Friday, October 9th which was rainy during the PM peak hour. To evaluate the proposed model in bad weather conditions, a test day in rainy weather conditions was selected and results of the proposed model with weather variable inclusion vs without this variable and also base model improved by the regression model is demonstrated in Table 10 and Figure 26. The average test trip duration is 491 seconds (more than eight minutes). Table 10 demonstrates the average trip, the average error in seconds for the graphical model without weather conditions variable, and including the weather condition variable – as well as the base model with regression. Figure 26 shows the comparison between the aforementioned models. As it is shown, there is an improvement in the percentage of errors when using the proposed graphical model with weather conditions variable compared to the graphical model without it as well as compared to the base model improved by a regression model that includes weather conditions variable. Table 10 The comparison of the model with and without weather variable Model Percentage of Error % Model error (s) Average test trip Model with Rain 22.1 110 491 Model without Rain 24.0 117 491 Base model 29.0 141 491 92 30.0% 29.0% 29.0% 28.0% 27.0% 26.0% 25.0% 24.0% 24.0% 23.0% 22.0% 22.0% 21.0% 20.0% Base model Model with weather Model without weather Figure 26 The comparison of the model with and without weather variable 5.4.1 The Time of Day Even though the analysis is shown for the most challenging time of day, PM peak, the proposed graphical model has the capability of providing travel time estimation/prediction in any time of day because it includes the time of day variable in the model. In this study, the time of day is categorized into bins based on the characteristics of traffic. The bins times are chosen in a way that the traffic pattern is nearly consistent in each bin, and different from others. The time period bins include the following bins: • “AM Rush Hour”: weekdays 7 – 9 AM • “PM Rush Hour”: weekdays 3 – 6 PM • “Nighttime”: Monday-Sunday 7 PM – 6 AM • “Weekday Daytime”: remaining times during weekdays 93 Percentage of Error • “Weekend” Table 11 and Figure 27 demonstrate the performance of the proposed model with all the variables for one of the other time bins, weekday daytime bin and winter season, as an example. As it is shown in Figure 27, the percentage of error for the proposed model is less than the base model. However, there is an improvement in the base model performance compare to PM peak base model. This could be attributed to the fact that the average or regression model might be able to capture the pattern to some extent better for the time bins with less traffic congestion since there is less variability in the traffic compared to time bins with more traffic congestion. Table 11 Summary of winter predictions based on days of the week for daytime Day of the week Monday Tuesday Wednesday Thursday Friday Average test trip 524 549 537 537 545 Proposed Model error (s) 96 105 99 130 107 Baseline error (s) 142 136 135 165 156 Proposed Model error % 18.7 19.8 18.7 20.3 19.8 Baseline % 26.6 23.9 24.8 26.5 28.5 94 30% 25% 20% 15% 10% 5% 0% Monday Tuesday Wednesday Thursday Friday Baseline model error % Proposed model error % Figure 27 Comparison of prediction error between proposed model and base model for day time in winter Other Modifications This section focuses on the possible improvements of the graphical model by adding as much as information as possible. What follows demonstrates the proposed model which includes intersections in the modeling process and the associated graphical mode. The section after that proposes an approach by which driver habits and traffic conditions could be modeled by introducing a new variable and adding it to the model and demonstrating it in the graphical model. 5.4.2 The Graphical Model with Intersection In most of the arterial travel time studies, travel time (expected value or the distribution) on the links is provided and the trip travel time is calculated as the sum of the link travel times. The alternative, however, is to include travel time on links and delays in the 95 Prediction Error intersections, and calculate trip travel time by adding the link travel times and delays at intersections. In general, the turn delays can be considered as deterministic or stochastic variables. The variation in the delay comes from unobserved changes such as in traffic flows, signal phasing, driver behavior, etc. Considering intersection delay as a separate entity is very important when we want to distinguish between right turn, left turn, and forward movements. If not for these three different movements, the delays at intersections could easily be added to the link travel times and there is no need to consider them separately. Including the intersection variable in the modeling requires much more data. Because there were not enough samples for each of the movements, the proposed model could not be tested. We propose this model for future work when and if there is enough data for each movement. In this modeling, there are three main challenges to overcome. The first would be to determine what variable should be considered in the modeling to represent the reality the best. We propose the movement variable. The movement variable is any movement from a link to another connected link. It could be noted that the movement variable for an intersection is not just one variable. Any combination of movement from any incoming link to any outgoing links could be considered separately, as shown below. However, one might be interested in just some of the movements such as the left turn and through movements. In Figure 28, there are two incoming links and two outgoing links, which make the total number of movements 4 (2*2). The number would be much higher when there are four incoming and four outgoing links (16 movements). For this reason, having the appropriate number of 96 observations is key because all of these movements should be backed by enough observations to ensure we have a good model. Figure 28 The intersection movement demonstration The second challenge is to create a new network of the city in which all the intersections (movements) are added and the graphical model is made to include the movement variables. To ensure this, a dummy link variable is proposed instead of each movement, and new parents and children are created based on this new proposed model structure. The proceeding nodes Xpa(l,t) (links) are now parents of dummy node Dl,t(intersection) and Dl,t is the parent of Xl,t as shown in Figure 29. 97 Figure 29 The proposed graphical model including the intersections The third challenge is how to allocate the observation travel time to the movements. For the link-based travel time, the travel time is allocated only on the links. When considering the intersections, the travel time should be allocated to the combination of links and intersection (movements). For that, the same optimization model explained in section 4.3 can be used. The optimization needs to be solved to decompose the path travel time to links’ travel times and intersections’ (dummy variables) travel times. The new bound for lower-bound and upper-bound for the intersections’ travel times is proposed based on the intersection type. The final proposed model provides the travel time distribution on the links as well as travel time (delays) distribution for the intersections, and a trip travel time is the sum 98 of the traversed link travel times and traversed intersection travel times, as demonstrated in Figure 30 Figure 30 Demonstration of path travel time as the sum of links and intersections travel times 5.4.3 The Graphical Model with Driver Habits Variable Another variable that is assumed to have an influence on arterial travel time modeling is the driver behavior variable. This study suggests a new variable that can capture driver habits and traffic conditions. This is particularly applicable when the vehicles’ speeds are also available. If it they are not available, the speed can still be estimated based on the time and the link lengths, but might not be as accurate. This variable is an observed variable and it provides a way to capture dependency of travel times across links. The new suggested variable is shown in equation 16. Vup to now= Avg. Vdriver/ Avg. Vspeed limit (16) 99 Figure 31 demonstrates how to include this variable in the model structure. The new variable could be used for both the initial graphical model as well as the graphical model with intersections. In this study, due to the low number of observations, this variable effect could not be tested. This study suggests these variables for future studies for which a higher number of observations is available. This variable also could potentially be used to provide the individual travel time prediction based on individual driver behavior. Figure 31 The proposed graphical model with proposed relative speed variable 100 5.5 Summary This chapter demonstrated the validation and comparison of the model with a base model. It also demonstrated that the graphical model performance would improve with the addition of the explanatory variables. There is an improvement by adding the days of the week variable, seasons, as well as weather conditions, all of which are presented by testing the model on a suitable test day. Model percentage of error in most scenarios is around or less than 20 %. Testing the model with weather conditions variable on a rainy day showed a lower percentage of errors when compared to the graphical model without this variable. The model is also compared with a base model with regression and a real-time average. The regression model includes the link explanatory variables such as speed limit, length, etc., and the trip explanatory variables such as seasons, weather conditions, etc. Even though there is a significant improvement in the performance of the base model when switching from real-time average to the combination of real-time average and regression model, the proposed graphical model with all the variables still outperforms the base model. There were two model modifications that this study aimed to explore but, due to the low number of observations and lack of data, the modifications could not be tested to see their impact on the performance of the model. Given this, we have left the implementation and testing of these modifications for future research. The first important modification is the inclusion of the intersection movements, which would 101 cause the model to have a different delay time for straight movement vs. left turn movement. 102 Chapter 6: Comparison and Sensitivity Analysis Validation of the model via separate test data is a common way of validating any prediction model; however, in reality, one may want to compare a model to existing commercial models. To the best of the author’s knowledge, none of the proposed models in existing literature have been compared to any commercial prediction model. As such, there is not a framework to do so. A comprehensive framework for comparing the model proposed in this study (or any model) to commercial travel time providers is proposed. The comparison of the proposed model in this study with the Google API, and the original observations are demonstrated in this chapter. The chapter continues with the sensitivity analysis on a parameter of the proposed model as well as available data, and the results of the sensitivity analysis are illustrated and discussed. 6.1 Validation Another way to validate the proposed model is to compare the outcome of this study to Google estimates. There are two challenges with this approach. First, we do not have the current probe data and Google API only provides the travel times for current or future time. This poses a big challenge, and there is no way to avoid this obstacle without access to current time vehicle trajectory data. However, we can develop a general estimate on how the model performs in off-peak hours, when the traffic has a lesser effect on travel time. As such, the travel times for the roads network are compared to each other. This just gives us an estimate of how our model is performing 103 in general and without the effect of traffic. It should be noted that we may want to compare only those trips’ travel times that have common route with the observed GPS trajectory. As such, the second challenge is that the Google-proposed route should be the same as the observed trajectory. To address this, we have proposed to use a shape similarity algorithm to only select those trips that have the same observed routes as the Google routes. Here, the probe vehicle trajectory and Google routes should be compared to each other to determine they are the same routes and comparison is meaningful. 6.2 Shape-Matching Shape-Matching is an important area of research that has many applications in computer vision, and it is the area of arbitrary target object detection. In computer graphics, the Hausdorff distance is used to measure the difference between two different representations of the same 3D object. It is also used in biological sciences for the analysis of protein structures. Considering we have point sets 𝐴 ∈ 𝑅x and ∈ 𝑅x , the goal is to find the level of similarity between A and B sets. There are several algorithms for finding the shape similarity including Hausdorff, Fr´echet distances, and Minkowski distance (Veltkamp, 2001). In this section, we try to use these algorithms to find the similarities between two routes in three dimensions (Veltkamp, 2001). This mathematically means that we want to find a match that has one of the following characteristics: • Minimizing maximum distance between mapped points • Minimizing sum of distances between mapped points 104 • Minimizing sum of squared distances between mapped points 6.2.1 Hausdorff Distance The Hausdroff distance is a measure of how far two subsets are. It is calculated by determining the maximum distance from a set to another set in which each point in a set A is matched to the nearest point in set B. The Hausdorff distance from A to B, h(A, B), is 𝑚𝑎𝑥V∈®𝑚𝑖𝑛€∈†||𝑎 − 𝑏||, The Hausdorff distance from B to A, h(B, A), is 𝑚𝑎𝑥€∈†𝑚𝑖𝑛V∈®||𝑏 − 𝑎||, The Hausdorff distance between A and B, H(A, B), is max {h(A, B), h(B, A)}. (Hausdorff and Frechet distance Lecture, 2017) Basically, h(A, B), for each point on dataset A, first finds the nearest neighbor point to it on dataset B, and finally reports the farthest of these distances. In this way, Hausdorff distance measures the mismatch between these two datasets. The idea is shown in Figure 32. (Alt & Guibas 1999). 105 A Min Min Min Min Max Min Min MinB Min Min Figure 32 Hausdorff distance as a measure of similarity 6.2.2 Fr´echet Distance The Fr´echet distance is explained with the following example. Assume a man is walking a dog where the man is walking on one curve and the dog on another curve. The minimum length of a leash allowing a dog and its owner to walk along the two curves without backtracking is defined as the Fr´echet distance (Figure 33). The Fr´echet distance is known to be a better measure of shape-matching for curve or surface matching. The Fr´echet distance between two curves is calculated as follows: 𝐹(𝐴, 𝐵) = 𝑖𝑛𝑓°,>𝑚𝑎𝑥:∈[h°,=]||𝐴a(𝑡)i − 𝐵(𝛽(𝑡)|| (17) where A, B : [0, 1] 𝑅x are parametrizations of the two curves 106 and 𝛼, 𝛽: [0, 1] → [0, 1] range overall continuous and monotone increasing functions. (Alt & Godau, 1995; Alt & Godau, 1999) Figure 33 Fr´echet distance as a measure of shape similarity 6.3 Route Similarity As mentioned before, this study proposes using shape-matching to find the similarities between two routes in order to compare the travel times of two similar routes. In this study, two algorithms of Fr´echet distance and Hausdorff distance were implemented. The Hausdorff distance seems to be a better indication of route similarity and is chosen for this aim. The implemented algorithm is an approximation to the Hausdorff Distance for two geometries. This algorithm measures the degree of similarity between two geometries. The computed measure is in the range of [0, 1]. Higher measures indicate a great degree of similarity, which means that a measure of 1.0 shows two exact matches and a 107 measure of 0.0 indicates that the two have essentially no similarity. The algorithm takes the geometry of the two routes of interest and returns the normalized Hausdorff metric. When comparing travel times, it is important to use the same route; otherwise, the comparison is meaningless. Knowing this, we calculate this metric for probe data trajectory vs. Google routes. The travel time estimation comparison can be calculated between routes that have a similarity score higher than 0.99. Figure 34 shows an implementation of the proposed algorithm for a GPS route (in blue) vs. Google route (in red) with a 0.7005 (70.05%) similarity index. The origin- destination is the same for both GPS and Google routes, and it is evident the routes have the same path up to a point and they bifurcate after that and meet again at the destination. Figure 34 Route similarity index example (two routes with similarity index of 70.05%) 108 6.4 Google Maps API The Google Maps Directions API is a service that estimates directions and travel time between locations. The directions between two locations are provided for several modes of transportation including transit, driving, walking, and cycling. The challenge is that the Google cost model is not available to the public to allow comparisons of travel time estimations for the proposed model vs. the Google model. Also, it is not possible to acquire all the link combinations with Google API, as it returns only the most efficient routes in terms of travel time when calculating directions between two locations. For this reason, route similarity is critical to making it possible to compare only the routes that are identical. 6.4.1 General Information Google API can provide an estimate for current time or future time. There is a limitation of 2,500 free directions requests per day with Google API. There are several parameters that can be changed in the URL, some of them are required while others are optional. Parameters are separated by the “&” character. Two of the important optional parameters for us are “departure_time”, and “traffic_model”. 6.4.2 Departure Time Google uses the parameter “departure_time” to identify a specific travel time; the value is the time specified as Unix time, or seconds since midnight, January 1, 1970, UTC. 109 6.4.3 Travel Time with Traffic The other optional parameter to use to collect travel time is “traffic_model,” which can pull the following values: best _guess, optimistic, and pessimistic. This parameter’s values change the “duration_in_traffic” value. traffic_model parameter values: 1. Best _guess (which is the default factor) is based on historical averages. 2. Optimistic gives the duration_in_traffic equal to the lowest value in the range Google Maps provides (based on observations for a couple cases). 3. Pessimistic gives the duration_in_traffic equals to the highest value in the range Google Maps provides (based on observations for a couple cases). The best application of the proposed method would be when there is real-time data, and the proposed model could provide a prediction of travel time that could be compared with Google API using the Best_guess and Current_time parameters. However, this data is not available and this approach could not be taken in this study. Nevertheless, it can still be used to find how the model is performing in comparison with Google API. In this way, during the off-peak hours, the estimated travel times could be compared, removing the real-time effects and the base estimation without traffic. The proposed approach could be used in any other studies and most importantly when real-time data is available. 110 6.5 Waypoint Parameter There is an optional parameter in Google API through which one can specify the waypoints available for driving, walking and bicycling, but not transit. Using this parameter, however, changes the route from the original in most of the cases, either because of an observed GPS error or based on Google API logic. As such, it would still be necessary to use route similarity. Furthermore, the advantage of using default Google routing instead of a giving waypoint is that the route selections could be compared as well. This is particularly important when there is also a routing suggestion based on an existing cost model. In this way, both the cost model and the routing suggestions could be compared. This parameter in Google API collects the specified array of waypoints (latitude/longitude coordinate) and alters the route by directing it through the specified waypoints. This study tried this feature as well, but it turned out using the route comparison and checking for similarity higher than 0.99 is a better approach. 6.6 Comparison As mentioned, the comparison of the proposed model and Google travel time estimation were conducted for the off-peak hours and equivalent times (same day, month, time) and ‘best_guest’ parameters. The proposed model and Google travel time estimations are also compared with the actual travel time observed from the probe data. The steps that need to be taken for the compression between the proposed model, Google model, and the probe data is shown in Figure 35. The indicated steps were 111 taken, and the results of the comparison for the Baltimore area are demonstrated in the following section. Step 1: Getting the proposed model and Google ETA for the same origin/destination and time. Step 2: Calculating the route similarity score for routes pairs & compare the travel time just for routes with similarity score> 0.99 Step 3: Compare the proposed model, Google travel time estimation and probe data travel time Figure 35 The proposed framework for travel time estimation comparison between models 6.6.1 Case Studies This section demonstrates how the proposed framework is used for the study area to compare the proposed model in this study to the Google API and the actual observations. As mentioned before, since we do not have access to real-time data, we illustrated the framework using historic data. To do this, all the historic data for all the training datasets were used as the input for the model to be trained. Once the model was trained, the model predictions of travel times for the test day again for 12 AM to 6 PM time, and with the same weather conditions are obtained. As such, for all the trips that occurred on the test date, we have the origin/destination, GPS track, and the proposed model’s predicted travel time. For 112 all those trips, we acquired the Google estimates as well. In Google API, we entered the same day, time of day, day of week and month and same weather condition as the test day, and used the “best_guess” parameter as the API input for the future data since only current and future times are available through Google, however, If we had the current data, we could simply use the current Google API estimation. In this way, we obtained the origin-destination, GPS track, and the estimated travel time of Google as well. The next step was to find those trips with a shape similarity index between the actual observations and the Google proposed path. For trips that the observation track and Google proposed route are the same, the average of travel times and errors for proposed model, the Google, and actual observation is demonstrated in Table 12. The details of travel times are shown for a sample of these trips in Table 13. What follows demonstrate the results of this comparison between the Google API, proposed model, and the observations. 6.6.2 Results 6.6.2.1 Off-peak Travel Time Comparison Figure 36 demonstrates an example of the comparison between the Google API and proposed model estimated travel time and observed travel time. The given trip is on Tuesday, in winter, and with fair weather conditions and non-peak hour traffic (February 10th at 00:57 AM). 113 The example trip route has a similar route based on a Google API proposed route, which means that the shape similarity index between the two routes is higher than 0.99 and we can compare the travel times. The observed travel time is 500 seconds; whereas, the Google API estimate is 480 seconds; finally, the proposed model estimation is 508 seconds. In this case, there is a 4% difference between the observed travel time and Google API estimation, and a 1% difference between the observed travel time and proposed model estimation. Also, the difference between Google and the proposed model estimation is 0.3 %. As mentioned, this is just one example to demonstrate the comparison process. As such, the comparison is done for the test dataset and results are demonstrated and discussed as follows. 114 Google Observation Figure 36 Proposed Model, Google API, and observed travel time comparison (off-peak) The selected test case is based on trips on a Tuesday in winter under normal weather conditions, for midnight to 5 am in the morning, for a travel length of more than five minutes. Around 190 cases were randomly selected from which around 40% of them had the same route as the Google API route with a shape similarity index of 0.99 and above. 115 The sample 90 tested cases have an average trip of 420 seconds, with a maximum of 608 seconds. From this sample, 10 trips were doubled-checked manually (both probe trajectory and timestamps) and all appeared normal. Figure 37 Google and proposed model Percentage of error trend demonstrates the travel difference from the ground truth trend (they are not necessarily equivalent points) for the proposed model and Google. As it is shown, the proposed model has the tendency to estimate a higher travel time than the ground truth, and Google’s estimates are lower than ground truth 50% of the time. As is shown in Table 12, the average estimation error for the proposed model is 20%; whereas, the average estimation error for the Google API is 22%. This analysis demonstrates the proposed process and the result of the analysis for 90 cases. The analysis shows that the proposed model estimation does not yield results that stay far from Google API, even though historical data is used. 116 1 0.5 0 0 20 40 60 80 100 -0.5 -1 -1.5 Cases Peoposed Model Google Figure 37 Google and proposed model Percentage of error trend (off-peak) Table 12 Google and proposed travel time prediction summary (off-peak) Variable Value Number of Observations 90 Average Trip Time (s) 420 Average Proposed Model Average Error (s) 75 Average Google Average Error (s) 84 Proposed Model Average Error % 20 Google Average Error % 22 117 Percentage of Error Table 13 Sample of Google, proposed model travel time prediction and observations travel time comparison (off- peak) Trip Observation (s) Proposed model (s) Google (s) Google-Obs (s) Proposed-Obs (s) 1 507 500 480 -20 7 2 455 322 429 107 132 3 547 507 444 -63 40 4 377 362 525 163 15 5 405 353 380 27 52 6 420 374 339 -35 46 7 418 301 415 114 117 8 429 354 296 -58 75 9 372 443 315 -128 -71 10 495 340 468 128 155 11 606 527 539 12 79 12 479 580 452 -128 -101 13 476 450 407 -43 26 14 436 533 392 -141 -97 15 377 484 286 -198 -107 16 465 445 472 27 20 17 303 360 288 -72 -57 18 353 414 344 -70 -61 19 491 540 333 -207 -49 20 528 435 349 -86 93 21 328 360 280 -80 -32 22 491 540 333 -207 -49 23 390 502 350 -152 -112 24 390 410 350 -60 -20 25 318 430 277 -153 -112 26 461 302 409 107 159 27 412 344 384 40 68 28 608 522 465 -57 86 29 440 372 295 -77 68 30 600 360 509 149 240 118 6.6.2.2 PM peak Travel Time Comparison As mentioned before, to remove the effects of real-time traffic variability, the comparison was done during off-peak hours. However, what follows is the demonstration of comparison between the proposed model, Google API, and the actual travel-time for a sample of observations on a Monday during PM peak hours. The sample cases are randomly selected from the test date. Around 30 trips are selected from which 11 cases met the comparison criteria. The average percentage of error for the proposed model is 10.8 % whereas the Google average percentage of error is 24.4%. Figure 38 demonstrates an example of the comparison between the Google API and proposed model estimated travel time and observed travel time. The given trip is on Monday, in winter, and with fair weather conditions and peak hour traffic. The example trip has a similar route as the one proposed by the Google API, meaning that the shape similarity index between the two routes is higher than 0.99 and consequently the travel times can be compared. The observed travel time is 422 seconds; whereas, the Google API estimate is 477 seconds; and the proposed model estimation is 391 seconds. In this case, there is a 13% difference between the observed travel time and Google API estimation, and a 7.4 % difference between the observed travel time and proposed model estimation. Table 14 shows all the 11 cases which are considered in this analysis. 119 Google Observation Figure 38 Proposed Model, Google API, and observed travel time comparison (PM peak) Table 14 Google, proposed model travel time prediction and observations travel time comparison (PM peak) Trip Observation (s) Proposed model (s) Google (s) Google-Obs (s) Proposed-Obs (s) 1 238 220 429 191 -18 2 414 365 455 41 -49 3 540 492 598 58 -48 4 382 421 647 265 39 5 308 355 474 166 47 6 422 391 477 55 -31 7 378 411 620 242 33 8 630 490 618 -12 -140 9 233 246 192 -41 14 10 540 492 598 58 -48 11 847 779 772 -75 -68 120 6.7 Sensitivity Analysis Sensitivity analysis is the study of how the output of a mathematical model can be sensitive to the inputs and assumptions. In this process, the outcome is recalculated under alternative assumptions/inputs to determine the impact of each assumption/input. There are several benefits to conducting sensitivity analysis such as understanding the potential relationships between input and output variables in the model, evaluating the robustness of the results of the model when uncertainty exists, finding any possible errors in the model, and developing a possible enhancement in the model by providing better input and so on. This study is an empirical study and the model is a data-driven model; as such, there are not many input factors/assumptions in the model. In this section, the input factors are explained, and the results of sensitivity analysis are demonstrated. 6.7.1 Number of States As mentioned before, here we have a few assumptions on which we can conduct the sensitivity analysis, one of which is the fact that we considered only two states of traffic congestion. The reason to choose two states of congestion is that in some of the cases, two distributions with two different means can be detected in the histogram of the links raw travel time data. However, there is no robust empirical study showing that that is the case. On the other hand, the more states we have, the more data we need to do the analysis, so even if there were more traffic states, we did not think with the available amount of data in this study more than two states can be considered. To demonstrate 121 whether or not the higher number of states could result in a different measure of model performance, and in order to explain the observation with accuracy, we conducted sensitivity analysis on the number of states in the model. In this part, the three states and four states are considered for the model. The result for the three states is shown in Table 15. As shown in Table 15 and Figure 39, the performance of the model utilizing a higher number of states decreased. To ensure that the result is not an accident, three different days of week and months are randomly selected and the performance of the model with two states and three states are compared. Figure 39 also demonstrates this trend of diminishing performance by incorporating a higher number of states for several test days. As mentioned before, this could be attributed to the small sample size. For example, utilizing three states means that we need to have three distributions for each link, and we need to have enough observations in all the cases to have sound data. It should be noted that the increasing the number of states results in much higher computational complexity for the training part. The initial test on the 4 states case showed that training time was very high which made it inapplicable in the real-world situation and therefore it is not considered here. Thus, the two-state option seems to be a good choice, at least for this amount of data. With a larger amount of data, the number of states could be considered to see if it can improve model performance. 122 Table 15 Sensitivity analysis on the number of states Test Day Wednesday, Tuesday Friday, Season Winter Summer Fall Average test trip 546 529 537 Model with 2 states error (s) 99 95 108 Model with 3 states error (s) 118 112 132 Model with 2 states error % 18.5 18.5 20.2 Model with 3 states error % 21.3 21.0 23.7 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% Wednesday, Winter Tuesday, Summer Friday, Fall 2 states 3 states Figure 39 Sensitivity analysis on the number of states 123 Percrntagr of Error 6.7.2 Number of Observations In most of the data-driven models, the model is assumed to be very sensitive to the amount of data. In this study, all the available data is used for the model. However, to show how sensitive the model is to the amount of data, the sensitivity analysis is conducted on the amount of data. For this purpose, since we do not have more data, less data is used to see how the model performance could potentially change based on the amount of available data. In this part, 80% and 50% of the whole data is used for the modeling. 80% (50%) of the data is selected from all of the available days to have a better representative data (not just using 50% of the days). The test day is randomly selected for this analysis. The test day is Friday, October 2nd which is in the fall season and the trained model is tested on this day. As noted, the model is trained with 80% of the data and 50% of the data and the results are demonstrated in the Table 16. As it is shown in Table 16, for the case of the trained model with 80% of the data the decrease in the percentage of the error is not significant and we can say the performance of the model with 100% of the data and 80% of the data, at least on this test day, is the almost the same. However, there is an evident improvement in the performance of model comparing the model with 50% of the data with models with 80% and 100% of the data. Thus in general, we can say the amount of data could have an impact on the model performance. The lower the number of observations we have, the lower the performance would be, and the closer the proposed 124 model performance would be to the base model with real-time average and regression model. Table 16 Sensitivity analysis on the amount of data Variable Value Average test trip 537 Model with 100 % of data error (s) 108 Model with 80 % of data error (s) 109 Model with 50 % of data errors (s) 126 Base model errors (s) 152 Model with 100 % of data error % 20.2 Model with 80% of data error % 20.3 Model with 50 % of data errors % 23.7 Base model errors % 28.1 This trend is also shown in Figure 40, which illustrates the improvement in the performance of the proposed graphical model and its correspondence with the increase in the number of observations. 125 24.0% 23.0% 22.0% 21.0% 20.0% 19.0% 18.0% Model with 100 % of data Model with 80% of data Model with 50 % of data Figure 40 Sensitivity analysis on the amount of data The analysis outcome shows that the model performance could potentially improve with more data and the model could be very promising when we have enough samples of data. In the future, this model could be applied with more data in order to see how performance could be improved. The availability of data is critical not only because the analysis shows that the higher amount of data could itself result in higher model performance, but also because it makes it possible for other proposed variables such as intersections and movements and more weather conditions to be included in the model and potentially increase the performance of the model across a variety of conditions. 126 Percentage Of Error 6.8 Summary The proposed model to compare this study model and Google API has three main steps. The first step involves having a number of trips for the test and changing the Google API parameters to match the test trip. The Google API parameters include origin, destination, departure time, traffic model, and waypoint, all of which are changed according to each trip. Then, the route proposed by Google API and the observation are compared to ensure that we are comparing the travel times on the same routes. To achieve this, this study proposes the use of a well-known shape similarity algorithm, Hausdorff Distance, and incorporates only those trips for which the Google routes and the observation routes are identical. Finally, this study compares the travel times provided by the Google API, the proposed model, and the observation. The analysis for the sample of trips during off-peak hours shows that, in general, Google API and the proposed model have a comparable difference from the ground truth (observation) and, in general, Google API overestimates the time in 50% of cases and underestimates travel time in 50% of cases. The proposed model, however, has a tendency toward overestimation. Finally, sensitivity analyses were conducted for two parameters of the model. One was the number of congestion states, which was considered two in this study. To find out how the model is sensitive to the states of congestion, three and four congestion states were considered and tested. The analysis results showed that, in instances where there were a higher number of congestion states, the model had lower performance. This could be because of the low number of observations or because two congestion states 127 are the best in representing the reality. The other parameter that was analyzed was the number of available observations. This analysis showed that, with a lower number of observations, we would have lower performance. This implies that, with larger amount of data, the proposed model’s performance can improve. 128 Chapter 7: Summary and Conclusion In this Chapter, we provide a complete summary of the research. Then we discuss the findings of this study. Finally, in conclusion, some of the limitations of this study and concrete suggestions for future research are proposed and discussed. 7.1 Summary Travel time prediction is very important to everyday life. Reliable travel time prediction helps both road users and system controllers to be informed about the future conditions on roadways and enables them to make their best decisions based on that information. In a world where we are facing a rapid increase of traffic congestion, accurate Traveler Information Systems are increasingly crucial. Accurate and reliable travel time information could help to reach control and could alleviate this congestion by providing a more reliable network. The first step toward this goal is the creation of accurate real- time traffic monitoring systems. This is especially important in arterial networks since they are major city streets that provide for travel within and between cities. Yet, there is not much work that is being done to improve arterial real-time traffic monitoring. In general, travel time estimation and prediction are very complex and challenging tasks. This is due to unknown variables such as varying driver behaviors, uncertainty in demand, weather, incidents, roadway conditions, etc. This is even more challenging for arterials since there are other factors playing a role in this uncertainty, such as traffic signals and the relation between the links and their effects on each other. 129 Many of these fluctuations could be captured in a travel time predictions model. In fact, the goal is to capture these fluctuations and to be able to predict the fluctuations as much as possible. Nevertheless, there is still some uncertainty associated with travel time that poses challenges for capturing this information. Still, knowing about these uncertainties and the range of uncertainties is very important. To address the need for travel time predictions, many studies have proposed different methodologies. Most of these studies are for freeway travel time prediction, and fewer are for arterial travel time prediction. An overview of these methods for travel time prediction in general and arterial travel time prediction was done in Chapter 2. The approaches to this problem can be categorized based on the scope, output, and methodology used. There are two primary different outcomes: large-scale navigation systems (mostly deterministic) and probability distributions of travel times, mostly for the smaller scale. There are two main scopes that are distinguishable in the literature, microscopic (includes a small number of intersections) or macroscopic. Finally, in terms of the used methods, the approaches can be categorized into three main approaches: parametric, non-parametric, and hybrid modeling. The parametric methods, such as Feedforward Neural Network, Hidden Markov Model, and Autoregressive Models, have strong model assumptions. Comparatively, the non-parametric methods, such as the k-nearest neighbor, require fewer assumptions and they grow with the amount of data. Some of the proposed probabilistic methods used specifically for arterial travel time prediction, such as STARMA, Dynamic Bayesian network using EM algorithm, and Spatial Moving Average (SMA), were discussed in more detail. 130 This study proposed a probabilistic model using raw probe data. Using the raw probe data for the modeling purpose presents a lot of challenges and requires steps to be taken prior to the modeling. However, as of the time of this writing, the author has no knowledge of a study that features a framework for each of these steps. Chapter 3 of this research provided a framework for each preliminary step needed to be taken before the modeling. The steps include the data cleaning framework, map matching of the data to the network, and creation of the network. This study proposed a framework to work with raw and dirty probe data, including two main steps, trip splitter and trip filtering. The three main splitters are Time Difference Splitter, Waypoint Sequence Splitter, and Location Difference Splitter. The trip filters are Same Time Filter, Same Location Filter, Speed Boundary Filter, Idle Filter, Location Displacement Filter, Trip Length Filter, Shaking Probe Filter, Large Error Size Filter, and Shaking Probe Filter. Next, the map matching and existing methods for map matching were discussed. This study used Hidden Markov Model (HMM) map matching methods to find the match that maximizes the multiplication of the likelihoods of the points on the map for each latitude/longitude, as well as the maximum likelihood of the sequence between two consecutive latitude/longitude pairs. Finally, the initial structure of the model was explained. The proposed model in this study is an enhanced graphical model. In the graphical model, it is assumed that there is a conditional dependency between random variables, which are demonstrated by a graph. As such, this chapter explained how a network or the base Dynamic Acyclic Graph (DAG) of the roads is created by creating the graph of the street map. In the base DAG, 131 a variable is assigned to each direction of a link and for time step t and the time step of t+1. The state of each link (variable) is conditioned on the neighboring links’ (variables) state at the previous time step. Consequently, the most important part of creating the network is creating this adjacency matrix of the connection between links. The enhanced graphical model was explained in Chapter 4. The enhanced graphical model has a global variable for time of the day, the day of the week, season, and weather conditions. All of the variations in either demand or travel behavior imposed by each of these variables are included in the model, and the model is able to estimate/predict travel time on each of these scenarios. The chapter continued with an explanation of Streaming Variational Bayes (SVB) algorithm. The SVB is an approximate inference algorithm that is used instead of exact Bayesian inference – the latter is very time consuming, and it is often inapplicable for complex models. Streaming Variational Bayes deploys the Bayes rule to convert a conventional Variational Bayes method into an incremental approach. The validation of the models and analysis of the variables influence was demonstrated in Chapter 5. The results of the enhanced graphical model were demonstrated for different scenarios, including different seasons and different days of the week. In most of the scenarios, the percentage of error of the model for short-term travel time prediction is less than 20%. The graphical models were compared to each other’s with incrementally added variables in order to allow the evaluation of the effects of inclusion of each variable in the model. The performance of the model was also compared to a base model, which is a real-time average, as well as a combination of real-time average and a regression model. The proposed model outperformed these two models in all of 132 the scenarios. The weather conditions variable was separately evaluated and discussed in this chapter, and the accuracy of the model in bad weather conditions was demonstrated again for different scenarios. The chapter concluded with detailed suggestions on extending the model by bringing the intersections and driver behavior into the modeling process. The extended model for inclusion of intersections, for example, includes a movement variable in the graph for each of the possible movements in the intersection. However, this model could not be tested in this research due to a lack of data regarding all the movements. In this study, there were enough data to test the original proposed model without the extensions. We provided comprehensive details for developing the extended model and testing it in future studies pending availability of data. Even though the model is evaluated based on the test data, in Chapter 6, this research proposed a new method to compare the proposed model (or any real-time prediction model) with any publicly available routing API such as Google. The proposed framework has three main steps. The first step involves having information from a sample trip (including GPS tracks) for the test date, and matching the Google API (or any available API) parameters according to each of the trips; it also involves obtaining the Google APT prediction/estimation for the same trip. The next step involves comparing the observed trip GPS tracks for the proposed model, and the Google API routes using Hausdorff Distance to make sure the travel times for the same route are being compared. The Hausdorff Distance is a Shape-Matching algorithm and is a measure that shows how similar are two subsets which in our case are the GPS tracks. Lastly, there is a comparison of the estimations/predictions for those routes that are 133 identical. The results showed that the proposed model and Google API have a comparable difference with the ground truth. Sensitivity analysis was done on two factors of the study. First, we evaluated the impact of the higher number of traffic states (three, and four were considered). The results of the analysis showed that, with this amount of data, the performance of the model decreases when the number of states is increased. Then we analyzed the impact of the amount of data available for use in the modeling process. The analysis showed that using a smaller amount of data (since we did not have more) could result in lower performance of the model, or even make it difficult or impossible to capture the influence of a variable at all. 7.2 Conclusions What follows is the list of this study findings: • The first step in developing a good model is gaining access to the cleanest data available. The cleaned data could result in a better estimation/prediction model. This study proposed a framework to clean the data has proven to be effective. • Using the Streaming Variational Bayes could make the model usable for large- scale networks with more variables in the model. • There is a time pattern in the travel time that can be captured. This variable could be incorporated in the model. • There is a seasonal pattern in travel time that can be captured by including this variable in the model. 134 • The weather condition does affect either driving behavior or demand, and it influences travel time and its uncertainty. The analysis confirms that the travel time distribution in bad weather conditions has a higher mean and wider standard deviation. • There is an average of 2% difference for the Google travel time estimation (with the available parameters) and our proposed model travel time estimation. Also, there is an average of 20% difference for the Google travel time estimation and actual travel time observations. • Less amount of data could result in poorer estimation/prediction and lower model performance. This means that more data could potentially result in better modeling and higher performance. • The model has the highest performance when it considers two states for each link’s travel time distributions. Considering a higher number of states results in lower performance. This could be attributed to the fact that two states can explain the observation the best, or it could be that, with the existing sample size, adding more variables into the model could decrease model performance. 7.3 For Future Research This research provided contributions to the existing literature in the area of real-time arterial travel time prediction. There are some model enhancements that could not be accomplished in this study due to a lack of data. The most important enhancement would be incorporating intersections in the model. This means that we would decompose the trip travel time to link travel times and delays at the intersection. For 135 this purpose, this research suggested incorporating a movement-based variable in the model. By doing so, the most important variation in the intersection based on the type of movement (left, right, and straight) could be captured in the modeling. Many commercial models still encounter this challenge in producing an accurate estimation/prediction waiting time for the traffic signals, specifically when it comes to left turns. If and when more data are available as it is becoming increasingly the case, this extended model can be developed and tested by using more data. This research assumed the links travel time to be independent for the sake of data availability, in the future and with the availability of more data, the interdependency between the links travel time could be captured by learning the joint travel time distributions for neighboring links. The other variable that could be considered in the graphical model is the driver behavior variable, which could be captured by individual vehicle trip speed. The effects of bringing this variable into the graphical model or other models could be evaluated in future studies. This study also included weather condition as an explanatory variable in the model. Two weather conditions were used in this study, however, with greater availability of data, more weather condition scenarios (such as heavy rain, heavy snow, storm etc.) could be used to expand this variable. This can lead to a more accurate travel time estimation/prediction in real-world conditions. Another interesting consideration about weather conditions would be the possibility of bringing weather predictions into the travel time modeling, as well as the uncertainties associated with it. 136 There are other events that can influence travel time significantly, including incidents, work zones, special holidays, etc.. However, enough vehicle probe data are needed for each of these events in order for them to be incorporated into the model. With rapidly improving technology development, increasing amounts of data are becoming available, which makes it possible to include these events in the modeling. Including these events could yield more accurate estimation/prediction. 137 Bibliography Ahmed, M. S., and Cook, A. R. (1979) "Analysis of freeway traffic time-series data by using Box-Jenkins techniques," Transportation Research Record. Alt, H., and Godau., M. (1995) “Computing the Fr´echet distance between two polygonal curves,” Int. J. Comput. Geom. Appl., 5:75–91. Alt, H., and Guibas, L. (1999) “Discrete geometric shapes: Matching, interpolation, and approximation - a survey,” In J.-R. Sack and J. Urrutia, editors, Handbook of Computational Geometry. Elsevier, 1999. Ban, X. Herring, R., Hao, P., and Bayen, A. (2009) “In Proceedings of the 88th Annual Meeting of the Transportation Research Board, Washington, D.C. Brakatsoulas, S., et al. (2005) “On Map Matching Vehicle Tracking Data,” in 31st International Conference on Very Large Databases: Trondheim, Norway p. 853- 864. Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013) “Streaming Variational Bayes,” In Advances in Neural Information Processing Systems 26, 2013. Budge, S., Ingolfsson, A., and Zerom, D. (2010) “Empirical analysis of ambulance travel times: The case of Calgary emergency medical services,” Management Science, 56:716–723. Butler, S., Ringwood, J. and Fay, D. (2007) “Use of Weather Inputs in Traffic Volume Forecasting,” Proceeding of Irish Signals and Systems Conference, Derry, Ireland. 138 Cetin, M., and Comert, G. (2006) "Short-term traffic flow prediction with regime switching models," Transportation Research Record: Journal of the Transportation Research Board, vol. 1965, pp. 23-31. Chung, E., Sarvi, M., Murakami, Y., Horiguchi, R., Kuwahara, M. (2003) Cleansing of probe car data to determine trip OD, in: Proceedings of 21st ARRB and 11th REAAA Conf erence, Cairns, Australia. Cui, Z. Ruimin, K. Yinhai W. “Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction”, access on January 2018 https://arxiv.org/abs/1801.02143. Davis, G. A., Nihan, N. L., Hamed, M. M., and Jacobson, L. N. (1990) "Adaptive forecasting of freeway traffic congestion," Transportation Research Record. Farokhi Sadabadi, K., Hamedi, M., and Haghani, A. (2010) "Evaluating moving average techniques in short-term travel time prediction using an AVI data set," in Transportation Research Board 89th Annual Meeting. Ghosh, B., Basu, B., and O'Mahony, M. (2009) "Multivariate Short-Term Traffic Flow Forecasting Using Time-Series Analysis," Intelligent Transportation Systems, IEEE Transactions on, vol. 10, pp. 246-254. Google API, https://developers.google.com/maps/documentation/directions/intro. GraphHopper, https://github.com/graphhopper/map-matching, accessed December 2016. 139 Greenfeld, J.S., (2002) “Matching GPS Observations to Locations on a Digital Map,” in 81st Annual Meeting of the Transportation Research Board: Washington, DC, USA. Haghani, A., Hamedi, M., Sadabadi, K. F., Young, S., and Tarnoff, P. (2010) "Data collection of freeway travel time ground truth with Bluetooth sensors," Transportation Research Record: Journal of the Transportation Research Board, vol. 2160, pp. 60-68. Hamed, M., Al-Masaeid, H., and Said, Z. (1995) "Short-Term Prediction of Traffic Volume in Urban Arterials," Journal of Transportation Engineering, volume. 121, pp. 249-254. Hamner, B. (2010) "Predicting travel times with context-dependent random forests by modeling local and aggregate traffic flow," in Data Mining Workshops (ICDMW), 2010 IEEE International Conference, pp. 1357-1359. Hao, P., Boriboonsomsin, K., Wu, G., Barth, M.J. (2017)” Modal Activity-Based Stochastic Model for Estimating Vehicle Trajectories from Sparse Mobile Sensor Data”. IEEE Trans. Intelligent Transportation System. Herring, R. (2010) “Real-time traffic modeling and estimation with streaming probe data using machine learning,” Ph.D. dissertation, Univ. California, Berkeley, CA. Herring, R., Hofleitner, A., Abbeel, P., and Bayen, A. (2010) “Estimating arterial traffic conditions using sparse probe data,” In Proceedings of the 13th IEEE International Conference on Intelligent Transportation Systems. IEEE, 923–929. 140 Hofleitner, A., Herring, R., and Bayen, A. (2012 a) “Arterial travel time forecast with streaming data: A hybrid approach of flow modeling and machine learning,” Transportation Research Part B, 46:1097–1122. Hofleitner, A., Herring, R., Abbeel, P., and Bayen, A. (2012 b) “Learning the dynamics of arterial traffic from probe data using a dynamic Bayesian network,” IEEE Transactions on Intelligent Transportation Systems, 13: 1679–1693. Hranac, R., E. Sterzin, D. Krechmer, H. Rakha, and M. Farzaneh. (2006) “Empirical Studies on Traffic Flow in Inclement Weather” Report FHWAHOP-07- 073. http://planet.openstreetmap.org/, accessed March 2016. https://onlinecourses.science.psu.edu/stat501/node/358, accessed December 2016. https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based datasets/quality-controlled-local-climatological-data-qclcd, accessed for year 2015. https://www.ncdc.noaa.gov/oa/climate/ghcn-daily/, accessed for the year 2015. https://www.ncdc.noaa.gov/stormevents/, accessed for the year 2015. https://www.openstreetmap.org, accessed March 2016. https://wiki.openstreetmap.org/wiki/, accessed March 2016. Huang, S., and Ran, B. (2003) “An Application of Neural Network on Traffic Speed Prediction Under Adverse Weather Condition,” Presented at 82nd Annual Meeting of the Transportation Research Board, Washington, D.C. 141 Huang, W.H., Song G.J., Hong H.K, and Xie K.Q. (2014) “Deep Architecture for Traffic Flow Prediction: Deep Belief Networks with Multitask Learning.” IEEE Transactions on Intelligent Transportation Systems. 15(5):2191–201. Hunter, T., Herring, R., Abbeel, P., Bayen, A. (2009). “Path and travel time inference from GPS probe vehicle data,” Neural Information Processing Systems Foundation (NIPS) Conference, Vancouver, Canada. Hunter, T., Hofleitner, A., Reilly, J., Krichene, W., Thai, J., Kouvelas, A., Abbeel, P., and Bayen, A. (2013) “Arriving on time: Estimating travel time distributions on large-scale road networks,” Technical Report arXiv: 1302.6617, arXiv. Jenelius, E., and Koutsopoulos, H. N. (2013) “Travel Time Estimation for Urban Road Networks Using Low Frequency Probe Vehicle Data,” Transportation Research Part B: Methodological 53: 64–81. Jiang, X., and Adeli, H. (2005) "Dynamic wavelet neural network model for traffic flow forecasting," Journal of Transportation Engineering, vol. 131, pp. 771- 779. Kamarianakis, Y., and Prastacos, P. (2005) "Space–time modeling of traffic flow," Computers & Geosciences, vol. 31, pp. 119-133. Karlaftis, M. G., and Vlahogianni, E. I. (2009) "Memory properties and fractional integration in transportation time-series," Transportation Research Part C: Emerging Technologies, vol. 17, pp. 444-453. Khoei, A. M., Bhaskar, A., and Chung, E. (2013) “Travel time prediction on signalized urban arterials by applying SARIMA modelling on Bluetooth data,” In 36th Australasian Transport Research Forum (ATRF), Brisbane, Australia. 142 Kim, S., and Kim, J.-H. (2001) “Adaptive Fuzzy-Network-Based C-Measure Map Matching Algorithm for Car Navigation System,” IEEE Transactions on Industrial Electronics. 48(2): p. 432-441. Koller, D., and Friedman, N. (2009). Probabilistic Graphical Models. Massachusetts: MIT Press. ISBN 0-262-01319-3. Krumm, J., Letchner, J., and Horvitz, E. (2007) “Map Matching with Travel Time Constraints,” in Society of Automotive Engineers (SAE) 2007 World Congress: Detroit, Michigan, USA. Kumar, S. V., Vanajaksh L. (2015) “Short-term traffic flow prediction using seasonal ARIMA model with limited input data,” European Transport Research Review, 7(21):1-9. Lecture on Hausdorff and Frechet distance, accessed July 2017, https://www2.cs.duke.edu/courses/spring07/cps296.2/scribe_notes/lecture23.pdf. Levin, M. and Tsao, Y.D. (1980) "On Forecasting Freeway Occupancies and Volumes (Abridgment)," Transportation Research Record. Leshem, G., and Ritov, Y. a. (2007) "Traffic Flow Prediction using Adaboost Algorithm with Random Forests as a Weak Learner," International Journal of Intelligent Technology, vol. 2. Markovic, N., Sekuła, P., Laan, Z. V., Andrienko, G., and Andrienko, N. (2017) “Applications of Trajectory Data in Transportation: Literature Review and Maryland Case Study”. IEEE Transactions On Intelligent Transportation Systems, 143 Min, W., and Wynter, L. (2011) "Real-time road traffic prediction with spatiotemporal correlations," Transportation Research Part C: Emerging Technologies, vol. 19, pp. 606-616. Murphy, K.P. (2012) “Machine learning: a probabilistic perspective.” The MIT Press. Myung, J., Kim, D. K., Kho, S. Y., and Park, C. H. (2011) "Travel Time Prediction Using k Nearest Neighbor Method with Combined Data from Vehicle Detector System and Automatic Toll Collection System," Transportation Research Record, pp. 51-59. Newson, P. and Krumm, J., (2009) “Hidden Markov Map Matching Through Noise and Sparseness,” Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS). Park, T., and Lee, S. (2004) “A Bayesian approach for estimating link travel time on urban arterial road network,” Lecture Notes in Computer Science, p. 1017, – 1025. Peng, Y. Lei, M. Li, J. B., and Peng, X. Y. (2014) “A novel hybridization of echo state networks and multiplicative seasonal ARIMA model for mobile communication traffic series forecasting,” Neural Computing and Applications, vol. 24, no. 3-4, pp. 883–890. Perry, R. W., & Greene, M. R. (1982). “The role of ethnicity in the emergency decision making process,” National Emergency Training Center. 144 Qiao, W., Haghani, A., Hamedi, M. (2012) “Short-Term Travel Time Prediction Considering the Effects of Weather,” Transportation Research Record: Journal of the Transportation Research Board 2308 (1), 61–72. RITIS. (2017) The regional integrated transportation information system (RITIS). http://www.cattlab.umd.edu/?portfolio=ritis. Accessed: 2017. Robinson, S., and Polak, J. (2005) "Modeling Urban Link Travel Time with Inductive Loop Detector Data by Using the k-NN Method," Transportation Research Record: Journal of the Transportation Research Board, vol. 1935, pp. 47-56. Russell, S., and Norvig, P. (2009) “Artificial Intelligence: A Modern Approach,” Pearson; 3rd edition, p. 496. Sharifi, E., Ernest, S. Y., Eshragh, S., Hamedi, M, Reuben ,M. J., and Kaushik,K. (2017) “Outsourced probe data effectiveness on signalized arterials,” Journal of Intelligent Transportation Systems, 21:6, 478-491 Smith, B., and Demetsky, M. (1997) "Traffic Flow Forecasting: Comparison of Modeling Approaches," Journal of Transportation Engineering, vol. 123, pp. 261- 266. Stathopoulos A., and Karlaftis, M. G. (2003)"A multivariate state space approach for urban traffic flow modeling and prediction," Transportation Research Part C: Emerging Technologies, vol. 11, pp. 121-135. Sun, S., Zhang, C. and Yu, G. (2006) "A Bayesian network approach to traffic flow forecasting," Intelligent Transportation Systems, IEEE Transactions on, vol. 7, pp. 124-132. 145 Tang, L., Kan, Z. Zhang, X., Yang, X., Huang, F., and Li, Q. (2016) “Travel time estimation at intersections based on low-frequency spatial-temporal GPS trajectory big data”, Cartography and Geographic Information Science, 43:5, 417-426. Telenav, http://www.telenav.com/products, accessed June 2017. Tiesyte, D., and Jensen, C. S. (2008) “Similarity-based prediction of travel times for vehicles traveling on known routes,” In Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM. Tsirigotis, L., Vlahogianni, E. I., and Karlaftis, M. G. (2011) “Does Information on Weather Affect the Performance of Short-term Traffic Forecasting Models?” Presented at 90th Annual Meeting of the Transportation Research Board, Washington, D.C. Veltkamp, R.C., (2001) “Shape matching: similarity measures and algorithms. in Shape Modeling and Applications”, SMI 2001 International Conference. Vlahogianni, E. I., Golias, J. C., Karlaftis, M. G. (2004) "Short- term traffic forecasting: Overview of objectives and methods," Transport Reviews, vol. 24, pp. 533-557. Westgate, B. S., Woodard, D. B., Matteson, D. S., and Henderson, S. G. (2016) “Large-network travel time distribution estimation, with application to ambulance fleet management,” European Journal of Operational Research. Westgate, B. S., Woodard, D. B., Matteson, D. S., and Henderson, S. G. (2013) “Travel time estimation for ambulances using Bayesian data augmentation,” Annals of Applied Statistics, 7:1139–1161. 146 Williams, B. M., Durvasula, P. K., and Brown, D. E. (1998) "Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models," Transportation Research 158 Record: Journal of the Transportation Research Board, vol. 1644, pp. 132- 141. Williams, B.M. and Hoel, L.A. (2003) “Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results,” Journal of Transportation Engineering, 129(6), 664–672. Woodard, D. B., Nogin, G., Koch, P., Racz, D., Goldszmidt, M, and Horvitz, E. (2017). “Predicting travel time reliability using mobile phone GPS data,” Transportation Research Part C, 75: 30-44. Pdf Yang, Q., Wu, G., Boriboonsomsin, K., & Barth, M. (2017). “A novel arterial travel time distribution estimation model and its application to energy/emissions estimation,” Journal of Intelligent Transportation Systems, 1–13. Yin, H., Wong, S., Xu, J., and Wong, C. (2002) "Urban traffic flow prediction using a fuzzy-neural approach," Transportation Research Part C: Emerging Technologies, vol. 10, pp. 85-98. Yu, B. Yang, Z. Z. and Chen, K. (2010) "Hybrid model for prediction of bus arrival times at next station," Journal of Advanced Transportation, vol. 44, pp. 193- 204. Yu, R., Li, Y., Shahabi, C., Demiryurek, U., and Liu, Y. (2017) “Deep learning: A generic approach for extreme condition traffic forecasting”. In Proceedings of the 147 SIAM International Conference on Data Mining (SDM), Houston, TE, USA, 27– 29. Zeng, X., and Zhang, Y. (2013) "Development of Recurrent Neural Network Considering Temporal-Spatial Input Dynamics for Freeway Travel Time Modeling," Computer-Aided Civil and Infrastructure Engineering. Zhang, Y. (2015) “Uncertainty Associated with Travel Time Prediction: Advanced Volatility Approaches and Ensemble Methods,” PhD thesis, University of Maryland, 789 East Eisenhower Parkway. 148