ABSTRACT Title of dissertation: COPULA BASED POPULATION SYNTHE- SIS AND BIG DATA DRIVEN PERFOR- MANCE MEASUREMENT Kartik Kaushik, Doctor of Philosophy, 2019 Dissertation directed by: Professor Cinzia Cirillo Department of Civil Engineering Transportation agencies all over the country are facing fiscal shortages due to the increasing costs of management and maintenance of facilities. The political reluctance to increase gas taxes, the primary source of revenue for many government transportation agencies, along with the improving fuel efficiency of automobiles sold to consumers, only exacerbate the financial dire straits. The adoption of electric vehicles threatens to completely stop the inflow of money into federal, state and regional agencies. Consequently, expansion of the network and infrastructure is slowly being replaced by a more proactive approach to managing the use of existing facilities. The required insights to manage the network more efficiently is also partly due to a massive increase in the type and volume of available data. These data are paving the way for network-wide Intelligent Transportation Systems (ITS), which promises to maximize utilization of current facilities. The waves of revolutions overtaking the usual business affairs of transportation agencies have prompted the development and application of various analytical tools, models and and procedures to transportation. Contributions to this growth of analysis techniques are documented in this dissertation. There are two main domains of transportation: demand and supply, which need to be simultaneously managed to effectively push towards optimal use of resources, facilities, and to minimize negative impacts like time wasted in delays, environmental pollution, and greenhouse gas emissions. The two domains are quite distinct and require specialized solutions to the problems. This dissertation documents the developed techniques in two sections, addressing the two domains of demand and supply. In the first section, a copula based approach is demonstrated to produce a reliable and accurate synthetic population which is essential to estimate the demand correctly. The second section deals with big data analytics using simple models and fast algorithms to produce results in real-time. The techniques developed target short-term traffic forecasting, linking of multiple disparate datasets to power niche analytics, and quickly computing accurate measures of highway network performance to inform decisions made by facility operators in real-time. The analyses presented in this dissertation target many core aspects of transporta- tion science, and enable the shared goal of providing safe, efficient and equitable service to travelers. Synthetic population in transportation is used primarily to estimate transportation demand from Activity Based Modeling (ABM) framework containing well-fitted behavioral and choice models. It allows accurate verification of the impacts of policies on the travel behavior of people, enabling confident implementation of policies, like setting transit fares or tolls, designed for the common benefit of many. Further accurate demand models allow for resilient and resourceful planning of new or repurposing existing infrastructure and assets. On the other hand, short-term traffic speed predictions and speed based reliable performance measures are key in providing advanced ITS, like real-time route guidance, traveler awareness, and others, geared towards minimizing time, energy and resource wastage, and maximizing user satisfaction. Merging of datasets allow transfer of data such as traffic volumes and speeds between them, allowing computation of the global and network-wide impacts and externalities of transportation, like greenhouse gas emissions, time, energy and resources consumed and wasted in traffic jams, etc. COPULA BASED POPULATION SYNTHESIS AND BIG DATA DRIVEN PERFORMANCE MEASUREMENT By Kartik Kaushik Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019. Advisory Committee: Professor Cinzia Cirillo, Chair and Advisor Professor Eric V. Slud Professor Parthasarathi Lahiri, Dean?s Representative Dr. Sevgi Erdogan Dr. Stanley E. Young ? Copyright by Kartik Kaushik 2019 Dedication To my beloved sister Syamini Kaushik and my beautiful wife Preeti Lakhole. Two women I cherish the most. ii Contents List of Abbreviations ix 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Introduction to Population Synthesis 9 2.1 Overview of Developed Method . . . . . . . . . . . . . . . . . . . . . 12 2.2 Theory of Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Empirical Marginals . . . . . . . . . . . . . . . . . . . . . . . 16 3 A Review of Population Synthesizers 19 4 Description of Data Used for Population Synthesis 23 4.1 American Community Survey . . . . . . . . . . . . . . . . . . . . . . 23 4.2 U.S. Decennial Census . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 Developed Methodology for Population Synthesis 27 5.1 Data Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Probability Integral Transformation using Empirical Marginal Distri- butions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Likelihood Maximization . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Goodness of Fit Testing . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4.1 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . 34 5.5 Inverse Probability Integral Transformations of Marginals . . . . . . . 37 6 Analysis of Synthesized Population 40 6.1 Copula Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2 Population Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7 Alternative Tests for Copula Goodness of Fit 50 7.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 iii 8 Conclusions from Population Synthesis 63 9 Introduction to Performance Measurement 67 9.1 Need for Short-Term Traffic Predictions . . . . . . . . . . . . . . . . 70 9.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9.1.2 Synthetic Time Series . . . . . . . . . . . . . . . . . . . . . . 74 9.2 Requirement of Performance Measurement . . . . . . . . . . . . . . . 75 9.3 Conflating Two Geospatial Datasets . . . . . . . . . . . . . . . . . . . 78 9.4 Merging Congestion and Incident Data . . . . . . . . . . . . . . . . . 81 10 Review of Performance Measurement 84 10.1 Literature on Traffic Speed Prediction . . . . . . . . . . . . . . . . . 84 10.2 Performance Measurement Literature . . . . . . . . . . . . . . . . . . 86 10.3 Literature on Geospatial and Temporal Conflation . . . . . . . . . . . 87 11 Description of Datasets used for Performance Measurement 90 11.1 GPS Probe Based Datasets . . . . . . . . . . . . . . . . . . . . . . . 90 11.1.1 Vehicle Probe Project (VPP) . . . . . . . . . . . . . . . . . . 90 11.1.2 National Performance Management Research Data Set (NPM- RDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 11.1.3 Bottleneck Dataset . . . . . . . . . . . . . . . . . . . . . . . . 93 11.2 Geospatial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 11.2.1 Highway Performance Monitoring System (HPMS) . . . . . . 95 11.2.2 TomTom Multinet . . . . . . . . . . . . . . . . . . . . . . . . 96 11.2.3 Traffic Message Channel (TMC) Codes for Linear Referencing (LR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 11.3 Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 11.3.1 Bluetooth Traffic Monitoring (BTM) . . . . . . . . . . . . . . 98 11.3.2 Incident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99 11.4 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 11.4.1 Preprocessing VPP Data for Forecasting . . . . . . . . . . . . 101 11.4.2 Preprocessing Datasets Used for Performance Measurement . 103 11.4.3 Preprocessing Geospatial Datasets before Conflating . . . . . 105 11.4.4 Merging Incidents and Congestion . . . . . . . . . . . . . . . 111 12 Methodology and Algorithms for Performance Measurement 116 12.1 Synthetic Time Series Method . . . . . . . . . . . . . . . . . . . . . . 116 12.2 Method to Compute Performance Measures from NPMRDS . . . . . 119 12.2.1 Percentile Method for Performance Measurement . . . . . . . 120 12.3 Methodology to Effect Spatial Conflation and Temporal Disaggregation 124 12.3.1 Computing Geometric Information . . . . . . . . . . . . . . . 124 12.3.2 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . 127 12.4 Method to Merge Incidents and Congestion Data . . . . . . . . . . . 131 iv 13 Results from Performance Measurement Methods 132 13.1 Results from Forecasting Traffic Speeds . . . . . . . . . . . . . . . . . 132 13.1.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 132 13.1.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 13.2 Results from Performance Measurement . . . . . . . . . . . . . . . . 142 13.3 Results from Spatial Conflation of HPMS and Multinet Datasets . . . 147 13.4 Results from Merging Incident and Congestion Data . . . . . . . . . 152 13.4.1 4th Quadrant Bottleneck at I-695 and MD-147 . . . . . . . . . 154 13.4.2 2nd Quadrant Bottleneck at I-95 and MD-43 . . . . . . . . . . 156 14 Conclusions Derived from Performance Measurement 159 14.1 Synthetic Time Series Forecasting Method . . . . . . . . . . . . . . . 159 14.2 Working with NPMRDS . . . . . . . . . . . . . . . . . . . . . . . . . 160 14.3 Spatial Conflation of Geospatial Datasets . . . . . . . . . . . . . . . . 161 14.4 Merging Incidents and Congestion Datasets . . . . . . . . . . . . . . 161 15 Dissertation Summary and Conclusions 163 16 Proposed Future Extensions to Presented Work 167 16.1 Universal Data Synthesis Model . . . . . . . . . . . . . . . . . . . . . 168 16.2 Population from Aggregate Data . . . . . . . . . . . . . . . . . . . . 169 16.3 Extending the Forecasting Framework . . . . . . . . . . . . . . . . . 169 A Additional Monte Carlo Simulations with Copulas 171 Bibliography 174 v List of Tables 5.1 Observations of Simulated Data from Gumbel Copula . . . . . . . . 29 5.2 Observations of Binned Simulated Data from Gumbel Copula . . . . 29 5.3 Pseudo-Observations of Binned Simulated Data from Gumbel Copula 30 5.4 Parameters Estimated for Archimedean Copula Families . . . . . . . 32 5.5 Parameters Estimated for Archimedean Copula Families . . . . . . . 36 6.1 Observations from Training Dataset for PUMA 1107 . . . . . . . . . 42 6.2 Pseudo-Observations from Training Dataset for PUMA 1107 . . . . 42 6.3 Constructed Archimedean Copulas for PUMA 1107 . . . . . . . . . 44 6.4 Crosswalk between ACS PUMS and Decennial Census Matrix . . . . 45 7.1 Results from Estimating Gumbel Copula to Data Groups . . . . . . 57 11.1 Summary of the Segments Chosen for Study . . . . . . . . . . . . . 105 11.2 Summary of Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . 112 11.3 Summary of Incidents . . . . . . . . . . . . . . . . . . . . . . . . . . 113 11.4 Highlighted Incidents Summary . . . . . . . . . . . . . . . . . . . . 115 12.1 Scoring algorithm tests and points . . . . . . . . . . . . . . . . . . . 126 13.1 Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 13.2 Total absolute conflation error by state and for the whole country . 152 A.1 Constructed Archimedean Copulas for PUMA 1107 . . . . . . . . . 172 A.2 Constructed Archimedean Copulas for PUMA 1107 . . . . . . . . . 173 A.3 Constructed Archimedean Copulas for PUMA 1107 . . . . . . . . . 173 vi List of Figures 6.1 Map of PUMA 1107 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2 Errors in Synthetic Population Marginals . . . . . . . . . . . . . . . 47 7.1 Kendall?s Rank Correlation Among Dimensions in ACS PUMS for PUMA 1107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.1 Kendall?s Rank Correlation Among Dimensions in ACS PUMS for PUMA 1107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.1 Kendall?s Rank Correlation Among Dimensions in ACS PUMS for PUMA 1107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.2 Kendall?s Rank Correlation Among Dimensions in the Synthetic Pop- ulation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 Kendall?s Rank Correlation Among Dimensions in the Synthetic Pop- ulation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.2 Kendall?s Rank Correlation Among Dimensions in the Synthetic Pop- ulation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 9.1 Overlayed Traffic Speed Observations from a TMC Segment . . . . . 73 11.1 Graphical representation of the function to detect bottlenecks . . . . 94 11.2 Maryland Important Roads Network . . . . . . . . . . . . . . . . . . 102 11.3 Geographic locations of performance measurement case study segments104 11.4 Comparison of Networks before Conflation . . . . . . . . . . . . . . 107 11.5 Schematic Representation of R-tree Indexing [108] . . . . . . . . . . 110 11.6 All bottlenecks in Maryland recorded in 2011 . . . . . . . . . . . . . 112 11.7 Map of all Incidents Recorded by CHART in Maryland during 2011 113 12.1 Scatter plot of travel times from NPMRDS illustrating overlaying method122 12.2 Cumulative frequency distribution plots produced from overlayed data in 12.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 13.1 Box Plot of RRMSPE Using all Predictions by Lag . . . . . . . . . . 135 13.2 RRMSPE for all Segments, each Day by Lag . . . . . . . . . . . . . 136 13.3 RRMSPE for all Segments, for each Lag by Day . . . . . . . . . . . 137 13.4 RRMSPE for all Segments, for each Lag by Period in Day . . . . . . 138 13.5 Box Plot of RRMSPE for all Segments, for each Lag by Period in Day 140 13.5 Box Plot of RRMSPE for all Segments, for each Lag by Period in Day 141 13.6 Relative Deviation for all Segments, for each Lag by Minute in Day . 142 vii 13.7 Overlaid travel times, and associated cfd plots . . . . . . . . . . . . 143 13.7 Overlaid travel times, and associated cfd plots for NJ11-0009 . . . . 144 13.8 Overlaid travel times, and associated cfd plots for VA08-0012 . . . . 145 13.8 Overlaid travel times, and associated cfd plots for VA08-0012 . . . . 146 13.9 Comparison of Network after Conflation . . . . . . . . . . . . . . . . 149 13.10 Jointly visualizing conflated speed and volume data for two corridors. 150 13.10 Jointly visualizing conflated speed and volume data for two corridors. 151 13.11 Map of all bottlenecks in Maryland from 2011 overlain by incidents also from 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 13.12 Scatter plot of count of incidents and Bottleneck Impact Factor . . . 153 13.13 Scatter plot of count of incidents and Bottleneck Impact Factor divided into four quadrants . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 13.14 Map of the bottleneck at I-695 and MD-147 . . . . . . . . . . . . . . 155 13.15 Time-time plot showing impact of incidents on bottleneck at I-695 and MD-147 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 13.16 Map of the bottleneck at I-95 and MD-43 . . . . . . . . . . . . . . . 157 13.17 Time-time plot showing impact of incidents on bottleneck at I-95 and MD-43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 viii List of Abbreviations AADT Annual Average Daily Traffic ABM Activity Based Model ACM Association for Computing Machinery ACS American Community Survey ADAPTS Agent-based Dynamic Activity Planning and Travel Scheduling AITS Advanced Intelligent Transportation System ALBATROSS A Learning Based Transportation Oriented Simulation System ARIMA Auto Regressive Integrated Moving Average ASTIN Actuarial Studies in Non-Life Insurance ATAS Advanced Traveller Assistance System ATIS Advanced Traveller Information System BIC Bayesian Information Criterion BTI Buffer Time Index BTM Bluetooth Traffic Monitoring BTR Basic Trust Region CATT Center for Advanced Transportation Technology CDF Cumulative Distribution Function CEMDAP Comprehensive Econometric Micro-simulator for Daily Activity- travel Patterns CHART Coordinated Highway Action Response Team CRAN Comprehensive R Archive Network DC District of Columbia DOT Department of Transportation EPA Environmental Protection Agency ETH Eidgen?ssische Technische Hochschule (Swiss Federal Institute of Technology) EUSFLAT European Society for Fuzzy Logic and Technology FHWA Federal Highway Administration FIPS Federal Information Processing Standards FRC Functional Road Classification GHG Greenhouse Gases GPS Global Positioning System GoF Goodness of Fit HPMS Highway Performance Monitoring System ix IAA International Actuarial Association IEEE Institute of Electrical and Electronics Engineers IET Institution of Engineering and Technology ILUTE Integrated Land Use, Transportation, Environment INFORMS Institute for Operations Research and the Management Sciences IPF Iterative Proportional Fitting IPU Iterative Proportional Updating IQR Inter Quartile Range IRS Internal Revenue Service ISO Internation Standards Organization ITS Intelligent Transportation System IVT Institut f?r Verkehrsplanung und Transportsysteme (Institute for Transport Planning and Systems) JSTOR Journal Storage LLC Limited Liability Company MAC Media Access Control MAP-21 Moving Ahead for Progress in the 21st Century Act MCMC Monte Carlo Markov Chain MD Maryland MPO Metropolitan Planning Organization NA Not Applicable/Available NALCA the North American Location Code Alliance NATMCCA The North American TMC Code Alliance NHS National Highway System NHTS National Household Travel Survey NPMRDS National Performance Management Research Dataset NPRM Notice of Proposed Rule Making NREL National Renewable Energy Lab ORNL Oak Ridge National Laboratory OSM Open State Map OUP Oxford University Press PDF Portable Document Format PM Performance Measurement PTI Planning Time Index PUMA Public Use Micro Area PUMS Public Use Micro Sample RMSE Root Mean Squared Error x RRMSPE Relative Root Mean Squared Prediction Error RV (rv) Random Variable SAE Small Area Estimation SHA State Highway Administration STRAHNET Strategic Highway Network T.T.I. Texas Transport Institute TAZ Traffic Analysis Zone TIM Traffic Incident Management TISA Traveler Information Services Association TMC Traffic Message Channel TMS Traffic Message Sign TOPS Lab Traffic Operations and Safety Laboratory TPEG Transport Protocol Experts Group TRANSIMS TRansportation ANalysis SIMulation System TTI Travel Time Index U.K. United Kingdom U.S. United States (of America) UMD University of Maryland VPP Vehicle Probe Project WGS World Geodetic System WSDOT Washington State Department of Transportation cdf cumulative density function cfd cumulative frequency distribution cmf cumulative mass function pmf probability mass function xi Chapter 1: Introduction There are two main aspects of managing performance of transportation facilities: supply side and demand side. Supply side management deals with allocating required resources to meet the demand. It also deals with controlling access, or imposing other restrictions, including tolling, taxing, and so on, to ensure the available resources are better utilized. These insights depend heavily on real-time data and analytics, and benefit from accurate short-term predictions, as management can be made proactive rather than reactive. While supply side management is dependent mainly on drafting the right control policies and the political will to implement them, demand side management is rather more difficult, as it not only involves inferring the behavior of individuals, but also influencing individual behavior towards system optimality. Recently, tremendous progress has been made in understanding and modeling human choices and behavior by studying and implementing a new framework of Activity Based Models (ABM). ABM takes the view that travel demand is more endemic to human nature, that it is product of the activities of individuals cognizant of their social groups. Estimations of demand from ABM are based on simulations and require a reliable synthetic population. The synthetic population is created using copulas, which requires individual observations of a sample of people. Copulas are multivariate distribution functions that represent just the underlying dependence in multivariate random variables, decoupled from the univariate marginal distribution of the sample data. Copulas are the perfect tool to capture the dependence between various characteristics of people, 1 such as age, gender, race, education, income and others, because they are not biased by the marginal distributions. Therefore, a copula that captures such multivariate dependence in some area can be used to generate synthetic population with the same dependence structure in the target area, so long as it is reasonable to expect the same dependence structure at the target area. While it may be empirically testable, it can be seen that this assumption holds in most cases, as people of certain socioeconomics and demographics lead similar lifestyles. Forecasting of traffic patterns, even for short-term durations, is a very complex problem as there are many internal and external variables that affect the roadway conditions. However, by accumulating sufficient data, some patterns can be gleaned, which can be used to increase the accuracy of the predictions. An innovative adaptation of the small area estimation method is used for predicting traffic speeds. The method considers the future unobserved time points as small areas, and borrows strength from historic observations of the speeds. Similarly, accumulating sufficient data from the recent past over the same time window, say 24-hours, allows filtering of outliers and direct estimation of the traffic conditions, for any time interval of interest. The congestion measures thus developed can be computed quickly, in real-time, or to study certain effects retrospectively. The techniques to simultaneously filter outliers and compute performance measures have been adapted by the Federal Highway Administration (FHWA) for drafting reporting rules under the Moving Ahead for Progress in the 21st Century Act (MAP-21), which directly impacts federal funding provided for maintenance and management of facilities. Data used in transportation exist in very disparate datasets, which are reported to and are available in different spatial and temporal domains. A lot of studies and uses for the data are hampered due to the discrepancies. Consequently, techniques to merge different datasets together across spatial and temporal differences are in high demand. Two innovative conflation methods that merge different datasets together 2 are demonstrated in this dissertation. One method filters incidents to include just those that occur in the spatial and temporal vicinity of traffic bottlenecks on the roadway. This allows attributing incidents to bottlenecks, simplifying computation of the true, system-wide cost of an incident, in terms of wasted time, energy, resources, and emissions, apart from the direct costs due to life, limb and property damages. The second conflation algorithm joins two roadway networks in the same spatial plane, but with different geometry representations, enabling transfer of information, such as traffic volumes, from one map layer to another. Volume information especially allows estimating the impacts due to transportation, and computing costs and resource consumption over all users. An extension to this method exposes issues with the National Performance Management Research Data Set (NPMRDS), a federal dataset of traffic speeds and volumes on roadways, made available to transportation agencies to aid in MAP-21 reporting. 1.1 Problem Statement Multiple problems are tackled in this dissertation, with the underlying theme of improving the data analytics powering the insights into transportation decisions made for providing safe, efficient, speedy and equitable travel for all. The issues resolved by the developed methods include aspects of both the demand for travel and supply of infrastructure and facilities to meet that demand. The availability of a dataset containing full socioeconomic and demographic records of everyone in the region of interest is very expensive to obtain and maintain. However, it is crucial to have such a dataset to be able to simulate the choices of every individual in the region of interest. These choices ultimately precipitate as demand for travel and specific transportation modes. Understanding these choices allows preparing the right solution to meet the demand, and to incentivize choices that allow for better optimality. Therefore, a model that can accurately produce the population at the individual level in the area 3 of interest of utmost importance. Management of facilities in real-time requires fast insights, usually obtained from big data observed all over the network. Therefore, the methods and algorithms need to be accurate, yet compute quickly to produce the results in real-time. Therefore, the method developed to compute performance measures simultaneously filters outliers, and can be deployed in a rolling manner to produce results for any number of segments in the roadway network. Similarly, the short-term forecasting models are fitted and selected before hand for each day of the week. Both methods use historic observations from a few weeks to even years prior to the time point where results are required. Often, in transportation science, the required data for producing certain important results are separated over multiple datasets. Merging of the data are essential to produce the required results, which for the core of reports and studies of long-term metrics on transportation. Such reports are now mandatory under the MAP-21 rules. Therefore, algorithms that accurately merge data together across spatial and temporal domains are highly sought after. The algorithms developed not only merge incidents, roadway speeds and volume data together across a period of a year, but also uncovered merge problems in FHWA provided NPMRDS dataset. 1.2 Research Objectives The main goal of the researches documented in this dissertation was to produce solutions to problems, and strengthen weak links in the state-of-art of transportation science. The solutions also add to human knowledge, as they are all innovative, and novel in the area where they are applied. Providing multidisciplinary solutions was also an objective of the conducted research, and all developed methods are strongly founded in multidisciplinary practices, and use techniques from domains of probability, statistics, economics, transportation, and others. Lastly, developed methods are simple and straightforward, with a low barrier to entry for real-world applications. 4 The use of copulas for population synthesis was motivated by the need to capture the dependencies among the characteristics of people, and increase number of char- acteristics simultaneously modeled. Simultaneously, the copula approach combines data from different datasets, which also enables additional analysis on the combined data. Computing performance measures, after filtering outliers, had a high demand in the industry after the NPMRDS was released. Similarly, the computing assets and technical expertise of professionals working for government transportation agencies influenced the choice of the forecasting method. Therefore, the developed methods were designed to be simple, fast and easy to implement on standard hardware. Lastly, the conflation of multiple geocoded data allows producing results that are not possible with a single dataset, vastly simplifying analysis methods required to produce the required insights. The underlying objectives central to the developed methods described in this dissertation are to use simple and efficient yet effective techniques to achieve the combined goal of reliable and accurate estimation of required variables. Many of the supply side methods are targeted at practitioners, while the synthetic population is meant for use by advanced modelers implementing ABM. 1.3 Contributions Transportation is undergoing massive shifts to become more efficient, equitable and ubiquitous, and much of this change is powered by data, analytics and policies driven by the results of those analytics. Most of the data available today never existed a decade ago. Further, the breakneck pace of technological revolution is constantly opening new modes, options and challenges in transportation domains. Compounded by the advancements in computation power, and improved, yet cheap, access to a high number of processors, many models that have only been theoretically possible have been implemented. Methods documented in this dissertation have been produced to 5 address the challenges encountered over the past six years. Consequently, there is no single problem definition to the work documented in this dissertation. The tackled issues can also be classified into the demand and supply domains. Population synthesis to census tracts that act as building blocks when estimating origin and destination matrices, mode shares matrices, and other demand related aspects has been nearly impossible before the development of copula based techniques. Similarly, the method to simultaneously filter outliers and estimate performance measures for segments forming road networks was speedy and effective for use not only in real-time applications, but also by FHWA in drafting reporting rules under MAP-21. Further, the synthetic time series framework for forecasting traffic speeds is extremely fast and versatile to make accurate predictions for the whole roadway network every minute, in real-time. Lastly, the conflation algorithm not only enables transfer of volume data for estimation of the total impacts of transportation, but also uncovers issues with the NPMRDS, a dataset crucial for reporting under MAP-21, and securing federal funds. A summary of the contributions of the developed methods documented in this dissertation are: 1. A novel method using copulas to synthesize population representative of a geographical area that overcomes limitations in the current state-of-art. The obtained synthetic population can be used with ABM, and other microsimulation models to provide accurate estimates of demand on links in the network. 2. Beginnings of solutions to problems inherent in using copulas with discrete, and categorical data, and validation of the goodness of fit of copulas. 3. A flexible and extensible framework founded on Small Area Estimation literature for producing short-term forecasts of various transportation related variables. Any model can be used in this framework to predict the required data. The model specification is also unrestricted, so any number of auxiliary variables can 6 be used. 4. Algorithm to compute performance measures from NPMRDS, despite the noise and outliers in the dataset. This algorithm has been adapted into the national performance management rules by the FHWA under the directions in MAP-21. 5. Spatial and temporal join of congestion and incident datasets so that a causative relation can be established. The findings are used by Maryland State Highway Administration to direct patrol distribution on Maryland roads. 6. Conflation of the speed and volume data so that estimates of collective impacts can be obtained. Collective impacts refer to the externalities of transportation like cost of delays, energy and resource consumption, emission of greenhouse gases and other pollutants. The conflation is being used at National Renewable Energy Laboratory to produce estimates of externalities. 1.4 Dissertation Organization This dissertation is organized in two main sections corresponding to the demand and supply management domains of transportation. Each section has an introduction, a literature review, description of the data used, a discussion about the methods applied, results and conclusions. The demand domain is covered in the first section, while the analytics required for supply management is documented in the second section. The overall conclusions and future work make up the last two chapters. 7 SECTION I Population Synthesis Chapter 2: Introduction to Population Synthesis Managing demand is increasingly viewed as an effective solution to controlling the negative effects of transportation, chiefly congestion, environmental impacts, wasted energy, time and resources. Construction of new facilities in response to increasing demand, once viewed as the only solution, is no longer feasible due to the escalating costs of construction, and subsequent maintenance of the facilities. Revenue streams have not kept pace with the rising costs, chiefly because gasoline taxes have not been raised in decades. Furthermore, increasing numbers of drivers, travel distances, vehicles and fuel efficiencies have contributed to exacerbate the wear and tear of roadways without contributing to the revenues. The idea that building more capacity only fuels more demand till the network is congested again has become an adage. Understanding demand is now the prime focus, however, it requires careful study of human lifestyles and decision making. A large set of logistic regression models have been developed over the past two decades that deal with modeling various choices made by people. Choices modeled not only include lifestyle decisions like selected activity, time spent doing the activity, location of residence, school, or work, vehicle ownership, and others, but also transportation variables like time of departure for the activity, mode of travel, etc. Activity Based Model (ABM) framework combines these individual choice models together, and thus allows for modeling joint decisions by all member of household, while taking into consideration the individual activity, work and other constraints like dropping or picking kids from school, buying groceries etc. The models comprising ABM can be fitted using well developed and studied 9 algorithms on sample datasets of people. However, to make predictions about travel behavior in the region of interest using ABM, it is critical to have a dataset that accurately represents the entire population of that region. Such dataset would be used for simulating the choices of every individual. The most accurate would be a survey detailed of each individual in the area of interest to populate the dataset. However, such a data collection effort is often perceived as too intrusive in private lives, quickly becomes prohibitively expensive, is difficult to maintain over time, and gets outdated in a few years as technology, policies, land use patterns, and lifestyles change. Therefore, a solution is sought with modeling to create a synthetic population dataset that accurately mimics the true population of the area. The proposed method uses people as the primary unit, and attributes household level variables to individuals. However, it can also be expanded to other units that can be surveyed, such as families and households. The synthetic population is produced using copulas, which are multivariate prob- ability functions that model the dependence within multivariate random variables, independent of the univariate marginal distributions of each variate. Copulas fitted to survey data capture the dependence among various characteristics of people, and can be used to generate data with the same structure, producing the synthetic population. It allows transferring the dependence from one region to another, where the marginals might be different, but the dependence is expected to be the same. The dependence is usually transferable, as people with in a certain age bracket, having completed a certain level of education, and working in a specific industry have similar scales of income. The marginals from one location to another might be different, for example, a neighborhood of predominantly blue collar workers, compared to a neighborhood of affluent white collar employees, or the differences in pay scales, or household sizes from one city to another. Current state-of-art methods in literature to synthesize population use an iterative 10 procedure to estimate the contribution of basic demographic characteristics of people to the whole population. It uses a method of ratios to compute the relative abundance for each record in a sample such that marginal totals are satisfied. However, given that it is a numerical procedure that requires a full enumeration of all samples in a matrix form, with marginal totals, it does not scale will with additional characteristics or even attributes in each characteristic. Consequently, it is limited to very few dimen- sions. Copulas, on the other hand, are regularly used with hundreds of dimensions, and therefore, have the potential to model not just basic demographic information, but also socioeconomic, and even lifestyle and choice variables that directly impact transportation demand. Once an ABM is fit, the synthetic population is used to generate predictions about behavior and choices for areas of interest. The developed method can directly synthesize population to census tracts, which is impossible to do with the methods explored in current literature because the number of samples in any one census tract are too few. Having predictions about the choices, including likely mode of transport, for all simulated individuals in a census tract, an origin-destination matrix can be built to index the number of trips bound from one region in a given time window, to destination regions. These trips could then be used to estimate the demand load on infrastructure pertinent to various modes, plan for future expansions of the network by mode, and determine fares, tolls and charges based on consumer profiles and willingness to pay. The ABM along with the synthetic population, when correct, provide simple means to verify the impacts of various policy decisions. Like the fuel price elasticity can be computed to understand the impacts of changing the fuel tax. Similarly, the impact of increasing frequencies or coverage of public transport on the people affected can be quickly estimated by changing a few basic control variables like cost and time of traveling. Expected demand increases can be used to decide the tolling on dynamic 11 tolled facilities to ensure the level of service is maintained on the facility, while the revenues are maximized. In this half of the dissertation, the literature review is presented in chapter 3, followed by a description of the datasets used in chapter 4. Chapter 5 discusses the developed method using synthetic data for illustration. Additionally, various mathematical issues related to discrete data, which are methodologically unsolved, are raised, and workarounds are explored. The results are presented in chapter 6 for a randomly selected PUMA in Maryland and two randomly selected census tracts within that PUMA as an example of the prowess of the developed method. Beginnings of workarounds to solve the issues encountered with the use of discrete data in copulas, and testing the similarity between the multivariate dataset produced by the constructed copulas with the original multivariate dataset are presented in chapter 7. Finally, the conclusions are discussed in chapter 8. 2.1 Overview of Developed Method Copulas are used to capture the dependence in basic characteristics reported by respondents in a survey. The goal is to use the constructed copula to generate a synthetic population that accurately reproduces the relation between various attributes of the real population. The developed method accomplishes this objective by jointly modeling household and individual data. The synthetic population so produced can be used for a large number of microsimulation models that require the population and its detailed characteristics in the study area. A class of such models are the ABM, which are increasingly gaining popularity for understanding and modeling travel behavior and choices of individuals in family units [3, 8, 43, 51, 92, 103, 134]. Moreover, the method can also produce synthetic population in regions with limited number of observations in the original dataset, that is, in small areas [41, 48, 54, 74, 97, 98, 99]. This study uses 1-year Public Use Micro Sample (PUMS) data coded to Public 12 Use Micro Area (PUMA), released by the American Community Survey (ACS) for the year 2016 [10]. The copulas are constructed using the PUMS data, which contains the characteristics of sampled respondents. The constructed copulas are then used to generate the synthetic population in census tracts. The generated data are inverse probability integral transformed using the aggregated totals provided by the 2010 Decennial U.S. census [9]. The complete procedure illustrates population synthesis, with preserved dependence among characteristics, within small areas. 2.2 Theory of Copulas A ?-dimensional copula ? is a multivariate distribution function on [0, 1]? having all marginals uniformly distributed on [0, 1] [55, 57, 81]. Copulas are a fairly modern mathematical probability tool which are increasingly being used in applied areas ? mainly in the field of actuarial sciences [23] ? as they allow modeling of jointly distributed random variables. We refer to Genest and Favre [33] and Okhrin, Ristig, and Xu [84] for a gentle introduction and to the books by Joe [55, 57] and Nelsen [81] for readers seeking concrete mathematical proofs, theorems and derivations related to copulas. A fundamental result, due to Sklar [109], states that any multivariate distribution can be represented by means of a copula. Theorem. Let ? be a multivariate distribution function with margins ?1,? , ??, then there exists a copula ? such that ?(?1,? , ??) = ?{?1(?1),? , ??(??)} , ?1,? , ?? ? ?. (2.1) If ?? are continuous for ? = 1,? , ? then ? is unique. Otherwise ? is uniquely determined on the Cartesian product of the range of the marginals ?1 (?)????? (?). Conversely, if ? is a copula and ?1,? , ?? are univariate distribution functions, then function ? defined above is a multivariate distribution function with margins ?1,? , ??. 13 We denote a parametrized copula as ??, where ? is the parameter vector (possibly of dimension one). A copula family, indexed by the parameter vector, spans a subset of the range of dependencies between two bounds given by the Fr?chet?Hoeffding theorem, ?(?1,? , ??) ? ??(?1,? , ??) ? ?(?1,? , ??), (2.2) ? ?(?1,? , ??) ?= max{1 ? ? +???, 0} , ?=1 (2.3) ?(?1,? , ??) ?= min {?1,? , ??} . ? is the upper bound and represents the comonotone case. The independence copula, ?(?1,? , ??), occurs somewhere in between ? and ?. Independence copula, also known as product copula, has the distribution function ? ???. (2.4) ?=1 Note that the upper bound ? is always a copula and is sharp, that is, attainable by the function. However, the lower bound ? is only a copula in two dimensions. In higher dimensions, ? is point-wise sharp, that is, for some ? = (?1,? , ??), there exists a copula ? ?? such that ? ??(?) = ?(?) [55, 57, 81]. Further, a copula family does not necessarily span the whole range between the upper and lower Fr?chet?Hoeffding bounds [see table 4.1 in 81, for limiting and special cases of dependencies of many Archimedean copulas]. The identity (2.1), along with the inequality (2.2), presents how copulas allow measurement of the dependence in the data by decoupling the effects of marginal processes on the data from the dependence structure [55, 57, 81]. Given some data with intrinsic dependence, which must exist between the bounds of (2.2), multiple straightforward techniques are available to estimate the parameter of a copula family under various scenarios [34, 56, 67]. Among them, the maximum 14 likelihood estimation method has been shown to be unbiased, and efficient [34, 36, 66, 69, 70]. However, the challenge is to validate the correct copula family, as Sklar?s representation theorem only ensures the existence of a copula, but does not provide any clues about the possible families, or its construction. Moreover, the copula is not necessarily unique if the marginals are not continuous. The uniqueness issue is relevant to this study because of the use of survey data, which are usually discretized with a predetermined and fixed number of responses, or are categories, or are aggregated into bins. For example, continuous variables like age, income, distance to school or work, area of house and land, and so on are usually reported in bins. Further, the survey also records responses which are discrete by default, like number of household or family members, number of rooms or cars, and so on. Some variables are inherently categorical, like race, gender, level of education, or industry of employment, etc. Discrete variables can sometimes, but not always, be approximated by continuous variables. Categorical variables cannot be derived from a continuous distribution [57]. Consider a dataset ? formed of ? independent and identically distributed copies of ?-dimensional vector ?: ? = {??}, ? ? {1,? , ?}. We denote the ?th component of ?? by ???. If any ??? is discrete, it may cause ties in the dataset. In other words, the data tend to repeat: ??? = ???, for some ? ? ?, and some ? ? {1,? , ?}. Surveys acknowledge this fact, and even take a step further by providing weights with the data for each unique record. The sum of the weights are usually designed to equal the total population in the area surveyed. Weights are inherently built into surveys as they are conducted on a sample of the population [10, 74]. In the simplest use case, the weights are intended as a frequency measure to determine the number of times each record should be repeated to get a dataset approximating the full population of the area. While, by definition, a copula is defined over the entire hypercube, [0, 1]?, in order to fit a copula on discrete data, we only have to consider it over the domain formed 15 by the Cartesian product of the marginals [see, for example 71], an approach pursued in the method documented in this dissertation as well. There have also been some attempts at using Bayesian inference techniques to construct copulas with discrete marginals, but they have not been widely adopted by researchers [112, 113]. Maximum likelihood estimation method can be successfully used to construct copulas with discrete marginals, as the copula family is specified before estimating the parameter [36]. The grunt work then remains to decide whether the chosen copula family is indeed appropriate to model the data at hand. A tie adapted procedure proposed by Kojadinovic [66] is used for testing the fit of the copulas strictly within the domain formed by the Cartesian product of the marginals. 2.2.1 Empirical Marginals In order to construct copulas, specification of the marginals is essential, as they form the core of the copulas. Two main avenues can be used to approximate a marginal distribution. The first one fixes the distribution of the marginals, subject to some parameters, before estimating the copula parameters and is useful when the processes generating the marginals are known. The second approach relies on the empirical distribution function, ? ??? (?) = 1 ??(??? ? ?) , ? ? {1,? , ?}, (2.5)? ?=1 where ? is the indicator function. This two-step approach is favored in the literature as it is immune to misspecification of the marginals. This approach further ensures that the copula is constructed directly from the data [34, 56, 64, 67, 70]. The empirical distribution function (2.5) is tantamount to the ranks of the ob- servations, as ??? = ???? (???) is the rank of ??? among all ?1?,? , ???, ? ? {1,? , ?}. 16 Probability integral transformed observations using (2.5) can be obtained as ???? = 1 (??1,? , ???) , ? ? {1,? , ?}, (2.6)? + 1 where ???? ?= (????1,? , ?????) are also called pseudo-observations. The asymptotically unbiased scaling factor ?/(? + 1) ensures that the pseudo-observations are contained within (0, 1), preventing application of the inverse distributions to 0 or 1, as it could produce infinities for unbounded distributions. Obtaining the marginal distributions using (2.6) accords some favorable advantages: 1. Since ranks are a maximally-invariant statistic of data, the estimated marginals are invariant under scale-transformations, or monotone increasing transforma- tions of the data, and are produced from the data without loss of information [33]. 2. Use of ranks prevents biases in inferences about the copulas [35, 66]. 3. Rank-based measures allow for misspecification of the marginals [36]. Since explicit specification of the marginals is not possible for ACS data, this property is indispensable. 4. All of the accepted, reliable, and accurate goodness-of-fit tests in literature use rank-based estimates of marginals [40]. When the marginals are continuous, the empirical distribution function produces ordinal ranks, and (2.6) is a consistent and asymptotically unbiased estimator of the marginal distribution function. However, (2.5) produces maximal ranks when ties are present in the data. Maximal ranks are inherently extreme value distributed, and thus (2.6) is no longer unbiased. Therefore, average ranks are used in (2.6), instead of (2.5) [as also advised in 66]. Average ranks are essentially the average value of the ordinal ranks assigned to each tied value. 17 A simple example is presented to illustrate average ranking with ties. Let {?, ?, ?, ?} be four numbers such that ? ? ? ? ? ? ?, and suppose ? = ?. The ordinal ranks of the numbers are 1, 2, 3, 4, while the corresponding average ranks are 1, 2.5, 2.5, 4. On the other hand, (2.5) would produce maximal ranks, which are 1, 3, 3, 4. Average ranks ensures that tied values follow the central tendency of the ordinal ranks. These properties make it the most suitable for constructing copulas using maximum likelihood, and for testing the goodness of fit [66]. 18 Chapter 3: A Review of Population Synthesizers Population synthesis enables microsimulation of people and their choices. Additionally, synthesized population is also used to mask the true identities of the people, while retaining general characteristics about the population. The population synthesized using the presented copula method can be used for either, or any other application like obtaining estimates for small areas, since the outcome is generic population. This section presents a review of recent literature in population synthesis related to the transportation engineering domain, although many other domains have developed specific methods to address their needs. The set of variables used in this dissertation have been set up specifically to answer research question pertinent to transportation, such as demand modeling. Through the long history of literature on population synthesis, there have been very few attempts at using copulas for generating population [17, 78]. Most prevalent methods have been based on numerical approximation techniques, and specifically the Iterative Proportional Fitting (IPF). IPF is due to Deming and Stephan [22], and was first used in transportation for synthesizing population needed for microsimulation in the TRANSIMS project [5]. IPF essentially reweighs counts from a sample, till the combined weight matches the constraints imposed by the aggregate statistics of the area of interest [see 17, for a brief explanation of the IPF procedure]. IPF, however, suffers from some limitations that have been a constant source of more research to improve population synthesis [17, 78]. As examples: 1. IPF convergence may take an indeterminate amount of time and is not guaran- 19 teed, especially as the number of characteristics increase. Therefore, combined estimation of household and personal characteristics is a challenge [95]. An updated procedure called Iterative Proportional Updating (IPU) proposed by Ye et al. [134] handles this issue to some extent. 2. IPF will not update a cell with a starting value of zero, i.e., no observations for some rare characteristic in the sample dataset, regardless of the value of the marginal [93]. 3. If a marginal total is zero, it is more problematic as it converts all values in the IPF table for that marginal to zeros. This eventually crashes the convergence of the IPF algorithm [134]. A simple fix is to use a small non-zero number for the marginal, however that may introduce biases in the final result [43]. 4. IPF matrices can occupy a large space in memory, increasing exponentially with the dimensions [93]. 5. Also, the sparsity of the IPF tableau increases exponentially with each new characteristic added to tableau, feeding the zero cell problem. Therefore, IPF is severely limited to just a few tightly coupled characteristics [93]. 6. The final IPF tableau cells may contain non-integer numbers, and rounding may induce significant biases [93], although sampling the final population from the weights using Monte Carlo Markov Chains has the ability to somewhat alleviate this limitation [78]. 7. In its native form, IPF cannot work across geopolitical zones, which is a serious limitation because data from different surveys are seldom collected over the same zone [78, 134]. There have been developments that overcome the issues mentioned above, at least partially. Most notably, the works by Arentze and Timmermans [3], Bradley, Bowman, 20 and Griesenbeck [8], Guo and Bhat [43], Javanmardi, Auld, and Mohammadian [51], Pinjari et al. [92], Pritchard and Miller [93], Salvini and Miller [103], and Ye et al. [134], have significantly contributed to the current practices of population synthesis. These methods mainly focus on simultaneous or two-step reweighing of both household and individual characteristics across multiple geopolitical levels. Additionally, some address the size of IPF tableau in memory [93, 103], the convergence problem of IPF [134], and the rounding biases introduced in the final tableau [43]. A clean solution for the zero-cell problem, unfortunately, does not exist when using IPF, making it difficult to fit to small geographies like census tracts [43]. Ye et al. [134] offer a solution that borrows information from the larger area. However, the method requires some fine-tuning to ensure that the demographic groups are not over represented. Similarly, they also overcome the zero-marginal problem encountered in the updated IPF ? called Iterative Proportional Updating (IPU) ? by assigning a small positive value to zero-marginal categories. The IPU is an improvement to IPF method to handle combined household and population marginal controls at the marginals. It can also work with marginals from different geographic levels, as long as they have been coaxed beforehand into the geographic level where the population is being synthesized. However, IPU still suffers from all the other shortcomings of IPF, including biases introduced by rounding errors of the final weights. Moreover, IPU also sacrifices the convergence guarantees of IPF, and therefore, Ye et al. [134] advocate a heuristic to solve for the optimal final weights assigned to each observation. Another crippling feature of IPF is the requirement of sample data, which is not readily available in much of the world. Sample free techniques to synthesize population have embraced other methods. As example, Barthelemy and Toint [4] proposed a complicated hierarchical method dependent on heuristic sampling to fit household and individual characteristics. Schafer [104], among various techniques, presented a 21 Bayesian technique to infer the final values of the initial IPF tableau, which potentially deals with the requirement of samples, provided the prior distributions are correctly selected. With flat priors, this technique may circumvent the requirement of sample observations. Statistically driven techniques are quite rare in the literature history of population synthesis for transportation. Copulas, for instance, have only been used in two studies previously. Kao et al. [59] fitted only multivariate normal distributions, and modified the covariance matrix considerably before the results could be validated. A purely computational technique, using empirical copula, was proposed by Jeong et al. [53], mainly intended to validate dependence captured by the IPF method. There have been few other papers that use optimization techniques such as combinatorial optimization [86, 131], simulated annealing [123], and numerical sampling methods [25]. However, these methods have not gained as much traction as IPF. The Monte Carlo Markov Chain (MCMC) method proposed by Farooq et al. [25] served as small motivation for this study. The authors attempt to circumvent the difficulties involved in explicitly finding and sampling from the joint distribution among people?s characteristics. They successfully use MCMC with a Gibbs sampler to produce synthetic population with significantly higher accuracy than competing methods, including IPF. However, the method is quite complex, and may not be easily extendable to different geographies. This study aims to directly find the joint distribution, but in a form where sampling from it would be easy and feasible. Additionally, the proposed method is transferable across geographies, and therefore has a wider application. A richer review of population synthesis techniques for microsimulation can be found in works by Choupani and Mamdoohi [17], Ma [75], M?ller and Axhausen [78], and Pritchard [94]. 22 Chapter 4: Description of Data Used for Population Synthesis There are two main datasets used for population synthesis: the American Community Survey (ACS) and the decennial U.S. Census data. 4.1 American Community Survey The sample data used are from the 2016 release of the 1-year Public Use Micro Sample (PUMS) data, by ACS, for the state of Maryland. ACS collects detailed data on 1% of the Census population in every state, each year. These detailed data, after cleaning, filtering, anonymization, and weighting are made available to the public. The data are geocoded to Public Use Micro Area (PUMA), which enclose no less than 100 000, and no more than 200 000 people [10]. The PUMA usually contains complete census tracts within, yet can span over multiple counties in a state. Maryland has 44 PUMAs, and 59 408 observations in total, which represent the about 6 million residents. In this dissertation, the copulas were constructed using observations from each PUMA independently of others. The constructed copulas were then used to generate population for the census tracts within the PUMA. The results are presented only for PUMA 1107, chosen at random, for the sake of brevity. The ACS forms the backbone of many studies involving sample data, as it provides detailed information on many characteristics of people. Some information reported in each record by ACS is also captured in the census data, as the ACS is the present 23 manifestation of the long-form census, now discontinued. The primary limitation in ACS data collection is the large number of missing item response. This occurs when respondents do not answer one or more questions, or answer them incorrectly, or select the equivalent of ?I do not wish to answer?. Such missing item responses are imputed by borrowing information from other relevant data available in the region by experts at the Census Bureau. An indicative flag is provided with the characteristics to distinguish reported and imputed values. The priority of such imputation is to conform ACS totals to the regional values, rather than interrelation among attributes. Consequently, the dependence structure in the data is altered. However, in this study, imputed and reported values are treated equivalently, because they are the best publicly available representation of the population. The ACS data is available in two sets, at the household level, which forms the survey unit, and for all persons within the selected household. The datasets provide a key to join both household and person data together, creating a complete dataset. The person dataset contains information about the person, like age, gender, race and so on, while the household file contains information about all residents in a household, like number of people, size and type of the house, total family income, presence of children, and others. The joined data are used in this study to construct copulas. This ensures the dependence of the entire household is captured, and then recreated. Usually, decisions made by people are in consensus with other members of the household, therefore, it is imperative that persons be recreated into complete household units for use in microsimulation models, especially choice based models. Following on the research conducted by Arentze and Timmermans [3], Bradley, Bowman, and Griesenbeck [8], Guo and Bhat [43], Javanmardi, Auld, and Mohamma- dian [51], Pinjari et al. [92], Salvini and Miller [103], and Ye et al. [134] in population synthesis, the characteristics currently used from the ACS in this study are provided below. 24 1. Number of people in a household (household characteristic), 2. Type of household (household characteristic), 3. Total household income (household characteristic), 4. Presence and age of children (household characteristic), 5. Number of workers in the household (household characteristic), 6. Age of people in household (person characteristic), 7. Gender of people in household (person characteristic), 8. Race of people in household (person characteristic), and 9. Employment status of the people in household (person characteristic). The dataset, therefore, has 9 dimensions, and only copulas that can be generalized to 9 dimensions are considered. At present such copulas are Clayton, Frank, Joe and Gumbel families from the Archimedean family of copulas [81]. Although elliptical copulas, like the normal and ? copulas are also feasible with 9 dimensions, their goodness of fit test takes an unacceptably large amount of time, due mainly to the large parameter space, and degrees of freedom [49]. Moreover, as Kao et al. [59] demonstrate, the data are unlikely to be well represented by normal or ? distributions. In this study, the characteristics are taken together in a single dataset and the person weights are used to duplicate each record to form the full population dataset for each PUMA, as advised in the ACS documentation [10]. 4.2 U.S. Decennial Census The census is a decennial survey, that aims to enumerate and survey every individual, as required by the constitution. However, it only captures basic information about 25 people, and the data release is in aggregate form. Census tracts are widely used in transportation as the finest geographic regions for which census aggregates are used. The aggregated totals are available by attributes of each characteristic collected, most of which overlap with the ACS data. Therefore census aggregates form the marginals of the characteristic at the census tract level [9]. While public consumption census data are reported for geographic regions smaller than census tracts, this study synthesizes population at the census tract level. The census data in the form of totals are publicly available for the census tracts, which are the target areas for the synthetic population. The data are the totals of collected characteristics about people, such as number of people of a given age, or race, or gender, etc. in a census tract. These data provide the frequencies of people in each given attribute of a characteristic, and can be easily converted into the cumulative mass function (cmf). These cmf for each characteristic forms the marginal distribution at the census tract level, which is the target area in this study. The marginal distributions are used to inverse integral transform the data produced by the copulas. The census data used here are obtained from the 2010 Census Summary File 1 dataset. 26 Chapter 5: Developed Methodology for Population Synthesis The methodology and algorithm proposed in this dissertation are illustrated using simulated data which are created so as to mimic ACS PUMS data. Using simulated data provides two main advantages: 1. a control benchmark to assess the performance of the proposed method, and 2. an unbiased presentation of the method. Although the proposed methodology and algorithm are applied to data from all 44 PUMA in the ACS PUMS dataset for Maryland, results obtained with PUMS data from only one randomly selected PUMA are presented in this dissertation. To ensure simulated data accurately and robustly mimics PUMS data, they are produced to have the same shape as data from the selected PUMA. Shape here refers to the number of rows and columns in the ACS PUMS data, after applying the provided person weights as a frequency multiplier, that is, the sum of all weights. However, the dependency and marginal distributions are pre-specified to act as controls and help verify the method. 27 5.1 Data Simulation The simulated data are generated using a Gumbel copula, randomly chosen among other Archimedean copulas, and is defined by the distribution function 1 {? ? ??} exp??[?(? ln? ? ?) ] ? , (5.1){? ?=1 ?} where ? is the copula parameter, and ?? ? (0, 1), ? = 1,? , ? [see table 4.1 in 81, for distribution functions of all one-parameter Archimedean families]. The admissible parameter space of the Gumbel copula is ? ? [1,?). With ? = 1 the Gumbel copula corresponds to the independence copula, which is inappropriate for social-science joint categorical, ordinal, and numeric data. Simulated data are produced iid from the Gumbel copula by setting the parameter ? = 1.500. The mixture models due Marshall and Olkin [77] can then be used to generate observations having dependence characterised by the selected member. Since 9 dimensions reported in ACS are used in this study, the simulated data produced are also in 9 dimensions. Further, the number of observations in the simulated data are 118 583, the same as available from the PUMA demonstrated in the next chapter. First few observations of these simulated data, rounded to three significant digits, are shown in table 5.1, where the column headers ? 1,? , ? 9 refer to the 9 dimensions as Variable1 to Variable9. Since these simulated data inherently have continuous marginals, simple binning like, ?(?) = ?? if ? ? (??, ??+1], (5.2) where ??(??, ??+1] = (0, 1], is used to introduce ties and thus discretize the data. 28 Table 5.1: Observations of Simulated Data from Gumbel Copula V1 V2 V3 V4 V5 V6 V7 V8 V9 0.000 0.225 0.279 0.062 0.248 0.001 0.043 0.060 0.538 0.763 0.929 0.605 0.828 0.750 0.584 0.926 0.656 0.663 0.601 0.377 0.252 0.592 0.742 0.236 0.600 0.157 0.783 0.882 0.967 0.773 0.803 0.909 0.918 0.876 0.814 0.955 0.966 0.901 0.923 0.937 0.989 0.851 0.883 0.888 0.794 0.943 0.642 0.897 0.565 0.629 0.863 0.790 0.725 0.439 ? ? ? ? ? ? ? ? ? Choosing ?? = 1/?, one can create ? evenly spaced bins. Here we set ? = 10, and the value ?? = ?, ? ? {1,? , 10} is the label of the bin, chosen such that ?? < ?? if ? < ?. The simulated data thus has pre-specified dependence and marginal structure, yet conforms to the shape of the real data. It is not expected that the features of this simulated data are exactly like the survey data, but are similar in discreteness, number of rows and columns, embedded dependence, and marginal distributions. The binned data by applying (5.2), for the same observations shown in table 5.1, are presented in table 5.2. These observations are similar to the data available in PUMS. From here onward in this dissertation, the proposed method is illustrated using simulated data presented in table 5.2. The same method is directly transferable and applied to PUMS data from any PUMA. Table 5.2: Observations of Binned Simulated Data from Gumbel Copula V1 V2 V3 V4 V5 V6 V7 V8 V9 1 3 3 1 3 1 1 1 6 8 10 7 9 8 6 10 7 7 7 4 3 6 8 3 7 2 8 9 10 8 9 10 10 9 9 10 10 10 10 10 10 9 9 9 8 10 7 9 6 7 9 8 8 5 ? ? ? ? ? ? ? ? ? 29 5.2 Probability Integral Transformation using Empirical Marginal Distributions The first step in constructing the copula is probability integral transformation of the marginal components of the data. The data in table 5.2, and similarly PUMS data from any PUMA, need to be probability integral transformed into the unit hypercube, [0, 1]?. This is necessary, as the support of the copula lies in the unit hypercube [34, 67]. The rank-based estimate of the marginal distribution is obtained by applying (2.6) to each component of the multivariate data. The transformed pseudo-observations from the simulated data are presented in table 5.3, for the data presented in table 5.2. Table 5.3: Pseudo-Observations of Binned Simulated Data from Gumbel Copula V1 V2 V3 V4 V5 V6 V7 V8 V9 0.050 0.252 0.251 0.049 0.250 0.050 0.050 0.050 0.550 0.750 0.951 0.652 0.851 0.751 0.553 0.950 0.651 0.650 0.651 0.353 0.251 0.550 0.751 0.251 0.652 0.151 0.751 0.850 0.951 0.752 0.851 0.950 0.950 0.851 0.851 0.951 0.950 0.951 0.951 0.950 0.950 0.851 0.851 0.851 0.751 0.950 0.652 0.852 0.550 0.651 0.851 0.752 0.751 0.450 ? ? ? ? ? ? ? ? ? 5.3 Likelihood Maximization The likelihood maximization is the second step in constructing copulas with parameter ?. As fully expounded by Genest, Ghoudi, and Rivest [34], the likelihood specification of copulas is fairly straightforward. It is the product of the probability densities or masses evaluated at realizations of iid copies of the random variable. Natural logarithm transformation is applied to the likelihood function to prevent cancellation due to rounding errors, and to simplify computations. The likelihood specification 30 then becomes a log-likelihood, ? ?(?) = 1 ? log [??(??)] , (5.3)? ?=1 where ? is the number of observations ?? in (0, 1)? (obtained by applying (2.6) to dataset ?), ? is the copula parameter, and ?? is the multivariate density of the parametric copula for continuous data, or the converted probability masses for discrete data (see Nelsen [81] for density or equivalent mass distributions of various copula families). Maximizing the log-likelihood yields an unbiased estimate ? ?? (we assume the maxi- mum is unique, usually the case with single parameter Archimedean copulas) of the associated parameter ?: ? ?? ?= argmax?(?), (5.4)? and we denote ??? the copula built with the parameter estimator to differentiate it? from the generic parametric copula ??. As ? can be very large, the computation of (5.3) can be quite expensive. This cost can be significantly reduced by avoiding computations of the same terms several times. Same terms occur as there are identical observations in the dataset. Let ?? denote the number of observations identical to ?? and for each set of identical observations, elect a representative ????, for instance by choosing ???? = ??, with ? = argmin? ?? such that ?? = ??. We then have a set of ?? unique observations, and (5.3) can be replaced by ?? ?(?) = 1 ??? log [??(????)] . (5.5)? ?=1 Note that ? sums to ? yet has ?? elements, and therefore, (5.5) is normalized by 31 ?. This is essentially similar to, but not the same as, the data and corresponding weights provided in PUMS. In this case, the weights ? are computed after producing the pseudo-observations by applying (2.6) so that the ranking is not disturbed. This compression is applied only during log-likelihood estimation, and results in significant computational speed-ups. The datasets compressed during log-likelihood estimation also include the ones generated during the parametric bootstrap. As the copula family that truly characterises the dependence is unknown, multiple copula families are tried to find a member that best describes the dependence in the data. The estimated parameter essentially selects a member that captures the dependence structure of the given data from the chosen family of copulas. Parameters estimated for Clayton, Frank, Gumbel and Joe families of copulas on the simulated pseudo-observations shown in table 5.3 are presented in table 5.4. It is to be noted that the estimated Gumbel copula parameter almost recovers the original parameter of 1.500. Table 5.4: Parameters Estimated for Archimedean Copula Families Copula Family Estimated Parameter Clayton 0.761 Frank 3.216 Gumbel 1.543 Joe 1.861 5.4 Goodness of Fit Testing As discussed with (2.2), parametric copulas can be viewed as families covering part of the monotonicity curve between the Fr?chet?Hoeffding bounds, each member of which is identified by the copula parameter. Further, the two-step maximum likelihood estimation procedure of Genest, Ghoudi, and Rivest [34] requires presumption of the copula family. Therefore, it is necessary to evaluate how well the chosen copula family 32 describes the dependence in the multivariate random variable being modeled. Let ? denote the chosen family. The null hypothesis is that the parametric copula ?? that characterizes the dependence is a member of the assumed family as given by ? ?0 ? ?? ? ? ?= {?? ? ? ? ? ? ? } , (5.6) where ? takes any value from the admissible set ?, and ? is an integer, such that ? ? 1, representing the dimensionality of the parameter. The admissible space is dependent on the family of copula, which may not always attain the Fr?chet?Hoeffding bounds [see table 4.1 in 81, for Archimedean copulas]. Genest, R?millard, and Beaudoin [40] examine the power of the most common method of testing the null hypothesis using statistics based on the empirical process ? ?? = ?(?? ???? ) , (5.7)? where ?? denotes the empirical copula, and ??? is derived under ?? 0 [39, 40]. The empirical distribution of multivariate random variables was originally given by Deheuvels [21]. An asymptotically equivalent and simpler to compute distribution is ? (?) = 1 ? ? ??(?? ? ?); ? ? [0, 1]?, (5.8)? ?=1 where the inequalities are evaluated component-wise [40]. This empirical distribution is piece-wise constant and thus not a copula, yet is often called the empirical copula, and this paper will continue that nomenclature. The empirical copula is an unbiased 33 and consistent estimator of the joint distribution under various conditions, as examined by Genest and R?millard [39]. Therefore it forms the most objective benchmark to judge the parametric copula [40]. Genest, R?millard, and Beaudoin [40] show empirically that the Cram?r?von Mises statistic, which quantifies separation between two distributions, is one of the most powerful tests for evaluating the similarity of distributions produced by the empirical copula and the estimated parametric copula. Cram?r?von Mises statistic expressed as a continuous functional of (5.7) becomes ? 2 ?? = ? ??(?)2d??(?) = ?(??(??) ? ??? (??)) . (5.9)? [0,1]? ?=1 The straightforward way to understand (5.9) is to note that Cram?r?von Mises statistic is like a ?distance? between the distribution obtained from the parametric copula and the distribution obtained by an unbiased empirical distribution. The null hypothesis is rejected with large values of the statistic (5.9) as ? ? ? [35, 39, 40, 69, 70, 90]. 5.4.1 Parametric Bootstrap Genest and R?millard [39] show that as ? ? ?, under ?0, ?? converges weakly to a limiting distribution and that their tests based on ?? are consistent; that is, ?0 is almost surely rejected if ?? ? ?. The limiting distribution of ?? is unknown, but can be approximated using parametric bootstrap as in Genest, R?millard, and Beaudoin [40]. In their extensive Monte Carlo experiments, both Genest, R?millard, and Beaudoin [40] and Kojadinovic [66] show that the statistic ?? holds the most power in rejecting ? (?)0. Let ?? be the ?th bootstrap replicate of the statistic. An approximate ?-value can be computed as 1 ?[??(?(?)? ? ??) + 0.5] , (5.10)? + 1 ?=1 34 where ? is the total number of bootstrap replications, assumed to be large enough. Here, we set ? = 1 000. 0.5/(? + 1) is an asymptotically insignificant adjustment factor used to ensure the ?-values are always in the interval (0, 1). A ?-value above a statistical level of significance indicates that ?0 cannot be rejected. Therefore the observed data might be described by the parametric copula ??, which is generically referred to as ?well-fitting copula? in the succeeding discussion. Since the data generated from the constructed copula have continuous marginals, but the marginals of the survey data used are discrete, the proofs by Genest and R?millard [39] are no longer valid, and may produce biased inferences about ?0 [36, 38, 66, 68, 89]. However, Kojadinovic [66] adapted the parametric bootstrap procedure of Genest, R?millard, and Beaudoin [40] to the presence of ties in the data. The modified algorithm is presented in Procedure 6.1 in the paper by Kojadinovic [66], and is shown to be consistent despite ties in the data. Goodness of fit to the data of any copula presented in this study is assessed using 1 000 parametric bootstrap replications as presented in algorithm 1. Readers are referred to the paper by Kojadinovic [66] for implementation and additional details about the procedure. The Cram?r?von Mises statistic, ??, and ?-values for the simulated data obtained by applying algorithm 1 are presented in table 5.5. The statistic and ?-value for the independent copula are also presented as a baseline. The results clearly indicate that the independent copula can be rejected, and has the largest value of the Cram?r?von Mises statistic. Moreover, the ?-values indicate that both the Clayton and Frank families can be rejected. The Gumbel and Joe families cannot be rejected as they score very high ?-values. Despite knowing that the simulated data were produced using the Gumbel copula, statistical discrimination between copula families for which ?0 is not rejected is tricky as it involves computing a measure of similarity between two multivariate distributions. Training dataset formed from real world data, or simulated data, used to construct the 35 Algorithm 1 The Parametric Bootstrap 1: Inputs: ?1,? , ??: ?, ?-dimensional observations, assumed to be realizations of an (unknown) random vector (obtained by applying (2.6) to realizations of a random vector ?) 2: Initialize: Assume a parametric copula family ? ? ? 1 000 (or a sufficiently large number) 3: Construct a copula ??? , an estimate of ?? ? ? using ? ? estimated by (5.4) over? the ? observations ?1,? , ?? 4: Compute ?? from ?? and ??? using (5.9)? 5: for ? ?= 1 to ? do 6: ?(?)1 ,? , ? (?) ? ? ? observations in ? dimensions drawn from ??? using the? procedure developed by Marshall and Olkin [77] 7: ?(?),? ,?(?) ? ?(?) (?)1 ? 1 ,? , ?? modified as per Step 4(b) of Procedure 6.1 proposed by Kojadinovic [66] 8: Compute ?(?)1 ,? , ? (?) ? from ?(?)1 ,? ,? (?) ? using (2.6) 9: Construct a copula ?(?) (? ?) ?? , an estimate of ?? ? ? using ? estimated by (5.4) ? over ?(?),? , ?(?)1 ? 10: Compute ?(?) from ?(?)? ? and ?(?)? using (5.9)?? 11: end for 12: Compute the approximate ?-value using (5.10) Table 5.5: Parameters Estimated for Archimedean Copula Families Copula Family Estimated Parameter Cram?r?von Mises Statistic ?-value Independent NA 2 421.214 0.001 Clayton 0.761 1 024.078 0.001 Frank 3.216 587.086 0.001 Gumbel 1.543 282.586 0.993 Joe 1.861 274.518 0.999 36 copula can be considered as observations from some unknown multivariate random distribution. Any validation dataset, not used to construct the copula, would also have observations resulting from this unknown random distribution. Therefore, there would be no statistical difference between the training and validation datasets. On the other hand, multivariate data generated from a copula by the mixture models due Marshall and Olkin [77] are also ultimately realizations from a defined multivariate random distribution. If the assumed copula family captures the unknown random distribution producing the training and validation observations, data pro- duced from the copula should be statistically indistinguishable from either datasets. Quantifying this requires comparison between multivariate random variables. Com- parative statistics based on any empirical process, like (5.7), are not distribution free, requiring tools like the bootstrap to resolve [27, 39, 40]. Finally, recall that from Sklar?s representation theorem, in presence of discrete marginals, more than one cop- ula can correctly represent the underlying multivariate distribution, so it is ultimately often illusionary to isolate one copula only. In chapter 7, additional discussion is pre- sented as a starting point in the development of tools to improve the tests conducted to assess the goodness of fit of copulas to provided multivariate data. 5.5 Inverse Probability Integral Transformations of Marginals The goal of the paper is to produce synthetic population at some geographic area, which could be the same area where the copulas are fit, or for another area where the dependence captured by the copula is transferable. In this dissertation, well-fitting copula to PUMS data from a PUMA is used to generate the population at finer geo- political division: the census tracts enclosed by the PUMA. The dependence structure of each enclosed census tract is assumed to be the same as that of the PUMA as a whole. This assumption is reasonable because the census bureau bases the ACS sample design on the decennial census and other datasets, while also designing the 37 next census based on the rolling ACS data collection. Further, the relation between different demographic properties of people should remain similar across space. As an example, consider the relation between age, education, job industry, salary, household size and location, number of owned cars, and family structure. It is easy to perceive the relationship between the characteristics, and similarly deduce that it would be applicable to other locations as well. Any well-fitting copula can be used to generate random data with the same dependence structure at the target geographic area as the data used to construct it using the mixture models by Marshall and Olkin [77]. These data will be in the space of the support of the copula, i.e., uniform in [0, 1]?. However, a synthetic population can be obtained by inverting the marginal draws from the copula. Suppose the individuals in a census tract are realizations of ? iid copies of a ?-dimensional multivariate random vector ? = (?1,? , ??). Let ?1,? ,?? denote the cumulative mass functions of each of the ? characteristics at the census tract level. Further, let ? = (?1,? , ??) be generated from the constructed copula. Synthesis of the population is then given by ??? = (??11 (?1),? ,??1? (??)), (5.11) where ??11 (?1),? ,??1? (??) are the pseudo-inverses of the cumulative mass functions ?1,? ,??, defined as ??1? (?) = inf{?? ? ??(??) ? ?}; ? ? {1,? , ?}, (5.12) Algorithm 2 presents the whole process of the inverse transformation. Some dimensions from ACS PUMS used in this study are not available in the decennial census marginal data (table 6.4). Therefore, those dimensions have been inverse transformed using marginal cumulative mass functions computed from the ACS PUMS dataset for each PUMA. Further, since a single copula family with scalar 38 Algorithm 2 Inverse Probability Integral Transform of Data Generated from Well- Fitting Copula 1: Inputs: ??? , a well fitting copula constructed with estimated parameter? ? ?? (?1,? ,??), a vector of cumulative mass functions describing the marginals at the required geopolitical level ? the required number of records of inverse transformed data 2: Initialize: ? ? 1 3: while ? ? ? do 4: ?? ? ?-dimensional realization drawn from the copula ???? 5: ???? ? (??11 (??1),? ,??1? (???)) 6: if ???? is acceptable then 7: Add ???? to output dataset 8: ? ? ? + 1 9: else 10: Reject ???? 11: end if 12: end while parameters was used to capture the dependence of the whole dataset, relative differences in strengths of association between various dimensions in the dataset are averaged out. Consequently, some sanity checking of the data produced after inverse transformation is required (see the condition in step 6 in algorithm 2). Quality controls are actually applied at two instances in the process, the first after inverse transforming the data dimensions available in the census data, and the second after inverse transforming the remaining dimensions using ACS PUMS data. 39 Chapter 6: Analysis of Synthesized Population The results from the randomly selected case study PUMA are presented in this section. Figure 6.1 shows a map with the location of the PUMA, along with enclosed census tracts, and place names. The PUMA is from the south-eastern side of Washington DC, where the state lines of Washington DC, Maryland and Virginia meet. This PUMA is home to about 120 000 people, and is a combination of low density suburbs, high density housing, many retail zones, and some business centers. 40 38.879628N 2010 CENSUS - PUMA REFERENCE MAP: Prince George's County (Southwest)--Oxon Hill, Hillcrest Heights & Temple Hills 38.879184N 77.046449W 76.862853W 1 395 Washi 395 ngto 695 LEGEND n 695 PUMA C 1 hnnl 00103 395 SYMBOL DESCRIPTION SYMBOL LABEL STYLE1 1 395 International CANADA Walker Mill Federal American IndianReservation L'ANSE RES Anacostia Riv Off-Reservation Trust Land T1880 PUMA Alaska Native RegionalCorporation NANA ANRC 00105 Public Use Microdata Area PUMA Coral Hills Capitol Heights (PUMA) 00300 8025.01 Census Tract 183.01 State (or statistically equivalent entity) NEW YORK 36 County (or statistically equivalent entity) ERIE 029 Minor Civil Division PUMA District Heights (MCD)1 Bristol town 01104 Consolidated City MILFORD 8024.05 Suitland Incorporated Place 2 Davis 218 Census Designated Place (CDP) 2 Incline Village PUMA 218 4 01302 4Washington 458637 458 DESCRIPTION SYMBOL DESCRIPTION SYMBOL Arlington 637 Interstate 3 Water Body Pleasant Lake PUMA U.S. Highway 2 00104 Outside Subject Area8018.05 State Highway 4 Nonvisible Boundary 8019.08 or Feature Not 295 Elsewhere Classified 295 A 8020.01RL Where international, state, and/or county boundaries coincide, the map showsINGTO 8018.04N 013 the boundary symbol for only the highest-ranking of these boundaries. DIS 0 Silver HillTRICT AO LUMBI Geographic areas outside the subject PUMA are labeled only as space allows.F CO Hillcrest Heights 1 MCD boundaries are shown in the following states in which some or all MCDs function as general-purpose governmental units: Connecticut, Illinois, Indiana, Kansas, 8018.02 Maine, Massachusetts, Michigan, Minnesota, Missouri, Nebraska, New Hampshire, PUMA New Jersey, New York, North Dakota, Ohio, Pennsylvania, Rhode Island, South Dakota, 01105 Vermont, and Wisconsin. Only governmental MCDs in the subject PUMA are labeled.8018.07 D 2 Place label color corresponds to the place fill color.IS Label colors: Davis Davis Davis Davis Davis 8018.08 Westphalia 8019.06 Forestville 8018.01 SUBJECT AREA COUNTIES ON MAP SHEET 24033 Prince George's 400 8019.07 Morningside 8019.05 Marlow Heights 8017.04 8017.06 400 8016 337 Glassmanor 337 495 95 95 495 8017.08 PUMA PUMA Camp Springs 51255 01107Forest Heights 8017.07 8019.01 8019.04 Alexandria 8017.02 Temple Hills 8017.01 8015 Andrews AFB 5 8014.08 5 414 8014.05 495 95 495 95 8014.09 Oxon Hill 8014.07 National Harbor FAIRFAX 059 8014.04 8014.06 210 210 PUMA 59302 PUMA 8014.11 01106 223 Fort Washington 223 Potomac Riv Friendly Clinton 8014.10 Fort Hunt 38.736134N 38.735692N 77.046832W 76.863603W Public Use Microdata Areas (PUMAs) are delineated based on the results of the 2010 Projection: Albers Equal Area Conic Decennial Census; all other legal boundaries and names are as of January 1, 2010. The Datum: NAD 83 PARENT SHEET 1 PUMA NAME: Prince George's County (Southwest)--Oxon Hill, Hillcrest Heights boundaries shown on this map are for Census Bureau statistical data collection and Spheroid: GRS 80 0 0.6 1.2 1.8 2.4 3 Kilometers & Temple Hills tabulation purposes only; their depiction and designation for statistical purposes does not 1st Standard Parallel: 38 11 33 Total Sheets: 1 constitute a determination of jurisdictional authority or rights of ownership or entitlement. 2nd Standard Parallel: 39 24 60 PUMA CODE: 011070 0.4 0.8 1.2 1.6 2 Miles - Index Sheets: 0 Central Meridian: -77 14 14 STATE: Maryland (24) Geographic Vintage: 2010 Census (reference date: January 1, 2010) The plotted map scale is 1:21815 - Parent Sheets: 1Latitude of Projection's Origin: 37 53 11 Data Source: U.S. Census Bureau's MAF/TIGER database (ACS13) - Inset Sheets: 0 Map Created by Geography Division: December 13, 2013 False Easting: 0 False Northing: 0 PUMA REF MAP (PARENT) U.S. DEPARTMENT OF COMMERCE Economics and Statistics Administration U.S. Census Bureau 23652401107001 Figure 6.1: Map of PUMA 1107 41 VIRGINIA 51 FAIRFAX 059 MARYLAND 24 PRINCE GEORGE'S 033 0 1 D VIST IRR GI IC NIAT 51 OF COLUMBIA 11 DISTR P IR CI TN OCE F G COEO LURG ME B'S IA 0 03 03 1 DISTRIC M T O AR F Y CL OA LN UD M 2 B4 IA 11 OF COL MBIA 001 CT URI MBIA 11 OL U T ISTRICT OF C VIRGINIA D 51 NDRIA 510 ALEX A 6.1 Copula Construction The dataset of PUMA 1107 is formed by joining the household and person sets using the key provided in ACS PUMS data [10]. A total of 118 583 observations were available from this PUMA, after duplicating the records by the provided ACS person weights [as recommended by 10]. A sample of the data are shown in table 6.1. This table is similar to table 5.2 of simulated observations. Pseudo-observations produced using equation (2.6) for data in table 6.1 are shown in table 6.2, which is similar to table 5.3. Table 6.1: Observations from Training Dataset for PUMA 1107 NP HHT HINCP HUPAC WIF AGEP SEX ESR RAC1P 2 3 160510 4 1 53 2 1 2 2 3 160510 4 1 53 2 1 2 2 3 160510 4 1 53 2 1 2 2 3 160510 4 1 53 2 1 2 2 3 160510 4 1 53 2 1 2 ? ? ? ? ? ? ? ? ? Table 6.2: Pseudo-Observations from Training Dataset for PUMA 1107 NP HHT HINCP HUPAC WIF AGEP SEX ESR RAC1P 0.269 0.640 0.865 0.717 0.392 0.708 0.724 0.455 0.496 0.269 0.640 0.865 0.717 0.392 0.708 0.724 0.455 0.496 0.269 0.640 0.865 0.717 0.392 0.708 0.724 0.455 0.496 0.269 0.640 0.865 0.717 0.392 0.708 0.724 0.455 0.496 0.269 0.640 0.865 0.717 0.392 0.708 0.724 0.455 0.496 0.269 0.640 0.865 0.717 0.392 0.708 0.724 0.455 0.496 ? ? ? ? ? ? ? ? ? The column headers in tables 6.1 and 6.2 are the same as field names in the 2016 ACS PUMS Data Dictionary. The header abbreviations are expanded as follows: 1. NP: Number of person records following the housing record, 42 2. HHT: Household or family type, 3. HINCP: Household income (past 12-months), 4. HUPAC: Household presence and age of children, 5. WIF: Workers in family during the past 12-months, 6. AGEP: Age of the person, 7. SEX: Sex of the person, 8. ESR: Employment status of the person (recode), 9. RAC1P: Recoded detailed race code. Four Archimedean copula families, Clayton, Frank, Gumbel and Joe, were fitted to this dataset using the log-likelihood maximization method. Goodness of fit was then tested for each family. Goodness of fit of the independent copula was also evaluated for baseline comparison. Additionally, standard errors for each parameter estimate were computed using the 80 replicate weights provided with PUMS data. The standard errors are given as per the ACS recommended procedure [10] by, 80 ??(?)? = ? 4 2 ?(??? ? ?)? , (6.1)80 ?=1 where ? ?? is the parameter estimated using the ?th replicate weights, and ? ? is the parameter estimated using the full sample weights. Note that any negative replicate weights were truncated at 0. The parameter estimates with standard errors, Cram?r?von Mises statistics and ?-values of the copula families are presented in table 6.3. 43 Table 6.3: Constructed Archimedean Copulas for PUMA 1107 Copula Parameter Standard Error CvM Statistic ?-value Independent NA NA 261.520 0.001 Clayton 0.098 0.022 220.655 0.999 Frank 0.113 0.079 255.123 0.001 Gumbel 1.059 0.015 233.572 0.999 Joe 1.083 0.018 243.633 0.999 Table 6.3 clearly indicates that the null hypothesis must be rejected for the independent copula, which enforces that there is some dependence between the 9 dimensions of the PUMS data used in this study. Frank copula is likewise rejected as it cannot faithfully capture the dependence in the data. The other three copulas ? Clayton, Gumbel and Joe ? describe the data from PUMA 1107 equally well, and any of them would serve the purpose of generating the synthetic population. This is further supported by additional Monte Carlo simulations presented in Appendix A. Joe copula is arbitrarily picked and used to generate synthetic population for census tracts enclosed in the PUMA (see figure 6.1 for details). The estimated parameters are stable over the replicate weights as reflected by the small standard errors, suggesting that the number of observations is large enough for the purpose of copula calibration. 6.2 Population Synthesis Population was generated for the census tracts enclosed by each PUMA. In this section, results from two randomly selected census tracts are presented. The decennial census data contains aggregate totals at the census tract level for six of the nine characteristics used from the ACS PUMS. Table 6.4 presents a crosswalk between the characteristics from ACS PUMS and the corresponding decennial census matrix from the Census Summary File 1 Dataset [9, 10]. Note that for some matrices, especially P19 and P20, only selected columns were used such that the data description of the categories in the decennial census data 44 Table 6.4: Crosswalk between ACS PUMS and Decennial Census Matrix ACS PUMS Census Summary Characteristic File 1 Matrix NP ? H13 HHT ? P19, H13 for vacant properties HINCP ? No matching matrix HUPAC ? P20, H13 for vacant properties WIF ? No matching matrix AGEP ? PCT12 SEX ? PCT12 ESR ? No matching matrix RAC1P ? P3 dictionary matches the description in the ACS PUMS data dictionary. Additionally, summations were used to club the numbers from different columns in the decennial census matrices to match the ACS PUMS characteristics ? for example, columns in P20 for HUPAC, and in PCT12 for AGEP. In cases like RAC1P and P3 (also, NP and H13), not all categories from the ACS PUMS data are available in the census matrix. However, these differences in categories do not cause any problems while inverse transforming data produced from copulas using the marginals from census and ACS PUMS. The final values are always in the domain of the marginal mass functions. To derive the quantile marginal distributions, the decennial census data for a characteristic (or dimension) from a census tract are converted into a cumulative mass function. The plateaus of the mass distribution define the bins of the quantile function. All data generated by the copula, for the characteristic (i.e., for the given dimension), that fall in a bin of the quantile function are assigned the value of the bin. This is similar to using function (5.2), except the bin edges, ?? and ??+1, are informed by the quantile function. Characteristics without a counterpart in the decennial census data ? HINCP, WIF and ESR ? are transformed using marginals obtained from the ACS PUMS data for the whole PUMA, thus demonstrating that all marginal distributions need not come from the same source. 45 Algorithm 2 was used to produce synthetic population for all census tracts in PUMA 1107. To maintain brevity, results from two randomly selected census tracts are presented. These census tracts are 8016 and 8017.01, with considerably different characteristics (see figure 6.1). Tract 8016 is part of the Glassmoor area, and shares a border with Washington DC. It is characterized by high density apartment and condominium style homes. Approximately, census tract 8016 has as many residents as households. On the other hand tract 8017.01 is part of Temple Hill, a suburb of Washington DC and has low density housing, characteristic of suburban sprawl. The average number of people per house in tract 8017.01 is about 2.5. Majority of the people in either tract are African Americans. As previously discussed, a rigorous out-of-sample validation of the data produced by the copulas is not straightforward. Future work would be devoted to the investigation of additional tests comparing multivariate distributions. It is nevertheless easy to empirically compare the univariate marginal frequencies of the population synthesized using copulas. Figures 6.2a and 6.2b show the distributions of percent errors of the population synthesized from the copulas. The errors are computed for each category in a given dimension (characteristic). They are the percent difference between the predicted frequency of the category by inverse transformed copula data, to the actual reported frequency in decennial census or ACS PUMS data. Mathematically it is ??? ? ?? ? ?= ?? 100 ?? ? {1,? , ?}, (6.2)?? where ??? is the vector of frequencies of each category in characteristic or dimension ? ? {1,? , ?}, and ?? is the vector of true frequencies from decennial census or ACS PUMS data. Box plots in figure 6.2a and 6.2b are created from the errors for all categories in a 46 Box Plot of Percent Errors by Characteristic 4 2 0 ?2 ?4 NP HHT HINCP HUPAC WIF AGEP SEX ESR RAC1P Measured Characteristics (a) Errors in Census Tract 8016 Box Plot of Percent Errors by Characteristic 4 2 0 ?2 ?4 NP HHT HINCP HUPAC WIF AGEP SEX ESR RAC1P Measured Characteristics (b) Errors in Census Tract 8017.01 Figure 6.2: Errors in Synthetic Population Marginals 47 Percent Error Percent Error characteristic. In the box plot, the box represents the interquartile range ([25%, 75%]), while the median is shown by the bar within each box. The ends of the whiskers outside the boxes show the confidence interval ([5%, 95%]). 6.3 Discussion The errors shown in figures 6.2a and 6.2b are very small, proving the accuracy of the copulas in reproducing the synthetic population marginals. The dependence between the dimensions is represented by a single parameter in Archimedean copulas, which captures the equivalent of an average dependence. Even though a single parameter offers protection against overfitting, some nuances of the dependence are missed. These missed nuances necessitate sanity checking the data after inverse transformation (as described in step 6 in algorithm 2). Moreover, the very high and very low ?-values also indicates some issues with the goodness of fit testing procedure. There are multiple reasons possible for this, including the sanity checking step which is not included within the parametric bootstrap. Another related reason is the change in the dependence structure by using the weights provided in ACS. However, by not using the weights, we work with a sample of the population which has sampling bias designed into it. Therefore, further exploration is required in this direction to understand the applicability of the well-studied inference processes to select copulas that represent the data at hand. The missed details of the dependence also suggest exploration of more granular copula models. The presented models can also be expanded to include many more dimensions, and multiple data sources, producing a holistic modeling framework to capture multivariate characteristics and synthesize data that encapsulates those characteristics. Further, the search for reliable statistical techniques to compare discrete multivariate distributions is posed as an open question in this dissertation. This could provide alternate methods to discriminate between copula families or 48 hierarchies and proceed to out-of-sample validations, keeping in mind that for discrete observations, more than one copula can adequately capture the dependence structure. These four avenues provide interesting objectives for future research. 49 Chapter 7: Alternative Tests for Copula Goodness of Fit As identified in chapters 5 and 6, and noted by Genest and Ne?lehov? [36] and Kojadinovic [66], the current copula goodness of fit tests based on Monte Carlo simulations have room for improvement when used with survey data, although they work reasonably well for simulated data. Therefore, in this chapter (see paper by Kaushik, Cirillo, and Bastin [60] for original work), we explore the tie adapted parametric bootstrap inference procedure for copula goodness of fit suggested by Kojadinovic [66] with survey data. Relation between the frequency and nature of ties and performance of the goodness of fit tests is yet to be elucidated. Theoretical results to that extent might be quite difficult to obtain given the underlying complexity of the models [37, 40, 66]. The consequent recommendation follows that any inferences from these goodness of fit tests should be adequately examined before use. Fortunately, the survey sample data are available at hand to independently estimate the dependence measures with, say, the pairwise Kendall?s rank correlations (?). Data generated from a copula family using the parameter estimated by (5.4) can be used to produce an independent copy of population over the same area. Rank correlation between these synthetic data can be compared with the correlations found in the original survey data. Any values assumed by the copula outside the space of the Cartesian product of the marginals are immaterial and should not be used for any tests or inferences. A resulting complication for discrete data with many ties is the sharply reduced space over which the copula is valid. For example, consider that the space of any two categorical characteristics like gender and marital status 50 form very few unique points, regardless of the number of observations in the dataset. Consequently, the inference process of Genest and R?millard [39], even adapted as Kojadinovic [66] suggests, breaks down. The only solution, at the moment, seems to be to increase the dimensionality of the dataset to enlarge the space forming the support of the copulas. Therefore, we rely on pairwise Kendall?s rank correlation coefficient to compare the synthetic dataset with the original. 7.1 Data Preparation 9 dimensions from the ACS PUMS data are used in this study, chosen such that three groups can be formed each containing three dimensions. The dimensions within each group are chosen to have similar relation to each other. The average dependence of a group ? obtained by computing the mean of the ?(? ? 1)/2 unique pairwise dependence (lower triangle of dependence cross-tabulation) ? classify each group into independent, weakly dependent and strongly dependent. The relations between dimensions in each group are estimated by Kendall?s rank correlation coefficient, ?. The overall relation of a group is denoted henceforth as ? ? [67]. Figure 7.1 provides the pairwise Kendall?s rank correlation coefficient matrix for the three dimensions in each of the three groups. Note that pseudo-observations were generated for all data using (2.6) before computing the rank correlation coefficients. Race code was found to have countermonotone relation with age and gender. Therefore, the subtraction rule, which inverts the direction of the axis, was applied to the race code dimension to convert countermonotonous dependence to comonotonous (figure 7.1a). Group subcaption of the figures include the average correlation coefficient (?)? of the group in parenthesis. The dimension headers are acronyms, and are the same as the ACS data dictionary for each field. The description of each field are as follows: 1. Independent Dimensions: 51 (a) AGEP: Age of the person (discrete ordinal), (b) SEX: Sex of the person (discrete categorical), (c) RAC1P: Recoded detailed race code (discrete categorical). 2. Weakly Dependent Dimensions: (a) VEH: Vehicles (1 ton or less) available (discrete ordinal), (b) HINCP: Household income (past 12-months) (discrete ordinal), (c) WIF: Workers in family during the past 12-months (discrete ordinal). 3. Strongly Dependent Dimensions: (a) JWMNP: Travel time to work (discrete ordinal), (b) JWRIP: Vehicle occupancy (discrete ordinal), (c) JWTR: Means of transportation to work (discrete categorical). 1The pseudo-observations for dimension RAC1P (obtained by (2.6)) have been replaced by applying the subtraction rule: 1 ?U????1?, so as to ensure the rank correlations with dimensions AGEP and SEX are positive. 52 (a) Independent Dimensions (? ? = 0.050)1 Figure 7.1: Kendall?s Rank Correlation Among Dimensions in ACS PUMS for PUMA 1107 53 (b) Weakly Dependent Dimensions (? ? = 0.430) Figure 7.1: Kendall?s Rank Correlation Among Dimensions in ACS PUMS for PUMA 1107 54 (c) Strongly Dependent Dimensions (? ? = 0.620) Figure 7.1: Kendall?s Rank Correlation Among Dimensions in ACS PUMS for PUMA 1107 55 7.2 Results The results presented here were originally published in the paper by Kaushik, Cirillo, and Bastin [60]. We have only explored Archimedean copulas with specified distribution functions for 3 or more dimensions. These copulas currently include the Clayton, Frank, Gumbel- Hougaard, and the Joe families. Only the Gumbel-Hougaard family of copulas (henceforth called Gumbel copula) are presented in this paper. Other copula families are also applicable. However, the parameter values assumed by Clayton and Frank families produce infinities for the independent case, while a simple relation between the Joe family parameter and Kendall?s rank correlation is not yet defined [68]. The distribution function for the Gumbel Family was presented in (5.1), but is repeated here for convenience: ? 1{ ? ?}? exp ?[?(? ln? ?? ?) ] ? , (7.1){? ?=1 }? where ?? are realizations from the uniform marginals (see identity (2.1)) [the text by Nelsen 81, contains distribution functions for other copula families]. The relation between Kendall?s rank correlation and the Gumbel copula parameter is straightforward and given as ? = 1 ? 1. (7.2) ? The maximum likelihood estimates of the Gumbel copula parameter obtained from (5.4) for the three groups of dependencies are provided in table 7.1. The Cram?r?von Mises statistics for the ACS PUMS dataset from PUMS 1107 computed using (5.9) are also included in the table. Approximate p-values computed using (5.10) from 1 000 replications of the adapted parametric bootstrap procedure due to Kojadinovic [66] are shown in the table 7.1. As an additional reference, Kendall?s rank correlation coefficient obtained from the estimated parameter using (7.2) are also presented in 56 table 7.1. Note that the these rank correlation coefficients are quite different from the mean coefficients included with figure 7.1. Genest and Ne?lehov? [36] finds that estimating copula parameters by inversion of Kendall?s rank correlation coefficient may be biased, and recommends using the maximum likelihood estimator instead. Further, all recommendations included with the modified procedure due to Kojadinovic [66] were followed while computing the results shown. Table 7.1: Results from Estimating Gumbel Copula to Data Groups Data Group ? ? ? ? ? ?? p-value Independent 1.098 0.089 8.710 0.001 Weakly Dependent 1.556 0.357 48.045 0.001 Strongly Dependent 1.908 0.476 1 804.954 0.001 The low p-values shown in table 7.1 indicate that the Gumbel copula is not a good fit to any of the three groups of data. However, that is somewhat a limitation of the inference procedure, and not Gumbel copula. We show this by creating a synthetic dataset representing the population of PUMA 1107 from the estimated Gumbel copula. The same number of realizations as the original expanded dataset of PUMA 1107 were simulated from the three estimated Gumbel copulas using the mixture method proposed by Marshall and Olkin [77]. These three sets of 118 583 observations, in 3 dimensions each, encapsulate the dependence captured by the copula parameter. Each dimension in these generated independent, weakly dependent and strongly dependent datasets are inverse transform sampled using cumulative mass functions estimated from the corresponding dimension of PUMA 1107 to yield a synthetic dataset for the given PUMA. Subtraction rule was applied to the dimension of race code (RAC1P) to convert negative dependence into positive dependence prior to copula estimation. This is a necessary step as the Gumbel copula cannot model negative dependence. Therefore, for this dimension, the rule was reapplied on the values generated from the copula 57 before inverse transformation. This ensured the simulated synthetic population data were as close to the PUMS data from ACS as allowed by the dependence captured by the copula. Pairwise Kendall?s rank correlation coefficients were estimated for these synthetic population datasets generated using the Gumbel copula for PUMA 1107. (2.6) was used to generate pseudo-observations from the simulated synthetic population data before estimating the correlation coefficients. Further, subtraction rule was applied to the race code dimension to ensure the estimated rank correlation coefficients are positive, as was done while producing figure 7.1a. The results are presented in figure 7.2, using the same format as figure 7.1. Average values of the rank correlation coefficients ? ?? are included in parenthesis in the subcaptions of the figure. Again we find that these correlation coefficients are different from the ones estimated using (7.2) from the estimated copula parameters. This result reinforces that method of moments based estimators might produce biased estimates of the copula parameter [36]. Comparing figure 7.2 with figure 7.1, a few things are immediately clear. The Gumbel copula perfectly captures the weak dependence case, and almost recovers the independence case. In both of those datasets, the estimated p-values were low, and suggested rejection of the Gumbel copula. The same low p-vales were also computed for the strong dependence case, for which the PUMS and synthetic datasets are drastically different. Consequently, we can conclude that the inference procedures with discrete data are not reliable enough to produce the prudent decision in all cases. Further, we show that the modified procedure due to Kojadinovic [66], while adequately reliable with simulated discrete data, fail with survey data. The need for additional and alternate tests are strongly advised before using the results of the inference processes on discrete data. 2The realizations for dimension RAC1P (obtained from the Gumbel copula) have been replaced by applying the subtraction rule to values produced using (2.6) so that positive correlation coefficients are produced similar to 7.1a. 58 (a) Independent Dimensions (? ?? = 0.090)2 Figure 7.2: Kendall?s Rank Correlation Among Dimensions in the Synthetic Population Datasets 59 (b) Weakly Dependent Dimensions (? ?? = 0.390) Figure 7.2: Kendall?s Rank Correlation Among Dimensions in the Synthetic Population Datasets 60 (c) Strongly Dependent Dimensions (? ?? = 0.520) Figure 7.2: Kendall?s Rank Correlation Among Dimensions in the Synthetic Population Datasets 61 The findings by Genest and R?millard [39] are still valid, as large values of the statistic ?? lead to rejection of the null hypothesis by the inference procedures. The issues are apparent when the statistic values are small, and there is no prudent reason to reject the assumed copula family, like demonstrated with the independent and weakly dependent datasets. However, since the distribution of the statistic depends on the unknown distribution of the copula parameter, which in turn depends on the marginal distribution functions of the multivariate random variable, quantification of the values of the statistic would be difficult [27, 40]. Therefore, data-driven methods of validating copula models are recommended as further research avenues. 62 Chapter 8: Conclusions from Population Synthesis In this half of the dissertation, a novel method to generate synthetic population is demonstrated on simulated and real world case study. Using the simulated data, it was shown that the method can recover the dependence, and indicate the correct copula family. In the real world case study, not only the dependence was captured using copulas, but also the captured dependence was used to generate synthetic population at a much finer geopolitical division [see also 98, 99]. Moreover, a linkage between the ACS PUMS data and the decennial Census aggregate data was established by the captured dependence. The demonstrated method provides contributions to multiple disciplines simul- taneously. In microsimulation, it promises to significantly improve the availability of accurate data over all required characteristics. Further contribution to small area estimation is provided by generating population to census tracts, which are widely used in many studies as the smallest unit of measure. The presented method also ex- poses the applicability of the developed statistical knowledge about copulas to discrete data. Lastly, an alternate form data linkage is provided by the use of the dependence captured from one dataset, and used with another. Over the current state of the art in population synthesis, the proposed method has numerous advantages. 1. It produces synthetic population for a different and finer geopolitical area. 2. Combined estimation of household and individual characteristics is easily possi- 63 ble. 3. Additional dimensions can be simply included requiring neither an increase in memory requirements, nor the sparsity of data. More observations, similarly have no negative impacts on the method. 4. The method can work with other datasets, and also potentially in areas without samples, so long as the dependence can be reliably transferred from another area. 5. Parametric specification of the dependence is direct to interpret and debug. 6. Estimation of the parameter is guaranteed and can be programmed to occur quickly, even with very large datasets. 7. Rare observations do not hinder the method, nor require additional preprocessing. This study illustrates that the use of copulas to generate synthetic multivariate datasets is a promising avenue, deserving more exploration, especially regarding the copula selection and its validation. The examination of the goodness of fit tests of copulas with survey data provided beginnings of tools with which the tests might be improved, especially when used with discrete data. However, a lot of work still remains. Nonetheless, the novel contribution of the work was the application of the inference methods to detect the suitability of a copula to model a dataset produced by some unknown multivariate random process. A test on real-world data was never conducted in the literature, and consequently the applicability of the methods to datasets outside the simulated realm was unknown. Using simulation, we showed that the Gumbel?Hougaard copula family is able to capture the multivariate dependence among the dimensions of two out of three groups. Alternate methods to compare two multivariate datasets are strongly suggested. Although the case where the Gumbel copula is a bad choice is 64 correctly identified by the inference procedures, the cases where the copula is an acceptable choice are erroneously falsified. Therefore, the strong recommendation is to use alternate, preferably data-driven, methods to validate the similarities between the original and synthetic datasets, and thus evaluate the applicability of a copula to model the original data. 65 SECTION II Performance Measurement Chapter 9: Introduction to Performance Measurement While the use of ABM with synthetic population constitutes an elaborate framework where a set of models are used to estimate demand and loads on infrastructure from the individuals and their behavioral patterns, supply side management requires real- time and direct insights into the current operational characteristics of transportation infrastructure. Such insights can be used to pinpoint areas of low performance, and allow taking immediate measures to reduce the negative impacts. Consequently, performance measurement has to happen in real-time, and provide reliable results that can be used to make important decisions. The data that power these inferences have been increasing in volume and velocity for almost 15-years, and are now considered terabyte scale big data. Most performance measurement algorithms are simple so that computation does not take too long with big data, and results are available quickly. One of the primary real-time sources of data for performance measurement is data collected from GPS devices with mobile telemetry. These data are available over the whole network, represented as small road segments, and at frequencies exceeding an observation per minute. Consequently, these data are quite large, and analysis may take a long time to complete. However, since the insights are required in real-time, the analysis methods, tools and algorithms developed are designed to be simple, yet reliable and straightforward to provide results as quickly as possible, with minimal interpretation of the model itself. Another advantage of using simple models for real-time inferences are to suit the skill level of practitioners in government agencies responsible for coordinating responses to deteriorating performance. Two 67 such models, one for forecasting and another for computing congestion measures are presented in this dissertation. The model that computes performance, while simple, is reliable enough to be adopted with slight modifications by the Federal Highway Administration (FHWA) in the formulation of performance measurement rules under Moving Ahead for Progress in the 21st Century Act (MAP-21) [26, 61]. The forecasting approach was likewise conferred the Best Paper from Americas award at the 24th Intelligent Transportation Systems World Congress for the simplicity and accuracy of the predicted traffic speeds [18]. The framework developed for producing short-term forecasts adapts ideas at the core of Small Area Estimation (SAE) from the spatial domain to the temporal domain [18]. Here, short-term refers to predictions up to 30-minutes into the future. The chief contribution of the developed framework is this tectonic shift in domain which has never been attempted before. In this framework, future time points are treated as small areas in time, that is, there are no observations from those time points. The key idea of borrowing strength from other larger areas is then used to make estimates about the future time points. Larger areas, in this context, is the past, where many observations are available. The developed method uses these past observations in a simple model that predicts the the future. The performance measurement methods, likewise, build on the strength of the past observations to filter outliers, and produce reliable estimates. The method involves overlaying data from a fixed window of time, say one day, from some similar days. For example, all data from weekdays for a couple of weeks can be overlayed within the 24-hour window to create a dataset with many observations for the measurement interval, which could be a minute or longer. This increases the density of data available in the measurement interval, increasing confidence when rejecting outliers, imputing missing data, or computing the performance of that measurement interval with a baseline interval [18]. Outside real-time requirements, performance analysis is used to measure and report 68 the overall impacts of transportation on human lives, environment, energy and resource utilization, time consumption, various costs of operation and service, and many other such facets. Such reports are also mandatory under MAP-21, and are directly linked by law to the availability of federal funding. However, these results cannot be obtained solely with the use of big data measured over the network, because the actual number, type and fuel efficiency of vehicles is unknown. Other datasets, like the Highway Performance Monitoring System (HPMS) might record some information like volume of vehicles. Descriptive statistics like totals and averages by make and fuel efficiencies can be compiled from administrative datasets, like vehicle registration information by type and vintage, and fuel efficiency and tailpipe emissions data from the Environmental Protection Agency (EPA). However, these data come from disparate sources available in different spatial and temporal domains creating an illusion of data scarcity. The result of data incoherence is an illusion of data scarcity [88]. Consequently, linkage of data across spatial and temporal domains is key to unlocking various analysis possible with the available data. Two methods developed for linking spatial and temporal datasets are documented in this dissertation. The first conceived method links the GPS based traffic big data with incident information to find the impact of incidents on congestion and the resulting effects on people and government agencies. The second method conflates two spatial map layers, one from HPMS and another from a private corporation that provides real-time speeds on the network [63]. This spatial join allows transfer of Average Annual Daily Traffic (AADT), a proxy of roadway volume, to augment the speeds dataset, which can then be used to compute systemwide measures like, vehicle miles traveled, total emissions, cost and time spent traveling, fuel, energy and other resource consumption, etc. Note that AADT must be disaggregated into required intervals of time [63]. The paper resulting from the second conflation method won the Outstanding Paper Award at the 97th Annual Meeting of the Transportation Research 69 Board. 9.1 Need for Short-Term Traffic Predictions Short-term speed predictions are hugely in demand in the industry in recent times. The main benefit to predicted speeds is the ability to adapt quicker to changing conditions, and provide additional driver assistance. The goal of such Advanced Traveler Assistance Systems (ATAS) are to reduce travel times, and the negative effect of congestion, like wasted resources, time and traveler wellbeing. In the modern world of performance management, the new imperative is to extract as much value from existing infrastructure as possible. Therefore, the past 35 or so years have seen an increasing adoption of Intelligent Transportation Systems (ITS) which includes not only Advanced Traveler Information Systems (ATIS) ? such as congestion and travel time information, dynamic message signs, next bus service, and so on ? but also Traffic Management Systems (TMS) ? like traffic incident management, real-time performance and congestion updates ? with the goal of achieving system optimal network flow and utilizations [16, 135]. Much of this has been made possible by significant progress and revolutions in data collection and computational power and algorithms [32]. These systems currently rely mainly on real-time data. Travelers, however, will arrive at points downstream along their path only after a period of time. Consequently, a report of current conditions at downstream points may be of little value, especially because the conditions can change rapidly by the time the traveler arrives at those points. For example, a congestion caused by an accident may dissolve by the time the traveler reaches the area currently congested. However, current conditions may prompt the traveler to alter the path to destination, potentially increasing travel time, and achieving the opposite of what was intended. The ideal situation is to have traffic conditions available 15 to 30 minutes into the future [1, 65, 110, 121, 124] to take the 70 necessary decisions. The limitation of not having predictions is well-known, and has greater repercus- sions than just increased travel time for a few travelers. Travel time and congestion predictions will help all domains of ITS, including ATIS and TMS, as predictive condi- tions allow actively applied measures instead of reactive measures. Reactive measures are when action is taken after the realization of an event, such as diverting traffic only when congestion worsens beyond a threshold. However, by predicting the progress of congestion, control measures may be set up earlier mitigating the worsening of congestion, which is perceived as a positive benefit by society in general [121]. Further, reducing congestion also makes travel more sustainable and less damaging to the environment and quality of life. Most of modern work in predicting traffic is focused in this area precisely because of this strong demand for better systems, and because travel time is one of the most easily understood of all performance measure on a roadway [83, 138]. 9.1.1 Motivation Predictive models require to be fit, i.e., their coefficients need to be estimated, before predictions can be made. This is true even for non-parametric models and neural networks where the fitting algorithm or neurons need to be trained to make predictions as real-time data flow in. Most recent literature dealing with online models take a combined approach where, even if a model is trained using historic data, its coefficients are constantly updated with each new data point [119, 122]. This model fitting or updating consumes time, and requires the most computation power. The drawback of existing online models is that the method is not scalable with data. Fitting on a huge network, using high frequency data requires a huge amount of computation power and time. Fitting models may not be possible, in such a scenario, over the whole network before the next data point is available. Therefore, the model 71 may never be fit in time to start making predictions. This is arguably the most probable cause why even recent literature ? published in times when big data and huge computational power are readily available ? does not have a combination of large network, high frequency data and short-term predictions [122]. By decoupling the fitting from the prediction steps, models can be made such that forecasting is fast and easy, while fitting occurs at a time convenient so that it does not impact predictions, and can be done on moderately powerful computers with sufficient time to spare. This is the fundamental idea behind the synthetic time series framework proposed in this chapter. The repeatability of traffic data is used to effect the synthetic time series, which is used for fitting the models and then making predictions. Figure 9.1 shows vehicle speeds observed on a segment of road from three consecutive Wednesdays in September overlayed on top of each other over a 24-hour period. The repeatability of traffic patterns is clearly apparent in this figure. Although forecasting traffic patterns is a very complex problem due to the countless factors that influence the motion of vehicles on road, some of which cannot be included in a model such as driver temperament and driving styles [139]. Even the factors that can be included as auxiliary variables, like upstream and downstream conditions, weather and incident information, etc. need reliable and interrelated databases. However, when data from repeatable periods is overlayed, a mean trend becomes apparent [see 61, and also figure 9.1], which can be modeled, while relegating the variations from each period to the error component. Autoregressive Integrated Moving Average (ARIMA) models are inherently designed to model just such a time series, where the mean trend is captured by the coefficient, while the error captures the noise in the time series data. The use of historic data sits squarely within the use case of the synthetic time series framework. 72 Speeds by Time of Day for Three Consecutive Wednesdays 70 60 50 40 30 20 Wed, 09-14 10 Wed, 09-21 Wed, 09-28 0 Time of Day Figure 9.1: Overlayed Traffic Speed Observations from a TMC Segment 73 Speed (MPH) 00:00 01:30 03:01 04:31 06:03 07:33 09:04 10:36 12:09 13:43 15:13 16:44 18:18 19:50 21:20 22:50 23:59 9.1.2 Synthetic Time Series Suppose that the data for a day were completely available beforehand. Models fit to these data would be able to accurately forecast traffic from any point in time for that day for any given horizon. However, since observations for a day are only available as the day progresses, data for the day could be considered incomplete. The synthetic method ameliorates this problem by using data from another time period, where the complete set is available, to inform estimates for the current period. To elaborate, suppose short-term traffic speed predictions on a roadway segment for September 21 from figure 9.1 are required. It can be seen that data profile from September 14 is very similar to data profile from 21st. Therefore, models can be fit to data from 14th, the most reasonable one selected to be used for predicting traffic speeds for 21st. It can be seen that the synthetic method used as outlined above accords some favorable advantages to the models. Firstly, the models are fit on data from the whole day, they are trained using the entirety of expected traffic patterns. Hence, they can reasonably predict the future short-term trend of traffic patterns based on a few real-time observations. Literature contains many examples where models fit on partial data consistently either over or under estimate during peak and off-peak periods [52, 130, 137, etc.]. This limitation is overcome in a natural way by the synthetic method. Secondly, in the example above, the models were fit to data from a week before the day of interest. This gives ample time to fit multiple models and select the most reasonable one for a segment of the roadway. The available time also allows for models to be fit for each individual segment in the network. Traffic characteristics can vary widely not only between different roadway classes, but also between different sections of the same roadway. Therefore, a single large and complex model for a network may be outperformed by small and simple models for individual segments that form the network. In addition to simplifying the models used, this also allows easy introduction 74 of auxiliary variables to provide supplementary and complementary information like incidents, weather, upstream and downstream conditions, etc. to the simple models. Finally, the synthetic method is extremely flexible. It is compatible with almost all models presented in the literature so far, the only exceptions being some data- driven models, like Neural Networks, that use bulk data to train the models. The only drawback of the synthetic framework is that it requires panel data across time, and that data should be similar across panels. However, this hurdle can be overcome to varying degrees by careful modeling, and transfer of coefficients across time and data sets ? limited by the linkages that can be established between datasets. When compared to data driven approaches, the synthetic method has the addi- tional advantage that a smaller dataset can produce sufficiently reliable estimates. Researchers have postulated that about a year?s data is required for training reliable data-driven models [115]. However, on the network studied (figure 11.2) data for a whole year at one minute frequency for each segment measures in gigabytes. Such big data, although more readily available today, requires expert database administrators and programmers to handle and utilize in data-driven models. However, as demon- strated here, data from a single day is sufficient to predict traffic for the same day in the next week, vastly reducing the data requirement. Of course, given the flexibility of synthetic method, one may train a data-driven model within the framework and apply it instead. 9.2 Requirement of Performance Measurement The National Performance Monitoring Research Data Set (NPMRDS) was procured by the Federal Highway Administration Office of Operations in 2013 initially serving as a research data set for sponsored programs, but with the anticipated emphasis placed on performance measures in the MAP-21 legislation, rights to use the data set to compute performance measures was secured for State Department of Transportation (DOT) 75 and Metropolitan Planning Organizations (MPO). NPMRDS is a form of commercial Global Positioning System (GPS) probe data, meaning the traffic conditions reported are derived from vehicles self-reporting speed, position, and heading based on GPS electronics on a periodic basis. It differs from commercially available data feeds in that FHWA specified that no smoothing, outlier detection, or imputation of traffic data be performed. As a result, NPMRDS contains unique characteristics with respect to statistical distribution of reported travel times. These unique characteristics render traditional processing techniques ineffective to obtain accurate performance measures. A method to handle the challenges posed by the nature of NPMRDS, and compute meaningful performance measures from it is proposed. First, the challenges in process- ing NPMRDS data are exposed and a method to overcome them is designed. Then, the results from the proposed method are compared with traffic data from commercial probe data sources as well as a reference re-identification data sourced at two case study locations. The case studies indicate that the method shows the ability to more accurately capture performance measures from NPMRDS using techniques originally developed to accurately reflect travel time and travel time reliability on interrupted flow facilities. Quantifying congestion is the largest form of highway performance measurement currently in practice, as evidenced by the majority of documents released by various departments of transportation [31, 47, 106, 116]. The ability to compute congestion measures for the whole network is also being stressed by initiatives like the MAP- 21. Traditionally, the data used for computing these measures involved floating car trips during peak and off-peak periods. The inability to scale these data collection techniques meant that congestion measures were spotty in time, and restricted to only the most important of road networks. With the wave of data revolution that is spreading through transportation, com- puting congestion measures has become more wide-spread, often encompassing large 76 networks, and various road classes. However, not all data are created equal. Where floating car runs could capture travel times in any road setting, the new GPS based probe data suffers from limited fidelity on roads with friction to traffic flow like signal- ized intersections, driveways and parking. Loss of accuracy is caused by the filtering that is applied to probe data because the source of the data cannot be confirmed. Pedestrians and cyclists with GPS devices like phones can unwittingly contribute their movements to probe data, and these need to be filtered out. Similarly, vehicles experi- encing trouble, or stopped at rest stops are excluded from observations because they are no longer part of the traffic flow. However, trouble arises because the algorithm cannot distinguish between vehicles stopped at signals and vehicles stopped due to mechanical failure, driver decision, incidents or weather related dangers. The stress laid on performance management is even more acute due to the budget constraints facing almost all departments of transportation. MAP-21 essentially conditions the distribution of federal funds on setting and meeting performance targets. In order to facilitate computations of these performance measures, and to ensure all departments are producing results using the same method, the FHWA procured and made available a national travel time dataset, called the NPMRDS [28]. Additionally, through the Notice of Proposed Rule Making (NPRM) process, the FHWA has provided the algorithm and formulas to be used to compute performance measures from the NPMRDS [26]. These rules, based on the work documented in this dissertation, went into effect on May 20, 2017. The NPMRDS is available to all state DOT and MPO free of cost. This dataset provides travel times on road segments spanning the National Highway System (NHS) and the Strategic Highway Network (STRAHNET) at 5-minute intervals, for every day of the year. The NPMRDS reports different travel times for automobiles and trucks, to enable measurement of freight performance separately. NPMRDS is made available monthly, as an archive of travel times for segments in the NHS state by state. With 77 the availability of the NPMRDS, small state DOT and MPO were suddenly faced with the challenges of processing big data. In fact, the first three quarterly webinars on NPMRDS focused on the issues of data size, and software that could process the data [28, 29, 30]. As the NPMRDS slowly brought about a change in the experience of personnel hired by DOT and MPO, a reliable and straightforward, yet unbiased, technique for computing performance measures was required. NPMRDS is specified to be a raw dataset, with minimal filtering applied to the observations. Also, it does not impute missing data, leaving large holes in the dataset, especially during night hours. Both of these attributes make it a difficult dataset to work with. An algorithm capable of handling these problems was developed to provide a simple method to compute performance measures from NPMRDS. The method adapts the FHWA guidelines to compute the Planning Time Index (PTI), Travel Time Index (TTI) and the Buffer Time Index (BTI) for use with percentiles obtained from the NPMRDS data over the required period [31]. The developed method is easy to encode in a computer program and have small computational cost. Moreover, by comparing to Bluetooth re-identification data collected from the side of the road, it is shown that the method is not biased, and yields the correct results, even for interrupted flow facilities. Additionally, the fidelity of the NPMRDS data on different classes of roadway was also assessed, and found that NPMRDS without filtration and other cleanup at source performs better than cleaned data, especially on roadways with high variation in travel times, like arterials. 9.3 Conflating Two Geospatial Datasets The advent of mobile devices with embedded global positioning systems has allowed commercial providers of real-time traffic data to develop highly accurate estimates of network-level vehicle speeds. Additionally, methods have been developed to monitor the network performance in real-time, and even make short-term predictions. Traffic 78 speed data have far outpaced the availability and accuracy of real-time traffic volume information. Limited to a relatively small number of permanent and temporary traffic counters in any city, traffic volumes typically only cover a handful of roadways, with inconsistent temporal resolution. A conflation methodology was developed to address this data gap by coupling a commercial dataset of typical traffic speeds (by roadway and time of week) from TomTom to the U.S. FHWA?s Highway Performance Monitoring System (HPMS) database of Average Annual Daily Traffic (AADT) counts by roadway. The novelty of the developed method is the ability to conflate independent road geometries spanning the whole country using big data techniques. The resulting product is a national dataset providing traffic speed and volume estimates under typical traffic conditions for all U.S. roadways with AADT values. The ubiquitous availability of speed data is revolutionizing the operation of trans- portation networks, especially roadways. It is now possible to get a bird?s-eye view of current conditions on a whole network, spanning a large area [61]. Rerouting vehicles based on current roadway conditions to avoid congestion and other delays is almost a regular occurrence, especially with advances in route guidance systems built either into vehicles or using smartphones and other mobile devices. Dynamic message signs are increasingly being used to alert travelers of current road conditions downstream so that delays are minimized proactively. These Advanced Traveler Information Systems (ATIS) are being informed by ubiquitously available speed data [137]. Similarly, for planning, a panel of speed data collected over a period helps identify problem areas, bottlenecks, and other issues that may each require different solutions. For example, areas of recurring congestion on certain sections of an urban interstate may prompt a detailed study into the intersection design, the number of lanes, lane widths, demand, and other factors to find workable solutions. Most of these studies, however, require more than just speed data. Such studies rely, perhaps more heavily, on data such as volume of vehicles during peak hours, mixture of vehicle types, geometric 79 design of the highways, and the rate, type, and severity of incidents. Researchers and planners are increasingly finding that these studies are made difficult not by the availability of datasets, as most of them are easily acquired, but by the inconsistencies in how different datasets are packaged and reported. No unified underlying geospatial framework exists to which all sets conform, despite global standards and organizations, such as the Open Geospatial Consortium and Technical Committee 211 of ISO. Private companies in the transportation domain continue to push their own independent geospatial data reporting frameworks or continue adhering to older standards [61]. Public agencies, on the other hand, suffer from a shortage of skilled people capable of digitizing existing maps and keeping the current digital maps updated [61]. Two approaches to affect spatial conflation were developed and are documented in this dissertation. The issue tackled here is the spatial and temporal conflation of the HPMS dataset with Multinet, a commercial speed dataset purchased from TomTom Inc., a multi- national vendor of maps and other cartographic services. The spatial conflation is complicated by the lack of a unifying framework used by both datasets, which makes join conditions difficult to resolve automatically. The HPMS dataset is represented by single line segments that follow either one of the carriageway in roadways. On the other hand, Multinet is based on a proprietary segmentation scheme that follows the centerlines of both carriageways in roadways with dual carriageways. There is no parity in the segment lengths either, and consequently segments from the datasets do not coincide anywhere in the entire United States. Specific algorithmic tools developed to address these challenges are presented. The conflation was not perfect, but the average error in total conflated segment lengths was just under 7%, which, for the whole United States is commendable. The developed algorithm can best be described as a scoring algorithm, where points are rewarded in direct relation to the strength of the match between segment metadata of the 80 two datasets. Most of the metadata were generated using the geometry of the line segments themselves, such as the parallel and perpendicular distances between specific points along the line and the angle of separation between them. Some criteria were borrowed from the few fields that were roughly common between the datasets, such as roadway name and shield numbers. After joining the datasets using the spatial join, the AADT values are transferred from the HPMS to the TomTom Multinet dataset. However, the AADT values are a single number per segment, representing the average annual daily traffic expected on that segment. Using a technique developed by Texas Transportation Institute (T.T.I.) the AADT values were disaggregated to 15-minute intervals for each typical day of the week. The splitting algorithm due to T.T.I. requires typical speeds by day of week, and direction of travel, which would not be possible without the join connecting the AADT and the typical speeds from TomTom Multinet. 9.4 Merging Congestion and Incident Data Congestion and incidents have an intrinsic relation with each other. Congestion is shown to depend on the demand to capacity ratio of any roadway at a given time. Incidents can take away some of the capacity of the roadway, thus altering the ratio and feeding congestion levels. Conversely, congestion increases the risk of incidents as drivers unaware of the downstream conditions might collide with slow moving vehicles in congestion. The efforts to find the impact of various incidents on congestion was developed to answer such questions. The focus is to correlate the incidents on the freeway system of Maryland with congestion as defined in the Maryland State Highway Mobility Report. The data used in this study was collected in 2011, and spans the whole year. It is found that incidents have a varying effect on congestion. Although the probability of incidents increases with increasing number of vehicles, incidents are more location specific rather than volume dependent. Locations prone to incidents 81 show larger impact of incidents on congestion as compared to locations with fewer incidents. Effective improvement of roadway performance does not solely involve adding new capacity to resolve the existing problems of congestion. Often, such action causes evanescent improvements, and gains are quickly overcome by the latent demand. Therefore, it is crucial to understand the underlying cause of traffic jams, and differen- tiate between those caused by capacity constraints from those caused by bad geometry or design constraints. Here, we divide congestion into two broad categories: recurring, or caused by more demand than the capacity, or non-recurring, caused by incidents. The distinction between the two causes of congestion is important to tailor the correct solution to the problem. The idea of Traffic Incident Management (TIM) is based solely on the ability to make this distinction. TIM refers to handling of incidents rapidly, so that the disruption caused by it is minimized. Providing TIM patrolling in areas experiencing recurring congestion might not be as effective as providing it in areas where motorists face a higher chance of experiencing distress. Various departments of transportation in the United States understand and prioritize providing aid to motorists so that issues can be quickly resolved with minimal disruption to the commute of others. TIM forms a considerable portion of the annual Maryland State Highway Admin- istration Mobility Report [76]. Similarly, the mobility reports published by other departments of transportation, such as the Federal Highway Administration (FHWA), the Texas Transport Institute (T.T.I.), Washington State, and others, lay considerable emphasis on executing TIM [31, 47, 106, 116]. Further, given the budget constraints, effectively providing targeted incident management services, as compared to a blan- ket application, is more important. Consequently, departments of transportation are interested in knowing the best areas and routes for directing highway patrol so that motorists having difficulties can be identified, reached and assisted with minimal loss 82 of time. The presented study spatially and temporally joins the congestion and incident databases to find the areas where incidents play a larger role in causing and prolonging congestion. The data used was compiled throughout 2011. Congestion, in this study, is defined where traffic speeds drop below 60% of the freeflow speeds for an extended period of time. The congestion is given a score based on the duration, length of road congested and the frequency of occurrence. Incidents are recorded by the Maryland Coordinated Highway Action Response Team (MD-CHART), and include a number of categories, some of which are alerts. The location of the head of the bottleneck, and the location of the incident are spatially joined to infer the relationship between them. Temporal analysis then follows, to determine how many occurrences of the bottleneck was caused or prolonged by incidents. Note that the conflated datasets can be used with the synthetic time series framework for forecasting traffic. Use of volume and incident information in the framework would further enhance the reliability of the predictions. However, such a task is relegated to a study beyond the scope of this dissertation. In this second half of the dissertation, a literature review of the existing methods is presented in chapter 10, followed by a description of the datasets used in chapter 11. The developed methods are documented in chapter 12, while the results are presented in chapter 13. Finally, chapter 14, provides a summary and concludes the second half of the dissertation. 83 Chapter 10: Review of Performance Measurement The field of performance measurement is quite vast, and each technique developed and documented in this dissertation has volumes worth of literature. Important and seminal works published in the literature associated with each method are described in this chapter. A note, however, that the literature in this chapter is from the time when the methods were originally developed. Given steady improvements in data collection, handling and cleaning, and also improvements in the available tools and algorithms to perform the analysis, current state-of-art might be better than the developed methods. 10.1 Literature on Traffic Speed Prediction The vast trove of knowledge available on the subject of traffic data forecasting possibly warrants a library dedicated to archiving the published literature. Over time, numerous attempts have been made to gather and summarize the literature in review papers [7, 50, 65, 83, 110, 118, 119, 121, 122, 137]. In fact, there have been special issues published dedicated to just the topic of short-term forecasting [140]. From these review papers, literature of historic significance to the proposed model is briefly outlined. Readers wishing to gain further knowledge in short-term traffic forecasting are referred to read the cited review papers, and special edition journals. The main trend that can be observed from the review papers is the evolution of prediction models. Initially, statistical models were favored, mainly because of the cost of computation [2, 20, 46, 72, 130]. However, since the late 1980s, more models 84 have been steadily replaced with machine learning procedures [1, 7]. At first, statistics based machine learning techniques were employed such as Kalman filters, and Bayesian Inference [85, 118, 135]. However, those were replaced with purely machine learning based models such as k-Nearest-Neighbors, Neural Networks, etc. [50, 65, 110]. Parallel to the models, the data available and accessible evolved and grew. Initially, forecasting was based on very local collected data either automatically using loops [130], or manually. This limited the uses and range of the methods to be focused on just a part of a road. Such methods are useful only for academic importance, and cannot be used for delivering any ITS products. However, with the advent of data, mainly the GPS based probe data, the situation changed [61]. In fact, it is arguably this abundance of data that caused the boom in adoption of data-driven approaches [32, 83, 115]. More importantly, this motivated the application of predictions to more than a section of road, but to entire roadways, and networks [15, 58, 79, 85, 126, 128, 133]. Recent literature discusses very little about parametric statistical models such as ARIMA, and is focused more on data-driven approaches [111]. However, most of the recent studies deal with the whole network, instead of just a section or a road [32, 52, 73, 117]. Unfortunately, the same cannot be said of studies on parametric methods, specifically ARIMA and derivations [12, 129]. The main drawback in using parametric models for a large network is the hetero- geneity of the roadway characteristics. Fitting and calibrating a parametric model to a vast and diverse network requires a huge amount of work, and the results may well be outstripped and outperformed by a neural network approach [32, 111]. In the documented study, the problem is solved by simply treating each segment as an individual entity and fitting a different model for each. That way, the best ARIMA order ? based on lowest Bayesian Information Criterion (BIC) ? is applied to each segment for each of 10 weekdays (a total of 26 540 models are selected out of 716 580 85 possible models), guaranteeing that the predictions for that segment for the same weekday in the following week are the best possible. Of course, the approach can be improved by considering the impact of the upstream and downstream segments, incident data, and weather data, which is part of planned future work. 10.2 Performance Measurement Literature Departments of transportation and research institutions that have some experience with probe data and big data were the first to use NPMRDS. Since the dataset cannot be used for general research, except ones that directly lead to performance measures, there is little literature available, and most of it is presented in the quarterly NPMRDS webinars held by FHWA [28]. Probably the first publicly available document was a presentation by Wisconsin Traffic Operations and Safety Laboratory, presented during the second quarterly NPMRDS webinar in February, 2014 [28, 29, 96, 114]. The authors primarily discuss performance measures from NPMRDS, along with representation of the data on a map. Another presentation in the same webinar, by University of Maryland highlighted the differences in the TMC segmentation scheme and map realizations used by NPM- RDS and the I-95 Corridor Coalition?s Vehicle Probe Project (VPP), concluding that direct comparison between different sources need to be done carefully to account for the differences in the segment properties [29]. Another study that compares NPMRDS data with Bluetooth re-identification and VPP probe data was conducted at CATT, and was presented at the 2014 ITS World Congress [62]. University of Minnesota and Minnesota Department of Transportation visualized data from NPMRDS on the map layer provided with the data [13]. No processing of the archived NPMRDS travel time data was carried out for this visualization. The American Transportation Research Center (ATRI) later used NPMRDS to compute the congestion and delay price paid by the trucking industry [91]. Freight truck data from NPMRDS was used in the study, 86 in addition to data from ATRI?s Freight Performance Measures (FPM) database. The Upper Midwest Reliability Resource Center maintains a Travel Time Reliability Reference Manual online [11]. In the manual, NPMRDS data was compared with probe data from INRIX to find that NPMRDS reports a lower mean for travel time with a higher variation as compared to data from INRIX. In the third quarterly NPMRDS webinar, Iteris shared their work for implementing MAP-21 performance measures [30]. This was the only study where extensive processing of NPMRDS data was done prior to computing performance measures. 10.3 Literature on Geospatial and Temporal Conflation Conflation, as used here, means a spatial join rather than data fusion, as is commonly used in literature. Moreover, the developed conflation methods can be viewed as a join in a relational database system, where joins are used to transfer data from one table or database to another. The spatial join is used to transfer the AADT values from the HPMS dataset to the Multinet dataset. Additionally, another type of join is used to transfer incident information to congestion data. Literature that deals exclusively with conflation of road geometries, either as lines or points, covers a wide range of methods, with no one-size-fits-all method. This is perhaps not surprising because each conflation effort has unique challenges and requires a custom-tailored solution. Road geometry conflation is mainly born out of the need to integrate open access datasets with existing datasets used by private or public agencies. This linkup is increasingly seen as a cost-effective way to keep datasets up to date with information generated by drivers so that they can be used to provide the most applicable real-time solution to issues faced by drivers [14, 24, 132]. The major source of open geospatial data is Open Street Maps (OSM), which provides rich details about the real-time urban landscape. One recent paper published on road geometry conflation uses urban blocks as polygons surrounded by road segments. The goal of the conflation is to connect 87 OSM with government data-sets. The polygons from the two sources are matched together in a two-step approach before matching the surrounding road segments. The authors claim that this process has a very low computation cost because the number of urban blocks is fewer than the number of road segments. However, this method is limited to urban areas only [24]. Yang et al. proposed a probabilistic relaxation method to again match the OSM data to government datasets. The relaxation method performs a heuristic search of the neighboring space from an initial set created using the dissimilarities between the geometries. The probability matrix is iteratively updated based on selected criteria, until a match is found [132]. Chen et al. developed an algorithm to attribute floating car data to map segments. The authors argue that existing algorithms found within in-vehicle navigation systems cannot accurately associate low-frequency data to map segments, and propose a dynamic program supported on multiple criteria as a solution [14]. Other examples of road geometry conflation in the literature deal with comparing and benchmarking the accuracies of datasets rather than transferring or updating data. Chiefly, these studies also relate to OSM, as it is crucial to understand the caveats in this modern, free, and crowdsourced dataset, given the rising importance placed on it. Important studies in this domain are works by Girres and Touya [42], Haklay [45], and Neis, Zielstra, and Zipf [80]. The history of automated road geometry conflation can be traced as far back as 1988 to Saalfeld, who used successive approximations to overlay two map geometries [102]. Noteworthy papers have since included a statistical approach [125] and an approach matching semantics prior to overlay to ensure the overlay was as accurate as possible [127]. All these papers used map features, like lines, points, or nodes, for the conflation. A rule-based method, much like the one proposed here, was first proposed by Cobb et al. [19]. This method also used the metadata in addition to geometric 88 attributes like shapes and locations for the conflation. The decade between 1998 and 2010 is filled with notable contributions by authors using a combination of different data attributes, including geometric and semantic, to achieve a conflation that suits the needs of the study. Altogether, over a hundred relevant papers were produced during this decade [101, 136]. This literature also provided ideas for the spatial and temporal conflation of incidents and bottleneck data. 89 Chapter 11: Description of Datasets used for Performance Measurement A lot of datasets have been used for performance measurement, and were processed differently for each method before applying the developed algorithm. In this chapter, an overview of the datasets used is provided first, followed by the description of preprocessing for each method, as appropriate. 11.1 GPS Probe Based Datasets GPS probe based data are now at the core of and an important part of the available transportation data. There are multiple datasets that are directly measured probe data, and even more which distill some initial results for consumption in other models. The following four datasets are based on GPS probe data, and have been used in the different methods documented in this dissertation. 11.1.1 Vehicle Probe Project (VPP) The data used in this project is obtained from the Vehicle Probe Project (VPP), which started as a result of contracts awarded by the I-95 Corridor Coalition since 2008. Contractually, the data is received in real-time at one minute intervals. The data comes from probe data vending companies, of which INRIX, HERE and TomTom are the three major players in North America. The data is based on vehicle speeds collected by the probe data vendors from fleet operators, individual users and other 90 companies which deal with data collection and reporting [61]. Vehicle speeds reported in VPP rely on GPS and telemetry technologies. The GPS devices in the vehicles compute their location, compass bearing, and speed. This information is then periodically transmitted such that the vehicle leaves a bread-crumb trail of locations. By attributing these speeds to directional roadway segments based on reported location and direction, then statistically aggregating these observations, a product like the VPP is produced. Usually these data are collected as part of a tacit agreement where the user gets real-time traffic, routing or other information, while the company gets the GPS data. Probe data vendors often use proprietary algorithms to clean, filter and impute missing data before reporting segment speeds at desired frequencies ? one minute for the VPP [61]. Traffic Message Channel (TMC) segments are used as geographic boundaries to aggregate the data. The TMC is one of the oldest roadway segmentation schemes, where the roadway is divided at intersections to form the segments. In the case of the VPP all valid GPS observations attributed to a TMC segment in one minute are aggregated together and reported for that segment at that minute. There may be some instances of time when GPS observations are not available, like in the midnight hours, when traffic volume is lowest. Also, if an observation cannot be clearly identified as an outlier, confidence in that observation is low. In such instances VPP data are bolstered using historic observations in the same time period or by the free flow speed of the roadway. The data has fields to identify these modified records and indicate the level to which they have been altered. However, original (raw) readings from vehicles are not available. 91 11.1.2 National Performance Management Research Data Set (NPM- RDS) The NPMRDS is a national dataset made available by the FHWA to all state DOT and MPO [28]. The University of Maryland (UMD) has access to the dataset through an agreement with the Maryland State Highway Administration (SHA). The data is archived alongside other probe data received via the VPP. The NPMRDS covers all of the National Highway System (NHS) and Strategic Highway Network (STRAHNET). The NPMRDS is split into segments from intersection to intersection, as per the TMC specification. Speeds for trucks, automobiles and combined traffic are reported for each segment at five-minute intervals for every day of the year. The NPMRDS is inherently compiled from GPS probe data. GPS devices equipped with telemetry in vehicles are usually programmed to periodically transmit the location, speed and heading computed by the GPS hardware to a central processing station, usually the offices of the GPS device vendor, or, for fleet operations, the vehicle coordinator. These data are then attributed to road segments given the location and heading of the vehicles. Usually, like for the VPP, the observed speeds are filtered to exclude obvious outliers, like pedestrians and cyclists with GPS enabled phones, stopped vehicles, and so on. The cleaned speeds are then aggregated over the reporting window, and reported. In the absence of real observations from vehicles, historic observations or educated guesses of the real conditions might be reported. The NPMRDS, however, contractually prohibits filtering of speeds, other than wrong way travelers, pedestrians and cyclists. Additionally, imputing historic or estimated speeds are also prohibited during periods of no observations. These two characteristics impart difficulties in using the NPMRDS directly for a finished product. Namely, the lack of outlier and noise filtering makes estimates unreliable, as stopped vehicles are often included (e.g. trucks parked at stops with electrical services turned 92 on). The gaps in the data caused by non-reporting during periods with lean traffic cause problems to models that rely on an uninterrupted time-series for analysis (e.g. scanning for congestion, vehicle trajectory tracing models, and others). Such problems were difficult even for professionals, using state of the art algorithms which relied on a continuous supply of data, to overcome. 11.1.3 Bottleneck Dataset The bottlenecks are identified and ranked by the Bottleneck Ranking Algorithm developed by the Center for Advanced Transportation Technologies (CATT) at UMD. This algorithm is applied to the probe speed data also archived by the VPP. The function underlying the algorithm is summarized in figure 11.1. A bottleneck is suspected when the speeds observed from a TMC segment drop below 60% of the freeflow speed of that TMC segment. If the speed remains below the threshold for 5-minutes or more, the bottleneck is confirmed, and added to the log of bottlenecks originating at that TMC. As the conditions improve, and the speeds rise above the 60% threshold, and remain there for 10-minutes or more, the bottleneck is considered clear, and the entry log is closed. The duration and suspected length (sum of lengths of all upstream segments also similarly congested) of the bottleneck are recorded. If two bottlenecks merge, or the head of the existing bottleneck moves downstream ? that is, downstream segments start getting congested ? then the downstream most congested segment is considered as the head of the bottleneck. If the bottleneck propagates into a spur in the road, where two streams of traffic merge, the merging point between two roads act as the head of the bottleneck on the merging road. The bottleneck impact factor is computed by multiplying the average length and duration of the bottleneck with the number of occurrences during the queried time frame, ??? = ? ? ?? ? ???, (11.1) 93 Figure 11.1: Graphical representation of the function to detect bottlenecks where ??? is the bottleneck impact factor in mile-hours, ? is the number of bottlenecks recorded in the time frame of interest, ?? is the average duration of all ? bottlenecks in hours, and ??? is the average of maximum lengths of all ? bottlenecks in miles. Although the location of the head of a bottleneck is known, and the maximum length and duration of each recorded bottleneck with the same head are also available, no details about the upstream segments are provided. That is, the bottleneck head locations are not mapped to TMC segments. Therefore, tracing the length of each occurrence from a given head is not possible. Especially, the evolution of the bottleneck cannot be traced through time, as there is no record of the change of length with time. Therefore, incidents that impact the bottleneck along the length cannot be included in the analysis. However, from theory, if an incident upstream in the bottleneck constricts the flow, it may momentarily cause the bottleneck to show a longer length, but 94 the reduced flow downstream of the incident might cause the bottleneck to dissolve. Therefore, a major incident will usually cannibalize the parent bottleneck, and spawn another, which may not be a frequent bottleneck given that it is not caused as a result of shortage of capacity [100]. Additionally, an inconsistency sometimes arises with the date-time stamp when the algorithm attempts to merge different TMC segments together. The algorithm may fail to properly aggregate parts of the bottleneck, resulting in the same bottleneck occurrence getting fragmented. As an example, if a bottleneck starts at 4PM and ends at 7PM, there could be a fragment from 6:30PM to 7PM. These multiple records are deleted from the table to avoid duplicate counts. The deleting does not affect the bottleneck count as it just removes subsets of a larger bottleneck. 11.2 Geospatial Datasets There are two main datasets that have a geospatial component. Additionally the TMC linear referencing system is described in this section. 11.2.1 Highway Performance Monitoring System (HPMS) The HPMS is a federal dataset that aggregates road geometries submitted by state and regional transportation agencies. The datasets cover all major roads, under federal, state or local jurisdiction. Major decisions about allotment of funds for roadway improvement, maintenance or expansion are based on analysis of HPMS data. The HPMS public release of geospatial data in shapefiles format is an open, free dataset obtained from the FHWA Office of Highway Policy Information. The details of this dataset are available online [82]. The data associated with each link in the HPMS describes the physical properties of the roadway, like functional road class, number of lanes, width of lanes and shoulders, 95 thickness of pavement, and other such properties. It also contains information about the road number or name, where available, and the Average Annual Daily Traffic (AADT). The roadway segments are usually arbitrarily split, unless when the roadway crosses state or county lines, or when AADT or other physical properties of the roadway changes significantly. Also, a vast number of segments are exactly as long as one U.S. chain length: 160m (or 0.100 mile). The total length of the HPMS network all over the country is about 2?1010 meter (1.200 million miles). 11.2.2 TomTom Multinet TomTom Multinet has two data subsets: a geospatial map layer with road segment geometries and metadata, and a table that provides typical weekly speed profiles on all segments. This is a commercial dataset and is available only by purchase. The geospatial data exist in shapefiles and contain the free-flow speed along with other information, such as road number and name (where available), Functional Road Class (FRC), length, number of lanes, whether part of a toll road, etc., for each segment. The Multinet segments are divided arbitrarily using proprietary algorithms such that resolution of the speed data is preserved. The segments are uniquely identified and referenced to the map by the OpenLR open standard proposed by TomTom. The Multinet road network is an order of magnitude longer than the HPMS. It has a total length of 10.600?1010 m (6.600 million miles) over all of U.S. Multinet covers every road, including the ones within parking lots, and both directions of roadways in the dataset. The typical speed profiles for the week are given as a percent of free-flow speeds in the geospatial datasets. These percentages are provided at 5-minute epochs in a week, making 288 such epochs in a day. The typical speed during a 5-minute epoch is obtained by the product of the percentage at that epoch and the free-flow speed, for a given day of the week. There are three different free-flow speeds in the 96 geospatial dataset for each segment. The first is the speed typical for the whole week, the second is the typical speed during weekdays, and the last is the free-flow speed during weekends. There were minor differences between the three speeds, mainly because these speeds are computed from observed global positioning system probe readings [61]. It was ensured that the typical profile percentages were multiplied by free-flow speeds from the correct day, i.e., weekday or weekend where available, and the combined speed where not available. Alaska did not have a profile table; consequently, only a spatial join was performed for Alaska without the disaggregation. 11.2.3 Traffic Message Channel (TMC) Codes for Linear Referencing (LR) The roots of the TMC standard emerge from the ALERT-C for conveying (encoding and broadcasting) information on radio sidebands. Currently, TMC codes are governed by Traveler Information Services Association (TISA). As a regulatory authority, TISA is responsible for maintaining the standards book for TMC codes, and other segmentation schemes such as ALERT-C and TPEG. Further, TISA keeps a look out for segment encoding standards created by the industry, and standardizes them if necessary. The standards maintained by TISA only serve as the guideline to create a TMC table. The specifications are laid out in International Standards Organization (ISO) standard documents: ISO 14819-1, ISO 14819-2, ISO 14819-3, and ISO 14819-6. These standards merely supply the procedure to segment the roadway, and the method to assign codes to them. The segments run from one break in access, ingress or egress, to another. Therefore, there are two types of TMC segments, internal to and spanning the length of an intersection (between on egress break and the next ingress break), and external spanning between two intersections (ingress break of the previous intersection to egress break of the next intersection). 97 The North American TMC Code Alliance (NATMCCA) or the North American Location Code Alliance (NALCA) is responsible for maintaining a TMC table for U.S. and Canada. The table contains a list of TMC codes, along with fields which bind codes to the actual location on a map. The binding fields contain the start and end coordinate points and the road geometry of the TMC code, and other pertinent information which can uniquely identify the location of the segment. These tables are created in accordance with the TISA standards mentioned above. Since the standards are at best a guideline, TMC segments from different data vendors differ in their encoding. Although the segment might have the same code, the length of road network represented might be different. For example, unlike the TMC representation used by TomTom, the TMC segmentation used by HERE Inc. does not have the internal segments, and they are merged into the downstream external segment. That is, the TMC segments reported by HERE Inc. run continuously from one egress along a roadway to the next. These differences cause compatibility issues when comparing data reported to TMC segments. 11.3 Other Datasets There are two more datasets that have been used for the performance measurement methods documented in this dissertation: the Bluetooth re-identification data, and the incidents dataset obtained from the Maryland Coordinated Highway Action Response Team (CHART). 11.3.1 Bluetooth Traffic Monitoring (BTM) The Bluetooth re-identification Data (Bluetooth data) were collected using Bluetooth sensors placed at the sides of the roads [107]. The sensors are powered by a rechargeable battery and last up to two weeks on a single charge. They are deliberately placed at 98 the ends of TMC segments so that the Bluetooth and GPS probe data can be directly compared. The Bluetooth sensors read and store the MAC address of passing visible Bluetooth devices. The time of detection is also stored with the MAC address. A Bluetooth sensor makes a sweep every few seconds, therefore, a single device can be recorded multiple times as long as the device is within Bluetooth range. Re-identification involves the process of identifying the same MAC address in both upstream and downstream sensors. Once the same devices are detected, then a difference in recording time gives the travel time between the upstream and downstream sensors. Since a sensor can record a device multiple times, either the first or the last observation is selected to determine the travel time. The data used in this study was compiled by computing the difference between the last observations of the same address at upstream and downstream sensors. This data, is also plagued with noise, outliers, and non-targets, like pedestrians and bicyclists. A filter based on the Inter Quartile Range (IQR) is applied to the raw re-identification data [6]. The Bluetooth data so collected are representative of the actual traffic conditions, and is equivalent to data collected by the floating car, license place matching, and similar methods. Therefore, the Bluetooth data is used as benchmark, and the VPP and NPMRDS datasets are compared against it to gauge accuracy and fidelity. Comparing with the Bluetooth reveals that the proposed method of computing performance measures from NPMRDS is not biased. 11.3.2 Incident Dataset Incidents are recorded by the Maryland Coordinated Highways Action Response Team (MD-CHART). CHART officials are usually first responders at the scene of an incident due to the large network of patrolling. Also, local fire, police and other first responders also directly communicate with CHART to coordinate responses. The main objective of CHART is to improve operations on the freeways of Maryland. CHART also 99 administers the 511 driver information system in Maryland which provides traveler information via phone, radio and Internet. CHART is one of the first organizations to be notified of any incident or event on the highway system. The accuracy of incident locations is fundamentally limited due to the prevailing practice of recording incidents in the CHART system. Operators enter the details of the incident location by noting the nearest intersection in term of crossroads. The software then automatically assigns the coordinates of the center of the intersection to the incident. Note that the intersection may not physically connect the roads, especially at interstate overpasses or underpasses. As such the geodesic coordinates stored with the incident record only provides location accuracy to the nearest crossroad. This lack of precision, however, does not impact the analysis, as all incident data are attributed to the nearest TMC. Hence, as long as the incident is geocoded sufficiently to identify the nearest TMC on which it occurred, its geographic placement within the TMC is inconsequential. Some incidents have incorrect geographic coordinates, and incidents usually appear outside the borders of the state of Maryland when mapped. Other incidents have no reported location data as they are typically warnings related to weather, system alerts, etc. processed through CHART. Such incidents are excluded from the analysis. Another inconsistency is with the start and end date-time stamps which are sometimes incorrectly recorded. An erroneous date-time stamp does contribute to some errors in the final analysis results. Obvious erroneous incidents, like accidents lasting just minutes, have been removed from the database. 11.4 Preprocessing Data The different methodologies required different preprocessing steps before the actual analysis documented in chapter 12. Study-wise preprocessing performed on the datasets are documented in this section. 100 11.4.1 Preprocessing VPP Data for Forecasting Only VPP data are used in the forecasting study. Data from weekdays in three weeks in September 2016 are examined in this study. The assumption underlying the synthetic method is that the model coefficients do not change in one week, models fitted to data from the week of September 12 are used to predict speeds for the week of September 19, and models fitted to the week of September 19 are used to predict speeds for the week of September 26. A total of 1 440 observations (rows) comprise the data for one day from one TMC segment. The total dataset contains data from 5 weekdays for 3 weeks for 2 654 segments. Therefore, the total number of records used are more than 57?107 (21 600 records per segment, or over 19?107 observations per week). Predictions for every minute up to 30-minutes in the future are produced for each segment. The predictions dataset thus has 30 times the data used for fitting. The authors are unaware of an online algorithm or model that would efficiently scale with this big data. All models proposed and studied in existing literature would require a huge amount of processing power to be fit and then make forecasts on such a large dataset. The synthetic method, however, takes under 3 hours on an 8 core computer to fit 27 ?????(?, ?, ?), where ?, ?, ? ? 0, 1, 2, models to all these data, and make predictions. The network studied is shown in figure 11.2. It comprises about 2 000 lane-miles, ranging across almost all classes of roadway (from collectors to freeways/interstates), in Maryland. A total of 2 654 individual, directional segments completely describe this network. The most reasonable model is selected for each segment individually from the 27 possible orders for the Autoregressive Integrated Moving Average (ARIMA) model. The ARIMA model does not have any auxiliary covariates, and the time spacing of the data is 1-minute. Once a model is selected in the synthetic framework for a segment, 101 predictions are made using the incoming stream of data for that segment. Predictions up to 30-minutes in the future are made using the selected model for each reported data point, received once per minute. This procedure is repeated for each day in the three weeks studied here. Figure 11.2: Maryland Important Roads Network Owing to the advantages of the synthetic time series framework, the method is also very scalable. 27 ARIMA models are fitted to the 2 654 roadway segments comprising the network studied. The total number of models thus fit are 71 658 per day. Since data are reported at 1-minute intervals, amounting to 1 440 records per segment per day (over 3.820?107 data points per day for all segments) this is truly an exercise in big data. The two weeks in September to which ARIMA models are fitted totals to over 16?107 records. Despite the size of the data, the synthetic method allows ample time to fit the 27 models to each segment, models are fit on data from each day for all segments, and select the most reasonable one. The novel contribution of the documented method is also the size of network handled. This is an unachievable task for models fit in real-time, as data points are received every minute. On a network as large as the one studied, it would be 102 unthinkable for models to finish computing before the next set of data arrive, especially on conventional hardware. Consequently, predictions on networks this large are neither recorded in literature [7, 50, 65, 83, 110, 118, 119, 121, 122, 137], nor are conducted by giant tech companies with very huge compute hardware infrastructures. Additionally, the synthetic method produces forecasts with reasonable prediction errors, which is primarily attributable to the fact that the model is trained using the complete set of data, containing the expected traffic patterns. Due to communication artifacts, and other delays, VPP data records are not timestamped at the exact minute, but can be received at any second in a minute. Sometimes, no record is received for a minute, but the preceding or succeeding minute might have two observations. Even more rarely, no data is received for an extended period of time. For the purposes of this study, it was deemed to have the data timestamped perfectly at the minute, which is also a requirement of the ARIMA models. To achieve this, speeds were linearly interpolated at the exact minute using adjacent observations. 11.4.2 Preprocessing Datasets Used for Performance Measurement The NPMRDS, VPP, and Bluetooth data were collected for two weeks (10 weekdays) for two case study segments. The data for each case study were collected at different two week periods. The case study segments are from two arterials, with different characteristics. One arterial, NJ11-0009, has fewer frictions to traffic, less signals and intersections and higher speed limits as compared to the other segment, VA08-0012. This causes traffic to behave on NJ11-0009 as if it were a freeway, and on VA08-0012 as if it were a congested downtown arterial. Table 11.1 presents a summary of the characteristics of both segments, while figure 11.3 shows their map locations. 103 (a) Geographic location of segment NJ11-0009 (b) Geographic location of segment VA08-0012 Figure 11.3: Geographic locations of performance measurement case study segments 104 Table 11.1: Summary of the Segments Chosen for Study Path ID Dataset NJ11-0009 VA08-0012 Road Name All U.S.-1 U.S.-29 Direction All SB WB Number of Lanes All 4 4 Number of Signals All 2 4 Segment Length Bluetooth 3.010 0.780 (miles) VPP 3.010 0.780NPMRDS 3.440 1.280 Number of TMC VPP 1 3 Segments NPMRDS 1 2 11.4.3 Preprocessing Geospatial Datasets before Conflating All roadways with measured non-zero AADT values from 50 states and Hawaii in the U.S. were conflated in this study. Figures 11.4a and 11.4b show the network from both HPMS and TomTom Multinet datasets before the conflation process. Figure 11.4a is a birds-eye view of Denver, Colorado, while figure 11.4b is a zoomed-in section around an interchange. In figure 11.4b, individual segments are demarked by black dots. The intersection a little southwest from the centers of the maps in figure 11.4b is I-70 and Denver West Colorado Mills Parkway. I-70 is clearly seen represented using bidirectional segments in the HPMS map, while Multinet shows each direction of travel on I-70 using separate one-way segments. Southwest of the intersection is a huge shopping complex; the dense network of Multinet segments show the rows of parking lots. The figures place a perspective on the details captured in the Multinet dataset. Figure 11.4b also explains the major difficulty in the conflation process. North of the intersection is Denver West Parkway, a road parallel to the interstate. The conflation should ensure that the Denver West Parkway segments are joined to the corresponding segments from HPMS and not to the Interstate segments, and vice- versa. The developed algorithm of computing parallel and perpendicular distances 105 between the segments and scoring them based on closeness overcomes this difficulty to an extent. However, some pre-processing was required before applying the algorithm, as explained below. Common Fields Some fields are common in the two datasets, like road numbers and names. However, the information and its representation format varied widely. For example, the Multinet data might include all shield numbers from a segment on a road with multiple numbers, whereas the HPMS might just include one, or even none, partly because the segments in the latter are often determined by a change in the AADT values, and roads with other shield numbers may merge or diverge within a segment. The shield numbers may themselves have different notations, like the inclusion of hyphens, spaces, back or front slash, or other separators to separate the class identifier and the actual road number (e.g. I-95, I95, I?95, I/95, etc.). The most difficult problem, however, was the lack of road names and shield numbers altogether for many of the segments. The datasets also encoded whether a segment was part of a toll road. This field was useful for an accurate join, however, the road miles that are tolled are insignificantly small as compared to the total length of roads. Another field recorded the number of lanes on the highway, but the count for HPMS included both directions of travel while Multinet counted the lanes on one carriageway only. On roads with an equal number of lanes in both directions, this is potentially not a problem; however, on one-way streets or roads with uneven number of lanes, there is no clean way to match the lane counts from both datasets. The last common field is the FRC. However, both datasets use completely indepen- dent classification systems. The HPMS uses FRC values from 1 to 7 while Multinet represents FRC from 0 to 11. Often, even the same road may have different FRC values. For example, all grade-separated highways are FRC 1 in the HPMS whereas 106 (a) Birds-eye View of Denver, Colorado, as rendered by geometries in the HPMS and Multinet datasets. (b) Birds-eye view of a small area in Denver, Colorado, zoomed in to show geometry details. Figure 11.4: Comparison of Networks before Conflation 107 in Multinet, they may take values from 0 to 2, depending on whether the highway is an interstate, a national U.S. road, or a state highway, and the length for which it is grade separated. Further see table 12.1 for a summary of matching fields. Data Preprocessing Pre-processing applied to the HPMS splits long segments into shorter components to harmonize the geometric distance computations. On the other hand, FRC classes 7 and above (lower in significance than minor collectors) were removed from the Multinet dataset in addition to filtering out ferries and other segments not part of roads. The following subsections detail the HPMS pre-processing, creation of a geospatial index, and computations of the absolute angles of the segments with respect to the equator. The angles are used to compute the parallel and perpendicular distances between segments from the two datasets. Splitting HPMS Segments: Not only did the lengths of the HPMS segments vary widely, but the average segment length was also very different from the mean Multinet segment length. Since the conflation algorithm mainly depends on calculating the spatial separation between the segments, having coherent segment lengths is of paramount importance. As the difference in lengths of two adjacent segments increases, the distance between their centroids can take a larger range of values, reducing confidence in eliminating bad join candidates. For example, consider a hypothetical HPMS segment, which is about 1 km long (0.600 miles). Adjacent to this segment are tiny (by comparison) Multinet segments, each 100m long (328 ft). The distance between the centroids of the HPMS segment and the Multinet segment can vary between 0m (0 ft) near the center and 550m (1 804 ft) near the ends, assuming minimum overlap. This is a large difference in lengths for a typical road network, as the road can curve away, intersect with other roads, etc. within that distance. Therefore, the number of join candidates increases 108 to the point where selection becomes difficult, even impossible. Consequently, the best strategy was to split the HPMS segments into smaller pieces. All HPMS segments longer than 200m (656 ft) were selected and broken down into segments 160m (528 ft) long, which is the standard U.S. chain length, and forms the basis of the HPMS map geometries. HPMS segment lengths, consequently, tend to cluster at multiples of this length. Some states merged these chain-length segments together, only splitting them when the AADT values changed. Splitting at 160m also reduced the occurrence of very short segments formed by the left-over bits, which reduces computational burden. The splitting cut-off was derived from a histogram of the Multinet segment lengths, which peaked at 200m. Longer line segments are often stored as multilines in map geometries, which are just a collection of shorter, regular line segments. The shorter lines encode a start and an end, between which the segments are usually rendered as straight lines. This allows a line to easily follow a curve in a piecewise approximation. To preserve the geometry, the splitting algorithm first separated all multi-lines into constituent segments. Any constituent segments longer than 200m were then subdivided into 160m segments by linearly interpolating between the start and end coordinates. Vincenty?s formula for computing distances on geodesics was used to compute all lengths throughout this chapter [120]. Vincenty?s method works better when points are very close together. The last two digits of the HPMS identification code are the state Federal Informa- tion Processing Standards (FIPS) code. The rest of the code can repeat in different states. Therefore, to create a unique identification code after splitting, a three-digit number (with zero-padding as required) was inserted before the FIPS code. The three digits were a linear count of the subsections created from a given HPMS segment. Some segments were split into more than 200 sections. Hereafter, HPMS refers to the new database formed by the split HPMS segments. Creating a Geospatial Index: A geospatial index serves to limit the search space 109 for matching segments. For example, to join segments on the west side of Kansas, segments on the east side need not be searched. The index was created by applying a buffer to the HPMS segments and storing the buffer extremities in an R-tree [44]. R-trees work by bounding smaller boxes within larger boxes, similar to real-world addresses (see figure 11.5). During a lookup, the R-tree is traversed until all the buffers that enclose a provided point are identified. The buffer radius around HPMS segments was fixed at 500m (1 640 ft), which in hindsight is a very large distance. A buffer of 100m (328 ft) would have been sufficient and would have significantly reduced the runtime of the complete algorithm. Figure 11.5: Schematic Representation of R-tree Indexing [108] The goal of the conflation was to create a one-to-one join between each Multinet 110 segment to the best candidate HPMS segment. Therefore, the algorithm exhaustively combs each valid Multinet segment to find the ideal join partner. However, the join was not forced; without an ideal HPMS segment, the Multinet segment was dropped from the join. Since the lookup was from Multinet to HPMS, the geospatial index was created only for the HPMS dataset. Candidate HPMS segments were selected by querying the index, and join strength was determined only for selected segments. Computing Multinet Typical Speeds: The profile speeds were used for the temporal disaggregation of the AADT values transferred through the spatial join. Speeds were attributed to each Multinet segment through a simple relational database join between the geospatial dataset and the typical week profile tables. The latter tables repeat each Multinet segment ID seven times, once for each day of the week. Therefore, the join was pivoted so the appropriate free-flow speed column was joined to the correct day: weekday free-flow speeds to weekday profile rows, and for weekends, given a Multinet segment. If the speeds were unavailable for a weekday or weekend day, the generic week free-flow speed was used. 11.4.4 Merging Incidents and Congestion Both incidents and congestion data were subjected to filtering by time and space, and descriptive analysis as documented. Only data from 2011, collected on the most important roads in Maryland are used. Preprocessing Bottleneck Data Data collected during 2011 was used in the study conflating congestion and incidents. The congestion is measured by (11.1). Table 11.2 provides a summary of all bottlenecks recorded in Maryland during 2011, while figure 11.6 shows the locations of the heads of all bottlenecks over the same space and time. As evident in the figure, the majority of bottlenecks appear to happen along the interstate and important routes spanning 111 the length of Maryland. This is reasonable because the interstates carry most of the traffic, and are bound to be first affected by increasing demand. Table 11.2: Summary of Bottlenecks Quarter Unique Occurrences Total Occurrences 1 729 26 728 2 752 46 094 3 754 49 515 4 740 48 921 Total 879 171 258 Figure 11.6: All bottlenecks in Maryland recorded in 2011 Preprocessing Incident Data Table 11.3 provides a summary of the incidents recorded by CHART in Maryland during 2011. Figure 11.7 plots all the incidents recorded by CHART during 2011 on a map. CHART groups incidents into 17 classes, some with up to 4 sub-classes. These form a complete picture of the incidents and alerts that affect the roadways of Maryland. Table 11.4 provides a full count of all incidents, by type, that affected the road network of Maryland. The counts are divided by quarter, and are further separated to show 112 Table 11.3: Summary of Incidents Quarter Number of Incidents 1 20 085 2 23 356 3 27 510 4 24 073 Total 95 024 Figure 11.7: Map of all Incidents Recorded by CHART in Maryland during 2011 113 the count of impacting and total incidents. Impacting incidents are those that lie within one mile of the head of any bottleneck in Maryland. 114 115 Table 11.4: Highlighted Incidents Summary Standardized Type Quarter 1 Quarter 2 Quarter 3 Quarter 4 Impacting Total Impacting Total Impacting Total Impacting Total Alert 924 2 715 507 1 475 506 1 717 457 1 373 Animal Struck 49 138 156 373 87 235 240 750 Collision 1 856 2 445 1 961 2 436 2 000 2 507 2 267 2 892 Congestion 141 259 371 415 393 504 271 317 Disabled Vehicle 4 449 5 478 10 346 11 769 10 303 12 788 9 179 10 620 Disturbances 9 28 8 34 99 374 12 55 Emergency Roadwork 63 78 89 127 101 136 95 124 Flood 1 1 5 29 1 1 0 0 Incidents 4 148 6 000 384 576 366 997 405 738 Obstructions 693 1 062 1 280 1 721 1 215 1 968 993 1 305 Road Construction 42 61 0 0 0 0 0 0 Road Maintenance Operations 671 843 2 093 2 574 2 356 3 893 2 604 4 593 Special Event 15 24 99 173 44 134 48 75 Traffic Signal Not Working 269 803 441 1 518 667 2 118 361 1 097 Vehicle on Fire 76 98 107 127 100 129 84 102 Weather Condition 15 52 5 8 3 6 11 30 Work on Underground Services 0 0 0 1 0 3 1 2 Total 13 421 20 085 17 852 23 356 18 241 27 510 17 028 24 073 Chapter 12: Methodology and Algorithms for Performance Measurement The methodology used for each study in performance measurement is quite different from the others. Each section in this chapter is dedicated to explaining the method used to affect the goal of the study. 12.1 Synthetic Time Series Method To explain the new synthetic modeling approach for real-time prediction, consider a given road segment and a given day of a week (e.g., Wednesday). The predictive models are likely to differ from one road segment to another or from a given day of a week to another, but the general concept and methodology will remain the same. The basic idea of the proposed synthetic method is to use historical data for both complex model building and model fitting for each road segment separately and simply use the fitted model for forecasting speed in real-time. Thus, the forecasting will be instantaneous. The underlying assumption made by using this framework is that traffic patterns do not vary for the same day on different consecutive weeks, although, the second half of this assumption may be altered as per convenience. Patterns may be assumed to repeat monthly, or fortnightly, or even annually. Further, the underlying model may be constructed to use auxiliary information such as weather forecasts, history of traffic incidents, or average traffic speeds from some other time periods. The unwavering 116 part of the assumption is that conditions do not change significantly from a historic observation set to the current observation set. In this study, the repetition interval is taken as one week, and data from the whole day (1 440 records) forms the observation set. The entire day can be viewed as a collection of T time points. For example, as consistent with the data used here, ? = 1 440 if the speed data come in one minute interval. Let ??,? denote the speed at time ? in week ?. Our goal is to predict speed at a future time point ??+?,? based on the observed data in the previous week ?? 1 and the observed data at or before time point ? in the current week ?. Consider the following general time series model: ? ??(?)(1 ? ?) ??,? = ???,??? + ??(?)??,?, (12.1) where ??(?) = 1 ? ?1?? ??? ?????, ??(?) = 1 ? ?1?? ??? ? ???? , ? is a back shift operator: ????,? = ????,?, ??,? are white noises that follow normal distributions with zero means and constant variance ?2, ??,? is a ? ? 1 vector of known auxiliary variables, ?? is a ? ? 1 vector of unknown fixed coefficients, ?1?, ???? and ?1?, ? ??? are unknown model parameters. Note that above model is the standard Autoregressive Integrated Moving Average (ARIMA) model, denoted by ?????(?, ?, ?; ?), where ? = (?, ?, ?2), with the regression term ???,??? added to borrow strength from external set of relevant auxiliary variables such as weather or accident data available at the current time or historical traffic data from previous weeks. The above class of distributions of speed is flexible 117 enough to model the distribution of speed for a variety of road segments. In a direct forecasting approach, to predict ??+?,? time series data up to and including time point ? in the current week ? only is used. That is, for each forecasting at time ?, first a specific model is selected from the above class of models, that is, orders ?, ? and difference ? are selected, and then different parameters (?) of the selected model are estimated. This is a cumbersome and time consuming process and is clearly not feasible when the same is needed to be done for thousands of road segments in real-time and the procedure repeated as one moves along in the time scale. A radically different idea of completing all the complex model building and estimation well ahead of time, say, a week in advance and then applying the chosen model with estimated parameters for the forecasting is now proposed. The following modeling synthetic assumptions is made: ??(?) = ???1(?) ??(?) = ???1(?) (12.2) ?? = ???1 That is, it is assumed that the model and the model parameters do not change in a week. Any model may be used here, parametric or non-parametric. The base assumption is that the model specification does not change from one week to another, or from one season to another where trends are expected to repeat. This assumption will allow us to select and fit models on data from the previous week, while the fitted model is then be used to forecast speed for the current week. The emphasis is that the real-time forecasting of speed for thousands of road segments will not pose any problem as model selection and fitting for all the road segments will be done one week ahead of time. Of course, one may question the validity of the synthetic modeling assumption, and therefore, the synthetic assumption is tested using real data. 118 Herein, the best order from the ARIMA family is selected. The ARIMA family is denoted by ?????(?, ?, ?), where, ? and ? are the orders of the model while ? is the differencing number. In our analysis, allowance for ?, ? and ? to take any value from {0, 1, 2} is made, a total of 27 combinations. The best ARIMA model is then specified by the combination of ?, ? and ? that results in the lowest Bayesian Information Criterion (BIC). All 27 models are fit to each of the 2 654 segments separately for each of the day examined. The framework is demonstrated with weekday data from two weeks in September 2016. Therefore, the models are fit on data from 10 different days. All in all, 716 580 models are fit to the data, out of which the best model for each segment for each day are selected, leaving us with 26 540 models. Predictions are made for the next week for the same day as the selected model. Therefore, a total of 10 days, spanning two weeks are predicted, amounting to over 38?107 predicted records. 12.2 Method to Compute Performance Measures from NPMRDS The existing technique to measure performance on roadway segments was developed by Texas Transportation Institute (T.T.I.). The T.T.I. method has proved to be quite successful when working with cleaned and filtered data like that from the VPP, and forms the basis of the results presented in the Maryland Mobility Report [76]. However, the method computes averages of speeds or travel times before computing the performance measures. Since the NPMRDS has considerable outliers and noise, averages produce biased and incorrect results. The proposed method successfully overcomes the biases due to outliers and noise by overlaying the data from multiple similar periods to increase the density of data and dilute the outliers and noise. The densification caused by overlaying increases certainty in central tendencies of the data. While averages are still susceptible to outliers, rank-based methods like percentiles are immune to the effects of outliers within the Inter Quartile Range (IQR). 119 Travel Time Index (TTI) and Planning Time Index (PTI) are two of the most common performance measures. The equations for computing TTI and PTI by the T.T.I. method are provided in (12.3) and (12.4), respectively. These equations require averages of travel times to be calculated. Note that the reciprocal of equations (12.3) and (12.4) should be used when working with speeds, instead of travel times, as they are inversely proportional to each other. Since, the NPMRDS reports travel times, the equations pertinent to speeds are not presented. ? ? ? ? = ? , (12.3) ??? ??? = ?? , (12.4) ??? where ? ? is the average travel time, ??? is the freeflow travel time, and ?? is the worst travel time. Note that these metrics are quite difficult to obtain, and required many assumptions and untested roadside surveys of travelers. The proposed methods not only reliably compute the TTI and PTI indexes, but also provide a robust estimate for the freeflow, average and worst travel times. 12.2.1 Percentile Method for Performance Measurement The innovative method proposed involves overlaying data from multiple similar obser- vation windows. The lack of imputation during periods of no observations (during lean traffic) causes considerable gaps in the data. These gaps break continuity algorithms, for example, an algorithm that assesses whether the next observation is an outlier, 120 based on the trend of current observations will be unable to judge after a long gap. Similarly, algorithms that trace vehicle trajectory will also run into unbridgeable gaps in the data. The solution requires increasing data density, which would serve both functions of helping outlier identification, and providing intermediate data points. The simplest, direct, and most data-driven approach to increasing density is to simply overlay similar periods of data. This is illustrated using figure 12.1. Figure 12.1a shows NPMRDS data from an interstate segment for just one weekday, while figure 12.1b shows data from 10 weekdays. Immediately, trends invisible in figure 12.1a are clear in figure 12.1b. The morning peak increase in travel times causes a noticeable bump in the data, the outliers are visible at the top of the plot, while the noise prevalent in the afternoon and evening hours are dotted above the actual trend of the data. To compute performance measures from these data, a simple technique involving percentiles is used. Traffic data has a singular advantage that outliers are easy to filter using percentiles, by discarding data at the extreme ranges. Percentiles of traffic data are also a good estimate of the trend of traffic, and can clearly identify the mean and extreme flows. Especially after overlaying, the trends become more pronounced as compared to the noise, and easier to identify. Quantile plots, also called cumulative frequency distribution (cfd) plots are shown in figure 12.2, where each line represents data from one hour. Every 5th quantile is computed starting at 5% and going up to 95%. Even granular quantiles are possible, but it would defeat the goal of noise smoothing. Using the cfd plot, hourly or daily performance measures can be easily computed. When comparing with the traditional way, the cfd offer some advantages, and one drawback. The advantages are that freeflow travel times, and congested travel times can be easily computed, and are the 5th and the 95th quantile respectively. The drawback is that the median of the quantiles differs slightly from the average of the 121 (a) Scatter plot of travel times from NPMRDS for single day (b) Scatter plot of travel times from NPMRDS for 10 days overlayed Figure 12.1: Scatter plot of travel times from NPMRDS illustrating overlaying method 122 Figure 12.2: Cumulative frequency distribution plots produced from overlayed data in 12.1 data. However, since the NPMRDS is not filtered, average would lead to a more erroneous solution than the median. Equations (12.5) and (12.6) then become: ??? ?50? = , (12.5)?5 ??? ?95? = , (12.6)?5 where ?50 is the 50th percentile, or median, travel time, ? is the 5th5 percentile, or freeflow, travel time, ?95 is the 95th percentile, or congested, travel time, and the subscript ? denotes the measures are computed from percentiles, rather than the T.T.I. method. 123 12.3 Methodology to Effect Spatial Conflation and Temporal Disag- gregation A scoring algorithm was used to select the best match for the spatial conflation. Since no common fields exist between the datasets, it was not possible to achieve quality conflation just by relying on the metadata for each segment. Further, it was not possible to use the geometries directly due to differences in the way the road segments were encoded. Consequently, a rule based algorithm was developed for the conflation which involved computing geometric measures of similarity among the line segments, in addition to whatever metadata matched. Appendix A of the Mobility Scorecard [105] published by T.T.I. was referred to for disaggregating the AADT by time of day. The steps in the published appendix were followed exactly, including adjusting the AADT for the day of the week and computing the peak directions and the severity of congestion. The graphs shown in Exhibits A-2 through A-6 in the appendix were used to extract the percent daily volume for a given hour of the day for scenarios applicable to the respective exhibits. These percent points were linearly interpolated to obtain the percent daily volume for 15- minute periods in a day, which were then multiplied by the daily volumes obtained by adjusting the directional AADT for the day of the week. Note that this disaggregation would not have been possible without the join also supplying the typical speeds by day of week for each AADT value transferred to Multinet. 12.3.1 Computing Geometric Information The Multinet segments were devolved into five points: 0% (start), 25%, 50% (mid), 75% and 100% (end) of their lengths. Since Multinet segments are unidirectional and have well-defined starts and ends, the intermediate points fall in the correct order. On the other hand, since HPMS segments are bidirectional, the starts, ends and all 124 other intermediate points can be equivalently interchanged. The only stationary point along HPMS segments are the centroids. Therefore, the HPMS segments were only devolved into their centroids because using other points would have necessitated the computations be repeated assuming HPMS segments in reverse order, which does not apply universally to all segments. The five Multinet points were used to build a set of candidate HPMS segments. The lengths and angles of the imaginary lines connecting the five points to the centroids of the candidate HPMS segments were computed. The lengths of the imaginary lines are then resolved into components para-llel and perpendicular to the HPMS segment using the angles. The resolved lengths from all five points are averaged together for each candidate HPMS segment. Additionally, the angle of the Multinet segment with respect to each candidate HPMS segment is computed. The average distances, along with the angles between the segments are used in attributing a score to each candidate HPMS segment. Computing Angles Angles between points are easily computed using the inverse tangent (arctan) formula. The inverse tangent function is provided by the Numpy package, a part of the Scientific Python Stack. The two-argument function (atan2) was chosen over the single argument function (arctan) to obtain the signed angle and avoid numeric infinities and division- by-zero errors when the segments are pointing true north, south, east, or west, i.e., when the difference between the start and end latitudes or longitudes is zero [87]. All angles were computed with reference to the candidate HPMS segments. Using HPMS segments was motivated by the bidirectional symmetry of the segments; therefore, the absolute angle can be used. Absolute angles are always between 0 rad to rad, which ensures that all distances are positive when resolved into parallel and perpendicular components. 125 Scoring Rules Scores were given based on the computed geometric values and the similarity between the metadata recorded in other fields for each candidate HPMS segment, given a Multinet segment. Table 12.1 presents the criteria and the corresponding points. The threshold for selection was 320 points, which means at least two criteria must be met at full capacity. Table 12.1: Scoring algorithm tests and points Test Metric Test Criteria Points Awarded Computed Geometric Data Matching Criteria Perpendicular distance Distance ?5m (16 ft) 250 between HPMS and Multinet Distance ?10m (33 ft) 100 segments Distance ?15m (49 ft) 10 Distance ?20% of HPMS and Parallel distance between Multinet segment lengths 100 HPMS and Multinet segments Distance ?30% of HPMS andMultinet segment lengths 50 Distance ?50% of HPMS and Multinet segment lengths 10 Angle between HPMS and Angle ?0.010 rad (0.600?) 100 Multinet segments Angle ?0.100 rad (5.700?) 50Angle ?1.000 rad (57?) 10 Field Metadata Matching Criteria Road shield number of HPMS and Multinet Segments Are the same 1 000 Difference in FRC between Difference =0 classes 100 HPMS and Multinet segments Difference ?1 class 50Difference ?2 classes 10 Number of lanes in HPMS and Multinet lanes = half of HPMS Multinet segments lanes 300 Toll road indicator in HPMS and Multinet segments Flagged as toll roads 300 Threshold to select candi- date HPMS segment Total points should be ? 320 126 Note that the points and threshold were fine-tuned by looking at sample joins from a wide range of geographies, including the plains of Kansas, the mountains of Colorado, the coastal roads of Hawaii and California, the density of roads in the Northeast, and the sparsity of roads in the Midwest. The candidate segments scoring above the threshold were sorted in descending order of scores, and ties were broken by secondary and tertiary ascending sorts on the perpendicular and parallel distances. The highest scoring, closest segment was thus selected as the join partner. Another noteworthy point from table 12.1 is the relative weights given to each test metric and criterion. Perpendicular closeness is weighted very heavily, as are having the same number of lanes and being part of a toll road. Matching names in the metadata of both segments outright receives a huge score to boost it to the top. The points are cumulative, that is, if two segments have a perpendicular distance less than 5m, then the total points awarded are 360 (250+100+10), which automatically crosses the threshold. This and the road shield numbers are the only criteria that allow pairing when met in full. An optimization method was not used for computing the scores or the threshold because of the size of the datasets involved. There were no good candidate areas on which to train the models, as land topology changed drastically over the country. Moreover, the compute times were already quite large, and an optimization algorithm would not have finished in the duration of the project. 12.3.2 Complete Algorithm The implementation and execution of the following algorithm was performed solely by the author, without any division of labor. Preprocessing 1. Split the HPMS segments into shorter segments of 160m (528 ft). 127 2. Compute the angles with respect to the equator and the centroids of each HPMS segment and store with the HPMS geometries. 3. Construct bounding buffer boxes around the HPMS segments with a 500m buffer (reduce to 100m for optimal performance). 4. Build an R-tree index for HPMS bounding boxes. 5. Filter Multinet segments to remove segments with FRC greater than 6 or where the segment is not a roadway, for example, ferry paths. 6. Compute the typical profile speeds by day of week using the Multinet geospatial dataset and profile speed tables. This will be used in step 9. Conflation 7. For each Multinet segment from the filtered set: (a) Devolve the segment into five points at 0%, 25%, 50%, 75% and 100% along the segment length. (b) For each devolved point: i. Compile a list of candidate HPMS segments by looking up the point in the R-tree index from step 3. Go to step 7(h) if no candidate HPMS segments are found. ii. Compute the angle and length of an imaginary line connecting the point and centroid of each candidate HPMS segment identified in step 7(b)i. iii. Subtract the angle computed in 7(b)ii from the absolute angle of the HPMS segment computed in step 2. 128 iv. Resolve the length computed in 7(b)ii into components parallel and perpendicular to the HPMS segment using the relative angle computed in 7(b)iii. (c) Find the average parallel and perpendicular distances from 7(b)iv for all points from 7(a), given a candidate HPMS segment, to obtain the approximate parallel and perpendicular distances between the Multinet and each of the candidate HPMS segment. (d) Compute the points scored by each join pair produced by the Multinet segment and candidate HPMS segment using table 12.1. (e) Filter the join pairs for those that score above the threshold in table 12.1. If no pairs qualify, go to step 7(h). (f) Sort the filtered pairs by decreasing points, increasing parallel and increasing perpendicular distances. (g) Designate the HPMS segment in the first pair from step 7(f) as the join key. (h) If any step fails to produce a candidate HPMS segment, attribute ???? as the join key to the Multinet segment. 8. Drop Multinet segment with a NULL join key. Copy the AADT values from HPMS to the Multinet dataset through the join. AADT Disaggregation 9. For each Multinet segment with an AADT value obtained through a successful join: (a) Halve the transferred AADT value, for segments part of dual carriageways by referring to the one-way flag in the Multinet metadata. 129 (b) Factor the AADT from step 9(a), for day of week using Exhibit A-1 from the Mobility Scorecard [105]. (c) Use the typical speed profile for day of week computed in step 6 for steps 9(d) and 9(e) below. (d) Determine the peak direction (AM, PM, or equal congestion) for each day of the week as detailed in Step 3 of the Mobility Scorecard [105]. (e) Determine the severity of congestion with respect to the off-peak speed profiles using Equation A-1 in the Mobility Scorecard (f) Using ordinate values for time of day from Exhibits A-2 through A-6 in the Mobility Scorecard, and the adjusted AADT values as computed in step 9(b), and using the peak direction and severity of congestion as determined by steps 9(d) and 9(e) above, find the disaggregated traffic counts for each hour in the week, as detailed in the Mobility Scorecard [105]. (g) Linearly interpolate the typical volumes obtained for each hour in a day from step 9(f) to find the volumes at 15-minute periods in a day. Divide these volumes by 4 to ensure the sum of the volumes for each day remains the same as computed in step 9(b). 10. End Algorithm. This algorithm was run state by state, in a program that can simultaneously process up to 14 states in parallel on different processors. Texas, the state with the largest number of segments, took the most time to finish?slightly over 5 days and 13 hours, occupying a whole processor. The run time could have been significantly decreased by reducing the buffer around the HPMS segments. The largest penalties to runtime were incurred in computing the angles and lengths of the segments, which required high-precision floating-point geometry calculations. 130 12.4 Method to Merge Incidents and Congestion Data This was one of the first methods developed and consequently, is quite basic. The first step was to select only the incidents that impact a bottleneck spatially. Since, the only available information is about the head of the bottleneck, just incidents around the head are selected. The selection buffer was created with a radius of one mile. The radius is computed using Vincenty?s great circle formula for ellipsoids [120]. The spatially selected incidents are further filtered by the time of occurrence. Only incidents that meet any of the following criteria are selected. These subselected impacting incidents have been presented in table 11.4. 1. Start within a 30-minute window before the start of the bottlenecks. 2. Start during the span of the bottlenecks. Incidents may last beyond the life of the bottleneck. 3. End during the span of the bottlenecks. 4. Start before and end after the bottlenecks, that is, range of the span of the bottlenecks. 131 Chapter 13: Results from Performance Measurement Methods In this chapter the results from the methods developed to compute performance measures are presented. Since each documented method in chapter 12 is different, with different objectives, each section in this chapter presents and discusses the results from the specific method. 13.1 Results from Forecasting Traffic Speeds The results presented in this section were originally published in the paper by Cirillo et al. [18]. Since visualizing the results for the whole network as shown in figure 11.2 is not possible, aggregate measures are presented. Predictions up to 30-minutes in the future, from any given minute are made, unless insufficient real observations are available. For example, for an ?????(1, 1, ?) model, a minimum of two observations are required before predictions can be made. Therefore, only after receiving the data for two minutes past midnight, can the data be forecasted. In other words, for segments modeled using ?????(1, 1, ?), minute three and onwards can be predicted. 13.1.1 Model Selection The first step in applying the proposed framework is model selection. As mentioned, (see equation (12.2)) the models are selected using data from the same day a week 132 before the day to be predicted. ?????(?, ?, ?) models are fitted to the data, where ?, ?, ? ? {0, 1, 2}, and the best model out of the possible 27 are selected. Best model is determined as the model with the lowest Bayesian Information Criterion (BIC). Table 13.1 gives an overview of the selected ARIMA orders. A model is fit to data from each day for each segment. Since models are fit to a total of 10 weekdays (two weeks) in this study, total number of orders are 26 540. The three most popular orders are highlighted. Table 13.1: Order Selection Order Count Order Count Order Count (0, 0, 0) NA (1, 0, 0) 5700 (2, 0, 0) 817 (0, 0, 1) 0 (1, 0, 1) 185 (2, 0, 1) 239 (0, 0, 2) 0 (1, 0, 2) 886 (2, 0, 2) 381 (0, 1, 0) 3077 (1, 1, 0) 589 (2, 1, 0) 154 (0, 1, 1) 405 (1, 1, 1) 10995 (2, 1, 1) 1 254 (0, 1, 2) 196 (1, 1, 2) 567 (2, 1, 2) 1 095 (0, 2, 0) 0 (1, 2, 0) 0 (2, 2, 0) 0 (0, 2, 1) 0 (1, 2, 1) 0 (2, 2, 1) 0 (0, 2, 2) 0 (1, 2, 2) 0 (2, 2, 2) 0 13.1.2 Forecasting Forecasts are made continuously once the minimum number of observations are available. For example, the first two minutes of real data are used to forecast the speed for the third minute in an ?????(1, 1, ?) model. The forecast for the fourth minute then depends on the predicted value for the third minute, and the observed value of the second minute. Fifth minute is predicted using the forecasts for the third and fourth minutes, and so on. Forecasts are stopped at 11:59PM of each day. For every minute of the day, up to 30-minutes of future estimates of speed are produced, or, in other words, for each minute, predictions from data received up to 30-minutes ago are available. To avoid confusion, the latter terminology is used, and 133 the term lag is defined as minutes prior to current minute that were used to predict the speeds at current minute. In other words, a lag of 5 means speed at the current minute was predicted from data received 5-minutes ago. Using these forecasts, statistical properties like the relative error are computed. Relative prediction errors from 5, 10, 15 and 30 minutes of lag are selected for presentation to avoid cluttering the figures. Where feasible in the plots, results from 20 and 25 minutes of lag are also included. Since each segment can have a different speed profile, and maximum speed (the speed limit of the roadway the segment is on) a single measure for all segments needs to be relative. Therefore, the Relative Root Mean Squared Prediction Error (RRMSPE) is introduced. The RRMSPE is calculated similar to the Root Mean Squared Error (RMSE), but with the additional step that the squared difference between the true speed and predicted speed is divided by the square of true speed for each observation. Mathematically, RRMSPE is given by ? 2 ?????? = ?1 ?(??? ? ??) , (13.1) ? ?=1 ?? where ??? is the predicted speed at time ?, ?? is the real observed speed at time ?, ? is the total number of time steps. For a day, ? = 1 440, or ? = 60 for an hour. Note that a single value of RRMSPE is produced for a given lag, segment and day. Therefore, a total of 26 540 RRMSPE values are produced for each lag. RRMSPE is calculated for each minute the data is predicted, however, since only predictions from 5, 10, 15 and 30 minutes are presented, RRMSPE from only these minutes are shown. In all figures, each lag is kept distinct from others, so that errors can be clearly analyzed. 134 Figure 13.1 shows a box plot for each lag. Abscissa in figure 13.1 shows the lag value, while the ordinate is the RRMSPE measure. The box encompasses the Inter Quartile Range (IQR) for RRMSPE values from all segments for both weeks of predictions, (that is, IQR of 26 540 points). The orange line within a box represents the median RRMSPE value for the given lag. The whiskers of the boxes show the 5th and 95th quantiles. The decision to move the whiskers out to the extreme quantiles was made to demonstrate the accuracy of ARIMA models in the synthetic framework. If figure 13.1 were instead also split by the day of the week, and the RRMSPE values averaged for each lag (that is 2 654 points averaged together), figure 13.2 is produced. Range of Network-wide RRMSPE by Lag 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 0 5 10 15 20 25 30 35 Prediction Step (lag) (minutes) Figure 13.1: Box Plot of RRMSPE Using all Predictions by Lag Figure 13.2 can be transposed such that the day falls on the abscissa, while a line for each lag is drawn on the plot against RRMSPE value on the ordinate. Figure 13.3 shows this transformation. It is of academic interest that Tuesday and Thursday are two of the most difficult days to predict, although, traffic patterns on those days are 135 RRMSPE Average Network-wide RRMSPE for each Day, by Lag 0.60 Mon, 09-19 Tue, 09-20 0.55 Wed, 09-21 Thu, 09-22 Fri, 09-23 0.50 Mon, 09-26 Tue, 09-27 0.45 Wed, 09-28 Thu, 09-29 Fri, 09-30 0.40 0.35 0.30 0.25 0.20 0.15 0 5 10 15 20 25 30 35 Prediction Step (lag) (minutes) Figure 13.2: RRMSPE for all Segments, each Day by Lag generally considered very repeatable. In fact, documents like the Maryland Mobility Report compute results using data from Tuesdays, Wednesdays and Thursdays because the traffic patterns are more stable as compared to other days [76]. Conversely, Fridays ? with very variable patterns due to weekend demand ? seems easier to predict. In order to understand figure 13.3 better the day is broken down into periods as shown in figure 13.4. Periods are fixed in size, and are defined in 4-hour intervals ? therefore 6 periods in a day. The list of periods is: midnight to 4AM; 4AM to 8AM; 8AM to noon (morning peak); noon to 4PM; 4PM to 8PM (evening peak); and 8PM to midnight. Average RRMSPE for a lag is plotted using data from all segments in a given period in a day in figure 13.4. For figures with distinct periods, RRMSPE of a period is computed from 240 data points for each minute in the 4-hour period. Similar to 13.3, figure 13.4 plots the day and period on the abscissa, RRMSPE value on the ordinate, and divides the plot to demarcate each day. 136 RRMSPE Average Network-wide RRMSPE for each Lag, by Day 0.6 0.5 0.4 0.3 0.2 5 20 10 25 15 30 0.1 -1 9 -2 0 1 2 3 -2 -2 -2 -2 6 27 28 29 30 09 09 09 09 09 9 9 - 9-0 0 0 09 - 9-0 n, e, d, u, i, n, e, ,o u e h r o u ed hu , ri , M T W T F M T F W T Date Figure 13.3: RRMSPE for all Segments, for each Lag by Day 137 RRMSPE Average Network-wide RRMSPE for Period in Day, by Lag 5 0.7 10 Mon Tue Wed Thu Fri 15 Mon Tue Wed Thu Fri 09-19 09-20 09-21 09-22 09-23 09-26 09-27 09-28 09-29 09-30 20 25 0.6 30 0.5 0.4 0.3 0.2 0.1 Period in Day Figure 13.4: RRMSPE for all Segments, for each Lag by Period in Day 138 RRMSPE 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 00 - 04 08 - 12 16 - 20 Figure 13.4 does not show the spread of the RRMSPE values, just the average value. Figure 13.5 however, expands each point in figure 13.4 into a box, showing the ranges of the data. Note that the whiskers in all box plots (figures 13.1, 13.5 and 13.6) correspond to the 5th and 95th quantiles, while the blue box itself shows the IQR, and the orange line within a box shows the median. Each subplot in 13.5 shows data from one lag (indicated in the subplot title). Lastly, in order to show the full variation in the data, figure 13.6 is shown. This figure plots a box plot for each minute of the day. Relative residuals for all segments, over all 10 days for the given minute comprise each box (26 540 data points per minute). Relative residuals are given by equation (13.2). Figure 13.6 clearly shows how the predicted values deviated from the real values, for each minute of the day. Similar to all other box plots, the median is shown by the orange line, the blue boxes, barely even at peak periods shows the IQR, while the whiskers shows the extreme 5% quantile ranges. The time of day is plotted on the abscissa, while the ordinate gives the relative residual value. ?? = ??? ? ??? , (13.2)?? where ??? is the relative residuals at time ?, ??? is the predicted speed at time ?, ?? is the real observed speed at time ?. The increase in deviation is clearly visible around peak periods in figure 13.6. Further, it is observed, from the larger relative residuals on the positive side, that the models make more optimistic predictions over the whole network. One of the most important observations is that the median line remains consistently near zero, indicating that the majority of predictions are very accurate. Also, note the extent of IQR denoted by the small blue region around the median line, again holding a 139 Range of Network-wide RRMSPE by Period in Day for Lag = 5 minutes 2.5 Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri 09-19 09-20 09-21 09-22 09-23 09-26 09-27 09-28 09-29 09-30 2.0 1.5 1.0 0.5 0.0 for Lag = 10 minutes 2.5 Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri 09-19 09-20 09-21 09-22 09-23 09-26 09-27 09-28 09-29 09-30 2.0 1.5 1.0 0.5 0.0 Period in Day Figure 13.5: Box Plot of RRMSPE for all Segments, for each Lag by Period in Day 140 RRMSPE RRMSPE 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 Range of Network-wide RRMSPE by Period in Day for Lag = 15 minutes 2.5 Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri 09-19 09-20 09-21 09-22 09-23 09-26 09-27 09-28 09-29 09-30 2.0 1.5 1.0 0.5 0.0 for Lag = 30 minutes 2.5 Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri 09-19 09-20 09-21 09-22 09-23 09-26 09-27 09-28 09-29 09-30 2.0 1.5 1.0 0.5 0.0 Period in Day Figure 13.5: Box Plot of RRMSPE for all Segments, for each Lag by Period in Day 141 RRMSPE RRMSPE 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 00 - 04 00 - 04 08 - 12 08 - 12 16 - 20 16 - 20 testament to the accuracy accorded by the synthetic method to ARIMA models. Range of Relative Error for Each Minute by Day for Lag = 5 minutes for Lag = 10 minutes 1.2 1.2 0.8 0.8 0.4 0.4 0.0 0.0 ?0.4 ?0.4 ?0.8 ?0.8 ?1.2 ?1.2 for Lag = 15 minutes for Lag = 30 minutes 1.2 1.2 0.8 0.8 0.4 0.4 0.0 0.0 ?0.4 ?0.4 ?0.8 ?0.8 ?1.2 ?1.2 Time of Day Time of Day Figure 13.6: Relative Deviation for all Segments, for each Lag by Minute in Day 13.2 Results from Performance Measurement The results presented in this section were originally published in the paper by Kaushik, Sharifi, and Young [61]. The NPMRDS, VPP and Bluetooth data were overlaid for the two case study segments as explained (see 12). The results of overlaying the data, and plotting the cfds are presented in figures 13.7 and 13.8 for segments NJ11-0009 and VA08-0012, respectively. The peak hour is highlighted in all subplots of figures 13.7 and 13.8 to allow direct comparison of the different datasets. Both case study segments have a PM peak hour, with travel times rising to almost four times the freeflow travel times. The peak hour is colored black in the overlay scatter plot and the cfd plot in each subfigure in figures 13.7 and 13.8. In each figure, part a shows the Bluetooth data, 142 Relative Error Relative Error 00:00 00:00 01:30 01:30 03:00 03:00 04:30 04:30 06:00 06:00 07:30 07:30 09:00 09:00 10:30 10:30 12:00 12:00 13:30 13:30 15:00 15:00 16:30 16:30 18:00 18:00 19:30 19:30 21:00 21:00 22:30 22:30 23:59 23:59 00:00 00:00 01:30 01:30 03:00 03:00 04:30 04:30 06:00 06:00 07:30 07:30 09:00 09:00 10:30 10:30 12:00 12:00 13:30 13:30 15:00 15:00 16:30 16:30 18:00 18:00 19:30 19:30 21:00 21:00 22:30 22:30 23:59 23:59 part b shows the VPP data, part c shows NPMRDS all vehicle data, part d shows NPMRDS passenger car data, while part e shows NPMRDS freight truck data. (a) Bluetooth travel times on NJ11-0009 (b) VPP travel times on NJ11-0009 Figure 13.7: Overlaid travel times, and associated cfd plots It is clearly seen from figure 13.7 that the three datasets produce almost similar results for the freeway-like NJ11-0009. This is expected in the case of freeways, where filtered probe data works very well. However, as shown by figure 13.8, in the case of the arterial segment, VA08-0012, VPP data, which is filtered and smoothed at source, performs poorly as compared to the NPMRDS or the Bluetooth data. These plots visually help compare the performance of the three datasets. The actual numeric 143 (c) NPMRDS all vehicle travel times on NJ11-0009 for NJ11-0009 (d) NPMRDS passenger car travel times on NJ11-0009 (e) NPMRDS freight truck travel times on NJ11-0009 Figure 13.7: Overlaid travel times, and associated cfd plots for NJ11-0009 144 (a) Bluetooth travel times on VA08-0012 (b) VPP travel times on VA08-0012 Figure 13.8: Overlaid travel times, and associated cfd plots for VA08-0012 145 (c) NPMRDS all vehicle travel times on VA08-0012 (d) NPMRDS passenger car travel times on VA08-0012 (e) NPMRDS freight truck travel times on VA08-0012 Figure 13.8: Overlaid travel times, and associated cfd plots for VA08-0012 146 values can be easily computed using equations (12.5) and (12.6). Due to the simplicity of the calculations, and in the interest of brevity, the numbers are not shown in this chapter. Additionally, it should be noted that VA08-0012 is composed of two or more TMC segments, depending on the data source. In this case, a shortcut method was used to get the combined travel times on both segments. The method involves finding those times when data was available for all comprising TMC segments simultaneously, and then adding them together to produce the total travel time to traverse the study segment. The correct method to compute this is to trace each observation on the upstream segment, and look for the corresponding travel time at the downstream segment when a hypothetical vehicle is expected to arrive at the downstream segment. However, this method is quite difficult to execute on the NPMRDS data due to the gaps in the data, and therefore was not used. Since travel times from VPP and NPMRDS for all constituent TMC segments must be simultaneously available in the case of VA08, the data density of both NPMRDS and VPP was severely reduced. A way to overcome this is to overlay even more data. Alternatively, the 10-day overlay can be used to construct and impute missing data before tracing a hypothetical vehicle to arrive at the travel time to cross the section of roadway under study. 13.3 Results from Spatial Conflation of HPMS and Multinet Datasets The results presented in this section were originally published in the paper by Kaushik, Wood, and Gonder [63]. Figures 13.9a and 13.9b show the same geographical areas as figures 11.4a and 11.4b. The line widths have been scaled in proportion to the AADT of each segment. The similarity between the HPMS and Multinet maps scaled by the transferred AADT is clearly visible over a metropolitan region in figure 13.9a, and spotlights the prowess of the presented conflation algorithm. Figure 13.9b, when compared with figure 147 11.4b, shows that the segments in the parking lot southwest of the intersection have disappeared, along with other segments that did not have a join partner. Further, the line widths of Denver West Parkway, north of the intersection show that it has correctly been conflated with the HPMS segment and was not attributed to I-70. Additionally, the ramps of the Interstate have also been correctly attributed. The imperfections become apparent, however, when other segments, parallel but unrelated to the HPMS segments, are thickened by AADT, like the segment in the northeast corner of the map in figure 13.9b, which was attributed the AADT from I-70. Similarly, the segments to the west of the incorrect segment also show ?bleeding? of AADT to unrelated segments. These are primarily a product of having a single threshold. The threshold was fixed by considering winding roads, and consequently is far too lenient for straight roads. In addition, some Multinet segments are missing around the intersection, indicating the join failed for these segments. Table 13.2 presents a global perspective of the errors. The table shows the percent error in the total length of conflated Multinet segments when compared with the total length of HPMS segments for each state. The final value for the whole country shows the average error is just under 7%, which for 1.200 million miles of complex, real-world roadways is very good by standards in literature. Figure 13.10 shows the use of the conflated dataset. This figure shows the relation of the speed and volume datasets for I-405N in Los Angeles, California, and I-270N in Maryland. Figures 13.10a and 13.10c are created by plotting the typical week speed profiles by day, and thickening the lines by the traffic volume for each 15-minute period of that day. Figures 13.10b and 13.10d use arrows to present the evolution traffic congestion with respect to time, volume, and speeds. These figures also share a marked resemblance with speed-flow fundamental diagrams. 148 (a) Birds-eye view of Denver, Colorado, post conflation. Lines thickened in proportion to AADT values. (b) Zoomed-in birds-eye view of an interchange in Denver, Colorado. Lines thickened to show AADT values. Figure 13.9: Comparison of Network after Conflation 149 70 65 60 55 Monday Tuesday Wednesday Thursday 50 Friday Saturday Sunday 45 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 23:59 Hour of Day (a) Speed and volume on I-405N in California for a typical week. 70 65 60 55 Monday Tuesday Wednesday Thursday 50 Friday Saturday Sunday 45 1000 2000 3000 4000 5000 Volume (VPH) (b) Evolution of congestion during typical week on I-405N in California. Figure 13.10: Jointly visualizing conflated speed and volume data for two corridors. 150 Speed (MPH) Speed (MPH), Lines Thickened by Volume (VPH) 65 60 55 50 Monday Tuesday Wednesday Thursday 45 Friday Saturday Sunday 40 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 23:59 Hour of Day (c) Speed and volume on I-270N in Maryland for a typical week. 65 60 55 50 Monday Tuesday Wednesday Thursday 45 Friday Saturday Sunday 40 1000 2000 3000 4000 Volume (VPH) (d) Evolution of congestion during typical week on I-270N in Maryland. Figure 13.10: Jointly visualizing conflated speed and volume data for two corridors. 151 Speed (MPH) Speed (MPH), Lines Thickened by Volume (VPH) Table 13.2: Total absolute conflation error by state and for the whole country State FIPS Total Error State FIPS Total Error State FIPS Total Error 1 0.044 21 0.080 38 0.033 2 ?0.254 22 0.307 39 0.147 4 0.207 23 0.014 40 0.171 5 0.055 24 0.178 41 ?0.027 6 0.282 25 0.103 42 0.057 8 0.141 26 0.123 44 0.183 9 0.046 27 0.124 45 ?0.310 10 0.004 28 0.075 46 0.083 11 ?0.109 29 0.153 47 0.108 12 0.475 30 ?0.475 48 0.278 13 0.089 31 0.020 49 ?0.068 15 0.156 32 0.051 50 ?0.064 16 ?0.008 33 0.024 51 ?0.291 17 0.230 34 0.347 53 ?0.031 18 0.139 35 0.263 54 ?0.439 19 0.163 36 0.200 55 0.130 20 0.056 37 0.087 56 0.129 Country Average Absolute Error 0.068 13.4 Results from Merging Incident and Congestion Data The results of the incident selection are presented in figure 13.11. In the figure, it appears that every bottleneck has impacting incidents, and the incidents are quite dense. However, since each incident is attributed to a bottleneck, it is a simple matter to count the number of incidents that impact each bottleneck. A chart produced by plotting the count of number of impacting incidents on the ordinate, and the bottleneck impact factor, computed using equation (11.1), on the abscissa for each bottleneck in 2011 is presented in figure 13.12. Different marker types and colors are used to denote the four quarters of the year. The figure clearly shows a decreasing exponential trend: the bottlenecks with a high impact factor are usually impacted by less incidents, while the bottlenecks impacted by a large number of incidents usually have a smaller impact factor. This makes intuitive sense because bottlenecks caused by 152 incidents are infrequent, and possibly last for shorter durations due to the combined effect of less demand with respect to capacity, and TIM efforts. Figure 13.11: Map of all bottlenecks in Maryland from 2011 overlain by incidents also from 2011 Figure 13.12: Scatter plot of count of incidents and Bottleneck Impact Factor The interesting results are obtained when figure 13.12 is divided into four quadrants as shown in figure 13.13. The division was guided by Maryland State Highway Administration (SHA), and clearly shows the detailed relation between bottlenecks and incidents. Points in the first quadrant represent bottlenecks with a high impact 153 factor and incidents, a potential place for heavy motorist misery. The second quadrant comprises of those areas with a large number of incidents, but moderate bottlenecks. This quadrant captures the locations where incidents are mainly responsible for the bottlenecks. Reasonable performance is depicted in the third quadrant, where both bottleneck impact factor and number of incidents are low. Congested conditions with very few incidents are in quadrant four. In order to explore the relation between bottlenecks and incidents, a bottleneck each from quadrants two and four are examined. Figure 13.13: Scatter plot of count of incidents and Bottleneck Impact Factor divided into four quadrants 13.4.1 4th Quadrant Bottleneck at I-695 and MD-147 Figure 13.14 shows the location of the bottleneck, with the impacting incidents. The red octagon shows the head of the bottleneck, the purple line shows the direction of the tail, while the yellow triangles represent locations of the incidents. This bottleneck is from the fourth quadrant, and had the worst bottleneck impact factor in the first quarter of 2011. The bottleneck ranking algorithm counted 62 occurrences for this bottleneck in 3 months with an average maximum duration of two and a half hours, 154 and an average maximum length of 9 miles. The impact factor computed by the bottleneck ranking algorithm is 127 703 mile-hours. However, it was impacted by only 16 incidents during the same time frame. Figure 13.14: Map of the bottleneck at I-695 and MD-147 The detailed records of the bottleneck were cleaned up, and plotted on a time-time graph as shown in figure 13.15. In this figure, the each occurrence of the bottleneck are plotted as days on the abscissa, while the time of day is shown on the ordinate. The purple squares represent the start of the bottleneck, while the tail shows the duration of the bottleneck. Incidents are shown on the graph as red triangles. Figure 13.15 clearly shows that this bottleneck occurs regularly during the evening peak. There are some bottleneck occurrences that clearly fall outside the trend, like on January 19 and 26, February 22, and March 8, which are clearly caused by incidents, due to the presence of the incident triangle at the start of the bottleneck. 155 Figure 13.15: Time-time plot showing impact of incidents on bottleneck at I-695 and MD-147 13.4.2 2nd Quadrant Bottleneck at I-95 and MD-43 This bottleneck is from the second quarter of 2011. The bottleneck ranking algorithm counted 47 occurrences for this bottleneck in 3 months with an average maximum duration also of two and a half hours, and an average maximum length of 9.080 miles. Both bottlenecks seems to be quite similar, at the outset, except for the frequency of recurrence. The impact factor computed by the bottleneck ranking algorithm is a moderate 84 869 mile-hours. However, where this bottleneck differs from the one at I-695 and MD-147 is the number of impacting incidents. This bottleneck was impacted by 126 incidents during the same time frame. The location of this bottleneck is shown in figure 13.16. The circled incident shows the location where 96 of the 126 impacting incidents are located. A similar time-time plot for this bottleneck clearly shows how the incidents impact almost every bottleneck. This plot is shown in figure 13.17. The occurrences of this bottleneck is also during the PM peak period, however, they always appear to be 156 Figure 13.16: Map of the bottleneck at I-95 and MD-43 caused by incidents. The most frequent incident type at this location is Disabled Vehicle, which does not help to understand the reason behind this incident and bottleneck pair. However, taken at face value, a disabled vehicle might cause motorists to slow down, and hence start and cause a bottleneck. In figure 13.17 it can also be seen that some incidents keep a bottleneck active for longer, as they occur before the bottleneck has a chance to disappear. 157 Figure 13.17: Time-time plot showing impact of incidents on bottleneck at I-95 and MD-43 158 Chapter 14: Conclusions Derived from Performance Measurement In this chapter, summaries and conclusions are presented from the methods developed for various supply side performance measurement. 14.1 Synthetic Time Series Forecasting Method The results presented in the previous chapter demonstrate the power of synthetic time series framework to predict traffic data with a good accuracy. Figure 13.4 shows that the maximum average RRMSPE over all periods in the ten days is just about 0.630. This error is acceptable in a naive model, without accounting for the volatility of traffic patterns due to the heavy dependency on environmental and human factors. Also, in terms of deviations from real value, figure 13.6, shows that the IQR is less than 10%. Therefore, for most of the segments, the predicted value varies less than0.010 of the real value. The authors are unaware of a better result produced using the ARIMA family of models in the literature. In fact, the synthetic method outperforms many other, more advanced and complicated models presented in literature. The proposed method is a foundation that should be expanded to include a multitude of factors. It is urged that future research in traffic prediction be conducted within the synthetic method, where applicable. The authors have planned certain enhancements to this work. Including various datasets, such as weather and incident data, into the ARIMA models as supplementary variables is part of one such planned 159 enhancement. Additionally, accounting for network effects, such as upstream and downstream conditions, as auxiliary information in the models is also planned. Further, the scalability of the synthetic framework will allow for easy expansion of the network to include a larger area. For example, all of the Washington DC, Maryland, and Virginia roadway system can be easily modeled, over a continuous and expanding time window. These results will provide key information to all users of ITS, and support subsystems like AITS, and TMS in the region. Also, the scalability allows for network-wide route guidance systems to be implemented, which would drive better utilization of infrastructure and lower costs to all stakeholders. Lastly, the method can be quickly applied to any region in the world, with data acquisition arrangements, paving way for reliable ITS deployment. 14.2 Working with NPMRDS The NPMRDS data is characterized and methods to increase its density, overcome the propensity of outliers, and quickly compute performance measures without the need for filtering, smoothing or imputation were demonstrated. The increased data density resulting from 24-hour overlay allows for outlier rejection without complex algorithms. The case studies performed highlight that the proposed method can be used to generate results from NPMRDS that resemble results from Bluetooth data, as evidenced by figures 13.7 and 13.8. Further, since VPP is subject to rigorous filtering and smoothing, it tends to underestimate travel time variation, especially on interrupted flow facilities. The most likely cause is perhaps vehicles stopped in queues (such as at traffic lights) are filtered out as outliers. Investigation of NPMRDS fidelity for use as the basis of performance measurement is recommended using the described methods on a statistically valid sample of segments representing freeways as well as varying functional classes of interrupted-flow arterials. 160 14.3 Spatial Conflation of Geospatial Datasets A scoring algorithm for conflation was proposed here and was applied to the entire U.S. network. The total conflation error was found to be less than 7%, which is commendable when compared with other results in the literature. For example, the probabilistic relaxation method proposed by Yang, Zhang, and Luan [132] has an error of about 5% for a much smaller urban study area. By constraining the area, the types of geographies are limited, and therefore the scores in table 12.1 can be further tuned for best results. If the geographies are limited, the tuning can be accomplished as an optimization problem. However, the United States has wide ranging and vastly different geographies, which makes using a generic conflation algorithm difficult. The dataset created by the conflation is being used by the National Renewable Energy Laboratory to answer specific questions about the effects of factors such as congestion, volume of traffic, gradient of roadways, and weather conditions on fuel use and greenhouse gas emissions. Figure 13.10, which present an innovative way to visualize the conflated data, were produced to aid one such study conducted at the National Renewable Energy Laboratory (NREL). A possible avenue for improvement of the conflation algorithm is to make the scores and thresholds more sensitive and directly dependent on the attributes of the segments. Additionally, successful join candidates can also be used to train advanced deep neural networks for better joins and to help resolve the issues with the current conflated dataset. 14.4 Merging Incidents and Congestion Datasets The correlation between incidents and bottlenecks proved challenging due to the fidelity of the incident database and the variation in the methods used to geolocate incidents and bottlenecks. Even so, the nature and magnitude of the correlation was observed, and is best summarized in figure 13.12 which depicts a relation between the 161 severity of the bottleneck and recorded incidents. Further analysis showed a high degree of correlation can be obtained in bottlenecks with a high count of incidents as compared to ones with a low count of incidents. In either case, it was shown that bottlenecks which fall outside the general trend are mostly attributable to incidents. Both of these are evident from figures 13.15 and 13.17 which provide a graphical method to visually assess the correlation between congestion and incidents. Extension and refinement of this graphical method was used to provide SHA with the appropriate patrol locations and paths. However, the fidelity of the incident database and the unavailability of bottleneck evolution data (how the bottleneck initiated, grew, and merged with other bottlenecks) hinders more concrete and tangible conclusions which may involve a mathematical model to compute the extent of incident impact on bottlenecks. For further extending this branch of study, a rigorous quality control over the incident data is necessary along with more details in the bottleneck data. 162 Chapter 15: Dissertation Summary and Conclusions It is necessary to have the domains of supply and demand of transportation work hand in hand to solve a majority of the current problems facing travelers on todays transportation networks. The methods described in this dissertation are geared towards both of these domains, providing separate solutions as required to tackle the most urgent problems and weakest link in existing methodologies. The difference in the problems faced in the supply and demand domains prompt solutions that are distinct. However, data fusion precipitates out of the methods documented in this dissertation, which is not surprising given the vast disparity in datasets available for transportation analysis. In this dissertation, three-pronged contributions to the analysis efforts in trans- portation industry, affecting both supply and demand domains, are documented. The first involves a reliable model to generate synthetic population to small areas. Syn- thetic population is required to estimate the demand for facilities from behavioral models fitted to sample survey data. The second contribution is techniques designed to obtain real-time insights into roadway facilities, involving short-term forecasting to understand the expected trend of performance, and a data-driven model to quickly produce the necessary performance measures. The third contribution is born out of the existence of transportation data in spatially and temporally disparate datasets, and involves merging of road networks to transfer information like volume of vehicles, and incidents from one dataset to another. The synthesis of population considering various characteristics of people was 163 demonstrated in the first section of this dissertation. Copulas were used to capture the dependence among various recorded characteristics of people, and then using those to synthesize the population with the same dependence for a different geographic area. The marginals of the synthesized population were then conformed to the totals of the region of interest. Not only does the method tackle the problem of small area estimation, by producing reliable synthetic population to small areas, but it also merges datsets together. In the first section of this dissertation, the American Community Survey (ACS) and the U.S. Decennial Census data were linked together using copulas. Further noteworthy is the handling of the temporal disparity, because the ACS data, were collected in 2016, while the census data were from 2010. The captured dependence represents the more recent data, and so do the number of synthesized individuals in each census tract. Census only supplies the relative frequencies attributes in each characteristic. The developed method using copulas can be easily extended to merge various other datasets, and simultaneously synthesize population with characteristics sourced from the separate datasets. The National Household Travel Survey (NHTS) collects data from very few respondents in a state, and consequently cannot be reliably used at even the county level. However, using the ACS marginals can be obtained at the Public Use Micro Area (PUMA) level, where NHTS population can be synthesized by fitting copulas to the statewide NHTS data. Examples of other possible datasets that can be included with copula models to synthesize reliable population to the region of interest are administrative data like the Internal Revenue Service (IRS). These data can fill gaps in the data reported by the census, like census does not collect and report income information. Even big data like the GPS probe based datasets discussed in chapter 11 can be used to obtain marginal information like travel times, congestion metrics, etc. in the region of interest. Performance measurement in transportation is a relatively new terminology, almost 164 born due to the economic constraints of the current age. The basic idea is to find the utilization of existing infrastructure, and improve upon it by considering the transportation network as a whole. This is radically different from the traditional practice of adding capacity, and roads as per the regional long-term plans. The reason behind the shift to performance measurement and management is to use the dwindling monetary reserves smartly, and make the most out of the existing infrastructure. It also increases sustainability, saves on wasted energy and time, and improves quality of life for the people. The mandated rules under MAP-21 for continuous reporting of performance measures add to the pressures heaped on transportation agencies trying to get the most bang for their buck. Most mandated measures require fusing data from different spatial and temporal regions. An instance of temporal linkage was demonstrated while predicting short-term traffic, where data from previous periods were linked through a model to the current observed traffic patterns. Similarly, in producing performance measures from the GPS and Bluetooth probe datasets, data from multiple similar periods were combined to increase the density of data, which improves the reliability and accuracy of the produced results. Further, incidents near the head of congested segments were temporally linked with the congestion, and represented graphically to understand the relation between the two. Although speed data are very useful for immediate insights into some aspects of the traffic, the complete picture is only obtained after incorporating other datasets, like incidents, volume, and weather to name a few. These data help estimate the global and system-wide impacts and externalities due to transportation. Some of these results are mandatory parts of reports generated under MAP-21. Therefore, considerable efforts were invested in producing conflation algorithms to merge the data together. A result documented in the second section of this dissertation from the conflation efforts was the merger of the speed and volume data together to produce 165 combined estimation of the resources consumed, time and energy spent, greenhouse gases emitted and other externalities of transportation. Similarly, the spatial and temporal conflation of the incident and congestion data provided insights into the causes of congestion, and the impact of further incidents on the congestion. However, the study had a limited scope due to the fidelity of the incident and congestion datasets. 166 Chapter 16: Proposed Future Extensions to Presented Work In this dissertation, a two-pronged approach was taken to solve the most pressing needs in the supply and demand domains of transportation. The ultimate goal would be the merger of the two in a large modeling framework such that the impacts estimated in one area immediately predicts the impacts to the other areas. A promising modeling approach in practice with the potential to achieve this goal lies is in the Activity Based Models (ABM). ABM runs many interlinked models mainly aimed at producing the choices made by people as they go about their daily lives. These choices are then used to generate an origin?destination matrix and inform the expected demand on various transportation networks. Models in the ABM are also sensitive to the local and regional congestion and other negative costs due to traveling. Consequently, the framework needs to be run in loops to converge on the final set of expected demand and loads on network links. Apart from reliable estimates of demand, the output can be compared against the typical trends of facility utilization observed using the real-time data over a sufficiently long period of time to minimize the impacts of current events. This would provide a check and balance to the demand models and assess the reality of their predictions. The merged volume and speed dataset produced by conflation algorithm documented in this dissertation would allow such comparisons to be direct for roadway based travel. The contribution of freight to the total volume and congestion on network links can be estimated by the use of datasets like the NPMRDS which provide separate data on freight vehicles and passenger vehicles. 167 With real-time and short-term predicted data for links forming transportation net- works provided through the GPS big data, and the documented forecasting technique, it can be expected to improve the ABM fitting and performance. The tuned ABM models can then also be subjected to various policy scenarios to predict potential behavior of people and the likelihood of achieving long-term targets like reduction in greenhouse gas emissions, and adoption of more efficient mobility solutions. How- ever, such results are only possible with a comprehensive database of various datasets in a similar temporal and spatial frame that can be leveraged to power the models. The methods developed and documented in this dissertation go a long way towards achieving the universal transportation dataset. The following subsections provide a high-level overview on further tasks required to achieve the goal of the universal dataset. 16.1 Universal Data Synthesis Model In Small Area Estimation (SAE) literature, a hierarchical models have been developed, which are estimated using hierarchical Bayesian approaches. These models can not only provide the required estimates for variables of interest in small areas, in space, time, or population sub-domains, but also provide the complete posterior distribution function for the variables. These posterior distributions can be obtained as the marginal univariate distribution function of the variable of interest. Combined with the copula approach, the hierarchical Bayesian models allow synthesis of not only the basic socioeconomic and demographic characteristics of individuals but also more advanced variables for which the marginal information may otherwise be unobtainable. The dataset thus produced would have all the information about people, including the transportation specific characteristics such as vehicles owned, departure time to work, travel time to work, distance traveled for business and leisure activities, income, educational achievements, location of home and work, presence of children, location 168 of schools for the children, and so on. This comprehensive dataset, along with a dataset that not only provides real-time and short-term predicted speeds, but also volumes, congestion indexes, performance measures and current and predicted weather information would provide holistic data for the most intricate ABM framework, that can predict choices of people, and estimate demand, even in real-time. Moreover, such data and models can be used to predict the likelihood of even the most difficult to forecast events, like incidents. Prediction of incidents would allow proactive positioning and patrolling to ensure not only minimal disruption to other users, but also provide critical and time-sensitive aid to the victims. 16.2 Population from Aggregate Data Population synthesis through the use of copulas documented in this dissertation only works with sample data, as provided by the ACS. Extending the method to use aggregate data requires the synthesis of the dependence that is lost during aggregation. The uses would be in regions and countries that lack a reliable sample survey framework. Moreover, many surveys are reported in aggregate totals, rather than sample data. Such a method might be produced by the use of Monte Carlo Markov Chains (MCMC) to capture the hidden relationship within aggregated datasets. Conditional drawing through the use of MCMC is required because people?s demographic characteristics are closely related to and dependent on each other. Such a study would be innovative, and would unlock the possibilities with data that otherwise find little use outside their specific domains. 16.3 Extending the Forecasting Framework The prediction framework demonstrated in Chapter 12 has only been demonstrated for short-term forecasting of traffic speed data. Since the framework is flexible and 169 extensible, it can be modified to support forecasting of a multitude of variables of interest. Also, the method itself can be expanded to improve the prediction accuracy over the short-term. One of the simplest ways to improve the method is to use the conditions of the upstream and downstream segments as auxiliary variables while predicting the conditions on the current segment. Additionally, the data on incidents, volume and weather can be incorporated into the model through spatial conflation methods as documented in this dissertation. It is also possible to modify the framework to provide predictions over other time horizons. Potential usefulness of having medium-term forecasts apply to the estimation of greenhouse gas emissions, and to the volume data in response to policies and market prices of gasoline. The latter can be supplied as auxiliary variables to the synthetic time-series framework. 170 Appendix A: Additional Monte Carlo Simulations with Copulas In order to further reinforce that the copula captures and synthesizes data with the correct dependence structure, we provide results from additional simulations in this appendix. The simulations follow a triple objective: 1. to illustrate that the copula can recapture the parameter from synthesized data conforming to the same marginals as the ACS PUMS; 2. to show that the parameter is still re-captured even with drastic changes in marginals; 3. to empirically examine the sensitivity to the copula parameter value. The simulations are all carried out using the Gumbel copula. The copula is used to generate two datasets of ? = 118 583 observations in ? = 9 dimensions. The difference between the datasets are the copula parameter, which is set to ? = 1.059 for the first dataset, representing the Gumbel copula parameter estimated from the PUMS data for PUMA 1107 (see table 6.3). Let this dataset be called ?(??) = {?(??)? , ? = 1,? , ?}, where ?? stands for ?Real Parameter?. The second dataset is created with ? = 1.500 to showcase a different and stronger dependence structure, as it is further from one, the value of the parameter corresponding to the independence copula. Let this dataset be ?(??) = {?(??)? , ? = 1,? , ?}, where ?? stands for ?Simulated Parameter?. 171 The first dataset is inverse transformed using on one hand the marginals represent- ing the PUMS data from PUMA 1107, and on the other with marginals representing the simulated dataset, with 10 different values per marginal, to create two different datasets. Continuing the nomenclature, the first of these datasets is called ?(????), where ?? indicates that the marginals of ?(????) are indistinguishable from the marginals of the PUMS data. Similarly, the other dataset is called ?(????), where ?? indicates that the marginals are simulated and formed using (5.2). We apply the same two marginal transformations to the dataset ?(??) to obtain two new datasets ?(????), and ?(????). The proposed copula estimation procedure is applied to all four resulting datasets. Results for ?(????) have already been presented in table 5.5. Results obtained for the other three datasets ?(????), ?(????) and ?(????) are discussed below. Results obtained by applying the proposed procedure to dataset ?(????) are presented in table A.1. The Gumbel copula parameter is almost correctly recovered. The ?-values further reinforce that the results are very similar to those presented in table 6.3. Results obtained by applying the proposed procedure to dataset ?(????) are presented in table A.2. The Gumbel copula parameter is again almost correctly recovered, illustrating that the captured dependence is independent of the influence of the marginal distribution. The results presented in table A.2 clearly show that any copula for which ?0 cannot be rejected is equally adept at reproducing the dependence structure, regardless of the marginals. Table A.1: Constructed Archimedean Copulas for PUMA 1107 Copula Parameter CvM Statistic ?-value Independent NA 867.652 0.001 Clayton 0.160 716.146 0.001 Frank NA NA NA Gumbel 1.113 668.015 0.999 Joe 1.185 679.031 0.999 172 Table A.2: Constructed Archimedean Copulas for PUMA 1107 Copula Parameter CvM Statistic ?-value Independent NA 185.441 0.001 Clayton 0.146 145.084 0.001 Frank 0.478 147.829 0.001 Gumbel 1.076 120.433 0.999 Joe 1.134 109.844 0.999 Finally, results obtained by applying the proposed procedure to dataset ?(????) are presented in table A.3. The estimated Gumbel copula parameter is close to the correct one. However, the ?-values necessitate rejection of every copula except Joe copula in the results. Also, the Cram?r?von Mises statistic values are drastically different from the results presented in tables 6.3, A.1 or A.2, but interestingly, the value is the smallest for the Gumbel copula. This suggests the statistical test proposed by Kojadinovic [66] has still some limitations and deserves more exploration, and that other tests to compare discrete multivariate distributions should be investigated in future research. Table A.3: Constructed Archimedean Copulas for PUMA 1107 Copula Parameter CvM Statistic ?-value Independent NA 6 401.343 0.001 Clayton 0.641 3 827.180 0.001 Frank 3.110 2 883.169 0.001 Gumbel 1.484 2 785.526 0.006 Joe 1.726 3 104.049 0.999 173 Bibliography [1] Baher Abdulhai, Himanshu Porwal, and Will Recker. ?Short term freeway traffic flow prediction using genetically-optimized time-delay-based neural networks?. In: California Partners for Advanced Transit and Highways (PATH) (1999). [2] Mohammed S Ahmed and Allen R Cook. Analysis of freeway traffic time-series data by using Box-Jenkins techniques. 722. 1979. [3] Theo A. Arentze and Harr J.P. Timmermans. ?A learning based transportation oriented simulation system?. In: Transportation Research Part B 38.7 (2004), pp. 613?633. doi: 10.1016/j.trb.2002.10.001. [4] Johan Barthelemy and Philippe L. Toint. ?Synthetic population generation without a sample?. In: Transportation Science 47.2 (2013), pp. 266?279. doi: 10.1287/trsc.1120.0408. [5] Richard J Beckman, Keith A Baggerly, and Michael D McKay. ?Creating synthetic baseline populations?. In: Transportation Research Part A 30.6 (1996), pp. 415?429. doi: 10.1016/0965-8564(96)00004-3. [6] BluSTATs. BluSTATs version 1.94 Operations Manual. 2012. [7] David E Boyce and Huw Williams. ?Urban travel forecasting in the U.S. and U.K.? In: Methods and Models in Transport and Telecommunications. Springer, 2005, pp. 25?44. doi: 10.1007/3-540-28550-4_3. [8] Mark Bradley, John L Bowman, and Bruce Griesenbeck. ?SACSIM: An applied activity-based model system with fine-level spatial and temporal resolution?. In: Journal of Choice Modelling 3.1 (2010), pp. 5?31. doi: 10.1016/s1755- 5345(13)70027-7. [9] United States Census Bureau. 2010 Census. 2010. url: https://www.census. gov/programs-surveys/decennial-census/decade.2010.html (visited on 04/2018). [10] United States Census Bureau. 2016 American Community Survey. 2016. url: https://www.census.gov/programs-surveys/acs/ (visited on 10/2017). [11] Upper Midwest Reliability Resource Center. Travel Time Reliability Refer- ence Manual. 2014. url: http://en.wikibooks.org/wiki/Travel_Time_ Reliability_Reference_Manual. 174 [12] Srinivasa Ravi Chandra and Haitham Al-Deek. ?Predictions of freeway traf- fic speeds and volumes using vector autoregressive models?. In: Journal of Intelligent Transportation Systems 13.2 (2009), pp. 53?72. doi: 10.1080/ 15472450902858368. [13] Liao Chen-Fu. ?Using truck GPS data for freight performance analysis in the Twin Cities metro area?. In: (2014). [14] Bi Yu Chen, Hui Yuan, Qingquan Li, William HK Lam, Shih-Lung Shaw, and Ke Yan. ?Map-matching algorithm for large-scale low-frequency floating car data?. In: International Journal of Geographical Information Science 28.1 (2014), pp. 22?38. doi: 10.1080/13658816.2013.816427. [15] Steven I-Jy Chien and Chandra Mouly Kuchipudi. ?Dynamic travel time prediction with real-time and historic data?. In: Journal of transportation engineering 129.6 (2003), pp. 608?616. doi: 10.1061/(asce)0733-947x(2003) 129:6(608). [16] Tom Choe, Alexander Skabardonis, and Pravin Varaiya. ?Freeway performance measurement system: operational analysis tool?. In: Transportation Research Record: Journal of the Transportation Research Board 1811 (2002), pp. 67?75. doi: 10.3141/1811-08. [17] Abdoul-Ahad Choupani and Amir Reza Mamdoohi. ?Population synthesis using iterative proportional fitting (IPF): A review and future research?. In: Transportation Research Procedia 17 (2016), pp. 223?233. doi: 10.1016/j. trpro.2016.11.078. [18] Cinzia Cirillo, Ying Han, Kartik Kaushik, and Parthasarathi Lahiri. ?Synthetic time series technique for predicting network-wide road traffic?. In: Statistical Journal of the IAOS 34.3 (2018), pp. 425?437. [19] Maria A Cobb, Miyi J Chung, Harold Foley III, Frederick E Petry, Kevin B Shaw, and H Vincent Miller. ?A rule-based approach for the conflation of attributed vector data?. In: GeoInformatica 2.1 (1998), pp. 7?35. [20] Gary A Davis, Nancy L Nihan, Mohammad M Hamed, and Leslie N Jacobson. ?Adaptive forecasting of freeway traffic congestion?. In: Transportation Research Record 1287 (1990). [21] Paul Deheuvels. ?La fonction de d?pendance empirique et ses propri?t?s, un test non param?trique d?ind?pendance?. In: Acad?mie Royale de Belgique, Bulletin de la Classe des Sciences, 5e s?rie 65.6 (1979), pp. 274?292. [22] W Edwards Deming and Frederick F Stephan. ?On a least squares adjustment of a sampled frequency table when the expected marginal totals are known?. In: The Annals of Mathematical Statistics 11.4 (1940), pp. 427?444. doi: 10.1214/aoms/1177731829. [23] Paul Embrechts, Filip Lindskog, and Alexander McNeil. Modelling dependence with copulas. Tech. rep. Zurich, Switzerland: D?partement de math?matiques, Institut F?d?ral de Technologie de Zurich, 2001. 175 [24] Hongchao Fan, Bisheng Yang, Alexander Zipf, and Adam Rousell. ?A polygon- based approach for matching OpenStreetMap road networks with regional transit authority data?. In: International Journal of Geographical Information Science 30.4 (2016), pp. 748?764. doi: 10.1080/13658816.2015.1100732. [25] Bilal Farooq, Michel Bierlaire, Ricardo Hurtubia, and Gunnar Fl?tter?d. ?Sim- ulation based population synthesis?. In: Transportation Research Part B 58 (2013), pp. 243?263. doi: 10.1016/j.trb.2013.09.012. [26] Federal Highway Administration and U.S. Department of Transportation. ?Na- tional Performance Management Measures; Assessing Performance of the Na- tional Highway System, Freight Movement on the Interstate System, and Congestion Mitigation and Air Quality Improvement Program?. In: 82-FR- 5970 (Feb. 2017), pp. 5970?6050. url: https://www.federalregister.gov/ documents/2017/01/18/2017-00681/national-performance-management- measures-assessing-performance-of-the-national-highway-system. [27] Jean-David Fermanian. ?Goodness-of-fit tests for copulas?. In: Journal of multivariate analysis 95.1 (2005), pp. 119?152. doi: 10.1016/j.jmva.2004. 07.004. [28] FHWA Office of Operations, Resource Center, HERE, and The Volpe Cen- ter. Introduction to the National Performance Management Research Data Set (NPMRDS). 2013. url: http://connectdot.connectsolutions.com/ p27329s6h9l/. [29] FHWA Office of Operations, Resource Center, HERE, and The Volpe Cen- ter. Second Quarterly NPMRDS Webinar. 2014. url: https://connectdot. connectsolutions.com/p36vxid1rr5/. [30] FHWA Office of Operations, Resource Center, HERE, and The Volpe Cen- ter. Third Quarterly NPMRDS Webinar. 2014. url: https://connectdot. connectsolutions.com/p1ubotswuel/. [31] FHWA Office of Performance Measurement Program. Urban Congestion Report: Documentation and Definitions. 2014. url: http://www.ops.fhwa.dot.gov/ perf_measurement/ucr/documentation.htm. [32] Gaetano Fusco, Chiara Colombaroni, and Natalia Isaenko. ?Short-term speed predictions exploiting big data on large urban road networks?. In: Transportation Research Part C: Emerging Technologies 73 (2016), pp. 183?201. doi: 10.1016/ j.trc.2016.10.019. [33] Christian Genest and Anne-Catherine Favre. ?Everything you always wanted to know about copula modeling but were afraid to ask?. In: Journal of hydrologic engineering 12.4 (2007), pp. 347?368. doi: 10.1061/(asce)1084-0699(2007) 12:4(347). [34] Christian Genest, Kilani Ghoudi, and L-P Rivest. ?A semiparametric estimation procedure of dependence parameters in multivariate families of distributions?. In: Biometrika 82.3 (1995), pp. 543?552. doi: 10.2307/2337532. 176 [35] Christian Genest, Wanling Huang, and Jean-Marie Dufour. ?A regularized goodness-of-fit test for copulas?. In: Journal de la Soci?t? Fran?aise de Statistique 154.1 (2013), pp. 64?77. [36] Christian Genest and Johanna Ne?lehov?. ?A primer on copulas for count data?. In: ASTIN Bulletin: The Journal of the IAA 37.2 (2007), pp. 475?515. doi: 10.2143/ast.37.2.2024077. [37] Christian Genest, Johanna G Ne?lehov?, Bruno R?millard, et al. ?On the empirical multilinear copula process for count data?. In: Bernoulli 20.3 (2014), pp. 1344?1371. doi: 10.3150/13-bej524. [38] Christian Genest, Johanna Ne?lehov?, and Martin Ruppert. ?Discussion: Sta- tistical models and methods for dependence in insurance data?. In: Journal of the Korean Statistical Society 40.2 (2011), pp. 141?148. doi: 10.1016/j.jkss. 2011.03.004. [39] Christian Genest and Bruno R?millard. ?Validity of the parametric bootstrap for goodness-of-fit testing in semiparametric models?. In: Annales de l?Institut Henri Poincar?, Probabilit?s et Statistiques. Vol. 44. 6. Institut Henri Poincar?. 2008, pp. 1096?1127. doi: 10.1214/07-aihp148. [40] Christian Genest, Bruno R?millard, and David Beaudoin. ?Goodness-of-fit tests for copulas: A review and a power study?. In: Insurance: Mathematics and economics 44.2 (2009), pp. 199?213. doi: 10.1016/j.insmatheco.2007. 10.005. [41] Malay Ghosh and Parthasarathi Lahiri. ?A hierarchical Bayes approach to small area estimation with auxiliary information?. In: Bayesian Analysis in Statistics and Econometrics. Springer, 1992, pp. 107?125. doi: 10.1007/978- 1-4612-2944-5_6. [42] Jean-Fran?ois Girres and Guillaume Touya. ?Quality assessment of the French OpenStreetMap dataset?. In: Transactions in GIS 14.4 (2010), pp. 435?459. doi: 10.1111/j.1467-9671.2010.01203.x. [43] Jessica Guo and Chandra Bhat. ?Population synthesis for microsimulating travel behavior?. In: Transportation Research Record 2014 (2007), pp. 92?101. doi: 10.3141/2014-12. [44] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. Vol. 14. 2. ACM, 1984. doi: 10.1145/602264.602266. [45] Mordechai Haklay. ?How good is volunteered geographical information? A comparative study of OpenStreetMap and Ordnance Survey datasets?. In: Environment and planning B: Planning and design 37.4 (2010), pp. 682?703. doi: 10.1068/b35097. [46] Mohammad M Hamed, Hashem R Al-Masaeid, and Zahi M Bani Said. ?Short- term prediction of traffic volume in urban arterials?. In: Journal of Trans- portation Engineering 121.3 (1995), pp. 249?254. doi: 10.1061/(asce)0733- 947x(1995)121:3(249). 177 [47] Paula J Hammond. ?The 2012 Congestion Report?. In:WSDOT?s comprehensive annual analysis of state highway system performance (2012). [48] Sam Hawala and Partha Lahiri. ?Variance modeling in the U.S. small area income and poverty estimates program for the American community survey?. In: Proceedings of the American Statistical Association, Bayesian Statistical Science Section, Survey Research Methods Section. 2010, pp. 4655?4663. [49] Marius Hofert, Ivan Kojadinovic, Martin Maechler, and Jun Yan. Copula: Multivariate dependence with copulas, R package version 0.999-18. 2018. url: https://CRAN.R-project.org/package=copula. [50] Sherif Ishak and Haitham Al-Deek. ?Performance evaluation of short-term time- series traffic prediction model?. In: Journal of Transportation Engineering 128.6 (2002), pp. 490?498. doi: 10.1061/(asce)0733-947x(2002)128:6(490). [51] Mahmoud Javanmardi, Joshua Auld, and K Mohammadian. ?Integration of TRANSIMS with the ADAPTS Activity-Based Model?. In: 4th Conference on Innovations in Travel Modeling. Tampa, Florida, U.S., 2011. [52] Erik Jenelius and Erik Koutsopoulos. ?Urban network travel time prediction based on a probabilistic principal component analysis model of probe data?. In: IEEE transactions on intelligent transportation systems (Print) (2017). doi: 10.1109/tits.2017.2703652. [53] Byungduk Jeong, Wonjoon Lee, Deok-Soo Kim, and Hayong Shin. ?Copula- Based Approach to Synthetic Population Generation?. In: PloS one 11.8 (2016), e0159496. doi: 10.1371/journal.pone.0159496. [54] Jiming Jiang and P Lahiri. ?Mixed model prediction and small area estimation?. In: Test 15.1 (2006), pp. 1?96. doi: 10.1007/bf02595419. [55] Harry Joe. Multivariate models and multivariate dependence concepts. Chapman and Hall/CRC, 1997. doi: 10.1201/b13150. [56] Harry Joe. ?Asymptotic efficiency of the two-stage estimation method for copula- based models?. In: Journal of Multivariate Analysis 94.2 (2005), pp. 401?419. doi: 10.1016/j.jmva.2004.06.003. [57] Harry Joe. Dependence modeling with copulas. CRC Press, 2014. doi: 10.1201/ b17116. [58] Yiannis Kamarianakis and Poulicos Prastacos. ?Space?time modeling of traffic flow?. In: Computers & Geosciences 31.2 (2005), pp. 119?133. doi: 10.1016/j. cageo.2004.05.012. [59] Shih-Chieh Kao, Hoe Kim, Cheng Liu, Xiaohui Cui, and Budhendra Bhaduri. ?Dependence-Preserving Approach to Synthesizing Household Characteristics?. In: Transportation Research Record 2302 (2012), pp. 192?200. doi: 10.3141/ 2302-21. [60] Kartik Kaushik, Cinzia Cirillo, and Fabian Bastin. ?On Modelling Human Population Characteristics with Copulas?. In: Procedia Computer Science 151 (2019), pp. 210?217. 178 [61] Kartik Kaushik, Elham Sharifi, and Stanley Ernest Young. ?Computing Perfor- mance Measures with National Performance Management Research Data Set?. In: Transportation Research Record: Journal of the Transportation Research Board 2529 (2015), pp. 10?26. doi: 10.3141/2529-02. [62] Kartik Kaushik, Elham Sharifi, Stanley Ernest Young, and Babak Baghaei. ?Comparison of National Performance Management Research Data Set (NPM- RDS) with Bluetooth Traffic Monitoring (BTM) data and I-95 corridor coali- tion Vehicle Probe Project (VPP) data?. In: 21st World Congress on Intelligent Transport Systems. Detroit, Sept. 8, 2014. [63] Kartik Kaushik, Eric Wood, and Jeffrey Gonder. ?Coupled Approximation of US Driving Speed and Volume Statistics using Spatial Conflation and Temporal Disaggregation?. In: Transportation Research Record 2672.43 (2018), pp. 1?11. [64] Gunky Kim, Mervyn J Silvapulle, and Paramsothy Silvapulle. ?Compari- son of semiparametric and parametric methods for estimating copulas?. In: Computational Statistics & Data Analysis 51.6 (2007), pp. 2836?2850. doi: 10.1016/j.csda.2006.10.009. [65] Howard R Kirby, Susan M Watson, and Mark S Dougherty. ?Should we use neural networks or statistical models for short-term motorway traffic forecast- ing?? In: International Journal of Forecasting 13.1 (1997), pp. 43?50. doi: 10.1016/s0169-2070(96)00699-1. [66] Ivan Kojadinovic. ?Some copula inference procedures adapted to the presence of ties?. In: Computational Statistics & Data Analysis 112 (2017), pp. 24?41. doi: 10.1016/j.csda.2017.02.006. [67] Ivan Kojadinovic and Jun Yan. ?Comparison of three semiparametric methods for estimating dependence parameters in copula models?. In: Insurance: Mathe- matics and Economics 47.1 (2010), pp. 52?63. doi: 10.1016/j.insmatheco. 2010.03.008. [68] Ivan Kojadinovic and Jun Yan. ?Modeling multivariate distributions with continuous margins using the copula R package?. In: Journal of Statistical Software 34.9 (2010), pp. 1?20. doi: 10.18637/jss.v034.i09. [69] Ivan Kojadinovic and Jun Yan. ?A goodness-of-fit test for multivariate multi- parameter copulas based on multiplier central limit theorems?. In: Statistics and Computing 21.1 (2011), pp. 17?30. doi: 10.1007/s11222-009-9142-y. [70] Ivan Kojadinovic, Jun Yan, and Mark Holmes. ?Fast large-sample goodness- of-fit tests for copulas?. In: Statistica Sinica 21 (2011), pp. 841?871. doi: 10.5705/ss.2011.037a. [71] Siem Jan Koopman, Rutger Lit, and Andr? Lucas. Intraday stock price depen- dence using dynamic discrete copula distributions. Tinbergen Institute Discus- sion Paper 15-037/III/DSF90. Amsterdam and Rotterdam, The Netherlands: Tinbergen Institute, 2015. doi: 10.2139/ssrn.2580840. 179 [72] Moshe Levin and Yen-Der Tsao. ?On forecasting freeway occupancies and volumes (abridgment)?. In: Transportation Research Record 773 (1980). [73] Li Li, Xiaonan Su, Yanwei Wang, Yuetong Lin, Zhiheng Li, and Yuebiao Li. ?Robust causal dependence mining in big data network and its application to traffic flow predictions?. In: Transportation Research Part C: Emerging Technologies 58 (2015), pp. 292?307. doi: 10.1016/j.trc.2015.03.003. [74] Benmei Liu, Partha Lahiri, and Graham Kalton. ?Hierarchical Bayes modeling of survey-weighted small area proportions?. In: Proceedings of the American Statistical Association, Survey Research Section. 2007, pp. 3181?3186. [75] Lu Ma. ?Generating disaggregate population characteristics for input to travel- demand models?. PhD thesis. University of Florida, 2011. [76] Subrat Mahapatra, Matthew Wolniak, and Kaveh F. Sadabadi. 2016 Maryland State Highway Mobility Report. Maryland State Highway Administration, Dec. 2016. [77] Albert W Marshall and Ingram Olkin. ?Families of multivariate distributions?. In: Journal of the American statistical association 83.403 (1988), pp. 834?841. doi: 10.2307/2289314. [78] Kirill M?ller and Kay W Axhausen. ?Population synthesis for microsimulation: State of the art?. In: Arbeitsberichte Verkehrs-und Raumplanung 638 (2010). [79] Chumchoke Nanthawichit, Takashi Nakatsuji, and Hironori Suzuki. ?Appli- cation of probe-vehicle data for real-time traffic-state estimation and short- term travel-time prediction on a freeway?. In: Transportation Research Record: Journal of the Transportation Research Board 1855 (2003), pp. 49?59. doi: 10.3141/1855-06. [80] Pascal Neis, Dennis Zielstra, and Alexander Zipf. ?The street network evolution of crowdsourced maps: OpenStreetMap in Germany 2007?2011?. In: Future Internet 4.1 (2011), pp. 1?21. doi: 10.3390/fi4010001. [81] Roger B Nelsen. An Introduction to Copulas. Springer, 2006. doi: 10.1007/978- 1-4757-3076-0. [82] Office of Highway Information Management. ?Highway Performance Monitoring System (HPMS)?. In: FHWA-PL-98-045 (1998). [83] Simon Oh, Young-Ji Byon, Kitae Jang, and Hwasoo Yeo. ?Short-term travel- time prediction on highway: a review of the data-driven approach?. In: Transport Reviews 35.1 (2015), pp. 4?32. doi: 10.1080/01441647.2014.992496. [84] Ostap Okhrin, Alexander Ristig, and Ya-Fei Xu. ?Copulae in high dimensions: an introduction?. In: ed. by Wolfgang Karl H?rdle, Cathy Yi-Hsuan Chen, and Ludger Overbeck. Third edition. Berlin, Germany: Springer, 2017. Chap. 13, pp. 247?277. doi: 10.1007/978-3-662-54486-0_13. [85] Iwao Okutani and Yorgos J Stephanedes. ?Dynamic prediction of traffic vol- ume through Kalman filtering theory?. In: Transportation Research Part B: Methodological 18.1 (1984), pp. 1?11. doi: 10.1016/0191-2615(84)90002-x. 180 [86] Stan Openshaw and Liang Rao. ?Algorithms for reengineering 1991 Census geography?. In: Environment and planning A 27.3 (1995), pp. 425?446. doi: 10.1068/a270425. [87] Elliott Irving Organick. ?A Fortran Iv Primer?. In: (1966). doi: 10.2307/ 2311791. [88] Darshan Mukund Pandit, Kartik Kaushik, and Cinzia Cirillo. ?Coupling Na- tional Performance Management Research Data Set and the Highway Perfor- mance Monitoring System Datasets on a Geospatial Level?. In: Transportation Research Record (2019), p. 0361198119838983. [89] R Pappad?, F Durante, and G Salvadori. ?Quantification of the environmental structural risk with spoiling ties: is randomization worthwhile?? In: Stochastic Environmental Research and Risk Assessment 31.10 (2017), pp. 2483?2497. doi: 10.1007/s00477-016-1357-9. [90] Fortunato Pesarin. Multivariate permutation tests: with applications in bio- statistics. Vol. 240. Wiley Chichester, 2001. [91] Dave Pierce and Dan Murray. ?Cost of Congestion to the Trucking Industry?. In: (2014). [92] Abdul Rawoof Pinjari, Naveen Eluru, Rachel B Copperman, Ipek N Sener, Jessica Y Guo, Sivaramakrishnan Srinivasan, and Chandra R Bhat. Activity- based travel-demand analysis for metropolitan areas in Texas: CEMDAP Models, Framework, Software Architecture and Application Results. Tech. rep. 2006. [93] David R Pritchard and Eric J Miller. ?Advances in population synthesis: fitting many attributes per agent and fitting to household and person margins simultaneously?. In: Transportation 39.3 (2012), pp. 685?704. doi: 10.1007/ s11116-011-9367-4. [94] David Robert Pritchard. ?Synthesizing agents and relationships for land use/- transportation modelling?. Master thesis. University of Toronto, 2008. [95] Friedrich Pukelsheim and Bruno Simeone. ?On the iterative proportional fitting procedure: Structure of accumulation points and L1-error analysis?. In: Preprint Nr. 5/2009 (2009). [96] Peter Rafferty and Chip Hankley. ?National Performance Management Research Data Set (NPMRDS)?. In: Wisconsin Traffic Operations and Safety Laboratory 12 (2014). [97] J. N. K. Rao and Isabel Molina. Small-Area Estimation. Second edition. Wiley, 2015. doi: 10.1002/9781118735855. [98] Louis-Paul Rivest, Francois Verret, and Sophie Baillargeon. ?Unit level small area estimation with copulas?. In: Canadian Journal of Statistics 44.4 (2016), pp. 397?415. doi: 10.1002/cjs.11296. [99] Louis-Paul Rivest, Fran?ois Verret, and Sophie Baillargeon. ?Estimation of the parameters in copula models for small areas?. In: Statistical Society of Canada, Proceedings of the Survey Methods Section. 2015. 181 [100] Roger P Roess, E Prassas, and William R McShane. Traffic Engineering. Fourth edition. Pearson, Upper Saddle River, NJ, 2011. isbn: 9780136135739. doi: 10.1007/978-1-349-10800-8. [101] Juan J Ruiz, F Javier Ariza, Manuel A Urena, and Elidia B Bl?zquez. ?Digital map conflation: a review of the process and a proposal for classification?. In: International Journal of Geographical Information Science 25.9 (2011), pp. 1439?1466. doi: 10.1080/13658816.2010.519707. [102] Alan Saalfeld. ?Conflation automated map compilation?. In: International Journal of Geographical Information System 2.3 (1988), pp. 217?228. doi: 10.1080/02693798808927897. [103] Paul Salvini and Eric J Miller. ?ILUTE: An operational prototype of a com- prehensive microsimulation model of urban systems?. In: Networks and Spatial Economics 5.2 (2005), pp. 217?234. doi: 10.1007/s11067-005-2630-5. [104] Joseph L Schafer. Analysis of incomplete multivariate data. CRC press, 1997. Chap. 9. doi: 10.1201/9781439821862. [105] D Schrank, B Eisele, T Lomax, and J Bak. Appendix A: Methodology for the 2015 urban mobility scorecard. Tech. rep. Technical report, Texas Transportation Institute, Texas A&M University, 2015. url: https://static.tti.tamu. edu/tti.tamu.edu/documents/mobility-scorecard-2015-appx-a.pdf. [106] David Schrank, Bill Eisele, and Tim Lomax. ?T.T.I.?s 2012 urban mobility report?. In: Texas A&M Transportation Institute. The Texas A&M University System (2012), p. 4. [107] Elham Sharifi, Masoud Hamedi, Ali Haghani, and Hadi Sadrsadat. ?Analysis of vehicle detection rate for bluetooth traffic sensors: A case study in maryland and delaware?. In: 18th World Congress on Intelligent Transport Systems. 2011. [108] w:en:Radim Baca Skinke. Simple example of an R-tree for 2D rectangles. url: https://en.wikipedia.org/wiki/R-tree#/media/File:R-tree.svg. [109] A Sklar. ?Fonctions de r?partition ? n dimensions et leurs marges.? In: Publica- tions de l?Institut de Statistique de l?Universit? de Paris (1959). [110] Brian L Smith and Michael J Demetsky. ?Traffic flow forecasting: comparison of modeling approaches?. In: Journal of transportation engineering 123.4 (1997), pp. 261?266. doi: 10.1061/(asce)0733-947x(1997)123:4(261). [111] Brian L Smith, Billy M Williams, and R Keith Oswald. ?Comparison of para- metric and nonparametric models for traffic flow forecasting?. In: Transporta- tion Research Part C: Emerging Technologies 10.4 (2002), pp. 303?321. doi: 10.1016/s0968-090x(02)00009-8. [112] Michael S Smith. ?Bayesian approaches to copula modelling?. In: Bayesian Theory and Applications. Ed. by Paul Damien, Petros Dellaportas, Nicholas G Polson, and David A Stephens. OUP Oxford, 2011, pp. 336?358. doi: 10.2139/ssrn.1974297. 182 [113] Michael S Smith and Mohamad A Khaled. ?Estimation of copula models with discrete margins via Bayesian data augmentation?. In: Journal of the American Statistical Association 107.497 (2012), pp. 290?303. doi: 10.2139/ ssrn.1937983. [114] TOPS Lab. NPMRDS Travel Time Reliability. 2014. url: http://www.arcgis. com/home/item.html?id=7089b0b5870e4505a2f9f175c157563c. [115] Alexandre Torday. ?Simulation-based decision support system for real time traffic management?. In: 89th Annual Meeting of the Transportation Research Board. 10-2120. Washington, DC, 2010. [116] Transportation Research Board and National Academies of Sciences, Engi- neering, and Medicine. ?Cost-effective performance measures for travel time delay, variation, and reliability?. In: (2008). doi: 10.17226/14167. url: https: //www.nap.edu/catalog/14167/cost-effective-performance-measures- for-travel-time-delay-variation-and-reliability. [117] Dimitrios I Tselentis, Eleni I Vlahogianni, and Matthew G Karlaftis. ?Improving short-term traffic forecasts: to combine models or not to combine?? In: IET Intelligent Transport Systems 9.2 (2014), pp. 193?201. doi: 10.1049/iet- its.2013.0191. [118] JWC Van Lint. ?Online learning solutions for freeway travel time prediction?. In: IEEE Transactions on Intelligent Transportation Systems 9.1 (2008), pp. 38?47. doi: 10.1109/tits.2007.915649. [119] JWC Van Lint and CPIJ Van Hinsbergen. ?Short-term traffic and travel time prediction models?. In: Artificial Intelligence Applications to Critical Transportation Issues 22 (2012), pp. 22?41. [120] Thaddeus Vincenty. ?Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations?. In: Survey review 23.176 (1975), pp. 88? 93. doi: 10.1179/sre.1975.23.176.88. [121] Eleni I Vlahogianni, John C Golias, and Matthew G Karlaftis. ?Short-term traffic forecasting: Overview of objectives and methods?. In: Transport reviews 24.5 (2004), pp. 533?557. doi: 10.1080/0144164042000195072. [122] Eleni I Vlahogianni, Matthew G Karlaftis, and John C Golias. ?Short-term traffic forecasting: Where we are and where we?re going?. In: Transportation Research Part C: Emerging Technologies 43 (2014), pp. 3?19. doi: 10.1016/j. trc.2014.01.005. [123] David Voas and Paul Williamson. ?An evaluation of the combinatorial optimi- sation approach to the creation of synthetic microdata?. In: Population, Space and Place 6.5 (2000), pp. 349?366. doi: 10.1002/1099-1220(200009/10)6: 5<349::aid-ijpg196>3.0.co;2-5. [124] P Vythoulkas. ?Alternative approaches to short term traffic forecasting for use in driver information systems?. In: Transportation and traffic theory 12 (1993), pp. 485?506. 183 [125] Volker Walter and Dieter Fritsch. ?Matching spatial data sets: a statistical approach?. In: International Journal of geographical information science 13.5 (1999), pp. 445?473. doi: 10.1080/136588199241157. [126] Yibing Wang, Markos Papageorgiou, and Albert Messmer. ?Real-time freeway traffic state estimation based on extended Kalman filter: Adaptive capabilities and real data testing?. In: Transportation Research Part A: Policy and Practice 42.10 (2008), pp. 1340?1358. doi: 10.1016/j.tra.2008.06.001. [127] J Mark Ware and Christopher B Jones. ?Matching and aligning features in overlayed coverages?. In: Proceedings of the 6th ACM international symposium on Advances in geographic information systems. ACM. 1998, pp. 28?33. doi: 10.1145/288692.288699. [128] Joe Whittaker, Simon Garside, and Karel Lindveld. ?Tracking and predicting a network traffic process?. In: International Journal of Forecasting 13.1 (1997), pp. 51?61. doi: 10.1016/s0169-2070(96)00700-5. [129] Billy M Williams and Lester A Hoel. ?Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results?. In: Journal of transportation engineering 129.6 (2003), pp. 664?672. doi: 10.1061/(asce)0733-947x(2003)129:6(664). [130] Billy Williams, Priya Durvasula, and Donald Brown. ?Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models?. In: Transportation Research Record: Journal of the Transportation Research Board 1644 (1998), pp. 132?141. doi: 10.3141/ 1644-14. [131] Paul Williamson, Mark Birkin, and Phil H Rees. ?The estimation of population microdata by using data from small area statistics and samples of anonymised records?. In: Environment and Planning A 30.5 (1998), pp. 785?816. doi: 10.1068/a300785. [132] Bisheng Yang, Yunfei Zhang, and Xuechen Luan. ?A probabilistic relaxation approach for matching road networks?. In: International Journal of Geographical Information Science 27.2 (2013), pp. 319?338. doi: 10.1080/13658816.2012. 683486. [133] Fan Yang, Zhaozheng Yin, Henry Liu, and Bin Ran. ?Online recursive algorithm for short-term traffic prediction?. In: Transportation Research Record: Journal of the Transportation Research Board 1879 (2004), pp. 1?8. doi: 10.3141/1879- 01. [134] Xin Ye, Karthik Konduri, Ram M Pendyala, Bhargava Sana, and Paul Waddell. ?A methodology to match distributions of both household and person attributes in the generation of synthetic populations?. In: 88th Annual Meeting of the Transportation Research Board. Washington, DC, 2009. 184 [135] Jiyoun Yeon, Lily Elefteriadou, and Siriphong Lawphongpanich. ?Travel time estimation on a freeway using Discrete Time Markov Chains?. In: Transportation Research Part B: Methodological 42.4 (2008), pp. 325?338. doi: 10.1016/j. trb.2007.08.005. [136] Shuxin Yuan and Chuang Tao. ?Development of conflation components?. In: Proceedings of Geoinformatics, Ann Arbor (1999), pp. 1?13. [137] Yanru Zhang. ?Uncertainity Associated with Travel Time Prediction: Advanced Volatility Approaches and Ensemble Methods?. Dissertation. University of Maryland, 2015. [138] Yanru Zhang and Ali Haghani. ?A gradient boosting method to improve travel time prediction?. In: Transportation Research Part C: Emerging Technologies 58 (2015), pp. 308?324. doi: 10.1016/j.trc.2015.02.019. [139] Yanru Zhang, Yunlong Zhang, and Ali Haghani. ?A hybrid short-term traffic flow forecasting method based on spectral analysis and statistical volatility model?. In: Transportation Research Part C: Emerging Technologies 43 (2014), pp. 65?78. doi: 10.1016/j.trc.2013.11.011. [140] Yunlong Zhang. ?Special issue on short-term traffic flow forecasting?. In: Trans- portation research. Part C, Emerging technologies 43 (2014), pp. 1?2. doi: 10.1016/j.trc.2014.05.009. 185