ABSTRACT Title of Thesis: DEVELOPING A TOUR-BASED TRIP IDENTIFICATION ALGORITHM USING MOBILE DEVICE LOCATION DATA Aliakbar Kabiri, Master of Science, 2022 Thesis directed by: Professor, Lei Zhang, Department of Civil and Environmental Engineering This thesis presents a novel trip identification algorithm that supports travel behavior analysis based on mobile device location data. The proposed trip identification algorithm is applied to a large-scale Location-based Service (LBS) dataset consisting of the location points of a large representative sample of United States residents with over 40 million users in January 2020. Firstly, the proposed framework divides sightings into long-distance and short-distance home-based tours and then identifies the trips on each type of tour using different methods. Furthermore, the Maryland Statewide Household Travel Survey 2018/2019 and the National Household Travel Survey (NHTS) 2017 validate the derived trips. The results showed that several metrics of the trips from mobile device location data and travel surveys follow similar trends. In addition, the impact of coronavirus disease 2019 (COVID-19) on the travel behavior of the population is studied as a real-world application of the proposed algorithm. DEVELOPING A TOUR-BASED TRIP IDENTIFICATION ALGORITHM USING MOBILE DEVICE LOCATION DATA by Aliakbar Kabiri Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Master of Science 2022 Advisory Committee: Professor Lei Zhang, Chair Associate Research Professor Chenfeng Xiong Professor Erkut Ozbay ? Copyright by Aliakbar Kabiri 2022 Dedication To the 176 innocent passengers of the Ukraine International Airlines Flight 752. ii Acknowledgements It is my great pleasure to thank my parents, Tavous and Taleb, for their valuable support. The support I have received from them has gotten me to where I am now. I want to acknowledge my colleagues, Aref Darzi, Yixuan Pan, Mofeng Yang, and Guangchen Zhao, in our research group for their wonderful collaborations. I would particularly like to thank my supervisor at the Maryland Transportation Institute (MTI), Dr. Lei Zhang. You have given me several opportunities to further my research, and I appreciate them. Furthermore, I would like to express my gratitude to Dr. Chenfeng Xiong and Dr. Erkuy Ozbay, who served on my thesis advisory committee. iii Table of Contents Dedication ..................................................................................................................... ii Acknowledgements ...................................................................................................... iii Table of Contents ......................................................................................................... iv List of Tables ................................................................................................................ v List of Figures .............................................................................................................. vi List of Abbreviations .................................................................................................. vii Chapter 1: Introduction ................................................................................................. 1 1.1 Background ................................................................................................... 1 1.2 Research Objectives and Contributions ........................................................ 2 1.3 Research Outline ........................................................................................... 3 Chapter 2: Literature Review ........................................................................................ 5 2.1 Mobile Device Location Data ....................................................................... 5 2.2 Device Characteristics and Travel Pattern Uniqueness ................................ 6 2.3 Trip Identification Algorithms ...................................................................... 9 Chapter 3: Datasets ..................................................................................................... 12 3.1 Mobile Device Location data (MDLD) ...................................................... 12 3.2 incenTrip Data ............................................................................................ 13 3.3 Household Travel Surveys (MTS and NHTS) ............................................ 14 3.4 American Community Survey (ACS) ......................................................... 15 Chapter 4: Methodology ............................................................................................. 16 4.1 Geographical Level of Study ...................................................................... 16 4.2 Home Location Identification ..................................................................... 17 4.3 Work Location Identification ...................................................................... 19 4.4 Device Deduplication.................................................................................. 19 4.5 Tour-based Trip Identification .................................................................... 25 4.5.1 Home-based tour identification ............................................................... 26 4.5.2 Trip Identification for Short-Distance Tours .......................................... 27 4.5.3 Trip Identification for Long-distance Tours ........................................... 30 4.5.3.1 Tour-ID regeneration .............................................................................. 32 4.5.3.2 Stop and destination identification.......................................................... 32 4.5.3.3 Sub-tour identification ............................................................................ 33 4.5.3.4 Trip generation ........................................................................................ 33 Chapter 5: Results ...................................................................................................... 35 5.1 Post-processing steps in the trip-identification algorithm .......................... 35 5.2 Regional Trip Validation with MTS ........................................................... 38 5.3 National Trip Validation with NHTS ......................................................... 42 5.4 Advantages of the proposed algorithm to the clustering method ............... 46 5.5 Case study: COVID-19 pandemic and travel behavior changes ................. 47 5.5.1 Data expansion to the population level ................................................... 48 5.5.2 COVID-19 pandemic and the population travel behavior ...................... 49 Chapter 6: Conclusion and Discussion ...................................................................... 53 6.1 Thesis summary .......................................................................................... 53 6.2 Discussions and future work ....................................................................... 55 References ................................................................................................................... 57 iv List of Tables Table 1. A sample of MDLD. ..................................................................................... 12 Table 2. Geo-hash width and height at different levels. ............................................. 16 Table 3. K-anonymity size statistics for devices having the exact home location. .... 21 Table 4. Top 10 origin-destination pairs in NHTS and MDLD at the state level. ...... 45 Table 5. Top 10 origin-destination pairs in NHTS and MDLD at the county level. .. 46 v List of Figures Figure 1. Final sampling rate at the county level. ....................................................... 13 Figure 2. Illustration of a level-7 geo-hash. ................................................................ 17 Figure 3. Anonymity size variation with different numbers for most-visited locations. ..................................................................................................................................... 21 Figure 4. Number of devices having one or more duplicates in the dataset. .............. 23 Figure 5. Number of devices having common hours in the sample dataset. .............. 24 Figure 6. Average number of hours observed in the same geo-hash level 7. ............. 25 Figure 7. Recursive algorithm of trip identification for short-distance tours. ............ 29 Figure 8. Recursive algorithm of trip identification for long-distance tours. ............. 31 Figure 9. Tour identification and trip linking demonstration. .................................... 34 Figure 10. An example of a local movement in a mall. .............................................. 36 Figure 11. A trip with multiple data jumps. ................................................................ 38 Figure 12. Trip length distribution comparison between MDLD and MTS survey. .. 40 Figure 13. Travel time distribution comparison between MDLD and MTS survey. .. 40 Figure 14. Trip start time distribution comparison between MDLD and MTS data. . 41 Figure 15. Trip rate distribution comparison between MDLD and MTS data. .......... 42 Figure 16. Trip length distribution comparison between MDLD and NHTS data. .... 43 Figure 17. Travel time distribution comparison between MDLD and NHTS data. ... 43 Figure 18. Trip start time distribution comparison between MDLD and NHTS data. 44 Figure 19. Trip rate distribution comparison between MDLD and NHTS data. ........ 44 Figure 20. A trip trajectory that was not captured by the ST-DBSCAN algorithm.... 47 Figure 21. Percentage of people staying home in January and April 2020. ............... 51 Figure 22. The daily trip rates in January and April 2020. ......................................... 51 vi List of Abbreviations ATUS American Time Use Survey BMC Baltimore Metropolitan Council CDR Call Detail Record COVID-19 Coronavirus disease 2019 DBSCAN Density-Based Spatial Clustering Applications with Noise DCI Divide, Conquer and Integrate GPS Global Positioning System ID Identifier LBS Location-based Service MDLD Mobile Device Location Data MTS Maryland Statewide Household Travel Survey NHGIS National Historical Geographic Information System NHTS National Household Travel Survey UTC Coordinated Universal Time vii Chapter 1: Introduction 1.1 Background A key aspect of transportation planning is the understanding of travel behavior. For many years, travel surveys were the most reliable method to obtain the movement patterns of a population. The National Household Travel Survey (NHTS) and the Maryland Statewide Travel Survey (MTS) are just two examples of travel surveys that agencies conducted to collect the travel diaries of a sample of the residents to be used in studies and make proper decisions in the transportation field. Although travel surveys provide many insights into the population's movement patterns, some drawbacks are associated with them. For large-scale studies, it is often impossible to have a high sampling rate and a long study period. Accordingly, in both surveys, a small population sample is surveyed for just a couple of days, and then the observed patterns are expanded to the entire population. This could lead to biases in several aspects, such as different demographic characteristics of population groups, temporal bias, etc. Moreover, occasionally, survey respondents make inaccurate trip reports or even miss a trip entirely. This may occur because they did not have enough encouragement to provide an accurate or honest report, or it may simply be because they forgot to report a specific trip during the survey period. For this reason, modern ways of data collection and processing should be used in such studies for more accurate disclosure of travel patterns, as well as providing a larger sample size and a more extended study period. As technology has grown and mobile phones have become ubiquitous, a vast portion of the populace has access to devices with Global Positioning Systems (GPS). Nowadays, many mobile phone applications store users' locations in latitude-longitude and timestamps showing the time 1 the location was reported. This offers a great source of information about the travel patterns of the population. Because these data do not provide trip-level information, several data processing and algorithms are required to explore the underlying travel behaviors of the people. This study aims to utilize Mobile Device Location Data (MDLD) for such kind of analysis and data processing. 1.2 Research Objectives and Contributions In this study, an analysis of movement patterns of the U.S. population was undertaken using mobile device location data from a large sample of mobile device users. We analyzed the raw location data of more than 40 million devices all over the country and developed novel data processing and trip identification algorithm to enable us to derive trip-level information of the unique devices. Because multiple data vendors provided data in this study, the same user might be reported with different device identifiers from various sources. Further, electronic devices are more widely available nowadays. As a result, a user might have two or more devices when traveling, such as a phone, a tablet, and a computer. As a result, multiple trajectories may be generated with different device identifiers (IDs) for the same user. As part of the data pre-processing steps, a novel deduplication algorithm is developed to avoid overrepresenting a user's travel patterns. The next step is to develop a trip identification algorithm that uses the raw sightings of mobile device users to derive trip-level information. A vital issue complicating the trip identification process is distinguishing between linked and unlinked trips. Existing trip identification methods identify unlinked trips. As an illustration, a single transit commute trip with longer than five minutes of waiting time at the origin and transfer transit stations can be identified 2 as three unlinked trips: a walking trip from home to the origin transit station; A transit trip from the origin transit station to the transfer station; and another transit trip from the transfer station to the destination. Additionally, long-distance trips can usually have a more extended stop in the middle of the trip compared to short-distance trips. This explains why there is a need to treat long- distance trips differently in trip identification methodologies. As a solution to these issues, a tour- based trip identification method is developed to identify tours before trip identification. This approach enables better trip identification, mode imputation, and purpose imputation for further analysis. 1.3 Research Outline After reviewing the literature on mobile device location datasets, duplicate device identification techniques, and trip identification methods, this study develops deduplication and tour-based trip identification algorithms with several improvements compared to the current literature to fulfill the gaps. The developed algorithms are applied to large-scale mobile device location data, and the results are compared with two regional and national surveys. This study also discusses the advantages of the proposed methodology over current trip identification methods based on clusters of sightings. A real-world use case of the proposed algorithm is demonstrated by comparing the travel behavior of the population before and during the COVID-19 pandemic. The outline of this thesis is as follows. The second chapter offers a comprehensive literature review that covers various types of mobile device location datasets, the process of deduplication, and the trip identification algorithm using different criteria. In Chapter 3, the primary datasets, including mobile device location data and surveys used for this study are introduced. In Chapter 4, we present a novel, highly accurate device deduplication algorithm and a tour-based trip 3 identification process to generate raw sightings-based trip-level information of individuals. Chapter 5 describes the post-processing steps needed for derived trips and compares the result with the two regional and national surveys: the Maryland Statewide household Travel Survey (MTS) and National Household Travel Survey (NHTS). Also, using the multi-level device and trip weighting procedures, the results are scaled to the national level to show how the U.S. population reacted to the COVID-19 pandemic in the early stages. Finally, Chapter 6 summarizes the findings and suggests possible future directions. 4 Chapter 2: Literature Review 2.1 Mobile Device Location Data In the past years, mobile device location data has become popular for studying the travel behavior of the populations. These datasets mainly include the records gathered from call detail records (CDRs), sightings data, GPS-based technology data, or location-based service data. The following are brief descriptions of each of these valuable data sources. In-vehicle GPS technology reports the location of the vehicles every few seconds. A lot of research incorporates this kind of data into their analyses. Chankaew et al. (2018) analyzed freight traffic using national truck GPS data in Thailand. CDRs are records that are produced by a telephone exchange or any other telecommunications equipment. These records contain the callers' phone numbers, starting time of the call, duration, and other phone call information. These data report the location of the cell towers instead of a user's actual location (Chen et al., 2016). On the other hand, a less frequently used dataset, called sightings, is generated each time the phone is located. What sightings data report are the location of the device using triangulation of multiple towers (Chen et al., 2016). Finally, Location-based Service (LBS) data consist of location information recorded by smartphone applications using GPS, cellular towers, Wi-Fi, and other types of connections to track the device's location. (Yang, 2020). This kind of data is the base of the study. According to what we discussed earlier, travel surveys are one of the primary methods for analyzing human mobility patterns. In recent years, mobile device location data has also been integrated with multiple travel surveys to help one capture the unreported trips of the users, find 5 the possible reasons for not reporting such trips, and prepare a solid independent dataset for the validation of the user reported trips. For example, the Kansas City Regional Travel Survey conducted in 2003 and 2004 included GPS logging equipment in the vehicle of more than 7% of the households that participated in the survey (Wolf et al., 2004). Forest and Pearson (2005) analyzed a GPS-enhanced travel survey to evaluate the differences between the trips reported by the respondents and the trips captured from GPS devices. They noticed that the number of trips reported in the GPS data was much greater than the trips reported in the travel survey. Beyond the detection of the users' trips, mobile device location data can help the analysts explore the exact route, the actual travel time, travel distances, and even with the help of rail, ferry, and bus networks, find the travel mode of the trips from GPS traces with high accuracy (Stopher et al., 2008). With the technological improvements, having smaller and lighter GPS devices, wearable GPS devices were embedded in the travel surveys. Additionally, surveys transitioned from being fully question-based to GPS-based surveys. Studies, including Sch?ssler and Axhausen (2009), examined the accuracy of the trip identification and travel mode detection of fully automated surveys without any questionnaire data. In Sch?ssler and Axhausen (2009), the GPS data of the participants who wore the GPS devices without any other information was collected. The results were compared with the existing national-level travel survey showing a good match between the census data and the GPS-based results. 2.2 Device Characteristics and Travel Pattern Uniqueness Studies on the uniqueness of the devices can be classified into two categories: research on the feasibility of using demographic information and research on the feasibility of using spatiotemporal information. Among studies that evaluated the demographic information to be used 6 in device uniqueness identification, a study on 1990 census data by Sweeney (2000) showed that 87% of the U.S. population could be identified uniquely using the collection of demographic attributes such as gender, date of birth, and 5-digit zip code. Furthermore, nearly 50% of the population can be uniquely identified by having their place of residency, gender, and date of birth. Golle (2006) used the 2000 census data to revisit the uniqueness of individuals using the same demographic information and revealed that the previous ratio decreased from 87% to 63%. Other studies that used the spatiotemporal information of mobile device location data and are more pertinent to this thesis include research by Trestian et al. (2009). They noted that people spend most of their time in their comfort zone, defined as the top three visited locations. Based on this study, those who stayed in five base stations during a week spent about 90% of their time in the top three locations. Even when a user had 50 base stations, areas visited having the size of on average 4 square kilometers, about 55% of their time was spent in their top three visited locations. This indicates that devices representing the same individuals are likely to share the top three visited regions. Golle and Partridge (2009) showed that about 50% of the U.S. workers could be uniquely identified at a census block level using only home and work locations coming from Longitudinal Employer-Household Dynamics (LEHD). This study revealed that the median size of the anonymity set of the workers in the U.S. at the Census block level is one. An anonymity set is a set of individuals that share the same attributes and cannot be distinguished from each other by the available information. Chow and Mokbel (2011) found that by having the paths of all users and knowing that a particular device was in certain places at certain times, they could identify the device's trajectory. This statement will be used for the deduplication validation in the following sections of this thesis. For two devices to be identical, they must be observed at the same place at any time. 7 In Zang, Hui, and Jean Bolot (2011), CDR data was used to identify mobile device owners by analyzing their top N locations based on how often they appeared across different geographical levels such as sectors, cell, and zip codes. A device whose top locations are fewer would be more challenging to identify. This study analyzed the top one, two, and three locations in terms of frequency of observations. Based on this research, more than half of the users could be uniquely identified by having their top three locations at the cell and sector levels. In addition, the top two locations were analyzed while they were interchangeably observed, as the users might make more calls from their work location than from their home location in one month while vice versa in another. Instead of analyzing the top N locations, De Montjoye et al. (2013) investigated the number of random points needed to identify an individual mobility trace. They evaluated the call data for 1.5 million users and found that about 95% of the people could be uniquely identified by four spatiotemporal observations from each device. Furthermore, in the re-identification of the individuals, both the spatial and temporal resolution of the devices' location observations are critical. Human mobility patterns are highly predictable. Song et al. (2010) studied users' trajectories and noted that people tend to spend most of their time in a few locations. According to the researchers, there is a potential for 93% predictability of average mobility, which does not vary much by population. In another study, Gonzalez et al. (2008) showed that the human mobility of individuals is consistent in both spatial and temporal domains and that people tend to return to their preferred locations regularly. We believe the main research gap in this field is the lack of studies evaluating duplicate devices among mobile device location data provided by different data vendors. 8 2.3 Trip Identification Algorithms The trip end identification algorithm for high-frequency mobile device location data, such as GPS data, has been well-studied and developed. The state-of-the-practice methods utilized by the commercial data vendors identify trips from raw location data points as follows: ? Method 1: Consider the time and distance relationship between consecutive location point observations to identify moving points and static points. Consecutive moving points between two sets of static points form a trip. ? Method 2: Consider zone boundaries to determine movements from one zone to another, which applies to identifying inter-zonal trips only. ? Method 3: Identify location point clusters as activity locations with spatial clustering methods. Location point observations between two consecutive activity locations form a trip. The traditional way of obtaining accurate trip ends is the rule-based trip end identification method. This type of method designs rules and parameters based on domain knowledge. The trip ends are obtained by applying the rules to every sighting in the location data and, at the same time, examining the intra-relationship between consecutive location points. The parameters used in these rules are defined mainly by domain knowledge and are applied to measures such as dwell time, speed, etc. (McGowen and McNally 2007; Gong et al. 2014; Axhausen et al. 2003; Tsui et al. 2006; Bothe and Maat 2009; Stopher et al. 2005; Du and Aultman-Hall 2007; Stopher et al. 2008; Schuessler and Axhausen 2009; Gong et al. 2012; Safi et al. 2015; Patterson et al. 2016). There is a wide range of dwell times in the current research conducted. Wolf et al. (2001) utilized GPS data to detect the trip diary of users by applying different time thresholds, as a rule, 9 to identify trip ends. If a device did not move during the time threshold, it was detected as a trip end. The best match between the reported and the detected trips was derived using a 120-second threshold. Tsui and Shalaby (2006) used an activity identification algorithm to find the trip ends from GPS data streams. A time threshold of 120 seconds was used as a primary criterion of activity identification. In the case of signal loss, other measures are included. For example, if the signal loss was between 120 to 600 seconds and the user moved in a distance as short as 50 meters, it is considered a short-duration indoor activity. Stopher et al. (2005) defined a trip end whenever the difference in the consecutive latitude and longitude values is less than 0.000051 degrees, and the heading is unchanged or is zero, along with speed being equal to zero while elapsed time during which these conditions hold is equal to or greater than 120 seconds. It is worth noting that this paper was written in Australia, and based on what the authors noted, most traffic lights in Australia have a red cycle of fewer than 2 minutes. Stopher et al. (2005) used a 3-minute threshold to determine the trip ends of the GPS data. Axhausen et al. (2004) utilized Trip Identification and Analysis System (TIAS). According to the model, points with dwell times of greater than five minutes are considered trip ends that can be identified from GPS data as the trip ends with confidence. Some research considered a speed threshold of zero or a value near zero as a measure to capture static clusters and trip ends (Wolf et al., 2001; Tsui and Shalaby, 2006; Schuessler and Axhausen, 2009). Moreover, researchers leveraged the supervised machine learning methods to supplement the rule-based methods, which classify each location point as static or moving (Gong et al., 2015; Zhou et al., 2016; Gong et al., 2018). Different clustering methods were also applied to obtain trip ends by first identifying people's activity locations from the location data (Zhou et al., 2007; Chen et al., 2014; Ye et al., 2009; Yao et al., 2019). A recent study utilized a spatiotemporal clustering 10 method with three combined optimization models to detect trip ends (Yao et al., 2019). There is also a particular focus on deriving the trip ends from LBS data. A "Divide, Conquer and Integrate" (DCI) framework was proposed to process the LBS data and extract mobility patterns in the Puget Sound region (Wang et al., 2019). The proposed framework combined a rule-based and incremental clustering method to handle the bi-modally distributed LBS data. The results were aggregated at the census tract level and compared with household travel surveys (Wang et al., 2019). 11 Chapter 3: Datasets 3.1 Mobile Device Location data (MDLD) The primary dataset used in this thesis is the mobile device location data (MDLD) collected by multiple leading data vendors. This dataset contains the spatial and temporal information of several users, including a random hashed device identifier, the latitude and longitude of location points, the time that the location of the user has been collected as a timestamp, the accuracy of the sightings as meters reported by the data provider, and the Coordinated Universal Time (UTC) offset that relates the UTC of each sighting to their local time. Table 1 shows a sample of the mobile device location data. Due to privacy concerns, noise has been applied to all the entries. Table 1. A sample of MDLD. Device-ID UTC timestamp Latitude Longitude Accuracy UTC offset Sfbcx-223da 1578010770 38.9924 -76.9293 2 -14400 Sfbcx-223da 1578010775 38.9802 -76.9190 5 -14400 Sfbcx-223da 1578010778 38.9605 -76.9201 3 -14400 Rjckf-2421s 1578010500 38.7069 -76.8985 11 -14400 Figure 1 shows the mobile device location data sampling rate used after multiple data processing steps described in Chapter 4, such as data cleaning, device deduplication, and devices with home locations. 92% and 90% of the counties have a sampling rate of more than 5% and 10%, respectively. The numbers indicate the great value of MDLD for the analysis of travel 12 patterns, compared with surveys in which the sampling rate is much lower than the sampling rate in MDLD used in this study. Figure 1. Final sampling rate at the county level. 3.2 incenTrip Data incenTrip (incentrip.org) was developed by the National Transportation Center (NTC) at the University of Maryland (UMD) for the "Integrated, Personalized, Real-time Traveler Information and Incentive" (iPretii) project, funded by the U.S. Department of Energy's (DOE) Advanced Research Projects Agency-Energy (ARPA-E). This application gets the users' location 13 information to suggest the best transit option and incentivize them to use the transit network instead of a private car for their travels (Mofeng, 2020). The data is collected and stored in compliance with data privacy protection requirements. Due to these similarities, this dataset is the same as the MDLD gathered from the leading data vendors in the U.S. but encompasses the Washington Metropolitan Area (DMV) and the Baltimore Metropolitan Council Area. 3.3 Household Travel Surveys (MTS and NHTS) Two travel surveys are used as part of the validation of the proposed trip identification algorithm. One of the ground truth datasets used is the Baltimore Metropolitan Council (BMC) survey called the Maryland Statewide Household Travel Survey (MTS) to capture the daily travel patterns of the people living in a set of counties in Maryland. This survey was conducted between April 2018 and May 2019. The survey data was collected from 7,500 households from counties including Alleghany, Anne Arundel, Baltimore, Caroline, Carroll, Cecil, Dorchester, Garrett, Harford, Howard, Kent, Queen Anne's, Somerset, Talbot, Washington, Wicomico, Worchester, and Baltimore City. Residents were asked what trips they made over a specific weekday for work, school, shopping, etc. The other ground truth dataset is the National Household Travel Survey (NHTS) 2017. The MTS focuses on a specific geographic area, while this survey covers all 50 states and the District of Columbia. The survey was conducted from March 2016 and May 2017 on weekdays and weekends, including holidays. 14 3.4 American Community Survey (ACS) The American Community Survey (ACS) is an ongoing survey conducted every year to provide helpful information about the U.S. population, such as demographics from different regions such as states, counties, and smaller geographical areas. In this research, the population of each county is used to calculate the sampling rates in each county of the fifty states and the District of Columbia. This survey data is provided by the National Historical Geographic Information System (NHGIS). 15 Chapter 4: Methodology 4.1 Geographical Level of Study The geographical level of study for users' home locations and visited locations are a geo- hash level 7. Geo-hashes are unique identifiers of specific zones on the earth, and their width and height depend on the level of the certain geo-hash. Table 2 shows the size of geo-hashes, from the largest to the smallest. Table 2. Geo-hash width and height at different levels. Level of geo-hash width ? length 1 5,009.4 km ? 4,992.6km 2 1,252.3 km ? 624.1 km 3 156.5 km ? 156 km 4 39.1 km ? 19.5 km 5 4.9 km ? 4.9 km 6 1.2 km ? 609.4 m 7 152.9 m ? 152.4 m 8 38.2 m ? 19 m 9 4.8 m ? 4.8 m 10 1.2 m ? 59.5 cm 11 149 mm ? 149 mm 12 37.2 mm ? 18.6 mm 16 Figure 1 illustrates the level of study for home location imputation and visited locations. The blue rectangle is a geo-hash level 7 denoted by the unique identifier named "dqcmc4p" with a size of 152.9 m ? 152.4 m. Figure 2. Illustration of a level-7 geo-hash. 4.2 Home Location Identification Several sections of this study require users' home locations, while the mobile device location data does not provide it. On the other hand, it provides enough information that helps to derive them. Home locations are derived from the locations that a device visited during a period, and they are reported as geo-hash level 7 zones. The methodology is as follows: 17 At first, sightings of a device are aggregated at geo-hash level 6 zones, and geo-hashes that meet the following criteria are considered as possible home locations and are kept for further evaluation: ? The geo-hash must have been observed on at least ?????? ?? ???????? ???? ?? ? ????? ??? {3, ??????? ( ) + 1} days in a month. 2 ? The geo-hash is observed on an average of more than 2 hours on the days that it has sightings. The remaining geo-hashes are sorted based on the number of observed days in a month, the average daily number of observed hours in observed days, the average number of hourly sightings in observed hours, and the top three zones are picked. Next, the top three remaining geo- hashes are sorted by the observed number of nights, the average daily number of observed nighttime hours, and the average number of hourly sightings during nighttime. The top geo-hash level 6 is identified as the home location since people tend to spend most of their nighttime at home. Finally, these two steps are repeated on all the level-7 geo-hashes in the identified home location, and the top geo-hash is selected as the home location. This step helps to have more precise information about the home location. It is worth noting that the nighttime hour is chosen to be 9 p.m. to 6 a.m. based on the American Time Use Survey (ATUS). ATUS showed that nearly 80% of the population that work full-time or part-time visited their home location at this time interval. MDLD provides the information of a massive number of devices, while many of them do not provide sufficient information. For example, there might be a device with fewer than ten observations during a month. Therefore, after identifying home locations, only devices with a minimum quality are kept in the dataset. 18 4.3 Work Location Identification In addition to home location, work location is needed when identifying a trip. Work location identification follows the same structure as home location identification. Workplace candidates are selected based on the visiting frequency of at least three workdays, or half of the total observed workdays for each device, and the average duration of at least two hours during the daytime on workdays. Furthermore, a term called 'temporal similarity' is introduced to avoid possible misidentification of work location. The temporal similarity ratio controls the work location to be somewhere different from the home location. For each workplace candidate, the temporal similarity ratio is defined as the ratio between the number of hours when the device was observed both at home and at the workplace candidate and the number of total hours when the device was observed at the workplace candidate. A threshold of 0.6 has been selected. A measurement of this kind is performed because, in some cases, a device's home location might not be far enough from the boundary of a geo-hash. This results in frequent observations of a device in two geo-hashes next to each other. Therefore, the possibility of identifying the geo-hash next to the home location as the work location increases. Assuming a user spends some time at work before returning home, the user's home and workplace should not regularly be observed at the same time. 4.4 Device Deduplication Following the literature discussed in Chapter 2, based on both demographic information, in this case, the imputed home location of devices, and travel movement pattern, the top N locations visited during a month, a deduplication algorithm is developed to identify different device IDs that represent the same user in the integrated dataset coming from different data 19 vendors. In this study, devices that have the exact home location and top-five most-visited locations are considered in the same k-anonymity and, as such, will be flagged as duplicate devices. These devices represent the same user but have different device identifiers. Geo-hashes with at least one sighting over a month are considered visited locations of a device. To determine the top locations visited by a device, all the visited geo-hashes are sorted based on the number of unique hours and the number of sightings during a month. A unique hour is a time interval of one hour during which a specific device was observed. For example, two observations at 11:45 a.m. and 11:12 a.m. on January 4 and January 6, 2020, are considered two unique hours during a month. Next, the top N geo-hashes are chosen as the most-visited locations of a device for the deduplication process. Devices that do not have top N geo-hashes, meaning they are observed in N-1 or fewer geo-hashes during the month, are removed since the information about their trajectories are insufficient. Table 3 summarizes the minimum, average, and maximum values of anonymity group sizes for devices in the dataset. As the number of the top-visited locations increases, fewer devices are associated with the same anonymity group. 20 Table 3. K-anonymity size statistics for devices having the exact home location. Number of the top-visited locations (N) Min Mean Max 1 1 4.23 20514 2 1 1.82 323 3 1 1.48 164 4 1 1.34 72 5 1 1.25 52 6 1 1.19 50 7 1 1.15 14 8 1 1.12 7 Figure 3. Anonymity size variation with different numbers for most-visited locations. 21 Figure 3 shows the percentile of anonymity sizes having different numbers of locations (N) as the most-visited locations. For example, with N equal to five, more than 80% of anonymity groups contain only a single device, and less than 20% have two or more devices in the same anonymity group. Previously, we noted that the objective is to consider all devices belonging to the same anonymity group representing the same individual since they share the top N most-visited locations and home locations. The next step is to determine a reasonable value for N. Based on Table 3 and Figure 3, as the value of N increases, fewer devices share the same most-visited locations, and consequently, fewer devices are identified as duplicates. A validation method is conducted to ensure the proper value of N is selected. Sharing the top one (N=1) or two (N=2) locations is not sufficient to identify duplicated devices since these two locations are most likely to be the home and work location of the devices, and it is possible to have two devices living and working together. Thus, more locations are needed to be considered as a duplicate identifier. Figure 4 shows the number of devices with at least one duplicate with varying values of N. Higher values can lead to failing to detect duplicate devices because even tiny changes in the number of observed hours can lead to failing to detect duplicate devices. This is because different data vendors may report different numbers of sightings of a device as different applications capture them. 22 Figure 4. Number of devices having one or more duplicates in the dataset. If two devices are duplicates, meaning they represent the same user in the dataset, they must be in the same location at the same time. This fact is the baseline for the validation of the duplicate identification algorithm. A 100,000-sample of duplicated paired devices is evaluated in terms of spatial and temporal information of sightings in 10 days. Their trajectories are evaluated to see if they are at the same location simultaneously. Since different data vendors might report the sightings of the same user at different hours of a day, only the sightings of the common hours are compared. Common hours are those one-hour intervals that both devices had sightings in the integrated dataset. For instance, if two devices appeared simultaneously at 4 a.m. on January 1, 2020, they must have appeared in the same level- 7 geo-hash. Figure 5 shows the number of paired devices in the random sample data that had at least one common hour in the first ten days of January. As the number of most visited locations 23 increases, the number of devices having sightings in the dataset at the same hour-intervals increases. For the values greater than five, the number of paired devices having common hours does not increase significantly. Figure 5. Number of devices having common hours in the sample dataset. Figure 6 shows the average percentage of common hours of paired devices observed in the same level-7 geo-hash. Same as the previous trend, when the value of N changes from one to five, the percentage increases a lot, but for the higher values, the percentage is almost around 99.5% and does not change a lot. This number is a good indicator of being in the same location simultaneously, and using a higher value increases the risk of failing to identify two duplicated devices. 24 Figure 6. Average number of hours observed in the same geo-hash level 7. Based on the previous discussions, a value of five for the number of most-visited locations of a device along with sharing the same home location is used to be a proxy for the duplicate identifier. Finally, when two devices are labeled as duplicate devices, all their sightings are integrated, and the same device ID is assigned to them. In this way, a more solid trajectory of the user is presented in the final dataset. 4.5 Tour-based Trip Identification Due to the absence of trip-level information in the raw mobile device location data, a trip identification algorithm is required to extract this information. The purpose of this section is to explain the trip identification algorithm. 25 4.5.1 Home-based tour identification As the first step, home-based tours of the devices are derived. Home-based tours are defined as all the sightings between two consecutive appearances of a device at the imputed home location (see section 4.2 for the home location identification algorithm). The algorithm processes the sightings on any day, from 4 a.m. to 4 a.m. the next day. This is called a trip day. It is assumed that individuals are at home at 4 a.m. unless they are on a long-distance trip. As a result, the algorithm first checks the first sighting of each user on every trip day. If the sighting is out-of- home and the distance of the sighting to the home location is shorter than 50 miles, i.e., the device is not on a long-distance tour, an at-home sighting is generated for the user at 4 a.m. on this trip day. Similarly, if the device's last sighting is out-of-home and the sighting distance to the home location is shorter than 50 miles, an at-home sighting is generated for the user at 4 a.m. of the next day. This ensures that users not on a long-distance tour start and end their days at home, and all the home-based tours are complete. Furthermore, every at-home sighting whose previous and next sightings are out-of-home is repeated, and a copy of the sighting is added to the list of users' sightings. This ensures that every sighting in the dataset only belongs to one tour. Next, if the first sighting of a user is out-of-home, i.e., the device is on a long-distance tour, a flagged tour ID of "Long-Distance Tour" is assigned to all sightings of the user until the user is seen at home. If the user is never seen at home during the trip day, all observations would have the same flagged tour ID on the trip day. Similarly, suppose the last sighting of a user is out-of-home. In that case, i.e., the user is on a long-distance tour at the end of the day and does not already have a tour ID, meaning that the user did not start the day on a long-distance tour, a flagged tour ID of "Long-Distance Tour" is assigned to all sightings of the user from its last observation at home to its last observation on that day. Next, a random tour-ID is generated for every out-of-home sighting 26 following an at-home observation. The same tour-ID is assigned to all following sightings until the device is again seen at home. Finally, the maximum distance of each tour to the home location is calculated. If the maximum distance exceeds 50 miles, the tour ID is changed to the flagged tour ID, "Long-Distance Tour." The reason for flagging the sightings on a long-distance tour is that their trip identification is different from the daily local sightings. At this stage, sightings are separated into two groups: sightings on short-distance tours and sightings on long-distance tours. Short-distance tours will go through a daily short-distance trip identification. In contrast, long-distance tours go through a monthly long-distance trip identification. 4.5.2 Trip Identification for Short-Distance Tours Trips for each short-distance tour are identified using the following steps. The trip identification algorithm assigns a random ID to every trip it identifies. First, all sightings of each user are sorted by time. The location dataset may include many sightings that do not belong to any trips, i.e., stationary sightings. The algorithm assigns "0" as the trip ID to these sightings. For every sighting, the distance, time, and speed between the sighting and its previous and next sightings as "time from," "time to," "distance from," "distance to," "speed from," and "speed to" variables, if applicable, are computed. The trip identification algorithm has three thresholds: distance, time, and speed. The speed threshold is used to identify if a sighting is recorded on the move. The distance and time thresholds are used to identify trip ends. At this step, the algorithm identifies the first sighting with ????? ???? ? ????? ?????????. This identified sighting is on the move, so a random trip-ID is generated and assigned to this sighting. All sightings recorded before this point, if they exist, are 27 set to have "0" as their trip-ID, meaning that they are stationary sightings. Then, a recursive algorithm discussed in the following paragraphs identifies if the next sightings are on the same trip and should have the same trip ID. The recursive algorithm runs on sightings with the same tour-ID. It checks every sighting to identify if they belong to the same trip as their previous point. If they do, the same trip ID is assigned to them. Otherwise, either a new trip-ID is assigned to them (when their "????? ????" ? ????? ?????????), meaning they are the starting point of a new trip, or their trip-ID is set to "0" (when their "????? ????" < ????? ?????????). Identifying if a sighting belongs to the same trip as its previous sighting is based on the sighting's "speed to," "distance to," and "time to" attributes. If a device is seen in a point with "???????? ??" ? ???????? ????????? but is not observed to move there ("????? ??" < ????? ?????????), the point does not belong to the same trip as its previous point. When a user is on the move ("????? ??" ? ????? ?????????), the sighting belongs to the same trip as its previous sighting; but when the user stops, the algorithm checks the radius and dwell time to identify if the previous trip ended. If the user stays at the stop (sightings should be closer than the distance threshold) for a while shorter than the time threshold, the sightings still belong to the previous trip. When the dwell time reaches the time threshold, the trip ends, and the subsequent sightings no longer belong to the same trip. The algorithm does this by updating "time from" to be measured from the first observation in the stop. 28 Figure 7. Recursive algorithm of trip identification for short-distance tours. If a sighting has a speed greater than three mph from the previous sighting, the sighting belongs to the same trip as its previous sighting. If a sighting has a speed lower than three mph from the previous sighting and is more than 1000 ft away from the previous sighting, the sighting does not belong to the same trip as its previous sighting. If the speed to the next sighting is also smaller than three mph, the current sighting simply terminates the trip; otherwise, it becomes the start of a new trip. If a sighting has a speed lower than three mph from the previous sighting and is within 1000 feet from the previous sighting, the cumulative dwell time for all the consecutive sightings meeting the following criteria is computed and checked: 29 1. If the cumulative dwell time is less than five minutes, the current sighting belongs to the same trip. 2. Otherwise, it terminates the trip if the speed to the next sighting is less than three mph or starts a new trip if the speed to the next sighting is more than three mph. 4.5.3 Trip Identification for Long-distance Tours Due to the nature of long-distance tours, trip identification for long-distance trips is performed differently. While the short-distance trip identification concentrated on daily observations, this section focuses on sightings on long-distance tours over a month. All the sightings with the flagged tour ID of "Long-Distance Tour" are filtered for the entire month, implying that they are on a long-distance tour. Next, trips are identified through regenerating tour IDs, identifying primary and secondary stops, destinations, assigning sub-tour IDs, and identifying trips on sub-tours. Figure 8 shows how the trip identification algorithm for long-distance tours works. Each stage of the flowchart is described in the following sections. 30 Figure 8. Recursive algorithm of trip identification for long-distance tours. 31 4.5.3.1 Tour-ID regeneration In the tour identification algorithm, long-distance tours were assigned a flagged tour ID of "Long-Distance Tour." At this step, a new random tour ID is assigned to all sightings between two consecutive at-home sightings. The difference with the previous tour identification algorithm is that the previous one was limited to observations within a trip day, so the tour window was limited to one day. However, this time, a multi-day tour and a multi-day trip can be identified. 4.5.3.2 Stop and destination identification The recursive trip identification algorithm described in section 4.5.2 is applied on long-distance tours with a time threshold of 30 minutes instead of 5 minutes so that a trip ends only if a user stays somewhere for 30 minutes or more, and all the trip ends are identified and named as "secondary stops." Primary stops are restricted cases of secondary stops. Primary stops on a long-distance tour are places where users stay and make secondary tours or places in which users stay for a significant amount of time. The spatial resolution of primary stops in our algorithm is geo-hash level 6, a rectangle with a width and height of 1.2km ? 609.4m. The following criteria are used to identify primary stops at geo-hash level 6: 1. If the duration of stay at a geo-hash is longer than 2 hours and in the current tour, the device leaves the geo-hash but later returns. 2. If the duration of stay at a geo-hash is longer than 24 hours. 3. If it is a home location. 32 Furthermore, the primary destination of a tour is defined as the farthest stop located at least 50 miles away from the home location of a user. At first, secondary stops are utilized to find the destination, but if no destination is found, secondary stops are investigated for destination identification. 4.5.3.3 Sub-tour identification A sub-tour is a segment of a long-distance tour that falls between two primary stops. At this step, every time a user leaves a primary stop, a sub-tour ID is generated. The sub-tour-ID is assigned to all the sightings of a device until the device is again seen at a primary stop. 4.5.3.4 Trip generation If a tour does not have a destination or the destination is the same as the user's work location, a recursive trip identification algorithm with a time threshold of 5 minutes (short-distance trip identification) is applied to the entire tour points. On the other hand, if a destination different from the work location is found, a recursive trip identification algorithm with a time threshold of 30 minutes is applied to sub-tours identified on this tour. Figure 9 illustrates how the tour-based algorithm produces more accurate trip identification results than the traditional methods. Graphs (a) and (b) show how the tour-based method differentiates actual activity clusters (e.g., home cluster and work cluster) from mid-trip transfer points (e.g., waiting at a transit station). The ability to construct linked trips from unlinked trips based on the tour-based approach leads to a 33 higher consistency between the trips derived from mobile device location data and trips reported by the surveys such as NHTS. (a). Multiple Unlinked Person Trips (b). One Linked Person Home-to-Work Trip Figure 9. Tour identification and trip linking demonstration. 34 Chapter 5: Results The tour-based trip identification is applied to the mobile device location data of January 2020, and the trips on long-distance tours and short-distance tours are derived. To ensure that the reported trips are quality assured, we investigated the derived trips and added four post-processing steps on top of the trip identification algorithm. Types of treated trips are as follows: 1. Trips with inadequate sightings and information. 2. Trips made in a trip end activity location. 3. Trips with high detour factors, i.e., round trips. 4. Trips with high speed between several consecutive trip points. Steps taken on these trips for further treatments are described in section 5.1. 5.1 Post-processing steps in the trip-identification algorithm Initially, trips with only two sightings are removed due to the inadequacy of trip information. These trips do not provide any information about the route taken by the user and might cause further issues such as miscalculation of the actual trip distance instead of the Euclidian distance between the trip ends. Secondly, trips less than 300 meters are removed. This prevents short trips made in an activity location while the user's phone records their real-time location. Local movements in an activity location should not be considered as trips. Figure 10 shows a mall where a user's phone recorded their sightings, and the trip identification algorithm reported these local movements as 35 a trip with 68 sightings. Noise has been added to the sightings due to privacy protections. This trip is removed with the predefined distance threshold. Figure 10. An example of a local movement in a mall. Thirdly, when a trip with a significant detour factor is observed, the trajectory of the trip is further investigated to break down the single trip into multiple unlinked trips. The detour factor is calculated using the following formula: ?????????? ???????? ??????? ??? ??? ????????? ?? ? ???? ?????? ?????? = ????????? ???????? ??????? ??? ??? ???? ???? ?? ? ???? When the detour factor of a trip is higher than five, the trip is divided into two segments: The first trip would be from the user's starting point to the farthest point they traveled, and the second trip would be from the farthest point to the ending point. 36 Lastly, trips with several consecutive data points with high speed are removed. Due to the possible inaccuracy in the GPS data collection, data jumps could occur. When a GPS device is within several tall buildings, underground, or in tunnels, location data collection may not work well. Studies have used speed thresholds to remove jumps from the datasets (Thiagarajan et al., 2009). Figure 11 shows a trip having multiple jumps in the reported sightings. A user made a trip from A to D (the red lines), but the GPS signal randomly reported the user's sightings in distant locations within seconds. Data jumps caused the actual trip to be reported as trips from A to B, B to A, A to C, C to B, and B to D in short periods. In order to avoid such inaccurate trips, those with multiple jumps in sightings are removed. Trips with 20% or more of their sightings with a speed of 500 meters per second are removed. A high value is chosen to ensure no air trip is removed from the data since air trips are highly possible to have a low number of sightings and a high value of speed between consecutive points. 37 Figure 11. A trip with multiple data jumps. 5.2 Regional Trip Validation with MTS The first validation step compares trip-level results with a regional travel survey, the BMC Maryland Statewide Household Travel Survey (MTS). Since the MTS only reported trips taken by the residents of the counties mentioned in section 3.3, MDLD users were also filtered to those whose home locations were in these counties. All the trips generated during January 2020 are filtered and validated against the weighted trips reported in the MTS. 38 Figure 12 and figure 13 show a comparison between the length and the duration of the trips made by residents of the selected area in MDLD and MTS. The overall distribution is similar between both datasets, while MDLD reports more long-distance and fewer short-distance trips. 39 Figure 12. Trip length distribution comparison between MDLD and MTS survey. Figure 13. Travel time distribution comparison between MDLD and MTS survey. 40 The distribution of the trip start time in MDLD is validated against the MTS, as shown in Figure 14. The overall distribution of MDLD trips is similar to the travel survey, while the MDLD trips showed a more flattened shape with smaller morning peak while having slightly more trips during the night. Figure 14. Trip start time distribution comparison between MDLD and MTS data. Figure 15 compares trip rates in the MTS and the MDLD data on a weekday for the people who made at least one trip. As expected, many users with only one trip are reported from the proposed trip identification algorithm. This observation can be explained by the fact that MDLD data does not capture the entire itinerary of a user during a trip day, while a survey records all the trips of a respondent during a day, to the best of the respondent's knowledge. This leads to an overall underestimation of the number of trips made by the users. 41 Figure 15. Trip rate distribution comparison between MDLD and MTS data. 5.3 National Trip Validation with NHTS As the next step of the validation process, the derived trips are validated at the national level by comparing them with the National Household Travel Survey (NHTS) 2017. Similar to the validation based on MTS, trip length, travel time, trip start time, and trip rate distributions are plotted in Figures 16-19. Again, more long-distance trips and fewer short-distance trips are reported. First, the method for calculating the survey trip length differs from the distance calculated from the MDLD trips. NHTS used the shortest network path distance generated by Google API, while the MDLD trip lengths are the cumulative distance between all the trip sightings. Moreover, considering the nature of the MDLD collections, longer trips are more likely to be captured by the data collectors. Sampling biases could be another reason for these discrepancies. Morning and evening peaks are all captured, while the morning peak in MDLD data is not as sharp as the NHTS, and a more flattened distribution is observed. 42 Figure 16. Trip length distribution comparison between MDLD and NHTS data. Figure 17. Travel time distribution comparison between MDLD and NHTS data. 43 Figure 18. Trip start time distribution comparison between MDLD and NHTS data. Figure 19. Trip rate distribution comparison between MDLD and NHTS data. 44 Furthermore, the top origin-destination pairs of the derived trips in MDLD are compared to the NHTS 2017. The rank-rank correlation between OD pairs in NHTS and the corresponding OD pairs in MDLD at the state level is 0.955. This represents a good match between the ranks of the OD pairs in both datasets. Tables 4 and 5 demonstrate the top 10 origin-destination pairs in NHTS and the corresponding ranking in MDLD OD pairs at state and county levels, respectively. At the state level, the top ten OD pairs observed in NHTS also appear in MDLD, and the rankings are highly similar. At the county level, seven of the ten top OD pairs in NHTS also appear in MDLD. It is only in New York County that the rank difference is noticeable. This observation is influenced mainly by the fact that a significant number of trips made in Manhattan are underground trips that are hardly captured by mobile device location data since they do not have a continuous active connection to GPS signals. Table 4. Top 10 origin-destination pairs in NHTS and MDLD at the state level. Origin State Destination State NHTS ranking MDLD ranking California California 1 3 Texas Texas 2 1 New York New York 3 4 Florida Florida 4 2 Illinois Illinois 5 7 Ohio Ohio 6 6 Pennsylvania Pennsylvania 7 9 Michigan Michigan 8 10 North Carolina North Carolina 9 8 Georgia Georgia 10 5 45 Table 5. Top 10 origin-destination pairs in NHTS and MDLD at the county level. Origin County Destination County NHTS ranking MDLD ranking Los Angeles, CA Los Angeles, CA 1 1 Cook, IL Cook, IL 2 4 Maricopa, AZ Maricopa, AZ 3 3 Harris, TX Harris, TX 4 2 San Diego, CA San Diego, CA 5 10 New York, NY New York, NY 6 140 Orange, CA Orange, CA 7 8 Dallas, TX Dallas, TX 8 7 Clark, NV Clark, NV 9 11 King, WA King, WA 10 19 5.4 Advantages of the proposed algorithm to the clustering method Clustering algorithms are one of the ways to identify the trip ends from the mobile device location data, as discussed in Chapter 2. Yang et al. (2021) applied Spatiotemporal Density-Based Spatial Clustering Applications with Noise, ST- DBSCAN, (Birant and Kut, 2017) on mobile device location data of an application that collects the location points of the users called "incenTrip," to derive trip ends. They assigned all the sightings between two activity stops to a trip. The tour-based trip identification algorithm is applied to the same dataset, and trips are compared with the trips from the clustering algorithm. Due to the limitation of the clustering algorithm that required a device to have a minimum of sightings in an activity location within a small geographical area in a short period, there are several cases in which a trip could not be captured since no trip end was identified. Either one of the trip ends, or both failed to form an activity cluster, leading to the trip not being detected. Figure 20 is an 46 example of a trip with 119 sightings that are not captured by the ST-DBSCAN algorithm, but the tour-based trip identification captured it as a 38-mile, 66-minute trip. Figure 20. A trip trajectory that was not captured by the ST-DBSCAN algorithm. The trip origin, destination, and real-time observations are not shown precisely due to privacy protection. Furthermore, clustering algorithms are costly and computationally complex. Thus, these algorithms may also not apply to large-scale datasets similar to those in this analysis containing sightings of more than 45 million users in one month. 5.5 Case study: COVID-19 pandemic and travel behavior changes This section examines a real-world case study using the proposed trip identification algorithm. First, the sample is weighted to the entire population. Then two mobility metrics, trip rate and percentage of people staying home in January, as a 47 base month when COVID-19 was not spread, and April 2020, as a month when the COVID-19 cases were raised for the first time, are compared. 5.5.1 Data expansion to the population level For studies of travel movements at granular levels instead of individual trips, it is necessary to expand the dataset to the population level. Due to the reasons listed below, a simple multi-level weighting, i.e., device-level and trip-level weighting, is done to upscale the dataset to the population level. First, the available mobile device location data does not represent the entire population. The average sampling rate for January 2020 is nearly 14%. A county-level device weighting is applied to the dataset to expand it to the population level. Accordingly, the users living in each county are assigned a weight so that the sample will reflect the county's population. Device-level weight is calculated as follows: ?????????? ?? ??? ????????? ?????? ?????? ?????? = ???????? ???? ?? ??? ????????? ?????? The five-year (2015-2019) American Community Survey (ACS) is used to estimate the population of the counties. Every user has the same weight in a county for the entire month since the number of residents is calculated monthly. Furthermore, since mobile device locations do not report the sightings of devices every 24 hours, it is possible to miss out on tracking a portion of a person's trip diary. As a result, determined trips based on location data from mobile devices may differ from the person's actual trips. This issue is addressed by trip-level weighting. 48 Each trip is weighted so that in January 2020, the MDLD trip rate matches NHTS 2017 trip rate at the state level. The trip-level weighting process is a one-time weight calculated only from the devices of January 2020. The reason for choosing January is that it is highly possible to have a different travel behavior in the following months due to the COVID-19 pandemic. The following formula calculates the trip-level weight of each device in each state. ???? ???? ?? ??? ????????? ?? ? ????? ?? ???? 2017 ?? ??? ?????? ???? ???? ?????? = ???? ???? ?? ??? ????????? ?? ? ????? ?? ???? ????? ?? ??????? 2020 5.5.2 COVID-19 pandemic and the population travel behavior There are many real-world applications for the proposed trip identification algorithm. Decision-makers can utilize the derived trips to explore real-world issues and dilemmas in various directions. COVID-19 is an outbreak that affected millions of people around the globe, and stay-at-home orders are one of the non-pharmaceutical interventions used by the government to contain the spread of the disease. On March 13, 2020, a national emergency declaration was issued to reduce the trip rate of the people and, consequently, reduce the spread of COVID-19. Travel behavior analysis can help policymakers determine how the people reacted to the interventions and whether the current strategy in containing the outbreak is effective. The study attempted to measure how these individuals practiced social distancing by calculating two mobility metrics. Day by day, we calculated the percentage of people staying at home and the trip rate of the entire nation in January 2020, the month where COVID-19 has no effect, and April 2020, when COVID-19 spread throughout the entire country, and compared these two months with each other. 49 The percentage of users staying home on any given day is defined as the proportion of users observed on that day for which the trip identification algorithm detected no trip. The average trip rate of a population is defined as the average number of trips made by all observed users during a given day. 50 Figure 22 and Figure 23 show the average trip rate and the percentage of people staying home at the national level for January and April 2020. Figure 21. Percentage of people staying home in January and April 2020. Figure 22. The daily trip rates in January and April 2020. 51 In the first place, both figures perfectly illustrate that people stay home more during the weekends, and Sundays have fewer trips during the week than any other day. The second point is that there is a considerable gap between January 2020 and April 2020 in terms of the percentage of people staying home and the average trip rate per person. When the COVID-19 pandemic started spreading all over the United States, people also practiced social distancing by staying home by almost more than 10% in comparison to regular days. Furthermore, the average trip rate of the population decreased by more than one trip per day. It is clear from the two comparisons that in the early stages of the COVID-19 pandemic, the interventions proposed by the government had an impact on the travel movements of the entire population. 52 Chapter 6: Conclusion and Discussion 6.1 Thesis summary This study presents a tour-based trip-identification algorithm to gather trip-level information of the location data collected from mobile devices. The MDLD from various data vendors are integrated, and several data cleansing steps are carried out to get a solid raw dataset. In the first step, a deduplication algorithm is developed to identify duplicate devices in the integrated dataset, and the sightings of such devices are merged to avoid the overrepresentation of users. In this algorithm, user sightings at a level-7 geo-hash are examined spatially and temporally. Devices with the same home location and the top five most visited locations during a month represent the same user. In addition, the results of the study are validated to see if duplicate devices are observed in the same location simultaneously. Second, using a home-based tour and trip identification algorithm, trips of more than 45 million users during January 2020, are determined from raw sighting data, which does not provide any trip-level information on its own. The algorithm first finds if the sightings of a device are on short-distance or long-distance home-based tours by calculating the distance between the farthest point that a user visited in each tour from their home location. A tour is defined as all the sightings of a device between two consecutive visits to their home location. Two different approaches are considered to derive the trips of each user. A daily short-distance trip identification algorithm to 53 derive trips on short-distance tours and a monthly long-distance trip identification to determine trips on long-distance tours. Third, several post-processing steps are taken to address the concerns raised in some of the trips. These trips include cases with inadequate sightings, trips made in trip end activity locations such as malls or homes, trips with high detour factors, and trips with data jumps. Lastly, the derived trips are validated against two household travel surveys, Maryland Statewide Household Travel Survey (MTS) 2018/2019 as a regional travel survey and National Household Travel Survey (NHTS) 2017 as a national travel survey. The results showed a good match for the trip length distribution, travel time distribution, the distribution of trip start time in a day, and the trip rate per person distribution, with a couple of discrepancies that are discussed in Chapter 5. Fourth, the proposed algorithm is applied to the second set of mobile device location data that had previously been clustered by a previous method to find the trip ends and, consequently, trips in it. Due to the limitations of the clustering methods and the need to make a cluster at both ends of a trip, a couple of trips are not determined. In contrast, the proposed tour-based trip identification algorithm is able to identify them. Finally, as a real-world application of the trip identification algorithm, the effect of the COVID-19 pandemic on the travel behavior of the population is investigated. It is shown that at the earliest stages of the pandemic, the population reacted to the travel restrictions by staying at home more and making a smaller number of trips each day. 54 6.2 Discussions and future work MDLD sightings with inaccurate latitudes and longitudes can be caused by many things, including being in a tunnel, walking near tall buildings, etc. Before implementing the trip identification algorithm, a data cleaning procedure can be helpful and may improve the accuracy of the trips reported. It is worth noting that data cleaning before trip identification can be computationally extensive and costly. Therefore, the pros and cons of the data jump cleaning must be evaluated for the specific study before being implemented. Moreover, the multi-level weighting method, i.e., device-level and trip level weighting methods presented in this study, is a simple weighting procedure and can be extended in many ways. In future studies, socio-demographic data such as education, age, and gender can be used to weigh devices based on the population share of each socio-demographic group. Considering that location data points generated by smartphones come from mobile devices and not every group of a population has access to smartphones equally, using different weights for users with different characteristics can help the study be more accurate. Currently, trip-level weighting assumes that the nation's behavior in 2020 has been similar to 2017. To obtain a more accurate trip-level weight, it is helpful to convert the average trip rates of each state from 2017 to 2020. In addition, similar to the device- level weighting that each population group could be assigned a different weight, trips can be grouped in different modes, and their weights can be dependent on their modes, 55 i.e., rail, bus, drive, bike, walk, air, etc. A travel mode detection is needed to weigh the trips based on the travel mode. Furthermore, the current COVID-19 analysis is limited to January and April 2020 and is designed to demonstrate how this data can be applied in real-world studies. Suppose the analysis is extended to later months. In that case, it can give decision- makers insights into whether the population still follows their restrictions after several months or whether social distancing has become less prevalent. Last but certainly not least, a comparison that includes both GPS-based trip detection and user-reported trips is an excellent way to validate results at an individual level instead of at a much more aggregated level. Such an analysis requires further user and data vendor agreements. 56 References 1 Sweeney, L. (2000). Uniqueness of simple demographics in the US Population, in LIDAP-WP4. http://privacy. cs. cmu. edu/dataprivacy/papers/LIDAP- WP4abstract. html. 2 Golle, P. (2006). Revisiting the uniqueness of simple demographics in the US population. Proceedings of the 5th ACM Workshop on Privacy in Electronic Society. 3 Golle, P., & Partridge, K. (2009). On the anonymity of home/work location pairs. International Conference on Pervasive Computing, Springer, Berlin, Heidelberg. 4 Trestian, I. et al. (2009) Measuring serendipity: connecting people, locations and interests in a mobile 3G network. Proceedings of the 9th ACM SIGCOMM conference on Internet measurement. 5 Chow, C.Y., & Mokbel, M.F. (2011). Trajectory privacy in location-based services and data publication. ACM Sigkdd Explorations Newsletter, 13(1), 19- 29. 6 Zang, H., & Bolot, J. (2011). Anonymization of location data does not work: A large-scale measurement study. Proceedings of the 17th annual international conference on Mobile computing and networking. 7 De Montjoye, Y. A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3(1), 1-5. 8 Gonzalez, M. C., Hidalgo, C. A., & Barabasi, A. L. (2008). Understanding individual human mobility patterns. nature, 453(7196), 779-782. 57 9 C. Song, Z. Qu, N. Blumm, and A.-L. Barabasi. Limits of predictability in human mobility. Science, 327(5968):1018{1021, 2010. 10 McGowen, P. & McNally, M. (2007) Evaluating the potential to predict activity types from GPS and GIS data. Transportation Research Board 86th Annual Meeting, Washington. 11 Gong, L., Morikawa, T., Yamamoto, T., et al. (2014). Deriving personal trip data from GPS data: A literature review on the existing methodologies. Procedia- Social and Behavioral Sciences, 138(0), 557-565. 12 Axhausen, K. W., Sch?nfelder, S., Wolf, J., Oliveira, M., & Samaga, U. (2004, January). Eighty weeks of gps traces, approaches to enriching trip information. In Transportation Research Board 83rd Annual Meeting Pre-print CDROM. 13 Tsui, S. Y. A., & Shalaby, A. S. (2006). Enhanced system for link and mode identification for personal travel surveys based on global positioning systems. Transportation Research Record: Journal of the Transportation Research Board, 1972(1), 38-45. 14 Bohte, W. & Maat, K. (2009). Deriving and validating trip purposes and travel modes for multi-day GPS-based travel surveys: A large-scale application in the Netherlands. Transportation Research Part C: Emerging Technologies, 17(3), 285-297. 15 Stopher, P. R., Jiang, Q., & FitzGerald, C. (2005). Processing GPS data from travel surveys. 2nd international colloqium on the behavioural foundations of integrated land-use and transportation models: frameworks, models and applications, Toronto. 58 16 Du, J. & Aultman-Hall, L. (2007). Increasing the accuracy of trip rate information from passive multi-day GPS travel datasets: Automatic trip end identification issues. Transportation Research Part A: Policy and Practice, 41(3), 220-232. 17 Stopher, P., FitzGerald, C., & Zhang, J. (2008). Search for a global positioning system device to measure person travel. Transportation Research Part C: Emerging Technologies, 16(3), 350-369. 18 Schuessler, N., & Axhausen, K. W. (2009). Processing raw data from global positioning systems without additional information. Transportation Research Record: Journal of the Transportation Research Board, 2105(1), 28-36. 19 Gong, H., Chen, C., Bialostozky, E., & Lawson, C. T. (2012). A GPS/GIS method for travel mode detection in New York City. Computers, Environment and Urban Systems, 2012. 36(2), 131-139. 20 Safi, H., Assemi, B., Mesbah, M., Fereira, L., and Hickman, M. (2015). Design and implementation of a smartphone-based system for personal travel survey: Case study from New Zealand. Transportation Research Record: Journal of the Transportation Research Board, 2526, 99?107. 21 Patterson, Z., & Fitzsimmons, K. (2016). Datamobile: Smartphone travel survey experiment. Transportation Research Record: Journal of the Transportation Research Board, 2594(1), 35-43. 22 Wolf, J., Guensler, R., & Bachman, W. (2001). Elimination of the travel diary: Experiment to derive trip purpose from global positioning system travel data. Transportation Research Record, 1768(1), 125-134. 59 23 Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial?temporal data. Data Knowl. Eng. 60(1), 208?221 (2007) 24 Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226?231 (1996) 25 Yang, M., Pan, Y., Darzi, A. et al. A data-driven travel mode share estimation framework based on mobile device location data. Transportation (2021). https://doi.org/10.1007/s11116-021-10214-3 26 Maryland Statewide Household Travel Survey. https://www.baltometro.org/transportation/data-maps/maryland-travel-survey. 27 Yang, M. (2020). Multimodal Travel Mode Imputation Based on Passively Collected Mobile Device Location Data (Doctoral dissertation, University of Maryland, College Park). 28 Wolf, J., Bricka, S., Ashby, T., & Gorugantua, C. (2004, June). Advances in the application of GPS to household travel surveys. In National Household Travel Survey Conference, Washington DC. 29 Forrest, T. L., & Pearson, D. F. (2005). Comparison of trip determination methods in household travel surveys enhanced by a global positioning system. Transportation Research Record, 1917(1), 63-71. 30 Stopher, P., Clifford, E., Zhang, J., & FitzGerald, C. (2008). Deducing mode and purpose from GPS data 31 Chen, C., Ma, J., Susilo, Y., Liu, Y., & Wang, M.. The promises of big data and small data for travel behavior (aka human mobility) analysis. Transportation research part C: emerging technologies, 2016. 68, 285-299. 60 32 Yang, M. (2020). Multimodal Travel Mode Imputation Based on Passively Collected Mobile Device Location Data (Masters Thesis, University of Maryland, College Park). 33 Steven Manson, Jonathan Schroeder, David Van Riper, Tracy Kugler, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 16.0. Minneapolis, MN: IPUMS. 2021. http://doi.org/10.18128/D050.V16.0 34 Gong, L., Yamamoto, T., & Morikawa, T.. Identification of activity stop locations in GPS trajectories by DBSCAN-TE method combined with support vector machines. Transportation Research Procedia. 32, 146-154, (2018). 35 Axhausen, Kay W., et al. "80 weeks of GPS-traces: approaches to enriching the trip information: submitted to the 83rd Transportation Research Board Meeting." Arbeitsberichte Verkehrs-und Raumplanung 178 (2003). 36 Zhou, C., Jia, H., Juan, Z., Fu, X., & Xiao, G.. A data-driven method for trip ends identification using large-scale smartphone-based GPS tracking data. IEEE Transactions on Intelligent Transportation Systems. 18(8), 2096-2110, (2016). 37 Zhou, C., Frankowski, D., Ludford, P., Shekhar, S., & Terveen, L.. Discovering personally meaningful places: An interactive clustering approach. ACM Transactions on Information Systems (TOIS). 25(3), 12, (2007). 38 Chen, W., Ji, M., & Wang, J.. T-DBSCAN: A spatiotemporal density clustering for GPS trajectory segmentation. International Journal of Online Engineering (iJOE). 10(6), 19-24, (2014). 61 39 Ye, Y., Zheng, Y., Chen, Y., Feng, J., & Xie, X.. Mining individual life pattern based on location history. 2009 tenth international conference on mobile data management: Systems, services and middleware. pp. 1-10, (2009). 40 Yao, Z., Zhou, J., Jin, P. J., & Yang, F.. Trip End Identification based on Spatial- Temporal Clustering Algorithm using Smartphone GPS Data (No. 19-01097), Presented at 98th Annual Meeting of the Transportation Research Board, Washington, D.C., (2019). 41 Wang, F., Wang, J., Cao, J., Chen, C., & Ban, X. J.. Extracting trips from multi- sourced data for mobility pattern analysis: An app-based data example. Transportation Research Part C: Emerging Technologies. 105, 183-202, (2019). 42 Thiagarajan, A., L. et al. VTrack: Accurate, Energy-Aware Road Traffic Delay Estimation Using Mobile Phones. Proc., 7th ACM Conference on Embedded Networked Sensor Systems, 2009, pp. 85?98. 62