ABSTRACT 
 
 
 
 
Title of Thesis: DEVELOPING A TOUR-BASED TRIP 
IDENTIFICATION ALGORITHM USING 
MOBILE DEVICE LOCATION DATA   
  
 Aliakbar Kabiri, Master of Science, 2022 
  
Thesis directed by: Professor, Lei Zhang, Department of Civil and 
Environmental Engineering 
 
 
 
 This thesis presents a novel trip identification algorithm that supports travel 
behavior analysis based on mobile device location data. The proposed trip 
identification algorithm is applied to a large-scale Location-based Service (LBS) 
dataset consisting of the location points of a large representative sample of United 
States residents with over 40 million users in January 2020. Firstly, the proposed 
framework divides sightings into long-distance and short-distance home-based tours 
and then identifies the trips on each type of tour using different methods. Furthermore, 
the Maryland Statewide Household Travel Survey 2018/2019 and the National 
Household Travel Survey (NHTS) 2017 validate the derived trips. The results showed 
that several metrics of the trips from mobile device location data and travel surveys 
follow similar trends. In addition, the impact of coronavirus disease 2019 (COVID-19) 
on the travel behavior of the population is studied as a real-world application of the 
proposed algorithm.  
 
 
 
 
 
 
 
 
 
 
 
 
 
DEVELOPING A TOUR-BASED TRIP IDENTIFICATION ALGORITHM 
USING MOBILE DEVICE LOCATION DATA     
 
 
 
by 
 
 
Aliakbar Kabiri 
 
 
 
 
 
Thesis submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Master of Science 
2022 
 
 
 
 
 
 
 
 
 
 
Advisory Committee: 
Professor Lei Zhang, Chair 
Associate Research Professor Chenfeng Xiong 
Professor Erkut Ozbay 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
? Copyright by 
Aliakbar Kabiri 
2022 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Dedication 
To the 176 innocent passengers of the Ukraine International Airlines Flight 752. 
ii 
 
 
 
Acknowledgements 
 It is my great pleasure to thank my parents, Tavous and Taleb, for their valuable 
support. The support I have received from them has gotten me to where I am now. I 
want to acknowledge my colleagues, Aref Darzi, Yixuan Pan, Mofeng Yang, and 
Guangchen Zhao, in our research group for their wonderful collaborations. 
I would particularly like to thank my supervisor at the Maryland Transportation 
Institute (MTI), Dr. Lei Zhang. You have given me several opportunities to further my 
research, and I appreciate them. Furthermore, I would like to express my gratitude to 
Dr. Chenfeng Xiong and Dr. Erkuy Ozbay, who served on my thesis advisory 
committee. 
iii 
 
 
 
Table of Contents 
Dedication ..................................................................................................................... ii 
Acknowledgements ...................................................................................................... iii 
Table of Contents ......................................................................................................... iv 
List of Tables ................................................................................................................ v 
List of Figures .............................................................................................................. vi 
List of Abbreviations .................................................................................................. vii 
Chapter 1: Introduction ................................................................................................. 1 
1.1 Background ................................................................................................... 1 
1.2 Research Objectives and Contributions ........................................................ 2 
1.3 Research Outline ........................................................................................... 3 
Chapter 2: Literature Review ........................................................................................ 5 
2.1 Mobile Device Location Data ....................................................................... 5 
2.2 Device Characteristics and Travel Pattern Uniqueness ................................ 6 
2.3 Trip Identification Algorithms ...................................................................... 9 
Chapter 3: Datasets ..................................................................................................... 12 
3.1 Mobile Device Location data (MDLD) ...................................................... 12 
3.2 incenTrip Data ............................................................................................ 13 
3.3 Household Travel Surveys (MTS and NHTS) ............................................ 14 
3.4 American Community Survey (ACS) ......................................................... 15 
Chapter 4: Methodology ............................................................................................. 16 
4.1 Geographical Level of Study ...................................................................... 16 
4.2 Home Location Identification ..................................................................... 17 
4.3 Work Location Identification ...................................................................... 19 
4.4 Device Deduplication.................................................................................. 19 
4.5 Tour-based Trip Identification .................................................................... 25 
4.5.1 Home-based tour identification ............................................................... 26 
4.5.2 Trip Identification for Short-Distance Tours .......................................... 27 
4.5.3 Trip Identification for Long-distance Tours ........................................... 30 
4.5.3.1 Tour-ID regeneration .............................................................................. 32 
4.5.3.2 Stop and destination identification.......................................................... 32 
4.5.3.3 Sub-tour identification ............................................................................ 33 
4.5.3.4 Trip generation ........................................................................................ 33 
Chapter 5:  Results ...................................................................................................... 35 
5.1 Post-processing steps in the trip-identification algorithm .......................... 35 
5.2 Regional Trip Validation with MTS ........................................................... 38 
5.3 National Trip Validation with NHTS ......................................................... 42 
5.4 Advantages of the proposed algorithm to the clustering method ............... 46 
5.5 Case study: COVID-19 pandemic and travel behavior changes ................. 47 
5.5.1 Data expansion to the population level ................................................... 48 
5.5.2 COVID-19 pandemic and the population travel behavior ...................... 49 
Chapter 6:  Conclusion and Discussion ...................................................................... 53 
6.1 Thesis summary .......................................................................................... 53 
6.2 Discussions and future work ....................................................................... 55 
References ................................................................................................................... 57 
iv 
 
 
 
List of Tables 
 
 
Table 1. A sample of MDLD. ..................................................................................... 12 
Table 2. Geo-hash width and height at different levels. ............................................. 16 
Table 3. K-anonymity size statistics for devices having the exact home location. .... 21 
Table 4. Top 10 origin-destination pairs in NHTS and MDLD at the state level. ...... 45 
Table 5. Top 10 origin-destination pairs in NHTS and MDLD at the county level. .. 46 
 
v 
 
 
 
List of Figures 
 
 
Figure 1. Final sampling rate at the county level. ....................................................... 13 
Figure 2. Illustration of a level-7 geo-hash. ................................................................ 17 
Figure 3. Anonymity size variation with different numbers for most-visited locations.
..................................................................................................................................... 21 
Figure 4. Number of devices having one or more duplicates in the dataset. .............. 23 
Figure 5. Number of devices having common hours in the sample dataset. .............. 24 
Figure 6. Average number of hours observed in the same geo-hash level 7. ............. 25 
Figure 7. Recursive algorithm of trip identification for short-distance tours. ............ 29 
Figure 8. Recursive algorithm of trip identification for long-distance tours. ............. 31 
Figure 9. Tour identification and trip linking demonstration. .................................... 34 
Figure 10. An example of a local movement in a mall. .............................................. 36 
Figure 11. A trip with multiple data jumps. ................................................................ 38 
Figure 12. Trip length distribution comparison between MDLD and MTS survey. .. 40 
Figure 13. Travel time distribution comparison between MDLD and MTS survey. .. 40 
Figure 14. Trip start time distribution comparison between MDLD and MTS data. . 41 
Figure 15. Trip rate distribution comparison between MDLD and MTS data. .......... 42 
Figure 16. Trip length distribution comparison between MDLD and NHTS data. .... 43 
Figure 17. Travel time distribution comparison between MDLD and NHTS data. ... 43 
Figure 18. Trip start time distribution comparison between MDLD and NHTS data. 44 
Figure 19. Trip rate distribution comparison between MDLD and NHTS data. ........ 44 
Figure 20. A trip trajectory that was not captured by the ST-DBSCAN algorithm.... 47 
Figure 21. Percentage of people staying home in January and April 2020. ............... 51 
Figure 22. The daily trip rates in January and April 2020. ......................................... 51 
 
 
 
 
 
 
 
 
 
vi 
 
 
 
List of Abbreviations 
 
 
 
 
ATUS    American Time Use Survey 
BMC    Baltimore Metropolitan Council 
CDR    Call Detail Record 
COVID-19   Coronavirus disease 2019 
DBSCAN   Density-Based Spatial Clustering Applications with Noise 
DCI    Divide, Conquer and Integrate 
GPS    Global Positioning System 
ID    Identifier 
LBS    Location-based Service 
MDLD   Mobile Device Location Data 
MTS    Maryland Statewide Household Travel Survey 
NHGIS   National Historical Geographic Information System 
NHTS    National Household Travel Survey 
UTC     Coordinated Universal Time 
 
  
vii 
 
 
 
Chapter 1: Introduction 
 
1.1 Background 
 
A key aspect of transportation planning is the understanding of travel behavior. For many 
years, travel surveys were the most reliable method to obtain the movement patterns of a 
population. The National Household Travel Survey (NHTS) and the Maryland Statewide Travel 
Survey (MTS) are just two examples of travel surveys that agencies conducted to collect the travel 
diaries of a sample of the residents to be used in studies and make proper decisions in the 
transportation field. Although travel surveys provide many insights into the population's 
movement patterns, some drawbacks are associated with them. For large-scale studies, it is often 
impossible to have a high sampling rate and a long study period. Accordingly, in both surveys, a 
small population sample is surveyed for just a couple of days, and then the observed patterns are 
expanded to the entire population. This could lead to biases in several aspects, such as different 
demographic characteristics of population groups, temporal bias, etc. Moreover, occasionally, 
survey respondents make inaccurate trip reports or even miss a trip entirely. This may occur 
because they did not have enough encouragement to provide an accurate or honest report, or it may 
simply be because they forgot to report a specific trip during the survey period. For this reason, 
modern ways of data collection and processing should be used in such studies for more accurate 
disclosure of travel patterns, as well as providing a larger sample size and a more extended study 
period.  
As technology has grown and mobile phones have become ubiquitous, a vast portion of the 
populace has access to devices with Global Positioning Systems (GPS). Nowadays, many mobile 
phone applications store users' locations in latitude-longitude and timestamps showing the time 
1 
 
 
the location was reported. This offers a great source of information about the travel patterns of the 
population. Because these data do not provide trip-level information, several data processing and 
algorithms are required to explore the underlying travel behaviors of the people. This study aims 
to utilize Mobile Device Location Data (MDLD) for such kind of analysis and data processing. 
 
1.2 Research Objectives and Contributions 
 
In this study, an analysis of movement patterns of the U.S. population was undertaken 
using mobile device location data from a large sample of mobile device users. We analyzed the 
raw location data of more than 40 million devices all over the country and developed novel data 
processing and trip identification algorithm to enable us to derive trip-level information of the 
unique devices. 
Because multiple data vendors provided data in this study, the same user might be reported 
with different device identifiers from various sources. Further, electronic devices are more widely 
available nowadays. As a result, a user might have two or more devices when traveling, such as a 
phone, a tablet, and a computer. As a result, multiple trajectories may be generated with different 
device identifiers (IDs) for the same user. As part of the data pre-processing steps, a novel 
deduplication algorithm is developed to avoid overrepresenting a user's travel patterns.   
The next step is to develop a trip identification algorithm that uses the raw sightings of 
mobile device users to derive trip-level information. A vital issue complicating the trip 
identification process is distinguishing between linked and unlinked trips. Existing trip 
identification methods identify unlinked trips. As an illustration, a single transit commute trip with 
longer than five minutes of waiting time at the origin and transfer transit stations can be identified 
2 
 
 
as three unlinked trips: a walking trip from home to the origin transit station; A transit trip from 
the origin transit station to the transfer station; and another transit trip from the transfer station to 
the destination. Additionally, long-distance trips can usually have a more extended stop in the 
middle of the trip compared to short-distance trips. This explains why there is a need to treat long-
distance trips differently in trip identification methodologies. As a solution to these issues, a tour-
based trip identification method is developed to identify tours before trip identification. This 
approach enables better trip identification, mode imputation, and purpose imputation for further 
analysis. 
1.3 Research Outline 
 
After reviewing the literature on mobile device location datasets, duplicate device 
identification techniques, and trip identification methods, this study develops deduplication and 
tour-based trip identification algorithms with several improvements compared to the current 
literature to fulfill the gaps. The developed algorithms are applied to large-scale mobile device 
location data, and the results are compared with two regional and national surveys. This study also 
discusses the advantages of the proposed methodology over current trip identification methods 
based on clusters of sightings. A real-world use case of the proposed algorithm is demonstrated by 
comparing the travel behavior of the population before and during the COVID-19 pandemic.   
The outline of this thesis is as follows. The second chapter offers a comprehensive literature 
review that covers various types of mobile device location datasets, the process of deduplication, 
and the trip identification algorithm using different criteria. In Chapter 3, the primary datasets, 
including mobile device location data and surveys used for this study are introduced. In Chapter 
4, we present a novel, highly accurate device deduplication algorithm and a tour-based trip 
3 
 
 
identification process to generate raw sightings-based trip-level information of individuals. 
Chapter 5 describes the post-processing steps needed for derived trips and compares the result with 
the two regional and national surveys: the Maryland Statewide household Travel Survey (MTS) 
and National Household Travel Survey (NHTS). Also, using the multi-level device and trip 
weighting procedures, the results are scaled to the national level to show how the U.S. population 
reacted to the COVID-19 pandemic in the early stages. Finally, Chapter 6 summarizes the findings 
and suggests possible future directions. 
 
4 
 
 
Chapter 2: Literature Review 
 
2.1 Mobile Device Location Data 
 
 In the past years, mobile device location data has become popular for studying the travel 
behavior of the populations. These datasets mainly include the records gathered from call detail 
records (CDRs), sightings data, GPS-based technology data, or location-based service data. The 
following are brief descriptions of each of these valuable data sources. 
 In-vehicle GPS technology reports the location of the vehicles every few seconds. A lot of 
research incorporates this kind of data into their analyses. Chankaew et al. (2018) analyzed freight 
traffic using national truck GPS data in Thailand. CDRs are records that are produced by a 
telephone exchange or any other telecommunications equipment. These records contain the callers' 
phone numbers, starting time of the call, duration, and other phone call information. These data 
report the location of the cell towers instead of a user's actual location (Chen et al., 2016). On the 
other hand, a less frequently used dataset, called sightings, is generated each time the phone is 
located. What sightings data report are the location of the device using triangulation of multiple 
towers (Chen et al., 2016). Finally, Location-based Service (LBS) data consist of location 
information recorded by smartphone applications using GPS, cellular towers, Wi-Fi, and other 
types of connections to track the device's location. (Yang, 2020). This kind of data is the base of 
the study. 
 According to what we discussed earlier, travel surveys are one of the primary methods for 
analyzing human mobility patterns. In recent years, mobile device location data has also been 
integrated with multiple travel surveys to help one capture the unreported trips of the users, find 
5 
 
 
the possible reasons for not reporting such trips, and prepare a solid independent dataset for the 
validation of the user reported trips. For example, the Kansas City Regional Travel Survey 
conducted in 2003 and 2004 included GPS logging equipment in the vehicle of more than 7% of 
the households that participated in the survey (Wolf et al., 2004). Forest and Pearson (2005) 
analyzed a GPS-enhanced travel survey to evaluate the differences between the trips reported by 
the respondents and the trips captured from GPS devices. They noticed that the number of trips 
reported in the GPS data was much greater than the trips reported in the travel survey.   
 Beyond the detection of the users' trips, mobile device location data can help the analysts 
explore the exact route, the actual travel time, travel distances, and even with the help of rail, ferry, 
and bus networks, find the travel mode of the trips from GPS traces with high accuracy (Stopher 
et al., 2008). With the technological improvements, having smaller and lighter GPS devices, 
wearable GPS devices were embedded in the travel surveys. Additionally, surveys transitioned 
from being fully question-based to GPS-based surveys. Studies, including Sch?ssler and Axhausen 
(2009), examined the accuracy of the trip identification and travel mode detection of fully 
automated surveys without any questionnaire data.  In Sch?ssler and Axhausen (2009), the GPS 
data of the participants who wore the GPS devices without any other information was collected. 
The results were compared with the existing national-level travel survey showing a good match 
between the census data and the GPS-based results.  
2.2 Device Characteristics and Travel Pattern Uniqueness 
 
 Studies on the uniqueness of the devices can be classified into two categories: research on 
the feasibility of using demographic information and research on the feasibility of using 
spatiotemporal information. Among studies that evaluated the demographic information to be used 
6 
 
 
in device uniqueness identification, a study on 1990 census data by Sweeney (2000) showed that 
87% of the U.S. population could be identified uniquely using the collection of demographic 
attributes such as gender, date of birth, and 5-digit zip code. Furthermore, nearly 50% of the 
population can be uniquely identified by having their place of residency, gender, and date of birth. 
Golle (2006) used the 2000 census data to revisit the uniqueness of individuals using the same 
demographic information and revealed that the previous ratio decreased from 87% to 63%. 
 Other studies that used the spatiotemporal information of mobile device location data and 
are more pertinent to this thesis include research by Trestian et al. (2009). They noted that people 
spend most of their time in their comfort zone, defined as the top three visited locations. Based on 
this study, those who stayed in five base stations during a week spent about 90% of their time in 
the top three locations. Even when a user had 50 base stations, areas visited having the size of on 
average 4 square kilometers, about 55% of their time was spent in their top three visited locations. 
This indicates that devices representing the same individuals are likely to share the top three visited 
regions. Golle and Partridge (2009) showed that about 50% of the U.S. workers could be uniquely 
identified at a census block level using only home and work locations coming from Longitudinal 
Employer-Household Dynamics (LEHD). This study revealed that the median size of the 
anonymity set of the workers in the U.S. at the Census block level is one. An anonymity set is a 
set of individuals that share the same attributes and cannot be distinguished from each other by the 
available information. Chow and Mokbel (2011) found that by having the paths of all users and 
knowing that a particular device was in certain places at certain times, they could identify the 
device's trajectory. This statement will be used for the deduplication validation in the following 
sections of this thesis. For two devices to be identical, they must be observed at the same place at 
any time. 
7 
 
 
 In Zang, Hui, and Jean Bolot (2011), CDR data was used to identify mobile device owners 
by analyzing their top N locations based on how often they appeared across different geographical 
levels such as sectors, cell, and zip codes. A device whose top locations are fewer would be more 
challenging to identify. This study analyzed the top one, two, and three locations in terms of 
frequency of observations. Based on this research, more than half of the users could be uniquely 
identified by having their top three locations at the cell and sector levels. In addition, the top two 
locations were analyzed while they were interchangeably observed, as the users might make more 
calls from their work location than from their home location in one month while vice versa in 
another. 
 Instead of analyzing the top N locations, De Montjoye et al. (2013) investigated the number 
of random points needed to identify an individual mobility trace. They evaluated the call data for 
1.5 million users and found that about 95% of the people could be uniquely identified by four 
spatiotemporal observations from each device. Furthermore, in the re-identification of the 
individuals, both the spatial and temporal resolution of the devices' location observations are 
critical.  
 Human mobility patterns are highly predictable. Song et al. (2010) studied users' 
trajectories and noted that people tend to spend most of their time in a few locations. According to 
the researchers, there is a potential for 93% predictability of average mobility, which does not vary 
much by population. In another study, Gonzalez et al. (2008) showed that the human mobility of 
individuals is consistent in both spatial and temporal domains and that people tend to return to 
their preferred locations regularly. We believe the main research gap in this field is the lack of 
studies evaluating duplicate devices among mobile device location data provided by different data 
vendors. 
8 
 
 
2.3 Trip Identification Algorithms 
 
 
 The trip end identification algorithm for high-frequency mobile device location data, such 
as GPS data, has been well-studied and developed. The state-of-the-practice methods utilized by 
the commercial data vendors identify trips from raw location data points as follows: 
? Method 1: Consider the time and distance relationship between consecutive location point 
observations to identify moving points and static points. Consecutive moving points between 
two sets of static points form a trip. 
? Method 2: Consider zone boundaries to determine movements from one zone to another, 
which applies to identifying inter-zonal trips only. 
? Method 3: Identify location point clusters as activity locations with spatial clustering methods. 
Location point observations between two consecutive activity locations form a trip. 
 The traditional way of obtaining accurate trip ends is the rule-based trip end identification 
method. This type of method designs rules and parameters based on domain knowledge. The trip 
ends are obtained by applying the rules to every sighting in the location data and, at the same time, 
examining the intra-relationship between consecutive location points. The parameters used in these 
rules are defined mainly by domain knowledge and are applied to measures such as dwell time, 
speed, etc. (McGowen and McNally 2007; Gong et al. 2014; Axhausen et al. 2003; Tsui et al. 
2006; Bothe and Maat 2009; Stopher et al. 2005; Du and Aultman-Hall 2007; Stopher et al. 2008; 
Schuessler and Axhausen 2009; Gong et al. 2012; Safi et al. 2015; Patterson et al. 2016). 
 There is a wide range of dwell times in the current research conducted. Wolf et al. (2001) 
utilized GPS data to detect the trip diary of users by applying different time thresholds, as a rule, 
9 
 
 
to identify trip ends. If a device did not move during the time threshold, it was detected as a trip 
end. The best match between the reported and the detected trips was derived using a 120-second 
threshold. Tsui and Shalaby (2006) used an activity identification algorithm to find the trip ends 
from GPS data streams. A time threshold of 120 seconds was used as a primary criterion of activity 
identification. In the case of signal loss, other measures are included. For example, if the signal 
loss was between 120 to 600 seconds and the user moved in a distance as short as 50 meters, it is 
considered a short-duration indoor activity. Stopher et al. (2005) defined a trip end whenever the 
difference in the consecutive latitude and longitude values is less than 0.000051 degrees, and the 
heading is unchanged or is zero, along with speed being equal to zero while elapsed time during 
which these conditions hold is equal to or greater than 120 seconds. It is worth noting that this 
paper was written in Australia, and based on what the authors noted, most traffic lights in Australia 
have a red cycle of fewer than 2 minutes.  Stopher et al. (2005) used a 3-minute threshold to 
determine the trip ends of the GPS data. Axhausen et al. (2004) utilized Trip Identification and 
Analysis System (TIAS). According to the model, points with dwell times of greater than five 
minutes are considered trip ends that can be identified from GPS data as the trip ends with 
confidence. Some research considered a speed threshold of zero or a value near zero as a measure 
to capture static clusters and trip ends (Wolf et al., 2001; Tsui and Shalaby, 2006; Schuessler and 
Axhausen, 2009). 
Moreover, researchers leveraged the supervised machine learning methods to supplement 
the rule-based methods, which classify each location point as static or moving (Gong et al., 2015; 
Zhou et al., 2016; Gong et al., 2018). Different clustering methods were also applied to obtain trip 
ends by first identifying people's activity locations from the location data (Zhou et al., 2007; Chen 
et al., 2014; Ye et al., 2009; Yao et al., 2019). A recent study utilized a spatiotemporal clustering 
10 
 
 
method with three combined optimization models to detect trip ends (Yao et al., 2019). There is 
also a particular focus on deriving the trip ends from LBS data. A "Divide, Conquer and Integrate" 
(DCI) framework was proposed to process the LBS data and extract mobility patterns in the Puget 
Sound region (Wang et al., 2019). The proposed framework combined a rule-based and 
incremental clustering method to handle the bi-modally distributed LBS data. The results were 
aggregated at the census tract level and compared with household travel surveys (Wang et al., 
2019). 
 
 
 
11 
 
 
Chapter 3: Datasets 
 
3.1  Mobile Device Location data (MDLD) 
 
 The primary dataset used in this thesis is the mobile device location data (MDLD) collected 
by multiple leading data vendors. This dataset contains the spatial and temporal information of 
several users, including a random hashed device identifier, the latitude and longitude of location 
points, the time that the location of the user has been collected as a timestamp, the accuracy of the 
sightings as meters reported by the data provider, and the Coordinated Universal Time (UTC) 
offset that relates the UTC of each sighting to their local time. Table 1 shows a sample of the 
mobile device location data. Due to privacy concerns, noise has been applied to all the entries.  
Table 1. A sample of MDLD. 
 
Device-ID UTC timestamp Latitude Longitude Accuracy UTC offset 
Sfbcx-223da 1578010770 38.9924 -76.9293 2 -14400 
Sfbcx-223da 1578010775 38.9802 -76.9190 5 -14400 
Sfbcx-223da 1578010778 38.9605 -76.9201 3 -14400 
Rjckf-2421s 1578010500 38.7069 -76.8985 11 -14400 
 
Figure 1 shows the mobile device location data sampling rate used after multiple data 
processing steps described in Chapter 4, such as data cleaning, device deduplication, and devices 
with home locations. 92% and 90% of the counties have a sampling rate of more than 5% and 
10%, respectively. The numbers indicate the great value of MDLD for the analysis of travel 
12 
 
 
patterns, compared with surveys in which the sampling rate is much lower than the sampling rate 
in MDLD used in this study. 
 
Figure 1. Final sampling rate at the county level. 
 
 
3.2 incenTrip Data 
 
 incenTrip (incentrip.org) was developed by the National Transportation Center (NTC) at 
the University of Maryland (UMD) for the "Integrated, Personalized, Real-time Traveler 
Information and Incentive" (iPretii) project, funded by the U.S. Department of Energy's (DOE) 
Advanced Research Projects Agency-Energy (ARPA-E). This application gets the users' location 
13 
 
 
information to suggest the best transit option and incentivize them to use the transit network instead 
of a private car for their travels (Mofeng, 2020). The data is collected and stored in compliance 
with data privacy protection requirements. Due to these similarities, this dataset is the same as the 
MDLD gathered from the leading data vendors in the U.S. but encompasses the Washington 
Metropolitan Area (DMV) and the Baltimore Metropolitan Council Area. 
3.3 Household Travel Surveys (MTS and NHTS) 
 
 Two travel surveys are used as part of the validation of the proposed trip identification 
algorithm. One of the ground truth datasets used is the Baltimore Metropolitan Council (BMC) 
survey called the Maryland Statewide Household Travel Survey (MTS) to capture the daily travel 
patterns of the people living in a set of counties in Maryland. This survey was conducted between 
April 2018 and May 2019. The survey data was collected from 7,500 households from counties 
including Alleghany, Anne Arundel, Baltimore, Caroline, Carroll, Cecil, Dorchester, Garrett, 
Harford, Howard, Kent, Queen Anne's, Somerset, Talbot, Washington, Wicomico, Worchester, 
and Baltimore City. Residents were asked what trips they made over a specific weekday for work, 
school, shopping, etc. 
 The other ground truth dataset is the National Household Travel Survey (NHTS) 2017. The 
MTS focuses on a specific geographic area, while this survey covers all 50 states and the District 
of Columbia. The survey was conducted from March 2016 and May 2017 on weekdays and 
weekends, including holidays. 
 
 
14 
 
 
3.4 American Community Survey (ACS) 
 
 The American Community Survey (ACS) is an ongoing survey conducted every year to 
provide helpful information about the U.S. population, such as demographics from different 
regions such as states, counties, and smaller geographical areas. In this research, the population of 
each county is used to calculate the sampling rates in each county of the fifty states and the District 
of Columbia. This survey data is provided by the National Historical Geographic Information 
System (NHGIS).  
 
15 
 
 
Chapter 4: Methodology 
4.1 Geographical Level of Study 
 
The geographical level of study for users' home locations and visited locations are a geo-
hash level 7. Geo-hashes are unique identifiers of specific zones on the earth, and their width and 
height depend on the level of the certain geo-hash. Table 2 shows the size of geo-hashes, from the 
largest to the smallest. 
Table 2. Geo-hash width and height at different levels. 
 
Level of geo-hash width ? length 
1 5,009.4 km ? 4,992.6km 
2 1,252.3 km ? 624.1 km 
3 156.5 km ? 156 km 
4 39.1 km ? 19.5 km 
5 4.9 km ? 4.9 km 
6 1.2 km ? 609.4 m 
7 152.9 m ? 152.4 m 
8 38.2 m ? 19 m 
9 4.8 m ? 4.8 m 
10 1.2 m ? 59.5 cm 
11 149 mm ? 149 mm 
12 37.2 mm ? 18.6 mm 
 
16 
 
 
Figure 1 illustrates the level of study for home location imputation and visited locations. 
The blue rectangle is a geo-hash level 7 denoted by the unique identifier named "dqcmc4p" with 
a size of 152.9 m ? 152.4 m.  
 
Figure 2. Illustration of a level-7 geo-hash. 
 
4.2 Home Location Identification 
 
 
Several sections of this study require users' home locations, while the mobile device 
location data does not provide it. On the other hand, it provides enough information that helps to 
derive them. Home locations are derived from the locations that a device visited during a period, 
and they are reported as geo-hash level 7 zones. The methodology is as follows: 
17 
 
 
 At first, sightings of a device are aggregated at geo-hash level 6 zones, and geo-hashes that 
meet the following criteria are considered as possible home locations and are kept for further 
evaluation:  
? The geo-hash must have been observed on at least 
?????? ?? ???????? ???? ?? ? ?????
??? {3, ??????? ( ) + 1}  days in a month. 
2
? The geo-hash is observed on an average of more than 2  hours on the days that it has 
sightings. 
 The remaining geo-hashes are sorted based on the number of observed days in a month, 
the average daily number of observed hours in observed days, the average number of hourly 
sightings in observed hours, and the top three zones are picked. Next, the top three remaining geo-
hashes are sorted by the observed number of nights, the average daily number of observed 
nighttime hours, and the average number of hourly sightings during nighttime. The top geo-hash 
level 6 is identified as the home location since people tend to spend most of their nighttime at 
home. Finally, these two steps are repeated on all the level-7 geo-hashes in the identified home 
location, and the top geo-hash is selected as the home location. This step helps to have more precise 
information about the home location. It is worth noting that the nighttime hour is chosen to be 9 
p.m. to 6 a.m. based on the American Time Use Survey (ATUS). ATUS showed that nearly 80% 
of the population that work full-time or part-time visited their home location at this time interval.  
 MDLD provides the information of a massive number of devices, while many of them do 
not provide sufficient information. For example, there might be a device with fewer than ten 
observations during a month. Therefore, after identifying home locations, only devices with a 
minimum quality are kept in the dataset.  
18 
 
 
4.3 Work Location Identification 
 
 In addition to home location, work location is needed when identifying a trip. Work 
location identification follows the same structure as home location identification. Workplace 
candidates are selected based on the visiting frequency of at least three workdays, or half of the 
total observed workdays for each device, and the average duration of at least two hours during the 
daytime on workdays. Furthermore, a term called 'temporal similarity' is introduced to avoid 
possible misidentification of work location. The temporal similarity ratio controls the work 
location to be somewhere different from the home location. For each workplace candidate, the 
temporal similarity ratio is defined as the ratio between the number of hours when the device was 
observed both at home and at the workplace candidate and the number of total hours when the 
device was observed at the workplace candidate. A threshold of 0.6 has been selected. 
A measurement of this kind is performed because, in some cases, a device's home location might 
not be far enough from the boundary of a geo-hash. This results in frequent observations of a 
device in two geo-hashes next to each other. Therefore, the possibility of identifying the geo-hash 
next to the home location as the work location increases. Assuming a user spends some time at 
work before returning home, the user's home and workplace should not regularly be observed at 
the same time.  
4.4 Device Deduplication 
 
 
 Following the literature discussed in Chapter 2, based on both demographic information, 
in this case, the imputed home location of devices, and travel movement pattern, the top N 
locations visited during a month, a deduplication algorithm is developed to identify different 
device IDs that represent the same user in the integrated dataset coming from different data 
19 
 
 
vendors. In this study, devices that have the exact home location and top-five most-visited 
locations are considered in the same k-anonymity and, as such, will be flagged as duplicate devices. 
These devices represent the same user but have different device identifiers. 
 Geo-hashes with at least one sighting over a month are considered visited locations of a 
device. To determine the top locations visited by a device, all the visited geo-hashes are sorted 
based on the number of unique hours and the number of sightings during a month. A unique hour 
is a time interval of one hour during which a specific device was observed. For example, two 
observations at 11:45 a.m. and 11:12 a.m. on January 4 and January 6, 2020, are considered two 
unique hours during a month.  Next, the top N geo-hashes are chosen as the most-visited locations 
of a device for the deduplication process. Devices that do not have top N geo-hashes, meaning 
they are observed in N-1 or fewer geo-hashes during the month, are removed since the information 
about their trajectories are insufficient. Table 3 summarizes the minimum, average, and maximum 
values of anonymity group sizes for devices in the dataset. As the number of the top-visited 
locations increases, fewer devices are associated with the same anonymity group. 
 
 
 
 
 
20 
 
 
Table 3. K-anonymity size statistics for devices having the exact home location. 
 
Number of the top-visited locations (N) Min Mean Max 
1 1 4.23 20514 
2 1 1.82 323 
3 1 1.48 164 
4 1 1.34 72 
5 1 1.25 52 
6 1 1.19 50 
7 1 1.15 14 
8 1 1.12 7 
 
 
Figure 3. Anonymity size variation with different numbers for most-visited locations. 
 
 
21 
 
 
Figure 3 shows the percentile of anonymity sizes having different numbers of locations (N) 
as the most-visited locations. For example, with N equal to five, more than 80% of anonymity 
groups contain only a single device, and less than 20% have two or more devices in the same 
anonymity group. Previously, we noted that the objective is to consider all devices belonging to 
the same anonymity group representing the same individual since they share the top N most-visited 
locations and home locations. The next step is to determine a reasonable value for N. Based on 
Table 3 and Figure 3, as the value of N increases, fewer devices share the same most-visited 
locations, and consequently, fewer devices are identified as duplicates. A validation method is 
conducted to ensure the proper value of N is selected. 
Sharing the top one (N=1) or two (N=2) locations is not sufficient to identify duplicated 
devices since these two locations are most likely to be the home and work location of the devices, 
and it is possible to have two devices living and working together. Thus, more locations are needed 
to be considered as a duplicate identifier. Figure 4 shows the number of devices with at least one 
duplicate with varying values of N. Higher values can lead to failing to detect duplicate devices 
because even tiny changes in the number of observed hours can lead to failing to detect duplicate 
devices. This is because different data vendors may report different numbers of sightings of a 
device as different applications capture them.  
22 
 
 
 
Figure 4. Number of devices having one or more duplicates in the dataset. 
 
If two devices are duplicates, meaning they represent the same user in the dataset, they 
must be in the same location at the same time. This fact is the baseline for the validation of the 
duplicate identification algorithm. A 100,000-sample of duplicated paired devices is evaluated in 
terms of spatial and temporal information of sightings in 10 days. Their trajectories are evaluated 
to see if they are at the same location simultaneously. Since different data vendors might report 
the sightings of the same user at different hours of a day, only the sightings of the common hours 
are compared. Common hours are those one-hour intervals that both devices had sightings in the 
integrated dataset. For instance, if two devices appeared simultaneously at 4 a.m. on January 1, 
2020, they must have appeared in the same level- 7 geo-hash.  
Figure 5 shows the number of paired devices in the random sample data that had at least 
one common hour in the first ten days of January. As the number of most visited locations 
23 
 
 
increases, the number of devices having sightings in the dataset at the same hour-intervals 
increases. For the values greater than five, the number of paired devices having common hours 
does not increase significantly. 
 
Figure 5. Number of devices having common hours in the sample dataset. 
 
Figure 6 shows the average percentage of common hours of paired devices observed in the 
same level-7 geo-hash. Same as the previous trend, when the value of N changes from one to five, 
the percentage increases a lot, but for the higher values, the percentage is almost around 99.5% 
and does not change a lot. This number is a good indicator of being in the same location 
simultaneously, and using a higher value increases the risk of failing to identify two duplicated 
devices.  
24 
 
 
 
Figure 6. Average number of hours observed in the same geo-hash level 7. 
 Based on the previous discussions, a value of five for the number of most-visited locations 
of a device along with sharing the same home location is used to be a proxy for the duplicate 
identifier. Finally, when two devices are labeled as duplicate devices, all their sightings are 
integrated, and the same device ID is assigned to them. In this way, a more solid trajectory of the 
user is presented in the final dataset. 
4.5 Tour-based Trip Identification 
 
 Due to the absence of trip-level information in the raw mobile device location data, a trip 
identification algorithm is required to extract this information. The purpose of this section is to 
explain the trip identification algorithm. 
25 
 
 
4.5.1 Home-based tour identification 
 
 As the first step, home-based tours of the devices are derived. Home-based tours are 
defined as all the sightings between two consecutive appearances of a device at the imputed home 
location (see section 4.2 for the home location identification algorithm). The algorithm processes 
the sightings on any day, from 4 a.m. to 4 a.m. the next day. This is called a trip day. It is assumed 
that individuals are at home at 4 a.m. unless they are on a long-distance trip. As a result, the 
algorithm first checks the first sighting of each user on every trip day. If the sighting is out-of-
home and the distance of the sighting to the home location is shorter than 50 miles, i.e., the device 
is not on a long-distance tour, an at-home sighting is generated for the user at 4 a.m. on this trip 
day. Similarly, if the device's last sighting is out-of-home and the sighting distance to the home 
location is shorter than 50 miles, an at-home sighting is generated for the user at 4 a.m. of the next 
day. This ensures that users not on a long-distance tour start and end their days at home, and all 
the home-based tours are complete. Furthermore, every at-home sighting whose previous and next 
sightings are out-of-home is repeated, and a copy of the sighting is added to the list of users' 
sightings. This ensures that every sighting in the dataset only belongs to one tour.  
Next, if the first sighting of a user is out-of-home, i.e., the device is on a long-distance tour, 
a flagged tour ID of "Long-Distance Tour" is assigned to all sightings of the user until the user is 
seen at home. If the user is never seen at home during the trip day, all observations would have the 
same flagged tour ID on the trip day. Similarly, suppose the last sighting of a user is out-of-home. 
In that case, i.e., the user is on a long-distance tour at the end of the day and does not already have 
a tour ID, meaning that the user did not start the day on a long-distance tour, a flagged tour ID of 
"Long-Distance Tour" is assigned to all sightings of the user from its last observation at home to 
its last observation on that day. Next, a random tour-ID is generated for every out-of-home sighting 
26 
 
 
following an at-home observation. The same tour-ID is assigned to all following sightings until 
the device is again seen at home.  
Finally, the maximum distance of each tour to the home location is calculated. If the 
maximum distance exceeds 50 miles, the tour ID is changed to the flagged tour ID, "Long-Distance 
Tour." The reason for flagging the sightings on a long-distance tour is that their trip identification 
is different from the daily local sightings. At this stage, sightings are separated into two groups: 
sightings on short-distance tours and sightings on long-distance tours. Short-distance tours will go 
through a daily short-distance trip identification. In contrast, long-distance tours go through a 
monthly long-distance trip identification. 
4.5.2 Trip Identification for Short-Distance Tours 
 
Trips for each short-distance tour are identified using the following steps. The trip 
identification algorithm assigns a random ID to every trip it identifies. First, all sightings of each 
user are sorted by time. The location dataset may include many sightings that do not belong to any 
trips, i.e., stationary sightings. The algorithm assigns "0" as the trip ID to these sightings. For every 
sighting, the distance, time, and speed between the sighting and its previous and next sightings as 
"time from," "time to," "distance from," "distance to," "speed from," and "speed to" variables, if 
applicable, are computed. 
The trip identification algorithm has three thresholds: distance, time, and speed. The speed 
threshold is used to identify if a sighting is recorded on the move. The distance and time thresholds 
are used to identify trip ends. At this step, the algorithm identifies the first sighting with 
????? ???? ? ????? ?????????. This identified sighting is on the move, so a random trip-ID is 
generated and assigned to this sighting. All sightings recorded before this point, if they exist, are 
27 
 
 
set to have "0" as their trip-ID, meaning that they are stationary sightings. Then, a recursive 
algorithm discussed in the following paragraphs identifies if the next sightings are on the same trip 
and should have the same trip ID. 
The recursive algorithm runs on sightings with the same tour-ID. It checks every sighting 
to identify if they belong to the same trip as their previous point. If they do, the same trip ID is 
assigned to them. Otherwise, either a new trip-ID is assigned to them (when their "????? ????" ?
????? ?????????), meaning they are the starting point of a new trip, or their trip-ID is set to "0" 
(when their "????? ????" < ????? ?????????). Identifying if a sighting belongs to the same 
trip as its previous sighting is based on the sighting's "speed to," "distance to," and "time to" 
attributes. If a device is seen in a point with "???????? ??" ? ???????? ????????? but is not 
observed to move there ("????? ??" < ????? ?????????), the point does not belong to the same 
trip as its previous point. 
When a user is on the move ("????? ??" ? ????? ?????????), the sighting belongs to the 
same trip as its previous sighting; but when the user stops, the algorithm checks the radius and 
dwell time to identify if the previous trip ended. If the user stays at the stop (sightings should be 
closer than the distance threshold) for a while shorter than the time threshold, the sightings still 
belong to the previous trip. When the dwell time reaches the time threshold, the trip ends, and the 
subsequent sightings no longer belong to the same trip. The algorithm does this by updating "time 
from" to be measured from the first observation in the stop. 
28 
 
 
 
Figure 7. Recursive algorithm of trip identification for short-distance tours. 
 If a sighting has a speed greater than three mph from the previous sighting, the sighting 
belongs to the same trip as its previous sighting. If a sighting has a speed lower than three mph 
from the previous sighting and is more than 1000 ft away from the previous sighting, the sighting 
does not belong to the same trip as its previous sighting. If the speed to the next sighting is also 
smaller than three mph, the current sighting simply terminates the trip; otherwise, it becomes the 
start of a new trip. If a sighting has a speed lower than three mph from the previous sighting and 
is within 1000 feet from the previous sighting, the cumulative dwell time for all the consecutive 
sightings meeting the following criteria is computed and checked:  
29 
 
 
1. If the cumulative dwell time is less than five minutes, the current sighting belongs to the 
same trip. 
2. Otherwise, it terminates the trip if the speed to the next sighting is less than three mph or 
starts a new trip if the speed to the next sighting is more than three mph. 
4.5.3 Trip Identification for Long-distance Tours 
 
 
 Due to the nature of long-distance tours, trip identification for long-distance trips is 
performed differently. While the short-distance trip identification concentrated on daily 
observations, this section focuses on sightings on long-distance tours over a month. All the 
sightings with the flagged tour ID of "Long-Distance Tour" are filtered for the entire month, 
implying that they are on a long-distance tour. Next, trips are identified through regenerating tour 
IDs, identifying primary and secondary stops, destinations, assigning sub-tour IDs, and identifying 
trips on sub-tours. Figure 8 shows how the trip identification algorithm for long-distance tours 
works. Each stage of the flowchart is described in the following sections.
30 
 
 
 
 
 
 
Figure 8. Recursive algorithm of trip identification for long-distance tours. 
31 
 
 
4.5.3.1 Tour-ID regeneration 
 
 
 In the tour identification algorithm, long-distance tours were assigned a flagged 
tour ID of "Long-Distance Tour." At this step, a new random tour ID is assigned to all 
sightings between two consecutive at-home sightings. The difference with the previous 
tour identification algorithm is that the previous one was limited to observations within 
a trip day, so the tour window was limited to one day. However, this time, a multi-day 
tour and a multi-day trip can be identified. 
4.5.3.2 Stop and destination identification 
 
 
The recursive trip identification algorithm described in section 4.5.2 is applied 
on long-distance tours with a time threshold of 30 minutes instead of 5 minutes so that 
a trip ends only if a user stays somewhere for 30 minutes or more, and all the trip ends 
are identified and named as "secondary stops." Primary stops are restricted cases of 
secondary stops. Primary stops on a long-distance tour are places where users stay and 
make secondary tours or places in which users stay for a significant amount of time. 
The spatial resolution of primary stops in our algorithm is geo-hash level 6, a rectangle 
with a width and height of 1.2km ? 609.4m. The following criteria are used to identify 
primary stops at geo-hash level 6:   
1. If the duration of stay at a geo-hash is longer than 2 hours and in the current 
tour, the device leaves the geo-hash but later returns. 
2. If the duration of stay at a geo-hash is longer than 24 hours. 
3. If it is a home location. 
32 
 
 
 
 Furthermore, the primary destination of a tour is defined as the farthest stop 
located at least 50 miles away from the home location of a user. At first, secondary 
stops are utilized to find the destination, but if no destination is found, secondary stops 
are investigated for destination identification.  
4.5.3.3 Sub-tour identification 
 
A sub-tour is a segment of a long-distance tour that falls between two primary stops. 
At this step, every time a user leaves a primary stop, a sub-tour ID is generated. The 
sub-tour-ID is assigned to all the sightings of a device until the device is again seen at 
a primary stop.  
4.5.3.4 Trip generation 
 
If a tour does not have a destination or the destination is the same as the user's 
work location, a recursive trip identification algorithm with a time threshold of 5 
minutes (short-distance trip identification) is applied to the entire tour points. On the 
other hand, if a destination different from the work location is found, a recursive trip 
identification algorithm with a time threshold of 30 minutes is applied to sub-tours 
identified on this tour. 
 Figure 9 illustrates how the tour-based algorithm produces more accurate trip 
identification results than the traditional methods. Graphs (a) and (b) show how the 
tour-based method differentiates actual activity clusters (e.g., home cluster and work 
cluster) from mid-trip transfer points (e.g., waiting at a transit station). The ability to 
construct linked trips from unlinked trips based on the tour-based approach leads to a 
33 
 
 
 
higher consistency between the trips derived from mobile device location data and trips 
reported by the surveys such as NHTS. 
 
  
(a). Multiple Unlinked Person Trips (b). One Linked Person Home-to-Work Trip 
Figure 9. Tour identification and trip linking demonstration. 
 
 
34 
 
 
 
Chapter 5:  Results 
 
The tour-based trip identification is applied to the mobile device location data 
of January 2020, and the trips on long-distance tours and short-distance tours are 
derived. To ensure that the reported trips are quality assured, we investigated the 
derived trips and added four post-processing steps on top of the trip identification 
algorithm. Types of treated trips are as follows: 
1. Trips with inadequate sightings and information. 
2. Trips made in a trip end activity location. 
3. Trips with high detour factors, i.e., round trips. 
4. Trips with high speed between several consecutive trip points. 
 Steps taken on these trips for further treatments are described in section 5.1. 
5.1 Post-processing steps in the trip-identification algorithm 
 
 
 Initially, trips with only two sightings are removed due to the inadequacy of trip 
information. These trips do not provide any information about the route taken by the 
user and might cause further issues such as miscalculation of the actual trip distance 
instead of the Euclidian distance between the trip ends. Secondly, trips less than 300 
meters are removed. This prevents short trips made in an activity location while the 
user's phone records their real-time location. Local movements in an activity location 
should not be considered as trips. Figure 10 shows a mall where a user's phone recorded 
their sightings, and the trip identification algorithm reported these local movements as 
35 
 
 
 
a trip with 68 sightings. Noise has been added to the sightings due to privacy 
protections. This trip is removed with the predefined distance threshold. 
 
Figure 10. An example of a local movement in a mall. 
  
 Thirdly, when a trip with a significant detour factor is observed, the trajectory 
of the trip is further investigated to break down the single trip into multiple unlinked 
trips. The detour factor is calculated using the following formula: 
?????????? ???????? ??????? ??? ??? ????????? ?? ? ????
?????? ?????? =   
????????? ???????? ??????? ??? ??? ???? ???? ?? ? ????
 When the detour factor of a trip is higher than five, the trip is divided into two 
segments: The first trip would be from the user's starting point to the farthest point they 
traveled, and the second trip would be from the farthest point to the ending point.  
36 
 
 
 
 Lastly, trips with several consecutive data points with high speed are removed. 
Due to the possible inaccuracy in the GPS data collection, data jumps could occur. 
When a GPS device is within several tall buildings, underground, or in tunnels, location 
data collection may not work well. Studies have used speed thresholds to remove jumps 
from the datasets (Thiagarajan et al., 2009).  
 Figure 11 shows a trip having multiple jumps in the reported sightings. A user 
made a trip from A to D (the red lines), but the GPS signal randomly reported the user's 
sightings in distant locations within seconds. Data jumps caused the actual trip to be 
reported as trips from A to B, B to A, A to C, C to B, and B to D in short periods. In 
order to avoid such inaccurate trips, those with multiple jumps in sightings are 
removed. Trips with 20% or more of their sightings with a speed of 500 meters per 
second are removed. A high value is chosen to ensure no air trip is removed from the 
data since air trips are highly possible to have a low number of sightings and a high 
value of speed between consecutive points. 
 
37 
 
 
 
 
Figure 11. A trip with multiple data jumps. 
  
5.2 Regional Trip Validation with MTS 
 
The first validation step compares trip-level results with a regional travel 
survey, the BMC Maryland Statewide Household Travel Survey (MTS). Since the 
MTS only reported trips taken by the residents of the counties mentioned in section 3.3, 
MDLD users were also filtered to those whose home locations were in these counties. 
All the trips generated during January 2020 are filtered and validated against the 
weighted trips reported in the MTS. 
38 
 
 
 
Figure 12 and figure 13 show a comparison between the length and the duration 
of the trips made by residents of the selected area in MDLD and MTS. The overall 
distribution is similar between both datasets, while MDLD reports more long-distance 
and fewer short-distance trips.  
39 
 
 
 
 
Figure 12. Trip length distribution comparison between MDLD and MTS survey. 
 
 
Figure 13. Travel time distribution comparison between MDLD and MTS survey. 
 
40 
 
 
 
 The distribution of the trip start time in MDLD is validated against the MTS, as 
shown in Figure 14. The overall distribution of MDLD trips is similar to the travel 
survey, while the MDLD trips showed a more flattened shape with smaller morning 
peak while having slightly more trips during the night. 
 
Figure 14. Trip start time distribution comparison between MDLD and MTS data. 
  
Figure 15 compares trip rates in the MTS and the MDLD data on a weekday for 
the people who made at least one trip. As expected, many users with only one trip are 
reported from the proposed trip identification algorithm. This observation can be 
explained by the fact that MDLD data does not capture the entire itinerary of a user 
during a trip day, while a survey records all the trips of a respondent during a day, to 
the best of the respondent's knowledge. This leads to an overall underestimation of the 
number of trips made by the users.   
41 
 
 
 
 
Figure 15. Trip rate distribution comparison between MDLD and MTS data. 
5.3 National Trip Validation with NHTS 
 
 As the next step of the validation process, the derived trips are validated at the 
national level by comparing them with the National Household Travel Survey (NHTS) 
2017. Similar to the validation based on MTS, trip length, travel time, trip start time, 
and trip rate distributions are plotted in Figures 16-19. Again, more long-distance trips 
and fewer short-distance trips are reported. First, the method for calculating the survey 
trip length differs from the distance calculated from the MDLD trips. NHTS used the 
shortest network path distance generated by Google API, while the MDLD trip lengths 
are the cumulative distance between all the trip sightings. Moreover, considering the 
nature of the MDLD collections, longer trips are more likely to be captured by the data 
collectors. Sampling biases could be another reason for these discrepancies. Morning 
and evening peaks are all captured, while the morning peak in MDLD data is not as 
sharp as the NHTS, and a more flattened distribution is observed. 
42 
 
 
 
Figure 16. Trip length distribution comparison between MDLD and NHTS data. 
 
Figure 17. Travel time distribution comparison between MDLD and NHTS data. 
 
43 
 
 
 
 
Figure 18. Trip start time distribution comparison between MDLD and NHTS data. 
 
 
Figure 19. Trip rate distribution comparison between MDLD and NHTS data. 
 
 
44 
 
 
 
Furthermore, the top origin-destination pairs of the derived trips in MDLD are 
compared to the NHTS 2017. The rank-rank correlation between OD pairs in NHTS 
and the corresponding OD pairs in MDLD at the state level is 0.955. This represents a 
good match between the ranks of the OD pairs in both datasets. Tables 4 and 5 
demonstrate the top 10 origin-destination pairs in NHTS and the corresponding ranking 
in MDLD OD pairs at state and county levels, respectively. At the state level, the top 
ten OD pairs observed in NHTS also appear in MDLD, and the rankings are highly 
similar. At the county level, seven of the ten top OD pairs in NHTS also appear in 
MDLD. It is only in New York County that the rank difference is noticeable. This 
observation is influenced mainly by the fact that a significant number of trips made in 
Manhattan are underground trips that are hardly captured by mobile device location 
data since they do not have a continuous active connection to GPS signals.  
Table 4. Top 10 origin-destination pairs in NHTS and MDLD at the state level. 
 
Origin State Destination State NHTS ranking MDLD ranking 
California California 1 3 
Texas Texas 2 1 
New York New York 3 4 
Florida Florida 4 2 
Illinois Illinois 5 7 
Ohio Ohio 6 6 
Pennsylvania Pennsylvania 7 9 
Michigan Michigan 8 10 
North Carolina North Carolina 9 8 
Georgia Georgia 10 5 
 
 
 
 
45 
 
 
 
Table 5. Top 10 origin-destination pairs in NHTS and MDLD at the county level. 
 
Origin County Destination County NHTS ranking MDLD ranking 
Los Angeles, CA Los Angeles, CA 1 1 
Cook, IL Cook, IL 2 4 
Maricopa, AZ Maricopa, AZ 3 3 
Harris, TX Harris, TX 4 2 
San Diego, CA San Diego, CA 5 10 
New York, NY New York, NY 6 140 
Orange, CA Orange, CA 7 8 
Dallas, TX Dallas, TX 8 7 
Clark, NV Clark, NV 9 11 
King, WA King, WA 10 19 
 
5.4 Advantages of the proposed algorithm to the clustering method 
 
Clustering algorithms are one of the ways to identify the trip ends from the 
mobile device location data, as discussed in Chapter 2. Yang et al. (2021) applied 
Spatiotemporal Density-Based Spatial Clustering Applications with Noise, ST-
DBSCAN, (Birant and Kut, 2017) on mobile device location data of an application that 
collects the location points of the users called "incenTrip," to derive trip ends. They 
assigned all the sightings between two activity stops to a trip. The tour-based trip 
identification algorithm is applied to the same dataset, and trips are compared with the 
trips from the clustering algorithm. Due to the limitation of the clustering algorithm 
that required a device to have a minimum of sightings in an activity location within a 
small geographical area in a short period, there are several cases in which a trip could 
not be captured since no trip end was identified. Either one of the trip ends, or both 
failed to form an activity cluster, leading to the trip not being detected. Figure 20 is an 
46 
 
 
 
example of a trip with 119 sightings that are not captured by the ST-DBSCAN 
algorithm, but the tour-based trip identification captured it as a 38-mile, 66-minute trip. 
 
Figure 20. A trip trajectory that was not captured by the ST-DBSCAN algorithm. 
  
The trip origin, destination, and real-time observations are not shown precisely 
due to privacy protection. 
 Furthermore, clustering algorithms are costly and computationally complex. 
Thus, these algorithms may also not apply to large-scale datasets similar to those in this 
analysis containing sightings of more than 45 million users in one month. 
5.5 Case study: COVID-19 pandemic and travel behavior changes 
 
 This section examines a real-world case study using the proposed trip 
identification algorithm. First, the sample is weighted to the entire population. Then 
two mobility metrics, trip rate and percentage of people staying home in January, as a 
47 
 
 
 
base month when COVID-19 was not spread, and April 2020, as a month when the 
COVID-19 cases were raised for the first time, are compared. 
5.5.1 Data expansion to the population level 
 
 
For studies of travel movements at granular levels instead of individual trips, it 
is necessary to expand the dataset to the population level. Due to the reasons listed 
below, a simple multi-level weighting, i.e., device-level and trip-level weighting, is 
done to upscale the dataset to the population level. First, the available mobile device 
location data does not represent the entire population. The average sampling rate for 
January 2020 is nearly 14%. A county-level device weighting is applied to the dataset 
to expand it to the population level. Accordingly, the users living in each county are 
assigned a weight so that the sample will reflect the county's population. Device-level 
weight is calculated as follows: 
?????????? ?? ??? ????????? ?????? 
?????? ?????? =   
???????? ???? ?? ??? ????????? ??????
The five-year (2015-2019) American Community Survey (ACS) is used to 
estimate the population of the counties. Every user has the same weight in a county for 
the entire month since the number of residents is calculated monthly. 
Furthermore, since mobile device locations do not report the sightings of 
devices every 24 hours, it is possible to miss out on tracking a portion of a person's trip 
diary. As a result, determined trips based on location data from mobile devices may 
differ from the person's actual trips. This issue is addressed by trip-level weighting. 
48 
 
 
 
Each trip is weighted so that in January 2020, the MDLD trip rate matches NHTS 2017 
trip rate at the state level. The trip-level weighting process is a one-time weight 
calculated only from the devices of January 2020. The reason for choosing January is 
that it is highly possible to have a different travel behavior in the following months due 
to the COVID-19 pandemic. The following formula calculates the trip-level weight of 
each device in each state. 
???? ???? ?? ??? ????????? ?? ? ????? ?? ???? 2017 ?? ??? ?????? ????
???? ?????? =   
???? ???? ?? ??? ????????? ?? ? ????? ?? ???? ????? ?? ??????? 2020
5.5.2 COVID-19 pandemic and the population travel behavior 
 
 There are many real-world applications for the proposed trip identification 
algorithm. Decision-makers can utilize the derived trips to explore real-world issues 
and dilemmas in various directions. COVID-19 is an outbreak that affected millions of 
people around the globe, and stay-at-home orders are one of the non-pharmaceutical 
interventions used by the government to contain the spread of the disease. On March 
13, 2020, a national emergency declaration was issued to reduce the trip rate of the 
people and, consequently, reduce the spread of COVID-19. Travel behavior analysis 
can help policymakers determine how the people reacted to the interventions and 
whether the current strategy in containing the outbreak is effective. 
 The study attempted to measure how these individuals practiced social 
distancing by calculating two mobility metrics. Day by day, we calculated the 
percentage of people staying at home and the trip rate of the entire nation in January 
2020, the month where COVID-19 has no effect, and April 2020, when COVID-19 
spread throughout the entire country, and compared these two months with each other. 
49 
 
 
 
The percentage of users staying home on any given day is defined as the proportion of 
users observed on that day for which the trip identification algorithm detected no trip. 
The average trip rate of a population is defined as the average number of trips made by 
all observed users during a given day. 
50 
 
 
 
 Figure 22 and Figure 23 show the average trip rate and the percentage of people 
staying home at the national level for January and April 2020. 
 
Figure 21. Percentage of people staying home in January and April 2020. 
 
Figure 22. The daily trip rates in January and April 2020. 
 
51 
 
 
 
 In the first place, both figures perfectly illustrate that people stay home more 
during the weekends, and Sundays have fewer trips during the week than any other day. 
The second point is that there is a considerable gap between January 2020 and April 
2020 in terms of the percentage of people staying home and the average trip rate per 
person. When the COVID-19 pandemic started spreading all over the United States, 
people also practiced social distancing by staying home by almost more than 10% in 
comparison to regular days. Furthermore, the average trip rate of the population 
decreased by more than one trip per day. It is clear from the two comparisons that in 
the early stages of the COVID-19 pandemic, the interventions proposed by the 
government had an impact on the travel movements of the entire population. 
52 
 
 
 
Chapter 6:  Conclusion and Discussion 
 
6.1 Thesis summary 
 
This study presents a tour-based trip-identification algorithm to gather trip-level 
information of the location data collected from mobile devices. The MDLD from 
various data vendors are integrated, and several data cleansing steps are carried out to 
get a solid raw dataset. 
In the first step, a deduplication algorithm is developed to identify duplicate 
devices in the integrated dataset, and the sightings of such devices are merged to avoid 
the overrepresentation of users. In this algorithm, user sightings at a level-7 geo-hash 
are examined spatially and temporally. Devices with the same home location and the 
top five most visited locations during a month represent the same user. In addition, the 
results of the study are validated to see if duplicate devices are observed in the same 
location simultaneously. 
Second, using a home-based tour and trip identification algorithm, trips of more 
than 45 million users during January 2020, are determined from raw sighting data, 
which does not provide any trip-level information on its own. The algorithm first finds 
if the sightings of a device are on short-distance or long-distance home-based tours by 
calculating the distance between the farthest point that a user visited in each tour from 
their home location. A tour is defined as all the sightings of a device between two 
consecutive visits to their home location. Two different approaches are considered to 
derive the trips of each user. A daily short-distance trip identification algorithm to 
53 
 
 
 
derive trips on short-distance tours and a monthly long-distance trip identification to 
determine trips on long-distance tours.  
Third, several post-processing steps are taken to address the concerns raised in 
some of the trips. These trips include cases with inadequate sightings, trips made in trip 
end activity locations such as malls or homes, trips with high detour factors, and trips 
with data jumps. Lastly, the derived trips are validated against two household travel 
surveys, Maryland Statewide Household Travel Survey (MTS) 2018/2019 as a regional 
travel survey and National Household Travel Survey (NHTS) 2017 as a national travel 
survey. The results showed a good match for the trip length distribution, travel time 
distribution, the distribution of trip start time in a day, and the trip rate per person 
distribution, with a couple of discrepancies that are discussed in Chapter 5. 
Fourth, the proposed algorithm is applied to the second set of mobile device 
location data that had previously been clustered by a previous method to find the trip 
ends and, consequently, trips in it. Due to the limitations of the clustering methods and 
the need to make a cluster at both ends of a trip, a couple of trips are not determined. 
In contrast, the proposed tour-based trip identification algorithm is able to identify 
them. 
Finally, as a real-world application of the trip identification algorithm, the effect 
of the COVID-19 pandemic on the travel behavior of the population is investigated. It 
is shown that at the earliest stages of the pandemic, the population reacted to the travel 
restrictions by staying at home more and making a smaller number of trips each day.  
54 
 
 
 
6.2 Discussions and future work 
 
MDLD sightings with inaccurate latitudes and longitudes can be caused by 
many things, including being in a tunnel, walking near tall buildings, etc. Before 
implementing the trip identification algorithm, a data cleaning procedure can be helpful 
and may improve the accuracy of the trips reported. It is worth noting that data cleaning 
before trip identification can be computationally extensive and costly. Therefore, the 
pros and cons of the data jump cleaning must be evaluated for the specific study before 
being implemented. 
Moreover, the multi-level weighting method, i.e., device-level and trip level 
weighting methods presented in this study, is a simple weighting procedure and can be 
extended in many ways. In future studies, socio-demographic data such as education, 
age, and gender can be used to weigh devices based on the population share of each 
socio-demographic group. Considering that location data points generated by 
smartphones come from mobile devices and not every group of a population has access 
to smartphones equally, using different weights for users with different characteristics 
can help the study be more accurate.  
Currently, trip-level weighting assumes that the nation's behavior in 2020 has 
been similar to 2017. To obtain a more accurate trip-level weight, it is helpful to convert 
the average trip rates of each state from 2017 to 2020. In addition, similar to the device-
level weighting that each population group could be assigned a different weight, trips 
can be grouped in different modes, and their weights can be dependent on their modes, 
55 
 
 
 
i.e., rail, bus, drive, bike, walk, air, etc. A travel mode detection is needed to weigh the 
trips based on the travel mode.  
Furthermore, the current COVID-19 analysis is limited to January and April 
2020 and is designed to demonstrate how this data can be applied in real-world studies. 
Suppose the analysis is extended to later months. In that case, it can give decision-
makers insights into whether the population still follows their restrictions after several 
months or whether social distancing has become less prevalent. 
Last but certainly not least, a comparison that includes both GPS-based trip 
detection and user-reported trips is an excellent way to validate results at an individual 
level instead of at a much more aggregated level. Such an analysis requires further user 
and data vendor agreements.  
56 
 
 
 
References 
1 Sweeney, L. (2000). Uniqueness of simple demographics in the US Population, in 
LIDAP-WP4. http://privacy. cs. cmu. edu/dataprivacy/papers/LIDAP-
WP4abstract. html. 
2 Golle, P. (2006). Revisiting the uniqueness of simple demographics in the US 
population. Proceedings of the 5th ACM Workshop on Privacy in Electronic 
Society.  
3 Golle, P., & Partridge, K. (2009). On the anonymity of home/work location pairs. 
International Conference on Pervasive Computing, Springer, Berlin, Heidelberg. 
4 Trestian, I. et al. (2009) Measuring serendipity: connecting people, locations and 
interests in a mobile 3G network. Proceedings of the 9th ACM SIGCOMM 
conference on Internet measurement.  
5 Chow, C.Y., & Mokbel, M.F. (2011). Trajectory privacy in location-based 
services and data publication. ACM Sigkdd Explorations Newsletter, 13(1), 19-
29. 
6 Zang, H., & Bolot, J. (2011). Anonymization of location data does not work: A 
large-scale measurement study. Proceedings of the 17th annual international 
conference on Mobile computing and networking.  
7 De Montjoye, Y. A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). 
Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 
3(1), 1-5. 
8 Gonzalez, M. C., Hidalgo, C. A., & Barabasi, A. L. (2008). Understanding 
individual human mobility patterns. nature, 453(7196), 779-782. 
57 
 
 
 
9 C. Song, Z. Qu, N. Blumm, and A.-L. Barabasi. Limits of predictability in human 
mobility. Science, 327(5968):1018{1021, 2010. 
10 McGowen, P. & McNally, M. (2007) Evaluating the potential to predict activity 
types from GPS and GIS data. Transportation Research Board 86th Annual 
Meeting, Washington. 
11 Gong, L., Morikawa, T., Yamamoto, T., et al. (2014). Deriving personal trip data 
from GPS data: A literature review on the existing methodologies. Procedia-
Social and Behavioral Sciences, 138(0), 557-565. 
12 Axhausen, K. W., Sch?nfelder, S., Wolf, J., Oliveira, M., & Samaga, U. (2004, 
January). Eighty weeks of gps traces, approaches to enriching trip information. In 
Transportation Research Board 83rd Annual Meeting Pre-print CDROM. 
13 Tsui, S. Y. A., & Shalaby, A. S. (2006). Enhanced system for link and mode 
identification for personal travel surveys based on global positioning systems. 
Transportation Research Record: Journal of the Transportation Research Board, 
1972(1), 38-45. 
14 Bohte, W. & Maat, K. (2009). Deriving and validating trip purposes and travel 
modes for multi-day GPS-based travel surveys: A large-scale application in the 
Netherlands. Transportation Research Part C: Emerging Technologies, 17(3), 
285-297. 
15 Stopher, P. R., Jiang, Q., & FitzGerald, C. (2005). Processing GPS data from 
travel surveys. 2nd international colloqium on the behavioural foundations of 
integrated land-use and transportation models: frameworks, models and 
applications, Toronto. 
58 
 
 
 
16 Du, J. & Aultman-Hall, L. (2007). Increasing the accuracy of trip rate information 
from passive multi-day GPS travel datasets: Automatic trip end identification 
issues. Transportation Research Part A: Policy and Practice, 41(3), 220-232. 
17 Stopher, P., FitzGerald, C., & Zhang, J. (2008). Search for a global positioning 
system device to measure person travel. Transportation Research Part C: 
Emerging Technologies, 16(3), 350-369. 
18 Schuessler, N., & Axhausen, K. W. (2009). Processing raw data from global 
positioning systems without additional information. Transportation Research 
Record: Journal of the Transportation Research Board, 2105(1), 28-36. 
19 Gong, H., Chen, C., Bialostozky, E., & Lawson, C. T. (2012). A GPS/GIS method 
for travel mode detection in New York City. Computers, Environment and Urban 
Systems, 2012. 36(2), 131-139. 
20 Safi, H., Assemi, B., Mesbah, M., Fereira, L., and Hickman, M. (2015). Design 
and implementation of a smartphone-based system for personal travel survey: 
Case study from New Zealand. Transportation Research Record: Journal of the 
Transportation Research Board, 2526, 99?107.  
21 Patterson, Z., & Fitzsimmons, K. (2016). Datamobile: Smartphone travel survey 
experiment. Transportation Research Record: Journal of the Transportation 
Research Board, 2594(1), 35-43. 
22 Wolf, J., Guensler, R., & Bachman, W. (2001). Elimination of the travel diary: 
Experiment to derive trip purpose from global positioning system travel data. 
Transportation Research Record, 1768(1), 125-134. 
59 
 
 
 
23 Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial?temporal 
data. Data Knowl. Eng. 60(1), 208?221 (2007) 
24 Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for 
discovering clusters in large spatial databases with noise. Kdd 96(34), 226?231 
(1996) 
25 Yang, M., Pan, Y., Darzi, A. et al. A data-driven travel mode share estimation 
framework based on mobile device location data. Transportation (2021). 
https://doi.org/10.1007/s11116-021-10214-3 
26 Maryland Statewide Household Travel Survey. 
https://www.baltometro.org/transportation/data-maps/maryland-travel-survey. 
27 Yang, M. (2020). Multimodal Travel Mode Imputation Based on Passively 
Collected Mobile Device Location Data (Doctoral dissertation, University of 
Maryland, College Park). 
28 Wolf, J., Bricka, S., Ashby, T., & Gorugantua, C. (2004, June). Advances in the 
application of GPS to household travel surveys. In National Household Travel 
Survey Conference, Washington DC. 
29 Forrest, T. L., & Pearson, D. F. (2005). Comparison of trip determination methods 
in household travel surveys enhanced by a global positioning 
system. Transportation Research Record, 1917(1), 63-71. 
30 Stopher, P., Clifford, E., Zhang, J., & FitzGerald, C. (2008). Deducing mode and purpose 
from GPS data 
31 Chen, C., Ma, J., Susilo, Y., Liu, Y., & Wang, M.. The promises of big data and 
small data for travel behavior (aka human mobility) analysis. Transportation 
research part C: emerging technologies, 2016. 68, 285-299. 
60 
 
 
 
32 Yang, M. (2020). Multimodal Travel Mode Imputation Based on Passively 
Collected Mobile Device Location Data (Masters Thesis, University of Maryland, 
College Park). 
33 Steven Manson, Jonathan Schroeder, David Van Riper, Tracy Kugler, and Steven 
Ruggles. IPUMS National Historical Geographic Information System: Version 
16.0. Minneapolis, MN: IPUMS. 2021. http://doi.org/10.18128/D050.V16.0 
34 Gong, L., Yamamoto, T., & Morikawa, T.. Identification of activity stop locations 
in GPS trajectories by DBSCAN-TE method combined with support vector 
machines. Transportation Research Procedia. 32, 146-154, (2018). 
35 Axhausen, Kay W., et al. "80 weeks of GPS-traces: approaches to enriching the 
trip information: submitted to the 83rd Transportation Research Board Meeting." 
Arbeitsberichte Verkehrs-und Raumplanung 178 (2003). 
36 Zhou, C., Jia, H., Juan, Z., Fu, X., & Xiao, G.. A data-driven method for trip ends 
identification using large-scale smartphone-based GPS tracking data. IEEE 
Transactions on Intelligent Transportation Systems. 18(8), 2096-2110, (2016). 
37 Zhou, C., Frankowski, D., Ludford, P., Shekhar, S., & Terveen, L.. Discovering 
personally meaningful places: An interactive clustering approach. ACM 
Transactions on Information Systems (TOIS). 25(3), 12, (2007). 
38 Chen, W., Ji, M., & Wang, J.. T-DBSCAN: A spatiotemporal density clustering 
for GPS trajectory segmentation. International Journal of Online Engineering 
(iJOE). 10(6), 19-24, (2014). 
61 
 
 
 
39 Ye, Y., Zheng, Y., Chen, Y., Feng, J., & Xie, X.. Mining individual life pattern 
based on location history. 2009 tenth international conference on mobile data 
management: Systems, services and middleware. pp. 1-10, (2009). 
40 Yao, Z., Zhou, J., Jin, P. J., & Yang, F.. Trip End Identification based on Spatial-
Temporal Clustering Algorithm using Smartphone GPS Data (No. 19-01097), 
Presented at 98th Annual Meeting of the Transportation Research Board, 
Washington, D.C., (2019). 
41 Wang, F., Wang, J., Cao, J., Chen, C., & Ban, X. J.. Extracting trips from multi-
sourced data for mobility pattern analysis: An app-based data example. 
Transportation Research Part C: Emerging Technologies. 105, 183-202, (2019). 
42 Thiagarajan, A., L. et al. VTrack: Accurate, Energy-Aware Road Traffic Delay 
Estimation Using Mobile Phones. Proc., 7th ACM Conference on Embedded 
Networked Sensor Systems, 2009, pp. 85?98. 
62