ABSTRACT 
 
 
 
 
Title of Thesis: TEMPORAL TRACKING URBAN AREAS 
USING GOOGLE STREET VIEW 	
  
 Ladan Najafizadeh, Master of Science, 2016 
  
Thesis Directed By: Professor Jon E. Froehlich 
Department of Computer Science 
 
 
Tracking the evolution of built environments is a challenging problem in computer vision 
due to the intrinsic complexity of urban scenes, as well as the dearth of temporal visual 
information from urban areas. Emerging technologies such as street view cars, provide 
massive amounts of high quality imagery data of urban environments at street-level (e.g., 
sidewalks, buildings). Such datasets are consistent with respect to space and time; hence, 
they could be a potential source for exploring the temporal changes transpiring in built 
environments. However, using street view images to detect temporal changes in urban 
scenes induces new challenges such as variation in illumination, camera pose, and 
appearance/disappearance of objects. 
In this thesis, we leverage Google Street View’s new feature, “time machine”, to track 
and label the temporal changes of accessibility features (e.g., existence of curb-ramps, 
condition of sidewalks). The main contributions of this thesis are: (i) initial proof-of-
concept automated method for tracking accessibility features through panorama images 
across time, (ii) a framework for processing and analyzing time series panoramas at scale, 
and (iii) a geo-temporal dataset including different types of accessibility features for the 
task of detection. 
 
 
  
 
 
 
 
 
 
 
 
 
 
TEMPORAL TRACKING URBAN AREAS USING GOOGLE STREET VIEW  
 
 
 
by 
 
 
Ladan Najafizadeh 
 
 
 
 
 
Thesis submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Master of Science 
2016 
 
 
 
 
 
 
 
 
 
Advisory Committee: 
Professor Jon E. Froehlich, Chair 
Professor Hal Daumé III 
Professor Jeffry Foster 
Professor David Jacobs 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© Copyright by 
Ladan Najafizadeh 
2016 
 
 
 
 
 
 
 
 
 
 
 
 
 
  ii 
 
To my parents,  
Soheila & Abbas
  iii 
 
Acknowledgements 
First and foremost, I would like to thank my wonderful advisor, Jon E. Froehlich, for 
always taking chances on me, from the beginning and throughout my studies here at 
UMD. Your precious advice and guidance encouraged me to not only be a better 
researcher, but also be a better version of myself. Thank you for being patient with me, 
and understanding me, when I needed. 
 
I would also thank my committee members, Hal Daumé III, Jeffrey Foster, and David 
Jacobs. Hal, you are an incredible teacher, and I feel lucky to have had a chance to take 
your classes and having you in my committee. Jeff, thank you for always supporting me 
one way or another. David, your valuable feedback, and suggestions truly helped me look 
at problems from different angles. Thank you. 
 
Throughout my studies at UMD, I took some courses that definitely helped me in my 
research and in my professional life. Leah Findlater, Tom Goldstein, and David Mount, 
thank you for sharing your knowledge with others and me.  
 
I could not make this far without the support of my friends and colleagues. Kotaro Hara, 
thank you for always giving me great advice and helping me figuring out my research 
questions. You are a true definition of O.G. A big thank you to my UMD friends: Jin Sun 
(for helping me at such a short notice), Matt Mauriello (for your wisdom and for making 
us laugh once in a while), Lee Stearns (for your help and advice, and Sir, you indeed have 
eagle eyes), Seokbin Kang (for your collaboration), Manaswi Saha (for your friendship), 
Leyla Norooz (for our worldwide adventures), Michael Gubbels (for our deep 
conversations), Sudha Rao (for the long-hours we spent struggling with homework 
questions, and of course, for your friendship), Meethu Malo (for your friendship), Jonggi 
Hong, Liang He, Majeed Kazemitabaar, and Soheil Behnezhad. Also a big shout out to 
my Persian friends at UMD, especially: Saba Ahmadi (for being such a great friend and a 
good listener), Kiana RoshanZamir (for always keeping me in the loop, and for your 
friendship and kindness), Mahsa Derakhshan, Ali Shafahi, and Sina Dehghani, as well as 
to my non-UMD friends, Venus Saatchi, and Elham Alikhani for always cheering me up.  
  iv 
 
 
At last but not least, I would like to thank my parents, Soheila and Abbas for their 
unconditional love. You taught me to fight for my dreams and to never give up on them. 
Your way of living has inspired me to think independently and passionately. Laleh thank 
you for being the best sister I could ever ask for. You always have been my role model. 
Ali, you are my brother, and my best friend at the same time, how cool is that? Thank 
you for being there, when I needed help. Our family would not have been complete 
without Sasan, and Sahar. Thank you for brining more joy to our family J 
  
  v 
 
Table of Contents 
 
 
Dedication ........................................................................................................................... ii	
Acknowledgements ........................................................................................................... iii	
Table of Contents ................................................................................................................. v	
List of Tables ...................................................................................................................... vi	
List of Figures ................................................................................................................... vii	
Chapter 1: Introduction ........................................................................................................ 1	
1.1 Summary of Contributions ........................................................................................ 5	
Chapter 2: Background and Related Work .......................................................................... 7	
2.1 Tracking Urban Changes ........................................................................................... 7	
2.2 Tracking Sidewalk Accessibility ............................................................................. 12	
Chapter 3: Methodology .................................................................................................... 15	
3.1 Dataset ..................................................................................................................... 15	
3.1.1 Data Collection ................................................................................................. 16	
3.1.2 Dataset Limitations ........................................................................................... 20	
3.2 Our Approach .......................................................................................................... 21	
3.2.1 ROI labeling ..................................................................................................... 22	
3.2.2 Category Classification ..................................................................................... 23	
3.2.3 Localization ...................................................................................................... 25	
3.2.4 Handling Occlusion .......................................................................................... 28	
3.3 Summary .................................................................................................................. 29	
Chapter 4: Study Results ................................................................................................... 31	
4.1 Category Classification ............................................................................................ 31	
4.2 System Framework Evaluation ................................................................................ 32	
4.3 Results ..................................................................................................................... 34	
4.3.1 Per Location Results ......................................................................................... 35	
4.3.2 Per Image Results ............................................................................................. 36	
Chapter 5: Discussion and Future Work ............................................................................ 38	
5.1 Conclusion and Limitations ..................................................................................... 38	
5.2 Directions Towards Future Research ...................................................................... 40	
Bibliography ...................................................................................................................... 42	
 
 
 
 
 
 
 
 
 
 
  vi 
 
List of Tables 
 
Table 2-1: Summary of data collection and change detection techniques using non-
stationary cameras. .................................................................................................... 12 
Table 3-1: Distribution of dataset ...................................................................................... 19 
Table 3-2: Overall distribution of snapshots per location ................................................. 20 
Table 3-3: A breakdown of labels with respect to location and image. Per location refers 
to existence of accessibility features on at least one of the snapshots, and per image 
represent the existence of accessibility features on all images, regardless of their 
locations and time. ..................................................................................................... 20 
Table 4-1: Category classification confusion matrix. Each cell indicates the percentage of 
images assigned to a predicted category (column) for each actual category (row)…
 ................................................................................................................................... 31 
 
  vii 
 
List of Figures 
Figure 1-1: Examples of temporal tracking built environments in urban studies. .............. 1 
Figure 1-2: Types of accessibility features in built environments ....................................... 4 
Figure 1-3: The challenges and limitations of GSV temporal images.  All locations are 
located in Washington, DC. The physical address from left to right is as follows: 
1899 Lang PI NE, 2702 28th St NE, 1733 Lang PI NE, 1701 V St SE. ..................... 5 
Figure 2-1: Change detection on satellite imagery for an area by comparing two 
consecutive years. The changes are shown in different colors, where each color 
represents the transition between two states. For example, the orange represents the 
transition from vegetation to soil. ................................................................................ 9 
Figure 2-2: Structural change detection using De-convolutional Neural Networks. ........ 11 
Figure 3-1: Difference in the number of temporal images available in GSV for various 
locations. .................................................................................................................... 15 
Figure 3-2: Examples of locations with various transitions in terms of accessibility 
features: (a) the missing curb-ramps changed to accessible curb-ramps, (b) the 
missing curb-ramps still exist, (c) the sidewalk remains accessible, (d) vehicles 
obstruct the view of missing curb-ramps in 2009-07. ............................................... 17 
Figure 3-3: A view of JSON file for a location with physical address as: 3052 Douglas St 
NE, Washington, DC. ................................................................................................ 18 
Figure 3-4: The stages in our framework: (stage1) the accessibility problem (e.g., “object 
in path”) is manually labeled, (stage 2) the category classifier classifies the labeled 
area, and (stage 3) the object detector localizes the accessibility problem in all 
snapshots over time of that location. ......................................................................... 22 
Figure 3-5: Comparing the quality of the earliest image (left) to the most image (right) of 
a location. ................................................................................................................... 23 
Figure 3-6: The ROI patches that are used in training the category classification. ........... 24 
Figure 3-7: Consistency of the location of accessibility features (e.g., accessible curb-
ramps) within all images over time of a same scene. ................................................ 25 
Figure 3-8: The overall procedure of Cascade object detector. ......................................... 26 
Figure 3-9: Consistency of aspect ratio within the examples of “Objects in path” 
category. .................................................................................................................... 27 
Figure 3-10: Occluded an area of sidewalk by a vehicle (highlighted in blue). The left-
view, and the right-view give enough information regarding the hidden area in front-
view. .......................................................................................................................... 28 
Figure 3-11: Tracking the missing curb-ramps in a location, where the curb is occluded in 
one of the snapshots (2009-07). However, the curb still exists in 2011-08, meaning 
that the occluded snapshot can be ignored. ............................................................... 29 
Figure 4-1: Test-time performance of our approach for each category and overall, with 
respect to Precision, Recall, and F1-score. ................................................................ 33 
Figure 4-2: Framework results for “Objects in path” category. The blue label refers to the 
manually labeled ROI on the most recent snapshot (input of the framework), and red 
labels are localized by the framework, referring to the existence of accessibility 
problems on all previous snapshots. .......................................................................... 34 
  viii 
 
Figure 4-3: The most recent snapshot (top) has been manually labeled to specify the 
“Accessible sidewalk”, and the result is shown on all previous snapshots (bottom).
 ................................................................................................................................... 34 
Figure 4-4: top snapshot is manually labeled as “Missing curb-ramps”, and the four 
bottom snapshots are the result of framework. The yellow label in the last snapshot 
refers to misclassification of “Missing curb-ramps” and “Surface problems”. ......... 35 
Figure 4-5: Successful results of our framework. The green labels refer to accessible 
sidewalks and accessible curb-ramps. The red labels refer to accessibility problems.
 ................................................................................................................................... 36 
Figure 4-6: Failed results of our framework. The yellow labels refer to either 
misclassifying the accessibility features, or not identifying the specified accessibility 
features indicated by the red arrows. ......................................................................... 36 
Figure 5-1: Possible heatmap visualization. The red refers to the period in which the 
accessibility features have not been maintained, and the green refers to the 
accessibility features being recently updated. ........................................................... 40 
 
 
 
 
  1 
 
Chapter 1: Introduction 
Evolution of built environments, whether occurring naturally or artificially, is an 
unavoidable process, which affects the elements of urban areas (e.g., plants, buildings, 
sidewalks, climate, etc.) in terms of transformation, deterioration, amelioration, and 
construction [1]–[3]. Across urban environments, trees and plants transform into different 
states due to earth’s axial tilt (i.e., seasonal change), existing infrastructures deteriorate 
with time and usage, and new infrastructures are constructed in response to the demands.  
Indeed, studying the evolution of urban environments is critically important to 
government policy, urban studies, as well as citizens (e.g., for understanding 
gentrification, land use, predicting real-estate prices, etc.). Tracking temporal changes of 
built environments and visualizing the changes at scale would allow us to build better 
models of urban behavior across time (Figure 1-1). Tracking and visualizing accessibility 
problems, specifically, will help reveal how and where cities invest in improving 
accessibility infrastructure, how often that infrastructure is changes/improved, and 
whether certain parts of a city are systematically overlooked.  
Land use Gentrification 
Figure 1-1: Examples of temporal tracking built environments in urban studies. 
 
 
  2 
 
Tracking built environments has been around for a while, and has been investigated 
by several studies in computer science, geo science, and urban planning over the years 
(details can be found in Chapter 2). For instance, in urban planning, inspecting the 
condition of street/sidewalk in a particular area requires taking multiple snapshots of that 
area over a specified time period. These snapshots then may help urban planners to gather 
information on how often street/sidewalks need to be changed/updated, or if constructing 
new infrastructure is in demand. These types of inspections are usually done via street 
audits, which are labor intensive, and do not address every issue of sidewalks such as 
accessibility of sidewalks. In this regard, the dearth of mechanisms to track the 
accessibility features in built environments at scale, motivated this thesis with the 
following research questions: 
• How can we track the changes of accessibility features in urban environments 
across time, and automatically label them?  
• How can we perform such a task at relatively large scale, let’s say for the entire 
city?  
The Lack of accessibility in urban areas directly impacts the lives of individuals with 
mobility impairments in many ways [4], [5]. The problem is not just that sidewalk 
accessibility affects where and how people travel in cities, but also that there are few 
mechanisms to determine accessible areas of a city a priori.  A newly published report by 
the National Council on Disability stated that no comprehensive information can be 
found on the degree to which sidewalks are accessible across the US [6].  
Recent studies such as Project Sidewalk [7], proposed the use of crowdsourcing to 
locate and assess sidewalk accessibility problems via Google StreetView imagery. This 
  3 
 
thesis extends work in Project Sidewalk [7]–[9]. Project Sidewalk focused on scalable 
methods to map the accessibility of the world by semi-automatically classifying features 
in panoramic map imagery such as Google StreetView, where only the current state of 
accessibility infrastructure is being captured. In contrast, our primary focus, in this 
research, is on developing scalable methods to track accessibility features in the built 
environment over time. This is, arguably, a much harder problem because we have to 
scale both spatially (lots of location) as well as temporally (over time). Thus, our dataset 
is much larger. In this thesis, we provide a proof-of-concept investigation of studying 
how to back propagate labels of accessibility features in time.  
The types of elements in urban environments that we are interested in are 
accessibility problems (i.e., poorly conditioned sidewalks, missing curb-ramps, objects on 
path), as well as accessible sidewalks (e.g., sidewalks with no accessibility problems) to 
check whether the accessibility problems resolved (Figure 1-2). Tracking these elements 
is particularly a hard problem in computer vision, since they might change in terms of 
structure and texture. Take, for instance a poorly conditioned sidewalk that got updated 
over the course of few years, and became an accessible sidewalk. This type of changes 
are textural rather than structural, since the geometric shape of the sidewalk remains the 
same, while the changes only occurred on its surface (i.e., change in color or intensity). 
On the other hand, appearance and disappearance of objects on sidewalks represent a 
structural change. 
  4 
 
     Emerging technologies such as street view cars provide massive amounts of high 
quality imagery data of built environments that gets updated frequently. Google Street 
View (GSV) is an example of such technologies that contains a feature, called “Time 
machine”, which allows for a possibility of going back in time and exploring how built 
environments evolve over time (currently from 2007 to 2015) [10]. Moreover, GSV 
covers nearly every region of cities in the US [11], which makes it to be a potential 
source for exploring the evolution of urban areas, especially those neighborhoods that 
receive less attention in terms of maintenance of the pedestrian infrastructure.  
This massive amounts of GSV imagery data as a less conventional data source is used 
in this thesis to track the progression of urban infrastructures in terms of accessibility 
features at scale, which otherwise would be expensive and difficult. However, similar to 
many other datasets, GSV images come with their own challenges such as, different 
Object in path Missing curb-ramps Surface problem Surface problem 
Accessible sidewalk Accessible curb-ramp 
Figure 1-2: Types of accessibility features in built environments 
 
  5 
 
(lighting, weather, and season) conditions, as well as different camera viewpoints, and 
oftentimes containing occlusions (e.g., a parked car obstructing the region of interest). 
These challenges are particularly hard to deal with from the computer vision perspective, 
where the aim is to detect the changes that have taken place in urban scenes across time, 
with a reasonable accuracy (Figure 1-3).  
1.1 Summary of Contributions 
In this thesis, our focus is to explore the possibility of tracking accessibility features in 
urban areas across time using Google StreetView images as our dataset. Towards this 
goal, we have collected temporal images of nearly 400 locations from different 
neighborhoods of Washington, DC and the state of Maryland. We formulate the problem 
 
Viewpoint variation Occlusion Design variation Illumination variation 
Figure 1-3: The challenges and limitations of GSV temporal images.  All locations are located in Washington, DC. 
The physical address from left to right is as follows: 1899 Lang PI NE, 2702 28th St NE, 1733 Lang PI NE, 1701 V 
St SE. 
  6 
 
into two parts: (1) image classification for classifying the types of accessibility problems, 
and (2) object detection to localize the accessibility problems within all snapshots.  
To address this problem, the proposed system works as follows: the system identifies 
and labels the accessibility problems (e.g., object in path, missing curb-ramps) in the 
most recent image at the location of interest. Next, it searches for the identified problems 
in the previous snapshots of that location, to see whether the identified accessibility 
problems have been resolved, or they still exist. The details of our proposed framework 
can be found in Chapter 3.  
The main contributions of this thesis are:  
(i) Initial proof-of-concept automated method that can be used to track accessibility 
problems through panorama images across time  
(ii) Development of a preliminary framework for processing and analyzing time 
series panoramas at scale.  
(iii) A geo-temporal dataset including different types of accessibility features for the 
task of object detection/image classification with respect to accessibility 
problems. 
  7 
 
Chapter 2: Background and Related Work 
The purpose of this chapter is to provide a review of studies that are most related to this 
thesis. We first review the computer vision aspect of urban tracking using satellite 
imagery, aerial imagery, street view imagery, and photos from the Internet (section 2.1). 
Next, we go over the studies on street-level accessibility, how cities invest in pedestrian 
infrastructures in terms of accessibility, and finally what semi-automated methods are 
currently being used to track urban accessibility (section 2.2). 
2.1 Tracking Urban Changes  
Change detection in urban areas has always been a challenging problem in computer 
vision. The goal of change detection is to identify the significant differences between the 
pixels of one image to the pixels of previous images, where all the images are referring to 
the same scene but taken at different times [12]. The difference is defined based on the 
application and the type of changes that are of interest (e.g., in urban studies, the targets 
are usually buildings, roads, street signs, and vegetation). Change in weather and lighting 
conditions, the structure of the scene itself, followed by the variation of camera 
parameters in terms of viewpoint, resolution, and the distance between the camera and 
the scene, all together make the change detection problem very challenging. As a result, 
there is no solid or unique recipe for addressing the problem, but on the bright side, 
narrowing down the problem into smaller sets can help to achieve an optimal solution. To 
better categorize the related work, with respect to image data, we break down the change 
detection problem into two categories: stationary cameras, and non-stationary cameras.  
  8 
 
Stationary Cameras. In this case, the sequence of images of a same scene, over time, 
is captured from a fixed viewpoint, which means that the acquired images are more or 
less aligned. Therefore, the challenges are mainly related to the illumination and/or 
geometric changes of wanted objects/regions of interest in the images. Jacobs et al. 
proposed a method for understanding the changes in the time-lapse sequences of static 
outdoor webcams (AMOS dataset) in terms of illumination such as the time of the day, or 
the weather condition [13]. Other methods to address change detection in videos with a 
stationary cameras, are probabilistic models, where pixels are modeled as a Gaussian 
mixture model, and are adapted to slow variations of object’s position [14], [15]. A more 
detailed explanation about change detection can be found in [12]. On the application side, 
time-lapse photography was used to examine the spatio-temporal of snow cover of a 
particular area, where the camera was fixated [16]. 
Non-Stationary Cameras. The goal, in this case, is to reason about the temporal 
changes that are taken place in the same scene, but have been captured either with 
different cameras, or via vehicle-mounted cameras. Typically, in urban planning and 
related fields, high resolution satellite imagery and remote sensing technologies are used 
to track changes in urban areas with respect to land coverage, congestion, transportation, 
and infrastructures over time [17]. For example, the remote sensing data can be used to 
evaluate the traffic pattern of crowded locations, or the conditions of roads. Traditionally, 
detecting changes in the pattern of urban environments is done by human observation, 
which is time consuming, expensive, and with high error rate. To automate the change 
detection procedure, Pacifici et al. proposed a Neural-nets method for high-resolution 
imagery (Figure 2-1) [18]. Temporal tracking (or change detection in this context) using 
  9 
 
satellite image data brings its own challenges (e.g., atmospheric conditions, satellite 
sensor angles, and sensor noise). A broad body of work has been done to overcome these 
challenges, which is beyond the scope of this thesis, but interested reader is encouraged 
to read [19]–[21]. Regarding aerial images, a study use images of a same scene over time, 
by placing cameras at arbitrary but known positions, and treated the change detection 
problem as a probabilistic three dimensional (3D) voxel model, where new images are 
compared with old images at voxel-level, and get updated accordingly [22].  
In regards to ground-based images acquired by non-stationary cameras, because the 
images are taken from different perspectives, many studies have attempted to first align 
the images before going through the change detection process. This alignment process is 
called image registration, and when the images are almost planar, Scale-Invariant Feature 
Transform (SIFT) feature matching [23] followed by a homography would suffice. 
Nonetheless, when the images are parallax (e.g., variation in depth in the image), SIFT 
 
Figure 2-1: Change detection on satellite imagery for an area by comparing two consecutive years. The changes 
are shown in different colors, where each color represents the transition between two states. For example, the 
orange represents the transition from vegetation to soil.   
  10 
 
features are not sufficient for the alignment, since they are invariant to affine deformation 
and to drastic changes in viewpoint, which are typical in temporal images of built 
environments. Hence, if the number of snapshots of the same scene is relatively large at 
each time stamp, a 3D reconstruction of a set of images using Structure from Motion 
(SfM) techniques [24] would be employed to align the images with each other along the 
time axis. Most previous work [25]–[33] follow the image registration step, but the 
change detection stage, along with the data collection procedure distinguish them from 
one another. 
Posterior to the image registration, using SfM, the next step is to reason about the 
temporal changes (types of changes differ with respect to their task). When the historical 
photos are available but are undated, reconstructing a 3D probabilistic temporal model 
from the images, and reasoning about the visibility of the points in the 3D domain, has 
shown to help in determining the temporal ordering of the images [29], [33]. Further, to 
create a smooth time-lapse video from Internet photos, Martin-Brualla et al. computed a 
global depthmap of the input images, warped them according to one virtual camera, and 
applied a temporal regularization on the output [25]. Similarly, Matzen et al. proposed a 
method to create a time-lapse sequence of temporal changes in planar structures (textural 
changes) of cities such as billboards, and street arts, by reasoning about the point clouds 
in terms of space and time [28]. To detect the tempo-structural changes in urban scenes, 
Sakurada et al. and Taneja et al. used videos taken from a vehicle-mounted camera 
within a period of time, and transformed the data into 3D domain [27], [31]. By warping 
the recent 3D model into the previous one via reprojection, the changes in appearance 
  11 
 
then revealed. A similar approach was applied on Google StreetView panoramas, where 
the cadastral models were available [30].  
Recent studies [26], [32] proposed De-Convolutional Neural Networks (deConvNets) 
and Convolutional Neural Networks (ConvNets) for detecting changes in urban scenes, 
respectively, where the data was collected using vehicle-mounted camera with additional 
information. For example, in [26], the street view videos are used to detect the structural 
changes in urban areas using deConvNests (Figure 2-2).  A summary of data collection 
and change detection methods of previous studies is provided in Table 2-1. 
 We now highlight the differences between our work and previous ones. First, our 
dataset is limited to Google Street View, with no additional information provided, other 
than the location of the images. Second, since there is no API for the “time machine” 
feature in Google Street View, the data is collected manually, which means that the data 
is limited. Finally, in this thesis, we specifically focus on the accessibility features of 
urban scenes at street-level (e.g., missing curb-ramps, surface problems), which means 
both the structural and the textural changes are important, and we need methods that can 
handle both types of changes.  
 
 
 
Figure 2-2: Structural change detection using De-convolutional Neural Networks. 
  12 
 
Paper Data Collection 
Method 
Change Detection 
Technique 
Schindler et al., 2007 Historical images Visibility of feature points 
in the 3D domain 
Schindler et al., 2010 Undated historical photos 
from 19th-20th centuries 
Visibility of feature points 
in 3D domain 
Taneja et al., 2011 Vehicle-mounted camera Warping the recent 3D 
model onto previous ones 
Taneja et al., 2013 Google Street View 
panoramas + cadastral 
model view 
Warping the recent 3D 
model onto previous ones 
Sakurada et al., 2013 Vehicle-mounted camera Probabilistic model at 
pixel-level  
Matzen et al., 2014 Geo-tagged photos from the 
Internet 
Clustering 3D point clouds 
with respect to space & 
time 
Martin-Brualla et al., 2015 Geo-tagged photos from the 
Internet 
One general depthmap       
+ temporal regularization at 
pixel-level 
Sakurada et al., 2015 Vehicle-mounted camera  
+ GPS sensor 
CNN features + super-pixel 
segmentation 
Alcantarilla et al., 2016 Vehicle-mounted camera 
(videos) 
De-convolutional Neural 
Networks 
Table 2-1: Summary of data collection and change detection techniques using non-stationary cameras. 
2.2 Tracking Sidewalk Accessibility  
The notion of accessibility in urban areas can be interpreted in two ways: (1) being 
locally close to opportunities such as jobs, health, and education, and (2) building urban 
infrastructures (e.g., sidewalks) that follow the “universal design” principles, where 
universal design, in this context, refers to infrastructures that can be used/crossed by as 
many people as possible, including people with disabilities [34], [35]. Unfortunately, the 
latter interpretation is overlooked in most cities, in which the lack of accessibility at 
street-level (e.g., poorly conditioned sidewalks, or absence of curb-ramps at intersections) 
  13 
 
has brought and continues to bring significant challenges to the lives of people with 
mobility impairments. Inaccessible sidewalks vastly has impacted the lives of 30.6 
million individuals with physical disabilities across the US [36]. Despite civil rights 
legislation for American with disabilities, ironically, inaccessibility at street-level still 
exists and has ostracized people with mobility impairments from the society. Missing 
curb-ramps at intersections, narrow or uneven sidewalks, existence of utility poles on 
sidewalks, and having poorly conditioned sidewalks or no sidewalk at all, reflect only a 
small fraction of the barriers people with mobility impairments face during navigation. 
Several lines of research in accessibility and urban planning have been dedicated to not 
only understand the severity of the problem but to improve the accessibility of sidewalks, 
accordingly [37]–[40]. In order to grasp the difficulties that people with mobility 
impairments face, when navigating the city, a considerable amount of surveys, 
interviews, and street audits have been conducted. For instance, Brookfield et al. have 
conducted a study with older adults to see how they would choose a route based on their 
physical condition via Google Street View [41].  
 Recently, Hara et al., by combining computer vision techniques and crowdsourcing, 
proposed a semi-automated mechanism for identifying accessibility problems in cities, 
remotely by using Google Street View [8]. Similarly, Prandi et al. developed a system for 
mobile phones that suggests accessible paths to the user, using the data collected by 
crowdsourcing, geo-referenced social websites [42].  
The dearth of interactive tools for obtaining information about the accessible areas 
within urban environments exacerbates the situation for individuals with mobility 
impairments; otherwise, they would have been prepared for upcoming challenges on the 
  14 
 
route, prior to their trip. The most recent attempt to address this issue is the Project 
Sidewalk, in which people around the globe can remotely contribute in identifying the 
accessibility features within cities via Google Street View [7]. 
These types of mechanisms are suitable for identifying the most likely accessible 
route, or detecting current accessibility problems within the city that require 
repairing/updating. Identifying accessibility features in cities are as important as tracking 
their temporal changes. Despite the major advances in computer vision and urban 
planning, temporal tracking of accessibility features (e.g., curb-ramps) in urban areas has 
received little to no attention, and to our knowledge, we are the first to address this issue. 
 
 
  15 
 
Chapter 3: Methodology 
In this chapter we present our proposed approach for temporal tracking accessibility 
problems in urban environments. First we describe the dataset, the procedures for data 
collection along with its limitations. Then, we explicitly define our approach for tackling 
the problem of temporal tracking accessibility problems in urban environments. 
3.1 Dataset 
To track the evolution of built environment with respect to accessibility problems we 
took advantage of Google StreetView new feature, “Time machine”. To this date, the 
available images to view cover the period of 2007 to 2015, including arbitrary gaps 
between the dates. For instance, for some locations the available images are 2007, 2009, 
2011, 2012, and 2014.  Note that the process of updating urban scenes in GSV does not 
equally take place among all locations, meaning that the number of available temporal 
images differs per location (Figure 3-1). 
 
One temporal image Nine temporal images 
Figure 3-1: Difference in the number of temporal images available in GSV for various locations. 
  16 
 
Perceiving the accessibility problems, especially in images, is subjective, which 
causes the data collection procedure to be ambiguous. To control the ambiguity of 
accessibility problems detection, we employed the guidelines of US Department of 
Transportation [43], the US Access Board [44], and followed the definition indicated in 
[9]. Accordingly, we categorized the accessibility features at street-level of urban areas 
into 5 main categories, as listed below: 
1. Missing curb-ramps (including narrow and poorly conditioned curb-ramps) 
2. Objects in path 
3. Surface problems (i.e., narrow/uneven/poorly conditioned sidewalks) 
4. Accessible sidewalks  
5. Accessible curb-ramps  
3.1.1 Data Collection 
The procedure of data collection was done manually, since Google has not yet offered an 
API for its “time machine” feature. We chose Washington, DC, and the state of Maryland 
as our primary source of data because of our first-person knowledge of those areas and 
their use in our previous work [8], [9].We collected temporal images of built 
environments by randomly walking through the streets of Washington, DC, and the state 
of Maryland, using Google StreetView, and by taking advantage of “Project Sidewalk” 
crowdsourced data to locate the areas containing accessibility problems [7]. To this aim, 
we randomly dropped the pegman of the Google maps on a random street, and started 
walking from there. From our experience in the data collection phase, the probability of 
encountering accessibility problems is higher in relatively poor neighborhoods. In 
Washington, DC, for instance, as we moved towards southeast and northeast areas, the 
  17 
 
number of accessibility problems increased. To randomize and diversify our data, we 
took screenshots of locations based on the following rules: 
• If a location contained accessibility problems, but over time the accessibility 
problems are resolved (Figure 3-2a).  
• If a location still contained accessibility problems (Figure 3-2b). 
• If a location did not contain accessibility problems within the available time frame 
(Figure 3-2c). 
• If a location contained accessibility problems and occlusion sometime within the 
available time frame (Figure 3-2d).  
 
 
Figure 3-2: Examples of locations with various transitions in terms of accessibility features: (a) the missing curb-
ramps changed to accessible curb-ramps, (b) the missing curb-ramps still exist, (c) the sidewalk remains 
accessible, (d) vehicles obstruct the view of missing curb-ramps in 2009-07.    
(a) (b) (c) (d) 
  18 
 
For each location, the screenshots were captured throughout the entire available time 
frame, along with their metadata. The metadata for each location is a JSON file 
containing the address, the GPS coordinates, URL, number of temporal snapshots, 
camera’s yaw/pitch/field-of-view information followed by the date (year-month) for each 
snapshot. For locations with multiple accessibility problems, all accessibility problems 
were indicated in the their metadata separated by comma. Note that if only one of the 
snapshots from a same location contained an accessibility problem, we still treated the 
 
Figure 3-3: A view of JSON file for a location with physical address as: 3052 Douglas St NE, 
Washington, DC. 
  19 
 
location as being inaccessible with respect to the identified accessibility problem. A 
sample of JSON file is illustrated in Figure 3-3. Our geo-temporal tagged data will be 
available for public download.  
We have collected 376 locations total, in which 90% of them contain accessibility 
problems. We mostly covered the DC area (88%) because of its diverse neighborhoods. 
The total number of all images regardless of location is 1633 (Table 3-1). As mentioned 
previously, the number of available temporal images is different per location, but the 
average number of available snapshots in our dataset is 4, meaning that most locations 
contain 4 available temporal images (Table 3-2).  
To understand the quantity of each accessibility feature within our collected data, we 
used two metrics: per location, and per image. For a certain location, if an accessibility 
feature exists at least in one of the available temporal snapshots of that location, we 
consider the location as having that accessibility feature. For per image, we discard the 
locations, and calculate the existence of accessibility features within all images in the 
dataset (1633 images). In other words, for a given image, regardless of its location, we 
calculated how many accessibility features it contains. Using Matlab image labeling tool 
[45], we manually labeled/annotated the accessibility features for our ground truth, and 
computed the total number of each accessibility features category on all images (Table 3-
3). 
 Overall DC MD 
# Locations 376 332 44 
# Images 1633 1464 169 
# Locations with 
accessibility problems 
341 296 45 
Table 3-1: Distribution of dataset 
  20 
 
 
Avg # of snapshots STD Median Min # of snapshots Max # of snapshots 
4 1 4 1 9 
Table 3-2: Overall distribution of snapshots per location 
 
Category Per Location Per Image 
Missing curb-ramps 52 231 
Objects in path 96 449 
Surface problems 123 374 
Accessible sidewalk 35 285 
Accessible curb-ramp 70 267 
Table 3-3: A breakdown of labels with respect to location and image. Per location refers to existence of 
accessibility features on at least one of the snapshots, and per image represent the existence of accessibility 
features on all images, regardless of their locations and time. 
3.1.2 Dataset Limitations 
The primary application of Google StreetView, and similar street view online tools are to 
provide remote navigation tools; hence using such datasets for tracking the temporal 
changes of built environments brings their own challenges and limitations to the table. In 
general, the common challenges of street view images (e.g., GSV images) are variation in 
illumination (due to weather and lighting conditions), and camera pose among temporal 
images of street view cars (i.e., images are not being captured from the same distance, or 
from the same spots). As a result, the temporal snapshots of a same scene are not aligned, 
which would exacerbate the temporal tracking problem.   
Furthermore, the distance between the panorama images captured by Google Street 
cars is not consistent, and varies based on the location and time. Therefore, the regions of 
interest (ROI), in this case accessibility problems, might not be visible from the exact 
same geo-location for all images across time. In this case, the images are captured from 
nearest available spot.  
  21 
 
Finally, due to the manual procedure of data collection, and since the regions of 
interest are accessibility problems; the number of available images is limited. This 
limitation directly affects the categories such that the number of one type might be 
comparatively different than other types. For instance, although the number of “missing 
curb-ramps” problem per image is relatively high (N=231), it is not comparable with the 
number of “objects in path”(N=449), or “surface problem” (N=374), which leads the 
dataset to be imbalanced. In the next section, methods for handling the abovementioned 
dataset limitations are discussed in details.  
3.2 Our Approach 
Given multiple snapshots of a same location over time, the goal is to automatically 
identify and label accessibility features in all snapshots, based on the labeled accessibility 
features in the most recent snapshot. More simply, if we label an accessibility problem 
(e.g., a power pole on the middle of the sidewalk) on the most recent dated GSV image of 
a certain location, the goal is to see the evolution of the specified accessibility problem, 
in this case back propagating, whether the power pole existed on that sidewalk, or it has 
been installed recently.  Our framework consists of 3 stages (Figure 3-4):  
Stage 1.For each location, the most recent image is sent as an input to the framework. 
The accessibility feature is manually labeled via Matlab image labeling tools, and the 
resulting patch is sent to stage 2.  
Stage 2. The category classification determines the category of the patch, and sends 
the result to the next stage. 
  22 
 
 
3.2.1 ROI labeling 
In order to track the temporal changes of accessibility features in one location, we 
manually label the area containing the accessibility feature on the most recent image of 
that location. The reason behind choosing the most recent image rather than the earliest 
image is that the oldest image often has poorer quality in terms of resolution, and lighting 
(Figure 3-5). Also, the number of temporal snapshots varies for each location, therefore, 
the most recent image was chosen for labeling the ROI, and sending the ROI patch to the 
next stage to be classified.  
Stage 3. Based on the result of category classification, the trained object detector 
examines all previous images of the same location to localize the specified 
accessibility feature within each image.  
 
 
Figure 3-4: The stages in our framework: (stage1) the accessibility problem (e.g., “object in path”) is 
manually labeled, (stage 2) the category classifier classifies the labeled area, and (stage 3) the object detector 
localizes the accessibility problem in all snapshots over time of that location. 
 
Stage 1 Stage 2 Stage 3 
Results 
  23 
 
3.2.2 Category Classification 
In category classification stage, the goal is to find local interest points (i.e., keypoints) in 
images that could distinguish the accessibility features from one another. Keypoints refer 
to geometrical or textural features that are unique to the accessibility problem’s general 
shape or appearance. For instance, most accessible curb-ramps have trapezius shape, 
which can discern them from other accessibility problems in urban areas. However, this 
is not the case for missing curb-ramps, such that they are not geometrically discernable, 
and might be mistaken as surface problems.  
Furthermore, due to variation in illumination (e.g., lighting, weather condition) as 
well as variation in street view camera pose, make the classification even harder.   
Recently, Convolutional Neural Networks (CNNs) have become the dominant 
approach for image classification [46]. However, a massive amount of training data is 
required to avoid over-fitting [47], [48]. Our dataset, on the contrary, is not large enough 
to train CNNs for category classification.  
With all this in mind, we have used a well-known technique for categorization, called 
“bag-of-visual-words” (BoVW) [49], which is derived from the “bag-of-words” method 
that is used in natural language processing for information retrieval. The idea behind the 
 
Figure 3-5: Comparing the quality of the earliest image (left) to the most image (right) of a location. 
  24 
 
BoVW method is to create a vector of most frequent local features that represent each 
category. Although, the BoVW method does not depend on the spatial information of the 
ROI and can be used on the entire image, we used ROI patches (i.e., only the accessibility 
features), because of the similarities between the elements in built environments. In other 
words, the visual information extracted from the entire image is a mixture of ROI and 
background, which affects on the real appearance of the ROI. Therefore, we have used 
the ROI patches of each category as our dataset (Figure 3-6).  
 
The BoVW method works as follows:  
Feature extraction. The local features that are repeatable and invariant to image 
transformations (i.e., translation, rotation, affine deformation, and scaling) are extracted 
from the image patches in the training set, and formed feature vectors (e.g., SIFT 
descriptors [23]).  
Clustering. The vectors of extracted local features are then mapped into the nearest 
cluster centers that contain similar features using k-means clustering algorithm [50]. Each 
cluster center represents a visual word vocabulary.  
Visual BoW histograms. The frequencies of occurrence of visual words are mapped 
to vectors (i.e., histograms) reflecting the categories.  
Figure 3-6: The ROI patches that are used in training the category classification. 
Missing curb-ramps 
Objects in path 
Surface problems 
Accessible curb-ramps Accessible sidewalks 
 
  25 
 
To train accessibility feature categorizer (5 categories), 5 Support Vector Machines 
(SVMs) [51] were trained, where each SVM distinguishes one category from the rest.   
The BoVW method does not localize the ROI on an image. Therefore, the next step 
towards reaching our goal is to do localization, which can be done using object detection 
algorithms.  
3.2.3 Localization 
In localization, the aim is to identify and detect the ROI (in this case an accessibility 
problem) within the image. The state-of-the-art object detection algorithms use a 
bounding box to scan the entire image and search for keypoints that are similar to the 
ROI. Here, we used Viola-Jones algorithm for object detection (a.k.a., cascade object 
detector) [52], which is based on boosting [53].  
When it comes to detecting accessibility problems at street-level in built 
environments, scanning the entire image is redundant, since accessibility problems are 
located at ground-level. This is not true for all GSV images, due to the variation in 
camera pose, and the street view car position. On the other hand, since the goal of this 
thesis is to label the accessibility features on images of a same location over time, the 
approximate location of the accessibility problem remains the same within all temporal 
images (Figure 3-7). Therefore, we can reduce the search area for object detector based 
 
Figure 3-7: Consistency of the location of accessibility features (e.g., accessible curb-ramps) within all images 
over time of a same scene.  
  26 
 
on the location of labeled area in the most recent dated image. This not only helps the 
object detector to detect and localize the accessibility feature faster and more accurately, 
but also reduces the number of false positives (i.e., falsely labeling an area on the image 
that does not contain the specified accessibility problem).  
To train our cascade detector, for each accessibility feature category, we provided a 
large set of negative examples (i.e., snapshots of urban areas that do not contain the 
targeted accessibility problem), along with set of positive examples with the accessibility 
features labeled in each image. The number of negative examples is roughly twice the 
number of positive examples. The accessibility features are labeled manually in positive 
examples using Matlab image labeling toolbox [45]. In the training phase, the HOG 
(Histograms of Orientated Gradients) features of input images are selected, and are sent 
to the cascade classifier. The cascade classifier is a set of stages, where at each stage an 
ensemble of weak classifiers is trained to be a highly accurate one using the information 
from its previous stage. The number of stages depends on the size of dataset. For our 
 
 
Figure 3-8: The overall procedure of Cascade object detector. 
 
Positive examples 
Negative examples 
Training the object 
detector 
  27 
 
limited dataset, we have tested several stages to train the cascade detector, and found that 
13-15 stages reduce the false positive rate (i.e., the percentage of labeled areas that do not 
contain the specified accessibility features). The procedure of cascade object detector is 
shown in Figure 3-8.  
In addition, accessibility features differ from one another in terms of aspect ratio 
(Figure 3-9). However, they maintain their aspect ratio within their category, meaning 
that “objects in path” aspect ratio remains approximately the same for majority of 
examples in the “objects in path” category. This increases the chance of detecting each 
category correctly.  
Note that the cascade object detector has to be trained on all five categories of 
accessibility features, in order to localize them. Therefore, we trained five object 
detectors for five accessibility features. Finally, localizing the specified accessibility 
features on previous snapshots is not sufficient for tracking their changes over time. As a 
result, for each location, if the specified accessibility features are not detected, then object 
detectors for other accessibility features will be scanning the snapshots to see whether the 
 
Figure 3-9: Consistency of aspect ratio within the examples of “Objects in path” category. 
  28 
 
reason for the failed localization was due to the transition of the accessibility features or 
the object detector’s fault.   
 Finally to reduce the number of bounding boxes predicted, we removed the 
overlapping detected windows by averaging the overlapped regions between the 
windows, and comparing them with a threshold.  
3.2.4 Handling Occlusion   
Vehicles and people are parts of urban environments; therefore, they exist in street view 
images, and might obstruct the regions of interest (i.e., accessibility problems). A car 
parked in front of a sidewalk, where the sidewalk is the potential ROI, is a simple 
example of occlusion. In GSV imagery data, the amount of data for a scene that contains 
occlusion is sparse. If we consider the street view car’s movement, for each sidewalk 
there are three snapshots available (Figure 3-10): before the street view car arrives (right 
view), when its in front of the sidewalk (frontal view), and after it passes the sidewalk 
(left view). Since, the goal in here is tracking the changes of accessibility problems over 
time, information from all temporal images is essential, and therefore, handling occlusion 
is required.  
 
Left-view Front-view Right-view 
Figure 3-10: Occluded an area of sidewalk by a vehicle (highlighted in blue). The left-view, and the right-
view give enough information regarding the hidden area in front-view.  
  29 
 
For a snapshot with occlusion, we looked at the previous snapshots and the later 
snapshots. If the accessibility feature existed between the former and the next snapshots, 
the occluded snapshot is ignored (Figure 3-11); otherwise, we looked at the occluded 
snapshot from different viewpoints to see whether the accessibility feature has changed 
on the date the snapshot was captured (Figure 3-10).  
 
3.3 Summary 
In order to track the temporal changes of accessibility features in urban areas, we 
manually collected temporal snapshots of roughly 400 locations in Washington, DC, and 
the state of Maryland, using GSV. The locations in the dataset are selected according to 
two criteria: maintaining the randomness, and balancing dataset with respect to the 
number of images per category. To meet the first criteria, we chose random 
neighborhoods in DC, and started to inspect their accessibility features by walking 
through their streets via GSV. For the latter, we used Project Sidewalk’s crowdsourced 
data to locate certain accessibility features.  
 Our framework consists of three stages: (1) labeling the accessibility features in the 
most recent snapshot of a location, (2) classifying the labeled area as one of five 
accessibility feature categories, and (3) localizing the classified patch in all previous 
 
Figure 3-11: Tracking the missing curb-ramps in a location, where the curb is occluded in one of the snapshots 
(2009-07). However, the curb still exists in 2011-08, meaning that the occluded snapshot can be ignored.  
  30 
 
snapshots of that location. This process only considers one location at a time, but has the 
potential to support scalability.  
 
  31 
 
Chapter 4: Study Results 
In this chapter, first, we evaluate the performance of category classification by itself, and 
report the analysis of our framework using standard measures such as precision, recall, 
and F1- score. Next, we present the results for our framework tested on different 
locations. 
4.1 Category Classification 
To evaluate the overall performance of our category classifier, we used k-fold cross 
validation approach for k=5, with each fold consisting of 326 image patches of 
accessibility features.  The confusion matrix of the performance of the category 
classification is shown in Table 4-1. The diagonal cells of the confusion matrix refer to 
the percentage of correct classification for each class (predicted category = actual 
category), and the off-diagonal cells represent the misclassifications for each category 
(predicted category  actual category).  
 Missing 
curb-ramps 
Objects 
in path 
Surface 
problems 
Accessible 
sidewalk 
Accessible 
curb-ramps 
Missing curb-ramps 66.3% 1.6% 0.1%  12.6% 
Objects in path  97.3%   0.6% 
Surface problems 12.8%  81.4% 14.1% 3.3% 
Accessible sidewalk 11.7%  17.2% 85.7% 9.3% 
Accessible curb-ramps 10.2% 1.1% 0.3% 0.2% 74.2% 
Table 4-1: Category classification confusion matrix. Each cell indicates the percentage of images assigned to 
a predicted category (column) for each actual category (row). 
  32 
 
According to the confusion matrix, the performance of “missing curb-ramps” 
category is relatively poor, and the reason, besides the scarcity of the dataset, is due to the 
similarity between curb-ramps, missing-curb-ramps, the curbs of accessible sidewalks, or 
similarity between the surface of missing curb-ramps and surface problems on sidewalk. 
4.2 System Framework Evaluation 
To evaluate the performance of classifier and object detector combined (our framework), 
we randomly split our dataset (per locations) into 70% training set (N=264), and 30% test 
set (N=112). Since our framework works per location, we ran the framework for 112 
round (i.e., the size of the test set). The input of the framework at each round is the patch 
consisting of the accessibility feature, which is done manually before running the 
experiment. The results of the framework then stored separately for each location.  
To measure the correctness of our framework per location, we looked for the input 
patch in all previous snapshots of that location at feature-level. For each location, feature-
level refers to appearance of the specified accessibility feature (e.g., surface problem) 
within all previous snapshots of that location. Therefore, as long as the specified 
accessibility feature has been found within previous snapshots, we accept the result. If the 
location of the detected label was approximately close to the manually labeled region, we 
accept that as well. For instance, if the manually labeled region (input of the framework) 
was a “surface problem”, then the resulted labels on the previous snapshots might be 
covering other parts of the sidewalk with the same accessibility problem, but not at the 
specific region, which was manually labeled. In this case, we accept the results.  
  33 
 
 To better understand the overall performance of our framework regardless of 
location, we measured the correctness at feature-level for all images in the test set. Since 
our dataset was small, we used human perception to evaluate the correctness of labels in 
all images.  We measured the precision, recall, and F1-score based on the following 
equations: 
 
Where, true positive is defined as providing the correct label in the image, false 
positive is providing a label for a problem that does not exist in the image, and false 
negative is not providing a label for a problem that exist in the image. The performance 
for each category and overall performance are illustrated in Figure 4-1.   
The performance of our system depends on the two stages of classification and 
localization. If the input image patch is classified falsely, the localization of the false 
category on the previous images reduces the accuracy of the system. According to the 
performance graph, the “object in path” category has relatively a high accuracy, and the 
reason is that we treated every cylindered shape on sidewalk as “objects in path”, even if 
it is not obstructing the path of pedestrians. Also, temporal snapshots of a same location 
in GSV are not aligned. However, they are panorama images; hence, during the data 
collection phase, we changed the yaw/pitch/Field of View of the camera at each time to 
 
  34 
 
make the temporal snapshots aligned as much as it possible. Moreover, the search area 
for the object detector on each snapshot depends on the position of the manually labeled 
ROI (input patch). Thus, these results should be considered preliminary and likely 
represent the high-end of our framework’s performance—they are under ideal conditions 
with manual tuning.   
Furthermore, the object detector (Viola-Jones algorithm) has its own limitations. This 
algorithm works best on objects that do not have out-of-plane orientation, that’s why the 
performance of the “object in path” category is comparatively high, to “missing curb-
ramps” and “accessible curb-ramps” categories, since the orientation of curb-ramps 
differs among the street-view images.  
4.3 Results 
We have tested our framework on different locations from the dataset. The results are 
categorized into per location, and per image (Successful/failed). 
 
Figure 4-1: Test-time performance of our approach for each category and overall, with respect to Precision, 
Recall, and F1-score. 
  35 
 
4.3.1 Per Location Results 
(i) Location: 2417 Hamlin St NW, Washington, DC (Figure 4-2). 
The accessibility feature for this location is “Objects in path”, and the framework 
successfully tracked the accessibility feature in all previous snapshots.    
 
(ii) Location: 6307 Brookville Rd, Washington, DC (Figure 4-3). 
The accessibility feature for this location is “Accessible sidewalks”. According to the 
results, the sidewalk in the snapshot “2012-03” does not have surface problems, and it 
 
Figure 4-3: The most recent snapshot (top) has been manually labeled to specify the “Accessible sidewalk”, 
and the result is shown on all previous snapshots (bottom).  
 
Figure 4-2: Framework results for “Objects in path” category. The blue label refers to the manually labeled 
ROI on the most recent snapshot (input of the framework), and red labels are localized by the framework, 
referring to the existence of accessibility problems on all previous snapshots. 
  36 
 
was detected falsely, due to variation in illumination (false positive).  
 
(iii) Location: 6202 Broad Branch Rd NW, Washington, DC (Figure 4-4). 
The accessibility feature for this location is “Missing curb-ramps”.  The last snapshot 
(2014-05) is misclassified with “surface problem”. This can be due to the similarities 
between the two categories (curbs are visible in patches of narrow sidewalks). 
4.3.2 Per Image Results 
(i) Successful results: the results of the framework on different locations (discarding the 
time), in which the accessibility features have successfully been identified and localized 
(Figure 4-5).  
(ii) Failed results: the failure of the framework in either identifying the specified 
accessibility features correctly (category classification), or localizing them within the 
snapshots (localization). Also, since the framework looks for other accessibility features, 
 
Figure 4-4: top snapshot is manually labeled as “Missing curb-ramps”, and the four bottom snapshots 
are the result of framework. The yellow label in the last snapshot refers to misclassification of “Missing 
curb-ramps” and “Surface problems”.  
  37 
 
if the specified accessibility feature could not be identified/localized within the snapshots, 
other accessibility features might be identified and localized (Figure 4-6).   
 
Figure 4-5: Successful results of our framework. The green labels refer to accessible sidewalks and accessible 
curb-ramps. The red labels refer to accessibility problems.  
 
 
Figure 4-6: Failed results of our framework. The yellow labels refer to either misclassifying the accessibility 
features, or not identifying the specified accessibility features indicated by the red arrows.  
 
  38 
 
 
Chapter 5: Discussion and Future Work 
This thesis took the first exploratory step towards serving a greater goal of developing a 
scalable (semi)-automated method for temporal tracking of accessibility problems in built 
environments. Here, we summarize the main contributions of this work, along with 
limitations, and we end this thesis by providing insights for directions of future research.  
5.1 Conclusion and Limitations 
We have demonstrated an initial proof-of-concept automated method for tracking the 
accessibility problems in street view images over time. In this thesis, we took advantage 
of bag of visual words and cascade object detector to identify and localize the 
accessibility features within all snapshots of a given location. Our findings show that 
despite the challenges of street view images, they could be a valuable source for tracking 
the accessibility problems at street-level. Our framework based on each location, tracks 
the changes in accessibility features across time, depicting the fact that even temporal 
tracking accessibility features for one scene is a difficult task. The performance of our 
framework indicates that the nature of tracking accessibility features cannot be performed 
automatically, as analyzing the conditions of accessibility features requires human 
understanding, due to the structural and textural changes in accessibility features. 
However, by incorporating automated mechanism and crowdsourcing, the goal of 
scalability is achievable. At scale, temporal tracking accessibility features can inform us 
what areas in built environments the pedestrian infrastructures have been overlooked, and 
how long these features have not been maintained/updated. This information can also be 
  39 
 
used with additional information, such as population of residents and passersby at each 
region to decide financial decisions on allocating budget for renewing pedestrian 
infrastructures. Our current framework has the potential to support scalability, by 
bringing human in the loop for verification.  
Limitaions. In this thesis, the data was collected manually, which is labor intensive, 
and time consuming. Our small dataset (376 locations; 1633 total images) limited our 
choice of classification and object detection algorithms. Also, to simplify the problem, 
we made assumptions about the position, and the size of accessibility features within all 
temporal snapshots at each location. By limiting the search area for the object detector, 
and by manually aligning the GSV images before taking screenshots, we tried to meet the 
assumptions. That is a reason for the getting relatively high results. However, as we 
mentioned before, this thesis is an exploratory step towards achieving the optimal 
accuracy for temporal tracking accessibility features in built environments.  
Moreover, we did not evaluate our framework per location, because our current 
dataset is small, and imbalanced towards some categories of accessibility features (# 
objects in path > # missing curb-ramps); hence, when splitting the data to training and 
test sets, some categories never occurred in the test set.  
Another limitation is related to GSV images themselves such as poor quality of some 
images, especially in the foremost year, which reduced the performance of our 
framework.  
Furthermore, the input of our framework is done manually (i.e., labeling the ROI in 
the most recent snapshot of a given location), which is time consuming and does not 
support scalability. With enough training data, and more accurate classification methods, 
  40 
 
however, one could observe the temporal changes of more locations with respect to 
accessibility features. In addition, we used the most recent snapshots as our baseline of 
choosing accessibility features. Nonetheless, if the accessibility features have been 
maintained or completely transformed, labeling those features is not possible on the most 
recent snapshot. One possible solution is to demonstrate both the most recent and the 
foremost snapshots at the beginning.   
Finally, since the primary focus of this thesis is on accessibility features, the majority 
of locations in the dataset do not contain occlusion. Therefore, our approach for handling 
the occlusion is limited to our dataset.  
5.2 Directions Towards Future Research 
While, in this thesis, we captured the important role of street view images on the 
scalability of temporal tracking built environments, specifically accessibility features, 
there are still many unexplored paths that can be taken from this starting point. We list a 
few of future work in the following:  
Heatmap Visualization of Temporal Changes. Our semi-automated method 
currently captures only the changes of accessibility features at each location. Visualizing 
these temporal changes on a map, and using variation of color intensity to illustrate how 
accessibility features deteriorate/update over time, would be a way to capture the essence 
of the scalability in this research (Figure 5-1) 
Predicting future changes in terms of accessibility at street-level. By 
incorporating the current data with other available resources regarding the maintenance 
of pedestrian infrastructures, could be used to predict possible future changes in the 
condition of accessibility features. This could be a useful for urban planners, and 
  41 
 
government officials to understand how often these features require maintain/update 
before they create barriers for citizen.  
Time series labeling tool. The general theme of temporal tracking can be used to 
implement a tool that can automatically label a set of temporal images by only labeling 
one of the images, which could be useful in image labeling tasks.  
Handling occlusion. Although we discussed about handling occlusion in this thesis, 
but future work can take advantage of the bird’s eye view of Google StreetView [54], 
high-resolution satellite imagery, or aerial imagery to see the accessibility features from 
top-down view.  
 
 
Figure 5-1: Possible heatmap visualization. The red refers to the period in 
which the accessibility features have not been maintained, and the green 
refers to the accessibility features being recently updated. 
  42 
 
Bibliography 
[1] S. M. Wheeler, “The Evolution of Built Landscapes in Metropolitan Regions,” J. 
Plan. Educ. Res., vol. 27, no. 4, pp. 400–416, Jan. 2008. 
[2] S. A. Changnon and S. A. Changnon, “Inadvertent Weather Modification in Urban 
Areas: Lessons for Global Climate Change,” Bull. Am. Meteorol. Soc., vol. 73, no. 
5, pp. 619–627, May 1992. 
[3] J. G. Masek, F. E. Lindsay, and S. N. Goward, “Dynamics of urban growth in the 
Washington DC metropolitan area, 1973-1996, from Landsat observations,” Int. J. 
Remote Sens., vol. 21, no. 18, pp. 3473–3486, Jan. 2000. 
[4] I. M. Lid and P. K. Solvang, “(Dis)ability and the experience of accessibility in the 
urban environment,” ALTER - Eur. J. Disabil. Res. / Rev. Eur. Rech. sur le 
Handicap, vol. 10, no. 2, pp. 181–194, 2016. 
[5] D. Gamache, S.;Vincent, C.; Routhier, F.; James McFadyen, B.; Beauregard, L.; 
Fiset, “DEVELOPMENT OF A MEASURE OF ACCESSIBILITY TO URBAN 
INFRASTRUCTURES: A CONTENT VALIDITY STUDY,” Med. Res. Arch., 
vol. 4, no. 5, p. 603, 2016. 
[6] N. C. on Disability, “The impact of the American with disabilities act: asessing the 
progress toward achieving the goals of the ADA,” 2007. 
[7] Makeability Lab, “Project Sidewalk,https://sidewalk.umiacs.umd.edu/,” 2016. 
[Online]. Available: https://sidewalk.umiacs.umd.edu/. 
[8] K. Hara, J. Sun, R. Moore, D. Jacobs, and J. Froehlich, “Tohme,” in Proceedings 
of the 27th annual ACM symposium on User interface software and technology - 
UIST ’14, 2014, pp. 189–204. 
  43 
 
[9] K. Hara, V. Le, and J. Froehlich, “Combining crowdsourcing and google street 
view to identify street-level accessibility problems,” in Proceedings of the SIGCHI 
Conference on Human Factors in Computing Systems - CHI ’13, 2013, p. 631. 
[10] Https://googleblog.blogspot.com/2014/04/go-back-in-time-with-street-view.html, 
“Go back in time with Street View,” Google official blog, 2014. [Online]. 
Available: https://googleblog.blogspot.com/2014/04/go-back-in-time-with-street-
view.html. 
[11] Google, “Google Street View Coverage,” 2016. . 
[12] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, “Image change detection 
algorithms: a systematic survey,” IEEE Trans. Image Process., vol. 14, no. 3, pp. 
294–307, Mar. 2005. 
[13] N. Jacobs, N. Roman, and R. Pless, “Consistent Temporal Variations in Many 
Outdoor Scenes,” in 2007 IEEE Conference on Computer Vision and Pattern 
Recognition, 2007, pp. 1–6. 
[14] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-
time tracking,” in Proceedings. 1999 IEEE Computer Society Conference on 
Computer Vision and Pattern Recognition (Cat. No PR00149), 1999, pp. 246–252. 
[15] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive Background 
Mixture Model for Real-time Tracking with Shadow Detection,” in Video-Based 
Surveillance Systems, Boston, MA: Springer US, 2002, pp. 135–144. 
[16] J. Parajka, P. Haas, R. Kirnbauer, J. Jansa, and G. Blöschl, “Potential of time-lapse 
photography of snow for hydrological purposes at the small catchment scale,” 
Hydrol. Process., vol. 26, no. 22, pp. 3327–3337, Oct. 2012. 
  44 
 
[17] J. R. Jensen and D. C. Cowen, “Remote Sensing of Urban/Suburban Infrastructure 
and Socio-Economic Attributes,” in The Map Reader, Chichester, UK: John Wiley 
& Sons, Ltd, 2011, pp. 153–163. 
[18] F. Pacifici, F. Del Frate, C. Solimini, and W. J. Emery, “An Innovative Neural-Net 
Method to Detect Temporal Changes in High-Resolution Optical Satellite 
Imagery,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 9, pp. 2940–2952, Sep. 
2007. 
[19] F. Pacifici, F. Del Frate, C. Solimini, and W. J. Emery, “An Innovative Neural-Net 
Method to Detect Temporal Changes in High-Resolution Optical Satellite 
Imagery,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 9, pp. 2940–2952, Sep. 
2007. 
[20] P. Du, S. Liu, P. Gamba, K. Tan, and J. Xia, “Fusion of Difference Images for 
Change Detection Over Urban Areas,” IEEE J. Sel. Top. Appl. Earth Obs. Remote 
Sens., vol. 5, no. 4, pp. 1076–1086, Aug. 2012. 
[21] S. C. van der Spek and C. M. van Langelaar, “USING GPS-TRACKING 
TECHNOLOGY FOR URBAN DESIGN INTERVENTIONS,” ISPRS - Int. Arch. 
Photogramm. Remote Sens. Spat. Inf. Sci., vol. XXXVIII-4/, pp. 41–44, Aug. 
2011. 
[22] T. Pollard and J. L. Mundy, “Change Detection in a 3-d World,” in 2007 IEEE 
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–6. 
[23] D. G. Lowe, “Object recognition from local scale-invariant features,” in 
Proceedings of the Seventh IEEE International Conference on Computer Vision, 
1999, pp. 1150–1157 vol.2. 
  45 
 
[24] S. Agarwal et al., “Building Rome in a day,” Commun. ACM, vol. 54, no. 10, pp. 
105–112, Oct. 2011. 
[25] R. Martin-Brualla, D. Gallup, and S. M. Seitz, “Time-lapse mining from internet 
photos,” ACM Trans. Graph., vol. 34, no. 4, p. 62:1-62:8, Jul. 2015. 
[26] P. F. Alcantarilla, S. Stent, G. Ros, R. Arroyo, and R. Gherardi, “Street-View 
Change Detection with Deconvolutional Networks,” in Robotics: Science and 
Systems XII, 2016. 
[27] A. Taneja, L. Ballan, and M. Pollefeys, “Image based detection of geometric 
changes in urban environments,” in 2011 International Conference on Computer 
Vision, 2011, pp. 2336–2343. 
[28] K. Matzen and N. Snavely, “Scene Chronology BT  - Computer Vision – ECCV 
2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, 
Proceedings, Part VII,” D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. 
Cham: Springer International Publishing, 2014, pp. 615–630. 
[29] G. Schindler, F. Dellaert, and S. B. Kang, “Inferring Temporal Order of Images 
From 3D Structure,” in 2007 IEEE Conference on Computer Vision and Pattern 
Recognition, 2007, pp. 1–7. 
[30] A. Taneja, L. Ballan, and M. Pollefeys, “City-Scale Change Detection in Cadastral 
3D Models Using Images,” in 2013 IEEE Conference on Computer Vision and 
Pattern Recognition, 2013, pp. 113–120. 
[31] K. Sakurada, T. Okatani, and K. Deguchi, “Detecting Changes in 3D Structure of a 
Scene from Multi-view Images Captured by a Vehicle-Mounted Camera.” pp. 
137–144, 2013. 
  46 
 
[32] K. Sakurada and T. Okatani, “Change Detection from a Street Image Pair using 
CNN Features and Superpixel Segmentation,” in Procedings of the British 
Machine Vision Conference 2015, 2015, p. 61.1-61.12. 
[33] G. Schindler and F. Dellaert, “Probabilistic temporal inference on reconstructed 
3D scenes,” in 2010 IEEE Computer Society Conference on Computer Vision and 
Pattern Recognition, 2010, pp. 1410–1417. 
[34] L. Viegas, Jose; Martinez, “Urban Accessibility: perception, measurement and 
equitable provision.” . 
[35] J. Hanson, “The Inclusive City: delivering a more accessible urban environment 
through inclusive design.” 
[36] U.S. Census Bureau, “Americans with Disabilities: 2010 Household Economic 
studies,” 2012. 
[37] L. Beale, K. Field, D. Briggs, P. Picton, and H. Matthews, “Mapping for 
Wheelchair Users: Route Navigation in Urban Spaces,” Cartogr. J., vol. 43, no. 1, 
pp. 68–81, Mar. 2006. 
[38] R. D. F Bromley, D. L. Matthews, and C. J. Thomas, “City centre accessibility for 
wheelchair users: The consumer perspective and the planning implications,” 
Cities, vol. 24, no. 3, pp. 229–241, Jun. 2007. 
[39] D. B. Gray, M. Gould, and J. E. Bickenbach, “ENVIRONMENTAL BARRIERS 
AND DISABILITY,” J. Archit. Plann. Res., vol. 20, no. 1, pp. 29–37, 2003. 
[40] H. Matthews, L. Beale, P. Picton, and D. Briggs, “Modelling Access with GIS in 
Urban Systems (MAGUS): capturing the experiences of wheelchair users,” Area, 
vol. 35, no. 1, pp. 34–45, Mar. 2003. 
  47 
 
[41] K. Brookfield and S. Tilley, “Using Virtual Street Audits to Understand the 
Walkability of Older Adults’ Route Choices by Gender and Age,” Int. J. Environ. 
Res. Public Health, vol. 13, no. 12, p. 1061, Oct. 2016. 
[42] C. Prandi, P. Salomoni, and S. Mirri, “mPASS: Integrating people sensing and 
crowdsourcing to map urban accessibility,” in 2014 IEEE 11th Consumer 
Communications and Networking Conference (CCNC), 2014, pp. 591–595. 
[43] Transportation US Department, “Designing sidewalks and trails for access.” . 
[44] P. R.-O.-W. A. A. C. (PROWACC), “Street and sidewalk accessibility problems,” 
2007. [Online]. Available: https://www.access-board.gov/guidelines-and-
standards/streets-sidewalks/public-rights-of-way. 
[45] Mathworks, “Matlab image labeling tool.” . 
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with 
Deep Convolutional Neural Networks.” pp. 1097–1105, 2012. 
[47] L. Yao and J. Miller, “Tiny ImageNet Classification with Convolutional Neural 
Networks,” vision.stanford.edu. 
[48] Matlab, “Machine Learning Challenges.” 
[49] G. Csurka, G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual 
categorization with bags of keypoints,” Work. Stat. Learn. Comput. VISION, 
ECCV, pp. 1--22, 2004. 
[50] P. Berkhin, “A Survey of Clustering Data Mining Techniques,” in Grouping 
Multidimensional Data, Berlin/Heidelberg: Springer-Verlag, 2006, pp. 25–71. 
[51] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, 
pp. 273–297, Sep. 1995. 
  48 
 
[52] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple 
features,” in Proceedings of the 2001 IEEE Computer Society Conference on 
Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, p. I-511-I-518. 
[53] P. M. B. Vitányi and R. E. Schapire, Computational learning theory : second 
European conference, EuroCOLT ’95, Barcelona, Spain, March 13-15, 1995 : 
proceedings. Springer, 1995. 
[54] Google, “Google bird’s eye view.” .