ABSTRACT Title of dissertation: TECHNIQUES FOR VIDEO SURVEILLANCE: AUTOMATIC VIDEO EDITING AND TARGET TRACKING Hazem El-Alfy, Doctor of Philosophy, 2009 Dissertation directed by: Professor Larry S. Davis Department of Computer Science Typical video surveillance control rooms include a collection of monitors con- nected to a large camera network, with many fewer operators than monitors. The cameras are usually cycled through the monitors, with provisions for manual over- ride to display a camera of interest. In addition, cameras are often provided with pan, tilt and zoom capabilities to capture objects of interest. In this dissertation, we develop novel ways to control the limited resources by focusing them into acquiring and visualizing the critical information contained in the surveyed scenes. First, we consider the problem of cropping surveillance videos. This process chooses a trajectory that a small sub-window can take through the video, selecting the most important parts of the video for display on a smaller monitor area. We model the information content of the video simply, by whether the image changes at each pixel. Then we show that we can flnd the globally optimal trajectory for a cropping window by using a shortest path algorithm. In practice, we can speed up thisprocess without afiecting theresults, bystitchingtogethertrajectoriescomputed over short intervals. This also reduces system latency. We then show that we can use a second shortest path formulation to flnd good cuts from one trajectory to another, improving coverage of interesting events in the video. We describe additional techniques to improve the quality and e?ciency of the algorithm, and show results on surveillance videos. Second, we turn our attention to the problem of tracking multiple agents moving amongst obstacles, using multiple cameras. Given an environment with obstacles, and many people moving through it, we construct a separate narrow fleld of view video for as many people as possible, by stitching together video segments from multiple cameras over time. We employ a novel approach to assign cameras to people as a function of time, with camera switches when needed. The problem is modeled as a bipartite graph and the solution corresponds to a maximum matching. As people move, the solution is e?ciently updated by computing an augmenting path rather than by solving for a new matching. This reduces computation time by an order of magnitude. In addition, solving for the shortest augmenting path minimizes the number of camera switches at each update. When not all people can be covered by the available cameras, we cluster as many people as possible into small groups, then assign cameras to groups using a minimum cost matching algorithm. We test our method using numerous runs from difierent simulators. Third, we relax the restriction of using flxed cameras in tracking agents. In particular, we study the problem of maintaining a good view of an agent moving amongst obstacles by a moving camera, possibly flxed to a pursuing robot. This is known as a two-player pursuit evasion game. Using a mesh discretization of the environment, we develop an algorithm that determines, given initial positions of both pursuer and evader, if the evader can take any moving strategy to go out of sight of the pursuer, and thus win the game. If it is decided that there is no winning strategy for the evader, we also compute a pursuer?s trajectory that keeps the evader within sight, for every trajectory that the evader can take. We study the efiect of varying the mesh size on both the e?ciency and accuracy of our algorithm. Finally, we show some earlier work that has been done in the domain of anomaly detection. Based on modeling co-occurrence statistics of moving objects in time and space, experiments are described on synthetic data, in which time intervals and locations of unusual activity are identifled. TECHNIQUES FOR VIDEO SURVEILLANCE: AUTOMATIC VIDEO EDITING AND TARGET TRACKING by Hazem Mohamed El-Alfy Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulflllment of the requirements for the degree of Doctor of Philosophy 2009 Advisory Committee: Professor Larry S. Davis, Chair Professor David W. Jacobs Professor James Reggia Professor Ramani Duraiswami Professor Eyad Abed c Copyright by Hazem El-Alfy 2009 Acknowledgments I begin by thanking and praising Allah, the flrst Teacher to mankind. I cannot thank Him properly without acknowledging the many people who have contributed in making my career in graduate school a successful experience. In this regard, I would like to thank my advisor Professor Larry Davis for his readinesstosupervisemeatatensemomentduringmygraduatecareer. Hispatience and suggestions for interesting and challenging problems helped me overcome that period quickly. Along the same lines, I would like to acknowledge Professor David Jacobs for his willingness to join us in our research. His hints and comments have only made this thesis possible. Additional thanks are due to Professors Eyad Abed, James Reggia and Ramani Duraiswami for agreeing to serve on my committee and for sparing enough time for this cause during this busy summer semester. Mentioning everyone by name in the limited space I have here is impossible. Many of my colleagues, friends and other people I came to know have helped me in some way or the other. In the few remaining lines, I wish to acknowledge in particular the encouragement and prayers of my mother and my father. I also appreciate my wife for acting responsibly and patiently during times of harshness as well as times of ease. Finally, my children Omar and Mariam have been my great joys. They gave my life a difierent meaning. ii Table of Contents List of Figures v 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Automatic Video Editing 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Problem deflnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Video Cropping Approach . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Extracting motion energy . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Building the graph . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.3 Shortest path . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.5 \Wiping out" captured motion . . . . . . . . . . . . . . . . . 20 2.4.6 Merging trajectories . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.7 Processing long videos . . . . . . . . . . . . . . . . . . . . . . 22 2.4.8 Video display . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 Splitting a long video into segments . . . . . . . . . . . . . . . 28 2.5.2 Overlap in video segments . . . . . . . . . . . . . . . . . . . . 28 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Multi-Camera Management in Surveillance Applications 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Problem Deflnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Visibility Polygons . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Graph Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3 Initial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.4 Matching Update . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Minimum Cost Bipartite Matching . . . . . . . . . . . . . . . . . . . 47 3.4.1 People Grouping and Graph Modeling . . . . . . . . . . . . . 48 3.4.2 Edge Weight Function . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3 Matching Update . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.3.1 Change in weights only, same graph topology . . . . 52 3.4.3.2 Change in graph topology . . . . . . . . . . . . . . . 53 3.4.3.3 Time requirements . . . . . . . . . . . . . . . . . . . 53 3.5 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . 53 iii 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 A Two-Player Pursuit-Evasion Game 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Problem Deflnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1 Environment Layout and Rules of the Game . . . . . . . . . . 63 4.2.2 Space Discretization . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.1 A \Level-0" Game . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 A \Level-1" Game . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Game Result and Tracking Strategy . . . . . . . . . . . . . . . . . . . 73 4.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.2 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.3 Space and Time Complexity . . . . . . . . . . . . . . . . . . . 75 4.5 Additional Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5 Detecting Unusual Activity in Surveillance Video 78 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.1 Deflnition of \unusual events" . . . . . . . . . . . . . . . . . . 80 5.2.2 Detecting unusual activity . . . . . . . . . . . . . . . . . . . . 81 5.3 Problem deflnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.2 Spatial tessellation . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.3 Co-occurrence matrix . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.4 Detecting unusual events . . . . . . . . . . . . . . . . . . . . . 89 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.2 System training . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.3 Unusual event detection . . . . . . . . . . . . . . . . . . . . . 93 5.6 Proposed Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6.1 Issues with real video . . . . . . . . . . . . . . . . . . . . . . . 98 A Image Cropping Through Learning 103 A.1 Problem deflnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Learning through SVR . . . . . . . . . . . . . . . . . . . . . . . . . . 104 A.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Bibliography 110 iv List of Figures 1.1 Typical design of a control room . . . . . . . . . . . . . . . . . . . . . 2 1.2 Diagram of the proposed future control room . . . . . . . . . . . . . . 4 2.1 Overview of our approach to crop a video segment. . . . . . . . . . . 13 2.2 Graph modeling the video. . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Using window energy density as a measure of motion energy favors smaller windows, cropping away small parts of objects, such as heads in humans (left: original frame, showing the location of the cropping window; right: cropped frame). . . . . . . . . . . . . . . . . . . . . . 15 2.4 Cropping window conflguration, with surrounding belt area. The numbers represent the labels of the inner (cropping) window?s corners. 16 2.5 Piecewise cubic Hermite interpolation performs better than cubic spline when the original data is staircase. . . . . . . . . . . . . . . . . 20 2.6 Example of merging three windows trajectories. The horizontal axis represents time; the thick lines are for the segments of the trajectories picked at that time; and the dashed lines are for times where jumps occur. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Four typical frames from an airport surveillance video, with three trajectories and flxed window size. In each frame, the original video frame, showing the locations of the cropping windows, is to the left and the cropped frame is to the right, resized to the original?s height. 24 2.8 Two typical frames from a tra?c intersection surveillance video, with two trajectories and flxed window size. In each frame, the original video frame, showing the locations of the cropping windows, is to the left and the cropped frame is to the right, resized to the original?s height. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.9 Variable size single trajectory results. Compared to the flxed size results, less empty area is retained around small cropped objects. . . 25 2.10 Video-in-video results: the cropped video location is selected auto- matically in the boundary region of overall least activity. . . . . . . . 25 2.11 Multi-camera display issues should be addressed in later work. Orig- inal video to the left; cropped video to the right. . . . . . . . . . . . . 26 v 2.12 Efiect of splitting a video at a time of little activity (frames 6800- 7000). The vertical axis represents the x-coordinate of the center of the cropping window [0..320], along with the window energy function that has been normalized to an integer range of [0..300] to display properly on the plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.13 Shortest paths of video preflxes. . . . . . . . . . . . . . . . . . . . . . 31 2.14 Splitting a video into 5 segments. . . . . . . . . . . . . . . . . . . . . 31 2.15 A cropped frame resulting from panning across a largely empty scene (left: original frame, showing the location of the cropping window; right: cropped frame). . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Problem setup in 2D. People?s trajectories are dotted and visibility polygons are in blended colors. The obstacles are shaded in black. . . 40 3.2 Computing the visibility polygon: rays r2 and r4 are tangent to ob- stacles, while r3 and r5 intersect them. . . . . . . . . . . . . . . . . . 42 3.3 Computing occlusions of subjects by one another. . . . . . . . . . . . 43 3.4 An example of the problem model with a perfect matching. . . . . . . 45 3.5 A bipartite graph update that involves the two cases with edge re- moval. Matched edges are thickened. In the flrst update, a non- matched edge v1u1 is removed so nothing needs to be done. In the second time, the matched edge v2u2 is removed. The alternating path v2u4v3u2 (red) matches v2u4, unmatches v3u4 and matches v3u2. . . . 47 3.6 A bipartite graph update that involves the addition of an edge to an incomplete matching. Matched edges are thickened. Initially, the maximum possible matching is incomplete. Later, an edge v4u1 is added. This permits completing the matching using the alternating path v4u1v3u2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 The approach used to cluster people. The cardinalities of the inter- sections are marked to the left and the actual clusters to the right. . . 49 3.8 Two example setups used in testing our method. Left: inside a bank, generated using our simulator (overlapping fleld-of-regard cameras); Right: a grid of obstacles, generated using van den Berg?s simulator. The linear segments are people?s trajectories. . . . . . . . . . . . . . . 54 vi 3.9 Mean coverage as the group size (diameter) varies. Top: bank sce- nario from our simulator. Bottom: regular grid of obstacles from van den Berg?s simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.10 Efiect of adding a time penalty on the standard deviation (= pvar) of the coverage. Top: bank scenario (our simulator) with a group size equal to 5 people. Bottom: regular grid of obstacles (van den Berg?s simulator) without grouping. . . . . . . . . . . . . . . . . . . . . . . . 57 3.11 Matching times for various graph sizes (van den Berg?s simulator). . . 58 3.12 Snapshot of our simulator: 2D top view with visibility regions along with 3D zoomed-in videos for the flrst 3 subjects. . . . . . . . . . . . 58 4.1 A simplifled layout that shows polygonal obstacles, players (points) and the environment partition with a grid. . . . . . . . . . . . . . . . 65 4.2 A Euclidean-distance neighborhood of radius 2 centered around the black square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 A Manhattan-distance neighborhood of radius 2 centered around the black square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Initial winning positions for a level-0 game. . . . . . . . . . . . . . . . 68 4.5 Initial losing positions for a level-0 game. . . . . . . . . . . . . . . . . 68 4.6 Computing the labeling function Bad(N(p);e;0) for the neighbor- hood of a pursuer located at position p, where p0 and p00 are the only possible neighbors for p due to the obstacle. . . . . . . . . . . . . . . 69 4.7 Extending Bad(N(p);e;0) into Bad(p;e;1). Steady state is reached at that point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.8 An example of a level-1 game where the pursuer loses in 2 steps. . . . 71 4.9 Computing the labeling function Bad(N(p);e;0) for the neighbor- hood of a pursuer located at position p, where p0 and p00 are the only possible neighbors for p due to the obstacle. . . . . . . . . . . . . . . 71 4.10 Extending Bad(N(p);e;0) into Bad(p;e;1). . . . . . . . . . . . . . . 72 4.11 Computing the labeling function Bad(N(p);e;1) out of the maps Bad(pi;e;1) for neighbors of position p. . . . . . . . . . . . . . . . . . 72 4.12 Extending Bad(N(p);e;1) into Bad(p;e;2). Steady state is reached at that point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 vii 5.1 Introductory example that shows the type of anomalies we are inter- ested in. The flrst two events are usual, but the third is not. Although it consists of a combination of exactly the two left event, it is their simultaneous occurrence that make it unusual . . . . . . . . . . . . . 79 5.2 The small (yellow) fllled circles represent humans. A single person us- ing the trash bin (a) or two people meeting in the building?s lobby (b) are considered usual situations. On the other hand, people meeting next to the trash bin (c) is an unusual situation. . . . . . . . . . . . . 86 5.3 Typical snapshot of our 4-way stop sign simulator. Arrows indicate the driving directions and squares represent vehicles. Cars are not strictly aligned due to the simulated noise. . . . . . . . . . . . . . . . 91 5.4 Typical distribution functions for a 16-bin grid. The pair of bins with the shown distribution in (a) observe relatively high joint activity, while (b) is for a pair of bins between which only one type of activity is present. Both plots are for the probability versus the event value index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Example of detected unusual events. The bin(s) where the event occurred is highlighted. In (a), the left and the right vehicles are each attempting to make a left turn simultaneously and are about to collide. In (b), the upper vehicle is making a left coinciding with the lower vehicle making a right. This will result in the two vehicles driving very close to one another. . . . . . . . . . . . . . . . . . . . . 94 5.6 First order anomaly example: a U-turn. Regardless of its interaction with other cars, a U-turn is not allowed (unusual) at a stop sign. Without noting this fact (a), it is reported as unusual w.r.t. most other populated bins. The algorithm has been updated to fllter out flrst order events from further processing with other bins (b). . . . . . 95 5.7 Unusual event detection using a coarse grid with 16 bins (a) and a flne one in (b). The mean correct detection rate is plotted versus the mean false alarm rate. For the coarse grid (a), the introduction of noise slightly deteriorates the detection rate, but for appropriate choices of pth the method proves to be immune to noise. The flne grid (b) is very sensitive to noise. In this case, adapting between coarse and flne grids can be the best alternative. . . . . . . . . . . . . . . . 96 5.8 Unusual event detection: coarse and flne grids compared in the ab- sence of noise. Slightly better performance using the flne grid for many values of pth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.9 Typical video snapshot showing the 3 types of detected objects . . . . 98 viii 5.10 Three examples of tracking errors: (a) three adjacent pedestrians clas- sifled as a vehicle. (b) Missed pedestrian. (c) Moving tree branches and empty parking spot detected as moving objects. . . . . . . . . . . 99 A.1 Overlapping-rates on the Nova Animal Set (4509 images) using dif- ferent algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.2 Overlapping-rates on the Nova Animal Set of two manual croppings. . 108 A.3 Scatter plot of rectangle area versus saliency content. . . . . . . . . . 108 A.4 Statistics of hand cropping for the \non-professional" set. . . . . . . . 109 ix Chapter 1 Introduction 1.1 Motivation The use of video surveillance systems has been rising over the past decade. Most recently, the need to improve public safety and the concerns about terrorist activity have contributed to a dramatic increase in the demand for surveillance sys- tems. The presence of these systems is very common in airports, subways, metropoli- tan areas, seaports, and in areas with large crowds. Modern video surveillance sys- tems consist of networks of cameras connected to a control room that includes a collection of monitors. Typical control rooms have a much smaller number of moni- tors than cameras and far fewer operators than monitors. The monitors either cycle automatically through the cameras, or operators can manually choose any camera from the network and display it on a selected monitor. The M25 London Orbital highway system consists of 5 tra?c control centers, each with 60 monitors connected to 324 cameras and distributed over 70 sites. The London underground has a network of 25,000 cameras at 167 stations [1]. A recent survey by the New York Civil Liberties Union found that in Lower Manhattan, New York City, the number of surveillance cameras below 14th St grew from 769 in 1998 to 4176 in 2005. According to the same source, the New York City Police Department announced in 2006 that it planned to create a \citywide system of 1 Figure 1.1: Typical design of a control room closed circuit televisions" operated from a single control center [2]. In the area of crowd control, CNN News reported from inside the central surveillance control room for this year?s Hajj ritual at Mecca, Saudi Arabia. Over 1400 cameras monitored a crowd of around 3 million pilgrims, who can ow in some areas at rates of up to 250,000 per hour [3]. A typical design of a control room is illustrated in flgure 1.1. We suggest this work to be part of a larger project in the design of future control rooms, that envisions an architecture consisting of a large display wall which acts as a single entity, as opposed to matrices of independent monitors, or display regions with pre-specifled monitoring tasks, as in current state of the art control rooms. The display wall assigns variable areas and locations to a subset of the available videos from surveillance cameras. The problem of selecting the subset of videos to display is addressed in this dissertation. Here, we do not only mean 2 the automatic selection of a camera footage to display at a speciflc time. We go one step further by editing the individual videos in both time and space before displaying them, possibly stitching together difierent video \pieces" coming from difierent cameras. Our goal is to present the operators with a video that contains the most critical scenes of the surveyed environment, thus focusing their attention on important information content. The related problems of assigning the appropriate area and selecting the location where the edited video is to be displayed is motivated in this dissertation, but the details are rather left as an area of future research. Intuitively, video segments that are more interesting to human operators shall be assigned longer display times, larger areas and more prominent parts of the display wall. Some \scoring" mechanism will be used to achieve this task. An overview of the surveillance system?s architecture presented in this dissertation is illustrated in flgure 1.2. 1.2 Dissertation Organization In this dissertation, we present four components that collaborate in building our proposed surveillance system. First, we note that assigning a video to its display area typically requires resizing it. If the assigned space is small, simply reducing the resolution of the original video might render its contents to be illegible. In this situation, cropping the video before resizing it results in videos that should be easier for humans to interpret. This brings up the need to automatically edit the raw footage coming from cameras. The design and implementation of this component, 3 Scoring Mapping Display Wall Cropping Figure 1.2: Diagram of the proposed future control room along with results on real surveillance video is the topic of chapter 2. Within the area of video editing, we also study the problem of creating a video by combining together several difierent video segments. In chapter 3, we create videos that focus on subjects during their presence in the surveyed scene, using multiple flxed cameras. Rather than relating a video to its source (i.e. the camera that produced it), we relate it to its content (i.e. the subject(s) that it covers). Creating such multi-camera videos, one per subject, enables the operator to focus on each subject?s behavior separately, rather than being distracted by difierent activities that occur in each camera?s fleld of view. Using flxed cameras can result in some subjects not being seen by any camera. Within the theme of focusing operators? attention on individual subjects, we suggest the use of mobile cameras for that purpose. This naturally relates to the pursuer 4 evader game. The flrst question that comes to mind is whether or not a pursuer will be able to maintain the visibility of an evader at all times and for which initial positions of the players. We devise an algorithm to solve that problem in chapter 4 and, if the answer to the question of maintaining the visibility of the evader is a?rmative, we also compute the motion strategy that realizes this. Finally, chapter 5 presents our method to detect anomalous activity in video. Byidentifyingtimeintervalsandlocationsofunusualactivity, wepresentyetanother method of focusing the operator?s attention on interesting parts of surveillance video that could require further human intervention. 1.3 Contributions We have introduced several novel methods in analyzing surveillance video in this dissertation. In the area of video editing, we deflned and implemented a broader video cropping technique that deals with raw camera footage versus retargeting pre- edited video. We also introduced a new approach to solve the problem of tracking people across multiple cameras. We applied our method to a new problem of auto- matically creating a video for a subject from the footage of multiple cameras. In the area of pursuer-evader games, we have developed a new computationally feasible approach to solve the problem of determining the outcome of the game. Fi- nally, our anomaly detection method uses co-occurrence statistics of moving objects in time and space. Unlike earlier approaches that used co-occurrence statistics in time only, our method is able to detect additionally events that need not be unusual 5 if considered individually, but that become so when occurring simultaneously. 6 Chapter 2 Automatic Video Editing 2.1 Introduction In surveillance applications, video cropping helps to focus the attention of operators on speciflc parts of the scene. Activity occurring in the background or at corners of the display area might pass unnoticed by operators, due to other activity in more prominent areas of the scene. Cropping is often needed as well to save bandwidth in transferring the video or saving it for archiving purposes. The tradeofi here is between the size of the cropped video and the information loss. In scenes with several regions of simultaneous activity, allowing the cropping window to jump occasionally between these regions supports coverage of multiple activities, while keeping the resulting cropped video small. In addition, it is crucial in surveillance applications to process video online ?as it becomes available? and cannot require the entire video to be available beforehand. These points are amplifled in the body of this chapter. We deflne the video cropping problem to be the determination of a smooth path of a cropping window that captures \salient" foreground throughout the video. The window can have variable size, which introduces virtual zoom-in and zoom-out efiects. It is allowed occasional jumps through the video, similar to scene cuts in fllmmaking. We use motion energy as a measure of the \saliency" of a cropping 7 window trajectory. Unlike previous approaches, we optimize the saliency globally for the whole trajectory, rather than for individual frames or shots, as follows. First, the video is modeled as a graph of windows, with its edge weights re ecting saliency captured by windows, e?ciently computed using integral images [78]. Then, a shortest path algorithm flnds the window trajectory that captures the overall maximum motion energy. The resulting trajectory is smoothed to remove jiggles and staircase-like appearance. This procedure is repeated several times on the remaining parts of the video to capture the remaining saliency. This results in obtaining a set of disjoint smooth paths of cropping windows that capture as much saliency as possible. A secondary optimization procedure produces the flnal path by alternately jumping between the paths computed earlier, selecting which one to follow at which time, so as to maximize both captured saliency and covered regions of the original video. Long videos are processed by breaking them into manageable sub- videos, while allowing overlap between consecutive subvideos, to produce smooth transitions. Our algorithm is applied to a collection of real surveillance videos. Several experiments are performed to determine the optimal choices regarding issues such as where to cut a video into sub-videos, the amount of overlap between them and how long a segment should be displayed before jumping to another. Some display conflgurations are compared, such as cropped video alone, side by side with original video, or video-in-video (like commercially available picture-in-picture). We are mainly interested in surveillance applications. Typically, much more video than operators can observe is available. In addition, this video is unedited, and more importantly, not focused on any agent in the scene. This has made our 8 approach [21] difierent from earlier approaches that dealt with edited videos, such as movies, news reports, or classroom video. In particular, our method ofiers the following contributions: ? a variable size cropping window which results in a smooth zoom in/out efiect, ? multiple cropping windows to cover more agents in the scene, ? only a relatively short video segment needs to be processed at a time ?not the complete video? which makes the algorithm an online algorithm, and ? we show empirically that by stitching together results from short segments of a video, we get a result identical to the globally optimal one, given the entire video. The rest of this chapter is organized as follows: section 2.2 reviews related work. Then, the video cropping problem is formally deflned in section 2.3. In section 2.4, we present our approach to solve the problem, while section 2.5 presents the results. Finally, closing remarks and conclusions can be found in section 2.6. 2.2 Related Work Research has been performed in the area of visual attention to detect salient areas in images and video from low-level features. Itti et al. [41] use orientation fllters in addition to color and intensity to detect salient parts of images. Later, Itti and Baldi extend this method to work with videos, using a statistical model for time [40]. A probabilistic approach is used by Kadir and Brady as a measure 9 of local saliency in images [42]. Their method is generalized by Hung and Gong by including the time variable to quantify spatio-temporal saliency in videos [38]. Many methods for cropping still images automatically have been published. For example, Suh et al. [68] use Itti?s saliency model, along with face detection, to crop informative parts of images before reducing them to thumbnails. The same model is also used by Chen et al. [12], with the addition of text detection, to flnd regions of interest in images for adaptation to small displays. Xie et al. [83] study the statistics of users? interaction with images on small displays to determine regions of maximum user interest. Much less work has been done in the area of video cropping, or detecting in- teresting space-time regions of a video. Fan et al. [24] determine areas of interest in individual frames, then combine them smoothly. Wang et al. [79] split surveillance video into interesting and non-interesting sequences using a threshold on their mo- tion content. Non-interesting sequences are zoomed out and transmitted/displayed at reduced frame rates, while interesting ones are displayed at full frame rate and zoomed in to clusters of high motion energy. Both of these methods are optimized locally and need not produce videos that are globally optimal, in terms of their saliency content. More recently, Kang et al. used a space-time saliency measure to cut out informative portions of a video and pack them into a video of smaller reso- lution and shorter duration [43]. This approach models videos and processes them as a whole, making them unsuitable for relatively longer videos, or for continuous surveillance video. Liu and Gleicher [52] edit videos using a flxed size cropping window. Since 10 they focus mainly on editing feature fllms, the window moves are restricted to pans and cuts whose parameters are optimized over individual shots. To keep the original structure of the fllm, the authors introduce a set of heuristic penalties that limit the motion of the cropping window. Although this approach works well with pro- fessionally captured fllms, it may not generalize well to unedited raw surveillance video, which generally have wider flelds of view, are not focused on a main subject, and consist of a single shot. Within the same area of cropping television and cinema video content to flt a difierent screen size, Deselaers et al. [15] also scan the video using a flxed size cropping window. They occasionally zoom-out, padding with black borders, when the cropping window \is not able to capture all relevant parts of the image". Another recent example of automatic editing of speciflc types of videos is that of classroom video editing. Heck et al. [32] flnd an optimal shot sequence, from a set a virtual shots, that have been selected based on prior knowledge of the scene and video content. 2.3 Problem deflnition Given a video sequence, our goal is to determine a smooth trajectory for a variable size window through the video, that maximizes the captured saliency over all such trajectories and window sizes. Occasional jumps are allowed to include as much saliency as possible. More formally, we consider the problem of optimizing a singletrajectory. AssumetheinputvideosegmenthasT frames. Eachframetcanbe covered by a set of n variable size overlapping windows. These windows are labeled 11 Wi;t, with i being the window number, selected from an index set I = f1;2;:::;ng. Deflne the cross product set I= I ?I ?????I = IT. Then, we want to solve the problem: argmax Q X t S(Wi;t) (2.1) where S(?) is a saliency measure, Q2I is the window sequence that maximizes the saliency, and i 2 I. It is more desirable to minimize functions, thus, the saliency function S(?) can be replaced by a cost function C(?) that decays with increasing saliency. Model (2.1) doesn?t enforce spatial smoothness. To guarantee a smooth path, windows in two consecutive frames are restricted to be close to one another and with little area variation. The problem is thus formulated as: argmin Q X t C(Wi;t) (2.2) such that: d(Wi0;t?1;Wi;t) < dmax jA(Wi0;t?1)?A(Wi;t)j < Amax where d(?;?) is a distance measure and A(W) is the area of window W. Our main contribution [21] is that we optimize globally, over the whole video, rather than locally for individual frames, that are later combined. This approach maximizes the total captured saliency and provides smoother results. 2.4 Video Cropping Approach Solving problem (2.2) by trying all possible paths is prohibitively expensive. Instead, we employ a dynamic programming approach, using the following proce- dure. First, motion energy is extracted through frame difierences. Then, we build a 12 Extract Motion Energy BuildingGraph WipingFrames MergingTrajectoriesShortestPath + Smoothing Video Frames Frames Motion Trajectories Cropped Video Figure 2.1: Overview of our approach to crop a video segment. weighted directed graph, with the cropping windows as its vertices, and edge weights measuring the motion energy. A shortest path algorithm through the graph selects the flrst optimal trajectory, which minimizes equation (2.2). This trajectory is then smoothed, and the motion it contains is \wiped out" from the original frames. This procedure is repeated to capture some of the remaining energy. A second graph is built out of the resulting trajectories with shortest path run once again to deter- mine the optimal combination of paths. In the flnal cropped video, one trajectory is followed at a time, with occasional jumps between these trajectories. For long videos, video segments are chosen to overlap with their immediately preceding ones, and are processed similarly. This allows smooth transitions between the respective trajectories in each segment. The whole process is summarized in flgure 2.1. In the remainder of this section, the above steps are presented in more detail. 2.4.1 Extracting motion energy In our implementation, we use motion energy as a measure of saliency. Earlier approaches [24, 52] used centered frame saliency maps, face and text detectors in addition to motion contrast, to crop movies and news. These detectors are too speciflc to be used with surveillance video. 13 Motion energy is e?ciently computed in real time, and captures important activity in surveillance video. We compute frame difierences and threshold them to detect motion, then apply morphological operations to the resulting motion frames. In particular, the opening operation is applied to remove detected small noisy areas, then a closing operation connects nearby fragments. The motion energy is computed to be the number of 1?s in the resulting binary images. The remainder of the algorithm works on these preprocessed difierence frames. 2.4.2 Building the graph The video is modeled as a weighted directed graph as follows. Let G = (V;E) be the graph. Each frame is sampled by n overlapping cropping windows of various sizes. Then, each window is represented by a vertex v 2 V. This makes the total number of vertices in the graph jVj = nT for a video segment of T sample frames. The restrictions in the problem formulation (2.2) are implemented by allowing a window to move only to neighboring window positions, or to grow or shrink by no more than one size step between two consecutive frames. This translates to adding an edge to E only if it connects a pair of vertices that represents a pair of windows between which a step is allowed. For a certain ordering of the vertices V, the adjacency matrix of G is banded. This is a sparse matrix that can be e?ciently stored using an adjacency list graph representation, to store just the bands. With b ? nT neighbors allowed per window, we have a graph of O(nT) edges. An additional \source" node and a \target" node are added to the beginning and end 14 dummy source node dummy target node w=0 w=0 w=0w=0 nodes of first frame nodes of last frame nodes of frame i Figure 2.2: Graph modeling the video. Figure 2.3: Using window energy density as a measure of motion energy favors smaller windows, cropping away small parts of objects, such as heads in humans (left: original frame, showing the location of the cropping window; right: cropped frame). of the graph. A model graph is shown in flgure 2.2. To flnd a trajectory that maximizes the captured motion energy, we need to deflne a window energy measure. Using the total energy enclosed in a window always favors larger windows, which makes the cropping useless. On the other hand, using the energy density as a measure favors smaller windows, which can be more densely fllled, with little empty space. This results in cropping away smaller parts of objects, such as heads in humans, as is illustrated in flgure 2.3. This is deflnitely an undesirable result. 15 Ein Ebelt1 4 3 2 Figure 2.4: Cropping window conflguration, with surrounding belt area. The num- bers represent the labels of the inner (cropping) window?s corners. To balance both efiects, we use an energy density measure, penalized by the total energy in a thin surrounding belt. This is similar to center-surround repre- sentations (e.g. Itti et al. [41]). It prevents the window from cropping away parts of objects, making it large enough to flt a single \cluster" of objects, while leaving distant ones for subsequent trajectories to capture. The multi-scale energy function of a window W is deflned as: E(W) = EinA in ? 1KEbelt; (2.3) where Ein is the motion energy (number of 1?s) in window W, Ain is the area of window W in pixels (i.e., the number of pixels in W), Ebelt is the motion energy in the surrounding belt, and K is an empirically chosen constant, that determines how much penalty to assign to cropped away parts. The diagram in flgure 2.4 illustrates these parameters. Next, the energy measure (2.3) is put in a form that can be minimized, to flt as a cost function C(?) in equation (2.2). In addition, to suit the shortest path algorithm used, the edge weights must be normalized to a non-negative integer 16 range. We choose this range to be from 0 to 100. If a transition is allowed from vertex i to vertex j, an edge is added to the graph with weight w(i;j) computed as: w(i;j) = ? 100?100max E j Aj ? 1 KEbeltj;0 ? +0:5 ? (2.4) where Ej, Aj and Ebeltj are all related to the window represented by vertex j. The function bx + 0:5c rounds x to the nearest integer. Ej and Ebeltj are computed by summing the motion pixels (1?s) for each window in each pre-processed difierence frame. This is a time consuming operation if performed in a straightforward manner. Instead, we use integral images in a manner similar to Viola and Jones in image analysis [78] based on an idea by Crow for texture mapping [14]. Given an image i(x;y), the integral image ii accumulates pixels to the top and left of each pixel, as deflned by: ii(x;y) = X x0?x;y0?y i(x0;y0): The integral image is computed in one pass using the two recurrences: s(x;y) = s(x;y ?1)+i(x;y) (row sum) ii(x;y) = ii(x?1;y)+s(x;y) (integral image) s(x;?1) = 0 ; ii(?1;y) = 0 The integral image is computed once for every frame, then every window sum, Ej, is computed using only 4 operations. Thus, we avoid computing the cumulative sum for each window, which includes a lot of redundancy due to the overlap between windows. Given a window with corners ?!x1, ?!x2, ?!x3 and ?!x4 in clockwise direction, starting from the top left (see flgure 2.4), the cumulative sum in that window can 17 be computed as: ii(?!x3)?ii(?!x2)?ii(?!x4)+ii(?!x1): 2.4.3 Shortest path The shortest path from the source node to the target node is computed using Dijkstra?s algorithm [17]. Beneflting from the special structure of this graph, some modiflcations are introduced into the algorithm to improve performance. Early termination can be achieved by halting the search prematurely when the flrst vertex (window) in the last video frame is reached, rather than waiting until all the vertices in the graph are labeled. Dijkstra?s algorithm is mainly slowed down by the search for the closest vertex to the source, in a list of temporary labeled nodes, at each iteration. With N vertices in the graph, the asymptotic running time of the algorithm isO(N2) orO(N log(N)) depending on the data structure used to implement the list of temporary labeled nodes. This running time has been reduced in our implementation in two ways. The multiplying factor is directly afiected by the size of the list of temporary labeled nodes. In our runs, we note that the maximum number of nodes in that list, over all iterations, is just around 1% of the total number of vertices in the graph. The running time is also reduced an order of magnitude, from quadratic to linear in the number of vertices, using Dial?s implementation [16]. In the original algorithm, Dial stores temporarily labeled nodes in buckets, indexed by the nodes? distances. This makes the search for the minimum distance node O(1). With 18 C being the maximum edge weight (100 in our graph), Dial?s original algorithm maintains NC + 1 buckets, which may be prohibitively large. A remark made by Ahuja [7] reduces the space requirements to only C + 1 buckets. During each iteration, the difierence between the maximum and minimum flnitely labeled nodes cannot exceed C. Hence, temporarily labeled nodes can be hashed by their distance labels into just C +1 buckets. 2.4.4 Smoothing The shortest path resulting from the previous stage has a noisy appearance, which can be best compared to a jittery or shaking cameraman, from the point of view of the cropped video. Two levels of smoothing are applied to the trajectory. First, a moving average smoother is applied to the trajectory. With y(t) being the original data (raw trajectory) at time t and ys(t) the smoothed one, the difierence equation is: ys(i) = 12N +1 (y(i+N)+y(i+N ?1)+???+y(i?N)); where N is the \radius" of the span interval. This has the efiect of a lowpass fllter, reducing the shaky appearance of the trajectory. Truncating the result completely removes any such artifacts, but results in a staircase-like appearance. A second level of smoothing is done by interpolating a piecewise cubic Hermite polynomial to a sub-sampled version of the data. This polynomial interpolates the data points and has a continuous flrst derivative. However, unlike cubic splines, the second derivative need not be continuous. This property is more suitable for our staircase 19 ?3 ?2 ?1 0 1 2 3?1.5 ?1 ?0.5 0 0.5 1 1.5 DataHermite Spline Figure 2.5: Piecewise cubic Hermite interpolation performs better than cubic spline when the original data is staircase. data, since it avoids excessive oscillations. Figure 2.5 illustrates our situation where Hermite interpolation produce more stable results. 2.4.5 \Wiping out" captured motion A single moving cropping window might not capture all motion energy in the video. In situations where several activities occur simultaneously in separated re- gions of the scene, panning back and forth between these regions produces a blurry video that covers the mostly empty area between them. To solve this problem, we compute several independent window trajectories, each covering a difierent activity in the same video. Later, we show how to control jumps between these paths. The window trajectories are determined by repeating the above procedure of modeling the video using a graph, computing the shortest path and smoothing. Between two consecutive computations of window trajectories, all captured objects that overlap with the flrst are removed from the motion frames. This \wiping" procedure guar- 20 Figure 2.6: Example of merging three windows trajectories. The horizontal axis represents time; the thick lines are for the segments of the trajectories picked at that time; and the dashed lines are for times where jumps occur. antees that an ensuing window path will not cover parts of previously captured objects, thus avoiding two similar trajectories. 2.4.6 Merging trajectories Once enough window paths are computed, we merge them into a single path that captures as much motion as possible, and covers as many regions of the original video as possible. Figure 2.6 illustrates an example solution to this problem. After following a segment of a path for some period of time, a jump to a segment of another path results in covering another activity region, without panning through the scene. This appears in the flnal cropped video as a cut. To compute the flnal merged trajectory, we solve a second optimization that uses a shortest path algorithm through the trajectories. A second graph G0 = (V0;E0) is built, with the list of vertices V0 formed by concatenating nodes from all computed trajectories. This list of nodes is duplicated k times, with the ith copy of a node from path p representing that this path has been followed for i steps (frames/nodes). The number of frames after which a switch between trajectories is allowed without penalty is k. This allows us to keep track of how long (in frames) 21 a single trajectory has been followed. There always exists an edge e 2 E0 from every node to the next node in the same path, and edges to next frame nodes in the other paths if a \cut" is allowed at that time. The weight function w0 is the cost associated with the representative window in the original graph G, computed as in (2.4), if the edge connects two nodes in the same path. However, if a path switch occurs, a penalty function is added to that weight. Intuitively, a higher penalty is associated with switches to closer trajectories, to favor more coverage of the original video, and to avoid \jumps". Another penalty is added to the window cost, that decays with the time that has been spent following a certain trajectory. This latter penalty inhibits frequent successive cuts, which might distract the operator. These penalties are fractions of the window cost, to make them comparable to the window cost penalty (2.2) in the global optimization. They are determined empirically, based on runs conducted on several videos. We also noticed that these penalties are consistent with fllmmakers? heuristics, who might use frequent cuts purposely only when special \pacing efiects" are required. Similarly, fllmmakers would avoid a cut to a nearby location, which is known in cinematography as a \jump cut". These rules have been followed in feature fllm editing [52] when creating virtual cuts. 2.4.7 Processing long videos Real-time surveillance applications require online processing of video. Ad- ditionally, available processing resources constrain the longest video that can be 22 processed as a whole. Our approach allows us to process long videos by breaking them into segments, with some overlap. Each segment is processed separately, re- sulting in consecutive overlapping trajectories. By piecing together corresponding paths from each segment and removing the overlap, continuous smooth trajectories result. This is further discussed in the section on results, where experiments are performed to determine appropriate locations to break the video into segments and the amount of overlap needed for smooth transitions. 2.4.8 Video display Several display methods for the resulting cropped video are considered. The most obvious is just displaying the cropped region. This is the most space e?cient but may result in seeing the cropped video out of its context. To keep both context and content, we display both the original and cropped videos side by side, with the original one shrunk to flt the display space. This display method is not space e?cient, though. To display context, while keeping a reasonable video size, we suggest a video- in-video display style. The familiar style is to display the \sub"-video (cropped video here) in a corner of the \full" video. In our approach, we flnd the optimal display location automatically while computing the shortest path. It is chosen to be largest border window, that covers the minimum activity in the video. In terms of our implementation, this is the largest window on the border that contains the least motion energy. The video-in-video window is of flxed size and location, but 23 Figure 2.7: Four typical frames from an airport surveillance video, with three tra- jectories and flxed window size. In each frame, the original video frame, showing the locations of the cropping windows, is to the left and the cropped frame is to the right, resized to the original?s height. can be allowed occasional changes over longer periods of time. 2.5 Experimental results We have applied our approach to several surveillance videos. Typical cropped frames from an airport surveillance video (three-window) and from a tra?c inter- section video (two-window) are shown for flxed size cropping, in flgures 2.7 and 2.8 respectively. For now, the number of cropping windows is manually selected for each video to cover all people who appear in the scene, occasionally jumping be- tween them. The flgures illustrate single frames in which only one cropping window is displayed. Variable-size single-path windows, video-in-video and multi-camera display are shown in flgures 2.9, 2.10 and 2.11. We have timed the performance of our system on the airport surveillance video. This is a high quality 720?480, 30 fps video. Excluding background subtraction and 24 Figure 2.8: Two typical frames from a tra?c intersection surveillance video, with two trajectories and flxed window size. In each frame, the original video frame, showing the locations of the cropping windows, is to the left and the cropped frame is to the right, resized to the original?s height. Figure 2.9: Variable size single trajectory results. Compared to the flxed size results, less empty area is retained around small cropped objects. Figure 2.10: Video-in-video results: the cropped video location is selected automat- ically in the boundary region of overall least activity. 25 Figure 2.11: Multi-camera display issues should be addressed in later work. Original video to the left; cropped video to the right. disk operations, which are done at independent stages, we were able to process a 66 second video segment (2000 frames), as a single segment, in about 400 seconds, for a flxed-size, single-trajectory window. This is equivalent to processing about 5 frames per second. Lower resolution videos (320?240) can be processed at almost real-time rates. Experiments are done on a 2.8 GHz dual processor CPU with 1 GB of memory. Using a variable size window with 6 size steps multiplies the size of the problem by a factor of 6, thus dramatically reducing the processing speed to about 0.88 fps. Computing more trajectories reduces the processing speed even further. A key contribution of our system over previous approaches is that it is able to process segments of video sequentially, eventually allowing for processing ows of 26 input video online. This makes our system able to process surveillance video online, without the need to have the whole video available completely before processing. This feature, along with the processing rates for the basic flxed-size window low- resolution video cropper, makes it useful for real-time surveillance systems. We hope that more e?cient implementations and faster hardware will allow the full feature variable-size multi-window cropper to work in real-time in the future. When splitting a long video into segments, we compare the solution, i.e. crop- ping window path, to that obtained if the complete video is processed as a single block. Experiments carried out on several videos show that window paths obtained by processing a sequence of subvideos are identical to those obtained by processing the longest video that could flt in memory as a whole. Processing video segments individually results in small jumps when piecing the resulting cropping window paths together. To obtain smooth transitions between consecutive segments, the locations to split the video should be carefully selected. In addition, some amount of overlap is required between segments, to obtain a transition at the closest point between window paths. Two sets of experiments are described in the following subsections. The flrst one determines the best locations to split a video into segments, and the second one determines the amount of overlap required between segments to obtain smooth transitions. 27 2.5.1 Splitting a long video into segments The flrst problem to be solved is to flnd where to split the video into segments. The rule of thumb here is to avoid splitting the video at a point where little or no activity occurs. This might result in a large discontinuity of the window path at the split point. Regardless of the overlap between the consecutive segments (see next section), choosing the split point inappropriately might still cause discontinuities. Our experiments for flxed-size cropping windows show that a split at any time where signiflcant motion is detected will result in smooth transitions. Figure 2.12 illustrates a situation where a video is split into two segments during a period of very little motion energy. This results in a large jump at the split location, despite the long overlap region between the two segments. The x-coordinate of the center of the cropping window is shown on the vertical axis, along with the motion energy content, normalized to a scale of 300, for visualization. The solid line represents the result of processing a one-minute-video segment as a whole, and the dashed lines are for the results of splitting in two sub-segments. 2.5.2 Overlap in video segments Once the locations at which to split the video are chosen, the amount of over- lap between these segments has to be determined. A compromise has to be made between smoothness, requiring longer overlap, and e?ciency, requiring shorter over- lap. Many experiments were conducted to determine average and shortest overlaps for which smooth transitions can still be obtained. We flrst start with the flxed 28 6400 6600 6800 7000 7200 7400 76000 50 100 150 200 250 300 frame # window center ( x?coord) / normalized energy 1000?frame segment400?frame segment A 400?frame segment BEnergy path energy Figure 2.12: Efiect of splitting a video at a time of little activity (frames 6800-7000). The vertical axis represents the x-coordinate of the center of the cropping window [0..320], along with the window energy function that has been normalized to an integer range of [0..300] to display properly on the plot. size window case. For all the videos considered, a maximum of 20 overlap frames was required. By tracing the shortest path algorithm, we record for each node the last frame (largest time index) that was processed to determine the node?s distance from the source node (depth in graph). In Dijkstra?s shortest path algorithm, once a node?s distance is permanent, a shortest path passing through it is not changed any further (although the path itself is not determined until the backward pass). This property indicates how much video needs to be processed ahead to compute the flnal trajectory. We found that no more than the nodes of 18 frames ahead are needed to compute the vertices? permanent distances, hence, the shortest path. Experiments carried out for the variable window size case show that longer overlaps are required. 29 In some situations, we were able to obtain smooth transitions for 50 frame overlaps, but more experiments still need to be done to determine requirements for a smooth result. To further illustrate the point of segment overlap, flgure 2.13 shows a flxed- size-window shortest path computed for a 1000 frame video (solid), plotted against shortest paths for sub-videos (dotted), starting at the same frame (source node) but ending much before the original video. The discrepancies are found in the range of 8-12 frames. Figure 2.14 shows a 1000 frame video that serves as ground truth. It is split into flve 200-frame-segments with 40 frame overlap on each side. The cropping window trajectories obtained using each method are plotted against each other. Again, difierences between the two trajectories are found not to exceed 20 frames around split points. Elsewhere, the two solutions are identical. This shows that our method tends to compute trajectories that are globally optimal, even though we are processing video segments individually. 2.6 Conclusions Modern surveillance systems contain so many high resolution videos that it is not possible for operators to view them all. Automatic systems that select the most relevant portions of the video will make it possible to display important parts of the video at higher resolution. To do this, we have created a system that crops videos to retain the regions of greatest interest, while also cutting from one region of the video to another, to provide coverage of all activities of interest. 30 10 20 30 40 50 60 70 80 90 100250 300 350 400 450 500 550 600 650 700 window # frame # 1000 frames20 frames 40 frames60 frames 80 frames100 frames Figure 2.13: Shortest paths of video preflxes. 0 100 200 300 400 500 600 700 800 900 10000 100 200 300 400 500 600 700 frame # window # 1000 frames200 frames Figure 2.14: Splitting a video into 5 segments. 31 At the core of our system is the use of a shortest path algorithm to flnd the optimal trajectory of a sub-window through a video, to capture its most interesting portions. This relies on a measure of interestingness using a center-surround opera- tor on the video?s motion energy. By normalizing this operator for scale, we produce trajectories that alter the zoom of the sub-window, allowing us to smoothly track objects as their apparent size changes. These methods allow us to flnd a globally optimal trajectory, given access to the entire video. In practice, we show that we can process the video a few seconds at a time and obtain results identical to the globally optimal solution, but with much less processing time and latency. In addi- tion, we smooth the trajectory of the cropping video, to reduce unpleasant artifacts. We also show that we can compute a number of independent trajectories, and use a shortest path algorithm to flnd the best way of cutting between these. This pro- duces a cropped video in which we pan across the original video, following objects of interest, with occasional cuts to display difierent objects. We plan additional work to improve upon these results in the future. First, in this chapter we have focused on flnding algorithms that can maximize the amount of information extracted from videos by cropping and cutting. We plan to perform user studies to determine the extent to which this can assist an operator in per- forming a real surveillance task (some relevant user studies of related systems can be found in [24, 26, 79]). User studies will also allow us to assess the importance of parameters that afiect video quality, such as the frequency of cuts or the smoothness of trajectories. We also plan to explore the use of temporal cutting, in which we eliminate frames in which not much is happening (see flgure 2.15). This will be 32 Figure 2.15: A cropped frame resulting from panning across a largely empty scene (left: original frame, showing the location of the cropping window; right: cropped frame). particularly important in large scale systems that cycle through many videos. Our work is based on an overall vision for control rooms of the future, in which a system determines which portions of which videos are most relevant to a surveillance task, and maps these to large displays. Such a system will combine automatic processing and user input to help operators focus on the most relevant visual nuggets from a huge sea of visual information. This work contributes to that vision by showing how to use tools from optimization to flnd the most important portions of a video. 33 Chapter 3 Multi-Camera Management in Surveillance Applications 3.1 Introduction Surveillance systems are becoming common in modern facilities. The increas- ing need to provide safety has resulted in more demand for these systems. They typically consist of a network of cameras, monitored from a control room. The large number of cameras available and the rising cost of human resources make it pro- hibitively expensive to manually operate all of them. Recently, there has been a lot of interest in automating camera management in surveillance systems, to direct the operators? attention to \interesting" scenes. 3.1.1 Related Work To cover large areas, cooperation schemes need to be developed amongst cam- eras. To maintain the visibility of as large an area as possible, several aspects of the problem have been addressed in the literature: 1. assigning camera locations; 2. controlling camera parameters, such as pan, tilt and zoom, and possibly posi- tion as well (for moving cameras) to track agents; and 3. switching tasks between cameras. 34 In scenes of known layout, such as lecture rooms [64], camera positions can be selected to focus on the lecturer and the audience, whose locations are known. In typical conflgurations, a wide-angle static camera is used to control the pan and tilt of an active camera that operates in high zoom mode. In surveillance applications, agents? activities are usually non-predictable. Wren et al. [80] control the pan, tilt and zoom of flxed high capability cameras using a network of low-cost sensors that measure activity. Goradia et al. [30] present solutions to the problems of optimally deploying cameras and switching between them. They use dynamic programming to optimize the switching of multiple flxed cameras, tracking a single moving agent. Fiore et al. [25] present a system that reorganizes camera placement with scene changes, while Zhao and Cheung [85], in addition to other contributions, develop a method to determine both the optimal number and positions of cameras that achieve a desired level of visibility. In the area of tracking agents across multiple cameras, Takemura and Miura [69] plan the panning and tilting of multiple flxed cameras to track as many moving agents as possible, in a room without obstacles. They use a motion model to predict people?s future states. Collins et al. [13] associate cost functions with cameras to track a single moving object across multiple flxed cameras in an outdoor scene. Switching of cameras can occur to \hand-ofi" to another camera when an agent has moved to within its fleld of regard. It may occur between a wide-angle view camera and a highly zoomed-in one, to provide a better view of an agent. When gaps exist between cameras, Makris et al. [53] suggest a method to correspond the tracks 35 of moving agents transiting through the \blind" regions of the camera network. An indoors surveillance system with multiple non-overlapping fleld-of-view cameras is presented in [63]. In this work, Porikli and Divakaran match objects between cameras to generate a key-frame-based video summary for each object during its presence within the system. Another way to cover a large area with surveillance cameras is to use mobile cameras. Bandyopadhyay et al. [9] present a strategy to track a single target, moving amongst obstacles, using a single movingcamera flxed ona robot. Tangand ?Ozg?uner [71] add several restrictions to that problem such as a limited number of robots with restricted flelds of view, versus a larger number of moving targets. In this case, the problem becomes that of minimizing the average duration each target is not observed. The problem of optimal sensor layout for visibility is classically known as the art gallery problem [61]. There are also a number of related problems that have drawn a lot of attention. For example, recently, Kloder et al. [44] solve the problem of minimizing the probability of undetected intrusion, in environments with obstacles. Kolling and Carpin [47] present an algorithm to flnd the minimum number of robots needed to detect possible intrusions, in indoor environments with many rooms connected by multiple doors. When the intruding targets are \clever", the problem becomes that of the pursuit-evasion game. Some passive results are to determine bounds on the numbers of targets hiding behind obstacles, based on observations by moving cameras [84]. The harder active problem is to calculate a pursuer path that keeps the evader in sight [59]. Interestingly, the inverse problem 36 has also been attempted. Marinakis et al. [55] infer the relative positions of a set of sensors by observing the motions of objects between their non-overlapping flelds of view. 3.1.2 Contributions We consider an environment with multiple flxed cameras and multiple agents moving amongst flxed obstacles. We assume that people can be tracked using, for example, a wide fleld of view camera flxed to the ceiling. The cameras are controlled toacquirehighresolutionvideoofindividualpeopleorsmallgroups. Weassumethat the cameras can pan and tilt fast enough so that latencies due to camera movement can be ignored. This is a fairly reasonable assumption with state-of-the-art pan speeds reaching up to 300? per second [72]. The problem is thus to assign cameras to people (or small groups) and select where to point each camera to produce the individual videos. Using our knowledge of people?s positions, we employ bipartite matching to assign cameras to people over time to create a zoomed-in video, with as few camera switches as possible. The contributions of the research presented in this chapter [22] are as follows: 1. We assign a \surveillance monitor" per subject (or small group), versus a monitor per camera. We solve the camera assignment problem in scenes with multiple subjects and multiple cameras with overlapping flelds of regard. 2. We combine results from computer vision, computational geometry and algo- rithms to compute a continuously updated camera assignment that maximizes 37 the total time subjects are imaged. It is not always possible to assign a camera to every individual person; for example, there can be more people than cameras in the surveyed scene. Even if this is not the case, people might cluster in the fleld of regard of a single camera. We resolve this issue by controlling the focal length of the camera to capture a group of neighboring people with a single camera, while still having a close-up view of each person in the group. This is implemented by grouping people, then assigning cameras to groups using weighted bipartite matching. We show in our results the efiect of this approach on the number of covered people. The rest of the chapter proceeds as follows: Section 3.2 deflnes the problem in more detail. Our approach to solving the problem is detailed in the ensuing three sections. Section 3.3 presents our modeling of the one person per camera problem as a bipartite graph matching algorithm, and section 3.4 discusses the approach for the many people per camera situation. Section 3.5 tests and evaluates the method with multiple runs using two difierent simulators. Finally, we conclude in section 3.6. 3.2 Problem Deflnition The flrst problem we solve is the following. A surveillance camera network with nc cameras is set up in an environment containing obstacles. A group of np people walk freely in the room. Our goal is to construct a video for each person, generally using multiple cameras over time, that captures each person for as long as possible 38 during their presence in the environment. This results in one (multi-camera) video per person rather than one video per camera being displayed to security operators, or further analyzed using computer vision methods. The cameras are flxed but can pan, tilt and zoom (PTZ cameras). We call the volume of the environment that a camera can capture, for a speciflc PTZ setting, the fleld of view (FOV) of that camera. We deflne the fleld of regard (FOR) of a camera as the union of its FOV?s for all possible PTZ settings reachable in negligible time. We assume that cameras have a large fleld of regard, but that a small fleld of view is generally required to image a person at the required spatial resolution. When there are fewer cameras than people, the objective becomes covering as many people as possible. We accomplish this by assigning a camera simultaneously to several people, with the constraint of having a zoomed-in video that shows the covered people with high resolution. Each camera is represented by its visibility polygon, the portion of the oor of the environment that lies inside its FOR. This enables us to solve the camera assignment problem in the two-dimensional space as shown in flgure 3.1. By the earlier deflnition of a FOR, any person inside a speciflc camera?s visibility polygon is viewable by that camera in a negligible time, using pan, tilt and zoom operations. The visibility computation algorithm we employ works for convex polygonal obsta- cles on the ground plane. Any non-convex obstacle is thus flrst decomposed into convex parts. 39 Figure 3.1: Problem setup in 2D. People?s trajectories are dotted and visibility polygons are in blended colors. The obstacles are shaded in black. 3.3 Bipartite Matching 3.3.1 Visibility Polygons We flrst determine which areas fall within the fleld of regard of each camera. Using the terminology in [61], this is achieved by computing the point visibility polygon V(c) for each camera c. The environment with obstacles is modeled as a polygon P with holes. The visibility polygon V(c) is deflned as the portion of P visible from c. We assume cameras can \see" as far as the boundaries of the room. The angular plane sweep algorithm, presented by O?Rourke [61], is used here. Each camera ci is identifled by its two-dimensional position (xci;yci), its fleld of regard (FOR) angle `i, and the orientation i of its FOR angle bisector. First, we 40 compute the vertices of V(ci), by forming a set of rays from the center of projection of the camera to the vertices of obstacles and the room. We only consider those rays whose angles are in the range ffi : i?`i=2 < fi < i+`i=2g, in addition to the two rays corresponding to the endpoints of the interval. We assume the cameras cannot see through obstacles, thus we only consider those vertices that are not occluded by other obstacles. As shown in flgure 3.2, we have two difierent cases for the rays: they either intersect an obstacle/environment boundary, or are tangent to an obstacle. In the flrst case, the intersection point is added to the list of vertices of V(ci). In the latter case, the ray is extended until the next closest intersection with an obstacle or environment boundary to form an additional vertex. We do not consider the case where two or more vertices are collinear with the center of projection of a camera. This is a rare situation which can avoided by inflnitesimally moving such a camera. Once the visibility polygon vertices are determined, its edges are formed by joiningtheverticesincounterclockwisedirection. First, wesorttheraysthatconnect the vertices to the camera?s center in increasing angular order. Then, we resolve ties by rearranging the vertices that belong to the same obstacle/environment next to each other. Computing the visibility polygon for each camera can be achieved in O(nlogn), where n is the total number of vertices of obstacles and the environment. The logn factor is due to the cost of sorting the vertices. The whole process is conducted in a preprocessing stage. 41 rr r r r1 5 4 32 Figure 3.2: Computing the visibility polygon: rays r2 and r4 are tangent to obstacles, while r3 and r5 intersect them. 3.3.2 Graph Modeling We assume the locations of people (xi(t);yi(t));1 ? i ? np are found, for example, using a wide fleld of view camera flxed to the ceiling. At each time instant t, we use the point-in-polygon test to determine the visibility polygons in which each person is located; in other words, the cameras that can view each person. This process is linear in the number of edges of all polygons. It is shown in [61] that the number of edges of a visibility polygon is linearly proportional to the number of edges of obstacles (polygon holes), i.e. O(n). This means that the process of determining the cameras that can view a person at any instant can be conducted in O(n?nc) time. This process is done online, hence must be performed e?ciently. 42 k di occluded shade angle Figure 3.3: Computing occlusions of subjects by one another. When the number of people np and the number of cameras nc are small, this can be computed in O(n ? nc ? np) for all people, at each time instant. As np grows, a more e?cient approach is to divide the plane into cells that directly index visibility polygons, and build a two-dimensional data structure that e?ciently assigns people to cells. An o?ine preprocessing stage will determine which cameras can view each cell. We also suggest a 2D model for the occlusion of people by one another with respect to each camera. To compute the occlusion, we sort the subjects within each visibility polygon w.r.t. the angles they make with their respective cameras. Then, for each subject, we flnd its neighbors that are located farther from the camera, and mark them as occluded. Given a camera, a subject S1 occludes a subject S2 if S1 is closer to that camera than S2, and S2 falls within the \shade" of S1. The shade angle of a subject i is computed using the equation: fii = tan?1 kpd2 i?k2 as shown in flgure 3.3. Finally, we model the camera assignment problem as matching in a bipartite graph G = ((V;U);E) as shown in flgure 3.4. The vertices V represent the people and U the set of cameras. An edge eij 2 E, between vi and uj, exists if camera cj can view person pi, that is pi falls inside the visibility polygon of cj and is not 43 occluded by another person. The problem is to match as many vertices as possible from V (people) to vertices in U (cameras), with the constraint that each camera is matched to at most one person. This is the classical bipartite matching problem [20]. 3.3.3 Initial Matching Once the bipartite graph is constructed, we compute an initial matching. Fig- ure 3.4 shows an example of a perfect matching, where all vertices are matched. This is not always possible, and hence, we seek a maximum cardinality matching, in which we try to match as many vertices as possible. Speciflcally, we are inter- ested in matching as many vertices of V (people) as possible. This proceeds by visiting each unmatched v-node, then using a modifled breadth-flrst-search (BFS) algorithm to reach the closest unmatched u-node, and flnally matching both. The BFS algorithm is modifled so as to restrict visited edges to be sequences alternating between matched and non-matched edges. These sequences are known as alternating paths. When no more alternating paths can be found in the graph G, a maximum matching is found. It is proved (see [62] for example) that a standard matching al- gorithm (Edmonds? [20]) correctly solves the matching problem for a bipartite graph in O(min(jVj;jUj)?jEj). In practice, nc ? np, so the initial matching complexity is O(n2c ?np). 44 people camerasu u u u v v v v 1 4 3 2 1 2 3 4 matched edge Figure 3.4: An example of the problem model with a perfect matching. 3.3.4 Matching Update As people move through the room, their positions within the flelds of regard of cameras are updated. In the event that a person enters a visibility polygon, exits it, becomes occluded or becomes visible, the change results in updating the bipartite graph accordingly, and possibly recomputing the matching. There are four cases here, depending on the type of graph update that has occurred: ? If no edges included in the matching, called matched edges, are removed, noth- ing has to be done. ? If one or more matched edges are removed, we attempt to augment the match- ing. ? If one or more edges are added to an incomplete matching, we attempt to augment it. ? If one or more edges are added to a complete matching, nothing has to be done. 45 The matching algorithm is incremental by nature. The cost of each stage is O(jEj) ? O(np?nc). This property reduces signiflcantly the running time of subsequent match- ing computations, since practically, very few people change visibility regions at any time, as will be shown in the results section. Depending on the type of graph update that occurs, the search for an alternative matching proceeds as follows: 1. Determine a non-matched vertex v 2 V (person). 2. Span a tree rooted at v using modifled BFS. 3. If non-matched leaf nodes in U (cameras) are reached, choose the closest one (u) to the tree root. The alternating path from v to u, forms the new aug- mented matching. 4. If no leaf nodes exist which are non-matched vertices in U, we conclude that v cannot be matched. At each time, we only need to run this procedure for unmatched vertices. This represents an order of magnitude time savings over solving the matching problem for the entire graph at each update. In addition, matching the tree root to the shallowest non-matched leaf guarantees that we obtain the minimum number of re-assignments at each step. Figure 3.5 illustrates the two matching update cases that involve edge removal, and flgure 3.6 shows how an incomplete matching is augmented when an edge is added. 46 u u u u v v v v 1 4 3 2 1 2 3 4 u u u u v v v v 1 4 3 2 1 2 3 4 u u u u v v v v 1 4 3 2 1 2 3 4 Initial match Non-matched updateMatched update Figure 3.5: A bipartite graph update that involves the two cases with edge removal. Matched edges are thickened. In the flrst update, a non-matched edge v1u1 is removed so nothing needs to be done. In the second time, the matched edge v2u2 is removed. The alternating path v2u4v3u2 (red) matches v2u4, unmatches v3u4 and matches v3u2. u u u u v v v v 1 4 3 2 1 2 3 4 u u u u v v v v 1 4 3 2 1 2 3 4 u u u u v v v v 1 4 3 2 1 2 3 4Incomplete max matchAn added edge Completed match Figure 3.6: A bipartite graph update that involves the addition of an edge to an incomplete matching. Matched edges are thickened. Initially, the maximum possible matching is incomplete. Later, an edge v4u1 is added. This permits completing the matching using the alternating path v4u1v3u2 3.4 Minimum Cost Bipartite Matching The previously described approach works e?ciently when the number of people np and the number of cameras nc are comparable. However, a non-uniform distribu- tion of people over cameras can result in many people being uncovered while several cameras are unassigned. This can also happen when np nc. In the following, we update our algorithm to deal with these cases. 47 3.4.1 People Grouping and Graph Modeling The flrst step is to cover as many people as possible with a small number of groups, where each group is represented by a disk on the ground plane that covers the group. We constrain the groups (i.e. disks) to be of a small flxed size, since covering a large group would generally require a signiflcant zoom-out, which would result in a video of low resolution. Such clustering problems are known to be NP- hard, see, for example, [6]. Several variants of the problem we address have been previously studied. In particular, Gonzalez [28] and Agarwal [6] suggest an O(n pk )- algorithm for the very similar problem of covering n points with flxed size squares, where k is the number of required clusters. This is impractical in most situations, such as ours. The authors also develop an approximate (not optimal) algorithm, that runs in O(nlogk). We employ a greedy approximate algorithm, that starts with the group (disk) that covers the maximum number of people, then the group with the next highest cardinality, and so on. This is done as follows (see flgure 3.7): 1. Center a disk over each person. This represents the locus of centers of disks (groups) that can cover that person. 2. The plane is discretized using a grid, with its cells? values incremented by one for each disk that overlaps them. 3. Considering the disks in arbitrary order, if the next disk overlaps a previ- ous one, the cardinalities of the grid cells forming the intersection region are incremented and the cells are inserted into a priority queue. 48 1 1 1 1 1 1 2 2 2 2 3 A A A1 2 3 Figure 3.7: The approach used to cluster people. The cardinalities of the intersec- tions are marked to the left and the actual clusters to the right. 4. Dequeue a grid cell; it has maximum cardinality, say K. 5. Delete the K nearest subjects, assign them to a group, then remove the cor- responding disks, decrementing the values of the involved grid cells. 6. Repeat 4) and 5) until only the grid cells with cardinalities 1 remain. With np people and k clusters, this method requires O(np) insertions in the priority queue, hence a complexity of O(np logk). Once the clusters are formed, we model the problem as previously using a bipartite graph G = ((V;U);E), with V here representing groups of people Ai (possibly consisting of a single person) and U still representing the set of cameras. An edge eij 2 E, between vi and uj, exists if camera cj can view the group Ai. A special case exists when a group falls on the edge of a camera?s visibility region or when some of its people are occluded by others. In that case, no sin- gle camera can view all the people of that group, which defeats the purpose that grouping was originally developed for. We identify when this happens, then, using a variant of the grouping algorithm, the formed groups are split, into the largest 49 subgroup that can be viewed entirely by a single camera, then the next largest, and so on. Our experiments show that slight improvements in the coverage rate of the algorithm is only achieved when the number of subjects is close to the number of cameras. In what follows, we do not use the splitting variant of the grouping algorithm. 3.4.2 Edge Weight Function In order to view as many people as possible, priority is given to larger groups. This is achieved by adding weights to the graph edges. The weighted bipartite matching problem is that of flnding a matching of G with the largest possible sum of edge weights [62], or the minimum sum of costs. The problem was initially solved by Kuhn [49] in O((jVj+jUj)3) time. Recently, an approximate algorithm that is an order of magnitude faster has been suggested by Schwartz for edge weights that satisfy certain properties [65]. To favor larger groups, we choose a cost function that assigns lower costs to edges associated with larger groups and flnd a match with minimal cost. A typical edge cost function is: cij = WMAX?wij, where wij is the number of people in group i visible to camera j and WMAX is larger than any wij. We implement a variant of Kuhn?s algorithm that is proven to flnd the optimal solution. The algorithm proceeds by creating an auxiliary non-weighted graph which contains the minimum cost edges incident on each node. Then, a maximum cardinality (non-weighted) matching, similar to flgure 3.4, is computed. If it is not complete, the next unused 50 minimum cost edge is added and an augmented matching, similar to flgure 3.6, is constructed. The process is repeated until all vertices of V (people) have been matched, or no more edges remain. This method works only with complete graphs that have jVj = jUj. This is achieved by completing the missing edges with ones with costs WMAX, so they are not chosen during the optimization. The use of a cost function that favors large groups of people results in small groups and individual people to be unmatched to any camera for extended periods of time. To balance the amounts of time each subject is covered, we add a time penalty to the edge cost function. The updated cost function is: cij = WMAX ?wij ?f(Ti) (3.1) where, for each person, a variable Ti accumulates the amount of time that person has not been assigned to any camera. For groups, we flnd the minimum positive Ti of its members. Our experiments have shown that using the maximum Ti in groups increases camera switches which produces a non-smooth video. At given time thresholds, the cost of an edge changes due to the time penalty, and the cam- eras are forced to switch. As a camera assignment changes, the variables Ti also change, which might reverse a previous assignment immediately, causing an infl- nite oscillation of cameras between two subjects. The function f(?) in equation (3.1) above is just used to prevent this from occurring, keeping the camera switches smooth. Similarly, other factors can also be included into the cost function, such as the orientation of a person w.r.t. camera, where a person facing a camera will have preference (lower cost) over a person facing away from it. 51 3.4.3 Matching Update The changes occurring in groups, as people move through the environment, are more involved than in the one-to-one case presented earlier. We present below how these changes afiect the graph topology, and how we e?ciently update the solution. 3.4.3.1 Change in weights only, same graph topology If a matching edge cost increases or a non-matching edge cost decreases, aug- menting the solution is attempted. First, the match of the incident node is removed. Then, a tree rooted at this node is spanned using the modifled BFS as previously. The branches of this tree are the potential alternating pathes. Along each path, the cost of an undone matching edge is subtracted, while the cost of a newly added matching edge is added. If a negative path is found, it means that we can flnd a smaller cost matching. The smallest cost branch is then used to augment the solution. If no negative paths are found, no better solution can be obtained. If a matching edge cost decreases, nothing needs to be done. The intuition here is that the cost was already minimal using that edge. Now, it?s even less. Similarly, if a non-matching edge cost increases, no new solution needs to be searched for. The current minimum cost solution did not use that edge and so obviously a new solution won?t, when its cost has increased. 52 3.4.3.2 Change in graph topology If an edge is added, this is equivalent to a cost decrease (from 1 to c) of a non-matching edge. If an edge is removed, this is equivalent to an edge cost increase (from c to 1): if it were a non-matching edge nothing has to be done, while if it were matching, an augmented solution is constructed. Finally, adding/removing a node to/from the graph (e.g., group split/merge) amounts to adding/removing at most nc edges. 3.4.3.3 Time requirements Since the solution is augmented here also using a modifled BFS, the time complexityofeachupdate issimilarly O(jEj) ? O(np?nc). When wehavemanymore people than cameras (np nc), nc can be considered a constant, and the update process is linear in the number of people np. By adding up the complexities of all the components of our algorithm, we can see that the bottleneck is in grouping people O(np logk). This represents the overall asymptotic complexity of the algorithm, if grouping is required. If not required, the overall complexity is linear in np. 3.5 Implementation and Results To evaluate our method under difierent conditions, we constructed a simula- tor and, in addition, used the simulator developed by van den Berg et al. [73]. In our simulator, obstacles? locations and people?s trajectories and velocities are man- ually traced. This enables more realistic agents? trajectories. We have simulated 53 Figure 3.8: Two example setups used in testing our method. Left: inside a bank, generated using our simulator (overlapping fleld-of-regard cameras); Right: a grid of obstacles, generated using van den Berg?s simulator. The linear segments are people?s trajectories. two indoor surveillance scenarios: overlapping and non-overlapping flelds-of-regard cameras. The scenarios correspond, loosely, to an idealized bank and an o?ce inte- rior, respectively, with about 25 people surveyed by up to 8 cameras. On the other hand, van den Berg?s simulator easily generates larger numbers of \more artiflcial" trajectories. Using it, we generated up to 60 people with 40 surveillance cameras, moving among several setups of obstacles. This results in graphs with over 2000 edges. An example setup from each simulator is shown in flgure 3.8. Under each of the tested scenarios, we perform multiple runs, varying the number of cameras and people. For each subject, we compute its coverage, i.e. the ratio of the frames where it was assigned to a camera to the frames for the entire duration of its presence in the scene. Then, for each run, we compute the mean and variance of the coverages of the subjects present during that run. The mean and 54 variance are weighted by the subjects? trajectory lengths. Figure3.9showsthemeanofthecoverageforthetwosetupsofflgure 3.8, under difierent grouping strategies, as the number of people and cameras vary. Group diameters are measured in units of people?s average sizes. In the bank environment (top), increasingthegroupsizehasasigniflcantefiectoncoverage. Thelargeamount of open space enables this to happen. However, in the obstacles grid environment (bottom), coverage increases with group size at a slower rate, because the narrow \hallways" that exist between obstacles do not allow the formation of large groups. Adding a time penalty to the edge cost function, as mentioned in equation (3.1), distributes the coverage more evenly amongst the people who haven?t been tracked in a while. We have experimented with difierent group sizes, in multiple environments and with difierent problem sizes (number of people, cameras). As ex- pected, adding the time penalty to the cost function reduces the standard deviation of the coverage, due to a more even distribution. The standard deviation is further reduced when there is a much greater number of people than cameras. This is be- cause in these situations, and in the absence of a time penalty, the cameras tend to remain assigned to the same subset of people to reduce camera switches, rather than rotating to difierent people. Sample results for the two example setups are shown in flgure 3.10. We must note however that the matching algorithm already produces the optimal result with respect to the mean of coverage. Thus, adding a penalty term to obtain a more even distribution of coverage also results in a less than optimal mean. We choose our penalty term carefully to balance between the desired reduction in standard deviation and the undesired reduction in mean. In 55 Figure 3.9: Mean coverage as the group size (diameter) varies. Top: bank scenario from our simulator. Bottom: regular grid of obstacles from van den Berg?s simulator. our results, when adding the selected penalty term, a slight reduction in the mean is noticeable only for people to camera ratios of 3 and up. We implemented the algorithms in C, on a Pentium D 2.8GHz processor with 1GB of memory. Computing the bipartite matching for multiple graph sizes is done in real time, the timings of which are shown in flgure 3.11. These are much lower than the asymptotic bounds developed in sections 2.3 and 2.4. It shows that in practice, people stay for some time before changing visibility regions, which reduces the number of graph/matching updates. 56 Figure 3.10: Efiect of adding a time penalty on the standard deviation (= pvar) of the coverage. Top: bank scenario (our simulator) with a group size equal to 5 people. Bottom: regular grid of obstacles (van den Berg?s simulator) without grouping. 3.6 Conclusion We have presented a novel approach to the problem of assigning multiple people moving amongst obstacles to multiple cameras, using the bipartite matching algorithm [22]. We show how our method beneflts from the incremental nature of the algorithm to update the matching of cameras to people online in linear time. When the number of cameras is comparable to the number of people in the surveyed scene, we have applied our method to construct individual zoomed-in videos for each person, as shown in flgure 3.12. When we have many more people than cameras, we cluster people and attempt to cover as many as possible, without compromising 57 Figure 3.11: Matching times for various graph sizes (van den Berg?s simulator). Figure 3.12: Snapshot of our simulator: 2D top view with visibility regions along with 3D zoomed-in videos for the flrst 3 subjects. their close-up view. Simulation results enable us to validate our method. 58 Chapter 4 A Two-Player Pursuit-Evasion Game 4.1 Introduction Findingsubjectsinenvironmentswithobstaclesandkeepingthemunderwatch is a problem that has many interesting surveillance and safety applications. In the area of providing care for the elderly and disabled, tracking subjects is a prereq- uisite to recognizing human activities, thus providing a safer home for the aged (for example [19]). Intent recognition based on following an agent and knowing the environment obstacles? geometry has been the subject of study in [57]. 4.1.1 Related Work In the area of surveillance and security, in which we are more interested, the moving subject can either be an intruder [45], [47], [44] an evader [84], [59], [9], or simply a moving target that is to be put under watch. We also note that the term target tracking is commonly used in the robotics literature (for example [36]) to mean maintaining the visibility of that target. We will stick to that terminology in the rest of this thesis. The broad class of surveillance applications span several interesting problems: ? Preventive art gallery problems, where the task is to flnd the optimal sensor layout that prevents intrusions into the surveyed scene. The sensors can either 59 be flxed [61] or moving (mounted on robots) [44]. ? Graph clearing problems that try to solve for the optimal number of robots needed to verify that an environment with obstacles is free from intruders (decontaminated) and that it will not get recontaminated [45], [47], [46], [48]. ? Searching environments for initially invisible intruders/evaders. The passive problem [84] estimates the approximate hiding locations of intruders and their numbers, while the active problem decides how to move the robots to search the environment [50]. ? Tracking initially visible evaders, maintaining their visibility at all times, if possible [10], [11], or maximizing the time they are visible, otherwise [9]. We elaborate here on the fourth application above, the pursuit evasion game, since it is the most closely related one to our work. In two-player games, when the layout of the environment?s obstacles is unknown [29], [9], the pursuer collects mea- surements to determine the locations of the local obstacles. Based on the visibility from its current position, the pursuer then determines the instantaneous next step to take in order to increase the evader?s escape time. The algorithm proceeds in a greedy manner to maximize the total time the evader is seen by the pursuer. Most of the work in the area of maintaining the visibility of evaders has been done in 2D although some attempts have been made in 3D as well [8]. When the number of players increases, the game becomes more complex. Vieira et al. [76] solve the problem of flnding the optimal time for multiple pur- suers to capture multiple evaders. The complexity is exponential in the number of 60 players. They also present a partition of the game into multi-pursuer single-evader games for a faster but near-optimal solution [77]. Another variation of the game is to combine it with the intruder search problem into one model, as suggested in [36]. A large amount of literature has been devoted to two-player pursuit evasion games, where one pursuer maintains the visibility of one evader, in an environment with obstacles with known layout. Bhattacharya et al. address that problem, with the additional condition that the pursuer loses the game immediately if it loses sight of the evader at any time. As in [9], the authors also assume here that the pursuer has complete information about the instantaneous position and velocity of the evader. Under these conditions, the authors solve the decidability problem around one corner [10]. The problem is said to be decidable if the outcome of the game can be known for any initial positions of the pursuer and evader. However, solving around one corner (i.e. two inflnite edges meeting at a point) is restrictive; there is no geometry \behind" which the evader can hide. The authors then extend their work in [11] to deal with more general environments with convex obstacles. However, in those environments, they are only able to provide a partial solution to the decidability problem. They provide a su?cient condition for escape of the evader. In other words, they flnd a region of initial positions, such that if the pursuer lies outside it, the evader escapes for sure. However, if the pursuer lies inside that region, the result of the game is undecidable. Their algorithm splits the environment into regions that depend on the geometry of the obstacles. Under the same earlier setup, Murrieta-Cid et al. [60] develop a tracking strat- egy also by creating a partitioning of the environment that depends on the geome- 61 try of obstacles. Later, they present a more simple convex partitioning [59] that is modeled as a mutual visibility graph. Their algorithm alternates between an evader assumed to take the shortest step to escape, countered by a pursuer that computes a prevention-from-escape step. The assumption of the evader taking the shortest step to an escape region is a reasonable one, since if the pursuer is able to counter it, then it will be able to counter any longer steps. However, this strategy is only locally optimal. The authors thus flnd all possible permutations of escape sequences along with corresponding prevention-from-escape steps. The global solution concatenates adjacent \partial local paths". This leads to an interesting result: To decide which player wins, every feasible ordering of local paths has to be checked. The authors reduce the traveling salesman problem to the current problem, concluding it is NP- complete. 4.1.2 Contributions Motivated by the work in [10], [11] and [59], we propose a novel algorithm to determine the outcome of the tracking problem. Our approach relies on partitioning the surveyed environment into a uniform grid that is independent of the geometry of the obstacles. At each time instant, players can move one or more grid cells away, according to their speeds. Given any initial pair of positions for the pursuer and the evader, we provide a complete solution to the problem of deciding which player wins in a computationally feasible time. In addition, if it is decided that the pursuer wins, i.e. it is able to maintain visibility of the evader at all times, we also flnd its 62 optimal trajectory that always keeps the evader as far as possible from positions from which it can escape. We are currently exploring the efiect of varying the grid cells? size on the accuracy of the results and the e?ciency of computing them. In the rest of this chapter, we proceed with a more detailed deflnition of the problem in section 4.2, followed by a couple of examples that illustrate the construction of our algorithm in section 4.3. Then, we detail our approach more formally in section 4.4. Finally, we suggest future work on this problem in section 4.5. 4.2 Problem Deflnition Several variants of the pursuit evasion problem have been attempted. In this work, we are interested in a setup similar to that in [11] and [59]. In what follows, we present in detail the environment setup and the conditions on the players. 4.2.1 Environment Layout and Rules of the Game We will deal here with indoors environments that contain obstacles that ob- struct the view of the players. In the 2D domain which we are solving, the obstacles are represented as polygons with known geometries and locations. This is a two- player game, with one pursuer and one evader represented as points. Each player knows exactly both the position and velocity of the other player. Both players move at bounded speeds and can maneuver to avoid obstacles. They are equipped with sensors that can \see" in all directions (ommnidirectional) and as far as the envi- 63 ronment boundaries or obstacles, whichever is closer. We will see later than we can restrict this property with a simple variation in our algorithm. The pursuit evasion game proceeds as follows. Initially, the pursuer and the evader are at positions from which they can see each other. As in [11], we deflne two players to be visible to one another if the line segment that joins them does not intersect any obstacle. The goal of the game is for the pursuer to maintain the visibility of the evader at all times. The game ends immediately, if at any time, the pursuer loses sight of the evader. In that case, we say that the pursuer loses and the evader wins. 4.2.2 Space Discretization The key idea in our algorithm is our space partitioning method. Unlike earlier work ([11], [59]), our partitioning does not depend on the geometry of the envi- ronment and obstacles. Instead, it depends on the motion of the players. Using a time-discretization similar to [9], we consider the positions of each player at time intervals of ?t units. If the velocity of a player at a certain time interval is v, the distance traveled during the discrete time step will be ?d = v?t. We recall that the players? velocities are bounded and, by the same token, can also be discretized as 0;vmin;:::;vmax. We thus choose to partition the space into a uniform grid with cell edge length: l = vmin?t: (4.1) 64 Figure 4.1: A simplifled layout that shows polygonal obstacles, players (points) and the environment partition with a grid. Figure 4.1 shows the suggested grid discretization superimposed on an example environment. We call the set of grid cells that are reachable by a player at any time step, the neighborhood of that player. A player can move zero or more cells at a time based on its current velocity. This enables us to model difierent velocities. Difierent choices of norms can be used in computing the neighborhoods. Using L2-norms (Euclidean distances) (flgure 4.2) takes slightly longer to compute neighborhoods, but gives more \natural" results, while L1-norms (Manhattan distances) (flgure 4.3) are faster and easier to compute, but yield a zigzag-like player path. 4.3 Motivating Examples Before describing our algorithm in detail, we present here the intuition behind it, illustrated by two simple examples. As mentioned earlier, we assume the evader 65 Figure 4.2: AEuclidean-distance neigh- borhoodofradius2centeredaroundthe black square. Figure 4.3: A Manhattan-distance neighborhood of radius 2 centered around the black square. is initially within the line of sight of the pursuer. If it is not, the game ends trivially with the pursuer losing instantaneously. Let us move then to the more interesting case. Without loss of generality, we will illustrate the algorithm at this point for the case of players with equal speeds, that move in Manhattan or \city-block" trajectories. Generalization to other cases have been discussed above. Given an initial position of the pursuer, let the evader be initially visible but able to escape in one step, i.e. by moving to a neighboring cell. We will call such evader location Bad(1) with respect to the initial pursuer position (i.e. the pursuer/evader pair is Bad(1)). However, during the same time step, the pursuer may take a similar step to any one of the neighboring cells. This in turn, may result in the evader being now two steps away from \safety" (hiding position). Thus, we consider an evader position to be Bad(1) with respect to a pursuer position, if the evader can reach a hiding cell by moving one step inside its neighborhood for 66 whichever neighboring cell the pursuer is able to move to. Inductively, we call a location Bad(2) if it satisfles the same above conditions but with respect to Bad(1) locations. We have shown the intuition here for a single initial position. We note, however, that we cannot begin labeling any Bad(2) cell until we have computed all the Bad(1) maps for every initial position of the evader. This guarantees that all the possible pursuer-evader location pairs are considered. We stop the search when either of these two conditions is met: 1. All the environment cells are labeled Bad(x) for some flnite value of x. This means that there is no winning strategy for the pursuer starting from its initial position. 2. We reach a state where Bad(i + 1) = Bad(i) for all labeled cells, with some cells non-labeled. This means that if the evader is initially located outside of these cells, it will never be able to go out of sight of the pursuer. 4.3.1 A \Level-0" Game We illustrate our method flrst with the example shown in flgure 4.4. We consider a simple partition where a grid cell is as big as an obstacle. We defer to a later point the analysis of how difierent grid cell sizes can afiect the solution. In this model, we deflne two cells to be visible to one another if the line segment from their centers does not intersect an obstacle. If the line segment is tangent to an obstacle, we still consider the cells to be visible to one other. Other visibility models can be substituted as desired. Given equal and constant speeds for both players, it is clear 67 Figure 4.4: Initial winning positions for a level-0 game. Figure 4.5: Initial losing positions for a level-0 game. from the shown setup that if the players are initially within sight of each other, they will remain so whatever path the evader takes. If the initial positions are such that they are not within line of sight (flgure 4.5), the game ends spontaneously. We call the game which outcome is as described a level-0 game. We trace our algorithm for the initial position of the pursuer shown in flg- ure 4.4. We emphasize again that we cannot proceed to the following iteration in the algorithm until all possible initial pursuer positions have been considered. For simplicity, we illustrate only one initial position, since the analysis from the other positions is similar, due to symmetry. Let p = (xp;yp) be the pursuer?s location and e = (xe;ye) the evader?s. We deflne a function Bad(?;?;?) Bad(p;e;i) :R2 ?R2 ?N?!f0;1g (4.2) that evaluates to 1 if an evader at e can escape in i steps or less a pursuer at p. It evaluates to zero otherwise, i.e. when the outcome of the game is that the pursuer 68 Figure 4.6: Computing the labeling function Bad(N(p);e;0) for the neighborhood of a pursuer located at position p, where p0 and p00 are the only possible neighbors for p due to the obstacle. wins. Our goal is to compute Bad(p;e;i) for all p and e. When i = 0, Bad(p;e;0) = 1 for all pairs of cells (p;e) mutually non-visible. As elaborated earlier, given an initial (p;e) pair, we only consider the cells non- visible from all of the positions neighboring the pursuer?s initial position. This is denoted as Bad(N(p);e;0) where N(p) is the neighborhood of p. It is computed as the disjunction of the values Bad(p0;e;0) for all p0 in the neighborhood of p. Figure 4.6 illustrates this step. The notation in the flgure might be confusing at flrst. To make the notation easier and illustrate the tracing of the algorithm at each iteration i, we choose a single pursuer position p, then label i all the cells for which Bad(p;e;i) = 1. The other cells for which Bad(p;e;i) = 0 are just left blank. The next step is to flnd all the positions that can reach, in at most one time step, a cell of Bad(N(p);e;0). An evader in any such position is able to win in no more than one step. Using our deflnition (equation 4.2), these positions are denoted Bad(p;e;1) and labeled by 1?s in flgure 4.7. At that point, we notice that Bad(p;e;1) = Bad(p;e;0). That means that after one time step, and for all possible pairs of moves that either player can make, the set of cells that were 69 Figure 4.7: Extending Bad(N(p);e;0) into Bad(p;e;1). Steady state is reached at that point. initially non-visible to the pursuer (escape cells) cannot be modifled. This indicates that the outcome of the game cannot be modifled in favor of either player. When we reach that state, we conclude that an evader in any labeled cell wins (in this case instantaneously), and loses if in any other cell. We call this a steady state. More formally, Bad(p;e;1) can be expressed in terms of Bad(p;e;0) using the following recursive equation: Bad(p;e;1) = 1 8(p;e) : 9e0 2 N(e) s.t. Bad(p0;e0;0) = 1 8p0 2 N(p): (4.3) Since N(e) and N(p) contain both e and p, Bad(p;e;1) = 1 for all the pairs (p;e) for which Bad(p;e;0) = 1. 4.3.2 A \Level-1" Game We now show an example of a game that reaches steady state after two \rounds" or iterations, namely a level-1 game. The setup used in this game is the one shown in flgure 4.8. In this setup, it can be easily identifled that if the evader is 1 or 2 cells ahead of the pursuer, it can never escape. Otherwise (3 or more cells ahead), the evader can break the line of sight and win. 70 Figure 4.8: An example of a level-1 game where the pursuer loses in 2 steps. Figure 4.9: Computing the labeling function Bad(N(p);e;0) for the neighborhood of a pursuer located at position p, where p0 and p00 are the only possible neighbors for p due to the obstacle. We trace the algorithm here to show why two iterations are needed to decide the outcome of the game. We consider the case where the initial position of the pursuer is as shown in the leftmost diagram of flgure 4.9. Other initial positions are similarly handled. We start by computing the cells non-visible from any of the positions that are neighboring to the pursuer?s initial position, Bad(N(p);e;0). This is the intersection of the Bad(p0;e;0) maps for p0 2 N(p). The positions from which a pursuer can reach a cell in Bad(N(p);e;0) in at most one step are shown in flgure 4.10. Unlike the previous example, we have not reached a steady state here since Bad(p;e;1) 6= Bad(p;e;0). In that case, we need to compute Bad(p0;e;1) maps for p0 2 N(p), each of which requires computing the 71 Figure 4.10: Extending Bad(N(p);e;0) into Bad(p;e;1). Figure 4.11: Computing the labeling function Bad(N(p);e;1) out of the maps Bad(pi;e;1) for neighbors of position p. corresponding Bad(pi;e;0) maps. Recall that Bad(N(p);e;1) is the disjunction of the values Bad(p0;e;1) for all p0 in the neighborhood of p. The process is shown in flgure 4.11. We start the third iteration by flnding all the positions that can reach a cell in Bad(p;e;1) in at most one step, namely Bad(p;e;2) (flgure 4.12). Here, we have Bad(p;e;2) = Bad(p;e;1), meaning that two time steps after the initial start time do not make more \escape" cells available to the evader than one time step. We thus break the iteration at that point and conclude we reached steady state. Any evader initially in a labeled cell has some way of escaping. We also see in flgure 4.12 that \bad" cells are indeed three or more cells away of the pursuer?s initial position, as intuitively guessed. At this point, we generalize the recursive equation 4.3 for 72 Figure 4.12: Extending Bad(N(p);e;1) into Bad(p;e;2). Steady state is reached at that point. any iteration. Bad(p;e;i+1) = 1 8(p;e) : 9e0 2 N(e) s.t. Bad(p0;e0;i) = 1 8p0 2 N(p): (4.4) Once the cells are classifled as either \bad" or \non-bad", for every initial position that the pursuer can be in, we have a complete answer to the question of deciding which player wins. However, we can use the information we have already computed to determine the optimal tracking strategy that the pursuer should follow. By labeling every non-bad cell with its distance to the closest bad cell (i.e. shortest escape path), the goal of the pursuer is to choose its next step so that the associated map has the evader in a non-bad cell with largest label. 4.4 Game Result and Tracking Strategy We present here the formal algorithm for determining the existence of a win- ning strategy for the pursuer. Along the way, we construct visibility maps that determine partial solutions. Once the algorithm terminates, we show how we can use the flnal visibility map to construct the actual winning strategy, if one exists. We prove the correctness of our algorithm and provide bounds for its space and time 73 complexity. 4.4.1 Algorithm Algorithm 1 computes the visibility map of an environment that determines, for every pair of cells, whether or not a winning strategy exists. In this algorithm, neighboringcellsarecomputingbasedonthedistancenormthatisused. Inaddition, a cell is not considered part of a neighborhood if it is occupied by an obstacle. Algorithm 1. Determine if a winning strategy exists. Input: Environment with obstacles coordinates. Output: A matrix that determines for every pair of locations whether a winning strategy exists. Data Structure: Two N ?N matrices M and M?. 1. Partition the environment into a uniform grid of N cells. 2. Initialize M and M? to 0 matrices. 3. for (0 ? p;e < N) Set M(p;e)=1 if cell e visible to cell p. 4. finish=FALSE. repeat 5. for(0 ? p;e < N) M?(p;e) = \q M(q;e) 8q 2 Neighb(p). 6. for(0 ? p;e < N)M?(q;e)=18q 2 Neighb(p)s.t. M?(p;e)=1^M?(q;e)=0. 7. if (M?=M) thenfinish=TRUE else M=M?. until finish=TRUE. At the conclusion of the algorithm, if M(p;e)=1, the initial player positions p and e are such there exists a strategy by which the evader can escape and the pursuer loses. If M(p;e)=0, the evader can never escape and the pursuer wins. To determine the tracking strategy in the latter case, we add this extension to the previous algorithm. 74 i = 1. while (9M(p;e) = 0) Set M?(q;e) = i+1 8q 2 Neighb(p) s.t. M?(p;e) = i and M?(q;e)=0. i = i+1. end 4.4.2 Proof of Correctness We flrst note that our algorithm always terminates. This happens either when the visibility matrix does not get modifled (M?=M) or when all the N cells are labeled 1. Next, we prove the correctness of the algorithm by induction. At iteration 0 (step 3 of the algorithm), M does indeed contain the decision of the game, since by deflnition, if M(p;e)=0, the pursuer cannot see the evader and the game ends, and vice versa. Now, let the matrix M contain the outcome of the game at iteration i. If M?6= M, it means that there exists a step that the evader can take to reach a cell from which it can escape in i steps, i.e. an escape path of length i+1. All such steps are recorded in M? which becomes the new decision matrix. However, if M? = M, no additional step can improve the situation of the evader. It is useless to go beyond, and we stop at that point. 4.4.3 Space and Time Complexity Algorithm 1 requires two N ? N matrices to store the temporary visibility maps. This is O(N2) space requirements, where N is the number of cells in the discretizing grid. With regards to time complexity, steps 3, 5, 6 and 7 are each O(N2). We are interested in the three steps that occur inside the loop. The worst case is to have all cells labeled. We argue here that this case will require O(N) 75 iterations, since each iteration labels one \layer" of O(N) cells around the current labeled block, thus labeling the entire environment in O(N) iterations. This makes the total time complexity O(N3). 4.5 Additional Work We have presented the algorithm, verifled its correctness and derived its poly- nomial complexity. At this point, we need more extensive experiments with larger environments. The simulators used in chapter 2 can also be used here for that pur- pose. The main point that needs to be analyzed is to determine how the grid size afiects the results. We have just shown that the complexity of the problem depends on the number of cells in the grid. It is thus desirable to reduce that number as long as it does not afiect the accuracy of the results. Consider a large cell size or equivalently small number of cells. If the cell size is larger than some obstacles, the accuracy of the visibility matrix is questionable. We would like also to consider coalescing neighboring cells that share the same visibility property. The cells will not physically disappear, since they represent the step each player takes within a time unit. However, this can signiflcantly reduce redundant computations. We are motivated here by the work of [59] and the way they partition the environment. Finally, some minor variations can also be consid- ered when comparing the performance of the algorithm under difierent conditions. We have already seen that the shape and size of the neighborhood model difierent distance norms and players? velocities respectively. We can also restrict players vis- 76 ibility, for example not \seeing" beyond a certain distance, by reducing the number of labeled cells in the initial visibility matrix. 77 Chapter 5 Detecting Unusual Activity in Surveillance Video 5.1 Introduction In this chapter, we present some older work that was done in the area of anomaly detection. The research presented in this chapter is meant to flt within the framework of the future control room introduced in chapter 2 (see flgure 1.2). We suggest a technique that can be used to \score" surveillance videos according to their level of interest to human operators. Given the huge amount of available surveillance video, it is necessary to fllter video where little of interest occurs [34]. For this purpose, we develop a video scoring technique based on modeling co-occurrence statistics of moving objects. In particular, we identify unusual events as those rare or non-frequent, according to a learned model. Unlike previous work, special attention is given here to situations where each agent?s behavior might not be independently unusual, but it is the joint occur- rence that creates an unusual situation. This problem is solved by computing co- occurrence statistics in both time and space, rather than just in time. To illustrate the idea further, consider the example of two cars crossing simultaneously while their trajectories intersect (flgure 5.1). Each car?s trajectory would be considered usual, had they crossed independently. Experiments are performed on synthetic data, where time intervals and locations of unusual activities are reported. The less 78 Figure 5.1: Introductory example that shows the type of anomalies we are interested in. The flrst two events are usual, but the third is not. Although it consists of a combination of exactly the two left event, it is their simultaneous occurrence that make it unusual frequent an event is, the more unusual it becomes, and the higher the score assigned to the involved video segment. Within the context of the display wall discussed in chapter 1, the scores assigned to the videos are used, along with other constraints, to control the mapping of videos to the display space. Higher score videos will be assigned a larger area of the display wall, and a location that attracts more attention of the operator. The rest of this chapter is organized as follows: section 5.2 reviews in some detail earlier work that has been done in the area of anomaly detection in video. In section 5.3, we formally deflne the problem then present our approach to solving it in section 5.4. We show results in section 5.5 then suggest future extensions in section 5.6. 79 5.2 Literature review 5.2.1 Deflnition of \unusual events" Much efiort has been devoted to the area of activity recognition. However, the problem of detecting generically interesting activity is still very much an open problem [82]. Depending on the type of application and the approach employed, interesting activity has been interchangeably called \abnormal activity" [19, 51, 75, 81, 82], \unusual activity" [39, 86], \atypical behavior" [54, 58], \inexplicable behavior" [33, 34], and \anomalous activity" [31, 56]. We choose the term unusual events to refer to non-frequent or rare ones. To solve the problem of detecting unusual events, we must flrst give a deflnition of that term. Human operators monitoring surveillance cameras might difier greatly in their deflnition. They can be in uenced more by people?s appearance than by their behavior [34]. To avoid this bias, authors have based their deflnitions on the model they choose to represent activity. Staufier et al. [67] deflne unusual events as those which have rarely or never occurred. Hung et al. [39] detect the least frequent occurrences corresponding to peaks of salient motion. Vaswami et al. [75] deflne abnormal activity (or suspicious behavior) as a slow or drastic change in a model, whose parameters are unknown. Morris et al. [58] assume that low speed and low distance to objects at trajectory landmarks are considered unusual. Dee et al. [33, 34] consider those trajectories that do not proceed towards a goal as unusual. Xiang et al. [81] deflne abnormality as behavior patterns that are not represented by su?cient samples in a training dataset. Micheloni et al. [56] focus on the time 80 spent by an object in the observed scene, where \too long" or \too short" a time needed to perform a speciflc activity would point to an anomalous event. Duong et al. [19] also focus on state duration and not on state order for the model they use for activity. Zhong et al. consider events (video segments) to be unusual if there are no similar events. Liao et al. [18, 51] detect abnormal behaviors by noting when an agent departs from a familiar routine by matching to a learned model. Hamid et al. [31] assert that anomalous activity is rare and dissimilar from learned models of regular activities. 5.2.2 Detecting unusual activity Several approaches have been attempted to detect unusual activity in video. Zhong et al. [86] used document clustering techniques with video segments as doc- uments and prototype features as keywords. The clusters with small inter-cluster similarity are considered unusual events, according to a developed similarity mea- sure. This approach doesn?t model interactions between spatially separated objects. It flts under the topic of anomaly detection, where the training data is clustered, and \distant" objects identifled as outliers [70]. Morris and Hogg present a simple statistical approach in [58]. They test their approach on a scene with pedestrians interacting with cars parked in a parking lot. With this scenario, they assume that low speed and low distance of pedestrians to landmarkpoints(parkedcars, etc...) areconsidered unusual. They form amodelby accumulating probabilities of speed and distance and test it on sample trajectories. 81 This model might be too restrictive to the speciflc application presented. Dee and Hogg develop a measure of intentionality to fllter surveillance data. Agents? behavior (humans) is modeled using a state transition diagram, with high costs on transitions that lead to inexplicable goals. Low cost footage is removed, leaving only video that is potentially interesting. It?s not clear how transition diagrams can be developed to model difierent scenarios, but this work is actively pursued in [33, 34, 35]. Hamid et al. [31] use natural language processing techniques to detect and explain anomalous activity in video. Activity is represented as histograms of event n-grams. With a value of n chosen to be 3, this is equivalent to breaking the video into overlapping chunks of 3 time units. Micheloni et al. [56] train an active event database with observed simple events. Composite events are formed by arranging simple events in a flnite automaton. Finally, activities are timed and classifled unusual if their duration is too short or too long. Again, both of these methods classify isolated events. Co-occurrence statistics have been used difierently in the literature. Staufier et al. [67] accumulate joint co-occurrences of codebook symbols to classify video sequences. Co-occurrences are computed between pairs of symbols in the same sequence; thus multiple-object interactions are not modeled. In addition, for large codebooks, the method becomes ine?cient with large co-occurrence matrices. More recently, Ermis et al. [23] also use co-occurrence statistics, but after \subtracting" normal motion patterns flrst. Hung and Gong overcome both these problems in [39]. First, salient features are extracted and their local spatio-temporal co-occurrences are computed for each frame. Then, higher level spatial co-occurrences are computed 82 for the rate of change of the flrst ones. Finally, the least frequent occurrences deflne the salient events, which correspond to unusual activity. It is not mentioned how the choice of the time interval for local temporal correlations afiects the results. The authors have chosen a local temporal neighborhood of 3 frames and just noted that it afiects self correlations. It is also questionable how saliency of intensity afiects false alarms. For example, usual events of \opening the door" are detected as salient due to the change in intensity. Activity modeling and recognition has often been presented in the literature in connection with Hidden Markov Models (HMM). In a nutshell, a dynamic model is created for the usual activity. Later on, as new activity is presented to the surveillance system, that activity which do not \flt" the model is considered unusual. Unfortunately, such methods tend to behave poorly in scenes with complex activity, when the number of usual activities is too high to model. A brief discussion is presented in [19] and [86]. This has led to the development of many variants to the basic HMM. Makris and Ellis [54] present the basic HMM approach. For the method to work properly, the scene is limited to pedestrian trajectories near the entrance of a university building. Tracking is assumed perfect with trajectories represented as sequences of nodes. Despite this \ideal" situation, the authors perform a slight update by training their model for each time of the day, to accommodate variations of pedestrian behavior. The trained model is presented with trajectories for which it computes the likelihood of belonging to a trained model, and detects the node on the path which departs from that model. Hu et al. [37] build a statistical model for 83 trajectory clusters which they call motion patterns. The distribution parameters for each motion pattern are evaluated using maximum likelihood. Then, a probability of anomaly occurrence is computed for partial and complete trajectories. In their results, the authors show current abnormality probabilities beside tracked moving objects. Xiang and Gong [81] use an HMM variant, the Multi-Observation HMM, to model behavior pattern. Then, an a?nity matrix is formed for spectral clustering of the data. Abnormal behavior has a low probability of belonging to any cluster. The authors develop this approach further in [82] by noting that patterns initially considered to be abnormal might later flt in the model as it is updated live. Duong et al. introduce the Switching Hidden Semi-Markov Model (S-HSMM) in [19]. Since they address the problem of learning human activities of daily living, they limit abnormality detection to state duration rather than state order. Liao et al. address the same problem also. In [51], the approach is based on modeling agents locations (obtained through GPS sensor measurements) using a hierarchical Markov model. All these applications are focused on single object trajectory modeling rather than object interaction. Vaswami et al. represent observed objects as conflgurations of point-object locations known as shapes [75]. The problem thus becomes that of modeling the shape motion and deformation using a continuous-state HMM, that the authors call \shape activity". Two measures are developed for drastic change and slow change in the system model, which are chosen to signal abnormal activity. This includes object interactions as well as time modeling. The authors present two methods to 84 solve the problem of varying the number of point-objects in shapes over time. It?s not clear, however, how the system would perform in scenes where this is a frequent situation. A simple example is that of a tra?c intersection, where the observed window is constantly having objects appearing and others leaving the scene. Such situation would consume most of the system?s time in \resampling" shapes. 5.3 Problem deflnition Our goal is to develop a method to detect unusual events in video, that can adapt easily and quickly to various scenes. Our main contribution is in modeling activity, which involves the interaction between multiple objects. In particular, we are interested in situations where individual objects activities might be usual, while their simultaneous occurrence is unusual. Most earlier work has avoided co-occurrence statistics methods due to their high complexity. Instead, researchers attempted to detect abnormality by building and training models for the observed activity, such as HMM and variants. However, these approaches have been limited to speciflc applications or scenes (e.g. ADL [19, 51]). In some cases, human made rules for deflning goals and intentions had to be injected in the model [33]. We propose a representation of video data that reduces the expensive compu- tations inherent in the co-occurrences approach. As in [82], we propose to update the statistics online. In this case, an event that was initially classifled as unusual due to the lack of training data, might eventually prove to be usual as similar patterns 85 Figure 5.2: The small (yellow) fllled circles represent humans. A single person using the trash bin (a) or two people meeting in the building?s lobby (b) are considered usual situations. On the other hand, people meeting next to the trash bin (c) is an unusual situation. frequently appear later in video. In situations including signiflcantly difierent activ- ity patterns over extended periods of time, we might train each period with similar activity independently. In [54], this is performed using a supervised approach. We?ll attempt to detect changes in patterns of activity automatically. Weproposetoidentifyunusualactivitiesthatcorrespondtorare co-occurrences in space and/or time, and that can involve as few as one object. Spatial and tempo- ral behavior have also been modeled in [75] and [39]. Figure 5.2 is a simple example of the type of problems we try to solve here. The illustrated unusual event results from two people pausing next to a trash bin. While a single person might pause there (to use it), or several people might meet in the building?s lobby (to chat), it could be unusual that people would meet at a trash bin. Here, not only the co- occurrence (of people) is considered unusual; it is the location of the co-occurrence that matters. 86 5.4 Our Approach Given a su?cient observation time to train the system, we detect infrequent co-occurrences of events, and attempt as accurately as possible to identify the times and locations of such events. 5.4.1 Feature selection We represent moving objects as point objects, where the point is chosen as the center of the bounding box. Our feature vector is quite similar to [67] and [81]. Currently, each object is represented by its position (x;y) and velocity (speed and direction) (dx;dy). We plan to include a ?size? or ?object-type? feature s at a later stage. The position is discretized using a spatial grid as shown later. Speed is quan- tized as a binary value (move, stop), and direction is quantized to the cardinal (\compass") directions. A difierent event value exists for each possible pair of ac- tivities. For example \car i making a right concurrently with ped j moving forwards" is considered a possible event v. 5.4.2 Spatial tessellation The image is subdivided into a grid, with the objects distributed into their respective bins. At this point, only the event count is required for every pair of bins, while the pairs of objects that were involved are discarded, saving space and time. Determining the appropriate bin size is a challenging issue. A larger bin size (coarser 87 grid) tends to smooth uctuations but is more immune to noise involved in tracking. On the other hand, a smaller bin size, corresponding to a flner grid, tends to detect events in crowded scenes more accurately, at the expense of increased computation time and greater in uence by noise. To obtain a more exible system, we implement the space decomposition us- ing a quadtree, with bins as its leaves. This has the advantage of a quick access to children nodes (flne grid) from the parent nodes (coarse grid). A pointer-less im- plementation is used for the problem at hand. The ability to travel quickly up and down the quadtree allows for real-time zooming into the scene, using a flner grid. Situations where this may be needed include a crowded bin, or an object following another too closely. 5.4.3 Co-occurrence matrix Aco-occurrencematrixCi;j;v stores, foreverypairofbins(i;j), thedistribution of events at that pair. At every time instant during training, the value v for the event value is computed to index the co-occurrence matrix entry Ci;j. For each pair of objects in bins (i;j), the distribution function at v is incremented by one: Ci;j;v = Ci;j;v +1. When the training process is complete, each distribution function Ci;j is normalized: Ci;j;v = Ci;j;vP v Ci;j;v for non-zero denominators, Ci;j;v = 0 otherwise. Since the locations at which the events occur matter, the normalization is performed \locally" at the bin level for every Ci;j. Some rare events, such as noise occurring in 88 a certain bin, might get magnifled in this normalization process. A global normal- ization, where elements of C are normalized by the sum of all its elements might be considered. Experimental results show that such a matrix normalization is not su?ciently discriminative. First order events are naturally reported as unusual with respect to every other object in the scene. These are further processed to detect the single ofiending object. The required training time is decided when the distributions Ci;j tend to reach a steady state. This is detected by measuring the change in the histograms of probabilities. Training is performed at a flnest level of the tessellation. Coarse level trained entries, if needed, are computed by agglomerating the corresponding flne level ones. Given two large bins x and y at the coarser level, the co-occurrence Cxy is computed as follows. The sets of children of x and y, Sx and Sy, are computed; then, the corresponding distributions are added as in the following equation: Cxy = Pi;j cij; 8i 2 Sx;8j 2 Sy, where c is the co-occurrence matrix at the flner level. The appropriate matrix is then used to test at the required level. 5.4.4 Detecting unusual events As mentioned earlier, we deflne unusual events as non-frequent or rare events. During the training period, probabilities of flrst and second order events are com- puted for every bin or pair of bins, respectively. If, during the test phase, an event 89 is detected as a \low probability event", it is reported as unusual. Deflning low probability events is quite challenging. A study of the histogram of probabilities is useful in determining a range of low probabilities. The threshold probability pth, below which an event is considered unusual, can then be chosen to coincide with a given cumulative percentage of events. Diagonal elements of the co-occurrence matrix are used to detect unusual flrst order events, i.e. those involving a single object. Once such an event is detected, its co-occurrences with further objects are ignored. 5.5 Results 5.5.1 Dataset We built a tra?c simulator to generate a \cheap" but rich environment for agent interactions. We simulate a four-way stop sign, which obeys the rules of tra?c in the US. Figure 5.3 illustrates a typical snapshot of the simulator, with the arrows indicating driving directions. The simulator has two built-in sets of rules. The flrst set generates the usual behavior that would be expected at a stop sign. It includes difierent arrival rates for each lane. According to these rates, vehicles arrive randomly at each lane, ow towards the stopping line, where they pause until the way is clear. Based on the order in which they arrived, vehicles are allowed to cross the intersection either forwards, by making a right turn, or by making a left turn. Usual rules of tra?c are obeyed by not having any two intersecting trajectories. 90 Figure 5.3: Typical snapshot of our 4-way stop sign simulator. Arrows indicate the driving directions and squares represent vehicles. Cars are not strictly aligned due to the simulated noise. The second set of rules generates unusual behavior, that is inconsistent with tra?c rules at a stop sign. The generated unusual behavior include both flrst order and second order events. First order events allow vehicles to make a U-turn at the intersection, while second order events would allow intersecting trajectories. Having such trajectories might result in vehicles \hitting" each other, coming very close to one another, or just nothing. However, the event is known to violate tra?c rules, as drivers are supposed not to initiate their movement until the way is clear. Both sets of rules are mixed in difierent proportions. Throughout the training phase, the subset of rules that allows for unusual events to occur is applied about 3% of the time, while the other set for usual events is enforced during the rest of the time. Noting that the former subset only allows abnormality but does not enforce it, the actual frequency of occurrence of unusual events drops to less than 1.5%. 91 The simulator might simply \elect" not to generate an intersecting path, even when it has the opportunity to do so. During the testing phase, a larger set of unusual events is allowed to occur for a longer period of time, for experimental results only. A real tracking module is often susceptible to noise. The simulator adds ran- dom noise to the position of vehicles, resulting in less artiflcial trajectories. 5.5.2 System training Timing results are based on experiments performed on a Pentium-M machine with a processor running at 1.8 GHz with 1 GB RAM. We train our system on two difierent simulations, one including noise and one without. Each training session has a duration of about 10 hours. First and second order unusual events occur with a rate of about 1.5-2% during the training period. For the purpose of testing, the simulated data contains unusual events for approximately 30% of the time in average. We generate over 100 short segments to obtain the shown results. Each segment runs for up to 7 minutes and contains a balance of usual versus unusual events for the test. Training is performed for two levels of spatial subdivision: a coarse grid with 16 bins and a flne grid with 64 bins. Only the co-occurrence matrix for the flne grid tessellation need to be trained since the matrix for the coarse grid is computed by coalescing the appropriate adjacent cells in the former one. At a simulated resolution of 10 frames per second, using a 64-bin grid, the time required to train the co-occurrence matrix represents less than 20% of the 92 frame time, allowing real time training. At the completion of the training phase, distribution functions of events are obtained for every pair of bins. Typical distri- bution functions for a 16-bin grid are shown in flgure 5.4. Non-zero entries in the coarse grid co-occurrence matrix represent around 7.6% of the total matrix entries. This number falls down to about 3.1% in the flne grid matrix. Figure 5.4: Typical distribution functions for a 16-bin grid. The pair of bins with the shown distribution in (a) observe relatively high joint activity, while (b) is for a pair of bins between which only one type of activity is present. Both plots are for the probability versus the event value index. 5.5.3 Unusual event detection The testing procedure is performed with the help of a volunteer human ob- server, who drives in the US. To make the job easier, the simulated unusual events represent tra?c violations. Thus, the observer is reminded about the rules of tra?c at a stop sign, and instructed to watch for moving violations occurring in the scene. Unusual events are reported by highlighting the bin(s) where the event occurred, at the time instance of occurrence. Figure 5.5 illustrates examples of correctly detected 93 second order unusual events. First order abnormalities are also detected as shown in flgure 5.6. (a) (b) Figure 5.5: Example of detected unusual events. The bin(s) where the event oc- curred is highlighted. In (a), the left and the right vehicles are each attempting to make a left turn simultaneously and are about to collide. In (b), the upper vehicle is making a left coinciding with the lower vehicle making a right. This will result in the two vehicles driving very close to one another. The accuracy of our method is measured through its correct detection and false alarm rates. These are captured in receiver operating characteristic (ROC) curves. Judging whether an event is usual or not is determined by the human observer. For each observed segment, the observer returns the count of three types of events: tp: True positives - unusual events correctly identifled. fp: False positives - usual events incorrectly identifled as unusual. fd: False dismissals - unusual events incorrectly undetected. A vehicle?s crossing of the intersection is counted as a single event. If, during the crossing, the vehicle interacts with more than one other vehicle, every interaction 94 (a) (b) Figure 5.6: First order anomaly example: a U-turn. Regardless of its interaction with other cars, a U-turn is not allowed (unusual) at a stop sign. Without noting this fact (a), it is reported as unusual w.r.t. most other populated bins. The algorithm has been updated to fllter out flrst order events from further processing with other bins (b). is counted as a separate event. An unusual event is considered correctly detected, i.e. a true positive, if the bin containing it or a neighboring one is identifled and if it is reported at the right time, tolerating a delay of one or two frames. Having recorded the three counts above, we deflne the correct detection rate cr and false alarm rate fr as follows: cr = tp = (tp+fd) fr = fp = (tp+fp) ROC curves are plotted for cr versus fr while varying the threshold probability pth. The values on the plots represent the means of several experiments. The behavior of our method using difierent grid sizes and its robustness as noise is introduced, are tested. Using a coarse grid with 16 bins (flgure 5.7a), it is noticed 95 that the noise efiect on the false alarm and correct detection rates can be tolerated with some choices of pth. Intuitively, larger bins tend to \absorb" noise in tracked objects, resulting in a smoother plot. This incurs the penalty of missing some unusual events, such as a car closely following another. The flne grid example, using 64 bins is shown in flgure 5.7b. Clearly, the small bin size makes it very sensitive to noise. In the absence of noise, the flne and coarse grid models are compared in flgure 5.8. Given the same threshold probability pth, the flne grid often performs slightly better than the coarse one. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false alarm rate correct detection rate without noisewith noise (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false alarm rate correct detection rate without noisewith noise (b) Figure 5.7: Unusual event detection using a coarse grid with 16 bins (a) and a flne one in (b). The mean correct detection rate is plotted versus the mean false alarm rate. For the coarse grid (a), the introduction of noise slightly deteriorates the detection rate, but for appropriate choices of pth the method proves to be immune to noise. The flne grid (b) is very sensitive to noise. In this case, adapting between coarse and flne grids can be the best alternative. We report the running time for the anomaly detector. In the 10 fps simulated data, the running time for the coarse grid event detector represents less than 50% of the frame time. This allows for real time detection at up to 20 fps. Zooming in to the flne level grid, however, dramatically afiects the running speed. An average of 96 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false alarm rate correct detection rate coarse gridfine grid Figure 5.8: Unusual event detection: coarse and flne grids compared in the absence of noise. Slightly better performance using the flne grid for many values of pth. only 1.4 fps can be analyzed. This detection rate can be improved by implementing techniques for sparse matrices. We conclude this section by noting that to achieve signiflcantly better perfor- mance using this model, the coarse grid should be used most of the time, zooming in to the flner grid only when needed. 5.6 Proposed Extensions We present in this section several areas that can be improved, and how they can be addressed. We also propose how the current work can be used for the video scoring problem. But we start flrst by showing some limitations of our method when dealing with real video as opposed to the simulated data used for our results. 97 5.6.1 Issues with real video Several issues rise when dealing with real video. We use tracking results for a tra?c intersection video and study points where our algorithm needs to be improved. Objects are tracked using a blob tracker and classifled using aspect ratio features. We visualize tracking results on the frames and analyze typical ones to develop our unusual event detector accordingly. The tracker computes the bounding box and returns the type for vehicles and pedestrians as visualized in flgure 5.9. When two blobs merge, they?re returned as a third type of \object". Trackers generally sufier several errors that might pedestrian merging blobs vehicle Figure 5.9: Typical video snapshot showing the 3 types of detected objects afiect the results obtained from the simulator. Examples of apparent tracking errors include incorrect classiflcation, missed objects and false detections. These are shown in flgure 5.10. Occasionally, some tracks might be lost, as in the case after a blob merge where object IDs might get assigned to incorrect objects. Not all these errors afiect our solution of the unusual event detection problem. Some errors might require adapting our algorithm to them, while others might 98 3 peds (a) missed (b) false de?tections (c) Figure 5.10: Three examples of tracking errors: (a) three adjacent pedestrians clas- sifled as a vehicle. (b) Missed pedestrian. (c) Moving tree branches and empty parking spot detected as moving objects. require more work with the tracker. We discuss these issues and other areas that can be improved in the rest of this section. 1. Instantaneous time shots: We use pairs of consecutive frames to compute velocities. This makes the problem an image (frames) classiflcation problem. Staufier et al. [67] have computed co-occurrences between frames for the same object (which they called a sequence). This is at the expense of inter-object correlations. We seek a compromise where inter-object correlations will be modeled for \reasonable" periods of time to span granular events (e.g. street crossing event, or building exiting event, etc). For example, a vehicle passing by a stop-sign without stopping is only detected if we model the period of time starting before the car reaches the stopping line and ending after it departs it. This is achieved by adding a third dimension to the co-occurrences. Time and space limitations need also to be observed. For example, events distant in time are unlikely to afiect each other, thus correlations need not be computed. 2. Space tessellation: Up to now, the space has been subdivided into a flxed 99 grid for the whole scene. We have used more than one bin size, but that size has been flxed for the whole scene. Some bins are crowded while others are almost empty. Results are afiected by the displacement of the grid on the ground. We suggest several improvements here. Computing co-occurrences need only be performed between \neighboring" bins. Computing the neighborhood size is a problem itself. There might be a situation where two nearby locations don?t interact at all (e.g. separated by a wall). A natural extension of the flxed grid for real video is a \perspective grid" in the image plane. We need results to be invariant with respect to the model chosen. A challenging prob- lem is that of automatically computing an adaptive space tessellation. The position of the grid will be determined based on trajectories clusters and the size of bins (variable), based on tra?c density. We also recall that in our results, only about 7.6% of the upper triangular co-occurrence matrix was non-zero. This is typical of this approach, as in the results in [39]. Storing using sparse matrix techniques, and avoiding compu- tations involving zero elements save considerable space and time. Non-zero elements represent mutual activity. In other words, avoiding zero elements means that we skip regions without activity. 3. Tracking errors: The simulator has assumed a perfect tracker except for the generated random noise added to the vehicles locations. This assumption is far from realistic. To deal with this problem, there are three possible approaches 100 (or a combination thereof). ? Manually correct tracking errors. This approach has been used in [33] and [35] for example. This approach has the beneflt of focusing all efiort on the abnormality detection work, hoping that one day, tracking methods will be developed enough. ? Enhance the tracker in aspects that beneflt the abnormality detector. One example can be that of learning locations of human and vehicle trajectories, thus easily considering a car driving up the wall of a building as a tracker error rather than an unusual event. (We assume of course that cars cannot drive in general up walls). ? Adjust the abnormality detector to the errors of the tracker. This means that we admit that any tracker will always have errors that can?t be flxed in the foreseeable future. An example of this can be to ignore short \disappearances" of moving objects. 4. Optical ow: Tracking \ensemble" movements or ows rather than individ- uals tracking is more appealing for crowded scenes and tra?c monitoring. We are more interested in detecting anomalies rather than in tracking individuals. This approach will involve estimating the distribution of ows in difierent re- gions of the surveyed scene, and reporting events that do not flt the estimated model as unusual. 5. Video scoring: Video segments with unusual activity receive high scores for 101 further manual analysis. Several issues need to be addressed here: ? Splitting the video into segments to be scored ? Choosing a suitable cost function to assign appropriate scores based on how much the analyzed segment varies from the model estimated for the video. 102 Appendix A Image Cropping Through Learning We have also investigated cropping images through learning as a motivation to the video cropping problem presented in this proposal. This work was jointly done with Haibin Ling. A.1 Problem deflnition The image cropping framework can be summarized in two steps: feature ex- traction and cropping rectangle detection. Following are some formulations in this framework. Input image I : [1::128]?[1::128] !R3 Feature extraction F : x = F(I) 2Rd Rectangle set R : R = fr = (rx;ry;rw;rh)> 2 [0;1]4g Rectangle detection function f : Rd !R r = f(x) 8x 2Rd Cropping C : Ir = C(I;r) : Dr !R3 where Dr = [[128(rx ?rw=2)] :: [128(rx +rw=2)]]? [[128(ry ?rh=2)] :: [128(ry +rh=2)]] [1::128]?[1::128] and Ir(x;y) = I(x;y); 8(x;y) 2 Dr 103 In the above formulation, F is the feature extraction step. In [68], the saliency map [41] is adapted for this task. The rectangle detection function f corresponds to the cropping step. A heuristic algorithm is used in [68] that balances the amount of saliency within a rectangle and its size. We study the use of machine learning techniques for the cropping problem. That is, we want to learn the cropping function f. Speciflcally, it is modeled as a regression problem given hand-cropped samples. The approach that we employ is the support vector based regression (SVR) [74]. A.2 Learning through SVR Suppose we have n example image set fIigni=1, the corresponding feature vec- tors X = fxi = F(Ii)gni=1, and hand-cropped rectangles R = frigni=1. The task is to flnd a rectangle detection function f through X and R. Note that a rectangle is determined by four parameters, while current SVR algorithms all deal with one output value. For this reason, when using SVR for our task, we actually decompose the function f as four functions for each rectangle parameters separately. In the following, without loss of generality, we just study one such function and assume f is single-valued. The"-SVRwasintroducedbyVapnik[74]thatusesthesocalled"-lossfunction j?j" j?j" = 8 >>< >>: 0 if j?j? " j?j?" otherwise (A.1) 104 In the linear case, f can be written as f(x) =< w;x > +b (A.2) The minimization of the "-SVR has the following formulation [66] minimize 12kwk+CPni=1(?i +??i ) subject to 8 >>> >>>< >>> >>>: ri? < w;xi > ?b ? "+?i < w;xi > +b?ri ? "+??i ?i;??i ? 0 (A.3) A.3 Experiments The performance of the classifler is evaluated by computing the mean and standard deviation of overlap rates between manually and automatically cropped rectangles in an image set. The overlap rate between two rectangles r1 and r2 is deflned as: overlapping-rate(r1;r2) = area(r1 Tr 2) area(r1Sr2) (A.4) We have tested on the animal subset of the NOVA photographs [4], which contains 4509 images, roughly 640?480. The results are included in flgure A.1. The statistics square to the left represents the density of cropping rectangles in difierent regions of the normalized image area. Histograms are for the parameters of the rectangles, each independently. The overlap rates are for hand cropped images versus automatically cropped ones, using difierent algorithms. The ?mean? cropper, simply crops a flxed rectangle that is the mean of all hand crops. The ?uist03? cropper is the heuristic algorithm used in [68]. The ?svr+sali? algorithm uses SVR 105 to learn crops based on saliency feature maps of images, developed in [41]. Finally, the ?svr+var? cropper uses variance feature maps of images [5] to learn the crops based on SVR. Surprisingly, all methods perform quite similarly, even similarly to the ?mean? cropper, with less than 60% average overlap rate. We investigate the problem fur- ther by analyzing the images statistics. Figure A.2 compares two manual crops, parameter by parameter, then rectangle by rectangle. The average overlap rate jumps to over 70%. We generate the scatter plot for the area of manually selected cropping rectangles versus their saliency content. Figure A.3 shows that there is little to learn. Noting that professionally taken photographs usually have their main subject centered in the image, we give our algorithm a last try on a set of images where the main subject is purposely placed in corners of the image. The set was photographed by Dr Evan Golub [27]. The statistics of hand croppings for that set are shown in flgure A.4. The rectangle count map shows a clear ?+? sign, that separates the 4 corners where the subjects of the images are located. The overlap ratio dramatically drops to 16%. The ?uist03? cropper [68] is also tested on that set. Although the algorithm succeeds in capturing the main subject in most cases, it retains on the average 55% of the original image areas. This is more than 8 times the area of the manually cropped images on average. The result is that the mean overlap ratio for that set is only 23%. We stopped further investigation in this direction. 106 20 40 60 80 100 20 40 60 80 100 1000200030004000 0 0.2 0.4 0.6 0.80 200 400 600 800 1000 x 0 0.5 10 200 400 600 800 1000 1200 y 0 0.5 10 100 200 300 400 500 width 0 0.5 10 100 200 300 400 500 height Statistics of hand cropping.0 1000 2000 3000 4000 0 0.2 0.4 0.6 0.8 1 overlap rate NS4509, ID:1, mean mean=0.5804, std=0.16733 0 1000 2000 3000 40000 0.2 0.4 0.6 0.8 1 overlap rate NS4509, ID:1, uist03 mean=0.5526, std=0.1945 0 1000 2000 3000 40000 0.2 0.4 0.6 0.8 1 overlap rate NS4509, ID:1, svr+sali mean=0.59393, std=0.17548 0 1000 2000 3000 40000 0.2 0.4 0.6 0.8 1 overlap rate NS4509, ID:1, svr+var mean=0.5907, std=0.1765 Figure A.1: Overlapping-rates on the Nova Animal Set (4509 images) using difierent algorithms. 107 0 2000 40000 0.5 1 x HaibinHazem 0 2000 4000?0.5 0 0.5 mean=0.039742, std=0.049725 diff 0 2000 40000 0.5 1 y HaibinHazem 0 2000 4000?0.5 0 0.5 mean=0.037734, std=0.042567 diff 0 2000 40000 0.5 1 width HaibinHazem 0 2000 4000?0.5 0 0.5 mean=0.091134, std=0.094706 diff 0 2000 40000 0.5 1 height HaibinHazem 0 2000 4000?0.5 0 0.5 mean=0.081893, std=0.081729 diff 0 500 1000 1500 2000 2500 3000 3500 4000 45000 0.5 1 overlap rate = intersect area / union area Haibin?s vs. Hazem?s cropping, mean=0.71654, std=0.19194 Figure A.2: Overlapping-rates on the Nova Animal Set of two manual croppings. 0 0.2 0.4 0.6 0.8 10 0.2 0.4 0.6 0.8 1 Scatter plot of feature fraction vs. area fraction ? Hazem area fraction of cropping rectangle feature fraction 0 0.2 0.4 0.6 0.8 10 0.2 0.4 0.6 0.8 1 Scatter plot of feature fraction vs. area fraction ? Haibin area fraction of cropping rectangle feature fraction S A L I E N C Y M A P Figure A.3: Scatter plot of rectangle area versus saliency content. 108 20 40 60 80 100 20 40 60 80 100 0 20 40 0.2 0.4 0.6 0.8 10 5 10 15 20 x 0.2 0.4 0.6 0.8 10 2 4 6 8 10 y 0 0.2 0.4 0.6 0.80 5 10 15 width 0 0.2 0.4 0.6 0.80 5 10 15 height Figure A.4: Statistics of hand cropping for the \non-professional" set. 109 Bibliography [1] Telindus Surveillance Solutions Ltd, 2005. http://www.telindussurveillance-us.com/. [2] New York Civil Liberties Union Special Report. Fall 2006. [3] Cable News Network, CNN 2007. http://www.cnn.com. [4] Nova development corporation, \Art explosion 300,000" . http://www.novadevelopment.com/Products/us/ahw/default.aspx. [5] Gaurav Agarwal. Presenting visual information to the user: Combining com- puter vision and interface design. M.S. Thesis, University of Maryland, 2005. [6] PankajK.AgarwalandCeciliaMagdalenaProcopiuc. Exactandapproximation algorithms for clustering. Algorithmica, 33(2):201{226, December 2002. [7] Ravindra K. Ahuja, Kurt Mehlhorn, James Orlin, and Robert E. Tarjan. Faster algorithms for the shortest path problem. J. ACM, 37(2):213{223, 1990. [8] Tirthankar Bandyopadhyay, Marcelo H. Ang Jr, and David Hsu. Motion plan- ning for 3-d target tracking among obstacles. In Proc. International Symposium on Robotics Research (ISRR?07), Hiroshima, Japan, 2007. [9] Tirthankar Bandyopadhyay, Yuanping Li, Marcelo H. Ang Jr, and David Hsu. A greedy strategy for tracking a locally predictable target among obstacles. In Proc. IEEE International Conference on Robotics and Automation (ICRA?06), pages 2342{2347, Orlando, Florida, May 2006. [10] Sourabh Bhattacharya, Salvatore Candido, and Seth Hutchinson. Motion strategies for surveillance. In Proceedings of Robotics: Science and Systems III, 2007. [11] Sourabh Bhattacharya and Seth Hutchinson. Approximation schemes for two- player pursuit evasion games with visibility constraints. In Proceedings of Robotics: Science and Systems IV, Zurich, Switzerland, June 2008. [12] Li-Qun Chen, Xing Xie, Xin Fan, Wei-Ying Ma, Hong-Jiang Zhang, and He- Qin Zhou. A visual attention model for adapting images on small displays. Multimedia Systems, 9(4):353{364, Oct 2003. [13] Robert T. Collins, Alan J. Lipton, Hironobu Fujiyoshi, and Takeo Kanade. Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE, 89(10):1456{1477, October 2001. 110 [14] Franklin C. Crow. Summed-area tables for texture mapping. In Proceedings of the 11th annual conference on Computer graphics and interactive techniques (SIGGRAPH?84), volume 18, pages 207{212, 1984. [15] Thomas Deselaers, Philippe Dreuw, and Hermann Ney. Pan, zoom, scan ? time-coherent, trained automatic video cropping. In IEEE Conference on Com- puter Vision and Pattern Recognition, (CVPR?08), pages 1{8, June 2008. [16] Robert B. Dial. Algorithm 360: Shortest-path forest with topological ordering [H]. Commun. ACM, 12(11):632{633, 1969. [17] Edsger W. Dijkstra. A note on two problems in connection with graphs. Nu- merische Mathematik, 1:269{271, 1959. [18] Dieter Fox Donald J. Patterson, Lin Liao and Henry Kautz. Inferring high- level behavior from low-level sensors. Proc. Fifth International Conference on Ubiquitous Computing (UBICOMP ?03), pages 73{89, Oct. 2003. [19] Thi V. Duong, Hung H. Bui, Dinh Q. Phung, and Svetha Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ?05), 1:838{ 845, June 2005. [20] Jack Edmonds. Paths, trees, and owers. Canadian Journal of Mathematics, 17:449{467, 1965. [21] Hazem El-Alfy, David Jacobs, and Larry Davis. Multi-scale video cropping. In Proceedings of the 15th international conference on Multimedia (ACM MM?07), pages 97{106, Sep 2007. [22] Hazem El-Alfy, David Jacobs, and Larry Davis. Assigning cameras to sub- jects in video surveillance systems. In Proc. IEEE International Conference on Robotics and Automation (ICRA?09), pages 837{843, Kobe, Japan, May 2009. [23] Erhan Baki Ermis, Venkatesh Saligrama, Pierre-Marc Jodoin, and Janusz Kon- rad. Motion segmentation and abnormal behavior detection via behavior clus- tering. In IEEE International Conference on Image Processing, (ICIP?08)., pages 769{772, Oct. 2008. [24] Xin Fan, Xing Xie, He-Qin Zhou, and Wei-Ying Ma. Looking into video frames on small displays. In ACM international conference on Multimedia (MM?03), pages 247{250, 2003. [25] Loren Fiore, Guruprasad Somasundaram, Andrew Drenner, and Nikolaos Papanikolopoulos. Optimal camera placement with adaptation to dynamic scenes. In Proc. IEEE International Conference on Robotics and Automation (ICRA?08), pages 956{961, Pasadena, California, May 2008. 111 [26] Andreas Girgensohn, John Adcock, Matthew Cooper, Jonathan Foote, and Lynn Wilcox. Simplifying the management of large photo collections. In Human-Computer Interaction (INTERACT?03), pages 196{203, 2003. [27] Evan Golub. Photocropr: A flrst step towards computer-supported automatic generation of photographically interesting cropping suggestions. Technical Re- port HCIL-2007-12, Univ of Maryland, Maryland, USA, 2007. [28] Teofllo F. Gonzalez. Covering a set of points in multidimensional space. Infor- mation Processing Letters, 40(4):181{188, November 1991. [29] H?ector H. Gonz?alez-Ba~nos, Cheng-Yu Lee, and Jean-Claude Latombe. Real- time combinatorial tracking of a target moving unpredictably among obsta- cles. In Proc. IEEE International Conference on Robotics and Automation (ICRA?02), pages 1683{1690, Washington, DC, USA, May 2002. [30] Amit Goradia, Zhiwei Cen, Clayton Hafiner, Ning Xi, and Matt Mutka. Switched video feedback for sensor deployment and target tracking in a surveil- lance network. In Proc. IEEE International Conference on Robotics and Au- tomation (ICRA?07), pages 3470{3475, Roma, Italy, April 2007. [31] Rafiay Hamid, Amos Johnson, Samir Batta, Aaron Bobick, Charles Isbell, and Graham Coleman. Detection and explanation of anomalous activities: Repre- senting activities as bags of event n-grams. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ?05), 1:1031{1038, June 2005. [32] Rachel Heck, Michael Wallick, and Michael Gleicher. Virtual videography. ACM Transactions on Multimedia Computing, Communications and Applica- tions (TOMCCAP), 3(1):4, Feb 2007. [33] David C. Hogg and Hannah M. Dee. Detecting inexplicable behavior. Proc. British Machine Vision Conference (BMVC ?04), Sept. 2004. [34] David C. Hogg and Hannah M. Dee. On the feasibility of using a cognitive model to fllter surveillance data. IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS ?05), pages 34{39, Sept. 2005. [35] David C. Hogg and Hannah M. Dee. Navigational strategies and surveillance. Sixth IEEE International Workshop on Visual Surveillance (ECCV-VS ?06), May 2006. [36] David Hsu, Wee Sun Lee, and Nan Rong. A point-based POMDP planner for target tracking. In Proc. IEEE International Conference on Robotics and Automation (ICRA?08), pages 2644{2650, Pasadena, California, May 2008. [37] Weiming Hu, Xuejuan Xiao, Zhouyu Fu, Dan Xie, Tieniu Tan, and Steve May- bank. A system for learning statistical motion patterns. IEEE Transactions on 112 Pattern Analysis and Machine Intelligence (TPAMI), 28(9):1450{1464, Sept. 2006. [38] Hayley Hung and Shaogang Gong. Quantifying temporal saliency. British Machine Vision Conference (BMVC?04), pages 727{736, Sept. 2004. [39] Hayley Hung and Shaogang Gong. Detecting and quantifying unusual interac- tions by correlating salient motion. IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS ?05), pages 46{51, Sept. 2005. [40] Laurent Itti and Pierre Baldi. A principled approach to detecting surprising events in video. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR?05), 1:631{637, June 2005. [41] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 20(11):1254{1259, Nov. 1998. [42] Timor Kadir and Michael Brady. Saliency, scale and image description. Inter- national Journal of Computer Vision (IJCV), 45(2):83{105, Nov. 2001. [43] Hong-Wen Kang, Xue-Quan Chen, Yasuyuki Matsushita, and Xiaoou Tang. Space-time video montage. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR?06), 2:1331{1338, June 2006. [44] Stephen Kloder and Seth Hutchinson. Partial barrier coverage: Using game theory to optimize probability of undetected intrusion in polygonal environ- ments. In Proc. IEEE International Conference on Robotics and Automation (ICRA?08), pages 2671{2676, Pasadena, California, May 2008. [45] Andreas Kolling and Stefano Carpin. The graph-clear problem: deflnition, theoretical properties and its connections to multirobot aided surveillance. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS?07), pages 1003{1008, San Diego, California, 2007. [46] Andreas Kolling and Stefano Carpin. Extracting surveillance graphs from robot maps. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS?08), pages 2323{2328, Nice, France, 2008. [47] Andreas Kolling and Stefano Carpin. Multi-robot surveillance: an improved algorithm for the GRAPH-CLEAR problem. In Proc. IEEE International Con- ference on Robotics and Automation (ICRA?08), pages 2360{2365, Pasadena, California, May 2008. [48] Andreas Kolling and Stefano Carpin. Probabilistic graph clear. In Proc. IEEE/RSJ International Conference on Robotics and Automation (ICRA?09), pages 3508{3514, Kobe, Japan, May 2009. 113 [49] Harold W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83{97, 1955. [50] Steven M. LaValle. Planning Algorithms. Cambridge University Press, New York, NY, USA, 2006. [51] Lin Liao, Dieter Fox, and Henry Kautz. Learning and inferring transportation routines. Proc. National Conference on Artiflcial Intelligence (AAAI ?04), 2004. [52] Feng Liu and Michael Gleicher. Video retargeting: automating pan and scan. In ACM international conference on Multimedia (MM?06), pages 241{250, 2006. [53] Dimitrios Makris, Tim Ellis, and James Black. Bridging the gaps between cameras. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR?04), pages 205{210, Washington, DC, 2004. [54] Dimitrios Makris and Tim J. Ellis. Spatial and probabilistic modelling of pedes- trian behaviour. British Machine Vision Conference (BMVC ?02), pages 557{ 566, Sept. 2002. [55] Dimitri Marinakis, Gregory Dudek, and David J. Fleet. Learning sensor net- work topology through monte carlo expectation maximization. In Proc. IEEE International Conference on Robotics and Automation (ICRA?05), pages 4581{ 4587, Barcelona, Spain, April 2005. [56] Christian Micheloni, Lauro Snidaro, and GianLuca Foresti. Statistical event analysis for video surveillance. Sixth IEEE International Workshop on Visual Surveillance (ECCV-VS ?06), May 2006. [57] Vlad I. Morariu, V. Shiv Naga Prasad, and Larry S. Davis. Human activity understanding using visibility context. In IEEE/RSJ IROS Workshop: From sensors to human spatial concepts (FS2HSC ?07), Oct. 2007. [58] R. Morris and D. Hogg. Statistical models of object interaction. International Journal of Computer Vision (IJCV), 37(2):209{215, June 2000. [59] Rafael Murrieta-Cid, Raul Monroy, Seth Hutchinson, and Jean-Paul Laumond. A complexity result for the pursuit-evasion game of maintaining visibility of a moving evader. In Proc. IEEE International Conference on Robotics and Automation (ICRA?08), pages 2657{2664, Pasadena, California, May 2008. [60] Rafael Murrieta-Cid, Teja Muppirala, Alejandro Sarmiento, Sourabh Bhat- tacharya, and Seth Hutchinson. Surveillance strategies for a pursuer with flnite sensor range. The International Journal of Robotics Research, 26(3):233{253, March 2007. [61] Joseph O?Rourke. Art Gallery Theorems and Algorithms. Oxford University Press, New York, 1987. 114 [62] Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Englewood Clifis, N.J., 1982. [63] Fatih Porikli and Ajay Divakaran. Multi-camera calibration, object tracking and query generation. In Proc. International Conference on Multimedia and Expo (ICME?03), pages 653{656, Baltimore, MD, USA, July 2003. [64] Yong Rui, Liwei He, Anoop Gupta, and Qiong Liu. Building an intelligent cam- era management system. In Proceedings of the Ninth ACM International Con- ference on Multimedia (ACM MM?01), pages 2{11, Ottawa, Canada, September 2001. [65] Justus Schwartz, Angelika Steger, and Andreas Wei?l. Covering a set of points in multidimensional space. Experimental and E?cient Algorithms, 3503:476{ 487, May 2005. [66] Alex J Smola and Bernhard Sch?olkopf. A tutorial on support vector regression. Statistics and Computing, 14:199{222, Aug. 2004. [67] Chris Staufier and W. Eric L. Grimson. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine In- telligence (TPAMI), 22(8):747{757, Aug. 2000. [68] Bongwon Suh, Haibin Ling, Benjamin B. Bederson, and David W. Jacobs. Automatic thumbnail cropping and its efiectiveness. ACM Symposium on User Interface Software and Technology (UIST?03), pages 95{104, Nov. 2003. [69] Noriko Takemura and Jun Miura. View planning of multiple active cameras for wide area surveillance. In Proc. IEEE International Conference on Robotics and Automation (ICRA?07), pages 3173{3179, Roma, Italy, April 2007. [70] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson, Addison Wesley, 2006. [71] Zhijun Tang and ?Umit ?Ozg?uner. Motion planning for multitarget surveillance with mobile sensor agents. IEEE Transactions on Robotics, 21(5):898{908, October 2005. [72] Toshiba America Information Systems, Inc. [www.toshibasecurity.com]. Toshiba IP Network PTZ Camera data sheet, IK-WB21A, 2007. [73] Jur van den Berg, Sachin Patil, Jason Sewall, Dinesh Manocha, and Ming Lin. Interactive navigation of multiple agents in crowded environments. In Proceedings of the 2008 symposium on interactive 3D graphics and games (I3D ?08), pages 139{147, Redwood City, CA, February 2008. [74] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. 115 [75] Namrata Vaswani, Amit Roy Chowdhury, and Rama Chellappa. "Shape Activity": A continuous-state HMM for moving/deforming shapes with ap- plication to abnormal activity detection. IEEE Trans. on Image Processing, 14(10):1603{1616, Oct. 2005. [76] Marcos A M Vieira, Ramesh Govindan, and Gaurav S. Sukhatme. Optimal policy in discrete pursuit-evasion games. Technical Report 08-900, Univ of Southern California, California, USA, 2008. [77] Marcos A M Vieira, Ramesh Govindan, and Gaurav S. Sukhatme. Optimal policy in discrete pursuit-evasion games. In Proc. International Conference on Robot Communication and Coordination ROBOCOMM ?09, Odense, Denmark, April 2009. [78] Paul Viola and Michael Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137{157, May 2004. [79] Jun Wang, Marcel J.T. Reinders, Reginald L. Lagendijk, Jasper Lindenberg, and Mohan S. Kankanhalli. Video content representation on tiny devices. IEEE International Conference on Multimedia and Expo (ICME?04), 3:1711{ 1714, June 2004. [80] Christopher R. Wren, U. Murat Erdem, and Ali J. Azarbayejani. Automatic pan-tilt-zoom calibration in the presence of hybrid sensor networks. In Pro- ceedings of the Third ACM International Workshop on Video Surveillance & Sensor Networks (VSSN?05), pages 113{120, Singapore, November 2005. [81] Tao Xiang and Shaogang Gong. Video behaviour proflling and abnormality detection without manual labelling. Tenth IEEE International Conference on Computer Vision (ICCV ?05), 2:1238{1245, Oct. 2005. [82] Tao Xiang and Shaogang Gong. Incremental visual behaviour modelling. Sixth IEEE International Workshop on Visual Surveillance (ECCV-VS ?06), May 2006. [83] Xing Xie, Hao Liu, Simon Goumaz, and Wei-Ying Ma. Learning user interest for image browsing on small-form-factor devices. In SIGCHI Conference on Human Factors in Computing Systems (CHI?05), pages 671{680, 2005. [84] Jingjin Yu and Steven M. LaValle. Tracking hidden agents through shadow information spaces. In Proc. IEEE International Conference on Robotics and Automation (ICRA?08), pages 2331{2338, Pasadena, California, May 2008. [85] Jian Zhao and Sen ching S. Cheung. Multi-camera surveillance with visual tag- ging and generic camera placement. In First ACM/IEEE International Con- ference on Distributed Smart Cameras (ICDSC ?07), pages 259{266, Vienna, Austria, 2007. 116 [86] Hua Zhong, Jianbo Shi, and Mirko Visontai. Detecting unusual activity in video. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ?04), 2:819{826, June 2004. 117