ABSTRACT Title of Dissertation: TOWARDS AUTONOMOUS DRIVING IN DENSE, HETEROGENEOUS, AND UNSTRUCTURED TRAFFIC Rohan Chandra Doctor of Philosophy, 2022 Dissertation Directed by: Professor Dinesh Manocha Department of Computer Science This dissertation addressed many key problems in autonomous driving towards handling dense, heterogeneous, and unstructured traffic environments. Autonomous vehicles (AV) at present are restricted to operating on smooth and well-marked roads, in sparse traffic, and among well-behaved drivers. We developed new techniques to perceive, predict, and plan among human drivers in traffic that is significantly denser in terms of number of traffic-agents, more heterogeneous in terms of size and dynamic constraints of traffic agents, and where many drivers do not follow the traffic rules. In this thesis, we present work along three themes?perception, driver behavior modeling, and planning. Our novel contributions include: 1. Improved tracking and trajectory prediction algorithms for dense and heterogeneous traffic using a combination of computer vision and deep learning techniques. 2. A novel behavior modeling approach using graph theory for characterizing human drivers as aggressive or conservative from their trajectories. 3. Behavior-driven planning and navigation algorithms in mixed (human driver and AV) and unstructured traffic environments using game theory and risk-aware control. Additionally, we have released a new traffic dataset, METEOR, which captures rare and interesting, multi-agent driving behaviors in India. These behaviors are grouped into traffic violations, atypical interactions, and diverse scenarios. We evaluate our perception work on tracking and trajectory prediction using standard autonomous driving datasets such as the Waymo Open Motion, Argoverse, NuScenes datasets, as well as public leaderboards where our tracking approach resulted in achieving rank 1 among over a 100 methods. We apply human driver behavior modeling in planning and navigation at unsignaled intersections and highways scenarios using state-of-the-art traffic simulators and show that our approach yields fewer collisions and deadlocks compared to methods based on deep reinforcement learning. We conclude the presentation with a discussion on future work. TOWARDS AUTONOMOUS DRIVING IN DENSE, HETEROGENEOUS, AND UNSTRUCTURED TRAFFIC by Rohan Chandra Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Dr. Dinesh Manocha, Chair/Advisor Dr. Derek Paley, Dean?s Rep Dr. Yiannis Aloimonos Dr. Pratap Tokekar Dr. Mac Schwager ? Copyright by Rohan Chandra 2022 To my family. ii Acknowledgments This dissertation was carried out through the collective efforts of many individuals. My advisor Dinesh Manocha deserves, of course, the first mention. His influence on my research extends beyond the boilerplate research supervision. Dinesh helped me understand the importance of not losing the forest for the trees in all aspects of scientific communication. I am grateful for the freedom he provided in order for me to pursue a wide range of exciting ideas to tackle important problems in autonomous driving. Many thanks to my committee members?Dr. Yiannis Aloimonos, Dr. Derek Paley, Dr. Mac Schwager, and Dr. Pratap Tokekar?who graciously accommodated my back-and-forth emails trying to finalize a date for the defense. I thank Yiannis Aloimonos for many spirited conversations during my early years at UMD as well as for writing a letter of recommendation for me that helped me get into the PhD program at UMD. I also thank Pratap Tokekar for his advice and guidance in the postdoc search. Finally, I am grateful for the collaboration with Mac Schwager where I learned about risk sensitivity analysis and its interplay with driver behavior modeling and game theory. Many of these ideas form the basis of many future projects I have planned. The last section in Chapter 4 is the result of our joint work. Next, I am grateful to all my co-authors and collaborators?Aniket Bera, Uttaran Bhattachrya, Tianrui Guan, Divya Kothandaraman, Angelos Mavrogiannis, and Trisha Mittal, . It would also be appropriate to mention here the people who indirectly provided the logistical and infrastructural iii support to my research?our department coordinator Tom Hurst along with the entire staff of UMD CS and UMIACS. Finally, I am grateful to Tom Goldstein and Jordan Boyd-Graber for writing letters of recommendation for PhD programs and assisting me in the application process, and to Ashok Agrawala for his wisdom in all matters important. Moving on, I am lucky to have a strong personal support system consisting of family and friends, who made research fun. I acknowledge all the members of ?Manocha?s Minions??in particular, Senthil ?Sentinel? Arul, Trisha Mittal, Utsav Patel, Adarsh Jagan, Pooja Guhan, Kasun Weerakoon, Uttaran Bhattacharya?and all other members of GAMMA group. I would also like to thank friends outside of GAMMA as well as those outside of UMD. Finally, I am grateful for the continual support of my mom, dad, and brother, whose induction into the U.S. Marine Corps. became a constant source of inspiration for me. iv Table of Contents Dedication ii Acknowledgements iii Table of Contents v List of Tables viii List of Figures xii Chapter 1: Introduction 1 1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Perception for Dense and Heterogeneous Traffic 11 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Pedestrian and Vehicle Tracking . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Motion Models in Pedestrian Tracking . . . . . . . . . . . . . . . . . . . 13 2.3 Tracking in dense traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Tracking by Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Reduced Probability of Track Loss . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Simultaneous Collision Avoidance and Interactions . . . . . . . . . . . . 21 2.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Trajectory prediction in heterogeneous traffic . . . . . . . . . . . . . . . . . . . 32 2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 Prediction Algorithms and Interactions . . . . . . . . . . . . . . . . . . . 37 2.5.2 Deep-Learning Based Methods . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 Traffic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.4 Advanced Driver Assistance Systems (ADAS) . . . . . . . . . . . . . . . 39 2.5.5 Road-Agent Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . 40 2.5.6 Traffic Flow and Forecasting . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.7 Hybrid Architecture for Traffic Prediction . . . . . . . . . . . . . . . . . 43 2.5.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 v 2.6 RobustTP: Improving robustness of prediction in unstructured traffic . . . . . . . 55 2.6.1 TrackNPred: A Software Framework for End-to-End Trajectory Prediction 57 2.7 Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.7.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.7.2 Network Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7.3 Spectral Clustering Regularization . . . . . . . . . . . . . . . . . . . . . 67 2.7.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.7.5 Long-Term Prediction Analysis . . . . . . . . . . . . . . . . . . . . . . 73 2.7.6 Behavior Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . 74 2.7.7 Long-Term Prediction Analysis . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 3: Online Driver Behavior Modeling 79 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.2.1 Graph-based Machine Learning . . . . . . . . . . . . . . . . . . . . . . 82 3.2.2 Data-Driven Methods for Driver Behavior Prediction . . . . . . . . . . . 83 3.2.3 Navigation Research in Autonomous Driving . . . . . . . . . . . . . . . 84 3.2.4 Interpretation of Driver Behavior in Social Science . . . . . . . . . . . . 84 3.3 Representing Traffic Data Using Graphs . . . . . . . . . . . . . . . . . . . . . . 86 3.4 StylePredict: Mapping Trajectories to Behavior . . . . . . . . . . . . . . . . . . 89 3.4.1 Centrality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.4.4 Style Likelihood and Intensity Estimates . . . . . . . . . . . . . . . . . . 92 3.5 Behavior Classification Using Machine Learning . . . . . . . . . . . . . . . . . 96 3.5.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.6 Speeding up the eigenvector centrality . . . . . . . . . . . . . . . . . . . . . . . 106 3.6.1 Eigenvalue Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.6.2 Graph Spectrum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.6.3 Behavior Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.6.4 Running Time Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.7 Conclusions, Limitations, and Future Work . . . . . . . . . . . . . . . . . . . . 113 Chapter 4: Behaviorally Compliant Planning in Human Environments 114 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2.1 Deep reinforcement learning (DRL) . . . . . . . . . . . . . . . . . . . . 117 4.2.2 Game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2.3 Recurrent neural networks (RNNs) . . . . . . . . . . . . . . . . . . . . . 118 4.2.4 Auctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.3 Planning at Unsignaled Intersections, Roundabouts, and Merging . . . . . . . . . 119 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.3.2 Modeling human driver behavior . . . . . . . . . . . . . . . . . . . . . . 121 4.3.3 Sponsored search auctions (SSAs) . . . . . . . . . . . . . . . . . . . . . 122 4.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 vi 4.3.5 Game-theoretic optimality and efficiency analysis . . . . . . . . . . . . . 125 4.3.6 Using ?OPT for collision prevention and deadlock resolution . . . . . . . . 129 4.3.7 Conclusion, Limitations, and Future Work . . . . . . . . . . . . . . . . . 130 4.4 Risk-Aware Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.4.4 Conclusion, Limitations, and Future Work . . . . . . . . . . . . . . . . . 144 Chapter 5: Software and Datasets 146 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.1.2 Applications and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.2 Comparison with Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.2.1 Tracking and Trajectory Prediction Datasets . . . . . . . . . . . . . . . . 151 5.2.2 Semantic Segmentation Datasets . . . . . . . . . . . . . . . . . . . . . . 152 5.2.3 Behavior Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.3 METEOR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.3.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.3.2 Dataset organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.3.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.3.4 Rare and Interesting Behaviors . . . . . . . . . . . . . . . . . . . . . . . 155 5.3.5 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.4.1 Analyzing Object Detection in Unstructured Scenarios . . . . . . . . . . 159 5.4.2 Multi-Agent Behavior Recognition . . . . . . . . . . . . . . . . . . . . . 162 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Chapter 6: Conclusion 166 vii List of Tables 2.1 Ablation experiments to show the advantage of SimCAI. We replace SimCAI with a constant velocity (Const Lin Vel) [1], Social Forces (SF) [2], and RVO motion model (RVO)[3]. The rest of the method is identical to the original method. All variations operate at similar fps of approximately 30 fps. Bold is best. Arrows (?, ?) indicate the direction of better performance. . . . . . . . . . . 30 2.2 Evaluation on the TRAF dataset with MOTDT [4] and MDP [5]. MOTDT is currently the best online tracker on the MOT benchmark with open-sourced code. Bold is best. Arrows (?, ?) indicate the direction of better performance. Observation: RoadTrack improves the accuracy (MOTA) over the state-of-the-art by 5.2% and precision (MOTP) by 0.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Evaluation on the KITTI-16 dataset from the MOT benchmark with online methods that have an average rank higher than ours. RoadTrack is at least approximately 4? faster than prior methods. While we do not outperform on the MOTA metric, we still achieve the highest MT, ML, FN, and MOTP. We analyze our MOTA performance in Section 2.3.4. Bold is best. Arrows (?, ?) indicate the direction of better performance. The values for all methods correspond to the KITTI-16 sequence specifically, and not the entire 2D MOT15 dataset. . . . . . . . . . . . 33 2.4 Evaluation on the full MOT benchmark. The full MOT dataset is sparse and is not a traffic-based dataset. RoadTrack is at least approximately 4? faster than previous methods. While we do not outperform on the MOTA metric, we still achieve the highest MT, ML (MOT16), FN, and MOTP(MOT15). We analyze our MOTA performance in Section 2.3.4. Bold is best. Arrows (?, ?) indicate the direction of better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Evaluation on sparse or homogeneous traffic datasets: The first number is the average RMSE error (ADE) and the second number is final RMSE error (FDE) after 5 seconds (in meters). NGSIM is a standard sparse traffic dataset with few heterogeneous interactions. The Beijing dataset is dense but with relatively low heterogeneity. Lower value is better and bold value represents the most accurate result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6 Evaluation on our new, highly dense and heterogeneous TRAF dataset. The first number is the average RMSE error (ADE) and the second number is final RMSE error (FDE) after 5 seconds (in meters). The original setting for a method indicates that it was tested with default settings. The learned setting indicates that it was trained on our dataset for fair comparison. We present variations of our approach with each weighted interaction and demonstrate the contribution of the method. Lower is better and bold is best result. . . . . . . . . . . . . . . . . 50 viii 2.7 Comparison of our new TRAF dataset with various traffic datasets in terms of heterogeneity and density of traffic agents. Heterogeneity is described in terms of the number of different agents that appear in the overall dataset. Density is the total number of traffic agents per Km in the dataset. The value for each agent type under ?Agents? corresponds to the average number of instances of that agent per frame of the dataset. It is computed by taking all the instances of that agent and dividing by the total number of frames. Visibility is a ballpark estimate of the length of road in meters that is visible from the camera. NGSIM data were collected using tower-mounted cameras (bird?s eye view), whereas both Beijing and TRAF data presented here were collected with car-mounted cameras (frontal view). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8 The list of algorithms currently implemented in TrackNPred. . . . . . . . . . . . 60 2.9 We evaluate RobustTP with methods that use noisy sensor input, on the TRAF Dataset. The trajectory histories are computed using tracking by two detection methods: Mask R-CNN [6] and YOLO [7]. The results are reported in the following format: ADE/FDE, where ADE is the average displacement RMSE over the k seconds of prediction and FDE is the final displacement RMSE at the end of k seconds. We tested for both short-term (k = 3) and longer-term (k = 5) predictions. We observe for all the cases that RobustTP is the state-of-the-art. . . 61 2.10 Main Results: We report the Average Displacement Error (ADE) and Final Displacement Error (FDE) for prior road-agent trajectory prediction methods in meters (m). Lower scores are better and bold indicates the SOTA. We used the original implementation and results for GRIP [8] and Social-GAN [9]. ?-? indicates that results for that particular dataset are not available. Conclusion: Our spectrally regularized method (?S1 + S2?) outperforms the next best method (GRIP) by upto 70% as well as the ablated version of our method (?S1 Only?) by upto 75%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.11 Upper Bound Analysis: ? is the upper bound on the RMSE for all agents at a time-step. T ? ? is the length of the prediction window. T-FDE (Eq. 2.20) is the theoretical FDE that should be achieved by using spectral regularization. The FDE results are obtained from Table 3.3. The % agreement is the agreement between the T-FDE and FDE computed using T-FDEFDE if T-FDE ?fh ?0, then d(f, fh ) < d(f, fFh ) with probability 1? B ,j j j j A where A and B are positive integers and A > B. Proof. Using the definition of the Cosine metric, the lemma reduces to proving the following, We pad both fM and fFh h such that ?fMh ? > ||fF || .j j j hj 0 We reduce ft, fMh , and f F h to binary vectors, i.e., vectors composed of 0s and 1s. Letj j 18 ?f = fM ? fFh h . We denote the number of 1s and ?1s in ?f as A and B, respectively. Now,j j let x and y denote the L0 norm of fMh and f F h , respectively. From our padding procedure, wej j have x > y. Then, if y = B, then x = A and we trivially have A > B. But if y > B, then A = x ? (y ? B) =? A ? B = x ? y. From x > y, it again follows that A > B. Thus, x > y =? A > B. Next, we define a (1, 1) coordinate in an ordered pair of vectors as the coordinate where both vectors contain 1s. Similarly, a (1,?1) coordinate in an ordered pair of vectors is the coordinate where the first vector contains 1 and the second vector contains ?1. Then, let pa and pb respectively denote the number of (1, 1) coordinates and (1,?1) coordinates in the pair (ft,?f). By definition, we have 0 < pa < A and 0 < pb < B. Thus, if we assume pa and pb to be uniformly distributed, it directly follows that P(pa > pb) = 1? B .A Based on Lemma 2.3.1, we finally prove the following proposition. Proposition 2.3.1. With probability 1 ? B , sparse feature vectors extracted from segmented A boxes decrease the loss of pedestrian tracks, thereby reducing the number of false negatives in comparison to regular bounding boxes. Proof. In our approach, we use Mask R-CNN for pedestrian detection, which outputs bounding boxes and their corresponding masks. We use the mask and bounding box pair to generate a segmented box. The correct assignment of an ID depends on successful feature matching between the predicted measurement feature and the optimal detection feature, that is, when the cosine cost between the two features is below a set threshold, ?. But since the correct ID assignment depends on multiple factors including a low cosine cost, we instead exploit the equivalency of the contra- 19 positive, d(f , fh?) > ? ? (? = ?) (2.3)j Equation 2.3 indicates that when the cosine cost is greater than ?, then feature matching fails and therefore an ID fails to be assigned, or equivalently, set to ?. Using Lemma 2.3.1 and the fact that ? ? U[0, 1], i.i.d. P(d(f , fMh? ) > ?) < P(d(f , fFh? ) > ?)j,pi j,pi Using the equivalency in Eq. 2.3, we obtain, PM(? = ?) < PF (? = ?) (2.4) Now, in our approach, we set certain fixed conditions that need to be satisfied for a track to be destroyed or lost. Those conditions are listed as follows: 1. DensePeds updates the ID of every pedestrian after each frame. If an update fails to occur for ? > ? frames, then this leads to loss of the track of that pedestrian. 2. If the first condition is satisfied, and the current ID of a pedestrian is not set, then the track of that pedestrian is lost. We formalize the two conditions as follows, (? > ?) ? (? = ?) ? Tt ? {?} 20 Using Eq. 2.4, it follows that, PM(Tt ? {?}) < PF (Tt ? {?}) (2.5) Informally, equation 2.5 tells us that the probability of losing a track using segmented boxes is less than the probability of losing a track if we were to use regular bounding boxes. To complete the proof, we now show that equation 2.5 implies that fewer lost tracks leads to fewer false negatives. We start by defining the total number of false negatives (FN) as ?T ? FN = ?Tt (2.6) t=1 pg?G where pg ? G denotes a ground truth pedestrian in the set of all ground truth pedestrians at current time t and ?z = 1 for z = 0 and 0 elsewhere. This is a variation of the Kronecker delta function. Using Eq. 2.5 and Eq. 2.6, we can say that fewer lost tracks (Tt ? {?}) indicate a smaller number of false negatives. We empirically demonstrate this analysis in Section 2.3.4. The upper bound, PF (Tt), in Eq. 2.5 depends on the amount of padding done to f and f . A general observed trend is that a higher amount of padding results in a larger upper bound in Eq. 2.5. 2.3.3 Simultaneous Collision Avoidance and Interactions One of the major challenges with tracking heterogeneous road-agents in dense traffic is that road-agents such as cars, buses, bicycles, road-agents, etc. have different sizes, geometric shape, 21 Figure 2.3: Inner yellow circle denotes the social distance and the outer orange area denotes the public region. At time t ? ? , pi intends to interact with pk. Then, (left) pi determines its ability to interact with pk. We observe that ? (grey cone) of pi contains ? of pk (green circle around pk). Thus pi can interact with pk. (right) pi and pk align their preferred velocities toward each other. maneuverability, behavior, and dynamics. This often leads to complex inter-agent interactions that have not been taken into account by prior multi-object trackers. Furthermore, road-agents in high-density scenarios are in close-proximity to one another or are almost colliding. So we need an efficient approach for predicting the next state of a road-agent by modeling the collisions and interactions. We thus present SimCAI, that takes into account both, ? Reciprocal collision avoidance [3] with car-like kinematic constraints for trajectory prediction and collision avoidance. ? Heterogeneous road-agent interaction between pedestrians, two-wheelers, rickshaws, buses, cars and so on. All the notations used in the paper are provided in Table I of full version of this text [45]. 22 2.3.3.1 Velocity Prediction by Modeling Collision Avoidance Reciprocal Velocity Obstacles (RVO) [3] extends Velocity Obstacles motion model by modeling collision avoidance behavior for multiple engaging agents. RVO can be applied to pedestrians in a crowd and we modify it to work with bounding boxes as our algoritheorem conforms to the tracking-by-detection paradigm. We represent each agent as, ?t = [u, v, u?, v?, vpref], where u, v, u?, v?, and vpref represent the top left corner of the bounding box, their velocities, and the preferred velocity of the agent in the absence of obstacles respectively. vpref is computed internally by RVO. The computation of the new state, ?t+1, is expressed as an optimization problem. For each agent, RVO computes a feasible region where it can move without collision. This region is defined according to the RVO collision avoidance constraints (or ORCA constraints [3]). If the ORCA constraints forbid an agent?s preferred velocity, that agent chooses the velocity closest to its preferred velocity that lies in the feasible region, as given by the following optimization: vnew = argmin ||v ? vpref|| (2.7) v?/ORCA The velocity, vnew, is then used to calculate the new position of a road-agent. The difference in shapes, sizes, and aspect ratios of road-agents motivate the need to use appearance-based features. In order to combine object detection with RVO, we modify the state vector, ?t, to include bounding box information by setting the position to the centers of the bounding boxes. Thus, u = u+w and v = v+h , where w, h denote the width and height, 2 2 respectively, of the corresponding bounding box. 23 Finally, the original RVO models the motion of agents seen from a top-view. Therefore, to account for front-view traffic as well as top-view, we use the modification proposed by the authors of [12] that allow RVO to model the motion of road-agents in front-view traffic scenes. 2.3.3.2 Velocity Prediction by Modeling Road-Agent Interactions In a traffic scenario, interactions can occur between different types of road-agents: vehicle- vehicle, pedestrian-pedestrian, vehicle-pedestrian, bicycle-pedestrian, etc. In this section, we present a formulation to model such interactions. Our input is an RGB video captured from a camera with known camera parameters. By using the camera center as the origin, we transform pixel coordinates to scene coordinates for the computations that follow in this section. Intent of Interaction The idea of using spatial regions to characterize agent behavior was proposed in [54]. The authors introduced the notion of ?public? and ?social? regions, that are of the form of concentric circles. We show a quadrant of these regions in Figure 2.3, where the yellow area is the social region and the orange area is the public region. Based on this work, Satake et al. [55] proposed a model of approach behavior with which a robot can interact with humans. At the public distance the robot is allowed to approach the human to interact with them, and at the social distance, interaction occurs. In SimCAI, we have set the public and social distances heuristically. We say that a road-agent, pi, intends to interact with another agent, pk, when pi is within the social distance of pk for some minimum time ? . When two road-agents intend to interact, they move towards each other and come in close proximity. 24 Ability to Interact Even when two road-agents want to interact, their movements could be restricted in dense traffic. We determine the ability to interact (Figure 2.3(right)) as follows. Each agent has a personal space, which we define as a circular region ? of radius ?, centered around pk. Given a road-agent pi, the slope of its vpref is tan ?. ? is the angle with the horizontal defined in the world coordinate system. In dense traffic, each agent, pi has a limited space in which they can steer, or turn. This space is the feasible region determined by the ORCA constraints described in the previous section. We define a 2D cone, ?, of angle ? as the ORCA region in which the agent can steer. ? is thus the steering angle of the agent. We denote the extreme rays of the cone as r1 and r . ?G?2 G denotes the smallest perpendicular distance between? any two geometric structures, say, G? and G?. These parameters are fixed for different agent types and are not learned from data. If pi has intended to interact with pk, the projected cone of pi, defined by extending r1 and r2, is directed towards pk Then, in order for interaction to take place, it is sufficient to check for either one of two conditions to be true: 1. Condition ?1: Intersection of ? with either r1 or r2 (if either ray intersects, then the entire cone intersects ?). 2. Condition ?2: ? ? ? (if ? lies in the interior of the cone, see Figure 2.3). For these conditions to hold, we require that the cone does not intersect or contain any pj ? P , j ?= i. We now make these equations more explicit. We parametrize r1, r2 by their slopes tan ?, where ? = ?i+?i if ?r1 r2? ? ?? , else ? = ?i??i. The resulting equation of r1 (or r2) is (Y ? vi) = tan ?(X ? ui) and the equation of ? is (X ? uk)2 + (Y ? v )2 = ?2k . Solving both equations simultaneously, we obtain an equation 25 ?1. Intersection occurs if the discriminant of ?1 ? 0. This provides us with the first condition necessary for the occurrence of an interaction between pi and pk. Next, we observe that if ? lies in the interior of ?, then pk lies on the opposite sides of r1 and r2 which is modeled by the following equation: ?2 ? r1(pk).r2(pk) ? 0 (2.8) Solving Equation 2.8 further provides us with the second condition for the occurrence of an interaction between pi and pk, where ?1,?2 : R2 ? R2 ? R? R 7?? R. Interaction If either ?1 or ?2 is true, then road-agents pi, pk will move towards each other to interact at time t ? ? . When this happens, we assume that pi and pk align their current velocities towards each other. Thus, vnew = vpref. The time taken for the two road-agents to be meet or ||p ? p || converge with each other is given by i k 2t = . If two road-agents are overlapping (based ||vi ? vk||2 on the values of ?1 and ?2), we model them as a new agent with radius 2?. Our approach can be extended to model multiple interactions. Currently, we restrict an interaction to take place between 2 road-agents. Therefore, in the case of multiple possible interactions with an agent, pk, we form a set Q ? P , where Q is the set of all road-agents p?, that are intending to interact with pk. We determine the road-agent that will interact with pk as the road-agent that minimizes the distance between pk and p? after a fixed time-step, ?t. Thus, p? = argminw?(p? + v??t) ? pk?, p? ? Q. road-agents that are not interacting avoid each other and continue moving towards their destination. 26 2.3.3.3 Analysis We analyze the accuracy and runtime performance of SimCAI in traffic scenarios with increasing density and heterogeneity. Accuracy Analysis: We analytically show the advantage of SimCAI over other motion models such as Social Forces [2], RVO [3, 44], and constant velocity [1]. We denote the mutliple object tracking accuracy, MOT?A of a system?using a particular motion model as MOTAmodel and define it as MOTAmodel = c MOTAc+ i MOTAi where c and i denote an agent whose motion is being modeled using collision avoidance and interaction, and MOTAc and MOTAi denote their individual accuracies, respectively. Let n represent the number of total road-agents in a video, then we have n = nc + ni, where nc, ni correspond to the number of agents that are avoiding collisions and are interacting, respectively. Increasing n would increase the number of road-agents whose motion is modeled through collision avoidance or heterogeneous interaction formulations. Linear models do not account for either formulation. Standard RVO only accounts for collision avoidance. SimCAI models both. Therefore, we rationalize that, MOTAlinear ? MOTARVO ? MOTASimCAIc c c MOTAlineari ? MOTARVO ? MOTASimCAIi i =? MOTAlinear ? MOTARVO ? MOTASimCAI We validate the analysis presented here in Section 2.3.4. Runtime Analysis: At approximately 30 fps, we achieve a minimum speed-up of approximately 4?, and upto approximately 30?, over state-of-the-art methods on the MOT dataset (Table 2.3). The state-of-the-art use RNNs to model the motion of road-agents [56, 57], while we use the 27 modified RVO formulation. We exploit the geometrical formulation of SimCAI to state and prove the following theorem: Theorem 2.3.1. Given P = {pi|1 ? i ? n}, that represents a set of n road-agents in a traffic scene that may assume any shape, size, and agent-type, if statep ? { stationary, collision avoiding,i interacting }, ?i ? n, then SimCAI can track the n road-agents in O((nc+?ni)), where ? << ni. Proof. RVO is based on linear programming and can perform tracking with a proven runtime complexity of O((n)) [3]. Now, if we assume that agents always assume one of the following states: stationary, avoiding collision, or interacting, then we have n = nc + ni, where nc, ni correspond to the number of agents in collision avoidance states and interacting states, respectively. We ignore stationary road-agents. Following the formulation in Section 2.3.3.2, for each interacting road-agent, SimCAI predicts a new velocity by solving a linear optimization problem over ? road-agents. Thus, the runtime complexity of SimCAI is O((nc + ?ni)), where ? << ni. Our high fps is a consequence of our linear runtime complexity and we validate our theoretical claims in Section 2.3.4. We further hypothesize that prior deep learning-based methods [56, 57] are less optimal in terms of runtime due to the intensive computation requirements by deep neural networks [58, 59]. For example, ResNet [60] needs more than 25 MB for storing the computed model in memory, and more than 4 billion float point operations (FLOPs) to process a single image of size 224?224 [58]. We would like to clarify that by realtime performance, we refer to the realtime computation of the tracking algorithm only. We do not consider the computation time of Mask R-CNN. This is standard practice by tracking-by-detection algorithms [57] that only contribute to the tracking 28 component, similar to this work. We therefore compare with realtime tracking algorithms. 2.3.4 Results On Dense Datasets: We provide results on the TRAF dataset using RoadTrack and demonstrate a state-of-the-art average MOTA of 75.8% (Table 2.2). The aim of this experiment is to highlight the advantage of our overall tracking algorithm in dense and heterogeneous traffic. We compare RoadTrack with methods on the dense TRAF dataset in Table 2.2. MOTDT [4] and MDP [5] are the only state-of-the-art methods with available open-source code. All methods are evaluated using a common set of detections obtained using Mask R-CNN. Compared to these methods, we improve upon MOTA by 5.2% on absolute. This is roughly equivalent to a rank difference of 46 on the MOT benchmark. MOTDT is currently the fastest method on the MOT16 benchmark. Our approach operates at realtime speeds upto approximately 30 fps and is comparable with MOTDT (Table 2.2). Our realtime performance results from the runtime analysis from Section 2.3.3.3 and theorem 2.3.1. Note that we observe an abnormally high number of identity switches compared to other methods; however, this is because prior methods mostly fail to maintain an agent?s track for more than 20% of their total visible time (near 100% ML). Not being able to track road-agents for most of the time excludes those agents as possible candidates for IDS, thereby resulting in lower IDS for prior methods. Interestingly, the low IDS score for prior methods also contributes to their reasonably high MOTA score, despite near-failure to track agents in dense traffic. On Standard Benchmarks: In the interest of completeness and thorough evaluation, we also evaluate RoadTrack on sparser tracking datasets and present results on both traffic- 29 Motion Model FPS? MT(%)? ML(%)? IDS? FN? MOTP(%)? MOTA(%)? Const. Vel 30 0.0 100 11 247,738(33.3&) 66.3 66.7 SF 30 0.1 98.6 147 246,528 (33.1%) 63.8 66.3 RVO 30 0.0 100 38 247,675 (33.2%) 63.8 66.9 SimCAI 30 7.0 66.9 1128 178,997 (24.0%) 65.7 75.8 Table 2.1: Ablation experiments to show the advantage of SimCAI. We replace SimCAI with a constant velocity (Const Lin Vel) [1], Social Forces (SF) [2], and RVO motion model (RVO)[3]. The rest of the method is identical to the original method. All variations operate at similar fps of approximately 30 fps. Bold is best. Arrows (?, ?) indicate the direction of better performance. only datasets (KITTI-16) in Table 2.3 as well as datasets containing only pedestrians (MOT) in Table 2.4. RoadTrack?s main advantage is SimCAI, which is based on modeling collision avoidance and interactions. In the absence of one or both, we do not expect it demonstrate superior performance over prior methods on the sparse KITTI-16 and MOT datasets. While not conclusive, we believe our low MOTA score on the 2D MOT15 and KITTI- 16 may also be attributed to a high number of detections that are incorrectly classified as false positives. For instance, road-agents that are too distant to be manually labeled are not annotated in the ground truth sequence. We observed this to be true for the methods we compared with as well. Therefore, we exclude FP from the calculation of MOTA for all methods in the interest of fair evaluation. We note, however, that RoadTrack is least 4? faster on the KITTI-16 and 2D MOT15 datasets at approximately 30 fps (Tables 2.3,2.4). To explain the speed-up, we refer to theorem 2.3.1 and the runtime analysis presented in Section 2.3.3.3. We specially point to the 15? and 5? speed-up over learning-based tracking methods, [56, 57] in Table 2.3 which we attribute the linear time computation of SimCAI as opposed to the intensive computation required by deep learning models. Ablation Experiments: We highlight the advantages of SimCAI through ablation experiments 30 Figure 2.4: Qualitative analysis of RoadTrack on the TRAF dataset at night time consisting of cars, 2-wheelers, 3-wheelers, and trucks. Frames are chosen with a gap of 2 seconds(? 60 frames). For visual clarity, each road-agent is associated with a unique ID number. The ID is displayed in orange. Note the consistencies in the ID, for example, the 3-wheeler (1), car (2), and 2-wheeler (3). in Table 2.1. The aim of these experiments is to isolate the benefit of SimCAI. We compare with the following variations of RoadTrack in which we replace our novel motion model SimCAI with standard and state-of-the-art motion models, while keeping the rest of the system untouched: ? Constant Linear Velocity (Const Lin Vel). We replace SimCAI with a constant velocity linear motion model [1]. ? Social Forces (SF). We replace SimCAI with the Social Forces motion model [2]. ? Reciprocal Velocity Obstacles (RVO) [3]. We replace SimCAI with the RVO motion model. We compare SimCAI with other motion models (Constant linear velocity, Social Forces, and RVO) on the dense TRAF dataset. These experiments were performed by only replacing SimCAI with other motion models, keeping the rest of the system unchanged. We observe that SimCAI outperforms the motion models by at least 8.9% on absolute on MOTA. All the variations used in the ablation experiments operated at the same fps of approximately 30 fps. Additionally, we experimentally verify the analysis of Section 2.3.3.3 by observing that MOTAlinear ? MOTARVO ? 31 MOTASimCAI. Once again, we point to our high IDS in Table 2.1, compared to the IDS of other motion models. As mentioned previously, this is due to the near-failure of other motion models (near 100% ML) to track road agents in dense traffic. Not being able to track a road-agent excludes them as a IDS candidate. Dataset Tracker FPS? MT(%)? ML(%)? IDS? FN? MOTP(%)? MOTA(%)? MOTDT 37.9 0 98.2 15 (<0.1%) 18,764 (33.0%) 63.3 67.0 TRAF1 MDP 9.3 0 98.2 21 (<0.1%) 18,667 (32.8%) 60.1 67.1 RoadTrack 43.9 0 95.6 163 (0.3%) 17,953 (31.6%) 58.8 68.1 MOTDT 41.6 0 98.8 17 (<0.1%) 18,201 (32.7%) 60.3 67.3 TRAF2 MDP 20.9 0 100.0 7 (<0.1%) 18,105 (32.5%) 59.6 67.5 RoadTrack 12.3 0 92.3 55 (0.1%) 17,202 (30.9%) 60.8 69.0 MOTDT 50.7 3.3 67.1 64 (<0.1%) 34,883 (27.0%) 69.6 72.9 TRAF3 MDP 51.8 0 100.0 0 (0.0%) 43,057 (33.3%) 69.2 66.7 RoadTrack 36.6 32.2 40.0 62 (<0.1%) 19,521 (15.1%) 70.1 84.8 MOTDT 36.6 1.2 76.3 123 (0.1%) 54,849 (29.0%) 65.3 70.9 TRAF4 MDP 9.0 1.2 87.2 16 (<0.1%) 59,097 (31.3%) 66.2 68.7 RoadTrack 40.6 6.0 54.6 266 (0.1%) 47,444 (25.1%) 65.1 74.7 MOTDT 36.0 0.7 75.9 221 (0.2%) 33,774 (28.9%) 63.2 70.9 TRAF5 MDP 22.5 0 98.4 6 (<0.1%) 38,091 (32.6%) 64.9 67.3 RoadTrack 41.4 1.5 55.7 299 (0.3%) 24,860 (21.3%) 63.1 78.4 MOTDT 33.0 0 87.5 161 (0.1%) 58,212 (29.4%) 63.3 70.5 TRAF6 MDP 4.3 0 99.3 0 (0.0%) 65,687 (33.2%) 68.6 66.8 RoadTrack 14.6 0.7 67.8 283 (0.1%) 52,017 (26.3%) 62.8 73.6 MOTDT 34.7 0.9 83.6 601 (0.1%) 218,683 (29.3%) 65.5 70.6 Summary MDP 10.1 0.2 97.0 50 (<0.1%) 242,704 (32.6%) 65.3 67.4 RoadTrack 31.6 7.0 66.9 1128 (0.2%) 178,997 (24.0%) 65.7 75.8 Table 2.2: Evaluation on the TRAF dataset with MOTDT [4] and MDP [5]. MOTDT is currently the best online tracker on the MOT benchmark with open-sourced code. Bold is best. Arrows (? , ?) indicate the direction of better performance. Observation: RoadTrack improves the accuracy (MOTA) over the state-of-the-art by 5.2% and precision (MOTP) by 0.2%. 2.4 Trajectory prediction in heterogeneous traffic For effective planning and navigation in dense and heterogeneous environments, AVs must predict the trajectories of human drivers (Figure 2.5). However, there are several issues associated with such environments that make prediction challenging. Highly dense traffic corresponds to more frequent inter-agent interactions which are hard to model due to the inherent uncertainty 32 Tracker FPS? MT(%)? ML(%)? IDS? FN? MOTP(%)? MOTA(%)? AP HWDPL p [61] 6.7 17.6 11.8 18 831 72.6 40.7 RAR 15 pub [57] 5.4 0.0 17.6 18 809 70.9 41.2 AMIR15 [56] 1.9 11.8 11.8 18 714 71.7 50.4 HybridDAT [62] 4.6 5.9 17.6 10 706 72.6 46.3 AM [63] 0.5 5.9 17.6 19 805 70.5 40.6 RoadTrack 28.9 29.4 11.7 15 668 71.3 12.2 Table 2.3: Evaluation on the KITTI-16 dataset from the MOT benchmark with online methods that have an average rank higher than ours. RoadTrack is at least approximately 4? faster than prior methods. While we do not outperform on the MOTA metric, we still achieve the highest MT, ML, FN, and MOTP. We analyze our MOTA performance in Section 2.3.4. Bold is best. Arrows (?, ?) indicate the direction of better performance. The values for all methods correspond to the KITTI-16 sequence specifically, and not the entire 2D MOT15 dataset. Tracker FPS? MT(%)? ML(%)? IDS? FN? MOTP(%)? MOTA(%)? AMIR15 [56] 1.9 15.8 26.8 1026 29,397 71.7 37.6 HybridDAT [62] 4.6 11.4 42.2 358 31,140 72.6 35.0 AM [63] 0.5 11.4 43.4 348 34,848 70.5 34.3 AP HWDPL p [61] 6.7 8.7 37.4 586 33,203 72.6 38.5 RoadTrack 28.9 18.6 32.7 429 27,499 75.6 20.0 EAMTT pub [64] 11.8 7.9 49.1 965 102,452 75.1 38.8 RAR16pub [57] 0.9 13.2 41.9 648 91,173 74.8 45.9 STAM16 [63] 0.2 14.6 43.6 473 91,117 74.9 46.0 MOTDT [4] 20.6 15.2 38.3 792 85,431 74.8 47.6 AMIR [56] 1.0 14.0 41.6 774 92,856 75.8 47.2 RoadTrack 18.8 20.3 36.1 722 78,413 75.5 40.9 Table 2.4: Evaluation on the full MOT benchmark. The full MOT dataset is sparse and is not a traffic-based dataset. RoadTrack is at least approximately 4? faster than previous methods. While we do not outperform on the MOTA metric, we still achieve the highest MT, ML (MOT16), FN, and MOTP(MOT15). We analyze our MOTA performance in Section 2.3.4. Bold is best. Arrows (?, ?) indicate the direction of better performance. in human behavior. Moreover, prediction algorithms require huge datasets for training, the collection of which is a costly and time-consuming process. Finally, AVs must simultaneously perform low-level trajectory and high-level action prediction for real time navigation as opposed to current state-of-the-art which handles trajectory and action prediction independent of each other. I developed three algorithms that address the limitations described above. In the first 33 MOT16 2D MOT15 KITTI-16 Figure 2.5: Trajectory prediction in dense, heterogeneous, and unstructured environments. approach, TraPHic [65], the key aspect is to selectively focus attention on fewer agents. The algorithm consists of a novel attention mechanism that teaches the ego-vehicle to identify the agents that deserve more importance than others. For instance, a pedestrian in the way of the ego-vehicle requires more attention than, say, a parked car to the side. This work has been published in CVPR?19. 2.4.1 Overview In this section, we give an overview of our prediction algorithm that uses weighted interactions. Our approach is designed for dense and heterogeneous traffic scenarios and is based on two observations. The first observation is based on the idea that road agents in such dense traffic do not react to every road agent around them; rather, they selectively focus attention on key interactions in a semi-elliptical region in the field of view, which we call the ?horizon?. For example, consider a motorcyclist who suddenly moves in front of a car and the neighborhood of the car consists of other road agents such as three-wheelers and pedestrians (Figure 2.6). The car 34 must prioritize the motorcyclist interaction over the other interactions to avoid a collision. The second observation stems from the heterogeneity of different road agents such as cars, buses, rickshaws, pedestrians, bicycles, animals, etc. in the neighborhood of an road agent (Figure 2.6). For instance, the dynamic constraints of a bus-pedestrian interaction differs significantly from a pedestrian-pedestrian or even a car-pedestrian interaction due to the differences in road agent shapes, sizes, and maneuverability. To capture these heterogeneous road agent dynamics, we embed these properties into the state-space representation of the road agents and feed them into our hybrid network. We also implicitly model the behaviors of the road agents. Behavior in our case the different driving and walking styles of different drivers and pedestrians. Some are more aggressive while others more conservative. We model these behaviors as they directly influence the outcome of various interactions, thereby affecting the road agents? navigation. Given a set of N road agents A = {ai}i=1...N , trajectory history of each road agent ai over t frames, denoted ?i,t := [(xi,1, yi,1), . . . , (x , y )]?i,t i,t , and the road agent?s size li, we predict the spatial coordinates of that road agent for the next ? frames. In addition, we introduce a feature called traffic concentration c, motivated by traffic flow theory. Traffic concentration, c(x, y), at the location (x, y) is defined as the number of road agents between (x, y) and (x, y)+(?x, ?y) for some predefined (?x, ?y) > 0. This metric is similar to traffic density, but the key difference is that traffic density is a macroscopic property of a traffic video, whereas traffic concentration is a mesoscopic property and is locally defined at a particular location. So we achieve a representation of traffic on several scales. 35 Finally, we define the state space of each road agent ai as [ ]? ?i := ?i,t ?? (2.9)i,t ci li where ? is a derivative operator that is used to compute the velocity of the road agent, and ci := [c(xi,1, yi,1), . . . , c(x ? i,t, yi,t)] . 2D Image Space to 3D World Coordinate Space: We compute camera parameters from given videos using standard techniques, and use the parameters to estimate the camera homography matrices. The homography matrices are subsequently used to convert the location of road agents in 2D pixels to 3D world coordinates w.r.t. a predetermined frame of reference, similar to approaches in [9, 66]. All state-space representations are subsequently converted to the 3D world space. Horizon and Neighborhood Agents: Prior trajectory prediction methods have collected neighborhood information using lanes and rectangular grids [67]. Our approach is more generalized in that we pre-process the trajectory data by assuming a lack of lane information. This assumption is especially true in practice in dense and heterogeneous traffic conditions. We formulate a road agent ai?s neighborhood, Ni, using an elliptical region and selecting a fixed number of closest road agents using the nearest-neighbor search algorithm in that region. Similarly, we define the horizon of that agent, Hi, by selecting a smaller threshold in the nearest-neighbor search algorithm, and in a semi-elliptical region in front of ai. 36 2.5 Related Work In this section, we give a brief overview of some important classical prediction algorithms and recent techniques based on deep neural networks. 2.5.1 Prediction Algorithms and Interactions Trajectory prediction has been researched extensively. Approaches include the Bayesian formulation, the Monte Carlo simulation, Hidden Markov Models (HMMs), and Kalman Filters. Methods that do not model road-agent interactions are regarded as sub-optimal or as less accurate than methods that model the interactions between road agents in the scene. Examples of methods that explicitly model road-agent interaction include techniques based on social forces, velocity obstacles [3], LTA, etc. Many of these models were designed to account for interactions between pedestrians in a crowd (i.e. homogeneous interactions) and improve the prediction accuracy [30]. Techniques based on velocity obstacles have been extended using kinematic constraints to model the interactions between heterogeneous road agents. Our learning approach does not use any explicit pairwise motion model. Rather, we model the heterogeneous interactions between road agents implicitly. 2.5.2 Deep-Learning Based Methods Approaches based on deep neural networks use variants of Recurrent Neural Networks (RNNs) for sequence modeling. These have been extended to hybrid networks by combining RNNs with other deep learning architectures for motion prediction. RNN-Based Methods RNNs are natural generalizations of feedforward neural networks to sequence [68]. 37 The benefits of RNNs for sequence modeling makes them a reasonable choice for traffic prediction. Since RNNs are incapable of modeling long-term sequences, many traffic trajectory prediction methods use long short-term memory networks (LSTMs) to model road-agent interactions. These include algorithms to predict trajectories in traffic scenarios with few heterogeneous interactions [67]. These techniques have also been used for trajectory prediction for pedestrians in a crowd [66]. Hybrid Methods Deep-learning-based hybrid methods consist of networks that integrate two or more deep learning architectures. Some examples of deep learning architectures include CNNs, GANs, VAEs, and LSTMs. Each architecture has its own advantages and, for many tasks, the advantages of individual architectures can be combined. There is considerable work on the development of hybrid networks. Generative models have been successfully used for tasks such as super resolution, image-to-image translation, and image synthesis. However, their application in trajectory prediction has been limited because back-propagation during training is non-trivial. In spite of this, generative models such as VAEs and GANs have been used for trajectory prediction of pedestrians in a crowd [9] and in sparse traffic [69]. Alternatively, Convolutional Neural Networks (CNNs or ConvNets) have also been successfully used in many computer vision applications like object recognition. Recently, they have also been used for traffic trajectory prediction [70, 71]. In this paper, we present a new hybrid network that combines LSTMs with CNNs for traffic prediction. 2.5.3 Traffic Datasets There are several datasets corresponding to traffic scenarios. ApolloScape [72] is a large- scale dataset of street views that contain scenes with higher complexities, 2D/3D annotations 38 and pose information, lane markings and video frames. However, this dataset does not provide trajectory information. The NGSIM simulation dataset [73] consists of trajectory data for road agents corresponding to cars and trucks, but the traffic scenes are limited to highways with fixed- lane traffic. KITTI [74] dataset has been used in different computer vision applications such as stereo, optical flow, 2D/3D object detection, and tracking. There are some pedestrian trajectory datasets like ETH and UCY, but they are limited to pedestrians in a crowd. Our new dataset, TRAF, corresponds to dense and heterogeneous traffic captured from Asian cities and includes 2D/3D trajectory information. 2.5.4 Advanced Driver Assistance Systems (ADAS) Passive safety measures (that do not process sensory information) in vehicles include safety belts, brakes, airbags etc. ADAS are active safety measures that collect and process sensory information through sensors such as lidars, radars, stereo cameras, and RGB cameras. Various ADAS process the input information in different ways to implement actions that assist the driver and prevent or reduce the likelihood of traffic accidents due to human error. The development of ADAS began with the Anti-Lock Braking System (ABS) introduced into production in the late 1970s. As ADAS with various functionality become popular, it is not uncommon for multiple systems to be installed on a vehicle. If each function uses its own sensors and processing unit, it will make installation difficult and raise the cost of the vehicle. As a countermeasure, research integrating multiple functions into a single system has been pursued and is expected to make installation easier, decrease power consumption, and vehicle pricing. RobustTP contributes 39 towards this research effort by integrating realtime tracking with trajectory prediction. In addition to to trajectory prediction applications, several other interesting ADAS are currently being used in vehicles on the road. For example, the Adaptive Cruise Control (ACC) automatically adapts speed to maintain a safe distance from vehicles in front. The Blind Spot Detection (BSD) helps drivers when they pull out in order to overtake another road-agent. Emergency Brake Assist (EBA) ensures optimum braking by detecting critical traffic situations. When EBA detects an impending collision, the braking system is put on emergency standby. Intelligent Headlamp Control (IHC) provides optimal night vision. The headlamps are set to provide optimum lighting via a continuous change of the high and low beams of the lights. 2.5.5 Road-Agent Behavior Prediction Current autonomous vehicles lack social awareness due to their inherent conservative behavior. Overly conservative behavior present new risks in terms of low efficiency and uncomfortable traveling experiences. Real-world examples of problems caused by AVs that are not socially adaptable can be seen in this video1. The notion of using driver behavior prediction to make the AVs socially aware is receiving attention [75]. Current driving behavior modeling methods are limited to traffic psychology studies where predictions for driving behavior are made offline, based on either driver responses to questionnaires or data collected over a period of time. Such approaches are not suitable for online behavior prediction. In contrast, our behavior prediction algorithm is the first computationally online approach that does not depend on offline data and manually tunable parameters. In the remainder of this section, we review some of the prior behavior modeling approaches and conclude by 1https://www.youtube.com/watch?v=Rm8aPR0aMDE 40 pointing out the advantages of our approach. Many studies have been performed behavior modeling by identifying factors that contribute to different driver behaviors classes such as aggressive, conservative, or moderate driving. These factors can be broadly categorized into four categories. The first category of factors that indicate road-agent behavior is driver-related. These include characteristics of drivers such as age, gender, blood pressure, personality, occupation, hearing, and so on [76, 77, 78]. Feng et al. [76] proposed five driver characteristics (age, gender, personality via blood test, and education level). Rong et al. [79] presented a similar study but instead used different features such as blood pressure, hearing, and driving experience to conclude that aggressive drivers tailgate and weave in and out of traffic. Dahlen et al. [80] studied the relationships between driver personality and aggressive driving using the five-factor-model [77]. Social Psychology studies [78, 81] have examined the aggressiveness according to the background of the driver, including age, gender, violation records, power of cars, occupation, etc. The second category corresponds to environmental factors such as weather or traffic conditions [82, 83]. The study conducted in [83] was designed to investigate the effects of weather-controlled speed limits and signs for slippery road conditions on driver behavior, while other studies [82] correlated changes in traffic density with varying driver behavior. The third category refers to psychological aspects that affect driving styles. These could include drunk driving, driving under the influence, state of fatigue, and so on [84, 85]. It is shown in [84] that driving under influence induces delayed responses in acceleration and deceleration. Jackson et al. show that a state of fatigue manifests the same characteristics as driving under influence, but without the effect of substance intoxication. Additionally, this category also includes distractions caused by driver activity during driving, such as operating 41 mobile phones. For example, [85] shows that drivers engaged in mobile phone conversations increase their response time to external stimuli. The final category of factors contributing to driving behavior are vehicular factors such as positions, acceleration, speed, throttle responses, steering wheel measurements, lane changes, and brake pressure [15, 86, 87, 88, 89]. A recent data-driven behavior prediction approach [15] also models traffic through graphs. The method predicts the driving behavior by training a neural network on the eigenvectors of the DGGs using supervised machine learning. Apart from behavior modeling, several methods have used machine learning to predict the intent of road-agents [90, 91]. The proposed behavior prediction algorithm in this paper extends the approach in [15] by predicting sequences of eigenvectors for future time-steps. Compared to these prior methods, are algorithm is online, computationally tractable and does not depend on any other information other than the vehicle coordinates. Aljaafreh et al. [86] categorized driving behaviors into four classes: Below normal, Normal, Aggressive, and Very aggressive, in accordance with acceleration data. Murphey et al. [87] conducted an analysis on the aggressiveness of drivers and observed that longitudinal (changing lanes) jerk is more related to aggressiveness than progressive (along the lane) jerk (i.e., rate of change in acceleration). Mohamad et al. [88] detected abnormal driving styles using speed, acceleration, and steering wheel movement, which indicated the direction of vehicles. Qi et al. [92] studied driving styles with respect to speed and acceleration. Shi et al. [93] pointed out that deceleration is not very indicative of the aggressiveness of drivers, but measurements of throttle opening, which are associated with acceleration, were more helpful in identifying aggressive drivers. Wang et al. [94] classified drivers into two categories, aggressive and normal, using speed and throttle opening captured by a simulator. Cheung et al. [89] use speed, acceleration, 42 lane change information of highway data to derive a linear mapping between vehicular information and driver behavior. Finally, using only the spatial positions of the vehicles, [15] uses spectral graph theory to train a multi-layer perceptron to classify driver behavior. 2.5.6 Traffic Flow and Forecasting Traffic forecasting has been studied in different contexts in prior literature. From a deep- learning perspective, traffic forecasting is synonymous with trajectory prediction and does not take into account road-agent behavior [11]. However, in a broader sense, traffic forecasting refers to predicting traffic flow [95, 96, 97] or traffic density [98, 99, 100, 101] on a macroscopic scale. Predicting traffic flow is important for applications such as congestion management and vehicle routing. In this paper, we mainly limit ourselves to forecasting low-level trajectories and high- level behaviors of each road-agent. 2.5.7 Hybrid Architecture for Traffic Prediction In this section, we present our novel network architecture for performing trajectory prediction in dense and heterogeneous environments. In the context of heterogeneous traffic, the goal is to predict trajectories, i.e. temporal sequences of spatial coordinates of a road agent. Temporal sequence prediction requires models that can capture temporal dependencies in data, such as LSTMs. However, LSTMs cannot learn dependencies or relationships of various heterogeneous road agents because the parameters of each individual LSTM are independent of one another. In this regard, ConvNets have been used in computer vision applications with greater success because they can learn locally dependent features from images. Thus, in order to leverage the 43 benefits of both, we combine ConvNets with LSTMs to learn locally useful relationships, both in space and in time, between the heterogeneous road agents. We now describe our model to predict the trajectory for each road agent ai. A visualization of the model is shown in Figure 2.7. We start by computing Hi and Ni for the agent ai. Next, we identify all road agents aj ? Ni ?Hi. Each aj has an input state-space ?j that is used to create the embeddings ej , using ej = ?(Wl?i + bl) (2.10) where Wl and bl are conventional symbols denoting the weight matrix and bias vector respectively, of the layer l in the network, and ? is the non-linear activation on each node. Our network consists of three layers. The horizon layer (top cyan layer in Figure 2.7) takes in the embedding of each road agent in Hi, and the neighbor layer (middle green layer in Figure 2.7) takes in the embedding of each road agent in Ni. The input embeddings in both these layers are passed through fully connected layers with ELU non-linearities, and then fed into single-layered LSTMs (yellow blocks in Figure 2.7). The outputs of the LSTMs in the two layers are hidden state vectors, hj(t), that are computed using hj(t) = LSTM(ej,Wl, bl, ht?1j ) (2.11) where ht?1j refers to the corresponding road agent?s hidden state vector from the previous time- step t ? 1. The hidden state vector of a road agent is a latent representation that contains tem- porally useful information. In the remainder of the text, we drop the parameter t for the sake of simplicity, i.e., hj is understood to mean hj(t) for any j. 44 The hidden vectors in the horizon layer are passed through an additional fully connected layer with ELU non-linearities. We denote the output of the fully connected layer as hjw. All the hjw?s in the horizon layer are then pooled together in a ?horizon map?. The hidden vectors in the neighbor layer are directly pooled together in a ?neighbor map?. These maps are further elaborated in Section 2.5.7.1. Both these maps are then passed through separate ConvNets in the two layers. The ConvNets in both the layers are comprised of two convolution operations followed by a max-pool operation. We denote the output feature vector from the ConvNet in the horizon layer as fhz, and that from the ConvNet in the neighbor layer as fnb. Finally, the bottom-most layer corresponds to the ego agent ai. Its input embedding, ei, passes sequentially through a fully connected with ELU non-linearities, and a single-layered LSTM to compute its hidden vector, hi. The feature vectors from the horizon and neighbor layers, fhz and fnb, are concatenated with hi to generate a final vector encoding z := concat(hi, fhz, fnb) (2.12) Finally, the concatenated encoding z passes through an LSTM to compute the prediction for the next ? seconds. 2.5.7.1 Weighted Interactions Our model is trained to learn weighted interactions in both the horizon and neighborhood layers. Specifically, it learns to assign appropriate weights to various pairwise interactions based on the shape, dynamic constraints and behaviors of the involved agents. The horizon-based weighted interactions takes into account the agents in the horizon of the ego agent, and learns the 45 Figure 2.6: Horizon and Heterogeneous Interactions: We highlight various interactions for the red car. Horizon-based weighted interactions are in the blue region, containing a car and a rickshaw (both blue). The red car prioritizes the interaction with the blue car and the rickshaw (i.e. avoids a collision) over interactions with other road-agents. Heterogeneous-Based weighted interactions are in the green region, containing pedestrians and motorcycles (all in green). We model these interactions as well to improve the prediction accuracy. ?horizon map? Hi, given as Hi = {hjw|aj ? Hi} (2.13) Similarly, the neighbor or heterogeneous-based weighted interactions accounts for all the agents in the neighborhood of the ego agent, and learns the ?neighbor map? Ni, given as Ni = {hj|aj ? Ni} (2.14) 46 During training, back-propagation optimizes the weights corresponding to these maps by minimizing the loss between predicted output and ground truth labels. Our formulation results in higher weights for prioritized interactions (larger tensors in Horizon Map or blue vehicles in Figure 2.6) and lower weights for less relevant interactions (smaller tensors in Neighbor Map or green vehicles in Figure 2.6). 2.5.7.2 Implicit Constraints Turning Radius: In addition to constraints such as position, velocity and shape, constraints such as the turning radius of a road agent also affects its maneuverability, especially as it interacts with other road agents within some distance. For example, a car (a non-holonomic agent) cannot alter its orientation in a short time frame to avoid collisions, whereas a bicycle or a pedestrian can. However, the turning radius of a road agent can be determined by the dimensions of the road agent, i.e., its length and width. Since we include these parameters into our state-space representation, we implicitly take into consideration each agent?s turning radius constraints as well. Driver Behavior: Velocity and acceleration (both relative and average ) are clear indicators of driver aggressiveness. For instance, a road agent with a relative velocity (and/or acceleration) much higher than the average velocity (and/or acceleration) of all road agents in a given traffic scenario would be deemed as aggressive. Moreover, given the traffic concentrations at two consecutive spatial coordinates, c(x, y) and c(x+?x, y+?y), where c(x, y) >> c(x+?x, y+?y), aggressive drivers move in a ?greedy? fashion in an attempt to occupy the empty spots in the subsequent spatial locations. For each road agent, we compute its concentration with respect to 47 its neighborhood and add this value to its input state-space. Finally, the relative distance of a road agent from its neighbors is another factor pertaining to how conservative or aggressive a driver is. More conservative drivers tend to maintain a healthy distance while aggressive drivers tend to tail-gate. Hence, we compute the spatial distance of each road agent in the neighborhood and encode this in its state-space representation. 2.5.7.3 Overall Trajectory Prediction Our algorithm follows a well-known scheme for prediction [66]. We assume that the position of the road agent in the next frame follows a bi-variate Gaussian distribution with parameters ?t, ?ti i = [(?x, ?y) t i, ((?x, ? t y)i)], and correlation coefficient ? t i. The spatial coordinates (xti, y t i) are thus drawn from N (?t, ?ti i , ?ti). We train the model by minimizing the negative log- likelihood loss function for the ith road agent trajectory, Li = ???t+1 log(P((xt ti, yi)|(?t t ti, ?i , ?i))). (2.15) We jointly back-propagate through all three layers of our network, optimizing the weights for the linear blocks, ConvNets, LSTMs, and Horizon and Neighbor Maps. The optimized parameters learned for the Linear-ELU block in the horizon layer indicates the priority for the interaction in the horizon of an road agent ai. 2.5.8 Results We describe our new dataset in Section 2.5.8. In Section 2.5.8, we list all implementation details used in our training process. Next, we list the evaluation metrics and methods that we 48 Figure 2.7: TraPHic Network Architecture: The ego agent is marked by the red dot. The green elliptical region around it is its neighborhood and the cyan semi-elliptical region in front of it is its horizon. We generate input embeddings for all agents based on trajectory information and heterogeneous dynamic constraints such as agent shape, velocity, and traffic concentration at the agent?s spatial coordinates, and other parameters. These embeddings are passed through LSTMs and eventually used to construct the horizon map, the neighbor map and the ego agent?s own tensor map. The horizon and neighbor maps are passed through separate ConvNets and then concatenated together with the ego agent tensor to produce latent representations. Finally, these latent representations are passed through an LSTM to generate a trajectory prediction for the ego agent. compare with, in Section 2.5.8. Finally, we present the evaluation results in Section 2.5.8. TRAF Dataset: Dense & Heterogeneous Urban Traffic We present a new dataset, currently comprising of 50 videos of dense and heterogeneous traffic. The dataset consists of the following road agent categories: car, bus, truck, rickshaw, pedestrian, scooter, motorcycle, and other road agents such as carts and animals. Overall, the dataset contains approximately 13 motorized vehicles, 5 pedestrians and 2 bicycles per frame. Annotations were performed following a strict protocol and each annotated video file consists of spatial coordinates, an agent ID, and an agent type. The dataset is categorized according to camera viewpoint (front-facing/top-view), motion (moving/static), time of day (day/evening/night), and difficulty level (sparse/moderate/heavy/challenge). All the videos have a resolution of 1280 ? 720. We present a comparison of our dataset with 49 Dataset Method RNN-ED S-LSTM S-GAN CS-LSTM TraPHic NGSIM 6.86/10.02 5.73/9.58 5.16/9.42 7.25/10.05 5.63/9.91 Beijing 2.24/8.25 6.70/8.08 4.02/7.30 2.44/8.63 2.16/6.99 Table 2.5: Evaluation on sparse or homogeneous traffic datasets: The first number is the average RMSE error (ADE) and the second number is final RMSE error (FDE) after 5 seconds (in meters). NGSIM is a standard sparse traffic dataset with few heterogeneous interactions. The Beijing dataset is dense but with relatively low heterogeneity. Lower value is better and bold value represents the most accurate result. Methods Evaluated on TRAF RNN-ED S-LSTM S-GAN CS-LSTM TraPHic Original Learned Original Learned Original Learned B He Ho Combined 3.24/5.16 6.43/6.84 3.01/4.89 2.89/4.56 2.76/4.79 2.34/8.01 1.15/3.35 2.73/7.21 2.33/5.75 1.22/3.01 0.78/2.44 Table 2.6: Evaluation on our new, highly dense and heterogeneous TRAF dataset. The first number is the average RMSE error (ADE) and the second number is final RMSE error (FDE) after 5 seconds (in meters). The original setting for a method indicates that it was tested with default settings. The learned setting indicates that it was trained on our dataset for fair comparison. We present variations of our approach with each weighted interaction and demonstrate the contribution of the method. Lower is better and bold is best result. # Frames Agents Visibility Density #Diff Dataset (?103) Ped Bicycle Car Bike Scooter Bus Truck Rick Total (Km) (?103) Agents NGSIM 10.2 0 0 981.4 3.9 0 0 28.2 0 1013.5 0.548 1.85 3 Beijing 93 1.6 1.9 12.9 16.4 0.005 3.28 3 TRAF 12.4 4.9 1.5 3.6 1.43 5 0.15 0.2 3.1 19.88 0.005 3.97 8 Table 2.7: Comparison of our new TRAF dataset with various traffic datasets in terms of heterogeneity and density of traffic agents. Heterogeneity is described in terms of the number of different agents that appear in the overall dataset. Density is the total number of traffic agents per Km in the dataset. The value for each agent type under ?Agents? corresponds to the average number of instances of that agent per frame of the dataset. It is computed by taking all the instances of that agent and dividing by the total number of frames. Visibility is a ballpark estimate of the length of road in meters that is visible from the camera. NGSIM data were collected using tower-mounted cameras (bird?s eye view), whereas both Beijing and TRAF data presented here were collected with car-mounted cameras (frontal view). standard traffic datasets in Table 2.7. Implementation Details We use single-layer LSTMs as our encoders and decoders with hidden state dimensions of 64 and 128, respectively. Each ConvNet is implemented using two convolutional operations each followed by an ELU non-linearity and then max-pooling. We train the network 50 for 16 epochs using the Adam optimizer with a batch size of 128 and learning rate of 0.001. We use a radius of 2 meters to define the neighborhood and a minor axis length of 1.5 meters to define the horizon, respectively. Our approach uses 3 seconds of history and predicts spatial coordinates of the road agent for up to 5 seconds (4 seconds for KITTI dataset). We do not down- sample on the NGSIM dataset due to its sparsity. However, we use a down-sampling factor of 2 on the Beijing and TRAF datasets due to their high density. Our network is implemented in Pytorch using a single TiTan Xp GPU. Our network does not use batch norm or dropout as they can decrease accuracy. We include the experimental details involving batch norm and dropout in the appendix due to space limitations. Evaluation Metrics and Comparison Methods We use the following commonly used metrics [9, 66, 67] to measure the performances of the algorithms used for predicting the trajectories of the road agents. 1. Average displacement error (ADE): The root mean square error (RMSE) of all the predicted positions and real positions during the prediction time. 2. Final displacement error (FDE): The RMSE distance between the final predicted positions at the end of the predicted trajectory and the corresponding true location. We compare our approach with the following methods. ? RNN-ED (Seq2Seq): An RNN encoder-decoder model, which is widely used in motion and trajectory prediction for vehicles. ? Social-LSTM (S-LSTM): An LSTM-based network with social pooling of hidden states to predict pedestrian trajectories in crowds [66]. 51 ? Social-GAN (S-GAN): An LSTM-GAN hybrid network to predict trajectories for large human crowds [9]. ? Convolutional-Social-LSTM (CS-LSTM): A variant of S-LSTM adding convolutions to the network in [66] in order to predict trajectories in sparse highway traffic [67]. We also perform ablation studies with the following four versions of our approach. ? TraPHic-B: A base version of our approach without using any weighted interactions. ? TraPHic-Ho: A version of our approach without using Heterogeneous-Based Weighted interactions, i.e., we do not take into account driver behavior and information such as shape, relative velocity, and concentration. ? TraPHic-He: A version of our approach without using Horizon-Based Weighted interactions. In this case, we do not explicitly model the horizon, but account for heterogeneous interactions. ? TraPHic: Our main algorithm using both Heterogeneous-Based and Horizon-Based Weighted interactions. We explicitly model the horizon and implicitly account for dynamic constraints and driver behavior. Results on Traffic Datasets In order to provide a comprehensive evaluation, we compare our method with state-of-the-art methods on several datasets. Table 2.5 shows the results on the standard NGSIM dataset and an additional dataset containing heterogeneous traffic of moderate density. We present results on our new TRAF dataset in Table 2.6. TraPHic outperforms all prior methods we compared with on our TRAF dataset. For a fairer comparison, we trained these methods on our dataset before testing them on the dataset. However, 52 Figure 2.8: RMSE Curve Plot: We compare the accuracy of four variants of our algorithm with CS-LSTM and each other based on RMSE values on the TRAF dataset. On the average, using TraPHic-He reduces RMSE by 15% relative to TraPHic-B, and using TraPHic-Ho reduces RMSE by 55% relative to TraPHic-B. TraPHic, the combination of TraPhic-He and TraPhic-Ho, reduces RMSE by 36% relative to TraPHic-Ho, 66% relative to TraPHic-He, and 71% relative to TraPHic-B. Relative to CS-LSTM, TraPHic reduces RMSE by 30%. the prior methods did not generalize well to dense and heterogeneous traffic videos. One possible explanation for this is that S-LSTM and S-GAN were designed to predict trajectories of humans in top-down crowd videos whereas the TRAF dataset consists of front-view heterogeneous traffic videos with high density. CS-LSTM uses lane information in its model and weight all agent interactions equally. Since the traffic in our dataset does not include the concept of lane-driving, we used the version of CS-LSTM that does not include lane information for a fairer comparison. However, it still led to a poor performance since CS-LSTM does not account for heterogeneous- based interactions. On the other hand, TraPHic considers both heterogeneous-based and horizon- based interactions, and thus produces superior performance on our dense and heterogeneous dataset. 53 Figure 2.9: Trajectory Prediction Results: We highlight the performance of various trajectory prediction methods on our TRAF dataset with different types of road-agents. We showcase six scenarios with different density, heterogeneity, camera position (fixed or moving), time of the day, and weather conditions. We highlight the predicted trajectories (over 5 seconds) of some of the road-agents in each scenario to avoid clutter. The ground truth (GT) trajectory is drawn as a solid green line, and our (TraPHic) prediction results are shown using a solid red line. The prediction results of other methods (RNN-ED, S-LSTM, S-GAN, CS-LSTM) are drawn with different dashed lines. TraPHic predictions are closest to GT in all the scenarios. We observe up to 30% improvement in accuracy over prior methods over this dense, heterogeneous traffic. We visualize the performance of the various trajectory prediction methods on our TRAF dataset Figure 2.9. Compared to the prior methods, TraPHic produces the least deviation from the ground truth trajectory in all the scenarios. Due to the significantly high density and heterogeneity in these videos, coupled with the unpredictable nature of the involved agents, all the predictions deviate from the ground truth in the long term (after 5 seconds). We demonstrate that our approach is comparable to prior methods on sparse datasets such as the NGSIM dataset. We do not outperform the current sate-of-the-art in such datasets, since our algorithm tries to account for heterogeneous agents and weighted interactions even when interactions are sparse and mostly homogeneous. Nevertheless, we are at par with the state-of- the-art performance. Lastly, we note that our RMSE value on the NGSIM dataset is quite high, 54 Figure 2.10: Overview of RobustTP: RobustTP is an end-to-end trajectory prediction algorithm that uses sensor input trajectories as training data instead of manually annotated trajectories. The sensor input is an RGB video from a moving or static camera. The first step is to compute trajectories using a tracking algorithm (light orange block). The trajectories generated are the training data for the trajectory prediction algorithm (green block). The model trains on ? = 3 seconds of trajectory history and predicts trajectory for the next k = 5 seconds. As an example, the predicted trajectories for two of the agents are shown in the output image at the right end. The green circles denote the positions of the agents at the beginning of prediction, as seen from a top-view in the 3D world. The red-dashed lines denote the predicted trajectories for the next 5 seconds, as seen from the same top-view in the 3D world. which we attribute to the fact that we used a much higher (2X) sampling rate for averaging than prior methods. Finally, we perform an ablation study to highlight the contribution of our weighted interaction formulation. We compare the four versions of TraPHic as stated in Section 2.5.8. We find that the Horizon-based formulation contributes more significantly to higher accuracy. TraPHic-He reduces ADE by 15% and FDE by 20% over TraPHic-B, whereas TraPHic-Ho reduces ADE by 55% and FDE by 58% over TraPHic-B. Incorporating both formulations results in the highest accuracy, reducing the ADE by 71% and the FDE by 66% over TraPHic-B. 2.6 RobustTP: Improving robustness of prediction in unstructured traffic We present an end-to-end algorithm for predicting future trajectories of road-agents in dense traffic with noisy sensor input trajectories obtained from RGB cameras (either static or moving) through a tracking algorithm. In this case, we consider noise as the deviation from the 55 ground truth trajectory. The amount of noise depends on the accuracy of the tracking algorithm. Our approach is designed for dense heterogeneous traffic, where the road agents corresponding to a mixture of buses, cars, scooters, bicycles, or pedestrians. is an approach that first computes trajectories using a combination of a non-linear motion model and a deep learning-based instance segmentation algorithm. Next, these noisy trajectories are trained using an LSTM-CNN neural network architecture that models the interactions between road-agents in dense and heterogeneous traffic. Our trajectory prediction algorithm outperforms state-of-the-art methods for end-to-end trajectory prediction using sensor inputs. We achieve an improvement of upto 18% in average displacement error and an improvement of up to 35.5% in final displacement error at the end of the prediction window (5 seconds) over the next best method. All experiments were set up on an Nvidia TiTan Xp GPU. Additionally, we release a software framework, TrackNPred. The framework consists of implementations of state-of-the-art tracking and trajectory prediction methods and tools to benchmark and evaluate them on real-world dense traffic datasets. The second approach is called RobustTP [102]. This is an end-to-end approach that does not require manually labeled ground-truth trajectories to train the trajectory prediction network. The input to this algorithm consists only of raw traffic videos obtained from commodity sensors such as monocular RGB cameras. The algorithm uses a tracking algorithm to generate noisy trajectories from these videos. These trajectories replace the trajectory input used by TraPHic. This work has been published in ACM CSCS?19. We begin by formally stating the problem and describing the notation. Then we give an overview of our approach to realtime end-to-end trajectory prediction in dense and heterogeneous traffic scenarios. Given a set of N road agents R = {ri}i=1...N , the trajectory history of each road agent 56 ri over ? frames, denoted Ti = {(x1, y1), (x2, y2), . . . , (x? , y? )}, and the road agent?s size l, we predict the trajectory, i.e., the spatial coordinates of that road agent for the next k frames. We define the state space of each road agent ri as [ ]? ?i := T ?T c l , (2.16)i i where ? is a derivative operator that is used to compute the velocity of the road agent, and c := [c(x , y ), . . . , c(x , y )]?1 1 ? ? . The traffic concentration, c(x, y), at the location (x, y), is defined as the number of road agents between (x, y) and (x, y)+ (?x, ?y) for some predefined (?x, ?y) > 0. We also compute camera parameters from given videos using standard techniques and use the parameters to estimate the camera homography matrices. The homography matrices are subsequently used to convert the location of road agents in 2D pixels to 3D world coordinates w.r.t. a predetermined frame of reference, similar to approaches in [9, 66]. All state-space representations are subsequently converted to the 3D world space. Finally, we consider a method to be more robust compared to other methods if the trajectories predicted by it are less affected by noise in the trajectory history (arising due to sensor artifacts, inaccuracies in tracking and similar factors). 2.6.1 TrackNPred: A Software Framework for End-to-End Trajectory Prediction TrackNPred is a python-based software library 2 for end-to-end realtime trajectory prediction for autonomous road-agents. Our first goal, through TrackNPred, is to enable autonomous road- agents to navigate safely in dense and heterogeneous traffic by estimating how road-agents, that 2https://gamma.umd.edu/robusttp 57 Figure 2.11: TrackNPred is a deep learning-based framework that integrates trajectory prediction methods with tracking by detection algorithms to motivate further research in end-to-end trajectory prediction. In this figure, we show the graphical user interface of TrackNPred where one can select the tracking by detection algorithm as well as choose the trajectory prediction method. The user can also set the hyperparameters for the training and evaluation phases. If the input can be connected to an RGB camera mounted on a road-agent, then TrackNPred can be extended to ADAS applications. are in close proximity, are going to move in the next few seconds. The continuous advancement in deep learning has resulted in the development of several state-of-the-art tracking and trajectory prediction algorithms that have shown impressive results on real world dense and heterogeneous traffic datasets. However, there are currently no theoretical guarantees to validate the comparison of performance of different deep learning models. It is only through empirical research that one can evaluate the efficiency of a particular deep learning model. Our second goal is to equip researchers with a packaged deep learning tool that performs trajectory prediction based on various state-of-the-art neural network architectures, such as Generative Adversarial Networks (GANs [9]), Recurrent Neural Networks (LSTMs [66], and Convolutional Neural Networks (CNNs [67]). Therefore, one of the advantages of TrackNPred is 58 that it enables researchers to experiment with these different deep learning architectures with minimal difficulty. Researchers need only select hyperparameters for the chosen network. We also provide the ability to modify individual architectures without disrupting the rest of the methods (Figure 2.11). TrackNPred integrates realtime tracking algorithms with end-to-end trajectory prediction methods to create a robust framework. The input is simply a video (through a moving or static RGB camera). TrackNPred selects a tracking method from the tracking module to first generate a trajectory, Ti = {(x1, y1), (x2, y2), . . . , (xn, yn)}, for the ith road-agent for n frames, where n is a constant. The trajectories for each agent are then treated as the trajectory history for that agent in the trajectory prediction module. The final output is the future trajectory for the ego-agent, Tego = ((xn+1, yn+1), (xn+2, yn+2), . . . , (xn+k, yn+k)), where k is the length of the prediction window. This is a major difference from trajectory prediction methods in the literature [9, 66, 67] that rely on manually annotated input trajectories. TrackNPred, in contrast, does not require any ground truth trajectories. Finally, TrackNPred evaluates and benchmarks realtime performances of various trajectory prediction methods on a real-world traffic dataset3 [65]. This dataset contains more than 50 videos of dense and heterogeneous traffic. The dataset consists of the following road agent categories: cars, buses, trucks, rickshaws, pedestrians, scooters, motorcycles, and other road agents such as carts and animals. Overall, the dataset contains approximately 13 motorized vehicles, 5 pedestrians, and 2 bicycles per frame. Annotations consist of spatial coordinates, an agent ID, and an agent type. The dataset is categorized according to camera viewpoint (front-facing/top-view), motion (moving/static), time of day (day/evening/night), and density 3https://go.umd.edu/TRAF-Dataset 59 Table 2.8: The list of algorithms currently implemented in TrackNPred. Methods Mask R-CNN + DeepSORT Tracking by Detection YOLO + DeepSORT RNN- Encoder Decoder [68] Social-GAN [9] Trajectory Prediction Covolutional Social-LSTM [67] TraPHic [65] level (sparse/moderate/heavy/ challenging). All the videos have a resolution of 1280? 720. 2.6.1.1 Methods Implemented in TrackNPred One of our goals is to motivate research in highly accurate, end-to-end, and realtime trajectory prediction methods. To achieve this goal, we design a common interface for several state-of-the-art methods from both tracking and trajectory prediction literature. Such a design facilitates easy bench-marking of new algorithms with respect to the state-of-the-art . The methods in TrackNPred differ in numerous ways from their original implementations in the literature in order to achieve improved accuracy in tracking and prediction in dense and heterogeneous traffic. Table 2.8 provides a list of algorithms currently implemented in TrackNPred. Tracking Module For tracking, we mainly focus our attention on tracking by detection approaches. These are approaches that leverage deep learning-based object detection models. This is because tracking methods that do not perform detection require manual, near-optimal initialization of each road-agent?s state information in the first video frame. Further, methods that do not utilize 60 Table 2.9: We evaluate RobustTP with methods that use noisy sensor input, on the TRAF Dataset. The trajectory histories are computed using tracking by two detection methods: Mask R-CNN [6] and YOLO [7]. The results are reported in the following format: ADE/FDE, where ADE is the average displacement RMSE over the k seconds of prediction and FDE is the final displacement RMSE at the end of k seconds. We tested for both short-term (k = 3) and longer-term (k = 5) predictions. We observe for all the cases that RobustTP is the state-of-the-art. Prediction length, k = 3 secs RNN-ED S-GAN CS-LSTM RobustTP MRCNN 2.60/4.96 2.11/3.50 1.27/2.01 1.14/1.90 YOLO 1.13/2.18 1.29/2.18 1.08/1.55 0.96/1.53 Prediction length, k = 5 secs RNN-ED S-GAN CS-LSTM RobustTP MRCNN 3.99/6.55 3.23/5.69 1.91/3.76 1.75/3.42 YOLO 2.06/4.26 1.98/3.72 1.52/2.67 1.29/1.97 object detection need to know the number of road-agents in each frame a priori so they do not handle cases in which new road-agents enter the scene during the video. Tracking by detection approaches overcome these limitations by employing a detection framework to recognize road- agents entering at any point during the video and initialize their state-space information. At present, we implement python-based tracking by detection algorithms to facilitate easy integration into TrackNPred. DeepSORT [1] is currently the state-of-the-art realtime tracker implemented in python. Naturally, we use DeepSORT as the base tracker. However, DeepSORT was originally developed using a constant velocity model with the goal of tracking pedestrians in sparse crowds. Consequently, it is not optimized for dense and heterogeneous traffic scenes that may contain cars, buses, pedestrians, two-wheelers, and even animals. Therefore, we replace the constant velocity model with a non-linear RVO motion model [103], which is designed for motion planning in dense environments. 61 The advantage of using tracking by detection algorithms is that we can combine the unique benefits of different object detection models. For example, we integrate two state-of-the-art object detection models, YOLO and Mask R-CNN. They are state-of-the-art in its own category. The YOLO algorithm is extremely fast as compared to Mask R-CNN wile the latter offers a higher accuracy. The output of the tracking module is a trajectory file with corresponding ID?s. An ID is an integer unique to every agent. Each row of this file corresponds to the following format: < Fid >,< Vid >,< center-X >,< center-Y > which denotes the frame ID, vehicle ID, and the 2D coordinates of the center of the bounding box of the road-agent. This trajectory file is input for the trajectory prediction module. 2.7 Behavior Prediction In the final algorithm, SpectralLSTM, we extend TraPHic to also incorporate action prediction. Architecturally, the network consists of a two-stream approach working in parallel. The first stream is essentially the TraPHic algorithm while the second stream is used to perform action prediction. This work [104] has been published in RAL/IROS?20. 2.7.1 Problem Statement We first present a definition of a vehicle trajectory: Definition 2.7.1. Trajectory: The traj{ectory for the i th r}oad agent is defined as a sequence ?i(a, b) ? {R2}, where ?i(a, b) = [xt, y ?t] | t ? [a, b] . [x, y] ? R2 denotes the spatial coordinates of the road-agent in meters according to the world coordinate frame and t denotes 62 Figure 2.12: Trajectory and Behavior Prediction: We predict the long-term (3-5 seconds) trajectories of road-agents, as well as their behavior (e.g. overspeeding, underspeeding, etc.), in urban traffic scenes. Our approach represents the spatial coordinates of road-agents (colored points in the image) as vertices of a DGG to improve long-term prediction using a new regularization method. the time instance. We define traffic forecasting as solving the following two problem statements, simultaneously, but separately using two separate streams. Problem 2.7.1. Trajectory Prediction: In a traffic video with N road agents, given the trajectory ? (0, ?), predict ? (?+i i , T ) for each road-agent vi, i ? [0, N ]. Problem 2.7.2. Behavior Prediction: In a traffic video with N road agents, given the trajectory, ?i(0, ?), predict a label from the following set, { Overspeeding, Neutral, Underspeeding} for each road-agent vi, i ? [0, N ]. The overall flow of the approach is as follows: 63 1. Our input consists of the spatial coordinates over the past ? seconds as well as the eigenvectors of the DGGs corresponding to the first ? DGGs. 2. Solving Problem 2.7.1: The first stream accepts the spatial coordinates and uses an LSTM- based sequence model [105] to predict ?i(?+, T ) for each vi, i ? [0, N ], where ?+ = ?+1. 3. Solving Problem 2.7.2: The second stream accepts the eigenvectors of the input DGGs and predicts the eigenvectors corresponding to the DGGs for the next ? seconds. The predicted eigenvectors form the input to the behavior prediction algorithm in Section 2.7.2.2 to assign a behavior label to the road-agent. 4. Stream 2 is used to regularize stream 1 using a new regularization algorithm presented in Section 2.7.3. We derive the upper bound on the prediction error of the regularized forecasting algorithm in Section 2.7.3.1. 2.7.2 Network Overview We present an overview of our approach in Figure 2.13 and defer the technical implementation details of our network to the supplementary material. Our approach consists of two parallel LSTM networks (or streams) that operate separately. Stream 1: The first stream is an LSTM-based encoder-decoder network [105] (yellow layer in Figure 2.13). The input consists of the trajectory history, ?i(0, ?) and output consists of ? +i(? , T ) for each road-agent vi, i ? [0, N ]. Stream 2: The second stream is also an LSTM-based encoder-decoder network (blue layer in Figure 2.13). To prepare the input to this stream, we first form a sequence of DGGs, {Gt| t ? [0, ? ]} for each time instance of traffic until time ? . For each DGG, Gt, we first compute 64 Figure 2.13: Network Architecture: We show the trajectory and behavior prediction for the ith road-agent (red circle in the DGGs). The input consists of the spatial coordinates over the past ? seconds as well as the eigenvectors (green rectangles, each shade of green represents the index of the eigenvectors) of the DGGs corresponding to the first ? DGGs. We perform spectral clustering on the predicted eigenvectors from the second stream to regularize the original loss function and perform back-propagation on the new loss function to improve long-term prediction. its corresponding Laplacian matrix, Lt and use state-of-the-art eigenvalue algorithms to obtain the spectrum, Ut consisting of the top k eigenvectors of length n. We form k different sequences, {Sj| j ? [0, k]}, where each Sj = {uj} is the set containing the j th eigenvector from each Ut corresponding to the tth time-step, with |Sj| = ? . The second stream then accepts a sequence, Sj , as input to predict the j th eigenvectors for the next T ? ? seconds. This is repeated for each Sj . The resulting sequence of spectrums, {U | t ? [?+t , T ]} are used to reconstruct the sequence, {Lt| t ? [?+, T ]}, which is then used to assign a behavior label to a road-agent, as explained below. 65 2.7.2.1 Trajectory Prediction The first stream is used to solve Problem 2.7.1. We clarify at this point that stream 1 does not take into account road-agent interactions. We use spectral clustering (discussed later in Section 2.7.3) to model these interactions. It is important to further clarify that the trajectories predicted from stream 1 are not affected by the behavior prediction algorithm (explained in the next Section). 2.7.2.2 Behavior Prediction Algorithm We define a rule-based behavior algorithm (blue block in Figure 2.13) to solve Problem 2.7.2. This is largely due to the fact that most data-driven behavior prediction approaches require large, well-annotated datasets that contain behavior labels. Our algorithm is based on the predicted eigenvectors of the DGGs of the next ? seconds. The degree of ith road-agent, (?i ? n), can be computed from the diagonal elements of the Laplacian matrix Lt. ?i measures the total number of distinct neighbors with which road-agent vi has shared an edge connection until time t. As Lt is formed by simply adding a row and column to Lt?1, the degree of each road-agent monotonically increases. Let the rate of increase of ?i be denoted as ??i. Intuitively, an aggressively overspeeding vehicle will observe new neighbors at a faster rate as compared to a road-agent driving at a uniform speed. Conversely, a conservative road-agent that is often underspeeding at unconventional spots such as green light intersections (Figure 2.12) will observe new neighbors very slowly. This intuition is formalized by noting the change in ?i across time-steps. In order to make sure that slower vehicles (conservative) did not mistakenly mark faster vehicles as new agents, we set a condition where an observed vehicle is 66 marked as ?new? if and only if the speed of the observed vehicle is less than the active vehicle (or ego-vehicle). To predict the behavior of the ith road-agent, we follow the following steps: 1. Form the set of predicted spectrums from stream 2, {Ut| t ? [?+, T ]}. We compute the eigenvalue matrix, ?, of Lt by applying theorem 5.6 of [106] to Lt?1. We explain the exact procedure in the supplemental version. 2. For each Ut ? U , compute Lt = U ?U?t t . 3. ?i = ith element of diag(Lt), where ?diag? is the diagonal matrix operator. 4. ??i = ??i . ?t where ? is the eigenvalue matrix of Lt. Based on heuristically pre-determined threshold parameters ?1 and ?2, we define the following rules to assign the final behavior label: Overspeeding ( ? ? > ?1), Neutral ( ? ?2 ? ? ? ?1), and Underspeeding ( ? ? < ?2). Note that since human behavior does not change instantly at each time-step, our approach predicts the behavior over time periods spanning several frames. 2.7.3 Spectral Clustering Regularization The original loss function of stream 1 for the ith road-agent in an LSTM network is given by, ? Fi = ? Tt=1 logPr(xt+1|?t, ?t, ?t) (2.17) Our goal is to optimize the parameters, ??t , ? ? t , that minimize equation 2.17. Then, the next spatial coordinate is sampled from a search space defined by N (??t , ??t ). The resulting optimization 67 forces ?t, ?t to stay close to the next spatial coordinate. However, in general trajectory prediction models, the predicted trajectory diverges gradually from the ground-truth, causing the error- margin to monotonically increase as the length of the prediction horizon increases ([107], cf. Figure 4 in [65, 67], Figure 3 in [8]). The reason for this may be that while equation 2.17 ensures that ?t, ?t stays close to the next spatial coordinate, it does not, however, guarantee the same for x?t+1 ? N (?t, ?t). Our solution to this problem involves regularizing equation 2.17 by adding appropriate constraints on the parameters, ?t, ?t, such that sampled coordinates from N (??t , ??t ) are close to the ground-truth trajectory. We assume the ground-truth trajectory of a road-agent to be equivalent to their ?preferred? trajectory, which is defined as the trajectory a road-agent would have taken in the absence of other dynamic road-agents. Preferred trajectories can be obtained by minimizing the Dirichlet energy of the DGG, which in turn can be achieved through spectral clustering on the road-agents [108]. Our regularization algorithm (shown in the yellow arrow in Figure 2.13) is summarized below. For each road-agent, vi: 1. The second stream computes the spectrum sequence, {UT+1, . . . , UT+?}. 2. For each U , perform spectral clustering [109] on the eigenvector corresponding to the second smallest eigenvalue. 3. Compute cluster centers from the clusters obtained in the previous step. 4. Identify the cluster to which vi belongs and retrieve the cluster center, ?c and deviation, ?c. Then for each road-agent, vi, the regularized loss function, F reg i , for stream 1 is given by, 68 ? ( )T t=1 ? logPr(y?t+1|?t, ?t, ?t + b1??t ? ?c?2 + b2??t ? ?c?2 (2.18) where b1 = b2 = 0.5 are regularization constants. The regularized loss function is used to backpropagate the weights corresponding to ?t in stream 1. Note that F reg i resembles a Gaussian kernel. This makes sense as the Gaussian kernel models the Euclidean distance non-linearly ? greater the Euclidean distance, smaller the Gaussian kernel value and vice versa. Furthermore, we can use Equation 2.18 to predict multiple modes [67] by computing maneuver probabilities using ?, ? following the approach in Section 4.3 of [67]. 2.7.3.1 Upper Bound for Prediction Error In this section, we derive an upper bound on the prediction error, ?j , of the first stream as a consequence of spectral regularization. We present our main result as follows, ? Theorem 2.7.1. ? ? ??t?t ?2j , where min(?j,?) denotes the minimum distance between ?min(?j ,?) j and ?k ? ? \ ?j . ?? ?Lt 0 ? Proof. At time instance t, the Laplacian matrix, Lt, its block form, ?? ??, denoted as 0 1 block(Lt), and the laplacian matrix for the next time-step, Lt+1 are described by Equation 3.1. We compute the eigenvalue matrix, ?, of Lt by applying theorem 5.6 of [106] to Lt?1. LSTMs make accurate sequence predictions if elements of the sequence are correlated across time, as opposed to being generated randomly. In a general sequence of eigenvectors, 69 the eigenvectors may not be correlated across time. Consequently, it is difficult for LSTM networks to predict the sequence of eigenvectors, U accurately. This may adversely affect the behavior prediction algorithm described in Section 2.7.2.2. Our goal is now to show there exist a correlation between Laplacian matrices across time-steps and that this correlation is lower- bounded, that is, there exist sufficient correlation for accurate sequence modeling of eigenvectors. Proving a lower-bound for the correlation is equivalent to proving an upper-bound for the noise, or error distance, between the j th eigenvectors of Lt and Lt+1. We denote this error distance through the angle ?j . From Theorem 5.4 of [106], the numerator of bound corresponds to the frobenius norm of the error between Lt and Lt+1. In our case, the update to the Laplacian matrix is given by Equation 3.1 where the error matrix is ???. In Theorem 2.7.1, ?j << 1 and ? is defined in equation 3.1. ?j represents the j th eigenvalue and ? represents all the eigenvalues of Lt. If the maximum component of ?t is ?max, then ?j = ? O( N?max). Theorem 2.7.1 shows that in a sequence of j th eigenvectors, the maximum angular ? difference between successive eigenvectors is bounded by O( N?max). By setting N = 270 (number of road-agents in Lyft), and ?max := e?3 = 0.049 (width of a lane), we observe a theoretical upper bound of 0.8 meters. A smaller value of ?j indicates a greater similarity between successive eigenvectors, thereby implying a greater correlation in the sequence of eigenvectors. This allows sequence prediction models to learn future eigenvectors efficiently. An alternative approach to computing the spectrums {UT+1, . . . , UT+?} is to first form traffic-graphs from the predicted trajectory given as the output from the stream 1. After obtaining the corresponding Laplacian matrices for these traffic-graphs, standard eigenvalue algorithms can be used to compute the spectrum sequence. This is, however, a relatively sub-optimal approach 70 as in this case, ? = O(NLmax), with Lmax ? ?max. 2.7.3.2 Results 2.7.4 Analysis and Discussion We compare the ADE and FDE scores of our predicted trajectories with prior methods in Table 3.3 and show qualitative results in the supplementary material. We compare with several state-of-the-art trajectory prediction methods and reduce the average RMSE by approximately 75% with respect to the next best method (GRIP). Ablation Study of Stream 1 (S1 Only) vs. Both Streams (S1 + S2): To highlight the benefit of the spectral cluster regularization on long-term prediction, we remove the second stream and only train the LSTM encoder-decoder model (Stream 1) with the original loss function (equation 2.17). Our results (Table 3.3, last four columns) show that regularizing stream 1 reduces the FDE by up to 70%. This is as expected since stream 1 does not take into account neighbor information. Therefore, it should also be noted that stream 1 performs poorly in dense scenarios but rather well in sparse scenarios. This is evident from Table 3.3 where stream 1 outperforms comparison methods on the sparse NGSIM dataset with ADE less than 1m. Additionally, Figure 2.14 shows that in the presence of regularization, the RMSE for our spectrally regularized approach (?both streams?, purple curve) is much lower than that of stream 1 (red curve) across the entire prediction window. RMSE depends on traffic density: The upper bound for the increase in RMSE error ? is a function of the density of the traffic since ? = O( N?max), where N is the total number of agents in the traffic video and ?max = 0.049 meters for a three-lane wide road system. The 71 Figure 2.14: RMSE Curves: We plot the RMSE values for all methods. The prediction window is 5 seconds corresponding to a frame length of 50 for the NGSIM dataset. NGSIM dataset contains the sparsest traffic with the lowest value for N and therefore the RMSE values are lower for the NGSIM (0.40/1.08) compared to the other three datasets that contain dense urban traffic. Comparison with other methods: Our method learns weight parameters for a spectral regularized LSTM network (Figure 2.13), while GRIP learns parameters for a graph-convolutional network (GCN). We outperform GRIP on the NGSIM and Apolloscape datasets, while comparisons on the remaining two datasets are unavailable. TraPHic and CS-LSTM are similar approaches. Both methods require convolutions in a heuristic local neighborhood. The size of the neighborhood is specifically adjusted to the dataset that each method is trained on. We use the default neighborhood parameters provided in the publicly available implementations, and apply them to the NGSIM, Lyft, Argoverse, and Apolloscape datasets. We outperform both methods on all benchmark datasets. Lastly, Social-GAN is trained on the scale of pedestrian trajectories, which differs significantly from the scale of vehicle trajectories. This is primarily the reason behind Social- GAN placing last among all methods. 72 2.7.5 Long-Term Prediction Analysis The goal of improved long-term prediction is to achieve a lower FDE, as observed in our results in Table 3.3. We achieve this goal by successfully upper-bounding the worst-case maximum FDE that can theoretically be obtained. These upper bounds are a consequence of the theoretical results in Section 2.7.3.1. We denote the worst-case theoretical FDE by T-FDE. This measure represents the maximum FDE that can be obtained using Theorem 2.7.1 under fixed assumptions. In Table 2.11, we compare the T-FDE with the empirical FDE results obtained in Table 3.3. The T-FDE is computed by, ? T-FDE = ? (T ? ?) (2.19) n The formula for T-FDE is derived as follows. The RMSE error incurred by all vehicles at a current time-step during spectral clustering is bounded by ? (Theorem 2.7.1). Let n = N = 10 T be the average number of vehicles per frame in each dataset. Then, at a single instance in the prediction window, the increase in RMSE for a single agent is bounded by ? . As T ? ? is the n length of the prediction window, the total increase in RMSE over the entire prediction window is given by T-FDE = ? ? (T ? ?). We do not have the data needed to compute ? for the NGSIM n dataset as the total number of lanes are not known. We note a 73%, 82%, 100% agreement between the theoretical FDE and the empirical FDE on the Apolloscape, Lyft, and Argoverse datasets, respectively. The main cause for disagreements in the first two datasets is the choice for the value of ?max = 0.049 during the computation of ?. This value is obtained for a three-lane wide road system that was observed in majority of the 73 Table 2.10: Main Results: We report the Average Displacement Error (ADE) and Final Displacement Error (FDE) for prior road-agent trajectory prediction methods in meters (m). Lower scores are better and bold indicates the SOTA. We used the original implementation and results for GRIP [8] and Social-GAN [9]. ?-? indicates that results for that particular dataset are not available. Conclusion: Our spectrally regularized method (?S1 + S2?) outperforms the next best method (GRIP) by upto 70% as well as the ablated version of our method (?S1 Only?) by upto 75%. Dataset (Pred. Len.) Comaprison Methods Ablation Our Approach CS-LSTM TraPHic Social-GAN GRIP S1 Only S1 + S2 ADE FDE ADE FDE ADE FDE ADE FDE ADE FDE ADE FDE Lyft (5 sec.) 4.423 8.640 5.031 9.882 7.860 14.340 - - 5.77 11.20 2.65 2.99 Argoverse (5 sec.) 1.050 3.085 1.039 3.079 3.610 5.390 - - 2.40 3.09 0.99 1.87 Apolloscape (3 sec.) 2.144 11.699 1.283 11.674 3.980 6.750 1.25 2.34 2.14 9.19 1.12 2.05 NGSIM (5 sec.) 7.250 10.050 5.630 9.910 5.650 10.290 1.61 3.16 1.31 2.98 0.40 1.08 videos in both datasets. However, it may be the case that several videos contain one- or two- lane traffic. In such cases, the values for ?max changes to 0.36 and 0.13, respectively, thereby increasing the upper bound for increase in RMSE. Note, in Figure 2.14, the increase in RMSE for our approach (purple curve) is much lower than that of other methods, which is due to the upper bound induced by spectral regularization. 2.7.6 Behavior Prediction Results We follow the behavior prediction algorithm described in Section 2.7.2.2. The values for ?1 and ?2 are based on the ground truth labels and are hidden from the test set. We observe a weighted accuracy of 92.96% on the Lyft dataset, 84.11% on the Argoverse dataset, and 96.72% on the Apolloscape dataset. In the case of Lyft, Figure 2.15(top) and Figure 2.15(bottom) show the ground truth and predictions for Lyft, respectively. We plot the value of ?? on the vertical axis and the road-agent I.D.s on the horizontal axis. More similarity across the two plots indicates higher accuracy. For instance, the red (aggressive) and blue (conservative) dotted regions inFigure 2.15(top) and Figure 2.15(bottom) are nearly identical indicating a greater number 74 Figure 2.15: Behavior Prediction Results: We classify the three behaviors? overspeeding(blue), neutral(green), and underspeeding(red), for all road-agents in the Lyft, Argoverse, and Apolloscape datasets, respectively. The y-axis shows ?? and the x-axis denotes the road-agents. We follow the behavior prediction protocol described in Section 2.7.2.2. Each figure in the top row represents the ground-truth labels, while the bottom row shows the predicted labels. In our experiments, we set ? = ?1 = ??2. of correct classifications. Similar results follow for the Apolloscape and Argoverse datasets, which we show in the supplementary material due to lack of space. Due to the lack of diverse behaviors in the NGSIM dataset, we do not perform behavior prediction on the NGSIM. An interesting observation is that road-agents towards the end of the x-axis appear late in the traffic video while road-agents at the beginning of the x-axis appear early in the video. The variation in behavior class labels, therefore, decreases towards the end of the x-axis. This intuitively makes sense as ?? for a road-agent depends on the number of distinct neighbors that it observes. This is difficult for road-agents towards the end of the traffic video. 2.7.6.1 Results We compare the ADE and FDE scores of our predicted trajectories with prior methods in Table 3.3 and show qualitative results in the supplementary material. We compare with several 75 state-of-the-art trajectory prediction methods and reduce the average RMSE by approximately 75% with respect to the next best method (GRIP). Ablation Study of Stream 1 (S1 Only) vs. Both Streams (S1 + S2): To highlight the benefit of the spectral cluster regularization on long-term prediction, we remove the second stream and only train the LSTM encoder-decoder model (Stream 1) with the original loss function (equation 2.17). Our results (Table 3.3, last four columns) show that regularizing stream 1 reduces the FDE by up to 70%. This is as expected since stream 1 does not take into account neighbor information. Therefore, it should also be noted that stream 1 performs poorly in dense scenarios but rather well in sparse scenarios. This is evident from Table 3.3 where stream 1 outperforms comparison methods on the sparse NGSIM dataset with ADE less than 1m. Additionally, Figure 2.14 shows that in the presence of regularization, the RMSE for our spectrally regularized approach (?both streams?, purple curve) is much lower than that of stream 1 (red curve) across the entire prediction window. RMSE depends on traffic density: The upper bound for the increase in RMSE error ? is a function of the density of the traffic since ? = O( N?max), where N is the total number of agents in the traffic video and ?max = 0.049 meters for a three-lane wide road system. The NGSIM dataset contains the sparsest traffic with the lowest value for N and therefore the RMSE values are lower for the NGSIM (0.40/1.08) compared to the other three datasets that contain dense urban traffic. Comparison with other methods: Our method learns weight parameters for a spectral regularized LSTM network (Figure 2.13), while GRIP learns parameters for a graph-convolutional network (GCN). We outperform GRIP on the NGSIM and Apolloscape datasets, while comparisons on the remaining two datasets are unavailable. TraPHic and CS-LSTM are similar approaches. 76 Both methods require convolutions in a heuristic local neighborhood. The size of the neighborhood is specifically adjusted to the dataset that each method is trained on. We use the default neighborhood parameters provided in the publicly available implementations, and apply them to the NGSIM, Lyft, Argoverse, and Apolloscape datasets. We outperform both methods on all benchmark datasets. Lastly, Social-GAN is trained on the scale of pedestrian trajectories, which differs significantly from the scale of vehicle trajectories. This is primarily the reason behind Social- GAN placing last among all methods. 2.7.7 Long-Term Prediction Analysis The goal of improved long-term prediction is to achieve a lower FDE, as observed in our results in Table 3.3. We achieve this goal by successfully upper-bounding the worst-case maximum FDE that can theoretically be obtained. These upper bounds are a consequence of the theoretical results in Section 2.7.3.1. We denote the worst-case theoretical FDE by T-FDE. This measure represents the maximum FDE that can be obtained using Theorem 2.7.1 under fixed assumptions. In Table 2.11, we compare the T-FDE with the empirical FDE results obtained in Table 3.3. The T-FDE is computed by, ? T-FDE = ? (T ? ?) (2.20) n The formula for T-FDE is derived as follows. The RMSE error incurred by all vehicles at a current time-step during spectral clustering is bounded by ? (Theorem 2.7.1). Let n = N = 10 T be the average number of vehicles per frame in each dataset. Then, at a single instance in the prediction window, the increase in RMSE for a single agent is bounded by ? . As T ? ? is the n 77 Table 2.11: Upper Bound Analysis: ? is the upper bound on the RMSE for all agents at a time-step. T ? ? is the length of the prediction window. T-FDE (Eq. 2.20) is the theoretical FDE that should be achieved by using spectral regularization. The FDE results are obtained from Table 3.3. The % agreement is the agreement between the T-FDE and FDE computed using T-FDEFDE if T-FDE b2 > . . . > bK and ?1 > ?2 > . . . > ?K . The 122 allocation rule is that the agent with the ith highest bid is allocated the ith most valuable item, ?i. The utility ui [184] incurred by ai is given as follows, ?k ui(bi) = vi?i ? bj+1 (?j ??j+1) . (4.3) j=i In the equation above, the quantity on the left represents the total utility for ai which is equal to value of the allocated goods ?i minus a payment term. The first term on the right is the value of the item obtained by ai. The second term on the right is the payment made by ai as a function of bids bj>i and their allocated item values ?j . We refer the reader to Chapter 3 in [184] for a derivation and detailed analysis of Equation 4.3. In our approach, we re-cast Equation 4.3 through the lens of a human driver. More specifically, the term vi?i denotes the time reward gained by driver ai by moving on her turn. The payment term represents a notion of risk [13] associated with moving on that turn. It follows that an allocation of a conservative agent to a later turn (smaller ?) also presents the lowest risk and vice-versa. Choosing an optimal ordering, in which agents navigate unsignaled and uncontrolled traffic scenarios, can be cast as an allocation problem where the goal is to allocate each agent, ai, a position in the optimal turn-based ordering (?i). Deciding such an allocation depends heavily on the incentives of the agents which, in the case of non-ideal agents, is a hard problem. Prior planning methods model non-ideal agents by estimating the objective functions of the agents from noisy data using statistical methods [122, 175, 189] or by assuming a fixed behavior for surrounding agents (static or constant velocity) [174, 175]. These methods are not guaranteed to be optimal and result in collisions and deadlocks in unsignaled traffic scenarios, as shown in 123 Table 3.3. Auction-based methods, on the other hand, model non-ideal agents in unsignaled traffic scenarios effectively albeit using a monetary-based bidding strategy that is not realizable in real- world scenarios [178, 180, 181, 183]. Our formulation, GamePlan, differs in this regard wherein we use a novel online driving behavior-based bidding strategy using the CMetric model [14]. In the rest of this section, we present the main algorithm followed by an analysis of its optimality. 4.3.4 Algorithm Our goal is to solve Problem 3.2.1 and compute the optimal turn-based ordering ?OPT = ?1?2 . . .?n, which shall determine the order in which agents will navigate unsignaled intersections, roundabouts, or merging. Our GamePlan algorithm proceeds in two stages: the behavior modeling phase and the planning phase. During the behavior modeling phase, we use CMetric to compute the behavior profiles ?i for every agent (active or non-active) using Equation 4.1 during an observation period of 5 seconds. However, alternative behavior models such as SVO [75] may be used. This is followed by the planning phase which runs a sponsored search auction (SSA) scheme. In the auction scheme, each active agent has a private valuation vi. Each agent ai submits a bid bi ? R?0 and obtains a time reward of 1 for completing the navigation task in ti seconds, measured from theti time the first agent begins to move. Note that moving earlier corresponds to a higher time reward. To summarize the algorithm, the agent with the highest bid i.e. most aggressive behavior is allocated the highest priority and is allowed to navigate the scenario first, followed by the second-most aggressive, and so on. Therefore, 124 (?OPT)i = j ?, (4.4) where j? is the index of ?j? in the sequence ?1 > ?2 > . . . > ?j? > . . . > ?K . 4.3.5 Game-theoretic optimality and efficiency analysis In this section, we show that our approach is incentive compatible, welfare maximizing, and can be computed in polynomial time. 4.3.5.1 Incentive compatibility The goal of any optimal auction should be such that that no agent is incentivised to ?cheat? or, more simply, when the dominant strategy for each agent is to bid their true valuation vi. We define a dominant strategy as, Definition 4.3.2. Dominant Strategy: Bidding bi is a dominant strategy for ai if ui(bi, b?i) > ui(b?i, b?i) for all b?i ?= bi. Ensuring fair allocations is crucial for auctions applied to traffic scenarios since unfair allocations could result in collisions and deadlocks. Incentivising traffic-agents to bid their true value as a dominant strategy is known as incentive compatibility [181, 182, 183, 184] which is defined as follows, Definition 4.3.3. Incentive Compatibility: An auction is said to be incentive compatible if for each agent, bidding bi = vi is a dominant strategy. We want to show that ?OPT is incentive compatible, maximizes welfare, and can be computed 125 in polynomial time. Incentive compatibility ensures that the best action In our formulation, we set the true valuation (vi) for a traffic-agent to be equal to its behavior profile ?i. Hence, vi = ?i. And so to show that our auction is incentive-compatible, we show the following, We verify these properties through the following analysis. Theorem 4.3.1. For each active agent ai ? A at a traffic intersection, roundabout, or during merging, bidding bi = ?i is the dominant strategy. We defer the proof to the supplementary material. Proof. Recall that the kth highest bidder (kth most aggressive agent) receives a time reward ?k = 1 . Then according to Equation 4.3, the overall utility achieved by the kth most aggressive traffic- tk agent is, ( ) ?K ( )1 ? 1 ? 1uk(bk) = ?k bj+1 . tk tj tj+1 j=k We sort the K highest bids received in the following order: b1 > b2 > . . . > bK . In order to show that bk = ?k is the dominant strategy, it is sufficient to show that over-bidding (b?k > ?k) and under-bidding (b?k < ?k) both result in a lower utility than uk. We proceed by analyzing both cases. Case 1: Over-bidding (b?k = bk?1 > bk): In this case, the new utility for ak is u?k(b?k) which is 126 equal to, ( ) ( ) K ( ) 1 ? 1 ? 1 ? 1 1 ?k bk ? bj+1 ? . (4.5) tk?1 tk?1 tk tj tj+1 j=k From Equation 4.3 and Equation 4.5, the net increase in utility is, ( ) 1 1 u?k(b?k)? uk(bk) = (?k ? bk) ? . (4.6) tk?1 tk Therefore, bidding b?k > ?k =? u?k(b?k) ? uk(bk) < 0 since tk?1 < tk. In other words, overbidding yields negative utility for agent ak. Case 2: Under-bidding (b?k = bk+1 < bi): The new utility in this case is given by, ( ) ?K ( )1 1 1 u?k(??k) = ?k ? bj+1 ? (4.7) tk+1 tj tj+1 j=k+1 From Equation 4.3 and Equation 4.7, the net decrease in utility is, ( ) uk(bk)? 1 1 u?k(b?k) = (?k + bk+1) ? . (4.8) tk tk+1 Note that Equation 4.8 is always positive since ?, bk+1 > 0 and tk < tk+1. This implies that under-bidding always results in a decrease in utility as well. 127 4.3.5.2 Welfare maximization The next desired property in an optimal auction is welfare maximization [183, 184] which maximizes the total utility earned by every active agent. ? Theorem 4.3.2. Welfare maximization: Social welfare of an auctio?n is defined as i vi?i. Welfare maximization involves finding the strategy b that maximizes i vi?i. For each active agent ai ? A, bidding bi = ?i maximizes social welfare. We defer the proof to the supplementary material. Proof. Our proof is based on induction. We begin with the base case with the most aggressive agent (highest bidder). Recall that after sorting, we have agents in decreasing order of aggressiveness i.e. ?1 > ?2 > . . . > ? 1 1 1 ?1n and > ?> . . . > . Therefore, we have that is maximum. Next,t1 t2 tk t1 consider the hypothesis that the sum k ?j is maximum up to the kth? j=1 highest bidder. Then thetj inductive step is to prove that k+1 ?jj=1 is maximum. Observe that,tj ?k+1 ( ) ?k ( )?j ?j ?k+1 = + tj t tj=1 j=1 j k+1 Note that the first term on the RHS is maximum from hypothesis. Then, ?k+1 > ?k+2 > ?k+3 > . . . > ?n and (4.9) 1 1 1 1 > > > . . . > tk+1 tk+2 tk+3 tK 128 implies that ?k+1 is maximum. tk+1 4.3.5.3 Polynomial time computation Finally, in terms of planning and auction design [184], it is important to show that the underlying auction is computationally efficient and can handle a large number of agents. We show that our approach runs in polynomial time via the following theorem, Theorem 4.3.3. Polynomial Runtime: GamePlan runs in polynomial time. Proof. The main computation in our algorithm is dominated by sorting the agent?s CMetric values; it is known that sorting algorithms run in polynomial time [190]. 4.3.6 Using ?OPT for collision prevention and deadlock resolution We identify a deadlock as a situation when two or more active traffic-agents remain stationary for an extended period of time due to the uncertainty in the actions of other active traffic-agents. Deadlocks may arise in traffic scenarios consisting of multiple conservative and/or aggressive agents and are resolved when one of the agents opts to move based on some heuristic. Via ?OPT, agents automatically know when each agent is supposed to move thereby eliminating any confusion or uncertainty in the actions of other agents. GamePlan can also prevent collisions in a similar manner. The number of collisions, or the likelihood thereof, increases when two or more aggressive agents decide to MOVE first, simultaneously, despite the uncertainty in the actions of the other agents. ?OPT can break ties between multiple aggressive drivers since ?i ?= ?j for i, j ? [1, n]. Using the turn-based ordering 129 determined by ?OPT, less aggressive agents can let more aggressive agents pass first. 4.3.7 Conclusion, Limitations, and Future Work We present a novel multi-agent game-theoretic planning algorithm called GamePlan in intersections, roundabouts, and during merging with human drivers and autonomous vehicles. GamePlan uses the behavior profiles of all traffic-agents, combines with sponsored search auctions, and produces an optimal turn-based ordering. We show that GamePlan is incentive compatible, welfare maximizing, and operates in polynomial time. We reduce the number of collisions and deadlocks by at least 10 ? 20% on average over prior methods. Moreover, we demonstrate GamePlan in two merging scenarios involving real human drivers and show that our game- theoretic model is applicable in real-world scenarios. There are a few limitations of our work. Our approach is primarily designed for moderately to highly dense traffic as CMetric [14] may not work as well in sparse traffic conditions. In such cases, data-driven behavior models such as SVO [75] may be used. There are many interesting directions for future work. For example, our method currently does not plan beyond computing turn-based orderings, i.e. local navigation. Next steps may include integrating GamePlan with global motion planning methods to achieve an end-to-end navigation approach for non-communicating multi-agent traffic scenarios. In addition, we have currently demonstrated real world application with 2 ? 3 vehicles. In the future, we plan to conduct further evaluation in denser and more comprehensive real world settings with more vehicles. 130 4.4 Risk-Aware Planning Risk-aware planning involves sequential decision-making in dynamic and uncertain environments, where agents must consider the risks associated with their actions and corresponding costs and rewards [191]. Risk-seeking agents are willing to take lower expected reward in exchange for a higher reward variance (more risk), while risk-averse agents are willing to take a lower expected reward in exchange for lower reward variance (less risk). Agents that are risk-averse or risk- seeking are collectively referred to as risk-aware. Human drivers are risk-aware by nature [192, 193, 194]. For example, aggressive drivers frequently speed, overtake, and perform sharp cut- ins, whereas conservative drivers drive more cautiously. To navigate successfully among human drivers, autonomous vehicles (AVs) must identify the risk preferences of human drivers online, and predict and plan future motion with the risk preferences of all agents in mind, including the AV?s own risk preferences. The most common risk measures utilized in risk-sensitive planning are entropic risk [195] and conditional value at risk (CVaR) [196]. A popular approach for risk-aware planning in multi- agent traffic scenarios is to model risk-aware agent interactions via dynamic games [13] wherein agents act while considering their impact on other agents as well as the intentions of the other agents. In [13], the authors compute the Nash equilibrium solution of the game by iteratively solving a set of LEQ equations [197, 198]. The main benefits of this approach compared to prior risk-aware planning methods include improved time-to-goal and, more importantly, generation of emergent behaviors. For instance, risk-averse agents learn to maintain a greater distance from risk-seeking agents and generally yield more frequently to risk-seeking agents at intersections, at roundabouts, and during merging. 131 Despite its performance and benefits, the main drawback of the approach proposed by [13] is that it does not model the risk tolerance of human drivers, as it assumes the AV knows the synthetically chosen risk tolerances for all other driving agents). Extending game-theoretic risk- aware planning to human drivers will allow AVs to act more confidently around human drivers and reduce time-to-goal via more efficient and safer navigation. Estimating the risk tolerances of human drivers, however, requires computationally tractable human driver behavior models that can characterize drivers. Some of the state-of-the-art approaches for modeling human driver behavior [75, 119, 131] are data-driven and require a large volume of clean training data. These methods classify behaviors as aggressive and conservative [119] or selfish and altruistic [75]. In contrast, deterministic models [14] do not require data and assign a real-valued score to each agent to quantify its behavior. These approaches can be integrated with risk-aware planning frameworks to incorporate planning for human agents. Main Contributions: We propose a novel approach for risk-aware planning in multi-agent traffic scenarios that takes into account human driver behaviors. We extend an existing risk-aware planner [13] by incorporating interactions with human drivers using a data-driven human driver behavior model [14]. We derive a linear mapping between the driver behavior and risk tolerance, which serves as the key component of our proposed approach. To evaluate our approach, we validate the mapping between driver behavior and risk tolerance by measuring the number of lane changes, and test the accuracy of this model via K-Means clustering. Our results show that aggressive human driving results in more frequent lane changing. We confirm that the final trajectories obtained from the risk-aware planner generate emergent 132 behaviors. We measure the yield % and minimum distance between human drivers at intersections, at roundabouts, and during merging where we observe that conservative drivers generally yield to aggressive drivers while maintaining a greater distance from them. We also conduct a user study in which we show that users are able to distinguish between aggressive and conservative trajectories generated by the planner. Finally, we compare our modified risk-aware planner with existing planners that do not model human drivers and show that modeling human drivers results in safer navigation. Specifically, [13] (and similar planners) assign a fixed neutral risk tolerance to human drivers and the ego-vehicle generates to the human driver accordingly. However, when the human driver is, in fact, either aggressive or conservative, then we show that the error (absolute value of minimum relative distance between the agents) increases by 10%. 4.4.1 Related Work 4.4.1.1 Risk-Aware Planning Risk sensitivity-based planning [199, 200, 201, 202, 203] considers the risk associated with the actions of agents to avoid unsafe situations. The most common risk measures utilized in risk- sensitive planning are entropic risk [195] and conditional value at risk (CVaR) [196]. Entropic risk has been widely used in optimal control due to its simplicity and tractability [204], while CVaR has recently been incorporated in trajectory optimization due to its interpretability [201]. Risk-aware planning has been used extensively in autonomous underwater vehicles [205], ground vehicles [206, 207], and unmanned aerial vehicles (UAVs) [208]. While the CVaR risk model has been used in the latter two cases, [205] used the entropic measure of risk. In addition to CVaR 133 and the entropic models, several other models are also used in various applications such as the dynamic risk density function for collision avoidance [209] and semantic maps for simultaneous localization and mapping (SLAM) [210, 211] 4.4.1.2 Data-Driven Methods for Driver Behavior Prediction Data-driven methods broadly follow two approaches. In the first approach, various machine learning algorithms such as clustering, regression, and classification predict or classify driver behavior as either aggressive or conservative. These methods have been studied in traffic psychology and the social sciences [76, 78, 81, 86, 87, 87, 88, 92, 93, 94, 212, 213, 214, 215, 216, 217, 218, 219]. So far, there has been relatively little work to improve the robustness and ability to generalize to different traffic scenarios, steps that require ideas from computer vision and robotics. The second approach uses trajectories to learn reward functions for human behavior using inverse reinforcement learning (IRL) [75, 119, 131]. IRL-based methods, however, have certain limitations. IRL requires large amounts of training data, and the learned reward functions are unrealistically tailored towards scenarios only observed in the training data [119, 131]. For instance, [119] requires 32 million data samples for optimum performance. Additionally, IRL- based methods are sensitive to noise in the trajectory data [75, 131]. Consequently, current IRL- based methods are restricted to simple and sparse traffic conditions. 134 4.4.2 Algorithm We consider N agents consisting of a mixture of human drivers and AVs with the system dynamics defined by Equation 4.11, and we define the cost function for each agent by Equation 4.13. A human driver is simulated using a user-controlled keyboard with the following features: acceleration, braking, and lane changing. For simplicity, we test with one human driver, but our approach can work with more than one human driver. We further assume that agents are non-ideal in that agents are not provided the risk tolerance of other agents. The input to our approach consists of the state and control signals of every agent at time t. Then, our goal is to compute the Nash equilibrium trajectories for all agents. The trajectories for the human agents are predictions, while the trajectories for the AVs can be executed in a receding horizon planning loop. Finally, none of the agents are assumed to follow constant velocity models. We describe our algorithm (Figure 3.2). The first step is to read the trajectories for every agent over a finite horizon T , denoted by ?T . These trajectories correspond to human agents. The second step is to compute the CMetric value, ? , for each agent during T ; the CMetric value encodes the aggressive (or conservative) nature of the driver via certain indicators such as speeding, overtaking, and zigzagging. The third step consists of mapping an agent?s CMetric to their risk sensitivity. This is performed using a linear transformation obtained by simple linear regression. This is discussed in detail in Section 4.4.2.1. Finally, based on the risk sensitivity, we perform game-theoretic risk-aware planning using the planner developed by Wang et al. [13]. 135 4.4.2.1 CMetric to Risk Sensitivity We denote the risk sensitivity parameter by ?. We first compute a linear mapping M : Z ?? ?. Since both ? ? Z and ? ? ? are scalars, we can use simple one dimensional linear regression to estimate M. We create a training dataset by first generating trajectories corresponding to a fixed array of risk sensitivity values ranging from ?5.0 (risk-seeking) to +5.0 (risk-averse). We denote these risk sensitivity values as ?? to indicate they are training values. We then evaluate the CMetric values, also represented using ?? , corresponding to each of these trajectories using the algorithm described in the previous section. The risk sensitivity and CMetric pair constitute the training dataset on which we apply linear regression to estimate linear coefficients ?0 and ?1. M is then defined as follows, M(?) = ?1? + ?0, (4.10) where ? is the CMetric value of a human agent at test time. 4.4.2.2 Risk-Aware Planning The system dynamics for each agent are given by, xt+1 = Atxt +B 1 t u 1 +B2t t u 2 t + wt. (4.11) To simplify notation, we describe a two-player system, although our approach can easily generalize to n agents. xt = [x1 2t , xt ] ? X represents the system state and xit = [pix, piy, vix, viy] denotes the position (in meters) and velocity (in meters/second) of an agent. u1t = a1, u 2 t = a2 ? U are the 136 control inputs for both agents denoting the acceleration of both agents, wt ? N (0,Wt) is the system noise, and At, B1 2t , Bt are fixed matrices of appropriate dimensions. An agent incurs the following cost during a finite horizon T : ?[ ]T?1 i 1 1 ? ? = xTt Q i iT jT ij j 2 t xt + lt xt + u R u +2 t t t t=0 j (4.12) 1 xT i iTTQTxT + lT xT ,2 where Qt ? 0 and Rt ? 0. To model risk, we use the exponential risk cost function used in [13], 1 [ ] J(?) = logE e(M(?)?) = RM(?)(?), (4.13)M(?) where M(?) is the risk tolerance of a human driver. Remark 1: The difference between Equation 4.13 and the risk cost function described in [13] is that the risk parameter in the latter work includes a fixed value for every agent, whereas in this work, we automatically generate the risk parameter for human agents in a data-driven fashion. The optimal strategies for each player can be obtained by minimizing J(?i) for each agent i and obtaining the Nash equilibrium using Riccati recursion [220, Chap. 6]. However, Equation 4.13 is constrained by the fact that M(?) is bounded. If the M(?) is too low or too high, then the cost function value approaches ?, also known as ?neurotic breakdown? [13]. Due to the data-driven nature of Equation 4.13, in order to ensure optimality, certain traffic parameters such as traffic density is assumed, since they affect the CMetric value [14], and by Equation 4.10, the risk sensitivity of the human agent used in Equation 4.13. 137 4.4.3 Experiments and Results In this section, we present the results of extensive experiments testing the accuracy of the linear mapping between human driver aggressiveness and risk tolerance. We then evaluate the emergent behaviors associated with the final trajectories generated by the iterative risk sensitive game theoretic solver, compare with [13], which is chosen as the baseline (in which human driver behavior is ignored), and finally, discuss using alternative human driver behavior models. All experiments are performed using a 12-core 2.60GHz Intel i7 processor. We conduct open-loop tests that follow the pipeline outlined in Figure 3.2. We use the OpenAI traffic simulator [149] to compute the CMetric values representing human driver behavior and the python-based controller provided by Wang et al. [13] to generate the final trajectories based on the risk tolerances obtained from the corresponding CMetric values. The configurations of both the simulator and the controller (which include the dynamics of the vehicles, traffic density, number of lanes etc.) are kept identical so that all vehicles generated using the controller are tracked in the simulator. We compute the CMetric values of the human driver in a highway scenario since we require a fixed duration of time (5s) during which we must observe the vehicle?s trajectory and its interaction with other vehicles. For the risk-aware trajectory controller, we consider a merging scenario where a human agent must merge onto a highway with another human agent in the target merging lane. Here, the human agent is equivalent to an agent whose risk sensitivity value is obtained from the CMetric value. We assume vehicles follow the center line in their current driving lanes and only consider the vehicle?s speed to finish the merging maneuver. In other words, we assume a steering controller will be executed separately for each car to remain in its lane. 138 Figure 4.1: We highlight the relationship between the CMetric value and the risk parameter ?. Refer to Section 4.4.3.1 for further details. 4.4.3.1 Verifying the accuracy of M In Figure 4.1, we plot the risk parameter ? (y-axis) obtained from a given CMetric value (x- axis) via the linear mapping. When computing the risk parameters corresponding to each CMetric value, we vary the simulation configuration (traffic density, number of lanes etc.) to include a range of environments. This results in ? belonging to a range (as opposed to a fixed value). This is desirable since, in practice, the traffic will vary according to place and time. The risk parameters are clustered into four categories: ?very conservative?, ?conservative?, ?aggressive?, and ?very aggressive?. Each cluster is identified by a color. The empty circles are training data. The goal of this experiment is to cluster a test set of CMetric values (solid-colored points) based on their risk sensitivity. The test data are generated by a human driving the OpenAI simulator [149] in a randomly selected environment consisting of eleven vehicles and four lanes. The results (Figure 4.1) demonstrate that given the CMetric value, the linear regression mapping can accurately identify the risk sensitivity among a wide range of traffic configurations. 139 Figure 4.2: Yielding behaviors: Darker colors (indicating higher likelihood of yielding) corresponding to interactions between a risk-seeking agent and risk-averse agent. As the risk tolerances for both the human drivers are data-driven, and therefore noisy, both agents are adapting to the other. As a result, when either agent is risk-averse, we see higher yielding likelihood (darker colors). 140 Another metric we use to validate the correlation between the CMetric and corresponding risk sensitivity is the average number of lane changes. Based on the final trajectories generated from the risk sensitivity parameter (obtained from corresponding CMetric values) and using the controller provided by [13], we measure the average number of lane changes made by the ego- vehicle. The reason for using average number of lane changes as a metric is that aggressive drivers change lanes more frequently than non-aggressive and conservative drivers. The aim of the experiment, therefore, is to check if an aggressive human-driven vehicle (modeled using the keyboard of the OpenAI simulator) results in more lane changes by the final simulated ego- vehicle (simulated using the python controller) and conversely, if a conservative human driver results in fewer simulated lane changes. In Figure 4.1, we confirm this is indeed the case; aggressive drivers (? < 0) yield a greater number of lane changes than conservative drivers (? > 0). 4.4.3.2 Emergent behaviors We evaluate the final trajectories generated using the learned risk sensitivity of human drivers in a merging scenario where a human agent attempts to merge onto the highway. Different risk sensitivities yield a range of emergent behaviors. For example, in [13], Wang et al. showed that two risk-averse agents maintain a larger minimum distance between them, while risk-seeking agents may allow a smaller gap. Further, in an interaction between a risk-averse and a risk- seeking agent, there is a higher likelihood of the risk-averse agent yielding to the risk-seeking agent. The experiments conducted by Wang et al. modeled synthetic agents for which the risk 141 sensitivity must be manually chosen. Here, we run the same set of experiments for human agents. In Figure 4.2, we can observe darker colors (indicating larger minimum distance) corresponding to two risk-averse human agents and lighter colors (indicating smaller minimum distance) corresponding to two risk-seeking human agents. In Figure 4.2, we can observe darker colors (indicating a higher likelihood of yielding for the risk-averse agent) corresponding to interactions between a risk-seeking agent and a risk-averse agent. 4.4.3.3 User studies We recruited 27 participants to respond to a user study consisting of two questions. The first question involved showing two video clips of final trajectories. The first clip (top) consisted of a risk seeking trajectory (? = ?2.429), while the second (bottom) consisted of a risk averse trajectory (? = 3.651). Participants were not told the risk preferences that generated the trajectories, and were asked to identify which trajectory corresponded to an aggressive driver. The goal of this question is to qualitatively assess the emergent nature of the final trajectories. That is, based on simply observing the nature of the trajectory, can a human distinguish between the generated trajectories? We answer in the affirmative; 26 out of the 27 participants correctly identified the risk seeking driver as the aggressive driver. 4.4.3.4 Comparing with the baseline We compare our modified risk-aware planner with existing planners that do not model human drivers and show that modeling human drivers results in safer navigation. Specifically, [13] (and similar planners) assign a fixed neutral risk tolerance to human drivers and the ego-vehicle 142 generates to the human driver accordingly. There are two outcomes: 1. Suppose the human driver is, in fact, aggressive. Then, by modeling the driver with a neutral risk tolerance, the ego-vehicle may stray close to the aggressive driver as opposed to keeping a safe distance from them. 2. Conversely, suppose the human driver is conservative. Then, by modeling the driver with a neutral risk tolerance, the ego-vehicle may enter a brief deadlock during which both agents wait to see who moves first. We aim to capture these inefficiencies via a single error metric, which is the absolute value of the minimum relative distance between the two agents. This metric is ideal since in both cases, it measures the discrepancy between the expected distance and the actual observed distance. For example, we show that in the first case, the expected minimum relative distance between both agents is more than the observed distance while in the second case, we show the observed minimum distance is more than the actual distance. In both cases, the error is positive by virtue of the absolute value. Empirically, the maximum RMSE observed is 0.0425m or 10% as shown in Figure 4.3. 4.4.3.5 Using alternative human driver behavior models Thus far, we have successfully demonstrated that CMetric can be effectively integrated with risk-aware planning to generate game-theoretic behavior-rich trajectories. Alternative models for human driver behavior such as the SVO can theoretically be used. However, there are practical issues when it comes to integrating these alternative models in risk-aware planning. Here, we discuss some of these challenges. SVO is an offline technique that requires a large volume 143 Figure 4.3: Comparison with [13]: The approach by Wang et al. assumes a neutral risk sensitivity for human agents. However, when an ego-agent interacts with a human driver who may be aggressive or conservative (indicated by the negative and positive values on the x axis, respectively), then the assumption of neutral risk tolerance results in an error in terms of absolute value of minimum relative distance between the two agents. of training data to learn a data-driven reward function via inverse reinforcement learning. Our technique is meant to be deployed in realtime and, as such, we test in an open-loop simulation and use active metrics such as yield %, frequency of lane changes, and minimum distance between agents. The SVO approach, on the other hand, uses RMSE to measure the deviation of the prediction trajectories from the ground-truth trajectories. We do not assume the availability of ground-truth data. In future, we will conduct experiments comparing CMetric with SVO once the source code for SVO is public. 4.4.4 Conclusion, Limitations, and Future Work We presented an approach for risk-aware planning in multi-agent traffic with human agents. The basic intuition of our approach is that aggressiveness of a driver is linearly correlated with 144 risk preference. That is, aggressive drivers are more risk-seeking while conservative drivers are more risk-averse. Accordingly, we integrate a human driver behavior model [14] with the risk-aware dynamic game solver in [13] via simple linear regression to derive a mapping between driver behavior and risk tolerance. Our results show that aggressive human driving results in more frequent lane-changing. We show that conservative drivers generally yield to aggressive drivers while maintaining a greater distance from them. Finally, we confirm that the final trajectories obtained from the risk-aware planner generate emergent behaviors though a comprehensive user study in which participants were able to distinguish between aggressive and conservative drivers. There are some limitations to our method. Currently, we have tested our approach in an open-loop simulation where we use two different simulators for the human behavior model and the trajectory planner. To use both simulators in open-loop simulation effectively, the environment configuration must be kept identical, which is cumbersome and a hindrance. In the future, we will explore a closed-loop simulator that combines the human behavior model and the risk-aware trajectory planner. 145 Chapter 5: Software and Datasets We present a new traffic dataset, METEOR, which captures traffic patterns and multi- agent driving behaviors in unstructured scenarios. METEOR consists of more than 1000 one- minute videos, over 2 million annotated frames with bounding boxes and GPS trajectories for 16 unique agent categories, and more than 13 million bounding boxes for traffic agents. METEOR is a dataset for rare and interesting, multi-agent driving behaviors that are grouped into traffic violations, atypical interactions, and diverse scenarios. Every video in METEOR is tagged using a diverse range of factors corresponding to weather, time of the day, road conditions, and traffic density. We use METEOR to benchmark perception methods for object detection and multi- agent behavior prediction. Our key finding is that state-of-the-art models for object detection and behavior prediction, which otherwise succeed on existing datasets such as Waymo, fail on the METEOR dataset. METEOR marks the first step towards the development of more sophisticated perception models for dense, heterogeneous, and unstructured scenarios. 5.1 Overview Recent research in learning-based techniques for robotics, computer vision, and autonomous driving has been driven by the availability of datasets and benchmarks. Several traffic datasets have been collected from different parts of the world to stimulate research in autonomous driving, 146 driver assistants, and intelligent traffic systems. These datasets correspond to highway or urban traffic, and are widely used in the development and evaluation of new methods for perception [45], prediction [66], behavior analysis [120], and navigation [221]. Many initial autonomous driving datasets were motivated by computer vision or perception tasks such as object recognition, semantic segmentation or 3D scene understanding. Recently, many other datasets have been released that consist of point-cloud representations of objects captured using LiDAR, pose information, 3D track information, stereo imagery or detailed map information for applications related to 3D object recognition and motion forecasting. Many large- scale motion forecasting datasets such as Argoverse [222], and Waymo Open Motion Dataset [223], among others, have been used extensively by researchers and engineers to develop robust prediction models that can forecast vehicle trajectories. However, existing datasets do not capture the rare behaviors or heterogeneous patterns. Therefore, prediction models trained on these existing datasets are not very robust in terms of handling challenging traffic scenarios that arise in the real world. A major challenge currently faced by research in autonomous driving is the heavy tail problem [222, 223], which refers to the challenge of dealing with rare and interesting instances. There are several ways in which existing datasets currently address the heavy tail problem: 1. Mining: The Argoverse and Waymo datasets use a mining procedure that includes scoring each trajectory based on its ?interestingness? to explicitly search for difficult and unusual scenarios [222, 223]. 2. Diversifying the taxonomy: Train the prediction and forecasting models to identify the unknown agents at the time of testing. This approach necessitates annotating a diverse 147 taxonomy of class labels. Argoverse and nuScenes [224] contain 15 and 23 classes, respectively. 3. Increasing dataset size: This approach is to simply collect more data with the premise that collecting more traffic data will likely also increase the number of such scenarios in the dataset. In spite of many efforts along these lines, existing datasets manage to collect only a handful of such instances, due to the infrequent nature of their occurrence. For example, the Waymo Open Motion dataset [223] contains only atypical interactions and diverse scenarios while the Argoverse dataset [222] contains only atypical interactions. There is clearly a need for a different approach to addressing the heavy tail problem. Our solution is to build a traffic dataset from videos collected in India, where the inherent nature of the traffic is dense, heterogeneous, and unstructured. The traffic patterns and surrounding environment in parts of India are more challenging. than those in other parts of the world. This includes high congestion and traffic density. Some of these roads are unmarked or unpaved. Moreover, the traffic agents moving on these roads correspond to vehicles, buses, trucks, bicycles, pedestrians, auto-rickshaws, two-wheelers such as scooters and motorcycles, etc. 5.1.1 Main Contributions 1. We present a novel dataset, METEOR, corresponding to the dense, heterogeneous, and unstructured traffic in India. METEOR is the first large-scale dataset containing annotated scenes for rare and interesting instances and multi-agent driving behaviors, broadly grouped into: (a) Traffic violations?running traffic signals, driving in the wrong lanes, taking wrong 148 turns). (b) Atypical interactions?cut-ins, yielding, overtaking, overspeeding, zigzagging, lane changing. (c) Diverse scenarios?intersections, roundabouts, and traffic signals. 2. METEOR has more than 2 million labeled frames and 13 million annotated bounding boxes for 16 unique traffic agents, and GPS trajectories for the ego-agent. 3. Every video in METEOR is tagged using a diverse range of factors including weather, time of the day, road conditions, and traffic density. 4. We evaluate state-of-the-art methods for object detection and multi-agent behavior prediction on METEOR. 5. We present a novel, fine-grained analysis on the relationship between traffic environments and perception. Specifically we study the effect of 2D object detection in varying traffic density, mixture of agents, area, time of the day, and weather conditions. 5.1.2 Applications and Benefits ? Towards Risk-Aware Planning and Control: Our multi-agent behavior prediction benchmark can aid the development of risk-aware motion planners by predicting the behaviors of surrounding agents. Motion planners can compute controls that guarantee safety around aggressive drivers who are prone to overtaking and overspeeding. ? Towards Robust Perception: We observe that these models fail in challenging Indian traffic scenarios, compared to their performance on existing datasets captured in the US, 149 Figure 5.1: METEOR: We summarize various characteristics of our dataset in terms of scene: traffic density, road type, lighting conditions, agents (we indicate the total count of each agent across 1250 videos), and behaviors, along with their size distribution (in GB). The total size of the current version of the dataset is around 100GB, and it will continue to expand. Our dataset can be used to evaluate the performance of current and new methods for perception, prediction, behavior analysis, and navigation based on some or all of these characteristics. Details of the organization of our dataset are given at https://gamma.umd.edu/meteor. Europe, and other developed nations. As a result, METEOR can be a useful benchmark for research in perception in unstructured traffic environments and developing nations. ? Towards Fine-grained Traffic Analysis: Our novel analysis studying the relationship between traffic patterns and 2D object detection can lead to more informed research in perception for autonomous driving. 150 5.2 Comparison with Existing Datasets 5.2.1 Tracking and Trajectory Prediction Datasets Datasets such as the Argoverse [222], Lyft Level 5 [225], Waymo Open Dataset [223], ApolloScape [72], nuScenes dataset [224] are used for trajectory forecasting [65, 102, 104, 226, 227] and tracking [45]. Several of these datasets use mining procedure [222, 223] that heuristically searches the dataset for rare and interesting scenarios. The resulting collection of such scenarios and behaviors, however, is only a fraction of the entire dataset. METEOR, by comparison, exclusively contains such scenarios due to the inherent nature of the unstructured traffic in India. METEOR has many additional characteristics with respect to these datasets. For instance, METEOR?s 2.02 million annotated frames are more than 10? the current highest number of annotated frames with respect to other dataset with high density traffic (ApolloScape). Furthermore, METEOR consists of 16 different traffic-agents that include only on-road moving entities (and not static obstacles). This is by far, the most diverse in terms of class labels. In comparison, Argoverse and nuScenes both contain 10 and 13 traffic-agents, respectively. METEOR is the first motion forecasting and behavior prediction dataset with traffic patterns from rural and urban areas that consist of unmarked roads and high-density traffic. In contrast, traffic scenarios in Argoverse, Waymo, Lyft, and nuScenes have been captured on sparse to medium density traffic with well-marked structured roads in urban areas. 151 Table 5.1: Characteristics of Traffic Datasets: We compare METEOR with state-of-the-art autonomous driving datasets that have been used for trajectory tracking, motion forecasting, semantic segmentation, prediction, and behavior classification. METEOR is the largest (in terms of number of annotated frames) and most diverse in terms of heterogeneity, scenarios, varying behaviors, densities, and rare instances. Darker shades represent a richer collection in that category. Best viewed in color. Rare and Interesting Behaviors? Datasets Location Bad weather Night Road type Het.? Size Density Lidar HD Maps Traffic Atypical Diverse Violations Interactions Scenarios Argoverse [222] USA ? ? urban 10 22K Medium ? ? ? ? ? Lyft Level 5 [225] USA ? ? urban 9 46K Low ? ? ? ? ? Waymo [223] USA ? urban 4 200K Medium ? ? ? ? ? ApolloScape [72] China ? ? urban, rural 5 144K High ? ? ? ? ? nuScenes [224] USA/Sg. ? ? urban 13 40K Low ? ? ? ? ? INTERACTION [228] International ? ? urban 1 ? Medium ? ? ? ? ? CityScapes [229] Europe ? ? urban 10 25K Low ? ? ? ? ? IDD [230] India ? ? urban, rural 12 10K High ? ? ? ? ? HDD [231] USA ? ? urban ? 275K Medium ? ? ? ? ? Brain4cars [232] USA ? ? urban ? 2000K Low ? ? ? ? ? D2-City [233] China ? ? urban 12 700K Medium ? ? ? ? ? TRAF [65] India ? ? urban, rural 8 72K High ? ? ? ? ? BDD [234] USA ? ? urban 8 3000K Low ? ? ? ? ? METEOR India ? ? urban, rural? 16?? 2027K High? ? ? ? ? ? ? Rare instances can be broadly grouped into (i) traffic violations, (ii) atypical interactions, and (iii) difficult scenarios. ? Includes roads without lane markings. Roads in other datasets with rural roads may contain lane markings. ? Heterogeneity. We indicate the classes corresponding to moving traffic agents only, excluding static objects such as poles, traffic lights, etc. ? Up to 40 agents per frame. ?? Up to 9 unique agents per frame. 5.2.2 Semantic Segmentation Datasets CityScapes [229] is widely used for several tasks, primarily semantic segmentation. It is based on urban traffic data collected from European cities with structured roads and low traffic density. In contrast, the Indian Driving Dataset (IDD) [230] is collected in India with both urban and rural areas with high-density traffic. A common aspect of both these datasets (CityScapes and IDD), however, is the relatively low annotated frame count (25K and 10K, respectively). This is probably due to the effort involved with annotating every pixel in each image. IDD also contains high-density traffic scenarios in rural areas, similar to METEOR. However, our dataset has 200? the number of annotated frames and 1.6? the number of traffic-agent classes. Similar to TRAF, the IDD does not contain the behavior data that is provided by METEOR. 152 5.2.3 Behavior Prediction Behavior prediction corresponds to the task of predicting turns (right, U-turn, or left), acceleration, merging, and braking in addition to driver-intrinsic behaviors such as over-speeding, overtaking, cut-ins, yielding, and rule-breaking. The two most prominent datasets for action prediction include the Honda Driving Dataset (HDD) [231] and the BDD dataset [234]. Some of the major distinctions between METEOR and the HDD in terms of size (approximately 10?), the availability of scenes with night driving and rainy weather, and the inclusion of unstructured environments in low-density traffic. The BDD dataset [234] contains more annotated samples than METEOR, however, the BDD dataset contains 100K videos while METEOR contains 1K videos. So the number of annotated samples per video is 66? higher for METEOR. The annotations in prior datasets are limited to actions and do not contain the rare and interesting behaviors contained in METEOR. 5.3 METEOR dataset Our dataset is visually shown in Figure 5.1. Below, we present some details of the data collection process and discuss some of the salient features and characteristics of METEOR. 5.3.1 Dataset Collection The data was collected in and around the city of Hyderabad, India within a radius of 42 to 62 miles. Several outskirts were chosen to cover rural and unstructured roads. Our hardware capture setup consists of two wide-angle Thinkware F800 dashcams mounted on an MG Hector and Maruti Ciaz. The camera sensor has 2.3 megapixel resolution with a 140 degrees field of 153 view. The video is captured in full high definition with a resolution of 1920 ? 1080 pixels at a frame rate of 30 frames per second. The dashcam is embedded with an accurate positioning system that stores the GPS coordinates, which were processed into the world frame coordinates. The sensor synchronizes between the camera and the GPS. Recordings from the dashcam are streamed continuously and are clipped into 1 minute video segments. 5.3.2 Dataset organization The dataset is organized as 1250 one-minute video clips. Each clip contains static and dynamic XML files. Each static file summarizes the meta-data of the entire video clip including the behaviors, road type, scene structure etc. Each dynamic file describes frame-level information such as bounding boxes, GPS coordinates, and agent behaviors. Our dataset can be searched using helpful filters that sort the data according to the road type, traffic density, area, weather, and behaviors. We also provide many scripts to easily load the data after downloading. 5.3.3 Annotations We provide the following annotations in our dataset: (i) bounding boxes for every agent, (ii) agent class IDs, (iii) GPS trajectories for the ego-vehicle, (iv) environment conditions including weather, time of the day, traffic density, and heterogeneity, (v) road conditions with urban, rural, lane markings, (vi) road network including intersections, roundabouts, traffic signal, (vii) actions corresponding to left/right turns, U-turns, accelerate, brake, (viii) rare and interesting behaviors (See Section 5.3.4), and (ix) the camera intrinsic matrix for depth estimation to generate trajectories of the surrounding vehicles. This set of annotations is the most diverse and extensive 154 compared prior datasets. A diverse and rich taxonomy of agent categories is necessary to ensure that autonomous driving systems can detect different types of agents in any given scenario. Towards that goal, datasets for autonomous driving are designed or captured to achieve two goals: (a) capture as many different types of agent categories as possible; (b) capture as many instances of each category as possible. In both these aspects, METEOR outperforms all prior datasets. We annotate 16 types of moving traffic entities, not including static obstacles listed in Figure 5.1 along with their distribution. Note specifically that the percentages of pedestrians, motorbikes, and bicycles are higher than the percentage of passenger vehicles. This is particularly useful as the former categories are known as ?vulnerable road users? (VRUs) [235], and it is important for autonomous driving systems to be able to detect them?necessitating many instances of these VRUs in any dataset. 5.3.4 Rare and Interesting Behaviors We provide a total of 17 different types of rich collection of rare and interesting cases that are unique to our dataset. They can be summarized in terms of the following groups: 5.3.4.1 Atypical Interactions Atypical interactions correspond to pairwise interactions among traffic agents that are not often observed in regular traffic scenarios. Some examples of atypical interactions include yielding to, and cutting across, pedestrians, zigzagging through traffic, pedestrian jaywalking, overtaking, sudden lane changing, and overspeeding. We describe these in more detail below: 155 Figure 5.2: Annotations for rare instances: One of the unique aspects of METEOR is the availability of explicit labels for rare and interesting instances including atypical interactions, traffic violations, and diverse scenarios. These annotations can be used to benchmark new methods for object detection and multi-agent behavior prediction. ? Overtaking (OT): When an agent overtakes another agent with sudden or aggressive movement. ? Overspeeding (OS): If the vehicle over-speeds (based on speed limits) due to any reason. ? Yield (Y): A pedestrian, bicycle, or any slow-moving agent trying to cross the road in front of another agent. If the latter slows down or stops, letting them cross the road then such behavior is labeled as yield. ? Cutting (C): When pedestrians, bicycles, or any slow-moving agents trying to cross the road is interrupted by another agent. Yielding and cutting can also be re-labeled as instances of jaywalking. In a majority of these cases, one of the agents involved is a pedestrian crossing 156 the road in the middle of traffic. ? Lane change w. lane markings (LC(m)): Agents aggressively change lanes on roads with clear lane markings. ? Lane change w/o. lane markings (LC): Agents aggressively change lanes on roads without lane markings. The above two annotations can be used to identify videos in the dataset that contain roads without lane markings for relevant applications. ? Zigzagging (ZM): If any of the agent of interest undergoes a zigzag movement in the traffic, the agent behavior is classified as zigzagging. 5.3.4.2 Traffic Violations In addition to the above driving behaviors, we also annotate traffic agents breaking traffic rules. These are particularly unique since rule breaking scenarios are rare. ? Running a traffic light (RB TL): Passing through an intersection even though the traffic signal is red. ? Wrong Lane (RB WL): A road may not be divided for inbound and outbound traffic by a physical barrier, making it possible for the motorists to use the inbound lane for the outbound traffic and vice versa. This behavior identifies all such cases. ? Wrong Turn (RB WT): When an agent makes an illegal turn (including U-turns). 157 Figure 5.3: We highlight the high traffic density, heterogeneity, and the richness of behavior information in METEOR. Abbreviations correspond to various behavior categories and are explained in Section 5.3.4. 5.3.4.3 Diverse Scenarios Finally, we provide annotations for challenging scenarios that include intersections, roundabouts, traffic signals, executing left turns, right turns, and U-turns. 5.3.5 Dataset statistics We analyze the dataset statistics and distribution of agents and their behaviors in terms of total count, uniqueness, and duration (in seconds). Figure 5.3 show that METEOR is very dense and highly heterogeneous, respectively; the total number of agents in a single frame can 158 Table 5.2: Effect of meta features on object detection: We analyze how meta features such as traffic density, type of agents, location, time of the day, and weather play a role in 2D object detection using the DETR, Deformable DETR, YOLOv3 and CenterNet object detectors. Bold indicates the type of meta feature that is the most effective for object detection. DETR and Deformable DETR (in parentheses) Density Agents Environment Time Weather Low Medium High Mixed Uniform Urban Rural Day Night Normal Rainy mAP 19.00 (22.70) 27.00 (38.30) 19.30 (28.10) 27.00 (38.30) 14.80 (31.30) 27.00 (38.30) 14.20 (25.70) 27.00 (38.30) 12.00 (20.60) 27.00 (38.30) 12.00 (20.90) mAP50 33.33 (36.80) 48.40 (61.80) 32.40 (41.40) 48.40 (61.80) 31.80 (44.30) 48.40 (61.80) 23.40 (34.90) 48.40 (61.80) 22.70 (36.10) 48.40 (61.80) 21.90 (32.70) mAP75 21.50 (22.10) 28.10 (41.50) 20.40 (31.30) 28.10 (41.50) 11.70 (37.00) 21.80 (41.50) 16.30 (28.40) 28.10 (41.50) 12.20 (20.50) 28.10 (41.50) 12.60 (22.90) mAPS 2.60 (7.10) 1.20 (12.10) 0.20 (2.50) 1.20 (12.10) 0.30 (12.80) 1.20 (12.10) 2.00 (10.30) 1.20 (12.10) 0.10 (0.30) 1.20 (12.10) 1.80 (9.50) mAPM 7.40 (25.20) 8.30 (22.50) 10.50 (16.90) 8.30 (22.50) 7.20 (34.30) 8.30 (22.50) 11.70 (28.10) 8.30 (22.50) 3.30 (12.50) 8.30 (22.50) 6.20 (19.90) mAPL 25.60 (24.90) 45.90 (54.10) 24.70 (35.60) 45.90 (54.10) 40.30 (57.80) 45.90 (54.10) 26.30 (35.60) 45.90 (54.10) 16.70 (27.80) 45.90 (54.10) 15.10 (23.80) YOLOv3 and CenterNet (in parentheses) Density Agents Environment Time Weather Low Medium High Mixed Uniform Urban Rural Day Night Normal Rainy mAP 19.20 (22.90) 30.40 (32.90) 21.10 (23.30) 30.40 (32.90) 19.10 (30.20) 30.40 (32.90) 13.80 (13.60) 30.40 (32.90) 13.30 (15.90) 30.40 (32.90) 13.40 (14.00) mAP50 36.90 (34.80) 52.50 (55.40) 36.30 (32.50) 52.50 (55.40) 35.10 (43.40) 52.50 (55.40) 22.00 (22.70) 52.50 (55.40) 25.00 (25.70) 52.50 (55.40) 25.00 (22.50) mAP75 16.10 (28.10) 32.30 (33.40) 23.20 (26.70) 32.30 (33.40) 19.70 (37.30) 32.30 (33.40) 15.70 (13.20) 32.30 (33.40) 13.40 (27.00) 32.30 (33.40) 13.60 (15.50) mAPS 2.70 (8.40) 2.40 (13.10) 0.60 (2.90) 2.40 (13.10) 7.90 (19.30) 2.40 (13.10) 5.20 (5.40) 2.40 (13.10) 0.00 (0.90) 2.40 (13.10) 1.30 (10.90) mAPM 14.10 (26.20) 13.10 (30.50) 11.70 (17.60) 13.10 (30.50) 19.10 (38.80) 13.10 (30.50) 22.50 (25.80) 13.10 (30.50) 7.50 (11.60) 13.10 (30.50) 11.60 (17.40) mAPL 23.70 (29.50) 48.70 (44.60) 27.30 (27.90) 48.70 (44.60) 38.90 (40.00) 48.70 (44.60) 21.20 (21.40) 48.70 (44.60) 18.50 (21.70) 48.70 (44.60) 16.40 (14.30) reach up to 40 and up to 9 unique agents can exist in a single frame. Figure 5.3 represents the distribution of behaviors across videos and Figure 5.3 shows the distribution of each behavior?s average duration. In particular, we note that the average duration can reach up to 3 seconds which, at 30 frames per second, corresponds to approximately 90 frames that contain visual, contextual, and semantic information that can inform behavior prediction algorithms for more accurate perception and prediction. 5.4 Experiments and Analysis We provide the pre-trained models for object detection and behavior prediction at https: //gamma.umd.edu/meteor. 5.4.1 Analyzing Object Detection in Unstructured Scenarios Existing datasets have helped develop sophisticated and robust 2D detection methods. We use the MMDetection [239] toolbox to train the following 2D object detection models? 159 Table 5.3: Training Details for Object Detection (BS: Batch size, Mom: Momentum, WD: Weight decay, MGN: Max Gradient Norm) Method Backbone BS Opt. LR Mom. WD (L2) MGN DETR [236] ResNet-50 2 AdamW 1e?4 ? 1e?4 0.1 Def. DETR [237] ResNet-50 2 AdamW 2e?4 ? 1e?4 0.1 YOLOv3 [7] Darknet-53 8 SGD 1e?3 0.9 5e?4 35 CenterNet [238] ResNet-18 16 SGD 1e?3 0.9 5e?4 35 Table 5.4: Object detection on Waymo and KITTI: We report the standard mAP for many widely used methods on autonomous driving datasets. DETR [236] CenterNet YOLO v3 Def. DETR Swin-T KITTI [74] 23.00 80.40 81.60 42.20 ? Waymo [223] 65.31 64.83 56.93 65.31 37.20 METEOR 8.30 12.10 14.30 15.80 32.60 DETR [236], Deformable DETR [237] (with iterative bounding box refinement), YOLOv3 [7] (with scale 608), CenterNet [238] (with normal convolutions), and Swin-T [240]. The models are pre-trained on the COCO dataset [241] and fine-tuned on METEOR. We provide the training details in Table 5.3 and report results using the standard mAP, mAP50, mAP75, mAPS, mAPM, and mAPL. We refer the reader to [242] for a primer on these metrics. In Table 5.4, we report the mAP for the 2D object detectors listed above. We observe that the most widely used 2D object detectors, that perform well on the state-of-the-art autonomous driving datasets, like the Waymo Open Motion Dataset [223] and the KITTI dataset [74], do not perform well on METEOR. More specifically, the detectors achieve 37% ? 65% and 23% ? 81% mAP on the Waymo and KITTI datasets, respectively, while the same methods achieve 8% ? 31% mAP on the METEOR dataset. In other words, the best possible result on METEOR is 1? and 1? the best result on the Waymo and KITTI datasets, respectively. In Table 5.5, we 2 3 compare METEOR in depth with the Waymo dataset using the Swin-T method [240], which 160 Table 5.5: Swin-T on Waymo and METEOR: We present a more detailed analysis of Swin-T, one of the state-of-the-art object detection approaches, on Waymo and METEOR. mAP mAP50 mAP75 mAPS mAPM mAPL Waymo [223] 37.20 70.60 52.00 17.20 41.80 67.20 METEOR 32.60 46.90 36.20 20.50 35.40 54.70 is currently one of the top performing methods on the standard COCO 2D object detection benchmark leaderboard [241]. The Swin-T method performs 14% better on the Waymo Dataset. There are two possible reasons for performance degradation on METEOR. First, 2D detectors are typically pre-trained on MS COCO [241] and ImageNet [243], which contain only up to 9 categories of the commonly occurring traffic agents. This was not an issue for detectors on existing datasets like Waymo and KITTI since those datasets contain a subset of those 9 classes. METEOR, on the other hand, contains 16 agent categories that are approximately equally distributed. The approximately 7 ? 8 traffic agent categories that are contained in METEOR but do not appear in MS COCO are novel to these 2D object detectors and are not classified correctly. The other reason why object detection deteriorates on METEOR is due to the challenging traffic environments in METEOR. More specifically, METEOR contains many challenging scenarios such as bad weather, nighttime traffic, rural area, high density traffic, etc. (see Figure 5.2). We analyze the effect of meta-features such as traffic conditions (density and heterogeneity), road conditions, weather, and time-of the day on 2D object detection and present this analysis in Table 5.2. For this analysis, we form separate test sets corresponding to each label in a meta-feature (for example, we have two test sets for day and night). Most datasets contain videos of medium density traffic. In Table 5.2, we see that the performance of the DETR, Deformable DETR, YOLOv3, and CenterNet suffers as the traffic density increases from medium 161 to high. Similar reasoning can be made for other factors?object detection is less effective for homogeneous traffic, in rural areas, at nighttime, and in rainy weather. In most datasets, the number of annotated data samples with these adverse and challenging factors are a fraction of the entire dataset, which partly explains why 2D detectors are more successful on those datasets. The analysis in this section empirically validates the difficulty that the heavy-tail problem poses to perception tasks in autonomous driving. 5.4.2 Multi-Agent Behavior Recognition Multi-agent behavior recognition (MABR) is the task of first localizing agents in a video followed by classifying their behaviors. This task has drawn attention in recent years and plays an important role in autonomous driving. Unlike object detection, which can be accomplished solely by observing visual appearances, MABR reasons about the actors? interactions with the surrounding context, including environments, other people and objects. Dataset Preparation: The METEOR dataset is ideal for spatio-temporal MABR due to the availability of bounding box annotations and their corresponding behavior labels for more than 1231 video clips, each lasting one minute in duration, and over 2 million annotated frames. We use 1000 video clips for training and 231 video clips for testing. As the guidelines of the benchmarks, we evaluate 16 behavior classes with mean Average Precision (mAP) as the metric, using a frame-level IoU threshold of 0.5. Framework: We use the ActorContext-Actor Relation Network (ACAR-Net) [245] which builds upon a novel high-order relation reasoning operator and an actor-context feature bank for indirect relation reasoning for spatio-temporal action localization. This framework is composed of an 162 Table 5.6: ACAR-Net on AVA and METEOR: We applied currently the state-of-the-art multi- agent action recognition approach on AVA to our METEOR dataset. (PT: pre-train, BS: batch size, Opt.: Optimization, LR: learning rate, WD: weight decay, FR(RX-101): Faster R-CNN (ResNeXt-101), Kin.-700: Kinetics-700, CR(Swin-T): Cascade R-CNN (Swin-T)) Dataset Detector PT BS Opt. LR WD mAP AVA [244] FR(RX-101) Kin.-700 32 SGD 0.008 1e?7 30.0 METEOR CR(Swin-T) Kin.-700 32 SGD 0.008 1e?7 6.10 object detector, backbone network, and ACAR components. Object Detector: For the object detection step, we use the Swin-T detector, generated by combining a Cascade R-CNN [246] with a Swin-T [240] backbone. The model is pre-trained on ImageNet and MS COCO, and fine-tuned on METEOR using the same settings as Swin-T [240]: multi-scale training [247] (resizing the input with the shorter side between 480 and 800 and the longer side at most 1333), AdamW [248] optimizer (initial learning rate of 1e?4, weight decay of 0.05, and batch size of 16), and 1? schedule (12 epochs). Backbone Network: Following ACAR-Net [245], we use SlowFast networks [249] as the backbone in the localization framework and double the spatial resolution of res5. We conduct experiments using a SlowFast R-101 8 ? 8, pre-trained on the Kinetics-700 dataset [250], without non-local blocks. The inputs are 64-frame clips, where we sample T = 8 frames with a temporal stride ? = 8 for the slow pathway, and ?T (? = 4) frames for the fast pathway. Training Settings: We train ACAR-Net using synchronous SGD with a batch size of 16. For the first 3 epochs, we use a base learning rate of 0.008, which is then decreased by a factor of 10 at iterations 4 epochs and 5 epochs. We use a weight decay of 1e?7 and Nesterov momentum of 0.9. We use both ground-truth boxes and predicted object boxes for training. For inference, we scale the shorter side of input frames to 384 pixels and use detected object boxes with scores greater than 0.85 for final behavior classification. 163 Results: We compare METEOR with the AVA dataset [244] as the latter is the state-of-the- art in multi-agent action recognition. In Table 5.6, we show that the current state-of-the-art approach, ACAR, achieves 30.0% mAP on AVA but yields 6.1% mAP on METEOR. There are several reasons why ACAR performs better on AVA. AVA focuses exclusively on only one target, humans, a category which most state-of-the-art object detectors can detect with ease. Furthermore, the videos in the AVA dataset consist of high-definition movies, in which agents (actors) are clearly visible, the background is simple, and the movements performed are also exaggerated and easier to identify. METEOR, on the other hand consists of 16 different categories of agents from vehicles to animals, most of which are novel for most detectors and therefore hard to detect. Moreover, the movements of the agents on the road are very fast, making them hard to capture. Finally, different agents have different motion patterns; for example, pedestrians move differently than vehicles and buses move differently than motorbikes. All of these factors collectively contribute to the complexity of MABR in dense, heterogeneous, and unstructured traffic scenarios. Our experiments and analysis show that there is much room for improvement and our hope with METEOR is that it provides the research community the resources it needs to tackle this important problem. 5.5 Conclusion We present a new dataset, METEOR, for autonomous driving applications in dense, heterogeneous, and unstructured traffic scenarios. rain consists of more than 1000 one-minute video clips, over 2 million annotated frames with 2D and GPS trajectories for 16 unique agent categories, and more than 13 million bounding boxes for traffic agents. We found that current models for object 164 detection and multi-agent behavior prediction fail on the METEOR dataset. METEOR marks the first step towards the development of more sophisticated and robust perception models for dense, heterogeneous, and unstructured scenarios. Our dataset has some limitations. While METEOR contains bounding box information for the surrounding agents, we currently do not provide trajectory information from a fixed reference frame. One would have to use depth estimation techniques to extract such trajectories. Furthermore, our dataset does not contain HD maps ad pointcloud data, which are used in many applications. For future work, we hope that our dataset can benefit in terms of design and evaluation of new motion forecasting and behavior prediction algorithms in dense and heterogeneous traffic. Finally, we hope to include semantic segmentation capability as part of METEOR by providing pixel labels for each object. 165 Chapter 6: Conclusion This dissertation addressed many key problems in autonomous driving towards handling dense, heterogeneous, and unstructured traffic environments. We developed new techniques to perceive, predict, and plan among human drivers in traffic that is significantly denser in terms of number of traffic-agents, more heterogeneous in terms of size and dynamic constraints of traffic agents, and where many drivers do not follow the traffic rules. In this thesis, we present work along three themes?perception, driver behavior modeling, and planning. Our novel contributions include: 1. Improved tracking and trajectory prediction algorithms for dense and heterogeneous traffic using a combination of computer vision and deep learning techniques. 2. A novel behavior modeling approach using graph theory for characterizing human drivers as aggressive or conservative from their trajectories. 3. Behavior-driven planning and navigation algorithms in mixed (human driver and AV) and unstructured traffic environments using game theory and risk-aware control. Additionally, we have released a new traffic dataset, METEOR, which captures rare and interesting, multi-agent driving behaviors in India. These behaviors are grouped into traffic violations, atypical interactions, and diverse scenarios. We evaluate our perception work on 166 tracking and trajectory prediction using standard autonomous driving datasets such as the Waymo Open Motion, Argoverse, NuScenes datasets, as well as public leaderboards where our tracking approach resulted in achieving rank 1 among over a 100 methods. We apply human driver behavior modeling in planning and navigation at unsignaled intersections and highways scenarios using state-of-the-art traffic simulators and show that our approach yields fewer collisions and deadlocks compared to methods based on deep reinforcement learning. We conclude the presentation with a discussion on future work. 167 Bibliography [1] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv preprint arXiv:1703.07402, March 2017. [2] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics. Physical review E, 51(5):4282, 1995. [3] Jur Van Den Berg, Stephen J Guy, Ming Lin, and Dinesh Manocha. Reciprocal n-body collision avoidance. In Robotics research, pages 3?19. Springer, 2011. [4] Chen Long, Ai Haizhou, Zhuang Zijie, and Shang Chong. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In IEEE International Conference on Multimedia and Expo (ICME), 2018. [5] Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning to track: Online multi-object tracking by decision making. In Proceedings of the IEEE international conference on computer vision, pages 4705?4713, 2015. [6] K. He, G. Gkioxari, P. Dolla?r, and R. Girshick. Mask R-CNN. ArXiv e-prints, March 2017. [7] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. [8] Xin Li, Xiaowen Ying, and Mooi Choo Chuah. Grip: Graph-based interaction-aware trajectory prediction. arXiv preprint arXiv:1907.07792, 2019. [9] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks. ArXiv e-prints, 2018. [10] Fridulv Sagberg, Selpi, Giulio Francesco Bianchi Piccinini, and Johan Engstro?m. A review of research on driving styles and road safety. Human factors, 57(7):1248?1275, 2015. [11] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 168 [12] Rohan Chandra, Uttaran Bhattacharya, Aniket Bera, and Dinesh Manocha. Densepeds: Pedestrian tracking in dense crowds using front-rvo and sparse features. arXiv preprint arXiv:1906.10313, 2019. [13] Mingyu Wang, Negar Mehr, Adrien Gaidon, and Mac Schwager. Game-theoretic planning for risk-aware interactive agents. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6998?7005. IEEE, 2020. [14] Rohan Chandra, Uttaran Bhattacharya, Trisha Mittal, Aniket Bera, and Dinesh Manocha. Cmetric: A driving behavior measure using centrality functions. arXiv preprint arXiv:2003.04424, 2020. [15] Rohan Chandra, Uttaran Bhattacharya, Trisha Mittal, Xiaoyu Li, Aniket Bera, and Dinesh Manocha. Graphrqi: Classifying driver behaviors using graph spectrums. arXiv preprint arXiv:1910.00049, 2019. [16] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886?893. IEEE, 2005. [17] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91?110, 2004. [18] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. ArXiv e-prints, November 2013. [19] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440?1448, 2015. [20] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. ArXiv e-prints, June 2015. [21] Ruilong Li, Xin Dong, Zixi Cai, Dingcheng Yang, Haozhi Huang, Song-Hai Zhang, Paul Rosin, and Shi-Min Hu. Pose2seg: Human instance segmentation without detection. arXiv preprint arXiv:1803.10683, 2018. [22] Jinshi Cui, Hongbin Zha, Huijing Zhao, and Ryosuke Shibasaki. Tracking multiple people using laser and vision. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2116?2121. IEEE, 2005. [23] Louis Kratz and Ko Nishino. Tracking pedestrians using local spatio-temporal motion patterns in extremely crowded scenes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (99):1?1, 2011. [24] Allison Bruce and Geoffrey Gordon. Better motion prediction for people-tracking. In Proc. of the International Conference on Robotics and Automation (ICRA), 2004. [25] Haifeng Gong, Jack Sim, Maxim Likhachev, and Jianbo Shi. Multi-hypothesis motion planning for visual object tracking. In 2011 International Conference on Computer Vision, pages 619?626. IEEE, 2011. 169 [26] Lin Liao, Dieter Fox, Jeffrey Hightower, Henry Kautz, and Dirk Schulz. Voronoi tracking: Location estimation using sparse and noisy sensor data. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2003. [27] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detection using social force model. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition,CVPR, pages 935?942. IEEE, 2009. [28] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You?ll never walk alone: Modeling social behavior for multi-target tracking. In IEEE International Conference on Computer Vision (ICCV), pages 261?268. IEEE, 2009. [29] Aniket Bera and Dinesh Manocha. REACH: Realtime crowd tracking using a hybrid motion model. Proc. of the International Conference on Robotics and Automation (ICRA), 2015. [30] Aniket Bera, Sujeong Kim, Tanmay Randhavane, Srihari Pratapa, and Dinesh Manocha. Glmp-realtime pedestrian path prediction using global and local movement patterns. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5528? 5535. IEEE, 2016. [31] Frank Dellaert and Chuck Thorpe. Robust car tracking using kalman filtering and bayesian templates. In Conference on intelligent transportation systems, volume 1, 1997. [32] Daniel Streller, K Furstenberg, and Klaus Dietmayer. Vehicle and object models for robust tracking in traffic scenes using laser range images. In Intelligent Transportation Systems, 2002. Proceedings. The IEEE 5th International Conference on, pages 118?123. IEEE, 2002. [33] Anna Petrovskaya and Sebastian Thrun. Model based vehicle detection and tracking for autonomous urban driving. Autonomous Robots, 26(2-3):123?139, 2009. [34] Andreas Ess, Konrad Schindler, Bastian Leibe, and Luc Van Gool. Object detection and tracking for autonomous navigation in dynamic environments. The International Journal of Robotics Research, 29(14):1707?1725, 2010. [35] Akshay Rangesh and Mohan M Trivedi. No blind spots: Full-surround multi- object tracking for autonomous vehicles using cameras & lidars. arXiv preprint arXiv:1802.08755, 2018. [36] Michael Darms, Paul Rybski, and Chris Urmson. Classification and tracking of dynamic objects with multiple sensors for autonomous driving in urban environments. In Intelligent Vehicles Symposium, 2008 IEEE, pages 1197?1202. IEEE, 2008. [37] Julien Moras, Ve?ronique Cherfaoui, and Philippe Bonnifait. Credibilist occupancy grids for vehicle perception in dynamic environments. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 84?89. IEEE, 2011. 170 [38] Nicolai Wojke and Marcel Ha?selich. Moving vehicle detection and tracking in unstructured environments. In IEEE International Conference on Robotics and Automation (ICRA), pages 3082?3087. IEEE, 2012. [39] Benjamin Coifman, David Beymer, Philip McLauchlan, and Jitendra Malik. A real- time computer vision system for vehicle tracking and traffic surveillance. Transportation Research Part C: Emerging Technologies, 6(4):271?288, 1998. [40] Anton Milan, Laura Leal-Taixe?, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, March 2016. [41] Anton Milan, Stefan Roth, and Konrad Schindler. Continuous energy minimization for multitarget tracking. IEEE transactions on pattern analysis and machine intelligence, 36(1):58?72, 2013. [42] Chanho Kim, Fuxin Li, Arridhana Ciptadi, and James M Rehg. Multiple hypothesis tracking revisited. In Proceedings of the IEEE International Conference on Computer Vision, pages 4696?4704, 2015. [43] Hao Sheng, Li Hao, Jiahui Chen, Yang Zhang, and Wei Ke. Robust local effective matching model for multi-target tracking. In Pacific Rim Conference on Multimedia, pages 233?243. Springer, 2017. [44] Aniket Bera and Dinesh Manocha. Realtime multilevel crowd tracking using reciprocal velocity obstacles. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 4164?4169. IEEE, 2014. [45] Rohan Chandra, Uttaran Bhattacharya, Tanmay Randhavane, Aniket Bera, and Dinesh Manocha. Roadtrack: Realtime tracking of road agents in dense and heterogeneous environments. arXiv, pages arXiv?1906, 2019. [46] Aniket Bera, Nico Galoppo, Dillon Sharlet, Adam Lake, and Dinesh Manocha. Adapt: real-time adaptive pedestrian tracking for crowded scenes. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 1801?1808. IEEE, 2014. [47] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. You?ll never walk alone: Modeling social behavior for multi-target tracking. In 2009 IEEE 12th International Conference on Computer Vision, pages 261?268, Sept 2009. [48] Kota Yamaguchi, Alexander C Berg, Luis E Ortiz, and Tamara L Berg. Who are you with and where are you going? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1345?1352. IEEE, 2011. [49] Ioannis Karamouzas, Peter Heil, Pascal Van Beek, and Mark H Overmars. A predictive collision avoidance model for pedestrian simulation. In International Workshop on Motion in Games, pages 41?52. Springer, 2009. 171 [50] Gianluca Antonini, Santiago Venegas Martinez, Michel Bierlaire, and Jean Philippe Thiran. Behavioral priors for detection and tracking of pedestrians in video sequences. International Journal of Computer Vision, 69(2):159?180, 2006. [51] Shu-Yun Chung and Han-Pang Huang. A mobile robot that understands pedestrian spatial behaviors. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 5861?5866. IEEE, 2010. [52] Harold W Kuhn. The hungarian method for the assignment problem. In 50 Years of Integer Programming 1958-2008, pages 29?47. Springer, 2010. [53] Michael Levandowsky and David Winter. Distance between sets. Nature, 234(5323):34, 1971. [54] Edward Twitchell Hall. The hidden dimension, volume 609. Garden City, NY: Doubleday, 1966. [55] Satoru Satake, Takayuki Kanda, Dylan F Glas, Michita Imai, Hiroshi Ishiguro, and Norihiro Hagita. How to approach humans?: strategies for social robots to initiate interaction. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction, pages 109?116. ACM, 2009. [56] Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In Proceedings of the IEEE International Conference on Computer Vision, pages 300?311, 2017. [57] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese. Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 466?475. IEEE, 2018. [58] Qing Zhang, Mengru Zhang, Mengdi Wang, Wanchen Sui, Chen Meng, Jun Yang, Weidan Kong, Xiaoyuan Cui, and Wei Lin. Efficient deep learning inference based on model compression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018. [59] Meiqi Wang, Zhisheng Wang, Jinming Lu, Jun Lin, and Zhongfeng Wang. E-lstm: An efficient hardware architecture for long short-term memory. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019. [60] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770?778, 2016. [61] Long Chen, Haizhou Ai, Chong Shang, Zijie Zhuang, and Bo Bai. Online multi-object tracking with convolutional neural networks. In Image Processing (ICIP), 2017 IEEE International Conference on, pages 645?649. IEEE, 2017. [62] Min Yang, Yuwei Wu, and Yunde Jia. A hybrid data association framework for robust online multi-object tracking. arXiv preprint arXiv:1703.10764, 2017. 172 [63] Qi Chu, Wanli Ouyang, Hongsheng Li, Xiaogang Wang, Bin Liu, and Nenghai Yu. Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In IEEE International Conference on Computer Vision (ICCV), pages 4846? 4855, 2017. [64] Ricardo Sanchez-Matilla, Fabio Poiesi, and Andrea Cavallaro. Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision, pages 84? 99. Springer, 2016. [65] Rohan Chandra, Uttaran Bhattacharya, Aniket Bera, and Dinesh Manocha. Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8483?8492, June 2019. [66] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961?971, 2016. [67] Nachiket Deo and Mohan M Trivedi. Convolutional social pooling for vehicle trajectory prediction. arXiv preprint arXiv:1805.06771, 2018. [68] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104?3112, 2014. [69] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 336?345, 2017. [70] Fang-Chieh Chou, Tsung-Han Lin, Henggang Cui, Vladan Radosavljevic, Thi Nguyen, Tzu-Kuo Huang, Matthew Niedoba, Jeff Schneider, and Nemanja Djuric. Predicting motion of vulnerable road users using high-definition maps and efficient convnets. arXiv preprint arXiv:1906.08469, 2019. [71] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F.-C. Chou, T.-H. Lin, and J. Schneider. Short-term Motion Prediction of Traffic Actors for Autonomous Driving using Deep Convolutional Networks. ArXiv e-prints, August 2018. [72] Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, and Dinesh Manocha. Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In AAAI Conference on Artificial Intelligence (AAAI), volume 33, pages 6120?6127, 2019. [73] U.S. Federal Highway Administration. U.s. highway 101 and i-80 dataset. 2005. [74] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), volume 32, pages 1231?1237. Sage Publications Sage UK: London, England, 2012. 173 [75] Wilko Schwarting, Alyssa Pierson, Javier Alonso-Mora, Sertac Karaman, and Daniela Rus. Social behavior for autonomous vehicles. Proceedings of the National Academy of Sciences, 116(50):24972?24978, 2019. [76] Zhang Wei-hua. Selected model and sensitivity analysis of aggressive driving behavior. volume 25, pages 106?112. Xi?an Highway University, 2012. [77] Murray R Barrick and Michael K Mount. The big five personality dimensions and job performance: a meta-analysis. Personnel psychology, 44(1):1?26, 1991. [78] Barbara Krahe? and Ilka Fenske. Predicting aggressive driving behavior: The role of macho personality, age, and power of car. Aggressive Behavior: Official Journal of the International Society for Research on Aggression, 28(1):21?29, 2002. [79] Jian Rong, Kejun Mao, and Jianming Ma. Effects of individual differences on driving behavior and traffic flow characteristics. Transportation research record, 2248(1):1?9, 2011. [80] Eric R Dahlen, Bryan D Edwards, Travis Tubre?, Michael J Zyphur, and Christopher R Warren. Taking a look behind the wheel: An investigation into the personality predictors of aggressive driving. Accident Analysis & Prevention, 45:1?9, 2012. [81] Kenneth H. Beck, Bina Ali, and Stacey B Daughters. Distress tolerance as a predictor of risky and aggressive driving. Traffic injury prevention, 15 4:349?54, 2014. [82] A Hamish Jamson, Natasha Merat, Oliver MJ Carsten, and Frank CH Lai. Behavioural changes in drivers experiencing highly-automated vehicle control in varying traffic conditions. Transportation research part C: emerging technologies, 30:116?125, 2013. [83] Pirkko Ra?ma?. Effects of weather-controlled variable speed limits and warning signs on driver behavior. Transportation Research Record, 1689(1):53?59, 1999. [84] Jiangpeng Dai, Jin Teng, Xiaole Bai, Zhaohui Shen, and Dong Xuan. Mobile phone based drunk driving detection. In 2010 4th International Conference on Pervasive Computing Technologies for Healthcare, 2010. [85] Mohammad Saifuzzaman, Md Mazharul Haque, Zuduo Zheng, and Simon Washington. Impact of mobile phone use on car-following behaviour of young drivers. Accident Analysis & Prevention, 82:10?19, 2015. [86] Ahmad Aljaafreh, Nabeel Alshabatat, and Munaf S. Najim Al-Din. Driving style recognition using fuzzy logic. 2012 IEEE International Conference on Vehicular Electronics and Safety (ICVES 2012), pages 460?463, 2012. [87] Yi Lu Murphey, Richard Milton, and Leonidas Kiliaris. Driver?s style classification using jerk analysis. 2009 IEEE Workshop on Computational Intelligence in Vehicles and Vehicular Systems, pages 23?28, 2009. 174 [88] Ishak Mohamad, Mohd. Alauddin Mohd. Ali, and Mahamod Ismail. Abnormal driving detection using real time global positioning system data. Proceeding of the 2011 IEEE International Conference on Space Science and Communication (IconSpace), pages 1?6, 2011. [89] Ernest Cheung, Aniket Bera, Emily Kubin, Kurt Gray, and Dinesh Manocha. Identifying driver behaviors using trajectory features for vehicle navigation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3445?3452. IEEE, 2018. [90] Alex Zyner, Stewart Worrall, and Eduardo Nebot. A recurrent neural network solution for predicting driver intention at unsignalized intersections. IEEE Robotics and Automation Letters, 3(3):1759?1764, 2018. [91] Alex Zyner, Stewart Worrall, James Ward, and Eduardo Nebot. Long short term memory for driver intent prediction. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1484?1489. IEEE, 2017. [92] Geqi Qi, Yiman Du, Jianping Wu, and Ming Xu. Leveraging longitudinal driving behaviour data with data mining techniques for driving style analysis. IET intelligent transport systems, 9(8):792?801, 2015. [93] Bin Shi, Li Xu, Jie Hu, Yun Tang, Hong Jiang, Wuqiang Meng, and Hui Liu. Evaluating driving styles by normalizing driving behavior based on personalized driver modeling. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45:1502?1508, 2015. [94] Wenshuo Wang, Junqiang Xi, Alexandre Chong, and Lin Li. Driving style classification using a semisupervised support vector machine. IEEE Transactions on Human-Machine Systems, 47:650?660, 2017. [95] Manoel Castro-Neto, Young-Seon Jeong, Myong-Kee Jeong, and Lee D. Han. Online-svr for short-term traffic flow prediction under typical and atypical traffic conditions. Expert Systems with Applications, 36(3, Part 2):6164?6173, 2009. [96] Hongbin Yin, S.C. Wong, Jianmin Xu, and C.K. Wong. Urban traffic flow prediction using a fuzzy-neural approach. Transportation Research Part C: Emerging Technologies, 10(2):85?98, 2002. [97] Y. Lv, Y. Duan, W. Kang, Z. Li, and F. Wang. Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 16(2):865? 873, April 2015. [98] Alireza Ermagun and David Levinson. Spatiotemporal traffic forecasting: review and proposed directions. Transport Reviews, 38(6):786?814, 2018. [99] I. Lana, J. Del Ser, M. Velez, and E. I. Vlahogianni. Road traffic forecasting: Recent advances and new challenges. IEEE Intelligent Transportation Systems Magazine, 10(2):93?109, Summer 2018. 175 [100] X. Cheng, R. Zhang, J. Zhou, and W. Xu. Deeptransport: Learning spatial-temporal dependency for traffic condition forecasting. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1?8, July 2018. [101] Chaoyun Zhang and Paul Patras. Long-term mobile traffic forecasting using deep spatio-temporal neural networks. In Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing, Mobihoc ?18, pages 231?240, New York, NY, USA, 2018. ACM. [102] Rohan Chandra, Uttaran Bhattacharya, Christian Roncal, Aniket Bera, and Dinesh Manocha. Robusttp: End-to-end trajectory prediction for heterogeneous road-agents in dense traffic with noisy sensor inputs. In ACM Computer Science in Cars Symposium, pages 1?9, 2019. [103] Sujeong Kim, Stephen J Guy, Wenxi Liu, David Wilkie, Rynson WH Lau, Ming C Lin, and Dinesh Manocha. Brvo: Predicting pedestrian trajectories using velocity-space reasoning. The International Journal of Robotics Research, 34(2):201?217, 2015. [104] Rohan Chandra, Tianrui Guan, Srujan Panuganti, Trisha Mittal, Uttaran Bhattacharya, Aniket Bera, and Dinesh Manocha. Forecasting trajectory and behavior of road-agents using spectral clustering in graph-lstms. IEEE Robotics and Automation Letters, 2020. [105] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. [106] James W Demmel. Applied numerical linear algebra, volume 56. Siam, 1997. [107] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensor-train rnns. arXiv preprint arXiv:1711.00073, 2017. [108] Braxton Osting, Chris D White, and E?douard Oudet. Minimal dirichlet energy partitions for graphs. SIAM Journal on Scientific Computing, 2014. [109] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 2007. [110] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362? 386, 2020. [111] Joel Janai, Fatma Gu?ney, Aseem Behl, Andreas Geiger, et al. Computer vision for autonomous vehicles: Problems, datasets and state of the art. Foundations and Trends? in Computer Graphics and Vision, 12(1?3):1?308, 2020. [112] Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry Davis, and Dinesh Manocha. M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. arXiv preprint arXiv:2104.11896, 2021. [113] Rohan Chandra and Dinesh Manocha. Gameplan: Game-theoretic multi-agent planning with human drivers at intersections, roundabouts, and merging. arXiv preprint arXiv:2109.01896, 2021. 176 [114] Angelos Mavrogiannis, Rohan Chandra, and Dinesh Manocha. B-gap: Behavior-guided action prediction for autonomous navigation. arXiv preprint arXiv:2011.03748, 2020. [115] Rohan Chandra, Mridul Mahajan, Rahul Kala, Rishitha Palugulla, Chandrababu Naidu, Alok Jain, and Dinesh Manocha. Meteor: A massive dense & heterogeneous behavior dataset for autonomous driving. arXiv preprint arXiv:2109.07648, 2021. [116] Dev Seth and Mary L Cummings. Traffic efficiency and safety impacts of autonomous vehicle aggressiveness. simulation, 19:20, 2019. [117] Xiaowei Shi, Zhen Wang, Xiaopeng Li, and Mingyang Pei. The effect of ride experience on changing opinions toward autonomous vehicle safety. Communications in Transportation Research, 1:100003, 2021. [118] Dirty Tesla. 25 miles of full self driving ? tesla challenge 2 ? autopilot ?. https: //www.youtube.com/watch?v=Rm8aPR0aMDE, 2019. [119] Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018. [120] Rohan Chandra, Aniket Bera, and Dinesh Manocha. Stylepredict: Machine theory of mind for human driver behavior from trajectories. arXiv preprint arXiv:2011.04816, 2020. [121] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39?57. IEEE, 2017. [122] Wilko Schwarting, Javier Alonso-Mora, and Daniela Rus. Planning and decision-making for autonomous vehicles. Annual Review of Control, Robotics, and Autonomous Systems, 2018. [123] Alex Davies. Google?s self-driving car caused its first crash, 2016. [124] Francisco Aparecido Rodrigues. Network centrality: An introduction. A Mathematical Modeling Approach from Nonlinear Dynamics to Complex Systems, page 177, 2019. [125] Sepp Hochreiter and Ju?rgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735?1780, 1997. [126] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020. [127] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(1):11, 2019. [128] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020. 177 [129] Minju Park, Kitae Jang, Jinwoo Lee, and Hwasoo Yeo. Logistic regression model for discretionary lane changing under congested traffic. Transportmetrica A: transport science, 11(4):333?344, 2015. [130] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273?297, 1995. [131] Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and Systems, 2016. [132] Dave Ferguson, Thomas M Howard, and Maxim Likhachev. Motion planning in urban environments: Part ii. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1070?1076. IEEE, 2008. [133] Steven M LaValle and James J Kuffner Jr. Randomized kinodynamic planning. The international journal of robotics research, 20(5):378?400, 2001. [134] Alexander Liniger, Alexander Domahidi, and Manfred Morari. Optimization-based autonomous racing of 1: 43 scale rc cars. Optimal Control Applications and Methods, 36(5):628?647, 2015. [135] Tingxiang Fan, Xinjing Cheng, Jia Pan, Dinesh Monacha, and Ruigang Yang. Crowdmove: Autonomous mapless navigation in crowded scenarios. arXiv preprint:1807.07870, 2018. [136] Tingxiang Fan, Pinxin Long, Wenxi Liu, and Jia Pan. Fully distributed multi-robot collision avoidance via deep reinforcement learning for safe and efficient navigation in complex scenarios. arXiv preprint arXiv:1808.03841, 2018. [137] Michael Everett, Yu Fan Chen, and Jonathan P How. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3052?3059. IEEE, 2018. [138] Chengxi Li, Yue Meng, Stanley H Chan, and Yi-Ting Chen. Learning 3d-aware egocentric spatial-temporal interaction via graph convolutional networks. arXiv preprint arXiv:1909.09272, 2019. [139] Guillaume Bresson, Zayed Alsayed, Li Yu, and Se?bastien Glaser. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent Vehicles, 2(3):194?220, 2017. [140] NVIDIA Corp. Tesla unveils top av training supercomputer powered by nvidia a100 gpus. https://blogs.nvidia.com/blog/2021/06/22/ tesla-av-training-supercomputer-nvidiaa100-gpus/, 2021. [141] Sylvia A Morelli, Desmond C Ong, Rucha Makati, Matthew O Jackson, and Jamil Zaki. Empathy and well-being correlate with centrality in different social networks. Proceedings of the National Academy of Sciences, 114(37):9843?9847, 2017. 178 [142] Glenn Lawyer. Understanding the influence of all nodes in a network. Scientific reports, 5(1):1?9, 2015. [143] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999. [144] Mile S?ikic?, Alen Lanc?ic?, Nino Antulov-Fantulin, and Hrvoje S?tefanc?ic?. Epidemic centrality?is there an underestimated epidemic impact of network peripheral nodes? The European Physical Journal B, 86(10):440, 2013. [145] Dinesh Manocha. Algebraic and numeric techniques in modeling and robotics. University of California at Berkeley, 1992. [146] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1019?1028. JMLR. org, 2017. [147] Kazi Iftekhar Ahmed. Modeling drivers? acceleration and lane changing behavior. PhD thesis, MIT, 1999. [148] Peng Wang, Xinyu Huang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence, 2019. [149] Edouard Leurent and Jean Mercat. Social attention for autonomous decision-making in dense traffic. arXiv preprint arXiv:1911.12250, 2019. [150] Philip Polack, Florent Altche?, Brigitte d?Andre?a Novel, and Arnaud de La Fortelle. The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles? In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 812?818. IEEE, 2017. [151] Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000. [152] Arne Kesting, Martin Treiber, and Dirk Helbing. General lane-changing model mobil for car-following models. Transportation Research Record, 1999(1):86?94, 2007. [153] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1?16, 2017. [154] Pablo Alvarez Lopez, Michael Behrisch, Laura Bieker-Walz, Jakob Erdmann, Yun-Pang Flo?ttero?d, Robert Hilbrich, Leonhard Lu?cken, Johannes Rummel, Peter Wagner, and Evamarie Wie?ner. Microscopic traffic simulation using sumo. In The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE, 2018. 179 [155] Jundong Li, Harsh Dani, Xia Hu, Jiliang Tang, Yi Chang, and Huan Liu. Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 387?396. ACM, 2017. [156] Matt W Gardner and SR Dorling. Artificial neural networks (the multilayer perceptron)?a review of applications in the atmospheric sciences. Atmospheric environment, 32(14- 15):2627?2636, 1998. [157] James W Demmel. Applied numerical linear algebra, volume 56. Siam, 1997. [158] William W Hager. Updating the inverse of a matrix. SIAM review, 31(2):221?239, 1989. [159] Matthew Brand. Fast low-rank modifications of the thin singular value decomposition. Linear algebra and its applications, 415(1):20?30, 2006. [160] Ziwei Zhang, Peng Cui, Jian Pei, Xiao Wang, and Wenwu Zhu. Timers: Error-bounded svd restart on dynamic networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [161] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994. [162] Virginia Vassilevska Williams. Multiplying matrices in o (n2. 373) time. 2014. [163] Meha Kaushik, K Madhava Krishna, et al. Parameter sharing reinforcement learning architecture for multi agent driving behaviors. arXiv preprint arXiv:1811.07214, 2018. [164] Meha Kaushik, Vignesh Prasad, K Madhava Krishna, and Balaraman Ravindran. Overtaking maneuvers in simulated highway driving using deep reinforcement learning. In 2018 ieee intelligent vehicles symposium (iv), pages 1885?1890. IEEE, 2018. [165] Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, and Qifeng Chen. PiP: Planning-Informed Trajectory Prediction for Autonomous Driving. Lecture Notes in Computer Science, page 598?614, 2020. [166] Sam Anthony. Self-driving cars still can?t mimic the most natural human behavior. https://qz.com/1064004/ self-driving-cars-still-cant-mimic-the-most-natural-human-behavior/, 2017. [167] Erik Vinkhuyzen and Melissa Cefkin. Developing socially acceptable autonomous vehicles. In Ethnographic Praxis in Industry Conference Proceedings, volume 2016, pages 522?534. Wiley Online Library, 2016. [168] Anup Doshi and Mohan M Trivedi. Examining the impact of driving style on the predictability and responsiveness of the driver: Real-world and simulator analysis. In 2010 IEEE Intelligent Vehicles Symposium, pages 232?237. IEEE, 2010. 180 [169] Andrea Corti, Carlo Ongini, Mara Tanelli, and Sergio M Savaresi. Quantitative driving style estimation for energy-oriented applications in road vehicles. In 2013 IEEE International Conference on Systems, Man, and Cybernetics, pages 3710?3715. IEEE, 2013. [170] Alessandro Paolo Capasso, Paolo Maramotti, Anthony Dell?Eva, and Alberto Broggi. End- to-end intersection handling using multi-agent deep reinforcement learning. arXiv preprint arXiv:2104.13617, 2021. [171] David Isele, Reza Rahimi, Akansel Cosgun, Kaushik Subramanian, and Kikuo Fujimura. Navigating occluded intersections with autonomous vehicles using deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2034?2039. IEEE, 2018. [172] Shixiong Kai, Bin Wang, D. Chen, Jianye Hao, Hongbo Zhang, and Wulong Liu. A multi- task reinforcement learning approach for navigating unsignalized intersections. 2020 IEEE Intelligent Vehicles Symposium (IV), pages 1583?1588, 2020. [173] Teng Liu, Xingyu Mu, Bing Huang, Xiaolin Tang, Fuqing Zhao, Xiao Wang, and Dongpu Cao. Decision-making at unsignalized intersection for autonomous vehicles: Left-turn maneuver with deep reinforcement learning. arXiv preprint arXiv:2008.06595, 2020. [174] Nan Li, Yu Yao, Ilya Kolmanovsky, Ella Atkins, and Anouck R Girard. Game-theoretic modeling of multi-vehicle interactions at uncontrolled intersections. IEEE Transactions on Intelligent Transportation Systems, 2020. [175] Ran Tian, Nan Li, Ilya Kolmanovsky, Yildiray Yildiz, and Anouck R Girard. Game- theoretic modeling of traffic in unsignalized intersection network for autonomous vehicle control verification and validation. IEEE Transactions on Intelligent Transportation Systems, 2020. [176] Junha Roh, Christoforos Mavrogiannis, Rishabh Madan, Dieter Fox, and Siddhartha S Srinivasa. Multimodal trajectory prediction via topological invariance for navigation at uncontrolled intersections. arXiv preprint arXiv:2011.03894, 2020. [177] Noam Buckman, Alyssa Pierson, Wilko Schwarting, Sertac Karaman, and Daniela L Rus. Sharing is caring: Socially-compliant autonomous intersection negotiation. 2020. [178] Matteo Vasirani and Sascha Ossowski. A market-inspired approach for intersection management in urban road traffic networks. Journal of Artificial Intelligence Research, 43:621?659, 2012. [179] Andrea Censi, Saverio Bolognani, Julian G Zilly, Shima Sadat Mousavi, and Emilio Frazzoli. Today me, tomorrow thee: Efficient resource allocation in competitive settings using karma games. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 686?693. IEEE, 2019. 181 [180] DianChao Lin and Saif Eddin Jabari. Pay for intersection priority: A free market mechanism for connected vehicles. IEEE Transactions on Intelligent Transportation Systems, 2021. [181] Dustin Carlino, Stephen D Boyles, and Peter Stone. Auction-based autonomous intersection management. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pages 529?534. IEEE, 2013. [182] David Rey, Michael W Levin, and Vinayak V Dixit. Online incentive-compatible mechanisms for traffic intersection auctions. European Journal of Operational Research, 2021. [183] Muhammed O Sayin, Chung-Wei Lin, Shinichi Shiraishi, Jiajun Shen, and Tamer Bas?ar. Information-driven autonomous intersection control via incentive compatible mechanisms. IEEE Transactions on Intelligent Transportation Systems, 20(3):912?924, 2018. [184] Tim Roughgarden. Twenty lectures on algorithmic game theory. Cambridge University Press, 2016. [185] Kurt Dresner and Peter Stone. A multiagent approach to autonomous intersection management. Journal of artificial intelligence research, 31:591?656, 2008. [186] Heiko Schepperle and Klemens Bo?hm. Agent-based traffic control using auctions. In International Workshop on Cooperative Information Agents, pages 119?133. Springer, 2007. [187] Simon Le Cleac?h, Mac Schwager, and Zachary Manchester. Lucidgames: Online unscented inverse dynamic games for adaptive trajectory prediction and planning. arXiv preprint arXiv:2011.08152, 2020. [188] Simon Le Cleac?h, Mac Schwager, and Zachary Manchester. Algames: A fast solver for constrained dynamic games. arXiv preprint arXiv:1910.09713, 2019. [189] Jaime F Fisac, Eli Bronstein, Elis Stefansson, Dorsa Sadigh, S Shankar Sastry, and Anca D Dragan. Hierarchical game-theoretic planning for autonomous vehicles. In 2019 International Conference on Robotics and Automation (ICRA), pages 9590?9596. IEEE, 2019. [190] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. 2009. [191] Kuanqi Cai, Chaoqun Wang, Shuang Song, Haoyao Chen, and Max Q-H Meng. Risk- aware path planning under uncertainty in dynamic environments. Journal of Intelligent & Robotic Systems, 101(3):1?15, 2021. [192] Anirudha Majumdar, Sumeet Singh, Ajay Mandlekar, and Marco Pavone. Risk-sensitive inverse reinforcement learning via coherent risk models. In Proceedings of Robotics: Science and Systems, Cambridge, Massachusetts, July 2017. 182 [193] Lillian J Ratliff and Eric Mazumdar. Inverse risk-sensitive reinforcement learning. IEEE Transactions on Automatic Control, 65(3):1256?1263, 2019. [194] Minae Kwon, Erdem Biyik, Aditi Talati, Karan Bhasin, Dylan P. Losey, and Dorsa Sadigh. When humans aren?t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, page 43?52. Association for Computing Machinery, 2020. [195] Hans Fo?llmer and Thomas Knispel. Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations. Stochastics and Dynamics, 11(02n03):333? 351, 2011. [196] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of banking & finance, 26(7):1443?1471, 2002. [197] Peter Whittle and Peter R Whittle. Risk-sensitive optimal control, volume 20. Wiley New York, 1990. [198] Wendell H Fleming and William M McEneaney. Risk sensitive optimal control and differential games. In Stochastic theory and adaptive control, pages 185?197. Springer, 1992. [199] Takayuki Osogami. Robustness and risk-sensitivity in markov decision processes. In Advances in Neural Information Processing Systems 25, pages 233?241. Curran Associates, Inc., 2012. [200] Yin-Lam Chow and Marco Pavone. A framework for time-consistent, risk-averse model predictive control: Theory and algorithms. In 2014 American Control Conference, pages 4204?4211. IEEE, 2014. [201] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems 28, pages 1522?1530. Curran Associates, Inc., 2015. [202] Samantha Samuelson and Insoon Yang. Safety-aware optimal control of stochastic systems using conditional value-at-risk. In 2018 American Control Conference, pages 6285?6290. IEEE, 2018. [203] Margaret P Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M Smith, Victoria Cheng, Jaime F Fisac, Susmit Jha, Marco Pavone, and Claire J Tomlin. A risk- sensitive finite-time reachability approach for safety of stochastic dynamic systems. In 2019 American Control Conference, pages 2958?2963. IEEE, 2019. [204] Peter Whittle. Risk-sensitive linear/quadratic/gaussian control. Advances in Applied Probability, 13(4):764?777, 1981. [205] Arvind A Pereira, Jonathan Binney, Geoffrey A Hollinger, and Gaurav S Sukhatme. Risk- aware path planning for autonomous underwater vehicles using predictive ocean models. Journal of Field Robotics, 30(5):741?762, 2013. 183 [206] Vishnu D Sharma, Maymoonah Toubeh, Lifeng Zhou, and Pratap Tokekar. Risk-aware planning and assignment for ground vehicles using uncertain perception from aerial vehicles. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11763?11769. IEEE, 2020. [207] Vishnu Dutt Sharma and Pratap Tokekar. Risk-aware path planning for ground vehicles using occluded aerial images. arXiv preprint arXiv:2104.11709, 2021. [208] Arnav Choudhry, Brady Moon, Jay Patrikar, Constantine Samaras, and Sebastian Scherer. Cvar-based flight energy risk assessment for multirotor uavs using a deep energy model. arXiv preprint arXiv:2105.15189, 2021. [209] Vishnu D Sharma, Maymoonah Toubeh, Lifeng Zhou, and Pratap Tokekar. Risk-aware planning and assignment for ground vehicles using uncertain perception from aerial vehicles. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11763?11769. IEEE, 2020. [210] Vishnu D Sharma, Maymoonah Toubeh, Lifeng Zhou, and Pratap Tokekar. Risk-aware planning and assignment for ground vehicles using uncertain perception from aerial vehicles. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11763?11769. IEEE, 2020. [211] Vishnu D Sharma, Maymoonah Toubeh, Lifeng Zhou, and Pratap Tokekar. Risk-aware planning and assignment for ground vehicles using uncertain perception from aerial vehicles. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11763?11769. IEEE, 2020. [212] Orit Taubman-Ben-Ari, Mario Mikulincer, and Omri Gillath. The multidimensional driving style inventory?scale construct and validation. Accident Analysis & Prevention, 36(3):323?332, 2004. [213] E Gulian, G Matthews, Aleck Ian Glendon, DR Davies, and LM Debney. Dimensions of driver stress. Ergonomics, 1989. [214] Davina J French, Robert J West, James Elander, and John Martin WILDING. Decision- making style, driving style, and self-reported involvement in road traffic accidents. Ergonomics, 36(6):627?644, 1993. [215] Jerry L Deffenbacher, Eugene R Oetting, and Rebekah S Lynch. Development of a driving anger scale. Psychological reports, 1994. [216] Motonori Ishibashi, Masayuki Okuwa, Shun?ichi Doi, and Motoyuki Akamatsu. Indices for characterizing driving style and their relevance to car following behavior. In SICE Annual Conference 2007, pages 1132?1137. IEEE, 2007. [217] Stephen J. Guy, Sujeong Kim, M. Chiao Lin, and Dinesh Manocha. Simulating heterogeneous crowd behaviorsusing personality trait theory. In Symposium on Computer Animation, 2011. 184 [218] Aniket Bera, Tanmay Randhavane, and Dinesh Manocha. Aggressive, tense or shy? identifying personality traits from crowd videos. In IJCAI, 2017. [219] J Christopher Brill, Mustapha Mouloua, Edwin Shirkey, and Pascal Alberti. Predictive validity of the aggressive driver behavior questionnaire (adbq) in a simulated environment. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 53, pages 1334?1337. SAGE Publications Sage CA: Los Angeles, CA, 2009. [220] Tamer Bas?ar and Geert Jan Olsder. Dynamic noncooperative game theory. SIAM, 1998. [221] Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In European Conference on Computer Vision (ECCV), pages 414?430. Springer, 2020. [222] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [223] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. arXiv preprint arXiv:2104.10133, 2021. [224] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019. [225] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet. Level 5 perception dataset 2020. https:// level-5.global/level5/data/, 2019. [226] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, et al. Tnt: Target-driven trajectory prediction. arXiv preprint arXiv:2008.08294, 2020. [227] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019. [228] Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey Clausse, Maximilian Naumann, Julius Kummerle, Hendrik Konigshof, Christoph Stiller, Arnaud de La Fortelle, et al. Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps. arXiv preprint arXiv:1910.03088, 2019. 185 [229] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [230] Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019. [231] Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7699?7707, 2018. [232] Ashesh Jain, Hema S Koppula, Shane Soh, Bharad Raghavan, Avi Singh, and Ashutosh Saxena. Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture. arXiv preprint arXiv:1601.00740, 2016. [233] Zhengping Che, Guangyu Li, Tracy Li, Bo Jiang, Xuefeng Shi, Xinsheng Zhang, Ying Lu, Guobin Wu, Yan Liu, and Jieping Ye. D2-city: A large-scale dashcam video dataset of diverse traffic scenarios. arXiv preprint arXiv:1904.01975, 2019. [234] Xin Wang Wenqi Xian Yingying Chen, Fangchen Liu Vashisht Madhavan Trevor Darrell, Fisher Yu, and Haofeng Chen. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. arXiv preprint arXiv: 1805.04687, 2018. [235] Aymery Constant and Emmanuel Lagarde. Protecting vulnerable road users from injury. PLoS medicine, 7(3):e1000228, 2010. [236] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV). Springer, 2020. [237] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. [238] Xingyi Zhou, Dequan Wang, and Philipp Kra?henbu?hl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. [239] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019. 186 [240] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (ICCV), pages 10012?10022, 2021. [241] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla?r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). Springer, 2014. [242] R. Padilla, S. L. Netto, and E. A. B. da Silva. A survey on performance metrics for object- detection algorithms. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 2020. [243] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248?255. Ieee, 2009. [244] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6047?6056, 2018. [245] Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. Actor- context-actor relation network for spatio-temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 464?474, 2021. [246] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6154?6162, 2018. [247] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pages 213?229. Springer, 2020. [248] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [249] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In IEEE International Conference on Computer Vision (ICCV), pages 6202?6211, 2019. [250] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019. 187