ABSTRACT Title of Dissertation: DEEP LEARNING FOR FORENSICS Peng Zhou Doctor of Philosophy, 2020 Dissertation Directed by: Professor Larry Davis Department of Electrical and Computer Engineering The advent of media sharing platforms and the easy availability of advanced photo or video editing software have resulted in a large quantity of manipulated images and videos being shared on the internet. While the intent behind such ma- nipulations varies widely, concerns on the spread of fake news and misinformation is growing. Therefore, detecting manipulation has become an emerging necessity. Different from traditional classification, semantic object detection or segmentation, manipulation detection/classification pays more attention to low-level tampering ar- tifacts than to semantic content. The main challenges in this problem include (a) investigating features to reveal tampering artifacts, (b) developing generic models which are robust to a large scale of post-processing methods, (c) applying algo- rithms to higher resolution in real scenarios and (d) handling the new emerging manipulation techniques. In this dissertation, we propose approaches to tackling these challenges. Manipulation detection utilizes both low-level tamper artifacts and semantic contents, suggesting that richer features needed to be harnessed to reveal more evidence. To learn rich features, we propose a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state- of-the-art performance compared to alternative methods with robustness to resizing and compression. Additionally, to extend manipulation detection from image to video, we in- troduce VIDNet, Video Inpainting Detection Network, which contains an encoder- decoder architecture with a quad-directional local attention module. To reveal ar- tifacts encoded in compression, VIDNet additionally takes in Error Level Analysis (ELA) frames to augment RGB frames, producing multimodal features at different levels with an encoder. Besides, to improve the generalization of manipulation detection model, we introduce a manipulated image generation process that creates true positives using currently available datasets. Drawing from traditional work on image blending, we propose a novel generator for creating such examples. In addition, we also propose to further create examples that force the algorithm to focus on boundary artifacts during training. Extensive experimental results validate our proposal. Furthermore, to apply deep learning models to high resolution scenarios ef- ficiently, we treat the problem as a mask refinement given a coarse low resolution prediction. We propose to convert the regions of interest into strip images and com- pute a boundary prediction in the strip domain. Extensive experiments on both the public and a newly created high resolution dataset strongly validate our approach. Finally, to handle new emerging manipulation techniques while preserving per- formance on learned manipulation, we investigate incremental learning. We propose a multi-model and multi-level knowledge distillation strategy to preserve perfor- mance on old categories while training on new categories. Experiments on standard incremental learning benchmarks show that our method improves the overall per- formance over standard distillation techniques. DEEP LEARNING FOR FORENSICS by Peng Zhou Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2020 Advisory Committee: Professor Larry Davis, Chair/Advisor Professor Rama Chellappa Professor Joseph JaJa Professor Behtash Babadi Professor Yang Tao ? Copyright by Peng Zhou 2020 Dedication This thesis is dedicated to my parent for their love and support. ii Acknowledgments Graduating during the 2020 pandemic is a special experience, and I would like to thank all the people who helped me survive my PhD. This dissertation could not be completed without the help of them. First and foremost, I would like to thank my advisor, professor Larry Davis, for his kindness to accept me as his student and his invaluable guidance and support to my research. It is Larry who introduced me to the field of computer vision. It is enjoyable to work with Larry and he always takes care of his students and provides suggestions if needed. Besides, it is honorable to be advised by a famous researcher like him. Additionally, I want to thank professor Rama Chellappa, professor Joseph JaJa, professor Behtash Babadi and professor Yang Tao for their timely help for serving as my committee members and reviewing all the manuscripts. I also want to express my gratitude to the Electrical and Computer Engineering department of University of Maryland. My PhD dream would not come true without the admission and course training from it. My gratitude also goes to all the graduate coordinators who helped me submit all types of materials during my PhD period. My colleagues at the UMIACS are another factor that makes my PhD desirable and thus thank all of them. Thanks Dr. Xintong Han for his help and guidance iii during my first two years and thanks Dr. Zuxuan Wu for his suggestions and favor during each deadline submission. Also, I am grateful for the assistance from my other co-authors at school including Dr. Vlad Morariu, professor Abhinav Shrivastava, Dr. Sernam Lim, Ning Yu, Dr. Hui Ding, Dr. Mahyar Najibi and Dr. Sirius Chen. It was a pleasure to collaborate with them and their discussion was insightful. Besides that, the time I spent with my colleagues was unforgettable. Special thanks goes to Xintong Han, Zuxuan Wu, Hengduo Li and Shiyi Lan for the fitness we have done together. I also cherish the time spent with Dr. Zhe Wu, Dr. Mingfei Gao, Xitong Yang, Luyu Yang, Jun Wang, Dr. Hao Zhou, Dr. Hongyu Xu and Dr. Pallabi Ghosh. Thank them for all the memorable moments during my PhD period. My internship experience is also part of my PhD journey and I share the same gratitude to all my mentors for their suggestions and collaborations. Thanks Dr. Long Mai, Dr. Jianming Zhang and Dr. Ning Xu for their suggestions on my incremental learning project; thanks Dr. Brian Price, Dr. Scott Cohen and Dr. Gregg Wilensky for their supportive guidance on the Deepstrip project; thanks Dr. Ran Xu and Dr. Zeyuan Chen for their discussion for the talking face generation project. I have learned a lot and will cherish the time spent with my mentors. Furthermore, I would like to thank my friends who shared their stories and encouraged me during my PhD. I truly thank my old friend Yiliang Wang, for the holidays we spent together and his suggestions while I felt down. Thanks Ye Jiang and Youru Zhou for their long-lasting friendship and willingness to listen to my unhappiness. Moreover, I am grateful for my 5-year roommate Shengjie Xie and his wife Xi Li who shared foods and TV series with me. Many thanks to other friends iv in the US including Yi Liu, Jing Huang, Zeyu Zhang, Zhouchen Luo, Xiao Xiao, Shenli Zou, Xiaomin Lin and Zhengyu Lin. I would also express my gratitude to my friends in China, including but not limited to Xiaoqing Wei, Sheng Zhou, Peng Xiao and Wei Xie, who treated me well each time I went back. Lastly, I owe my deepest thanks to my parents who always trust and stand by me. Thanks for their unconditional love and effort to bring me up. Also thanks for my relatives who took care of me all the time, and I really feel lucky to be a member of my family. v Table of Contents Dedication ii Acknowledgements iii Table of Contents vi List of Tables ix List of Figures xi Chapter 1: Introduction and Motivation 1 Chapter 2: Learning Rich Features for Image Manipulation Detection 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 RGB Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Noise Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.4 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Testing on Standard Datasets . . . . . . . . . . . . . . . . . . 18 2.4.3 Manipulation Technique Detection . . . . . . . . . . . . . . . 24 2.4.4 Qualitative Result . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 3: Deep Video Inpainting Detection 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Multimodal Features . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Quad-Directional Local Attention . . . . . . . . . . . . . . . . 36 3.3.3 ConvLSTM Decoder . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi 3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.4 Robustness Analysis . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.5 Results on Free-form Video Inpainting Dataset . . . . . . . . . 48 3.4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 4: Generate, Segment and Refine: Towards Generic Manipulation Segmentation 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.3 Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1 Datasets and Experiment Setting . . . . . . . . . . . . . . . . 64 4.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.3 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.4 Robustness to Attacks . . . . . . . . . . . . . . . . . . . . . . 69 4.4.5 Segmentation with COCO Annotations . . . . . . . . . . . . . 70 4.4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 5: DeepStrip: High Resolution Boundary Refinement 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.1 Strip Image Creation . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.2 Strip Boundary Prediction . . . . . . . . . . . . . . . . . . . . 81 5.3.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.4 Strip Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1 Datasets and Metrics . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.3 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.4 Memory and Speed Comparison . . . . . . . . . . . . . . . . . 94 5.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.6 Strip Height Adaptation . . . . . . . . . . . . . . . . . . . . . 95 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 6: Multi-model and Multi-level Knowledge Distillation for Incremen- tal Learning 97 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 vii 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.1 Multi-model Distillation . . . . . . . . . . . . . . . . . . . . . 103 6.3.2 Auxiliary Distillation . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.3 Model Reconstruction . . . . . . . . . . . . . . . . . . . . . . 107 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4.1 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . 109 6.4.2 Exemplar-free setting . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.4.4 Analysis on pruning ratio . . . . . . . . . . . . . . . . . . . . 113 6.4.5 Exemplar Based Setting . . . . . . . . . . . . . . . . . . . . . 114 6.4.6 Memory Comparison . . . . . . . . . . . . . . . . . . . . . . . 115 6.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 7: Conclusion 119 Bibliography 121 viii List of Tables 2.1 AP comparison on our synthetic COCO dataset. The row is the model architectures, where RGB Net is a single Faster R-CNN using RGB image as input; Noise Net is a single Faster R-CNN using noise feature map as input; RGB-N noise RPN is a two-stream Faster R- CNN using noise features for RPN network. Noise + RGB RPN is a two-stream Faster R-CNN using both noise and RGB features as the input of RPN network. RGB-N is a two-stream Faster R-CNN using RGB features for RPN network. . . . . . . . . . . . . . . . . . . . . . 18 2.2 Training and testing split (number of images) for four standard datasets. Columbia is only used for testing the model trained on our synthetic dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 F1 score comparison on four standard datasets. ?-? denotes that the result is not available in the literature. . . . . . . . . . . . . . . . . . 21 2.4 Pixel level AUC comparison on four standard datasets. ?-? denotes that the result is not available in the literature. . . . . . . . . . . . . 21 2.5 Data augmentation comparison. Flipping: image flipping. JPEG: JPEG compression with quality 70. Noise: adding Gaussian noise with variance of 5. Each entry is F1/AUC score. . . . . . . . . . . . . 24 2.6 F1 score on NIST16 dataset for JPEG compression (with quality 70 and 50) and resizing (with scale 0.7 and 0.5) attacks. Each entry is the F1 score of JPEG/Resizing. . . . . . . . . . . . . . . . . . . . . . 25 2.7 AP comparison on multi-class on NIST16 dataset using the RGB- N network. Mean denotes the mean AP for splicing, removal and copy-move. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 mean IoU and F1 score comparison on inpainted DAVIS. ?*? denotes that the model is trained on these inpainting algorithms. . . . 44 3.2 mean IoU and F1 score comparison on inpainted DAVIS. ?*? denotes that the model is trained on these inpainting algorithms. . . . 44 3.3 mean IoU and F1 score comparison on inpainted DAVIS. ?*? denotes that the model is trained on these inpainting algorithms. . . . 45 3.4 Ablation analysis for each component on our approach. ?*? denotes that the model is trained on these inpainting algorithms. . . . 45 3.5 Mean IoU and F1 score comparison on FVI. The results are directly tested on FVI dataset, and all the model are trained on VI and OP inpainted DAVIS. . . . . . . . . . . . . . . . . . . . . . . . . 48 ix 4.1 MCC and F1 score comparison on four standard datasets. ?-? denotes that the result is not available in the literature. * Our method is 1600 times faster than EXIF-consistency. . . . . . . . . . . 62 4.2 Ablation analysis on four datasets. Each entry is the F1 score tested on individual dataset. . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 F1 score manipulation segmentation comparison trained with COCO annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Boundary-based F score comparison. The scale factor between low and high resolution image is 4 on DAVIS 2016 and 8, 16, 32 on Pix- aHR. For DAVIS 2016, the pixel dilation is 0 and 1 and for PixaHR is 1 and 2 instead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Ablation analysis on two datasets. Each entry is the boundary-based F score tested on individual dataset. . . . . . . . . . . . . . . . . . . 91 5.3 Memory and speed comparison. Each entry is the memory or speed on DAVIS 2016/PixaHR dataset. We only compare the memory usage among learning-based approaches. . . . . . . . . . . . . . . . . . . . . 92 5.4 Strip height selection comparison on PixaHR 32?. . . . . . . . . . . . 95 6.1 Top-1 accuracy comparison among different pruning ratios on Cifar- 100 (20 classes per incremental step). . . . . . . . . . . . . . . . . . 111 6.2 Memory compensation comparison (MB). Each entry is the additional memory requirement for methods across different datasets based on the memory footprint of LWF. . . . . . . . . . . . . . . . . . . . . . . 117 x List of Figures 2.1 Examples of tampered images that have undergone different tamper- ing techniques. From the top to bottom are the examples showing manipulations of splicing, copy-move and removal. . . . . . . . . . . . 5 2.2 Illustration of our two-stream Faster R-CNN network. The RGB stream models visual tampering artifacts, such as unusually high con- trast along object edges, and regresses bounding boxes to the ground- truth. The noise stream first obtains the noise feature map by pass- ing input RGB image through an SRM filter layer, and leverages the noise features to provide additional evidence for manipulation classifi- cation. The RGB and noise streams share the same region proposals from RPN network which only uses RGB features as input. The RoI pooling layer selects spatial features from both RGB and noise streams. The predicted bounding boxes (denoted as ?bbx pred?) are generated from RGB RoI features. A bilinear pooling [1,2] layer after RoI pooling enables the network to combine the spatial co-occurrence features from the two streams. Finally, passing the results through a fully connected layer and a softmax layer, the network produces the predicted label (denoted as ?cls pred?) and determines whether predicted regions have been manipulated or not. . . . . . . . . . . . . 6 2.3 Illustration of tampering artifacts. Two examples showing tamper- ing artifacts in the original RGB image and in the local noise features obtained by the SRM filter layer. The second column is the amplified regions for the red bounding boxes in the first column. As shown in the second column, the unnaturally high contrast along the baseball player?s edges provides a strong cue about the presence of tamper- ing. The third column shows the local noise inconsistency between tampered regions and authentic regions. In different scenarios, visual information and noise features play a complementary role in revealing tampering artifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 The three SRM filter kernels used to extract noise features. . . . . . . 14 xi 2.5 Qualitative visualization of results. The top row shows a qualitative result from the COVER dataset. The copy-moved bag confuses the RGB Net, and the noise Net. RGB-N achieves a better detection in this case because it combines the features from the two streams. The middle row shows a qualitative result from the Columbia. The RGB Net produces a more accurate result than noise stream. Tak- ing into account both streams produces a better result for RGB-N. The bottom row shows a qualitative result from the CASIA1.0. The spliced object leaves clear tampering artifacts in both the RGB and noise streams, which yields precise detections for the RGB, noise, and RGB-N networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Qualitative results for multi-class image manipulation detection on NIST16 dataset. RGB and noise map provide different information for splicing, copy-move and removal. By combining the features from the RGB image with the noise features, RGB-N produces the correct classification for different tamepring techniques. . . . . . . . . . . . . 23 3.1 Problem introduction. Given an inpainted video (second column), we localize the inpainted region both spatially and temporally. . . . . 29 3.2 Framework overview. Given an RGB frame in a video, we first derive its corresponding ELA frame and compute multimodal fea- tures at different scales with both frames. We also introduce a quad- directional local attention module (striped) to the last encoded RGB features (colored blue) to explore spatial relationships among pix- els from four directions. These encoded features are further input into a multi-layer ConvLSTM (colored green) for decoding, exploit- ing spatial and temporal relationships explicitly, to produce masks of inpainted regions. See texts for more details. . . . . . . . . . . . . . . 30 3.3 ELA frame example. From the top to the bottom: the inpainted RGB frame, its corresponding ELA frame, and the ground-truth in- painting mask. The inpainting artifacts, e.g., the dog, person and ship, stand out in ELA space while not easily seen in the RGB space. 34 3.4 The quad-directional local attention module. Given RGB fea- tures from the last layer of the encoder, we derive attention maps with a quad-directional local attention module. To detect whether a pixel is inpainted or not, the module attends to its neighbors from four directions (left-to-right, up-to-down, right-to-left and down-to-up). 35 3.5 Mean IoU comparison under different perturbations. Perturbation in JPEG compression consists of the quality factor with 90 and 70; perturbation in noise consists of SNR 30dB and 20dB. Column from left to right is the result on VI, OP and CP inpainting. ?*? denotes that the model is trained on these inpainting algorithms. . . . . . . . 47 3.6 Qualitative visualization on DAVIS. The first row shows the inpainted video frame. The second to fourth row indicates the final predictions from different methods. The fifth row is the ground truth. . . . . . . 49 xii 4.1 Examples of manipulated images across different datasets. Columns from left to right are images in CASIA [3], COVER [4], Carvalho [5], and In-The-Wild [6]. The odd rows are manipulated im- ages and the even rows are the ground truth masks. Different datasets contain different distributions (from animals to person), manipulation techniques (from copy-move (the second column) to splicing (the rest columns)) and post-processing methods (from no post-processing to various processes including filtering, illumination, and blurring). . . . 52 4.2 GSR-Net framework overview. (a) Given a tampered image S, an authentic target image T , and the ground truth mask K, the gen- eration stage generates hard example G(M) starting from a simple copy-pasting image M . (b) Feeding the training images, copy-pasted images or generated images as input, the segmentation stage learns to segment the boundary artifacts and fill the interior to produce the final prediction. (c) The segmentation network concatenates lower level features to predict boundary artifacts and then concatenate back the boundary feature to the segmentation branch for final prediction. (d) The refinement stage creates a novel tampered image with new boundary artifacts by replacing the predicted manipulated bound- aries of segmentation stage with original authentic regions and learns to make a new prediction. . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Analysis of robustness under different attacks. Attacks with JPEG compression consists of quality factors of 70 and 50; scale at- tacks use scaling ratios of 0.7 and 0.5. (a) JPEG compression attacks on In-The-Wild. (b) Scale attacks on In-The-Wild. (c) JPEG com- pression attacks on Carvalho. (d) Scale attacks on Carvalho. . . . . . 68 4.4 Qualitative visualization. The first row shows manipulated im- ages on different datasets. The second indicates the final manipula- tion segmentation prediction. The third row illustrates the output of boundary artifacts branch. The last row is the ground truth. . . . . 70 4.5 Qualitative visualization of the generation network. The first two columns show the authentic background and manipulation mask. As the number of epochs increases, the manipulated region matches better with the background and thus boundary artifacts are harder to identify. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1 Concept overview. The example is from the newly created PixaHR dataset. Given low resolution mask and high resolution image on the left, a bilinear upsampling with scale factor 16? would results in boundary misalignment in high resolution image, as is shown in the enlarged boundary region on the right. Also, the new details in high resolution would be missed. . . . . . . . . . . . . . . . . . . . . . . . 74 xiii 5.2 Framework. To save memory and computation, we predict the bound- ary in a strip image instead of the whole image. First, the strip image is extracted from the HR image and corresponding LR mask. Feeding the strip image as input, the network predicts all potential bound- aries (denoted as ?x?) and passes the initial prediction to a selection layer (denoted as ?m?) to pursue more accurate prediction on the target boundary (denoted as ?s?). The numbers are indicator to the losses displayed on the right. Orange and green curves denote the ground truth and prediction, respectively. Note that the strip image and prediction are rotated 90 degree for visualization. . . . . . . . . . 75 5.3 Strip image creation. To generate strip image, B-spline representa- tion of the contour in the LR mask is upsampled to HR as a coarse boundary. The HR region along the normal direction (e.g., red and green arrows) of the contour is then extracted. Finally, the strip image and corresponding boundary ground truth is obtained by flat- tening the extracted region in both the HR image and mask. Note that the final boundary filters out noisy boundaries (e.g., the red box region) from the initial boundary. The strip image and boundaries are rotated 90 degree for visualization. . . . . . . . . . . . . . . . . . 79 5.4 Qualitative results on PixaHR 32?. Rows from top to down are the results of Dense CRF, STEAL, Ours and the Ground truth. We show the entire boundary (green color) result first and enlarge the blue bounding box region for comparison (boundaries are whitened). . 93 5.5 Qualitative results on COCO. Columns from left to right are coarse annotation, DELSE [7], STEAL [8] and Ours. . . . . . . . . . . . . . 94 6.1 Concept overview. We propose to distill knowledge from all previ- ous models efficiently to preserve old data information rather than sequentially applying distillation only to the last model. (For exam- ple, using both S1 and S2 in S3 for distillation instead of sequentially using S1 for S2 and then S2 for S3). The confusion matrix is LWF- MC [9] on the left and our method on the right for the exemplar-free incremental setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 xiv 6.2 Framework overview. Given images from the current training data, we preserve previous knowledge directly from the reconstructed out- put through matching the logits with the corresponding model and classifying the current data with its ground truth. As an example, each layer contains a mask matrix Mt at the ti-th incremental stepi recording significant weights for previous data. The gray dots repre- sent the weights to be trained on the current data. The red and green dots are fixed during training, denoting the weights retained from the first and second incremental step respectively. The gray dots are fine- tuned for the current data before pruning. After pruning, a subset of the gray dots will be marked as important weights and become blue dots, and the remaining weights will be fine-tuned during the next incremental step. Accordingly, Mt2 is updated and used as Mt3 at the end of this round. In multi-model distillation, the red and green output logits of the current model are matched with the model 1 and 2 respectively while the blue logits are matched with its ground truth. 102 6.3 Illustration of auxiliary distillation. We extract the intermediate fea- tures and connect directly with an auxiliary classifier to preserve mid- dle level knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Performance on iILSVRC-small and Cifar-100 dataset in exemplar- free setting. (a) Top-1 accuracy on Cifar-100 (5-class batch). (b) Top-1 accuracy on Cifar-100 (10-class batch). (c) Top-1 accuracy on Cifar-100 (20-class batch). (d) Top-5 accuracy on iILSVRC-small (10-class batch). (e) Top-5 accuracy on iILSVRC-small (20-class batch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5 Ablation Studies for our approach. (a) Top-1 accuracy comparison on Cifar-100 (20-class batch). (b) Top-5 accuracy performance on iILSVRC-small (20-class batch). . . . . . . . . . . . . . . . . . . . . 113 6.6 Comparison between different number of models used in multi-model distillation on Cifar-100 20-class batch. . . . . . . . . . . . . . . . . . 114 6.7 Performance comparison in exemplar based setting. (a) Top-1 accu- racy performance on Cifar-100 (10-class batch). (b) Top-5 accuracy performance on iILSVRC-small (10-class batch). . . . . . . . . . . . 115 6.8 Analysis on performance and memory compared to iCaRL on Cifar- 100 (10-class batch). We increase memory budget for exemplar set from 200 to 4000 images and report the average accuracy of all the 10 incremental steps. . . . . . . . . . . . . . . . . . . . . . . . . . . 117 xv Chapter 1: Introduction and Motivation Recent decades have witnessed a rapid development of deep learning, and it has been applied to various applications including image/video editing, gener- ative model, object recognition and detection. However, the improving result of photo/video editing has raised a lot of concerns about malicious purposes or mis- information. As a result, research comes to light which combines traditional foren- sics approaches with deep learning to fight against the fake media. In this thesis, we mainly tackle four different challenges to improve forensic detection with deep learning? a) harnessing rich features to find evidence of tampering. b) improving the generalization of deep learning based forensic models. c) exploring efficient solu- tions to apply deep learning model at different scales. d) extending the application of the learned model to new emerging manipulation techniques. In Chapter 2, we introduce a two-stream RGB-N network to learn rich features for manipulation detection. One of the two streams is an RGB stream whose purpose is to extract features from the RGB image input to find tampering artifacts like strong contrast difference, unnatural tampered boundaries, and so on. The other is a noise stream that leverages the noise features extracted from a steganalysis rich model filter layer to discover the noise inconsistency between authentic and 1 tampered regions. We then fuse features from the two streams through a bilinear pooling layer to further incorporate spatial co-occurrence of these two modalities. In Chapter 3, we take the temporal dimension into account and detect the inpainting manipulation within videos. A inpainting detection network VIDNet is then proposed to reveal both spatial and temporal artifacts. The features are learned by our network based on both the compression coefficient artifacts and visual RGB artifacts. After that, the features of these two modalities are further decoded by a Convolutional LSTM to predict masks of inpainted regions. In addition, when detecting whether a pixel is inpainted or not, we present a quad-directional local attention module that borrows information from its surrounding pixels from four directions. Extensive experiments are conducted to validate our approach. We demonstrate that VIDNet outperforms by clear margins alternative inpainting de- tection methods. In Chapter 4, we propose combining a generative model to augment train- ing data and thus improve the generalization of the learned model. The network first automatically generates both hard and easy examples, and then segments both boundary artifacts and the interior regions, and finally replaces the predicted arti- facts with original regions to refine the predicted result. In Chapter 5, we study the problem of high resolution boundary refinement to extend the deep learning model to real scenario. We propose transforming the image into an image strip domain to reduce the computation and memory consumption. To detect the target boundary at high resolution, we present a framework with two prediction layers. First, all potential boundaries are predicted as an initial prediction 2 and then a selection layer is used to pick the target boundary and smooth the result. To encourage accurate prediction, a loss which measures the boundary distance in the strip domain is introduced. In addition, we enforce a matching consistency and C0 continuity regularization to the network to reduce false alarms. In Chapter 6, we further investigate incremental learning to make deep learn- ing models robust to new emerging categories while avoiding forgetting the previous learned knowledge. We leverage all previous model snapshots as the teacher to ob- tain previous knowledge while trained on new categories. In addition, we incorporate an auxiliary distillation to further preserve knowledge encoded at the intermediate feature levels. To make the model more memory efficient, we adapt mask based pruning to reconstruct all previous models with a small memory footprint. In Chapter 7, we summarize this dissertation and discuss potential directions for the future research. 3 Chapter 2: Learning Rich Features for Image Manipulation Detec- tion 2.1 Introduction With the advances of image editing techniques and user-friendly editing soft- ware, low-cost tampered or manipulated image generation processes have become widely available. Among tampering techniques, splicing, copy-move, and removal are the most common manipulations. Image splicing copies regions from an authen- tic image and pastes them to other images, copy-move copies and pastes regions within the same image, and removal eliminates regions from an authentic image followed by inpainting. Sometimes, post-processing like Gaussian smoothing will be applied after these tampering techniques. Examples of these manipulations are shown in Figure 2.1. Even with careful inspection, humans find it difficult to recog- nize the tampered regions. As a result, distinguishing authentic images from tampered images has become increasingly challenging. The emerging research focusing on this topic ? image forensics ? is of great importance because it seeks to prevent attackers from using their tampered images for unscrupulous business or political purposes. In contrast to 4 Authentic image Tampered image Ground-truth mask Figure 2.1: Examples of tampered images that have undergone different tampering techniques. From the top to bottom are the examples showing manipulations of splicing, copy-move and removal. 5 Removal Copy-move Splicing RPN layer RGB stream input RGB Conv Layers RGB RoI features bbx_pred SRM filter layer Bilinearpooling cls_pred RoI pooling layer Noise stream input Noise Conv Layers Noise RoI features Figure 2.2: Illustration of our two-stream Faster R-CNN network. The RGB stream models visual tampering artifacts, such as unusually high contrast along object edges, and regresses bounding boxes to the ground-truth. The noise stream first obtains the noise feature map by passing input RGB image through an SRM filter layer, and leverages the noise features to provide additional evidence for manipu- lation classification. The RGB and noise streams share the same region proposals from RPN network which only uses RGB features as input. The RoI pooling layer selects spatial features from both RGB and noise streams. The predicted bounding boxes (denoted as ?bbx pred?) are generated from RGB RoI features. A bilinear pooling [1, 2] layer after RoI pooling enables the network to combine the spatial co-occurrence features from the two streams. Finally, passing the results through a fully connected layer and a softmax layer, the network produces the predicted label (denoted as ?cls pred?) and determines whether predicted regions have been manipulated or not. 6 current object detection networks [10?15] which aim to detect all objects of different categories in an image, a network for image manipulation detection would aim to detect only the tampered regions (usually objects). We investigate how to adopt object detection networks to perform image manipulation detection by exploring both RGB image content and image noise features. Recent work on image forensics utilizes clues such as local noise features [16,17] and Camera Filter Array (CFA) patterns [18] to classify a specific patch or pixel [5] in an image as tampered or not, and localize the tampered regions [18?20]. Most of these methods focus on a single tampering technique. A recently proposed ar- chitecture [21] based on a Long Short Term Network (LSTM) segments tampered patches, showing robustness to multiple tampering techniques by learning to detect tampered edges. Here, we propose a novel two-stream manipulation detection frame- work, which not only models visual tampering artifacts (e.g., tampered artifacts near manipulated edges), but also captures inconsistencies in local noise features. More specifically, we adopt Faster R-CNN [10] within a two-stream network and perform end-to-end training. A summary of our method is shown in Figure 2.2. Deep learning detection models like Faster R-CNN [10] have demonstrated good performance on detecting semantic objects over a range of scales. The Region Proposal Network (RPN) is the component in Faster R-CNN that is responsible for proposing image regions that are likely to contain objects of interest, and it can be adapted for image manipulation detection. For distinguishing tampered regions from authentic regions, we utilize features from the RGB channels to capture clues like visual inconsistencies at tampered boundaries and contrast effect between tampered 7 regions and authentic regions. The second stream analyzes the local noise features in an image. The intuition behind the second stream is that when an object is removed from one image (the source) and pasted into another (the target), the noise features between the source and target images are unlikely to match. These differences can be partially masked if the user subsequently compresses the tampered image [17,22]. To utilize these features, we transform the RGB image into the noise domain and use the local noise features as the input to the second stream. There are many ways to produce noise features from an image. Based on recent work on steganalysis rich model (SRM) for manipulation classification [16,23], we select SRM filter kernels to produce the noise features and use them as the input channel to the second Faster R-CNN network. Features from these two streams are then bi-linearly pooled for each Region of Interest (RoI) to detect tampering artifacts based on features from both streams, see Figure 2.2. Previous image manipulation datasets [4,24?26] contain only several hundred images, not enough to train a deep network. To overcome this, we created a syn- thetic tampering dataset based on COCO [27] for pre-training our model and then finetuned the model on different datasets for testing. Experimental results of our approach on four standard datasets demonstrate promising performance. Our contribution is two-fold. First, we show how a Faster R-CNN frame- work can be adapted for image manipulation detection in a two-stream fashion. We explore two modalities, RGB tampering artifacts and local noise feature inconsis- 8 tencies, bilinearly pooling them to identify tampered regions. Second, we show that the two streams are complementary for detecting different tampered techniques, leading to improved performance on four image manipulation datasets compared to state-of-the-art methods. 2.2 Related Work Research on image forensics consists of various approaches to detect the low- level tampering artifacts within a tampered image, including double JPEG compres- sion [22], CFA color array anaylsis [18] and local noise analysis [28]. Specifically, Bianchi et al. [22] propose a probabilistic model to estimate the DCT coefficients and quantization factors for different regions. CFA based methods analyze low-level statistics introduced by the camera internal filter patterns under the assumption that the tampered regions disturb these patterns. Goljan et al. [18] propose a Gaus- sian Mixture Model (GMM) to classify CFA present regions (authentic regions) and CFA absent regions (tampered regions). Recently, local noise features based methods, like the steganalysis rich model (SRM) [23], have shown promising performance in image forensics tasks. These methods extract local noise features from adjacent pixels, capturing the inconsis- tency between tampered regions and authentic regions. Cozzolino et al. [28] explore and demonstrate the performance of SRM features in distinguishing tampered and authentic regions. They also combine SRM features by including the quantization and truncation operations with a Convolutional Neural Network (CNN) to perform 9 manipulation localization [29]. Rao et al. [30] use an SRM filter kernel as initializa- tion for a CNN to boost the detection accuracy. Most of these methods focus on specific tampering artifacts and are limited to specific tampering techniques. We also use these SRM filter kernels to extract low-level noise that is used as the input to a Faster R-CNN network, and learn to capture tampering traces from the noise features. Moreover, a parallel RGB stream is trained jointly to model mid- and high-level visual tampering artifacts. With the success of deep learning techniques in various computer vision and image processing tasks, a number of recent techniques have also employed deep learning to address image manipulation detection. Chen et al. [31] add a low pass filter layer before a CNN to detect median filtering tampering techniques. Bayar et al. [32] change the low pass filter layer to an adaptive kernel layer to learn the filtering kernel used in tampered regions. Beyond filtering learning, Zhang et al. [33] propose a stacked autoencoder to learn context features for image manipulation detection. Cozzolino et al. [19] treat this problem as an anomaly detection task and use an autoencoder based on extracted features to distinguish those regions that are difficult to reconstruct as tampered regions. Salloum et al. [34] use a Fully Convolutional Network (FCN) framework to directly predict the tampering mask given an image. They also learn a boundary mask to guide the FCN to look at tampered edges, which assists them in achieving better performance in various image manipulation datasets. Bappy et al. [21] propose an LSTM based network applied to small image patches to find the tampering artifacts on the boundaries between tampered patches and image patches. They jointly train this network with pixel level segmentation 10 to improve the performance and show results under different tampering techniques. However, only focusing on nearby boundaries provides limited success in different scenarios, e.g., removing the whole object might leave no boundary evidence for detection. Instead, we use global visual tampering artifacts as well as the local noise features to model richer tampering artifacts. We use a two-stream network built on Faster R-CNN to learn rich features for image manipulation detection. The network shows robustness to splicing, copy-move and removal. In addition, the network enables us to make a classification of the suspected tampering techniques. 2.3 Proposed Method We employ a multi-task framework that simultaneously performs manipulation classification and bounding box regression. RGB images are provided in the RGB stream (the top stream in Figure 2.2), and SRM images in the noise stream (the bottom stream in Figure 2.2). We fuse the two streams through bilinear pooling before a fully connected layer for manipulation classification. The RPN uses the RGB stream to localize tampered regions. 2.3.1 RGB Stream The RGB stream is a single Faster R-CNN network and is used both for bounding box regression and manipulation classification. We use a ResNet 101 network [35] to learn features from the input RGB image. The output features of the last convolutional layer of ResNet are used for manipulation classification. 11 The RPN network in the RGB stream utilizes these features to propose RoI for bounding box regression. Formally, the loss for the RPN network is defined as 1 ? LRPN(gi, fi) = Lcls(gi, g ? N i ) cls i 1 ? +? g? ? N i Lreg(fi, fi ), (2.1) reg i where gi denotes the probability of anchor i being a potential manipulated region in a mini batch, and g?i denotes the ground-truth label for anchor i to be positive. The terms f ?i, fi are the 4 dimensional bounding box coordinates for anchor i and the ground-truth, respectively. Lcls denotes cross entropy loss for RPN network and Lreg denotes smooth L1 loss for regression for the proposal bounding boxes. Ncls denotes the size of a mini-batch in the RPN network. Nreg is the number of anchor locations. The term ? is a hyper-parameter to balance the two losses and is set to 10. Note that in contrast to traditional object detection whose RPN network searches for regions that are likely to be objects, our RPN network searches for regions that are likely to be manipulated. The proposed regions might not necessarily be objects, e.g., the case in the removal tampering process. 2.3.2 Noise Stream RGB channels are not sufficient to tackle all the different cases of manipula- tion. In particular, tampered images that were carefully post processed to conceal 12 the splicing boundary and reduce contrast differences are challenging for the RGB stream. So, we utilize the local noise distributions of the image to provide additional evidence. In contrast to the RGB stream, the noise stream is designed to pay more attention to noise rather than semantic image content. This is novel ? while current deep learning models do well in representing hierarchical features from RGB image content, no prior work in deep learning has investigated learning from noise distributions in detection. Inspired by recent progress on SRM features from image forensics [23], we use SRM filters to extract the local noise features (examples shown in Figure 2.3) from RGB images as the input to our noise stream. In our setting, noise is modeled by the residual between a pixel?s value and the estimate of that pixel?s value produced by interpolating only the values of neigh- boring pixels. Starting from 30 basic filters, along with nonlinear operations like maximum and minimum of the nearby outputs after filtering, SRM features gather the basic noise features. SRM quantifies and truncates the output of these filters and extracts the nearby co-occurrence information as the final features. The feature obtained from this process can be regarded as a local noise descriptor [28]. We find that only using 3 kernels can achieve decent performance, and applying all 30 kernels does not give significant performance gain. Therefore, we choose 3 kernels, whose weights are shown in Figure 2.4, and directly feed these into a pre-trained network trained on 3-channel inputs. We define the kernel size of the SRM filter layer in the noise stream to be 5? 5? 3. The output channel size of our SRM layer is 3. The resulting noise feature maps after the SRM layer are shown in the third 13 Tampered image Visual artifacts Noise Ground-truth Figure 2.3: Illustration of tampering artifacts. Two examples showing tampering artifacts in the original RGB image and in the local noise features obtained by the SRM filter layer. The second column is the amplified regions for the red bounding boxes in the first column. As shown in the second column, the unnaturally high contrast along the baseball player?s edges provides a strong cue about the presence of tampering. The third column shows the local noise inconsistency between tampered regions and authentic regions. In different scenarios, visual information and noise features play a complementary role in revealing tampering artifacts. 0 0 0 0 0 -1 2 -2 2 -1 0 0 0 0 0 0 -1 2 -1 0 1 1 2 -6 8 -6 2 0 0 0 0 0 0 2 -4 2 0 -2 8 -12 8 -2 1 0 1 -2 1 0 4 120 -1 2 -1 0 2 -6 8 -6 2 2 0 0 0 0 0 0 0 0 0 0 -1 2 -2 2 -1 0 0 0 0 0 Figure 2.4: The three SRM filter kernels used to extract noise features. 14 column of Figure 2.3. It is clear that they emphasize the local noise instead of image content and explicitly reveal tampering artifacts that might not be visible in the RGB channels. We directly use the noise features as the input to the noise stream network. The backbone convolutional network architecture of the noise stream is the same as the RGB stream. The noise stream shares the same RoI pooling layer as the RGB stream. For bounding box regression, we only use the RGB channels because RGB features perform better than noise features for the RPN network based on our experiments (See Table 2.1). 2.3.3 Bilinear Pooling We finally combine the RGB stream with the noise stream for manipulation detection. Among various fusion methods, we apply bilinear pooling on features from both streams. Bilinear pooling [1], first proposed for fine-grained classification, combines streams in a two-stream CNN network while preserving spatial information to improve the detection confidence. The output of our bilinear pooling layer is x = fTRGBfN , where fRGB is the RoI feature of the RGB stream and fN is the RoI feature of the noise stream. Sum pooling squeezes the s?patial feature before classification. We then apply signed square root (x? sign(x) |x|) and L2 normalization before forwarding to the fully connected layer. To save memory and speed up training without decreasing performance, we use compact bilinear pooling as proposed in [2]. After the fully connected and softmax layers, we obtain the predicted class of 15 the RoI regions, as indicated in Figure 2.2. We use cross entropy loss for manipu- lation classification and smooth L1 loss for bounding box regression. The total loss function is: Ltotal = LRPN + Ltamper(fRGB, fN) + Lbbox(fRGB), (2.2) where Ltotal denotes total loss. LRPN denotes the RPN loss in RPN network. Ltamper denotes the final cross entropy classification loss, which is based on the bilinear pool- ing feature from both the RGB and noise stream. Lbbox denotes the final bounding box regression loss. fRGB and fN are the RoI features from RGB and noise streams. The summation of all terms produces the total loss function. 2.3.4 Implementation Detail The proposed network is trained end-to-end. The input image as well as the extracted noise features are re-sized so that the shorter length equals to 600 pixels. Four anchor scales with size from 82, 162, 322 to 642 are used, and the aspect ratios are 1:2, 1:1 and 2:1. The feature size after RoI pooling is 7 ? 7 ? 1024 for both RGB and noise streams. The output feature size of compact bilinear pooling is set to 16384. The batch size of RPN proposal is 64 for training and 300 for testing. Image flipping is used for data augmentation. The Intersection-over Union (IoU) threshold for RPN positive example (potential manipulated regions) is 0.7 and 0.3 for negative example (authentic regions). Learning rate is initially set to 0.001 and then is reduced to 0.0001 after 40K steps. We train our model for 110K steps. At test time, standard Non-Maximum Suppression (NMS) is applied to reduce 16 the redundancy of proposed overlapping regions. The NMS threshold is set to 0.2. 2.4 Experiments We demonstrate our two stream network on four standard image manipulation datasets and compare the results with state-of-the-art methods. We also compare different data augmentations and measure the robustness of our method to resizing and JPEG compression. 2.4.1 Pre-trained Model Current standard datasets do not have enough data for deep neural network training. To test our network on these datasets, we pre-train our model on our synthetic dataset. We automatically create a synthetic dataset using the images and annotations from COCO [27]. We use the segmentation annotations to randomly select objects from COCO [27], and then copy and paste them to other images. The training (90%) and testing set (10%) is split to ensure the same background and tampered object do not appear in both training and testing set. Finally, we create 42K tampered and authentic image pairs. We will release this dataset for research use. The output of our model is bounding boxes with confidence scores indicating whether the detected regions have been manipulated or not. To include some authentic regions in Region of Interest (RoI) for better com- parison, We slightly enlarge the default bounding boxes by 20 pixels during training so that both the RGB and noise streams learn the inconsistency between tampered 17 AP Synthetic test RGB Net 0.445 Noise Net 0.461 RGB-N noise RPN 0.472 Noise + RGB RPN 0.620 RGB-N 0.627 Table 2.1: AP comparison on our synthetic COCO dataset. The row is the model architectures, where RGB Net is a single Faster R-CNN using RGB image as input; Noise Net is a single Faster R-CNN using noise feature map as input; RGB-N noise RPN is a two-stream Faster R-CNN using noise features for RPN network. Noise + RGB RPN is a two-stream Faster R-CNN using both noise and RGB features as the input of RPN network. RGB-N is a two-stream Faster R-CNN using RGB features for RPN network. and authentic regions. We train our model end-to-end on this synthetic dataset. The ResNet 101 used in Faster R-CNN is pre-trained on ImageNet. We use Average Precision (AP) for evaluation, the metric of which is the same as COCO [27] detection evaluation. We compare the result of the two-stream network with each one of the streams in Table 2.1. This table shows that our two-stream network performs better than each single stream. Also, the comparison among RGB-N, RGB-N using noise features as RPN and RPN uses both features shows that RGB features are more suitable than noise features to generate region proposals. 2.4.2 Testing on Standard Datasets Datasets. We compare our method with current state-of-the-art methods on NIST Nimble 2016 [25] (NIST16), CASIA [3,26], COVER [4] and Columbia dataset. ? NIST16 is a challenging dataset which contains all three tampering techniques. 18 Datasets NIST16 CASIA Columbia COVER Training 404 5123 - 75 Testing 160 921 180 25 Table 2.2: Training and testing split (number of images) for four standard datasets. Columbia is only used for testing the model trained on our synthetic dataset. The manipulations in this dataset are post-processed to conceal visible traces. They also provide ground-truth tampering mask for evaluation. ? CASIA provides spliced and copy-moved images of various objects. The tampered regions are carefully selected and some post processing like filtering and blurring is also applied. Ground-truth masks are obtained by thresholding the difference between tampered and original images. We use CASIA 2.0 for training and CASIA 1.0 for testing. ? COVER is a relatively small dataset focusing on copy-move. It covers similar objects as the pasted regions to conceal the tampering artifacts (see the second row in Figure 2.1). Ground-truth masks are provided. ? Columbia dataset focuses on splicing based on uncompressed images. Ground- truth masks are provided. To fine-tune our model on these datasets, we extract the bounding box from the ground-truth mask. We compare with other approaches on the same training and testing split protocol as [21] (for NIST16 and COVER) and [34] (for Columbia and CASIA). See Table 2.2. Evaluation Metric. We use pixel level F1 score and Area Under the receiver operating characteristic Curve (AUC) as our evaluation metrics for performance comparison. F1 score is a pixel level evaluation metric for image manipulation 19 detection, as discussed in [34, 36]. We vary different thresholds and use the highest F1 score as the final score for each image, which follows the same protocol in [34,36]. We assign the confidence score to every pixel in the detected bounding boxes for pixel-level AUC evaluation. Baseline Models. We compare our proposed method with various baseline models as described below: ? ELA: An error level analysis method [37] which aims to find the compression error difference between tampered regions and authentic regions through different JPEG compression qualities. ? NOI1: A noise inconsistency based method using high pass wavelet coefficients to model local noise [38]. ? CFA1:A CFA pattern estimation method [39] which uses nearby pixels to approx- imate the camera filter array patterns and then produces the tampering probability for each pixel. ? MFCN: A multi-task edge-enhanced FCN based network [34] jointly detecting tampered edges using edge binary masks and tampered regions using tampered region masks. ? J-LSTM: An LSTM based network [21] jointly training patch level tampered edge classification and pixel level tampered region segmentation. ? RGB Net: A single Faster R-CNN network with RGB images as input. i.e., our RGB Faster R-CNN stream. ? Noise Net: A single Faster R-CNN network with noise feature map as input obtained from a SRM filter layer. The RPN network uses noise features in this case. 20 NIST16 Columbia COVER CASIA ELA [37] 0.236 0.470 0.222 0.214 NOI1 [38] 0.285 0.574 0.269 0.263 CFA1 [39] 0.174 0.467 0.190 0.207 MFCN [34] 0.571 0.612 - 0.541 RGB Net 0.567 0.585 0.391 0.392 Noise Net 0.521 0.705 0.355 0.283 Late Fusion 0.625 0.681 0.371 0.397 RGB-N (ours) 0.722 0.697 0.437 0.408 Table 2.3: F1 score comparison on four standard datasets. ?-? denotes that the result is not available in the literature. NIST16 Columbia COVER CASIA ELA [37] 0.429 0.581 0.583 0.613 NOI1 [38] 0.487 0.546 0.587 0.612 CFA1 [39] 0.501 0.720 0.485 0.522 J-LSTM [21] 0.764 - 0.614 - RGB Net 0.857 0.796 0.789 0.768 Noise Net 0.881 0.851 0.753 0.693 Late Fusion 0.924 0.856 0.793 0.777 RGB-N (ours) 0.937 0.858 0.817 0.795 Table 2.4: Pixel level AUC comparison on four standard datasets. ?-? denotes that the result is not available in the literature. ? Late Fusion: Direct fusion combining all detected bounding boxes for both RGB Net and noise Net. The confidence scores of the overlapping detected regions from the two streams are set to the maximum one. ? RGB-N: Bilinear pooling of RGB stream and noise stream for manipulation clas- sification and RGB stream for bounding box regression. i.e.our full model. We use the F1 scores of NOI1, CFA1 and ELA reported in [34] and run the code provided by [36] to obtain the AUC results. The results of MFCN and J-LSTM are replicated from the original literatures as their code is not publicly available. Table 2.3 shows the F1 score comparison between our method and the base- 21 Tampered image Ground-truth RGB Net result Noise Net result RGB-N result Figure 2.5: Qualitative visualization of results. The top row shows a qualitative result from the COVER dataset. The copy-moved bag confuses the RGB Net, and the noise Net. RGB-N achieves a better detection in this case because it combines the features from the two streams. The middle row shows a qualitative result from the Columbia. The RGB Net produces a more accurate result than noise stream. Taking into account both streams produces a better result for RGB-N. The bot- tom row shows a qualitative result from the CASIA1.0. The spliced object leaves clear tampering artifacts in both the RGB and noise streams, which yields precise detections for the RGB, noise, and RGB-N networks. 22 Authentic image Tampered image Noise map Ground-truth Detection result Figure 2.6: Qualitative results for multi-class image manipulation detection on NIST16 dataset. RGB and noise map provide different information for splicing, copy-move and removal. By combining the features from the RGB image with the noise features, RGB-N produces the correct classification for different tamepring techniques. lines. Table 2.4 provides the AUC comparison. From these two tables, it is clear that our method outperforms conventional methods like ELA, NOI1 and CFA1. This is because they all focus on specific tampering artifacts that only contain partial infor- mation for localization, which limits their performance. Our approach outperforms MFCN on Columbia and NIST16 dataset. One of the reasons our method achieves better performance than J-LSTM is that J-LSTM seeks tampered edges as evidence of tampering, which cannot always detect the entire tampered regions. Also, our method has larger receptive field and captures global context rather than nearby pixels, which helps collect more cues like contrast difference for manipulation classification. As shown in Table 2.3 and 2.4, our RGB-N network also improves the in- 23 F1/AUC NIST16 COVER CASIA Flipping + JPEG 0.712/0.950 0.425/0.810 0.413/0.785 Flipping + noise 0.717/0.947 0.412/0.801 0.396/0.776 Flipping 0.722/0.937 0.437/0.817 0.408/0.795 No flipping 0.716/0.940 0.312/0.793 0.361/0.766 Table 2.5: Data augmentation comparison. Flipping: image flipping. JPEG: JPEG compression with quality 70. Noise: adding Gaussian noise with variance of 5. Each entry is F1/AUC score. dividual streams for all the datasets except Columbia. Columbia only contains uncompressed spliced regions, which preserves noise differences so well that it is sufficient to use only the noise features. This yields satisfactory performance for the noise stream. For all datasets, late fusion performs worse than RGB-N, which shows the effectiveness of our fusion approach. Data Augmentation. We compare different data augmentation methods in Table 2.5. Compared with no augmentation, image flipping improves the performance and other augmentation methods like JPEG compression and noise contribute little improvement. Robustness to JPEG and Resizing Attacks. We test the robustness of our method and compare with 3 methods (whose code is available) in Table 2.6. Our method is more robust to these attacks and outperforms other methods. 2.4.3 Manipulation Technique Detection The rich feature representation of our network enables it to distinguish between different manipulation techniques as well. We explore manipulation technique de- 24 JPEG/Resizing 100/1 70/0.7 50/0.5 NOI1 0.285/0.285 0.142/0.147 0.140/0.155 ELA 0.236/0.236 0.119/0.141 0.114/0.114 CFA1 0.174/0.174 0.152/0.134 0.139/0.141 RGB-N 0.722/0.722 0.677/0.689 0.677/0.681 Table 2.6: F1 score on NIST16 dataset for JPEG compression (with quality 70 and 50) and resizing (with scale 0.7 and 0.5) attacks. Each entry is the F1 score of JPEG/Resizing. Splicing Removal Copy-Move Mean AP 0.960 0.939 0.903 0.934 Table 2.7: AP comparison on multi-class on NIST16 dataset using the RGB-N network. Mean denotes the mean AP for splicing, removal and copy-move. tection and analyze the detection performance for all three tampering techniques. NIST16 contains the labels for all three tampering techniques, which enables multi- class image manipulation detection. We change the classes for manipulation classifi- cation to be splicing, removal and copy-move so as to learn distinct visual tampering artifacts and noise features for each class. The performance of each tamper class is shown in Table 2.7. The AP result in Table 2.7 indicates that splicing is the easiest manipulation techniques to detect using our method. This is because splicing has a high probabil- ity to produce both RGB artifacts like unnatural edges, contrast differences as well as noise artifacts. Removal detection performance also beats copy-move because the inpainting that follows the removal process has a large effect on the noise features, as shown in Figure 2.3. Copy-move is the most difficult tamper technique for our proposed method. The explanation is that on one hand, the copied regions are from the same image, which yields a similar noise distribution to confuse our noise 25 stream. On the other hand, the two regions generally have the same contrast. Also, the technique would ideally need to compare the two objects to each other (i.e., it would need to find and compare two RoIs at the same time), which the current ap- proach does not do. Thus, our RGB stream has less evidence to distinguish between the two regions. 2.4.4 Qualitative Result We show some qualitative results in Figure 2.5 for comparison of RGB, noise and RGB-N network in two-class image manipulation detection. The images are selected from the COVER, Columbia and CASIA 1.0. Figure 2.5 provides examples for which our two-stream network yields good performance even if one of the single streams fails (the first and second row in Figure 2.5). Figure 2.6 shows the results of the RGB-N network on the task of manipulation technique detection task using the NIST16. As is shown in the figure, our network produces accurate results for different tampering techniques. 2.5 Conclusion We propose a novel network using both an RGB stream and a noise stream to learn rich features for image manipulation detection. We extract noise features by an SRM filter layer adapted from steganalysis literatures, which enables our model to capture noise inconsistency between tampered and authentic regions. We explore the complementary contribution of finding tampered regions from RGB and 26 the noise features of an image. Not surprisingly, the fusion of the two streams leads to improved performance. Experiments on standard datasets show that our method not only detects tampering artifacts but also distinguishes between various tampering techniques. More features, including JPEG compression, will be explored in the future. 27 Chapter 3: Deep Video Inpainting Detection 3.1 Introduction Video inpainting, which completes corrupted or missing regions in a video sequence, has achieved impressive progress over the years [40?48]. The ability to produce realistic videos that can be used in applications like video restoration, vir- tual reality, etc., while appealing, brings significant security concerns at the same time since these techniques can also be used maliciously. By removing objects that could serve as evidence, malicious inpainting can result in serious legal and social implications including swaying a jury, accelerating the spread of misinformation on social platforms, etc. Our goal in this work is to develop a framework for detect- ing inpainted videos constructed with state-of-the-art methods (see Fig. 3.1 for a conceptual overview). Although there are recent studies on detecting tampered regions in images [6, 49?51], very limited effort has been devoted to video inpainting detection. For image- based manipulation detection, existing approaches either focus on spliced regions or ?deepfake?-style face replacement instead of object removal based on inpainting or they are designed specifically for images [52, 53] only and suffer from poor perfor- mance on videos. Therefore, it is important to learn robust video representations 28 Original frame Inpainted frame (input) Our prediction Ground truth Time Figure 3.1: Problem introduction. Given an inpainted video (second column), we localize the inpainted region both spatially and temporally. that explore the temporal relationships among frames for video inpainting detection. In light of this, we introduce VIDNet, a video inpainting detection network, which is an encoder-decoder architecture with a quad-directional local attention module to predict inpainted regions in videos (as is shown in Fig. 3.2). In particular, at each time step, VIDNet takes as inputs the current RGB frame together with its corresponding Error Level Analysis [54] (ELA) frame to the encoder, truncated from a pretrained VGG network [55]. Since video are compressed based on discrete cosine transforms (DCT) and frames extracted are usually stored in JPEG formats, we leverage ELA images as an additional signal to reveal artifacts like compression inconsistency (as is shown in Fig. 3.3). Instead of using ELA images directly, which tends to produce false alarms, we extract features from both ELA and RGB images with the encoder, producing five different multimodal features at different scales, that are further jointly trained for inpainting detection. In addition, given a missing region to fill in, inpainting methods leverage information from surrounding pixels 29 ConvLSTM RGB ELA QDLA Upsample ConvLSTM ConvLSTM RGB and ELA frame Time Figure 3.2: Framework overview. Given an RGB frame in a video, we first derive its corresponding ELA frame and compute multimodal features at different scales with both frames. We also introduce a quad-directional local attention module (striped) to the last encoded RGB features (colored blue) to explore spatial rela- tionships among pixels from four directions. These encoded features are further input into a multi-layer ConvLSTM (colored green) for decoding, exploiting spatial and temporal relationships explicitly, to produce masks of inpainted regions. See texts for more details. of the region to make the region coherent spatially. Motivated by this, for RGB features from the last layer of the encoder, we introduce a quad-directional local attention module to attend to the neighbors of a pixel to detect whether that pixel is inpainted or not. This allows us to explicitly model spatial dependencies among different pixels to identify inpainted pixels. Finally, with multimodal features encoded at different scales, we leverage a four-layer Convolutional LSTM, serving as a decoder for inpainting detection. More 30 Skip Connections specifically, the ConvLSTM at a certain layer not only takes in features from a previ- ous time step but also features upsampled from a coarse level (i.e., a lower decoding layer). In this way both spatial relationships across different scales and temporal dynamics over time are leveraged to produce inpainted masks over time. The frame- work is trained end-to-end with backpropagation. We conduct experiments on the DAVIS 2016 [56] Dataset and the Free-form Video Inpaiting Dataset [44]. VIDNet successfully detects inpainted regions under all different settings and outperforms by clear margins competing methods. We also show that VIDNet can be generalized to detect out-of-domain inpainted videos that are unseen during training. Our contributions can be summarized as follows: 1) We target at a rela- tively new task, to the best of our knowledge, we introduce the first learning based approach for video inpainting detection. 2) We present an end-to-end framework for video inpainting detection, which models spatial and temporal relationships in videos. 3) We leverage multimodal features, i.e., RGB and ELA features, at differ- ent scales, for video inpainting detection. 4) We introduce a quad-directional local attention module to explicitly determine if a pixel is inpainted or not by attending to its neighbours. 3.2 Related Work Video Inpainting. With the advance of recent image inpainting approaches [46? 48, 57?62], more recent studies have investigated video inpainting. There are two lines of work ? patch based and learning based approaches. For patch based 31 approaches, PatchMatch [63] is a prominent approach which searches for similar patches in the surrounding region iteratively to complete the inpainted region. To achieve better quality, Huang et al. [45] explore an optimization based method to match patches and utilize information including color and flow as regularization. On the other hand, learning based approaches have been explored recently. Wang [64] propose a 3D encoder-decoder structure for video inpaining. Afterwards, Xu et al. [42] leverage optical flow information to guide the inpainting in videos in both forward and backward passes. Similarly, Kim et al. [41] propose to estimate the proceeding flow as additional constraint while completing the missing regions. To maintain more frame pixels, Oh et al. [43] use gated convolution to inpaint video frames gradually from the reference frame. Lee et al. [40] copy and paste future frames to complete missing details in the current frame. In contrast, our approach detects regions inpainted by these approaches. Manipulation Detection. There are also approaches focusing on manip- ulation detection. Most mainly tackle splicing based manipulation and use clues specific to it [19, 51, 65, 66]. In particular, Zhou et al. [49] use both RGB and local noise to detect potential regions. Salloum et al. [67] rely on boundary artifacts to reveal manipulated regions in a multi-task learning fashion and Zhou et al. [68] im- prove its generalization ability with a generative model. Huh et al. [6] use meta-data to find inconsistent patches and Wu et al. [50] treat it as anomaly detection to learn features in a self-supervised manner. More related to our work are methods for image inpainting detection. [53] is a classical approach that searches for similar patches matched by zero-connectivity. 32 However, high false alarm rates limit their applications in real scenarios. More recently, Zhu et al. [69] use CNNs to localize inpainting patches within images. Li et al. [52] explore High Pass Filtering (HPF) as the initialization of CNNs for the purpose of distinguishing high frequency noise of natural images from inpainted ones. However, the generalization and robustness is limited as these HPFs are learned given specific inpainting methods. In contrast, we combine both RGB information and ELA features as inputs to VIDNet, and show that our approach generalizes to different inpainting methods. In addition, without temporal guidance, the methods above cannot guarantee temporally consistent prediction like our approach. 3.3 Approach VIDNet, Video Inpainting Detection Network, is an encoder-decoder architec- ture (See Fig. 3.2 for an overview the framework) operating on multimodal features to detect inpainted regions. In addition to RGB video frames, VIDNet utilizes Error Level Analysis frames (Sec. 3.3.1) to identify artifacts incurred during the inpainting process. Motivated by the fact that inpaiting methods typically borrow information from neighbouring pixels of the region to be inpainted, we introduce a multi-head local attention module (Sec. 3.3.2) which uses adjacent pixels to discover inpaint- ing traces. Finally, we model the temporal relations among different frames with a ConvLSTM (Sec. 3.3.3). In the following, we describe the components of the model. 33 Inpainted frame ELA frame Mask Figure 3.3: ELA frame example. From the top to the bottom: the inpainted RGB frame, its corresponding ELA frame, and the ground-truth inpainting mask. The inpainting artifacts, e.g., the dog, person and ship, stand out in ELA space while not easily seen in the RGB space. 3.3.1 Multimodal Features Learning a mapping directly from an inpainted RGB frame to a mask that encloses the removed object, while feasible, is challenging, since the RGB space is intentionally modified by replacing regions with their surrounding pixels to appear realistic. To mitigate this issue, we additionally augment RGB information with error level analysis features [54] that are designed to reveal regions with inconsistent compression artifacts in compressed JPEG images. Note although videos are usually compressed in MPEG formats, extracted frames are often times stored in the format of JPEG. More formally, an ELA image is defined as: IELA = ||I ? Ijpg||1, (3.1) 34 ConvLSTM Sigmoid Conv Feature Figure 3.4: The quad-directional local attention module. Given RGB features from the last layer of the encoder, we derive attention maps with a quad-directional local attention module. To detect whether a pixel is inpainted or not, the module attends to its neighbors from four directions (left-to-right, up-to-down, right-to-left and down-to-up). where IELA is the ELA image, I denotes the original image and Ijpg denotes the recompressed JPEG image from the original image. Fig. 3.3 illustrates the corresponding ELA images of sampled inpainted frames. Although ELA images have been used in forensics applications [36, 66], they tend to create false alarms when other artifacts like e.g., sharp boundaries, are present in the images, which requires ad-hoc judgement to determine whether a region is tampered. So, instead of only using ELA frames, we augment them with RGB frames as inputs to our encoder. In particular, both the RGB and ELA frame are input to a two-stream encoder. Each stream, based on a VGG encoder, transforms the input image to high-level representations with five layers, yielding 5 feature representations at different scales. At each scale, we normalize the corresponding RGB and ELA features, respectively with `2 normalization, and then apply one convolutional layer to absorb both fea- 35 tures into a unified representation: fl = ReLU(F ( [ ||fRGBl ||2 | ||fELAl ||2 ])) (l < 5), (3.2) where [|] denotes feature concatenation, fl denotes the feature at l-th layer. fRGBl , fELAl denote the RGB and ELA features at layer l, respectively. F represents the convolutional layer and ReLU denotes the activation function. The fused represen- tation at each level is further used for decoding. For l = 5, we simply use RGB features as we find that high-level ELA features are not helpful. 3.3.2 Quad-Directional Local Attention Inpainting methods aim to replace a region with pixels from its surrounding areas for photorealistic visual effect. Therefore, when determining whether a pixel is inpainted or not, it is important to examine its surrounding pixels. Inspired by recursive filtering techniques that model pixel relations from four directions for edge-preserving smoothing, we introduce a quad-directional local attention module to explore spatial relations among adjacent pixels. We learn four attention maps for four directions, left-to-right, right-to-left, top-to-bottom, bottom-to-top, to determine how much information to leverage from the pixels in the corresponding direction based on each map. More specifically, we use F?, F?, F? and F? to denote functions that derive attention maps for the left-to-right, right-to-left, top-to-bottom and bottom-to-top four directions. In the following, we consider the left-to-right direction for simplicity. Given features f5 36 from the last layer of the RGB stream, we first transform the features with F? to have the dimension as f5, and then compute an attention map A?: A? = ?(F?(f5;W?)), (3.3) where W? denotes the weights for the convolutional kernel, and ? is the sigmoid function to ensure the attentional weights at each pixel are in the range of [0, 1]. Then, for each pixel in the feature map, we obtain information from the surrounding pixels as: f5?[k] = (1? A?[k])f5[k] + A?[k]f5[k ? 1], (3.4) where k denotes the location of the pixel. Since we are considering attention from the left-to-right direction, k ? 1 indicates the pixel to the left of k. The current value of pixel k is updated with information from its neighboring pixel, and the weight to balance the contribution A? is derived with convolution, which aggregates information from a small grid in the original features. As a result, we attend to a small local region to compute the refined representation. We can derive f5?, f5? and f5? similarly, and thus we have four different refined representations. Note that the quad-directional attention module is similar in spirit to recursive filtering. However, in standard recursive filtering, a weight matrix, in the form of an edge map [70] or a weighted map [71], is used as our attention map A to guide the filtering to restore images or smooth feature maps. In contrast, our filtering can be considered as a form of self-attention?we derive attention maps by modeling 37 similarities in a local region with convolutions conditioned on input features and the resulting maps are in turn used to refine features, allowing pixels to borrow information by attending to its adjacent pixels. In addition, the motivation of our approach can be seen as the ?reverse? process of recursive filtering?in recursive filtering, information from surrounding pixels is diffused to make local regions co- herent, whereas we wish to detect inconsistent pixels by attending to a neighboring region. Furthermore, we compute four refined feature maps for four directions in a parallel way conditioned on the same feature map. An alternative is to generate a single feature representation by sequentially performing attention in four directions, i.e., f5? is used as inputs to generate f5?, and so on and so forth, as in [70]. However, we find through experiments that the parallel multi-head approach offers better results, possibly due to the disentanglement of different directions. 3.3.3 ConvLSTM Decoder Temporal information like inconsistency in the inpainted region over time is a significant clue for video inpainting detection. To explore temporal relationships among adjacent frames, we use multiple ConvLSTM decoding layers to take features from the encoders and produce predicted detection results, which enables message passing from previous frames. More specifically, the decoder contains four Con- vLSTM layers to process features from different spatial scales. At each time step, taking into account both spatial and temporal information, we concatenate together 38 the skipped connected feature of the current frame and the upsampled feature from a lower level, as the inputs to the current ConvLSTM layer. More formally, for the t-th time step, the i-th (2 <= i <= 4) ConvLSTM computes the hidden states and cell contents for the t+ 1-th time step as: ht+1 , ct+1i i = ConvLSTMi( g t t t i , hi , ci), (3.5) gti = [ U(h t i?1) | f t6?i ], (3.6) where hti and c t i denote the hidden states and cell states for the i-th ConvLSTM, respectively, and U denotes the function for bilinearly upsampling, which maps the outputs from a lower-level ConvLSTM with smaller feature maps to have the same dimension as the current one. In addition, f t6?i is the skip connected feature of the frame t from the encoder. When i = 1, the first layer of the ConvLSTM takes features from the last layer of the encoder, i.e. f5 as inputs. Recall that we obtain four refined features based on f5 with our quad-directional local attention module to identify pixels that are inconsistent with its neighbours from four directions. Thus, we use these refined features as inputs to ConvLSTM1. We input them into the LSTM in the order of f5?, f5?, f5? and f5? to obtain all the four directional features. At each time step, we compute gt5 with Eqn. 3.6 to produce a prediction p t for each QDLA direction via one convolutional layer. Finally, to explore non-linear relations among these four directional outputs, we fuse them with one additional 39 convolutional layer to form the final prediction. During training, we divide each video into N clips with equal clip length. To encourage more intersection with the binary ground truth mask, we use IoU score [72] as our loss function which is formulated as: ?H,W m=1,w=1 pm ? yw L(p, y)=1? ? , (3.7)H,W ? ? m=1,w=1 p H,W H,W m,w+ m=1,w=1 ym,w? m=1,w=1 pm,w ? ym,w+ where p and y denote the prediction and the binary ground truth mask, respectively. H and W denote the height and width, respectively.  denotes a small number to avoid zero division. The loss is updated once the ConvLSTM decoder goes through a single video clip to collect temporal information. By exploring spatial and temporal information recurrently, predictions of inpainted regions become more accurate. 3.3.4 Implementation Details We use PyTorch for implementation. Our model is trained on a NVIDIA GeForce TITAN P6000. The input to the network is resized to 240 ? 427. The length of our video clips is set to 3 frames during training. To extract ELA frames, we recompress the corresponding RGB frames by quality factor 50 and compute their difference. Our feature extraction backbone is VGG-16 [55] for both RGB and ELA features. To increase the generalization ability, we add instance normalization [73] layer to the backbone. The encoder is initialized from VGG-16 model pretrained 40 on ImageNet [74] and the decoder is initialized by Xaiver initialization [75]. We concatenate both RGB and ELA features up to the penultimate encoding layer. Afterwards, the features are passed into one convolutional and normalization layer to reduce the dimension by half to reduce training parameters. The QDLA module is only added to the last encoder layer to extract directional feature information based on ablation results in Sec. 3.4. The decoder is a 4-layer ConvLSTM. We use Adam [76] optimizer with a fixed learning rate of 1? 10?4 for encoder and 1? 10?3 for decoder. The optimizer of the encoder and decoder network are updated in an alternating fashion. To avoid overfitting, weight decay with a factor of 5 ? 10?5 and 50% dropout [77] are applied. Only random horizontal flipping augmentation is applied during training. We train the whole network end-to-end for 40 epochs with a batch size of 4. 3.4 Experiment We compare our VIDNet with approaches on manipulation/image inpainting detection in this section to show the advantages of our approach on video inpainting detection. We also analyze the robustness of our approach under different pertur- bations and show both quantitative and qualitative results. 3.4.1 Experiment setup Dataset and Evaluation Metrics. Since DAVIS 2016 [56] is the most common benchmark for video inpainting, which consists of 30 videos for training 41 and 20 videos for testing, we evaluate our approach on it for inpainting detection. We generate inpainted videos using SOTA video inpainting approaches ? VI [41], OP [43] and CP [40], with the ground truth object mask as reference. To show both the performance and generalization, we choose two out of the three inpainted DAVIS for training and testing, leaving one for additional testing. The training/testing split follows DAVIS default setting. We report the F1 score and mean Intersection of Union (IoU) to the ground truth mask as evaluation metrics. We compare our method with the following approaches: NOI [38]: A traditional approach which aims to find inconsistent noise region as the clue of manipulation. The code for evaluation is from Zampoglou et al. [66]. We directly test on the VI, OP and CP test set as it is unsupervised. CFA [65]: An approach that estimates Camera Filter Array (CFA) and regards the region with different CFA patterns as the manipulated region. We directly test on the VI, OP and CP test set as it is unsupervised. HPF [52]: A learning based image inpainting detection approach that applies one high pass filter layer as an initialization to reveal high frequency inpainting artifacts. We implement their filter kernel and train the network frame-by-frame from the ImageNet pretrained weights for comparison. GSR-Net [68]: A generic image manipulation segmentation approach that ap- plies generative models and exploits boundary artifacts to improve the generalization ability. We use their released code and retrain on inpainted DAVIS frame-by-frame for evaluation. Ours RGB (baseline): Our baseline approach which feeds as input RGB frame 42 only. No QDLA module is applied. VIDNet-BN (ours): Our batch normalization [78] version approach. VIDNet-IN (ours): We report this as our main results, which replaces the batch normalization in encoder by instance normalization to improve the general- ization across different video inpainting algorithms. 3.4.2 Main Results Tables 3.1, 3.2 and 3.3 highlight our advantages over other methods. For all the three settings, our IN version outperforms other approaches in both trained and untrained inpainting algorithms, showing the generalization of our approach. Addi- tionally, we show clear improvement over our baseline, indicating the effectiveness of our proposed ELA feature and QDLA module. Comparing across different inpaint- ing algorithms, the performance degrades on the untrained algorithms, indicating a domain shift between trained and untrained inpainting algorithms. However, benefiting from diverse features and more focus on proximity regions, our method still results in better generalization compared with other approaches. Finally, the results indicate that our BN version generally has better performance on the in- domain training inpainting algorithms while IN version shows better generalization on the cross-domain one. Therefore, we provide both results as a trade off between in-domain performance and generalization. 43 Table 3.1: mean IoU and F1 score comparison on inpainted DAVIS. ?*? denotes that the model is trained on these inpainting algorithms. VI* OP* CP Methods IoU F1 IoU F1 IoU F1 NOI [38] 0.082 0.137 0.090 0.137 0.072 0.132 CFA [65] 0.103 0.142 0.083 0.137 0.076 0.121 HPF [52] 0.456 0.568 0.494 0.615 0.458 0.577 GSR-Net [68] 0.571 0.693 0.500 0.626 0.509 0.634 Ours RGB (baseline) 0.552 0.671 0.456 0.580 0.493 0.625 VIDNet-BN (ours) 0.620 0.726 0.749 0.833 0.670 0.775 VIDNet-IN (ours) 0.585 0.704 0.588 0.707 0.565 0.685 Table 3.2: mean IoU and F1 score comparison on inpainted DAVIS. ?*? denotes that the model is trained on these inpainting algorithms. VI OP* CP* Methods IoU F1 IoU F1 IoU F1 NOI [38] 0.082 0.137 0.090 0.137 0.072 0.132 CFA [65] 0.103 0.142 0.083 0.137 0.076 0.121 HPF [52] 0.342 0.444 0.409 0.510 0.676 0.773 GSR-Net [68] 0.302 0.426 0.736 0.818 0.801 0.849 Ours RGB (baseline) 0.308 0.417 0.705 0.773 0.777 0.859 VIDNet-BN (ours) 0.301 0.415 0.801 0.860 0.837 0.915 VIDNet-IN (ours) 0.386 0.493 0.740 0.820 0.810 0.869 3.4.3 Ablation Analysis We analyze the importance of each key component in our framework and the details are as follows: Ours ELA: The baseline architecture which only feeds ELA frame as input. Ours RF edge: Similar to Chen et al. [70], we add additional edge branch and apply recursive filter to the final prediction. The output of edge branch is used as the reference to recursive filter layer. The loss function of the edge branch is a weighted binary cross entropy loss. 44 Table 3.3: mean IoU and F1 score comparison on inpainted DAVIS. ?*? denotes that the model is trained on these inpainting algorithms. VI* OP CP* Methods IoU F1 IoU F1 IoU F1 NOI [38] 0.082 0.137 0.090 0.137 0.072 0.132 CFA [65] 0.103 0.142 0.083 0.137 0.076 0.121 HPF [52] 0.551 0.671 0.186 0.286 0.690 0.796 GSR-Net [68] 0.588 0.703 0.221 0.329 0.700 0.765 Ours RGB (baseline) 0.582 0.689 0.196 0.305 0.753 0.846 VIDNet-BN (ours) 0.578 0.695 0.231 0.323 0.753 0.848 VIDNet-IN (ours) 0.592 0.712 0.245 0.344 0.760 0.850 Table 3.4: Ablation analysis for each component on our approach. ?*? denotes that the model is trained on these inpainting algorithms. VI* OP* CP Methods IoU F1 IoU F1 IoU F1 Ours ELA 0.460 0.578 0.509 0.631 0.417 0.546 Ours RGB (baseline) 0.552 0.671 0.456 0.580 0.493 0.625 Ours RF edge 0.540 0.661 0.460 0.591 0.555 0.670 QDLA both features 0.555 0.680 0.580 0.700 0.495 0.635 Ours w/o QDLA 0.559 0.682 0.557 0.681 0.512 0.644 Ours frame-by-frame 0.558 0.683 0.566 0.688 0.532 0.664 Ours w/o ELA 0.568 0.691 0.465 0.595 0.560 0.678 QDLA all layers 0.570 0.693 0.469 0.585 0.564 0.682 VIDNet-IN (ours) 0.585 0.704 0.588 0.707 0.565 0.685 Ours w/o ELA: The baseline applied with QDLA in the last encoder layer. This is our full model without the ELA features. QDLA both features : Our full model except that the input to QDLA module is the concatenation of both RGB and ELA feature from the 5-th layer. QDLA all layers : Applying QDLA module to all the 5 encoding feature layers. Ours frame-by-frame: Instead of training with video clip length of 3, we train our full model frame-by-frame. Ours w/o QDLA: Adding ELA feature to the encoder, and concatenating with 45 RGB feature. The decoder follows baseline which only using temporal information as the additional feature. This is our full model without QDLA module. Table 3.4 displays the comparison results. Compared to baseline, the ELA feature alone yields worse performance. This perhaps because the ELA frame also contains other artifacts like sharp boundary, which leads to confusion without proper guidance from RGB contents. Adding QDLA module introduces feature adjacency relationship and thus leads to improvement. However, the higher features are more useful for our QDLA than lower ones when comparing to QDLA all layers, and high level ELA features are less helpful than lower ones when comparing with QDLA both features. Compared to Ours RF edge, our QDLA module (Ours w/o ELA) yields better performance because the boundary prediction degrades in video in- painting scenario and thus edge map contains false positives to guide the segmen- tation branch. In addition, the comparison between Ours frame-by-frame and our final model verifies the importance of temporal information in video inpainting de- tection. Eventually, with QDLA module, ELA feature and temporal information, the performance gets boosted further. 3.4.4 Robustness Analysis To test the robustness of our approach under noise and JPEG perturbation, we conduct experiments listed in Fig. 3.5. We add Gaussian noise to the input frame with Signal-to-Noise Ratio (SNR) 30 and 20 dB and evaluate on these noisy frames, or recompress test frame with JPEG quality 90 and 70 for perturbation. Moreover, 46 (a) JPEG perturbation (VI*, OP*, CP) (b) Noise perturbation (VI*, OP*, CP) Figure 3.5: Mean IoU comparison under different perturbations. Perturbation in JPEG compression consists of the quality factor with 90 and 70; perturbation in noise consists of SNR 30dB and 20dB. Column from left to right is the result on VI, OP and CP inpainting. ?*? denotes that the model is trained on these inpainting algorithms. to study the effect of specific augmentation in video inpainting detection, we apply noise and JPEG augmentation to our approach and make comparison together. The details of our augmentation is as follow. VID-Noise-Aug : Randomly apply Gaussian noise with SNR 20 dB to the input frames during training. VID-JPEG-Aug : Randomly apply JPEG compression with quality factor 90 to the input frames during training. The robustness of our approach stands out under different perturbations. Compared to other approaches, HPF suffers more from perturbation because more high frequency noises will be introduced. With generative models for augmentation, 47 Table 3.5: Mean IoU and F1 score comparison on FVI. The results are directly tested on FVI dataset, and all the model are trained on VI and OP inpainted DAVIS. FVI Methods IoU F1 NOI [38] 0.062 0.107 CFA [65] 0.073 0.122 HPF [52] 0.205 0.285 GSR-Net [68] 0.195 0.288 Ours RGB (baseline) 0.156 0.223 VIDNet-IN (ours) 0.257 0.367 GSR-Net shows good robustness. However, our approach outperforms GSR-Net as more modalities of video inpainting clues have been considered. Even though adding noise augmentation results in a small degradation on the initial performance, the robustness to both noise and JPEG perturbation has been improved. Similar ob- servation is made on JPEG augmentation also. 3.4.5 Results on Free-form Video Inpainting Dataset To further test the performance on different dataset, additional evaluation is provided on Free-form Video Inpainting dataset (FVI). FVI dataset [44] provides 100 test videos, which mostly targets at multi-instance object removal. We directly apply their approach, which leverages 3D gated convolution encoder-decoder architecture for video inpainting, to generate the 100 inpainted videos. To test the generalization of our approach, we directly test the models trained on VI and OP inpainted DAVIS. Table 3.5 displays the comparison results. Since both the dataset and in- painting approach are different, the performance degrades due to the domain shift. However, compared to other approaches, our method still achieves relatively better 48 Input frame HPF GSR-Net Ours Ground truth Input frame HPF GSR-Net Ours Ground truth Figure 3.6: Qualitative visualization on DAVIS. The first row shows the inpainted video frame. The second to fourth row indicates the final predictions from different methods. The fifth row is the ground truth. generalization by a large margin. Also, compared with our baseline model which only uses RGB features, our approach shows clear improvement. This further vali- dates the effectiveness to combine both RGB and ELA features and introduce spatial and temporal information for more evidence. 3.4.6 Qualitative Results Fig. 3.6 illustrates the visualization of our predictions versus others under the same setting. Thanks to our ELA and RGB features which provide spatial clues, 49 it is clear that our approach is able to obtain a closer prediction to the ground truth than other methods. Specifically, HPF only transfers RGB into noise domain, making it easier to produce false alarm. GSR-Net makes decision frame-by-frame, making the result less temporally consistent. In contrast, with the favor of temporal information, our prediction maintains temporal consistency. 3.5 Conclusions We introduce learning based video inpainting detection in this paper. To reveal more inpainting artifacts from different domains, we propose to extract both RGB and ELA features and make concatenation. Additionally, we encourage learning from adjacent feature in a self-attended manner by introducing QDLA module. With both the adjacent spatial and temporal information, we make the final prediction through a ConvLSTM based decoder. Our experiments validate the effectiveness of our approach both in-domain and cross-domain. As shown in the results, there still exists a clear gap in the generalization and robustness, making the problem far from being solved. Involving some domain adaption strategies might be a remedy for this issue, which we leave for future research. 50 Chapter 4: Generate, Segment and Refine: Towards Generic Manip- ulation Segmentation 4.1 Introduction Manipulated photos are becoming ubiquitous on social media due to the avail- ability of advanced editing software, including powerful generative adversarial mod- els [79,80]. While such images have been created for a variety of purposes, including memes, satires, etc., there are growing concerns on the abuse of manipulated images to spread fake news and misinformation. To this end, a variety of solutions have been developed towards detecting such manipulated images. While a number of proposed solutions posed the problem as a classification task [16,51], where the goal is to classify whether a given image has been tampered with, there is great utility for solutions that are capable of detecting manipulated regions in a given image [6,16,81,82]. In this paper, we similarly treat this problem as a semantic segmentation task and adapt GANs [83] to generate samples to alleviate the lack of training data. The lack of training data has been an ongoing problem for training models to detect manipulated images. Scouring the internet for ?real? tampered images [84] is a laborious process that often leads to over-fitting in the 51 CASIA COVER Carvalho In-The-Wild Figure 4.1: Examples of manipulated images across different datasets. Columns from left to right are images in CASIA [3], COVER [4], Carvalho [5], and In-The-Wild [6]. The odd rows are manipulated images and the even rows are the ground truth masks. Different datasets contain different distributions (from an- imals to person), manipulation techniques (from copy-move (the second column) to splicing (the rest columns)) and post-processing methods (from no post-processing to various processes including filtering, illumination, and blurring). training process. Alternatively, one could employ a self-supervised process, where detected objects in one image are spliced onto another, with the caveat that such 52 (a) Generate Stage Copy Pasting Tampered Image S Ground TruthK Hard Example Uzo1tNxwBhiselaWcn+NUB44c7=I"8>ZAHAAA=B+6gH9iYcEbNZSBBNbSi86NAAAE"IcYUnc97a88uJMrd7m022Al2DY/ijVM4xquxuJMrd7m022Al2DY/ijVM4xESQF8Ft4LHp6rWva9nYIEAN8SNBZux"uNJpMir7dQ7BmA0U2v2tAclS2bD6YA/cicj8VrML4uxEqAe8ENSZQcFH8BFAt>4=L4H+pd6gr7WWv6aH94nFYFISEebiH6BAAA>"=c4U+cd7g87vx WHYrru//YtG+HPvWGeraY0YaLDWYq9H6vKVnS3NjxUL+9uWDnHUVDxO/DDmYu9U6Ph/ohP4CLx4U8epP8ua/cdvmoK0vLDy8PM8hPOvEEiuDau++0UNja3tneKuhChW4xLe4/8Pp/8PadcKvoov08LLy8MEPSviEuuva++j0qN3aKtLe6uYWY/DPxoVLGSrvYqYLLYWGqhu8CL44Uxep88uP/cavmdK0oLDv8MyPO8EEvau/cdvmoK0vLDy8M8POvEEiuDauhhP4CLx4U8epVSnx+Lj/Ko3DUPiYe/t9aWN60u+P +Ig5xocnFHmE9pEqe2fqjbYnFhEzqQqCH82BZC5VTIqVgeXGxWD2OLoCmwUCbkOC9wrk5tEvFOvSpNnPEoqzKQHCZ8RB5Cx+5xoFr9geFjKFRH55pENnPYEvSvOf95twkEUOCCkmOmwLCcXD2GWgTgehgpNEoz5qFQvqKChHf8n2RBgZPCE5j+FqHx5xEFnop9pbYeSrOpOwmC5vtg9okkEnwSOqCrURkPmvCCmxwbOFCKc5LhDN2OXtWwgkGxgFe9TeVjIFoHE5zEqnQpqICYHv8E2fB5Z9CE5O+UmVTegcCLXDg2WG CIUZYqMcLlZWjHh6SLz1nT2Q6PIXGYRs0f9L3Sv1QwIIkZu0/UKvYOkgEZD1rjvxQPkndJaIRjWUlHf/HsYMJWQzgTn6IGR093vQIku/KYkEDrvQkdaRLfWLCSH1Zw6IhZLPf/WHMUsjlIYJqXnqf/nXPcxPjZ1QZzgTZO/fWQHRM1UEs6jLlQI/Y0JwqSnfXdPrcYxkP3jGZn1ZQHZCzWgLTaOkSvvD1kUKYu0ILvZ9hRII62Sv1UY0LZhI6wZ1HSCLWfLRadkQvrD2EkYK/ukIQv390RGI62ncxPjZ10YU1vSO VlFxBmqZTT4eysvVydulClLMVEZqA5yoG/JnIiqE2RevUql0lZgv3TIV9r+/N2UreGr2I7AWUsiyq4FisTZmxlVBTsVydulClLMVEZqA5yoG/JnIiqE2RlLLUql0lZgv3TIV9r+/N2UreGr2I7AWULlLEveyq4FisTZmxlVBTvesyLqu4BFyiCsVTTZVmdxlllVMsqZA+5gyroIGU/AJ3n9INieqIEU2vRTlVLrL/U2qrlG02l7ZWylCuLMlEZVA5qoGyJn/iqI2RELLlqlUlZ0v3gIVTr+9N2/reUr2G7AIdWUV c7EgvHoC07uBhgCxErfUF45QytoaCl4/H>txi>xie>iite eallaESYMseAAA>"=YLsMeAAAB6HicbZBNS8NAEIYn9avWr60pHL4tF8FQSEeqx5u"wNsv1itErYSBbAXKzrU9aEDSgb96LAoY0Ew6XWEa4nxIqAe8ENSZQcFH8BFAt>4=LLHYpS44t5uXs1wrSLH9iLcrbYZvBpNISn8aNWA6EHtwF89FbQUSoEDeXq0xL4gEaXz heb2N7YhGc3KKG/wIsHRbz6k7z5b9y2OWjrCwsP78ywM2+QCK6N6gt4loYLahWfN883K/g5/rp4mrd2e3NmYd74b5pg//h8KN8fcLWYaogtlwgl4t2Y+o3a6c4WrLhN6fC8MhdKe8NgY/7/3bNpK5Qm2K5b9y2OWGKKG/wIsHRbz6k7z5b9y2OWjrCwsP78ywM2+QCK6N6347hYGKKG/wIsHRbz6k7z5b9y2OWjrCwsP78yjrCwsP78ywM2+QC6N6343Nre2GKKG/wIsHRbz6k7zd4m5pbgl/thYfo8aNc8WKLg/ sKH7t2XeNuEUEyZ7h4Y8Wdm8UgImIqqfkZMLbcB5JbYMNwK7iUIaBQIfk0wHWIKfRW5tXPXhK3Fq8EHkMLGco5MoPXIyt5fKlUIW0k9ZAf9qmmlgo8Ad88Ru02weB7HQIaIUWfUWKt5PyhX3Kq7EFkHL4cK5GoMAqmm9gl8Ido88uX2Rw7IQBaIUi7KwNMYbJBbEqNUtYs3Kq7EFkst4YNUBqKbbJwYGNoK7ieHmmqgl9Id888oX2uewRIQ7aIBi7UwNKYbMBbJqNEtYUAf0ZHkIWfUWKt5PyhX3Kq7EFkHL4cK5GoMs l/I9YNczySxClKSV9hZg/mCrOdh899mzSrZjOUa00RV2+Th/sC1jNccal+7olscxKSLMtkTOMCoTj+2ZjFhtK4EZBtVLjEBN1U/mZFQ7qjzEYr/jyU0R2T/Cjca+osxSMkOCT+ZFt4ZtLENUEqNQjhFZy1//YIz1CsUBEOtj4hFV+lCBk+SEsz+9c8CdTrR7UmrFqjEhNQ/Z1sI1jBOlhVEB+8z9mdrVghSKCYzcl9/mgyrjU0R2T/Cjca+osxSMkOCT+ZFt4ZtLENUmF7jEqNQhZ1/I1sBOjhVlB+Ez98drmghV lZGHwrbumJFN1GJ012KWmAUTsZClLDaaRTYzoiasV6tzc3/ar3UhDd+xmWSGpRs+VGYCmJiMr+I3GaRra3TYRoizsVatz63/cr3ahDU+xdWSmpRG+VsYCGJimr+MDLYzZoUisaJsaVa6KtFz3cz3z/samr131UbhwDrdT+ix6m3WZSTGApWR2s0+GVNGmYHClmIJuiJMRrY+oaaDVLtlcC/aFUhlDZdA+JxGmlWIS3Gmp2Rssb+GVuGrYwCHmNJ0iWMTrC+maHDlLwl3CrZJsuTIUGAZmbWFK12J110KJmGU1TNZ <=v=QJM2cnPHuzpB/OzQK41O7xO/O+8iK8bzMvAIcrASiBDdAKQYLdQIqOCzDQeteixta/>t/iaxtel/atl/elxlitti>ea>tx Target Image T Discriminator D hs" dtNiNxCeYtY6uu6uI=6Cw/yj5MrxYe4S=F"F>4AHA6vWYa5nwIIAt8uNWZCckH4B8Ap>6=EYI5YwnINJ4HL8tFNFAk7CzW3uKtdNTpNC2/CjVMYAB46rHyi6cjbKZ3Bzq7xYSeENQSva96wy5rY4=">AJjIAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx4MVjC/2CNpTNdtKu3WzC7kYAiSwbd/FwvhPoqATFv2Pxl6vO3iLeYr55bvoYovtzH3oi4Hgg/buFboizdwF5dPjdooxbI3zrW3L8wxiqLi3Oqjlfq2FfbACA1/asvYxlvqodp/Wx4vjpOju88nifd3n3e2jrfurfI33r5ALb3AvzL/2COsLarYv3vluvvPiFfw/Azb3oALAzr5rIfoAruj2Fbe3dfbAinujgzOp41H/6FtxxChWSq2shFxat51xL6pH84vOng3uPi3brdTeuF3jPrroaIO5szLLCvww/LiOx5SLWv2PiThPqFxwFAtb1o6Sx2Tq 9i1FQQuLoNR5soEsl3tLgy0I3PA2nWuxJiQshQosFNA0r/vWBK9MHq51Rw7yzW66h50NjHyoWKLGTPlv1PQG5sQoKHsLF5N63ixyWwWNsq1MQ2PWK/IoLN0LWTj0y6zhR57BvHAFrhQounJ30AtlgysPsyPysgG55yKhQ1o6LrHnsRNLF057i96F3JW3xEyuNiwWWj1hszqR2HMBQAKoPQWuoA/0It0sLoNQTi5oLQsNH5iF3W6yNxW1wq2sQKMWoPI0/NTLWyL0hGKjEuz765HRBv9AFrhQounJ30AtlgsREuQoiR1o RTjQ8RsisSELuUcvjWFJ6RCgNRTXcFNaWiARsi7FUWNZPvEctb5uiLOxIS8sknasTR/02Qo0n0W0nWo61NR2pUJz1CgZEgNv89CzO+TphhXOp8RE+1FpzNNR97civCWPgZitZ5abCORIzxskUaAn2/12o00nTWQ2n/R8saxksiSIbOLEu5ZtcUvPiNWjF7NR6JRpE1gCN8hOTRXpz+FcN9gvWaiZzCRAsUN0zsRA2U2ZiaWvg9NcF+zpXRTOh8NCg1EpRJ6RN7FjWNiPvUctZ5uELObISiskxas1R/n2QTnoW801 3vYfImbmLBFCIlh9Kx4aGzlYbIVvVfJ3KKV4aGZlzi7z2Ymabom2i18sLDbmDFmFFVma8729XPzZhnm9xGaQtJVK9FmaCsJgYripjrMAZAbNJme82VVImmFFYbDmsI8gVm2aJFetGJmQAmN9bnMXZPA9p8rsj1YaioIYGBm3vfrK4bZAMrjpirYgJsCamF9KVJtQaGx9mnhZzPX97aLs81i2moaYbzmIvf3NAmeJK4BGlYDbmbBYKzboim8aJip1e2FsCLs8j7Z9Aa2PmzFXahmngZr9YxrmAaMQNGmJJVVtI9VF Aato8q4XuxIEXaU5n8lIQX<6HG/zw4+O0qa4O4u5/bOnYd8/yY8QCJbcAiH/EivOBARWdquEuRUQmnQaquGG/Eetixitt>e>xixeit>x Copy-Pasted image M Generated image G(M) e64Z=A"I80H38=aBD>yBpclNGAxnasUo8"AA1AY6hllbJBDSlNVE1Y29Wvarb_1ahs tixetaA<9i1DlJslehlY_1xA48aUaatxtG=l6psybD1Bh3 0i8e"a8SNrWva9nYIEANVB"N=Yov9=A62s1bV1lhD JilehaY<12A98=U>aAxBGHlcpZyND8BA3I0n8a"Wb64e="803BDyplGxaU8A1YhlJDlV12A9o=">AAAB6HicbZB +1Kp4F9872TtAgI2RK7CcIpt2YORHWda2jVdUqijp2p407Jbr9j88j9rbJ70jpsTcnGTLRKb7wschAti8h6pgdLPeAoxqeESQF8Ft4LpH44624HL8tFPaEpPiZUbVC2fatR2Yst0ImC7KGR6IEAETv7o9L4vKH1h+QWxjKssT/cynZGvT/LxRmK2bb7gw7sac+hsAPtDiw8zhg4wd728Hw8aOJg42Npqp4t7crNdTTrQ4SNEeqxoAePLdgPaEPZbCfN2tEEvoLvHhQxKs/yZv/xm2bg7a+sPDwzgw78W+j1sKT4c9n7GTTALIRRKKbC7IwtsYcRhaA2tViUW+Ejm1wsvKFTP4Kcs9An67gG7TZTHAGLCIdReR4KJK7bwC+7bI/w/tQsLYEcmRthZaPAP2xtSVFiHU48ai8hwpz4Dpsda0g22JxHvry8sjxOh8vgo9E26b7p072pf4btP2acgjLNe2oTqdErQ784tqLNp8hp4pd02JHr4Jiaw8O8g92bp7p4t2cjN2Tdr74qN4Jaw87wgz6G7wm0HsS2xtFfACebFZ4P6EoaqPEgQd8LtPLepPvE+jasD7Lob2mx/2sg06mG7EvyZhvHQKxs/ lzihTyOrh6iR0lhWjQVNK1MXOjFSn32yz/YhqkoLUtykxCCyRt+R9IJVknCsMmCfoRHUkeckaNtZp3g3LFrboLkXXwcqhArsag/NUfComzfHRcUNeakpNBZg3r3mFobXLNXcwrq9AasUg0NxfUiMNoBMc1xjZRbleHogtBK6Z394ayPtv/keNXOTnaCmpQcHWI4lpECripCI/pHWK4rHzWbmtccQUpOaoCCXcngNkOrexNqtnkZ1tveyCPo3kmaqpNLYo9Xzh0a2UMyno1YF2lFOMBgMj4HKR/SVjTmjfQRlUleEkzNrZC3n3kFVbJLIX9wRq+AtsaglNhf0iRNiB6msNM9y0RMX19l6B4MxIpW4HWmcQK4/RV3TEjrQIljlhEMztrnCjnMkzVyJRI69hRR+VtrllhZ0ZRxic6isTMOyhRyhsriO0ylT+h9iJzkcC3zxlMQpCIXnSW4HWmcQNOeNtk1vyP3aX96VT/oK9CCMCBOOolkFk1CniMb2K0PzO9cYHNCqhmTocBZUeNtyZiaxvfNUnNp/egUaRsfrmAbhrqychwzX3XakjLSoRbHrjFgL63Xg33yp1ZtteNNaXkaZgKjtHoRe Ygmp856wviJsk5CktgktRzk6wstuURKcaD4h3QZIWKwjM8VLWTUxG3ZctTNDy7YEb2gQr3Zo6A8YDDuD27r5q8mLe+58eFVLAjtg==itae/etxl/a== xcW9P<9abeAi= chi1=b9s86+=SWHx6cBrsfBHYbStaBHwFcea996nwYiI2ESAANX8sSxN1B/ZBb=criEHr6IBZAAA2A/>k<xRMdgQSxNYD+ZabYcyivtYXN2BBiAAA"A/>I"B=bkA"A="ccR//c2IcwdBItFbwHQfBr/xlte=6R42=d"FWQy/xxq+cYBlrtkxftYsHa+_baxet4/"BSQJwsFVIHdAcb2q/zRAcb=K"m>=A>AAABBX6cHZiLcwbxZIBvNFSt8lNBAlit xktLsoai_paUe74W"hy qiearHciA6B>AAc"=2R/IcdQFwtB/Hb+kYfcrByqx=W"V46FteX1vxIFSMwZLBibcBX3AAA=>"iwmbKnUPAqzSMybHAA66VJsj2HS4"=s6e_abt1aiexst 1hat_fckxBbq+yH"Y4fekar_Bacsqxx/yQWF"d=24R6=e>sAaBbH_c1ZaNh8sA ItniaxWe"as=64y"WcxqkBrHfYx+bBt/F wcIdR2/"c=A>A6ABcHiBbZ8NSVNAKI54"=s6e_abh1ats eixltal<4""=ycqRB/k2Yc+dxr/WQvFxQrWva9nYIEAN8SNBZbciH6BAAA>"=cR/2cdIFwQB/txb+HYfkr OWj7rQC8wFs5Pk748y2wJMz2t+OQGChKG6KNN6H3Q497+hYYd3WNJrweF2ddO4yma5Dp9be/E/ggw8cK4hQ8zf8NuL8WKcUabo4Y5tYlegNs5Y9toUJNSqZSOEbsBzJ1b6YMMtNbwwK37/ilUwIuaSBoQiIY7EwZe2R32cX+uf8X8boQdIIW8Klmgagble8NItdIod8y8euTXi2ERqepwN7rIHQSBxatIEUOi475KFw1NWMZYNbJyBjbzEsq5NbULt4YesIgXlutIY7oWa8cKWYLmNKfa8thbKU8egd/x/FblpX5BmN4qdt21eorFNR3kY0h97J4n3Q6dNg6NKjCnQV+y20MLwtyK8b79Pws2w3Cer/jNWlOE2wyQ9ubg5uzu70kY6lzIb0RPHysXIrwH/mGKKUKZGR5Rukwgsd17tQrPS2b6X/zsUUaLDSgH9ELQIA05LkWaM448x0qvedEgSoQOFt8nFlt549LtHMp76CgOlb8kIRdwoK8w8ruzXg20RHeFwS7xILQ9BUaSIsUGi/7HK6w5N2MrYPbwJQBNb7ENqdNpUgt8YWsYgslNtBYMo7aac7W2L8N8fM8ehFK486gI/8/RbIpI5KmY4bdU2georLNN3uYbhu7w4O356mNy6DKFCfQv+z2DMzwvyl8B7bPbsfwNCkroj0WiO52Fy29Cbg5Fz677ks6TzMbsRGH/sHI6w5/2GrKPKwGQ5Nu7wNsd1ptgr8SWbYXszNUBaMD7ga97L2I808LtWEMo4QxEqAe4E5SOQkFJ8aF4t44dL8H5p06S6vpFHdLh4gt1Fo8ZFOQWStEOenqZxl4OM5WNL90NIKL/9mgeDYa3UKz2X8bwSWr9s78bnIoKJuVtnX0D5I6MOezFN4n6nNAuGbUujtj0UkVY3IAlm4RIhd70bWiPwyFyl5qXDeVrHORHITamZYfKMiMUFJQZ6EpRHeLR4q2ktFL88F2QwS8EPewqrxW42M9W5L706IbLH9Ig/DKaGUuzsXtbSSXrUtD19sIwLup5LGtK8KQGE/qw4IWs0HLRgbazz6bkr71zw55bK9Gyw2sORWzjkrzCbwysOPj7C8sy7wyMM2++CQ6C6K46hN36r324475hbY/38NhrfeL2cdo4tmg5YpUbq/b/JgY8NKKhi8IfBNILwWRcXa8ooYItllWg4sqYEtQU8NtqLEpbgB8Jdb8YuM2New7KQ7aiUU7IwaMBbQBIE7NwtesRl2YXauW8q8ro8diIC8zlngpJHrZnbmCP59MQkFbeswM9R3htHyBe2bEeCO3Zovp10mbbws+0kdI+SbudLGUfNdeBWM3d/TgmuYfbf2jd9HmW2Yzhg1sTlLlHjbm7p2UNC8J7pdLSz9L7fljIfDiW7Mg4TfXBzNrlMj1cwX5YK2G4w2s1RpztkjzZbKyvOhjQCasD7GytMi+ECP6e6s4oh33LrS2Q4F58bF/t84hLfHLpc6o6tpgHYLU4qtbFJ8YFNQKSiEIqBxI4wMRWXL80oIILl9agbD8aiUCzzXnbpSHrjtG1fsAwlu95pGzKZKbGj/NwfIMsFHvRtb/zb68kC7wzE5sbJ9Lyn2pO0WPjfrmCrw2s3P77f8iyfwhMX2s+tQ+CGKF6kNj6d3F4U7mhTYC3sNAr5ei2sd+47md5+pmbl/h/pgm8fKhh88/fbN5L4W2cra3ohY4t6l6gCs+YMtyU7NsqCEjbOByJbbzYkMzNRwsKw7GiKU5Iwa1BrQbIz7awgeLR02WX4uq8E8Qo8dtIL8plagbg8li8CIzdnop8H8juGXf2ARle9wp7zIZQbBjaNIfUMiF7vKtw/NbM8YCbwJEBsbJELqnNpU0tPYfsmgrl2t3Y7ofaicfWhLXNsft8+hGKF8kgj/d/FbUpm5TmC4sdA25eirsN+37Ydh+7m4l3h6pym9b5z7k6zbRHsIw/GKKG5uws1trSbXzUaDg9LI0LWM4xqeEpN66KHC4QF+Fe mmqjAPfs0sZZHYk+IxWEfmUaW/KNt+55PxyPh+Xs35KxqI7NExFgkaHELZ4mcsKa5RGroIMYyNl//x9PYgcNzaS+CEK+VZhsgmm5rsdx8a9IzYEN+/BxlPVghNjaO+Bjsv11Iu/m1PZ5hgQsNNqxEaja7+FImEUYN+ENLZt/Zs4xtmFPZ5+gTsCNOxkaMaS+xIsEoY++aNcZj/Cs/xTm2PR50gUsjNrxaaRaT+YIzEoYi+aNsZV/6stxzPcf3Z/DaGrG3uU6hgDQd0+Wx3mLWySCG8phR1s7+ZVOGaY0Czmz9hmGmCq0AWf30LZyHCk8IhW1f7UZWOKat05zPzyhhGXC30KWq37LEyFCk8HhL147cZKO5aG0ozMzyhlG/C90YWc3zLSyCCK8Vhh1g7mZrOda809zzzEh+GBCl0VWh3jLOyBCs81hI1/71ZZOhaQ0NzqzEhjG7CF0mWU3NLEyLCt8Zh41t7FZZO+aT0CzOzkhMGSCxusioH+xaickjmCd/sTP2MRu0lUejbrwaIRoT9YAzHofitahsqVk6ctoz/cz3V/razrl3OUIhhDEdm+LxtmTWMSoGjp2RjsT+iV6G3Y3CdmW9RmGm9qAAHff0tZhHqkkIcWof/UzWVKrtz5lPOyIhhXE3mKLqt7TEMFokjH2Lj4TciK653G3odMWyRlG/99AYHcfztShCqKkVchog/mzrVdr8z9lzOEI+hBElmVLhtjTOMBosj12Ij/T1iZ6h3Q3NdqWERjG79FAmHUfNtEhLqtkZc4ot/FzZV+rTzClOOkIMhSExmsLot+TaMcojjC2/jTT2iR603U3jdrWaRRGT9YAzHofitahsqVk6ctoz/cz3V/razrl3OUIhhDEdm+LxtmTWMSoGjp2RjsT+iV6G3Y3CdmWrR4GW9MAtHUfCtkh7qOk6cooy/azrVarAzdlLOMIWh2EImVLxtDTCM9oYja2NjbT+iG6S3S37dKWDRDGU9JYxSaq0DdLhoEFFYB0UicNUnNVS4ayFywm79gU9bLVizwqU1XfUfzZ7ZGCgIUUPzN5Mt74RyRUtmRqxfzZzkiW6UdKk57yAX0KN7+F5Ho4YKkGvM3l994cuSKKZh9mbdO9sEtBjVdjzBD1h/NZ7QGqaj2FMUPE/tL4BFi+WClktS1sE+lcdCVTERSUarrRHY6oKaCV0tVco/5rwUtDd+mmaSJpNsCVGYYm8mKqGfWZskRWSUcKn5NyVX4K67FFeHq4fKsG1Mfli9Gc1SCKshImIdq9rEFBTVvjVB715/6ZgQ1qTjkFCUWEotS4iFl+vCbkDSfsy+McwCiTkR3UJrIRXYCoWaaVmtXcW/8ryU4DG+PmMShpdshV1YqmnmqqAfNZpk4WtUBKZ5+yzXCKO7QFMHW4pK8GYMdlT9WczShK/hnm4dx95EYBmV1jkBy1R/yZNQxqNjOFIU7Eute4zFc+3C/ksSms5+Oc8CUT0R4UlrTRrYLoCabVVtYcf/hrTUnDi+lmHSKp5slVGYYm8mKqGfWZskRWSUcKn5NyVX4K67FFeHq4fKsG1Mfli9Gc1SCKshImIdq9rEFBTVvjVB715/6ZgQ1qTjkFCUWEotS4iFl+vCbkDSfsy+McwCiTkR3UJrIRXYCoWaaVmtXcW/8ryU4DG+PmMShpdshV1YqmnmqqAfNZpk4WtUBKZ5+yzXCKO7QFMHW4pK8GYMdlT9WczShK/hnm4dx95EYBmV1jkBy1R/yZNQxqNjOFIU7Eute4zFc+3C/ksSms5+Oc8CUT0R4UlrTRrYLoCabVVtYcf/hrTUnDi+lmHSKp5slVFYMmUmwqxfiZOkpWIU8K45oywXaKa77FXHu4BK9GWMalW93ccSXKIhumvdg9FEVBCVDjQBs11/BZ9Qaqsj/FTUFEgtW4oFz+RCMkUSKs1+7cnCpTuRaUIrKRfYKocaCVZtwcV/urlUnDG+lmPSKpJsnVSYEmzn0r3NAuXQIQChSdPCzdNADxlOA3ZIEK8cDXLVZNhefl1IoE5muOkcdTEJW5mkGGcc0ESoQTgzu79X7WNZV8Fa9ncIfeZ8/kkKLfg==xC5leTIVfIjJhaoExxuythY6ooMzabig3BWz15HdpXOO8Gi6+D8lbAKqM3vmz2IZcMACArSwrHBND0iWbTdlG+AJK+89LJduIIOGzZQbYFU1HJv1pKumVUes9CmLWaqrni/6zdSAAcIzvMKb8+i8OCPdShCQIQXuAN3r0n79X7WNZVrBDibdGAK8LdIOzQYUHvp1Kl/<=gjiGjzBz/+I4PDt6tauu6H51r4vdXh4JhALAKdDEM3NIMEXatexit>nPcuzHB/pzQO41K9cyeyl6C0P1A3NCiKOOQMepwzWBJO=VzyqCvBd9ifqQrgHD0tTo+gxga8MMHE1Jq7mhuFUdO68cdRB+mfqF6e3Xlcm+Gl2qA0ZlDpM5tg>5eh/Tw96yuP/nO87A+nnmWW9MVupwH=Y=znJ+i9MJru+IaGDZLblFC1ZJs1TKUmAUmsWCKL2a1r0iJ>Gi1dNtFxmtblHSeAvAQcdIAzivSMIKKbi8/+Kiq86O3Ol/mOGx27AKZ1D4MOtQtzvew/=w=6L/KOGObqDWrmArcSzAMAbc+I8zOvOM7K1bO8t+9iV8pOHOY/zOIxL7KKG1b4DOrQAzcpz/MBbH+z8uOcOP7n16nW+M9vJwu=I=GsJCiLMarr+ia>DiLelaC/Z=swTMU6APmuWHK/2Q140KJxG/1ONiF8mKbvHIZAlSGBwiId3tuxrtJl6<9=Cv+WqnncqzWBmp9Qe4VGuxp/Yi8+k8UbPKnMSviztI1cmAzA=SarrBlr/LIiEeKa8/c=DwXML6KPIuNH/zpQIOi4K1mKF7Cx1OJ/ZOJOU8aib+189buKGMbv1z1ImcsALArSZrFBJDKiAb6drG3AwKl8HLmdNIGO0z2QWYAUTHZvlpDu+VMeJ9HmmWNqGn0q2+WC8UTszZaCllGLYDNaH+hrxMSikJy>ZtSi6xIeRtwaalb/YcBMc0LfxnP7tZa7eT4g"r4GZlcA0AncZHT2rkl5A8HMk=8"=>>AAABA3Bc6Z3LigcxbIZPDFLtShg7MMxfF7I7bgPG1AFcu2t5tMl"aAtAe6xiibtD SsMhFab11_ubttuF1PbIFxMgSLDZbci36BAAA>"=M85k2HcAAlGrgT7Z7nf0Mc7Zh4K"=46esab_1ahs tixeta< DmmavxzJD8xh5t/6PVGF4wDRrfjJWqWATwPGzJG68J5vuAQj6OzKdSR3L00OYRwnftNKE9KEGFlOEVDDJTTWjrN4ZPS5aDyvDDRD7tCLYjrnStEJ6HeuRJBLXhGvUyEwZR2mEv1AiKEP8tNaLJpj6O8wjfz+7t7Orcblux2bt2aKbfPXNXNtfP2bmabtl2mucbtrz7+7wwxVO6JO/haFPJGEKFv9DGmKwajtywhnWVLRuAPOH8t0Gqn3L65SuJQ6zdRL0YwfNEKGEtwD6xLNpEi8E21EUZXBGe6RSrEC7Y6pRN8Li1E2ZEUGEBRX6EerYS7RCyaDZN6paL8N68VEFiq1AEw2GZJE3U0GOXRBnRteK69EESjrnYtCJ7HRuDJyLahSvZyNwjRTmJvDAJKLPStSZNjTJDElGKENfwFaJ+hPjKOjOl6mwaVwfOwb+XJzLtSJjO6fnt3ctxq2Jf0NHm8FuEOJJFAhLOR6hVVwv7n7yrwbRua2mtKavbGPAt9DlDKvYR/5PDGw4RDKr0j0WNWlTTPdzLGY8f5EuGQE6JzjdaSz6Qu58GzPTWWjrD4GP/5xDzvmDmDNtXPfbKa2tb2xulbLcO7t7Dyr mfQlmffrwZr+dYY8TVypZoBmB8pmPQmrLUxis3+5GEp7nCGQx46zMan318PqJIxi+daPSRyMKNphmsPZhZTX+CxZCEj2VoBO55hFhCWlz+Yd99HvGahUQQSDUUq+1GNoAFaF4rFo1Q30iut+Ffb3fTBKf+ZpVGyhQCTfwPYTmxPjpWNd8oYZ+ZZXrCfZlEf2mTrbFvdFfi3f+WTb2jGCZEC+xTZXZPhK3+IfCF3iatRuEQ4083dFm15o7FQrz43AqaiFPGMosN+qG1p+nDGmxB68Mpns1hPQJmxr+LaUSxyimP+ZhZZX+TxZCzM6n3a8P1JIqiWx+dGQ0uF13FroAa4GoFq1NDU+QSUaUQHGhd9vzY9Cl+hhWO5FVB5mmoyRMPaS+idIJxP8q3n1Mza4x6GCQ7pnG5E3s+xUirmLhsQY8NpPmYwTQfBaPSRyMmmoBV5BfGZ1VGytoSB4VQ5f5COhFahFh3Wtl+CW+jYUza9G9FdAvrGoHFh0UuaiQfS3QbUTU2DE+U1QqQNUoH5K9W9OzhYl+hCFsxiUp8BmopdrmflfrZ+Y8NpPmYwrfolpfBmprmd8ThQLmQryG3s+5nEp7QGC4xZV qtdlIiVIzzlDhURVyjpYaIMJdmA9QpBCCxHJLiEPkIJniMZKy380etlhu7EGnQ3KOzCt8uGFmwlY6hEJidgdIlDgl6UGjOdEYeJZmkJHEQhlhwYmR8FtuCy3zQKnpuGh7la8t30yMiKnMJdEIiPLtCJi>ttxe/al=<=NQd9ij/HMg5jZ/Ycqo9aFfYneHnjBnAvpL8PxAY9BCYpAPALp8AnvjnBenHfFY9oac/ZY5gjH/Mi9jQNd<==a/lxte>ititae/tixetal/<==QdNij9HM/5jg/YZqocaF9YnfHneBnjvpA8PLAYAp9CBxtJCiLPdIEnJMMKti3y0at8hl7pGuQnKyz3tCuRF8wmYhhlJEdldilIgz6DGUEmOQVEjeYZIkJHmQ> 8NAE=IhYYn393alWHvWcr8s0h6aa1N_abHass_eh6i4x=n"AsIs8Qn1NGBiN8EnTbsTbci0W8A0coNsIjZES+NHZxc3HaBEAc"Ncka=x"+>jAoAHA8oaB69Htha4IX8IZ+HnAa=IEEHAsJ08nT0N=BsZ1b ceiBNH30Eb0nT"iise4sa6axtatA==sk"NAAB6EbNSvE39lrstHt_a"aE41s0isTs0EE03=NV>6B8cLN1AvngWXIOF2xUMDgrSvN9DYZEbNcSiBXb2iB6AAAAA">k"c=akxy+Ejlo88rc7bi8cGTbn8iG1QsrWva9nYIEAN8SNBZbcm6BAAA>"=kNcEa3xH+Ejso080cTbn8iG1Qss"=46esab_1ahs tis"=46esab_1+Ejso080QGZ1+sP"S4meUan_4aHsZtkxet+lCrdvp98YkEcN"S=B4b6iexsaacbr_v19aYhEsN StBibxie6tAaAl"="AAAxB3bicLZBMSwIxF1XvtFeHQV=q>ZAKBSHmc3Z4N=8"A>IAnAaAWB<3aXeiic bhZ1BbLsS6w=MsxQFGI8Xbvc18Foejt+VxAaAc>k""=A8Ax6riZbhBrSONpEzY791vVrjHiXAAAB6HicbZBNS8NAEIYn9avWr74r4dGTw2QNaj7cO2Hta4+p67jpUbX2e9NgG88Oxj28trEHzJK2Z0adxpe4kpCh7ip8sULixV9t92EA4ahhnRgcvYLsyt2wVIK72CobeKyK1RtRXItL2A4TdTGGZ7gnN98ct4KTsKEs71gj2+pWK+H6Zp/H7Lc4HtAFR8jFqQfShEHeoqbxv4gE4G7PpLYdTgrPKaAEHP2Z/bpCEf6tN2ss+0emZ7lGW6ZEbElv6oFWE+Ejs1vsKKmTP4wcr9tng72G8ThT7ALLcIWR8RqK4Kab8Cw7zIDwstasgY2cxRvhyasAx2htwVJiTUp8Oidhipc4bpTdT0p2QJGHPr08ojsO28cgL942ab2pE7xpx4st827c2j9NJ2iTadIrI794vqsNe4FJeaJwo8Z72wugDziw8D8PcsF+TaM7egcb92bmbxd/+vRZkyM/ZsVKGx6QBhhHBvyL3opvHEHEF6SGx7PmP0Zst2mtEfLCQb/Z/PbE+awP7gJd4LTPcGpE24OxHqdehEiSAQcFw8bFRtT4nLTHjpp6t4QNqLGbgJP2fE0w6voyh7sxvs2gazD4wwa7q2dPj74jb+8Jri07pVUY2bRItICmRTAK7/4+1F6Z4SFPe/EPdtEKCmsLGQvQH/KbZam67wPpgdwLNbrtN0t8p2gQ8C2E478qt6h4se7GKELLGGcgsKWaL78PEw4bLNafbg2274E0vhx7yKx6gDsazH8q4j7424287F9rjpJFpUiRVSatYRIeKAI4Tx91K4+EHFFESPxdPCPdZstvmPEHLZQE/m/PbZ+gwN7CJr4tTtcpp82sO2H8dmhtisAGc7wLbERGTsnvTWj8pLtEQLqvGHghPQfx0K6so/hysZvv2/axDmw2abqgd7ja4+bs8PrD0wpzUg2wR7t8CwRaAJ744N1qX407prYdCTt2zNSjNcj2BtI4Cpb7Qprb22z9zgh8bOAjw8NreHPJu2/0YdepH4TpZh8i58dUciUVGtc29ApaWhuRecUYys7twwHI97jC7bOKsKQR4RIIDLkAgTXTBGu7HnT9TcO4XTCKes81njC+XWiq81prLNt88EQiEuqC4oGzLbgnaQPpbufH2F0Z7w6bEZoCvOh5xRsMy5vkx42bgmasskDMzywR82ah4DqH70dB2Fj2254E7fbC9983jvroJ30ppzp0iJUbVD2waQR+YztkInCIKvRSIcAuTl7L9Q4UKB1N+96eHb4WFdF3Sbe/x+EgPfduPgEfZNCftzsjmkG9ENvmLoH2QfKz/0Zg/jmsbi7l+zPlw5gj7nwmJFNp4XrUT2NCcVtJpCpp28gLOg8zHy2LdF4fhb8ji6tfA0hic7s7w+7gbsKTRLLXTTGznQcrTMsMjaWtpbLNt88EQiEuqC4oGzLbgnaQPpbufN2B0l7C6/EYoBv7hVxAsyy6vZx022gNapsMDZzkwj8taC4Kqf70dF2bjl2w4o74bt998yj5r7J80xpfpSiyU0V92UaDRgY4t5IyCEK3RHIuAiTT7E9l4LKv1v+/6tHF4bFaF2Sde3xbE3PidvPuElZgC/tLshm0GXEHvlLpHNQyKy/aZW/4mEbN7k+WPqwUgt7OwJJuNE4UrlTNNTcZtJpkps2JgOOZ8pHA2ndk4chJ82iNtLA6hbc5s+w17nbNKFRaLmT+GpnEcmTGsYjaWtpbLNt88EQiEuqC4oGzLbgnaQPpbufN2B0l7C6/EYoBv7hVxAsyy6vZx022gNapsMDZzkwj8taC4Kqf70dF2bjl2w4o74bt998yj5r7J80xpfpSiyU0V92UaDRgY4t5IyCEK3RHIuAiTT7E9l4LKv1v+/6tHF4bFaF2Sde3xbE3PidvPuElZgC/tLshm0GXEHvlLpHNQyKy/aZW/4mEbN7k+WPqwUgt7OwJJuNE4UrlTNNTcZtJpkps2JgOOZ8pHA2ndk4chJ82iNtLA6hbc5s+w17nbNKFRaLmT+GpnEcmTGsYjWCmfSoLRVHXU6kyeWcrkXaoNttMZRpF32gr3gLFFXrabyoaLrkaXYXHwXc6qChxA6r3smaagy/rNMUafLxYiRyHNWUXBFo6mXqoNaYV9yzt0a2yMrnM1aFil5OUBoMf4VKK/qVQTojLQalYlVERzyrHCtnWkaVXJyIF9rR6+MtXlaho0LRaiY6VsRMyyHRthWraOXyyTFhri6zMcX3axoMLZajYbVSReyRHotHWtkjtKcgUZr6M9kXJaV3hPvyRvX1bkVt/NKenOCNknpXoChaUpoQ2cMmlWVHh4RWzpSIgCyCNmmCCfkopRoHhUUkoe2cMklaVNhtRZzpS3ggy3NLmFCrkbpooLhkUXoX2wMclqVhhARrzsSaggy/NNmUCfkxpioyhNUUoB2oMmlqVNhYR9zzS0g2yMNnm1CFkBpMo4hKU/oV2TMjlQVlhlREzzSrgCynNkmVCJIIz9VRU+dtslAh10vRhiS61suMcyZRlhMrQOsyjTFhyihzlck3exsMOZMjvbmSeebRyoKHotxjCKHgcZt6g9rXkac3rP/yxvU1qkztnNOeKOjNEnnXICtaRpMQrchm3WjHR4jW6p3I1CeCXmQCHfIoCRHHcUtkgerckkcarN/txZUpq3zgn3OLKFjrEbnoILtkRXMXrwhc3qjhRAjr6s3a1ge/XNQUHfIxCiHycNtUgBrokmcqrN/Yx9Uzq0z2nMOnK1jFElnOIBtMR4MKr/hV3TjjRQjl6l3E1zerXCQnHkIVCJHIc9tRg+rtklchr0/RxiU6qszMnyORKhjrEOnyITthRiMzrch33xjMRZjj6b3S1eeRXoQHHtIjCKHgcZt6g9rXkac3rP/yxvU1qkztnNOeKOjNEnnXICtaRpMQrchm3WjHR4jW6p3I1CerX4QWHMItSUmC8kj7qOx6uouyKa5rXahAgdeLTMoWB2UI/VDxGDxCs9EYbanNabJ+qGMSwSn7XKmDaDsUkJYxyap0DdIh4EOFIB1UicKUHNSSIa7Fywn7dgT9LLdiTwjUVXbU/zV7pGJgoUcPzNzMN7yRyRTtCRmxfzRzUie6kdNkZ73A30FNb+L5XowYqkAvs3g9N4fuiKNZB9mbNO9s0tMj1dlzBD4h/NT7QGlaz2CMkPJ/9L+Bli0Wilsty1hEOlTdiVcExSZabreHo6tKKCZ09VaoP5vwktNdOmnaCJpNcCWY4JpHC2mofZRSU8eHkJNWZY3337FfbqLXXqw1q/AysFg2NCfYiENKB6mjNC9S0vMV1+l5Ba4s/3TSQElWzeCokCJa9h+ulL0XidswyMh8OoTniHcJxDZUbieioctlKXZq95axPYvykLNMORn9Cip1c6WH4mpCCdmpfwRuUhehk0NxZi363cFMb2LBXxwaqtAzsRgxNBfbiuN7BxmdNT9n0lMg17lSBS4J/3TMQ+lEz0CskuJv9Z+3lP0wivs6y+hYOUT3i4cRxTZqbbeOoctUKIZv9laRP3vGkuNGOKnVCspvcjWY4JpHC2mofZRSU8eHkJNWZY3337FfbqLXXqw1q/AysFg2NCfYiENKB6mjNC9S0vMV1+l5Ba4s/3TSQElWzeCokCJa9h+ulL0XidswyMh8OoTniHcJxDZUbieioctlKXZq95axPYvykLNMORn9Cip1c6WH4mpCCdmpfwRuUhehk0NxZi363cFMb2LBXxwaqtAzsRgxNBfbiuN7BxmdNT9n0lMg17lSBS4J/3TMQ+lEz0CskuJv9Z+3lP0wivs6y+hYOUT3i4cRxTZqbbeOoctUKIZv9laRP3vGkuNGOKnVCspvcjWw40pxCCmCfoRHUkeckaNtZp3g3LFrboLkXXwcqhArsag/NUfxiyNUBomqNY9z02Mn1FlOBM4K/VTjQllEzrCnkVJI9R+tlh0Ri6sMyRhrOyThizc3xMZjbSeRoHtjKgZ69Xa3Pyv1ktNeONnXCapQcmWH4WpIClO XWw7M8V7WMUyGtZRtVNQyyYKbfgQrDZk628jDI>etZixxjeutTaKlD/eju50zcCVnw/mkwfIPL4rA3lyo921MouPQ1=C=2iYXmc8U6fv>JikeCat/k=RQkMwotAUPKkan4z3jZLWFw8M+VLW8U5G7ZDtDNYyAYob3gQr2ZE678DDTuc23rpqVm5em5reuV82ZpguY3NkZcURVTwiZD4cK7tgkEkDC2Jp6QmhR32Vq2ejed2+udkcR6i3c4gaDKpUhttwQksRIkztKCwkjJ6v8658LmsFTd55C1/cf44ell2Ru4=AMxieLt+a7lD/MGDZLt8NjyTYsbLg5r8Z66j8tDQus2IrzqKmwej56e8V52LpsuT35kCc/RfT4ilD2cu7=gDez/DQIoYPsnAjtFZ+Q8o7tD3Ah3Q2p72TD3EVgm7ucZDYiZTVRZcKkk3Cu6pA2BVye35WeMmWqGr>2iueDa8/6=ZArLgWbDY/yR2fpbQ9hF3VtPorQvA3s7YBIDDtzxDtKl7GYWmN8W63vaJUkwCRttkkRvk8wYtIUlKQa04wenCw5dpuxBww=pZ4sH5N4cxskA9Vt3zF6asN/V2elfhgSRyoSroKh5QWIuKFjM8cLcTYCkfZlWu63ItRowQaAbsYYJI1DQz803FYBbQ2MDD72d4FMowAu7KnAc3AzxBWzKjoGgiejNgF=V1sCNO4IwSubwUQXUybrSPIVOFC91bMfQRB/FD0WaL6AW=Z/kaYecic>Mt (b) Segment Stage (C) Detailed Structure of 2 Predicted Boundary Predicted BoundaryP Segmentation Network vIsjxsc3XoX1Lajq_itNdUV=g"L>94a6aerslH6AiTbFDgSgMaFP3jshZ"1=bN1thx AiAeAaB"AU1XZNsj 3PLLFsxx"L=cUaNji"q4ae1aZ_saos3tsxjtIlv1dZrsN3XFAMTSFDgbgi=1Zao3sjIsdrvXANFgTaPgh"j46=sae_1bhsati etxlB6A ovSHytZzsO28c2XQ7RT0i024UbVg2ZxOSCbYUUoLMUWLsNtzoRd8arTdf/uU2zLafN2Wv1fb/BfjiHV3uHjXmgzc2gHnWY80aI/73pfQGh3jDrLQScEJOE8No34YWbcBFklRFRd4lwF7IB+4E2kpx6Dw+2f8lOdRFRlkFBcbWYTSv74sQXty2c0Z2THoo8R8iO0zUwU47IF+4B2kEx6p43NJEOSEcrQL3DjQhG0nYW2HggczjmX3HuiVHBjff/bW1gVb4ZvxF2XS4CBOjYBUEb7oRLaUbUiWaMNs6N2LFzwoRtcdb82R1rf4HwHDj2bfc/3dpUx2kuBL+aIz48l/d7kIlpFfY3WGNhfQajWD/3fLBQVruc3EmSzOWEYJFvmXNzaLu2Ud/f2Dw4rR8dtozLNsMWUULobUYOCS2xZbgV2U400iQR2ovHtzO88oTZcyXs7S3HuiVHBjff/bW1vf2NzaLu2Ud/f2Dw43fpI7/8a0nYW2HggcFI74wlFdRk4Rl4B2kEx6p43NJEOSEcrQLST7oszXiyHcRZ0Ttov824Qa03QV2gWbdZLxD2sSoCROUYMUNbzotL8UrhG3fpI7/8a0njY2HWgcgjTz 6HHqy0QVPC/k8TeIDBZmyr0n0rpFUdcpuM6FDymDNqndRsPFSqfs2fg6ySoCBVKODyX/2BYlmC0jJBjekFfmxNADwn5mWG8RPJQLU2HbHc0CkZTp8puLBZmGrmnZrFfBkCjjJl0bmBYy2yXODCK2BSofyJgs2Ff0SsPqRRnDNFmmDM68F0Wn5NwpAxIr7Z60ZK7+nshpYswkiVwyoXZklr2Z+J/Xy2x8E+ZcpplFyRYfxZkgVpHZ1Wrl7Z60ZK7+nshpYswkiVwyoXZklr2Z+J/Xy2x8E+ZcpplFyRYfxZkgVpHZ1WZW1VHpZgkYxfFRypFDymDNqW1ZHpVgkZxfYRyFlppcZ+E8x2yX/J+Z2rlkZXoywVikwsYphsn+7KZ06Z7rIcZb+nEr80xp2pymX2/ZJL+xZN20rRlJkyZCXZocy7wIVAiwk5wWs8YPpQhUsHnH+07kKTZ80u6BZmxwNpnW58m0RQPUJ02HH0byCTk8ZZpBumLcGnrrmDFkfjBej0JmlCB2YXy/OKDBCVSyogf6sf2SFqsRPnqdDmNDFyMF6FM6AndRsPFSqfs2fg6ySoCBVKOGrnlDjr/mCfBFyk0emjYB2JXD F>tOAmxDb3tI7RlMZe<8oPz+ce+7i8zkBieO3UPuamp1RK/s0ZbZCRrbTQYrzAbHkVuXUFwWB4ATv4CI6uAYE3sETWAAJ4alJ4e1n9i6jdp9VEp3TiPkv+84wTe4kWMFYXzVCHRn17AxlKn2pp5eJ1IVN3T9Ej8+d96Fc71l4eAR4WAsEDEY36CIu4IABOTW4Uu3FVXbPmHew1Kp2nx7RsAslZZ01lAR0Z3R1ZRZ1lAbrbn5pQwoPrITNATA1CuIY63EEDWsAA4Rle4c179F68d+9jEV3zipkT+APv8wekMYzCbARIZ3F3WU4BTO4AICuIY63EEDWsAA4Rle4c179F68d+9jEV3zipkT+APvxR1/latexiTRAoNbTr8rwQebk>MJY5zpCnQgWHIaDQXztLubIvQrdED1UBtwi6A1stArSbXzUaDg9LI0LWM4xqe5"x4MWLQFIwdIRc9/=cA"DABAc6UiEe8SXFtFp4SHr69WtaYnNIsAS8bNuZicAHQBAAc>F=/RB2dZSNE82AYIvnRarWL6=Ht4EFsF4SqeExdMtL=cLLBgpaZzrbArv1rw15tXSbazUa9HdYFFQEwxIgc9/LI0LWM4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciH6BAAA>"=cR/2cdIFwQNISQF8QFFw2I4cR/>cH"AAHA66ciNbWB8SINaEnYW9nv6r4pILFtS8AQeEMq84LW90NLDgXaZzSbsrc1uwR5=6c>"BAAAu5twS1"r>bXzUaDg9LI0LWM4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciHuwbZOzYp59kl4A0GtAEHjpQubznCuoE8ibtNVtaF1eXIvwFQdcI/R2="cAA>B6AicHZBbS8NAENYnIav9r6WHLptF4FQ8EeSx4qWLMIL0kIg9aUzXbSrt1swu5xMwSLBZbciX3BAAA>"=wmi6KbPnUzAD2RyMksmb4kbIZOzYp59kl4Ak5MR5OCZbwZFHupQnbzoCuiE8Nr1q5uws1trSbXzUaDg9LI0LXcY=284a2H15pItZjAZUKWvnhAQNacDBG>t7idE+Pve9sYoE3NLSSBQbFi86FAtA4"LkHLpo6irpW4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciH6BAAA>"=cR/2cdIFwQ0fAGtwQFzIEdLc02w/8R4cg=r"p>tAQAqAWBL6aHbi1c5bHZ4BFNFSS8eNxAMELIIY9nD9UaXvSWtrs6ujupHnbQoCziEuNb8VteF1vXIFxMwSLbcZX3iAAB>"Awm=KniPAbEMV68L1vXIFtaxMBgSNDUzZbciX2BFfAAA>"=kyEl8r7GZ+ gKGKI/6H2GWb/74wy9zl2s2Y6NCRkK8bhzo5WeNkm4Yj+zsby5ztGynco/Lg85b34tr4ho6aCMc7yOsLjbyfzIzK7h/lG8Ygdegr/O1HZTpmOYlKmGaYKWd8/gjpGdCN77sdjM77bCyO5bMkWR+wPKCL2l6YNafWhNO8LKfgh/8p/mbd5e4N2YC733hN4KNQerh3Y37466NQKCM+28wy6+6C2+wM8wP6wbrHWI2/9K5W786PbswCrjWO2y9b5HP5Rys0wyGSK5gvtkoz7b6zsRH/IwKGKXGFWjrc6Ya4WdNZ8/KMg8/Fpamrdre8N4Y/743bN0K5Qs2Pw78ywM2+QCK6N6347hY3Nre2d4m5pb//g8Kh8fNLWcaoYtlgGKKG/57lwIsHRbz6k7z5b9y2OWjrKGwKw/yIs8PPCwOr8WHs6Rwz7k9z2b2yrOQjw28sK7wyQMN+KC363674Nha3erm2f4p5gbM/K8NhJfWLYcGoltggltYoacWLNf8hKlgjJNaf/MIFdvpt4/dbm89yhbDzHk0zBRFs2w5GECfwCE9s3JvLon3pp0zP0fJmbrD2w3Q7+fzikfnhIXvsStc+uGlFLkQjUKBgNt9oecbLWfdh38b//b+5g4f2urg33hN4e6263rNjYkFG+tsXhfif732rmfP0pnLJsEwC8b/tvFMfNCbGd0+dmsbZ1vbOeteyw39QeFGKKI/wRsH6bzzk795bOy2rWjsCw8P7MywQ2+fNfKzQj2kw98NPmwor2Wf2z905g7j66bCHK3N6h47NY32remd4b5pg//h8KN8fssilIz/lKWGca9PmngltYo5jfjdBlMJd4TamIY4bd24d0HdWWY8hP15TyL0HybS752vNX8F7eddSr9h7OlgIHD1WTMo4mfZBYNO7KhWCHsIw/GKKGltgoaYWLcf8NK8h//gp5b4dmer23YN74h6N3KC6+2QwyM7P8wCsjWr2yOb597kzzb6R 6657gVQEKvTwFurZqFItIbsUCe1dGqiWfX1KsfNqBeMF76a47V2N8n8cmSfRkslWiG5VA5s7FvTIrqCIsi1Gsf1efq4F6nVNicdUFtOUmSTJWRs8GKmYG9kplhqRNmd+5Oe7s+RKs8YYGUmqkbpJ9YhNqKliNImBRI+w5RdXe87oOI+lR9smlAi0EH5IUf5WEtyPKhZ37qHEZk4LGcn5MosK3G5oc4KHkLE7FK3qhyX5tPWUKWIfHZkfA0mmqgl9Id888oX2uewRIQ7aIBi7UwNKYbMBCJEqbUAZZ+stnyCQJAT7OrmLU0Urt6FCiWdMKT7sS7SdGa+aboNOakYU9MC4DMxrVzIX2TWgM7LiteRYsMoG5Kc4LHkFE7qK3XhyP5tKWUfWIkHZ0fAqmm9gl8Ido88uX2Rew7IQBaIUi7KwNMYbJBbEoMG553K8c84HLKHnkXFRE97IqbKn3MXuhDyEPf5AtIKRWIfY6UjjbOfnFiLayazggBLL87pACIJ0VmCl2oUXXwpBFimNMJoqGY5UKGcN45LJHskwFpEo77q9K33uXVhsyoPc5ktqKhWmUUfWWkIZkfHqZm0gf8Adq8mum29eg7lQ8aIUd7ow8M8buBXE2NRteswV7jIUQABnazI6U0iV7oK8wFNUMxYObIJ4BwbaEXqBNWUWtcYIMvIFRCHQV1DMqGlKF4wHiFb7qNXbhqyBPe5qtWKdWUUtfNWXIZkgHuZQ0wfEAYqJmimw9Ugklf8mI8d8o2878au7XM2BUNNswU7bIYQKBIaIIRUWif7IKHw0NAMmY9blJIBob8EXqRNwUItBYIsiMKoNGY5JKbcq4ULYHsktFNEE7BqbKM3wX7hUyaPQ57teK2u88o7dUIK8XlHgM9ymKmFq4AGfY0PZhH3kqIEWkfLUcW5Kots5ttBQMos5KG4LckFH7qE3XKyPhtK5UfWIkWYZH0fAqmqNE9biBKJIbwY7MUNam7Iw2euRX8o8dI8lg QNqEj7rEUNZLtF4tTZ+kCOxMS+sojacTC/02RrUjmFyl/9YczSCKVhgmrd89zE+BlVhjOBs1I/1ZhQNqEj7FmUNELtZ4tFZ+TCOkMSxso+acjC/T2R0Ujr0jUTR2j/C+caxoskSMTOCF+ZZt4EtLmNUjF7NEqZQhI1/B1shOjBVlz+Ed98grmKhVzCS9cYy/lLyRTyIkU1fmKYo5nxV4lnF/1hLz9WGTDd2YE8zpjWkMmQEO/CEzl+oZcBPtZ4Gg/1oTLkICrWlo9SZi4lcvcbBDqfTy2MVwmihkO3zJTIUXICVWJaxmjXRWv8ryz4jGjPMMthmdhhO1zqVn/qcAmNBp14qtEB+Zs+TzlCKO9QjMZWFpt8MYjdjTzWrzNh7/Nn14sxh5+Y8ms1rkRy1RuyfpZNlAPqSnuqi1HhxdihkMmPdGs4gyC8YWyX0m/aaWxCOXZIZJN37kNi1wsMhy+f8DgbCvYlyi9SSohWdCEkVTB1/gQNjhUZt1F/CIS1+sCBROrj9hSVhldBE+VEBz/9Q8jdUrZmZgOhxVaK/C0SyzYcCYg9t/ElUyFhjDqzQdZj/t1sBOjbV9BZEK9ud4a9/3FvWkzYMoK57+pNa0KAK7Ckwdu6nilzKznxERYtSRqRD7LMoNFPYU0giGN7nzVU4XyUywmi9Lm9hgK7SwcF9alSrNUURcTUCBcF+EshSdk0Ca+xFJ4UtDEDUrFjjUq0QRZ2/T1/BCjjVcBaE+9odsmxhSKMSkcO9ClTy+/ZYFztC4VZgtrL8EzN+UlmhFO7sjIE1qhNNQEh7Zm1N/LIZ1tsZBTOOjMhxVolaBj+/E2z09j8yd/rYmzgChVVgKrC8Szz+clYh9O/slIy1rhjNUE07Rm2NTt/4CFj+cCak+Soss+xcSCMTkROUCrTl+9ZcFStK4hZmgqNjU0R2T/Cjca+osxd8t9jzCEU+FBMlFVEh4j+OkBEs71mIN/L1ZZthZQTSO 0JG1NFmbHZlGwI3urJW8UOW5Nmls0/T3mcGz2eAuZJDrFu13JI1wKGmlUZsHCbLK+VeZGYCmJirflfu1uq1zMVGbPUjaRTYzoiasV6tzc3/ar3UhDd+xmWSGpRs+VGYCmJiM12mKnWTmHApU6TVsrZVCmliL3DraT+uriMYisJSmaCzYsGzVM+lsCRfpiG5SgWMmJxC+Gd+DRhGUW3xrRaY/o3acVzttc6/V4s4aTiLobzYYhTnRlaKal+zrGSaDLlCZsTWmx+QToGSWEm0Eckc5GJdRspD0hU3raNxNOI77GJru3IwGlZHbmF6CbIuUwzg5ItZVNThEeOfklc1mIOocEd5kmuu5PyQUxC5leTIVfIjJhaoExxuythY6ooMzabig3BWz14oaaRTYzoiasV6tzN1GJ012KWmAUTsZClLDarJu63WIGw3GTlRZdH3bimCF+NG1xGhJa0z1s2zKaWYmVAsUpTSsmZ+CDlULrD/ac+trVMaioJYmR+V+MiJmCYGV+sRpGSUmx+dDhU3ra/o3Gt6zsaVoziTRYAmaK2W0J11NGxNNI7OeI1mlPfVeRhCNTZaVCO5dro17idbTwEIJJWt5Vm6kSGlGaclcL02ESSYox7zDh9N2aGdbwSoY0HK3HZa8EadQEyt1RNYJWkBY/lMzzo3/cm5s8UO4l0rLTbVCfhYniTHKllu5gTzFmQHZbGwl3uIJJru3rwGIRTazoYasi6tVc3zar/UhJrauU3ZICw+GmlsZLHrb2mWFANT1ZGlJD0+1M3x+Dd+xRsp3G3ShWamrxU+rd/DcmGYGpSs+RGYVmJCMriaD+lCLsT5ZCAmUK2W0Jc1uYYC0mTJailMWrV+oaRDZLA+2d6DshiUz3TraaC/s3UcmzKt1mNaisWVU6rtDzac33h/d1mWVSsGRp+IJNFJiwMGr0+Ha2DGLmllCKZ1sJT1UFAbmZWGmu3rbZHGwIJGZ1bNmFH3rlulJnuMnWqIzvMKb8+i8OO/Ox7K14OQzp/BHzucPWVm6epuYvMUzQLOwd=8=iDx8eitAa+la/=dbpixDnB4rzSvAKAQcBI8zWv=M7K1bO8z+/iH8uOVO6/MOw=OxtixVet7KK71W4cO=Qnzvpu/PB6HMzw1b8O/OO8i+69C+qnqWm9/latexit>>tixetallt/t>tiielax/etal/xtx>aiteexit>>ti/latexit>xetal/>/latexit>/latei/lat xi_tt 6szhxae1t_l4s=Va1tsax3lWNOBbSaasaet634==<"sVh6 3ibeOaOcA+BUH4ccZ=N"8>AAIAnAaBW66HHi4cFbFZSBeixVMjNS8jN/ADDl/2YEIY2nV9HaBvAW>r=64p+HdLg47tNF88AFIQnSaZWE6eHq4xF4FqSleAx22Mm4jVDidYl20AM27mJdxMruxum=dcMuJA>uuxuJMrd7m022Al2DY/ijVM KeWn6tr3/aOjYNYUV0r+D+Vuua9DDuxiHEYE8viOxPW8/M68YyGD8LEvu0jKKoYmUvudmcD/8aEuD8/PVpYeH8xUD49HKeWWratvN/0DL8yPMOPivoEDu+aL+U03NSana6evu9WC/q4xLe4L8Pp/8YadcKvvoh0hLPyxMLPCv4EPu+P0ohLhSPG4YCLLqxv4SUL8oePp/PW8uuvaq/LcYdGvnm3ojKUh+huthhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvEEiuDau++0UNja3tneKu6W9/YPDrGYVY/LxWvqSHLo nIVqvmeTvWG25qfxeoTbBrpFgngKCRg5GcnOXmbm2UDOqEX9x5OfOmFU9Ev59EpY5peEFHFjj9+xOCF8EQYEHHp5no5FRKP5EhSNkOLtCwCkCw2LVWwegnRnCEK5FHkFrjoeC9xF5xw+ZCHBk8qCEQtzIogVvegGDWc2+L5CCwZCBk2C8wHkCtqvQOqSzNEPohIgV5TReKgnGFgrWbXo2xDqL5cZC2CHzqSqoEqoNzqQ2CP8ZBqCh+xxbFg9reOjwFmHC5mEknUpCpOYwvEEkf95t95EvOfUOmEmSOvcNDYXPgpghTpI Yf/vW1HlM6U/szjElnILYfJcq0nRXQPUc1xIPQj0ZM1qQZZSzIgLTQO/S0v/1IUxYg00L1ZRhEIQ66wWZs1YHXSPCQLTW1fZLwRSafddkrQYvkr3DGEnkHYjKJ/PujkZIOQUvZ3w9S0fRdGrIY6k23nGfn//WHHUMjUIsJjnlPIxYjJ1qZngXOPvcUxLPhj6ZZ1HQCZWzLgaTkOvSDvk1KUuYI0vL9ZRhII26fwWZM1sHlSYCqLXWcfPLZRQazdTkSQ1vYrLDhE6kZYHKC/WuLkaIkQvvD3k9K0uRIGvI96R2In2 GgvFi4TZsxlmBTVVysuldlLCVEMqAZyo5/JGIiqE2RlLLUql0lZ3TIV9r+/N2UreGr2I7AJ/nZIriTq0Ev2VRlllLgL3UIq9Weyv4FqsTievylql4rFZiGsZTJZum4x2lTVqBqTysLVsyTdvurl/+VCvl0LLM2VIEGZAqVAC5yyBomGi/yJrneIUiNq+E92IR3lgLlLlUUqLlR0EliZng/vo35TqIEVM9lrl+d/VNT2VUxWZAs7FIq2em7lVBTsVydulClLUMVZqAn5oZxGyUWA7I2rGerUUWAEqyevI2rerU2N/ k8M+hcmwDhU8o5CxOza+BoCvaYyzsFv

iteix>xe>ex>iitt "=M85k2HcAAlGrgttFuSFM1DPxbgILT7Zbkc=iA386HB"AMA5A2>cn0Mf7Zc4Kh=4"es6b_aah1 tlAG4rag4TZ7"Ze7_nhfK0=M6cs7bsxeialttut1PFIa1htsx iteFxMgSLDZbci36BAAA>"=M85k2HcAAlGrgT7Z7nf0Mc7Zh4K"=46esab_1ahs tixetalMca rLEOR8E97iaJ6Ew9EZSa7UBnbX0A7Rr0GqJ3Z6DSrJGEeertpau27bN7VwF6EhEJKc1GbaEa2tFPGKEAKvtmGRwwVyBvRhOL8JJuFHiJhtUnOj+L6wEQz76VdZLjYJfftLOdNLKJluDjTpNNSEyYRCC7RDyaSZNjTJDElGKENfwY0LRdz6Qu58GzPTWWjrD4GP/5xDzvmDmDNtXPfbKa2tb2x12G58TzP8zT6WLr84iPE5ZDUvXDRD6jSWYW4rDtGP6Sjtn30qJuH8AOJhLRnVv6yu6w6LSj6n3tqJ0H8uOJALRhVvnywwtRamKvGA9KFPEtJa6LJp8NEEj1Z2XEOGRBSewEYrDCaREyGSENfTYDPl/K5NxwD0zRvzmQD5mGDPNWtjXDPGf/bxKzam2mtNbX2fxKu2lbbxclrcOO7tt+7f+wwOfjVJwa6tFPOKjAhvlmuRJFtwOpL68EN1EiZE2GXUReBES6YCrRD7aSyNjZJDTlGEENKwYfLR0z6du5QGz8TWPjrWD4GP/5xDzvmDmDNtJL2S6jJ67nP3htwqbJa0EHF8OuVO7JrAuLtRbhPVtvanJyjwOwwtfR+atmOKcvlGxAb92KKFfX KZ+PN2+hauTTfFy+3QjxQ+pCZoYjBxFGt6WGxpQ+UxqmrpfoYmPrT8ZmGQ4f1+QzF9fvGHCU+PhXKZZTCbEZTf3FiX0roqACoU+HQZU9vhzEl5FyV2maSP+nJS1PMaxdn+GisxLIhJBqdPl8+1p3wnVaRMYz96d4GxhQaGSCUnD71pNEdGm5l+r3+s8ipxmUwLQrVmfQohFsApr8oBFm0oupidfr3mbfTl2fErZZC+XYZ8ZNKpPPhmTb+Wx+f3ifFQtuF03F1oAr4GaFqoND1+QUUaSQHUhdGvz99CY+hlWOhFV55mBoymMYZwTCQjyGVWZff+BFotmQB381pFs4haQGmNr1LDUUxSiash3G+d59GYECpW7hnOC5GBQmxM4R6PzdMiaInq38138aPzq4JQICx7iE+5d3aiPUSrRQysM8mmmporBfVf5Z5YONFPhYhTWylZCBdGHZhPTXZCx+CEZ2GjWbT3+fFifuQt3F0oF14ArFGaNqo+D1UQUQaSd9vzY9Cl+hhWO5FVB5mmoyRMPaS+ixIJqP813naMz64xQGCn7pEG5+3sixULrmQhsp8BmopdrmflfrZ+Y8NpPmYwTQyVZfBKUh EJdaledBihldIPgazN6lDCGAUvOnVZEHh=JijeMYEZLItk9JAHAmnQHQ9moH/J/kjIQZ/YeeYijEKVMOnUIGPDi6JzxgCIplYiLd8lpdjEBJnlfhYhFYcmqwYjg85FMR9uiCdt=3<8lttmxwt8FRuCt3zyKnQuGp7lh>tixetal/<==QdNij9HM/5jg/YZqocaF9YnfHneBnjvpA8PLYAz9ApxCBCtJPiLEdImnQkHJYIZap0EMejUVO6GDIzgdliEldhJlmhYFw8CRuzt3nyKGQulp7th8ya0K3iy3MyKjneQauaGNpv7nlZhH8=tian0Hy93oi/K/MjMQJ/neE>IjdBPnLfiYCFJctqxYBgC59Mp9AiYdA=Lp3lz8ynKynBQtu9GJBYjv>njjQxMeNtHa=ld/i<9=MpYg/AYnZJMEAPLPA8ILd Network 1024512 Final Prediction 256Input Image 64 128 2 2 Final Prediction DeepLab VGG-16 (d) Refine Stage Predicted Boundary P 0 RALBtXcc3ZKNa8QATIan3a1Wv=zAML3ccK0QnaZ1Vz23V0zZXlaWva9nYIEAN8SNBZbciX6BAAA>"=svzV2VZn0c3Mzv13aTQaK3ctLRA"=46esab_1ahs tixetalB4ZvbVV 6l/p3Lp73YqddOtvbxNQtvuwWACRkpoxfaAvwHF/P4PtF9FQSEfRRzaL8R5Qa5Kv5DABijoFuh2JArt3NgEl2sTo1QWidCi2ZhIx1No43H//V2zndmzy7aWNufYlfjrrooAIA5/zzLCviwsLQOd53LvvvPqPdPqFiwoAvfPolv3LQ5aOsLCw/vzLAzb5AIro3rfjultfWNza3ydmVnd2//oHp41NZxqhi2WCxi1Q2oRsEltgw3AruJQhoFAjvB5Davd55QaR7LRzN6pHL4tF9FQSEfRY8OKxiv2ANpTNdtIu3WzC7kYoof/AiwdFvPqRzLz71RWadQ15q5oddv2axDi5ZBpv/jVA3FooQhiQCJ2uhrxAN34wHg/t2lnEmsyRaPfWtvlku4jtffrf3WoprOIFA65obCzuANLAzxvY/SwFCHL/soOYa75zQ3LI3dvTlNP2viPKq8PRvEFQd9wtiLAplLQ3aO5LCs/vwLAzb5zIrA3rojuftflNzW3yamVdd2n/o/p4HNZ1qhx2WixiCQ21RsoltEw3gruAQkY74CfzOWF36uYIStFdHNKT8pRNEAQ29vtiLxphoFAjvB5Davd55QaR7LRzJPvv sZpaA2Np1qTbK0N3N90IdGSgNPiGySV/AiCkU51W/NMaZ0h1I2QIXuef5J/mwmw2nKWRWoafwAmqAi9uxwGpH+jm6NtCMdJrn8EsYQWopNN/TT/N7WQYmErnbJtMpmgsmHSGBxf9bAVmiwBakWmWZncwfw2/Y54e1X8QiIWh/ZOMv/H1XU/CCAWVNyTi02ZbsRpHAAN91qTMKKaSI/B0fNbViBkmZcf2Y418iW/O7p/cXNCWZT0AspTN1IKa/7pGuTWNfJ2YnES3mJmMm2/sGHqiKxA9Ro9mawkafWnWAq5w/wIWi5XeuvbcsamN0I/oSPKQM2qs9dA8H1Rrbt2di6yCV0TNCjUH1a/+MGZphNIw2I/odS6PCK0QNMj2Hqas+NGWpCN/wXbcuvWOi/IWqi58A9adfAk8oH91RRi1K4qY22/fmc3ZmmSkJB2igVfbGfuBpSIAeWuXbQwINhpZGM+/a1HUjCNA0VCy6id2tbrR1H8Ad9sq2MQKPSo/I0NNgmpBfSVibkmBcfZY428i1/OWcXvCW/T0NspZN1AKaT7pIuT/NfG2YWESJmJnMm3/smHq2KxG9Ri9mAwkofWaWAa5wnwIqi5/ ciEgV3uBgW5zO5d56FVdjqMXRdccz/zyy/JVGWIto4YREZxxtlyOXE5EBTcuoEhCxPaLIiTGaBetfxhBxEt1oraA3IFI/R4nEe/IiUe3jZx+YKa8WOqO//ZOTxi6EKR1Z5/B5QBzBx=Mtz56ootYhxyuoxEJahfjITVI5eG>Yixtiaet5YIGqTfVhIFJoauE5xtyohW6Moiz3c3g5BizF5ddaX/cWyMV4tlRoxEOEEYuPCBLtGxtrBu1IAeIxnUIK3o+O8xOhOK6Q1jBxzz/fHcuCPIJ8MtLiBGwButexPBJE/1>rzAHIPIMR=naeiI5UQ3xZB+zKc8BOCO8/=OCYjwVrMv49xYqEeNESSBQbFi86FAtA4"L4HrpykzW6ISjCJLJ7j9IE6dwbYEkr78C4zNW33YuBtNdYNvTpptNQCq2V/2CTjKVzMY4cxZqNe8EASIQnFa8WJ6jHI46FwFyS5erxYM4j=/"C>pANAtAuBW6CHki7CS83uKtdNTpNC2/CjVM4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciH6BAAA>"=4Yr5yJjI6wyFt>4cLNHBpY6ZrHWAv=ar9BnbYiI6EAAAN"54 xSW2ihqxFt16xHp4jO8gnufi3b3d2erFujfr3orWS5bzALzv/wCLsOa5YL3vlPvTqPvFdwiA/bAxIooqhizLpAOzPb85PALIwrjoF3Trvf5jLuvFOr4eH2vdq3vbl33iYfausnCg/FSx16xHp4jO8gnufi3b3d2erFujfr3orIA5bzALzv/wCLsOa5YL3vlPi2h3qWxCFvAa/zbqoloYxs6/1xtPdToPovbL/5AOiLwwdvFLvSvPvqdwFA/ioobzbAAI5o2rrf3uFxtje2r3bdif3nguOj8pHx61tFwxTq2Wiih4 ri06Q9oR1EuQEAutuRo107hsz5RvHlBAroF/hIJ0nL3NgTlLsqo0Qtign3AQuJFhoPAW2hjWs1ywN3xW56isFNoHL5QKPGsWyyvB9H5R7z6h0jyWLTNL0I/oWPKQM2qs1WwNyxW36i5FNsHLoQK5GsPyM5QKQounJi1FQQuLoNR5soEsl3tLgy0I3PA2nWuxJiQshQosFNA0r/vWBK9MHq51Rw7yzW66h50NjHyoWKLGTP3jygtlEsRouQ1iGsPWLTNL0I/oWPKQM2qs1WwNyxW36i5FNsHLoQKAFArvB09H5R7z6h0y0Q0JnWoRWiU2sLzTRUCRa0ZsicgoW8vEcj9CNczIF5+tRZpPXNhiT7ORCN8pN1EEg81OJhapT+/zRnsa8ksxS2i1OLbuoWQnE0T2s/cn8aSkvxiIuOUbE5vtWZUP6NFNjR0WjnQ0R26/snsaRkSxLIJOubc5gtvZWPNNFi67CRRNgpT1NET8XOXhFpR+NzW9FvigRZNCsz9UcvWgiZaCRzsUA211A2pARiF7onW0Q0T/2nsR8kaxSsiOIbuLEt5ZvcUNPiFWjR7NR6J1pENgCO8hXTR+pzNFcv9giWaCZzsRA2U1 amozmY9bbBafZ3zKJ4iGxlsY2D3b2mFFmFzVLmhIG8CVj2AJmelmmA7NmbaMtZQA9pbrmj2m7aziZ1a2VsKLJ8i7p9baez8XFhDn4Zv9hxamLaiQzGBJKV9tV9GFa2xiZ18sLa79XPzZhnm9xGaQtJVaFKJCmrgsjiYArpbZMmNA2eJIV8FmVbFmlDYKG4v3fvImoImbzYKaCBmYJogas1rsi8Y9mPIXvnf93mKQ4JGtlFYaDmbgmrFYFrVAmMIN8mVJ2VJIeVmFAbNYbGMKZfAIpnrXjPY9i8rss1gaJomYCB9P qE+Exxet>xit>tiexiet> TAFagxgaasPrjeh_"i=j4T6"erslaa>"N=UcNHiBqAa>1=ZNsqo13sr3ccESSEONE4J6NE3244pF67xwEFkR2RBc4B+bFaIX7O4Jw3lpFxdkRBk+RIl4clFdBkYlbFWYaW11XZZs93sFjM3SLDFbxiM6gASAL"3UciEZos31aqs1A">BAAi6HVcbSDLxgM3FL9jsDZVqiNoUZ=a RLQ3oDLjWQDhUG23dfRpsI/7v/z82af04ndYLWW2oHbg1gfcNzajumUX/32Hwuri8VtHzBNjMfUfLWNLsotz8Ud4wr2fDdU/uL2zNafv2WLQ3VDzjHQ3hWGg3mfupBI27g/c8jaX0HniYHb/ffj1LUoWMUNLsotz8Rd4wr2fDdU/uL2zNafv2Wb1ff/BHjLQ3oDLjWQDhUG23dfRpsI/7v/z82af04ndYLWW2oHbg1gfcNzajumUX/32Hwuri8VtHzBNjMfUfLi3XHjzmggc2WHn0Y8/aIp73GfQjh3LDULQVuMoTFSs7g06FKD8yxm2DoNXqsnXdVTORs7P2FySBqDf2ZyHoRzYoCi2UZbgS2U40SQB2/vft6OS8CTVcOXy7/XBMXFcyTD8qOdtsvF2qQs0f462SgCZV2OCZYybc8Toz8OvtHRo20QiU04g2VxbZC2SUOYyby/B2DXyVOKoBC6Sy2gfqsfPSFdsRNnqyDm6DFTMFsS7sbB2U/OXSyxDbOVKUV0BiCRooSHyz68gofZ2y8sTSqFS6FDPmsNRndRnPqSNfD2mgyyDoFB6KcDZXToMFz8OvtHRo20QiU04g2VxbZC2SUOYsbfM ZYl0muCm0TjHJBB8jkeHkUFcfpmZrZDCnbGyr2LJ00ZN5wW0nmP8Q0RJHUHy2bk0TZCZu8BcpLrmnDGmfrkeFBJj0CjlYmxIA7Zr0K6+n7hpsswYwyVAikk8wHsYYkpYh8sZnH+r7CKnZi0Q65Z77nriIkxBArpJwyN75hnwWw0U8PmWPwRIQ607UhJwHw20HTyu0mbnkfCjT2Zl8VZKu+psBpcsmkLVryGJn0DRrmm0fnFNkpexjrBZJ0jK0+CsmplsYkpVQyRyPbmC8Z0ZWpnc5LNGwDpmAFxeIBrj70Zm6Yiw ZoXrkl+2ZXJ/xy2+8EpZcFplYyRZfxVkgZpHT1WTANJIrQo5nprbbl1ARZZR310RZlnAsK7x71oXAZ1kJlTrx2RZk+5J3/XX+yp2YxH8IEn+RZAceprp2opXZZWkylxrV21ZN+JJp/lX1y12lxn82E1+kZZcXp8pTlFFfygRZYpfFxRZfkZggVppZHWZA1TWrToAQNrTbIbrRJZoZ501ZQRpsr7nKbpleboAZRl12Z+R/Zy1x0E3ZZpllRyAYsxnk7VxHK122lecrJI5QornplbbR1ARZZ031lRZsnAxKppe2NAT 4c13E79AFR6b8BdI+s9ejEuVU3OzCi6pDlAiekTmt+AVPXvF8Ww4eTk4MIYuzYC3>>> 7TZx7lnsfG0_Mxcr7gZlha4tKt"7_gsa=a4t6teG41=h" Ki4ehaZ<7Z7TcrMs0bf1nh6 eiseaab<<=la"AiANA8Bc613Ci1cMb"ZcDiLSSygEMExNFbIDbAP51kF5uAtttF6bpSLRNr8eEGi21EEp21ZFESUbG6XABMR2eA6AE2SMrAY6Cp7uRbDMyDaZSZZaNDj7TYJSD6ERlXiUEZ8ENiL8tL56utEF3PBIAx>g=L8ZkcH3ABlAR>X=E82k1H2AHlcAAlEN21EDlEjJTSNZDayCR7SYreE6XRBEGUpr6tSt6u6FA1AP"bMI5F2xcMAgeSBLGDZZEbil8ElDAJHTkj8N=Z>SAaBy3DLR67tCFYPlIExDgJLTZjcciU KEGfwN0LYdzRQu68G5PTzWjWD4rP/GxD5vmzmDDtXNfbPa2Kb2tulxcrb7tO+w7VwfOO6hJjaJFJtadF4JvhzjEO5OW6uwYVDfDwP+r7Tt876OLrfcGbmlzuxx/2GbDtj2WaPKGb5fQPzXRt0NwDNmKGtKEGfwN0LYdzRQu68G5PTzWjWD4rP/GxD5vmzmDDtXNfbPa2Kb2tulxcrb7tO+w7VwfOO6hJjaJFJtadF4JvhzjEO5OW6uwYVDfDwP+r7Tt876OLrfcmbmlzuxx/2GbDtj2WaPKGb5fQPzXRt0NwDNKtwPJKALoLFA0nuuivfm3abjTt2HEJZhCyXRZvZAKJLSj6n3tqJ0H8uOJALRhVvnywwtRamKvGK9FFEEPAK9KGvRmawtwvynRVhJLA8OuJH03qtjn6JSLZKPThZCX+CxZ2EjWGT3bfF+fuit3Q0oF14FrPAra44rFF1o31QFt3F0+QfuWtGijFCfx++3TfhbPWJTSG623jqE0C8ZOxACR+VXnTwZthaZKPGK9FFEEPAK9KGvRmawtwvynRVhJLA8OuJH03qtjn6JSLZKPThZCX+ExZGj2bTW+f3ifFQtuF03F1oAr4aaAC s3++V5nGoEapJ7vnxCFGsQMxS4l6QzGM1amnz311G8xP1q+JmIFxziU++dsanPMSJRSyVMlmGmUoGB+Vp5G56OnFPhxhaWyloC5+hY+z99H9QdUvqGFH+hpUGa6QnSPQxUaUyDo+51hq+N9ohGSFDiNsi33+55EG7ECpQ74nzCaG3Q8xq4I6izdMPaRnM3m1B85POqhJWICxYi9+ddGahPaSSRUyDM1mNmGoiB3V55E57OCFQh4hzWal3C8+qYIzi9d9PdRvMGmHBh5UOahQWSCQYU9UdDH+U1QqQNUo+GqFoiFv UxLFrJmYQdhwshpd8lB8mmohplCEulRidhhflmrZfY8+pPNYwmxUTrmLhsQ8BpopmrmdlffZ+r8NYPmpwTYyVQfBZmHQkIJYeZEVjUGO6zDIlCumRPFD8fwrmYYYhrhslIJVEYdNlZdfipl8IQgUzz6UDjGwUmOpV8E+jrelYmZdIokBJpHhmmQLBxfgZ6VGyOQETegZ+idldEJlhhYmw8FRuY8NpPmYwTQyVZfBQmHJkIZYejEQyVZfBQmHJkIZxULrmQhsp8BrmpdrmfluCRwFY8mfrlJVEDdzlUdgi6lGIOoClynALtvB9AaPAAHvJnteHHBn39/3zyKnQGtp7lh8ta0y3iKMMJnEIdPLiCJNxdji/9CMj5ZgpYoq9cYFnYefLnnBAj8pP8ALpY9ptCjxCJdiBPEIMnnMiKa3f08tphY7uGyQFK3zatcoqZY/gj5/MH9jiNdKzyunQ7Gp8lh0taiy3MKMEJnPIdCLixJt9BCYpAPALp8AnvjnBenHfFY9oacYqZj/gM5/jH9diNufBniMK3JIPLiCJtxBC9pAYALP8ApvjEnenHnY9FacoMqZY/gj5t30zhy7KtnMQlu8Gapy/d9HjiNd=tixetal/<==Q>tixetal/<==Q>tixetal/<==Q>tix Image M 0 Network YIuE"AnN_8KS0NeBWZ/bVcGiuX49aaavNMmrUbTGhCE6WB4AQAHAc>R"==6IsHbf1Chs tetvaWnY9EAal"4HbuWKCMEbMGhuV4TaKaUt/tmxueN6AAAAA"BI6fXNimcUbTZhBENWS48QNHAcERI=Y6ns9ba1vhW rireWav"=IHfCNum/UKTVhMECWb4GQGH0cuR"=46esab_1ahs tixetaloJxmR2NhkIzRnKiQXKtLC7EjaimjK4ZoRJsmN2zhNInRXKyQtKELa7ajKiQj40CZqs7Nvy6auQlqu69uqrrbxbE1bEizDFb/CgCu1w1wMoEonQ3uzVypKxF+O81x/LKozTgxHDPzufAwABwP<=t/waxejio>7EKCotQXRiQnIz2kuNmR>xVoitaxpt/lPxto0 New Prediction Figure 4.2: GSR-Net framework overview. (a) Given a tampered image S, an authentic target image T , and the ground truth mask K, the generation stage generates hard example G(M) starting from a simple copy-pasting image M . (b) Feeding the training images, copy-pasted images or generated images as input, the segmentation stage learns to segment the boundary artifacts and fill the interior to produce the final prediction. (c) The segmentation network concatenates lower level features to predict boundary artifacts and then concatenate back the boundary feature to the segmentation branch for final prediction. (d) The refinement stage creates a novel tampered image with new boundary artifacts by replacing the pre- dicted manipulated boundaries of segmentation stage with original authentic regions and learns to make a new prediction. a process often results in training images that are not realistic. Of course, the best approach for generating training samples is to employ professional labelers to create realistic looking manipulated images, but this remains a very tedious process. It is therefore not surprising that existing datasets [3?6,26] are often not comprehensive enough to train models that generalize well. Additionally, in contrast to standard semantic image segmentation, correctly segmenting manipulated regions depends more on visual artifacts that are often 53 Real or Fake? created at the boundaries of manipulated regions than on semantic content [21, 49]. Several challenges exist in recognizing these boundary artifacts. First, the space of manipulations is very diverse. One can, for example, do a copy-move, which copies and pastes image regions within the same image (the second column in Figure 4.1) , or splice, which copies a region from one image and pastes it to another image (the remaining columns in Figure 4.1). Second, a variety of post-processing such as compression, blurring, and various color transformations make it harder to detect boundary artifacts caused by tampering. See Figure 4.1 for some examples. Most existing methods [6, 49, 81, 82] that utilize discriminative features like image metadata, noise models, or color artifacts due to, for example, Color Filter Array (CFA) inconsistencies, have failed to generalize well for these reasons. This paper introduces a two-pronged approach to (1) address the lack of com- prehensive training data, as well as, (2) focus the training process on learning to recognize boundary artifacts better. We adopt GANs for addressing (1), but instead of relying on prior GAN methods [79, 85, 86] that mainly explore image level ma- nipulation, we introduce a novel objective function that optimizes for the realism of the manipulated regions by blending tampered regions in existing datasets to assist segmentation. That is, given an annotated image from an existing dataset, our GAN takes the given annotated regions and optimizes via a blending based ob- jective function to enhance the realism of the regions. Blending has been shown to be effective in creating training images effective for the task of object detection [87], and this forms our main motivation in formulating our GAN. To address (2), we propose a segmentation and refinement procedure. The 54 segmentation stage localizes manipulated regions by learning to spot boundary ar- tifacts. To further prevent the network from just focusing on semantic content, the refinement stage replaces the predicted manipulation boundaries with authen- tic background and feed the new manipulated images back to the segmentation network. We will show empirically that the segmentation and refinement has the effect of focusing the model?s attention on boundary artifacts during learning (see Table 4.2). We design an architecture called GSR-Net which includes these three components? a generation stage, a segmentation stage and a refinement stage. The architecture of GSR-Net is shown in Figure 4.2. During training, we alternatively train the genera- tion GAN, followed by the segmentation and refinement stage, which take as input the output of the generation stage as well as images from the training datasets. The additional varieties of manipulation artifacts provided by both the generation and refinement stages produce models that exhibit very good generalization ability. We evaluate GSR-Net on four public benchmarks and show that it performs better to state-of-the-art methods. Experiments with two different post-processing attacks further demonstrate the robustness of GSR-Net. In summary, the contributions of this paper are 1) A framework that augments existing datasets in a way that specifically addresses the main weaknesses of current approaches without requiring new annotations efforts; 2) Introducing a generation stage with a novel objective function based on blending for generating images effective for training models to detect tampered regions; 3) Introducing a novel refinement stage that encourages the learning of boundary artifacts inherent in manipulated regions, which, to the 55 best of our knowledge, no prior work in this field has utilized to help training. 4.2 Related Work Image Manipulation Segmentation. [81] train a network to find JPEG compres- sion discrepancies between manipulated and authentic regions. [16,49] harness noise features to find inconsistencies within a manipulated image. [6] treat the problem as anomaly segmentation and use metadata to locate abnormal patches. The fea- tures used in these works are based on the assumption that manipulated regions are from a different image, which is not the case in copy-move manipulation. However, our method directly focuses on general artifacts in the RGB channel without spe- cific feature extraction and thus can be applied to copy-move segmentation. More related works from [82] and [21] show the potential of boundary artifacts in differ- ent image manipulation techniques. These methods are sources of motivation for us to exploit boundary artifacts as a strong cue for detecting manipulations. [21] design a Long Short-Term Memory (LSTM) [88] based network to identify RGB boundary artifacts at both the patch and pixel level. [82] adopt a Multi-task Fully Convolutional Network (MFCN) to manipulation segmentation by providing both segmentation and edge annotations. Instead of applying hole filling on edge pre- diction to do late fusion, our segmentation stage early fuses edge information with segmentation branch to improve segmentation results. GAN Based Image Editing. GAN based image editing approaches have wit- nessed a rapid emergence and impressive results have been demonstrated recently [85, 56 86, 89?91]. Prior and concurrent works force the output of GAN to be conditioned on input images through extra regression losses (for example, `2 loss) or discrete labels. However, these methods manipulate the whole images and do not fully ex- plore region based manipulation. In contrast, our GAN manipulates minor regions and fits better for manipulation segmentation where minor regions have been ma- nipulated. A more related work [89] generates natural composite images using both scene parsing and harmonized ground truth. Even though it targets at region ma- nipulation, experimental results show that our method performs better in terms of assisting segmentation. Adversarial Training. Discriminative feature learning has motivated recent re- search on adversarial training on several tasks. [92] propose a simulated and un- supervised learning approach which utilizes synthetic images to generate realistic images. An online hard negative generation network [93] boosts the performance on occluded and deformed objects. [94] investigate an adversarial erasing approach to learn dense and complete semantic segmentation. [95] propose an adversarial shadow attenuation network to make correct predictions on hard shadow examples. How- ever, their approaches are difficult to adapt to manipulation segmentation because they either generate whole synthetic images or leave artifacts on erased regions. In contrast, we replace manipulated regions with original ones so that the replaced regions become authentic. 57 4.3 Approach We describe the GSR-net in details in the following sections. A key to the generation is the utilization of a GAN with a loss function central around using blending to optimize for producing realistic training images. The segmentation and refinement stage are specially designed to single out boundaries of the manipulated regions in order to guide the training process to pay extra attention to boundary artifacts. 4.3.1 Generation Generator. Referring to Figure 4.2 (a), the generator is given as input both copy- pasted images and ground truth masks. To prepare the input images, we start with the training samples in manipulation datasets (for example, CASIA 2.0 [3]). Given a training image S, the corresponding ground truth binary mask K and an authentic target image T from a clean dataset (for example, COCO [27]), we first create a simple copy-pasted image M by taking S as foreground and T as background: M = K S + (1?K) T, (4.1) where represents pointwise multiplication. In Poisson blending [96], the final value of pixel i in the manipulated regions 58 is ? bi = arg min ||?bi ??si||2 ?bi si?S,Ni?S + ||bi ? ti||2, (4.2) si?S,Ni?6 S where ? denotes the gradient, Ni is the neighborhood (for example, up, down, left and right) of the pixel at position i, bi is the pixel in the blended image B, si is the pixel in S and ti is the pixel in T . Similar to Poisson blending, we optimize the generator to blend neighborhoods in the resulting image that now contains copy-pasted regions and background re- gions. A key part of our loss function enforces the shapes of the tampered regions, while maintaining the background regions. To maintain background regions, we utilize `1 loss to reconstruct the background: 1 ? Lbg = ||mi ? ti||1, (4.3) Nbg ti?T,ki=0 where Nbg is the total number of pixels in the background, mi is the pixel in M and ki is the value in mask K at position i. To maintain the shape of manipulated regions, we apply a Laplacian operator to the pasted regions and reconstruct the gradient of this region to match the source region: 1 ? Lgrad = ||?mi ??si||1, (4.4) Nfg si?S,ki=1 59 where ? denotes the Laplacian operator and Nfg is the total number of pixels in pasted regions. To further constrain the shape of pasted regions, we add an addi- tional edge loss as denoted by 1 ? Ledge = ||mi ? si||1, (4.5) Nedge si?S,ei=1 where Nedge is the number of boundary pixels and ei is the value of the edge mask at position i, which is obtained by the absolute difference between a dilation and an erosion on K. To generate realistic manipulated images, we add an adversarial loss Ladv, as explained below, that serves to encourage the generator to produce increasingly realistic images as the training progresses. Discriminator. In our discriminator, a crucial detail to point out is that the manipulated regions are typically occupying only a small area in the image. Hence, it is beneficial to restrict the GAN discriminator?s attention to the structure in local images patches. This is reminiscent of ?PatchGAN? [79] that only penalizes structure at the scale of patches. Similar to PatchGAN, our discriminator applies a final fully convolutional layer at a patch scale of N ? N . The discriminator distinguishes the authentic image T as real and the generated image G(K,M) as fake by maximizing: Ladv = ET [log(D(K,T ))] + EM [1? log(D(K,G(K,M)))], (4.6) 60 where K is concatenated with G(K,M) or T as the input to the discriminator (we do not show K in the discriminator input in Figure 4.2 (a) for simplicity). The final loss function of the generator is given as LG = Lbg + ?gradLgrad + ?edgeLedge + ?advLadv, (4.7) where ?grad, ?edge, and ?adv are parameters which control the importance of the corresponding loss terms. Conditioned on this constraint, the generator preserves background and texture information of pasted regions while blending the manipu- lated regions with the background, which can be applied to generate both splicing and copy-move examples. Also, it can be potentially utilized to generate removal examples by setting ?grad and ?edge to zero, and thus the generator learns to inpaint the missing regions, creating images with removal manipulation. 4.3.2 Segmentation For segmentation, we simply adopt the publicly available VGG-16 [97] based DeepLab model [70] to include boundary information. The network structure is depicted in Figure 4.2 (c), consisting of a boundary branch predicting the manipu- lated boundaries and a segmentation branch predicting the interior. In particular, to enhance attention on boundary artifacts, we introduce boundary information by subtracting the erosion from the dilation of the binary ground truth mask to obtain the boundary mask. We then predict this boundary mask through concatenating bi- linearly up-sampled intermediate features and passing them to a 1?1 convolutional 61 Dataset Carvalho In-The-Wild COVER CASIA Metrics MCC F1 MCC F1 MCC F1 MCC F1 NOI [38] 0.255 0.343 0.159 0.278 0.172 0.269 0.180 0.263 CFA [39] 0.164 0.292 0.144 0.270 0.050 0.190 0.108 0.207 MFCN [82] 0.408 0.480 - - - - 0.520 0.541 RGB-N [49] 0.261 0.383 0.290 0.424 0.334 0.379 0.364 0.408 EXIF-consistency [6]* 0.420 0.520 0.415 0.504 0.102 0.276 0.127 0.204 DeepLab (baseline) 0.343 0.420 0.352 0.472 0.304 0.376 0.435 0.474 GSR-Net (ours) 0.462 0.525 0.446 0.555 0.439 0.489 0.553 0.574 Table 4.1: MCC and F1 score comparison on four standard datasets. ?-? denotes that the result is not available in the literature. * Our method is 1600 times faster than EXIF-consistency. layer to form the boundary branch. Finally, we concatenate the output features of the boundary branch with the up-sampled features of the segmentation branch. Empirically, we noticed such multi-task learning helps the generalization of the final model. Only the segmentation branch output after boundary feature concatenation is used for evaluation during inference. During training, we select the copy-pasted examples M , generated examples G(M) and training samples S in the dataset as input to the segmentation network which provides a larger variety of manipulation. The loss function of the segmentation network is an average, two class softmax cross entropy loss. 4.3.3 Refinement The goal of the refinement stage is to draw attention to the boundary artifacts during training, taking into account the fact that boundary artifacts play a more pivotal role than semantic content in detecting manipulations [21,49]. While we may be able to employ prior erasing based adversarial mining methods [93,94], they are not suitable for our purpose because it will introduce artifacts on the erased regions 62 that should become authentic background. Instead, the refinement stage utilizes the prediction of the segmentation stage to produce new boundary artifacts through replacing with original regions. As illustrated in Figure 4.2 (d), given an authentic target image T in which the manipulated regions was inserted, the manipulated image M (which could also be the generated image G(M)), and the manipulated boundary prediction P by the segmentation stage, we replace the pixels in predicted boundaries by the authentic regions in T and create a novel manipulated image: M ? = T P +M (1? P ), (4.8) where M ? is the novel manipulated image with new boundary artifacts. The corre- sponding segmentation ground truth now becomes K ? = K ?K P, (4.9) where K ? is the new manipulated mask for M ?. The new boundary artifact mask can be extracted in the same way as the previous step. Notice that the refinement stage utilizes the target images T to help training, providing more side information to spot the artifacts. Taking as input the new manipulated images, the same segmentation network in Figure 4.2 (c) then learns to predict the new manipulated boundaries and interior regions. In addition to augment boundary artifacts, the refinement stage also mines the hard examples during training. Since the refinement stage is based on predic- 63 tions from the previous stage, hard examples where the manipulation regions are not predicted remain the same after the replacing operation. As a result, these hard examples weight more during training after feeding back to the segmentation network. Similar to [94], multiple refinement operations are possible and there is a tradeoff between training time and performance. However, the difference is that the segmentation network in the refinement stage shares weights with that in the segmentation stage. The weight sharing enables us to use a single segmentation network at inference. As a result, the network learns to focus more attention on boundary artifacts with no additional cost at inference time. 4.4 Experiments We evaluate the performance of GSR-Net on four public benchmarks and compare it with the state-of-the-art methods. We also analyze its robustness under several attacks. 4.4.1 Datasets and Experiment Setting Datasets. We evaluate our performance on four datasets ? In-The-Wild [6], COVER [4], CASIA 1.0 [26] and Carvalho [5]. Evaluation Metrics. We use pixel-level F1 score and MCC as the evaluation metrics when comparing to other approaches. For fair comparison, following the same measurement as [6, 49, 82], we vary the prediction threshold to get binary 64 prediction mask and report the optimal score over the whole dataset. 4.4.2 Main Results In this section, We present our results for the task of manipulation segmenta- tion. We fine-tune our model on CASIA 2.0 from the ImageNet pre-trained model and test directly the performance on the aforementioned four datasets. We compare with methods described below: ? NoI [38]: A noise inconsistency method which predicts regions as manipulated where the local noise is inconsistent with authentic regions. We use the released code [36] for evaluation. ? CFA [39]: A CFA based method which estimates the internal CFA pattern of the camera for every patch in the image and segments out the regions with anomalous CFA features as manipulated regions. The evaluation code is public available [36]. ? RGB-N [49]: A two-stream Faster R-CNN based approach which combines fea- tures from the RGB and noise channel to make the final prediction. We train the model on CASIA 2.0 using the code provided by the authors. ? MFCN [82]: A multi-task FCN based method which harnesses both an edge mask and segmentation mask for manipulation segmentation. Hole filling is applied for the edge branch to make the prediction. The final decision is the intersection of the two branches. We directly report the results from the paper since the code is not publicly available. ? EXIF-consistency [6]: A self-consistency approach which utilizes metadata to 65 learn features useful for manipulation localization. The prediction is made patch by patch and post-processing like mean-shift [98] is used to obtain the pixel-level manipulation prediction. We use the code provided by the authors for evaluation. ? DeepLab: Our baseline model which adopts DeepLab VGG-16 model to manip- ulation segmentation task. No generation, boundary branch or refinement stage is added. ? GSR-Net: Our full model combining generation, segmentation and refinement for manipulation segmentation. The final results, presented in Table 4.1, highlight the advantage of GSR-Net. For supervised methods [49,82], we train the model on CASIA 2.0 and evaluate on all the four datasets. For other unsupervised methods [6,38,39], we directly test the model on all datasets. GSR-Net outperforms other approaches by a large margin on COVER, suggesting the advantage of our network on copy-move manipulation. Also, GSR-Net has an improvement on In-The-Wild, CASIA 1.0 and Carvalho. Ad- ditionally, in terms of computation time, EXIF-consistency takes 1600 times more computation (80 seconds for an 800? 1200 image on average) than ours (0.05s per image). Compared to boundary artifact based methods, our GSR-Net outperforms MFCN by a large margin, indicating the effectiveness of the generation and refine- ment stages. In addition to that, no hole filling is required since our approach does not perform late fusion with the boundary branch, but utilizing boundary artifacts to guide the segmentation branch instead. Our method outperforms the baseline model by a large margin, showing the effectiveness of the proposed generation, segmentation and refinement stages. 66 Dataset Carvalho In-the-Wild COVER CASIA DeepLab 0.420 0.472 0.376 0.474 DL + CP 0.446 0.504 0.410 0.503 DL + G 0.460 0.524 0.434 0.506 DL + DIH 0.384 0.421 0.342 0.420 DL + CP + G 0.472 0.528 0.444 0.507 GS-Net 0.515 0.540 0.455 0.545 GSR-Net 0.525 0.555 0.489 0.574 Table 4.2: Ablation analysis on four datasets. Each entry is the F1 score tested on individual dataset. 4.4.3 Ablation Analysis We quantitatively analyze the influence of each component in GSR-Net in terms of F1 score. ? DL + CP: DeepLab VGG-16 model with just the segmentation output, using simple copy-pasted (no generator) and CASIA 2.0 images during training. ? DL + G: DeepLab VGG-16 model with just the segmentation output, using generated and CASIA 2.0 images during training. ? DL + DIH: DeepLab VGG-16 model with just the segmentation output, us- ing the images generated from [89] and CASIA 2.0 images during training. We adapt deep image harmonization (DIH) network for the generation stage as it also manipulate regions. ? DL + CP + G: DeepLab VGG-16 model with just the segmentation output, using both copy-pasted, generated and CASIA 2.0 images during training. ? GS-Net: Generation and segmentation network with boundary artifact guided 67 0.60 0.60 0.55 0.55 0.50 0.50 0.45 0.45 0.40 0.40 EXIF-selfconsistency EXIF-selfconsistency 0.35 RGB-N 0.35 RGB-N GSR-Net GSR-Net 0.30 0.30 100 70 50 1 0.7 0.5 JPEG compression Scale ratio (a) In-The-Wild JPEG attack (b) In-The-Wild scale attack (c) Carvalho JPEG attack (d) Carvalho scale attack Figure 4.3: Analysis of robustness under different attacks. Attacks with JPEG compression consists of quality factors of 70 and 50; scale attacks use scaling ratios of 0.7 and 0.5. (a) JPEG compression attacks on In-The-Wild. (b) Scale attacks on In-The-Wild. (c) JPEG compression attacks on Carvalho. (d) Scale attacks on Carvalho. manipulation segmentation. No refinement stage is incorporated. The results are shown in Table 4.2. Starting from our baseline model, simply adding copy-pasted images (DL + CP) achieves improvement due to broadening the manipulation distribution. In addition, replacing copy-pasted images with gen- erated images (DL + G) also shows improvement compared to DL + CP on all the datasets as it refines the boundary from naive copy-pasting. As expected, adding both copy-pasted images and generated hard examples (DL + CP + G) is more 68 F1 score F1 score Dataset Carvalho In-The-Wild COVER CASIA CP + S 0.343 0.430 0.351 0.242 CP + G + S 0.354 0.441 0.355 0.270 CP + GSR 0.418 0.479 0.381 0.331 Table 4.3: F1 score manipulation segmentation comparison trained with COCO annotations. useful because the network has access to a larger distribution of manipulation. Compared to applying deep harmonization network (DL + DIH), our gener- ation approach (DL + G) performs better as it aligns well with the natural process of manipulation and has a larger variety of manipulation. The results also indicate the impact of boundary guided segmentation net- work. Directly predicting segmentation (DL + CP + G) does not explicitly learn manipulation artifacts, and thus has limit generalization ability compared to GS- Net, which uses the boundary features as side information. Furthermore, GSR- Net boosts the performance on GS-Net since the refinement stage introduces new boundary artifacts. 4.4.4 Robustness to Attacks We apply both JPEG compression and image scaling attacks to test images of In-The-Wild and Carvalho datasets. We compare GSR-Net with RGB-N [49], EXIF-selfconsistency [6] using their publicly available code, and MFCN [82] using the numbers reported in their paper. Figure 4.3 shows the results, which indicates our approach yields more stable performance than prior methods. 69 Manipulated image Edge output Segmentation output Ground truth Figure 4.4: Qualitative visualization. The first row shows manipulated images on different datasets. The second indicates the final manipulation segmentation prediction. The third row illustrates the output of boundary artifacts branch. The last row is the ground truth. 4.4.5 Segmentation with COCO Annotations This experiment shows how much gain our model achieves without using the manipulated images in CASIA 2.0. Instead of carefully manipulated training data, we only utilize the object annotations in COCO to create manipulated images. We compare the result of using different training data as follows: ? CP + S: Only using copy-pasted images to train the segmentation network. ? CP + G + S: Using both copy-pasted and generated images. ? CP + GSR: Using copy-pasted images and generated images. The refinement 70 Authentic Ground Truth Copy Paste Epoch 4 Epoch 20 Epoch 40 Figure 4.5: Qualitative visualization of the generation network. The first two columns show the authentic background and manipulation mask. As the number of epochs increases, the manipulated region matches better with the background and thus boundary artifacts are harder to identify. stage is applied. Results are presented in Table 4.3. The performance using only copy-pasted images (CP + S) on the four datasets indicates that our network truly learns boundary artifacts. Also, the improvement after adding generated images (CP + G + S) shows that our generation network provides useful manipulation examples that increases generalization. Last, the refinement stage (CP + GSR) boosts performance further by encouraging the network to spot new boundary artifacts. 4.4.6 Qualitative Results Generation Visualization. We illustrate some visualizations of the generation network in Figure 4.5. It is clear that the generation network learns to match the pasted region with background during training. As a result, the boundary artifacts 71 are becoming subtle and the generation network produces harder examples for the segmentation network. Segmentation Results. We present qualitative segmentation results on four datasets in Figure 4.4. Unsurprisingly, the boundary branch outputs the potential boundary artifacts in manipulated images and the other branch fills in the interior based on the predicted manipulated boundaries. The examples indicate that our approach deals well with both splicing and copy-move manipulation based on the manipulation clues from the boundaries. 4.5 Conclusion We propose a novel segmentation framework that firstly utilizes a generation network to enable generalization across variety of manipulations. Starting from copy-pasted examples, the generation network generates harder examples during training. We also design a boundary artifact guided segmentation and refinement network to focus on manipulation artifacts rather than semantic content. Further- more, the segmentation and refinement stage share the same weights, allowing for much faster inference. Extensive experiments demonstrate the generalization ability and effectiveness of GSR-Net on four standard datasets and show state-of-the-art performance. The manipulation segmentation problem is still far from solved due to the large variation of manipulations and post-processing methods. Including more manipulation techniques in the generation network could potentially boost the gen- eralization ability of the existing model and is part of our future research. 72 Chapter 5: DeepStrip: High Resolution Boundary Refinement 5.1 Introduction Boundary detection is a well-studied problem and fundamental for human recognition [99, 100]. Recent decades have witnessed considerable effort to im- prove the boundary quality of an object that has been detected [101?109] or seg- mented [110?114]. Consequently, it is not difficult to separate object of interests from backgrounds with precise boundaries utilizing these methods. While current learning based boundary detection algorithms are usually computed on low res- olution (LR) images (0.04-0.25 million pixels), most photos taken these days are much larger, ranging from cell phone size (8-16 million pixels) to professional cam- era size (16-400 million pixels). Most methods are not designed for images of this size and the excessive computation they require, and most machine learning based methods cannot process them due to memory constraints. Given a precise low res- olution prediction, a workaround would be to directly apply upsampling to reach high resolution (HR). Nevertheless, this usually yields poor quality results because the semantic contents in the HR image are not considered. (See Figure 5.1.) Most research in boundary detection focuses on improving the boundary qual- ity in LR through introducing more semantic information [8, 115, 116] or human 73 HR image Bilinear upsampling Ours HR ground truth Boundary upsampling LR mask Figure 5.1: Concept overview. The example is from the newly created PixaHR dataset. Given low resolution mask and high resolution image on the left, a bilinear upsampling with scale factor 16? would results in boundary misalignment in high resolution image, as is shown in the enlarged boundary region on the right. Also, the new details in high resolution would be missed. interaction [108,112,113,117,118]. While there has been some work on HR seman- tic segmentation [119,120] and upsampling [121,122], there is less focus on accurately capturing the boundary detail in HR. Instead of treating this problem as an upsam- pling problem, we treat it as boundary detection and harness the contents in HR images for prediction. To this end, we propose a novel approach to handle boundary refinement in HR images. (See Figure 5.2.) Our key idea is to allow the power of deep learning methods to be applied to HR images in a time and memory efficient manner by op- erating on narrow images made up of pixels near the boundary. Given an accurate LR mask, the boundary in HR is likely in proximity to the upsampled LR boundary. (See Figure 5.1.) Therefore, to save memory and computation, we propose to search for the target boundary in a strip region near the boundary of the upsampled mask. The strip image is formed by sampling pixels along and normal to the upsampled mask boundary. Since the normals may not be smooth due to inaccurate boundaries 74 LR mask and HR image Skip Connections Selection layer C0 continuity 1 regularization 2 1 2 L1 loss 2 Strip 3 Boundary Creation distance loss 4 m 3 4 Matching loss x s Figure 5.2: Framework. To save memory and computation, we predict the boundary in a strip image instead of the whole image. First, the strip image is extracted from the HR image and corresponding LR mask. Feeding the strip image as input, the network predicts all potential boundaries (denoted as ?x?) and passes the initial prediction to a selection layer (denoted as ?m?) to pursue more accurate prediction on the target boundary (denoted as ?s?). The numbers are indicator to the losses displayed on the right. Orange and green curves denote the ground truth and prediction, respectively. Note that the strip image and prediction are rotated 90 degree for visualization. in the upsampled mask, we represent the LR boundary with a spline approximation and directly treat the orthogonal derivatives of the upsampled spline as the normal directions. Feeding as input the generated strip images, we train a network to firstly predict all potential boundaries. Based on the initial prediction, an additional selec- tion layer is included to predict the target boundary more accurately. To encourage closer prediction and reduce false positives, we propose loss functions to minimize the boundary distance between the prediction and ground truth in the strip image and to encourage C0 continuity in the prediction. Lastly, we pursue consistent re- sults through matching the prediction under different strip sizes to further boost the performance. To validate our approach, we create a new PixaHR dataset (see Figure 5.1 for image example) consisting of 100 photos with average resolution 7k ? 7k and 75 evaluate our approach up to scale factor 32?. Results on DAVIS 2016 and COCO coarse annotations also show our ability to refine coarse boundary annotations. In a nutshell, our contribution is three-fold. 1) We propose an approach to predict the boundary in a strip image which converts potential boundary regions into a strip space. This approach allows us to apply neural networks in a computationally and memory efficient manner. 2) To improve performance and encourage closer prediction, we propose novel losses including boundary distance, matching and C0 continuity loss. 3) We create a high resolution dataset for evaluation. To the best of our knowledge, we are the first learning based approach to make HR dense boundary refinement with resolution up to 10k ? 10k. Extensive experiments on both public and the new PixaHR dataset strongly highlight our effectiveness. 5.2 Related Work Boundary Refinement. Multiple attempts have been made to improve boundary quality through extracting better features [8,101,116,123,124]. Xie et al. [101] utilize features from multiple layers and fuse both low and high level features to detect edges. Liu et al. [116] explore rich convolutional features to boost the performance. More related, attention has been taken to refine coarse boundary predictions or annotations [8, 115]. Conventional methods like dense Conditional Random Fields (CRF) [125], Graph Cuts [126] model the relationship between nearby pixels and thus can be applied to refine LR masks [127]. However, these are segmentation based and only low-level features have been utilized. With more supervision, Yu 76 et al. [115] propose to simultaneously learn and align edges to refine misaligned boundaries directly. Acuna et al. [8] further improve the performance by introducing a thinning layer and active alignment strategy to obtain refined boundary. These methods mainly explore edge detection in LR images. In contrast, we tackle HR boundary refinement and apply detection only on regions around upsampled LR boundary splines and thus is more memory and computation efficient. Active Contours. Active contour models like Snakes [104] have been introduced to refine boundaries from coarse ones. Various approaches have been explored to handle the limitation of Snakes through, e.g., better initialization, morphological operation [128] or user interaction [108]. Since our method also refines the curve upsampled from LR mask, we can benefit from these methods and refine the bound- ary further. Instead of taking the whole image as input, deep active contour [129] learns to predict the flow of boundary pixels in a patch by patch fashion. However, it cannot guarantee a continuous boundary prediction. Instead, our approach directly extracts a consecutive boundary region and thus contains more global information. Rather than predict the entire curve, other works have explored predicting control points [117,130,131] through recurrent neural networks or Graph Convolutional Net- works (GCN) [132] and then fit a curve as the final prediction. However, boundary details are smoothed in the spline representation. In contrast, our approach predicts precise edge information directly. Another line of work implicitly represents bound- ary curves. For example, deep level set methods [133] evolve boundary curves by minimizing the level energy function. Other learning based approaches [7, 134, 135] have proposed to provide useful features, including texture, color or shape, for bet- 77 ter optimization. However, these learning based approaches suffer from compu- tation and memory issues when the resolution increases because they process the entire image while our approach only focuses on the regions around upsampled LR boundaries, and thus requires less computation and memory overhead. High Resolution Up-sampling. With the information of low resolution masks, researchers have focused on achieving high quality HR segmentation masks. Con- ventional methods [136,137] reach HR by applying upsampling jointly with the LR mask reference. However, the fixed filter structures have difficulty capturing new HR boundary details. He et al. [138] propose guided filtering to smooth while pre- serving edge information when upsampling. Wu et al. [121] make the guided filter faster and learnable. For HR segmentation approaches, Zhao et al. [120] propose to aggregate LR features for HR segmentation and Chen et al. [119] align both global and local features to avoid heavy GPU consumption for HR segmentation. Even though these methods can be potentially adapted to boundary refinement, our method mainly focuses on boundary regions and is designed to detect boundaries in HR directly. Therefore, our approach learns new HR boundaries better, especially when LR boundaries are coarsely annotated. 5.3 Approach Our goal lies in refining boundaries in HR images given LR precise masks. To achieve this purpose efficiently, we propose to predict on a strip image that captures the potential boundary region rather than the entire HR image. Figure 5.2 illustrates 78 LR HR Upsampled Boundary Strip Initial Final mask image contour region image boundary boundary Figure 5.3: Strip image creation. To generate strip image, B-spline representation of the contour in the LR mask is upsampled to HR as a coarse boundary. The HR region along the normal direction (e.g., red and green arrows) of the contour is then extracted. Finally, the strip image and corresponding boundary ground truth is obtained by flattening the extracted region in both the HR image and mask. Note that the final boundary filters out noisy boundaries (e.g., the red box region) from the initial boundary. The strip image and boundaries are rotated 90 degree for visualization. our framework. Our approach consists of strip image creation, which converts HR RGB image into strip image, strip boundary prediction, which refines the edges on the strip image using a network and strip reconstruction which reconstructs the prediction in the original image from the strip boundary prediction during testing. 5.3.1 Strip Image Creation Figure 5.3 describes the procedure of strip image creation. Due to the inter- polation introduced by upsampling, a directly upsampled boundary from the LR image is likely to be shifted from the ground truth boundary in HR. To localize the real HR boundary pixels, searching around the upsampled boundary is more necessary than searching the whole image. Therefore, we extract pixels near the 79 upsampled boundary to create a strip image. To create the strip image, we step along the boundary and sample points along the normal direction at each point on the curve. To obtain smoothly varying normal directions along the coarse boundary, we represent the LR boundary by B-spline and upsample the LR spline to HR. Given the HR image I(p, q) and the upsampled spline representation C = (p(k), q(k)) of the boundary contour, where (p(k), q(k)) denotes the HR image co- ordinates parameterized by arclength k along the curve, the continuous strip image JI,C is defined by JI,C(k, t+H/2)=I(p(k) + t?np(k), q(k) + t?nq(k)), (5.1) where t denotes the distance in the normal direction, H denotes the height of the strip image, and (np(k), nq(k)) is the unit normal to the curve at arclength k. Ac- cordingly, the strip image JI,C(j, i) with dimension H ?W is obtained by sampling k = j ? dk, t = i ? dt, where tangential step size dk = b|C|/W c and normal step size dt is set to 1 for simplicity. |C| denotes the length of C, j = 0, 1, ...,W and i = ?H/2, ..., 0, ..., H/2. Also, bilinear interpolation is applied in the high resolution image to evaluate I(p, q) for non-pixel coordinates (p, q). The corresponding HR strip boundary ground truth is obtained similarly with two adaptations. First, for large sampling scale factors, the ground truth boundary is likely to be outside the range of the strip if the strip height is small, making the boundary in strip image not continuous. We add labels at the border of strip if no boundary pixel is included to maintain the C0 continuity of the boundary pixels in 80 the strip image. Second, if the strip height is large, multiple boundary pixels might be included in each column in regions where the boundaries are closer than the strip height. In this case, we filter out the extraneous boundaries that are not connected to the current boundary. (See Figure 5.3.) 5.3.2 Strip Boundary Prediction Provided the HR strip image as input, we train a network to predict the corresponding boundaries within the strip domain. For memory efficiency, we adapt light-weighted encoder-decoder based structure nested U-Net [139,140] for boundary prediction. Given the fact that proper dimension of strip image varies for different resolutions, we use instance normalization [73] during training so that the mean and variance are approximated per image. As is shown in Figure 5.2, two prediction layers are proposed to learn the tar- get boundary in strip image to account for the fact that multiple true boundaries may be present in a single column of the strip image. Firstly, we extract the last upsampling layer to predict all potential boundaries. This encourages the network to learn boundary features within the strip image. To predict the target boundary, we add a learnable selection layer to pick up the target boundary from potential boundaries. The input to the selection layer is the initial prediction, and we apply column-wise softmax to the output of the selection layer as a confidence score for the initial prediction. Finally, the target boundary is computed by the multipli- cation between the initial prediction and the selection score. The selection layer 81 also smooths the initial prediction, analogous to the non-maximum suppression in Canny edge detection [100]. Formally, s = x m, (5.2) where denotes pixel-wise multiplication, s denotes the final prediction, x denotes the initial prediction which applies Sigmoid activation to the output of the last upsampling layer and m is the softmax activated output of the selection layer. 5.3.3 Loss Function Our basic loss function for the initial and final boundary prediction is a weighted l1 loss to differentiate the boundary from non-boundary pixels. Formally, ? ? Le = ? |yij ? sij|+ (1? ?) |yij ? sij|, (5.3) (i,j)?Y+ (i,j)?Y? where Y+ and Y? denote boundary and non-boundary pixels, respectively. ? = |Y?|/|Y | denotes the weight to balance the label and |Y | denotes the total number of pixels in strip mask. sij denotes the prediction and yij denotes the binary ground truth at position (i, j) in the strip image. In addition, we adapt Dice loss [141] to boundary prediction to encourage intersection between prediction and ground truth: ??? 2 sij ??yij + Ldice = 1 , (5.4) sij + yij +  82 where  denotes a small constant to avoid zero division. The loss aims to maximize the intersection over union between the prediction and ground truth. 5.3.3.1 Boundary Distance Loss For boundary prediction, a closer prediction to the boundary ground truth is preferred. However, both weighted l1 and dice loss are not sensitive to the distance from prediction to ground truth. Therefore, we introduce a boundary distance loss to measure the average distance between the predicted boundary and the ground truth to encourage closer prediction. Thanks to the strip domain which maps the regions along the normal direction in every column, the boundary distance can be calculated directly through the difference between the prediction and ground truth. Given the prior that only one boundary pixel exists in each column in the final strip mask, the boundary distance at every column can be measured by calculating the argmax difference at every column between the prediction and ground truth. Since argmax function is not differentiable, we approximate it through soft argmax before calculating the boundary distance and formulate the loss as W 1 ? Ld = | softarg(sij)? arg max(yij)|, (5.5) W i i j=1 where W is the width of strip mask and the soft argmax in each column (normal direction) is computed as ?H ( )|sij| softarg(sij) = ? i , (5.6) i ||S || i=1 j 1 83 where ||Sj||1 is the l1 normalization of sij at column j. Since the final prediction sij encourages a unimodal distribution according to Equation 5.2, this loss enforces the column-wise maximum activation of the final prediction to match with that in ground truth. 5.3.3.2 Matching Loss Since the strip height is fixed during training, to introduce variance and avoid overfitting on specific strip height, we augment the data through cropping the strip height. Starting from a large height, we crop the strip to a shorter one and make a new prediction. For consistency, the overlapped regions between original and the cropped strip should have the same initial prediction since all potential boundaries are predicted. Formally, we take a l1 loss between the cropped and original initial prediction to calculate the matching loss, 1 ? Lm = |x?ij ? xij|, (5.7)|Ycrop| (i,j)?Ycrop where Ycrop is the cropped region of original mask Y and x ? ij is the new initial prediction for the cropped strip image. In addition, this loss also helps the network learn to ignore spurious edges detected near the border of the strip. 5.3.3.3 C0 Continuity Regularization Additionally, we add a C0 continuity regularization to the final prediction to enforce a continuous prediction. Ideally, at most one boundary pixel is allowed at 84 every column in the final prediction, so the prediction is C0 continuous if the maxi- mum activated position of every column is C0 continuous. Specifically, we compute the soft argmax of every column, calculate a marginal difference between nearby argmax columns and penalize the position within a window size where prediction becomes discontinuous. Formally, 1 ?W LC0= P (max(0,| softarg(sij)?softarg(si,j+1)|?v)), (5.8) W i i j=1 where v denotes the margin value and P denotes the maxpooling with a fixed kernel size so that all pixels within the range get penalized. siW+1 is replicated by si1 for calculation. This loss serves as a self regularization as no ground truth label is required. The total loss function is therefore, Ltotal = Le + Ldice + ?1Ld + ?2Lm + ?3LC0, (5.9) where ?1, ?2, ?3 are hyper-parameters to adjust the weight of each loss. Le is applied to both the initial and final prediction. Lm is only applied to the initial prediction and Ldice, Ld, LC0 are applied only to the final prediction. With the total loss func- tion, a closer prediction is preferred and the network draws attention to the target boundaries. 85 5.3.4 Strip Reconstruction To make a prediction on the HR image, a mapping between the predicted strip boundaries and the full HR mask is required at inference. For every pixel in the strip image, the corresponding coordinates in the HR image are recorded for reconstruction. Given the raw prediction, we optimize the path with a dynamic programming similar to seam carving [142] and find the path with minimum energy. We minimize the function |?I(i, j)| Eij = ?sij ? , (5.10) max(|?I|) where |?I(i, j)| denotes the magnitude of the image gradients at (i, j). The algo- rithm searches for the energy cost for neighborhood pixels and finds the path with a minimum energy cost, which indicates the boundary path with the highest prob- ability. We then connect the original coordinates of the final path in the full mask to form the full prediction. At inference, the flexible input dimension of our framework enables different strip sizes for different images. Benefitting from it, we determine the width of strip, which reflects the number of sampling points along the boundary, by multiplying the LR boundary length with the scale factor. We fix the height of strip with the as- sumption that all target boundaries are involved, and an adaptive height adjustment strategy is also discussed in Section 5.4.6. For objects containing multiple contours due to complex topology, the prediction is made on each contour separately. 86 5.3.5 Implementation Details We generate the spline curve efficiently from the binary mask using the scipy function ?splprep? after extracting contours. To guarantee a consistent sign for the normals, we extract strip images from closed contours. The starting point of strip is not deterministic so that no bias is introduced in training. The final ground truth strip boundary mask is obtained by taking the gradient of the ground truth segmentation mask after removing any isolated noisy boundaries. Additionally, we randomly add small shifts to the spline representation to introduce position variation of the target boundary in strip image during training. Our framework is implemented in Pytorch. The encoder consists of 4 3 ? 3 convolutional layers and the decoder consists of 4 upsampling layers. The selection layer consists of another convolutional layer with 3? 3 kernel size. The activation function is ReLU [143] for all encoder and decoder layers. We use instance normalization for all normalization layers to enable flexible input size at inference. During training, the input strip dimension is fixed as 80 ? 4096. We train the network for 70 epochs with batch size 6 on an NVIDIA GeForce TITAN P6000. We use Stochastic Gradient Descent (SGD) as optimizer and the initial learning rate is 0.1. The learning rate decays by a factor of 10 after every 20 epochs. The momentum is set to 0.9 and weight decay is set to 0.0005. ?1, ?2 and ?3 are set to be 0.1, 20 and 1 empirically. We crop strip image by half to obtain Ycrop for matching loss and the maxpooling kernel size for C0 continuity regularization is 11. The margin in C0 continuity regularization is set to 1. Horizontal flipping is applied as data augmentation. 87 Dataset DAVIS 2016 [56] 4? PixaHR 8? PixaHR 16? PixaHR 32? Metrics F (0 pix) F (1 pix) F (1 pix) F (2 pix) F (1 pix) F (2 pix) F (1 pix) F (2 pix) Bilinear Upsampling 0.171 0.521 0.116 0.194 0.15 0.187 0.07 0.106 Grabcut [103] 0.232 0.541 0.063 0.121 0.020 0.053 0.0 0.0 Dense CRF [125] 0.268 0.702 0.278 0.434 0.245 0.389 0.142 0.227 Bilateral Solver [137] 0.274 0.569 0.207 0.277 0.185 0.247 0.156 0.216 Curve-GCN [117] 0.076 0.160 0.021 0.033 0.018 0.028 0.012 0.028 DELSE [7] 0.271 0.531 0.096 0.133 0.086 0.132 0.080 0.130 STEAL [8] 0.171 0.348 0.282 0.457 0.151 0.255 0.09 0.144 JBU [136] 0.175 0.447 0.140 0.231 0.117 0.184 0.055 0.090 Guided Filtering [138] 0.129 0.349 0.121 0.195 0.092 0.145 0.060 0.097 Deep GF [121] 0.193 0.461 0.286 0.420 0.175 0.269 0.09 0.141 U-Net boundary 0.320 0.656 0.170 0.297 0.139 0.197 0.068 0.108 U-Net strip (baseline) 0.303 0.710 0.334 0.455 0.303 0.425 0.267 0.357 Ours 0.423 0.788 0.416 0.508 0.396 0.498 0.330 0.447 Table 5.1: Boundary-based F score comparison. The scale factor between low and high resolution image is 4 on DAVIS 2016 and 8, 16, 32 on PixaHR. For DAVIS 2016, the pixel dilation is 0 and 1 and for PixaHR is 1 and 2 instead. 5.4 Experiments We evaluate our approach on two HR datasets which provide both low and high resolution ground truth in Section 5.4.2, and then analyze the importance of each components in our framework in Section 5.4.3. We also provide memory and speed comparison in Section 5.4.4. 5.4.1 Datasets and Metrics For our experiments, we need a dataset with highly accurate pixel-level HR an- notation. Unfortunately, most current datasets are low resolution and many provide inaccurate polygon boundaries as ground truth annotations. We found DAVIS [56] to provide accurate enough results with a resolution that is usable for our needs. To better evaluate the results at large scaling factors, we introduce a new dataset? PixaHR. We describe these datasets below. DAVIS 2016 [56]: A benchmark for video segmentation which consists of 50 classes 88 with precise annotations in both 480P and 1080P. To enlarge the scale factor, we down sample the 480P mask by a factor of 2, train our approach on the 30-class 1080P training set with 240P LR masks and test on 20-class 1080P testing set. The scale factor is 4.5 for this experiment. The results are evaluated frame by frame. PixaHR: To evaluate more realistic scenarios, we create a PixaHR dataset. It contains 100 images with average resolution 7k? 7k (ranging from 5k? 5k to 10k? 10k) collected from public photograph website Pixabay [144]. We manually annotate the object boundary in the HR images, downsample the HR mask by 8?, 16? and 32? and obtain binary LR mask for evaluation. The photos were uploaded by public users and have diverse contents. We apply our model that was trained on DAVIS to this dataset for evaluation. Metrics: We use boundary-based F score introduced by Perazzi et al. [56] for evaluation, which is designed to evaluate the boundary quality of segmentation. As it allows changing pixel tolerance by dilation, we set 0 and 1 pixel dilation on DAVIS, and 1 and 2 pixel on PixaHR dataset to measure how close the prediction is to the ground truth. 5.4.2 Main Results For upsampling based approaches, we compare our approach with Bilinear Upsampling, Bilateral Solver [137], Joint Bilateral Upsampling [136] (JBU), Guided Filtering [138] and Deep GF [121]. The boundary is obtained by tak- ing the gradient of the upsampled mask. For boundary refinement approaches, we 89 compare with Grabcut [103], Dense CRF [125] and STEAL [8] using upsam- pled mask as initialization. For active contour methods, the baselines are Curve- GCN [117] and DELSE [7], and predictions on PixaHR are made in LR and upsampled to original resolution since the whole boundary region is required at in- ference. Learning based approaches are trained or fine-tuned on the training set of DAVIS and evaluated directly on all datasets. In addition, we also compare our own implemented baselines as below: ? U-Net boundary: We train U-Net directly on the full resolution images on DAVIS for boundary prediction. We concatenate both the full resolution image and upsampled masks as input so that the network learns to refine the coarse masks. The loss function is a weighted binary cross entropy following Xie et al. [101]. Similarly, we also add deep supervision and fuse all intermediate layers to obtain the final prediction. The prediction is made patch-by-patch with patch size 1920 ? 1080 on PixaHR dataset. ? U-Net strip (baseline): Our baseline method which learns to directly predict the target boundary on strip image. Only weighted l1 loss is used as loss function. ? Ours: Our full model which applies selection layer to predict the boundary in strip images with our boundary distance loss, matching loss and C0 continuity reg- ularization. Table 5.1 exhibits our advantage over the baselines. For the DAVIS dataset, a simple upsampling yields a boundary shift from the ground truth and thus per- forms poorly. Grabcut and dense CRF are segmentation based and thus yield worse performance than ours. Even though other methods including bilateral solver, JBU 90 Dataset DAVIS 2016 PixaHR 16? Metrics F (0 pix) F (1 pix) U-Net strip 0.303 0.303 U-Net strip dice 0.323 0.320 U-Net strip dice + selection 0.372 0.328 U-Net strip dice+selection+BD 0.390 0.342 Our w/o matching 0.405 0.365 Ours 0.423 0.396 Table 5.2: Ablation analysis on two datasets. Each entry is the boundary-based F score tested on individual dataset. and Deep GF leverage the low resolution mask, they are designed for general up- sampling instead of for boundary refinement and prediction. Curve-GCN fits the curve from the predicted control points which cannot generate as precise a bound- ary as ours. DELSE moves the contour along the gradient of its energy function, but is less robust than our approach which predicts the target boundary pixels. Additionally, our approach outperforms STEAL as the scale factor increases, indi- cating the active alignment in STEAL may not be accurate enough for pixel-level boundary prediction. Compared with U-Net boundary, predicting the boundary in strip image (U-Net strip) yields a slightly better performance, perhaps because the strip image narrows down the search space for target boundary. As expected, with our selection layer and proposed losses, we boost the performance further by better determining the target boundaries from other potential boundaries. A similar ten- dency is observed on PixaHR dataset. Note that in large scale factor 32, most of the methods fail to make close predictions to the ground truth while our method still has a relatively stable performance. 91 Methods Memory (MB) Speed (s/image) Bilinear Upsampling - 0.01/0.02 Grabcut [103] - 5.17/320 Dense CRF [125] - 3.22/310 Bilateral Solver [137] - 4.18/158 JBU [136] - 0.08/5.71 Guided filtering [138] - 0.08/16.1 Deep GF [121] - 0.07/3.95 STEAL [8] 7775/7959 43.1/4231 Curve-GCN [117] 17330/17330 0.93/75.2 DELSE [7] 17771/17771 1.02/20.4 U-net boundary 17000/17000 0.31/24.5 Ours 3300/3300 0.28/2.51 Table 5.3: Memory and speed comparison. Each entry is the memory or speed on DAVIS 2016/PixaHR dataset. We only compare the memory usage among learning- based approaches. 5.4.3 Ablation Analysis We analyze the importance of each component in our framework as listed below: ? U-Net strip dice: Adding dice loss to the baseline. ? U-Net strip dice + selection: Adding dice loss and selection layer to the baseline. ? U-Net strip dice + selection + BD: Adding dice, boundary distance loss and selection layer to the baseline. ? Ours w/o matching: Adding additional C0 regularization. It is our full model without the matching loss. Table 5.2 summarizes the comparison result. Starting from our baseline U-Net strip, adding dice loss encourages more intersection with the ground truth boundary and thus yields better performance. Comparing U-Net strip + dice with U-Net 92 Dense CRF STEAL Ours Ground truth Figure 5.4: Qualitative results on PixaHR 32?. Rows from top to down are the results of Dense CRF, STEAL, Ours and the Ground truth. We show the entire boundary (green color) result first and enlarge the blue bounding box region for comparison (boundaries are whitened). strip + dice + selection, the selection layer boosts the performance on DAVIS by a large margin, indicating it effectiveness in suppressing the noisy boundaries and smoothing the final prediction. Also, with the boundary distance loss the network learns to have closer prediction. With C0 regularization (Ours w/o matching), the network filters out false positive boundaries by making a continuous prediction. Finally, the performance further improves with the matching loss because the net- work makes a consistent prediction over different strip heights to avoid overfitting. 93 Figure 5.5: Qualitative results on COCO. Columns from left to right are coarse annotation, DELSE [7], STEAL [8] and Ours. 5.4.4 Memory and Speed Comparison Since we only extract a strip image for prediction, our approach is efficient in both memory and computation. Table 5.3 compares our memory overhead and speed performance with baselines. Over all, our computation and memory requirement is relatively small. Our memory requirement is smaller than other learning based approaches. Note that for U-Net boundary and STEAL, the prediction on PixaHR is made patch-by-patch due to the high resolution. More specifically, the main computation in our approach lies in strip recon- struction. e.g., for a 1920 ? 1080 DAVIS image with around 3200 pixels along the boundary, our strip image creation takes 0.08s, prediction process takes 0.06s and the strip reconstruction takes 0.14s. A similar computation percentage is observed 94 Dataset PixaHR 32? Metrics F (1 pix) Ours 0.330 Ours adaptive 1 segment 0.353 Ours adaptive 2 segments 0.365 Table 5.4: Strip height selection comparison on PixaHR 32?. on PixaHR also. 5.4.5 Qualitative Results We show visualization comparisons in Figure 5.4. It is clear that our approach produces more accurate boundariers than the other methods. To further show the effectiveness of our approach on refining the boundaries given LR or coarse masks, we provide qualitative results on COCO where only polygonal boundary ground truth is provided. We directly extract strip image using the coarse annotation on COCO, and visualize the prediction in Figure 5.5. Comparing with other approaches, our method provides more accurate boundaries, indicating the potential application of our approach to help refine the coarse boundaries. 5.4.6 Strip Height Adaptation We predict the target boundary in the strip image under the assumption that the target boundary exists within the pre-defined height range, however, it might not hold true especially for a large scale factor. While one solution is to pre-define a larger height for strip image creation, we propose to progressively increase the height and regenerate strip image to make new predictions at inference. Specifically, 95 we increase the height of strip image until the summation of the final prediction score decreases. Furthermore, height adjustment is more flexible by dividing the whole contour into several segments and adjusting them independently. The results are shown in Table 5.4. The comparison between Ours and Ours adaptive 1 segment indicates the effectiveness to have a flexible height. The performance increases further when dividing the whole contour into 2 segments which allows variable height for different regions. 5.5 Conclusion In summary, this paper presents a novel strategy to handle HR boundary re- finement computationally and memory efficiently given LR precise masks. To save memory, we propose to extract boundary regions along the upsampled boundary spline to form a strip image and make prediction within this strip image. To fo- cus on the target boundaries in strip image, boundary distance, matching loss and C0 continuity regularization have been proposed. Extensive experiments on both public and our newly created dataset demonstrate the effectiveness of the proposed approach. However, the current approach still has difficulty predicting complicated topology and soft boundary regions. A smarter adaptive strip height adjustment for every pixel might be a potential solution, which is left for future research. 96 Chapter 6: Multi-model and Multi-level Knowledge Distillation for Incremental Learning 6.1 Introduction Deep neural networks perform well on many visual recognition tasks [11, 145, 146] given specific training data. However, problem arises when adapting networks to unseen categories while remembering seen ones, which is known as catastrophic forgetting [147?149]. To tackle this issue, there is a growing research attention on incremental learning where the new training data is not provided upfront but added incrementally. The target of incremental learning is to achieve good performance on new data without sacrificing the performance on old and it has been widely explored across different tasks such as classification [9, 150] and detection [151]. To alleviate catastrophic forgetting in incremental learning, one possibility is to maintain a subset of old data to avoid over fitting on new data [9,152,153]. However, an issue in practice is that when models embedded in a product are delivered to customers, they no longer have access to trained data for privacy purposes. To tackle the situation, a stricter exemplar-free setting was introduced in [150], which requires no exemplar set for previous categories and only distills previous knowledge 97 Knowledge Knowledge Knowledge Knowledge distillation distillation distillation distillation S1 S2 S3 S1 S2 S3 Incremental step Incremental step Figure 6.1: Concept overview. We propose to distill knowledge from all previous models efficiently to preserve old data information rather than sequentially applying distillation only to the last model. (For example, using both S1 and S2 in S3 for distillation instead of sequentially using S1 for S2 and then S2 for S3). The confusion matrix is LWF-MC [9] on the left and our method on the right for the exemplar-free incremental setting. from the current categories. Prior methods typically apply knowledge distillation [154] sequentially during the incremental procedure to preserve previous knowledge. Since they apply distil- lation only to the last model, it is difficult to maintain all past knowledge completely (the left side of Figure 6.1). From that observation, we propose using all the model snapshots. Prior knowledge is preserved better through our approach (the right side of Figure 6.1). However, saving all previous models may incur a great penalty 98 in memory storage and without somehow compressing this historical information would not be practical. To address this, we reconstruct previous outputs using only ?necessary? parameters during training. To this end, we propose an end-to-end Multi-model and Multi-level Knowl- edge Distillation (M 2KD) framework as depicted in Figure 6.2. We introduce a multi-model distillation loss which leverages the snapshots of all previous models to serve as teacher models during distillation, and then directly matches the out- puts of a network with those from the corresponding teacher models. To make the pipeline more efficient, we adapt mask based pruning methods to reconstruct the previous models. We prune the network after each incremental training step and identify significant weights to reconstruct the model. This allows us to reconstruct previous models on-the-fly and utilize them as teacher models in our multi-model distillation. To further enhance the distillation process, we also include an auxiliary distillation loss to preserve more intermediate features of previous models. Addi- tionally, our approach addresses catastrophic forgetting in sequential distillation, and thus generalizes well for both exemplar based and exemplar-free settings. To show the effectiveness of our approach, we evaluate our model on Cifar-100 [155] and a subset of ImageNet [146]. We achieve state-of-the-art performance for all the datasets in exemplar-free setting. We also show improvement when adapting to exemplar-based incremental learning and our exemplar-free setting outperforms iCaRL [9] with a 200 exemplar budget. In summary, our contributions are three fold. First, we propose a multi-model distillation loss, which directly matches logits of the current model with those from 99 the corresponding teacher models. Secondly, for efficiency, we reconstruct historical models via mask based pruning such that model snapshots can be reconstructed with low memory footprint. Experiments on standard incremental learning bench- marks show that our method achieves state-of-the-art performance in exemplar-free incremental setting. 6.2 Related Work The ultimate goal of incremental learning is to achieve good performance on new data while preserving the knowledge about old data. Generally, two types of evaluation settings [156] have been considered. One is multi-head incremental learning which utilizes multiple classifiers at inference, and the other is single-head incremental learning which only utilizes one classifier at inference. Multi-head incremental learning. The evaluation setting in this stream is that a specific classifier is selected during testing according to the tasks or categories. With this prior information, no confusion exists across different classifiers, and thus the target becomes how to adapt the old model for new tasks or categories. Research has been focused on utilizing an episodic memory to trace back previous tasks [157?159], or constraining the important weights on old tasks [149]. In addition, [160, 161] learn a mask for pruning to further constrain the weights on old tasks. [162] distill the knowledge from the old model when adapting to new tasks. Different from this setting, we do not assume the task or category information is known during inference and follow the setting of single-head incremental learning. Also, even 100 though we apply pruning in our approach, our goal is different from [160, 161] as the masks are utilized to reconstruct previous models and our approach requires no mask selection at inference. Single-head incremental learning. Single-head evaluation uses only one clas- sifier to predict both the old and the new classes. This setting is more challeng- ing [156] compared to the multi-head counterpart because of the confusion between old and new categories. Knowledge distillation [154] is frequently utilized to pre- serve information. [150] distill the knowledge from the last model. [163] introduce Grad-CAM [164] in the loss function. A relaxed setting is to introduce exemplar set [9] for the old data and match previous logits through distillation. [152] explore the balance between old and new data during training. [153] focus on constructing exemplar set and [165, 166] replay the seen categories with GANs [83]. Instead of saving exemplars, we save the parameters of previous models for reconstruction. With that, this paper can be considered a complement research direction. In fact, as knowledge distillation is an important component in these methods, they can po- tentially benefit from our approach as well. Additionally, [167] alleviate the bias in knowledge distillation by introducing a scaling vector to trained classifier, however, our approach is agnostic to classifier and achieves better performance. Network pruning. Considerable research has explored this area to reduce net- work redundancy. [168,169] propose to compress network through quantization and Huffman coding. [170] compress the weights according to their scores. Other meth- ods [171?173] explore compression for fast inference. In contrast to these methods, we leverage network redundancy and use pruning to reconstruct all previous models 101 Mask Distill Mask Distill Input Figure 6.2: Framework overview. Given images from the current training data, we preserve previous knowledge directly from the reconstructed output through matching the logits with the corresponding model and classifying the current data with its ground truth. As an example, each layer contains a mask matrix Mt ati the ti-th incremental step recording significant weights for previous data. The gray dots represent the weights to be trained on the current data. The red and green dots are fixed during training, denoting the weights retained from the first and second incremental step respectively. The gray dots are fine-tuned for the current data before pruning. After pruning, a subset of the gray dots will be marked as important weights and become blue dots, and the remaining weights will be fine- tuned during the next incremental step. Accordingly, Mt2 is updated and used as Mt3 at the end of this round. In multi-model distillation, the red and green output logits of the current model are matched with the model 1 and 2 respectively while the blue logits are matched with its ground truth. in incremental learning with low memory footprint. 6.3 Approach We propose novel distillation losses to preserve previous information without introducing too much memory overhead (See Figure 6.2). The model is agnos- tic to the backbone architecture and generalizes well to both exemplar based and exemplar-free methods. 102 Current model Model 2 Model 1 6.3.1 Multi-model Distillation Single-head incremental learning consists of a sequence of incremental class inclusion process, referred to as incremental steps. Samples from a batch of new classes Ck are added at the k-th incremental step. For instance, 20 classes will be added per incremental step in a 20-class batch setting. Accordingly, the network assigns new logits (output nodes) for the incremental classes. At inference, the maximum logit score in the output is treated as the final decision. The knowledge distillation used in incremental learning [9,150] mainly aims to match the output of the current model to a concatenation of the last model logits and ground truth labels. Formally, it optimizes the cross entropy for both the old and new logits, ?N ?C1 o ? LD = ? sij log(sij)N ?i=1?j=1N C ? 1 yij log(sij), (6.1) N i=1 j=Co+1 where N and C denotes the number of samples and the total class number so far respectively, and Co denotes the old classes. sij is the output score of the network obtained by applying Sigmoid function to the output logits for sample i at logit j. ? sij denotes the old score obtained by the penultimate model. yij denotes the ground truth. Treating the penultimate model as the teacher and applying this distillation 103 sequentially helps preserve historical information, especially when no previous exem- plar set is stored, which is the protocol for prior methods [9,150,152,163]. However, the historical information will be gradually lost in this sequential pipeline as the cur- rent model must reconstruct all the prior information from the penultimate model alone. To address this limitation, we propose multi-model distillation, which di- rectly leverages all previous models as our teacher model set. Since we mainly have current training data and labels for both settings, the network is more confident on current classes than old ones. Therefore, matching the previous logits of the current model directly with their corresponding old models preserves information better than always using the last model. Formally, we minimize the cross entropy for the logits between the current model and corresponding teacher models from previous incremental steps, ?N P?1 C1 ? ?k? ?LMMD = sijk log(sijk)N ?i=1 k?=1 j=Ck?1+1N C ? 1 yij log(sij), (6.2) N i=1 j=CP?1+1 where classes from Ck?1 + 1 to Ck belong to the k-th incremental step and P denotes the number of incremental steps. Classes from CP?1 + 1 to C belong to the current categories. sijk is the output score of the current model for sample i at logit ? j in the k-th incremental step. sijk denotes the output score of the k-th previous model. Multi-model distillation matches the logits in the current model with the corre- 104 Figure 6.3: Illustration of auxiliary distillation. We extract the intermediate features and connect directly with an auxiliary classifier to preserve middle level knowledge. sponding teacher model directly, reducing the information loss between incremental steps. At inference, we directly choose the maximum among the output logits, which acts as an ensemble of all the previous teacher models and the current model. 6.3.2 Auxiliary Distillation Previous incremental learning methods preserve old class information through matching the final output. However, the features from intermediate layers also contain useful information. Inspired by the auxiliary loss in segmentation task [174], we propose an auxiliary distillation loss to preserve the intermediate statistics of previous models. Similar to using the final output to represent network statistics, the prediction made by lower level features also represents intermediate feature statistics. Following the main branch classification, we extract lower level features and use an auxiliary classifier to conduct classification based on intermediate features (See Figure 6.3). 105 input conv bn relu conv bn relu conv bn relu conv bn relu distillation 2 distillation 1 Also, a multi-model distillation loss is added on this auxiliary classifier for the purpose of preserving prior lower level features, and a standard cross entropy loss is also included for classifying the current data. Formally, ?N P??1 ?C1 k ? LAD = ? a log(aijk) N ijk ?i=?1 k=1 j=Ck?1+1N C ? ? yij log(aij), (6.3) N i=1 j=1 ? where aijk denotes the output score from previous auxiliary classifiers, aijk or aij is the output score of the auxiliary branch, ? is the ratio between the distillation and cross entropy loss. Notice that all the logits in ground truth labels are utilized in the classification cross entropy to enforce the correct prediction of current data. The total loss function of the network becomes, Ltotal = LMMD + ?LAD, (6.4) where ? is the ratio between the main classification multi-model distillation and the auxiliary classification distillation. This auxiliary classification branch is only used during training. At inference time, we only use the main branch classifier for prediction. 106 Algorithm 1 Pruning Algorithm 1: Input: X1, . . . , Xk // input image sets of incremental step 1, . . . , k 2: ? // current model parameters 3: store pre-update parameters and masks m 4: for y = 1, . . . , k do 5: Grad(?y(m < y)) = 0 // apply mask 6: update optimizer through Back-Propagation ?y ? min(LMMD(?y) + ?LAD(?y)) 7: adjust threshold by pruning ratio //update threshold 8: ?y(?y < threshold) = 0 // prune and update ?y 9: m(?y >= threshold) = y //update masks 10: end for 6.3.3 Model Reconstruction One drawback of multi-model distillation in its original form is that it utilizes all previous models, requiring additional memory storage for the models. However, we observe that distillation aims to match logits. Therefore it is only necessary to preserve the outputs of previous networks, not the entire networks themselves. Our idea is to save only a small set of the necessary parameters from which we can approximate the output. By that way, all the models can be recovered on-the-fly without large memory penalty. To determine the necessary parameters, we adapt mask based pruning [160] for model reconstruction. Specifically, after training each incremental step we sort the magnitude of weights in each layer, freeze the important ones to reach a specified pruning ratio, and use the residual weights to train the next incremental class set. We repeat this procedure for all future incremental steps until all the incremental classes are included. (See Algorithm 1) We use a mask M to identify the important weights of each layer for all pre- vious incremental steps. After each pruning procedure, we update the mask for the 107 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 Number of classes Number of classes Number of classes (a) Top-1 Cifar-100, 5-class (b) Top-1 Cifar-100, 10-class (c) Top-1 Cifar-100, 20-class batch batch batch 100 100 80 80 60 60 40 40 20 20 20 40 60 80 100 20 40 60 80 100 Number of classes Number of classes (d) Top-5 iILSVRC-small, 10- (e) Top-5 iILSVRC-small, 20- (f) Legend class batch class batch Figure 6.4: Performance on iILSVRC-small and Cifar-100 dataset in exemplar-free setting. (a) Top-1 accuracy on Cifar-100 (5-class batch). (b) Top-1 accuracy on Cifar-100 (10-class batch). (c) Top-1 accuracy on Cifar-100 (20-class batch). (d) Top-5 accuracy on iILSVRC-small (10-class batch). (e) Top-5 accuracy on iILSVRC-small (20-class batch). current incremental step. With the saved biases, batch normalization and classifier parameters, we can reconstruct all previous models from the last model (pre-updated model) on-the-fly. Formally, the output of a network with n convolutional layers is obtained from its classifier (the last layer) and features, s = ?(f (n)), (6.5) where ? denotes the classifier and f (n) denotes the features in the n-th layer and can be generally written as f (n) = ?(w(n)f (n?1) + b(n)), (6.6) 108 Acurracy (%) Acurracy (%) Acurracy (%) Acurracy (%) Acurracy (%) where w and b are weights and biases respectively, ? denotes the activation function and f (0) is the input. With the mask Mk for the k-th incremental step, we reconstruct the corre- sponding features by: (n) (n) (n) (n?1) (n) fk = ?(wk ?(Mk <= k)fk + bk ), (6.7) (n) (n) where Mk denotes the mask in the n-th layer at incremental step k, fk denotes the feature in the n-th layer in k-th incremental step, and ? denotes delta function. Thus the output of the k-th model is reconstructed by (n) sk = ?k(fk ), (6.8) where sk and ?k denote the output of the network and the classifier for the k-th incremental step respectively. 6.4 Experiments We first evaluate our method in the exemplar-free setting. Then we extend our method to the exemplar-based setting. For more analysis, we also compare our memory cost with other methods. 6.4.1 Datasets and Evaluation Metrics The evaluation is conducted on iILSVRC-small [175] and Cifar-100 [155]. 109 Evaluation Metrics. Following the same metrics in prior methods [9,150], the top- 1 classification accuracy is reported for Cifar-100 and top-5 classification accuracy is reported for iILSVRC-small. 6.4.2 Exemplar-free setting We evaluate our methods in exemplar-free single-head setting. For evalua- tion, we also compare with the following baselines and state-of-the-art single-head approaches. FT: A baseline approach that only applies cross entropy loss to fine-tune the penul- timate model on new coming incremental classes. Knowledge distillation is not applied. Scaled [167]: A threshold moving strategy to alleviate the bias in knowledge distil- lation. We use the released code for evaluation. DGM [166]: A dynamic generative memory approach which utilizes GANs to gen- erate old samples as exemplar set. We use the released code for evaluation and no real sample is used during training. Rwalk [156]: A generalization algorithm of EWC [149] and Path Integral [176]. The official code is evaluated. LWF-MC [9]: A multi-class classification version of [150] as described in [9], ap- plying distillation to the logits from the last previous model sequentially. M 2KD (ours): Our full model applying multi-model, auxiliary distillation along with pruning to save memory storage. 110 Step 1 2 3 4 5 No pruning 83.5 61.8 52.5 51.5 42.1 Ratio 0.6 82.9 59.6 52.2 46.5 40.1 Ratio 0.7 83.5 61.7 52.5 50.0 42.8 Ratio 0.8 83.5 58.5 52.0 49.3 42.0 Ratio 0.9 83.0 58.0 49.7 47.3 39.9 Table 6.1: Top-1 accuracy comparison among different pruning ratios on Cifar-100 (20 classes per incremental step). M 2KD (no pruning): The upper bound of our model which directly loads all the previous snapshots for multi-model distillation. Upper-Bound: The upper bound of incremental learning which directly trains all classes together. Figure 6.4 highlights our performance compared to state-of-the-art methods. For Cifar-100, our method consistently outperforms other methods from 5-class to 20-class batch per incremental step. The margin becomes larger as more incremen- tal steps are added. This demonstrates the advantage of multi-model distillation as it avoids accumulating loss of historical information. Similar observation can be made when evaluating on iILSVRC-small. It is interesting to note that our model with pruning achieves comparable performance with the no-pruning version. This indicates the effectiveness of the pruning procedure in terms of saving memory while maintaining performance. Even though the residual active weights decrease gradu- ally due to pruning, we still preserve the performance up to 20 incremental steps. 111 6.4.3 Ablation Studies We investigate the effectiveness of each component of our method in this sec- tion. In particular, we compare our full model with the following baselines. LWF-MC aux: Add auxiliary distillation to LWF-MC. LWF-MC MMD: Change the original loss to our multi-model distillation. No auxiliary distillation is applied. Ours skip1: Instead of using all previous models, we study the case when skipping some snapshots. Starting from the last previous model, we skip the first model in multi-model distillation. The skipped model is replaced by the next model for multi-model distillation. Ours skip2: Skip the first two models instead of one compared to Ours skip1. Figure 6.5 shows the comparison for each of the component in our approach. LWF-MC aux improves our baseline model LWF-MC on all the datasets after adding auxiliary distillation, indicating that the intermediate level information also contributes to preserving previous knowledge. With only multi-model distillation (LWF-MC MMD), the performance gradually improves for both datasets as more incremental steps are involved, which demonstrates that directly distilling knowledge from the corresponding model helps to reduce the lost in sequential distillation. Note that our multi-model distillation reduces to the standard distillation used in [150] if only one or two incremental steps are added. By incorporating the auxiliary distillation, however, our method still shows improved performance. Lastly, our model achieves nearly the same performance as our upper bound which saves all 112 100 100 LWF-MC LWF-MC aux 80 LWF-MC MMD 80 M2KD (ours) M2KD (no pruning) 60 60 LWF-MC LWF-MC aux 40 40 LWF-MC MMD M2KD (ours) M2KD (no pruning) 20 20 40 60 80 100 20 20 40 60 80 100 Number of classes Number of classes (a) Top-1 Cifar-100 (b) Top-5 iILSVRC-small Figure 6.5: Ablation Studies for our approach. (a) Top-1 accuracy comparison on Cifar-100 (20-class batch). (b) Top-5 accuracy performance on iILSVRC-small (20-class batch). previous snapshots, showing the effectiveness of our pruning based approach. Figure 6.6 compares how multi-model distillation is affected by the number of models. LWF-MC can be regarded as a special case which skips 3 models in the last round. The trend from LWF-MC to Ours shows that the performance improves as the number of model preserved increases, confirming the value of multi-model distillation. 6.4.4 Analysis on pruning ratio We compare the results corresponding to different pruning ratios to investigate the robustness of our approach. Table 6.1 summarizes the results. Marginal perfor- mance variation (around 3%) is observed for different pruning ratios. Even though a higher (0.9) pruning ratio affects the performance as the active weights decrease in the current incremental step and a lower (0.6) ratio affects the performance as 113 Acurracy (%) Acurracy (%) 100 LWF-MC Ours skip1 80 Ours skip2 Ours 60 40 20 20 40 60 80 100 Number of classes Figure 6.6: Comparison between different number of models used in multi-model distillation on Cifar-100 20-class batch. available weights decrease in the future steps, the relatively trivial influence indi- cates that a large redundancy exists in the network architecture. Benefitting from it, our approach shows robustness to different pruning ratios. 6.4.5 Exemplar Based Setting Our approach can also be applied to exemplar based incremental learning methods which use distillation sequentially on the output of networks [9, 152, 153]. To evaluate our model in this setting, we add exemplar selection to our approach and compare with exemplar based methods. iCaRL [9]: A prominent exemplar based incremental learning approach which con- structs exemplar set for the old data according to the feature means and do distil- lation on the last previous model. A nearest class mean classifier [177] is applied at 114 Acurracy (%) iCaRL iCaRL 90 iCaRL aux 90 iCaRL aux iCaRL M2KD iCaRL M2KD 70 70 50 50 30 20 40 60 80 100 30 20 40 60 80 100 Number of classes Number of classes (a) Top-1 Cifar-100 (b) Top-5 iILSVRC-small Figure 6.7: Performance comparison in exemplar based setting. (a) Top-1 accuracy performance on Cifar-100 (10-class batch). (b) Top-5 accuracy performance on iILSVRC-small (10-class batch). inference. iCaRL aux: Adding auxiliary distillation to iCaRL. iCaRL M 2KD: Change the original distillation function which only matches logits from the last previous model to our multi-model distillation. Auxiliary distillation is also appended for a better performance. The results are shown in Figure 6.7. With the introduction of multi-model and auxiliary distillation, the performance of iCaRL improves. It indicates that with direct access to all the previous models for distillation, the knowledge preserves better even with exemplar set. 6.4.6 Memory Comparison Starting from the memory footprint of LWF as our baseline, we compare the extra memory storage between exemplar based method such as iCaRL [9] and our 115 Acurracy (%) Acurracy (%) approach. The memory is calculated in the 10-class incremental step setting for both iILSVRC-small and Cifar-100. For our approach, we directly calculate the storage difference between the last and the initial step. For iCaRL, the memory is approximately calculated by the average size of image for 2000 samples (i.e. the default exemplar size), and the compensation for saving the record of exemplar set. To optimize the memory consumption of iCaRL, we resize the images in iILSVRC- small to 256 ? 256 and compress to JPG with quality 95 to match their network input size during training. Table 6.2 shows the memory compensation for different methods. It indi- cates that our approach has approximately 7? smaller memory compensation on iILSVRC-small and 10? smaller on Cifar-100 than iCaRL. On average, for each incremental step, our approach only takes 0.98 MB and 0.08 MB for iILSVRC-small and Cifar-100 respectively. The memory advantage to exemplar based methods might become larger as higher resolution images take more storage. We provide further memory analysis in Figure 6.8. We compare our approach with iCaRL on Cifar-100 given the same memory constraint. For fair comparison, we reduce the exemplar set as a penalty of the additional memory we use for net- work parameters to match with the memory size used for iCaRL. The performance is evaluated by averaging the top-1 accuracy across all the incremental steps. When memory budget equals to 200 images, we do not use any exemplar set but still perform better than iCaRL. The reason for this is that the sequential distillation pipeline tends to lose information even when exemplars from old classes are avail- able. Moreover, increasing memory budget makes the performance gap between 116 Dataset iILSVRC-small Cifar-100 LWF-MC 0 0 iCaRL 68.0 9.4 M2KD (ours) 9.80 0.84 Table 6.2: Memory compensation comparison (MB). Each entry is the additional memory requirement for methods across different datasets based on the memory footprint of LWF. M2KD(ours) iCaRL 65 55 45 1000 2000 3000 4000 Memory budget K Figure 6.8: Analysis on performance and memory compared to iCaRL on Cifar-100 (10-class batch). We increase memory budget for exemplar set from 200 to 4000 images and report the average accuracy of all the 10 incremental steps. our approach and iCaRL larger, showing our strength to memorize what has been learned. 6.5 Conclusion and Discussion This paper presents a novel distillation strategy that mitigates catastrophic forgetting in single-head incremental learning setting. We introduce multi-model dis- tillation which directly guides the model to distill knowledge from the corresponding 117 Acurracy (%) teacher models. To further improve our performance, we incorporate auxiliary dis- tillation to preserve intermediate features. More efficiently, we avoid saving all the model snapshots through reconstructing all previous models using mask based prun- ing algorithm. Extensive experiments on standard incremental learning benchmarks demonstrate the effectiveness of our approach. Incremental learning is still far from solved. A significant gap between one-step training versus incremental training still exists. It remains to be a open question how to reduce the confusion between dif- ferent incremental steps especially without access to previous data, which might be a future exploration for our research. 118 Chapter 7: Conclusion In this dissertation, we have studied the existing challenges in combining deep learning with forensics to make manipulation detection. We proposed RGB-N net- work to learn rich features to reveal more artifacts in the domain of local noise and RGB image. Moreover, we also extended from image manipulation to video manipulation detection and studied the problem of video inpainting detection. Fur- thermore, We combined a blending based GANs to improve the generalization of manipulation segmentation networks. We then studied the general issue with deep learning models. For the issue of high resolution prediction, we proposed a Deepstrip approach to handle inaccurate results at high resolution more efficiently. Lastly, we explored the field of incremental learning to prevent the catastrophic forgetting is- sue of current neural networks. Even though researchers have provided promising solutions to fight against the fake images/videos, the problem is still far from solved. Below we discuss some of the potential directions for the future research. The first direction is to handle various manipulation techniques. We mainly focused on splicing and inpainting detection in the dissertation, however, detect- ing other manipulation techniques are also valuable. Taking into account this cat- and-mouse problem, the new emerging manipulation techniques including deepfake, 119 generative model based image editing still remains to be explored. Applying deep learning to detect these new types of manipulation is an interesting direction for the future research. Another challenge exists in manipulation detection is the domain shift prob- lem. Research has demonstrated performance degradation when applying learned manipulation detection models to a different manipulation domain. This degrada- tion is one of the major factors that limit the application of manipulation detection models. Exploring more generic features or discovering the domain specific to ma- nipulation and applying domain generalization algorithms might be an interesting direction. 120 Bibliography [1] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn mod- els for fine-grained visual recognition. In ICCV, 2015. [2] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In CVPR, 2016. [3] Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database. In ChinaSIP, 2013. [4] Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. Coverage?a novel database for copy-move forgery detection. In ICIP, 2016. [5] Tiago Jose? De Carvalho, Christian Riess, Elli Angelopoulou, Helio Pedrini, and Anderson de Rezende Rocha. Exposing digital image forgeries by illumi- nation color classification. TIFS, 2013. [6] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. Fighting fake news: Image splice detection via learned self-consistency. In ECCV, 2018. [7] Zian Wang, David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Object instance annotation with deep extreme level set evolution. In CVPR, 2019. [8] David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges: Learning semantic boundaries from noisy annotations. In CVPR, 2019. [9] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In CVPR, 2017. [10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: To- wards real-time object detection with region proposal networks. In NIPS, 2015. 121 [11] Ross Girshick. Fast r-cnn. In ICCV, 2015. [12] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. [13] Ruichi Yu, Xi Chen, Vlad I Morariu, and Larry S Davis. The role of context selection in object detection. In BMVC, 2016. [14] Mingfei Gao, Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Dynamic zoom-in network for fast object detection in large images. In CVPR, 2018. [15] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas Huang. Deep grabcut for object selection. arXiv preprint arXiv:1707.00243, 2017. [16] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Two-stream neural networks for tampered face detection. In CVPRW, 2017. [17] Xunyu Pan, Xing Zhang, and Siwei Lyu. Exposing image splicing with incon- sistent local noise variances. In ICCP, 2012. [18] Miroslav Goljan and Jessica Fridrich. Cfa-aware features for steganalysis of color images. In SPIE/IS&T Electronic Imaging, 2015. [19] Davide Cozzolino and Luisa Verdoliva. Single-image splicing localization through autoencoder-based anomaly detection. In WIFS, 2016. [20] Davide Cozzolino, Diego Gragnaniello, and Luisa Verdoliva. Image forgery localization through the fusion of camera-based, feature-based and pixel-based techniques. In ICIP, 2014. [21] Jawadul H Bappy, Amit K Roy-Chowdhury, Jason Bunk, Lakshmanan Nataraj, and BS Manjunath. Exploiting spatial structure for localizing ma- nipulated image regions. In ICCV, 2017. [22] Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. Improved dct coeffi- cient analysis for forgery localization in jpeg images. In ICASSP, 2011. [23] Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital images. TIFS, 2012. [24] Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. Columbia image splicing detection evaluation dataset. http://www.ee.columbia.edu/ln/ dvmm/downloads/authspliceddataset/authspliceddataset.htm, 2009. [25] Nist nimble 2016 datasets. https://www.nist.gov/itl/iad/mig/ nimble-challenge-2017-evaluation/. [26] Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database 2010. http://forensics.idealtest.org. 122 [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla?r, and C Lawrence Zitnick. Microsoft coco: Com- mon objects in context. In ECCV, 2014. [28] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Splicebuster: A new blind image splicing detector. In WIFS, 2015. [29] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Recasting residual- based local descriptors as convolutional neural networks: an application to image forgery detection. In IH&MMSec, 2017. [30] Yuan Rao and Jiangqun Ni. A deep learning approach to detection of splicing and copy-move forgeries in images. In WIFS, 2016. [31] Jiansheng Chen, Xiangui Kang, Ye Liu, and Z Jane Wang. Median filtering forensics based on convolutional neural networks. Signal Processing Letters, 2015. [32] Belhassen Bayar and Matthew C Stamm. A deep learning approach to uni- versal image manipulation detection using a new convolutional layer. In IH&MMSec, 2016. [33] Ying Zhang, Jonathan Goh, Lei Lei Win, and Vrizlynn LL Thing. Image region forgery detection: A deep learning approach. In SG-CRC, 2016. [34] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Image splicing localiza- tion using a multi-task fully convolutional network (mfcn). arXiv preprint arXiv:1709.02016, 2017. [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [36] Markos Zampoglou, Symeon Papadopoulos, and Yiannis Kompatsiaris. Large- scale evaluation of splicing localization algorithms for web images. Multimedia Tools and Applications, 2017. [37] Neal Krawetz. A picture?s worth... Hacker Factor Solutions, 2007. [38] Babak Mahdian and Stanislav Saic. Using noise inconsistencies for blind image forensics. Image and Vision Computing, 2009. [39] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. Image forgery localization via fine-grained analysis of cfa artifacts. TIFS, 2012. [40] Sungho Lee, Seoung Wug Oh, DaeYeun Won, and Seon Joo Kim. Copy-and- paste networks for deep video inpainting. In ICCV, 2019. [41] Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Deep video inpainting. In CVPR, 2019. 123 [42] Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy. Deep flow-guided video inpainting. In CVPR, 2019. [43] Seoung Wug Oh, Sungho Lee, Joon-Young Lee, and Seon Joo Kim. Onion-peel networks for deep video completion. In ICCV, 2019. [44] Ya-Liang Chang, Zhe Yu Liu, Kuan-Ying Lee, and Winston Hsu. Free-form video inpainting with 3d gated convolution and temporal patchgan. ICCV, 2019. [45] Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Johannes Kopf. Tem- porally coherent completion of dynamic video. TOG, 2016. [46] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In CVPR, 2018. [47] Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and Jiebo Luo. Foreground-aware image inpainting. In CVPR, 2019. [48] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016. [49] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Learning rich features for image manipulation detection. In CVPR, 2018. [50] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. Mantra-net: Ma- nipulation tracing network for detection and localization of image forgeries with anomalous features. In CVPR, 2019. [51] Davide Cozzolino, Justus Thies, Andreas Ro?ssler, Christian Riess, Matthias Nie?ner, and Luisa Verdoliva. Forensictransfer: Weakly-supervised domain adaptation for forgery detection. arXiv preprint arXiv:1812.02510, 2018. [52] Haodong Li and Jiwu Huang. Localization of deep inpainting using high-pass fully convolutional network. In ICCV, 2019. [53] Qiong Wu, Shao-Jie Sun, Wei Zhu, Guo-Hui Li, and Dan Tu. Detection of digital doctoring in exemplar-based inpainted images. In ICMLC, 2008. [54] Wei Wang, Jing Dong, and Tieniu Tan. Tampered region localization of digital color images based on jpeg compression noise. In IWDW, 2010. [55] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [56] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 124 [57] Kaiming He and Jian Sun. Image completion approaches using the statistics of similar patches. TPAMI, 2014. [58] James Hays and Alexei A Efros. Scene completion using millions of pho- tographs. TOG, 2007. [59] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image completion. ToG, 2017. [60] Yunqiang Liu and Vicent Caselles. Exemplar-based image inpainting using multiscale graph cuts. TIP, 2012. [61] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial con- volutions. In ECCV, 2018. [62] Haotian Zhang, Long Mai, Ning Xu, Zhaowen Wang, John Collomosse, and Hailin Jin. An internal learning approach to video inpainting. In ICCV, 2019. [63] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. In ToG, 2009. [64] Chuan Wang, Haibin Huang, Xiaoguang Han, and Jue Wang. Video inpainting by jointly learning temporal structure and spatial details. In AAAI, 2019. [65] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. Image forgery localization via fine-grained analysis of cfa artifacts. In TIFS, 2012. [66] Markos Zampoglou, Symeon Papadopoulos, and Yiannis Kompatsiaris. De- tecting image splicing in the wild (web). In ICMEW, 2015. [67] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Image splicing localization using a multi-task fully convolutional network (mfcn). In JVCI, 2018. [68] Peng Zhou, Bor-Chun Chen, Xintong Han, Mahyar Najibi, Abhinav Shrivas- tava, Ser Nam Lim, and Larry S Davis. Generate, segment and refine: Towards generic manipulation segmentation. AAAI, 2020. [69] Xinshan Zhu, Yongjun Qian, Xianfeng Zhao, Biao Sun, and Ya Sun. A deep learning approach to patch-based image inpainting forensics. SPIC, 2018. [70] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In TPAMI, 2018. [71] Sifei Liu, Jinshan Pan, and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. In ECCV, 2016. 125 [72] Mengye Ren and Richard S Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017. [73] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normaliza- tion: The missing ingredient for fast stylization. CoRR, 2016. [74] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima- genet: A large-scale hierarchical image database. In CVPR, 2009. [75] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010. [76] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza- tion. In ICLR, 2015. [77] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- lan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. In JMLR, 2014. [78] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [79] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. [80] Raymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In CVPR, 2017. [81] Jinseok Park, Donghyeon Cho, Wonhyuk Ahn, and Heung-Kyu Lee. Double jpeg detection in mixed jpeg quality factors using deep convolutional neural network. In ECCV, 2018. [82] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Image splicing localization using a multi-task fully convolutional network (mfcn). In JVCI, 2018. [83] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver- sarial nets. In NeurIPS, 2014. [84] Daniel Moreira, Aparna Bharati, Joel Brogan, Allan Pinto, Michael Parowski, Kevin W Bowyer, Patrick J Flynn, Anderson Rocha, and Walter J Scheirer. Image provenance analysis at scale. arXiv preprint arXiv:1801.06510, 2018. [85] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. [86] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ICLR, 2018. 126 [87] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In ICCV, 2017. [88] Sepp Hochreiter and Ju?rgen Schmidhuber. Long short-term memory. Neural computation, 1997. [89] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming- Hsuan Yang. Deep image harmonization. In CVPR, 2017. [90] Jean-Francois Lalonde and Alexei A Efros. Using color compatibility for as- sessing image realism. In ICCV, 2007. [91] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018. [92] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017. [93] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for object detection. In CVPR, 2017. [94] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In CVPR, 2017. [95] Hieu Le, Tomas F Yago Vicente, Vu Nguyen, Minh Hoai, and Dimitris Sama- ras. A+ d net: Training a shadow detector with adversarial shadow attenua- tion. In ECCV, 2018. [96] Patrick Pe?rez, Michel Gangnet, and Andrew Blake. Poisson image editing. In TOG, 2003. [97] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015. [98] Yizong Cheng. Mean shift, mode seeking, and clustering. In TPAMI, 1995. [99] Andreas Opelt, Axel Pinz, and Andrew Zisserman. A boundary-fragment- model for object detection. In ECCV, 2006. [100] John Canny. A computational approach to edge detection. TPAMI, 1986. [101] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In CVPR, 2015. [102] Meng Tang, Lena Gorelick, Olga Veksler, and Yuri Boykov. Grabcut in one cut. In ICCV, 2013. 127 [103] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interac- tive foreground extraction using iterated graph cuts. In TOG, 2004. [104] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1988. [105] Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang Yang, Xiang Ruan, and Ali Borji. Detect globally, refine locally: A novel approach to saliency detection. In CVPR, 2018. [106] Ting Zhao and Xiangqian Wu. Pyramid feature attention network for saliency detection. In CVPR, 2019. [107] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, and Tiejun Huang. Bi-directional cascade network for perceptual edge detection. In CVPR, 2019. [108] Hoang Le, Long Mai, Brian Price, Scott Cohen, Hailin Jin, and Feng Liu. Interactive boundary prediction for object selection. In ECCV, 2018. [109] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth Bodla, and Rama Chellappa. Deep regionlets for object detection. In ECCV, 2018. [110] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for se- mantic image segmentation. In ECCV, 2018. [111] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution rep- resentation learning for human pose estimation. In CVPR, 2019. [112] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. In CVPR, 2018. [113] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interactive object segmentation with human annotators. In CVPR, 2019. [114] Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, and Fei Sha. Fastmask: Segment multi-scale object candidates in one shot. In CVPR, 2017. [115] Zhiding Yu, Weiyang Liu, Yang Zou, Chen Feng, Srikumar Ramalingam, BVK Vijaya Kumar, and Jan Kautz. Simultaneous edge alignment and learning. In ECCV, 2018. [116] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional features for edge detection. In CVPR, 2017. [117] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with curve-gcn. In CVPR, 2019. [118] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. In CVPR, 2016. 128 [119] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-local networks for memory-efficient segmentation of ultra- high resolution images. In CVPR, 2019. [120] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In ECCV, 2018. [121] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Fast end-to-end trainable guided filter. In CVPR, 2018. [122] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, 2018. [123] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object contour detection with a fully convolutional encoder-decoder network. In CVPR, 2016. [124] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In ECCV, 2018. [125] Philipp Kra?henbu?hl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, 2011. [126] Yuri Y Boykov and M-P Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In ICCV, 2001. [127] Yin Li, Jian Sun, Chi-Keung Tang, and Heung-Yeung Shum. Lazy snapping. ToG, 2004. [128] Luis A?lvarez, Luis Baumela, Pedro Henr??quez, and Pablo Ma?rquez-Neila. Mor- phological snakes. In CVPR, 2010. [129] Christian Rupprecht, Elizabeth Huaroc, Maximilian Baust, and Nassir Navab. Deep active contours. arXiv preprint arXiv:1607.05074, 2016. [130] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Anno- tating object instances with a polygon-rnn. In CVPR, 2017. [131] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018. [132] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017. [133] Stanley Osher and James A Sethian. Fronts propagating with curvature- dependent speed: algorithms based on hamilton-jacobi formulations. JCP, 1988. 129 [134] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa Zhang, Min Bai, Ren- jie Liao, and Raquel Urtasun. Learning deep structured active contours end- to-end. In CVPR, 2018. [135] Dominic Cheng, Renjie Liao, Sanja Fidler, and Raquel Urtasun. Darnet: Deep active ray network for building segmentation. In CVPR, 2019. [136] Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling. In ToG, 2007. [137] Jonathan T Barron and Ben Poole. The fast bilateral solver. In ECCV, 2016. [138] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. TPAMI, 2012. [139] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. [140] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jian- ming Liang. Unet++: A nested u-net architecture for medical image segmen- tation. In DLMIA, 2018. [141] Ken CL Wong, Mehdi Moradi, Hui Tang, and Tanveer Syeda-Mahmood. 3d segmentation with exponential logarithmic loss for highly unbalanced object sizes. In MICCAI, 2018. [142] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing. In TOG, 2007. [143] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011. [144] Pixabay. https://pixabay.com. [145] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [146] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica- tion with deep convolutional neural networks. In NeurIPS, 2012. [147] Michael McCloskey and Neal J Cohen. Catastrophic interference in connec- tionist networks: The sequential learning problem. In Psychology of learning and motivation, 1989. [148] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ICLR, 2014. 130 [149] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neu- ral networks. PNAS, 2017. [150] Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 2018. [151] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In CVPR, 2017. [152] Francisco M Castro, Manuel J Mar??n-Jime?nez, Nicola?s Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In ECCV, 2018. [153] Yu Li, Zhongxiao Li, Lizhong Ding, Peng Yang, Yuhui Hu, Wei Chen, and Xin Gao. Supportnet: solving catastrophic forgetting in class incremental learning with support data. arXiv preprint arXiv:1806.02942, 2018. [154] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [155] Alex Krizhevsky. Learning multiple layers of features from tiny images. In Tech.rep., 2009. [156] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, 2018. [157] David Lopez-Paz et al. Gradient episodic memory for continual learning. In NeurIPS, 2017. [158] Arslan Chaudhry, Marc?Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2019. [159] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In ECCV, 2018. [160] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, 2018. [161] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, 2018. [162] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Life- long learning via progressive distillation and retrospection. In ECCV, 2018. [163] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. arXiv preprint arXiv:1811.08051, 2018. 131 [164] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017. [165] Hugo Caselles-Dupre?, Michael Garcia-Ortiz, and David Filliat. Continual state representation learning for reinforcement learning using generative re- play. NeurIPS, 2018. [166] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and Moin Nabi. Learning to remember: A synaptic plasticity driven framework for continual learning. In CVPR, 2019. [167] Khurram Javed and Faisal Shafait. Revisiting distillation and incremental classifier learning. In ACCV, 2018. [168] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. Dsd: Dense- sparse-dense training for deep neural networks. ICLR, 2016. [169] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, 2015. [170] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In CVPR, 2018. [171] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. ICLR, 2016. [172] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convo- lutional neural networks with low rank expansions. BMVC, 2014. [173] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017. [174] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017. [175] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern- stein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. [176] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017. [177] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV, 2012. 132