ABSTRACT
Title of Dissertation: DEEP LEARNING FOR FORENSICS
Peng Zhou
Doctor of Philosophy, 2020
Dissertation Directed by: Professor Larry Davis
Department of Electrical
and Computer Engineering
The advent of media sharing platforms and the easy availability of advanced
photo or video editing software have resulted in a large quantity of manipulated
images and videos being shared on the internet. While the intent behind such ma-
nipulations varies widely, concerns on the spread of fake news and misinformation
is growing. Therefore, detecting manipulation has become an emerging necessity.
Different from traditional classification, semantic object detection or segmentation,
manipulation detection/classification pays more attention to low-level tampering ar-
tifacts than to semantic content. The main challenges in this problem include (a)
investigating features to reveal tampering artifacts, (b) developing generic models
which are robust to a large scale of post-processing methods, (c) applying algo-
rithms to higher resolution in real scenarios and (d) handling the new emerging
manipulation techniques. In this dissertation, we propose approaches to tackling
these challenges.
Manipulation detection utilizes both low-level tamper artifacts and semantic
contents, suggesting that richer features needed to be harnessed to reveal more
evidence. To learn rich features, we propose a two-stream Faster R-CNN network
and train it end-to-end to detect the tampered regions given a manipulated image.
Experiments on four standard image manipulation datasets demonstrate that our
two-stream framework outperforms each individual stream, and also achieves state-
of-the-art performance compared to alternative methods with robustness to resizing
and compression.
Additionally, to extend manipulation detection from image to video, we in-
troduce VIDNet, Video Inpainting Detection Network, which contains an encoder-
decoder architecture with a quad-directional local attention module. To reveal ar-
tifacts encoded in compression, VIDNet additionally takes in Error Level Analysis
(ELA) frames to augment RGB frames, producing multimodal features at different
levels with an encoder.
Besides, to improve the generalization of manipulation detection model, we
introduce a manipulated image generation process that creates true positives using
currently available datasets. Drawing from traditional work on image blending, we
propose a novel generator for creating such examples. In addition, we also propose
to further create examples that force the algorithm to focus on boundary artifacts
during training. Extensive experimental results validate our proposal.
Furthermore, to apply deep learning models to high resolution scenarios ef-
ficiently, we treat the problem as a mask refinement given a coarse low resolution
prediction. We propose to convert the regions of interest into strip images and com-
pute a boundary prediction in the strip domain. Extensive experiments on both the
public and a newly created high resolution dataset strongly validate our approach.
Finally, to handle new emerging manipulation techniques while preserving per-
formance on learned manipulation, we investigate incremental learning. We propose
a multi-model and multi-level knowledge distillation strategy to preserve perfor-
mance on old categories while training on new categories. Experiments on standard
incremental learning benchmarks show that our method improves the overall per-
formance over standard distillation techniques.
DEEP LEARNING FOR FORENSICS
by
Peng Zhou
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2020
Advisory Committee:
Professor Larry Davis, Chair/Advisor
Professor Rama Chellappa
Professor Joseph JaJa
Professor Behtash Babadi
Professor Yang Tao
? Copyright by
Peng Zhou
2020
Dedication
This thesis is dedicated to my parent for their love and support.
ii
Acknowledgments
Graduating during the 2020 pandemic is a special experience, and I would like
to thank all the people who helped me survive my PhD. This dissertation could not
be completed without the help of them.
First and foremost, I would like to thank my advisor, professor Larry Davis,
for his kindness to accept me as his student and his invaluable guidance and support
to my research. It is Larry who introduced me to the field of computer vision. It is
enjoyable to work with Larry and he always takes care of his students and provides
suggestions if needed. Besides, it is honorable to be advised by a famous researcher
like him.
Additionally, I want to thank professor Rama Chellappa, professor Joseph
JaJa, professor Behtash Babadi and professor Yang Tao for their timely help for
serving as my committee members and reviewing all the manuscripts.
I also want to express my gratitude to the Electrical and Computer Engineering
department of University of Maryland. My PhD dream would not come true without
the admission and course training from it. My gratitude also goes to all the graduate
coordinators who helped me submit all types of materials during my PhD period.
My colleagues at the UMIACS are another factor that makes my PhD desirable
and thus thank all of them. Thanks Dr. Xintong Han for his help and guidance
iii
during my first two years and thanks Dr. Zuxuan Wu for his suggestions and favor
during each deadline submission. Also, I am grateful for the assistance from my other
co-authors at school including Dr. Vlad Morariu, professor Abhinav Shrivastava, Dr.
Sernam Lim, Ning Yu, Dr. Hui Ding, Dr. Mahyar Najibi and Dr. Sirius Chen. It
was a pleasure to collaborate with them and their discussion was insightful. Besides
that, the time I spent with my colleagues was unforgettable. Special thanks goes to
Xintong Han, Zuxuan Wu, Hengduo Li and Shiyi Lan for the fitness we have done
together. I also cherish the time spent with Dr. Zhe Wu, Dr. Mingfei Gao, Xitong
Yang, Luyu Yang, Jun Wang, Dr. Hao Zhou, Dr. Hongyu Xu and Dr. Pallabi
Ghosh. Thank them for all the memorable moments during my PhD period.
My internship experience is also part of my PhD journey and I share the same
gratitude to all my mentors for their suggestions and collaborations. Thanks Dr.
Long Mai, Dr. Jianming Zhang and Dr. Ning Xu for their suggestions on my
incremental learning project; thanks Dr. Brian Price, Dr. Scott Cohen and Dr.
Gregg Wilensky for their supportive guidance on the Deepstrip project; thanks Dr.
Ran Xu and Dr. Zeyuan Chen for their discussion for the talking face generation
project. I have learned a lot and will cherish the time spent with my mentors.
Furthermore, I would like to thank my friends who shared their stories and
encouraged me during my PhD. I truly thank my old friend Yiliang Wang, for the
holidays we spent together and his suggestions while I felt down. Thanks Ye Jiang
and Youru Zhou for their long-lasting friendship and willingness to listen to my
unhappiness. Moreover, I am grateful for my 5-year roommate Shengjie Xie and his
wife Xi Li who shared foods and TV series with me. Many thanks to other friends
iv
in the US including Yi Liu, Jing Huang, Zeyu Zhang, Zhouchen Luo, Xiao Xiao,
Shenli Zou, Xiaomin Lin and Zhengyu Lin. I would also express my gratitude to
my friends in China, including but not limited to Xiaoqing Wei, Sheng Zhou, Peng
Xiao and Wei Xie, who treated me well each time I went back.
Lastly, I owe my deepest thanks to my parents who always trust and stand
by me. Thanks for their unconditional love and effort to bring me up. Also thanks
for my relatives who took care of me all the time, and I really feel lucky to be a
member of my family.
v
Table of Contents
Dedication ii
Acknowledgements iii
Table of Contents vi
List of Tables ix
List of Figures xi
Chapter 1: Introduction and Motivation 1
Chapter 2: Learning Rich Features for Image Manipulation Detection 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 RGB Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Noise Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Testing on Standard Datasets . . . . . . . . . . . . . . . . . . 18
2.4.3 Manipulation Technique Detection . . . . . . . . . . . . . . . 24
2.4.4 Qualitative Result . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: Deep Video Inpainting Detection 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Multimodal Features . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Quad-Directional Local Attention . . . . . . . . . . . . . . . . 36
3.3.3 ConvLSTM Decoder . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.4 Robustness Analysis . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.5 Results on Free-form Video Inpainting Dataset . . . . . . . . . 48
3.4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4: Generate, Segment and Refine: Towards Generic Manipulation
Segmentation 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Datasets and Experiment Setting . . . . . . . . . . . . . . . . 64
4.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.4 Robustness to Attacks . . . . . . . . . . . . . . . . . . . . . . 69
4.4.5 Segmentation with COCO Annotations . . . . . . . . . . . . . 70
4.4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 5: DeepStrip: High Resolution Boundary Refinement 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Strip Image Creation . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Strip Boundary Prediction . . . . . . . . . . . . . . . . . . . . 81
5.3.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.4 Strip Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Datasets and Metrics . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.3 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.4 Memory and Speed Comparison . . . . . . . . . . . . . . . . . 94
5.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.6 Strip Height Adaptation . . . . . . . . . . . . . . . . . . . . . 95
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 6: Multi-model and Multi-level Knowledge Distillation for Incremen-
tal Learning 97
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
vii
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 Multi-model Distillation . . . . . . . . . . . . . . . . . . . . . 103
6.3.2 Auxiliary Distillation . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.3 Model Reconstruction . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.1 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . 109
6.4.2 Exemplar-free setting . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4.4 Analysis on pruning ratio . . . . . . . . . . . . . . . . . . . . 113
6.4.5 Exemplar Based Setting . . . . . . . . . . . . . . . . . . . . . 114
6.4.6 Memory Comparison . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 117
Chapter 7: Conclusion 119
Bibliography 121
viii
List of Tables
2.1 AP comparison on our synthetic COCO dataset. The row is the
model architectures, where RGB Net is a single Faster R-CNN using
RGB image as input; Noise Net is a single Faster R-CNN using noise
feature map as input; RGB-N noise RPN is a two-stream Faster R-
CNN using noise features for RPN network. Noise + RGB RPN is a
two-stream Faster R-CNN using both noise and RGB features as the
input of RPN network. RGB-N is a two-stream Faster R-CNN using
RGB features for RPN network. . . . . . . . . . . . . . . . . . . . . . 18
2.2 Training and testing split (number of images) for four standard datasets.
Columbia is only used for testing the model trained on our synthetic
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 F1 score comparison on four standard datasets. ?-? denotes that the
result is not available in the literature. . . . . . . . . . . . . . . . . . 21
2.4 Pixel level AUC comparison on four standard datasets. ?-? denotes
that the result is not available in the literature. . . . . . . . . . . . . 21
2.5 Data augmentation comparison. Flipping: image flipping. JPEG:
JPEG compression with quality 70. Noise: adding Gaussian noise
with variance of 5. Each entry is F1/AUC score. . . . . . . . . . . . . 24
2.6 F1 score on NIST16 dataset for JPEG compression (with quality 70
and 50) and resizing (with scale 0.7 and 0.5) attacks. Each entry is
the F1 score of JPEG/Resizing. . . . . . . . . . . . . . . . . . . . . . 25
2.7 AP comparison on multi-class on NIST16 dataset using the RGB-
N network. Mean denotes the mean AP for splicing, removal and
copy-move. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 mean IoU and F1 score comparison on inpainted DAVIS. ?*?
denotes that the model is trained on these inpainting algorithms. . . . 44
3.2 mean IoU and F1 score comparison on inpainted DAVIS. ?*?
denotes that the model is trained on these inpainting algorithms. . . . 44
3.3 mean IoU and F1 score comparison on inpainted DAVIS. ?*?
denotes that the model is trained on these inpainting algorithms. . . . 45
3.4 Ablation analysis for each component on our approach. ?*?
denotes that the model is trained on these inpainting algorithms. . . . 45
3.5 Mean IoU and F1 score comparison on FVI. The results are
directly tested on FVI dataset, and all the model are trained on VI
and OP inpainted DAVIS. . . . . . . . . . . . . . . . . . . . . . . . . 48
ix
4.1 MCC and F1 score comparison on four standard datasets.
?-? denotes that the result is not available in the literature. * Our
method is 1600 times faster than EXIF-consistency. . . . . . . . . . . 62
4.2 Ablation analysis on four datasets. Each entry is the F1 score
tested on individual dataset. . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 F1 score manipulation segmentation comparison trained with
COCO annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Boundary-based F score comparison. The scale factor between low
and high resolution image is 4 on DAVIS 2016 and 8, 16, 32 on Pix-
aHR. For DAVIS 2016, the pixel dilation is 0 and 1 and for PixaHR
is 1 and 2 instead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Ablation analysis on two datasets. Each entry is the boundary-based
F score tested on individual dataset. . . . . . . . . . . . . . . . . . . 91
5.3 Memory and speed comparison. Each entry is the memory or speed on
DAVIS 2016/PixaHR dataset. We only compare the memory usage
among learning-based approaches. . . . . . . . . . . . . . . . . . . . . 92
5.4 Strip height selection comparison on PixaHR 32?. . . . . . . . . . . . 95
6.1 Top-1 accuracy comparison among different pruning ratios on Cifar-
100 (20 classes per incremental step). . . . . . . . . . . . . . . . . . 111
6.2 Memory compensation comparison (MB). Each entry is the additional
memory requirement for methods across different datasets based on
the memory footprint of LWF. . . . . . . . . . . . . . . . . . . . . . . 117
x
List of Figures
2.1 Examples of tampered images that have undergone different tamper-
ing techniques. From the top to bottom are the examples showing
manipulations of splicing, copy-move and removal. . . . . . . . . . . . 5
2.2 Illustration of our two-stream Faster R-CNN network. The RGB
stream models visual tampering artifacts, such as unusually high con-
trast along object edges, and regresses bounding boxes to the ground-
truth. The noise stream first obtains the noise feature map by pass-
ing input RGB image through an SRM filter layer, and leverages the
noise features to provide additional evidence for manipulation classifi-
cation. The RGB and noise streams share the same region proposals
from RPN network which only uses RGB features as input. The
RoI pooling layer selects spatial features from both RGB and noise
streams. The predicted bounding boxes (denoted as ?bbx pred?) are
generated from RGB RoI features. A bilinear pooling [1,2] layer after
RoI pooling enables the network to combine the spatial co-occurrence
features from the two streams. Finally, passing the results through
a fully connected layer and a softmax layer, the network produces
the predicted label (denoted as ?cls pred?) and determines whether
predicted regions have been manipulated or not. . . . . . . . . . . . . 6
2.3 Illustration of tampering artifacts. Two examples showing tamper-
ing artifacts in the original RGB image and in the local noise features
obtained by the SRM filter layer. The second column is the amplified
regions for the red bounding boxes in the first column. As shown in
the second column, the unnaturally high contrast along the baseball
player?s edges provides a strong cue about the presence of tamper-
ing. The third column shows the local noise inconsistency between
tampered regions and authentic regions. In different scenarios, visual
information and noise features play a complementary role in revealing
tampering artifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 The three SRM filter kernels used to extract noise features. . . . . . . 14
xi
2.5 Qualitative visualization of results. The top row shows a qualitative
result from the COVER dataset. The copy-moved bag confuses the
RGB Net, and the noise Net. RGB-N achieves a better detection
in this case because it combines the features from the two streams.
The middle row shows a qualitative result from the Columbia. The
RGB Net produces a more accurate result than noise stream. Tak-
ing into account both streams produces a better result for RGB-N.
The bottom row shows a qualitative result from the CASIA1.0. The
spliced object leaves clear tampering artifacts in both the RGB and
noise streams, which yields precise detections for the RGB, noise, and
RGB-N networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Qualitative results for multi-class image manipulation detection on
NIST16 dataset. RGB and noise map provide different information
for splicing, copy-move and removal. By combining the features from
the RGB image with the noise features, RGB-N produces the correct
classification for different tamepring techniques. . . . . . . . . . . . . 23
3.1 Problem introduction. Given an inpainted video (second column),
we localize the inpainted region both spatially and temporally. . . . . 29
3.2 Framework overview. Given an RGB frame in a video, we first
derive its corresponding ELA frame and compute multimodal fea-
tures at different scales with both frames. We also introduce a quad-
directional local attention module (striped) to the last encoded RGB
features (colored blue) to explore spatial relationships among pix-
els from four directions. These encoded features are further input
into a multi-layer ConvLSTM (colored green) for decoding, exploit-
ing spatial and temporal relationships explicitly, to produce masks of
inpainted regions. See texts for more details. . . . . . . . . . . . . . . 30
3.3 ELA frame example. From the top to the bottom: the inpainted
RGB frame, its corresponding ELA frame, and the ground-truth in-
painting mask. The inpainting artifacts, e.g., the dog, person and
ship, stand out in ELA space while not easily seen in the RGB space. 34
3.4 The quad-directional local attention module. Given RGB fea-
tures from the last layer of the encoder, we derive attention maps
with a quad-directional local attention module. To detect whether a
pixel is inpainted or not, the module attends to its neighbors from
four directions (left-to-right, up-to-down, right-to-left and down-to-up). 35
3.5 Mean IoU comparison under different perturbations. Perturbation
in JPEG compression consists of the quality factor with 90 and 70;
perturbation in noise consists of SNR 30dB and 20dB. Column from
left to right is the result on VI, OP and CP inpainting. ?*? denotes
that the model is trained on these inpainting algorithms. . . . . . . . 47
3.6 Qualitative visualization on DAVIS. The first row shows the inpainted
video frame. The second to fourth row indicates the final predictions
from different methods. The fifth row is the ground truth. . . . . . . 49
xii
4.1 Examples of manipulated images across different datasets.
Columns from left to right are images in CASIA [3], COVER [4],
Carvalho [5], and In-The-Wild [6]. The odd rows are manipulated im-
ages and the even rows are the ground truth masks. Different datasets
contain different distributions (from animals to person), manipulation
techniques (from copy-move (the second column) to splicing (the rest
columns)) and post-processing methods (from no post-processing to
various processes including filtering, illumination, and blurring). . . . 52
4.2 GSR-Net framework overview. (a) Given a tampered image S,
an authentic target image T , and the ground truth mask K, the gen-
eration stage generates hard example G(M) starting from a simple
copy-pasting image M . (b) Feeding the training images, copy-pasted
images or generated images as input, the segmentation stage learns
to segment the boundary artifacts and fill the interior to produce the
final prediction. (c) The segmentation network concatenates lower
level features to predict boundary artifacts and then concatenate back
the boundary feature to the segmentation branch for final prediction.
(d) The refinement stage creates a novel tampered image with new
boundary artifacts by replacing the predicted manipulated bound-
aries of segmentation stage with original authentic regions and learns
to make a new prediction. . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Analysis of robustness under different attacks. Attacks with
JPEG compression consists of quality factors of 70 and 50; scale at-
tacks use scaling ratios of 0.7 and 0.5. (a) JPEG compression attacks
on In-The-Wild. (b) Scale attacks on In-The-Wild. (c) JPEG com-
pression attacks on Carvalho. (d) Scale attacks on Carvalho. . . . . . 68
4.4 Qualitative visualization. The first row shows manipulated im-
ages on different datasets. The second indicates the final manipula-
tion segmentation prediction. The third row illustrates the output of
boundary artifacts branch. The last row is the ground truth. . . . . 70
4.5 Qualitative visualization of the generation network. The first
two columns show the authentic background and manipulation mask.
As the number of epochs increases, the manipulated region matches
better with the background and thus boundary artifacts are harder
to identify. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Concept overview. The example is from the newly created PixaHR
dataset. Given low resolution mask and high resolution image on
the left, a bilinear upsampling with scale factor 16? would results in
boundary misalignment in high resolution image, as is shown in the
enlarged boundary region on the right. Also, the new details in high
resolution would be missed. . . . . . . . . . . . . . . . . . . . . . . . 74
xiii
5.2 Framework. To save memory and computation, we predict the bound-
ary in a strip image instead of the whole image. First, the strip image
is extracted from the HR image and corresponding LR mask. Feeding
the strip image as input, the network predicts all potential bound-
aries (denoted as ?x?) and passes the initial prediction to a selection
layer (denoted as ?m?) to pursue more accurate prediction on the
target boundary (denoted as ?s?). The numbers are indicator to the
losses displayed on the right. Orange and green curves denote the
ground truth and prediction, respectively. Note that the strip image
and prediction are rotated 90 degree for visualization. . . . . . . . . . 75
5.3 Strip image creation. To generate strip image, B-spline representa-
tion of the contour in the LR mask is upsampled to HR as a coarse
boundary. The HR region along the normal direction (e.g., red and
green arrows) of the contour is then extracted. Finally, the strip
image and corresponding boundary ground truth is obtained by flat-
tening the extracted region in both the HR image and mask. Note
that the final boundary filters out noisy boundaries (e.g., the red box
region) from the initial boundary. The strip image and boundaries
are rotated 90 degree for visualization. . . . . . . . . . . . . . . . . . 79
5.4 Qualitative results on PixaHR 32?. Rows from top to down are
the results of Dense CRF, STEAL, Ours and the Ground truth. We
show the entire boundary (green color) result first and enlarge the
blue bounding box region for comparison (boundaries are whitened). . 93
5.5 Qualitative results on COCO. Columns from left to right are coarse
annotation, DELSE [7], STEAL [8] and Ours. . . . . . . . . . . . . . 94
6.1 Concept overview. We propose to distill knowledge from all previ-
ous models efficiently to preserve old data information rather than
sequentially applying distillation only to the last model. (For exam-
ple, using both S1 and S2 in S3 for distillation instead of sequentially
using S1 for S2 and then S2 for S3). The confusion matrix is LWF-
MC [9] on the left and our method on the right for the exemplar-free
incremental setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xiv
6.2 Framework overview. Given images from the current training data,
we preserve previous knowledge directly from the reconstructed out-
put through matching the logits with the corresponding model and
classifying the current data with its ground truth. As an example,
each layer contains a mask matrix Mt at the ti-th incremental stepi
recording significant weights for previous data. The gray dots repre-
sent the weights to be trained on the current data. The red and green
dots are fixed during training, denoting the weights retained from the
first and second incremental step respectively. The gray dots are fine-
tuned for the current data before pruning. After pruning, a subset of
the gray dots will be marked as important weights and become blue
dots, and the remaining weights will be fine-tuned during the next
incremental step. Accordingly, Mt2 is updated and used as Mt3 at
the end of this round. In multi-model distillation, the red and green
output logits of the current model are matched with the model 1 and
2 respectively while the blue logits are matched with its ground truth. 102
6.3 Illustration of auxiliary distillation. We extract the intermediate fea-
tures and connect directly with an auxiliary classifier to preserve mid-
dle level knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Performance on iILSVRC-small and Cifar-100 dataset in exemplar-
free setting. (a) Top-1 accuracy on Cifar-100 (5-class batch). (b)
Top-1 accuracy on Cifar-100 (10-class batch). (c) Top-1 accuracy on
Cifar-100 (20-class batch). (d) Top-5 accuracy on iILSVRC-small
(10-class batch). (e) Top-5 accuracy on iILSVRC-small (20-class
batch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Ablation Studies for our approach. (a) Top-1 accuracy comparison
on Cifar-100 (20-class batch). (b) Top-5 accuracy performance on
iILSVRC-small (20-class batch). . . . . . . . . . . . . . . . . . . . . 113
6.6 Comparison between different number of models used in multi-model
distillation on Cifar-100 20-class batch. . . . . . . . . . . . . . . . . . 114
6.7 Performance comparison in exemplar based setting. (a) Top-1 accu-
racy performance on Cifar-100 (10-class batch). (b) Top-5 accuracy
performance on iILSVRC-small (10-class batch). . . . . . . . . . . . 115
6.8 Analysis on performance and memory compared to iCaRL on Cifar-
100 (10-class batch). We increase memory budget for exemplar set
from 200 to 4000 images and report the average accuracy of all the
10 incremental steps. . . . . . . . . . . . . . . . . . . . . . . . . . . 117
xv
Chapter 1: Introduction and Motivation
Recent decades have witnessed a rapid development of deep learning, and
it has been applied to various applications including image/video editing, gener-
ative model, object recognition and detection. However, the improving result of
photo/video editing has raised a lot of concerns about malicious purposes or mis-
information. As a result, research comes to light which combines traditional foren-
sics approaches with deep learning to fight against the fake media. In this thesis,
we mainly tackle four different challenges to improve forensic detection with deep
learning? a) harnessing rich features to find evidence of tampering. b) improving
the generalization of deep learning based forensic models. c) exploring efficient solu-
tions to apply deep learning model at different scales. d) extending the application
of the learned model to new emerging manipulation techniques.
In Chapter 2, we introduce a two-stream RGB-N network to learn rich features
for manipulation detection. One of the two streams is an RGB stream whose purpose
is to extract features from the RGB image input to find tampering artifacts like
strong contrast difference, unnatural tampered boundaries, and so on. The other
is a noise stream that leverages the noise features extracted from a steganalysis
rich model filter layer to discover the noise inconsistency between authentic and
1
tampered regions. We then fuse features from the two streams through a bilinear
pooling layer to further incorporate spatial co-occurrence of these two modalities.
In Chapter 3, we take the temporal dimension into account and detect the
inpainting manipulation within videos. A inpainting detection network VIDNet is
then proposed to reveal both spatial and temporal artifacts. The features are learned
by our network based on both the compression coefficient artifacts and visual RGB
artifacts. After that, the features of these two modalities are further decoded by
a Convolutional LSTM to predict masks of inpainted regions. In addition, when
detecting whether a pixel is inpainted or not, we present a quad-directional local
attention module that borrows information from its surrounding pixels from four
directions. Extensive experiments are conducted to validate our approach. We
demonstrate that VIDNet outperforms by clear margins alternative inpainting de-
tection methods.
In Chapter 4, we propose combining a generative model to augment train-
ing data and thus improve the generalization of the learned model. The network
first automatically generates both hard and easy examples, and then segments both
boundary artifacts and the interior regions, and finally replaces the predicted arti-
facts with original regions to refine the predicted result.
In Chapter 5, we study the problem of high resolution boundary refinement to
extend the deep learning model to real scenario. We propose transforming the image
into an image strip domain to reduce the computation and memory consumption.
To detect the target boundary at high resolution, we present a framework with two
prediction layers. First, all potential boundaries are predicted as an initial prediction
2
and then a selection layer is used to pick the target boundary and smooth the result.
To encourage accurate prediction, a loss which measures the boundary distance in
the strip domain is introduced. In addition, we enforce a matching consistency and
C0 continuity regularization to the network to reduce false alarms.
In Chapter 6, we further investigate incremental learning to make deep learn-
ing models robust to new emerging categories while avoiding forgetting the previous
learned knowledge. We leverage all previous model snapshots as the teacher to ob-
tain previous knowledge while trained on new categories. In addition, we incorporate
an auxiliary distillation to further preserve knowledge encoded at the intermediate
feature levels. To make the model more memory efficient, we adapt mask based
pruning to reconstruct all previous models with a small memory footprint.
In Chapter 7, we summarize this dissertation and discuss potential directions
for the future research.
3
Chapter 2: Learning Rich Features for Image Manipulation Detec-
tion
2.1 Introduction
With the advances of image editing techniques and user-friendly editing soft-
ware, low-cost tampered or manipulated image generation processes have become
widely available. Among tampering techniques, splicing, copy-move, and removal
are the most common manipulations. Image splicing copies regions from an authen-
tic image and pastes them to other images, copy-move copies and pastes regions
within the same image, and removal eliminates regions from an authentic image
followed by inpainting. Sometimes, post-processing like Gaussian smoothing will
be applied after these tampering techniques. Examples of these manipulations are
shown in Figure 2.1. Even with careful inspection, humans find it difficult to recog-
nize the tampered regions.
As a result, distinguishing authentic images from tampered images has become
increasingly challenging. The emerging research focusing on this topic ? image
forensics ? is of great importance because it seeks to prevent attackers from using
their tampered images for unscrupulous business or political purposes. In contrast to
4
Authentic image Tampered image Ground-truth mask
Figure 2.1: Examples of tampered images that have undergone different tampering
techniques. From the top to bottom are the examples showing manipulations of
splicing, copy-move and removal.
5
Removal Copy-move Splicing
RPN layer
RGB stream input RGB Conv Layers RGB RoI features
bbx_pred
SRM filter layer Bilinearpooling cls_pred
RoI pooling layer
Noise stream input Noise Conv Layers Noise RoI features
Figure 2.2: Illustration of our two-stream Faster R-CNN network. The RGB stream
models visual tampering artifacts, such as unusually high contrast along object
edges, and regresses bounding boxes to the ground-truth. The noise stream first
obtains the noise feature map by passing input RGB image through an SRM filter
layer, and leverages the noise features to provide additional evidence for manipu-
lation classification. The RGB and noise streams share the same region proposals
from RPN network which only uses RGB features as input. The RoI pooling layer
selects spatial features from both RGB and noise streams. The predicted bounding
boxes (denoted as ?bbx pred?) are generated from RGB RoI features. A bilinear
pooling [1, 2] layer after RoI pooling enables the network to combine the spatial
co-occurrence features from the two streams. Finally, passing the results through
a fully connected layer and a softmax layer, the network produces the predicted
label (denoted as ?cls pred?) and determines whether predicted regions have been
manipulated or not.
6
current object detection networks [10?15] which aim to detect all objects of different
categories in an image, a network for image manipulation detection would aim to
detect only the tampered regions (usually objects). We investigate how to adopt
object detection networks to perform image manipulation detection by exploring
both RGB image content and image noise features.
Recent work on image forensics utilizes clues such as local noise features [16,17]
and Camera Filter Array (CFA) patterns [18] to classify a specific patch or pixel [5]
in an image as tampered or not, and localize the tampered regions [18?20]. Most
of these methods focus on a single tampering technique. A recently proposed ar-
chitecture [21] based on a Long Short Term Network (LSTM) segments tampered
patches, showing robustness to multiple tampering techniques by learning to detect
tampered edges. Here, we propose a novel two-stream manipulation detection frame-
work, which not only models visual tampering artifacts (e.g., tampered artifacts near
manipulated edges), but also captures inconsistencies in local noise features.
More specifically, we adopt Faster R-CNN [10] within a two-stream network
and perform end-to-end training. A summary of our method is shown in Figure
2.2. Deep learning detection models like Faster R-CNN [10] have demonstrated
good performance on detecting semantic objects over a range of scales. The Region
Proposal Network (RPN) is the component in Faster R-CNN that is responsible for
proposing image regions that are likely to contain objects of interest, and it can be
adapted for image manipulation detection. For distinguishing tampered regions from
authentic regions, we utilize features from the RGB channels to capture clues like
visual inconsistencies at tampered boundaries and contrast effect between tampered
7
regions and authentic regions. The second stream analyzes the local noise features
in an image.
The intuition behind the second stream is that when an object is removed
from one image (the source) and pasted into another (the target), the noise features
between the source and target images are unlikely to match. These differences can
be partially masked if the user subsequently compresses the tampered image [17,22].
To utilize these features, we transform the RGB image into the noise domain and
use the local noise features as the input to the second stream. There are many ways
to produce noise features from an image. Based on recent work on steganalysis rich
model (SRM) for manipulation classification [16,23], we select SRM filter kernels to
produce the noise features and use them as the input channel to the second Faster
R-CNN network.
Features from these two streams are then bi-linearly pooled for each Region
of Interest (RoI) to detect tampering artifacts based on features from both streams,
see Figure 2.2.
Previous image manipulation datasets [4,24?26] contain only several hundred
images, not enough to train a deep network. To overcome this, we created a syn-
thetic tampering dataset based on COCO [27] for pre-training our model and then
finetuned the model on different datasets for testing. Experimental results of our
approach on four standard datasets demonstrate promising performance.
Our contribution is two-fold. First, we show how a Faster R-CNN frame-
work can be adapted for image manipulation detection in a two-stream fashion. We
explore two modalities, RGB tampering artifacts and local noise feature inconsis-
8
tencies, bilinearly pooling them to identify tampered regions. Second, we show that
the two streams are complementary for detecting different tampered techniques,
leading to improved performance on four image manipulation datasets compared to
state-of-the-art methods.
2.2 Related Work
Research on image forensics consists of various approaches to detect the low-
level tampering artifacts within a tampered image, including double JPEG compres-
sion [22], CFA color array anaylsis [18] and local noise analysis [28]. Specifically,
Bianchi et al. [22] propose a probabilistic model to estimate the DCT coefficients
and quantization factors for different regions. CFA based methods analyze low-level
statistics introduced by the camera internal filter patterns under the assumption
that the tampered regions disturb these patterns. Goljan et al. [18] propose a Gaus-
sian Mixture Model (GMM) to classify CFA present regions (authentic regions) and
CFA absent regions (tampered regions).
Recently, local noise features based methods, like the steganalysis rich model
(SRM) [23], have shown promising performance in image forensics tasks. These
methods extract local noise features from adjacent pixels, capturing the inconsis-
tency between tampered regions and authentic regions. Cozzolino et al. [28] explore
and demonstrate the performance of SRM features in distinguishing tampered and
authentic regions. They also combine SRM features by including the quantization
and truncation operations with a Convolutional Neural Network (CNN) to perform
9
manipulation localization [29]. Rao et al. [30] use an SRM filter kernel as initializa-
tion for a CNN to boost the detection accuracy. Most of these methods focus on
specific tampering artifacts and are limited to specific tampering techniques. We
also use these SRM filter kernels to extract low-level noise that is used as the input
to a Faster R-CNN network, and learn to capture tampering traces from the noise
features. Moreover, a parallel RGB stream is trained jointly to model mid- and
high-level visual tampering artifacts.
With the success of deep learning techniques in various computer vision and
image processing tasks, a number of recent techniques have also employed deep
learning to address image manipulation detection. Chen et al. [31] add a low pass
filter layer before a CNN to detect median filtering tampering techniques. Bayar et
al. [32] change the low pass filter layer to an adaptive kernel layer to learn the filtering
kernel used in tampered regions. Beyond filtering learning, Zhang et al. [33] propose
a stacked autoencoder to learn context features for image manipulation detection.
Cozzolino et al. [19] treat this problem as an anomaly detection task and use an
autoencoder based on extracted features to distinguish those regions that are difficult
to reconstruct as tampered regions. Salloum et al. [34] use a Fully Convolutional
Network (FCN) framework to directly predict the tampering mask given an image.
They also learn a boundary mask to guide the FCN to look at tampered edges,
which assists them in achieving better performance in various image manipulation
datasets. Bappy et al. [21] propose an LSTM based network applied to small image
patches to find the tampering artifacts on the boundaries between tampered patches
and image patches. They jointly train this network with pixel level segmentation
10
to improve the performance and show results under different tampering techniques.
However, only focusing on nearby boundaries provides limited success in different
scenarios, e.g., removing the whole object might leave no boundary evidence for
detection. Instead, we use global visual tampering artifacts as well as the local
noise features to model richer tampering artifacts. We use a two-stream network
built on Faster R-CNN to learn rich features for image manipulation detection. The
network shows robustness to splicing, copy-move and removal. In addition, the
network enables us to make a classification of the suspected tampering techniques.
2.3 Proposed Method
We employ a multi-task framework that simultaneously performs manipulation
classification and bounding box regression. RGB images are provided in the RGB
stream (the top stream in Figure 2.2), and SRM images in the noise stream (the
bottom stream in Figure 2.2). We fuse the two streams through bilinear pooling
before a fully connected layer for manipulation classification. The RPN uses the
RGB stream to localize tampered regions.
2.3.1 RGB Stream
The RGB stream is a single Faster R-CNN network and is used both for
bounding box regression and manipulation classification. We use a ResNet 101
network [35] to learn features from the input RGB image. The output features of
the last convolutional layer of ResNet are used for manipulation classification.
11
The RPN network in the RGB stream utilizes these features to propose RoI
for bounding box regression. Formally, the loss for the RPN network is defined as
1 ?
LRPN(gi, fi) = Lcls(gi, g
?
N i
)
cls i
1 ?
+? g? ?
N i
Lreg(fi, fi ), (2.1)
reg i
where gi denotes the probability of anchor i being a potential manipulated region
in a mini batch, and g?i denotes the ground-truth label for anchor i to be positive.
The terms f ?i, fi are the 4 dimensional bounding box coordinates for anchor i and
the ground-truth, respectively. Lcls denotes cross entropy loss for RPN network and
Lreg denotes smooth L1 loss for regression for the proposal bounding boxes. Ncls
denotes the size of a mini-batch in the RPN network. Nreg is the number of anchor
locations. The term ? is a hyper-parameter to balance the two losses and is set to 10.
Note that in contrast to traditional object detection whose RPN network searches
for regions that are likely to be objects, our RPN network searches for regions that
are likely to be manipulated. The proposed regions might not necessarily be objects,
e.g., the case in the removal tampering process.
2.3.2 Noise Stream
RGB channels are not sufficient to tackle all the different cases of manipula-
tion. In particular, tampered images that were carefully post processed to conceal
12
the splicing boundary and reduce contrast differences are challenging for the RGB
stream.
So, we utilize the local noise distributions of the image to provide additional
evidence. In contrast to the RGB stream, the noise stream is designed to pay
more attention to noise rather than semantic image content. This is novel ? while
current deep learning models do well in representing hierarchical features from RGB
image content, no prior work in deep learning has investigated learning from noise
distributions in detection. Inspired by recent progress on SRM features from image
forensics [23], we use SRM filters to extract the local noise features (examples shown
in Figure 2.3) from RGB images as the input to our noise stream.
In our setting, noise is modeled by the residual between a pixel?s value and the
estimate of that pixel?s value produced by interpolating only the values of neigh-
boring pixels. Starting from 30 basic filters, along with nonlinear operations like
maximum and minimum of the nearby outputs after filtering, SRM features gather
the basic noise features. SRM quantifies and truncates the output of these filters
and extracts the nearby co-occurrence information as the final features. The feature
obtained from this process can be regarded as a local noise descriptor [28]. We find
that only using 3 kernels can achieve decent performance, and applying all 30 kernels
does not give significant performance gain. Therefore, we choose 3 kernels, whose
weights are shown in Figure 2.4, and directly feed these into a pre-trained network
trained on 3-channel inputs. We define the kernel size of the SRM filter layer in the
noise stream to be 5? 5? 3. The output channel size of our SRM layer is 3.
The resulting noise feature maps after the SRM layer are shown in the third
13
Tampered image Visual artifacts Noise Ground-truth
Figure 2.3: Illustration of tampering artifacts. Two examples showing tampering
artifacts in the original RGB image and in the local noise features obtained by the
SRM filter layer. The second column is the amplified regions for the red bounding
boxes in the first column. As shown in the second column, the unnaturally high
contrast along the baseball player?s edges provides a strong cue about the presence of
tampering. The third column shows the local noise inconsistency between tampered
regions and authentic regions. In different scenarios, visual information and noise
features play a complementary role in revealing tampering artifacts.
0 0 0 0 0 -1 2 -2 2 -1 0 0 0 0 0
0 -1 2 -1 0
1 1
2 -6 8 -6 2 0 0 0 0 0
0 2 -4 2 0 -2 8 -12 8 -2
1 0 1 -2 1 0
4 120 -1 2 -1 0 2 -6 8 -6 2 2 0 0 0 0 0
0 0 0 0 0 -1 2 -2 2 -1 0 0 0 0 0
Figure 2.4: The three SRM filter kernels used to extract noise features.
14
column of Figure 2.3. It is clear that they emphasize the local noise instead of
image content and explicitly reveal tampering artifacts that might not be visible in
the RGB channels. We directly use the noise features as the input to the noise stream
network. The backbone convolutional network architecture of the noise stream is
the same as the RGB stream. The noise stream shares the same RoI pooling layer
as the RGB stream. For bounding box regression, we only use the RGB channels
because RGB features perform better than noise features for the RPN network based
on our experiments (See Table 2.1).
2.3.3 Bilinear Pooling
We finally combine the RGB stream with the noise stream for manipulation
detection. Among various fusion methods, we apply bilinear pooling on features
from both streams. Bilinear pooling [1], first proposed for fine-grained classification,
combines streams in a two-stream CNN network while preserving spatial information
to improve the detection confidence. The output of our bilinear pooling layer is x =
fTRGBfN , where fRGB is the RoI feature of the RGB stream and fN is the RoI feature
of the noise stream. Sum pooling squeezes the s?patial feature before classification.
We then apply signed square root (x? sign(x) |x|) and L2 normalization before
forwarding to the fully connected layer.
To save memory and speed up training without decreasing performance, we
use compact bilinear pooling as proposed in [2].
After the fully connected and softmax layers, we obtain the predicted class of
15
the RoI regions, as indicated in Figure 2.2. We use cross entropy loss for manipu-
lation classification and smooth L1 loss for bounding box regression. The total loss
function is:
Ltotal = LRPN + Ltamper(fRGB, fN) + Lbbox(fRGB), (2.2)
where Ltotal denotes total loss. LRPN denotes the RPN loss in RPN network. Ltamper
denotes the final cross entropy classification loss, which is based on the bilinear pool-
ing feature from both the RGB and noise stream. Lbbox denotes the final bounding
box regression loss. fRGB and fN are the RoI features from RGB and noise streams.
The summation of all terms produces the total loss function.
2.3.4 Implementation Detail
The proposed network is trained end-to-end. The input image as well as the
extracted noise features are re-sized so that the shorter length equals to 600 pixels.
Four anchor scales with size from 82, 162, 322 to 642 are used, and the aspect ratios
are 1:2, 1:1 and 2:1. The feature size after RoI pooling is 7 ? 7 ? 1024 for both
RGB and noise streams. The output feature size of compact bilinear pooling is set
to 16384. The batch size of RPN proposal is 64 for training and 300 for testing.
Image flipping is used for data augmentation. The Intersection-over Union
(IoU) threshold for RPN positive example (potential manipulated regions) is 0.7
and 0.3 for negative example (authentic regions). Learning rate is initially set to
0.001 and then is reduced to 0.0001 after 40K steps. We train our model for 110K
steps. At test time, standard Non-Maximum Suppression (NMS) is applied to reduce
16
the redundancy of proposed overlapping regions. The NMS threshold is set to 0.2.
2.4 Experiments
We demonstrate our two stream network on four standard image manipulation
datasets and compare the results with state-of-the-art methods. We also compare
different data augmentations and measure the robustness of our method to resizing
and JPEG compression.
2.4.1 Pre-trained Model
Current standard datasets do not have enough data for deep neural network
training. To test our network on these datasets, we pre-train our model on our
synthetic dataset. We automatically create a synthetic dataset using the images and
annotations from COCO [27]. We use the segmentation annotations to randomly
select objects from COCO [27], and then copy and paste them to other images. The
training (90%) and testing set (10%) is split to ensure the same background and
tampered object do not appear in both training and testing set. Finally, we create
42K tampered and authentic image pairs. We will release this dataset for research
use. The output of our model is bounding boxes with confidence scores indicating
whether the detected regions have been manipulated or not.
To include some authentic regions in Region of Interest (RoI) for better com-
parison, We slightly enlarge the default bounding boxes by 20 pixels during training
so that both the RGB and noise streams learn the inconsistency between tampered
17
AP Synthetic test
RGB Net 0.445
Noise Net 0.461
RGB-N noise RPN 0.472
Noise + RGB RPN 0.620
RGB-N 0.627
Table 2.1: AP comparison on our synthetic COCO dataset. The row is the model
architectures, where RGB Net is a single Faster R-CNN using RGB image as input;
Noise Net is a single Faster R-CNN using noise feature map as input; RGB-N noise
RPN is a two-stream Faster R-CNN using noise features for RPN network. Noise +
RGB RPN is a two-stream Faster R-CNN using both noise and RGB features as the
input of RPN network. RGB-N is a two-stream Faster R-CNN using RGB features
for RPN network.
and authentic regions.
We train our model end-to-end on this synthetic dataset. The ResNet 101
used in Faster R-CNN is pre-trained on ImageNet. We use Average Precision (AP)
for evaluation, the metric of which is the same as COCO [27] detection evaluation.
We compare the result of the two-stream network with each one of the streams in
Table 2.1. This table shows that our two-stream network performs better than each
single stream. Also, the comparison among RGB-N, RGB-N using noise features as
RPN and RPN uses both features shows that RGB features are more suitable than
noise features to generate region proposals.
2.4.2 Testing on Standard Datasets
Datasets. We compare our method with current state-of-the-art methods on
NIST Nimble 2016 [25] (NIST16), CASIA [3,26], COVER [4] and Columbia dataset.
? NIST16 is a challenging dataset which contains all three tampering techniques.
18
Datasets NIST16 CASIA Columbia COVER
Training 404 5123 - 75
Testing 160 921 180 25
Table 2.2: Training and testing split (number of images) for four standard datasets.
Columbia is only used for testing the model trained on our synthetic dataset.
The manipulations in this dataset are post-processed to conceal visible traces. They
also provide ground-truth tampering mask for evaluation.
? CASIA provides spliced and copy-moved images of various objects. The tampered
regions are carefully selected and some post processing like filtering and blurring
is also applied. Ground-truth masks are obtained by thresholding the difference
between tampered and original images. We use CASIA 2.0 for training and CASIA
1.0 for testing.
? COVER is a relatively small dataset focusing on copy-move. It covers similar
objects as the pasted regions to conceal the tampering artifacts (see the second row
in Figure 2.1). Ground-truth masks are provided.
? Columbia dataset focuses on splicing based on uncompressed images. Ground-
truth masks are provided.
To fine-tune our model on these datasets, we extract the bounding box from
the ground-truth mask. We compare with other approaches on the same training
and testing split protocol as [21] (for NIST16 and COVER) and [34] (for Columbia
and CASIA). See Table 2.2.
Evaluation Metric. We use pixel level F1 score and Area Under the receiver
operating characteristic Curve (AUC) as our evaluation metrics for performance
comparison. F1 score is a pixel level evaluation metric for image manipulation
19
detection, as discussed in [34, 36]. We vary different thresholds and use the highest
F1 score as the final score for each image, which follows the same protocol in [34,36].
We assign the confidence score to every pixel in the detected bounding boxes for
pixel-level AUC evaluation.
Baseline Models. We compare our proposed method with various baseline models
as described below:
? ELA: An error level analysis method [37] which aims to find the compression error
difference between tampered regions and authentic regions through different JPEG
compression qualities.
? NOI1: A noise inconsistency based method using high pass wavelet coefficients to
model local noise [38].
? CFA1:A CFA pattern estimation method [39] which uses nearby pixels to approx-
imate the camera filter array patterns and then produces the tampering probability
for each pixel.
? MFCN: A multi-task edge-enhanced FCN based network [34] jointly detecting
tampered edges using edge binary masks and tampered regions using tampered
region masks.
? J-LSTM: An LSTM based network [21] jointly training patch level tampered edge
classification and pixel level tampered region segmentation.
? RGB Net: A single Faster R-CNN network with RGB images as input. i.e., our
RGB Faster R-CNN stream.
? Noise Net: A single Faster R-CNN network with noise feature map as input
obtained from a SRM filter layer. The RPN network uses noise features in this case.
20
NIST16 Columbia COVER CASIA
ELA [37] 0.236 0.470 0.222 0.214
NOI1 [38] 0.285 0.574 0.269 0.263
CFA1 [39] 0.174 0.467 0.190 0.207
MFCN [34] 0.571 0.612 - 0.541
RGB Net 0.567 0.585 0.391 0.392
Noise Net 0.521 0.705 0.355 0.283
Late Fusion 0.625 0.681 0.371 0.397
RGB-N (ours) 0.722 0.697 0.437 0.408
Table 2.3: F1 score comparison on four standard datasets. ?-? denotes that the result
is not available in the literature.
NIST16 Columbia COVER CASIA
ELA [37] 0.429 0.581 0.583 0.613
NOI1 [38] 0.487 0.546 0.587 0.612
CFA1 [39] 0.501 0.720 0.485 0.522
J-LSTM [21] 0.764 - 0.614 -
RGB Net 0.857 0.796 0.789 0.768
Noise Net 0.881 0.851 0.753 0.693
Late Fusion 0.924 0.856 0.793 0.777
RGB-N (ours) 0.937 0.858 0.817 0.795
Table 2.4: Pixel level AUC comparison on four standard datasets. ?-? denotes that
the result is not available in the literature.
? Late Fusion: Direct fusion combining all detected bounding boxes for both RGB
Net and noise Net. The confidence scores of the overlapping detected regions from
the two streams are set to the maximum one.
? RGB-N: Bilinear pooling of RGB stream and noise stream for manipulation clas-
sification and RGB stream for bounding box regression. i.e.our full model.
We use the F1 scores of NOI1, CFA1 and ELA reported in [34] and run the
code provided by [36] to obtain the AUC results. The results of MFCN and J-LSTM
are replicated from the original literatures as their code is not publicly available.
Table 2.3 shows the F1 score comparison between our method and the base-
21
Tampered image Ground-truth RGB Net result Noise Net result RGB-N result
Figure 2.5: Qualitative visualization of results. The top row shows a qualitative
result from the COVER dataset. The copy-moved bag confuses the RGB Net, and
the noise Net. RGB-N achieves a better detection in this case because it combines
the features from the two streams. The middle row shows a qualitative result from
the Columbia. The RGB Net produces a more accurate result than noise stream.
Taking into account both streams produces a better result for RGB-N. The bot-
tom row shows a qualitative result from the CASIA1.0. The spliced object leaves
clear tampering artifacts in both the RGB and noise streams, which yields precise
detections for the RGB, noise, and RGB-N networks.
22
Authentic image Tampered image Noise map Ground-truth Detection result
Figure 2.6: Qualitative results for multi-class image manipulation detection on
NIST16 dataset. RGB and noise map provide different information for splicing,
copy-move and removal. By combining the features from the RGB image with the
noise features, RGB-N produces the correct classification for different tamepring
techniques.
lines. Table 2.4 provides the AUC comparison. From these two tables, it is clear that
our method outperforms conventional methods like ELA, NOI1 and CFA1. This is
because they all focus on specific tampering artifacts that only contain partial infor-
mation for localization, which limits their performance. Our approach outperforms
MFCN on Columbia and NIST16 dataset.
One of the reasons our method achieves better performance than J-LSTM is
that J-LSTM seeks tampered edges as evidence of tampering, which cannot always
detect the entire tampered regions. Also, our method has larger receptive field and
captures global context rather than nearby pixels, which helps collect more cues like
contrast difference for manipulation classification.
As shown in Table 2.3 and 2.4, our RGB-N network also improves the in-
23
F1/AUC NIST16 COVER CASIA
Flipping + JPEG 0.712/0.950 0.425/0.810 0.413/0.785
Flipping + noise 0.717/0.947 0.412/0.801 0.396/0.776
Flipping 0.722/0.937 0.437/0.817 0.408/0.795
No flipping 0.716/0.940 0.312/0.793 0.361/0.766
Table 2.5: Data augmentation comparison. Flipping: image flipping. JPEG: JPEG
compression with quality 70. Noise: adding Gaussian noise with variance of 5. Each
entry is F1/AUC score.
dividual streams for all the datasets except Columbia. Columbia only contains
uncompressed spliced regions, which preserves noise differences so well that it is
sufficient to use only the noise features. This yields satisfactory performance for the
noise stream.
For all datasets, late fusion performs worse than RGB-N, which shows the
effectiveness of our fusion approach.
Data Augmentation. We compare different data augmentation methods in Table
2.5. Compared with no augmentation, image flipping improves the performance
and other augmentation methods like JPEG compression and noise contribute little
improvement.
Robustness to JPEG and Resizing Attacks. We test the robustness of our
method and compare with 3 methods (whose code is available) in Table 2.6. Our
method is more robust to these attacks and outperforms other methods.
2.4.3 Manipulation Technique Detection
The rich feature representation of our network enables it to distinguish between
different manipulation techniques as well. We explore manipulation technique de-
24
JPEG/Resizing 100/1 70/0.7 50/0.5
NOI1 0.285/0.285 0.142/0.147 0.140/0.155
ELA 0.236/0.236 0.119/0.141 0.114/0.114
CFA1 0.174/0.174 0.152/0.134 0.139/0.141
RGB-N 0.722/0.722 0.677/0.689 0.677/0.681
Table 2.6: F1 score on NIST16 dataset for JPEG compression (with quality 70 and
50) and resizing (with scale 0.7 and 0.5) attacks. Each entry is the F1 score of
JPEG/Resizing.
Splicing Removal Copy-Move Mean
AP 0.960 0.939 0.903 0.934
Table 2.7: AP comparison on multi-class on NIST16 dataset using the RGB-N
network. Mean denotes the mean AP for splicing, removal and copy-move.
tection and analyze the detection performance for all three tampering techniques.
NIST16 contains the labels for all three tampering techniques, which enables multi-
class image manipulation detection. We change the classes for manipulation classifi-
cation to be splicing, removal and copy-move so as to learn distinct visual tampering
artifacts and noise features for each class. The performance of each tamper class is
shown in Table 2.7.
The AP result in Table 2.7 indicates that splicing is the easiest manipulation
techniques to detect using our method. This is because splicing has a high probabil-
ity to produce both RGB artifacts like unnatural edges, contrast differences as well
as noise artifacts. Removal detection performance also beats copy-move because the
inpainting that follows the removal process has a large effect on the noise features,
as shown in Figure 2.3. Copy-move is the most difficult tamper technique for our
proposed method. The explanation is that on one hand, the copied regions are
from the same image, which yields a similar noise distribution to confuse our noise
25
stream. On the other hand, the two regions generally have the same contrast. Also,
the technique would ideally need to compare the two objects to each other (i.e., it
would need to find and compare two RoIs at the same time), which the current ap-
proach does not do. Thus, our RGB stream has less evidence to distinguish between
the two regions.
2.4.4 Qualitative Result
We show some qualitative results in Figure 2.5 for comparison of RGB, noise
and RGB-N network in two-class image manipulation detection. The images are
selected from the COVER, Columbia and CASIA 1.0. Figure 2.5 provides examples
for which our two-stream network yields good performance even if one of the single
streams fails (the first and second row in Figure 2.5).
Figure 2.6 shows the results of the RGB-N network on the task of manipulation
technique detection task using the NIST16. As is shown in the figure, our network
produces accurate results for different tampering techniques.
2.5 Conclusion
We propose a novel network using both an RGB stream and a noise stream
to learn rich features for image manipulation detection. We extract noise features
by an SRM filter layer adapted from steganalysis literatures, which enables our
model to capture noise inconsistency between tampered and authentic regions. We
explore the complementary contribution of finding tampered regions from RGB and
26
the noise features of an image. Not surprisingly, the fusion of the two streams
leads to improved performance. Experiments on standard datasets show that our
method not only detects tampering artifacts but also distinguishes between various
tampering techniques. More features, including JPEG compression, will be explored
in the future.
27
Chapter 3: Deep Video Inpainting Detection
3.1 Introduction
Video inpainting, which completes corrupted or missing regions in a video
sequence, has achieved impressive progress over the years [40?48]. The ability to
produce realistic videos that can be used in applications like video restoration, vir-
tual reality, etc., while appealing, brings significant security concerns at the same
time since these techniques can also be used maliciously. By removing objects that
could serve as evidence, malicious inpainting can result in serious legal and social
implications including swaying a jury, accelerating the spread of misinformation on
social platforms, etc. Our goal in this work is to develop a framework for detect-
ing inpainted videos constructed with state-of-the-art methods (see Fig. 3.1 for a
conceptual overview).
Although there are recent studies on detecting tampered regions in images [6,
49?51], very limited effort has been devoted to video inpainting detection. For image-
based manipulation detection, existing approaches either focus on spliced regions or
?deepfake?-style face replacement instead of object removal based on inpainting or
they are designed specifically for images [52, 53] only and suffer from poor perfor-
mance on videos. Therefore, it is important to learn robust video representations
28
Original frame Inpainted frame (input) Our prediction Ground truth
Time
Figure 3.1: Problem introduction. Given an inpainted video (second column),
we localize the inpainted region both spatially and temporally.
that explore the temporal relationships among frames for video inpainting detection.
In light of this, we introduce VIDNet, a video inpainting detection network,
which is an encoder-decoder architecture with a quad-directional local attention
module to predict inpainted regions in videos (as is shown in Fig. 3.2). In particular,
at each time step, VIDNet takes as inputs the current RGB frame together with
its corresponding Error Level Analysis [54] (ELA) frame to the encoder, truncated
from a pretrained VGG network [55]. Since video are compressed based on discrete
cosine transforms (DCT) and frames extracted are usually stored in JPEG formats,
we leverage ELA images as an additional signal to reveal artifacts like compression
inconsistency (as is shown in Fig. 3.3). Instead of using ELA images directly, which
tends to produce false alarms, we extract features from both ELA and RGB images
with the encoder, producing five different multimodal features at different scales,
that are further jointly trained for inpainting detection. In addition, given a missing
region to fill in, inpainting methods leverage information from surrounding pixels
29
ConvLSTM RGB ELA QDLA Upsample
ConvLSTM
ConvLSTM
RGB and ELA frame
Time
Figure 3.2: Framework overview. Given an RGB frame in a video, we first derive
its corresponding ELA frame and compute multimodal features at different scales
with both frames. We also introduce a quad-directional local attention module
(striped) to the last encoded RGB features (colored blue) to explore spatial rela-
tionships among pixels from four directions. These encoded features are further
input into a multi-layer ConvLSTM (colored green) for decoding, exploiting spatial
and temporal relationships explicitly, to produce masks of inpainted regions. See
texts for more details.
of the region to make the region coherent spatially. Motivated by this, for RGB
features from the last layer of the encoder, we introduce a quad-directional local
attention module to attend to the neighbors of a pixel to detect whether that pixel
is inpainted or not. This allows us to explicitly model spatial dependencies among
different pixels to identify inpainted pixels.
Finally, with multimodal features encoded at different scales, we leverage a
four-layer Convolutional LSTM, serving as a decoder for inpainting detection. More
30
Skip Connections
specifically, the ConvLSTM at a certain layer not only takes in features from a previ-
ous time step but also features upsampled from a coarse level (i.e., a lower decoding
layer). In this way both spatial relationships across different scales and temporal
dynamics over time are leveraged to produce inpainted masks over time. The frame-
work is trained end-to-end with backpropagation. We conduct experiments on the
DAVIS 2016 [56] Dataset and the Free-form Video Inpaiting Dataset [44]. VIDNet
successfully detects inpainted regions under all different settings and outperforms
by clear margins competing methods. We also show that VIDNet can be generalized
to detect out-of-domain inpainted videos that are unseen during training.
Our contributions can be summarized as follows: 1) We target at a rela-
tively new task, to the best of our knowledge, we introduce the first learning based
approach for video inpainting detection. 2) We present an end-to-end framework
for video inpainting detection, which models spatial and temporal relationships in
videos. 3) We leverage multimodal features, i.e., RGB and ELA features, at differ-
ent scales, for video inpainting detection. 4) We introduce a quad-directional local
attention module to explicitly determine if a pixel is inpainted or not by attending
to its neighbours.
3.2 Related Work
Video Inpainting. With the advance of recent image inpainting approaches [46?
48, 57?62], more recent studies have investigated video inpainting. There are two
lines of work ? patch based and learning based approaches. For patch based
31
approaches, PatchMatch [63] is a prominent approach which searches for similar
patches in the surrounding region iteratively to complete the inpainted region. To
achieve better quality, Huang et al. [45] explore an optimization based method to
match patches and utilize information including color and flow as regularization. On
the other hand, learning based approaches have been explored recently. Wang [64]
propose a 3D encoder-decoder structure for video inpaining. Afterwards, Xu et
al. [42] leverage optical flow information to guide the inpainting in videos in both
forward and backward passes. Similarly, Kim et al. [41] propose to estimate the
proceeding flow as additional constraint while completing the missing regions. To
maintain more frame pixels, Oh et al. [43] use gated convolution to inpaint video
frames gradually from the reference frame. Lee et al. [40] copy and paste future
frames to complete missing details in the current frame. In contrast, our approach
detects regions inpainted by these approaches.
Manipulation Detection. There are also approaches focusing on manip-
ulation detection. Most mainly tackle splicing based manipulation and use clues
specific to it [19, 51, 65, 66]. In particular, Zhou et al. [49] use both RGB and local
noise to detect potential regions. Salloum et al. [67] rely on boundary artifacts to
reveal manipulated regions in a multi-task learning fashion and Zhou et al. [68] im-
prove its generalization ability with a generative model. Huh et al. [6] use meta-data
to find inconsistent patches and Wu et al. [50] treat it as anomaly detection to learn
features in a self-supervised manner.
More related to our work are methods for image inpainting detection. [53] is
a classical approach that searches for similar patches matched by zero-connectivity.
32
However, high false alarm rates limit their applications in real scenarios. More
recently, Zhu et al. [69] use CNNs to localize inpainting patches within images. Li
et al. [52] explore High Pass Filtering (HPF) as the initialization of CNNs for the
purpose of distinguishing high frequency noise of natural images from inpainted ones.
However, the generalization and robustness is limited as these HPFs are learned
given specific inpainting methods. In contrast, we combine both RGB information
and ELA features as inputs to VIDNet, and show that our approach generalizes to
different inpainting methods. In addition, without temporal guidance, the methods
above cannot guarantee temporally consistent prediction like our approach.
3.3 Approach
VIDNet, Video Inpainting Detection Network, is an encoder-decoder architec-
ture (See Fig. 3.2 for an overview the framework) operating on multimodal features
to detect inpainted regions. In addition to RGB video frames, VIDNet utilizes Error
Level Analysis frames (Sec. 3.3.1) to identify artifacts incurred during the inpainting
process. Motivated by the fact that inpaiting methods typically borrow information
from neighbouring pixels of the region to be inpainted, we introduce a multi-head
local attention module (Sec. 3.3.2) which uses adjacent pixels to discover inpaint-
ing traces. Finally, we model the temporal relations among different frames with a
ConvLSTM (Sec. 3.3.3). In the following, we describe the components of the model.
33
Inpainted
frame
ELA
frame
Mask
Figure 3.3: ELA frame example. From the top to the bottom: the inpainted
RGB frame, its corresponding ELA frame, and the ground-truth inpainting mask.
The inpainting artifacts, e.g., the dog, person and ship, stand out in ELA space
while not easily seen in the RGB space.
3.3.1 Multimodal Features
Learning a mapping directly from an inpainted RGB frame to a mask that
encloses the removed object, while feasible, is challenging, since the RGB space is
intentionally modified by replacing regions with their surrounding pixels to appear
realistic. To mitigate this issue, we additionally augment RGB information with
error level analysis features [54] that are designed to reveal regions with inconsistent
compression artifacts in compressed JPEG images. Note although videos are usually
compressed in MPEG formats, extracted frames are often times stored in the format
of JPEG. More formally, an ELA image is defined as:
IELA = ||I ? Ijpg||1, (3.1)
34
ConvLSTM
Sigmoid
Conv
Feature
Figure 3.4: The quad-directional local attention module. Given RGB features
from the last layer of the encoder, we derive attention maps with a quad-directional
local attention module. To detect whether a pixel is inpainted or not, the module
attends to its neighbors from four directions (left-to-right, up-to-down, right-to-left
and down-to-up).
where IELA is the ELA image, I denotes the original image and Ijpg denotes the
recompressed JPEG image from the original image.
Fig. 3.3 illustrates the corresponding ELA images of sampled inpainted frames.
Although ELA images have been used in forensics applications [36, 66], they tend
to create false alarms when other artifacts like e.g., sharp boundaries, are present
in the images, which requires ad-hoc judgement to determine whether a region is
tampered. So, instead of only using ELA frames, we augment them with RGB
frames as inputs to our encoder.
In particular, both the RGB and ELA frame are input to a two-stream encoder.
Each stream, based on a VGG encoder, transforms the input image to high-level
representations with five layers, yielding 5 feature representations at different scales.
At each scale, we normalize the corresponding RGB and ELA features, respectively
with `2 normalization, and then apply one convolutional layer to absorb both fea-
35
tures into a unified representation:
fl = ReLU(F ( [ ||fRGBl ||2 | ||fELAl ||2 ])) (l < 5), (3.2)
where [|] denotes feature concatenation, fl denotes the feature at l-th layer. fRGBl ,
fELAl denote the RGB and ELA features at layer l, respectively. F represents the
convolutional layer and ReLU denotes the activation function. The fused represen-
tation at each level is further used for decoding. For l = 5, we simply use RGB
features as we find that high-level ELA features are not helpful.
3.3.2 Quad-Directional Local Attention
Inpainting methods aim to replace a region with pixels from its surrounding
areas for photorealistic visual effect. Therefore, when determining whether a pixel
is inpainted or not, it is important to examine its surrounding pixels. Inspired
by recursive filtering techniques that model pixel relations from four directions for
edge-preserving smoothing, we introduce a quad-directional local attention module
to explore spatial relations among adjacent pixels.
We learn four attention maps for four directions, left-to-right, right-to-left,
top-to-bottom, bottom-to-top, to determine how much information to leverage from
the pixels in the corresponding direction based on each map. More specifically,
we use F?, F?, F? and F? to denote functions that derive attention maps for the
left-to-right, right-to-left, top-to-bottom and bottom-to-top four directions. In the
following, we consider the left-to-right direction for simplicity. Given features f5
36
from the last layer of the RGB stream, we first transform the features with F? to
have the dimension as f5, and then compute an attention map A?:
A? = ?(F?(f5;W?)), (3.3)
where W? denotes the weights for the convolutional kernel, and ? is the sigmoid
function to ensure the attentional weights at each pixel are in the range of [0, 1].
Then, for each pixel in the feature map, we obtain information from the surrounding
pixels as:
f5?[k] = (1? A?[k])f5[k] + A?[k]f5[k ? 1], (3.4)
where k denotes the location of the pixel. Since we are considering attention from
the left-to-right direction, k ? 1 indicates the pixel to the left of k. The current
value of pixel k is updated with information from its neighboring pixel, and the
weight to balance the contribution A? is derived with convolution, which aggregates
information from a small grid in the original features. As a result, we attend to a
small local region to compute the refined representation. We can derive f5?, f5?
and f5? similarly, and thus we have four different refined representations.
Note that the quad-directional attention module is similar in spirit to recursive
filtering. However, in standard recursive filtering, a weight matrix, in the form of
an edge map [70] or a weighted map [71], is used as our attention map A to guide
the filtering to restore images or smooth feature maps. In contrast, our filtering can
be considered as a form of self-attention?we derive attention maps by modeling
37
similarities in a local region with convolutions conditioned on input features and
the resulting maps are in turn used to refine features, allowing pixels to borrow
information by attending to its adjacent pixels. In addition, the motivation of our
approach can be seen as the ?reverse? process of recursive filtering?in recursive
filtering, information from surrounding pixels is diffused to make local regions co-
herent, whereas we wish to detect inconsistent pixels by attending to a neighboring
region.
Furthermore, we compute four refined feature maps for four directions in a
parallel way conditioned on the same feature map. An alternative is to generate a
single feature representation by sequentially performing attention in four directions,
i.e., f5? is used as inputs to generate f5?, and so on and so forth, as in [70].
However, we find through experiments that the parallel multi-head approach offers
better results, possibly due to the disentanglement of different directions.
3.3.3 ConvLSTM Decoder
Temporal information like inconsistency in the inpainted region over time is
a significant clue for video inpainting detection. To explore temporal relationships
among adjacent frames, we use multiple ConvLSTM decoding layers to take features
from the encoders and produce predicted detection results, which enables message
passing from previous frames. More specifically, the decoder contains four Con-
vLSTM layers to process features from different spatial scales. At each time step,
taking into account both spatial and temporal information, we concatenate together
38
the skipped connected feature of the current frame and the upsampled feature from
a lower level, as the inputs to the current ConvLSTM layer. More formally, for the
t-th time step, the i-th (2 <= i <= 4) ConvLSTM computes the hidden states and
cell contents for the t+ 1-th time step as:
ht+1 , ct+1i i = ConvLSTMi( g
t t t
i , hi , ci), (3.5)
gti = [ U(h
t
i?1) | f t6?i ], (3.6)
where hti and c
t
i denote the hidden states and cell states for the i-th ConvLSTM,
respectively, and U denotes the function for bilinearly upsampling, which maps the
outputs from a lower-level ConvLSTM with smaller feature maps to have the same
dimension as the current one. In addition, f t6?i is the skip connected feature of the
frame t from the encoder.
When i = 1, the first layer of the ConvLSTM takes features from the last layer
of the encoder, i.e. f5 as inputs. Recall that we obtain four refined features based
on f5 with our quad-directional local attention module to identify pixels that are
inconsistent with its neighbours from four directions. Thus, we use these refined
features as inputs to ConvLSTM1. We input them into the LSTM in the order of
f5?, f5?, f5? and f5? to obtain all the four directional features.
At each time step, we compute gt5 with Eqn. 3.6 to produce a prediction p
t
for each QDLA direction via one convolutional layer. Finally, to explore non-linear
relations among these four directional outputs, we fuse them with one additional
39
convolutional layer to form the final prediction. During training, we divide each
video into N clips with equal clip length. To encourage more intersection with
the binary ground truth mask, we use IoU score [72] as our loss function which is
formulated as:
?H,W
m=1,w=1 pm ? yw
L(p, y)=1? ? , (3.7)H,W ? ?
m=1,w=1 p
H,W H,W
m,w+ m=1,w=1 ym,w? m=1,w=1 pm,w ? ym,w+
where p and y denote the prediction and the binary ground truth mask, respectively.
H and W denote the height and width, respectively.  denotes a small number to
avoid zero division.
The loss is updated once the ConvLSTM decoder goes through a single video
clip to collect temporal information. By exploring spatial and temporal information
recurrently, predictions of inpainted regions become more accurate.
3.3.4 Implementation Details
We use PyTorch for implementation. Our model is trained on a NVIDIA
GeForce TITAN P6000. The input to the network is resized to 240 ? 427. The
length of our video clips is set to 3 frames during training. To extract ELA frames,
we recompress the corresponding RGB frames by quality factor 50 and compute their
difference. Our feature extraction backbone is VGG-16 [55] for both RGB and ELA
features. To increase the generalization ability, we add instance normalization [73]
layer to the backbone. The encoder is initialized from VGG-16 model pretrained
40
on ImageNet [74] and the decoder is initialized by Xaiver initialization [75]. We
concatenate both RGB and ELA features up to the penultimate encoding layer.
Afterwards, the features are passed into one convolutional and normalization layer
to reduce the dimension by half to reduce training parameters. The QDLA module
is only added to the last encoder layer to extract directional feature information
based on ablation results in Sec. 3.4. The decoder is a 4-layer ConvLSTM. We use
Adam [76] optimizer with a fixed learning rate of 1? 10?4 for encoder and 1? 10?3
for decoder. The optimizer of the encoder and decoder network are updated in an
alternating fashion. To avoid overfitting, weight decay with a factor of 5 ? 10?5
and 50% dropout [77] are applied. Only random horizontal flipping augmentation is
applied during training. We train the whole network end-to-end for 40 epochs with
a batch size of 4.
3.4 Experiment
We compare our VIDNet with approaches on manipulation/image inpainting
detection in this section to show the advantages of our approach on video inpainting
detection. We also analyze the robustness of our approach under different pertur-
bations and show both quantitative and qualitative results.
3.4.1 Experiment setup
Dataset and Evaluation Metrics. Since DAVIS 2016 [56] is the most
common benchmark for video inpainting, which consists of 30 videos for training
41
and 20 videos for testing, we evaluate our approach on it for inpainting detection.
We generate inpainted videos using SOTA video inpainting approaches ? VI [41],
OP [43] and CP [40], with the ground truth object mask as reference. To show both
the performance and generalization, we choose two out of the three inpainted DAVIS
for training and testing, leaving one for additional testing. The training/testing
split follows DAVIS default setting. We report the F1 score and mean Intersection
of Union (IoU) to the ground truth mask as evaluation metrics.
We compare our method with the following approaches:
NOI [38]: A traditional approach which aims to find inconsistent noise region
as the clue of manipulation. The code for evaluation is from Zampoglou et al. [66].
We directly test on the VI, OP and CP test set as it is unsupervised.
CFA [65]: An approach that estimates Camera Filter Array (CFA) and regards
the region with different CFA patterns as the manipulated region. We directly test
on the VI, OP and CP test set as it is unsupervised.
HPF [52]: A learning based image inpainting detection approach that applies
one high pass filter layer as an initialization to reveal high frequency inpainting
artifacts. We implement their filter kernel and train the network frame-by-frame
from the ImageNet pretrained weights for comparison.
GSR-Net [68]: A generic image manipulation segmentation approach that ap-
plies generative models and exploits boundary artifacts to improve the generalization
ability. We use their released code and retrain on inpainted DAVIS frame-by-frame
for evaluation.
Ours RGB (baseline): Our baseline approach which feeds as input RGB frame
42
only. No QDLA module is applied.
VIDNet-BN (ours): Our batch normalization [78] version approach.
VIDNet-IN (ours): We report this as our main results, which replaces the
batch normalization in encoder by instance normalization to improve the general-
ization across different video inpainting algorithms.
3.4.2 Main Results
Tables 3.1, 3.2 and 3.3 highlight our advantages over other methods. For all
the three settings, our IN version outperforms other approaches in both trained and
untrained inpainting algorithms, showing the generalization of our approach. Addi-
tionally, we show clear improvement over our baseline, indicating the effectiveness of
our proposed ELA feature and QDLA module. Comparing across different inpaint-
ing algorithms, the performance degrades on the untrained algorithms, indicating
a domain shift between trained and untrained inpainting algorithms. However,
benefiting from diverse features and more focus on proximity regions, our method
still results in better generalization compared with other approaches. Finally, the
results indicate that our BN version generally has better performance on the in-
domain training inpainting algorithms while IN version shows better generalization
on the cross-domain one. Therefore, we provide both results as a trade off between
in-domain performance and generalization.
43
Table 3.1: mean IoU and F1 score comparison on inpainted DAVIS. ?*?
denotes that the model is trained on these inpainting algorithms.
VI* OP* CP
Methods IoU F1 IoU F1 IoU F1
NOI [38] 0.082 0.137 0.090 0.137 0.072 0.132
CFA [65] 0.103 0.142 0.083 0.137 0.076 0.121
HPF [52] 0.456 0.568 0.494 0.615 0.458 0.577
GSR-Net [68] 0.571 0.693 0.500 0.626 0.509 0.634
Ours RGB (baseline) 0.552 0.671 0.456 0.580 0.493 0.625
VIDNet-BN (ours) 0.620 0.726 0.749 0.833 0.670 0.775
VIDNet-IN (ours) 0.585 0.704 0.588 0.707 0.565 0.685
Table 3.2: mean IoU and F1 score comparison on inpainted DAVIS. ?*?
denotes that the model is trained on these inpainting algorithms.
VI OP* CP*
Methods IoU F1 IoU F1 IoU F1
NOI [38] 0.082 0.137 0.090 0.137 0.072 0.132
CFA [65] 0.103 0.142 0.083 0.137 0.076 0.121
HPF [52] 0.342 0.444 0.409 0.510 0.676 0.773
GSR-Net [68] 0.302 0.426 0.736 0.818 0.801 0.849
Ours RGB (baseline) 0.308 0.417 0.705 0.773 0.777 0.859
VIDNet-BN (ours) 0.301 0.415 0.801 0.860 0.837 0.915
VIDNet-IN (ours) 0.386 0.493 0.740 0.820 0.810 0.869
3.4.3 Ablation Analysis
We analyze the importance of each key component in our framework and the
details are as follows:
Ours ELA: The baseline architecture which only feeds ELA frame as input.
Ours RF edge: Similar to Chen et al. [70], we add additional edge branch
and apply recursive filter to the final prediction. The output of edge branch is used
as the reference to recursive filter layer. The loss function of the edge branch is a
weighted binary cross entropy loss.
44
Table 3.3: mean IoU and F1 score comparison on inpainted DAVIS. ?*?
denotes that the model is trained on these inpainting algorithms.
VI* OP CP*
Methods IoU F1 IoU F1 IoU F1
NOI [38] 0.082 0.137 0.090 0.137 0.072 0.132
CFA [65] 0.103 0.142 0.083 0.137 0.076 0.121
HPF [52] 0.551 0.671 0.186 0.286 0.690 0.796
GSR-Net [68] 0.588 0.703 0.221 0.329 0.700 0.765
Ours RGB (baseline) 0.582 0.689 0.196 0.305 0.753 0.846
VIDNet-BN (ours) 0.578 0.695 0.231 0.323 0.753 0.848
VIDNet-IN (ours) 0.592 0.712 0.245 0.344 0.760 0.850
Table 3.4: Ablation analysis for each component on our approach. ?*?
denotes that the model is trained on these inpainting algorithms.
VI* OP* CP
Methods IoU F1 IoU F1 IoU F1
Ours ELA 0.460 0.578 0.509 0.631 0.417 0.546
Ours RGB (baseline) 0.552 0.671 0.456 0.580 0.493 0.625
Ours RF edge 0.540 0.661 0.460 0.591 0.555 0.670
QDLA both features 0.555 0.680 0.580 0.700 0.495 0.635
Ours w/o QDLA 0.559 0.682 0.557 0.681 0.512 0.644
Ours frame-by-frame 0.558 0.683 0.566 0.688 0.532 0.664
Ours w/o ELA 0.568 0.691 0.465 0.595 0.560 0.678
QDLA all layers 0.570 0.693 0.469 0.585 0.564 0.682
VIDNet-IN (ours) 0.585 0.704 0.588 0.707 0.565 0.685
Ours w/o ELA: The baseline applied with QDLA in the last encoder layer.
This is our full model without the ELA features.
QDLA both features : Our full model except that the input to QDLA module
is the concatenation of both RGB and ELA feature from the 5-th layer.
QDLA all layers : Applying QDLA module to all the 5 encoding feature layers.
Ours frame-by-frame: Instead of training with video clip length of 3, we train
our full model frame-by-frame.
Ours w/o QDLA: Adding ELA feature to the encoder, and concatenating with
45
RGB feature. The decoder follows baseline which only using temporal information
as the additional feature. This is our full model without QDLA module.
Table 3.4 displays the comparison results. Compared to baseline, the ELA
feature alone yields worse performance. This perhaps because the ELA frame also
contains other artifacts like sharp boundary, which leads to confusion without proper
guidance from RGB contents. Adding QDLA module introduces feature adjacency
relationship and thus leads to improvement. However, the higher features are more
useful for our QDLA than lower ones when comparing to QDLA all layers, and high
level ELA features are less helpful than lower ones when comparing with QDLA
both features. Compared to Ours RF edge, our QDLA module (Ours w/o ELA)
yields better performance because the boundary prediction degrades in video in-
painting scenario and thus edge map contains false positives to guide the segmen-
tation branch. In addition, the comparison between Ours frame-by-frame and our
final model verifies the importance of temporal information in video inpainting de-
tection. Eventually, with QDLA module, ELA feature and temporal information,
the performance gets boosted further.
3.4.4 Robustness Analysis
To test the robustness of our approach under noise and JPEG perturbation,
we conduct experiments listed in Fig. 3.5. We add Gaussian noise to the input frame
with Signal-to-Noise Ratio (SNR) 30 and 20 dB and evaluate on these noisy frames,
or recompress test frame with JPEG quality 90 and 70 for perturbation. Moreover,
46
(a) JPEG perturbation (VI*, OP*, CP)
(b) Noise perturbation (VI*, OP*, CP)
Figure 3.5: Mean IoU comparison under different perturbations. Perturbation in
JPEG compression consists of the quality factor with 90 and 70; perturbation in
noise consists of SNR 30dB and 20dB. Column from left to right is the result on VI,
OP and CP inpainting. ?*? denotes that the model is trained on these inpainting
algorithms.
to study the effect of specific augmentation in video inpainting detection, we apply
noise and JPEG augmentation to our approach and make comparison together. The
details of our augmentation is as follow.
VID-Noise-Aug : Randomly apply Gaussian noise with SNR 20 dB to the input
frames during training.
VID-JPEG-Aug : Randomly apply JPEG compression with quality factor 90
to the input frames during training.
The robustness of our approach stands out under different perturbations.
Compared to other approaches, HPF suffers more from perturbation because more
high frequency noises will be introduced. With generative models for augmentation,
47
Table 3.5: Mean IoU and F1 score comparison on FVI. The results are directly
tested on FVI dataset, and all the model are trained on VI and OP inpainted DAVIS.
FVI
Methods IoU F1
NOI [38] 0.062 0.107
CFA [65] 0.073 0.122
HPF [52] 0.205 0.285
GSR-Net [68] 0.195 0.288
Ours RGB (baseline) 0.156 0.223
VIDNet-IN (ours) 0.257 0.367
GSR-Net shows good robustness. However, our approach outperforms GSR-Net as
more modalities of video inpainting clues have been considered. Even though adding
noise augmentation results in a small degradation on the initial performance, the
robustness to both noise and JPEG perturbation has been improved. Similar ob-
servation is made on JPEG augmentation also.
3.4.5 Results on Free-form Video Inpainting Dataset
To further test the performance on different dataset, additional evaluation is
provided on Free-form Video Inpainting dataset (FVI). FVI dataset [44] provides 100
test videos, which mostly targets at multi-instance object removal. We directly apply
their approach, which leverages 3D gated convolution encoder-decoder architecture
for video inpainting, to generate the 100 inpainted videos. To test the generalization
of our approach, we directly test the models trained on VI and OP inpainted DAVIS.
Table 3.5 displays the comparison results. Since both the dataset and in-
painting approach are different, the performance degrades due to the domain shift.
However, compared to other approaches, our method still achieves relatively better
48
Input
frame
HPF
GSR-Net
Ours
Ground
truth
Input
frame
HPF
GSR-Net
Ours
Ground
truth
Figure 3.6: Qualitative visualization on DAVIS. The first row shows the inpainted
video frame. The second to fourth row indicates the final predictions from different
methods. The fifth row is the ground truth.
generalization by a large margin. Also, compared with our baseline model which
only uses RGB features, our approach shows clear improvement. This further vali-
dates the effectiveness to combine both RGB and ELA features and introduce spatial
and temporal information for more evidence.
3.4.6 Qualitative Results
Fig. 3.6 illustrates the visualization of our predictions versus others under the
same setting. Thanks to our ELA and RGB features which provide spatial clues,
49
it is clear that our approach is able to obtain a closer prediction to the ground
truth than other methods. Specifically, HPF only transfers RGB into noise domain,
making it easier to produce false alarm. GSR-Net makes decision frame-by-frame,
making the result less temporally consistent. In contrast, with the favor of temporal
information, our prediction maintains temporal consistency.
3.5 Conclusions
We introduce learning based video inpainting detection in this paper. To reveal
more inpainting artifacts from different domains, we propose to extract both RGB
and ELA features and make concatenation. Additionally, we encourage learning
from adjacent feature in a self-attended manner by introducing QDLA module. With
both the adjacent spatial and temporal information, we make the final prediction
through a ConvLSTM based decoder. Our experiments validate the effectiveness
of our approach both in-domain and cross-domain. As shown in the results, there
still exists a clear gap in the generalization and robustness, making the problem far
from being solved. Involving some domain adaption strategies might be a remedy
for this issue, which we leave for future research.
50
Chapter 4: Generate, Segment and Refine: Towards Generic Manip-
ulation Segmentation
4.1 Introduction
Manipulated photos are becoming ubiquitous on social media due to the avail-
ability of advanced editing software, including powerful generative adversarial mod-
els [79,80]. While such images have been created for a variety of purposes, including
memes, satires, etc., there are growing concerns on the abuse of manipulated images
to spread fake news and misinformation. To this end, a variety of solutions have
been developed towards detecting such manipulated images.
While a number of proposed solutions posed the problem as a classification
task [16,51], where the goal is to classify whether a given image has been tampered
with, there is great utility for solutions that are capable of detecting manipulated
regions in a given image [6,16,81,82]. In this paper, we similarly treat this problem as
a semantic segmentation task and adapt GANs [83] to generate samples to alleviate
the lack of training data. The lack of training data has been an ongoing problem
for training models to detect manipulated images. Scouring the internet for ?real?
tampered images [84] is a laborious process that often leads to over-fitting in the
51
CASIA COVER Carvalho In-The-Wild
Figure 4.1: Examples of manipulated images across different datasets.
Columns from left to right are images in CASIA [3], COVER [4], Carvalho [5],
and In-The-Wild [6]. The odd rows are manipulated images and the even rows are
the ground truth masks. Different datasets contain different distributions (from an-
imals to person), manipulation techniques (from copy-move (the second column) to
splicing (the rest columns)) and post-processing methods (from no post-processing
to various processes including filtering, illumination, and blurring).
training process. Alternatively, one could employ a self-supervised process, where
detected objects in one image are spliced onto another, with the caveat that such
52
(a) Generate Stage
Copy Pasting
Tampered Image S Ground TruthK Hard Example
Uzo1tNxwBhiselaW<dSZOOb36V1=46esab_a t"WawwwN1zUBo<dSZOOb36V"=46esab_1ahs tixetalla1tZe1x=itto 3sshta<1U_dbOaVs6eb6s4x=l"NVz6B3WbSOOObZ6S"d4WeoaB_Uhz 1iNe<a uxuJMrd7m022Al2DY/ijVM4xqeESQF8Ft4LHp6rWaeq78dgc7Ad>cn+NUB44c7=I"8>ZAHAAA=B+6gH9iYcEbNZSBBNbSi86NAAAE"IcYUnc97a88uJMrd7m022Al2DY/ijVM4xquxuJMrd7m022Al2DY/ijVM4xESQF8Ft4LHp6rWva9nYIEAN8SNBZux"uNJpMir7dQ7BmA0U2v2tAclS2bD6YA/cicj8VrML4uxEqAe8ENSZQcFH8BFAt>4=L4H+pd6gr7WWv6aH94nFYFISEebiH6BAAA>"=c4U+cd7g87vx WHYrru//YtG+HPvWGeraY0YaLDWYq9H6vKVnS3NjxUL+9uWDnHUVDxO/DDmYu9U6Ph/ohP4CLx4U8epP8ua/cdvmoK0vLDy8PM8hPOvEEiuDau++0UNja3tneKuhChW4xLe4/8Pp/8PadcKvoov08LLy8MEPSviEuuva++j0qN3aKtLe6uYWY/DPxoVLGSrvYqYLLYWGqhu8CL44Uxep88uP/cavmdK0oLDv8MyPO8EEvau/cdvmoK0vLDy8M8POvEEiuDauhhP4CLx4U8epVSnx+Lj/Ko3DUPiYe/t9aWN60u+P +Ig5xocnFHmE9pEqe2fqjbYnFhEzqQqCH82BZC5VTIqVgeXGxWD2OLoCmwUCbkOC9wrk5tEvFOvSpNnPEoqzKQHCZ8RB5Cx+5xoFr9geFjKFRH55pENnPYEvSvOf95twkEUOCCkmOmwLCcXD2GWgTgehgpNEoz5qFQvqKChHf8n2RBgZPCE5j+FqHx5xEFnop9pbYeSrOpOwmC5vtg9okkEnwSOqCrURkPmvCCmxwbOFCKc5LhDN2OXtWwgkGxgFe9TeVjIFoHE5zEqnQpqICYHv8E2fB5Z9CE5O+UmVTegcCLXDg2WG CIUZYqMcLlZWjHh6SLz1nT2Q6PIXGYRs0f9L3Sv1QwIIkZu0/UKvYOkgEZD1rjvxQPkndJaIRjWUlHf/HsYMJWQzgTn6IGR093vQIku/KYkEDrvQkdaRLfWLCSH1Zw6IhZLPf/WHMUsjlIYJqXnqf/nXPcxPjZ1QZzgTZO/fWQHRM1UEs6jLlQI/Y0JwqSnfXdPrcYxkP3jGZn1ZQHZCzWgLTaOkSvvD1kUKYu0ILvZ9hRII62Sv1UY0LZhI6wZ1HSCLWfLRadkQvrD2EkYK/ukIQv390RGI62ncxPjZ10YU1vSO VlFxBmqZTT4eysvVydulClLMVEZqA5yoG/JnIiqE2RevUql0lZgv3TIV9r+/N2UreGr2I7AWUsiyq4FisTZmxlVBTsVydulClLMVEZqA5yoG/JnIiqE2RlLLUql0lZgv3TIV9r+/N2UreGr2I7AWULlLEveyq4FisTZmxlVBTvesyLqu4BFyiCsVTTZVmdxlllVMsqZA+5gyroIGU/AJ3n9INieqIEU2vRTlVLrL/U2qrlG02l7ZWylCuLMlEZVA5qoGyJn/iqI2RELLlqlUlZ0v3gIVTr+9N2/reUr2G7AIdWUV c7EgvHoC07uBhgCxErfUF45QytoaCl4/H<HclF/zYYzvwoa+szPn/5/8Ih4grQ757l5u6ltXXf6bQ5EEz7vH+7n+hkca84C490fIDCi//CO/mowPMvGsAyaaQkU+xwB8Cng+cbHfhbqulX6ll/qu8Q+akCf9yvsD/iP/oOwmCC/MAGI40aUQ4grxCB77Hgzc<gr/uq6lluXlfqbh5HQwc+/8qklCual9qyhfHscDgvCiBPx/UoQOa/AmGCMww/mMOC/GiIDAf09aCw8hc8h55nzzo+YoFv/YazqFl<utlaql//t<l8v4 xeettx>>txi>xie>iite
ealla<ext<l ts_h"a+1x_"bZalssee664Y=c"47tctQaZaW4Yc+w++wW6QZ7l=r<4aje+i2 2h+1jb4sr6l=Z7iw+teaitxsh 1_aasbt6e2+j4rlZ<6w++YWZQxit sha1_base642+7jW4+rQl"ZY6Z=c LKYESeMsLY=">AAAB6Hicbbw4xULg5uYnWavpr64HL8tFSFQqEe1wsStrIaEADN98oSwN0XXzBZ9ELg9DrawU1z5XtbsSuLB6KYoSeEsLM="YLK>ESYMseAAA>"=YLsMeAAAB6HicbZBNS8NAEIYn9avWr60pHL4tF8FQSEeqx5u"wNsv1itErYSBbAXKzrU9aEDSgb96LAoY0Ew6XWEa4nxIqAe8ENSZQcFH8BFAt>4=LLHYpS44t5uXs1wrSLH9iLcrbYZvBpNISn8aNWA6EHtwF89FbQUSoEDeXq0xL4gEaXz heb2N7YhGc3KKG/wIsHRbz6k7z5b9y2OWjrCwsP78ywM2+QCK6N6gt4loYLahWfN883K/g5/rp4mrd2e3NmYd74b5pg//h8KN8fcLWYaogtlwgl4t2Y+o3a6c4WrLhN6fC8MhdKe8NgY/7/3bNpK5Qm2K5b9y2OWGKKG/wIsHRbz6k7z5b9y2OWjrCwsP78ywM2+QCK6N6347hYGKKG/wIsHRbz6k7z5b9y2OWjrCwsP78yjrCwsP78ywM2+QC6N6343Nre2GKKG/wIsHRbz6k7zd4m5pbgl/thYfo8aNc8WKLg/ sKH7t2XeNuEUEyZ7h4Y8Wdm8UgImIqqfkZMLbcB5JbYMNwK7iUIaBQIfk0wHWIKfRW5tXPXhK3Fq8EHkMLGco5MoPXIyt5fKlUIW0k9ZAf9qmmlgo8Ad88Ru02weB7HQIaIUWfUWKt5PyhX3Kq7EFkHL4cK5GoMAqmm9gl8Ido88uX2Rw7IQBaIUi7KwNMYbJBbEqNUtYs3Kq7EFkst4YNUBqKbbJwYGNoK7ieHmmqgl9Id888oX2uewRIQ7aIBi7UwNKYbMBbJqNEtYUAf0ZHkIWfUWKt5PyhX3Kq7EFkHL4cK5GoMs l/I9YNczySxClKSV9hZg/mCrOdh899mzSrZjOUa00RV2+Th/sC1jNccal+7olscxKSLMtkTOMCoTj+2ZjFhtK4EZBtVLjEBN1U/mZFQ7qjzEYr/jyU0R2T/Cjca+osxSMkOCT+ZFt4ZtLENUEqNQjhFZy1//YIz1CsUBEOtj4hFV+lCBk+SEsz+9c8CdTrR7UmrFqjEhNQ/Z1sI1jBOlhVEB+8z9mdrVghSKCYzcl9/mgyrjU0R2T/Cjca+osxSMkOCT+ZFt4ZtLENUmF7jEqNQhZ1/I1sBOjhVlB+Ez98drmghV lZGHwrbumJFN1GJ012KWmAUTsZClLDaaRTYzoiasV6tzc3/ar3UhDd+xmWSGpRs+VGYCmJiMr+I3GaRra3TYRoizsVatz63/cr3ahDU+xdWSmpRG+VsYCGJimr+MDLYzZoUisaJsaVa6KtFz3cz3z/samr131UbhwDrdT+ix6m3WZSTGApWR2s0+GVNGmYHClmIJuiJMRrY+oaaDVLtlcC/aFUhlDZdA+JxGmlWIS3Gmp2Rssb+GVuGrYwCHmNJ0iWMTrC+maHDlLwl3CrZJsuTIUGAZmbWFK12J110KJmGU1TNZ <=v=QJM2cnPHuzpB/OzQK41O7xO/O+8iK8bzMvAIcrASiBDdAKQYLdQIqOCzDQe<u=z=dQvvqMU2mJ6nAP8cOuUzuHmBq/6pvzeQqOC4M1cKr7bxLOI/zOYOH8piV+98WbnK+M9vYzHIpcVA9AWSnr+B9DKivbIdAGSABKieW<vHGdAqKQ8uLGdJIBOKz6uMVPbbUpH4vxp/uCVqe29nmcWzqHn/qz+OC19768<O=O=9Q+vnp9==mQOv1M82OJ7nOP+cQu4zKHxB//Opiz8OD8iU+L8GbzKbMKvIzYIicdAAA8SdrOBQ >teixta/>t/iaxtel/atl/elxlitti>ea>tx Target Image T Discriminator D
hs" dtNiNxCeYtY6uu6u<uey4Os_esbgaael4F=E1asala<extit sha1_base64="Cu6uuu4dgeEsF_O4yauau=u66subC1"hyEgFds4eN t<iaxtelYONeYsd4gEFOyuuu6uC"=46esb_1ahs <leaitxt dVutWiLd2tCKkYAr63ivbQBzS4NJE"Y792NqCY4QEp8BLWrN9HEBSAbc6"A64ZyrjraNWp6tH848FEFASqeVxIMCjN/nCTppNATJAj>I=6Cw/yj5MrxYe4S=F"F>4AHA6vWYa5nwIIAt8uNWZCckH4B8Ap>6=EYI5YwnINJ4HL8tFNFAk7CzW3uKtdNTpNC2/CjVMYAB46rHyi6cjbKZ3Bzq7xYSeENQSva96wy5rY4=">AJjIAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx4MVjC/2CNpTNdtKu3WzC7kYAiSwbd/FwvhPoqATFv2Pxl6vO3iLeYr55bvoYovtzH3oi4Hgg/buFboizdwF5dPjdooxbI3zrW3L8wxiqLi3Oqjlfq2FfbACA1/asvYxlvqodp/Wx4vjpOju88nifd3n3e2jrfurfI33r5ALb3AvzL/2COsLarYv3vluvvPiFfw/Azb3oALAzr5rIfoAruj2Fbe3dfbAinujgzOp41H/6FtxxChWSq2shFxat51xL6pH84vOng3uPi3brdTeuF3jPrroaIO5szLLCvww/LiOx5SLWv2PiThPqFxwFAtb1o6Sx2Tq 9i1FQQuLoNR5soEsl3tLgy0I3PA2nWuxJiQshQosFNA0r/vWBK9MHq51Rw7yzW66h50NjHyoWKLGTPlv1PQG5sQoKHsLF5N63ixyWwWNsq1MQ2PWK/IoLN0LWTj0y6zhR57BvHAFrhQounJ30AtlgysPsyPysgG55yKhQ1o6LrHnsRNLF057i96F3JW3xEyuNiwWWj1hszqR2HMBQAKoPQWuoA/0It0sLoNQTi5oLQsNH5iF3W6yNxW1wq2sQKMWoPI0/NTLWyL0hGKjEuz765HRBv9AFrhQounJ30AtlgsREuQoiR1o RTjQ8RsisSELuUcvjWFJ6RCgNRTXcFNaWiARsi7FUWNZPvEctb5uiLOxIS8sknasTR/02Qo0n0W0nWo61NR2pUJz1CgZEgNv89CzO+TphhXOp8RE+1FpzNNR97civCWPgZitZ5abCORIzxskUaAn2/12o00nTWQ2n/R8saxksiSIbOLEu5ZtcUvPiNWjF7NR6JRpE1gCN8hOTRXpz+FcN9gvWaiZzCRAsUN0zsRA2U2ZiaWvg9NcF+zpXRTOh8NCg1EpRJ6RN7FjWNiPvUctZ5uELObISiskxas1R/n2QTnoW801 3vYfImbmLBFCIlh9Kx4aGzlYbIVvVfJ3KKV4aGZlzi7z2Ymabom2i18sLDbmDFmFFVma8729XPzZhnm9xGaQtJVK9FmaCsJgYripjrMAZAbNJme82VVImmFFYbDmsI8gVm2aJFetGJmQAmN9bnMXZPA9p8rsj1YaioIYGBm3vfrK4bZAMrjpirYgJsCamF9KVJtQaGx9mnhZzPX97aLs81i2moaYbzmIvf3NAmeJK4BGlYDbmbBYKzboim8aJip1e2FsCLs8j7Z9Aa2PmzFXahmngZr9YxrmAaMQNGmJJVVtI9VF Aato8q4XuxIEXaU5n8lIQX<6HG/zw4+O0qa4O4u5/bOnYd8/yY8QCJbcAiH/EivOBARWdquEuRUQmnQaquGG/E<al5tutal<//zGOYGq+QdzGJ65qcn4uiXxm/bOaiX8uO5vnAEAUW4BQqIduE48RRou8QqUEnEudaHmquAqBGbtWlC<Az8QAOy+vH8/OwY+80OaiO/uO/uO/YO8xya8iC0b4A+HcEwq5o/4JIH4zE+5QXObqXQnQ6zdYQqOH+z5J/+wci40OaxO/uO/iO8Yy8vAA8bCWqBAEHd8EqYoz4nq6GduXmba e>etixitt>e>xixeit>x Copy-Pasted image M Generated image G(M) e64Z=A"I80H38=aBD>yBpclNGAxnasUo8"AA1AY6hllbJBDSlNVE1Y29Wvarb_1ahs tixetaA<9i1DlJslehlY_1xA48aUaatxtG=l6psybD1Bh3 0i8e"a8SNrWva9nYIEANV<latexitsh 1_aasWrvaaS9on4YtI6E9AaNt8<SANbBEZrbec_isHx6lBAA"AAAi>B"N=Yov9=A62s1bV1lhD JilehaY<12A98=U>aAxBGHlcpZyND8BA3I0n8a"Wb64e="803BDyplGxaU8A1YhlJDlV12A9o=">AAAB6HicbZB +1Kp4F9872TtAgI2RK7CcIpt2YORHWda2jVdUqijp2p407Jbr9j88j9rbJ70jpsTcnGTLRKb7wschAti8h6pgdLPeAoxqeESQF8Ft4LpH44624HL8tFPaEpPiZUbVC2fatR2Yst0ImC7KGR6IEAETv7o9L4vKH1h+QWxjKssT/cynZGvT/LxRmK2bb7gw7sac+hsAPtDiw8zhg4wd728Hw8aOJg42Npqp4t7crNdTTrQ4SNEeqxoAePLdgPaEPZbCfN2tEEvoLvHhQxKs/yZv/xm2bg7a+sPDwzgw78W+j1sKT4c9n7GTTALIRRKKbC7IwtsYcRhaA2tViUW+Ejm1wsvKFTP4Kcs9An67gG7TZTHAGLCIdReR4KJK7bwC+7bI/w/tQsLYEcmRthZaPAP2xtSVFiHU48ai8hwpz4Dpsda0g22JxHvry8sjxOh8vgo9E26b7p072pf4btP2acgjLNe2oTqdErQ784tqLNp8hp4pd02JHr4Jiaw8O8g92bp7p4t2cjN2Tdr74qN4Jaw87wgz6G7wm0HsS2xtFfACebFZ4P6EoaqPEgQd8LtPLepPvE+jasD7Lob2mx/2sg06mG7EvyZhvHQKxs/ lzihTyOrh6iR0lhWjQVNK1MXOjFSn32yz/YhqkoLUtykxCCyRt+R9IJVknCsMmCfoRHUkeckaNtZp3g3LFrboLkXXwcqhArsag/NUfComzfHRcUNeakpNBZg3r3mFobXLNXcwrq9AasUg0NxfUiMNoBMc1xjZRbleHogtBK6Z394ayPtv/keNXOTnaCmpQcHWI4lpECripCI/pHWK4rHzWbmtccQUpOaoCCXcngNkOrexNqtnkZ1tveyCPo3kmaqpNLYo9Xzh0a2UMyno1YF2lFOMBgMj4HKR/SVjTmjfQRlUleEkzNrZC3n3kFVbJLIX9wRq+AtsaglNhf0iRNiB6msNM9y0RMX19l6B4MxIpW4HWmcQK4/RV3TEjrQIljlhEMztrnCjnMkzVyJRI69hRR+VtrllhZ0ZRxic6isTMOyhRyhsriO0ylT+h9iJzkcC3zxlMQpCIXnSW4HWmcQNOeNtk1vyP3aX96VT/oK9CCMCBOOolkFk1CniMb2K0PzO9cYHNCqhmTocBZUeNtyZiaxvfNUnNp/egUaRsfrmAbhrqychwzX3XakjLSoRbHrjFgL63Xg33yp1ZtteNNaXkaZgKjtHoRe Ygmp856wviJsk5CktgktRzk6wstuURKcaD4h3QZIWKwjM8VLWTUxG3ZctTNDy7YEb2gQr3Zo6A8YDDuD27r5q8mLe+58eFVLAj<zPGkfL62p8+58LFsLTj5zCnC//n4kfPkAtg4M=gl=k/tateGiM>tg==</latexit>itae/<l=g=MGtkAgPf4/nkj853wr78KGDRzDDyIMYUskA2Q6obtt3WhWQapw2tDvEYgu78cZDgiYTNRZcUkV3wuZp42KVtek5keCmJq6rm5LsTL8F+8568L5wjKD7DIzsAYotQhQ32Dpg7EDicRcT3uk2Vp5eeqrmuD26Z8gbryNYZGtWVUwWM34ZKUawktktRkJC68vCzti>etxl/a==<tMggkG4PAk/fCznjT5sFL85L8L+8j6w75DzKIYDAQst3oQphDE27cgiTDckRup3Ve2em5r2qD8uZr6bYgNtyGUZVMWWZw4a3UtKkRwtCktxJk68mYvjmY>
xcW9P<9abeAi= chi1=b9s86+=SWHx6cBrsfBHYbStaBHwFcea996nwYiI2ESAANX8sSxN1B/ZBb=criEHr6IBZAAA2A/>k<<lraEtbeAxcitta hs6hjaM1P_"biaMseea614==b"sWbyIx"qi6NEvcBlrYkvfBYNHY6v8WLn1AvNXcIBF>xRMdgQSxNYD+ZabYcyivtYXN2BBiAAA"A/>I"B=bk<yeE l18sr=7HG6ZA+yPUSnmwUAn34bHSZFk1eV+<Cedhpb86kbc1"W=64b6hetswacbc_A16abhSsE 9triraW<vIan9anWYZINE8AANI8nSaNWBrZvb9cYiEHN6SBBAbAiA6>A"A="ccR//c2IcwdBItFbwHQfBr/xlt<l<QlWa6tNeWxaintI As8hNaZ1c_HbBaAs>e=6R42=d"FWQy/xxq+cYBlrtkxftYsHa+_baxet4/"BSQJwsFVIHdAcb2q/zRAcb=K"m>=A>AAABBX6cHZiLcwbxZIBvNFSt8lNBAlit xktLsoai_paUe74W"hy qiearHciA6B>AAc"=2R/IcdQFwtB/Hb+kYfcrByqx=W"V46FteX1vxIFSMwZLBibcBX3AAA=>"iwmbKnUPAqzSMybHAA66VJsj2HS4"=s6e_abt1aiexst 1hat_fckxBbq+yH"Y4fekar_Bacsqxx/yQWF"d=24R6=e>sAaBbH_c1ZaNh8sA ItniaxWe"as=64y"WcxqkBrHfYx+bBt/F wcIdR2/"c=A>A6ABcHiBbZ8NSVNAKI54"=s6e_abh1ats eixltal<<eattxih s_a1ebtEI=Y7nU9da5v+W3rLsNahi ttxe<aleixltas<etaxb<ZBlNtSx8tNsAaE_IaYen49"ayvqWBrblctixHt6sBaA_AaAe>4""=ycqRB/k2Yc+dxr/WQvFxQrWva9nYIEAN8SNBZbciH6BAAA>"=cR/2cdIFwQB/txb+HYfkr OWj7rQC8wFs5Pk748y2wJMz2t+OQGChKG6KNN6H3Q497+hYYd3WNJrweF2ddO4yma5Dp9be/E/ggw8cK4hQ8zf8NuL8WKcUabo4Y5tYlegNs5Y9toUJNSqZSOEbsBzJ1b6YMMtNbwwK37/ilUwIuaSBoQiIY7EwZe2R32cX+uf8X8boQdIIW8Klmgagble8NItdIod8y8euTXi2ERqepwN7rIHQSBxatIEUOi475KFw1NWMZYNbJyBjbzEsq5NbULt4YesIgXlutIY7oWa8cKWYLmNKfa8thbKU8egd/x/FblpX5BmN4qdt21eorFNR3kY0h97J4n3Q6dNg6NKjCnQV+y20MLwtyK8b79Pws2w3Cer/jNWlOE2wyQ9ubg5uzu70kY6lzIb0RPHysXIrwH/mGKKUKZGR5Rukwgsd17tQrPS2b6X/zsUUaLDSgH9ELQIA05LkWaM448x0qvedEgSoQOFt8nFlt549LtHMp76CgOlb8kIRdwoK8w8ruzXg20RHeFwS7xILQ9BUaSIsUGi/7HK6w5N2MrYPbwJQBNb7ENqdNpUgt8YWsYgslNtBYMo7aac7W2L8N8fM8ehFK486gI/8/RbIpI5KmY4bdU2georLNN3uYbhu7w4O356mNy6DKFCfQv+z2DMzwvyl8B7bPbsfwNCkroj0WiO52Fy29Cbg5Fz677ks6TzMbsRGH/sHI6w5/2GrKPKwGQ5Nu7wNsd1ptgr8SWbYXszNUBaMD7ga97L2I808LtWEMo4QxEqAe4E5SOQkFJ8aF4t44dL8H5p06S6vpFHdLh4gt1Fo8ZFOQWStEOenqZxl4OM5WNL90NIKL/9mgeDYa3UKz2X8bwSWr9s78bnIoKJuVtnX0D5I6MOezFN4n6nNAuGbUujtj0UkVY3IAlm4RIhd70bWiPwyFyl5qXDeVrHORHITamZYfKMiMUFJQZ6EpRHeLR4q2ktFL88F2QwS8EPewqrxW42M9W5L706IbLH9Ig/DKaGUuzsXtbSSXrUtD19sIwLup5LGtK8KQGE/qw4IWs0HLRgbazz6bkr71zw55bK9Gyw2sORWzjkrzCbwysOPj7C8sy7wyMM2++CQ6C6K46hN36r324475hbY/38NhrfeL2cdo4tmg5YpUbq/b/JgY8NKKhi8IfBNILwWRcXa8ooYItllWg4sqYEtQU8NtqLEpbgB8Jdb8YuM2New7KQ7aiUU7IwaMBbQBIE7NwtesRl2YXauW8q8ro8diIC8zlngpJHrZnbmCP59MQkFbeswM9R3htHyBe2bEeCO3Zovp10mbbws+0kdI+SbudLGUfNdeBWM3d/TgmuYfbf2jd9HmW2Yzhg1sTlLlHjbm7p2UNC8J7pdLSz9L7fljIfDiW7Mg4TfXBzNrlMj1cwX5YK2G4w2s1RpztkjzZbKyvOhjQCasD7GytMi+ECP6e6s4oh33LrS2Q4F58bF/t84hLfHLpc6o6tpgHYLU4qtbFJ8YFNQKSiEIqBxI4wMRWXL80oIILl9agbD8aiUCzzXnbpSHrjtG1fsAwlu95pGzKZKbGj/NwfIMsFHvRtb/zb68kC7wzE5sbJ9Lyn2pO0WPjfrmCrw2s3P77f8iyfwhMX2s+tQ+CGKF6kNj6d3F4U7mhTYC3sNAr5ei2sd+47md5+pmbl/h/pgm8fKhh88/fbN5L4W2cra3ohY4t6l6gCs+YMtyU7NsqCEjbOByJbbzYkMzNRwsKw7GiKU5Iwa1BrQbIz7awgeLR02WX4uq8E8Qo8dtIL8plagbg8li8CIzdnop8H8juGXf2ARle9wp7zIZQbBjaNIfUMiF7vKtw/NbM8YCbwJEBsbJELqnNpU0tPYfsmgrl2t3Y7ofaicfWhLXNsft8+hGKF8kgj/d/FbUpm5TmC4sdA25eirsN+37Ydh+7m4l3h6pym9b5z7k6zbRHsIw/GKKG5uws1trSbXzUaDg9LI0LWM4xqeEpN66KHC4QF+Fe mmqjAPfs0sZZHYk+IxWEfmUaW/KNt+55PxyPh+Xs35KxqI7NExFgkaHELZ4mcsKa5RGroIMYyNl//x9PYgcNzaS+CEK+VZhsgmm5rsdx8a9IzYEN+/BxlPVghNjaO+Bjsv11Iu/m1PZ5hgQsNNqxEaja7+FImEUYN+ENLZt/Zs4xtmFPZ5+gTsCNOxkaMaS+xIsEoY++aNcZj/Cs/xTm2PR50gUsjNrxaaRaT+YIzEoYi+aNsZV/6stxzPcf3Z/DaGrG3uU6hgDQd0+Wx3mLWySCG8phR1s7+ZVOGaY0Czmz9hmGmCq0AWf30LZyHCk8IhW1f7UZWOKat05zPzyhhGXC30KWq37LEyFCk8HhL147cZKO5aG0ozMzyhlG/C90YWc3zLSyCCK8Vhh1g7mZrOda809zzzEh+GBCl0VWh3jLOyBCs81hI1/71ZZOhaQ0NzqzEhjG7CF0mWU3NLEyLCt8Zh41t7FZZO+aT0CzOzkhMGSCxusioH+xaickjmCd/sTP2MRu0lUejbrwaIRoT9YAzHofitahsqVk6ctoz/cz3V/razrl3OUIhhDEdm+LxtmTWMSoGjp2RjsT+iV6G3Y3CdmW9RmGm9qAAHff0tZhHqkkIcWof/UzWVKrtz5lPOyIhhXE3mKLqt7TEMFokjH2Lj4TciK653G3odMWyRlG/99AYHcfztShCqKkVchog/mzrVdr8z9lzOEI+hBElmVLhtjTOMBosj12Ij/T1iZ6h3Q3NdqWERjG79FAmHUfNtEhLqtkZc4ot/FzZV+rTzClOOkIMhSExmsLot+TaMcojjC2/jTT2iR603U3jdrWaRRGT9YAzHofitahsqVk6ctoz/cz3V/razrl3OUIhhDEdm+LxtmTWMSoGjp2RjsT+iV6G3Y3CdmWrR4GW9MAtHUfCtkh7qOk6cooy/azrVarAzdlLOMIWh2EImVLxtDTCM9oYja2NjbT+iG6S3S37dKWDRDGU9JYxSaq0DdLhoEFFYB0UicNUnNVS4ayFywm79gU9bLVizwqU1XfUfzZ7ZGCgIUUPzN5Mt74RyRUtmRqxfzZzkiW6UdKk57yAX0KN7+F5Ho4YKkGvM3l994cuSKKZh9mbdO9sEtBjVdjzBD1h/NZ7QGqaj2FMUPE/tL4BFi+WClktS1sE+lcdCVTERSUarrRHY6oKaCV0tVco/5rwUtDd+mmaSJpNsCVGYYm8mKqGfWZskRWSUcKn5NyVX4K67FFeHq4fKsG1Mfli9Gc1SCKshImIdq9rEFBTVvjVB715/6ZgQ1qTjkFCUWEotS4iFl+vCbkDSfsy+McwCiTkR3UJrIRXYCoWaaVmtXcW/8ryU4DG+PmMShpdshV1YqmnmqqAfNZpk4WtUBKZ5+yzXCKO7QFMHW4pK8GYMdlT9WczShK/hnm4dx95EYBmV1jkBy1R/yZNQxqNjOFIU7Eute4zFc+3C/ksSms5+Oc8CUT0R4UlrTRrYLoCabVVtYcf/hrTUnDi+lmHSKp5slVGYYm8mKqGfWZskRWSUcKn5NyVX4K67FFeHq4fKsG1Mfli9Gc1SCKshImIdq9rEFBTVvjVB715/6ZgQ1qTjkFCUWEotS4iFl+vCbkDSfsy+McwCiTkR3UJrIRXYCoWaaVmtXcW/8ryU4DG+PmMShpdshV1YqmnmqqAfNZpk4WtUBKZ5+yzXCKO7QFMHW4pK8GYMdlT9WczShK/hnm4dx95EYBmV1jkBy1R/yZNQxqNjOFIU7Eute4zFc+3C/ksSms5+Oc8CUT0R4UlrTRrYLoCabVVtYcf/hrTUnDi+lmHSKp5slVFYMmUmwqxfiZOkpWIU8K45oywXaKa77FXHu4BK9GWMalW93ccSXKIhumvdg9FEVBCVDjQBs11/BZ9Qaqsj/FTUFEgtW4oFz+RCMkUSKs1+7cnCpTuRaUIrKRfYKocaCVZtwcV/urlUnDG+lmPSKpJsnVSYEmzn0r3NAuXQIQChSdPCzdNADxlOA3ZIEK8cDXLVZNhefl1IoE5muOkcdTEJW5mkGGcc0ESoQTgzu79X7WNZV8Fa9ncIfeZ8/kkKLfg==</latexit>xC5leTIVfIjJhaoExxuythY6ooMzabig3BWz15HdpXOO8Gi6+D8lbAKqM3vmz2IZcMACArSwrHBND0iWbTdlG+AJK+89LJduIIOGzZQbYFU1HJv1pKumVUes9CmLWaqrni/6zdSAAcIzvMKb8+i8OCPdShCQIQXuAN3r0n79X7WNZVrBDibdGAK8LdIOzQYUHvp1Kl/<=gjiGjzBz/+I4PDt6tauu6H51r4vdXh4JhALAKdDEM3NIMEXatexit>nPcuzHB/pzQO41K9cyeyl6C0P1A3NCiKOOQMepwzWBJO=VzyqCvBd9ifqQrgHD0tTo+gxga8MMHE1Jq7mhuFUdO68cdRB+mfqF6e3Xlcm+Gl2qA0ZlDpM5tg>5eh/Tw96yuP/nO87A+nnmWW9MVupwH=Y=z<I/LlKaGtbeDxVi9tW>nJ+i9MJru+IaGDZLblFC1ZJs1TKUmAUmsWCKL2a1r0iJ>Gi1dNtFxmtblH<Z=lvGWwnIc3zuBrpJQ649KCx7VZOLOdOzQiJMAr=+OaPDUL0lMCOZxsBTrU+AOmiWOKd281i08JvG41HNTFxmcbOHrZ=lzGUw8IB3vuOr4Ju6O9dCA+8qxnCqwWNmW9leJVluWpBv7H+UcYbQazMOHIxdvLQ8dKAAiGSdIbKiiD/BKr>SeAvAQcdIAzivSMIKKbi8/+Kiq86O3Ol/mOGx27AKZ1D4MOtQtz<pv/nBzHpz1uOc8Pbnz6AWDM>vew/=w=6<u//lOa7tUeHxYiztI>L/KOGObqDWrmArcSzAMAbc+I8zOvOM7K1bO8t+9iV8pOHOY/zOIxL7KKG1b4DOrQAzcpz/MBbH+z8uOcOP7n16nW+M9vJwu=I=G<Z/blFa1tJe1xKimtU>sJCiLMarr+ia>DiLelaC/Z=swTMU6APmuWHK/2Q140KJxG/1ONiF8mKbvHIZAlSGBwiId3tuxrtJl6<9=Cv+WqnncqzWBmp9Qe4VGuxp/Yi8+k8UbPKnMSviztI1cmAzA=SarrB<Dhi5bEdcGgAHKM8aLidCICOwztQvYzUfHovupduWVGe09QmuW/qun6xwe/tea>lr/L<s=q=6g3flLxKlk=kW/c8BZeelfIIEcmnO9caTFJ85ukAGectEaolT/z<Q=p=BwzvcMnWW6vn=P<clutzxHtBJ/Mp+zDQlOZNTAAD+x9lJOuAI3GZ>IiEeKa8/c=DwXML6KPIuNH/zpQIOi4K1mKF7Cx1OJ/ZOJOU8aib+189buKGMbv1z1ImcsALArSZrFBJDKiAb6drG3AwKl8HLmdNIGO0z2QWYAUTHZvlpDu+VMeJ9HmmWNqGn0q2+WC8UTszZaCllGLYDNaH+hrxMSikJy>ZtSi6xIeRtwaalb/Y<J=1=Qw8v3MYWb2D97m71o1dAOKnJ4FMFoWA270nGcNmHbZrlCG6wuIJ39+nqq9WmeVupvH Generator G lt7<rliaItceuxAi6t7 "sMh0ak1Z_<bKafsGe56A4D=b"ZKZ4A<=l3agtFesx=ihtM 7sghAa21M_Ab6absSeF614t=c"nKT4lhHZ87>cBMc0LfxnP7tZa7eT4g"r4GZlcA0AncZHT2rkl5A8HMk=8"=>>AAABA3Bc6Z3LigcxbIZPDFLtShg7MMxfF7I7bgPG1AFcu2t5tMl"aAtAe6xiibtD SsMhFab11_ubttuF1PbIFxMgSLDZbci36BAAA>"=M85k2HcAAlGrgT7Z7nf0Mc7Zh4K"=46esab_1ahs tixeta< DmmavxzJD8xh5t/6PVGF4wDRrfjJWqWATwPGzJG68J5vuAQj6OzKdSR3L00OYRwnftNKE9KEGFlOEVDDJTTWjrN4ZPS5aDyvDDRD7tCLYjrnStEJ6HeuRJBLXhGvUyEwZR2mEv1AiKEP8tNaLJpj6O8wjfz+7t7Orcblux2bt2aKbfPXNXNtfP2bmabtl2mucbtrz7+7wwxVO6JO/haFPJGEKFv9DGmKwajtywhnWVLRuAPOH8t0Gqn3L65SuJQ6zdRL0YwfNEKGEtwD6xLNpEi8E21EUZXBGe6RSrEC7Y6pRN8Li1E2ZEUGEBRX6EerYS7RCyaDZN6paL8N68VEFiq1AEw2GZJE3U0GOXRBnRteK69EESjrnYtCJ7HRuDJyLahSvZyNwjRTmJvDAJKLPStSZNjTJDElGKENfwFaJ+hPjKOjOl6mwaVwfOwb+XJzLtSJjO6fnt3ctxq2Jf0NHm8FuEOJJFAhLOR6hVVwv7n7yrwbRua2mtKavbGPAt9DlDKvYR/5PDGw4RDKr0j0WNWlTTPdzLGY8f5EuGQE6JzjdaSz6Qu58GzPTWWjrD4GP/5xDzvmDmDNtXPfbKa2tb2xulbLcO7t7Dyr mfQlmffrwZr+dYY8TVypZoBmB8pmPQmrLUxis3+5GEp7nCGQx46zMan318PqJIxi+daPSRyMKNphmsPZhZTX+CxZCEj2VoBO55hFhCWlz+Yd99HvGahUQQSDUUq+1GNoAFaF4rFo1Q30iut+Ffb3fTBKf+ZpVGyhQCTfwPYTmxPjpWNd8oYZ+ZZXrCfZlEf2mTrbFvdFfi3f+WTb2jGCZEC+xTZXZPhK3+IfCF3iatRuEQ4083dFm15o7FQrz43AqaiFPGMosN+qG1p+nDGmxB68Mpns1hPQJmxr+LaUSxyimP+ZhZZX+TxZCzM6n3a8P1JIqiWx+dGQ0uF13FroAa4GoFq1NDU+QSUaUQHGhd9vzY9Cl+hhWO5FVB5mmoyRMPaS+idIJxP8q3n1Mza4x6GCQ7pnG5E3s+xUirmLhsQY8NpPmYwTQfBaPSRyMmmoBV5BfGZ1VGytoSB4VQ5f5COhFahFh3Wtl+CW+jYUza9G9FdAvrGoHFh0UuaiQfS3QbUTU2DE+U1QqQNUoH5K9W9OzhYl+hCFsxiUp8BmopdrmflfrZ+Y8NpPmYwrfolpfBmprmd8ThQLmQryG3s+5nEp7QGC4xZV qtdlIiVIzzlDhURVyjpYaIMJdmA9QpBCCxHJLiEPkIJniMZKy380etlhu7EGnQ3KOzCt8uGFmwlY6hEJidgdIlDgl6UGjOdEYeJZmkJHEQhlhwYmR8FtuCy3zQKnpuGh7la8t30yMiKnMJdEIiPLtCJi>ttxe/al=<=NQd9ij/HMg5jZ/Ycqo9aFfYneHnjBnAvpL8PxAY9BCYpAPALp8AnvjnBenHfFY9oac/ZY5gjH/Mi9jQNd<==a/lxte>ititae/<l=Q=Nid9Hj/5Mg/jZqYcao9YFfHneBnjvnA8pLAPApYCB9tJxiLCdIPnJMMKi3y0at8hl7pGuQnKyz3tCuxR8wFYhmlJhdlEildgzIDG6OVUjeEZIYJHk>tixetal/<==QdNij9HM/5jg/YZqocaF9YnfHneBnjvpA8PLAYAp9CBxtJCiLPdIEnJMMKti3y0at8hl7pGuQnKyz3tCuRF8wmYhhlJEdldilIgz6DGUEmOQVEjeYZIkJHmQ>
8NAE=IhYYn393alWHvWcr8<nl8a1tAeexZitA >s0h6aa1N_abHass_eh6i4x=n"AsIs8Qn1NGBiN8EnTbsTbci0W8A0coNsIjZES+NHZxc3HaBEAc"Ncka=x"+>jAoAHA8oaB69Htha4IX8IZ+HnAa=IEEHAsJ08nT0N=BsZ1b cei<Ha6IB8AZeHtEaWln<A<NlcABE>BNH30Eb0nT"iise4sa6axtatA<ieb 61Ss>==sk"NAAB6EbNSvE39lrstHt_a"aE41s0isTs0EE03=NV>6B8cLN1AvngWXIOF2xUMDgrSvN9DYZEbNcSiBXb2iB6AAAAA">k"c=akxy+Ejlo88rc7bi8cGTbn8iG1QsrWva9nYIEAN8SNBZbcm6BAAA>"=kNcEa3xH+Ejso080cTbn8iG1Qss"=46esab_1ahs tis"=46esab_1+Ejso080QGZ1+sP"S4meUan_4aHsZtkxet+lCrdvp98YkEcN"S=B4b6iexsaacbr_v19aYhEsN StBibxie6tAaAl"<kactaexx+ijto 8schba81G_Qbsa=s6es6b41=h" hiIePah<jljt8xhtRs2aT_TaLeK4F"2XbbcTin8QG1"ss6=4aes1b_sahi ttxe<all<atexit sha<latexit sha1_baTd6ske"4=PhIjhjR8hT2TFLKH2dqQVSZK4m3>="AAAxB3bicLZBMSwIxF1XvtFeHQV=q>ZAKBSHmc3Z4N=8"A>IAnAaAWB<3aXeiic bhZ1BbLsS6w=MsxQFGI8Xbvc18Foejt+VxAaAc>k""=A8Ax6riZbhBrSONpEzY791vVrjHiX<latexit sha1_base64="ssQ1Gi8nbTc080osjE+HxrW1v8ao9=neYcIQEsA N<88SbNGBsZ6bbchiiHa6sB0A0ATAn>i"1=sk"N4ceEaa_3axsHt+xEtjl3aEcNk=">AAAB6HicbZBNS8NAEIYn9avWr74r4dGTw2QNaj7cO2Hta4+p67jpUbX2e9NgG88Oxj28trEHzJK2Z0adxpe4kpCh7ip8sULixV9t92EA4ahhnRgcvYLsyt2wVIK72CobeKyK1RtRXItL2A4TdTGGZ7gnN98ct4KTsKEs71gj2+pWK+H6Zp/H7Lc4HtAFR8jFqQfShEHeoqbxv4gE4G7PpLYdTgrPKaAEHP2Z/bpCEf6tN2ss+0emZ7lGW6ZEbElv6oFWE+Ejs1vsKKmTP4wcr9tng72G8ThT7ALLcIWR8RqK4Kab8Cw7zIDwstasgY2cxRvhyasAx2htwVJiTUp8Oidhipc4bpTdT0p2QJGHPr08ojsO28cgL942ab2pE7xpx4st827c2j9NJ2iTadIrI794vqsNe4FJeaJwo8Z72wugDziw8D8PcsF+TaM7egcb92bmbxd/+vRZkyM/ZsVKGx6QBhhHBvyL3opvHEHEF6SGx7PmP0Zst2mtEfLCQb/Z/PbE+awP7gJd4LTPcGpE24OxHqdehEiSAQcFw8bFRtT4nLTHjpp6t4QNqLGbgJP2fE0w6voyh7sxvs2gazD4wwa7q2dPj74jb+8Jri07pVUY2bRItICmRTAK7/4+1F6Z4SFPe/EPdtEKCmsLGQvQH/KbZam67wPpgdwLNbrtN0t8p2gQ8C2E478qt6h4se7GKELLGGcgsKWaL78PEw4bLNafbg2274E0vhx7yKx6gDsazH8q4j7424287F9rjpJFpUiRVSatYRIeKAI4Tx91K4+EHFFESPxdPCPdZstvmPEHLZQE/m/PbZ+gwN7CJr4tTtcpp82sO2H8dmhtisAGc7wLbERGTsnvTWj8pLtEQLqvGHghPQfx0K6so/hysZvv2/axDmw2abqgd7ja4+bs8PrD0wpzUg2wR7t8CwRaAJ744N1qX407prYdCTt2zNSjNcj2BtI4Cpb7Qprb22z9zgh8bOAjw8NreHPJu2/0YdepH4TpZh8i58dUciUVGtc29ApaWhuRecUYys7twwHI97jC7bOKsKQR4RIIDLkAgTXTBGu7HnT9TcO4XTCKes81njC+XWiq81prLNt88EQiEuqC4oGzLbgnaQPpbufH2F0Z7w6bEZoCvOh5xRsMy5vkx42bgmasskDMzywR82ah4DqH70dB2Fj2254E7fbC9983jvroJ30ppzp0iJUbVD2waQR+YztkInCIKvRSIcAuTl7L9Q4UKB1N+96eHb4WFdF3Sbe/x+EgPfduPgEfZNCftzsjmkG9ENvmLoH2QfKz/0Zg/jmsbi7l+zPlw5gj7nwmJFNp4XrUT2NCcVtJpCpp28gLOg8zHy2LdF4fhb8ji6tfA0hic7s7w+7gbsKTRLLXTTGznQcrTMsMjaWtpbLNt88EQiEuqC4oGzLbgnaQPpbufN2B0l7C6/EYoBv7hVxAsyy6vZx022gNapsMDZzkwj8taC4Kqf70dF2bjl2w4o74bt998yj5r7J80xpfpSiyU0V92UaDRgY4t5IyCEK3RHIuAiTT7E9l4LKv1v+/6tHF4bFaF2Sde3xbE3PidvPuElZgC/tLshm0GXEHvlLpHNQyKy/aZW/4mEbN7k+WPqwUgt7OwJJuNE4UrlTNNTcZtJpkps2JgOOZ8pHA2ndk4chJ82iNtLA6hbc5s+w17nbNKFRaLmT+GpnEcmTGsYjaWtpbLNt88EQiEuqC4oGzLbgnaQPpbufN2B0l7C6/EYoBv7hVxAsyy6vZx022gNapsMDZzkwj8taC4Kqf70dF2bjl2w4o74bt998yj5r7J80xpfpSiyU0V92UaDRgY4t5IyCEK3RHIuAiTT7E9l4LKv1v+/6tHF4bFaF2Sde3xbE3PidvPuElZgC/tLshm0GXEHvlLpHNQyKy/aZW/4mEbN7k+WPqwUgt7OwJJuNE4UrlTNNTcZtJpkps2JgOOZ8pHA2ndk4chJ82iNtLA6hbc5s+w17nbNKFRaLmT+GpnEcmTGsYjWCmfSoLRVHXU6kyeWcrkXaoNttMZRpF32gr3gLFFXrabyoaLrkaXYXHwXc6qChxA6r3smaagy/rNMUafLxYiRyHNWUXBFo6mXqoNaYV9yzt0a2yMrnM1aFil5OUBoMf4VKK/qVQTojLQalYlVERzyrHCtnWkaVXJyIF9rR6+MtXlaho0LRaiY6VsRMyyHRthWraOXyyTFhri6zMcX3axoMLZajYbVSReyRHotHWtkjtKcgUZr6M9kXJaV3hPvyRvX1bkVt/NKenOCNknpXoChaUpoQ2cMmlWVHh4RWzpSIgCyCNmmCCfkopRoHhUUkoe2cMklaVNhtRZzpS3ggy3NLmFCrkbpooLhkUXoX2wMclqVhhARrzsSaggy/NNmUCfkxpioyhNUUoB2oMmlqVNhYR9zzS0g2yMNnm1CFkBpMo4hKU/oV2TMjlQVlhlREzzSrgCynNkmVCJIIz9VRU+dtslAh10vRhiS61suMcyZRlhMrQOsyjTFhyihzlck3exsMOZMjvbmSeebRyoKHotxjCKHgcZt6g9rXkac3rP/yxvU1qkztnNOeKOjNEnnXICtaRpMQrchm3WjHR4jW6p3I1CeCXmQCHfIoCRHHcUtkgerckkcarN/txZUpq3zgn3OLKFjrEbnoILtkRXMXrwhc3qjhRAjr6s3a1ge/XNQUHfIxCiHycNtUgBrokmcqrN/Yx9Uzq0z2nMOnK1jFElnOIBtMR4MKr/hV3TjjRQjl6l3E1zerXCQnHkIVCJHIc9tRg+rtklchr0/RxiU6qszMnyORKhjrEOnyITthRiMzrch33xjMRZjj6b3S1eeRXoQHHtIjCKHgcZt6g9rXkac3rP/yxvU1qkztnNOeKOjNEnnXICtaRpMQrchm3WjHR4jW6p3I1CerX4QWHMItSUmC8kj7qOx6uouyKa5rXahAgdeLTMoWB2UI/VDxGDxCs9EYbanNabJ+qGMSwSn7XKmDaDsUkJYxyap0DdIh4EOFIB1UicKUHNSSIa7Fywn7dgT9LLdiTwjUVXbU/zV7pGJgoUcPzNzMN7yRyRTtCRmxfzRzUie6kdNkZ73A30FNb+L5XowYqkAvs3g9N4fuiKNZB9mbNO9s0tMj1dlzBD4h/NT7QGlaz2CMkPJ/9L+Bli0Wilsty1hEOlTdiVcExSZabreHo6tKKCZ09VaoP5vwktNdOmnaCJpNcCWY4JpHC2mofZRSU8eHkJNWZY3337FfbqLXXqw1q/AysFg2NCfYiENKB6mjNC9S0vMV1+l5Ba4s/3TSQElWzeCokCJa9h+ulL0XidswyMh8OoTniHcJxDZUbieioctlKXZq95axPYvykLNMORn9Cip1c6WH4mpCCdmpfwRuUhehk0NxZi363cFMb2LBXxwaqtAzsRgxNBfbiuN7BxmdNT9n0lMg17lSBS4J/3TMQ+lEz0CskuJv9Z+3lP0wivs6y+hYOUT3i4cRxTZqbbeOoctUKIZv9laRP3vGkuNGOKnVCspvcjWY4JpHC2mofZRSU8eHkJNWZY3337FfbqLXXqw1q/AysFg2NCfYiENKB6mjNC9S0vMV1+l5Ba4s/3TSQElWzeCokCJa9h+ulL0XidswyMh8OoTniHcJxDZUbieioctlKXZq95axPYvykLNMORn9Cip1c6WH4mpCCdmpfwRuUhehk0NxZi363cFMb2LBXxwaqtAzsRgxNBfbiuN7BxmdNT9n0lMg17lSBS4J/3TMQ+lEz0CskuJv9Z+3lP0wivs6y+hYOUT3i4cRxTZqbbeOoctUKIZv9laRP3vGkuNGOKnVCspvcjWw40pxCCmCfoRHUkeckaNtZp3g3LFrboLkXXwcqhArsag/NUfxiyNUBomqNY9z02Mn1FlOBM4K/VTjQllEzrCnkVJI9R+tlh0Ri6sMyRhrOyThizc3xMZjbSeRoHtjKgZ69Xa3Pyv1ktNeONnXCapQcmWH4WpIClO XWw7M8V7WMUyGtZRtVNQyyYKbfgQrDZk628jDI>etZixxjeutTaKlD/e<y1UyuYTTAV+ePwe8AnxPFCX+BTolhpoity<plFCLL8j5I+h8cLu6q8rjG53wR78KmD3z7D3IDY8sFAnQoo/ti3Nh4Qcpo2DD3EVgX78cpDyi2T3RFcMkZ3kuxpl2=V2e45/e5msq5r62wuzDs8t6pZgrigkb2Yey2N6tbZtGWUWWaVwMtwvWYZr354pacKDUEtQwokYRDk5tLC8kLJzvk6A8Mm=YaLYT>ju50zcCVnw/mkwfIPL4rA3lyo921MouPQ1=C=2<O/YloaptIeuxSiytk>iYXmc8U6fv>JikeCat/k=RQkMwotAUPKkan4z3jZLWFw8M+VLW8U5G7ZDtDNYyAYob3gQr2ZE678DDTuc23rpqVm5em5reuV82ZpguY3NkZcURVTwiZD4cK7tgkEkDC2Jp6QmhR32Vq2ejed2+udkcR6i3c4gaDKpUhttwQksRIkztKCwkjJ6v8658LmsFTd55C1/cf44ell2Ru4=A<AlMt+xmsct>MxieLt+a7lD/M<w=o=LQkuaM52Fo8lYY5m88z6Av=JiksC6tLk8RLk5wAtDU7K8a+4F3jZnWPwoMQV/WeU>GDZLt8NjyTYsbLg5r8Z66j8tDQus2IrzqKmwej56e8V52LpsuT35kCc/RfT4ilD2cu7=g<ElDtAx4tPwfKkL/Fn8C+zL58j5KFat2CJv68mYTj5zCn/kfPUGWkVtMKwkWaZU3w4R4Qko2MuQ==</latYma8c6cv9JLkwCttykkRok8wJtUUeKbah4M3LZDWEw5MNVtWAUwGtZ<tiNKyZYgbig=rAZz6885DYuQ2DrpqrmgeU54ekVm28pZuV3ZkxclR=TyikDycX70g2EX7dK>Dez/DQIoYPsnAjtFZ+Q8o7tD3Ah3Q2p72TD3EVgm7ucZDYiZTVRZcKkk3Cu6pA2BVye35WeMmWqGr>2iueDa8/6=ZArLgWbDY/yR2fpbQ9hF3VtPorQvA3s7YBIDDtzxDtKl7<w=5uj28l64Lf8/+C558TLsFLs5L8T6jj5wzZCznI/skQftPh4pADlgoc2iMRukQu=2=e<e/ql2aDt6erxbiytt>GYWmN8W63vaJUkwCRttkkRvk8wYtIUlKQa04wenCw5dpuxBww=pZ4sH5N4cxskA9Vt3zF6asN/V2elfhgSRyoSroKh5QWIuKFjM8cLcTYCkfZlWu6<atttNzyYYlbNgarHZG678ODQu32orAqYmDeD57e5V82Lp+u83FkLcjtzRnTkiPDAco7MgQE=D/2apeQih>3ItRowQaAbsYYJI1DQz803FYBbQ2MDD72d4FMowAu7KnAc3A<FfXBEIMZIiNt3rMaEHDCdQKBAXLbA9heJt4ih5XRdVv34crp1d5FH062uDu7avt2606XtyDkPy4=Il+x/>zxBWzKjoGgiejNgF=V1sCNO4IwSubwUQ<c/yl8aBtaeAxhidtg>XUybrSPIVOFC91bMfQRB/FD0WaL6AW=Z/kaYecic>Mt
(b) Segment Stage (C) Detailed Structure of 2
Predicted Boundary
Predicted BoundaryP Segmentation Network
vIsjxsc3XoX1Lajq_itNdUV=g"L>94a6aerslH6AiTbFDgSgMaFP3jshZ"1=bN1thx AiAeAaB<Zs9XZ1sj9LF3MgxLDSbcVH6iAAB>"AU1XZNsj 3PLLFsx<MVg=S1LeD9VgbHchi6Hb6hBiAaAXAj>x"L=cUaNji"q4ae1aZ_saos3tsxjtIlv1dZrsN3XFAMTSFDgbgi=1Zao3sjIsdrvXANFgTaPgh"j46=sae_1bhsati etxl<ala<extt ihas_b1sea4=iq6hj"agPFTgXNAdvrjsIos31aZiNq="UAA>B6A ovSHytZzsO28c2XQ7RT0i024UbVg2ZxOSCbYUUoLMUWLsNtzoRd8arTdf/uU2zLafN2Wv1fb/BfjiHV3uHjXmgzc2gHnWY80aI/73pfQGh3jDrLQScEJOE8No34YWbcBFklRFRd4lwF7IB+4E2kpx6Dw+2f8lOdRFRlkFBcbWYTSv74sQXty2c0Z2THoo8R8iO0zUwU47IF+4B2kEx6p43NJEOSEcrQL3DjQhG0nYW2HggczjmX3HuiVHBjff/bW1gVb4ZvxF2XS4CBOjYBUEb7oRLaUbUiWaMNs6N2LFzwoRtcdb82R1rf4HwHDj2bfc/3dpUx2kuBL+aIz48l/d7kIlpFfY3WGNhfQajWD/3fLBQVruc3EmSzOWEYJFvmXNzaLu2Ud/f2Dw4rR8dtozLNsMWUULobUYOCS2xZbgV2U400iQR2ovHtzO88oTZcyXs7S3HuiVHBjff/bW1vf2NzaLu2Ud/f2Dw43fpI7/8a0nYW2HggcFI74wlFdRk4Rl4B2kEx6p43NJEOSEcrQLST7oszXiyHcRZ0Ttov824Qa03QV2gWbdZLxD2sSoCROUYMUNbzotL8UrhG3fpI7/8a0njY2HWgcgjTz 6HHqy0QVPC/k8TeIDBZmyr0n0rpFUdcpuM6FDymDNqndRsPFSqfs2fg6ySoCBVKODyX/2BYlmC0jJBjekFfmxNADwn5mWG8RPJQLU2HbHc0CkZTp8puLBZmGrmnZrFfBkCjjJl0bmBYy2yXODCK2BSofyJgs2Ff0SsPqRRnDNFmmDM68F0Wn5NwpAxIr7Z60ZK7+nshpYswkiVwyoXZklr2Z+J/Xy2x8E+ZcpplFyRYfxZkgVpHZ1Wrl7Z60ZK7+nshpYswkiVwyoXZklr2Z+J/Xy2x8E+ZcpplFyRYfxZkgVpHZ1WZW1VHpZgkYxfFRypFDymDNqW1ZHpVgkZxfYRyFlppcZ+E8x2yX/J+Z2rlkZXoywVikwsYphsn+7KZ06Z7rIcZb+nEr80xp2pymX2/ZJL+xZN20rRlJkyZCXZocy7wIVAiwk5wWs8YPpQhUsHnH+07kKTZ80u6BZmxwNpnW58m0RQPUJ02HH0byCTk8ZZpBumLcGnrrmDFkfjBej0JmlCB2YXy/OKDBCVSyogf6sf2SFqsRPnqdDmNDFyMF6FM6AndRsPFSqfs2fg6ySoCBVKOGrnlDjr/mCfBFyk0emjYB2JXD F>tOAmxDb3tI7RlMZe<8oPz+ce+7i8zkBieO3UPuamp1RK/s0ZbZCRrbTQYrzAbHkVuXUFwWB4ATv4CI6uAYE3sETWAAJ4alJ4e1n9i6jdp9VEp3TiPkv+84wTe4kWMFYXzVCH<w/el2axtneAxli3t1>Rn17AxlKn2pp5eJ1IVN3T9Ej8+d96Fc71l4eAR4WAsEDEY36CIu4IABOTW4Uu3FVXbPmHew1Kp2nx7RsAslZZ01lAR0Z3R1ZRZ1lAbrbn5pQwoPrITNATA1CuIY63EEDWsAA4Rle4c179F68d+9jEV3zipkT+APv8wekMYzC</lTNtAIT5rxopQlrtb>bARIZ3F3WU4BTO4AICuIY63EEDWsAA4Rle4c179F68d+9jEV3zipkT+APvxR1/latexiTRAoNbTr8rwQebk>MJY5zpCn<lIAtZRRZ71s0l3AZnKp2eP1wHmVbXu
Upsample &
Concat
or or
Tampered Image S Copy-Pasted image M Generated image G(M) Segmentation
hsa ataiaxBe2tBaalx<4qtctB4rxkHfxYiHY+ybaxtt6/jBt1a_ybYats4qVStB</et/x+bk+qH"Yef_ksrx"lB=cHqlxxysW_"e="4q6ke+s/a_be_a16a6hase i<lltaxtBetxbiHtftr csxhWe=664s=bs1shh ai1e_ab<a4s"eS J"<Wayexiq chB1rbksf6l=<WYxHc+rbfxHtb/tBBy1bbMsA6A=x stixeta<l<N9lNy6PWYQ5K"=46<latexiB/tx<lHaxt=erxtiWtc fsbhBa"1y_qbBaksYe+6x4/bHYfkrBcqxyW"=46esab_1ahs tixetal<latexit sha1_base64="ck8pdC+eZHknU47L3+Wyt sha1_base64="WyxqcBrkfYH+bxt/BB/txbqSyxb_MeAsAlHa6aVt6tsPjsJbH1Sh2 "i=e4a6<+YfkrBcqxyW"=46esab_1ahHesmSUb<hl a1ttesxai_at 5BuSN9SpF8U5s0w92bBsHwAbwqI0c2/Lc>QgWHIaDQXztLubIvQrdED1UBtwi6A1stArSbXzUaDg9LI0LWM4xqe5"x4MWLQFIwdIRc9/=cA"DABAc6UiEe8SXFtFp4SHr69WtaYnNIsAS8bNuZicAHQBAAc>F=/RB2dZSNE82AYIvnRarWL6=Ht4EFsF4SqeExdMtL=cLLBgpaZzrbArv1rw15tXSbazUa9HdYFFQEwxIgc9/LI0LWM4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciH6BAAA>"=cR/2cdIFwQNISQF8QFFw2I4cR/>cH"AAHA66ciNbWB8SINaEnYW9nv6r4pILFtS8AQeEMq84LW90NLDgXaZzSbsrc1uwR5=6c>"BAAAu5twS1"r>bXzUaDg9LI0LWM4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciHuwbZOzYp59kl4A0GtAEHjpQubznCuoE8ibtNVtaF1eXIvwFQdcI/R2="cAA>B6AicHZBbS8NAENYnIav9r6WHLptF4FQ8EeSx4qWLMIL0kIg9aUzXbSrt1swu5xMwSLBZbciX3BAAA>"=wmi6KbPnUzAD2RyMksmb4kbIZOzYp59kl4Ak5MR5OCZbwZFHupQnbzoCuiE8Nr1q5uws1trSbXzUaDg9LI0LXcY=284a2H15pItZjAZUKWvnhAQNacDBG>t7idE+Pve9sYoE3NLSSBQbFi86FAtA4"LkHLpo6irpW4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciH6BAAA>"=cR/2cdIFwQ0fAGtwQFzIEdLc02w/8R4cg=r"p>tAQAqAWBL6aHbi1c5bHZ4BFNFSS8eNxAMELIIY9nD9UaXvSWtrs6ujupHnbQoCziEuNb8VteF1vXIFxMwSLbcZX3iAAB>"Awm=KniPAbEMV68L1vXIFtaxMBgSNDUzZbciX2BFfAAA>"=kyEl8r7GZ+ gKGKI/6H2GWb/74wy9zl2s2Y6NCRkK8bhzo5WeNkm4Yj+zsby5ztGynco/Lg85b34tr4ho6aCMc7yOsLjbyfzIzK7h/lG8Ygdegr/O1HZTpmOYlKmGaYKWd8/gjpGdCN77sdjM77bCyO5bMkWR+wPKCL2l6YNafWhNO8LKfgh/8p/mbd5e4N2YC733hN4KNQerh3Y37466NQKCM+28wy6+6C2+wM8wP6wbrHWI2/9K5W786PbswCrjWO2y9b5HP5Rys0wyGSK5gvtkoz7b6zsRH/IwKGKXGFWjrc6Ya4WdNZ8/KMg8/Fpamrdre8N4Y/743bN0K5Qs2Pw78ywM2+QCK6N6347hY3Nre2d4m5pb//g8Kh8fNLWcaoYtlgGKKG/57lwIsHRbz6k7z5b9y2OWjrKGwKw/yIs8PPCwOr8WHs6Rwz7k9z2b2yrOQjw28sK7wyQMN+KC363674Nha3erm2f4p5gbM/K8NhJfWLYcGoltggltYoacWLNf8hKlgjJNaf/MIFdvpt4/dbm89yhbDzHk0zBRFs2w5GECfwCE9s3JvLon3pp0zP0fJmbrD2w3Q7+fzikfnhIXvsStc+uGlFLkQjUKBgNt9oecbLWfdh38b//b+5g4f2urg33hN4e6263rNjYkFG+tsXhfif732rmfP0pnLJsEwC8b/tvFMfNCbGd0+dmsbZ1vbOeteyw39QeFGKKI/wRsH6bzzk795bOy2rWjsCw8P7MywQ2+fNfKzQj2kw98NPmwor2Wf2z905g7j66bCHK3N6h47NY32remd4b5pg//h8KN8fssilIz/lKWGca9PmngltYo5jfjdBlMJd4TamIY4bd24d0HdWWY8hP15TyL0HybS752vNX8F7eddSr9h7OlgIHD1WTMo4mfZBYNO7KhWCHsIw/GKKGltgoaYWLcf8NK8h//gp5b4dmer23YN74h6N3KC6+2QwyM7P8wCsjWr2yOb597kzzb6R 6657gVQEKvTwFurZqFItIbsUCe1dGqiWfX1KsfNqBeMF76a47V2N8n8cmSfRkslWiG5VA5s7FvTIrqCIsi1Gsf1efq4F6nVNicdUFtOUmSTJWRs8GKmYG9kplhqRNmd+5Oe7s+RKs8YYGUmqkbpJ9YhNqKliNImBRI+w5RdXe87oOI+lR9smlAi0EH5IUf5WEtyPKhZ37qHEZk4LGcn5MosK3G5oc4KHkLE7FK3qhyX5tPWUKWIfHZkfA0mmqgl9Id888oX2uewRIQ7aIBi7UwNKYbMBCJEqbUAZZ+stnyCQJAT7OrmLU0Urt6FCiWdMKT7sS7SdGa+aboNOakYU9MC4DMxrVzIX2TWgM7LiteRYsMoG5Kc4LHkFE7qK3XhyP5tKWUfWIkHZ0fAqmm9gl8Ido88uX2Rew7IQBaIUi7KwNMYbJBbEoMG553K8c84HLKHnkXFRE97IqbKn3MXuhDyEPf5AtIKRWIfY6UjjbOfnFiLayazggBLL87pACIJ0VmCl2oUXXwpBFimNMJoqGY5UKGcN45LJHskwFpEo77q9K33uXVhsyoPc5ktqKhWmUUfWWkIZkfHqZm0gf8Adq8mum29eg7lQ8aIUd7ow8M8buBXE2NRteswV7jIUQABnazI6U0iV7oK8wFNUMxYObIJ4BwbaEXqBNWUWtcYIMvIFRCHQV1DMqGlKF4wHiFb7qNXbhqyBPe5qtWKdWUUtfNWXIZkgHuZQ0wfEAYqJmimw9Ugklf8mI8d8o2878au7XM2BUNNswU7bIYQKBIaIIRUWif7IKHw0NAMmY9blJIBob8EXqRNwUItBYIsiMKoNGY5JKbcq4ULYHsktFNEE7BqbKM3wX7hUyaPQ57teK2u88o7dUIK8XlHgM9ymKmFq4AGfY0PZhH3kqIEWkfLUcW5Kots5ttBQMos5KG4LckFH7qE3XKyPhtK5UfWIkWYZH0fAqmqNE9biBKJIbwY7MUNam7Iw2euRX8o8dI8lg QNqEj7rEUNZLtF4tTZ+kCOxMS+sojacTC/02RrUjmFyl/9YczSCKVhgmrd89zE+BlVhjOBs1I/1ZhQNqEj7FmUNELtZ4tFZ+TCOkMSxso+acjC/T2R0Ujr0jUTR2j/C+caxoskSMTOCF+ZZt4EtLmNUjF7NEqZQhI1/B1shOjBVlz+Ed98grmKhVzCS9cYy/lLyRTyIkU1fmKYo5nxV4lnF/1hLz9WGTDd2YE8zpjWkMmQEO/CEzl+oZcBPtZ4Gg/1oTLkICrWlo9SZi4lcvcbBDqfTy2MVwmihkO3zJTIUXICVWJaxmjXRWv8ryz4jGjPMMthmdhhO1zqVn/qcAmNBp14qtEB+Zs+TzlCKO9QjMZWFpt8MYjdjTzWrzNh7/Nn14sxh5+Y8ms1rkRy1RuyfpZNlAPqSnuqi1HhxdihkMmPdGs4gyC8YWyX0m/aaWxCOXZIZJN37kNi1wsMhy+f8DgbCvYlyi9SSohWdCEkVTB1/gQNjhUZt1F/CIS1+sCBROrj9hSVhldBE+VEBz/9Q8jdUrZmZgOhxVaK/C0SyzYcCYg9t/ElUyFhjDqzQdZj/t1sBOjbV9BZEK9ud4a9/3FvWkzYMoK57+pNa0KAK7Ckwdu6nilzKznxERYtSRqRD7LMoNFPYU0giGN7nzVU4XyUywmi9Lm9hgK7SwcF9alSrNUURcTUCBcF+EshSdk0Ca+xFJ4UtDEDUrFjjUq0QRZ2/T1/BCjjVcBaE+9odsmxhSKMSkcO9ClTy+/ZYFztC4VZgtrL8EzN+UlmhFO7sjIE1qhNNQEh7Zm1N/LIZ1tsZBTOOjMhxVolaBj+/E2z09j8yd/rYmzgChVVgKrC8Szz+clYh9O/slIy1rhjNUE07Rm2NTt/4CFj+cCak+Soss+xcSCMTkROUCrTl+9ZcFStK4hZmgqNjU0R2T/Cjca+osxd8t9jzCEU+FBMlFVEh4j+OkBEs71mIN/L1ZZthZQTSO 0JG1NFmbHZlGwI3urJW8UOW5Nmls0/T3mcGz2eAuZJDrFu13JI1wKGmlUZsHCbLK+VeZGYCmJirflfu1uq1zMVGbPUjaRTYzoiasV6tzc3/ar3UhDd+xmWSGpRs+VGYCmJiM12mKnWTmHApU6TVsrZVCmliL3DraT+uriMYisJSmaCzYsGzVM+lsCRfpiG5SgWMmJxC+Gd+DRhGUW3xrRaY/o3acVzttc6/V4s4aTiLobzYYhTnRlaKal+zrGSaDLlCZsTWmx+QToGSWEm0Eckc5GJdRspD0hU3raNxNOI77GJru3IwGlZHbmF6CbIuUwzg5ItZVNThEeOfklc1mIOocEd5kmuu5PyQUxC5leTIVfIjJhaoExxuythY6ooMzabig3BWz14oaaRTYzoiasV6tzN1GJ012KWmAUTsZClLDarJu63WIGw3GTlRZdH3bimCF+NG1xGhJa0z1s2zKaWYmVAsUpTSsmZ+CDlULrD/ac+trVMaioJYmR+V+MiJmCYGV+sRpGSUmx+dDhU3ra/o3Gt6zsaVoziTRYAmaK2W0J11NGxNNI7OeI1mlPfVeRhCNTZaVCO5dro17idbTwEIJJWt5Vm6kSGlGaclcL02ESSYox7zDh9N2aGdbwSoY0HK3HZa8EadQEyt1RNYJWkBY/lMzzo3/cm5s8UO4l0rLTbVCfhYniTHKllu5gTzFmQHZbGwl3uIJJru3rwGIRTazoYasi6tVc3zar/UhJrauU3ZICw+GmlsZLHrb2mWFANT1ZGlJD0+1M3x+Dd+xRsp3G3ShWamrxU+rd/DcmGYGpSs+RGYVmJCMriaD+lCLsT5ZCAmUK2W0Jc1uYYC0mTJailMWrV+oaRDZLA+2d6DshiUz3TraaC/s3UcmzKt1mNaisWVU6rtDzac33h/d1mWVSsGRp+IJNFJiwMGr0+Ha2DGLmllCKZ1sJT1UFAbmZWGmu3rbZHGwIJGZ1bNmFH3rlulJnuMnWq<mq+C9<w=v=NWX669O7Ob8+dMA6cyO1CApBo9rbDJXcGLKl6TIIWHn98qC+Wnqem9pVuUvHzYQdOIKL8dAGDbiSBrcAAvIzbMKi8+O8Ox/O17KQ4O/zpzBHPucWn6wMv<==w==WvMP6nzcu/HBQpz1O4xK7OO/iO8b+8KMvzIcAASrBDibdGAK8LdIOzQYUH9C+qnqWm9eVupvHUYQzOAnd8yP598h5gqpA0+lFcde+fdRi6hFM7BEg8DgStQgCfVBCyQOmzuMvKN3O0ryA9ndBpcHIzvMKb8+i8OO/Ox7K14OQzp/BHzucPn6WMvw==<9XLXDc8KEIZ3AOlxDANdzCPdShCQI6q9X+qneWA9uVUp3HQYdz0I8LdK7GibSDArIA7czz/vKP8uiwOe/1xOKV4ZQHppBuz6cVnMW=v9=m<Wqnq+C96Fa9ncIfeZ8/kkKLfg==</latexit>IzvMKb8+i8OO/Ox7K14OQzp/BHzucPWVm6epuYvMUzQLOwd=8=<vn70X7AcABSr9NZWc8PdILzQOUHYpuveDi6bqdqGCAWKn8+m99=<AGLdIt>iDx8eitAa+la/=<S=EgcjEiIGKjlzIBzzA/W+oIA4LP8Dnte6k6ft<a3uXuC6PHN5l1Zr84nv9d7XAhd4DJMhXAFL9AcKfdZD/KkALGgd=b/irDNBurQSQAhAdcCIdzAvxMOK3bI8K+miq8qOCO6/AOnxKEFMG3bNcIXO<QYUHvpuVe<==wvMW6nPcuzHB/pzQO41K7x=BDrOSuAmAUc8IezvvQMdKzb96V9pCH+YqznIqLWK>dbpixDnB4rzSvAKAQcBI8zWv=M7K1bO8z+/iH8uOVO6/MOw=OxtixVet7KK71W4cO=Qnzvpu/PB6HMzw1b8O/OO8i+69C+qnqWm9<ew=vOM/WH6zn9PBcpuQz4VBr/aOiOu8+pMKvvzAIAcSdDibdGALK8OIYQzUH /tx>/latexit>>tixetallt/t>tiielax/etal/xtx>aiteexit>>ti/latexit>xetal/>/latexit>/latei/lat
xi_tt 6szhxae1t_l4s=Va1tsax3lWNOBbSaasaet634==<"sVh6 3ibeOaO<ZwS1dUWoodBZUOzb11Nhw 6iebs6a"b4_e_abh1ats eixlta<BUoWdSZOObse64la<"V=wN1wN1zzUBoWdSZOObt366V"e AENp8USSNBBrZIbAcY798cgr74d"ci+NUY4n7v"6uHE4rFxFASAcBc6AH6ibcSbEv9trQLq8aEi4W2Y0A6piAAc"LU87EBtN9YLv8p8t0QQ27/J2Ereuqx8M77g7d8+g47=d>cA+BUH4ccZ=N"8>AAIAnAaBW66HHi4cFbFZSBeixVMjNS8jN/ADDl/2YEIY2nV9HaBvAW>r=64p+HdLg47tNF88AFIQnSaZWE6eHq4xF4FqSleAx22Mm4jVDidYl20AM27mJdxMruxum=dcMuJA>uuxuJMrd7m022Al2DY/ijVM KeWn6tr3/aOjYNYUV0r+D+Vuua9DDuxiHEYE8viOxPW8/M68YyGD8LEvu0jKKoYmUvudmcD/8aEuD8/PVpYeH8xUD49HKeWWratvN/0DL8yPMOPivoEDu+aL+U03NSana6evu9WC/q4xLe4L8Pp/8YadcKvvoh0hLPyxMLPCv4EPu+P0ohLhSPG4YCLLqxv4SUL8oePp/PW8uuvaq/LcYdGvnm3ojKUh+huthhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvEEiuDau++0UNja3tneKu6W9/YPDrGYVY/LxWvqSHLo nIVqvmeTvWG25qfxeoTbBrpFgngKCRg5GcnOXmbm2UDOqEX9x5OfOmFU9Ev59EpY5peEFHFjj9+xOCF8EQYEHHp5no5FRKP5EhSNkOLtCwCkCw2LVWwegnRnCEK5FHkFrjoeC9xF5xw+ZCHBk8qCEQtzIogVvegGDWc2+L5CCwZCBk2C8wHkCtqvQOqSzNEPohIgV5TReKgnGFgrWbXo2xDqL5cZC2CHzqSqoEqoNzqQ2CP8ZBqCh+xxbFg9reOjwFmHC5mEknUpCpOYwvEEkf95t95EvOfUOmEmSOvcNDYXPgpghTpI Yf/vW1HlM6U/szjElnILYfJcq0nRXQPUc1xIPQj0ZM1qQZZSzIgLTQO/S0v/1IUxYg00L1ZRhEIQ66wWZs1YHXSPCQLTW1fZLwRSafddkrQYvkr3DGEnkHYjKJ/PujkZIOQUvZ3w9S0fRdGrIY6k23nGfn//WHHUMjUIsJjnlPIxYjJ1qZngXOPvcUxLPhj6ZZ1HQCZWzLgaTkOvSDvk1KUuYI0vL9ZRhII26fwWZM1sHlSYCqLXWcfPLZRQazdTkSQ1vYrLDhE6kZYHKC/WuLkaIkQvvD3k9K0uRIGvI96R2In2 GgvFi4TZsxlmBTVVysuldlLCVEMqAZyo5/JGIiqE2RlLLUql0lZ3TIV9r+/N2UreGr2I7AJ/nZIriTq0Ev2VRlllLgL3UIq9Weyv4FqsTievylql4rFZiGsZTJZum4x2lTVqBqTysLVsyTdvurl/+VCvl0LLM2VIEGZAqVAC5yyBomGi/yJrneIUiNq+E92IR3lgLlLlUUqLlR0EliZng/vo35TqIEVM9lrl+d/VNT2VUxWZAs7FIq2em7lVBTsVydulClLUMVZqAn5oZxGyUWA7I2rGerUUWAEqyevI2rerU2N/ k8M+hcmwDhU8o5CxOza+BoCvaYyzsFv<P/olQaxt/YCzv9+5nc8fw8+9kDafyisiv/Pmo//wCG/OCAIt0a4l4/rCgg7BHc7wE/MCGIA0a4Ql6uzu7lqhfF5hbqflXull6qu/Q<uQ/lq6XlugQxr/Uq4lqbh5HEc7gHC7BgxrU4Q4a0AIGCM/wCm/OHHEc57bgfPXol/DivysfC9a+k8hcwn85oz+zvY/F<tlaHCno+UQ/EugqH65l7l7uhXHlcfgqCbB4a94atavlz/F<YQzrn85c8Ch+wkv/fysDmiPO/owCCGM/IA0 xet>iteix>xe>ex>iitt
<latexit shttuF1PbIFxMgSLDZbci36BAAA>"=M85k2HcAAlGrgttFuSFM1DPxbgILT7Zbkc=iA386HB"AMA5A2>cn0Mf7Zc4Kh=4"es6b_aah1 tlAG4rag4TZ7"Ze7_nhfK0=M6cs7bsxeialttut1PFIa1htsx iteFxMgSLDZbci36BAAA>"=M85k2HcAAlGrgT7Z7nf0Mc7Zh4K"=46esab_1ahs tixetal<bZ<l7a1_base64="K4hZ7ttuHF<1TP8blInF=xkMAgrSZL0D"ZMb5c2ic3A6GBgA7A7Af>Mca rLEOR8E97iaJ6Ew9EZSa7UBnbX0A7Rr0GqJ3Z6DSrJGEeertpau27bN7VwF6EhEJKc1GbaEa2tFPGKEAKvtmGRwwVyBvRhOL8JJuFHiJhtUnOj+L6wEQz76VdZLjYJfftLOdNLKJluDjTpNNSEyYRCC7RDyaSZNjTJDElGKENfwY0LRdz6Qu58GzPTWWjrD4GP/5xDzvmDmDNtXPfbKa2tb2x12G58TzP8zT6WLr84iPE5ZDUvXDRD6jSWYW4rDtGP6Sjtn30qJuH8AOJhLRnVv6yu6w6LSj6n3tqJ0H8uOJALRhVvnywwtRamKvGA9KFPEtJa6LJp8NEEj1Z2XEOGRBSewEYrDCaREyGSENfTYDPl/K5NxwD0zRvzmQD5mGDPNWtjXDPGf/bxKzam2mtNbX2fxKu2lbbxclrcOO7tt+7f+wwOfjVJwa6tFPOKjAhvlmuRJFtwOpL68EN1EiZE2GXUReBES6YCrRD7aSyNjZJDTlGEENKwYfLR0z6du5QGz8TWPjrWD4GP/5xDzvmDmDNtJL2S6jJ67nP3htwqbJa0EHF8OuVO7JrAuLtRbhPVtvanJyjwOwwtfR+atmOKcvlGxAb92KKFfX KZ+PN2+hauTTfFy+3QjxQ+pCZoYjBxFGt6WGxpQ+UxqmrpfoYmPrT8ZmGQ4f1+QzF9fvGHCU+PhXKZZTCbEZTf3FiX0roqACoU+HQZU9vhzEl5FyV2maSP+nJS1PMaxdn+GisxLIhJBqdPl8+1p3wnVaRMYz96d4GxhQaGSCUnD71pNEdGm5l+r3+s8ipxmUwLQrVmfQohFsApr8oBFm0oupidfr3mbfTl2fErZZC+XYZ8ZNKpPPhmTb+Wx+f3ifFQtuF03F1oAr4GaFqoND1+QUUaSQHUhdGvz99CY+hlWOhFV55mBoymMYZwTCQjyGVWZff+BFotmQB381pFs4haQGmNr1LDUUxSiash3G+d59GYECpW7hnOC5GBQmxM4R6PzdMiaInq38138aPzq4JQICx7iE+5d3aiPUSrRQysM8mmmporBfVf5Z5YONFPhYhTWylZCBdGHZhPTXZCx+CEZ2GjWbT3+fFifuQt3F0oF14ArFGaNqo+D1UQUQaSd9vzY9Cl+hhWO5FVB5mmoyRMPaS+ixIJqP813naMz64xQGCn7pEG5+3sixULrmQhsp8BmopdrmflfrZ+Y8NpPmYwTQyVZfBKUh EJdaledBihldIPgazN6lDCGAUvOnVZEHh=JijeMYEZLItk9JAHAmnQHQ9moH/J/kjIQZ/YeeYijEKVMOnUIGPDi6JzxgCIplYiLd8lpdjEBJnlfhYhFYcmqwYjg85FMR9uiCdt=3<8lttmxwt8FRuCt3zyKnQuGp7lh>tixetal/<==QdNij9HM/5jg/YZqocaF9YnfHneBnjvpA8PLYAz9ApxCBCtJPiLEdImnQkHJYIZap0EMejUVO6GDIzgdliEldhJlmhYFw8CRuzt3nyKGQulp7th8ya0K3iy3MyKjneQauaGNpv7nlZhH8=tian0Hy93oi/K/MjMQJ/neE>IjdBPnLfiYCFJctqxYBgC59Mp9AiYdA=L<Pl8tAxpt3/i5Ja0HfqEozcUaiFVKDQImlHlJOkGI6ZgYledjdEJtithhY9mew78CFiRhunCxtC>p3lz8ynKynBQtu9GJBYjv>njjQxMeNtHa=ld/i<9=MpYg/AYnZJMEAPLPA8ILd Network 1024512
Final Prediction 256Input Image 64 128 2 2 Final Prediction
DeepLab VGG-16
(d) Refine Stage Predicted Boundary P 0
<lsteix ta=e6s6aTb9_s1SaRhc<AlbaEtre3xviVt" AsihBaN1Y_vb"atsae36M4n=v"=A>RALBtXcc3ZKNa8QATIan3a1Wv=zAML3ccK0QnaZ1Vz23V0zZXlaWva9nYIEAN8SNBZbciX6BAAA>"=svzV2VZn0c3Mzv13aTQaK3ctLRA"=46esab_1ahs tixetal<rWva9nYIEAN8SNBZbciX6BAAtexit sha1_base64="ARLtc3KaQTa31vzM3c0A>"=svzV2<nZrW2vAa69sniYAI"EzAcNr8BSAN>B4ZvbVV 6l/p3Lp73YqddOtvbxNQtvuwWACRkpoxfaAvwHF/P4PtF9FQSEfRRzaL8R5Qa5Kv5DABijoFuh2JArt3NgEl2sTo1QWidCi2ZhIx1No43H//V2zndmzy7aWNufYlfjrrooAIA5/zzLCviwsLQOd53LvvvPqPdPqFiwoAvfPolv3LQ5aOsLCw/vzLAzb5AIro3rfjultfWNza3ydmVnd2//oHp41NZxqhi2WCxi1Q2oRsEltgw3AruJQhoFAjvB5Davd55QaR7LRzN6pHL4tF9FQSEfRY8OKxiv2ANpTNdtIu3WzC7kYoof/AiwdFvPqRzLz71RWadQ15q5oddv2axDi5ZBpv/jVA3FooQhiQCJ2uhrxAN34wHg/t2lnEmsyRaPfWtvlku4jtffrf3WoprOIFA65obCzuANLAzxvY/SwFCHL/soOYa75zQ3LI3dvTlNP2viPKq8PRvEFQd9wtiLAplLQ3aO5LCs/vwLAzb5zIrA3rojuftflNzW3yamVdd2n/o/p4HNZ1qhx2WixiCQ21RsoltEw3gruAQkY74CfzOWF36uYIStFdHNKT8pRNEAQ29vtiLxphoFAjvB5Davd55QaR7LRzJPvv sZpaA2Np1qTbK0N3N90IdGSgNPiGySV/AiCkU51W/NMaZ0h1I2QIXuef5J/mwmw2nKWRWoafwAmqAi9uxwGpH+jm6NtCMdJrn8EsYQWopNN/TT/N7WQYmErnbJtMpmgsmHSGBxf9bAVmiwBakWmWZncwfw2/Y54e1X8QiIWh/ZOMv/H1XU/CCAWVNyTi02ZbsRpHAAN91qTMKKaSI/B0fNbViBkmZcf2Y418iW/O7p/cXNCWZT0AspTN1IKa/7pGuTWNfJ2YnES3mJmMm2/sGHqiKxA9Ro9mawkafWnWAq5w/wIWi5XeuvbcsamN0I/oSPKQM2qs9dA8H1Rrbt2di6yCV0TNCjUH1a/+MGZphNIw2I/odS6PCK0QNMj2Hqas+NGWpCN/wXbcuvWOi/IWqi58A9adfAk8oH91RRi1K4qY22/fmc3ZmmSkJB2igVfbGfuBpSIAeWuXbQwINhpZGM+/a1HUjCNA0VCy6id2tbrR1H8Ad9sq2MQKPSo/I0NNgmpBfSVibkmBcfZY428i1/OWcXvCW/T0NspZN1AKaT7pIuT/NfG2YWESJmJnMm3/smHq2KxG9Ri9mAwkofWaWAa5wnwIqi5/ ciEgV3uBgW5zO5d56FVdjqMXRdccz/zyy/JVGWIto4YREZxxtlyOXE5EBTcuoEhCxPaLIiTGaBetfxhBxEt1oraA3IFI/R4nEe/IiUe3jZx+YKa8WOqO//ZOTxi6EKR1Z5/B5QBzBx=Mtz56ootYhxyuoxEJahfjITVI5eG>Yixtiaet<l/w==C8MPJBzcu/HBQxz1B5xK6OO/KO83+ZeUIInRrIAB1EBxtLGiEPCEuT/iBHYzGuTcVPIBJJaCEMx8ywh=6=o<zicWglB5zq5ZddX/c4yWVZtWRlxTO/EEuaC/letixd>5YIGqTfVhIFJoauE5xtyohW6Moiz3c3g5BizF5ddaX/cWyMV4tlRoxEOEEYuPCBLtGxtrBu1IAeIxnUIK3o+O8xOhOK6Q1jBxzz/fHcuCPIJ8M<we=l/xateiiY>tLiBGwButexPBJE/1>rzAHIPIMR=naeiI5UQ3xZB+zKc8BOCO8/=O<xl6tKx1t/
Edge Replacement
Target Image T Predicted BoundaryP Generated imageG(M) New Tampered Segmentation
ENeuY4sYdt4hguEaFdO6yauguuua6tuNCF"C=b4i6NedsOaub"_e1_ashxsl Ytgiyx6e=tsa1l <e<<least4eFxyiut6 Cs=h6as1b_1bha siee6a4<=e"sC4uE6Ouuuuuuy"O4FeEag_4adsstYxetNl KkY7=C8zaWH36uIKZtAd5NWTnpANNCc2B/>CYjwVrMv49xYqEeNESSBQbFi86FAtA4"L4HrpykzW6ISjCJLJ7j9IE6dwbYEkr78C4zNW33YuBtNdYNvTpptNQCq2V/2CTjKVzMY4cxZqNe8EASIQnFa8WJ6jHI46FwFyS5erxYM4j=/"C>pANAtAuBW6CHki7CS83uKtdNTpNC2/CjVM4xqeESQF8Ft4LHp6rWva9nYIEAN8SNBZbciH6BAAA>"=4Yr5yJjI6wyFt>4cLNHBpY6ZrHWAv=ar9BnbYiI6EAAAN"54 xSW2ihqxFt16xHp4jO8gnufi3b3d2erFujfr3orWS5bzALzv/wCLsOa5YL3vlPvTqPvFdwiA/bAxIooqhizLpAOzPb85PALIwrjoF3Trvf5jLuvFOr4eH2vdq3vbl33iYfausnCg/FSx16xHp4jO8gnufi3b3d2erFujfr3orIA5bzALzv/wCLsOa5YL3vlPi2h3qWxCFvAa/zbqoloYxs6/1xtPdToPovbL/5AOiLwwdvFLvSvPvqdwFA/ioobzbAAI5o2rrf3uFxtje2r3bdif3nguOj8pHx61tFwxTq2Wiih4 ri06Q9oR1EuQEAutuRo107hsz5RvHlBAroF/hIJ0nL3NgTlLsqo0Qtign3AQuJFhoPAW2hjWs1ywN3xW56isFNoHL5QKPGsWyyvB9H5R7z6h0jyWLTNL0I/oWPKQM2qs1WwNyxW36i5FNsHLoQK5GsPyM5QKQounJi1FQQuLoNR5soEsl3tLgy0I3PA2nWuxJiQshQosFNA0r/vWBK9MHq51Rw7yzW66h50NjHyoWKLGTP3jygtlEsRouQ1iGsPWLTNL0I/oWPKQM2qs1WwNyxW36i5FNsHLoQKAFArvB09H5R7z6h0y0Q0JnWoRWiU2sLzTRUCRa0ZsicgoW8vEcj9CNczIF5+tRZpPXNhiT7ORCN8pN1EEg81OJhapT+/zRnsa8ksxS2i1OLbuoWQnE0T2s/cn8aSkvxiIuOUbE5vtWZUP6NFNjR0WjnQ0R26/snsaRkSxLIJOubc5gtvZWPNNFi67CRRNgpT1NET8XOXhFpR+NzW9FvigRZNCsz9UcvWgiZaCRzsUA211A2pARiF7onW0Q0T/2nsR8kaxSsiOIbuLEt5ZvcUNPiFWjR7NR6J1pENgCO8hXTR+pzNFcv9giWaCZzsRA2U1 amozmY9bbBafZ3zKJ4iGxlsY2D3b2mFFmFzVLmhIG8CVj2AJmelmmA7NmbaMtZQA9pbrmj2m7aziZ1a2VsKLJ8i7p9baez8XFhDn4Zv9hxamLaiQzGBJKV9tV9GFa2xiZ18sLa79XPzZhnm9xGaQtJVaFKJCmrgsjiYArpbZMmNA2eJIV8FmVbFmlDYKG4v3fvImoImbzYKaCBmYJogas1rsi8Y9mPIXvnf93mKQ4JGtlFYaDmbgmrFYFrVAmMIN8mVJ2VJIeVmFAbNYbGMKZfAIpnrXjPY9i8rss1gaJomYCB9P qE+E<d+Haq/AQBtb0WwCHAO8zAlyxvi84OcY58JOziQ/qOYuG//OaJOEEuoRd4Ou4IvQ54xU8Ean65cuiX/aibOmdXtuGzq+wQ+O0qaQOYuz/GO<Y/8lGX4Rbmta8lm/8<UG8zqYuQuqYOyQC+uzaHnJQ/R5awOcv+A4A06inaXxbOX/5uEO4/Ii4Oo8qXdyWAE8qAuCBWdb8BnAbqAHHdEEqEoxuICHY/yQ/48U8EOnW5AHAqvAOB8biOu/Onia0J4Q+zc/wH5+dQqYtzlG/<amn5q6E6GqXnuqabX8uUGd4IQu4R8o ei>xxet>xit>tiexiet> TAFagxgaasPrjeh_"i=j4T6"erslaa<jlea<tveXxgijd4sat asthtas11_hb atsxetbl6s4I=d"NhAjFPgaPghg=F6TsjbvIA_XN1rhd viIejas<XNsvdINXrTFAgagjhP=4"es6b_aah1 tsxeialt< rrcEESpOaEaJdN73V4ApE6+xYEEkR2LBA46+UFoIJ7k44wll1F4d4RFkBR9lgcHb=FBiYAb"WiaZ1rXOZ39xsBjI3lLkFFxWMSgNS6L2DFVwbRccibHX6jBxALAcAB>>"N=UcNHiBqAa>1=ZNsqo13sr3ccESSEONE4J6NE3244pF67xwEFkR2RBc4B+bFaIX7O4Jw3lpFxdkRBk+RIl4clFdBkYlbFWYaW11XZZs93sFjM3SLDFbxiM6gASAL"3UciEZos31aqs1A">BAAi6HVcbSDLxgM3FL9jsDZVqiNoUZ=a RLQ3oDLjWQDhUG23dfRpsI/7v/z82af04ndYLWW2oHbg1gfcNzajumUX/32Hwuri8VtHzBNjMfUfLWNLsotz8Ud4wr2fDdU/uL2zNafv2WLQ3VDzjHQ3hWGg3mfupBI27g/c8jaX0HniYHb/ffj1LUoWMUNLsotz8Rd4wr2fDdU/uL2zNafv2Wb1ff/BHjLQ3oDLjWQDhUG23dfRpsI/7v/z82af04ndYLWW2oHbg1gfcNzajumUX/32Hwuri8VtHzBNjMfUfLi3XHjzmggc2WHn0Y8/aIp73GfQjh3LDULQVuMoTFSs7g06FKD8yxm2DoNXqsnXdVTORs7P2FySBqDf2ZyHoRzYoCi2UZbgS2U40SQB2/vft6OS8CTVcOXy7/XBMXFcyTD8qOdtsvF2qQs0f462SgCZV2OCZYybc8Toz8OvtHRo20QiU04g2VxbZC2SUOYyby/B2DXyVOKoBC6Sy2gfqsfPSFdsRNnqyDm6DFTMFsS7sbB2U/OXSyxDbOVKUV0BiCRooSHyz68gofZ2y8sTSqFS6FDPmsNRndRnPqSNfD2mgyyDoFB6KcDZXToMFz8OvtHRo20QiU04g2VxbZC2SUOYsbfM ZYl0muCm0TjHJBB8jkeHkUFcfpmZrZDCnbGyr2LJ00ZN5wW0nmP8Q0RJHUHy2bk0TZCZu8BcpLrmnDGmfrkeFBJj0CjlYmxIA7Zr0K6+n7hpsswYwyVAikk8wHsYYkpYh8sZnH+r7CKnZi0Q65Z77nriIkxBArpJwyN75hnwWw0U8PmWPwRIQ607UhJwHw20HTyu0mbnkfCjT2Zl8VZKu+psBpcsmkLVryGJn0DRrmm0fnFNkpexjrBZJ0jK0+CsmplsYkpVQyRyPbmC8Z0ZWpnc5LNGwDpmAFxeIBrj70Zm6Yiw ZoXrkl+2ZXJ/xy2+8EpZcFplYyRZfxVkgZpHT1WTANJIrQo5nprbbl1ARZZR310RZlnAsK7x71oXAZ1kJlTrx2RZk+5J3/XX+yp2YxH8IEn+RZAceprp2opXZZWkylxrV21ZN+JJp/lX1y12lxn82E1+kZZcXp8pTlFFfygRZYpfFxRZfkZggVppZHWZA1TWrToAQNrTbIbrRJZoZ501ZQRpsr7nKbpleboAZRl12Z+R/Zy1x0E3ZZpllRyAYsxnk7VxHK122lecrJI5QornplbbR1ARZZ031lRZsnAxKppe2NAT 4c13E79AFR6b8BdI+s9ejEuVU3OzCi6pDlAiekTmt+AVPXvF8Ww4eTk4MIYuzYC3<E/WlAa4tlA4R4WAsEDEY36CIu4IABOTW4Uu3FVXbPmHewxwIAcw187493F96<8Wd4++9MjtEEVA3lz1ikwPPeHzmlV6bEXDusFA3RWeUc47BpTTOA4vAwIkCYuCI/YaPitx8eDteaCl7/A<6C+zFYcMRkseEwI89vdP6A9+1T4klp4iAzW3EV3EYjuH4POFTVB+4HU8Wj3zFPu6Xdb9VEm3ikpeex/iatxttil<zCYeMkw8vT+A >>>>
7TZx7lnsfG0_Mxcr7gZlha4tKt"7_gsa=a4t6teG41=h" Ki4ehaZ<7Z7TcrMs0bf1nh6 eiseaab<<=la<extt ihas_b1sea4=6K4"Z7hM0cat7e7xKiftG hsMh7ag1"_4bZacs0en6Z4TlrnZ7TgrGf7 =M"B>"AiANA8Bc613Ci1cMb"ZcDiLSSygEMExNFbIDbAP51kF5uAtttF6bpSLRNr8eEGi21EEp21ZFESUbG6XABMR2eA6AE2SMrAY6Cp7uRbDMyDaZSZZaNDj7TYJSD6ERlXiUEZ8ENiL8tL56utEF3PBIAx>g=L8ZkcH3ABlAR>X=E82k1H2AHlcAAlEN21EDlEjJTSNZDayCR7SYreE6XRBEGUpr6tSt6u6FA1AP"bMI5F2xcMAgeSBLGDZZEbil8ElDAJHTkj8N=Z>SAaBy3DLR67tCFYPlIExDgJLTZjcciU KEGfwN0LYdzRQu68G5PTzWjWD4rP/GxD5vmzmDDtXNfbPa2Kb2tulxcrb7tO+w7VwfOO6hJjaJFJtadF4JvhzjEO5OW6uwYVDfDwP+r7Tt876OLrfcGbmlzuxx/2GbDtj2WaPKGb5fQPzXRt0NwDNmKGtKEGfwN0LYdzRQu68G5PTzWjWD4rP/GxD5vmzmDDtXNfbPa2Kb2tulxcrb7tO+w7VwfOO6hJjaJFJtadF4JvhzjEO5OW6uwYVDfDwP+r7Tt876OLrfcmbmlzuxx/2GbDtj2WaPKGb5fQPzXRt0NwDNKtwPJKALoLFA0nuuivfm3abjTt2HEJZhCyXRZvZAKJLSj6n3tqJ0H8uOJALRhVvnywwtRamKvGK9FFEEPAK9KGvRmawtwvynRVhJLA8OuJH03qtjn6JSLZKPThZCX+CxZ2EjWGT3bfF+fuit3Q0oF14FrPAra44rFF1o31QFt3F0+QfuWtGijFCfx++3TfhbPWJTSG623jqE0C8ZOxACR+VXnTwZthaZKPGK9FFEEPAK9KGvRmawtwvynRVhJLA8OuJH03qtjn6JSLZKPThZCX+ExZGj2bTW+f3ifFQtuF03F1oAr4aaAC s3++V5nGoEapJ7vnxCFGsQMxS4l6QzGM1amnz311G8xP1q+JmIFxziU++dsanPMSJRSyVMlmGmUoGB+Vp5G56OnFPhxhaWyloC5+hY+z99H9QdUvqGFH+hpUGa6QnSPQxUaUyDo+51hq+N9ohGSFDiNsi33+55EG7ECpQ74nzCaG3Q8xq4I6izdMPaRnM3m1B85POqhJWICxYi9+ddGahPaSSRUyDM1mNmGoiB3V55E57OCFQh4hzWal3C8+qYIzi9d9PdRvMGmHBh5UOahQWSCQYU9UdDH+U1QqQNUo+GqFoiFv UxLFrJmYQdhwshpd8lB8mmohplCEulRidhhflmrZfY8+pPNYwmxUTrmLhsQ8BpopmrmdlffZ+r8NYPmpwTYyVQfBZmHQkIJYeZEVjUGO6zDIlCumRPFD8fwrmYYYhrhslIJVEYdNlZdfipl8IQgUzz6UDjGwUmOpV8E+jrelYmZdIokBJpHhmmQLBxfgZ6VGyOQETegZ+idldEJlhhYmw8FRuY8NpPmYwTQyVZfBQmHJkIZYejEQyVZfBQmHJkIZxULrmQhsp8BrmpdrmfluCRwFY8mfrlJVEDdzlUdgi6lGIOoClynALtvB9AaPAAHvJnteHHBn39/3zyKnQGtp7lh8ta0y3iKMMJnEIdPLiCJNxdji/9CMj5ZgpYoq9cYFnYefLnnBAj8pP8ALpY9ptCjxCJdiBPEIMnnMiKa3f08tphY7uGyQFK3zatcoqZY/gj5/MH9jiNdKzyunQ7Gp8lh0taiy3MKMEJnPIdCLixJt9BCYpAPALp8AnvjnBenHfFY9oacYqZj/gM5/jH9diNufBniMK3JIPLiCJtxBC9pAYALP8ApvjEnenHnY9FacoMqZY/gj5t30zhy7KtnMQlu8Gapy/d9HjiNd</lQate=>=tixetal/<==Q>tixetal/<==Q>tixetal/<==Q>tix Image M 0 Network
YIuE"AnN_8KS0NeBWZ/bVcGiuX49aaavNMmrUbTGhCE6WB4AQAHAc>R"==6IsHbf1Chs tetvaWnY9EAal"<AtI VsxhZaC1G__bia8sXe=6/4C=0"eRsulca0AHNGcQBG>4HbuWKCMEbMGhuV4TaKaUt/tmxueN<CEfNHSIB=b"i>6AAAAA"BI6fXNimcUbTZhBENWS48QNHAcERI=Y6ns9ba1vhW rireWav<ar9xntYlIN8SNBZbciX6BAAA>"=IHfCNum/UKTVhMECWb4GQGH0cuR"=46esab_1ahs tixetal<i dITmNYbM2XdtHDW/Yih21lpZTLIHWb47B2lNc8Y7492Rp9rDKDout9OYW19Y3OruF3a2d3bPygfHrVMnGrGmyyWse4E1HApFG+iQXs7ieY0CiRvB+O7lLRU3x9kMb6pMHdLt4ytbF39rFHQiSXKWeFiyxG4eMGWeL+ULMoVO+9QrBavbKfZMrGtWpElp2i4720YvXbc9iRlDNuB9/Y41MYWODuI3l2739PRg9H7V8nNr2m7ybsH4L1TA1Fh+YQWkHidY2CbRYBmOTWd3IxpUDs6LHHL4tF9FQSKeix4MWLUMV+Q6BvKZrtpl242YXcilNB/4MWDIl79R9p6XsRULx93bWbO+BvRiC0Yei7kMQi+GvFAHpE41sWeymyrGGMVnHfryPg3dba32urF3YO1W9O9YuotDpbWDdTIYbmdH2YhWTLQB+mVuM2UhLHWYMm4NxniPeuKYSpQdFH9yFrtV4gL3H3pOX1s9RDUdLbxW9T37b7WybGOG+MBrvfRyibCd0aYFeri379kWMOQtio+DGIFTpYA2HH1YE14Lebs2W81b7pH2N879R97lIDWM4/BNlicXY242lptrZKvBQ+VMULWM4xieKSQF9Ft4 2epz3SE/m+jVlJgf2hKlV7QmlnmvxbURKHIhTv9kNVhJ7mPIZl16/oBVgKn3JuNu2aWC6iLSYrdc17HccHVpscQYS+l5peS/flmm17cMMIKg2eJElmjNfxEz3N6QeJ2YnMQnlFuUmxvsjvUURuKfHxILhwTdvL9kkgNKVnhuJnrLkgEKcn9uAnVKAUjZKcHk/i+ge+W1bXICn55nCIXb1W+eg+i/kHcKZjUAKVnAu9ncKEgSk3LNdzwQLex6f2uYUJjMvQsnulUFFnnUMmYu6vQsNxfvxbNjJUEUeRMuIKMfmHmxlIrLhhew5pQKYscYcUHPSZIcZkli1g6+/1oXBCV5gnYInb3WJeu+N/uH2KajWACV6Ai9LcSErscTpvY9SkQNlV5h+Je7hmhPpIrZfll1/6m/1omBcVMg7KInK3MJ2ueNguE2laJWjCN6mixLESfrhhKbY2JMQnlFnUmuvsxvbjUURuKfHxILhwTdvL9kkgNKVnhuJn7KmUPZIcZkli1g6+/1oXBCV5gnKInb3WJeu+N/uH2KajWACSYHVcscpYSQl5+ehhprfl/m1mcM7IKM2egElJjNmxEf3NzQeV6Ai9LcSEr6 1L9+xEO8DKz+v8Gx1yKpj37VxEquinpQMMyo+AmRlo1CwyCuwlpluxCEgCDC/CCAFni3z8EoEKGz1zxHbmqxbklirC9muRly6xxlqGuCQpvAaEy8y+7zNmCjslRtZL4K0hKJjem=iwaDOo18TpKQzwxgHzPbDuAQANz0<jtKfRx2PiwABjPA=</latexit>oJxmR2NhkIzRnKiQXKtLC7EjaimjK4ZoRJsmN2zhNInRXKyQtKELa7ajKiQj40CZqs7Nvy6auQlqu69uqrrbxbE1bEizDFb/CgCu1w1wMoEonQ3uzVypKxF+O81x/LKozTgxHDPzufAwABwP<=t/waxejio>7EKCotQXRiQnIz2kuNmR>xVoitaxpt/lP<xABjfA+PzmxH8zTzLKx1x++OLKx8Vyo3uEonTMoAw1xCup/CDDFCEizE1Gbxfqrl69wlqxauBvyys7PCZRj4=Kim7a/ELCQtaXKiInezhkmNJR>xto0
New Prediction
Figure 4.2: GSR-Net framework overview. (a) Given a tampered image S,
an authentic target image T , and the ground truth mask K, the generation stage
generates hard example G(M) starting from a simple copy-pasting image M . (b)
Feeding the training images, copy-pasted images or generated images as input, the
segmentation stage learns to segment the boundary artifacts and fill the interior
to produce the final prediction. (c) The segmentation network concatenates lower
level features to predict boundary artifacts and then concatenate back the boundary
feature to the segmentation branch for final prediction. (d) The refinement stage
creates a novel tampered image with new boundary artifacts by replacing the pre-
dicted manipulated boundaries of segmentation stage with original authentic regions
and learns to make a new prediction.
a process often results in training images that are not realistic. Of course, the best
approach for generating training samples is to employ professional labelers to create
realistic looking manipulated images, but this remains a very tedious process. It is
therefore not surprising that existing datasets [3?6,26] are often not comprehensive
enough to train models that generalize well.
Additionally, in contrast to standard semantic image segmentation, correctly
segmenting manipulated regions depends more on visual artifacts that are often
53
Real or Fake?
created at the boundaries of manipulated regions than on semantic content [21,
49]. Several challenges exist in recognizing these boundary artifacts. First, the
space of manipulations is very diverse. One can, for example, do a copy-move,
which copies and pastes image regions within the same image (the second column in
Figure 4.1) , or splice, which copies a region from one image and pastes it to another
image (the remaining columns in Figure 4.1). Second, a variety of post-processing
such as compression, blurring, and various color transformations make it harder to
detect boundary artifacts caused by tampering. See Figure 4.1 for some examples.
Most existing methods [6, 49, 81, 82] that utilize discriminative features like image
metadata, noise models, or color artifacts due to, for example, Color Filter Array
(CFA) inconsistencies, have failed to generalize well for these reasons.
This paper introduces a two-pronged approach to (1) address the lack of com-
prehensive training data, as well as, (2) focus the training process on learning to
recognize boundary artifacts better. We adopt GANs for addressing (1), but instead
of relying on prior GAN methods [79, 85, 86] that mainly explore image level ma-
nipulation, we introduce a novel objective function that optimizes for the realism
of the manipulated regions by blending tampered regions in existing datasets to
assist segmentation. That is, given an annotated image from an existing dataset,
our GAN takes the given annotated regions and optimizes via a blending based ob-
jective function to enhance the realism of the regions. Blending has been shown to
be effective in creating training images effective for the task of object detection [87],
and this forms our main motivation in formulating our GAN.
To address (2), we propose a segmentation and refinement procedure. The
54
segmentation stage localizes manipulated regions by learning to spot boundary ar-
tifacts. To further prevent the network from just focusing on semantic content,
the refinement stage replaces the predicted manipulation boundaries with authen-
tic background and feed the new manipulated images back to the segmentation
network. We will show empirically that the segmentation and refinement has the
effect of focusing the model?s attention on boundary artifacts during learning (see
Table 4.2).
We design an architecture called GSR-Net which includes these three components?
a generation stage, a segmentation stage and a refinement stage. The architecture of
GSR-Net is shown in Figure 4.2. During training, we alternatively train the genera-
tion GAN, followed by the segmentation and refinement stage, which take as input
the output of the generation stage as well as images from the training datasets. The
additional varieties of manipulation artifacts provided by both the generation and
refinement stages produce models that exhibit very good generalization ability. We
evaluate GSR-Net on four public benchmarks and show that it performs better to
state-of-the-art methods. Experiments with two different post-processing attacks
further demonstrate the robustness of GSR-Net. In summary, the contributions
of this paper are 1) A framework that augments existing datasets in a way that
specifically addresses the main weaknesses of current approaches without requiring
new annotations efforts; 2) Introducing a generation stage with a novel objective
function based on blending for generating images effective for training models to
detect tampered regions; 3) Introducing a novel refinement stage that encourages
the learning of boundary artifacts inherent in manipulated regions, which, to the
55
best of our knowledge, no prior work in this field has utilized to help training.
4.2 Related Work
Image Manipulation Segmentation. [81] train a network to find JPEG compres-
sion discrepancies between manipulated and authentic regions. [16,49] harness noise
features to find inconsistencies within a manipulated image. [6] treat the problem
as anomaly segmentation and use metadata to locate abnormal patches. The fea-
tures used in these works are based on the assumption that manipulated regions are
from a different image, which is not the case in copy-move manipulation. However,
our method directly focuses on general artifacts in the RGB channel without spe-
cific feature extraction and thus can be applied to copy-move segmentation. More
related works from [82] and [21] show the potential of boundary artifacts in differ-
ent image manipulation techniques. These methods are sources of motivation for
us to exploit boundary artifacts as a strong cue for detecting manipulations. [21]
design a Long Short-Term Memory (LSTM) [88] based network to identify RGB
boundary artifacts at both the patch and pixel level. [82] adopt a Multi-task Fully
Convolutional Network (MFCN) to manipulation segmentation by providing both
segmentation and edge annotations. Instead of applying hole filling on edge pre-
diction to do late fusion, our segmentation stage early fuses edge information with
segmentation branch to improve segmentation results.
GAN Based Image Editing. GAN based image editing approaches have wit-
nessed a rapid emergence and impressive results have been demonstrated recently [85,
56
86, 89?91]. Prior and concurrent works force the output of GAN to be conditioned
on input images through extra regression losses (for example, `2 loss) or discrete
labels. However, these methods manipulate the whole images and do not fully ex-
plore region based manipulation. In contrast, our GAN manipulates minor regions
and fits better for manipulation segmentation where minor regions have been ma-
nipulated. A more related work [89] generates natural composite images using both
scene parsing and harmonized ground truth. Even though it targets at region ma-
nipulation, experimental results show that our method performs better in terms of
assisting segmentation.
Adversarial Training. Discriminative feature learning has motivated recent re-
search on adversarial training on several tasks. [92] propose a simulated and un-
supervised learning approach which utilizes synthetic images to generate realistic
images. An online hard negative generation network [93] boosts the performance
on occluded and deformed objects. [94] investigate an adversarial erasing approach to
learn dense and complete semantic segmentation. [95] propose an adversarial shadow
attenuation network to make correct predictions on hard shadow examples. How-
ever, their approaches are difficult to adapt to manipulation segmentation because
they either generate whole synthetic images or leave artifacts on erased regions. In
contrast, we replace manipulated regions with original ones so that the replaced
regions become authentic.
57
4.3 Approach
We describe the GSR-net in details in the following sections. A key to the
generation is the utilization of a GAN with a loss function central around using
blending to optimize for producing realistic training images. The segmentation and
refinement stage are specially designed to single out boundaries of the manipulated
regions in order to guide the training process to pay extra attention to boundary
artifacts.
4.3.1 Generation
Generator. Referring to Figure 4.2 (a), the generator is given as input both copy-
pasted images and ground truth masks. To prepare the input images, we start with
the training samples in manipulation datasets (for example, CASIA 2.0 [3]). Given a
training image S, the corresponding ground truth binary mask K and an authentic
target image T from a clean dataset (for example, COCO [27]), we first create a
simple copy-pasted image M by taking S as foreground and T as background:
M = K  S + (1?K) T, (4.1)
where  represents pointwise multiplication.
In Poisson blending [96], the final value of pixel i in the manipulated regions
58
is
?
bi = arg min ||?bi ??si||2
?bi si?S,Ni?S
+ ||bi ? ti||2, (4.2)
si?S,Ni?6 S
where ? denotes the gradient, Ni is the neighborhood (for example, up, down, left
and right) of the pixel at position i, bi is the pixel in the blended image B, si is the
pixel in S and ti is the pixel in T .
Similar to Poisson blending, we optimize the generator to blend neighborhoods
in the resulting image that now contains copy-pasted regions and background re-
gions. A key part of our loss function enforces the shapes of the tampered regions,
while maintaining the background regions. To maintain background regions, we
utilize `1 loss to reconstruct the background:
1 ?
Lbg = ||mi ? ti||1, (4.3)
Nbg
ti?T,ki=0
where Nbg is the total number of pixels in the background, mi is the pixel in M
and ki is the value in mask K at position i. To maintain the shape of manipulated
regions, we apply a Laplacian operator to the pasted regions and reconstruct the
gradient of this region to match the source region:
1 ?
Lgrad = ||?mi ??si||1, (4.4)
Nfg
si?S,ki=1
59
where ? denotes the Laplacian operator and Nfg is the total number of pixels in
pasted regions. To further constrain the shape of pasted regions, we add an addi-
tional edge loss as denoted by
1 ?
Ledge = ||mi ? si||1, (4.5)
Nedge
si?S,ei=1
where Nedge is the number of boundary pixels and ei is the value of the edge mask
at position i, which is obtained by the absolute difference between a dilation and
an erosion on K. To generate realistic manipulated images, we add an adversarial
loss Ladv, as explained below, that serves to encourage the generator to produce
increasingly realistic images as the training progresses.
Discriminator. In our discriminator, a crucial detail to point out is that the
manipulated regions are typically occupying only a small area in the image. Hence,
it is beneficial to restrict the GAN discriminator?s attention to the structure in
local images patches. This is reminiscent of ?PatchGAN? [79] that only penalizes
structure at the scale of patches. Similar to PatchGAN, our discriminator applies
a final fully convolutional layer at a patch scale of N ? N . The discriminator
distinguishes the authentic image T as real and the generated image G(K,M) as
fake by maximizing:
Ladv = ET [log(D(K,T ))]
+ EM [1? log(D(K,G(K,M)))], (4.6)
60
where K is concatenated with G(K,M) or T as the input to the discriminator (we
do not show K in the discriminator input in Figure 4.2 (a) for simplicity).
The final loss function of the generator is given as
LG = Lbg + ?gradLgrad + ?edgeLedge + ?advLadv, (4.7)
where ?grad, ?edge, and ?adv are parameters which control the importance of the
corresponding loss terms. Conditioned on this constraint, the generator preserves
background and texture information of pasted regions while blending the manipu-
lated regions with the background, which can be applied to generate both splicing
and copy-move examples. Also, it can be potentially utilized to generate removal
examples by setting ?grad and ?edge to zero, and thus the generator learns to inpaint
the missing regions, creating images with removal manipulation.
4.3.2 Segmentation
For segmentation, we simply adopt the publicly available VGG-16 [97] based
DeepLab model [70] to include boundary information. The network structure is
depicted in Figure 4.2 (c), consisting of a boundary branch predicting the manipu-
lated boundaries and a segmentation branch predicting the interior. In particular,
to enhance attention on boundary artifacts, we introduce boundary information by
subtracting the erosion from the dilation of the binary ground truth mask to obtain
the boundary mask. We then predict this boundary mask through concatenating bi-
linearly up-sampled intermediate features and passing them to a 1?1 convolutional
61
Dataset Carvalho In-The-Wild COVER CASIA
Metrics MCC F1 MCC F1 MCC F1 MCC F1
NOI [38] 0.255 0.343 0.159 0.278 0.172 0.269 0.180 0.263
CFA [39] 0.164 0.292 0.144 0.270 0.050 0.190 0.108 0.207
MFCN [82] 0.408 0.480 - - - - 0.520 0.541
RGB-N [49] 0.261 0.383 0.290 0.424 0.334 0.379 0.364 0.408
EXIF-consistency [6]* 0.420 0.520 0.415 0.504 0.102 0.276 0.127 0.204
DeepLab (baseline) 0.343 0.420 0.352 0.472 0.304 0.376 0.435 0.474
GSR-Net (ours) 0.462 0.525 0.446 0.555 0.439 0.489 0.553 0.574
Table 4.1: MCC and F1 score comparison on four standard datasets. ?-?
denotes that the result is not available in the literature. * Our method is 1600 times
faster than EXIF-consistency.
layer to form the boundary branch. Finally, we concatenate the output features
of the boundary branch with the up-sampled features of the segmentation branch.
Empirically, we noticed such multi-task learning helps the generalization of the final
model. Only the segmentation branch output after boundary feature concatenation
is used for evaluation during inference. During training, we select the copy-pasted
examples M , generated examples G(M) and training samples S in the dataset as
input to the segmentation network which provides a larger variety of manipulation.
The loss function of the segmentation network is an average, two class softmax cross
entropy loss.
4.3.3 Refinement
The goal of the refinement stage is to draw attention to the boundary artifacts
during training, taking into account the fact that boundary artifacts play a more
pivotal role than semantic content in detecting manipulations [21,49]. While we may
be able to employ prior erasing based adversarial mining methods [93,94], they are
not suitable for our purpose because it will introduce artifacts on the erased regions
62
that should become authentic background. Instead, the refinement stage utilizes
the prediction of the segmentation stage to produce new boundary artifacts through
replacing with original regions. As illustrated in Figure 4.2 (d), given an authentic
target image T in which the manipulated regions was inserted, the manipulated
image M (which could also be the generated image G(M)), and the manipulated
boundary prediction P by the segmentation stage, we replace the pixels in predicted
boundaries by the authentic regions in T and create a novel manipulated image:
M ? = T  P +M  (1? P ), (4.8)
where M ? is the novel manipulated image with new boundary artifacts. The corre-
sponding segmentation ground truth now becomes
K ? = K ?K  P, (4.9)
where K ? is the new manipulated mask for M ?. The new boundary artifact mask can
be extracted in the same way as the previous step. Notice that the refinement stage
utilizes the target images T to help training, providing more side information to spot
the artifacts. Taking as input the new manipulated images, the same segmentation
network in Figure 4.2 (c) then learns to predict the new manipulated boundaries
and interior regions.
In addition to augment boundary artifacts, the refinement stage also mines
the hard examples during training. Since the refinement stage is based on predic-
63
tions from the previous stage, hard examples where the manipulation regions are
not predicted remain the same after the replacing operation. As a result, these
hard examples weight more during training after feeding back to the segmentation
network.
Similar to [94], multiple refinement operations are possible and there is a
tradeoff between training time and performance. However, the difference is that
the segmentation network in the refinement stage shares weights with that in the
segmentation stage. The weight sharing enables us to use a single segmentation
network at inference. As a result, the network learns to focus more attention on
boundary artifacts with no additional cost at inference time.
4.4 Experiments
We evaluate the performance of GSR-Net on four public benchmarks and
compare it with the state-of-the-art methods. We also analyze its robustness under
several attacks.
4.4.1 Datasets and Experiment Setting
Datasets. We evaluate our performance on four datasets ? In-The-Wild [6],
COVER [4], CASIA 1.0 [26] and Carvalho [5].
Evaluation Metrics. We use pixel-level F1 score and MCC as the evaluation
metrics when comparing to other approaches. For fair comparison, following the
same measurement as [6, 49, 82], we vary the prediction threshold to get binary
64
prediction mask and report the optimal score over the whole dataset.
4.4.2 Main Results
In this section, We present our results for the task of manipulation segmenta-
tion. We fine-tune our model on CASIA 2.0 from the ImageNet pre-trained model
and test directly the performance on the aforementioned four datasets. We compare
with methods described below:
? NoI [38]: A noise inconsistency method which predicts regions as manipulated
where the local noise is inconsistent with authentic regions. We use the released
code [36] for evaluation.
? CFA [39]: A CFA based method which estimates the internal CFA pattern of the
camera for every patch in the image and segments out the regions with anomalous
CFA features as manipulated regions. The evaluation code is public available [36].
? RGB-N [49]: A two-stream Faster R-CNN based approach which combines fea-
tures from the RGB and noise channel to make the final prediction. We train the
model on CASIA 2.0 using the code provided by the authors.
? MFCN [82]: A multi-task FCN based method which harnesses both an edge
mask and segmentation mask for manipulation segmentation. Hole filling is applied
for the edge branch to make the prediction. The final decision is the intersection of
the two branches. We directly report the results from the paper since the code is
not publicly available.
? EXIF-consistency [6]: A self-consistency approach which utilizes metadata to
65
learn features useful for manipulation localization. The prediction is made patch
by patch and post-processing like mean-shift [98] is used to obtain the pixel-level
manipulation prediction. We use the code provided by the authors for evaluation.
? DeepLab: Our baseline model which adopts DeepLab VGG-16 model to manip-
ulation segmentation task. No generation, boundary branch or refinement stage is
added.
? GSR-Net: Our full model combining generation, segmentation and refinement
for manipulation segmentation.
The final results, presented in Table 4.1, highlight the advantage of GSR-Net.
For supervised methods [49,82], we train the model on CASIA 2.0 and evaluate on
all the four datasets. For other unsupervised methods [6,38,39], we directly test the
model on all datasets. GSR-Net outperforms other approaches by a large margin
on COVER, suggesting the advantage of our network on copy-move manipulation.
Also, GSR-Net has an improvement on In-The-Wild, CASIA 1.0 and Carvalho. Ad-
ditionally, in terms of computation time, EXIF-consistency takes 1600 times more
computation (80 seconds for an 800? 1200 image on average) than ours (0.05s per
image). Compared to boundary artifact based methods, our GSR-Net outperforms
MFCN by a large margin, indicating the effectiveness of the generation and refine-
ment stages. In addition to that, no hole filling is required since our approach does
not perform late fusion with the boundary branch, but utilizing boundary artifacts
to guide the segmentation branch instead.
Our method outperforms the baseline model by a large margin, showing the
effectiveness of the proposed generation, segmentation and refinement stages.
66
Dataset Carvalho In-the-Wild COVER CASIA
DeepLab 0.420 0.472 0.376 0.474
DL + CP 0.446 0.504 0.410 0.503
DL + G 0.460 0.524 0.434 0.506
DL + DIH 0.384 0.421 0.342 0.420
DL + CP + G 0.472 0.528 0.444 0.507
GS-Net 0.515 0.540 0.455 0.545
GSR-Net 0.525 0.555 0.489 0.574
Table 4.2: Ablation analysis on four datasets. Each entry is the F1 score tested
on individual dataset.
4.4.3 Ablation Analysis
We quantitatively analyze the influence of each component in GSR-Net in
terms of F1 score.
? DL + CP: DeepLab VGG-16 model with just the segmentation output, using
simple copy-pasted (no generator) and CASIA 2.0 images during training.
? DL + G: DeepLab VGG-16 model with just the segmentation output, using
generated and CASIA 2.0 images during training.
? DL + DIH: DeepLab VGG-16 model with just the segmentation output, us-
ing the images generated from [89] and CASIA 2.0 images during training. We
adapt deep image harmonization (DIH) network for the generation stage as it also
manipulate regions.
? DL + CP + G: DeepLab VGG-16 model with just the segmentation output,
using both copy-pasted, generated and CASIA 2.0 images during training.
? GS-Net: Generation and segmentation network with boundary artifact guided
67
0.60 0.60
0.55 0.55
0.50 0.50
0.45 0.45
0.40 0.40
EXIF-selfconsistency EXIF-selfconsistency
0.35 RGB-N 0.35 RGB-N
GSR-Net GSR-Net
0.30 0.30
100 70 50 1 0.7 0.5
JPEG compression Scale ratio
(a) In-The-Wild JPEG attack (b) In-The-Wild scale attack
(c) Carvalho JPEG attack (d) Carvalho scale attack
Figure 4.3: Analysis of robustness under different attacks. Attacks with
JPEG compression consists of quality factors of 70 and 50; scale attacks use scaling
ratios of 0.7 and 0.5. (a) JPEG compression attacks on In-The-Wild. (b) Scale
attacks on In-The-Wild. (c) JPEG compression attacks on Carvalho. (d) Scale
attacks on Carvalho.
manipulation segmentation. No refinement stage is incorporated.
The results are shown in Table 4.2. Starting from our baseline model, simply
adding copy-pasted images (DL + CP) achieves improvement due to broadening
the manipulation distribution. In addition, replacing copy-pasted images with gen-
erated images (DL + G) also shows improvement compared to DL + CP on all the
datasets as it refines the boundary from naive copy-pasting. As expected, adding
both copy-pasted images and generated hard examples (DL + CP + G) is more
68
F1 score
F1 score
Dataset Carvalho In-The-Wild COVER CASIA
CP + S 0.343 0.430 0.351 0.242
CP + G + S 0.354 0.441 0.355 0.270
CP + GSR 0.418 0.479 0.381 0.331
Table 4.3: F1 score manipulation segmentation comparison trained with
COCO annotations.
useful because the network has access to a larger distribution of manipulation.
Compared to applying deep harmonization network (DL + DIH), our gener-
ation approach (DL + G) performs better as it aligns well with the natural process
of manipulation and has a larger variety of manipulation.
The results also indicate the impact of boundary guided segmentation net-
work. Directly predicting segmentation (DL + CP + G) does not explicitly learn
manipulation artifacts, and thus has limit generalization ability compared to GS-
Net, which uses the boundary features as side information. Furthermore, GSR-
Net boosts the performance on GS-Net since the refinement stage introduces new
boundary artifacts.
4.4.4 Robustness to Attacks
We apply both JPEG compression and image scaling attacks to test images
of In-The-Wild and Carvalho datasets. We compare GSR-Net with RGB-N [49],
EXIF-selfconsistency [6] using their publicly available code, and MFCN [82] using
the numbers reported in their paper. Figure 4.3 shows the results, which indicates
our approach yields more stable performance than prior methods.
69
Manipulated 
image
Edge output
Segmentation 
output
Ground 
truth
Figure 4.4: Qualitative visualization. The first row shows manipulated images
on different datasets. The second indicates the final manipulation segmentation
prediction. The third row illustrates the output of boundary artifacts branch. The
last row is the ground truth.
4.4.5 Segmentation with COCO Annotations
This experiment shows how much gain our model achieves without using the
manipulated images in CASIA 2.0. Instead of carefully manipulated training data,
we only utilize the object annotations in COCO to create manipulated images. We
compare the result of using different training data as follows:
? CP + S: Only using copy-pasted images to train the segmentation network.
? CP + G + S: Using both copy-pasted and generated images.
? CP + GSR: Using copy-pasted images and generated images. The refinement
70
Authentic Ground Truth Copy Paste Epoch 4 Epoch 20 Epoch 40
Figure 4.5: Qualitative visualization of the generation network. The first two
columns show the authentic background and manipulation mask. As the number of
epochs increases, the manipulated region matches better with the background and
thus boundary artifacts are harder to identify.
stage is applied.
Results are presented in Table 4.3. The performance using only copy-pasted
images (CP + S) on the four datasets indicates that our network truly learns
boundary artifacts. Also, the improvement after adding generated images (CP +
G + S) shows that our generation network provides useful manipulation examples
that increases generalization. Last, the refinement stage (CP + GSR) boosts
performance further by encouraging the network to spot new boundary artifacts.
4.4.6 Qualitative Results
Generation Visualization. We illustrate some visualizations of the generation
network in Figure 4.5. It is clear that the generation network learns to match the
pasted region with background during training. As a result, the boundary artifacts
71
are becoming subtle and the generation network produces harder examples for the
segmentation network.
Segmentation Results. We present qualitative segmentation results on four
datasets in Figure 4.4. Unsurprisingly, the boundary branch outputs the potential
boundary artifacts in manipulated images and the other branch fills in the interior
based on the predicted manipulated boundaries. The examples indicate that our
approach deals well with both splicing and copy-move manipulation based on the
manipulation clues from the boundaries.
4.5 Conclusion
We propose a novel segmentation framework that firstly utilizes a generation
network to enable generalization across variety of manipulations. Starting from
copy-pasted examples, the generation network generates harder examples during
training. We also design a boundary artifact guided segmentation and refinement
network to focus on manipulation artifacts rather than semantic content. Further-
more, the segmentation and refinement stage share the same weights, allowing for
much faster inference. Extensive experiments demonstrate the generalization ability
and effectiveness of GSR-Net on four standard datasets and show state-of-the-art
performance. The manipulation segmentation problem is still far from solved due to
the large variation of manipulations and post-processing methods. Including more
manipulation techniques in the generation network could potentially boost the gen-
eralization ability of the existing model and is part of our future research.
72
Chapter 5: DeepStrip: High Resolution Boundary Refinement
5.1 Introduction
Boundary detection is a well-studied problem and fundamental for human
recognition [99, 100]. Recent decades have witnessed considerable effort to im-
prove the boundary quality of an object that has been detected [101?109] or seg-
mented [110?114]. Consequently, it is not difficult to separate object of interests
from backgrounds with precise boundaries utilizing these methods. While current
learning based boundary detection algorithms are usually computed on low res-
olution (LR) images (0.04-0.25 million pixels), most photos taken these days are
much larger, ranging from cell phone size (8-16 million pixels) to professional cam-
era size (16-400 million pixels). Most methods are not designed for images of this
size and the excessive computation they require, and most machine learning based
methods cannot process them due to memory constraints. Given a precise low res-
olution prediction, a workaround would be to directly apply upsampling to reach
high resolution (HR). Nevertheless, this usually yields poor quality results because
the semantic contents in the HR image are not considered. (See Figure 5.1.)
Most research in boundary detection focuses on improving the boundary qual-
ity in LR through introducing more semantic information [8, 115, 116] or human
73
HR image Bilinear upsampling Ours HR ground truth
Boundary
upsampling
LR
mask
Figure 5.1: Concept overview. The example is from the newly created PixaHR
dataset. Given low resolution mask and high resolution image on the left, a bilinear
upsampling with scale factor 16? would results in boundary misalignment in high
resolution image, as is shown in the enlarged boundary region on the right. Also,
the new details in high resolution would be missed.
interaction [108,112,113,117,118]. While there has been some work on HR seman-
tic segmentation [119,120] and upsampling [121,122], there is less focus on accurately
capturing the boundary detail in HR. Instead of treating this problem as an upsam-
pling problem, we treat it as boundary detection and harness the contents in HR
images for prediction.
To this end, we propose a novel approach to handle boundary refinement in
HR images. (See Figure 5.2.) Our key idea is to allow the power of deep learning
methods to be applied to HR images in a time and memory efficient manner by op-
erating on narrow images made up of pixels near the boundary. Given an accurate
LR mask, the boundary in HR is likely in proximity to the upsampled LR boundary.
(See Figure 5.1.) Therefore, to save memory and computation, we propose to search
for the target boundary in a strip region near the boundary of the upsampled mask.
The strip image is formed by sampling pixels along and normal to the upsampled
mask boundary. Since the normals may not be smooth due to inaccurate boundaries
74
LR mask and HR image Skip Connections Selection layer
C0 continuity
1 regularization
2 1
2 L1 loss
2
Strip 3 Boundary
Creation distance loss
4 m 3
4 Matching
loss
x s
Figure 5.2: Framework. To save memory and computation, we predict the boundary
in a strip image instead of the whole image. First, the strip image is extracted from
the HR image and corresponding LR mask. Feeding the strip image as input, the
network predicts all potential boundaries (denoted as ?x?) and passes the initial
prediction to a selection layer (denoted as ?m?) to pursue more accurate prediction
on the target boundary (denoted as ?s?). The numbers are indicator to the losses
displayed on the right. Orange and green curves denote the ground truth and
prediction, respectively. Note that the strip image and prediction are rotated 90
degree for visualization.
in the upsampled mask, we represent the LR boundary with a spline approximation
and directly treat the orthogonal derivatives of the upsampled spline as the normal
directions. Feeding as input the generated strip images, we train a network to firstly
predict all potential boundaries. Based on the initial prediction, an additional selec-
tion layer is included to predict the target boundary more accurately. To encourage
closer prediction and reduce false positives, we propose loss functions to minimize
the boundary distance between the prediction and ground truth in the strip image
and to encourage C0 continuity in the prediction. Lastly, we pursue consistent re-
sults through matching the prediction under different strip sizes to further boost
the performance.
To validate our approach, we create a new PixaHR dataset (see Figure 5.1
for image example) consisting of 100 photos with average resolution 7k ? 7k and
75
evaluate our approach up to scale factor 32?. Results on DAVIS 2016 and COCO
coarse annotations also show our ability to refine coarse boundary annotations.
In a nutshell, our contribution is three-fold. 1) We propose an approach to
predict the boundary in a strip image which converts potential boundary regions into
a strip space. This approach allows us to apply neural networks in a computationally
and memory efficient manner. 2) To improve performance and encourage closer
prediction, we propose novel losses including boundary distance, matching and C0
continuity loss. 3) We create a high resolution dataset for evaluation. To the best of
our knowledge, we are the first learning based approach to make HR dense boundary
refinement with resolution up to 10k ? 10k. Extensive experiments on both public
and the new PixaHR dataset strongly highlight our effectiveness.
5.2 Related Work
Boundary Refinement. Multiple attempts have been made to improve boundary
quality through extracting better features [8,101,116,123,124]. Xie et al. [101] utilize
features from multiple layers and fuse both low and high level features to detect
edges. Liu et al. [116] explore rich convolutional features to boost the performance.
More related, attention has been taken to refine coarse boundary predictions or
annotations [8, 115]. Conventional methods like dense Conditional Random Fields
(CRF) [125], Graph Cuts [126] model the relationship between nearby pixels and
thus can be applied to refine LR masks [127]. However, these are segmentation
based and only low-level features have been utilized. With more supervision, Yu
76
et al. [115] propose to simultaneously learn and align edges to refine misaligned
boundaries directly. Acuna et al. [8] further improve the performance by introducing
a thinning layer and active alignment strategy to obtain refined boundary. These
methods mainly explore edge detection in LR images. In contrast, we tackle HR
boundary refinement and apply detection only on regions around upsampled LR
boundary splines and thus is more memory and computation efficient.
Active Contours. Active contour models like Snakes [104] have been introduced
to refine boundaries from coarse ones. Various approaches have been explored to
handle the limitation of Snakes through, e.g., better initialization, morphological
operation [128] or user interaction [108]. Since our method also refines the curve
upsampled from LR mask, we can benefit from these methods and refine the bound-
ary further. Instead of taking the whole image as input, deep active contour [129]
learns to predict the flow of boundary pixels in a patch by patch fashion. However, it
cannot guarantee a continuous boundary prediction. Instead, our approach directly
extracts a consecutive boundary region and thus contains more global information.
Rather than predict the entire curve, other works have explored predicting control
points [117,130,131] through recurrent neural networks or Graph Convolutional Net-
works (GCN) [132] and then fit a curve as the final prediction. However, boundary
details are smoothed in the spline representation. In contrast, our approach predicts
precise edge information directly. Another line of work implicitly represents bound-
ary curves. For example, deep level set methods [133] evolve boundary curves by
minimizing the level energy function. Other learning based approaches [7, 134, 135]
have proposed to provide useful features, including texture, color or shape, for bet-
77
ter optimization. However, these learning based approaches suffer from compu-
tation and memory issues when the resolution increases because they process the
entire image while our approach only focuses on the regions around upsampled LR
boundaries, and thus requires less computation and memory overhead.
High Resolution Up-sampling. With the information of low resolution masks,
researchers have focused on achieving high quality HR segmentation masks. Con-
ventional methods [136,137] reach HR by applying upsampling jointly with the LR
mask reference. However, the fixed filter structures have difficulty capturing new
HR boundary details. He et al. [138] propose guided filtering to smooth while pre-
serving edge information when upsampling. Wu et al. [121] make the guided filter
faster and learnable. For HR segmentation approaches, Zhao et al. [120] propose
to aggregate LR features for HR segmentation and Chen et al. [119] align both
global and local features to avoid heavy GPU consumption for HR segmentation.
Even though these methods can be potentially adapted to boundary refinement, our
method mainly focuses on boundary regions and is designed to detect boundaries in
HR directly. Therefore, our approach learns new HR boundaries better, especially
when LR boundaries are coarsely annotated.
5.3 Approach
Our goal lies in refining boundaries in HR images given LR precise masks. To
achieve this purpose efficiently, we propose to predict on a strip image that captures
the potential boundary region rather than the entire HR image. Figure 5.2 illustrates
78
LR HR Upsampled Boundary Strip Initial Final
mask image contour region image boundary boundary
Figure 5.3: Strip image creation. To generate strip image, B-spline representation
of the contour in the LR mask is upsampled to HR as a coarse boundary. The HR
region along the normal direction (e.g., red and green arrows) of the contour is then
extracted. Finally, the strip image and corresponding boundary ground truth is
obtained by flattening the extracted region in both the HR image and mask. Note
that the final boundary filters out noisy boundaries (e.g., the red box region) from
the initial boundary. The strip image and boundaries are rotated 90 degree for
visualization.
our framework. Our approach consists of strip image creation, which converts HR
RGB image into strip image, strip boundary prediction, which refines the edges on
the strip image using a network and strip reconstruction which reconstructs the
prediction in the original image from the strip boundary prediction during testing.
5.3.1 Strip Image Creation
Figure 5.3 describes the procedure of strip image creation. Due to the inter-
polation introduced by upsampling, a directly upsampled boundary from the LR
image is likely to be shifted from the ground truth boundary in HR. To localize
the real HR boundary pixels, searching around the upsampled boundary is more
necessary than searching the whole image. Therefore, we extract pixels near the
79
upsampled boundary to create a strip image. To create the strip image, we step
along the boundary and sample points along the normal direction at each point on
the curve. To obtain smoothly varying normal directions along the coarse boundary,
we represent the LR boundary by B-spline and upsample the LR spline to HR.
Given the HR image I(p, q) and the upsampled spline representation C =
(p(k), q(k)) of the boundary contour, where (p(k), q(k)) denotes the HR image co-
ordinates parameterized by arclength k along the curve, the continuous strip image
JI,C is defined by
JI,C(k, t+H/2)=I(p(k) + t?np(k), q(k) + t?nq(k)), (5.1)
where t denotes the distance in the normal direction, H denotes the height of the
strip image, and (np(k), nq(k)) is the unit normal to the curve at arclength k. Ac-
cordingly, the strip image JI,C(j, i) with dimension H ?W is obtained by sampling
k = j ? dk, t = i ? dt, where tangential step size dk = b|C|/W c and normal step
size dt is set to 1 for simplicity. |C| denotes the length of C, j = 0, 1, ...,W and
i = ?H/2, ..., 0, ..., H/2. Also, bilinear interpolation is applied in the high resolution
image to evaluate I(p, q) for non-pixel coordinates (p, q).
The corresponding HR strip boundary ground truth is obtained similarly with
two adaptations. First, for large sampling scale factors, the ground truth boundary
is likely to be outside the range of the strip if the strip height is small, making the
boundary in strip image not continuous. We add labels at the border of strip if no
boundary pixel is included to maintain the C0 continuity of the boundary pixels in
80
the strip image. Second, if the strip height is large, multiple boundary pixels might
be included in each column in regions where the boundaries are closer than the strip
height. In this case, we filter out the extraneous boundaries that are not connected
to the current boundary. (See Figure 5.3.)
5.3.2 Strip Boundary Prediction
Provided the HR strip image as input, we train a network to predict the
corresponding boundaries within the strip domain. For memory efficiency, we adapt
light-weighted encoder-decoder based structure nested U-Net [139,140] for boundary
prediction. Given the fact that proper dimension of strip image varies for different
resolutions, we use instance normalization [73] during training so that the mean and
variance are approximated per image.
As is shown in Figure 5.2, two prediction layers are proposed to learn the tar-
get boundary in strip image to account for the fact that multiple true boundaries
may be present in a single column of the strip image. Firstly, we extract the last
upsampling layer to predict all potential boundaries. This encourages the network
to learn boundary features within the strip image. To predict the target boundary,
we add a learnable selection layer to pick up the target boundary from potential
boundaries. The input to the selection layer is the initial prediction, and we apply
column-wise softmax to the output of the selection layer as a confidence score for
the initial prediction. Finally, the target boundary is computed by the multipli-
cation between the initial prediction and the selection score. The selection layer
81
also smooths the initial prediction, analogous to the non-maximum suppression in
Canny edge detection [100]. Formally,
s = xm, (5.2)
where  denotes pixel-wise multiplication, s denotes the final prediction, x denotes
the initial prediction which applies Sigmoid activation to the output of the last
upsampling layer and m is the softmax activated output of the selection layer.
5.3.3 Loss Function
Our basic loss function for the initial and final boundary prediction is a
weighted l1 loss to differentiate the boundary from non-boundary pixels. Formally,
? ?
Le = ? |yij ? sij|+ (1? ?) |yij ? sij|, (5.3)
(i,j)?Y+ (i,j)?Y?
where Y+ and Y? denote boundary and non-boundary pixels, respectively. ? =
|Y?|/|Y | denotes the weight to balance the label and |Y | denotes the total number
of pixels in strip mask. sij denotes the prediction and yij denotes the binary ground
truth at position (i, j) in the strip image.
In addition, we adapt Dice loss [141] to boundary prediction to encourage
intersection between prediction and ground truth:
??? 2 sij ??yij + Ldice = 1 , (5.4)
sij + yij + 
82
where  denotes a small constant to avoid zero division. The loss aims to maximize
the intersection over union between the prediction and ground truth.
5.3.3.1 Boundary Distance Loss
For boundary prediction, a closer prediction to the boundary ground truth is
preferred. However, both weighted l1 and dice loss are not sensitive to the distance
from prediction to ground truth. Therefore, we introduce a boundary distance loss
to measure the average distance between the predicted boundary and the ground
truth to encourage closer prediction. Thanks to the strip domain which maps the
regions along the normal direction in every column, the boundary distance can be
calculated directly through the difference between the prediction and ground truth.
Given the prior that only one boundary pixel exists in each column in the final strip
mask, the boundary distance at every column can be measured by calculating the
argmax difference at every column between the prediction and ground truth. Since
argmax function is not differentiable, we approximate it through soft argmax before
calculating the boundary distance and formulate the loss as
W
1 ?
Ld = | softarg(sij)? arg max(yij)|, (5.5)
W i i
j=1
where W is the width of strip mask and the soft argmax in each column (normal
direction) is computed as
?H ( )|sij|
softarg(sij) = ? i , (5.6)
i ||S ||
i=1 j 1
83
where ||Sj||1 is the l1 normalization of sij at column j. Since the final prediction
sij encourages a unimodal distribution according to Equation 5.2, this loss enforces
the column-wise maximum activation of the final prediction to match with that in
ground truth.
5.3.3.2 Matching Loss
Since the strip height is fixed during training, to introduce variance and avoid
overfitting on specific strip height, we augment the data through cropping the strip
height. Starting from a large height, we crop the strip to a shorter one and make
a new prediction. For consistency, the overlapped regions between original and the
cropped strip should have the same initial prediction since all potential boundaries
are predicted. Formally, we take a l1 loss between the cropped and original initial
prediction to calculate the matching loss,
1 ?
Lm = |x?ij ? xij|, (5.7)|Ycrop|
(i,j)?Ycrop
where Ycrop is the cropped region of original mask Y and x
?
ij is the new initial
prediction for the cropped strip image. In addition, this loss also helps the network
learn to ignore spurious edges detected near the border of the strip.
5.3.3.3 C0 Continuity Regularization
Additionally, we add a C0 continuity regularization to the final prediction to
enforce a continuous prediction. Ideally, at most one boundary pixel is allowed at
84
every column in the final prediction, so the prediction is C0 continuous if the maxi-
mum activated position of every column is C0 continuous. Specifically, we compute
the soft argmax of every column, calculate a marginal difference between nearby
argmax columns and penalize the position within a window size where prediction
becomes discontinuous. Formally,
1 ?W
LC0= P (max(0,| softarg(sij)?softarg(si,j+1)|?v)), (5.8)
W i i
j=1
where v denotes the margin value and P denotes the maxpooling with a fixed kernel
size so that all pixels within the range get penalized. siW+1 is replicated by si1
for calculation. This loss serves as a self regularization as no ground truth label is
required.
The total loss function is therefore,
Ltotal = Le + Ldice + ?1Ld + ?2Lm + ?3LC0, (5.9)
where ?1, ?2, ?3 are hyper-parameters to adjust the weight of each loss. Le is applied
to both the initial and final prediction. Lm is only applied to the initial prediction
and Ldice, Ld, LC0 are applied only to the final prediction. With the total loss func-
tion, a closer prediction is preferred and the network draws attention to the target
boundaries.
85
5.3.4 Strip Reconstruction
To make a prediction on the HR image, a mapping between the predicted
strip boundaries and the full HR mask is required at inference. For every pixel in
the strip image, the corresponding coordinates in the HR image are recorded for
reconstruction. Given the raw prediction, we optimize the path with a dynamic
programming similar to seam carving [142] and find the path with minimum energy.
We minimize the function
|?I(i, j)|
Eij = ?sij ? , (5.10)
max(|?I|)
where |?I(i, j)| denotes the magnitude of the image gradients at (i, j). The algo-
rithm searches for the energy cost for neighborhood pixels and finds the path with
a minimum energy cost, which indicates the boundary path with the highest prob-
ability. We then connect the original coordinates of the final path in the full mask
to form the full prediction.
At inference, the flexible input dimension of our framework enables different
strip sizes for different images. Benefitting from it, we determine the width of strip,
which reflects the number of sampling points along the boundary, by multiplying
the LR boundary length with the scale factor. We fix the height of strip with the as-
sumption that all target boundaries are involved, and an adaptive height adjustment
strategy is also discussed in Section 5.4.6. For objects containing multiple contours
due to complex topology, the prediction is made on each contour separately.
86
5.3.5 Implementation Details
We generate the spline curve efficiently from the binary mask using the scipy
function ?splprep? after extracting contours. To guarantee a consistent sign for the
normals, we extract strip images from closed contours. The starting point of strip
is not deterministic so that no bias is introduced in training. The final ground
truth strip boundary mask is obtained by taking the gradient of the ground truth
segmentation mask after removing any isolated noisy boundaries. Additionally,
we randomly add small shifts to the spline representation to introduce position
variation of the target boundary in strip image during training. Our framework is
implemented in Pytorch. The encoder consists of 4 3 ? 3 convolutional layers and
the decoder consists of 4 upsampling layers. The selection layer consists of another
convolutional layer with 3? 3 kernel size. The activation function is ReLU [143] for
all encoder and decoder layers. We use instance normalization for all normalization
layers to enable flexible input size at inference. During training, the input strip
dimension is fixed as 80 ? 4096. We train the network for 70 epochs with batch
size 6 on an NVIDIA GeForce TITAN P6000. We use Stochastic Gradient Descent
(SGD) as optimizer and the initial learning rate is 0.1. The learning rate decays by
a factor of 10 after every 20 epochs. The momentum is set to 0.9 and weight decay
is set to 0.0005. ?1, ?2 and ?3 are set to be 0.1, 20 and 1 empirically. We crop strip
image by half to obtain Ycrop for matching loss and the maxpooling kernel size for
C0 continuity regularization is 11. The margin in C0 continuity regularization is set
to 1. Horizontal flipping is applied as data augmentation.
87
Dataset DAVIS 2016 [56] 4? PixaHR 8? PixaHR 16? PixaHR 32?
Metrics F (0 pix) F (1 pix) F (1 pix) F (2 pix) F (1 pix) F (2 pix) F (1 pix) F (2 pix)
Bilinear Upsampling 0.171 0.521 0.116 0.194 0.15 0.187 0.07 0.106
Grabcut [103] 0.232 0.541 0.063 0.121 0.020 0.053 0.0 0.0
Dense CRF [125] 0.268 0.702 0.278 0.434 0.245 0.389 0.142 0.227
Bilateral Solver [137] 0.274 0.569 0.207 0.277 0.185 0.247 0.156 0.216
Curve-GCN [117] 0.076 0.160 0.021 0.033 0.018 0.028 0.012 0.028
DELSE [7] 0.271 0.531 0.096 0.133 0.086 0.132 0.080 0.130
STEAL [8] 0.171 0.348 0.282 0.457 0.151 0.255 0.09 0.144
JBU [136] 0.175 0.447 0.140 0.231 0.117 0.184 0.055 0.090
Guided Filtering [138] 0.129 0.349 0.121 0.195 0.092 0.145 0.060 0.097
Deep GF [121] 0.193 0.461 0.286 0.420 0.175 0.269 0.09 0.141
U-Net boundary 0.320 0.656 0.170 0.297 0.139 0.197 0.068 0.108
U-Net strip (baseline) 0.303 0.710 0.334 0.455 0.303 0.425 0.267 0.357
Ours 0.423 0.788 0.416 0.508 0.396 0.498 0.330 0.447
Table 5.1: Boundary-based F score comparison. The scale factor between low and
high resolution image is 4 on DAVIS 2016 and 8, 16, 32 on PixaHR. For DAVIS
2016, the pixel dilation is 0 and 1 and for PixaHR is 1 and 2 instead.
5.4 Experiments
We evaluate our approach on two HR datasets which provide both low and
high resolution ground truth in Section 5.4.2, and then analyze the importance of
each components in our framework in Section 5.4.3. We also provide memory and
speed comparison in Section 5.4.4.
5.4.1 Datasets and Metrics
For our experiments, we need a dataset with highly accurate pixel-level HR an-
notation. Unfortunately, most current datasets are low resolution and many provide
inaccurate polygon boundaries as ground truth annotations. We found DAVIS [56]
to provide accurate enough results with a resolution that is usable for our needs.
To better evaluate the results at large scaling factors, we introduce a new dataset?
PixaHR. We describe these datasets below.
DAVIS 2016 [56]: A benchmark for video segmentation which consists of 50 classes
88
with precise annotations in both 480P and 1080P. To enlarge the scale factor, we
down sample the 480P mask by a factor of 2, train our approach on the 30-class
1080P training set with 240P LR masks and test on 20-class 1080P testing set. The
scale factor is 4.5 for this experiment. The results are evaluated frame by frame.
PixaHR: To evaluate more realistic scenarios, we create a PixaHR dataset. It
contains 100 images with average resolution 7k? 7k (ranging from 5k? 5k to 10k?
10k) collected from public photograph website Pixabay [144]. We manually annotate
the object boundary in the HR images, downsample the HR mask by 8?, 16? and
32? and obtain binary LR mask for evaluation. The photos were uploaded by public
users and have diverse contents. We apply our model that was trained on DAVIS
to this dataset for evaluation.
Metrics: We use boundary-based F score introduced by Perazzi et al. [56] for
evaluation, which is designed to evaluate the boundary quality of segmentation.
As it allows changing pixel tolerance by dilation, we set 0 and 1 pixel dilation on
DAVIS, and 1 and 2 pixel on PixaHR dataset to measure how close the prediction
is to the ground truth.
5.4.2 Main Results
For upsampling based approaches, we compare our approach with Bilinear
Upsampling, Bilateral Solver [137], Joint Bilateral Upsampling [136] (JBU),
Guided Filtering [138] and Deep GF [121]. The boundary is obtained by tak-
ing the gradient of the upsampled mask. For boundary refinement approaches, we
89
compare with Grabcut [103], Dense CRF [125] and STEAL [8] using upsam-
pled mask as initialization. For active contour methods, the baselines are Curve-
GCN [117] and DELSE [7], and predictions on PixaHR are made in LR and
upsampled to original resolution since the whole boundary region is required at in-
ference. Learning based approaches are trained or fine-tuned on the training set of
DAVIS and evaluated directly on all datasets. In addition, we also compare our own
implemented baselines as below:
? U-Net boundary: We train U-Net directly on the full resolution images on
DAVIS for boundary prediction. We concatenate both the full resolution image and
upsampled masks as input so that the network learns to refine the coarse masks. The
loss function is a weighted binary cross entropy following Xie et al. [101]. Similarly,
we also add deep supervision and fuse all intermediate layers to obtain the final
prediction. The prediction is made patch-by-patch with patch size 1920 ? 1080 on
PixaHR dataset.
? U-Net strip (baseline): Our baseline method which learns to directly predict
the target boundary on strip image. Only weighted l1 loss is used as loss function.
? Ours: Our full model which applies selection layer to predict the boundary in
strip images with our boundary distance loss, matching loss and C0 continuity reg-
ularization.
Table 5.1 exhibits our advantage over the baselines. For the DAVIS dataset,
a simple upsampling yields a boundary shift from the ground truth and thus per-
forms poorly. Grabcut and dense CRF are segmentation based and thus yield worse
performance than ours. Even though other methods including bilateral solver, JBU
90
Dataset DAVIS 2016 PixaHR 16?
Metrics F (0 pix) F (1 pix)
U-Net strip 0.303 0.303
U-Net strip dice 0.323 0.320
U-Net strip dice + selection 0.372 0.328
U-Net strip dice+selection+BD 0.390 0.342
Our w/o matching 0.405 0.365
Ours 0.423 0.396
Table 5.2: Ablation analysis on two datasets. Each entry is the boundary-based F
score tested on individual dataset.
and Deep GF leverage the low resolution mask, they are designed for general up-
sampling instead of for boundary refinement and prediction. Curve-GCN fits the
curve from the predicted control points which cannot generate as precise a bound-
ary as ours. DELSE moves the contour along the gradient of its energy function,
but is less robust than our approach which predicts the target boundary pixels.
Additionally, our approach outperforms STEAL as the scale factor increases, indi-
cating the active alignment in STEAL may not be accurate enough for pixel-level
boundary prediction. Compared with U-Net boundary, predicting the boundary in
strip image (U-Net strip) yields a slightly better performance, perhaps because the
strip image narrows down the search space for target boundary. As expected, with
our selection layer and proposed losses, we boost the performance further by better
determining the target boundaries from other potential boundaries. A similar ten-
dency is observed on PixaHR dataset. Note that in large scale factor 32, most of
the methods fail to make close predictions to the ground truth while our method
still has a relatively stable performance.
91
Methods Memory (MB) Speed (s/image)
Bilinear Upsampling - 0.01/0.02
Grabcut [103] - 5.17/320
Dense CRF [125] - 3.22/310
Bilateral Solver [137] - 4.18/158
JBU [136] - 0.08/5.71
Guided filtering [138] - 0.08/16.1
Deep GF [121] - 0.07/3.95
STEAL [8] 7775/7959 43.1/4231
Curve-GCN [117] 17330/17330 0.93/75.2
DELSE [7] 17771/17771 1.02/20.4
U-net boundary 17000/17000 0.31/24.5
Ours 3300/3300 0.28/2.51
Table 5.3: Memory and speed comparison. Each entry is the memory or speed on
DAVIS 2016/PixaHR dataset. We only compare the memory usage among learning-
based approaches.
5.4.3 Ablation Analysis
We analyze the importance of each component in our framework as listed
below:
? U-Net strip dice: Adding dice loss to the baseline.
? U-Net strip dice + selection: Adding dice loss and selection layer to the
baseline.
? U-Net strip dice + selection + BD: Adding dice, boundary distance loss and
selection layer to the baseline.
? Ours w/o matching: Adding additional C0 regularization. It is our full model
without the matching loss.
Table 5.2 summarizes the comparison result. Starting from our baseline U-Net
strip, adding dice loss encourages more intersection with the ground truth boundary
and thus yields better performance. Comparing U-Net strip + dice with U-Net
92
Dense CRF
STEAL
Ours
Ground
truth
Figure 5.4: Qualitative results on PixaHR 32?. Rows from top to down are the
results of Dense CRF, STEAL, Ours and the Ground truth. We show the entire
boundary (green color) result first and enlarge the blue bounding box region for
comparison (boundaries are whitened).
strip + dice + selection, the selection layer boosts the performance on DAVIS by
a large margin, indicating it effectiveness in suppressing the noisy boundaries and
smoothing the final prediction. Also, with the boundary distance loss the network
learns to have closer prediction. With C0 regularization (Ours w/o matching),
the network filters out false positive boundaries by making a continuous prediction.
Finally, the performance further improves with the matching loss because the net-
work makes a consistent prediction over different strip heights to avoid overfitting.
93
Figure 5.5: Qualitative results on COCO. Columns from left to right are coarse
annotation, DELSE [7], STEAL [8] and Ours.
5.4.4 Memory and Speed Comparison
Since we only extract a strip image for prediction, our approach is efficient in
both memory and computation. Table 5.3 compares our memory overhead and speed
performance with baselines. Over all, our computation and memory requirement is
relatively small. Our memory requirement is smaller than other learning based
approaches. Note that for U-Net boundary and STEAL, the prediction on PixaHR
is made patch-by-patch due to the high resolution.
More specifically, the main computation in our approach lies in strip recon-
struction. e.g., for a 1920 ? 1080 DAVIS image with around 3200 pixels along the
boundary, our strip image creation takes 0.08s, prediction process takes 0.06s and
the strip reconstruction takes 0.14s. A similar computation percentage is observed
94
Dataset PixaHR 32?
Metrics F (1 pix)
Ours 0.330
Ours adaptive 1 segment 0.353
Ours adaptive 2 segments 0.365
Table 5.4: Strip height selection comparison on PixaHR 32?.
on PixaHR also.
5.4.5 Qualitative Results
We show visualization comparisons in Figure 5.4. It is clear that our approach
produces more accurate boundariers than the other methods. To further show the
effectiveness of our approach on refining the boundaries given LR or coarse masks, we
provide qualitative results on COCO where only polygonal boundary ground truth
is provided. We directly extract strip image using the coarse annotation on COCO,
and visualize the prediction in Figure 5.5. Comparing with other approaches, our
method provides more accurate boundaries, indicating the potential application of
our approach to help refine the coarse boundaries.
5.4.6 Strip Height Adaptation
We predict the target boundary in the strip image under the assumption that
the target boundary exists within the pre-defined height range, however, it might
not hold true especially for a large scale factor. While one solution is to pre-define
a larger height for strip image creation, we propose to progressively increase the
height and regenerate strip image to make new predictions at inference. Specifically,
95
we increase the height of strip image until the summation of the final prediction
score decreases. Furthermore, height adjustment is more flexible by dividing the
whole contour into several segments and adjusting them independently. The results
are shown in Table 5.4. The comparison between Ours and Ours adaptive 1
segment indicates the effectiveness to have a flexible height. The performance
increases further when dividing the whole contour into 2 segments which allows
variable height for different regions.
5.5 Conclusion
In summary, this paper presents a novel strategy to handle HR boundary re-
finement computationally and memory efficiently given LR precise masks. To save
memory, we propose to extract boundary regions along the upsampled boundary
spline to form a strip image and make prediction within this strip image. To fo-
cus on the target boundaries in strip image, boundary distance, matching loss and
C0 continuity regularization have been proposed. Extensive experiments on both
public and our newly created dataset demonstrate the effectiveness of the proposed
approach. However, the current approach still has difficulty predicting complicated
topology and soft boundary regions. A smarter adaptive strip height adjustment
for every pixel might be a potential solution, which is left for future research.
96
Chapter 6: Multi-model and Multi-level Knowledge Distillation for
Incremental Learning
6.1 Introduction
Deep neural networks perform well on many visual recognition tasks [11, 145,
146] given specific training data. However, problem arises when adapting networks
to unseen categories while remembering seen ones, which is known as catastrophic
forgetting [147?149]. To tackle this issue, there is a growing research attention on
incremental learning where the new training data is not provided upfront but added
incrementally. The target of incremental learning is to achieve good performance on
new data without sacrificing the performance on old and it has been widely explored
across different tasks such as classification [9, 150] and detection [151].
To alleviate catastrophic forgetting in incremental learning, one possibility is to
maintain a subset of old data to avoid over fitting on new data [9,152,153]. However,
an issue in practice is that when models embedded in a product are delivered to
customers, they no longer have access to trained data for privacy purposes. To
tackle the situation, a stricter exemplar-free setting was introduced in [150], which
requires no exemplar set for previous categories and only distills previous knowledge
97
Knowledge	 Knowledge	 Knowledge	 Knowledge	
distillation distillation distillation distillation
S1 S2 S3 S1 S2 S3
Incremental	step Incremental	step
Figure 6.1: Concept overview. We propose to distill knowledge from all previous
models efficiently to preserve old data information rather than sequentially applying
distillation only to the last model. (For example, using both S1 and S2 in S3 for
distillation instead of sequentially using S1 for S2 and then S2 for S3). The confusion
matrix is LWF-MC [9] on the left and our method on the right for the exemplar-free
incremental setting.
from the current categories.
Prior methods typically apply knowledge distillation [154] sequentially during
the incremental procedure to preserve previous knowledge. Since they apply distil-
lation only to the last model, it is difficult to maintain all past knowledge completely
(the left side of Figure 6.1). From that observation, we propose using all the model
snapshots. Prior knowledge is preserved better through our approach (the right
side of Figure 6.1). However, saving all previous models may incur a great penalty
98
in memory storage and without somehow compressing this historical information
would not be practical. To address this, we reconstruct previous outputs using only
?necessary? parameters during training.
To this end, we propose an end-to-end Multi-model and Multi-level Knowl-
edge Distillation (M 2KD) framework as depicted in Figure 6.2. We introduce a
multi-model distillation loss which leverages the snapshots of all previous models
to serve as teacher models during distillation, and then directly matches the out-
puts of a network with those from the corresponding teacher models. To make the
pipeline more efficient, we adapt mask based pruning methods to reconstruct the
previous models. We prune the network after each incremental training step and
identify significant weights to reconstruct the model. This allows us to reconstruct
previous models on-the-fly and utilize them as teacher models in our multi-model
distillation. To further enhance the distillation process, we also include an auxiliary
distillation loss to preserve more intermediate features of previous models. Addi-
tionally, our approach addresses catastrophic forgetting in sequential distillation,
and thus generalizes well for both exemplar based and exemplar-free settings.
To show the effectiveness of our approach, we evaluate our model on Cifar-100
[155] and a subset of ImageNet [146]. We achieve state-of-the-art performance for
all the datasets in exemplar-free setting. We also show improvement when adapting
to exemplar-based incremental learning and our exemplar-free setting outperforms
iCaRL [9] with a 200 exemplar budget.
In summary, our contributions are three fold. First, we propose a multi-model
distillation loss, which directly matches logits of the current model with those from
99
the corresponding teacher models. Secondly, for efficiency, we reconstruct historical
models via mask based pruning such that model snapshots can be reconstructed
with low memory footprint. Experiments on standard incremental learning bench-
marks show that our method achieves state-of-the-art performance in exemplar-free
incremental setting.
6.2 Related Work
The ultimate goal of incremental learning is to achieve good performance on
new data while preserving the knowledge about old data. Generally, two types
of evaluation settings [156] have been considered. One is multi-head incremental
learning which utilizes multiple classifiers at inference, and the other is single-head
incremental learning which only utilizes one classifier at inference.
Multi-head incremental learning. The evaluation setting in this stream is that a
specific classifier is selected during testing according to the tasks or categories. With
this prior information, no confusion exists across different classifiers, and thus the
target becomes how to adapt the old model for new tasks or categories. Research has
been focused on utilizing an episodic memory to trace back previous tasks [157?159],
or constraining the important weights on old tasks [149]. In addition, [160, 161]
learn a mask for pruning to further constrain the weights on old tasks. [162] distill
the knowledge from the old model when adapting to new tasks. Different from
this setting, we do not assume the task or category information is known during
inference and follow the setting of single-head incremental learning. Also, even
100
though we apply pruning in our approach, our goal is different from [160, 161] as
the masks are utilized to reconstruct previous models and our approach requires no
mask selection at inference.
Single-head incremental learning. Single-head evaluation uses only one clas-
sifier to predict both the old and the new classes. This setting is more challeng-
ing [156] compared to the multi-head counterpart because of the confusion between
old and new categories. Knowledge distillation [154] is frequently utilized to pre-
serve information. [150] distill the knowledge from the last model. [163] introduce
Grad-CAM [164] in the loss function. A relaxed setting is to introduce exemplar
set [9] for the old data and match previous logits through distillation. [152] explore
the balance between old and new data during training. [153] focus on constructing
exemplar set and [165, 166] replay the seen categories with GANs [83]. Instead of
saving exemplars, we save the parameters of previous models for reconstruction.
With that, this paper can be considered a complement research direction. In fact,
as knowledge distillation is an important component in these methods, they can po-
tentially benefit from our approach as well. Additionally, [167] alleviate the bias in
knowledge distillation by introducing a scaling vector to trained classifier, however,
our approach is agnostic to classifier and achieves better performance.
Network pruning. Considerable research has explored this area to reduce net-
work redundancy. [168,169] propose to compress network through quantization and
Huffman coding. [170] compress the weights according to their scores. Other meth-
ods [171?173] explore compression for fast inference. In contrast to these methods,
we leverage network redundancy and use pruning to reconstruct all previous models
101
Mask Distill
Mask Distill
Input
Figure 6.2: Framework overview. Given images from the current training data,
we preserve previous knowledge directly from the reconstructed output through
matching the logits with the corresponding model and classifying the current data
with its ground truth. As an example, each layer contains a mask matrix Mt ati
the ti-th incremental step recording significant weights for previous data. The gray
dots represent the weights to be trained on the current data. The red and green
dots are fixed during training, denoting the weights retained from the first and
second incremental step respectively. The gray dots are fine-tuned for the current
data before pruning. After pruning, a subset of the gray dots will be marked as
important weights and become blue dots, and the remaining weights will be fine-
tuned during the next incremental step. Accordingly, Mt2 is updated and used as
Mt3 at the end of this round. In multi-model distillation, the red and green output
logits of the current model are matched with the model 1 and 2 respectively while
the blue logits are matched with its ground truth.
in incremental learning with low memory footprint.
6.3 Approach
We propose novel distillation losses to preserve previous information without
introducing too much memory overhead (See Figure 6.2). The model is agnos-
tic to the backbone architecture and generalizes well to both exemplar based and
exemplar-free methods.
102
Current model Model 2 Model 1
6.3.1 Multi-model Distillation
Single-head incremental learning consists of a sequence of incremental class
inclusion process, referred to as incremental steps. Samples from a batch of new
classes Ck are added at the k-th incremental step. For instance, 20 classes will be
added per incremental step in a 20-class batch setting. Accordingly, the network
assigns new logits (output nodes) for the incremental classes. At inference, the
maximum logit score in the output is treated as the final decision.
The knowledge distillation used in incremental learning [9,150] mainly aims to
match the output of the current model to a concatenation of the last model logits
and ground truth labels. Formally, it optimizes the cross entropy for both the old
and new logits,
?N ?C1 o ?
LD = ? sij log(sij)N
?i=1?j=1N C
? 1 yij log(sij), (6.1)
N
i=1 j=Co+1
where N and C denotes the number of samples and the total class number so far
respectively, and Co denotes the old classes. sij is the output score of the network
obtained by applying Sigmoid function to the output logits for sample i at logit j.
?
sij denotes the old score obtained by the penultimate model. yij denotes the ground
truth.
Treating the penultimate model as the teacher and applying this distillation
103
sequentially helps preserve historical information, especially when no previous exem-
plar set is stored, which is the protocol for prior methods [9,150,152,163]. However,
the historical information will be gradually lost in this sequential pipeline as the cur-
rent model must reconstruct all the prior information from the penultimate model
alone. To address this limitation, we propose multi-model distillation, which di-
rectly leverages all previous models as our teacher model set. Since we mainly have
current training data and labels for both settings, the network is more confident
on current classes than old ones. Therefore, matching the previous logits of the
current model directly with their corresponding old models preserves information
better than always using the last model. Formally, we minimize the cross entropy
for the logits between the current model and corresponding teacher models from
previous incremental steps,
?N P?1 C1 ? ?k? ?LMMD = sijk log(sijk)N
?i=1 k?=1 j=Ck?1+1N C
? 1 yij log(sij), (6.2)
N
i=1 j=CP?1+1
where classes from Ck?1 + 1 to Ck belong to the k-th incremental step and P
denotes the number of incremental steps. Classes from CP?1 + 1 to C belong to the
current categories. sijk is the output score of the current model for sample i at logit
?
j in the k-th incremental step. sijk denotes the output score of the k-th previous
model.
Multi-model distillation matches the logits in the current model with the corre-
104
Figure 6.3: Illustration of auxiliary distillation. We extract the intermediate features
and connect directly with an auxiliary classifier to preserve middle level knowledge.
sponding teacher model directly, reducing the information loss between incremental
steps. At inference, we directly choose the maximum among the output logits, which
acts as an ensemble of all the previous teacher models and the current model.
6.3.2 Auxiliary Distillation
Previous incremental learning methods preserve old class information through
matching the final output. However, the features from intermediate layers also
contain useful information. Inspired by the auxiliary loss in segmentation task [174],
we propose an auxiliary distillation loss to preserve the intermediate statistics of
previous models. Similar to using the final output to represent network statistics,
the prediction made by lower level features also represents intermediate feature
statistics. Following the main branch classification, we extract lower level features
and use an auxiliary classifier to conduct classification based on intermediate features
(See Figure 6.3).
105
input
conv
bn
relu
conv
bn
relu
conv
bn
relu
conv
bn
relu
distillation	2 distillation	1
Also, a multi-model distillation loss is added on this auxiliary classifier for the
purpose of preserving prior lower level features, and a standard cross entropy loss is
also included for classifying the current data. Formally,
?N P??1 ?C1 k ?
LAD = ? a log(aijk)
N ijk
?i=?1 k=1 j=Ck?1+1N C
? ? yij log(aij), (6.3)
N
i=1 j=1
?
where aijk denotes the output score from previous auxiliary classifiers, aijk or
aij is the output score of the auxiliary branch, ? is the ratio between the distillation
and cross entropy loss. Notice that all the logits in ground truth labels are utilized
in the classification cross entropy to enforce the correct prediction of current data.
The total loss function of the network becomes,
Ltotal = LMMD + ?LAD, (6.4)
where ? is the ratio between the main classification multi-model distillation and
the auxiliary classification distillation. This auxiliary classification branch is only
used during training. At inference time, we only use the main branch classifier for
prediction.
106
Algorithm 1 Pruning Algorithm
1: Input: X1, . . . , Xk // input image sets of incremental step 1, . . . , k
2: ? // current model parameters
3: store pre-update parameters and masks m
4: for y = 1, . . . , k do
5: Grad(?y(m < y)) = 0 // apply mask
6: update optimizer through Back-Propagation
?y ? min(LMMD(?y) + ?LAD(?y))
7: adjust threshold by pruning ratio //update threshold
8: ?y(?y < threshold) = 0 // prune and update ?y
9: m(?y >= threshold) = y //update masks
10: end for
6.3.3 Model Reconstruction
One drawback of multi-model distillation in its original form is that it utilizes
all previous models, requiring additional memory storage for the models. However,
we observe that distillation aims to match logits. Therefore it is only necessary
to preserve the outputs of previous networks, not the entire networks themselves.
Our idea is to save only a small set of the necessary parameters from which we can
approximate the output. By that way, all the models can be recovered on-the-fly
without large memory penalty.
To determine the necessary parameters, we adapt mask based pruning [160]
for model reconstruction. Specifically, after training each incremental step we sort
the magnitude of weights in each layer, freeze the important ones to reach a specified
pruning ratio, and use the residual weights to train the next incremental class set.
We repeat this procedure for all future incremental steps until all the incremental
classes are included. (See Algorithm 1)
We use a mask M to identify the important weights of each layer for all pre-
vious incremental steps. After each pruning procedure, we update the mask for the
107
100 100 100
80 80 80
60 60 60
40 40 40
20 20 20
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Number of classes Number of classes Number of classes
(a) Top-1 Cifar-100, 5-class (b) Top-1 Cifar-100, 10-class (c) Top-1 Cifar-100, 20-class
batch batch batch
100 100
80 80
60 60
40 40
20 20
20 40 60 80 100 20 40 60 80 100
Number of classes Number of classes
(d) Top-5 iILSVRC-small, 10- (e) Top-5 iILSVRC-small, 20- (f) Legend
class batch class batch
Figure 6.4: Performance on iILSVRC-small and Cifar-100 dataset in exemplar-free
setting. (a) Top-1 accuracy on Cifar-100 (5-class batch). (b) Top-1 accuracy
on Cifar-100 (10-class batch). (c) Top-1 accuracy on Cifar-100 (20-class batch).
(d) Top-5 accuracy on iILSVRC-small (10-class batch). (e) Top-5 accuracy on
iILSVRC-small (20-class batch).
current incremental step. With the saved biases, batch normalization and classifier
parameters, we can reconstruct all previous models from the last model (pre-updated
model) on-the-fly. Formally, the output of a network with n convolutional layers is
obtained from its classifier (the last layer) and features,
s = ?(f (n)), (6.5)
where ? denotes the classifier and f (n) denotes the features in the n-th layer and
can be generally written as
f (n) = ?(w(n)f (n?1) + b(n)), (6.6)
108
Acurracy (%)
Acurracy (%)
Acurracy (%)
Acurracy (%)
Acurracy (%)
where w and b are weights and biases respectively, ? denotes the activation function
and f (0) is the input.
With the mask Mk for the k-th incremental step, we reconstruct the corre-
sponding features by:
(n) (n) (n) (n?1) (n)
fk = ?(wk ?(Mk <= k)fk + bk ), (6.7)
(n) (n)
where Mk denotes the mask in the n-th layer at incremental step k, fk denotes
the feature in the n-th layer in k-th incremental step, and ? denotes delta function.
Thus the output of the k-th model is reconstructed by
(n)
sk = ?k(fk ), (6.8)
where sk and ?k denote the output of the network and the classifier for the k-th
incremental step respectively.
6.4 Experiments
We first evaluate our method in the exemplar-free setting. Then we extend
our method to the exemplar-based setting. For more analysis, we also compare our
memory cost with other methods.
6.4.1 Datasets and Evaluation Metrics
The evaluation is conducted on iILSVRC-small [175] and Cifar-100 [155].
109
Evaluation Metrics. Following the same metrics in prior methods [9,150], the top-
1 classification accuracy is reported for Cifar-100 and top-5 classification accuracy
is reported for iILSVRC-small.
6.4.2 Exemplar-free setting
We evaluate our methods in exemplar-free single-head setting. For evalua-
tion, we also compare with the following baselines and state-of-the-art single-head
approaches.
FT: A baseline approach that only applies cross entropy loss to fine-tune the penul-
timate model on new coming incremental classes. Knowledge distillation is not
applied.
Scaled [167]: A threshold moving strategy to alleviate the bias in knowledge distil-
lation. We use the released code for evaluation.
DGM [166]: A dynamic generative memory approach which utilizes GANs to gen-
erate old samples as exemplar set. We use the released code for evaluation and no
real sample is used during training.
Rwalk [156]: A generalization algorithm of EWC [149] and Path Integral [176].
The official code is evaluated.
LWF-MC [9]: A multi-class classification version of [150] as described in [9], ap-
plying distillation to the logits from the last previous model sequentially.
M 2KD (ours): Our full model applying multi-model, auxiliary distillation along
with pruning to save memory storage.
110
Step 1 2 3 4 5
No pruning 83.5 61.8 52.5 51.5 42.1
Ratio 0.6 82.9 59.6 52.2 46.5 40.1
Ratio 0.7 83.5 61.7 52.5 50.0 42.8
Ratio 0.8 83.5 58.5 52.0 49.3 42.0
Ratio 0.9 83.0 58.0 49.7 47.3 39.9
Table 6.1: Top-1 accuracy comparison among different pruning ratios on Cifar-100
(20 classes per incremental step).
M 2KD (no pruning): The upper bound of our model which directly loads all the
previous snapshots for multi-model distillation.
Upper-Bound: The upper bound of incremental learning which directly trains all
classes together.
Figure 6.4 highlights our performance compared to state-of-the-art methods.
For Cifar-100, our method consistently outperforms other methods from 5-class to
20-class batch per incremental step. The margin becomes larger as more incremen-
tal steps are added. This demonstrates the advantage of multi-model distillation
as it avoids accumulating loss of historical information. Similar observation can be
made when evaluating on iILSVRC-small. It is interesting to note that our model
with pruning achieves comparable performance with the no-pruning version. This
indicates the effectiveness of the pruning procedure in terms of saving memory while
maintaining performance. Even though the residual active weights decrease gradu-
ally due to pruning, we still preserve the performance up to 20 incremental steps.
111
6.4.3 Ablation Studies
We investigate the effectiveness of each component of our method in this sec-
tion. In particular, we compare our full model with the following baselines.
LWF-MC aux: Add auxiliary distillation to LWF-MC.
LWF-MC MMD: Change the original loss to our multi-model distillation. No
auxiliary distillation is applied.
Ours skip1: Instead of using all previous models, we study the case when skipping
some snapshots. Starting from the last previous model, we skip the first model
in multi-model distillation. The skipped model is replaced by the next model for
multi-model distillation.
Ours skip2: Skip the first two models instead of one compared to Ours skip1.
Figure 6.5 shows the comparison for each of the component in our approach.
LWF-MC aux improves our baseline model LWF-MC on all the datasets after
adding auxiliary distillation, indicating that the intermediate level information also
contributes to preserving previous knowledge. With only multi-model distillation
(LWF-MC MMD), the performance gradually improves for both datasets as more
incremental steps are involved, which demonstrates that directly distilling knowledge
from the corresponding model helps to reduce the lost in sequential distillation. Note
that our multi-model distillation reduces to the standard distillation used in [150]
if only one or two incremental steps are added. By incorporating the auxiliary
distillation, however, our method still shows improved performance. Lastly, our
model achieves nearly the same performance as our upper bound which saves all
112
100 100
LWF-MC
LWF-MC aux
80 LWF-MC MMD 80
M2KD (ours)
M2KD (no pruning)
60 60
LWF-MC
LWF-MC aux
40 40 LWF-MC MMD
M2KD (ours)
M2KD (no pruning)
20 20 40 60 80 100 20 20 40 60 80 100
Number of classes Number of classes
(a) Top-1 Cifar-100 (b) Top-5 iILSVRC-small
Figure 6.5: Ablation Studies for our approach. (a) Top-1 accuracy comparison
on Cifar-100 (20-class batch). (b) Top-5 accuracy performance on iILSVRC-small
(20-class batch).
previous snapshots, showing the effectiveness of our pruning based approach.
Figure 6.6 compares how multi-model distillation is affected by the number of
models. LWF-MC can be regarded as a special case which skips 3 models in the last
round. The trend from LWF-MC to Ours shows that the performance improves
as the number of model preserved increases, confirming the value of multi-model
distillation.
6.4.4 Analysis on pruning ratio
We compare the results corresponding to different pruning ratios to investigate
the robustness of our approach. Table 6.1 summarizes the results. Marginal perfor-
mance variation (around 3%) is observed for different pruning ratios. Even though
a higher (0.9) pruning ratio affects the performance as the active weights decrease
in the current incremental step and a lower (0.6) ratio affects the performance as
113
Acurracy (%)
Acurracy (%)
100
LWF-MC
Ours skip1
80 Ours skip2
Ours
60
40
20 20 40 60 80 100
Number of classes
Figure 6.6: Comparison between different number of models used in multi-model
distillation on Cifar-100 20-class batch.
available weights decrease in the future steps, the relatively trivial influence indi-
cates that a large redundancy exists in the network architecture. Benefitting from
it, our approach shows robustness to different pruning ratios.
6.4.5 Exemplar Based Setting
Our approach can also be applied to exemplar based incremental learning
methods which use distillation sequentially on the output of networks [9, 152, 153].
To evaluate our model in this setting, we add exemplar selection to our approach
and compare with exemplar based methods.
iCaRL [9]: A prominent exemplar based incremental learning approach which con-
structs exemplar set for the old data according to the feature means and do distil-
lation on the last previous model. A nearest class mean classifier [177] is applied at
114
Acurracy (%)
iCaRL iCaRL
90 iCaRL aux 90 iCaRL aux
iCaRL M2KD iCaRL M2KD
70 70
50 50
30 20 40 60 80 100 30 20 40 60 80 100
Number of classes Number of classes
(a) Top-1 Cifar-100 (b) Top-5 iILSVRC-small
Figure 6.7: Performance comparison in exemplar based setting. (a) Top-1 accuracy
performance on Cifar-100 (10-class batch). (b) Top-5 accuracy performance on
iILSVRC-small (10-class batch).
inference.
iCaRL aux: Adding auxiliary distillation to iCaRL.
iCaRL M 2KD: Change the original distillation function which only matches logits
from the last previous model to our multi-model distillation. Auxiliary distillation
is also appended for a better performance.
The results are shown in Figure 6.7. With the introduction of multi-model
and auxiliary distillation, the performance of iCaRL improves. It indicates that
with direct access to all the previous models for distillation, the knowledge preserves
better even with exemplar set.
6.4.6 Memory Comparison
Starting from the memory footprint of LWF as our baseline, we compare the
extra memory storage between exemplar based method such as iCaRL [9] and our
115
Acurracy (%)
Acurracy (%)
approach. The memory is calculated in the 10-class incremental step setting for
both iILSVRC-small and Cifar-100. For our approach, we directly calculate the
storage difference between the last and the initial step. For iCaRL, the memory is
approximately calculated by the average size of image for 2000 samples (i.e. the
default exemplar size), and the compensation for saving the record of exemplar set.
To optimize the memory consumption of iCaRL, we resize the images in iILSVRC-
small to 256 ? 256 and compress to JPG with quality 95 to match their network
input size during training.
Table 6.2 shows the memory compensation for different methods. It indi-
cates that our approach has approximately 7? smaller memory compensation on
iILSVRC-small and 10? smaller on Cifar-100 than iCaRL. On average, for each
incremental step, our approach only takes 0.98 MB and 0.08 MB for iILSVRC-small
and Cifar-100 respectively. The memory advantage to exemplar based methods
might become larger as higher resolution images take more storage.
We provide further memory analysis in Figure 6.8. We compare our approach
with iCaRL on Cifar-100 given the same memory constraint. For fair comparison,
we reduce the exemplar set as a penalty of the additional memory we use for net-
work parameters to match with the memory size used for iCaRL. The performance
is evaluated by averaging the top-1 accuracy across all the incremental steps. When
memory budget equals to 200 images, we do not use any exemplar set but still
perform better than iCaRL. The reason for this is that the sequential distillation
pipeline tends to lose information even when exemplars from old classes are avail-
able. Moreover, increasing memory budget makes the performance gap between
116
Dataset iILSVRC-small Cifar-100
LWF-MC 0 0
iCaRL 68.0 9.4
M2KD (ours) 9.80 0.84
Table 6.2: Memory compensation comparison (MB). Each entry is the additional
memory requirement for methods across different datasets based on the memory
footprint of LWF.
M2KD(ours)
iCaRL
65
55
45 1000 2000 3000 4000
Memory budget K
Figure 6.8: Analysis on performance and memory compared to iCaRL on Cifar-100
(10-class batch). We increase memory budget for exemplar set from 200 to 4000
images and report the average accuracy of all the 10 incremental steps.
our approach and iCaRL larger, showing our strength to memorize what has been
learned.
6.5 Conclusion and Discussion
This paper presents a novel distillation strategy that mitigates catastrophic
forgetting in single-head incremental learning setting. We introduce multi-model dis-
tillation which directly guides the model to distill knowledge from the corresponding
117
Acurracy (%)
teacher models. To further improve our performance, we incorporate auxiliary dis-
tillation to preserve intermediate features. More efficiently, we avoid saving all the
model snapshots through reconstructing all previous models using mask based prun-
ing algorithm. Extensive experiments on standard incremental learning benchmarks
demonstrate the effectiveness of our approach. Incremental learning is still far from
solved. A significant gap between one-step training versus incremental training still
exists. It remains to be a open question how to reduce the confusion between dif-
ferent incremental steps especially without access to previous data, which might be
a future exploration for our research.
118
Chapter 7: Conclusion
In this dissertation, we have studied the existing challenges in combining deep
learning with forensics to make manipulation detection. We proposed RGB-N net-
work to learn rich features to reveal more artifacts in the domain of local noise
and RGB image. Moreover, we also extended from image manipulation to video
manipulation detection and studied the problem of video inpainting detection. Fur-
thermore, We combined a blending based GANs to improve the generalization of
manipulation segmentation networks. We then studied the general issue with deep
learning models. For the issue of high resolution prediction, we proposed a Deepstrip
approach to handle inaccurate results at high resolution more efficiently. Lastly, we
explored the field of incremental learning to prevent the catastrophic forgetting is-
sue of current neural networks. Even though researchers have provided promising
solutions to fight against the fake images/videos, the problem is still far from solved.
Below we discuss some of the potential directions for the future research.
The first direction is to handle various manipulation techniques. We mainly
focused on splicing and inpainting detection in the dissertation, however, detect-
ing other manipulation techniques are also valuable. Taking into account this cat-
and-mouse problem, the new emerging manipulation techniques including deepfake,
119
generative model based image editing still remains to be explored. Applying deep
learning to detect these new types of manipulation is an interesting direction for the
future research.
Another challenge exists in manipulation detection is the domain shift prob-
lem. Research has demonstrated performance degradation when applying learned
manipulation detection models to a different manipulation domain. This degrada-
tion is one of the major factors that limit the application of manipulation detection
models. Exploring more generic features or discovering the domain specific to ma-
nipulation and applying domain generalization algorithms might be an interesting
direction.
120
Bibliography
[1] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn mod-
els for fine-grained visual recognition. In ICCV, 2015.
[2] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear
pooling. In CVPR, 2016.
[3] Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection
evaluation database. In ChinaSIP, 2013.
[4] Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing
Shen, and Stefan Winkler. Coverage?a novel database for copy-move forgery
detection. In ICIP, 2016.
[5] Tiago Jose? De Carvalho, Christian Riess, Elli Angelopoulou, Helio Pedrini,
and Anderson de Rezende Rocha. Exposing digital image forgeries by illumi-
nation color classification. TIFS, 2013.
[6] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. Fighting
fake news: Image splice detection via learned self-consistency. In ECCV, 2018.
[7] Zian Wang, David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Object
instance annotation with deep extreme level set evolution. In CVPR, 2019.
[8] David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges: Learning
semantic boundaries from noisy annotations. In CVPR, 2019.
[9] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H
Lampert. icarl: Incremental classifier and representation learning. In CVPR,
2017.
[10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: To-
wards real-time object detection with region proposal networks. In NIPS,
2015.
121
[11] Ross Girshick. Fast r-cnn. In ICCV, 2015.
[12] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via
region-based fully convolutional networks. In NIPS, 2016.
[13] Ruichi Yu, Xi Chen, Vlad I Morariu, and Larry S Davis. The role of context
selection in object detection. In BMVC, 2016.
[14] Mingfei Gao, Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Dynamic
zoom-in network for fast object detection in large images. In CVPR, 2018.
[15] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas Huang. Deep
grabcut for object selection. arXiv preprint arXiv:1707.00243, 2017.
[16] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Two-stream
neural networks for tampered face detection. In CVPRW, 2017.
[17] Xunyu Pan, Xing Zhang, and Siwei Lyu. Exposing image splicing with incon-
sistent local noise variances. In ICCP, 2012.
[18] Miroslav Goljan and Jessica Fridrich. Cfa-aware features for steganalysis of
color images. In SPIE/IS&T Electronic Imaging, 2015.
[19] Davide Cozzolino and Luisa Verdoliva. Single-image splicing localization
through autoencoder-based anomaly detection. In WIFS, 2016.
[20] Davide Cozzolino, Diego Gragnaniello, and Luisa Verdoliva. Image forgery
localization through the fusion of camera-based, feature-based and pixel-based
techniques. In ICIP, 2014.
[21] Jawadul H Bappy, Amit K Roy-Chowdhury, Jason Bunk, Lakshmanan
Nataraj, and BS Manjunath. Exploiting spatial structure for localizing ma-
nipulated image regions. In ICCV, 2017.
[22] Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. Improved dct coeffi-
cient analysis for forgery localization in jpeg images. In ICASSP, 2011.
[23] Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital
images. TIFS, 2012.
[24] Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. Columbia image splicing
detection evaluation dataset. http://www.ee.columbia.edu/ln/
dvmm/downloads/authspliceddataset/authspliceddataset.htm, 2009.
[25] Nist nimble 2016 datasets. https://www.nist.gov/itl/iad/mig/
nimble-challenge-2017-evaluation/.
[26] Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection
evaluation database 2010. http://forensics.idealtest.org.
122
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dolla?r, and C Lawrence Zitnick. Microsoft coco: Com-
mon objects in context. In ECCV, 2014.
[28] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Splicebuster: A new
blind image splicing detector. In WIFS, 2015.
[29] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Recasting residual-
based local descriptors as convolutional neural networks: an application to
image forgery detection. In IH&MMSec, 2017.
[30] Yuan Rao and Jiangqun Ni. A deep learning approach to detection of splicing
and copy-move forgeries in images. In WIFS, 2016.
[31] Jiansheng Chen, Xiangui Kang, Ye Liu, and Z Jane Wang. Median filtering
forensics based on convolutional neural networks. Signal Processing Letters,
2015.
[32] Belhassen Bayar and Matthew C Stamm. A deep learning approach to uni-
versal image manipulation detection using a new convolutional layer. In
IH&MMSec, 2016.
[33] Ying Zhang, Jonathan Goh, Lei Lei Win, and Vrizlynn LL Thing. Image
region forgery detection: A deep learning approach. In SG-CRC, 2016.
[34] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Image splicing localiza-
tion using a multi-task fully convolutional network (mfcn). arXiv preprint
arXiv:1709.02016, 2017.
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In CVPR, 2016.
[36] Markos Zampoglou, Symeon Papadopoulos, and Yiannis Kompatsiaris. Large-
scale evaluation of splicing localization algorithms for web images. Multimedia
Tools and Applications, 2017.
[37] Neal Krawetz. A picture?s worth... Hacker Factor Solutions, 2007.
[38] Babak Mahdian and Stanislav Saic. Using noise inconsistencies for blind image
forensics. Image and Vision Computing, 2009.
[39] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva.
Image forgery localization via fine-grained analysis of cfa artifacts. TIFS,
2012.
[40] Sungho Lee, Seoung Wug Oh, DaeYeun Won, and Seon Joo Kim. Copy-and-
paste networks for deep video inpainting. In ICCV, 2019.
[41] Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Deep video
inpainting. In CVPR, 2019.
123
[42] Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy. Deep flow-guided
video inpainting. In CVPR, 2019.
[43] Seoung Wug Oh, Sungho Lee, Joon-Young Lee, and Seon Joo Kim. Onion-peel
networks for deep video completion. In ICCV, 2019.
[44] Ya-Liang Chang, Zhe Yu Liu, Kuan-Ying Lee, and Winston Hsu. Free-form
video inpainting with 3d gated convolution and temporal patchgan. ICCV,
2019.
[45] Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Johannes Kopf. Tem-
porally coherent completion of dynamic video. TOG, 2016.
[46] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang.
Generative image inpainting with contextual attention. In CVPR, 2018.
[47] Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and
Jiebo Luo. Foreground-aware image inpainting. In CVPR, 2019.
[48] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and
Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR,
2016.
[49] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Learning rich
features for image manipulation detection. In CVPR, 2018.
[50] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. Mantra-net: Ma-
nipulation tracing network for detection and localization of image forgeries
with anomalous features. In CVPR, 2019.
[51] Davide Cozzolino, Justus Thies, Andreas Ro?ssler, Christian Riess, Matthias
Nie?ner, and Luisa Verdoliva. Forensictransfer: Weakly-supervised domain
adaptation for forgery detection. arXiv preprint arXiv:1812.02510, 2018.
[52] Haodong Li and Jiwu Huang. Localization of deep inpainting using high-pass
fully convolutional network. In ICCV, 2019.
[53] Qiong Wu, Shao-Jie Sun, Wei Zhu, Guo-Hui Li, and Dan Tu. Detection of
digital doctoring in exemplar-based inpainted images. In ICMLC, 2008.
[54] Wei Wang, Jing Dong, and Tieniu Tan. Tampered region localization of digital
color images based on jpeg compression noise. In IWDW, 2010.
[55] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. In ICLR, 2015.
[56] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus
Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation
methodology for video object segmentation. In CVPR, 2016.
124
[57] Kaiming He and Jian Sun. Image completion approaches using the statistics
of similar patches. TPAMI, 2014.
[58] James Hays and Alexei A Efros. Scene completion using millions of pho-
tographs. TOG, 2007.
[59] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally
consistent image completion. ToG, 2017.
[60] Yunqiang Liu and Vicent Caselles. Exemplar-based image inpainting using
multiscale graph cuts. TIP, 2012.
[61] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao,
and Bryan Catanzaro. Image inpainting for irregular holes using partial con-
volutions. In ECCV, 2018.
[62] Haotian Zhang, Long Mai, Ning Xu, Zhaowen Wang, John Collomosse, and
Hailin Jin. An internal learning approach to video inpainting. In ICCV, 2019.
[63] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman.
Patchmatch: A randomized correspondence algorithm for structural image
editing. In ToG, 2009.
[64] Chuan Wang, Haibin Huang, Xiaoguang Han, and Jue Wang. Video inpainting
by jointly learning temporal structure and spatial details. In AAAI, 2019.
[65] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva.
Image forgery localization via fine-grained analysis of cfa artifacts. In TIFS,
2012.
[66] Markos Zampoglou, Symeon Papadopoulos, and Yiannis Kompatsiaris. De-
tecting image splicing in the wild (web). In ICMEW, 2015.
[67] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Image splicing localization
using a multi-task fully convolutional network (mfcn). In JVCI, 2018.
[68] Peng Zhou, Bor-Chun Chen, Xintong Han, Mahyar Najibi, Abhinav Shrivas-
tava, Ser Nam Lim, and Larry S Davis. Generate, segment and refine: Towards
generic manipulation segmentation. AAAI, 2020.
[69] Xinshan Zhu, Yongjun Qian, Xianfeng Zhao, Biao Sun, and Ya Sun. A deep
learning approach to patch-based image inpainting forensics. SPIC, 2018.
[70] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and
Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional
nets, atrous convolution, and fully connected crfs. In TPAMI, 2018.
[71] Sifei Liu, Jinshan Pan, and Ming-Hsuan Yang. Learning recursive filters for
low-level vision via a hybrid neural network. In ECCV, 2016.
125
[72] Mengye Ren and Richard S Zemel. End-to-end instance segmentation with
recurrent attention. In CVPR, 2017.
[73] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normaliza-
tion: The missing ingredient for fast stylization. CoRR, 2016.
[74] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima-
genet: A large-scale hierarchical image database. In CVPR, 2009.
[75] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training
deep feedforward neural networks. In AISTATS, 2010.
[76] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. In ICLR, 2015.
[77] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan Salakhutdinov. Dropout: a simple way to prevent neural networks from
overfitting. In JMLR, 2014.
[78] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015.
[79] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. In CVPR, 2017.
[80] Raymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G Schwing, Mark
Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep
generative models. In CVPR, 2017.
[81] Jinseok Park, Donghyeon Cho, Wonhyuk Ahn, and Heung-Kyu Lee. Double
jpeg detection in mixed jpeg quality factors using deep convolutional neural
network. In ECCV, 2018.
[82] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Image splicing localization
using a multi-task fully convolutional network (mfcn). In JVCI, 2018.
[83] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-
sarial nets. In NeurIPS, 2014.
[84] Daniel Moreira, Aparna Bharati, Joel Brogan, Allan Pinto, Michael Parowski,
Kevin W Bowyer, Patrick J Flynn, Anderson Rocha, and Walter J Scheirer.
Image provenance analysis at scale. arXiv preprint arXiv:1801.06510, 2018.
[85] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired
image-to-image translation using cycle-consistent adversarial networks. In
ICCV, 2017.
[86] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive
growing of gans for improved quality, stability, and variation. ICLR, 2018.
126
[87] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, Paste and Learn:
Surprisingly Easy Synthesis for Instance Detection. In ICCV, 2017.
[88] Sepp Hochreiter and Ju?rgen Schmidhuber. Long short-term memory. Neural
computation, 1997.
[89] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-
Hsuan Yang. Deep image harmonization. In CVPR, 2017.
[90] Jean-Francois Lalonde and Alexei A Efros. Using color compatibility for as-
sessing image realism. In ICCV, 2007.
[91] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and
Bryan Catanzaro. High-resolution image synthesis and semantic manipulation
with conditional gans. In CVPR, 2018.
[92] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda
Wang, and Russell Webb. Learning from simulated and unsupervised images
through adversarial training. In CVPR, 2017.
[93] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard
positive generation via adversary for object detection. In CVPR, 2017.
[94] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and
Shuicheng Yan. Object region mining with adversarial erasing: A simple
classification to semantic segmentation approach. In CVPR, 2017.
[95] Hieu Le, Tomas F Yago Vicente, Vu Nguyen, Minh Hoai, and Dimitris Sama-
ras. A+ d net: Training a shadow detector with adversarial shadow attenua-
tion. In ECCV, 2018.
[96] Patrick Pe?rez, Michel Gangnet, and Andrew Blake. Poisson image editing. In
TOG, 2003.
[97] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. ICLR, 2015.
[98] Yizong Cheng. Mean shift, mode seeking, and clustering. In TPAMI, 1995.
[99] Andreas Opelt, Axel Pinz, and Andrew Zisserman. A boundary-fragment-
model for object detection. In ECCV, 2006.
[100] John Canny. A computational approach to edge detection. TPAMI, 1986.
[101] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In CVPR,
2015.
[102] Meng Tang, Lena Gorelick, Olga Veksler, and Yuri Boykov. Grabcut in one
cut. In ICCV, 2013.
127
[103] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interac-
tive foreground extraction using iterated graph cuts. In TOG, 2004.
[104] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active
contour models. IJCV, 1988.
[105] Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang Yang, Xiang
Ruan, and Ali Borji. Detect globally, refine locally: A novel approach to
saliency detection. In CVPR, 2018.
[106] Ting Zhao and Xiangqian Wu. Pyramid feature attention network for saliency
detection. In CVPR, 2019.
[107] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, and Tiejun Huang.
Bi-directional cascade network for perceptual edge detection. In CVPR, 2019.
[108] Hoang Le, Long Mai, Brian Price, Scott Cohen, Hailin Jin, and Feng Liu.
Interactive boundary prediction for object selection. In ECCV, 2018.
[109] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth Bodla, and Rama
Chellappa. Deep regionlets for object detection. In ECCV, 2018.
[110] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and
Hartwig Adam. Encoder-decoder with atrous separable convolution for se-
mantic image segmentation. In ECCV, 2018.
[111] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution rep-
resentation learning for human pose estimation. In CVPR, 2019.
[112] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation
with latent diversity. In CVPR, 2018.
[113] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interactive
object segmentation with human annotators. In CVPR, 2019.
[114] Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, and Fei Sha. Fastmask:
Segment multi-scale object candidates in one shot. In CVPR, 2017.
[115] Zhiding Yu, Weiyang Liu, Yang Zou, Chen Feng, Srikumar Ramalingam, BVK
Vijaya Kumar, and Jan Kautz. Simultaneous edge alignment and learning. In
ECCV, 2018.
[116] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer
convolutional features for edge detection. In CVPR, 2017.
[117] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast
interactive object annotation with curve-gcn. In CVPR, 2019.
[118] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep
interactive object selection. In CVPR, 2016.
128
[119] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian.
Collaborative global-local networks for memory-efficient segmentation of ultra-
high resolution images. In CVPR, 2019.
[120] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia.
Icnet for real-time semantic segmentation on high-resolution images. In ECCV,
2018.
[121] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Fast end-to-end
trainable guided filter. In CVPR, 2018.
[122] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual
dense network for image super-resolution. In CVPR, 2018.
[123] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang.
Object contour detection with a fully convolutional encoder-decoder network.
In CVPR, 2016.
[124] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu.
Learning to predict crisp boundaries. In ECCV, 2018.
[125] Philipp Kra?henbu?hl and Vladlen Koltun. Efficient inference in fully connected
crfs with gaussian edge potentials. In NeurIPS, 2011.
[126] Yuri Y Boykov and M-P Jolly. Interactive graph cuts for optimal boundary
& region segmentation of objects in nd images. In ICCV, 2001.
[127] Yin Li, Jian Sun, Chi-Keung Tang, and Heung-Yeung Shum. Lazy snapping.
ToG, 2004.
[128] Luis A?lvarez, Luis Baumela, Pedro Henr??quez, and Pablo Ma?rquez-Neila. Mor-
phological snakes. In CVPR, 2010.
[129] Christian Rupprecht, Elizabeth Huaroc, Maximilian Baust, and Nassir Navab.
Deep active contours. arXiv preprint arXiv:1607.05074, 2016.
[130] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Anno-
tating object instances with a polygon-rnn. In CVPR, 2017.
[131] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive
annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018.
[132] Thomas N Kipf and Max Welling. Semi-supervised classification with graph
convolutional networks. ICLR, 2017.
[133] Stanley Osher and James A Sethian. Fronts propagating with curvature-
dependent speed: algorithms based on hamilton-jacobi formulations. JCP,
1988.
129
[134] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa Zhang, Min Bai, Ren-
jie Liao, and Raquel Urtasun. Learning deep structured active contours end-
to-end. In CVPR, 2018.
[135] Dominic Cheng, Renjie Liao, Sanja Fidler, and Raquel Urtasun. Darnet: Deep
active ray network for building segmentation. In CVPR, 2019.
[136] Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele.
Joint bilateral upsampling. In ToG, 2007.
[137] Jonathan T Barron and Ben Poole. The fast bilateral solver. In ECCV, 2016.
[138] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. TPAMI,
2012.
[139] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional
networks for biomedical image segmentation. In MICCAI, 2015.
[140] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jian-
ming Liang. Unet++: A nested u-net architecture for medical image segmen-
tation. In DLMIA, 2018.
[141] Ken CL Wong, Mehdi Moradi, Hui Tang, and Tanveer Syeda-Mahmood. 3d
segmentation with exponential logarithmic loss for highly unbalanced object
sizes. In MICCAI, 2018.
[142] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing.
In TOG, 2007.
[143] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier
neural networks. In AISTATS, 2011.
[144] Pixabay. https://pixabay.com.
[145] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015.
[146] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In NeurIPS, 2012.
[147] Michael McCloskey and Neal J Cohen. Catastrophic interference in connec-
tionist networks: The sequential learning problem. In Psychology of learning
and motivation, 1989.
[148] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio.
An empirical investigation of catastrophic forgetting in gradient-based neural
networks. ICLR, 2014.
130
[149] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume
Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag-
nieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neu-
ral networks. PNAS, 2017.
[150] Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 2018.
[151] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental
learning of object detectors without catastrophic forgetting. In CVPR, 2017.
[152] Francisco M Castro, Manuel J Mar??n-Jime?nez, Nicola?s Guil, Cordelia Schmid,
and Karteek Alahari. End-to-end incremental learning. In ECCV, 2018.
[153] Yu Li, Zhongxiao Li, Lizhong Ding, Peng Yang, Yuhui Hu, Wei Chen, and Xin
Gao. Supportnet: solving catastrophic forgetting in class incremental learning
with support data. arXiv preprint arXiv:1806.02942, 2018.
[154] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a
neural network. arXiv preprint arXiv:1503.02531, 2015.
[155] Alex Krizhevsky. Learning multiple layers of features from tiny images. In
Tech.rep., 2009.
[156] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and
Philip HS Torr. Riemannian walk for incremental learning: Understanding
forgetting and intransigence. In ECCV, 2018.
[157] David Lopez-Paz et al. Gradient episodic memory for continual learning. In
NeurIPS, 2017.
[158] Arslan Chaudhry, Marc?Aurelio Ranzato, Marcus Rohrbach, and Mohamed
Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2019.
[159] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach,
and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget.
In ECCV, 2018.
[160] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a
single network by iterative pruning. In CVPR, 2018.
[161] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a
single network to multiple tasks by learning to mask weights. In ECCV, 2018.
[162] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Life-
long learning via progressive distillation and retrospection. In ECCV, 2018.
[163] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu,
and Rama Chellappa. Learning without memorizing. arXiv preprint
arXiv:1811.08051, 2018.
131
[164] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna
Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations
from deep networks via gradient-based localization. In ICCV, 2017.
[165] Hugo Caselles-Dupre?, Michael Garcia-Ortiz, and David Filliat. Continual
state representation learning for reinforcement learning using generative re-
play. NeurIPS, 2018.
[166] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and Moin
Nabi. Learning to remember: A synaptic plasticity driven framework for
continual learning. In CVPR, 2019.
[167] Khurram Javed and Faisal Shafait. Revisiting distillation and incremental
classifier learning. In ACCV, 2018.
[168] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang,
Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. Dsd: Dense-
sparse-dense training for deep neural networks. ICLR, 2016.
[169] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights
and connections for efficient neural network. In NeurIPS, 2015.
[170] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han,
Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks
using neuron importance score propagation. In CVPR, 2018.
[171] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz.
Pruning convolutional neural networks for resource efficient inference. ICLR,
2016.
[172] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convo-
lutional neural networks with low rank expansions. BMVC, 2014.
[173] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and
Changshui Zhang. Learning efficient convolutional networks through network
slimming. In ICCV, 2017.
[174] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia.
Pyramid scene parsing network. In CVPR, 2017.
[175] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
[176] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through
synaptic intelligence. In ICML, 2017.
[177] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka.
Metric learning for large scale image classification: Generalizing to new classes
at near-zero cost. In ECCV, 2012.
132