ABSTRACT Title of Dissertation: MACHINE LEARNING FOR ANIME: ILLUSTRATION, ANIMATION, AND 3D CHARACTERS Shuhong Chen Doctor of Philosophy, 2024 Dissertation Directed by: Professor Matthias Zwicker Department of Computer Science As anime-style content becomes more popular on the global stage, we ask whether new vision/graphics techniques could contribute to the art form. However, the highly-expressive and non-photorealistic nature of anime poses additional challenges not addressed by standard ML models, and much of the existing work in the domain does not align with real artist workflows. In this dissertation defense, we will present several works building foundational 2D/3D infras- tructure for ML in anime (including pose estimation, video frame interpolation, and 3D character reconstruction) as well as an interactive tool leveraging novel techniques to assist 2D animators. MACHINE LEARNING FOR ANIME: ILLUSTRATION, ANIMATION, AND 3D CHARACTERS by Shuhong Chen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Matthias Zwicker, Chair/Advisor Professor Min Wu, Dean’s Representative Professor Abhinav Shrivastava Professor Leo Zhicheng Liu Professor Jia-Bin Huang © Copyright by Shuhong Chen 2024 Preface Anime is getting more popular in the global entertainment market. However, traditional animation is laborious. To create the expressive motions loved by millions, professional and am- ateur animators alike face the intrinsic cost of 12 illustrations per second. As the medium rapidly enters mainstream, the sheer manual line-mileage demanded continues to increase. This begs the question of whether modern data-driven computer vision/graphics methods can offer automation or assist the creative process. While some work exists for colorization, cleanup, in-betweening, etc., we’re still missing foundational domain-specific infrastructure. In addition, much of the aca- demic work around problems like in-betweening have only been studied in-vitro, without prac- tical considerations for real animators. By studying industry practices, scaling data pipelines, bridging domain gaps, leveraging 3d priors, etc., we developed domain-specific ML infrastruc- ture for anime, and demonstrated ways that modern techniques can assist existing workflows. In addition, while 3d human priors are crucial for the above animation topic (form and surface anatomy are animator fundamentals), 3d character modeling itself may also benefit from new techniques. As AR/VR apps and virtual creators become more popular, there will soon be major demand for stylized 3d avatars. But current template-based designers are restrictive, with custom assets still requiring expert software to create. Novel representations and rendering techniques may help create realistic 3d humans, but comparatively little has been done to suit the design challenges of non-photorealistic characters. This work also tries to democratize 3d character creation, bringing customizable experiences to the next generation of social interaction. ii Dedication To Nagato Yuki. iii Table of Contents Preface ii Dedication iii Table of Contents iv Chapter 1: Transfer Learning for Pose Estimation of Illustrated Characters 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Method & Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Application: Pose-guided Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 2: Improving the Perceptual Quality of 2D Animation Interpolation 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Experiments & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Limitations & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 3: Match-free Inbetweening Assistant (MIBA): A Practical Animation Tool without User Stroke Correspondence 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Input/Output Representations . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 Match-free Inbetweening . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.3 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 User Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 iv 3.5 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4: Stylized Single-view 3D Reconstruction from Portraits of Anime Characters 67 4.1 Introduction & Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Results & Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Limitations & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 v Chapter 1: Transfer Learning for Pose Estimation of Illustrated Characters Human pose information is a critical component in many downstream image processing tasks, such as activity recognition and motion tracking. Likewise, a pose estimator for the il- lustrated character domain would provide a valuable prior for assistive content creation tasks, such as reference pose retrieval and automatic character animation. But while modern data- driven techniques have substantially improved pose estimation performance on natural images, little work has been done for illustrations. In our work, we bridge this domain gap by effi- ciently transfer-learning from both domain-specific and task-specific source models. Addition- ally, we upgrade and expand an existing illustrated pose estimation dataset, and introduce two new datasets for classification and segmentation subtasks. We then apply the resultant state-of- the-art character pose estimator to solve the novel task of pose-guided illustration retrieval. All data, models, and code will be made publicly available. 1.1 Introduction Human pose estimation is a foundational computer vision task with many real-world appli- cations, such as activity recognition [81], 3D reconstruction [47], motion tracking [102], virtual try-on [30], person re-identification [78], etc. The generic formulation is to find, in a given image containing people, the positions and orientations of body parts; typically, this means locating 1 landmark and joint keypoints on 2D images, or regressing for bone transformations in 3D. The usefulness of pose estimation is not limited to the natural image domain; in particular, we focus on the domain of illustrated characters. As pose-guided motion retargeting of realistic humans rapidly advances [37], there is growing potential for automatic pose-guided animation [43], a traditionally labor-intensive task for both 2D and 3D artists. Pose information may also serve as a valuable prior in illustration colorization [137], keyframe interpolation [106], 3D char- acter reconstruction [13] and rigging [128], etc. With deep computer vision, we have been able to leverage large-scale datasets [70, 4, 116] to train robust estimators of human pose [45, 17, 33]. However, little work has been done to solve pose estimation for illustrated characters. Previous pose estimation work on illustrations by Khungurn et al [56] presented a 2D keypoint detector, but relied on a publicly-unavailable synthetic dataset and an ImageNet-trained backbone. In addition, the dataset they collected for supervision lacked variation, and was missing keypoints and bounding boxes required for evalu- ation under the more modern COCO standard [70]. Facing these challenges, we constructed a 2D keypoint detector with state-of-the-art per- formance on illustrated characters, built upon domain-specific components and efficient transfer learning architectures. We demonstrate the effectiveness of our methods by implementing a novel illustration retrieval system. Summarizing, we contribute: • A state-of-the-art pose estimator for illustrated characters, transfer-learned from both domain- specific and task-specific source models. Despite the absence of synthetic supervision, we outperform previous work by 10-20% PDJ@20 [56]. • An application of our proposed pose estimator to solve the novel task of pose-guided char- 2 acter illustration retrieval. • Datasets for our model and its components, including: an updated COCO-compliant ver- sion of Khungurn et al’s [56] pose dataset with 2x the number of samples and more diverse poses; a novel 1062-class Danbooru [5] tagging rulebook; and a character segmentation dataset 20x larger than those currently available. 1.2 Related Work The Illustration Domain Though there has been work on caricatures and cartoons [15, 91], we focus on anime/manga-style drawings where characters tend to be less abstract. While there is work for more traditional problems like lineart cleaning [103] and sketch extraction [66], more recent studies include sketch colorization [137], illustration segmentation [135], painting relighting [136], image-to-image translation with photos [58], and keyframe interpolation [106]. Available models for illustrated tasks typically rely on small manually-collected datasets. For example, the AniSeg [68] character segmenter is trained on less than 1,000 examples. While larger datasets are becoming available (e.g. Danbooru [5] now with 4.2m tagged illustrations), the labels are noisy and long-tailed, leading to poor model performance [6, 60]. Works requiring pose information may use synthetic renders of anime-style 3D models [56, 43], but the models are usually not publicly available. In this work, we present a cleaner tag classification task, a large character segmentation dataset, and an upgraded COCO keypoint dataset; these will all be made available upon publication, and may serve as a valuable prior for other tasks. Transfer Learning & Domain Adaptation Transfer learning and domain adaptation have been defined somewhat inconsistently throughout the vision and natural language processing 3 literature [119, 27], though generally the former is considered broader than the latter. In this paper, we use the terms interchangeably, referring to methods that leverage information from a number of related source domains and tasks, to a specific target domain and task. Typically, much more data is available for the source than the target, motivating us to transfer useful related source knowledge in the absence of sufficient target data [119]. For deep networks, the simplest practice is to pretrain a model on source data, and fine-tune its parameters on target data; however, various techniques have been studied that work with different levels of target data availability. Much of the transfer learning work in vision focuses on extreme cases with significantly limited target domain data, with emphasis around the task of image classification. In the few- shot learning case, we may be given as few as ten (or even one) samples from the target, inviting methods that embed prototypical target data into a space learned through prior source knowledge [121]. In particular, it is common to align parameters of feature extractors across domains, by directly minimizing pairwise feature distances or by adversarial domain discrimination [74, 115]. If the source and target are similar enough, it is possible to perform domain adaptation in the complete absence of labeled target data. This can be achieved by matching statistical properties of extracted features [109], or by converting inputs between domains through cycle-consistent image translation [46]. Pose Estimation With the availability of large-scale human pose datasets [70, 4], the vision community has recently been able to make great strides in pose estimation. A naive baseline was demonstrated by Mask R-CNN [45], which extended their detection and segmentation framework to predict single-pixel masks of joint locations. Other work such as RMPE take an approach tailored to pose estimation, deploying spatial transformer networks with pose-guided NMS and region proposal [33]. Around the same time, OpenPose proposed part affinity fields as a bottom- 4 up alternative to the more common heatmap representation of joints [17]. Human pose estimation work continues to make headway, extending beyond keypoint localization to include dense body part labels [42] and 3D pose estimation [51, 64, 80]. Pose Estimation Transfer Most transfer learning for pose estimation adapts from synthetically- rendered data to natural images. For example, by using mocaps and 3D human models, SUR- REAL [116] provides 6 million frames of synthetic video, complete with a variety of datatypes (2D/3D pose, RGB, depth, optical flow, body parts, etc.). CNNs may be able to directly gen- eralize pose from synthesized images [116], and can further close the domain gap using other priors like motion [28]. Outside of synthetic-to-real, Cao et al [65] explore domain adaptation for quadruped animal pose estimation, achieving generalization from human pose through adversar- ial domain discrimination with pseudo-label training. The closest prior work to our topic was done by Khungurn et al [56], who collected a modest AnimeDrawingsDataset (ADD) of 2k character illustrations with joint keypoints, and a larger synthetic dataset of 1 million frames rendered from MikuMikuDance (MMD) 3D models and mocaps. Unfortunately, the MMD dataset is not publicly available, and ADD contains mostly standard forward-facing poses. In addition, ADD is missing bounding boxes and several face keypoints, which are necessary for evaluation under the modern COCO standard [70]. We remedy these issues by training a bounding box detector from our new character segmentation dataset, labeling missing annotations in ADD, and labeling 2k additional samples in more varied poses. Khungurn et al perform transfer from an ImageNet-pretrained GoogLeNet backbone [111] and synthetic MMD data. In the absence of MMD, we instead transfer from a stronger backbone trained on a new illustration-specific classification task, as well as from a task-specific model pretrained on COCO keypoints. We use our subtask models and data to implement a number of 5 transfer techniques, from naive fine-tuning to adversarial domain discrimination. In doing so, we significantly outperform Khungurn et al on their reported metrics by 10-20%. Figure 1.1: A schematic outlining our two transfer learning architectures: feature concatenation, and feature matching. Note that source feature specificity is with respect to the target; i.e. task- specific means “related to pose estimation” and domain-specific means “related to illustrations”. Feature converters and matchers are convolutional networks that learn to mimic or re-appropriate pretrained features, respectively. While both designs require the pretrained Mask R-CNN com- ponents during training, feature matching discards them during inference, instead relying on the trained matcher network. “BCE” refers to binary cross-entropy loss. 1.3 Method & Architectures We provide motivation and architecture details for two variants of our proposed pose esti- mator (feature concatenation and feature matching), as well as two submodules critical for their success (a class-balanced tagger backbone and a character segmentation model). Architectures for baseline comparison models are described in Sec. 1.5. Pose Estimation Transfer Model We present two versions of our final model: feature concatenation, and feature matching. In this section, we assume that region proposals are given by a separate segmentation model (Sec. 1.3), and that the domain-specific backbone is already 6 available (Sec. 1.3); here, we focus on combining source features to predict keypoints (Fig. 4.1). The goal is to perform transfer simultaneously from both a domain-specific classification backbone (Sec 1.3) and a task-specific keypoint model (Mask R-CNN [45]). Here, we chose Mask R-CNN as it showed significantly better out-of-the-box generalization to illustrations than OpenPose [17] (Tab. 1.1). Taking into account that the task-specific model already achieves mediocre performance on the target domain, the feature concatenation model simply stacks fea- tures from both sources (Fig. 4.1). In order to perform the concatenation, it learns shallow feature converters for each source to decrease the feature channel count and allow bilinear sampling to a common higher resolution. The combined features are fed to the head, consisting of a shallow converter and two ResNet blocks. The final output is a stack of 25 heatmaps, 17 for COCO keypoints and 8 for auxiliary appendage midpoints (following Khungurn et al [56]). We apply pixel-wise binary cross-entropy loss on each heatmap, targeting a normal distribution centered on the ground-truth keypoint lo- cation with standard deviation proportional to the keypoint’s COCO OKS sigma [70]; the sigmas for auxiliary midpoints are averaged from endpoints of the body part. At inference, we gaussian- smooth the heatmaps and take the maximum pixel value index as the keypoint prediction. Although feature concatenation produces the best results (Tab. 1.1), it is very inefficient. At inference, it must maintain the parameters of both source models, and run both forward models for each prediction; Mask R-CNN is particularly expensive in this regard. We thus also provide a feature matching model, inspired by the methods used in Luo et al [74]. As shown in Fig. 4.1, we simultaneously train an additional matching network that predicts features from the expensive task-specific model using features from the domain-specific model. Though matching may be optimized with self-supervision signals such as contrastive loss [126], we found that feature- 7 wise mean-squared error is suitable. Given the matcher, the pretrained Mask R-CNN still helps training, but is not necessary at inference. Despite its simplicity, feature matching retains most performance benefits from both source models, while also being significantly lighter and faster than the concatenation architecture. ResNet Tagger The domain-specific backbone for our model (Fig. 4.1) is a pretrained ResNet50 [ResNet] fine-tuned as an illustration tagger. The tagging task is equivalent to multi- label classification, in this case predicting the labels applied to an image by the Danbooru image- board moderators [5]. The 392k unique tags cover topics including colors, clothing, interactions, composition, and even copyright metainfo. Khungurn et al [56] use an ImageNet-trained GoogLeNet [111] backbone for their illus- trated pose estimator, but we find that Danbooru fine-tuning significantly boosts transfer per- formance. There are publicly-available Danbooru taggers [6, 60], but both their classification performance and feature learning capabilities are hindered by uninformative target tags and se- vere class imbalance. By alleviating these issues, we achieve significantly better transfer to pose estimation. Most available Danbooru taggers [6, 60] take a coarse approach to defining classes, sim- ply predicting the several thousand (6-7k) most frequent tags. However, many of these tags represent contextual information not present in the image; e.g. neon genesis evangelion (name of a franchise), or alternate costume (fanmade/non-canon clothes). We instead only allow tags explicitly describing the image (clothing, body parts, etc.). Selecting tags by frequency also introduces tag redundancy and annotator disagreement. There are many high-frequency tags that share similar concepts, but are annotated inconsistently; e.g. hand in hair, adjusting hair, and hair tucking have vague wiki definitions for taggers, and many color tags are subjective 8 (aqua hair vs. blue hair). To address these challenges, we survey Danbooru wikis to manually develop a rulebook of tag groups that defines more explicit and less redundant classes. Danbooru tag frequencies form a long-tailed distribution, posing a severe class imbalance problem. In addition to filtering out under-tagged images (detailed in Sec. 1.4), we implement an inverse square-root frequency reweighing scheme to emphasize the learning of less-frequent classes. More formally, the loss on a sample is: L(y, ŷ) = 1 C C−1∑ i=0 wi(yi)BCE(yi, ŷi) (1.1) wi(z) = 1 2 ( z ri + 1− z 1− ri ) (1.2) ri = √ Ni√ Ni + √ N −Ni (1.3) where C is the number of classes, ŷ ∈ [0, 1]C is the prediction, y ∈ {0, 1}C is the ground truth label, BCE is binary cross entropy loss, N is the total number of samples, and Ni is the number of positive samples in the ith class. We found that plain inverse frequency weighing caused numerical instability in training, necessitating the square root. Character Segmentation & Bounding Boxes In order to produce bounding boxes around each subject in the image, we first train an illustrated character segmenter. As we assume one sub- ject per image, we can derive a bounding box by enclosing the thresholded segmentation output. The single-subject assumption also removes the need for region proposal and NMS infrastructure present in available illustrated segmenters [68], so that our model may focus on producing clean segmentations only. Our segmentation model is based on DeepLabv3 [22], with three additional layers at the end of the head for finer segmentations at the input image resolution. We initial- 9 ize with pretrained DeepLabv3 weights from PyTorch [89], and fine-tune the full model using pixel-wise binary cross-entropy loss. Table 1.1: Performance of different architectures and ablations described in Sec. 1.5. Note that the parameter count and speed are measured in inference mode with batch size one; “m” refers to “millions of parameters”. 1.4 Data Collection Unless mentioned otherwise, we train with random image rotation, translation, scaling, flipping, and recoloring. Pose Data We extend the AnimeDrawingsDataset (ADD), first collected by Khungurn et al [56]. The original dataset had 2000 illustrated full-body single-character images from Danbooru, each annotated with joint keypoints. However, ADD did not follow the now popularized COCO standard [70]; in particular, it was missing facial keypoints (eyes and ears) and bounding boxes. In order to evaluate and compare with modern pose estimators, we manually labeled the missing keypoints using an open-source COCO annotator [12] and automatically generated bounding boxes using the character segmenter described in Sec. 1.3. We also manually remove 57 images 10 with multiple characters, or without the full body in view. In addition, we improve the diversity of poses in ADD by collecting an additional 2043 samples. A major weakness of ADD is its lack of backwards-facing characters; only 5.45% of the entire 2k dataset had a back-related Danbooru tag (e.g. back, from behind, looking back, etc.). We specifically filtered for back-related images when annotating, resulting in a total of 850 in the updated dataset (21.25%). We also selected for other notably under-represented poses, like difficult leg tags (soles, bent over, leg up, crossed legs, squatting, kneeling, etc.), arm tags (stretch, arms up, hands clasped, etc.), and lying tags (on side, on stomach). Our final updated dataset contains 4000 illustrated character images with all 17 COCO keypoints and bounding boxes. We designate 3200 images for training (previously 1373), 313 for validation (previously 97), and 487 for testing (same as original ADD). For each input image, we first scale and crop such that the bounding box is centered and padded by at least 10% of the edge length on all sides. We then perform augmentations; flips require swapping left-right keypoints, and full 360-degree rotations are allowed. ResNet Tagger Data Our ResNet50 tagger is trained on a new subset of the 512px SFW Danbooru2019 dataset [5]. The original dataset contains 2.83m images with over 390k tags, but after filtering and retagging we arrive at 837k images with 1062 classes. The new classes are derived from manually-selected union rules over 2027 raw tags, as described in Sec. 1.3; the rulebook has 314 body-part, 545 clothing, and 203 miscellaneous (e.g. image composition) classes. To combat the class imbalance problem described in Sec. 1.3, we also rigorously filtered the dataset. We remove all images that are not single-person (solo, 1girl, or 1boy), are comics (comic, 4koma, doujinshi, etc.), or are smaller than 512px. Most critically, we remove all images with less 11 than 12 positive tags; these images are very likely under-tagged, and would have introduced many false-negatives to the ground truth. The final subset of 837k images has significantly reduced class imbalance (median class frequency 0.38%, minimum 0.04%) compared to the datasets of available taggers (median 0.07%, min 0.01%) [6]. We split the dataset 80-10-10 train-val-test. As some tags are color-sensitive, we do not jitter the hue; similarly as some tags are orientation-sensitive, we allow up to 15-degree rotations and horizontal flips only. Character Segmentation Data To obtain character bounding boxes, we train a character segmentation model and enclose output regions at 0.5 threshold (Sec. 1.3). The inputs to our segmentation system are augmented composites of RGBA foregrounds (with transparent back- grounds) onto RGB backgrounds; the synthetic ground truth is the foreground alpha. The avail- able AniSeg dataset [68] has only 945 images, with manually-labeled segmentations that are not pixel-perfectly aligned. We thus collect our own larger synthetic compositing dataset. Our background images are a mix of illustrated scenery (5.8k Danbooru images with scenery and no humans tag) and stock textures (2.3k scraped [2] from the Pixiv Dataset [67]). We collect single-character foreground images from Danbooru with the transparent background tag; 18.5k samples are used, after filtering images with text, non-transparency, or more than one connected component in the alpha channel. Counting each foreground as a single sample, this makes our new dataset roughly 20x larger than AniSeg. The foregrounds and backgrounds are randomly paired for compositing during training, with 5% chance of having no foreground. We hold out 2048 deterministic foreground-background pairs for validation and testing (1024 each). 12 Table 1.2: Keypoint breakdown of our most performant “feature concatenation” model trained on our extended ADD dataset. In the center, we list the relative improvement of each metric when training on additional data. On the right, we display the PDJ@20 from Khungurn et al [56], and report the relative difference from our best model. *Note that due to keypoint incompatibilities, we fill missing keypoint results from [56] using the most similar keypoints reported: “head” for eyes and ears, and “body” for shoulders and hips. Model F-1 pre. rec. IoU Ours 0.9472 0.9427 0.9576 0.9326 YAAS SOLOv2 0.9061 0.9003 0.9379 0.9077 YAAS CondInst 0.8866 0.8824 0.8999 0.9158 AniSeg 0.5857 0.5877 0.5954 0.6651 Table 1.3: Comparison of our character segmentation and bounding box performance, described in Sec. 1.5. 1.5 Experiments We used PyTorch [89] wrapped in Lightning [3]; some models use the R101-FPN keypoint detection R-CNN from Detectron2 [125]. All models can be trained with a single GTX1080ti (11GB VRAM). Unless otherwise mentioned, we trained models using the Adam [61] optimizer, with 0.001 learning rate and batch size 32, for 1,000 epochs. The ResNet backbone is trained on the Danbooru tag classification task using our new manual tagging rulebook (Sec. 1.4). The character segmenter used for bounding boxes is trained with our new character segmentation dataset (Sec. 1.4). Using the previous two submodules, we 13 train the pose estimator using our upgraded version of the ADD dataset (Sec. 1.4). All data and code will be released upon publication. Pose Estimation Transfer Table 1.1 shows the performance of different architectures. We report COCO OKS [70], PCKh and PCPm [4], and PDJ (for comparison with Khungurn et al [56]). From the top four rows, we see that our proposed feature concatenation and matching models perform the best out overall, and that the addition of our new data increases performance. We also observe that while concatenation performs marginally better than matching, matching is 8.8x more parameter efficient and one-third faster at inference. The second group of Table 1.1 shows other architectures, roughly in order of method com- plexity. Here, as in Fig. 4.1, “task” source features refer to Mask R-CNN pose estimation fea- tures, and “domain” source features refer to illustration features extracted by our ResNet50 tag classifier. “Task Fine-tuning Only” fine-tunes the pretrained Mask R-CNN head with its frozen de- fault backbone; the last head layer is re-initialized to accommodate auxiliary appendage key- points. This is vanilla transfer by fine-tuning a task-specific source network on a small task- specific target domain dataset. “Domain Features Only” is our frozen ResNet50 backbone with a keypoint head. This is vanilla transfer by adding a new task head to a domain-specific source network. “Task Fine-tuning w/ Domain Features” fine-tunes the pretrained Mask R-CNN head as above, but replaces the R-CNN backbone with our frozen ResNet50 backbone. This is a naive method of incorporating both sources, attempting to adapt the task source’s pretrained prediction component to new domain features. “Adversarial (DeepFashion2)” reuses the feature matching architecture, but performs ad- 14 versarial domain discrimination instead of MSE matching. The discriminator is a shallow 2-layer convnet, trained to separate Mask R-CNN features of randomly sampled DeepFashion2 [38] images from ResNet features of Danbooru illustrations. As the feature maps to discriminate are spatial, we are careful to employ only 1x1 kernels in the discriminator; otherwise, the dis- criminator could pick up intrinsic anatomical differences. The matching network now fools the discriminator by adversarially aligning the feature distributions. “Adversarial (COCO)” is the same adversarial architecture as above, but using COCO [70] images containing people instead of Deepfashion2. While domain-features-only is the cheapest architecture overall, it is only slightly more efficient than feature matching, and loses all benefits of task-specific transfer. However, the performance drop from feature concatenation to domain-features-only and task-with-domain- features is not very large (2-3% OKS@50); meanwhile, there is a wide gap to task-fine-tuning- only. This shows that the domain-specific ResNet50 backbone trained on our new body-tag rulebook provides much more predictive power than the task-specific pretrained Mask R-CNN. It is important to note that the adversarial models exhibited significant instability during training. After extensive hyperparameter tuning, the best DeepFashion2 model returns NaN loss at epoch 795, and the best COCO model fails at epoch 354; all other models safely exited at epoch 1,000. DeepFashion2 likely outperforms COCO because the image composition is much more similar to that of Danbooru; images are typically single-person portraits with most of the body in view. Adversarial losses are notoriously difficult to optimize, and in our case destabilized training so as to perform worse than not having been used at all. The fourth group of Table 1.1 shows out-of-the-box generalization to illustrations for Mask R-CNN [45] and OpenPose [17]. We use Mask R-CNN as our task-specific source, as it is less- 15 Figure 1.2: Pose-based retrieval. From left to right, we show the query image (descriptor dis- tance zero) followed by its five nearest neighbors (duplicate and NSFW images removed). Each illustration is annotated with its Danbooru ID, descriptor distance to the query, and the predicted bounding box with COCO keypoints. Please see supplementary materials for full artist attribution and additional examples. overfit to natural images than OpenPose. Table 1.2 gives a keypoint breakdown and comparison with Khungurn et al [56]. The results demonstrate that training on our additional more varied data improves the overall model performance; this is especially true for appendage keypoints, which are more variable than the head and torso. We also see significant improvement from results reported in Khungurn et al. The exception is the hips, for which we compare to their “body” keypoint at the navel. While this is not a direct comparison, our PDJ on hips is nevertheless low relative to other keypoints. This is 16 because PDJ does not account for the intrinsic ambiguity of the hips; looking at the OKS, which accounts for annotator disagreement, we see that hip performance is actually quite high. An important caveat is that the metrics are generally not comparable with those reported in human pose estimation. COCO OKS, for example, was designed using annotator disagreement on natural images [70]; however, illustrated character proportions deviate widely from the stan- dard human form (i.e. bigger head and eyes). Characters also tend to take up more screen space proportional to body size (i.e. big hair and clothing), leading to looser thresholds normalized by bounding box size. ResNet Tagger Backbone We train our ResNet50 tagger backbone to produce illustration- specific source features (Fig. 4.1). Taking into account the class imbalance, we accumulate gradients for an effective batch size of 512. Considering the minimum (0.04%) and median (0.38%) class frequencies, we may expect the smallest class to appear 0.2 times per batch, and the median class to appear 1.9 times per batch. To demonstrate the effectiveness of our tag rulebook and class reweighing strategy, we report performance on pose estimation using two other ResNet50 backbones: the RF5 tagger [6], and the default ImageNet-pretrained ResNet50 from PyTorch [89]. While there are several Danbooru taggers available [6, 60], we chose to compare our backbone to the RF5 tagger [6] because it is the most architecturally similar to our ResNet50, and relatively better-documented. The backbones all share the same architecture and parameter count, and are all placed into our feature concatenation transfer model for the ablation. The backbone ablation results are shown in the last three rows of Table 1.1. As expected, a classifier trained with our novel body-part-specific tagging rulebook and class-balancing tech- niques significantly improves transfer to pose estimation. Note that our tagger also outperforms 17 RF5 at classification (on shared target classes); please refer to the supplementary materials for more details. Character Segmentation & Bounding Boxes We compare the segmentation and bound- ing box performance of our system with that of publicly-available models. AniSeg [68] is a Faster-RCNN [95], and YAAS [141] provides SOLOv2 [120] and CondInst [114] models. These detectors may detect more than one character, and their bounding boxes are not necessarily tight around segmentations; for simplicity, we union all predicted segmentations of an image, and redraw a tight bounding box around the union. We evaluate all models on the same test set de- scribed in Sec. 1.4. Table 1.3 shows that training with our new 20x larger dataset outperforms available models in both mean F-1 (segmentation) and IoU (bounding boxes); we thus use it in our pipeline for bounding box prediction. 1.6 Application: Pose-guided Retrieval An immediate application of our illustrated pose estimator is a pose-guided character re- trieval system. We construct a proof-of-concept retriever that takes a query character (or user- specified keypoints and bounding box) and searches for illustrated characters in a similar pose. This system can serve as a useful search tool for artists, who often use reference drawings while illustrating. Our pose retriever performs a simple nearest-neighbor search. The support images con- sist of single-character Danbooru illustrations with the full body tag. Using our best-performing model, we extract bounding boxes and keypoint locations for each character, normalize the key- points by the longest bounding box dimension, and finally store the pairwise euclidean distances 18 between the normalized keypoints. This process ensures the pairwise-distance descriptor is in- variant to translation, rotation, and image scale. At inference, we extract the descriptor from the query, and find the euclidean k-nearest neighbors from the support set. In practice, we compute descriptors using all 25 predicted keypoints (17 COCO and 8 additional appendage midpoints). This makes the descriptor 300-dimensional (25 choose 2), which is generally too large for tree-based nearest neighbors [14]. However, since our support set consists of 136k points, we are still able to brute force search in reasonable time. Empirically, each query takes about 0.1341s for keypoint extraction (GPU) and 0.0638s for search (CPU). To demonstrate the effectiveness of our pose estimator, we present several query results in Fig. 1.2; while there is no ground-truth to measure quantitative performance, qualitative in- spection suggests that our model works well. We can retrieve reasonably similar illustrations for standard poses as shown in the first row, as well as more difficult poses for which illustrators would want references. Note that while our system has no awareness of perspective, it is able to effectively leverage keypoint cues to retrieve similarly foreshortened views in the last row. For more examples, please refer to our supplementary materials. 1.7 Conclusion & Future Work While we may continue to improve the transfer performance through methods like pseudo- labeling [65] or cycle-consistent image translation [46], we can also begin extending our work to multi-character detection and pose estimation. While it is possible to construct a naive instance- based segmentation and keypoint estimation dataset by compositing background-removed ADD samples, we cannot expect a system trained on such data to perform well in-the-wild. Character 19 interactions in illustrations are often much more complex than human interactions in real life, with much more frequent physical contact. For example, Danbooru has 43.6k images tagged with holding hands and 59.1k with hugging, already accounting for 2.8% of the entire dataset. Simply compositing independent characters together would not be able to model the intricacies of the illustration domain; we would again need to expand our datasets with annotated instances of character interactions. As a fundamental vision task, pose estimation also provides a valuable prior for numerous other novel applications in the illustrated domain. Our pose estimator opens the door to pose- guided retargeting for automatic character animation, better keyframe interpolation, pose-aware illustration colorization, 3D character reconstruction, etc. In conclusion, we demonstrate state-of-the-art pose estimation on the illustrated charac- ter domain, by leveraging both domain-specific and task-specific source models. Our model significantly outperforms prior art [56] despite the absence of synthetic supervision, thanks to successful transfer from our new illustration tagging subtask focused on classifying body-related tags. In addition, we provide a single-region proposer trained on a novel character segmentation dataset 20x larger than those currently available, as well as an updated illustration pose estima- tion dataset with twice the number of samples in more diverse poses. Our model performance allows for the novel task of pose-guided character illustration retrieval, and paves the way for future applications in the illustrated domain. 20 Chapter 2: Improving the Perceptual Quality of 2D Animation Interpolation Traditional 2D animation is labor-intensive, often requiring animators to manually draw twelve illustrations per second of movement. While automatic frame interpolation may ease this burden, 2D animation poses additional difficulties compared to photorealistic video. In this work, we address challenges unexplored in previous animation interpolation systems, with a focus on improving perceptual quality. Firstly, we propose SoftsplatLite (SSL), a forward-warping interpolation architecture with fewer trainable parameters and better perceptual performance. Secondly, we design a Distance Transform Module (DTM) that leverages line proximity cues to correct aberrations in difficult solid-color regions. Thirdly, we define a Restricted Relative Linear Discrepancy metric (RRLD) to automate the previously manual training data collection process. Lastly, we explore evaluation of 2D animation generation through a user study, and establish that the LPIPS perceptual metric and chamfer line distance (CD) are more appropriate measures of quality than PSNR and SSIM used in prior art. 2.1 Introduction Traditional 2D animators typically draw each frame manually; this process is incredibly labor-intensive, requiring large production teams with expert training to sketch and color the tens of thousands of illustrations required for an animated series. With the growing global popularity 21 of the traditional style, studios are hard-pressed to deliver high volumes of quality content. We ask whether recent advancements in computer vision and graphics may reduce the burden on animators. Specifically, we study video frame interpolation, a method of automatically generating intermediate frames in a video sequence. In the typical problem formulation, a system is expected to produce a halfway image naturally interpolating two given consecutive video frames. In the context of animation, an animator could potentially achieve the same framerate for a sequence (or “cut”) by manually drawing only a fraction of the frames, and use an interpolator to generate the rest. Though there is abundant work on video interpolation, 2D animation poses additional dif- ficulties compared to photorealistic video. Given the high manual cost per frame, animators tend to draw at reduced framerates (e.g. “on the twos” or at 12 frames/second), increasing the pixel displacements between consecutive frames and exaggerating movement non-linearity. Unlike in natural videos with motion blur, the majority of animated frames can be viewed as stand-alone cel illustrations with crisp lines, distinct solid-color regions, and minute details. For this non- photorealistic domain with such different image and video features, even our understanding of how to evaluate generation quality is limited. Previous animation-specific interpolation by Li et. al. (AnimeInterp [106]) approached some of these challenges by improving the optical flow estimation component of a deep video interpolation system by Niklaus et. al. (Softsplat [83]); in this paper, we build upon AnimeInterp by addressing some remaining challenges. Firstly, though AnimeInterp improved optical flow, it trained with an L1 objective and did not modify the Softsplat feature extraction, warping, or synthesis components; this results in blurred lines/details and ghosting artifacts in supposedly solid-color regions. We alleviate these issues with architectural improvements in our proposed 22 SoftsplatLite (SSL) model, as well as with an additional Distance Transform Module (DTM) that refines outputs using domain knowledge about line drawings. Secondly, though AnimeInterp provided a small ATD12k dataset of animation frame triplets, the construction of this dataset required intense manual filtering of evenly-spaced triplets with linear movement. We instead automate linear triplet collection from raw animation by introducing Restricted Relative Linear Discrepancy (RRLD), enabling large-scale dataset construction. Lastly, AnimeInterp only fo- cused on PSNR/SSIM evaluation, which we show (through an exploratory user study) are less indicative of percieved quality than LPIPS [138] and chamfer line distance (CD). We summarize the contributions of this paper: 1. SoftsplatLite (SSL): a forward-warping interpolation architecture with fewer trainable pa- rameters and better perceptual performance. We tailor the feature extraction and synthesis networks to reduce overfitting, propose a simple infilling method to remove ghosting arti- facts, and optimize LPIPS loss to preserve lines and details. 2. Distance Transform Module (DTM): a refinement module with an auxiliary domain- specific loss that leverages line proximity cues to correct aberrations in difficult solid-color regions. 3. Restricted Relative Linear Discrepancy (RRLD): a metric to quantify movement non- linearity from raw animation; this automates the previously manual training data collection process, allowing more scalable training. 4. Perceptual user study: we explore evaluation of 2D animation generation, establishing the LPIPS perceptual metric and chamfer line distance (CD) as more appropriate quality measures than PSNR/SSIM used in prior art. 23 Figure 2.1: We improve the perceptual quality of 2D animation interpolation from previous work. (a) Overlaid input images to interpolate; (b) AnimeInterp by Li et. al. [106]; (c) Our proposed method; (d) Ground truth interpolation. Note the destruction of lines in (b) compared to (c), and the patchy artifacts ghosted on the teapot in (b). Our user study validates our focus on perceptual metrics and artifact removal. 2.2 Related Work Much recent work has been published on photorealistic video interpolation. Broadly, these works fall into phase-based [77, 76], kernel-based [85, 84], and flow-based methods [83, 53, 88, 127], with others using a mix of techniques [7, 8, 25]. The most recent state-of-the-art has seen more flow-based methods [83, 88], following corresponding advancements in optical flow estimation [50, 110, 52, 112]. Flow-based methods can be further split by forward [83], or backward [88] warping. The prior art most directly related to ours is AnimeInterp, by Li et. al. [106]. While they laid the groundwork for the problem specific to the traditional 2D animation domain, their system had many shortcomings that we overcome as described in the introduction section. Even though we focus on animations “post-production” (i.e. interpolating complete full- color sequences), there is also a body of work on automating more specific components of ani- mation production itself. For example, sketch simplification [105, 104] is a popular topic with applications to speeding up animation “tie-downs” and “cleanups”. There are systems for syn- 24 Figure 2.2: Schematic of our proposed system. SoftsplatLite (SSL, Sec. 2.3) passes a prediction to the Distance Transform Module (DTM, Sec. 2.3) for refinement. SSL uses many fewer train- able parameters than AnimeInterp [106] to reduce overfitting, and introduces an infilling step to avoid ghosting artifacts. DTM leverages domain knowledge about line drawings to achieve more uniform solid-color regions. Artists: hariken, k.k.1 thesizing “in-between” line drawings from sketch keyframes in both raster [82, 130] and vector [123, 26] form. While the flow-based in-betweening done by Narita et. al. [82] shares similarity to our work (such as the use of chamfer distance and forward warping), their system composed pretrained models without performing any form of training. Another related problem is sketch colorization, with application to both single illustrations [137] and animations [92, 75, 19]. These works unsurprisingly highlight the foundational role of lines and sketches in animation, and we continue the trend by introducing a Distance Transform Module to improve our generation qual- ity. 25 2.3 Methodology SoftsplatLite As with AnimeInterp [106], we base our model on the state-of-the-art Softsplat [83] inter- polation model, which uses bidirectional optical flow to differentiably forward-splat input image features for synthesis. Whereas AnimeInterp only focused on improving optical flow estimation, we assume a fixed flow estimator (the same RAFT [112] network from AnimeInterp, which they dub “RFR”). We instead look more closely at feature extraction, warping, and synthesis; our pro- posed SoftsplatLite (named similarly to PWC-Lite [72]) aims to improve convergence on LPIPS [138] while also being parameter- and training-efficient. Please see Fig. 4.1a for an overview of SSL. We first note that the feature extractors in AnimeInterp [106] and Softsplat [83] are rel- atively shallow. The extractors must still be trained, and rely on backpropagation through the forward splatting mechanism. In practice, we found that replacing the extractor with the first four blocks of a frozen ImageNet-pretrained ResNet-50 [44] performs better; additionally, freezing the extractor contributes to reduced memory usage and compute during training, as no gradients must be backpropagated through the warping operations. Note that we also tried unfreezing the ResNet, but observed slight overfitting. Next, we observe that forward splatting results in large empty occluded regions. If left unhandled during LPIPS training, these gaps often cause undesirable ghosting artifacts (see Ani- meInterp [106] output in Fig. 2.3b). Additionally, subtle gradients at the edge of moving objects in the optical flow field may result in a spread of dots after forward warping; these later manifest 1hariken: https://danbooru.donmai.us/posts/5378938 k.k.: https://danbooru.donmai.us/posts/789765 26 https://danbooru.donmai.us/posts/5378938 https://danbooru.donmai.us/posts/789765 Figure 2.3: SSL vs. AnimeInterp ft. [106]. Trained on the same ATD data [106] and LPIPS loss [138], AnimeInterp encounters many “ghosting” artifacts, which we resolve in SSL by proposing an inpainting technique. as blurry patches in AnimeInterp predictions (see Fig. 2.1b). To remove these artifacts, we pro- pose a simple infilling technique to generate a better warped feature stack F prior to synthesis (“occlusion-mask infilling” in Fig. 4.1a): F = 1 2 (M0→tW0→t(f(I0)) + (1−M0→t)W1→t(f(I1))) + 1 2 (M1→tW1→t(f(I1)) + (1−M1→t)W0→t(f(I0))) (2.1) Z1→0 = −0.1× ||LAB(I1)−W ′ 0→1(LAB(I0))|| (2.2) where Wa→b denotes forward warping from timestep a to timestep b, W ′ denotes backwarping, M denotes the opened occlusion mask of the warp, I represents either input image, and f rep- resents the feature extractor. In other words, occluded features are directly infilled with warped features from the other source image. The computation of mask M involves warping an image of ones, followed by a morphological image opening with kernel k = 5 to remove dotted artifacts; 27 note that though opening is non-differentiable, no gradients are needed with respect to the flow field as our flow estimator is fixed. Unlike AnimeInterp [106], we do not use average forward splatting, and instead use the more accurate softmax weighting scheme with negative L2 LAB color consistency as our Z-metric (similar as in Softsplat [83]). While it is not guaranteed that this infilling method will eliminate all holes (it is still possible for both warps to have shared occluded regions), we find that in practice the majority of image areas are covered. Lastly, for the synthesis stage, we opt for a much more lightweight U-Net [98] instead of the GridNet [36] used in the original Softsplat [83]. We may afford this thrifty replacement by carefully placing a direct residual path from an initial warped guess to the final output. This follows the observation that directly applying our previously-described infilling method to the input RGB images produces a strong initial guess for the output; this is achieved by replacing feature extractor f in Eq. 2.1 with the identity function. Instead of requiring a large synthesizer to reconcile two sets of warped images and features into a single final image, we employ a small network to simply refine a single good guess. Under this architecture, the additional GridNet parameters become redundant, and even contribute to overfitting. Note that while SoftsplatLite and Softsplat have comparable parameter counts at inference (6.92M and 6.21M respectively), the frozen feature extractor and smaller synthesizer signifi- cantly reduces the number of trainable parameters compared to the original (1.28M and 2.01M respectively). We later demonstrate through ablations (Tab. 2.2) that lighter training and artifact reduction allow SSL to score better on perceptual metrics like LPIPS and chamfer distance. Distance Transform Module As seen in Fig. 2.4b, SoftsplatLite may struggle to choose colors for certain regions, or have trouble with large areas of flat color. These difficulties may be partly attributed to the natural tex- 28 Figure 2.4: Effect of DTM. DTM effectively leverages line proximity cues (distance transform) to refine SSL outputs. DTM not only removes minor aberrations from solid-color regions (bottom), but also corrects entire enclosures if needed (top). ture bias of convolutional models [39]; the big monotonous regions of traditional cel animation would expectedly require convolutions with larger perceptual fields to extract meaningful fea- tures. Instead of building much deeper or wider models, we take advantage of line information inherently present in 2D animation; hypothetically, providing line proximity information to con- volutions may act as a form of “stand-in” texture that helps the processing of cel-colored image data. We thus propose a Distance Transform Module (DTM) to refine the SSL outputs by lever- aging a normalized version of the Euclidean distance transform (NEDT). At a high level (see Fig. 4.1b), DTM first attempts to predict the ground truth NEDT of the output (middle) frame, and then uses this prediction to refine the SSL output through a residual block. To train the predic- tion of NEDT, we introduce an auxiliary Ldt in addition to the Llpips on the final prediction, and optimize a weighted sum of both losses end-to-end. The rest of this section provides specifics on the implementation. The first step is to extract lines from the input images; for this, we use the simple but 29 Figure 2.5: RRLD filtering. RRLD quantifies whether a triplet is evenly-spaced. We show several overlaid triplets from our additional dataset ranked by RRLD; higher RRLD (bottom) indicates deviation from the halfway assumption. As RRLD is fully automatic, appropriate training data can be filtered from raw video at scale. effective difference of gaussians (DoG) edge detector, DoG(I) = 1 2 + t(Gkσ(I)−Gσ(I))− ϵ, (2.3) where Gσ are Gaussian blurs after greyscale conversion, k = 1.6 is a factor greater than one, and t = 2 with ϵ = 0.01 are hyperparameters. Please see Fig. 2.6 for examples of DoG extraction. Next, we apply the distance transform. To bound the range of values, we normalize EDT values to unit range similar to Narita et. al. [82], NEDT (I) = 1− exp{−EDT (DoG(I) > 0.5) τd }, (2.4) where τ = 15/540 is a steepness hyperparameter, and d is the image height in pixels. Note that we thresholded DoG at 0.5 to get a binarized sketch. 30 This normalized EDT is extracted from both input images, and warped through the same inpainting procedure as Eq. 2.1; more precisely, f is replaced by NEDT . DTM then uses this, as well as the extracted NEDT of SSL’s output, to estimate the NEDT of the ground truth output frame. This prediction occurs through a small convolutional network (first yellow box in Fig. 4.1b), and is trained to minimize an auxiliary Ldt, the L1 Laplacian pyramid loss between predicted and ground truth NEDTs. A final convolutional network (second yellow box in Fig. 4.1b) then incorporates the predicted NEDT to residually refine the SSL output. Note that we detach the predicted NEDT image from the final RGB image prediction gra- dients (“SG” for “stop-gradient” in Fig. 4.1b), in order to reduce potentially competing signals from Ldt and the final image loss. It is also important to mention that since both DoG sketch extraction and EDT are non-differentiable operations, the extraction of NEDT from the Softsplat output cannot be backpropagated. However, we found that we could still reasonably perform end-to-end training despite the required stop-gradient in this step. Through this process, our DTM is able to predict the distance transform of the output, and utilize it in the final interpolation. Experiments show that this relatively cheap additional network is effective at improving perceptual performance (Tab. 2.2). Restricted Relative Linear Discrepancy Unlike in the natural video domain, where almost any three consecutive frames from a cut may be used as a training triplet, data collection for 2D animation is much more ambiguous. Animators often draw at variable framerates with expressive arc-like movements; when coupled with high pixel displacements, this results in a significant amount of triplets with non-linear motion or uneven spacing. However under the problem formulation, all middle frames of training triplets are assumed to be “halfway” between the inputs. While forward warping provides a way 31 to control the interpolated t ∈ [0, 1] at which generation occurs, it is ambiguous to label such ground truth for training. Li et. al. in AnimeInterp [106] manually filter through more than 130,000 triplets to arrive at their ATD dataset with 12,000 samples, a costly manual effort with less than 10% yield. In order to automate the training data collection process from raw animation data, we quan- tify the deviance of a triplet from the halfway assumption with a novel Restricted Relative Linear Discrepancy (RRLD) metric, and filter samples based on a simple threshold. In our experi- ments (Tab. 2.2), we demonstrate that selecting additional training data with RRLD improves generalization error, whereas training on naively-collected triplets damages performance. We ad- ditionally show that RRLD largely agrees with ATD, and that RRLD is robust to choice of flow estimator (Sec. 4.3). Please see Fig. 2.5 for example triplets accepted or rejected by RRLD. The rest of this section provides specifics of the filtering method. We define RRLD as follows, RRLD(ω0→t, ω1→t) = 1 |Ω| ∑ (i,j)∈Ω ||ω0→t[i, j] + ω1→t[i, j]||/2 ||ω0→t[i, j]− ω1→t[i, j]|| , (2.5) where ω are forward flow fields extracted from consecutive frames I0 and It and I1, and Ω denotes the set of (i, j) pixel coordinates where both flows have norms greater than threshold 2.0 and point to pixels within the image. RRLD takes as input flow fields from the middle frame It to the end frames, and assumes they are correct. The numerator of Eq. 2.5 represents the distance from pixel (i, j) to the midpoint between destination pixels, while the denominator describes the total distance between destina- tion pixels. In other words, the interior of the summand is half the ratio between the diameters of a parallelogram formed by two flow vectors; this measures the relative distance from the actual 32 Figure 2.6: Line and detail preservation. (a) AnimeInterp prediction; (b) our full model (SSL+DTM); (c) ground truth; (middle) extracted DoG lines; (bottom) normalized Euclidean distance transform. AnimeInterp blurs lines and details that are critical to animation; by focusing on perceptual metrics like LPIPS and chamfer distance (CD), we improve the generation quality. to the ideal halfway point. As the estimated flows are noisy, we average over a restricted set of pixels Ω. We first remove pixels with displacement close to zero, where a low denominator results in unrepresentatively high discrepancy measurement. Then, we also filter out pixels with flows pointing outside the image, which are often poor estimates. The final RRLD gives a rough measure of deviance from the halfway-frame assumption, for which we may define a cutoff (0.3 in this work). One caveat to this method is that pans must be discarded. In some cases, a non-linear ani- mation may be composited onto a panning background; RRLD would then include the linearly- moving background in Ω, lowering the overall measurement despite having a nonlinear region of interest. We simply remove triplets with large Ω, high average flow magnitude, and low flow vari- ance. It is possible to reintroduce panning effects through data augmentation if needed, though we did not for our training. Another important point is that even though animators may draw at framerates like 12 or 8, the final raw input videos are still at 24fps. Thus, many consecutive triplets in actuality contain 33 two duplicates, which leads to RRLD values around 0.5; had the duplicate been removed, an adjacent frame outside the triplet may have had a qualifying RRLD. In order to maximize the data yield, we also train a simple duplicate frame detector, using linear regression over the mean and maximum L2 LAB color difference between consecutive frames. User Study & Quality Metrics We perform a user study in order to evaluate our system and explore the relationship be- tween metrics and perceived quality. To get a representative subset of the ATD test set, on which we perform all evaluations, we select 323 random samples in accordance with Fischer’s sample size formula (with population 2000, margin of error 5%, and confidence level 95%). For each sample triplet, users were given a pair of animations playing back and forth at 2fps, cropped to the region-of-interest annotation provided by ATD. The middle frame of each anima- tion was a result generated either by our best model (on LPIPS), or by the pretrained AnimeInterp [106]. Participants were asked to pick which animation had: clearer/sharper lines, more consis- tent shapes/colors, and better overall quality. Complete survey results, including several random animation pairs compared, are available in the supplementary. Our main metric of interest is LPIPS [138], a general measure of perceived image quality based on deep image classification features. We are interested in understanding its applicability to non-photorealistic domains like ours, especially in comparison with PSNR/SSIM used in prior work [106]. We additionally consider the chamfer distance (CD) between lines extracted from the ground truth vs. the prediction. The chamfer metric is typically used in 3D work, where the distance between two point clouds is calculated by averaging the shortest distances from each point of one cloud to a point on the other. In the context of binary line drawings extracted from 34 Table 2.1: Comparison with baselines. Our full proposed method achieves the best percep- tual performance, followed by AnimeInterp [106]. We show in our user study (Sec. 2.4) that LPIPS/CD are better indicators of quality than the PSNR/SSIM focused on in previous work; we list them here for completeness. Models from prior work are fine-tuned on LPIPS for fairer comparison. Best values are underlined, runner-ups italicized; LPIPS is scaled by 1e2, CD by 1e5. All Eastern Western Model LPIPS CD PSNR SSIM LPIPS CD LPIPS CD DAIN [7] 4.695 5.288 28.840 95.28 5.499 6.537 4.204 4.524 DAIN ft. [7] 4.137 4.851 29.040 95.27 4.734 5.888 3.771 4.217 RIFE [48] 4.451 5.488 28.515 95.14 4.933 6.618 4.156 4.796 RIFE ft. [48] 4.233 5.411 27.977 93.70 4.788 6.643 3.894 4.658 ABME [88] 5.731 7.244 29.177 95.54 7.000 10.010 4.955 5.552 ABME ft. [88] 4.208 4.981 29.060 95.19 4.987 6.092 3.732 4.302 AnimeInterp [106] 5.059 5.564 29.675 95.84 5.824 7.017 4.590 4.674 AnimeInterp ft. [106] 3.757 4.513 28.962 95.02 4.113 5.286 3.540 4.039 Ours 3.494 4.350 29.293 95.15 3.826 4.979 3.291 3.966 our data using DoG (Eq. 2.3), the 3D points are replaced by all 2D pixels that lie on lines. As chamfer distance would intuitively measure how far lines are from each other in different images, we explore the importance of this metric for our domain with images based on line drawings. Please see Fig. 2.6 for examples of CD evaluation. In this work, we define chamfer distance as: CD(X0, X1) = 1 2HWD ∑ X0DT (X1) +X1DT (X0) (2.6) where X are binary sketches with 1 on lines and 0 elsewhere, DT denotes the Euclidean distance transform, the summation is pixel-wise, and HWD is the product of height, width, and diameter. We normalize by both area and diameter to enforce invariance to image scale. Note that our defi- nition is symmetric with respect to prediction and ground truth, zero if and only if they are equal, and strictly non-negative. Also observe that as neither DoG binarization nor DT is differentiable, CD cannot be optimized directly by gradient descent training; thus it is used for evaluation only. 35 Table 2.2: Ablations of proposed methods. Firstly, each component of SSL contributes to perfor- mance (especially infilling). Secondly, new data filtered naively hurts performance, while new RRLD-filtered data helps. Lastly, DTM improvement is due to auxiliary supervision, not just increased parameter count. AnimeInterp ft. is copied from Tab. 4.3 for comparison; the last row here and in Tab. 4.3 are equivalent. Best values are underlined, runner-ups italicized; LPIPS is scaled by 1e2, CD by 1e5. All Eastern Western Model Data LPIPS CD LPIPS CD LPIPS CD AnimeInterp ft. [106] ATD 3.757 4.513 4.113 5.286 3.540 4.039 SSL (no flow infill) ATD 3.648 4.496 4.026 5.160 3.416 4.089 SSL (no U-net synth.) ATD 3.614 4.579 3.982 5.288 3.389 4.146 SSL (no ResNet extr.) ATD 3.605 4.739 3.957 5.429 3.391 4.317 SSL ATD 3.586 4.572 3.940 5.248 3.369 4.158 SSL ATD+naive 3.702 4.811 3.997 5.033 3.521 4.675 SSL ATD+RRLD 3.535 4.431 3.873 5.089 3.329 4.028 SSL+DTM (no Ldt) ATD+RRLD 3.531 4.430 3.865 4.995 3.327 4.085 SSL+DTM ATD+RRLD 3.494 4.350 3.826 4.979 3.291 3.966 2.4 Experiments & Discussion We implement our system in PyTorch [90] wrapped in Lightning [3], with Kornia [96]. Our model uses the same RFR/RAFT with SGM flows as AnimeInterp for fairer comparison [106, 112], and forward splatting is done with the official Softsplat [83] module. We train with the Table 2.3: User study results. For each of the visual criteria we asked the users to judge (rows), we list the percentage of instances where users preferred the animation with a better metric score (columns). Values above 50% indicate agreement between queried criteria and metric score dif- ference, and values under 50% indicate contradiction. “Pref. Ours” means percent of users preferring our output to AnimeInterp [106] for that criteria. Prefer Lower Lower Higher Higher Criteria Ours LPIPS CD PSNR SSIM cleaner/sharper lines 86.01% 86.56% 78.20% 18.95% 15.48% more consistent shape/color 78.82% 79.26% 73.99% 25.02% 22.66% better overall quality 81.11% 81.55% 75.67% 22.97% 19.88% 36 Adam [62] optimizer at learning rate α = 0.001 for 50 epochs, and accumulate gradients for an effective batch size of 32. Our code uses the official LPIPS [138] package, with the AlexNet [63] backbone. All training minimizes the total loss L = λlpipsLlpips + λdtLdt, where λlpips = 30; depending on whether DTM is trained, λdt is either 0 or 5. Evaluations are run over the 2000- sample test set from AnimeInterp’s ATD12k dataset; however we only train on a random 9k of the remaining 10k in ATD, so that we can designate 1k for validation. Similar to Li et. al. [106], we randomly perform horizontal flips and frame order reversal augmentations during training. We use single-node training with at most 4x GTX1080Ti at a time, with mixed precision where possible. All models are trained and tested at 540x960 resolution. We wrote a custom CUDA implementation for the distance transform and chamfer distance using CuPy [86] that achieves upwards of 3000x speedup from the SciPy CPU implementation [117]; the algorithm is a simpler version of Felzenszwalb et. al. [34], where we calculate the minimum of the lower envelope through brute iteration. While more efficient GPU algorithms are known [16], we found our implementation sufficient. RRLD Data Collection As RRLD was designed to replicate the manual selection of training data, we applied RRLD to AnimeInterp’s ATD dataset [106] and achieved 95.3% recall (i.e. RRLD only rejected less than 5% of human-collected data); as the negative samples from the ATD collection process are not available, it is not possible to calculate RRLD’s precision on ATD. Additionally we study the effect of flow estimation on RRLD, finding that filtering with FlowNet2 [50] and RFR flows [106] returns very similar results (0.877 Cohen’s kappa tested over 34,128 triplets). We use our automatic pipeline to collect additional training triplets. We source data from 14 franchises in the eastern “anime” style, with premiere dates ranging from 1989-2020, totalling 37 239 episodes (roughly 95hrs, 8.24M frames at 24fps); please refer to our supplementary materials for the full list of sources. Here, RRLD was calculated using FlowNet2 [50] as inference was faster than RFR [106]. While RRLD filtering presents us with 543.6k viable triplets, we only select one random triplet per cut to promote diversity; the cut detection was performed with a pretrained TransNet v2 [108]. This cuts down eligible samples to 49.7k. For the demonstrative purposes of this paper, we do not train on the full new dataset, and instead limit ourselves to doubling the ATD training set by randomly selecting 9k qualifying triplets. Please see Fig. 2.5 for examples of accepted and rejected triplets from franchises set aside for validation. While we cannot release the new data collected in this work, our specific sources are listed in the supplementary and our RRLD data collection pipeline will be made public; this allows followup work to either recreate our dataset or assemble their own datasets directly from source animations. Comparison with Baselines The main focus of our work is to improve perceptual quality, namely LPIPS and chamfer distance (as validated later by our user study results). We gather four existing frame interpolation systems (ABME [88], RIFE [48], DAIN [7], and AnimeInterp [106]) for comparison to our full model incorporating all our proposed methods. For a fairer comparison, as other models may not have been trained on the same LPIPS objective or on animation data, we fine-tune their given pre-trained models with LPIPS on the ATD training set. As we can see from Tab. 4.3, our full proposed method achieves the best perceptual performance, followed by AnimeInterp. To provide more complete information on trainable parameters, our model has 1.28M (million) compared to: AnimeInterp 2.01M, RIFE 13.0M, ABME 17.5M, DAIN 24.0M. Breaking down further, our model consists of 1.266M for SSL and 0.011M for DTM. 38 Ablation Studies We perform several ablations in Tab. 2.2. In the first group, each of the modifications to Softsplat [83] (frozen ResNet [44] feature extractor, infilling, U-net [98] replacing GridNet [36]) contributes to SSL outperforming AnimeInterp [106]. The infilling technique improves performance the most. In the second group of Tab. 2.2, we ablate the addition of new data filtered by RRLD (Sec. 4.3). Training with RRLD-filtered data improves generalization as expected. To demonstrate the necessity of RRLD’s specific filtering strategy, we train with an alternative dataset of equal size gathered from the same sources, but using a “naive” filtering approach. For simplicity, we directly follow the crude filter used in creating ATD [106]: no two frames of a triplet may contain SSIM outside [0.75, 0.95]. We see this naively-collected data actively damages model performance, validating the use of our proposed RRLD filter. Splitting by eastern vs. western style, we clarify the distribution shift between sub-domains. Note that our new data is all anime, whereas 62.05% of ATD test set is in the western “Disney” style. From the LPIPS results, the eastern style is more difficult; adding eastern-only RRLD data has unexpectedly less of an effect on eastern testing than western. This may be because west- ern productions tend to prioritize fluid motion (smaller displacements) over complex character designs (more details), contrary to the eastern style. In the last group of Tab. 2.2, we train SoftsplatLite with DTM, but ablate the effect of additionally optimizing for Ldt; this way, we may see whether auxiliary supervision of NEDT improves performance under the same parameter count. Note that the upper yellow convnet of Fig. 4.1b receives no gradients in the ablation, effectively remaining at its random initializa- tion. The results show that the prediction of line proximity information indeed contributes to 39 performance. User Study Results We summarize the user study results in Tab. 2.3, and provide the full breakdown with sample animations in the supplementary. Our study had 5 participants, meaning each entry of Tab. 2.3 has support 1615 (323 compared pairs per participant). We confirm the observations made by Niklaus et. al. and Blau et. al. [10], that PSNR/SSIM and perceptual metrics may be at odds with one another. Despite lower PSNR/SSIM scores, users consistently preferred our outputs to those of AnimeInterp. A possible explanation is that due to animations having larger displacements, the middle ground truth frames may be quite displaced from the ideal halfway interpolation. SSIM, as noted by previous work [138, 100], was not designed to assess these geometric distortions. Color metrics like PSNR and L1 may penalize heavily for this perceptually minor difference, encouraging the model to reduce risk by blurring; this is consistent with behavior exhibited by the original AnimeInterp trained on L1 (Fig. 2.6). LPIPS on the other hand has a larger perceptive field due to convolutions, and may be more forgiving of these instances. This study provides another example of the perception-distortion tradeoff [10], and establishes its transferability to 2D animation. The user study also shows an imperfect match between LPIPS and CD. This mismatch is also reflected in Tables 4.3 and 2.2, where aggregate decreases in LPIPS do not correspond to reduced CD. This maybe because CD reflects only the line-structures of an image. However, Tab 2.3 shows LPIPS is unexpectedly more predictive of line quality. A possible explanation is that CD is still more sensitive to offsets than LPIPS; in fact, CD grows roughly proportionally to displacement for line drawings. Thus, it may suffer the same problems as PSNR but to a lesser extent, as PSNR would penalize across an entire displaced area opposed to across a thin line. 40 2.5 Limitations & Conclusion Our system still has several limitations. By design, our model can only interpolate linearly between two frames, while real animations have non-linear movements that follow arcs across long sequences. In future work, we may incorporate non-linearity from methods like QVI [127], or allow user input from an artist. Additionally, we are limited to colored frames, which are typically unavailable until the later stages of animation production; following related work [82], we can expand our scope to work on line drawings directly. To summarize, we identify and overcome shortcomings of previous work [106] on 2D ani- mation interpolation, and achieve state-of-the-art interpolation perceptual quality. Our contribu- tions include an effective SoftsplatLite architecture modified to improve perceptual performance, a Distance Transform Module leveraging domain knowledge of lines to perform refinement, and a Restricted Relative Linear Discrepancy metric that allows automatic training data collection from raw animation. We validate our focus on perceptual quality through a user study, hopefully inspiring future work to maintain this emphasis for the traditional 2D animation domain. 41 Chapter 3: Match-free Inbetweening Assistant (MIBA): A Practical Animation Tool without User Stroke Correspondence In traditional 2D frame-by-frame animation, inbetweening (interpolating line drawings, abbr. “IB”) is still a manual and labor-intensive task. Despite the abundance of literature and software offering automation and claiming speedups, animators and the industry as a whole have been hesitant to adopt these new tools. Upon inspection, we find prior work often unreasonably expects adoption of novel stroke-matching workflows, naively assumes access to adequate center- line vectorization, and lacks rigorous evaluation with professional users on real production data. Facing these challenges, we leverage optical flow estimation and differentiable vector graphics to design a “Match-free Inbetweening Assistant” (MIBA). Unrestricted by the need for user stroke correspondence, MIBA integrates into the existing IB workflow without introducing additional requirements, and makes the raster input case feasible thanks to its robustness to vectorization quality. MIBA’s simplicity and effectiveness is demonstrated in our comprehensive user study, where users with professional IB experience achieved 4.2x average speedup and better chamfer distance scores on real-world production data, given only a 5-minute tutorial of new functional- ity. 42 Figure 3.1: Our MIBA system assists the inbetweening (IB) task. In traditinoal 2D frame- by-frame animation, animators often draw “pose-to-pose”, finishing “cleaned-up keyframes” (CU) at critical poses of a sequence first, before completing the “inbetweens” (IB) that spatio- temporally interpolate based on a specified “timing grid”. IB is notoriously labor-intensive and is still drawn manually in many productions, since existing automatic methods often fail on raster inputs and are incompatible with artist workflows. Leveraging state-of-the-art optical flow [112] and differentiable vector graphics [69], our MIBA system works robustly on scanned-in raster inputs, and integrates into the existing workflow so well that it can be learned in 5 minutes by experienced animators in the industry. Tested by professional inbetweeners on real production data, MIBA sped up IB drawing by 4.2x on average during our user study. 3.1 Introduction In traditional 2D frame-by-frame animation, animators often draw “pose-to-pose”: first completing “cleaned-up keyframes” (or “CU”) at several critical poses of a sequence, before spatially interpolating them with “inbetweens” (or “IB”) at intervals specified by a “timing grid” (Fig. 3.1). CUIB (a.k.a. “douga” in Japanese) can be drawn as vectors or rasters, but they are typically saved as aliased rasters to then be bucket-filled by digital ink and paint staff (DIP). The IB process demands precise linework and is notoriously time-consuming, taking anywhere between 5-40 minutes per frame depending on the difficulty. Across a single 24-minute episode at 12 frames per second, thousands of such inbetweens are drawn by hand; this scales to tens of thousands of drawings for seasonal shows and feature films. 43 The academic community has proposed many systems for offering computational IB as- sistance, typically in the form of matching vector strokes across keyframes to interpolate control points [123, 131, 54, 18, 133, 82, 132, 26]. But despite the abundance of work, there is still little adoption of automatic methods in industry. While there are many artistic, economic, and cultural factors contributing to this lack of progress, we observe that the research community has repeat- edly overlooked major design considerations that have significant impact on practical usability and system evaluation. Specifically: Consideration 1: Artists should not be expected to alter their workflow. It is pro- hibitively expensive for artists to make major changes to their established way of doing things, and invest time and effort into learning the intricacies of new tooling. Yet, the systems proposed in the literature consistently introduce additional requirements and functions that burden the an- imator (see summary in Table 3.1). By learning the existing workflow from professionals, we designed MIBA to automate specific steps without intrusively adding new ones (Fig. 3.2). Consideration 2: Animators do not have vector keyframe inputs adequate for prior vector-based methods. Prior work has mostly focused on interpolation between matched vec- tor strokes, but implicitly assumes properly-connected, noise-free, and/or nearly-identical vector topology on both input keyframes (Tab. 3.1). Unfortunately, we find that these requirements are satisfied by neither off-the-shelf vectorizations nor by artist-drawn vectors (Fig. 3.5, Fig. 3.6). MIBA on the other hand is robust to input vectorization quality, and so can feasibly handle the raster input case. Consideration 3: Evaluation should be precise and representative of real-world an- imation production. Relevant IB user studies are few; most have participants without pro- fessional IB experience, tend to evaluate qualitatively on simple animations, lack details about 44 Figure 3.2: MIBA non-intrusively assists the existing workflow. Our proposed system (bot- tom) seamlessly integrates into the existing workflow as described by professionals (top). Laying down all predicted lines with a simple click advances users to the “check+fix” stage of their fa- miliar workflow, characterized by zooming globally/locally to find/fix errors and missing lines. By replacing the time-consuming step of zooming in to lay lines, we save significant effort with- out deviating from established practices. procedures and training, and neglect to report key metrics like speedup and evaluation against ground truth (see summary in Tab. 3.3). We evaluate MIBA with a comprehensive user study addressing all these points. Taking into account these three key considerations, we make the following contributions: • MIBA, a “Match-free Inbetweening Assistant” leveraging deep optical flow and differen- tiable vector graphics [112, 69]. Unrestricted by the need for user stroke correspondence, MIBA seamlessly integrates into the existing IB workflow, and works well on raster input thanks to its robustness to vectorization quality. • A comprehensive user study evaluating our MIBA system, in which users with profes- sional IB experience achieved both 4.2x average task speedup and better chamfer distance scores w.r.t. ground-truth on real-world production data, given only a 5-minute tutorial of MIBA’s functionality. 45 Additional requirements / tools to learn MIBA (ours) - none (raster input, no vector requirements) (new assistant operated by button click) BetweenIT [123] - requires adequate vector input for 1-to-1 stroke correspondence (Fig. 3.5) - lasso tool for stroke correspondence - point tool for vector correspondence - path tool for trajectory guidelines VGC [26] - requires learning completely new time-vector topology data primitives DiLight [18] - requires well-connected vectors (Fig. 3.6) - path tool for stroke correspondence - must set stroke matching tolerance param. FTP-SC [133] - requires adequate vector input for 1-to-1 stroke correspondence (Fig. 3.5) - matching tool for stroke correspondence Narita et. al. [82] - cannot generate vector output, only rasters are supported (Fig. 3.4) CACANI [54] - requires adequate vector input for 1-to-1 stroke correspondence (Fig. 3.5) - requires stroke depth layering - requires stroke occlusion orientation - new grouping, linking, inverting tools AnimeInbet [107] - requires strictly straight-line vectors (does not support curved lines) - requires well-connected and densely-sampled vectors (Fig. 3.6) Table 3.1: Comparison of new requirements and tooling in prior work. It is costly for anima- tors to make major changes to their established workflows, and invest time and effort into learning new tooling. Yet, systems proposed in academia consistently introduce additional requirements and functions that burden the animator. Our MIBA framework on the other hand is designed to accelerate IB with a simple button click. 46 3.2 Related Work Prior vector-based inbetweening works first and foremost address the problem of vector stroke correspondence [123, 18, 133]. They assume two vector keyframe inputs with extremely similar and well-connected topologies, between which they match strokes. Based on the discov- ered 1-to-1 correspondence, control points are interpolated to derive the target IB vector repre- sentation. BetweenIT [123] and FTP-SC [133] provide a number of semi-automatic user tools to achieve this exact vector match, while DiLight [18] proposes guideline tools and loosens the strict correspondence requirement. More recently, AnimeInbet [107] proposed a method of fusing the two graph topologies. However, we often see that neither off-the-shelf vectorizers [122, 9, 79] nor users provide adequate vectorization quality for these methods (Fig. 3.5 & 3.6). In order to break free of the limitations imposed by vectorization, our match-free methodology completely discards the need to correspond disparate vectors, achieving a simple-to-use tool robust enough to work on vectorizations of raster scan-ins. Stroke occlusion resolution and aesthetic non-linear curve interpolation are other aspects of IB work. CACANI [54] for example proposes a layered and oriented representation of strokes, and is able to infer occlusions automatically. BetweenIT [123], FTP-SC [133], and DiLight [18] all propose their own methods for allowing user-specified non-linear trajectories. In our work on MIBA, however, we have found that simple linear interpolation works well for real-world production IB data, and that improper occlusion is relatively quick for users to fix. We thus focus on removing the stroke correspondence paradigm; however, as the occlusion and interpolation techniques are orthogonal to our match-free contribution, they can be added after our workflow if desired. 47 Another very different approach to IB is taken by Narita et. al. [82], who discard the vector representation altogether in favor of rasters. They propose the direct use of optical flow to warp between keyframes (similar to video frame interpolation systems commonly used for natural RGB videos [23, 7, 48]), and improve flow estimation on sparse line drawings with the distance transform. While this approach indeed relieves the user of vector considerations, the results are heavily dependent on optical flow performance. As failure cases are common for animations with large displacements and sparse lines, the user would need to edit an inflexible raster representation (Fig. 3.4). Our MIBA on the other hand tackles vector representation issues, without resorting to rasters. Yet another alternative is proposed by Boris et. al. [26], who introduce a novel “vector animation complex” data structure to manage interpolation of topologies across both space and time. However, the new representation is a departure from the conventional vector graphics paradigm, and inhibitively requires the artist to learn a new framework of thought in order to inbetween. As summarized in Tab. 3.1 and illustrated in Fig. 3.2, our MIBA system distinguishes itself from prior work by not requiring significant changes to animator tooling or workflow. Addition- ally, we provide one of the most comprehensive user studies for IB in the literature to evaluate MIBA (Tab. 3.3); we report metrics with respect to ground-truth production data from the real world, provide explicit details on speed and interaction gains, and test with professional IB ani- mators. 48 Figure 3.3: Schematic of our proposed system. Given the left two raster keyframes (I0, I1) and the interpolating position on the timing grid (t = 0.5), our system produces a vector inbetween (V̂t, right) that can then be adjusted as needed, without requiring the user to specify stroke corre- spondences between vectorized versions of the input frames. The key to achieving this match-free framework lies in our ability to robustly warp and align the vectorization of the first frame (V0, top) to a raster of the second frame (I1, bottom). This way, we obtain topologically identical vectors (red) aligned to respective frames, ready for interpolation. The system leverages optical flow estimation and differentiable vector graphics to produce reasonable stroke-raster alignment. 3.3 Methodology Our system operates on a single timing grid sequence at a time (Fig. 3.1). As illustrated in Fig. 3.2, MIBA lays down predictions of where the vector lines of IB should be placed. Our system uses a match-free method to return a vector IB drawing for each interval specified by the timing grid. The user may then use common vector editing tools, or choose to rasterize at any time and use common raster tools. Below, we define the input and output representations more formally, before describing our algorithm and provided user interaction tools. 3.3.1 Input/Output Representations The inputs to the MIBA system are: a timing grid T , and two cleaned keyframes (Fig. 4.1, left-most side). The timing grid T is a sorted array of unique values ti ∈ [0, 1], where t0 = 0 and 49 t1 = 1. The two CU keyframes are assumed to be rasters (I0, I1 in Fig. 4.1). In the case where a vector representation is available, we still rasterize before inputting to our system; this ensures consistency of outputs, and relieves any mental burden on the animators to consider line topology when drawing CU. Similar to the IB to be generated, the CU keyframes are binary-aliased for eventual digital coloring with paint-bucket tools. The CU often only have few colors aside from black; by default in the Japanese pipeline, red denotes highlights, blue is shadow, and green is a special effect or second shadow/highlight. The output representation of MIBA is a vector graphics representation (V̂t in Fig. 4.1), which can be rasterized if desired. We select to use cubic Bézier curves, with a polar representa- tion for control points. The representation backend supports arbitrary graph connectivity, though for our experiments we only work with acyclic graphs, similar to SVGs. In practice, our sys- tem renders curves as short piecewise-linear line segments; we thus support variable line width across a single curve, although we found that a single global thickness worked well enough for our specific data. 3.3.2 Match-free Inbetweening Our proposed match-free inbetweening method is illustrated in Fig. 4.1. Similar to pre- viously proposed frameworks, each curve of the output is derived by interpolating between the vertex/handle coordinates of two existing curves (V0 and V1, one representing each vectorized input image). However, to fit this paradigm, prior work struggles at finding the two correct cor- responding strokes to inbetween from each input, because the input vectorizations have either unreasonably different topology (Fig. 3.5) or inadequately noisy connectivity (Fig. 3.6). 50 The key insight of our MIBA method is that we can warp one frame’s vector (V0) to align with the other’s raster (I1). This way we do not need to match two disparate vectorizations, and instead can directly interpolate between two coordinate-value configurations of the same vector topology (Fig. 4.1, red). Put another way, we find offsets from one vector to look like the other raster; interpolation equates to moving by a specified fraction of those offsets. Naturally, since no matching occurs between different vectorizations, MIBA is robust to the connectivity and noisiness of the input vectors; by discarding the stroke correspondence paradigm, we make IB assistance on vectorized scanned-ins a feasible reality (Fig. 3.6). In addition, animators can easily operate MIBA without the intricate stroke correspondence and match guidance tooling required by prior work (Tab. 3.1). From the perspective of solving the stroke occlusion problem, MIBA intrinsically resolves occlusion at the warp and align stages. At these steps, strokes are effectively occluded by short- ening curves to the occlusion boundary, in order to satisfy alignment to the opposing raster frame. Note that this occurs independently of local topology at the junction, as the optimization condi- tion is imposed directly on a raster render of the stroke graph. While we found that improper occlusions still may appear from imperfect alignments, animators are able to fix them relatively quickly. Note that MIBA is asymmetric with respect to the input frame order; in other words, the system output for interpolating t = 0.5 will be different if the two input CU keyframes are swapped. Users can choose which direction to use MIBA, although in practice we find many users simply stick with the default forward direction. The subsections below describe in more detail the steps outlined in Fig. 4.1, and how we leveraged state-of-the-art optical flow and differentiable vector graphics to achieve this non-trivial 51 vector-raster alignment across frames. Vectorization with Alignment Post-processing As previously mentioned, MIBA only works with a single vector topology that does not need to be matched; thus the method is robust to the input vector representation quality. In Fig. 3.6, we demonstrate our system with Weber AutoTrace [122], PolyVectorization [9] (simplified with [101]), and Virtual Sketching [79]; in all three cases, our system was able to deliver similar reasonable results. We default to using AutoTrace in our program for its quick processing speed. Across the different vectorization algorithms, we inevitably found imperfections in the linework that did not match with the input raster; usually, this came in the form of “wobbly” curves. To improve the base vectorizations (V ′ 0), we optimize the vector image to match the input raster (I0) using DiffVG [69] by minimizing V0 = argmin θ L2 ( R(θ), I0 ) , (3.1) where θ represents the vertices and Bézier control points initialized at V ′ 0 , and R denotes the DiffVG differentiable vector rendering operation. The optimization is run for 32 iterations, with Adam [62] on MSE image loss with unit learning rate. To facilitate alignment of initially distant lines, we additionally apply Gaussian blur to the differentiable render before loss evaluation for the first half of optimization (with a ramping sigma). Vector Warping & Alignment The goal of warping here is to roughly position the post-processed vectorization (V0) over the other raster frame (I1), to initialize the more expensive alignment optimizations. We estimate the optical flow (F ) between the two raster inputs using an off-the-shelf RAFT model [112]; 52 inspired by previous work on raster inbetweening of line art [82], we preprocess the sparse line drawings into dense distance transform images to improve the flow estimation. To perform the warp, we simply offset each vertex by the flow sampled at its image coor- dinates, denoted as V ′ 1 = V0 + F [V0]. Even without modifying the polar Bézier handle values, we found that the warp gave results that were reasonable, but not yet acceptable for interpolation without further alignment. With the vectorization of the first frame roughly warped to the second raster, we remove remaining alignment imperfections by DiffVG optimization [69]. The optimization process is the same as vector postprocessing previously described in Sec. 3.3.2, but θ is initialized at V ′ 1 and the raster target is instead I1: V1 = argmin θ L2 ( R(θ), I1 ) . (3.2) Note that there still may be imperfect alignments, since the inherent topology of the two frames is often different (Fig. 3.4, 3.5, 3.6). However, we find that in many cases these incompatibilities can be quickly fixed by the animator after interpolation, still at a time discount compared to manually laying down all the lines. Vector Interpolation Despite the emphasis that prior work puts on interpolating aesthetically between two curves [123, 133, 18], we find that a simple linear interpolation of vertex coordinates and polar Bézier handle values is sufficient enough to satisfy professional IB animators. As the choice of inter- polation method here is orthogonal to our contribution of match-free assistance, this step can be freely interchanged with interpolation schemes from other work. For timing grids with more 53 than one IB, we simply cache the alignment offsets and lerp to new intervals as needed. Once the rasterization of the first frame is appropriately offset to the target IB, the user is free to correct any imperfections of the MIBA output using vanilla vector manipulation tools. 3.3.3 User Interaction We implemented an interactive web browser app with Vue.js. Our app has basic features typical to modern IB software, including a navigable viewport, toolbar with options, raster and vector layers, frame/layer selection panels, onionskinning and frame-flipping, tool keybindings, undo/redo history, etc. Pen/stylus input is supported, and is functionally equivalent to the mouse. MIBA assistance is implemented as buttons attached to each timing grid sequence the user is asked to IB. The user clicks on the provided “assist” button, which provides a preview of the generated IBs, and then clicks “use” on the previews they would like to use. This inserts the MIBA output as a vector layer on the frame in question, which the u