ABSTRACT

Title of Dissertation: MACHINE LEARNING FOR ANIME:
ILLUSTRATION, ANIMATION,
AND 3D CHARACTERS

Shuhong Chen
Doctor of Philosophy, 2024

Dissertation Directed by: Professor Matthias Zwicker
Department of Computer Science

As anime-style content becomes more popular on the global stage, we ask whether new

vision/graphics techniques could contribute to the art form. However, the highly-expressive and

non-photorealistic nature of anime poses additional challenges not addressed by standard ML

models, and much of the existing work in the domain does not align with real artist workflows.

In this dissertation defense, we will present several works building foundational 2D/3D infras-

tructure for ML in anime (including pose estimation, video frame interpolation, and 3D character

reconstruction) as well as an interactive tool leveraging novel techniques to assist 2D animators.


MACHINE LEARNING FOR ANIME:
ILLUSTRATION, ANIMATION, AND 3D CHARACTERS

by

Shuhong Chen

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2024

Advisory Committee:
Professor Matthias Zwicker, Chair/Advisor
Professor Min Wu, Dean’s Representative
Professor Abhinav Shrivastava
Professor Leo Zhicheng Liu
Professor Jia-Bin Huang


© Copyright by
Shuhong Chen

2024


Preface

Anime is getting more popular in the global entertainment market. However, traditional

animation is laborious. To create the expressive motions loved by millions, professional and am-

ateur animators alike face the intrinsic cost of 12 illustrations per second. As the medium rapidly

enters mainstream, the sheer manual line-mileage demanded continues to increase. This begs the

question of whether modern data-driven computer vision/graphics methods can offer automation

or assist the creative process. While some work exists for colorization, cleanup, in-betweening,

etc., we’re still missing foundational domain-specific infrastructure. In addition, much of the aca-

demic work around problems like in-betweening have only been studied in-vitro, without prac-

tical considerations for real animators. By studying industry practices, scaling data pipelines,

bridging domain gaps, leveraging 3d priors, etc., we developed domain-specific ML infrastruc-

ture for anime, and demonstrated ways that modern techniques can assist existing workflows.

In addition, while 3d human priors are crucial for the above animation topic (form and

surface anatomy are animator fundamentals), 3d character modeling itself may also benefit from

new techniques. As AR/VR apps and virtual creators become more popular, there will soon

be major demand for stylized 3d avatars. But current template-based designers are restrictive,

with custom assets still requiring expert software to create. Novel representations and rendering

techniques may help create realistic 3d humans, but comparatively little has been done to suit

the design challenges of non-photorealistic characters. This work also tries to democratize 3d

character creation, bringing customizable experiences to the next generation of social interaction.

ii


Dedication

To Nagato Yuki.

iii


Table of Contents

Preface ii

Dedication iii

Table of Contents iv

Chapter 1: Transfer Learning for Pose Estimation of Illustrated Characters 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Method & Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Application: Pose-guided Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Chapter 2: Improving the Perceptual Quality of 2D Animation Interpolation 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Experiments & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Limitations & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Chapter 3: Match-free Inbetweening Assistant (MIBA): A Practical Animation Tool
without User Stroke Correspondence 42

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.1 Input/Output Representations . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Match-free Inbetweening . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 User Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

iv


3.5 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Chapter 4: Stylized Single-view 3D Reconstruction from Portraits of Anime Characters 67
4.1 Introduction & Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Results & Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Limitations & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

v


Chapter 1: Transfer Learning for Pose Estimation of Illustrated Characters

Human pose information is a critical component in many downstream image processing

tasks, such as activity recognition and motion tracking. Likewise, a pose estimator for the il-

lustrated character domain would provide a valuable prior for assistive content creation tasks,

such as reference pose retrieval and automatic character animation. But while modern data-

driven techniques have substantially improved pose estimation performance on natural images,

little work has been done for illustrations. In our work, we bridge this domain gap by effi-

ciently transfer-learning from both domain-specific and task-specific source models. Addition-

ally, we upgrade and expand an existing illustrated pose estimation dataset, and introduce two

new datasets for classification and segmentation subtasks. We then apply the resultant state-of-

the-art character pose estimator to solve the novel task of pose-guided illustration retrieval. All

data, models, and code will be made publicly available.

1.1 Introduction

Human pose estimation is a foundational computer vision task with many real-world appli-

cations, such as activity recognition [81], 3D reconstruction [47], motion tracking [102], virtual

try-on [30], person re-identification [78], etc. The generic formulation is to find, in a given image

containing people, the positions and orientations of body parts; typically, this means locating

1


landmark and joint keypoints on 2D images, or regressing for bone transformations in 3D.

The usefulness of pose estimation is not limited to the natural image domain; in particular,

we focus on the domain of illustrated characters. As pose-guided motion retargeting of realistic

humans rapidly advances [37], there is growing potential for automatic pose-guided animation

[43], a traditionally labor-intensive task for both 2D and 3D artists. Pose information may also

serve as a valuable prior in illustration colorization [137], keyframe interpolation [106], 3D char-

acter reconstruction [13] and rigging [128], etc.

With deep computer vision, we have been able to leverage large-scale datasets [70, 4, 116]

to train robust estimators of human pose [45, 17, 33]. However, little work has been done to

solve pose estimation for illustrated characters. Previous pose estimation work on illustrations

by Khungurn et al [56] presented a 2D keypoint detector, but relied on a publicly-unavailable

synthetic dataset and an ImageNet-trained backbone. In addition, the dataset they collected for

supervision lacked variation, and was missing keypoints and bounding boxes required for evalu-

ation under the more modern COCO standard [70].

Facing these challenges, we constructed a 2D keypoint detector with state-of-the-art per-

formance on illustrated characters, built upon domain-specific components and efficient transfer

learning architectures. We demonstrate the effectiveness of our methods by implementing a novel

illustration retrieval system. Summarizing, we contribute:

• A state-of-the-art pose estimator for illustrated characters, transfer-learned from both domain-

specific and task-specific source models. Despite the absence of synthetic supervision, we

outperform previous work by 10-20% PDJ@20 [56].

• An application of our proposed pose estimator to solve the novel task of pose-guided char-

2


acter illustration retrieval.

• Datasets for our model and its components, including: an updated COCO-compliant ver-

sion of Khungurn et al’s [56] pose dataset with 2x the number of samples and more diverse

poses; a novel 1062-class Danbooru [5] tagging rulebook; and a character segmentation

dataset 20x larger than those currently available.

1.2 Related Work

The Illustration Domain Though there has been work on caricatures and cartoons [15,

91], we focus on anime/manga-style drawings where characters tend to be less abstract. While

there is work for more traditional problems like lineart cleaning [103] and sketch extraction [66],

more recent studies include sketch colorization [137], illustration segmentation [135], painting

relighting [136], image-to-image translation with photos [58], and keyframe interpolation [106].

Available models for illustrated tasks typically rely on small manually-collected datasets.

For example, the AniSeg [68] character segmenter is trained on less than 1,000 examples. While

larger datasets are becoming available (e.g. Danbooru [5] now with 4.2m tagged illustrations),

the labels are noisy and long-tailed, leading to poor model performance [6, 60]. Works requiring

pose information may use synthetic renders of anime-style 3D models [56, 43], but the models

are usually not publicly available. In this work, we present a cleaner tag classification task, a

large character segmentation dataset, and an upgraded COCO keypoint dataset; these will all be

made available upon publication, and may serve as a valuable prior for other tasks.

Transfer Learning & Domain Adaptation Transfer learning and domain adaptation have

been defined somewhat inconsistently throughout the vision and natural language processing

3


literature [119, 27], though generally the former is considered broader than the latter. In this

paper, we use the terms interchangeably, referring to methods that leverage information from a

number of related source domains and tasks, to a specific target domain and task. Typically, much

more data is available for the source than the target, motivating us to transfer useful related source

knowledge in the absence of sufficient target data [119]. For deep networks, the simplest practice

is to pretrain a model on source data, and fine-tune its parameters on target data; however, various

techniques have been studied that work with different levels of target data availability.

Much of the transfer learning work in vision focuses on extreme cases with significantly

limited target domain data, with emphasis around the task of image classification. In the few-

shot learning case, we may be given as few as ten (or even one) samples from the target, inviting

methods that embed prototypical target data into a space learned through prior source knowledge

[121]. In particular, it is common to align parameters of feature extractors across domains, by

directly minimizing pairwise feature distances or by adversarial domain discrimination [74, 115].

If the source and target are similar enough, it is possible to perform domain adaptation in the

complete absence of labeled target data. This can be achieved by matching statistical properties

of extracted features [109], or by converting inputs between domains through cycle-consistent

image translation [46].

Pose Estimation With the availability of large-scale human pose datasets [70, 4], the vision

community has recently been able to make great strides in pose estimation. A naive baseline was

demonstrated by Mask R-CNN [45], which extended their detection and segmentation framework

to predict single-pixel masks of joint locations. Other work such as RMPE take an approach

tailored to pose estimation, deploying spatial transformer networks with pose-guided NMS and

region proposal [33]. Around the same time, OpenPose proposed part affinity fields as a bottom-

4


up alternative to the more common heatmap representation of joints [17]. Human pose estimation

work continues to make headway, extending beyond keypoint localization to include dense body

part labels [42] and 3D pose estimation [51, 64, 80].

Pose Estimation Transfer Most transfer learning for pose estimation adapts from synthetically-

rendered data to natural images. For example, by using mocaps and 3D human models, SUR-

REAL [116] provides 6 million frames of synthetic video, complete with a variety of datatypes

(2D/3D pose, RGB, depth, optical flow, body parts, etc.). CNNs may be able to directly gen-

eralize pose from synthesized images [116], and can further close the domain gap using other

priors like motion [28]. Outside of synthetic-to-real, Cao et al [65] explore domain adaptation for

quadruped animal pose estimation, achieving generalization from human pose through adversar-

ial domain discrimination with pseudo-label training.

The closest prior work to our topic was done by Khungurn et al [56], who collected a

modest AnimeDrawingsDataset (ADD) of 2k character illustrations with joint keypoints, and a

larger synthetic dataset of 1 million frames rendered from MikuMikuDance (MMD) 3D models

and mocaps. Unfortunately, the MMD dataset is not publicly available, and ADD contains mostly

standard forward-facing poses. In addition, ADD is missing bounding boxes and several face

keypoints, which are necessary for evaluation under the modern COCO standard [70]. We remedy

these issues by training a bounding box detector from our new character segmentation dataset,

labeling missing annotations in ADD, and labeling 2k additional samples in more varied poses.

Khungurn et al perform transfer from an ImageNet-pretrained GoogLeNet backbone [111]

and synthetic MMD data. In the absence of MMD, we instead transfer from a stronger backbone

trained on a new illustration-specific classification task, as well as from a task-specific model

pretrained on COCO keypoints. We use our subtask models and data to implement a number of

5


transfer techniques, from naive fine-tuning to adversarial domain discrimination. In doing so, we

significantly outperform Khungurn et al on their reported metrics by 10-20%.

Figure 1.1: A schematic outlining our two transfer learning architectures: feature concatenation,
and feature matching. Note that source feature specificity is with respect to the target; i.e. task-
specific means “related to pose estimation” and domain-specific means “related to illustrations”.
Feature converters and matchers are convolutional networks that learn to mimic or re-appropriate
pretrained features, respectively. While both designs require the pretrained Mask R-CNN com-
ponents during training, feature matching discards them during inference, instead relying on the
trained matcher network. “BCE” refers to binary cross-entropy loss.

1.3 Method & Architectures

We provide motivation and architecture details for two variants of our proposed pose esti-

mator (feature concatenation and feature matching), as well as two submodules critical for their

success (a class-balanced tagger backbone and a character segmentation model). Architectures

for baseline comparison models are described in Sec. 1.5.

Pose Estimation Transfer Model We present two versions of our final model: feature

concatenation, and feature matching. In this section, we assume that region proposals are given

by a separate segmentation model (Sec. 1.3), and that the domain-specific backbone is already

6


available (Sec. 1.3); here, we focus on combining source features to predict keypoints (Fig. 4.1).

The goal is to perform transfer simultaneously from both a domain-specific classification

backbone (Sec 1.3) and a task-specific keypoint model (Mask R-CNN [45]). Here, we chose

Mask R-CNN as it showed significantly better out-of-the-box generalization to illustrations than

OpenPose [17] (Tab. 1.1). Taking into account that the task-specific model already achieves

mediocre performance on the target domain, the feature concatenation model simply stacks fea-

tures from both sources (Fig. 4.1). In order to perform the concatenation, it learns shallow feature

converters for each source to decrease the feature channel count and allow bilinear sampling to

a common higher resolution. The combined features are fed to the head, consisting of a shallow

converter and two ResNet blocks.

The final output is a stack of 25 heatmaps, 17 for COCO keypoints and 8 for auxiliary

appendage midpoints (following Khungurn et al [56]). We apply pixel-wise binary cross-entropy

loss on each heatmap, targeting a normal distribution centered on the ground-truth keypoint lo-

cation with standard deviation proportional to the keypoint’s COCO OKS sigma [70]; the sigmas

for auxiliary midpoints are averaged from endpoints of the body part. At inference, we gaussian-

smooth the heatmaps and take the maximum pixel value index as the keypoint prediction.

Although feature concatenation produces the best results (Tab. 1.1), it is very inefficient. At

inference, it must maintain the parameters of both source models, and run both forward models

for each prediction; Mask R-CNN is particularly expensive in this regard. We thus also provide a

feature matching model, inspired by the methods used in Luo et al [74]. As shown in Fig. 4.1, we

simultaneously train an additional matching network that predicts features from the expensive

task-specific model using features from the domain-specific model. Though matching may be

optimized with self-supervision signals such as contrastive loss [126], we found that feature-

7


wise mean-squared error is suitable. Given the matcher, the pretrained Mask R-CNN still helps

training, but is not necessary at inference. Despite its simplicity, feature matching retains most

performance benefits from both source models, while also being significantly lighter and faster

than the concatenation architecture.

ResNet Tagger The domain-specific backbone for our model (Fig. 4.1) is a pretrained

ResNet50 [ResNet] fine-tuned as an illustration tagger. The tagging task is equivalent to multi-

label classification, in this case predicting the labels applied to an image by the Danbooru image-

board moderators [5]. The 392k unique tags cover topics including colors, clothing, interactions,

composition, and even copyright metainfo.

Khungurn et al [56] use an ImageNet-trained GoogLeNet [111] backbone for their illus-

trated pose estimator, but we find that Danbooru fine-tuning significantly boosts transfer per-

formance. There are publicly-available Danbooru taggers [6, 60], but both their classification

performance and feature learning capabilities are hindered by uninformative target tags and se-

vere class imbalance. By alleviating these issues, we achieve significantly better transfer to pose

estimation.

Most available Danbooru taggers [6, 60] take a coarse approach to defining classes, sim-

ply predicting the several thousand (6-7k) most frequent tags. However, many of these tags

represent contextual information not present in the image; e.g. neon genesis evangelion (name

of a franchise), or alternate costume (fanmade/non-canon clothes). We instead only allow tags

explicitly describing the image (clothing, body parts, etc.). Selecting tags by frequency also

introduces tag redundancy and annotator disagreement. There are many high-frequency tags

that share similar concepts, but are annotated inconsistently; e.g. hand in hair, adjusting hair,

and hair tucking have vague wiki definitions for taggers, and many color tags are subjective

8


(aqua hair vs. blue hair). To address these challenges, we survey Danbooru wikis to manually

develop a rulebook of tag groups that defines more explicit and less redundant classes.

Danbooru tag frequencies form a long-tailed distribution, posing a severe class imbalance

problem. In addition to filtering out under-tagged images (detailed in Sec. 1.4), we implement

an inverse square-root frequency reweighing scheme to emphasize the learning of less-frequent

classes. More formally, the loss on a sample is:

L(y, ŷ) = 1

C

C−1∑
i=0

wi(yi)BCE(yi, ŷi) (1.1)

wi(z) =
1

2

(
z

ri
+

1− z

1− ri

)
(1.2)

ri =

√
Ni√

Ni +
√
N −Ni

(1.3)

where C is the number of classes, ŷ ∈ [0, 1]C is the prediction, y ∈ {0, 1}C is the ground truth

label, BCE is binary cross entropy loss, N is the total number of samples, and Ni is the number

of positive samples in the ith class. We found that plain inverse frequency weighing caused

numerical instability in training, necessitating the square root.

Character Segmentation & Bounding Boxes In order to produce bounding boxes around

each subject in the image, we first train an illustrated character segmenter. As we assume one sub-

ject per image, we can derive a bounding box by enclosing the thresholded segmentation output.

The single-subject assumption also removes the need for region proposal and NMS infrastructure

present in available illustrated segmenters [68], so that our model may focus on producing clean

segmentations only. Our segmentation model is based on DeepLabv3 [22], with three additional

layers at the end of the head for finer segmentations at the input image resolution. We initial-

9


ize with pretrained DeepLabv3 weights from PyTorch [89], and fine-tune the full model using

pixel-wise binary cross-entropy loss.

Table 1.1: Performance of different architectures and ablations described in Sec. 1.5. Note that
the parameter count and speed are measured in inference mode with batch size one; “m” refers
to “millions of parameters”.

1.4 Data Collection

Unless mentioned otherwise, we train with random image rotation, translation, scaling,

flipping, and recoloring.

Pose Data We extend the AnimeDrawingsDataset (ADD), first collected by Khungurn et al

[56]. The original dataset had 2000 illustrated full-body single-character images from Danbooru,

each annotated with joint keypoints. However, ADD did not follow the now popularized COCO

standard [70]; in particular, it was missing facial keypoints (eyes and ears) and bounding boxes.

In order to evaluate and compare with modern pose estimators, we manually labeled the missing

keypoints using an open-source COCO annotator [12] and automatically generated bounding

boxes using the character segmenter described in Sec. 1.3. We also manually remove 57 images

10


with multiple characters, or without the full body in view.

In addition, we improve the diversity of poses in ADD by collecting an additional 2043

samples. A major weakness of ADD is its lack of backwards-facing characters; only 5.45% of

the entire 2k dataset had a back-related Danbooru tag (e.g. back, from behind, looking back,

etc.). We specifically filtered for back-related images when annotating, resulting in a total of

850 in the updated dataset (21.25%). We also selected for other notably under-represented poses,

like difficult leg tags (soles, bent over, leg up, crossed legs, squatting, kneeling, etc.), arm tags

(stretch, arms up, hands clasped, etc.), and lying tags (on side, on stomach).

Our final updated dataset contains 4000 illustrated character images with all 17 COCO

keypoints and bounding boxes. We designate 3200 images for training (previously 1373), 313

for validation (previously 97), and 487 for testing (same as original ADD). For each input image,

we first scale and crop such that the bounding box is centered and padded by at least 10% of

the edge length on all sides. We then perform augmentations; flips require swapping left-right

keypoints, and full 360-degree rotations are allowed.

ResNet Tagger Data Our ResNet50 tagger is trained on a new subset of the 512px SFW

Danbooru2019 dataset [5]. The original dataset contains 2.83m images with over 390k tags,

but after filtering and retagging we arrive at 837k images with 1062 classes. The new classes

are derived from manually-selected union rules over 2027 raw tags, as described in Sec. 1.3;

the rulebook has 314 body-part, 545 clothing, and 203 miscellaneous (e.g. image composition)

classes.

To combat the class imbalance problem described in Sec. 1.3, we also rigorously filtered the

dataset. We remove all images that are not single-person (solo, 1girl, or 1boy), are comics (comic,

4koma, doujinshi, etc.), or are smaller than 512px. Most critically, we remove all images with less

11


than 12 positive tags; these images are very likely under-tagged, and would have introduced many

false-negatives to the ground truth. The final subset of 837k images has significantly reduced

class imbalance (median class frequency 0.38%, minimum 0.04%) compared to the datasets of

available taggers (median 0.07%, min 0.01%) [6].

We split the dataset 80-10-10 train-val-test. As some tags are color-sensitive, we do not

jitter the hue; similarly as some tags are orientation-sensitive, we allow up to 15-degree rotations

and horizontal flips only.

Character Segmentation Data To obtain character bounding boxes, we train a character

segmentation model and enclose output regions at 0.5 threshold (Sec. 1.3). The inputs to our

segmentation system are augmented composites of RGBA foregrounds (with transparent back-

grounds) onto RGB backgrounds; the synthetic ground truth is the foreground alpha. The avail-

able AniSeg dataset [68] has only 945 images, with manually-labeled segmentations that are

not pixel-perfectly aligned. We thus collect our own larger synthetic compositing dataset. Our

background images are a mix of illustrated scenery (5.8k Danbooru images with scenery and

no humans tag) and stock textures (2.3k scraped [2] from the Pixiv Dataset [67]). We collect

single-character foreground images from Danbooru with the transparent background tag; 18.5k

samples are used, after filtering images with text, non-transparency, or more than one connected

component in the alpha channel. Counting each foreground as a single sample, this makes our

new dataset roughly 20x larger than AniSeg. The foregrounds and backgrounds are randomly

paired for compositing during training, with 5% chance of having no foreground. We hold out

2048 deterministic foreground-background pairs for validation and testing (1024 each).

12


Table 1.2: Keypoint breakdown of our most performant “feature concatenation” model trained on
our extended ADD dataset. In the center, we list the relative improvement of each metric when
training on additional data. On the right, we display the PDJ@20 from Khungurn et al [56], and
report the relative difference from our best model. *Note that due to keypoint incompatibilities,
we fill missing keypoint results from [56] using the most similar keypoints reported: “head” for
eyes and ears, and “body” for shoulders and hips.

Model F-1 pre. rec. IoU

Ours 0.9472 0.9427 0.9576 0.9326
YAAS SOLOv2 0.9061 0.9003 0.9379 0.9077
YAAS CondInst 0.8866 0.8824 0.8999 0.9158
AniSeg 0.5857 0.5877 0.5954 0.6651

Table 1.3: Comparison of our character segmentation and bounding box performance, described
in Sec. 1.5.

1.5 Experiments

We used PyTorch [89] wrapped in Lightning [3]; some models use the R101-FPN keypoint

detection R-CNN from Detectron2 [125]. All models can be trained with a single GTX1080ti

(11GB VRAM). Unless otherwise mentioned, we trained models using the Adam [61] optimizer,

with 0.001 learning rate and batch size 32, for 1,000 epochs.

The ResNet backbone is trained on the Danbooru tag classification task using our new

manual tagging rulebook (Sec. 1.4). The character segmenter used for bounding boxes is trained

with our new character segmentation dataset (Sec. 1.4). Using the previous two submodules, we

13


train the pose estimator using our upgraded version of the ADD dataset (Sec. 1.4). All data and

code will be released upon publication.

Pose Estimation Transfer Table 1.1 shows the performance of different architectures. We

report COCO OKS [70], PCKh and PCPm [4], and PDJ (for comparison with Khungurn et al

[56]). From the top four rows, we see that our proposed feature concatenation and matching

models perform the best out overall, and that the addition of our new data increases performance.

We also observe that while concatenation performs marginally better than matching, matching is

8.8x more parameter efficient and one-third faster at inference.

The second group of Table 1.1 shows other architectures, roughly in order of method com-

plexity. Here, as in Fig. 4.1, “task” source features refer to Mask R-CNN pose estimation fea-

tures, and “domain” source features refer to illustration features extracted by our ResNet50 tag

classifier.

“Task Fine-tuning Only” fine-tunes the pretrained Mask R-CNN head with its frozen de-

fault backbone; the last head layer is re-initialized to accommodate auxiliary appendage key-

points. This is vanilla transfer by fine-tuning a task-specific source network on a small task-

specific target domain dataset.

“Domain Features Only” is our frozen ResNet50 backbone with a keypoint head. This is

vanilla transfer by adding a new task head to a domain-specific source network.

“Task Fine-tuning w/ Domain Features” fine-tunes the pretrained Mask R-CNN head as

above, but replaces the R-CNN backbone with our frozen ResNet50 backbone. This is a naive

method of incorporating both sources, attempting to adapt the task source’s pretrained prediction

component to new domain features.

“Adversarial (DeepFashion2)” reuses the feature matching architecture, but performs ad-

14


versarial domain discrimination instead of MSE matching. The discriminator is a shallow 2-layer

convnet, trained to separate Mask R-CNN features of randomly sampled DeepFashion2 [38]

images from ResNet features of Danbooru illustrations. As the feature maps to discriminate

are spatial, we are careful to employ only 1x1 kernels in the discriminator; otherwise, the dis-

criminator could pick up intrinsic anatomical differences. The matching network now fools the

discriminator by adversarially aligning the feature distributions.

“Adversarial (COCO)” is the same adversarial architecture as above, but using COCO [70]

images containing people instead of Deepfashion2.

While domain-features-only is the cheapest architecture overall, it is only slightly more

efficient than feature matching, and loses all benefits of task-specific transfer. However, the

performance drop from feature concatenation to domain-features-only and task-with-domain-

features is not very large (2-3% OKS@50); meanwhile, there is a wide gap to task-fine-tuning-

only. This shows that the domain-specific ResNet50 backbone trained on our new body-tag

rulebook provides much more predictive power than the task-specific pretrained Mask R-CNN.

It is important to note that the adversarial models exhibited significant instability during

training. After extensive hyperparameter tuning, the best DeepFashion2 model returns NaN loss

at epoch 795, and the best COCO model fails at epoch 354; all other models safely exited at

epoch 1,000. DeepFashion2 likely outperforms COCO because the image composition is much

more similar to that of Danbooru; images are typically single-person portraits with most of the

body in view. Adversarial losses are notoriously difficult to optimize, and in our case destabilized

training so as to perform worse than not having been used at all.

The fourth group of Table 1.1 shows out-of-the-box generalization to illustrations for Mask

R-CNN [45] and OpenPose [17]. We use Mask R-CNN as our task-specific source, as it is less-

15


Figure 1.2: Pose-based retrieval. From left to right, we show the query image (descriptor dis-
tance zero) followed by its five nearest neighbors (duplicate and NSFW images removed). Each
illustration is annotated with its Danbooru ID, descriptor distance to the query, and the predicted
bounding box with COCO keypoints. Please see supplementary materials for full artist attribution
and additional examples.

overfit to natural images than OpenPose.

Table 1.2 gives a keypoint breakdown and comparison with Khungurn et al [56]. The

results demonstrate that training on our additional more varied data improves the overall model

performance; this is especially true for appendage keypoints, which are more variable than the

head and torso. We also see significant improvement from results reported in Khungurn et al.

The exception is the hips, for which we compare to their “body” keypoint at the navel. While this

is not a direct comparison, our PDJ on hips is nevertheless low relative to other keypoints. This is

16


because PDJ does not account for the intrinsic ambiguity of the hips; looking at the OKS, which

accounts for annotator disagreement, we see that hip performance is actually quite high.

An important caveat is that the metrics are generally not comparable with those reported in

human pose estimation. COCO OKS, for example, was designed using annotator disagreement

on natural images [70]; however, illustrated character proportions deviate widely from the stan-

dard human form (i.e. bigger head and eyes). Characters also tend to take up more screen space

proportional to body size (i.e. big hair and clothing), leading to looser thresholds normalized by

bounding box size.

ResNet Tagger Backbone We train our ResNet50 tagger backbone to produce illustration-

specific source features (Fig. 4.1). Taking into account the class imbalance, we accumulate

gradients for an effective batch size of 512. Considering the minimum (0.04%) and median

(0.38%) class frequencies, we may expect the smallest class to appear 0.2 times per batch, and

the median class to appear 1.9 times per batch.

To demonstrate the effectiveness of our tag rulebook and class reweighing strategy, we

report performance on pose estimation using two other ResNet50 backbones: the RF5 tagger

[6], and the default ImageNet-pretrained ResNet50 from PyTorch [89]. While there are several

Danbooru taggers available [6, 60], we chose to compare our backbone to the RF5 tagger [6]

because it is the most architecturally similar to our ResNet50, and relatively better-documented.

The backbones all share the same architecture and parameter count, and are all placed into our

feature concatenation transfer model for the ablation.

The backbone ablation results are shown in the last three rows of Table 1.1. As expected,

a classifier trained with our novel body-part-specific tagging rulebook and class-balancing tech-

niques significantly improves transfer to pose estimation. Note that our tagger also outperforms

17


RF5 at classification (on shared target classes); please refer to the supplementary materials for

more details.

Character Segmentation & Bounding Boxes We compare the segmentation and bound-

ing box performance of our system with that of publicly-available models. AniSeg [68] is a

Faster-RCNN [95], and YAAS [141] provides SOLOv2 [120] and CondInst [114] models. These

detectors may detect more than one character, and their bounding boxes are not necessarily tight

around segmentations; for simplicity, we union all predicted segmentations of an image, and

redraw a tight bounding box around the union. We evaluate all models on the same test set de-

scribed in Sec. 1.4. Table 1.3 shows that training with our new 20x larger dataset outperforms

available models in both mean F-1 (segmentation) and IoU (bounding boxes); we thus use it in

our pipeline for bounding box prediction.

1.6 Application: Pose-guided Retrieval

An immediate application of our illustrated pose estimator is a pose-guided character re-

trieval system. We construct a proof-of-concept retriever that takes a query character (or user-

specified keypoints and bounding box) and searches for illustrated characters in a similar pose.

This system can serve as a useful search tool for artists, who often use reference drawings while

illustrating.

Our pose retriever performs a simple nearest-neighbor search. The support images con-

sist of single-character Danbooru illustrations with the full body tag. Using our best-performing

model, we extract bounding boxes and keypoint locations for each character, normalize the key-

points by the longest bounding box dimension, and finally store the pairwise euclidean distances

18


between the normalized keypoints. This process ensures the pairwise-distance descriptor is in-

variant to translation, rotation, and image scale. At inference, we extract the descriptor from the

query, and find the euclidean k-nearest neighbors from the support set.

In practice, we compute descriptors using all 25 predicted keypoints (17 COCO and 8

additional appendage midpoints). This makes the descriptor 300-dimensional (25 choose 2),

which is generally too large for tree-based nearest neighbors [14]. However, since our support

set consists of 136k points, we are still able to brute force search in reasonable time. Empirically,

each query takes about 0.1341s for keypoint extraction (GPU) and 0.0638s for search (CPU).

To demonstrate the effectiveness of our pose estimator, we present several query results

in Fig. 1.2; while there is no ground-truth to measure quantitative performance, qualitative in-

spection suggests that our model works well. We can retrieve reasonably similar illustrations for

standard poses as shown in the first row, as well as more difficult poses for which illustrators

would want references. Note that while our system has no awareness of perspective, it is able to

effectively leverage keypoint cues to retrieve similarly foreshortened views in the last row. For

more examples, please refer to our supplementary materials.

1.7 Conclusion & Future Work

While we may continue to improve the transfer performance through methods like pseudo-

labeling [65] or cycle-consistent image translation [46], we can also begin extending our work to

multi-character detection and pose estimation. While it is possible to construct a naive instance-

based segmentation and keypoint estimation dataset by compositing background-removed ADD

samples, we cannot expect a system trained on such data to perform well in-the-wild. Character

19


interactions in illustrations are often much more complex than human interactions in real life,

with much more frequent physical contact. For example, Danbooru has 43.6k images tagged

with holding hands and 59.1k with hugging, already accounting for 2.8% of the entire dataset.

Simply compositing independent characters together would not be able to model the intricacies

of the illustration domain; we would again need to expand our datasets with annotated instances

of character interactions.

As a fundamental vision task, pose estimation also provides a valuable prior for numerous

other novel applications in the illustrated domain. Our pose estimator opens the door to pose-

guided retargeting for automatic character animation, better keyframe interpolation, pose-aware

illustration colorization, 3D character reconstruction, etc.

In conclusion, we demonstrate state-of-the-art pose estimation on the illustrated charac-

ter domain, by leveraging both domain-specific and task-specific source models. Our model

significantly outperforms prior art [56] despite the absence of synthetic supervision, thanks to

successful transfer from our new illustration tagging subtask focused on classifying body-related

tags. In addition, we provide a single-region proposer trained on a novel character segmentation

dataset 20x larger than those currently available, as well as an updated illustration pose estima-

tion dataset with twice the number of samples in more diverse poses. Our model performance

allows for the novel task of pose-guided character illustration retrieval, and paves the way for

future applications in the illustrated domain.

20


Chapter 2: Improving the Perceptual Quality of 2D Animation Interpolation

Traditional 2D animation is labor-intensive, often requiring animators to manually draw

twelve illustrations per second of movement. While automatic frame interpolation may ease

this burden, 2D animation poses additional difficulties compared to photorealistic video. In this

work, we address challenges unexplored in previous animation interpolation systems, with a focus

on improving perceptual quality. Firstly, we propose SoftsplatLite (SSL), a forward-warping

interpolation architecture with fewer trainable parameters and better perceptual performance.

Secondly, we design a Distance Transform Module (DTM) that leverages line proximity cues to

correct aberrations in difficult solid-color regions. Thirdly, we define a Restricted Relative Linear

Discrepancy metric (RRLD) to automate the previously manual training data collection process.

Lastly, we explore evaluation of 2D animation generation through a user study, and establish that

the LPIPS perceptual metric and chamfer line distance (CD) are more appropriate measures of

quality than PSNR and SSIM used in prior art.

2.1 Introduction

Traditional 2D animators typically draw each frame manually; this process is incredibly

labor-intensive, requiring large production teams with expert training to sketch and color the tens

of thousands of illustrations required for an animated series. With the growing global popularity

21


of the traditional style, studios are hard-pressed to deliver high volumes of quality content. We

ask whether recent advancements in computer vision and graphics may reduce the burden on

animators. Specifically, we study video frame interpolation, a method of automatically generating

intermediate frames in a video sequence. In the typical problem formulation, a system is expected

to produce a halfway image naturally interpolating two given consecutive video frames. In the

context of animation, an animator could potentially achieve the same framerate for a sequence

(or “cut”) by manually drawing only a fraction of the frames, and use an interpolator to generate

the rest.

Though there is abundant work on video interpolation, 2D animation poses additional dif-

ficulties compared to photorealistic video. Given the high manual cost per frame, animators tend

to draw at reduced framerates (e.g. “on the twos” or at 12 frames/second), increasing the pixel

displacements between consecutive frames and exaggerating movement non-linearity. Unlike in

natural videos with motion blur, the majority of animated frames can be viewed as stand-alone

cel illustrations with crisp lines, distinct solid-color regions, and minute details. For this non-

photorealistic domain with such different image and video features, even our understanding of

how to evaluate generation quality is limited.

Previous animation-specific interpolation by Li et. al. (AnimeInterp [106]) approached

some of these challenges by improving the optical flow estimation component of a deep video

interpolation system by Niklaus et. al. (Softsplat [83]); in this paper, we build upon AnimeInterp

by addressing some remaining challenges. Firstly, though AnimeInterp improved optical flow,

it trained with an L1 objective and did not modify the Softsplat feature extraction, warping, or

synthesis components; this results in blurred lines/details and ghosting artifacts in supposedly

solid-color regions. We alleviate these issues with architectural improvements in our proposed

22


SoftsplatLite (SSL) model, as well as with an additional Distance Transform Module (DTM) that

refines outputs using domain knowledge about line drawings. Secondly, though AnimeInterp

provided a small ATD12k dataset of animation frame triplets, the construction of this dataset

required intense manual filtering of evenly-spaced triplets with linear movement. We instead

automate linear triplet collection from raw animation by introducing Restricted Relative Linear

Discrepancy (RRLD), enabling large-scale dataset construction. Lastly, AnimeInterp only fo-

cused on PSNR/SSIM evaluation, which we show (through an exploratory user study) are less

indicative of percieved quality than LPIPS [138] and chamfer line distance (CD). We summarize

the contributions of this paper:

1. SoftsplatLite (SSL): a forward-warping interpolation architecture with fewer trainable pa-

rameters and better perceptual performance. We tailor the feature extraction and synthesis

networks to reduce overfitting, propose a simple infilling method to remove ghosting arti-

facts, and optimize LPIPS loss to preserve lines and details.

2. Distance Transform Module (DTM): a refinement module with an auxiliary domain-

specific loss that leverages line proximity cues to correct aberrations in difficult solid-color

regions.

3. Restricted Relative Linear Discrepancy (RRLD): a metric to quantify movement non-

linearity from raw animation; this automates the previously manual training data collection

process, allowing more scalable training.

4. Perceptual user study: we explore evaluation of 2D animation generation, establishing

the LPIPS perceptual metric and chamfer line distance (CD) as more appropriate quality

measures than PSNR/SSIM used in prior art.

23


Figure 2.1: We improve the perceptual quality of 2D animation interpolation from previous work.
(a) Overlaid input images to interpolate; (b) AnimeInterp by Li et. al. [106]; (c) Our proposed
method; (d) Ground truth interpolation. Note the destruction of lines in (b) compared to (c), and
the patchy artifacts ghosted on the teapot in (b). Our user study validates our focus on perceptual
metrics and artifact removal.

2.2 Related Work

Much recent work has been published on photorealistic video interpolation. Broadly, these

works fall into phase-based [77, 76], kernel-based [85, 84], and flow-based methods [83, 53,

88, 127], with others using a mix of techniques [7, 8, 25]. The most recent state-of-the-art has

seen more flow-based methods [83, 88], following corresponding advancements in optical flow

estimation [50, 110, 52, 112]. Flow-based methods can be further split by forward [83], or

backward [88] warping. The prior art most directly related to ours is AnimeInterp, by Li et. al.

[106]. While they laid the groundwork for the problem specific to the traditional 2D animation

domain, their system had many shortcomings that we overcome as described in the introduction

section.

Even though we focus on animations “post-production” (i.e. interpolating complete full-

color sequences), there is also a body of work on automating more specific components of ani-

mation production itself. For example, sketch simplification [105, 104] is a popular topic with

applications to speeding up animation “tie-downs” and “cleanups”. There are systems for syn-

24


Figure 2.2: Schematic of our proposed system. SoftsplatLite (SSL, Sec. 2.3) passes a prediction
to the Distance Transform Module (DTM, Sec. 2.3) for refinement. SSL uses many fewer train-
able parameters than AnimeInterp [106] to reduce overfitting, and introduces an infilling step to
avoid ghosting artifacts. DTM leverages domain knowledge about line drawings to achieve more
uniform solid-color regions. Artists: hariken, k.k.1

thesizing “in-between” line drawings from sketch keyframes in both raster [82, 130] and vector

[123, 26] form. While the flow-based in-betweening done by Narita et. al. [82] shares similarity

to our work (such as the use of chamfer distance and forward warping), their system composed

pretrained models without performing any form of training. Another related problem is sketch

colorization, with application to both single illustrations [137] and animations [92, 75, 19]. These

works unsurprisingly highlight the foundational role of lines and sketches in animation, and we

continue the trend by introducing a Distance Transform Module to improve our generation qual-

ity.

25


2.3 Methodology

SoftsplatLite

As with AnimeInterp [106], we base our model on the state-of-the-art Softsplat [83] inter-

polation model, which uses bidirectional optical flow to differentiably forward-splat input image

features for synthesis. Whereas AnimeInterp only focused on improving optical flow estimation,

we assume a fixed flow estimator (the same RAFT [112] network from AnimeInterp, which they

dub “RFR”). We instead look more closely at feature extraction, warping, and synthesis; our pro-

posed SoftsplatLite (named similarly to PWC-Lite [72]) aims to improve convergence on LPIPS

[138] while also being parameter- and training-efficient. Please see Fig. 4.1a for an overview of

SSL.

We first note that the feature extractors in AnimeInterp [106] and Softsplat [83] are rel-

atively shallow. The extractors must still be trained, and rely on backpropagation through the

forward splatting mechanism. In practice, we found that replacing the extractor with the first four

blocks of a frozen ImageNet-pretrained ResNet-50 [44] performs better; additionally, freezing

the extractor contributes to reduced memory usage and compute during training, as no gradients

must be backpropagated through the warping operations. Note that we also tried unfreezing the

ResNet, but observed slight overfitting.

Next, we observe that forward splatting results in large empty occluded regions. If left

unhandled during LPIPS training, these gaps often cause undesirable ghosting artifacts (see Ani-

meInterp [106] output in Fig. 2.3b). Additionally, subtle gradients at the edge of moving objects

in the optical flow field may result in a spread of dots after forward warping; these later manifest
1hariken: https://danbooru.donmai.us/posts/5378938

k.k.: https://danbooru.donmai.us/posts/789765

26

https://danbooru.donmai.us/posts/5378938
https://danbooru.donmai.us/posts/789765


Figure 2.3: SSL vs. AnimeInterp ft. [106]. Trained on the same ATD data [106] and LPIPS loss
[138], AnimeInterp encounters many “ghosting” artifacts, which we resolve in SSL by proposing
an inpainting technique.

as blurry patches in AnimeInterp predictions (see Fig. 2.1b). To remove these artifacts, we pro-

pose a simple infilling technique to generate a better warped feature stack F prior to synthesis

(“occlusion-mask infilling” in Fig. 4.1a):

F =
1

2
(M0→tW0→t(f(I0)) + (1−M0→t)W1→t(f(I1)))

+
1

2
(M1→tW1→t(f(I1)) + (1−M1→t)W0→t(f(I0))) (2.1)

Z1→0 = −0.1× ||LAB(I1)−W
′

0→1(LAB(I0))|| (2.2)

where Wa→b denotes forward warping from timestep a to timestep b, W ′ denotes backwarping,

M denotes the opened occlusion mask of the warp, I represents either input image, and f rep-

resents the feature extractor. In other words, occluded features are directly infilled with warped

features from the other source image. The computation of mask M involves warping an image of

ones, followed by a morphological image opening with kernel k = 5 to remove dotted artifacts;

27


note that though opening is non-differentiable, no gradients are needed with respect to the flow

field as our flow estimator is fixed. Unlike AnimeInterp [106], we do not use average forward

splatting, and instead use the more accurate softmax weighting scheme with negative L2 LAB

color consistency as our Z-metric (similar as in Softsplat [83]). While it is not guaranteed that

this infilling method will eliminate all holes (it is still possible for both warps to have shared

occluded regions), we find that in practice the majority of image areas are covered.

Lastly, for the synthesis stage, we opt for a much more lightweight U-Net [98] instead of

the GridNet [36] used in the original Softsplat [83]. We may afford this thrifty replacement by

carefully placing a direct residual path from an initial warped guess to the final output. This

follows the observation that directly applying our previously-described infilling method to the

input RGB images produces a strong initial guess for the output; this is achieved by replacing

feature extractor f in Eq. 2.1 with the identity function. Instead of requiring a large synthesizer

to reconcile two sets of warped images and features into a single final image, we employ a small

network to simply refine a single good guess. Under this architecture, the additional GridNet

parameters become redundant, and even contribute to overfitting.

Note that while SoftsplatLite and Softsplat have comparable parameter counts at inference

(6.92M and 6.21M respectively), the frozen feature extractor and smaller synthesizer signifi-

cantly reduces the number of trainable parameters compared to the original (1.28M and 2.01M

respectively). We later demonstrate through ablations (Tab. 2.2) that lighter training and artifact

reduction allow SSL to score better on perceptual metrics like LPIPS and chamfer distance.

Distance Transform Module

As seen in Fig. 2.4b, SoftsplatLite may struggle to choose colors for certain regions, or have

trouble with large areas of flat color. These difficulties may be partly attributed to the natural tex-

28


Figure 2.4: Effect of DTM. DTM effectively leverages line proximity cues (distance transform) to
refine SSL outputs. DTM not only removes minor aberrations from solid-color regions (bottom),
but also corrects entire enclosures if needed (top).

ture bias of convolutional models [39]; the big monotonous regions of traditional cel animation

would expectedly require convolutions with larger perceptual fields to extract meaningful fea-

tures. Instead of building much deeper or wider models, we take advantage of line information

inherently present in 2D animation; hypothetically, providing line proximity information to con-

volutions may act as a form of “stand-in” texture that helps the processing of cel-colored image

data.

We thus propose a Distance Transform Module (DTM) to refine the SSL outputs by lever-

aging a normalized version of the Euclidean distance transform (NEDT). At a high level (see Fig.

4.1b), DTM first attempts to predict the ground truth NEDT of the output (middle) frame, and

then uses this prediction to refine the SSL output through a residual block. To train the predic-

tion of NEDT, we introduce an auxiliary Ldt in addition to the Llpips on the final prediction, and

optimize a weighted sum of both losses end-to-end. The rest of this section provides specifics on

the implementation.

The first step is to extract lines from the input images; for this, we use the simple but

29


Figure 2.5: RRLD filtering. RRLD quantifies whether a triplet is evenly-spaced. We show several
overlaid triplets from our additional dataset ranked by RRLD; higher RRLD (bottom) indicates
deviation from the halfway assumption. As RRLD is fully automatic, appropriate training data
can be filtered from raw video at scale.

effective difference of gaussians (DoG) edge detector,

DoG(I) =
1

2
+ t(Gkσ(I)−Gσ(I))− ϵ, (2.3)

where Gσ are Gaussian blurs after greyscale conversion, k = 1.6 is a factor greater than one, and

t = 2 with ϵ = 0.01 are hyperparameters. Please see Fig. 2.6 for examples of DoG extraction.

Next, we apply the distance transform. To bound the range of values, we normalize EDT values

to unit range similar to Narita et. al. [82],

NEDT (I) = 1− exp{−EDT (DoG(I) > 0.5)

τd
}, (2.4)

where τ = 15/540 is a steepness hyperparameter, and d is the image height in pixels. Note that

we thresholded DoG at 0.5 to get a binarized sketch.

30


This normalized EDT is extracted from both input images, and warped through the same

inpainting procedure as Eq. 2.1; more precisely, f is replaced by NEDT . DTM then uses

this, as well as the extracted NEDT of SSL’s output, to estimate the NEDT of the ground truth

output frame. This prediction occurs through a small convolutional network (first yellow box in

Fig. 4.1b), and is trained to minimize an auxiliary Ldt, the L1 Laplacian pyramid loss between

predicted and ground truth NEDTs. A final convolutional network (second yellow box in Fig.

4.1b) then incorporates the predicted NEDT to residually refine the SSL output.

Note that we detach the predicted NEDT image from the final RGB image prediction gra-

dients (“SG” for “stop-gradient” in Fig. 4.1b), in order to reduce potentially competing signals

from Ldt and the final image loss. It is also important to mention that since both DoG sketch

extraction and EDT are non-differentiable operations, the extraction of NEDT from the Softsplat

output cannot be backpropagated. However, we found that we could still reasonably perform

end-to-end training despite the required stop-gradient in this step.

Through this process, our DTM is able to predict the distance transform of the output, and

utilize it in the final interpolation. Experiments show that this relatively cheap additional network

is effective at improving perceptual performance (Tab. 2.2).

Restricted Relative Linear Discrepancy

Unlike in the natural video domain, where almost any three consecutive frames from a cut

may be used as a training triplet, data collection for 2D animation is much more ambiguous.

Animators often draw at variable framerates with expressive arc-like movements; when coupled

with high pixel displacements, this results in a significant amount of triplets with non-linear

motion or uneven spacing. However under the problem formulation, all middle frames of training

triplets are assumed to be “halfway” between the inputs. While forward warping provides a way

31


to control the interpolated t ∈ [0, 1] at which generation occurs, it is ambiguous to label such

ground truth for training. Li et. al. in AnimeInterp [106] manually filter through more than

130,000 triplets to arrive at their ATD dataset with 12,000 samples, a costly manual effort with

less than 10% yield.

In order to automate the training data collection process from raw animation data, we quan-

tify the deviance of a triplet from the halfway assumption with a novel Restricted Relative Linear

Discrepancy (RRLD) metric, and filter samples based on a simple threshold. In our experi-

ments (Tab. 2.2), we demonstrate that selecting additional training data with RRLD improves

generalization error, whereas training on naively-collected triplets damages performance. We ad-

ditionally show that RRLD largely agrees with ATD, and that RRLD is robust to choice of flow

estimator (Sec. 4.3). Please see Fig. 2.5 for example triplets accepted or rejected by RRLD. The

rest of this section provides specifics of the filtering method. We define RRLD as follows,

RRLD(ω0→t, ω1→t) =
1

|Ω|
∑

(i,j)∈Ω

||ω0→t[i, j] + ω1→t[i, j]||/2
||ω0→t[i, j]− ω1→t[i, j]||

, (2.5)

where ω are forward flow fields extracted from consecutive frames I0 and It and I1, and Ω denotes

the set of (i, j) pixel coordinates where both flows have norms greater than threshold 2.0 and point

to pixels within the image.

RRLD takes as input flow fields from the middle frame It to the end frames, and assumes

they are correct. The numerator of Eq. 2.5 represents the distance from pixel (i, j) to the midpoint

between destination pixels, while the denominator describes the total distance between destina-

tion pixels. In other words, the interior of the summand is half the ratio between the diameters of

a parallelogram formed by two flow vectors; this measures the relative distance from the actual

32


Figure 2.6: Line and detail preservation. (a) AnimeInterp prediction; (b) our full model
(SSL+DTM); (c) ground truth; (middle) extracted DoG lines; (bottom) normalized Euclidean
distance transform. AnimeInterp blurs lines and details that are critical to animation; by focusing
on perceptual metrics like LPIPS and chamfer distance (CD), we improve the generation quality.

to the ideal halfway point. As the estimated flows are noisy, we average over a restricted set

of pixels Ω. We first remove pixels with displacement close to zero, where a low denominator

results in unrepresentatively high discrepancy measurement. Then, we also filter out pixels with

flows pointing outside the image, which are often poor estimates. The final RRLD gives a rough

measure of deviance from the halfway-frame assumption, for which we may define a cutoff (0.3

in this work).

One caveat to this method is that pans must be discarded. In some cases, a non-linear ani-

mation may be composited onto a panning background; RRLD would then include the linearly-

moving background in Ω, lowering the overall measurement despite having a nonlinear region of

interest. We simply remove triplets with large Ω, high average flow magnitude, and low flow vari-

ance. It is possible to reintroduce panning effects through data augmentation if needed, though

we did not for our training.

Another important point is that even though animators may draw at framerates like 12 or 8,

the final raw input videos are still at 24fps. Thus, many consecutive triplets in actuality contain

33


two duplicates, which leads to RRLD values around 0.5; had the duplicate been removed, an

adjacent frame outside the triplet may have had a qualifying RRLD. In order to maximize the

data yield, we also train a simple duplicate frame detector, using linear regression over the mean

and maximum L2 LAB color difference between consecutive frames.

User Study & Quality Metrics

We perform a user study in order to evaluate our system and explore the relationship be-

tween metrics and perceived quality. To get a representative subset of the ATD test set, on

which we perform all evaluations, we select 323 random samples in accordance with Fischer’s

sample size formula (with population 2000, margin of error 5%, and confidence level 95%).

For each sample triplet, users were given a pair of animations playing back and forth at 2fps,

cropped to the region-of-interest annotation provided by ATD. The middle frame of each anima-

tion was a result generated either by our best model (on LPIPS), or by the pretrained AnimeInterp

[106]. Participants were asked to pick which animation had: clearer/sharper lines, more consis-

tent shapes/colors, and better overall quality. Complete survey results, including several random

animation pairs compared, are available in the supplementary.

Our main metric of interest is LPIPS [138], a general measure of perceived image quality

based on deep image classification features. We are interested in understanding its applicability

to non-photorealistic domains like ours, especially in comparison with PSNR/SSIM used in prior

work [106].

We additionally consider the chamfer distance (CD) between lines extracted from the

ground truth vs. the prediction. The chamfer metric is typically used in 3D work, where the

distance between two point clouds is calculated by averaging the shortest distances from each

point of one cloud to a point on the other. In the context of binary line drawings extracted from

34


Table 2.1: Comparison with baselines. Our full proposed method achieves the best percep-
tual performance, followed by AnimeInterp [106]. We show in our user study (Sec. 2.4) that
LPIPS/CD are better indicators of quality than the PSNR/SSIM focused on in previous work;
we list them here for completeness. Models from prior work are fine-tuned on LPIPS for fairer
comparison. Best values are underlined, runner-ups italicized; LPIPS is scaled by 1e2, CD by
1e5.

All Eastern Western
Model LPIPS CD PSNR SSIM LPIPS CD LPIPS CD

DAIN [7] 4.695 5.288 28.840 95.28 5.499 6.537 4.204 4.524
DAIN ft. [7] 4.137 4.851 29.040 95.27 4.734 5.888 3.771 4.217
RIFE [48] 4.451 5.488 28.515 95.14 4.933 6.618 4.156 4.796
RIFE ft. [48] 4.233 5.411 27.977 93.70 4.788 6.643 3.894 4.658
ABME [88] 5.731 7.244 29.177 95.54 7.000 10.010 4.955 5.552
ABME ft. [88] 4.208 4.981 29.060 95.19 4.987 6.092 3.732 4.302
AnimeInterp [106] 5.059 5.564 29.675 95.84 5.824 7.017 4.590 4.674
AnimeInterp ft. [106] 3.757 4.513 28.962 95.02 4.113 5.286 3.540 4.039

Ours 3.494 4.350 29.293 95.15 3.826 4.979 3.291 3.966

our data using DoG (Eq. 2.3), the 3D points are replaced by all 2D pixels that lie on lines. As

chamfer distance would intuitively measure how far lines are from each other in different images,

we explore the importance of this metric for our domain with images based on line drawings.

Please see Fig. 2.6 for examples of CD evaluation. In this work, we define chamfer distance as:

CD(X0, X1) =
1

2HWD

∑
X0DT (X1) +X1DT (X0) (2.6)

where X are binary sketches with 1 on lines and 0 elsewhere, DT denotes the Euclidean distance

transform, the summation is pixel-wise, and HWD is the product of height, width, and diameter.

We normalize by both area and diameter to enforce invariance to image scale. Note that our defi-

nition is symmetric with respect to prediction and ground truth, zero if and only if they are equal,

and strictly non-negative. Also observe that as neither DoG binarization nor DT is differentiable,

CD cannot be optimized directly by gradient descent training; thus it is used for evaluation only.

35


Table 2.2: Ablations of proposed methods. Firstly, each component of SSL contributes to perfor-
mance (especially infilling). Secondly, new data filtered naively hurts performance, while new
RRLD-filtered data helps. Lastly, DTM improvement is due to auxiliary supervision, not just
increased parameter count. AnimeInterp ft. is copied from Tab. 4.3 for comparison; the last row
here and in Tab. 4.3 are equivalent. Best values are underlined, runner-ups italicized; LPIPS is
scaled by 1e2, CD by 1e5.

All Eastern Western
Model Data LPIPS CD LPIPS CD LPIPS CD

AnimeInterp ft. [106] ATD 3.757 4.513 4.113 5.286 3.540 4.039

SSL (no flow infill) ATD 3.648 4.496 4.026 5.160 3.416 4.089
SSL (no U-net synth.) ATD 3.614 4.579 3.982 5.288 3.389 4.146
SSL (no ResNet extr.) ATD 3.605 4.739 3.957 5.429 3.391 4.317
SSL ATD 3.586 4.572 3.940 5.248 3.369 4.158

SSL ATD+naive 3.702 4.811 3.997 5.033 3.521 4.675
SSL ATD+RRLD 3.535 4.431 3.873 5.089 3.329 4.028

SSL+DTM (no Ldt) ATD+RRLD 3.531 4.430 3.865 4.995 3.327 4.085
SSL+DTM ATD+RRLD 3.494 4.350 3.826 4.979 3.291 3.966

2.4 Experiments & Discussion

We implement our system in PyTorch [90] wrapped in Lightning [3], with Kornia [96]. Our

model uses the same RFR/RAFT with SGM flows as AnimeInterp for fairer comparison [106,

112], and forward splatting is done with the official Softsplat [83] module. We train with the

Table 2.3: User study results. For each of the visual criteria we asked the users to judge (rows),
we list the percentage of instances where users preferred the animation with a better metric score
(columns). Values above 50% indicate agreement between queried criteria and metric score dif-
ference, and values under 50% indicate contradiction. “Pref. Ours” means percent of users
preferring our output to AnimeInterp [106] for that criteria.

Prefer Lower Lower Higher Higher
Criteria Ours LPIPS CD PSNR SSIM

cleaner/sharper lines 86.01% 86.56% 78.20% 18.95% 15.48%
more consistent shape/color 78.82% 79.26% 73.99% 25.02% 22.66%
better overall quality 81.11% 81.55% 75.67% 22.97% 19.88%

36


Adam [62] optimizer at learning rate α = 0.001 for 50 epochs, and accumulate gradients for an

effective batch size of 32. Our code uses the official LPIPS [138] package, with the AlexNet [63]

backbone. All training minimizes the total loss L = λlpipsLlpips + λdtLdt, where λlpips = 30;

depending on whether DTM is trained, λdt is either 0 or 5. Evaluations are run over the 2000-

sample test set from AnimeInterp’s ATD12k dataset; however we only train on a random 9k of

the remaining 10k in ATD, so that we can designate 1k for validation. Similar to Li et. al. [106],

we randomly perform horizontal flips and frame order reversal augmentations during training.

We use single-node training with at most 4x GTX1080Ti at a time, with mixed precision where

possible. All models are trained and tested at 540x960 resolution.

We wrote a custom CUDA implementation for the distance transform and chamfer distance

using CuPy [86] that achieves upwards of 3000x speedup from the SciPy CPU implementation

[117]; the algorithm is a simpler version of Felzenszwalb et. al. [34], where we calculate the

minimum of the lower envelope through brute iteration. While more efficient GPU algorithms

are known [16], we found our implementation sufficient.

RRLD Data Collection

As RRLD was designed to replicate the manual selection of training data, we applied RRLD

to AnimeInterp’s ATD dataset [106] and achieved 95.3% recall (i.e. RRLD only rejected less than

5% of human-collected data); as the negative samples from the ATD collection process are not

available, it is not possible to calculate RRLD’s precision on ATD. Additionally we study the

effect of flow estimation on RRLD, finding that filtering with FlowNet2 [50] and RFR flows

[106] returns very similar results (0.877 Cohen’s kappa tested over 34,128 triplets).

We use our automatic pipeline to collect additional training triplets. We source data from

14 franchises in the eastern “anime” style, with premiere dates ranging from 1989-2020, totalling

37


239 episodes (roughly 95hrs, 8.24M frames at 24fps); please refer to our supplementary materials

for the full list of sources. Here, RRLD was calculated using FlowNet2 [50] as inference was

faster than RFR [106]. While RRLD filtering presents us with 543.6k viable triplets, we only

select one random triplet per cut to promote diversity; the cut detection was performed with a

pretrained TransNet v2 [108]. This cuts down eligible samples to 49.7k. For the demonstrative

purposes of this paper, we do not train on the full new dataset, and instead limit ourselves to

doubling the ATD training set by randomly selecting 9k qualifying triplets. Please see Fig. 2.5

for examples of accepted and rejected triplets from franchises set aside for validation.

While we cannot release the new data collected in this work, our specific sources are listed

in the supplementary and our RRLD data collection pipeline will be made public; this allows

followup work to either recreate our dataset or assemble their own datasets directly from source

animations.

Comparison with Baselines

The main focus of our work is to improve perceptual quality, namely LPIPS and chamfer

distance (as validated later by our user study results). We gather four existing frame interpolation

systems (ABME [88], RIFE [48], DAIN [7], and AnimeInterp [106]) for comparison to our full

model incorporating all our proposed methods. For a fairer comparison, as other models may

not have been trained on the same LPIPS objective or on animation data, we fine-tune their

given pre-trained models with LPIPS on the ATD training set. As we can see from Tab. 4.3,

our full proposed method achieves the best perceptual performance, followed by AnimeInterp.

To provide more complete information on trainable parameters, our model has 1.28M (million)

compared to: AnimeInterp 2.01M, RIFE 13.0M, ABME 17.5M, DAIN 24.0M. Breaking down

further, our model consists of 1.266M for SSL and 0.011M for DTM.

38


Ablation Studies

We perform several ablations in Tab. 2.2. In the first group, each of the modifications

to Softsplat [83] (frozen ResNet [44] feature extractor, infilling, U-net [98] replacing GridNet

[36]) contributes to SSL outperforming AnimeInterp [106]. The infilling technique improves

performance the most.

In the second group of Tab. 2.2, we ablate the addition of new data filtered by RRLD (Sec.

4.3). Training with RRLD-filtered data improves generalization as expected. To demonstrate the

necessity of RRLD’s specific filtering strategy, we train with an alternative dataset of equal size

gathered from the same sources, but using a “naive” filtering approach. For simplicity, we directly

follow the crude filter used in creating ATD [106]: no two frames of a triplet may contain SSIM

outside [0.75, 0.95]. We see this naively-collected data actively damages model performance,

validating the use of our proposed RRLD filter.

Splitting by eastern vs. western style, we clarify the distribution shift between sub-domains.

Note that our new data is all anime, whereas 62.05% of ATD test set is in the western “Disney”

style. From the LPIPS results, the eastern style is more difficult; adding eastern-only RRLD data

has unexpectedly less of an effect on eastern testing than western. This may be because west-

ern productions tend to prioritize fluid motion (smaller displacements) over complex character

designs (more details), contrary to the eastern style.

In the last group of Tab. 2.2, we train SoftsplatLite with DTM, but ablate the effect of

additionally optimizing for Ldt; this way, we may see whether auxiliary supervision of NEDT

improves performance under the same parameter count. Note that the upper yellow convnet of

Fig. 4.1b receives no gradients in the ablation, effectively remaining at its random initializa-

tion. The results show that the prediction of line proximity information indeed contributes to

39


performance.

User Study Results

We summarize the user study results in Tab. 2.3, and provide the full breakdown with

sample animations in the supplementary. Our study had 5 participants, meaning each entry of Tab.

2.3 has support 1615 (323 compared pairs per participant). We confirm the observations made by

Niklaus et. al. and Blau et. al. [10], that PSNR/SSIM and perceptual metrics may be at odds with

one another. Despite lower PSNR/SSIM scores, users consistently preferred our outputs to those

of AnimeInterp. A possible explanation is that due to animations having larger displacements, the

middle ground truth frames may be quite displaced from the ideal halfway interpolation. SSIM,

as noted by previous work [138, 100], was not designed to assess these geometric distortions.

Color metrics like PSNR and L1 may penalize heavily for this perceptually minor difference,

encouraging the model to reduce risk by blurring; this is consistent with behavior exhibited by the

original AnimeInterp trained on L1 (Fig. 2.6). LPIPS on the other hand has a larger perceptive

field due to convolutions, and may be more forgiving of these instances. This study provides

another example of the perception-distortion tradeoff [10], and establishes its transferability to

2D animation.

The user study also shows an imperfect match between LPIPS and CD. This mismatch is

also reflected in Tables 4.3 and 2.2, where aggregate decreases in LPIPS do not correspond to

reduced CD. This maybe because CD reflects only the line-structures of an image. However,

Tab 2.3 shows LPIPS is unexpectedly more predictive of line quality. A possible explanation is

that CD is still more sensitive to offsets than LPIPS; in fact, CD grows roughly proportionally to

displacement for line drawings. Thus, it may suffer the same problems as PSNR but to a lesser

extent, as PSNR would penalize across an entire displaced area opposed to across a thin line.

40


2.5 Limitations & Conclusion

Our system still has several limitations. By design, our model can only interpolate linearly

between two frames, while real animations have non-linear movements that follow arcs across

long sequences. In future work, we may incorporate non-linearity from methods like QVI [127],

or allow user input from an artist. Additionally, we are limited to colored frames, which are

typically unavailable until the later stages of animation production; following related work [82],

we can expand our scope to work on line drawings directly.

To summarize, we identify and overcome shortcomings of previous work [106] on 2D ani-

mation interpolation, and achieve state-of-the-art interpolation perceptual quality. Our contribu-

tions include an effective SoftsplatLite architecture modified to improve perceptual performance,

a Distance Transform Module leveraging domain knowledge of lines to perform refinement, and

a Restricted Relative Linear Discrepancy metric that allows automatic training data collection

from raw animation. We validate our focus on perceptual quality through a user study, hopefully

inspiring future work to maintain this emphasis for the traditional 2D animation domain.

41


Chapter 3: Match-free Inbetweening Assistant (MIBA): A Practical Animation

Tool without User Stroke Correspondence

In traditional 2D frame-by-frame animation, inbetweening (interpolating line drawings,

abbr. “IB”) is still a manual and labor-intensive task. Despite the abundance of literature and

software offering automation and claiming speedups, animators and the industry as a whole have

been hesitant to adopt these new tools. Upon inspection, we find prior work often unreasonably

expects adoption of novel stroke-matching workflows, naively assumes access to adequate center-

line vectorization, and lacks rigorous evaluation with professional users on real production data.

Facing these challenges, we leverage optical flow estimation and differentiable vector graphics to

design a “Match-free Inbetweening Assistant” (MIBA). Unrestricted by the need for user stroke

correspondence, MIBA integrates into the existing IB workflow without introducing additional

requirements, and makes the raster input case feasible thanks to its robustness to vectorization

quality. MIBA’s simplicity and effectiveness is demonstrated in our comprehensive user study,

where users with professional IB experience achieved 4.2x average speedup and better chamfer

distance scores on real-world production data, given only a 5-minute tutorial of new functional-

ity.

42


Figure 3.1: Our MIBA system assists the inbetweening (IB) task. In traditinoal 2D frame-
by-frame animation, animators often draw “pose-to-pose”, finishing “cleaned-up keyframes”
(CU) at critical poses of a sequence first, before completing the “inbetweens” (IB) that spatio-
temporally interpolate based on a specified “timing grid”. IB is notoriously labor-intensive and
is still drawn manually in many productions, since existing automatic methods often fail on raster
inputs and are incompatible with artist workflows. Leveraging state-of-the-art optical flow [112]
and differentiable vector graphics [69], our MIBA system works robustly on scanned-in raster
inputs, and integrates into the existing workflow so well that it can be learned in 5 minutes by
experienced animators in the industry. Tested by professional inbetweeners on real production
data, MIBA sped up IB drawing by 4.2x on average during our user study.

3.1 Introduction

In traditional 2D frame-by-frame animation, animators often draw “pose-to-pose”: first

completing “cleaned-up keyframes” (or “CU”) at several critical poses of a sequence, before

spatially interpolating them with “inbetweens” (or “IB”) at intervals specified by a “timing grid”

(Fig. 3.1). CUIB (a.k.a. “douga” in Japanese) can be drawn as vectors or rasters, but they are

typically saved as aliased rasters to then be bucket-filled by digital ink and paint staff (DIP).

The IB process demands precise linework and is notoriously time-consuming, taking anywhere

between 5-40 minutes per frame depending on the difficulty. Across a single 24-minute episode

at 12 frames per second, thousands of such inbetweens are drawn by hand; this scales to tens of

thousands of drawings for seasonal shows and feature films.

43


The academic community has proposed many systems for offering computational IB as-

sistance, typically in the form of matching vector strokes across keyframes to interpolate control

points [123, 131, 54, 18, 133, 82, 132, 26]. But despite the abundance of work, there is still little

adoption of automatic methods in industry. While there are many artistic, economic, and cultural

factors contributing to this lack of progress, we observe that the research community has repeat-

edly overlooked major design considerations that have significant impact on practical usability

and system evaluation. Specifically:

Consideration 1: Artists should not be expected to alter their workflow. It is pro-

hibitively expensive for artists to make major changes to their established way of doing things,

and invest time and effort into learning the intricacies of new tooling. Yet, the systems proposed

in the literature consistently introduce additional requirements and functions that burden the an-

imator (see summary in Table 3.1). By learning the existing workflow from professionals, we

designed MIBA to automate specific steps without intrusively adding new ones (Fig. 3.2).

Consideration 2: Animators do not have vector keyframe inputs adequate for prior

vector-based methods. Prior work has mostly focused on interpolation between matched vec-

tor strokes, but implicitly assumes properly-connected, noise-free, and/or nearly-identical vector

topology on both input keyframes (Tab. 3.1). Unfortunately, we find that these requirements are

satisfied by neither off-the-shelf vectorizations nor by artist-drawn vectors (Fig. 3.5, Fig. 3.6).

MIBA on the other hand is robust to input vectorization quality, and so can feasibly handle the

raster input case.

Consideration 3: Evaluation should be precise and representative of real-world an-

imation production. Relevant IB user studies are few; most have participants without pro-

fessional IB experience, tend to evaluate qualitatively on simple animations, lack details about

44


Figure 3.2: MIBA non-intrusively assists the existing workflow. Our proposed system (bot-
tom) seamlessly integrates into the existing workflow as described by professionals (top). Laying
down all predicted lines with a simple click advances users to the “check+fix” stage of their fa-
miliar workflow, characterized by zooming globally/locally to find/fix errors and missing lines.
By replacing the time-consuming step of zooming in to lay lines, we save significant effort with-
out deviating from established practices.

procedures and training, and neglect to report key metrics like speedup and evaluation against

ground truth (see summary in Tab. 3.3). We evaluate MIBA with a comprehensive user study

addressing all these points.

Taking into account these three key considerations, we make the following contributions:

• MIBA, a “Match-free Inbetweening Assistant” leveraging deep optical flow and differen-

tiable vector graphics [112, 69]. Unrestricted by the need for user stroke correspondence,

MIBA seamlessly integrates into the existing IB workflow, and works well on raster input

thanks to its robustness to vectorization quality.

• A comprehensive user study evaluating our MIBA system, in which users with profes-

sional IB experience achieved both 4.2x average task speedup and better chamfer distance

scores w.r.t. ground-truth on real-world production data, given only a 5-minute tutorial of

MIBA’s functionality.

45


Additional requirements / tools to learn

MIBA (ours) - none
(raster input, no vector requirements)
(new assistant operated by button click)

BetweenIT [123] - requires adequate vector input for
1-to-1 stroke correspondence (Fig. 3.5)
- lasso tool for stroke correspondence
- point tool for vector correspondence
- path tool for trajectory guidelines

VGC [26] - requires learning completely new
time-vector topology data primitives

DiLight [18] - requires well-connected vectors (Fig. 3.6)
- path tool for stroke correspondence
- must set stroke matching tolerance param.

FTP-SC [133] - requires adequate vector input for
1-to-1 stroke correspondence (Fig. 3.5)
- matching tool for stroke correspondence

Narita et. al. [82] - cannot generate vector output,
only rasters are supported (Fig. 3.4)

CACANI [54] - requires adequate vector input for
1-to-1 stroke correspondence (Fig. 3.5)
- requires stroke depth layering
- requires stroke occlusion orientation
- new grouping, linking, inverting tools

AnimeInbet [107] - requires strictly straight-line vectors
(does not support curved lines)
- requires well-connected and
densely-sampled vectors (Fig. 3.6)

Table 3.1: Comparison of new requirements and tooling in prior work. It is costly for anima-
tors to make major changes to their established workflows, and invest time and effort into learning
new tooling. Yet, systems proposed in academia consistently introduce additional requirements
and functions that burden the animator. Our MIBA framework on the other hand is designed to
accelerate IB with a simple button click.

46


3.2 Related Work

Prior vector-based inbetweening works first and foremost address the problem of vector

stroke correspondence [123, 18, 133]. They assume two vector keyframe inputs with extremely

similar and well-connected topologies, between which they match strokes. Based on the discov-

ered 1-to-1 correspondence, control points are interpolated to derive the target IB vector repre-

sentation. BetweenIT [123] and FTP-SC [133] provide a number of semi-automatic user tools to

achieve this exact vector match, while DiLight [18] proposes guideline tools and loosens the strict

correspondence requirement. More recently, AnimeInbet [107] proposed a method of fusing the

two graph topologies. However, we often see that neither off-the-shelf vectorizers [122, 9, 79]

nor users provide adequate vectorization quality for these methods (Fig. 3.5 & 3.6). In order to

break free of the limitations imposed by vectorization, our match-free methodology completely

discards the need to correspond disparate vectors, achieving a simple-to-use tool robust enough

to work on vectorizations of raster scan-ins.

Stroke occlusion resolution and aesthetic non-linear curve interpolation are other aspects

of IB work. CACANI [54] for example proposes a layered and oriented representation of strokes,

and is able to infer occlusions automatically. BetweenIT [123], FTP-SC [133], and DiLight [18]

all propose their own methods for allowing user-specified non-linear trajectories. In our work

on MIBA, however, we have found that simple linear interpolation works well for real-world

production IB data, and that improper occlusion is relatively quick for users to fix. We thus focus

on removing the stroke correspondence paradigm; however, as the occlusion and interpolation

techniques are orthogonal to our match-free contribution, they can be added after our workflow

if desired.

47


Another very different approach to IB is taken by Narita et. al. [82], who discard the

vector representation altogether in favor of rasters. They propose the direct use of optical flow

to warp between keyframes (similar to video frame interpolation systems commonly used for

natural RGB videos [23, 7, 48]), and improve flow estimation on sparse line drawings with the

distance transform. While this approach indeed relieves the user of vector considerations, the

results are heavily dependent on optical flow performance. As failure cases are common for

animations with large displacements and sparse lines, the user would need to edit an inflexible

raster representation (Fig. 3.4). Our MIBA on the other hand tackles vector representation issues,

without resorting to rasters.

Yet another alternative is proposed by Boris et. al. [26], who introduce a novel “vector

animation complex” data structure to manage interpolation of topologies across both space and

time. However, the new representation is a departure from the conventional vector graphics

paradigm, and inhibitively requires the artist to learn a new framework of thought in order to

inbetween.

As summarized in Tab. 3.1 and illustrated in Fig. 3.2, our MIBA system distinguishes itself

from prior work by not requiring significant changes to animator tooling or workflow. Addition-

ally, we provide one of the most comprehensive user studies for IB in the literature to evaluate

MIBA (Tab. 3.3); we report metrics with respect to ground-truth production data from the real

world, provide explicit details on speed and interaction gains, and test with professional IB ani-

mators.

48


Figure 3.3: Schematic of our proposed system. Given the left two raster keyframes (I0, I1) and
the interpolating position on the timing grid (t = 0.5), our system produces a vector inbetween
(V̂t, right) that can then be adjusted as needed, without requiring the user to specify stroke corre-
spondences between vectorized versions of the input frames. The key to achieving this match-free
framework lies in our ability to robustly warp and align the vectorization of the first frame (V0,
top) to a raster of the second frame (I1, bottom). This way, we obtain topologically identical
vectors (red) aligned to respective frames, ready for interpolation. The system leverages optical
flow estimation and differentiable vector graphics to produce reasonable stroke-raster alignment.

3.3 Methodology

Our system operates on a single timing grid sequence at a time (Fig. 3.1). As illustrated

in Fig. 3.2, MIBA lays down predictions of where the vector lines of IB should be placed. Our

system uses a match-free method to return a vector IB drawing for each interval specified by the

timing grid. The user may then use common vector editing tools, or choose to rasterize at any

time and use common raster tools. Below, we define the input and output representations more

formally, before describing our algorithm and provided user interaction tools.

3.3.1 Input/Output Representations

The inputs to the MIBA system are: a timing grid T , and two cleaned keyframes (Fig. 4.1,

left-most side). The timing grid T is a sorted array of unique values ti ∈ [0, 1], where t0 = 0 and

49


t1 = 1. The two CU keyframes are assumed to be rasters (I0, I1 in Fig. 4.1). In the case where

a vector representation is available, we still rasterize before inputting to our system; this ensures

consistency of outputs, and relieves any mental burden on the animators to consider line topology

when drawing CU. Similar to the IB to be generated, the CU keyframes are binary-aliased for

eventual digital coloring with paint-bucket tools. The CU often only have few colors aside from

black; by default in the Japanese pipeline, red denotes highlights, blue is shadow, and green is a

special effect or second shadow/highlight.

The output representation of MIBA is a vector graphics representation (V̂t in Fig. 4.1),

which can be rasterized if desired. We select to use cubic Bézier curves, with a polar representa-

tion for control points. The representation backend supports arbitrary graph connectivity, though

for our experiments we only work with acyclic graphs, similar to SVGs. In practice, our sys-

tem renders curves as short piecewise-linear line segments; we thus support variable line width

across a single curve, although we found that a single global thickness worked well enough for

our specific data.

3.3.2 Match-free Inbetweening

Our proposed match-free inbetweening method is illustrated in Fig. 4.1. Similar to pre-

viously proposed frameworks, each curve of the output is derived by interpolating between the

vertex/handle coordinates of two existing curves (V0 and V1, one representing each vectorized

input image). However, to fit this paradigm, prior work struggles at finding the two correct cor-

responding strokes to inbetween from each input, because the input vectorizations have either

unreasonably different topology (Fig. 3.5) or inadequately noisy connectivity (Fig. 3.6).

50


The key insight of our MIBA method is that we can warp one frame’s vector (V0) to align

with the other’s raster (I1). This way we do not need to match two disparate vectorizations, and

instead can directly interpolate between two coordinate-value configurations of the same vector

topology (Fig. 4.1, red). Put another way, we find offsets from one vector to look like the other

raster; interpolation equates to moving by a specified fraction of those offsets.

Naturally, since no matching occurs between different vectorizations, MIBA is robust to

the connectivity and noisiness of the input vectors; by discarding the stroke correspondence

paradigm, we make IB assistance on vectorized scanned-ins a feasible reality (Fig. 3.6). In

addition, animators can easily operate MIBA without the intricate stroke correspondence and

match guidance tooling required by prior work (Tab. 3.1).

From the perspective of solving the stroke occlusion problem, MIBA intrinsically resolves

occlusion at the warp and align stages. At these steps, strokes are effectively occluded by short-

ening curves to the occlusion boundary, in order to satisfy alignment to the opposing raster frame.

Note that this occurs independently of local topology at the junction, as the optimization condi-

tion is imposed directly on a raster render of the stroke graph. While we found that improper

occlusions still may appear from imperfect alignments, animators are able to fix them relatively

quickly.

Note that MIBA is asymmetric with respect to the input frame order; in other words, the

system output for interpolating t = 0.5 will be different if the two input CU keyframes are

swapped. Users can choose which direction to use MIBA, although in practice we find many

users simply stick with the default forward direction.

The subsections below describe in more detail the steps outlined in Fig. 4.1, and how we

leveraged state-of-the-art optical flow and differentiable vector graphics to achieve this non-trivial

51


vector-raster alignment across frames.

Vectorization with Alignment Post-processing

As previously mentioned, MIBA only works with a single vector topology that does not

need to be matched; thus the method is robust to the input vector representation quality. In Fig.

3.6, we demonstrate our system with Weber AutoTrace [122], PolyVectorization [9] (simplified

with [101]), and Virtual Sketching [79]; in all three cases, our system was able to deliver similar

reasonable results. We default to using AutoTrace in our program for its quick processing speed.

Across the different vectorization algorithms, we inevitably found imperfections in the

linework that did not match with the input raster; usually, this came in the form of “wobbly”

curves. To improve the base vectorizations (V ′
0), we optimize the vector image to match the input

raster (I0) using DiffVG [69] by minimizing

V0 = argmin
θ

L2

(
R(θ), I0

)
, (3.1)

where θ represents the vertices and Bézier control points initialized at V ′
0 , and R denotes the

DiffVG differentiable vector rendering operation. The optimization is run for 32 iterations, with

Adam [62] on MSE image loss with unit learning rate. To facilitate alignment of initially distant

lines, we additionally apply Gaussian blur to the differentiable render before loss evaluation for

the first half of optimization (with a ramping sigma).

Vector Warping & Alignment

The goal of warping here is to roughly position the post-processed vectorization (V0) over

the other raster frame (I1), to initialize the more expensive alignment optimizations. We estimate

the optical flow (F ) between the two raster inputs using an off-the-shelf RAFT model [112];

52


inspired by previous work on raster inbetweening of line art [82], we preprocess the sparse line

drawings into dense distance transform images to improve the flow estimation.

To perform the warp, we simply offset each vertex by the flow sampled at its image coor-

dinates, denoted as V ′
1 = V0 + F [V0]. Even without modifying the polar Bézier handle values,

we found that the warp gave results that were reasonable, but not yet acceptable for interpolation

without further alignment.

With the vectorization of the first frame roughly warped to the second raster, we remove

remaining alignment imperfections by DiffVG optimization [69]. The optimization process is the

same as vector postprocessing previously described in Sec. 3.3.2, but θ is initialized at V ′
1 and

the raster target is instead I1:

V1 = argmin
θ

L2

(
R(θ), I1

)
. (3.2)

Note that there still may be imperfect alignments, since the inherent topology of the two frames

is often different (Fig. 3.4, 3.5, 3.6). However, we find that in many cases these incompatibilities

can be quickly fixed by the animator after interpolation, still at a time discount compared to

manually laying down all the lines.

Vector Interpolation

Despite the emphasis that prior work puts on interpolating aesthetically between two curves

[123, 133, 18], we find that a simple linear interpolation of vertex coordinates and polar Bézier

handle values is sufficient enough to satisfy professional IB animators. As the choice of inter-

polation method here is orthogonal to our contribution of match-free assistance, this step can

be freely interchanged with interpolation schemes from other work. For timing grids with more

53


than one IB, we simply cache the alignment offsets and lerp to new intervals as needed. Once the

rasterization of the first frame is appropriately offset to the target IB, the user is free to correct

any imperfections of the MIBA output using vanilla vector manipulation tools.

3.3.3 User Interaction

We implemented an interactive web browser app with Vue.js. Our app has basic features

typical to modern IB software, including a navigable viewport, toolbar with options, raster and

vector layers, frame/layer selection panels, onionskinning and frame-flipping, tool keybindings,

undo/redo history, etc. Pen/stylus input is supported, and is functionally equivalent to the mouse.

MIBA assistance is implemented as buttons attached to each timing grid sequence the user

is asked to IB. The user clicks on the provided “assist” button, which provides a preview of the

generated IBs, and then clicks “use” on the previews they would like to use. This inserts the

MIBA output as a vector layer on the frame in question, which the u