ABSTRACT

Title of Thesis: : SYNPLAY: IMPORTING
REAL-WORLD DIVERSITY FOR A
SYNTHETIC HUMAN DATASET

Jinsub Yim
Master of Science, 2024

Thesis Directed by: Professor Shuvra S. Bhattacharyya
Department of Electrical and Computer Engineering

In response to the growing demand for large-scale training data, synthetic

datasets have emerged as practical solutions. However, existing synthetic datasets

often fall short of replicating the richness and diversity of real-world data. Synthetic

Playground (SynPlay) is introduced as a new synthetic human dataset that aims to

bring out the diversity of human appearance in the real world.

In this thesis, We focus on two factors to achieve a level of diversity that has

not yet been seen in previous works: i) realistic human motions and poses and ii)

multiple camera viewpoints towards human instances. We first use a game engine

and its library-provided elementary motions to create games where virtual players

can take less-constrained and natural movements while following the game rules (i.e.,

rule-guided motion design as opposed to detail-guided design). We then augment the

elementary motions with real human motions captured with a motion capture device.


To render various human appearances in the games from multiple viewpoints, we use

seven virtual cameras encompassing the ground and aerial views, capturing abundant

aerial-vs-ground and dynamic-vs-static attributes of the scene. Through extensive

and carefully-designed experiments, we show that using SynPlay in model training

leads to enhanced accuracy over existing synthetic datasets for human detection and

segmentation. Moreover, the benefit of SynPlay becomes even greater for tasks in

the data-scarce regime, such as few-shot and cross-domain learning tasks. These

results clearly demonstrate that SynPlay can be used as an essential dataset with rich

attributes of complex human appearances and poses suitable for model pretraining.


SYNPLAY: IMPORTING REAL-WORLD DIVERSITY FOR A
SYNTHETIC HUMAN DATASET

by

Jinsub Yim

Thesis submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Masters of Science

2024

Advisory Committee:
Professor Shuvra S. Bhattacharyya, Chair/Advisor
Dr. Heesung Kwon
Professor Jonathan Z. Simon


© Copyright by
Jinsub Yim

2024


Acknowledgments

First and foremost I would like to thank my advisor, Prof. Shuvra S. Bhat-

tacharyya for working with him and for his unwavering support, and encouragement

throughout this journey. Working alongside him, I discovered that this period has

been the most valuable time of my life.

Secondly, I extend gratitude to Dr. Heesung Kwon for his invaluable contribu-

tion to this project. His generous support and assistance have been instrumental

throughout its development. Without his extraordinary insights and expertise, this

thesis would have remained a distant dream. I’d also like to thank Prof Simon Z.

Jonathan for providing insightful and valuable feedback.

I would like to thank our research group members, especially Dr. Hyungtae

Lee and Dr. Sungmin Eum. From the inception of brilliant idea to the culmination

of this project, their direct support has played a pivotal role in achieving the results

we see today. Additionally, I would like to express thanks to Yi-Ting and Yan Zhang

for their contribution to this project.

Lastly, I express my gratitude to my wife Songhee Jung. Her love and support

have meant everything to me. I’m so grateful to have her by my side. And to our

beloved daughter, Tierry Yim, who joined our lives this year, thank you for being

our special gift.

This research was sponsored in part by the Army Research Office and Army

Research Laboratory (ARL) and was accomplished under Grant Number W911NF-

21-1-0258. The views and conclusions contained in this document are those of the

author and should not be interpreted as representing the official policies, either

expressed or implied, of the Army Research Office, Army Research Laboratory (ARL)

ii


or the U.S. Government. The U.S. Government is authorized to reproduce and

distribute reprints for Government purposes notwithstanding any copyright notation

herein.

iii


Table of Contents

1 Introduction 1

2 Related Works 7

3 SynPlay Dataset: Methods and Diversity 10
3.1 Diverse yet realistic human motion . . . . . . . . . . . . . . . . . . . 11
3.2 Multiple viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Scenario Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Other Design Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 3D Animated Characters . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 SynPlay Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Character Distribution . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Altitude Distribution . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 Perspective Distribution . . . . . . . . . . . . . . . . . . . . . 25
3.5.4 Bbox Size Distribution . . . . . . . . . . . . . . . . . . . . . . 26
3.5.5 Bbox Heatmap Distribution . . . . . . . . . . . . . . . . . . . 26
3.5.6 Dataset Comparison . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 SynPlay Sample Images . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Task Evaluation 33
4.1 General tasks: detection and segmentation . . . . . . . . . . . . . . . 33

4.1.1 Experiement Setting . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Aerial-view tasks . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Ground-view tasks . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Combination with MS-COCO for pre-training dataset . . . . . 37

4.2 Data-scarce tasks: Few-shot and cross-domain learning . . . . . . . . 38
4.2.1 Experiement Setting . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Comparison with other synthetic data . . . . . . . . . . . . . 40
4.2.3 Scaling behaviors . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Image Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Discussion and Conclusion 43
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 46

iv


Chapter 1

Introduction

Large-scale synthetic datasets, known for their scalability, provide a practical solution

to the increasing demand for training large-capacity models. These datasets are

crucial as they enable the development and fine-tuning of advanced machine-learning

algorithms without the constraints of limited real-world data. By offering a vast

amount of diverse and high-quality data, synthetic datasets ensure that models

can generalize better and perform more accurately in various scenarios. Moreover,

the ability to generate tailored datasets for specific applications accelerates the

innovation process, allowing researchers to test and implement novel ideas quickly.

This scalability and flexibility make synthetic datasets an indispensable tool in the

advancement of machine learning technologies.

Recently developed rendering engines (e.g., Unity [49], Unreal [16], etc.) have

significantly enhanced the realism of synthetic data, broadening its applicability

across various computer vision tasks. Despite efforts to scale synthetic data to

match the extensive curation of real-world data, the desired level of diversity has not

yet been achieved. This insufficiency in diversity is largely due to the inadequate

consideration and integration of key factors that are essential to real-world diversity

in the process of creating synthetic data.

In the past few years, several attempts have been made to increase human

1


appearance diversity by controlling innate characteristics (e.g., race, gender) [5], body

shape (e.g, height) [5, 36], or clothing [5]. These datasets have demonstrated their

effectiveness in tasks aimed at identifying human characteristics from close-up images,

e.g., human body/pose estimation [5, 36] and shape reconstruction [33]. However,

these datasets have seldom yielded a discernible positive impact on computer vision

tasks aimed at identifying humans from a distance, e.g., human detection and

segmentation.

When it comes to recognizing the overall human appearance from a distance,

the motions and poses exhibited by the individuals play more vital roles than

other characteristics. Despite prior attempts to synthesize various human poses, the

quality of the rendering remained suboptimal, lacking in realism [39] and diversity [45].

AMASS [33] was the output of an early endeavor aimed at achieving both realism and

diversity, where a motion scanner was utilized to collect real human motions. Before

capturing these motions, detailed descriptions were provided to articulate specific

movements — e.g., 5 seconds waving above the head with both arms1, while adhering

to physical constraints that limit large movements in motion-capture environments.

This detail-guided motion design often results in capturing a restricted range of

motions tied to specific descriptions while missing out on all the motions that defy

easy description.

We claim that providing relatively high-level, less-detailed guidance greatly

helps in breaking out from the aforementioned limitations and provides more freedom
1This motion description was used to construct the Mocap Database HDM05 in AMASS

(https://resources.mpi-inf.mpg.de/HDM05/05-01/index.html)

2

https://resources.mpi-inf.mpg.de/HDM05/05-01/index.html


towards the expansion of the diversity in human motions. In constructing our dataset,

we follow a new rule-guided motion design approach providing game “rules” or

winning strategies for the virtual players to follow, which serve as a set of significantly

coarser guidelines when compared to detail-guided approaches. In this way, the

motions that they manifest are not confined to predetermined/easily-describable

motions. As for the “rules,” we opted to borrow them from the six traditional Korean

games that were also played in the Netflix TV series “Squid Game” [23]. These games

involve substantial amounts of physical movements, which naturally provide room

for a diverse set of human poses and motions. The diversity is further influenced

by in-game factors such as the uniquely defined rules of each game, the number of

players, and the interactions between them.

Under our rule-guided motion design approach, each scenario run (i.e., one

round of a game played with specific settings) in the virtual environment is initialized

by carrying out a scenario design step which is followed by the incorporation of

real-world motions. The scenario design involves the setup of all the parameters that

control the appearance, players (winners and losers), game dynamics (e.g., how/when

each game ends), and the human motion evolutions for each specific scenario. This

is where the high-level rules of a given game are defined, and the coarse boundary

of how human motions can evolve within the game is set. The incorporation of

real-world motion is the phase where a rich variety of motions truly comes to life.

Details on the entire pipeline will be elaborated in Chapter 3.

In addition, we also took into account that human appearance can vary greatly

depending on the perspective from which it is viewed. Accordingly, we capture

3


Figure 1.1: SynPlay dataset is constructed while players play six traditional games
in a virtual playground also introduced in the Netflix TV show “Squid game” [23].
We have diversified the human appearances in the scenes by focusing on two factors:
i) leveraging real-world human motions and ii) adopting multiple viewpoints.

every scene from multiple viewpoints by implementing several image-capturing

devices to take advantage of different perspective-related characteristics: three

Unmanned Aerial Vehicles (UAVs), three Closed-Circuit Televisions (CCTVs), and

one Unmanned Ground Vehicle (UGV). The three UAVs fly with random trajectories

at different altitudes, the three CCTVs are located at the front, side, and back of the

game playground, and the UGV moves randomly within the playground where the

game is being played. These devices offer a variety of image-capturing properties,

including aerial-vs-ground and dynamic-vs-static. Our strategy, designed to provide

very diverse viewpoints in the scene capture process, not only serves to ensure that

the dataset includes more diverse human appearances but also broadens the potential

tasks (e.g., re-identification, multi-view applications, aerial-to-ground scene matching,

etc.) for which the dataset can be used.

By leveraging the aforementioned human appearance-diversifying strategies,

we construct a large-scale synthetic human dataset called SynPlay that contains

4


more than 73k images with 6.5M human instances, see sample images in Fig 1.1. To

demonstrate SynPlay’s ability to represent a variety of human appearances to the

extent seen in the real world, we conduct a series of experiments where we evaluate

the impact of leveraging SynPlay alongside real (non-synthetic) datasets curated for

a variety of human-related tasks, i.e.,, aerial-view/ground-view human detection and

segmentation. For all the tasks, training with SynPlay outperforms its counterparts

(i.e., training-from-scratch, using other synthetic data) across a variety of datasets.

Experiments also demonstrate that the SynPlay dataset significantly improves model

performance on data-scarce tasks, highlighting its value in scenarios that require

substantial supplementary training data.

Research Questions To effectively evaluate SynPlay’s role as supplemental training

data, we designed our experiments around three key questions:

• Can SynPlay impact General Computer Vision Tasks? This question

delves into SynPlay’s ability to improve the generalizability of computer vision

models across a range of common tasks. It assesses whether SynPlay effectively

handles tasks that require identifying and understanding diverse human ap-

pearances, which can be challenging due to variations in pose, clothing, and

lighting conditions.

• Can SynPlay perform effectively for data-scarce tasks (few-shot,

cross-domain learning)? This question focuses on SynPlay’s ability to

supplement real-world data in situations where data is scarce or limited.

5


• How does SynPlay’s data quality compare to other synthetic datasets?

This question breaks down the comparison into two key aspects: Data quality

metrics and correlation with performance.

6


Chapter 2

Related Works

Synthetic human data. The creation of various synthetic human datasets has

been facilitated by the advancement of modern synthetic data rendering engines

such as Blender [6], Unity [49], and Unreal [16], alongside human modeling tools

like MakeHuman [34] and Character Creator [47]. These rendering engines enable a

realistic representation of humans in 3D virtual environments, while the modeling

tools give creators precise control over the design of virtual characters. The creators

of these datasets leveraged these tools to meticulously control key design factors,

ensuring suitability for specific tasks, e.g., SOMAset [3], PersonX [48], UnrealPer-

son [53], CARGO [52] for Re-Identification, SURREAL [51] for pose estimation,

GTA5 [39] for semantic segmentation, and Archangel-Synthetic [45] for detection.

Recently, several attempts have been made to enhance the realism of virtual

human models, with the aim of bringing them closer to the quality of their real-world

counterparts. SMPL-X [37] and AMASS [33] used motion-capturing devices to

capture natural human motions. BEDLAM [5] tried to improve the diversity of

various factors such as skin tones or clothing that affect human outer appearance,

still relying on SMPL-X. However, motion scanners impose constraints on the

environment, particularly in capturing large motions or events involving multiple

humans. While AGORA [36] and ScoreHMR [46] sought to digress from the usage

7


of the motion scanners by fitting human body models to the human motions in

real-world images/videos, the quality of the fitted human models declined drastically

on the images taken from a distance. One of our goals for our dataset was to

incorporate multiple viewpoints, including distant views of the target scene. To

avoid compromising the quality of human motions/poses, we chose to use motion

capture devices, while implementing our approach to ensure that final results in the

dataset are not limited by the environments in which the devices were used.

Natural human motion acquisition. Whether we are creating a real or synthetic

dataset, images of motions captured by directing humans to perform specific actions

based on a description often appear awkward rather than natural. Because of

that, most datasets aim to include humans engaged in daily activities (e.g., MS

COCO [29], MPII Human Pose [2]), performing tasks such as sports (e.g., UCF-

Sports [41], SoccerNet [10], SportsMOT [13]) or art (e.g., Human-Art [26]) to

capture their motions and poses in the most natural states possible. However, it is

self-contradictory to artificially create a virtual event to capture natural motions

associated with the event. In this thesis, we aim to avoid this self-contradict by

initially designing the virtual events (i.e., aforementioned games) using existing but

non-natural virtual motions, which are then replaced by real-world motions captured

using a motion capture device.

Supplemental datasets for training. Enhancing model performance by sup-

plementing the training with additional data has been a common strategy [27,38].

Initially, this involved combining datasets constructed with the same purpose, like

8


MS COCO [29] and PASCAL VOC [17], for tasks such as object detection. Some

approaches utilized large-scale datasets (e.g., ImageNet [14] or Instagram [32]), which

were not necessarily designed for the target task, to build foundational features,

followed by a transfer learning such as pretrain-finetune [19] or PTL [43] to adjust

the model on the target dataset. As models have grown in size and complexity (e.g.,

ViT [15]), the demand for large-scale, high-quality datasets has increased, but the

high costs of annotation present a significant barrier. To address this, label-agnostic

training methods like self-supervised learning [7, 8, 20, 21] and synthetic dataset

generation with cost-free annotations have emerged as viable solutions. In response,

we have specifically designed SynPlay to supplement various computer vision tasks

that require a large-scale, highly-diversified human appearance set.

9


Chapter 3

SynPlay Dataset: Methods and Diversity

In creating our synthetic dataset, we focus on two key factors: rule-guided motion

design and multiple viewpoints. These are crucial for capturing the wide range

of real-world variations, enabling us to mirror this diversity in our dataset. By

employing rule-guided motion design, we establish high-level game rules that allow

virtual players to generate diverse and realistic movements, avoiding the limitations

of predetermined motions. Capturing scenes from multiple viewpoints with various

image-capturing devices ensures diverse perspectives. This approach enhances the

diversity of human poses and appearances and broadens the dataset’s applicability

for tasks like re-identification, multi-view applications, and aerial-to-ground scene

matching. We drew inspiration from traditional Korean games featured in the

Netflix series “Squid Game” [23], leveraging their diversity in physical activities and

interactions.

Using these methodologies, We created ten different scenarios for each game,

resulting in 60 game scenarios in total. Frames were rendered from seven different

camera viewpoints at 1 fps with a resolution of 1920×1080, resulting in a total of

73,892 images with more than 6.5M human instances. The frame generation rate

was selected as 1 fps to avoid including highly redundant human poses. Taking full

advantage of the game engine’s ability to generate annotations while rendering the

10


Game scenario

   - Players ID: 3, 4, 7, 14, …
   - Background: urban1

   - Motion timing
      1. entry (for 1 sec.)

      2. stand (for 1.7 sec.)

      3. transition to grab (for 0.3 sec.)

      4. grab (for 2.3 sec.)

      …
   - Team
      1. team A: 3, 45, …
      2: team B: 4, 19, …
      ...

   - Individual player
      - ID 3
         3D location: (31.4, 27.5, 0)

         win/lose: lose @ 7th motion

      ...

entry stand

move

  
stand
-to-sit

  sit

 grab

 pull

  win   lose

entry stand

move

  
stand
-to-sit

  sit

 grab

 pull

  win   lose

 pull
motion 1

motion 2

motion 3
motion 4

...

...

Real motion capture

...

  
motion add

Scenario design Incorporation of real-world motions

- Motion evolution: tug-of-war

Game sequence rendering

... ...

Figure 3.1: Game sequence generation pipeline. This illustrates how we create
a sequence for a tug-of-war game which includes an example of how we incorporate
real-world motions towards the elementary motion state of pull. In the motion
evolution graph, the start and end nodes are indicated by green and red circles,
respectively. A diverse set of pull motion instances is shown below the image of the
rendered scene.

scenes, we provided various types of ground truth annotations useful for various

computer vision tasks: 2D/3D bounding boxes, instance-level segmentation masks,

depth maps, and human keypoint locations.

3.1 Diverse yet realistic human motion

We use rule-guided motion design in our SynPlay dataset, borrowing “rules” from

six traditional Korean games, also featured in the Netflix series “Squid Game” [23]1.

This approach offers coarser motion guidance for the virtual players, facilitating the

generation of a wide spectrum of natural motions, even including the ones that defy

detailed description.

Our rule-guided motion design is effectively baked into the overall sequence-

generating design pipeline, as shown in Figure 3.1, that consists of the scenario design
1Our game scenarios are designed based on the traditional game rules, without taking any

specific situations from the show.

11


followed by the incorporation of real-world motions. The scenario design involves the

setup of all the parameters that control the appearance, players (winners and losers),

game dynamics (e.g., how/when each game ends), and the human motion evolutions

for each specific scenario. All the items within each scenario that do not have to be

hard-coded (e.g., game rules) are selected randomly when designing each scenario.

The motion evolution of each virtual player in a specific game is governed by a graph

structure where all possible elementary motions and their potential transitions are

represented as nodes and directed edges, respectively. Each node is tied with a pool

of motions that fall under the same elementary motion state (e.g., move, sit). As

the game progresses, a virtual player evolves its motion following the directed edges

and stays there according to the “motion timing”, also defined in the scenario design.

At each state, the virtual human randomly chooses to exhibit one of the motions in

the corresponding pool. Note that, while a uniquely designed scenario is used for a

unique sequence, the same motion evolution graph [50] is used for all the sequences

captured under the same game rule.(need to explain more detail)

Fig 3.2 shows motion evolution graphs used in designing the game scenarios

for the SynPlay dataset. Even within the same game, the scenario may change,

but the motion evolution graph will remain consistent. It is noteworthy to mention

that, despite the wide range of situations and a variety of motions involved in the

games, the motion evolution graph for each game consists of only a few motion nodes

and their transitions. Given that each node encompasses a range of motions, this

illustrates the essence of a rule-based design approach where only basic game rules

are provided to freely allow the diverse array of human motions to be manifested.

12


entry stand

move

  
stand
-to-sit

  sit

 grab

 pull

  win   lose

entry stand

move

  
stand
-to-sit

  sit

 grab

 pull

  win   lose

entry stand  move

  win   lose

entry stand  move

  win   lose

red light, green light sugar candy tug-of-war marbles

stepping stones Squid (attacker)

entry stand

  win   lose

  walk

  approach

  throw

  pauseentry stand

  win   lose

  walk

  approach

  throw

  pause

entry stand

stand
-to-sit

  sit

  walk

 wait

  win

  
Slip

&stagger

  jump

  land

 embarass

  lose

entry stand

stand
-to-sit

  sit

  walk

 wait

  win

  
Slip

&stagger

  jump

  land

 embarass

  lose

any 
state

stand
-to-sit

 sit

  win   lose

any 
state

stand
-to-sit

 sit

  win   lose

any 
state

  
grab 

(interaction)

  
pull 

(interaction)

  
push 

(interaction)

  lose

  exitany 
state

  
grab 

(interaction)

  
pull 

(interaction)

  
push 

(interaction)

  lose

  exit

Squid (defender)

any 
state

  
grab 

(interaction)

  
pull 

(interaction)

  
push 

(interaction)

  win

  exitany 
state

  
grab 

(interaction)

  
pull 

(interaction)

  
push 

(interaction)

  win

  exit

entry stand follow

  win   lose

entry stand follow

  win   lose

entry stand  play

 walk

  win   lose
get 

sugar candy

entry stand  play

 walk

  win   lose
get 

sugar candy

entry pause

one-leg 
jump

  
two-leg 

jump

  fake

  jump

  win

entry pause

one-leg 
jump

  
two-leg 

jump

  fake

  jump

  win

embarassed

Figure 3.2: Motion evolution graphs. The start node (‘entry’) and the end nodes
(‘win’ or ‘lose’) are indicated by green and red circles, respectively. For the games
where secondary graphs are available (i.e., sugar candy or squid), at any given time
(except at the start or end node), the current state in the main graph can move to
the ‘any state’ node (blue-filled circle) in the secondary graph. When the ‘end’ node
(red-bordered circle) is reached within the secondary graph, the current state moves
its way back to the latest node that was touched in the main graph before entering
the secondary graph.

Before incorporating the real-world motions, we leverage two techniques to

pre-diversify the elementary human motions readily available in human motion

libraries such as Mixamo [1]: i) dynamically blending two existing motions of similar

types to generate a new motion type (e.g., blending slow-walking and running

to generate hasty-walking), ii) using elementary motions as animation layers to

13


crawling running

crippling dragging

walking while bending running of exhausted person

Figure 3.3: Two motion blending examples. For each example (left or right
column), the two motions (top and middle row) are blended together to generate a
new motion (bottom row). The blending ratio between the two input motions can
be controlled. The blending process does not depend on the specific names given to
the motions.

make a new motion (e.g., raising hands while walking). Fig 3.3 and Fig 3.4 show

several examples of the blending process and how the animation layers are leveraged:

the two techniques for expanding human motions within the virtual environments,

respectively. Interestingly, the motions created by blending are largely different from

their corresponding input motions, while the motions created via leveraging the

animation layers still exhibit the appearances and dynamics resembling both the

input motions. These two techniques are readily available for use within the Unity

environment.

On top of the pre-diversified set of motions for each game, a real human player

14


sitting walking kneeling down

raising something up throwing something cheeringand looking at it

sitting, raising something up throwing something cheering
and looking at it while walking while kneeling down

Figure 3.4: Three examples of leveraging animation layers. For each example
(left, middle, or right), the resulting motion of leveraging the animation layers over
two input motions (top and middle rows) is shown in the bottom row. Note that the
semantic labels (e.g., walking, cheering) were not provided at the time of capture;
they are included in the figure only for the convenience of the reader.

wearing a motion capture device 2 is asked to either similarly mimic or newly create

motions that align with the given game rule (thus, rule-guided). For example, for the

game of tug-of-war, human players were provided with the game rule, and then asked

to reenact any possible motion with the freedom of choosing the winning or losing

side. For some games, more than one player were asked to play the game together

to capture the motions that can naturally arise at the time of physical interactions.

As the result of incorporating the real-world motions, the total number of unique

motions in SynPlay increased from 104 to 257.

2Each real player used a SmartSuit Pro II and a pair of Smartgloves from Rokoko (http:
//rokoko.com)

15

http://rokoko.com
http://rokoko.com


Figure 3.5: Real-world motion examples. Real-world motions are acquired either
(a) by mimicking reference motions or (b) by exhibiting potential in-game motions
without any references that align with the given game rules. Wearable motion
scanners are used for all the cases.

16


Fig 3.5 shows several examples of real-world motions. Real-world motions are

created either by having the real human wearing the motion capture device mimic the

pre-provided reference motions or by demonstrating potential in-game motions under

the given game rules. It is observed that real-world motions can express a wider

range of specific actions while maintaining a sense of realism. Moreover, motions that

are difficult to pinpoint or describe can also be created, e.g., multi-person wrestling

motions.

3.2 Multiple viewpoints

The camera viewpoints within SynPlay are diversified by implementing three widely

used types of image-capturing platforms in the real world: UAV, UGV, and CCTV.

They cover a variety of image-capturing properties like static/dynamic and ground/aerial.

Viewpoint diversity is acquired by controlling the locations and the focal points

of the cameras. Three UAVs, three CCTVs, and one UGV have been deployed

(Figure 3.6), resulting in seven unique viewpoints for every game sequence. The

UAVs are deployed to fly at various random locations while maintaining altitudes of

low (∼30m), medium (∼50m), and high (∼100m). CCTVs are located at a height

of 15m at the front, back, and either side of the game playground. UGV images are

captured assuming that a vehicle is randomly roaming on the ground. The focal

points are set at several locations close to the area where the game usually takes

place. For the UAVs, the focal point is changed to a random location every 10 sec,

where each change takes 5 sec to be fixed at a location for another 5 sec. Focal

17


UAV low-alt UAV med-alt UAV high-alt UGV

CCTV front CCTV side CCTV back

Figure 3.6: Multiple viewpoints used in SynPlay. On the top-right corner of
each image, we place the enlarged crop of one human instance who is visible from
all seven viewpoints in each scenario. Multiple camera viewpoints allow substantial
variations in appearance for the same human subject with identical pose.

points of the CCTVs and the UGV do not change once determined. Figure 3.7 shows

illustration of camera movement and perspective.

3.3 Scenario Design

For naturalness and connection, we used six traditional Korean games (Red Light

Green Light, Honeycomb, Tug of War, Marbles, Glass Stepping, and Squid Game)

18


Figure 3.7: Illustration of Camera Perspectives. From left to right: UAV
capturing frames from aerial perspectives at varying altitudes, CCTV positioned at
different angles around the game ground, and UGV moving randomly within the
game ground. These perspectives provide diverse viewing angles for comprehensive
data collection.

in the Netflix series Squid Game as the motif for each scenario. The objective across

all games was to authentically portray the required movements of the characters

without direct user intervention. Each game sequence captured unique challenges and

interactions, contributing to a diverse and extensive exploration of human behavior

under pressure.

• Red Light, Green Light: In this game, contestants must advance towards

a finish line during “Green Light” but freeze in place as soon as “Red Light”

is announced. The challenge lies in stopping abruptly, Participants exhibit

varying speeds and movements as they race forward, facing obstacles and

navigating through the crowd. However, failure to stop results in immediate

elimination, with contestants collapsing to the ground.

• Sugar Candy: Participants in this game are tasked with delicately carving

out a specific shape from a honeycomb candy without causing it to break. Each

participant chooses a symbol and lines up in front of it, then receives the honey-

19


comb from the NPC and places it in a random location. Once all participants

have their honeycomb, they can each play the game. Throughout the game,

participants exhibit a range of natural postures, seamlessly integrated into the

gameplay. They can be seen standing patiently in line, sitting comfortably on

chairs as they engage with the task, or even reclining on the floor.

• Tug of War: In the “Tug of War” game, two teams exert determined strength,

each applying random force to the rope in a bid to surpass a critical victory

threshold. Participants, organized into teams and lined up, showcase a variety

of poses as they engage in the intense struggle, depicting dynamic motions

during the pulling action.

• Marbles: This game involves players taking turns to flick marbles into a

central hole. Each player throws their marbles into the central hole with

random angles and force, aiming to score points. As with other games in

the scenario, the diverse poses as they flick their marbles contribute to the

atmosphere.

• Stepping Stones: This game involves crossing glass-covered bridges where

pre-set fragile glass panels are placed. Players must find a safe path to reach

the opposite end of the bridge. The game incorporates dynamic movements

including preparatory actions before jumping, various jumping and landing

movements, slipping movements after landing, expressions of surprise upon

confirming glass breakage, and utilizing the rag-doll effect when players hit the

ground for naturalistic body movements.

20


• Squid Game: This game takes place in a squid-shaped arena where two

players face off against each other. One player acts as the defender, remaining

within the squid-shaped area, while the other player acts as the attacker,

moving to catch the defender. The attacker can stand with both feet inside

the circle and must jump with one foot until crossing the defender’s torso. The

game emphasizes interactions between players, including movements to catch

or evade, pulling or pushing, and deceptive motions to confuse the opponent.

In conclusion in the Scenario Design section, we examined six traditional

Korean games used as motifs for each scenario in the “Squid Game” series. The

goal was to have the characters participating in each game naturally exhibit the

movements required for the game while following the rules set for each game. The six

games formed one sequence, and a total of 10 sequences were captured in different

environments with different seeds for random number generation

3.4 Other Design Factors

3.4.1 3D Animated Characters

We have designed 456 virtual characters using the Character Creator [47](See Fig. 3.8),

where each character is involved in multiple game scenarios. To vary the appearance

of the characters and avoid generating biases, each character was uniquely designed

with gender, skin color, age, height, obesity (body type), hair (styles and colors), and

outfits. We kept the gender ratio between male and female at 1:1 and the ratio of

skin color among white, black, yellow, and brown at 1:1:1:1. For age, each character

21


was designed to fall into one of three categories: child, middle-aged, and elderly, and

the ratio was set at 1:2:1. For each gender and age group, heights were modeled to

follow a bell-shaped distribution, resulting in an overall dataset range of 140 to 190

cm. We manually designated every character with a unique outfit, while setting the

hair and obesity aspects as diverse as possible.

Figure 3.8: 456 virtual players in SynPlay created using Character Creator.

3.4.2 Background

For each scenario, we set different environmental factors: sites, lighting conditions,

and weather. Utilizing assets available on the Unity Asset Store and leveraging

the Unity Terrain Tool, we developed five urban environments, encompassing three

typical city locations, a construction area, and a factory site. Additionally, we

simulated five natural environments, including a green area, a snowy field, a desert,

a meadow, and a beach(See Fig. 3.9). Multiple locations within each site map can

be used as local playgrounds.

The weather conditions were described as clear skies, foggy conditions, and

22


Figure 3.9: Various Background Environments. From top left to bottom right:
Three typical city locations, Construction site, Factory cite, Green, Snowy, Desert,
Meadow, and Beach.

Figure 3.10: Various Lighting Conditions. From left to right: Dawn, Morning,
Noon, Afternoon, Sunset.

foggy with rain. This was done to enhance the model’s ability to perform accurately

in a wider range of conditions. To achieve this, the Unity fog volume was utilized to

depict foggy situations, while particle systems were employed to simulate rainfall.

The lightning environment were characterized by five distinct time periods:

dawn, morning, noon, afternoon, and sunset(See Fig. 3.10). Initially, to simulate

the sun’s position in the Unity environment, the rotation of directional light was

precisely calculated and adjusted to correspond with each time period.

Each scenario involves randomly determined sites, lighting conditions, and

weather.

23


140m 150m 160m 170m 180m 190m
height

20

30

40

co
un

t i
n 

Sy
nP

la
y

distribution w.r.t. gender, age, height
Male, child
Male, middle-aged
Male, elderly
Female, child
Female, middle-aged
Female, elderly

Figure 3.11: Character height distribution according to gender and age.

3.5 SynPlay Statistics

3.5.1 Character Distribution

Fig 3.11 displays bell-shaped distributions of human height categorized by gender

and age. Leveraging these distributions, we generated 456 virtual characters that

accurately reflect demographic characteristics. Beyond height, gender, and age, we

diversified attributes like skin color, body type, hair style, and attire.

3.5.2 Altitude Distribution

Fig 3.12 depicts the altitude distribution within the dataset, showing a standard

distribution pattern across UAV sensors, excluding fixed-range CCTVs and UGV.

The data reveals a randomized yet consistent spread from minimum to maximum

24


Figure 3.12: Altitude distribution for each image-capturing device.

altitudes, reflecting the dataset’s extensive coverage of human instances captured

under various aerial conditions. This distribution underscores the dataset’s ability

to represent diverse real-world surveillance scenarios, essential for training robust

detection models capable of handling altitude variations effectively.

3.5.3 Perspective Distribution

Fig 3.13 shows the angle distribution ranging from 0 degrees(perpendicular to the

ground) to 90 degrees(parallel to the ground). This indicates that the dataset reflects

not only nadir-view perspectives but also various UAV perspectives. This diversity

in perspectives enhances the robustness of object detection models by exposing them

to a wide range of viewpoints, preparing them to recognize and localize objects under

different spatial orientations.

25


Figure 3.13: Angle distribution for each image-capturing device.

3.5.4 Bbox Size Distribution

Fig 3.14 shows the distribution of bounding box sizes over human instances captured

by each device. The majority of bounding box sizes are small, which illustrates

a common characteristic of aerial-view datasets. Interestingly, UAVs can capture

human instances with larger bounding boxes than CCTVs. This could be due to the

fact that, although UAVs are typically positioned at higher altitudes than CCTVs,

there are more cases where UAVs get closer to real-time events and human instances,

different from the fixed CCTVs.

3.5.5 Bbox Heatmap Distribution

Fig 3.15 presents the heatmap distribution of 2D bounding box data from var-

ious sensors in the SynPlay dataset. The heatmap visualization illustrates how

26


Figure 3.14: BBox size distribution for each image-capturing device.

frequently 2D bounding boxes cover different regions of the image frames. Typically,

brighter colors indicate higher percentages of frames where the bounding boxes are

located. Across most sensors, the central regions of the images exhibit brighter colors,

suggesting a concentration of detected objects in these areas. Notably, excluding

UGVs, the heatmap distributions tend to fade towards the edges of the images.

This phenomenon suggests that while central areas may capture more frequent and

prominent activity, the outer regions see fewer instances of bounding box detections.

Understanding these spatial patterns is critical for optimizing surveillance tactics and

improving the accuracy of object detection algorithms across diverse sensor types.

27


All Sensors UGV CCTV Back CCTV Side

CCTV Front UAV 100 UAV 50 UAV 30

Figure 3.15: Heatmap distribution from various sensors. The X-axis and Y-axis
represent the horizontal and vertical pixels of the image (1920 x 1080), respectively.
The brightness of each pixel is determined by normalizing with the maximum value
of each sensor.

3.5.6 Dataset Comparison

Table 3.1 provides a comparative analysis of various human datasets, categorized

as real or synthetic and captured from either ground or aerial perspectives. Key

observations from this comparison are outlined below:

1. Aerial-view sets, thanks to their wide viewing angles, generally have more

human instances per image than ground-view sets, except for few cases that

employ a fixed number of actors in a real set or design one instance per image

in a synthetic set.

2. Aerial-view sets generally contain a wider range of viewpoints (mostly near∼far).

3. For existing synthetic datasets, aerial-view sets typically feature fewer motion

variations compared to ground-view sets. This is because aerial-view datasets

often prioritize leveraging a wide range of viewpoints over expanding the variety

of human motions.

28


dataset domain #inst #img #inst/img natural motion #motion viewpoint
ground-view
VOC 12 [17] real 10K 11.5K 2.48 daily infinite near
COCO Dev17 [29] real 649K 164K 9.72 daily infinite near
MPII Human Pose [2] real 40K 24.9K 1.61 daily 20 near
Cityscapes [12] real 21.4K 5K 7.85 daily 2 near
ADE20K [54] real 30K 27.5K 4.36 daily infinite near
Human-Art [26] real 123K 50K 2.46 art infinite near
GTA5 [39] synth 1.4M 1.4M 1 X 20Kpose near
SURREAL [51] synth 6.5M 6.5M 1 detail+mocap 23 near
SOMAset [3] synth 100K 100K 1 detail+mocap 250pose near
PersonX [48] synth 273K 273K 1 X 4pose near
UnrealPerson [53] synth 120K 120K 1 X 2 near
AGORA [36] synth · 19K 1∼15 detail+mocap 4,240pose near
BEDLAM [5] synth · 380K 1∼10 detail+mocap 2,311pose near
aerial-view
Okutama-action [4] real · 77K ∼9 detail 12 med
Semantic Drone [24] real 1.5K 400 4.16 daily unspecified med
UAVid [31] real 4.7K 420 20.06 daily unspecified med∼far
VisDrone [55] real 109K 40.0K 15.42 daily unspecified med
Archangel-real [45] real 165.6K 41.4K 4 detail 3pose near∼far
Archangel-mannequin [45] real · 178.8K 6∼7 detail 3pose near∼far
Archangel-synth [45] synth 4.4M 4.4M 1 X 3pose near∼far
SynDrone [40] synth 803K 72K 11.15 X 2 med∼far
CARGO [52] synth 108K 108K 1 X 2 near∼far
SynPlay synth 6.5M 73K 88.40 rule+mocap infinite near∼far
* natural motion
· daily: human motions engaged in daily activity
· art: human motions shown in works of art
· detail: human motions captured by ‘detail-guided design’
· rule: human motions captured by ‘rule-guided design’
· +mocap: human motions captured using a motion scanner

Table 3.1: Comparison of human datasets. ‘#inst/img’ is acquired only on
images that contain humans. ‘#motion’ indicates the number of unique motions
depicted in the dataset, except the ones with the subscript ‘pose’ which indicate
the number of static poses. Since a single motion can consist of multiple number of
unique poses, #motion is generally smaller than the number of poses. For certain
datasets, the test set without available labels is excluded from this comparison.

4. Rule-guided design can utilize significantly larger range of human motions

compared to detail-guided design.

The comparison shown in the table also demonstrates that SynPlay successfully

addresses the shortfall of aerial-view synthetic datasets (3rd observation), while

maximizing the benefits of aerial-view datasets (1st and 2nd observations).

Moreover, the 4th observation supports that our proposed rule-guided design is

successful in securing the diversity of human motions in the set. It is noteworthy

that while SURREAL [51] (constructed with ‘detail+mocap’) contains a comparable

29


number (6.5M) of human instances as SynPlay, the number of motions manifested in

the dataset is extremely limited when compared to SynPlay (23 vs. infinite).

3.6 SynPlay Sample Images

Fig 3.16 includes additional sample images from the SynPlay dataset. Various human

appearances depending on human motion differently taken according to the game

scenario, and camera viewpoints are observed. Various human appearances are

observed that change depending on human motions taken differently according to

the game scenario, and different camera viewpoints. In addition, various characters

and backgrounds used for creating SynPlay are also visible.

30


(1) Red Light, Green Light

UGV UAV med-alt UAV low-alt

CCTV back UAV high-alt CCTV side

(2) Sugar Candy

CCTV side UAV high-alt UGV

UAV low-alt CCTV front CCTV back

(3) Tug-of-War

UAV low-alt CCTV front UAV high-alt

CCTV back UAV low-alt UGV

31


(4) Marbles

UGV CCTV front UAV med-alt

CCTV side UAV med-alt UAV low-alt
(5) Stepping Stones

CCTV front UAV med-alt UAV high-alt

UGV UAV med-alt CCTV side
(6) Squid Game

UGV CCTV front UAV high-alt

CCTV side UGV UAV low-alt

Figure 3.16: More example images from SynPlay are shown for all six Korean
traditional games, each with various camera viewpoints.

32


Chapter 4

Task Evaluation

In line with the inherent purpose of synthetic data to serve as supplemental training

data, we use the entire SynPlay dataset to train models for various computer

vision tasks and evaluate its positive impact towards task performance. Our major

counterpart models in evaluation are trained-from-scratch, which are trained only on

real images (denoted as real in evaluation tables). We also validate the advantage of

using SynPlay over other synthetic datasets.

In this experiment, we evaluate the effectiveness of the SynPlay dataset across

two primary computer vision tasks: human detection and segmentation. These tasks

are particularly challenging due to the need to identify diverse human appearances

captured from varying distances in images.

4.1 General tasks: detection and segmentation

We evaluate the SynPlay dataset on two general vision tasks, human detection and

segmentation. These tasks require the ability to identify diverse human appearances

in images captured at a distance. To leverage synthetic data during training, we

adopt a pretrain-finetune strategy, where a model is pre-trained on synthetic data

and fine-tuned on target real-world data. The detectors used in the experiments are

YOLO v8 models [25] with three different architecture sizes (small, medium, and

33


large). Mask2former [9] with the Swin-Base [30] backbone was used for segmentation.

For evaluation metrics, we use the COCO-style APs [11] which are two bounding

box APs (APbb and APbb
50 )1 for human detection and Intersection-over-Union (IoU)

for the segmentation.

The main tasks are conducted on aerial-view datasets, which feature a wider

range of human appearances, making them ideal for validating the design philosophy

behind the SynPlay dataset. We also conduct experiments on ground-view datasets

to evaluate SynPlay on a more widely studied task in the community.

4.1.1 Experiement Setting

In our experiments, the goal is to explore SynPlay’s efficacy as supplementary

training data across various tasks. We largely adhere to the original settings and

implementations of the methods, with minimal modifications tailored to our specific

experiments:

• Architecture Modification: Given that human detection and semantic

segmentation can be considered as one-class problems, we adjust method

architectures, particularly the dimensions of the last layer, accordingly.

• Image Size in YOLOv8: During training and inference of YOLOv8, we

use an image size of 1280×1280 for most datasets, except COCO, which uses

640×640. This decision is based on the original image sizes of the datasets,

ensuring consistency in the range of human instances across datasets.
1Detection accuracy for each model in the following tables are reported with two numbers in

the form of APbb/APbb
50 .

34


• Training Mask2Former without the large-scale jittering (LSJ) aug-

mentation [18].: We did not use the default LSJ augmentation in training

the Mask2Former segmentation models solely for performance reasons. In all

cases, segmentation accuracy was found to be significantly lower when LSJ

augmentation was used. LSJ augmentation, which greatly expands the range

of image scaling, may not be suitable for aerial-view detection, which mainly

includes small-sized human instances. This performance degradation with LSJ

augmentation is also observed in [21], which is a reputable literature in the

field of self-supervised learning.

By combining these experimental approaches, we aim to comprehensively assess the

utility and performance of SynPlay in enhancing computer vision tasks across diverse

datasets and scenarios.

4.1.2 Aerial-view tasks

Table 4.1 shows the results for aerial-view human detection and semantic segmentation

tasks. Overall, for both tasks, using SynPlay in training provides remarkably better

accuracy than all the compared cases, including real and all the other variations

involving other synthetic data.

Notably, the results show that warming up the model with synthetic data

before incorporating real data generally does not improve performance, except in the

case of SynPlay. In other words, unless the synthetic dataset is properly designed

and constructed, we cannot expect performance improvement simply from adding

35


Table 4.1: Comparison with other synthetic datasets on aerial-view human detection and semantic
segmentation. The numbers in parentheses are the gaps from the model trained without synthetic data
(‘real’). Positive and negative gaps are indicated in green and red fonts, respectively. The best accuracy for
each setting is shown in bold. Notations: ‘+ real’ represents a model pre-trained with synthetic data and
fine-tuned on a ‘real’ dataset, where ‘real’ is a training set derived from the dataset used for evaluation. ‘s’,
‘m’, and ‘l’ represent three YOLO v8 models with different architectures.

human detection
data in training VisDrone [55] Okutama-action [4] Semantic Drone [24]

s m l s m l s m l
real 19.72/47.43 21.14/49.52 21.60/51.10 27.40/75.17 28.99/76.60 31.53/78.78 44.00/ 77.20 44.52/ 78.52 42.62/ 79.87
Archangel [45] 0.23/ 0.63 0.38/ 0.98 0.59/ 1.48 2.59/ 8.45 3.90/10.13 2.83/ 9.12 0.64/ 1.59 2.42/ 5.37 0.94/ 1.62
SynDrone [40] 0.31/ 0.81 0.36/ 0.84 0.71/ 1.89 0.00/ 0.00 0.00/ 0.01 0.00/ 0.00 0.00/ 0.00 0.00/ 0.00 0.00/ 0.00
SynPlay 5.29/11.75 4.31/ 9.12 2.79/ 5.87 12.74/40.86 8.19/25.43 8.15/25.23 7.02/ 12.21 9.60/ 15.51 15.71/ 23.59
Archangel + real 18.77/45.39 20.25/48.52 20.82/49.51 30.72/80.35 32.36/80.63 31.71/79.63 46.60/ 74.07 48.60/ 75.86 44.62/ 73.23

(-0.95/-2.04) (-0.89/-1.00) (-0.78/-1.59) (+3.32/+5.18) (+3.37/+4.03) (+0.18/+0.85) ( +2.60/ -2.13) ( +4.08/ -2.66) ( +1.00/ -6.64)
SynDrone + real 18.78/45.79 20.94/49.44 21.97/51.51 29.70/77.71 31.39/79.42 31.24/78.71 50.93/ 82.28 53.71/ 85.47 59.59/ 85.02

(-0.94/-1.64) (-0.20/-0.08) (+0.37/+1.41) (+2.30/+2.54) (+2.40/+2.82) (-0.29/-0.07) ( +6.93/ +5.08) ( +9.19/ +6.95) (+16.97/ +5.15)
SynPlay + real 20.88/49.31 22.34/52.12 22.98/52.93 32.47/81.60 31.96/81.13 33.17/82.52 66.52/ 90.33 69.46/ 91.35 68.82/ 91.37

(+1.16/+1.88) (+1.20/+2.60) (+1.38/+1.83) (+5.07/+6.43) (+2.97/+4.53) (+1.64/+3.74) (+22.52/+13.13) (+24.94/+12.83) (+26.20/+11.50)

semantic seg
data in training Semantic Drone [24] Aeroscapes [35]
real 0.66 22.25
Archangel [45] 0.74 0.04
SynDrone [40] 0.07 0.00
SynPlay 8.03 6.44
Archangel + real 9.28 20.61

( +8.62) (-1.64)
SynDrone + real 5.56 24.59

( +4.90) (+2.34)
SynPlay + real 23.32 32.19

(+22.66) (+9.94)

synthetic data to the training process. Moreover, among cases using synthetic data

only in training, SynPlay presents unparalleled accuracy. In fact, the results using

other sources of synthetic data are so poor that the other sources can be considered

useless for this type of dataset utilization. Based on these two observations, our

design strategies for enhancing the diversity and realism of human appearance are

shown to be highly effective in meeting expectations.

36


4.1.3 Ground-view tasks

Table 4.2 explores the impact of using SynPlay for the general computer vision tasks

of ground-view human detection and semantic segmentation. We also evaluate how

models perform when only the subset with matching viewpoint (i.e., UGV images in

SynPlay) is used in training. Overall, using the entire SynPlay yields the highest

accuracy on both tasks, while using the UGV-subset still outperforms the model

trained without SynPlay. These results demonstrate that our insight in ensuring

diversity by varying the camera viewpoints is effective even in tasks that do not

contain such multiple viewpoints. In addition, the greater improvement in semantic

segmentation over object detection shows that ensuring diversity is more effective in

tasks that require more detailed human representation models.

4.1.4 Combination with MS-COCO for pre-training dataset

The effect of using pre-training can be greater when applying two or more datasets

with complementary properties. Here, we aim to investigate the potential synergy

achieved by integrating MS COCO, a real dataset primarily comprising ground-view

images, with SynPlay for the task of aerial-view human detection. Table 4.3, shows

all combinations of SynPlay and MS COCO datasets when used for pre-training. The

anticipated synergistic effect appears in all cases except in one case (APbb
50 results on

Okutama-action) when fine-tuned on the target dataset. Moreover, using SynPlay

only provides comparable accuracy to using MS COCO when used indirectly through

fine-tuning on the real dataset.

37


Table 4.2: Impact of SynPlay on MS COCO (person category). Notation:
‘SynPlay-UGV’ and ‘SynPlay-all’ are the UGV subset of SynPlay and the entire
SynPlay, respectively.

(a) human detection
data in training s m l
real 46.19/65.91 50.10/69.86 52.52/72.15
SynPlay-UGV 46.53/66.18 50.70/70.37 52.69/72.29
+real (+0.34/+0.27) (+0.60/+0.51) (+0.17/+0.14)
SynPlay-all 46.84/66.70 51.12/70.74 53.00/72.59
+real (+0.65/+0.79) (+1.02/+0.88) (+0.48/+0.44)

(b) sem.seg.

15.10
20.18

(+5.08)
21.57

(+6.47)

Table 4.3: Synergy impact with MS COCO on aerial-view human detection.
YOLO v8 model with a medium size architecture.

data in training VisDrone Okutama-action Semantic Drone
real 21.14/49.52 28.99/76.60 44.52/78.52
COCO 7.16/16.46 15.17/48.28 34.74/56.39
SynPlay 4.31/ 9.12 8.19/25.43 9.61/15.52
COCO + SynPlay 11.49/25.20 14.68/49.82 18.60/31.03
COCO + real 22.11/51.73 32.26/80.10 65.72/89.20
SynPlay + real 22.34/52.13 31.96/81.13 69.46/91.35
COCO + SynPlay + real 22.78/53.01 33.82/79.44 73.52/92.80

Interestingly, the results without fine-tuning show a different trend. On

Okutama-action and Semantic Drone cases, using MS COCO performed better

than other two baselines, with SynPlay specifically showing a much lower accuracy.

We observe that synthetic data still lags behind real-world data in many respects,

highlighting the need for further research to bridge the gap.

4.2 Data-scarce tasks: Few-shot and cross-domain learning

In this section, we compare SynPlay with other synthetic datasets on its ability to

meet the demand for additional data in data-scarce tasks. For data-scarce tasks,

we adopt few-shot and cross-domain learning tasks on aerial-view human detection,

38


which suffers more severely from a lack of training data than ground-view detection.

Following the data-scarce task setups of [43], we train models with two few-shot

regimes using 20 and 50 images of VisDrone (denoted by ‘Vis-20/50’). We test the

models on VisDrone, Okutama-action, and Semantic Drone where the evaluations

done on the last two datasets can be seen as ‘cross-domain’. To attenuate the

potential random effects that may arise when selecting real training images, all

reported numbers are average accuracy over three runs.

As baseline methods leveraging synthetic data in training, we use a pretrain-

finetune strategy (PT-FT) and Progressive Transformation Learning (PTL) [43].

PTL is a progressive data augmentation approach that iteratively expands the

training set by adding a subset of synthetic data, which is transformed to look real.

In each PTL iteration, a subset of the synthetic data is selected, such that synthetic

data that is closer to the real dataset is selected more often. For the data-scarce

tasks experimented in [43], PTL was better than PT-FT while both outperformed

the cases without synthetic data. We used RetinaNet [28] as the detector.2

4.2.1 Experiement Setting

• Settings for PT-FT. When using PT-FT in the general tasks, training

specifications, including training epochs and learning rate, did not differ between

pre-training and fine-tuning. In data-scarce tasks, we follow all the settings

of [43] as outlined in PTL, while leaving out the progressive component.

2The currently available implementation of PTL is tailored to RetinaNet. For a fair comparison
between PTL and PT-FT, we used RetinaNet instead of YOLO v8 for this experiment.

39


Table 4.4: Few-shot and cross-domain learning accuracy (APbb/APbb
50)

on aerial-view human detection. To clarify, the accuracy on Okutama-action and
Semantic Drone refers to cross-domain learning performance. Notation: ‘Archangel*’
is an expanded ‘Archangel’ to be pose-diversified [44].

Vis-20 Vis-50
data in training method VisDrone Okutama-action Semantic Drone VisDrone Okutama-action Semantic Drone
real 0.58/ 2.27 3.64/ 14.54 0.62/ 1.89 0.76/ 3.30 7.82/ 28.66 1.30/ 5.65

+ Archangel

PTL

2.07/ 6.72 7.90/ 31.53 8.81/ 33.71 2.92/ 9.26 11.49/ 42.51 8.98/ 33.21
+ Archangel* 2.26/ 7.39 8.95/ 36.97 6.45/ 26.13 2.99/ 9.42 12.89/ 47.24 6.29/ 25.50
+ SynPlay 3.08/ 9.03 14.39/ 49.53 6.94/ 24.22 3.71/11.20 15.67/ 52.06 7.74/ 26.99

(+2.50/+6.76) (+10.75/+34.99) ( +6.32/+22.33) (+2.95/+7.90) (+7.85/+23.40) (+6.44/+21.34)
+ Archangel

PT-FT

0.76/ 2.48 4.24/ 17.17 6.53/ 23.67 1.29/ 3.76 5.32/ 20.96 7.10/ 27.95
+ Archangel* 1.21/ 4.02 9.14/ 34.70 8.20/ 28.80 1.84/ 5.37 10.39/ 36.83 8.63/ 30.09
+ SynPlay 2.94/ 9.38 12.19/ 40.88 11.32/ 37.30 3.72/11.87 13.66/ 44.06 12.76/ 41.66

(+2.36/+7.11) ( +8.55/+26.34) (+10.70/+35.41) (+2.96/+8.57) (+5.84/+15.40) (+11.46/+36.01)

• Settings for data-scarce tasks. For all experiments performed for data-

scarce tasks including the scaling behavior study, we follow all the settings and

experimental environments of [44].

4.2.2 Comparison with other synthetic data

In Table 4.4, we compare the detection accuracy of the models trained with different

synthetic datasets on the few-shot and cross-domain learning tasks. With PT-FT,

SynPlay achieved significantly better accuracy than other synthetic datasets across all

three datasets. With PTL, SynPlay performed the best on VisDrone and Okutama-

action. These results were consistent for both Vis-20 and Vis-50 settings. Even on

Semantic Drone, which shows an unusual performance trend, the best performance

was achieved when SynPlay was used via PT-FT.

In addition, compared to SynPlay’s performance improvement on general

tasks (in Table 4.1), the improvement achieved on data-scarce settings by SynPlay

is much greater on VisDrone and Okutama-action on data-scarce tasks. This

demonstrates that SynPlay effectively meets the demand for additional data in data-

40


1,080 4,320 17,280 34,994 73,892

# img (log scale)

3

6

9

bbox AP50: VisDrone

PTL: Archangel
PTL: Archangel*
PTL: SynPlay
PT-FT: Archangel
PT-FT: Archangel*
PT-FT: SynPlay

1,080 4,320 17,280 34,994 73,892

# img (log scale)

20

30

40

50

bbox AP50: Okutama-action

PTL: Archangel
PTL: Archangel*
PTL: SynPlay
PT-FT: Archangel
PT-FT: Archangel*
PT-FT: SynPlay

1,080 4,320 17,280 34,994 73,892

# img (log scale)

10

20

30

40

bbox AP50: Semantic Drone

PTL: Archangel
PTL: Archangel*
PTL: SynPlay
PT-FT: Archangel
PT-FT: Archangel*
PT-FT: SynPlay

Figure 4.1: Scaling behavior of synthetic datasets under the Vis-20 setup
(APbb

50 ). Scaling behavior of each dataset is compared by randomly sampled subsets
of 1,080, 4,320, and 17,280 images, which correspond to 1/16th the size, 1/4th the
size, and the size of Archangel. For reference, the sizes of Archange* and SynPlay
are 34,994 and 73,892, respectively.

scarce settings, which is greater than that in general tasks. We will discuss the

unexpected performance trends on Semantic Drone in more detail in Chapter 5.

4.2.3 Scaling behaviors

To fairly validate the performance comparison without being affected by dataset

size, we explore the scaling behavior of synthetic datasets. In Fig 4.1, we compare

the detection accuracy of three synthetic datasets at multiple points where the

datasets are randomly sampled to have the same size. On all three test sets, the

41


Table 4.5: FID comparison. In FID calculation, VisDrone serves as a reference
representing real aerial-view human data.

COCO Archangel Archangel* SynDrone SynPlay
48.16 67.20 67.20 21.66 18.36

best performing models use SynPlay in training, i.e., SynPlay + PTL on VisDrone

and Okutama-action, SynPlay + PT-FT on Semantic Drone. The performance gain

achieved using SynPlay is not simply due to the large size of the dataset.

4.3 Image Quality Evaluation

In Table 4.5, for all training datasets involved in our experiments, we calculate FID

(Fréchet Inception Distance) [22] to assess their fidelity and diversity. We utilized the

PyTorch implementation of FID in [42] with the default setup to assess the fidelity

and diversity for all the training datasets involved in our experiments. We did not

perform image scaling on the input for any dataset, and the final average pooling

features were used to compute FID.

SynPlay presents the best score compared to other synthetic datasets, which is

a result that aligns well with our task results. These results suggest that SynPlay’s

superior task performance is achieved by better fidelity and diversity, which are

our goals in designing SynPlay. Moreover, the better FID of SynPlay compared to

MS COCO, which mainly includes ground-view images, is also reported, support-

ing the hypothesis that adopting multiple viewpoints effectively diversifies human

appearances.

42


Chapter 5

Discussion and Conclusion

5.1 Discussion

Peculiar performance trend on Semantic Drone: The following conflicting

phenomena were observed in experimental results(Table 4.4) when tested on Semantic

Drone:

– When using synthetic data in training via PTL for data-scarce tasks, involving

SynPlay under-performed when compared to cases using other synthetic datasets.

– The performance gain acquired by incorporating synthetic data during train-

ing is remarkably large for Semantic Drone when compared to the other two

cases(VisDrone, Okutama-action), with SynPlay’s incorporation showcasing

the most substantial gain.

We aim to find answers to the first conflicting phenomenon through the analysis

of nadir-view instances. An instance with an elevation angle greater than 71.57◦

relative to the UAV is considered to be a nadir-view instance, representing the

maximum elevation angle for Archangel [44]. To identify nadir-view instances for

Archangel, we utilized the dataset metadata, i.e., UAV position. Similarly, for

Archangel, we determined if an instance was a nadir-view instance based on the

source instance, also using the dataset metadata. In the case of SynPlay, we computed

43


Table 5.1: Proportion of nadir-view instances in synthetic datasets used in
data-scarce tasks. Instances with camera viewing angle from ground greater than
71.57◦ are considered as nadir-view instances.

Archangel Archangel* SynPlay
25.00% 12.82% 4.24%

the elevation angle for each instance using the absolute 3D coordinates of the instance

and the UAV provided by SynPlay.

Most human instances in Semantic Drone are taken from nadir views, while

VisDrone and Okutama-action have instances captured with fewer nadir-views. The

portion of nadir-view instances in SynPlay is the smallest among the synthetic

datasets used in data-scarce tasks (Table 5.1). As PTL continues to prioritize

samples from synthetic data that closely resemble the seed data (i.e., VisDrone) for

training, the reduced selection of nadir-view instances from SynPlay may result in a

lower gain (first phenomenon). On the other hand, the second phenomenon indicates

that ensuring greater diversity via using supplemental synthetic data has greater

impact on Semantic Drone, which lacks diversity due to its limited viewpoints.

Moreover, SynPlay, which is less similar to Semantic Drone while being more

diverse in relation to the compared synthetic datasets, shows the largest impact,

supporting our claim that improving the diversity is generally effective in constructing

better synthetic data.

Broader impact. Utilizing real human datasets frequently entails inherent privacy

concerns. We hope that our endeavors to enhance synthetic humans data, moving it

one step closer to real-world fidelity, will contribute to alleviating these challenges.

44


Limitations. SynPlay was developed to provide richer representations of human

appearance for tasks that involve localizing human in the scenes. We recognize

the significance of incorporating distinctive features from diverse object categories.

For future work, we are aiming to expand SynPlay to encompass a wider array of

categories, thereby enriching its training capabilities.

5.2 Conclusion

What motion a human performs and where a person is viewed from are two crucial

factors that make a difference in how a human looks. We create a synthetic human

dataset called SynPlay with the aim of expanding the realism of human appearance

by diversifying these factor. Enhancing the diversity allowed SynPlay to have a

greater positive impact towards model training when compared to (train-from-scratch,

using other synthetic data) on aerial-view/ground-view object detection and semantic

segmentation. This positive impact of SynPlay becomes even greater in data-scarce

tasks, where synthetic data is strongly desired as supplemental training data.

45


Bibliography

[1] Adobe: Mixamo, https://www.mixamo.com/#/

[2] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation:
New benchmark and state of the art analysis. In: Proc. CVPR (2014)

[3] Barbosa, I.B., Cristani, M., Caputo, B., Rognhaugen, A., Theoharis, T.: Looking
beyond appearances: Synthetic training data for deep CNNs in re-identification.
Comput. Vis. Image Underst. 167, 50–62 (Feb 2018)

[4] Barekatain, M., Martí, M., Shih, H.F., Murray, S., Nakayama, K., Matsuo, Y.,
Prendinger, H.: Okutama-action: An aerial view video dataset for concurrent
human action detection. In: Proc. CVPR Workshop (2017)

[5] Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: A synthetic dataset of
bodies exhibiting detailed lifelike animated motion. In: Proc. CVPR (2023)

[6] Blender Institute: Blender, https://www.blender.org

[7] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsuper-
vised learning of visual features by contrasting cluster assignments. In: Proc.
NeurIPS (2020)

[8] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for
contrastive learning of visual representations. In: Proc. ICML (2020)

[9] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention
mask transformer for universal image segmentation. In: Proc. CVPR (2022)

[10] Cioppa, A., Giancola, S., AdrienDeli’ege, Kang, L., Zhou, X., Cheng, Z.,
Ghanem, B., Droogenbroeck, M.V.: SoccerNet-Tracking: Multiple object track-
ing dataset and benchmark in soccer videos. In: Proc. CVPRW (2022)

[11] COCO Consortium: COCO - common objects in context. https://
cocodataset.org/ (nd), accessed: July 18, 2024

[12] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., ad Uwe Franke,
R.B., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proc. CVPR (2016)

[13] Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: A large
multi-object tracking dataset in multiple sports scenes. In: Proc. ICCV (2023)

[14] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A
large-scale hierarchical image database. In: Proc. CVPR (2009)

46

https://www.mixamo.com/#/
https://www.blender.org
https://cocodataset.org/
https://cocodataset.org/


[15] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-
terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit,
J., Houlsby, N.: An image is worth 16×16 words: Transformers for image
recognition at scale. In: Proc. ICLR (2021)

[16] Epic Games: Unreal engine, https://www.unrealengine.com

[17] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.,
Zisserman, A.: The PASCAL visual object classes challenge: A retrospective.
Int. J. Comput. Vis. 111(1), 98–136 (Jan 2015)

[18] Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V.,
Zoph, B.: Simple copy-paste is a strong data augmentation method for instance
segmentation. In: Proc. CVPR (2021)

[19] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional
networks for accurate object detection and segmentation. IEEE Trans. Pattern
Anal. Mach. Intell. 38(1), 142–158 (Jan 2016)

[20] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya,
E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu,
K., Munos, R., Valko, M.: Bootstrap your own latent: A new approach to
self-supervised learning. In: Proc. NeurIPS (2020)

[21] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders
are scalable vision learners. In: Proc. CVPR (2022)

[22] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs
Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.
In: Proc. NeurIPS (2017)

[23] Hwang, D.h.(Writer and Director).: Squid game (2021), https://www.netflix.
com/title/81040344?source=35

[24] Institute of Computer Graphics and Vision, Graz University of Technol-
ogy: Aerial semantic segmentation drone dataset. http://dronedataset.icg.
tugraz.at

[25] Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics YOLO (2023), https://github.
com/ultralytics/ultralytics

[26] Ju, X., Zeng, A., Wang, J., Xu, Q., Zhang, L.: Human-Art: A versatile human-
centric dataset bridging natural and artificial scenes. In: Proc. CVPR (2023)

[27] Lee, H., Eum, S., Kwon, H.: ME R-CNN: Multi-expert R-CNN for object
detection. IEEE Trans. Image Process. 29, 1030–1044 (Sep 2019)

[28] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: Proc. ICCV (2017)

47

https://www.unrealengine.com
https://www.netflix.com/title/81040344?source=35
https://www.netflix.com/title/81040344?source=35
http://dronedataset.icg.tugraz.at
http://dronedataset.icg.tugraz.at
https://github.com/ultralytics/ultralytics
https://github.com/ultralytics/ultralytics


[29] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
P., Zitnick, C.: Microsoft COCO: Common objects in context. In: Proc. ECCV
(2014)

[30] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proc.
ICCV (2021)

[31] Lyu, Y., Vosselman, G., Xia, G.S., Yilmaz, A., Yang, M.Y.: UAVid: A semantic
segmentation dataset for uav imagery. ISPRS J. Photogramm Remote Sens.
(P&RS) 165, 108–119 (Jul 2020)

[32] Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y.,
Bharambe, A., van der Maaten, L.: Exploring the limits of weakly supervised
pretraining. In: Proc. ECCV (2018)

[33] Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS:
Archive of motion capture as surface shapes. In: Proc. ICCV (2019)

[34] MakeHuman Community: MakeHuman, https://static.
makehumancommunity.org/makehuman.html

[35] Nigam, I., Huang, C., Ramanan, D.: Ensemble knowledge transfer for semantic
segmentation. In: Proc. WACV (2018)

[36] Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.:
AGORA: Avatars in geography optimized for regression analysis. In: Proc.
CVPR (2021)

[37] Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas,
D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a
single image. In: Proc. CVPR (2019)

[38] Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object
detection with region proposal networks. IEEE Trans. Pattern Anal. Mach.
Intell. 39(6), 1137–1149 (Jun 2016)

[39] Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth
from computer games. In: Proc. ECCV (2016)

[40] Rizzoli, G., Barbato, F., Caligiuri, M., Zanuttigh, P.: Syndrone-multi-modal
UAV dataset for urban scenarios. In: Proc. ICCV Workshop (2023)

[41] Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: Aspatio-temporal
maximum average correlation height filter for action recognition. In: Proc.
CVPR (2008)

[42] Seitzer, M.: pytorch-fid: FID Score for PyTorch. https://github.com/
mseitzer/pytorch-fid (August 2020), version 0.3.0

48

https://static.makehumancommunity.org/makehuman.html
https://static.makehumancommunity.org/makehuman.html
https://github.com/mseitzer/pytorch-fid
https://github.com/mseitzer/pytorch-fid


[43] Shen, Y.T., Lee, H., Kwon, H., Bhattacharrya, S.S.: Progressive transformation
learning for leveraging virtual images in training. In: Proc. CVPR (2023)

[44] Shen, Y.T., Lee, H., Kwon, H., Bhattacharyya, S.S.: Diversifying human pose
in synthetic data for aerial-view human detection. arXiv:2405.15939 (2024),
https://arxiv.org/abs/2405.15939

[45] Shen, Y.T., Lee, Y., Kwon, H., Conover, D.M., Bhattacharyya, S.S., Vale,
N., Gray, J.D., Leongs, G.J., Evensen, K., Skirlo, F.: Archangel: A hybrid
UAV-based human detection benchmark with position and pose metadata. IEEE
Access 11, 80958–80972 (2023)

[46] Stathopoulos, A., Han, L., Metaxas, D.: Score-guided diffusion for 3D human
recovery. In: Proc. CVPR (2024)

[47] Studio Chacre: The character creator, https://charactercreator.org

[48] Sun, X., Zheng, L.: Dissecting person re-identification from the viewpoint of
viewpoint. In: Proc. CVPR (2019)

[49] Unity Technologies: Unity, https://unity.com/

[50] Unity Technologies: Animatorcontroller. https://docs.unity3d.com/Manual/
class-AnimatorController.html (nd), accessed: July 18, 2024

[51] Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I.,
Schmid, C.: Learning from synthetic humans. In: Proc. CVPR (2017)

[52] Zhang, Q., Wang, L., Patel, V.M., Xie, X., Lai, J.: View-decoupled transformer
for person re-identification under aerial-ground camera network. In: Proc. CVPR
(2024)

[53] Zhang, T., Xie, L., Wei, L., Zhuang, Z., Zhang, Y., Li, B., Tian, Q.: UnrealPer-
son: An adaptive pipeline towards costless person re-identification. In: Proc.
CVPR (2021)

[54] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene
parsing through ADE20K dataset. In: Proc. CVPR (2017)

[55] Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and
tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 44(11),
7380–7399 (Nov 2022)

49

https://arxiv.org/abs/2405.15939
https://charactercreator.org
https://unity.com/
https://docs.unity3d.com/Manual/class-AnimatorController.html
https://docs.unity3d.com/Manual/class-AnimatorController.html

	Introduction
	Related Works
	SynPlay Dataset: Methods and Diversity
	Diverse yet realistic human motion
	Multiple viewpoints
	Scenario Design
	Other Design Factors
	3D Animated Characters
	Background

	SynPlay Statistics
	Character Distribution
	Altitude Distribution
	Perspective Distribution
	Bbox Size Distribution
	Bbox Heatmap Distribution
	Dataset Comparison

	SynPlay Sample Images

	Task Evaluation
	General tasks: detection and segmentation
	Experiement Setting
	Aerial-view tasks
	Ground-view tasks
	Combination with MS-COCO for pre-training dataset

	Data-scarce tasks: Few-shot and cross-domain learning
	Experiement Setting
	Comparison with other synthetic data
	Scaling behaviors

	Image Quality Evaluation

	Discussion and Conclusion
	Discussion
	Conclusion

	Bibliography