ABSTRACT

Title of Dissertation: CONTEXT-AWARE
COMPUTATIONAL VIDEO EDITING
AND RE-EDITING

Pooja Guhan
Doctor of Philosophy, 2025

Dissertation Directed by: Professor Dinesh Manocha
Department of Computer Science

Video has emerged as the dominant medium for communication and creative expression in the

digital era, fueled by advances in consumer cameras and the ubiquity of content-sharing platforms.

This democratization has empowered creators from diverse backgrounds and made capturing videos

now effortless, but editing remains a significant barrier. It demands both technical expertise and

nuanced creative judgement in narrative structure, emotional tone, and audience engagement.

Current artificial intelligence (AI)-driven tools offer automation for basic editing tasks but fall short

in supporting the high-level creative decisions that define compelling video, often neglecting the

narrative intent, production context, and viewer perception.

We address this gap by introducing adaptive, expressive, and accessible editing techniques that

bridge the gap between automation and artistic intent. We present computational models that support

decision making across key stages of post-production, structured in three parts. The first part presents


two context-aware image editing approaches. The first approach leverages reinforcement learning

to automatically analyze images in a way that harmonizes with the broader design or narrative

context, rather than applying uniform edits across diverse content. The second approach, TAME-

RD, pioneers AI-based reverse designing to provide detailed breakdowns of editing operations

and parameter strengths for easy style extraction and transfer. TAME-RD reported improvements

of 6-10% on various accuracy metrics and 1.01X - 4X on the RMSE score on the GIER dataset.

Additionally, we also introduced a new dataset I-MAD. Together, these methods advance automated

color grading, enabling personalized and contextually relevant workflows.

The second part tackles the context-based adaptation of visual effects and camera motions to

diverse narrative and stylistic goals. Our algorithm, V-Trans4Style, employs a transformer-based

encoder-decoder and style conditioning module to generate visually seamless, temporally consistent

transitions tailored to targeted production styles, significantly outperforming prior methods. On the

AutoTransition dataset, V-Trans4Style achieved improvements of 10%-80% in Recall@K and mean

rank values over baselines. We also introduced the AutoTransition++ dataset. Complementing this,

another algorithm, CamMimic, introduces a zero-shot algorithm leveraging video diffusion models

to transfer camera motion patterns from reference videos to new scenes, allowing creators to emulate

complex camera work without additional data or 3D information. Both approaches received strong

user preference (at least 70%), underscoring their effectiveness in empowering creative video editing.

The third part focuses on the edit refinement process based on audience feedback as new context

to guide iterative editing decisions, helping creators identify impactful moments and enhance future

content delivery. To solve the challenge of reliably quantifying audience engagement, we present


a machine learning–based approach to estimate viewer engagement levels during video playback,

drawing on psychological theories of attention and interaction. Our method has been validated

through real-world experiments, including an application in telehealth for mental health, where

the system automatically assessed patient engagement from video sessions. We obtained a 40%

improvement in evaluation metrics over state-of-the-art methods for engagement estimation. By

enabling objective, automated measurement of engagement, this approach empowers editors to make

data-driven refinements, ultimately improving the effectiveness and resonance of video content.


CONTEXT-AWARE COMPUTATIONAL VIDEO EDITING AND RE-EDITING

by

Pooja Guhan

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

July 2025

Advisory Committee:
Dr. Dinesh Manocha (Chair)
Dr. Maria Cameron (Dean’s Representative)
Dr. Ramani Duraiswami
Dr. Ming Lin
Dr. Guan-Ming Su


© Copyright by
Pooja Guhan

2025


Dedication

To my family

and

to the video editors who bring stories to life, frame by frame.

ii


Acknowledgments

This dissertation is not only a reflection of my own efforts but also a testament to the unwavering

support of those who have shaped my journey, both academically and personally. While the path

has certainly felt like a rollercoaster at times, it has also been one of deep self-discovery, revealing

strengths I didn’t know I had, exposing areas for growth, and above all, teaching me the importance

of perseverance amidst uncertainty. I am deeply grateful to my advisor, Prof. Dinesh Manocha,

whose unwavering confidence in me, especially during my moments of self-doubt, empowered me

to embrace the challenges of pursuing a Ph.D. with renewed determination and assurance.

My sincere thanks to my dissertation committee members, namely, Prof. Maria Cameron,

Prof. Ming Lin, Prof. Ramani Duraiswami, and Dr. Guan-Ming Su, for their invaluable guidance,

constructive feedback, and continued support. I am deeply grateful for the time and effort they

dedicated to reviewing my dissertation and for their encouragement throughout the process.

I extend my heartfelt thanks to my incredible GAMMA lab mates, constant companions on

this academic journey. From idea-sparking brainstorming sessions to laughter-filled coffee breaks,

your camaraderie made even the toughest days brighter. Thank you for your unwavering support,

right from sharing frustrations over stubborn experiments to celebrating every victory, no matter

how small. I’m grateful to have shared this experience with such a dedicated and remarkable team.

Among these wonderful people (some who have also graduated), in particular, I am especially

grateful to Trisha Mittal, for trusting me (then a Masters student) enough as a co-author and

iii


introducing me to the GAMMA family. I am grateful to Rohan Chandra, for his invaluable

support and mentorship throughout my time in the lab. My sincere thanks to Kasun Weerakoon,

Adarsh Sathyamoorthy, Mohamed Bashir Elnoor, Gershom Seneviratne, Senthil Hariharan, Vishnu

Dorbala, Utsav Patel, and Jim for sharing your expertise and passion for robotics with me. Our

discussions, whether about cutting-edge research or troubleshooting the quirks of robots, were

always intellectually stimulating and often inspired new directions in my own work. Special

shout out to the “foodies gang” (Laura Zheng, Niall Williams, Xijun Wang, Tianrui Guan, Ruiqi

Xian, Bhrij Patel, Geonsun Lee, Yonghan) for organizing some truly enjoyable outings, including

the unforgettable Totality 2024! I was grateful to join many of these moments, which provided

a refreshing break from the routine and added a wonderful flavor of camaraderie to my PhD

experience. Finally, to Divya Kothandaraman, for being a wonderful co-author and friend. Also, a

big shout-out to the sys-admin team that I had the opportunity to be part of for a good chunk of

this academic journey. I learned so much being in your company, trying to resolve the lab system

issues as they came. A special mention to Alper Bozkurt for taking the initiative and helping us

streamline the system allocation process within the lab to ensure all the members, including me,

could continue running their experiments without any disruptions.

It still feels surreal that I was fortunate enough to share a home with some of the most inspiring

and thoughtful individuals I’ve ever met - Priyal Gala, Nitin Sanket, Anoorag Sunkari, Nakul

Garg, Chahat Deep Singh, Sunaina Prabhu, Mrunal Dhaygude, and Aakriti Agarwal. We met as

housemates, but now they are my extended family that I didn’t think I would need. Despite their

own demanding and often exhausting schedules, they were always there for me, ready to lend an

ear, offer support, or just sit with me in shared silence. Our home was a place of vibrant debates,

unexpected ideas, and uncontrollable laughter (often leading to me literally falling off my chair!).

iv


These moments, both profound and silly, grounded me through the highs and lows of this journey,

and I’m forever grateful for their presence in my life. If not for them, I would have the misfortune

of not meeting and knowing Stella (our house dog). Getting greeted with the most enthusiastic and

energetic barks when I got home every day, followed by being forced into a round of tug of war

with her, made even the most frustrating research days the best.

I feel immensely grateful and lucky to have received the opportunity to collaborate with and

receive mentorship from some of the brightest minds in our research community across different

projects. These include Aisha Walcott, Celia Cintas, Sekou Lionel Remy, Uttaran Bhattacharya,

Saayan Mitra, Somdeb Sarkhel, Stephano Petrangeli, Ritwik Sinha, Vishy Swaminathan, Tsung-

Wei Huang, Dae Yeol Lee, Gloria Reeves, Kristin Bussell, and Aniket Bera. Their guidance,

encouragement, and thoughtful feedback have not only shaped the direction of my work but have

also deeply influenced the way I approach research and collaboration. I will always carry forward

the lessons I’ve learned from them.

I began my journey at UMD as a Master’s student and later transitioned into the PhD program. I

was fortunate to share this path with a wonderful group — Naman Awasthi, Vasu Singla, Vaishnavi

Patil, Shishira Maiyya, Noor Pratap Singh, Sai Yerramreddy, and Pulkit Kumar. Their camaraderie,

curiosity, and support made the highs brighter and the lows easier. From navigating the COVID-19

lockdown to celebrating festivals and exchanging research ideas, they made this journey truly

memorable.

Some friendships transcend time, place, and context. I’m extremely thankful for the friends

who have stayed by my side since my school and undergrad days, as well as the incredible people

I have had the joy of meeting in recent years. These include Ishan Bansal, Shashank Gupta, Sai

Karthikey Pentapati, Anand Murali, Srujana Peddinti, Arushi Singhal, Simran Singhal, Thota

v


Venkata Aishwarya, Anushka Agarwal, Surya Soujanya, Zakir Hussain Shaik, Harikrupa Sridhar,

Maithili Kunte, Aishwarya Deshpande, and Nikshita Ranganathan. My thesis would not have been

possible without their steady encouragement and belief in me. I couldn’t have asked for better

cheerleaders, sounding boards, or safe spaces. Their unwavering support throughout this journey

has meant the world to me.

I would also like to thank Migo Gui, Tom Hurst, and Jodie Gray for helping me with all

administrative concerns, including TAships and RAships. Their support and assistance made my

graduate student life significantly smoother and less stressful.

Throughout this journey, my family has been my greatest pillar of strength. My sincere thanks

to my mom, Priya, and dad, Guhan, for their unwavering patience and belief in me, especially

during the times I doubted myself. Knowing that they stood firmly by my side through every high

and low gave me the courage to keep moving forward. Their unconditional love, quiet sacrifices,

and constant reassurance have been the foundation that held me up through the toughest moments.

This achievement is as much theirs as it is mine. My sister, Mahima has been my source of joy,

perspective, and resilience throughout this journey. She always knew how to bring a smile to my face

when things felt overwhelming. Her steady presence and unshakable faith in me helped me emerge

from this journey as a more optimistic, confident, and brave researcher. I am grateful to her for

reminding me, time and again, of the light even in the most chaotic moments. I couldn’t have asked

for a better cheerleader. I am immensely blessed to have received unwavering encouragement and

support from my grandparents, aunts, uncles, and cousins throughout this journey. Their constant

check-ins, words of motivation, and quiet pride in my progress have been instrumental in keeping

me grounded and motivated. Whether through heartfelt conversations, shared laughter, or simply

being there when I needed a break, their presence reminded me that I was never alone in this pursuit.

vi


I wish to express my heartfelt appreciation to all the funding agencies that have supported my

research over the past few years. These include Adobe, Dolby, MPower, and the UMD Graduate

School Summer Fellowship. Their generous support has been instrumental in enabling me to

pursue ambitious ideas, attend conferences, collaborate with experts, and push the boundaries of

my work. This journey would not have been possible without the freedom and flexibility that their

funding provided, allowing me to focus deeply on research while growing both intellectually and

professionally.

Additionally, I am deeply grateful to the open-source platforms and vibrant community forums

such as PyTorch, TensorFlow, Linux, and Stack Overflow that played a crucial role in my research

journey. Their extensive resources, shared knowledge, and collaborative spirit helped me navigate

countless implementation challenges. This thesis would not have been possible without the

invaluable contributions of these communities.

I am sincerely thankful to everyone who has been a part of this journey in one way or another.

Your support, whether big or small, has left a lasting impact. I extend my humble apologies to

anyone I may have inadvertently overlooked. Please know that your presence and kindness have not

gone unnoticed, and I carry deep gratitude for each of you.

As I bring this chapter to a close and look ahead to new beginnings, I do so with immense

gratitude for the experiences, lessons, and connections that have defined this journey. This

acknowledgment is a heartfelt tribute to everyone who, in their own way, helped shape and support

my path. Thank you for being part of this unforgettable ride and for leaving an indelible mark on

both my work and my life.

vii


Table of Contents

Dedication ii

Acknowledgements iii

Table of Contents viii

List of Tables xii

List of Figures xiii

I World of Video Production 1

Chapter 1: Introduction and Overview 2
1.1 Complexity of Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 AI and Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Role of Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Context and AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 2: Background 16
2.0.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.0.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.0.3 Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

II Context-Aware Image Editing 25

Chapter 3: Contextualized Styling of Images for Web Interfaces using
Reinforcement Learning 26

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Enhancement of Content . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Image Enhancement for Context . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Incorporating Human Feedback . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Problem Definition and Approach Intuition . . . . . . . . . . . . . . . . . . . . . . 31

viii


3.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Approach Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Context Definition and Image Corpus . . . . . . . . . . . . . . . . . . . . 33

3.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . . . . . . . 34
3.4.2 State and Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.3 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.4 Environment Modeling and Training . . . . . . . . . . . . . . . . . . . . . 39

3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Chapter 4: TAME-RD: Text Assisted Replication of Image Multi-Adjustments for Reverse
Designing 46

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 TAME-RD: Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Stream 1: Pre-Post Effect From Images . . . . . . . . . . . . . . . . . . . 56
4.3.3 Stream 2: Context From Language . . . . . . . . . . . . . . . . . . . . . . 57
4.3.4 Multitask Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 GIER Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.2 I-MAD: Our Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Experiment Results and Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5.1 Training Details and Evaluation Metrics . . . . . . . . . . . . . . . . . . . 69
4.5.2 Quantitative Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.3 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.4 Additional Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Ethical Considerations and Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

III Context-Aware Visual Effects 83

Chapter 5: V-Trans4Style: Visual Transition Recommendation for Video
Production Style Adaptation 84

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.4.1 AutoTransition Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 V-Trans4Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

ix


5.5.1 Pre-trained Transition and Style Embeddings . . . . . . . . . . . . . . . . 96
5.5.2 Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.3 Style Conditioning Module . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6.2 Comparisons and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.3 Ethical Considerations: . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.7 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7.1 Pre-trained Transition and Style Embeddings . . . . . . . . . . . . . . . . 109
5.7.2 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.8 Broad Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.9 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 113

Chapter 6: CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation
Using Diffusion Models 116

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2.1 Video Personalization, Controllability, and Camera Motion Transfer . . . . 120
6.2.2 Image Personalization & Novel View Synthesis . . . . . . . . . . . . . . . 121

6.3 CamMimic: Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.1 Phase-1: Multi-Concept Finetuning . . . . . . . . . . . . . . . . . . . . . 124
6.3.2 Phase-2: Homography Guided Inference . . . . . . . . . . . . . . . . . . . 127
6.3.3 CameraScore - A New Metric . . . . . . . . . . . . . . . . . . . . . . . . 128

6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4.3 Comparison with Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.4.4 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.5.1 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.5.2 Additional User Study Details . . . . . . . . . . . . . . . . . . . . . . . . 139

6.6 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.7 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 151

IV Generating Context for Editing and Re-Editing 153

Chapter 7: Developing an Effective and Automated Patient Engagement Estimator for
Telehealth: A Machine Learning Approach 154

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

x


7.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.2 Proposed Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3.1 Study 1: Testing Our Proposed Approach on MEDICA . . . . . . . . . . . 179
7.3.2 Study 2: Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.3.3 Study-3:Analysis on Real-World Data . . . . . . . . . . . . . . . . . . . . 181

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.4.1 Principal Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.4.2 Comparison with Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.4.3 Strengths and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.4.4 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

V Conclusion and Future Directions 189

Chapter 8: Conclusion, Limitations and Future Directions 190
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.2 Applications and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

8.2.1 Context-Aware Image Editing . . . . . . . . . . . . . . . . . . . . . . . . 196
8.2.2 Context-Aware Visual Effects . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2.3 Generating New Context For Editing . . . . . . . . . . . . . . . . . . . . . 199
8.2.4 Extending Context-Aware Editing to 3D Videos . . . . . . . . . . . . . . . 200
8.2.5 Context-Aware Editing for Immersive and 360◦ Video Formats . . . . . . . 201

Bibliography 203

xi


List of Tables

3.1 Automated contextualized styling of images: Context variables considered . . . . . 33
3.2 Automated contextualized styling of images: User study results . . . . . . . . . . . 44

4.1 TAME-RD: Edit operations in I-MAD-Dense . . . . . . . . . . . . . . . . . . . . 65
4.2 TAME-RD: Edit operations in I-MAD-Pro . . . . . . . . . . . . . . . . . . . . . . 68
4.3 TAME-RD: Quantitative results on different datasets . . . . . . . . . . . . . . . . 81
4.4 TAME-RD: Quantitative results with different fusion strategies . . . . . . . . . . . 82
4.5 TAME-RD: Quantitative results with different λ . . . . . . . . . . . . . . . . . . . 82

5.1 V-Trans4Style: Encoder-decoder network quantitative results . . . . . . . . . . . . 108
5.2 V-Trans4Style: Ablation experiment results for encoder-decoder network . . . . . . 108
5.3 V-Trans4Style: Production style based transition recommendation quantitative results 109

6.1 CamMimic: Feature-by-feature comparison . . . . . . . . . . . . . . . . . . . . . 134

7.1 Engagement estimation for Telehealth: MEDICA comparison with related datasets 162
7.2 Engagement estimation for Telehealth: Demographic information for real-world

samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.3 Engagement estimation for Telehealth: Quantitative results on MEDICA . . . . . . 180
7.4 Engagement estimation for Telehealth: Ablation experiments on MEDICA . . . . . 181
7.5 Engagement estimation for Telehealth: Real-world experiment results . . . . . . . 183

xii


List of Figures

1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Automated contextualized styling of images - Overview . . . . . . . . . . . . . . . 28
3.2 Automated contextualized styling of images - RL based architecture using a combination

of dynamic and static reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Automated contextualized styling of images - User study results . . . . . . . . . . 43

4.1 TAME-RD: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 TAME-RD: Multimodal Multitask Learning Network Architecture . . . . . . . . . 51
4.3 TAME-RD: Understanding challenges - Different operations similar effects . . . . 53
4.4 TAME-RD: Understanding challenges - Same operation different effects . . . . . . 54
4.5 TAME-RD: Text provides semantic context . . . . . . . . . . . . . . . . . . . . . 56
4.6 TAME-RD: Experiment results - Per class average precision on GIER . . . . . . . 74
4.7 TAME-RD: Experiment results - Per class average precision on I-MAD-Dense . . . 75
4.8 TAME-RD: Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.9 TAME-RD: Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 V-Trans4Style: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 V-Trans4Style: Visual transition distribution across production styles . . . . . . . . 92
5.3 V-Trans4Style: AutoTransition++ distribution . . . . . . . . . . . . . . . . . . . . 93
5.4 V-Trans4Style: Encoder-Decoder with Style Conditioning Module Architecture . . 95
5.5 V-Trans4Style: Pre-processing stage - Multi-task learning . . . . . . . . . . . . . . 97
5.6 V-Trans4Style: Reconstruction loss decoder architecture . . . . . . . . . . . . . . 102
5.7 V-Trans4Style: Pre-processing stage experiment results . . . . . . . . . . . . . . . 106
5.8 V-Trans4Style: Pre-processing stage class-wise visual transition accuracy . . . . . 110
5.9 V-Trans4Style: Pre-processing stage production style class-wise accuracy . . . . . 110
5.10 V-Trans4Style: User study results . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.11 V-Trans4Style: Qualitative results Set-1 . . . . . . . . . . . . . . . . . . . . . . . 113
5.12 V-Trans4Style: Qualitative results Set-2 . . . . . . . . . . . . . . . . . . . . . . . 114
5.13 V-Trans4Style: Impact of style conditioning module on transition recommendation 115

6.1 CamMimic: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 CamMimic: Camera trajectory challenge . . . . . . . . . . . . . . . . . . . . . . . 122
6.3 CamMimic: Zero-shot video to image camera motion transfer architecture . . . . . 123

xiii


6.4 CamMimic: Need for CameraScore, a homography-based metric . . . . . . . . . . 129
6.5 CamMimic: Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6 CamMimic: Qualitative results Set-1 . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 CamMimic: Qualitative results Set-2 . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.8 CamMimic: Qualitative results Set-3 . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.9 CamMimic: Qualitative results Set-4 . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.10 CamMimic: Qualitative results Set-5 . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.11 CamMimic: Qualitative results Set-6 . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.12 CamMimic: Qualitative results Set-7 . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.13 CamMimic: Qualitative results Set-8 . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.14 CamMimic: User study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.15 CamMimic: User studies setup details - Concept introduction page . . . . . . . . . 149
6.16 CamMimic: User study setup details - Sample question . . . . . . . . . . . . . . . 150

7.1 Engagement estimation for Telehealth: Overview . . . . . . . . . . . . . . . . . . 155
7.2 Engagement estimation for Telehealth: Samples from MEDICA dataset . . . . . . 159
7.3 Engagement estimation for Telehealth: Samples from real-world data . . . . . . . . 163
7.4 Engagement estimation for Telehealth: GAN-based semi-supervised multimodal

learning architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

xiv


Part I

World of Video Production

1


Chapter 1: Introduction and Overview

Over the past decade, video has emerged as the dominant medium for communication, entertainment,

education, marketing, and documentation. With the proliferation of smartphones, affordable high-

resolution cameras, and accessible video sharing platforms, the barriers to video creation and

consumption have dramatically decreased. Platforms such as YouTube, TikTok, Instagram, and

Netflix have not only accelerated content consumption but also have shaped evolving expectations

around storytelling formats, pacing, production quality, and personalization [1, 2]. Consequently,

video content has seen explosive growth. By some estimates, video comprises more than 80% of

global internet traffic. This content explosion has placed tremendous pressure on video production

workflows. Video production [3] traditionally encompasses three major phases -

1. Pre-Production involves planning and organizing the video project, including script writing,

storyboarding, casting, location scouting, and equipment preparation. This stage sets the

foundation for the video project, ensuring that all necessary elements are in place before

filming begins.

2. Production is the actual filming phase, where the video footage is captured. This stage requires

careful coordination of the cast, crew, and equipment to ensure that the vision outlined in the

pre-production stage is effectively utilized.

2


3. Post-Production is where the raw footage is transformed into a polished final video. This is

the final stage of the video production process and is typically the longest of all stages. It

encompasses a series of intricate and interdependent tasks.

Content creators, ranging from professionals and indie filmmakers to educators and marketers, are

expected to deliver engaging, high-quality videos quickly and frequently. While capturing raw

footage has become relatively frictionless, the post-production process remains labor-intensive,

requiring significant expertise and time. To better understand the complexities of post-production,

we further divide it into three functional stages [4, 5]:

• Editorial stage: This initial phase focuses on organizing and assembling the raw footage into

a coherent narrative. Tasks include selecting takes, trimming clips, determining scene order,

and creating a rough cut to establish pacing and structure.

• Video Editing Stage: Here, the creative and stylistic identity of the video takes shape. This

stage includes the application of visual effects (VFX), transitions, color correction and grading,

camera motion stylization, and edit refinement. These enhancements are essential for aligning

the final content with the creator’s vision and the intended emotional tone.

• Finalization Stage: The final step involves technical polishing and delivery preparation. Tasks

include audio mixing, sound design, subtitle integration, compression, format conversion, and

quality assurance to ensure compatibility with the target distribution platform.

Among these, the Video Editing stage often emerges as the primary bottleneck.

3


1.1 Complexity of Video Editing

Video editing is a sophisticated and nuanced craft focused on shaping the visual and emotional

tone of a video to obtain a compelling visual narrative after the initial editorial work [4, 6]. Far more

than simply cutting and splicing clips, it is a multi-layered process that demands both technical

mastery and creative vision [7]. The journey typically begins with meticulously logging hours of

footage, identifying the most powerful moments, and assembling a rough cut that lays the foundation

for the story [5]. From there, editors dive into color grading to set the mood, add visual effects to

enhance storytelling, and mix sound to create an immersive audio landscape [8, 9]. Each stage,

whether it’s refining transitions, balancing audio levels, or integrating feedback from collaborators,

is crucial in shaping the final product into a seamless and emotionally resonant experience for

viewers [4, 7].

What makes video editing particularly complex is its power to control not just what the audience

sees, but how they feel [10, 11]. Editors sculpt the pacing, build suspense or release tension, and

guide the viewer’s eye through carefully timed cuts and visual cues. The process is both an art and

a science, requiring a keen sense of timing, narrative structure, and an understanding of human

psychology. Also, editing isn’t a one-time process. Often, we return to our videos to repurpose

them for different needs, audiences, or emotional tones. This is referred to as re-editing. In today’s

fast-paced digital world, the demand for high-quality, engaging video content is at an all-time

high, yet the editing process remains notoriously time-consuming and labor-intensive. Artificial

intelligence (AI) has long been applied in video-related tasks, particularly within the domains

of computer vision and multimedia processing. Early systems employed rule-based methods or

traditional statistical models for frame analysis and content organization. However, recent advances

4


in deep learning and large-scale data availability have enabled more powerful and flexible AI-driven

solutions in video editing (and re-editing). For the sake of clarity and consistency, the term video

editing will be used throughout this dissertation to also encompass re-editing processes.

1.2 AI and Video Editing

The integration of AI into video editing has transformed both the capabilities and the accessibility

of post-production workflows. AI promises to automate or accelerate several editing steps, reducing

human effort while improving consistency. Early applications of AI in video editing focused on

low-level operations such as shot boundary detection [12, 13, 14], object tracking [15, 16, 17], face

or background segmentation [18, 19] and stabilization. These tasks, once time-consuming, have

been made more efficient through computer vision models trained on large-scale annotated datasets.

Progress in deep learning has enabled more ambitious applications. AI systems can now generate

video highlights [20, 21, 22, 23], perform video summarization [24, 25], detect scenes of emotional

relevance, and even recommend edits based on aesthetics or co-occurrence patterns. Commercial

platforms like Adobe Premiere Pro [26] have enabled automation of tasks like smart reframing,

background score matching, or even stabilization. At the frontier of generative AI, models such

as Imagen [27], SORA [28], Make-A-video [29] explore the synthesis of short video segments

from textual prompts, offering new directions in content creation. Transformer-based models for

video understanding [30, 31] further enhance the capability of AI to reason over multimodal data,

combining visual, auditory, and textual cues to infer semantic relevance or contextual importance.

These advances suggest a future where AI might not only automate rote editing steps but also

contribute to creative ideation and narrative construction.

5


1.3 Role of Context

Context refers to the surrounding circumstances, conditions, or intentions that give meaning to a

piece of information or an action [32]. In computational systems, context has been shown to enhance

decision-making by situating content within a broader semantic or functional framework [33, 34].

Context can broadly be divided into two categories: internal and external [34, 35]. Internal context

consists of features inherent to the data itself. In the case of video editing, this includes visual

patterns, object semantics, spatial layout, temporal continuity, and audio cues extracted directly

from the footage [36]. It enables systems to answer the question: what is happening in a frame or

sequence, such as who appears, what actions unfold, how the composition is framed, and when the

transitions occur. External context, by contrast, refers to information that lies beyond the immediate

content. In the case of video editing, for instance, it provides the rationale for why a particular

edit is appropriate in a given situation. It governs the intent behind an editorial decision. This

includes, but is not limited to, narrative intent[37], genre conventions, platform constraints, cultural

background, and anticipated audience reactions [5, 38, 39]. Both are tightly interwoven and crucial

for shaping the audience’s perception of a video. Traditional editing practices implicitly operate

on both types of context. They together inform every editorial decision and breathe coherence and

emotion into a sequence of visuals [5, 10, 37]. A simple transition, for example, may function

differently depending on its context. A fade might evoke nostalgia in one scene but feel disjointed in

another [40]. A quick cut might increase the sense of urgency or might disrupt the pacing in another.

The nuance required to make these distinctions arises from understanding both what is seen (internal

context) and why it matters (external context). The ability to sense and manipulate these nuances

is what distinguishes skilled editors from generic automated systems [41]. Editing decisions must

6


be guided by an understanding of who the audience is, what emotion is being conveyed, how the

moment connects to the previous or the next one and what visual logic governs the sequence.

However, modern AI tools lack access to and understanding of this full spectrum of context.

While they are trained to recognize patterns within the content (such as repetitive shots, facial

expressions, or motion cues), they often remain unaware of the broader communicative goals or

stylistic considerations [42]. For example, a system trained to identify visually similar frames

may successfully group scenes with recurring elements, but it may fail to recognize a moment of

emotional climax or symbolic importance that warrants emphasis-even if it is visually unremarkable.

Similarly, a model trained to generate a video summary may optimize for coverage or diversity

but fail to reflect the intended message or tone of the source material. This gap is especially

evident in creative or expressive domains where editing decisions are deeply tied to artistic

intent, cultural nuance, and audience reception. Consider the difference between editing a travel

vlog versus a cinematic short film versus a therapy session recording. Each requires a distinct

understanding of what is salient, what pacing is appropriate, and what visual or auditory cues align

with communicative goals. An editing system that treats all content uniformly is unlikely to meet

the nuanced expectations of different storytelling contexts. Furthermore, many machine learning

or vision approaches are often designed as a standalone automatons, focusing on the what (i.e.,

which frames to cut, which transition to apply) but not why behind those decisions. As a result, edits

produced by such systems often feel sterile, emotionally disconnected, or stylistically inconsistent,

even when they are technically flawless. This shortcoming arises because editing is not merely

a technical process. It is a narrative and affective one. A cut is not just a slice; it is a decision

laden with meaning [5]. Depending on its timing, rhythm, and context, a cut can heighten suspense,

convey psychological states, elicit laughter, or guide audience attention. Current AI systems cannot

7


reason about these layers of meaning because they lack an understanding of the “why” behind the

editorial decisions. They may detect what occurs in a shot and when it occurs, but they remain

blind to why that moment matters in the broader arc of the story. As a result, automation, while

necessary and often beneficial, is far from being sufficient for achieving meaningful, engaging edits

that resonate with the audience.

1.4 Context and AI

The importance of context is not unique to video editing. Across several domains of AI, context

(both internal and external) has proven to be crucial for building technologies that align more closely

with human reasoning, intention, and experience. In natural language processing, contextual

embeddings revolutionized how machines understand semantics. Models such as BERT [43],

GPT [44], and ELMo [45] significantly outperform earlier methods by incorporating the surrounding

linguistic environment (an example of modeling internal context) to disambiguate meaning. These

models learn dynamic, context-sensitive representations that significantly improve performance on

a range of tasks, from language understanding to dialogue generation. Beyond internal semantics,

models that incorporate external context, such as speaker identity, sentiment, or dialogue history,

further improve performance in tasks like response generation and personalized conversation [46].

In recommender systems, contextual modeling has enhanced personalization by incorporating

signals like time of day, user mood, recent interactions, and device type [47]. These systems adapt

to the user’s current situation and historical patterns to generate more relevant suggestions. For

example, Netflix and Spotify personalize recommendations not only based on past behavior but

also on inferred situational context such as watching late at night, commuting, or preparing for

8


a workout [48, 49, 50]. In human-computer interaction, context-aware systems adjust interfaces

and feedback mechanisms based on both user behavior (internal context) and environmental or

affective signals (external context). Adaptive interfaces modify layouts based on past interaction

patterns [51], while intelligent tutoring systems like AutoTutor [52] respond to learner engagement

by integrating facial expressions, eye gaze, and emotional states. These systems demonstrate how

external context, when fused with internal cues, leads to more fluid and user-centric interactions.

In robotics and embodied AI, contextual reasoning is indispensable for real-world deployment.

Internal environmental models (e.g., object recognition, scene geometry) guide basic planning,

but effective robotic behavior is highly dependent on the interpretation of external context such as

human proximity, social norms, and dynamic constraints [53, 54]. KnowRob [55] exemplifies how

robots can integrate semantic and procedural knowledge to understand not just how to perform a

task, but when and why it is appropriate.

Together, these examples underscore that context, whether derived internally from the data or

externally from surrounding conditions, enables more adaptive, coherent, and human-aligned AI

systems.

1.5 Thesis Objective

Despite the clear benefits of context-aware AI elsewhere, video editing has yet to fully embrace

this paradigm. Current approaches to video editing remain heavily reliant on internal content

cues, such as visual similarity or shot duration [56, 57]. Although this enables certain degrees

of automation, it overlooks the role of external context, such as creative intent, emotional tone,

platform-specific constraints, or audience expectations that critically shape editorial decisions [5, 38].

9


Without access to the complete context spectrum (internal and external), AI remains capable, yet

unintuitive. They are productive but not perceptive. We position context with emphasis on the

external context as the missing link between automated editing and effective storytelling. We believe

editorial intelligence should not only recognize what appears in a video but also why certain editing

choices are being made. In pursuit of this vision, our research develops computational methods

that operationalize both internal and external context that treat editing not as a sequence of isolated

tasks, but as a contextually grounded, narrative-driven process, i.e., one in which editing decisions

are informed by context (semantic, stylistic, emotional, and perceptual) and are aligned with the

overarching storytelling goals. By incorporating various forms of context like semantic, stylistic,

emotional, and perceptual, our work explores how editing systems might begin to exhibit elements

of creative reasoning, stylistic adaptability, and audience-aware decision-making, setting the stage

for more contextually responsive editing tools.

Main Contributions: We introduce computational techniques and frameworks for building editing

systems that are sensitive to context across multiple dimensions. Our proposed methods embed the

external context signals into the editing pipeline while also leveraging internal content representation.

Specifically, we focus on three core challenges:

• Context-Aware Image Editing

Image editing refers to the manipulation or enhancement of individual frames or still images.

In the context of video, frame-level edits often contribute to color correction, lighting

adjustments, and compositional improvements that influence the overall aesthetic and narrative

tone of a scene. Traditional image editing tools require manual specifications of regions,

transformations, or filters. In contrast, AI-based image editing leverages models trained

10


Figure 1.1: Overview of research done: Context is integrated into four key areas of video editing
to address distinct challenges. Works under Context-Aware Image Editing focuses on enhancing the
image/frame colors by adapting to the context of the scene. Context-Aware Visual Effects applies
contextual understanding to visual effects, ensuring seamless transitions and coherence. We explore
visual transitions and camera motions under this segment. Finally, Generating Context for Editing
and Re-Editing captures and quantifies human engagement, providing valuable feedback for creating
context-driven content. Each component contributes to a holistic, context-aware approach to video
editing.

11


on large-scale datasets to automate tasks like inpainting [58, 59], style transfer [60, 61],

relighting [62, 63], and object manipulation [64]. GAN and diffusion models, DALL-

E [65] have dramatically improved realism in synthesized edits, enabling users to edit images

using textual prompts or semantic sketches. Despite these advances, most models operate

on local image content or global semantic embeddings, lacking awareness of narrative or

situational context. This can result in inconsistencies in expected visual perception or result

in misalignment with creative intent. We develop models that integrate semantic, perceptual,

and spatial context to support editing operations on frames and images. Our methods are

designed to preserve narrative relevance and visual coherence, even in the absence of explicit

human direction.

In Chapter 3, we investigate the role of situational context in image editing. The circumstances

in which an image is presented significantly influence how it is perceived by viewers. Just as

one size does not fit all, a single edited version of an image may not be appropriate across

different contexts, and conversely, the same context may call for varied edits depending on

the changing content. Editing decisions can be shaped by factors such as the target audience,

the intended emotional tone, and the overarching theme of the content. To address this, we

propose a reinforcement learning-based approach that generates contextually appropriate

image edits without requiring access to datasets containing multiple versions of an image

edited for different contexts. Our method is explicitly optimized to produce edits that not only

align with the given context but are also statistically preferred by human evaluators. To the

best of our knowledge, this is the first work to systematically examine the interplay between

image editing, human preference, and situational context.

12


In Chapter 4, we introduce the concept of AI-driven reverse designing, which involves

recovering the full sequence of edit operations and their associated parameter values that

transform a given source image into its corresponding edited image, both of which are

provided as input. By enabling the easy reconstruction of the edit trajectory, our proposed

method TAME-RD can potentially help uncover the underlying decision rationale of a

reference edit. The insight allows downstream applications such as edit replication and

thereby, supports the transfer of exact editing styles across content to match a desired intent.

Our proposed method, TAME-RD, achieved a 6–10% improvement across various accuracy

metrics and a 1.01× to 4× reduction in RMSE scores as compared to the state-of-the-art.

• Context-Aware Visual Effects

Visual effects refer to the addition or manipulation of visual elements in a video to enhance

storytelling, create illusions that cannot be captured during live filming, or emphasize specific

aspects of a scene to make the viewing experience more immersive and engaging [66, 67].

These effects range from subtle stylistic treatments, such as lens flares, motion blur, and

depth-of-field adjustments, to more complex elements, such as simulated camera motion,

stylized transitions, or the addition of synthetic environmental components. In cinematic

post-production, the choice and timing of such effects are typically informed by the intended

mood, genre conventions, and narrative pacing. Recent advances in generative models such as

Generative Adversarial Networks, diffusion [68, 69, 70], and neural rendering techniques like

Neural Radiance Fields [71, 72] and Gaussian Splatting [73]. Transformer-based multimodal

models have also enabled semantic control over editing by linking language prompts with

visual outcomes. However, most of these models treat visual effects as isolated transformations

13


and are not sensitive to the narrative role or editorial intention behind their use. As a result,

AI-generated effects may look plausible in isolation but can feel tonally jarring, stylistically

inconsistent, or narratively unmotivated within the broader video.

We, therefore, propose approaches to enable context-aware visual effects. We present

algorithms for recommending and adapting visual transitions and camera motions based

on production styles, pacing, and creator intent. The models learn from stylistic reference

patterns across different genres and apply appropriate transformations to maintain aesthetic

fidelity. In Chapter 5, we present V-Trans4Style, our algorithm to recommend a sequence of

transitions that can facilitate the adaptation of videos to different production styles. Unlike

prior works, in addition to considering the aesthetic appeal, our algorithm keeps track of

temporal consistency and proposes a unique inference-based strategy to obtain a video can

mimics the desired production style.

In Chapter 6, we discuss CamMimic, our zero-shot strategy to enable a single reference

video-based camera motion transfer, where we learn to extract and transfer the camera motion

observed in a reference video onto a static or minimally animated target scene. Our method

enables the synthesis of dynamic camera effects, such as slow tracking shots or rapid camera

movements, without requiring any 3D information or manual animation.

• Generating Context for Editing and Re-Editing

Edit refinement involves evaluating, adjusting, or reordering video segments to improve

narrative coherence, emotional impact, or viewer engagement. This is typically the final phase

of the editing pipeline, where earlier decisions are re-examined in light of audience response

and storytelling clarity.

14


To support a more informed and viewer-aware re-editing process, we propose generating

contextual signals derived from audience-centered cues such as affective expressions, visual

saliency, and attention patterns (e.g., audio reactions). In Chapter 7, we introduce a semi-

supervised, multimodal learning framework designed to estimate viewer engagement. Our

approach computationally models cognitive and affective psychological states, and we validate

it through a real-world application in telehealth for mental health support.

Together, these contributions take initial steps toward the development of AI tools that move beyond

surface-level automation and can begin to adapt to human intent, assist in creative decision-making,

and collaborate meaningfully with human storytellers.

15


Chapter 2: Background

The primary aim of this chapter is to lay a comprehensive groundwork by presenting the essential

background knowledge and contextual framework necessary for understanding the central topics

addressed in this dissertation. Through a review of fundamental concepts, core methodologies, and

recent developments, this chapter seeks to provide readers with the insight and perspective needed

to fully grasp the contributions and results discussed in later chapters. Machine learning, and in

particular deep learning, forms a central pillar of this thesis. Throughout the various research studies

presented here, we use multiple kinds of deep learning concepts to build our algorithms. For clarity

and completeness, we briefly discuss each of these approaches below.

2.0.1 Transformers

The transformer architecture, introduced by [74], is a deep learning model that relies entirely on

attention mechanisms to process sequential data, eliminating the need for recurrence or convolutions.

The attention mechanism enables the model to dynamically aggregate information from different

parts of the input sequence, assigning varying weights based on learned attention scores. This allows

the transformer to process the entire input sequence in parallel, rather than sequentially as in models

like LSTMs, which depend on hidden states as memory. Given two input matrices, Xl ∈ Rlxdin and

Xs ∈ Rsxdin , where l and s are the sequence lengths and din is the feature dimension, the attention

16


layer computes its output as

Attention(Q, V,K) = Softmax(
QKT

√
din

)V ∈ Rlxdout (2.1)

where Q = XlWq, K = XlWk and V = XsWv. Here, Q,K and V are the query, key and

value matrices, projected from the inputs using learnable weight matrices Wq,Wk ∈ Rdinxdh and

Wv ∈ Rdinxdout with dh and dout as the hidden and output dimensions respectively. When both Xl

and Xs are the same, the mechanism is known as self-attention, a core component of the transformer.

Each transformer layer typically includes a multi-head self-attention (MHSA) module, several

linear transformations, normalization, and activation functions. This architecture has demonstrated

superior performance and stability across a wide range of applications, largely due to its ability to

model long-range dependencies and its efficient, parallelizable design.

2.0.2 Reinforcement Learning

Reinforcement learning (RL) is concerned with training agents to make sequences of decisions

by interacting with an environment. The primary components of an RL algorithm include:

1. Environment: Represents the external system with which the agent interacts. It defines the

context, rules, and dynamics that govern the consequences of the agent’s actions.

2. Agent: It is the learner or decision-maker that observes the environment, takes actions, and

aims to maximize the cumulative rewards over time.

3. Actions: Actions are the set of all possible moves or decisions the agent can make at any

given state within the environment. The choice of action influences the subsequent state and

17


the reward received.

4. Reward: It is the positive or negative reinforcement that the agent receives from the environment

as a result of its actions. It is a way to evaluate the quality of the action done by the agent.

5. State: It is a representation of the current situation or configuration of the environment as

perceived by the agent. The agent uses the state to decide which action to take next.

6. Policy: This defines the agent’s strategy for selecting actions based on the current state. It

can be stochastic or deterministic and is typically optimized during training.

7. Value Function: It estimates the expected cumulative reward the agent can obtain from a

given state (or state-action pair), serving as a measure of the long-term benefit of being in

that state.

These components operate within the agent-environment loop: at each time step t, the environment

is in state St ∈ S, the agent selects an action At ∈ A(S⊔), and the environment responds with a

reward Rt+1 ∈ R and transitions to a new state St+1. The agent’s objective is to maximize the sum

of rewards over time, learning to take actions that yield the highest cumulative return.

RL algorithms are broadly categorized as model-based or model-free, depending on whether

the agent has access to or learns a model of the environment’s dynamics. Model-based methods

use a model to predict future states and rewards, enabling planning and foresight, and are suitable

when the environment is well-understood or static. In contrast, model-free methods learn directly

from experience without explicit modeling, making them advantageous for dynamic or complex

environments where modeling is infeasible. This dissertation specifically focuses on model-free

RL. Within model-free RL, two principal approaches are widely used: policy optimization and

18


Q-learning.

Policy Optimization: These methods directly parameterize the agent’s policy and optimize it

to maximize the cumulative reward. The policy, typically denoted as πθ(a|s), is updated using

gradient-based methods to improve performance. Policy optimization is particularly effective in

environments with high-dimensional or continuous action spaces and includes algorithms such as

REINFORCE, Proximal Policy Optimization(PPO), and Trust Region Policy Optimization (TRPO).

The main advantage of policy optimization is its stability and reliability, as it directly improves

agent performance by optimizing the policy itself.

Q-Learning This is a value-based approach that learns an action-value function Q(s, a), representing

the expected cumulative reward of taking action a in state s and following the optimal policy

thereafter. The agent uses the learned Q-function to select actions that maximize the expected

value. Deep Q-Networks (DQN) and related algorithms are prominent examples in this category.

Q-learning tends to be more sample-efficient, as it can reuse past experience, but may suffer from

stability issues due to the indirect nature of optimizing the policy via the value function.

For a comprehensive introduction to reinforcement learning algorithms and foundational

concepts, we refer the reader to the seminal textbook by Sutton and Barto [75] and the lecture series

by David Silver [76], which together offer both theoretical grounding and practical insights into

modern RL techniques.

In this dissertation, we utilize Deep Deterministic Policy Gradient (DDPG), an algorithm that

blends the strengths of both approaches. DDPG concurrently learns a deterministic policy and

a Q-function, using each to improve the other. This hybrid method allows for stable learning in

19


continuous action spaces while leveraging the sample efficiency of value-based techniques, making

it well-suited for complex real-world tasks.

2.0.3 Generative AI

Recently, artificial intelligence has undergone a profound paradigm shift driven by rapid

advancements in generative AI technologies. What began as experimental research has evolved

into mission-critical solutions that are now transforming industries, business models, and creative

practices on a global scale [77, 78, 79]. Generative AI, powered by breakthroughs in foundational

models [77, 80, 81] and scalable computing [82], is no longer confined to theoretical or niche

applications [78]. By 2025, these technologies will have matured into essential tools that automate

complex tasks, streamline workflows, and enable the creation of novel content across text, images,

audio, and even 3D modalities [80, 83, 84]. Given their relevance to the methods and applications

explored in this research, we provide an overview of Generative Adversarial Networks (GANs) and

diffusion models in this section.

2.0.3.1 Generative Adversarial Networks (GANs)

Introduced by Ian Goodfellow in 2014, GANs [85] are a class of machine learning algorithms

designed to generate new data that closely resembles a given training dataset. The architecture

consists of two neural networks - the generator and the discriminator, engaged in a competitive,

adversarial process. The generator creates synthetic data (such as images, text, or audio), while

the discriminator evaluates whether the data is real (from the training set) or fake (produced by the

generator).

20


More formally, given a set of data instances X and a set of labels Y , generative models capture

the joint probability P (X, Y ), or just P (X) if there are no labels. Discriminative models capture the

conditional probability P (Y |X). One way to think about GANs is that it is a competition between

two players. The generator is trying to fool the discriminator, while the discriminator is trying to

spot the fakes. GAN is essentially a zero-sum game where the generator is trying to maximize its

score and the discriminator is trying to minimize the generator’s score. The goal of training a GAN

is to find a Nash equilibrium, where neither player can improve their score by making any changes

to their strategy. In the paper that introduced GANs, the training process involves the generator

trying to minimize the following function while the discriminator tries to maximize it.

Ex[log(D(x))] + Ez[log(1−D(G(z)))]

where, D(x) is the discriminator’s estimate of the probability that real data instance x is real, Ex

is the expected value over all real data instances, G(z) when given the noise z, D(G(z)) is the

discriminator’s estimate of the probability that a fake instance is real, and Ez is the expected value

over all random inputs to the generator. Some other loss functions, like modified minmax loss and

Wasserstein loss, have also been used in the literature.

2.0.3.2 Diffusion Models

Generative models have revolutionized the creation of high-quality synthetic data, with Generative

Adversarial Networks (GANs) [85], Variational Autoencoders (VAEs) [86], and Flow-based

models [87] each offering unique strengths and facing distinct limitations. GANs, for instance, are

celebrated for producing sharp and realistic samples but often encounter unstable training dynamics

21


and mode collapse, which reduces sample diversity [88]. VAEs optimize a surrogate loss that

can result in blurry outputs due to the trade-off between reconstruction accuracy and latent space

regularization [86]. Flow-based models, while providing exact likelihoods and invertible mappings,

require intricate, reversible network architectures that add design complexity [87].

Diffusion models have recently emerged as a powerful alternative, delivering stable training and

exceptional sample quality. Rooted in non-equilibrium thermodynamics, these models operate by

progressively corrupting data with noise through a forward process and then learning to reverse

this process, effectively transforming random noise into coherent data samples. The foundational

concept was introduced by Sohl-Dickstein et al. [89], with significant advancements in modeling and

efficiency by Ho et al. [90] and Song et al. [91]. Diffusion models comprise of two key processes,

namely, forward diffusion process and reverse diffusion process.

Forward Diffusion Process: The forward process is a markov chain that gradually adds gaussian

noise to the data over T discrete steps. For an initial data point x0, the corrupted version at step t is

denoted xt, and the transition is defined as:

q(xt|xt−1) = N (xt;
√
1− βtxt−1, βtI) (2.2)

where βt is a variance schedule that controls the amount of noise added at each step. The full

forward process is

q(x1:T |x0) =
T∏
t=1

q(xt|xt−1) (2.3)

22


This process ensures that, as t→ T , xT approaches an isotropic Gaussian distribution. Importantly,

the marginal distribution at any timestep t can be computed directly:

q(xt|x0) = N (xt;
√
αtx0, (1− αt)I) (2.4)

where αt =
∏t

i=1(1− βi)

Reverse Diffusion Process: The generative process learns to reverse the forward noising. The

reverse transitions are parameterized as:

pθ(xt−1|xt) = N (xt−1;µθ(xt, t),
∑
θ

(xt, t)) (2.5)

where µθ and
∑

θ are neural network outputs trained to predict the mean and variance for denoising.

Sampling starts from pure noise xT ∼ N (0, I) and iteratively denoises to produce a sample x0.

Training Objective: Diffusion models are trained by maximizing a variational lower bound

(ELBO) on the negative likelihood of the data. This objective can be decomposed into a sum of KL

divergences between the true and approximate posteriors at each timestep, leveraging the tractability

of Gaussian transitions:

LELBO = Eq[
T∑
t=1

KL(q(xt−1|xt, x0)||pθ(xt−1|xt))− log pθ(x0|x1)] (2.6)

23


In practice, a simplified objective is often used: the model learns to predict the added noise at each

timestep, which yields stable and efficient training. The formulation for this loss looks as follows:

L = Ex0,t,ϵ[||ϵ− ϵθ(xt, t)||2] (2.7)

where ϵ is sampled from a standard normal distribution and ϵθ is the noise predicted by the model.

In the subsequent chapters, we will discuss how these learning paradigms are integrated into our

research, highlighting their roles in addressing specific challenges and advancing the state of the art

in our target applications.

24


Part II

Context-Aware Image Editing

25


Chapter 3: Contextualized Styling of Images for Web Interfaces using

Reinforcement Learning

Content personalization is one of the foundations of today’s digital marketing. Often, the same

image needs to be adapted for different design schemes for content that is created for different

occasions, geographic locations, or other aspects of the target population. We present a novel

reinforcement learning (RL) based method for automatically stylizing images to complement the

design scheme of media, e.g., interactive websites, apps, or posters. Our approach considers

attributes related to the design of the media and adapts the style of the input image to match the

context. We do so using a preferential reward system in the RL framework that learns a reward

function using human feedback. We conducted several user studies to evaluate our approach and

demonstrate that we are able to effectively adapt image styles to different design schemes. In user

studies, images stylized through our approach were the most preferred variation across a majority

of our experiments. Additionally, we also release a dataset consisting of perceptual associations of

web context with the associated image style.

3.1 Introduction

A professional-looking website with an engaging audience is a great way to promote a brand.

With the availability of Stock photography services such as Shutterstock, Adobe Stock, etc., content

26


creators can easily create impressive-looking websites. Although finding image assets is much

easier, for the content to be effective for the brand (i.e., better engagement, higher click-through

rates, higher conversion), the images need to be stylized or optimized. Approaches to effectively

modify images to improve engagement have been extensively studied in the literature [92]. However,

all these studies assume the invariability of the webpage where the image is embedded. In reality,

different web pages have different design aesthetics as well as different themes. Additionally, brands

also often change the aesthetics and styles of their web pages. Therefore, context is key for adopting

the right image styling strategies. Context here refers to the circumstances under which the image

asset will be consumed by the user. In the case of websites, context could include but is not limited

to the website’s design template, the target users and their role, the task at hand or the steps in the

process, the user’s location, the time and date, or the device being used. Manually styling the image

assets so that they blend well with this context can be difficult and time-consuming. Also, creating

and delivering image content at a scale that can resonate with the user is a very challenging task.

In this paper, we, therefore, investigate how to efficiently automate this process and optimize the

image style characteristics based on the specific context it is associated with.

In particular, we use a reinforcement learning (RL)-based algorithm to search across the huge

space of image variations and select the best one based on the context. It will also be efficient both

from a time and computation perspective, as RL will leverage the knowledge gained from past

optimizations to accelerate the search for the best image variation for a new context. Additionally,

since assessing the suitability of the image style for a given context would be meaningless without

incorporating users’ (content consumers’) feedback on it, we propose using a reward function that

mimics the human users and evaluates the image styles generated. Our proposed approach can,

therefore, automatically modify and optimize an image based on the context it is associated with,

27


without constraining the image search in terms of pre-defined, static configurations, and at the same

time, agreeing with the user’s expectations. Such a data-driven and adaptive framework can help

content creators save an incredible amount of time and also make using design tools for beginners

easier.

Image Style optimal for 
Website

Figure 3.1: We propose a framework that can automate the process of styling an image such that it suits the
context defined by the website it’s being embedded in.

Main Contributions. The novel components and main contributions of our work include

1. An image color stylization method to automatically adapt the image by taking into consideration

its context.

2. A reinforcement learning approach that uses a unique reward function capable of capturing

human user preferences.

3. Additionally, we release a dataset consisting of perceptual association labels of web context

with the associated image style derived from our user study.

28


3.2 Related Work

In this section, we discuss prior works in content enhancement and context-aware image

modification. We also survey literature related to incorporating human feedback in the learning

framework and existing datasets.

3.2.1 Enhancement of Content

Automating the enhancement of content has been an active area of research. Over the past few

years, several deep learning-based models have been developed to improve and score the quality of

images[93, 94, 95] and videos[96]. Similarly, there has also been some work around evaluating the

aesthetics of webpages[97]. However, these do not describe what the optimal layout would be if

the score is not good enough. Content optimization has also been approached from the perspective

of displaying only important details or information that can better engage a user by making use of

some ranking algorithms[98]. In some works, the device on which the content will be presented has

been considered as context for understanding the layout of the content.

3.2.2 Image Enhancement for Context

Some of the earliest works in the personalization of images propose an enhancement of an image

based on the enhancement strategies adopted by the user for a sample set of images [99, 100]. [101]

proposes an effective way to rank and select the most optimal design for the given set of context

variables. While our work also depends on context variables, unlike [101], we need to optimize

jointly on both the context variables as well as image features. [102] proposed an end-to-end

reinforcement learning-based framework that formulated various image-retouching operations as a

29


series of differentiable filters. The end goal of the framework was to determine the sequence and

parameters of these filters for the given input image. [103] improves the work of [104] to learn the

statistical correlation between the keywords associated with an image and color characteristics and

modifies the image colors based on the learned correlations. There have been some related works

on image enhancement where they either do not consider any contextual information at all or they

consider only internal context, i.e, either looking at the objects present within the image [103] or

pixel level information [93]. But in a setting like a website, for example, the information and

elements surrounding the image (external context) play a role in how the image should appear,

which the existing works do not address. Some works based on Bayesian models [105] either fix

the context variables or the designs, i.e., image variations are possible. Scaling these approaches

to our problem is highly complex and computationally expensive. Therefore, we need an image

enhancer for a context model that modifies its behavior based on human feedback received and the

context provided.

3.2.3 Incorporating Human Feedback

Unlike some other studies [106] that are only interested in how the image appears, we are also

interested in how a human user perceives the image in the context it is presented, and understand

whether the image variation created is something they would be willing to use for the given context.

A lot of work [105] has also been done to incorporate humans into the training loop. Some of them

expect humans to show multiple demonstrations [107] for the system to understand the decisions

taken by the human users in the process and emulate the same for unseen tasks. There is another

set of research that expects humans to provide continuous feedback on the outputs produced by

30


the system [108]. Both these approaches can be tedious for the user, especially in a setting where

neither the context nor the image is fixed.

3.2.4 Datasets

Datasets like MIT-Adobe-5k dataset [109], SIQAD [110] or Webpage dataset [111] have proven

to be useful for tasks related to image enhancement, quality assessment and webpage saliency. But,

these datasets either just have data regarding how the image was modified or have user feedback

data. We need a dataset that ideally has not only sets of variations available for certain images, but

also contains variations created with respect to different context parameters. Additionally, we would

also like the dataset to be annotated for human feedback on how good the variation is concerning

the context in which it is presented.

3.3 Problem Definition and Approach Intuition

3.3.1 Problem Definition

For any multimedia content W (consisting of images and text), our goal is to develop its stylized

version, say, W ′. In this work, we specifically focus on optimizing an image I in W using specific

image properties that are conditional to certain context parameters (c1, c2, ...cn) (e.g., the website

design, text style and font etc.). Our goal is to find the optimal style I ′. Particularly, we consider an

image style optimal for a given context if it is a variation that a human user would most likely pick

to be used in W .

31


3.3.2 Approach Intuition

Learning the best possible image variation for a given context requires a large amount of

training data, given the theoretically infinite number of combinations between context and variations.

Collecting user feedback on each of these combinations is also not scalable nor time and resource-

efficient. Therefore, rather than generating this dataset for training, we instead propose to incorporate

the image variation generation and user feedback collection processes in an online fashion using a

reinforcement learning approach. Our approach generates image variations based on the context and

leverages feedback collected from the users on-the-fly to improve the image variation generation

itself. In other words, we allow user feedback to be part of the learning process of our model. Most

importantly, we design our system to achieve this efficiently, using a minimal amount of explicit

feedback from the user.

In RL, an agent learns from its interactions with the environment over a sequence of steps. At

each time step, the agent takes an action at depending on its state st and observed reward (rt). The

agent’s goal is to maximize the cumulative reward for the task. Defining the reward function is

a critical component of any RL-based approach. Hand-crafting a reward function to model the

user’s preference is a non-trivial task whose performance is hard to quantify. Therefore, in this

work, we use a deep neural network for generating the reward itself, whose goal is to capture the

user preference for a given image in a given context. Our network is trained using explicit user

feedback: we consider the distance between the “preference” expressed by the network for a given

image-context and that of the user providing feedback, which is our ground truth. Our goal is to be

able to carry out this learning using the minimum amount of feedback from the users. We discuss

details of the reward function in Section 3.4.3.

32


The action to be taken by our agent is to modify an image to better suit a particular context.

In this work, we limit the action space to brightness (bt), hue (ht), and contrast (nt) modifications.

Since, the range of possible values for nt, ht and bt are continuous, we consider a deterministic

policy that learns the best action as a function of the state. Since defining an accurate model of the

environment for the agent to understand the consequences of the actions is non-trivial, we opt for

the Deep Deterministic Policy Gradient (DDPG) [112] approach, which is model-free. More details

on the proposed framework are given in Section 3.4.

3.3.3 Context Definition and Image Corpus

Table 3.1: Context variables considered in this work, along with four representative values chosen
for each.

Context Variable Values

Background color blue, orange, green, white
Font color black, blue, red, green
Font style Times New Roman, Helvetica, Courier New, Brush Script

MT

While there are large numbers of potential attributes to take into account when defining the

context, we consider in this work the online page background color, text font color, and text font

styles as context variables.

Using all potential variations of background/font colors and font styles would make the problem

intractable. Therefore, we choose four possible variations for each of these variables. Table 3.1

presents more details about the exact set of background colors, font color, and font styles that we

consider in this work.

We use the MIT-Adobe-5k dataset [109] as a corpus for the starting raw images, a popular

33


10

DDPG Agent State (𝐬𝐭)

Action
(𝐚𝐭)

Environment
(Photo editor)

…
Contexts

Reward	(𝐫𝐭)

(𝐫𝛟)

Human 
feedback(labels) 

Modified 
Image 

Acceptable ?

Yes

No

𝐫𝐨	= + k

𝐫𝐨	= -k

Reward Function

(𝒔
𝒕 ,𝒂

𝒕 )
(𝒔
𝒕 ,𝒂

𝒕 )

Contrast

Hue

Brightness

Figure 3.2: Architecture Overview: Our RL agent takes as input the state st and reward rt obtained
from the environment as a result of an action at. The reward function computes the acceptability
(r0) and human preference (rϕ) for the image obtained after applying action at.

choice among works on image modification and variation generation. The dataset contains 5000

photographs and their retouching by five different artists. The images cover a wide range of scenes,

subjects, and lighting conditions.

3.4 Network Architecture

We now look closely at the architecture of the proposed framework, which is depicted in

Figure 3.2.

3.4.1 Deep Deterministic Policy Gradient (DDPG)

DDPG is based on the concept of DQN but has been developed to handle continuous action

spaces. The use of an experience relay buffer helps in addressing issues related to data being

dependent and non-identically distributed. The algorithm also introduces the use of a target network

34


to actor-critic policy learning [113] to stabilize the learning process and efficiently deal with the

non-stationary target values. For our DDPG framework, let Q(s, a, w), µ(s, θ) represent the critic

and actor networks respectively and let Q′(s′, a, w′), µ′(s′, θ′) represent their corresponding target

networks. Based on [114], we can define the loss function for the critic network would be

LDDPGc = (r + γQ(s′, µ(s′, θ′), w′)−Q(s, a, w))2 (3.1)

where r is the reward and γ is the discount factor. The loss function for the actor on the other hand

is as follows:

LDDPGa = ∇aQ(s, a, w)∇θµ(s, θ) (3.2)

The target networks can be updated as follows with τ << 1:

w′ = τw + (1− τ)w′ θ′ = τθ + (1− τ)θ′ (3.3)

DDPG is relatively better in terms of sample efficiency as compared to on-policy algorithms like

the ones discussed in [115].

3.4.2 State and Actions

At each step, the DDPG agent decides which action to execute according to the current state.

The state must provide the agent with comprehensive information for better decisions. In our

proposed approach, at any time step t, the state vector st can be defined as the combination of the

following components:

1. ft (current input image): the action that will be selected will be directly applied to this image

35


to derive a better result, represented as features [116]

2. at−1 (past historical action vector): this informs the agent about the action taken at time step

t− 1. The knowledge of the previous decision could help the action selection at the current

step

3. c (context variables): represents the vector corresponding to the different context variables

considered (namely background colors, font style, and font colors).

Therefore, we can define the state vector as

st = [ft, at−1, c] (3.4)

The action at for our agent is the vector consisting of three values corresponding to contrast, hue,

and brightness factors to be applied to the image. For every action that the agent performs, it

receives two values, namely, the new state and a reward that signals how good the action taken was.

We detail the reward function in the next section.

3.4.3 Reward Function

The main objective of the RL-agent is to stylize an input image using a sequence of actions such

that, after modification, the image content is still preserved and the stylized image also appeals to

the context in which it is presented. The reward function rt has been defined to keep both aspects

into consideration, and it is defined as

rt = r0 + rϕ (3.5)

36


r0 is the static part of the reward, and it is used to guarantee that the image content after stylization

is still representative of the initial image. While performing a sequence of actions, if the agent ends

up making the image overexposed or underexposed, then this style of image is not acceptable, and

the agent is given a negative reward. It is positive otherwise. We define the following criteria for

determining underexposed or overexposed conditions, respectively:

|255− Avg(image pixel colors)| > δ1 (3.6)

|255− Avg(image pixel colors)| < δ2 (3.7)

where δ1 and δ2 are user-defined constant values. At any given point, the image must satisfy both

equations 3.6 and 3.7. If it does, we refer to it as an acceptable variation; otherwise unacceptable.

We refer to r0 as static because it indicates whether the image obtained after stylization based on

the agent’s actions is acceptable or not:

r0 =


κ if acceptable

−κ otherwise

(3.8)

The dynamic part of the reward rϕ quantifies how likely the image variation created by the

agent would be preferred by a human user for the given context. The dynamic reward is based

on a preference-based reward learning framework [117], where it is learned based on the user’s

preference between two state-action trajectories, each of which leads to a different image variation.

The advantage of this approach lies in the fact that it is more convenient for a human user to choose

37


between two outcomes rather than rank a few variations together. During training, the loss is

computed between the preference of the model, depicted by rϕ, and the preference of the users rh.

3.4.3.1 Collecting Human Preferences

The user feedback necessary in our approach was obtained by implementing our system in

MTurk. Sets of image pairs were created after the state-action trajectory selected by our agent was

applied to the raw image. Each image was then laid out on a website template having features

described by the context variables (i.e., background color, font color, and font styles). We opted for

dummy text on the website in order to avoid potentially affecting the website’s perception of the

user. The two possible websites were then shown to Mturk workers, who were asked to choose the

website where the image complemented the website features well. The obtained responses are fed

back into the network for training the reward network.

3.4.3.2 Reward Learning from Human Preferences

The reward function rϕ has to be trained such that the preference of the model for a given pair

of images is consistent with the observed human feedback. The procedure is very similar to [118].

Two trajectories σ1, σ2 are given as input, each trajectory being a sequence of observations and

actions {sp, ap, ..., sp+k, ap+k}. We obtain preferences y using the pipeline discussed in the previous

section for the pair of images corresponding to σ1 and σ2. y indicates which image (as a result of a

trajectory) the user preferred, i.e., y ∈ {(1, 0), (0, 1)}. This preference, along with the trajectory

pairs (σ1, σ2), is stored in a dataset D as a triplet (σ1, σ2, y). Based on Bradley-Terry model [119],

the preference predictor is modeled using the reward function rϕ as follows:

38


Pϕ(σ2 ≻ σ1) =
exp{

∑
t rϕ(st

1, at
1)}∑

i∈{0,1} exp{
∑

t rϕ(st
i, ati)}

(3.9)

where σi ≻ σj denotes that the image obtained from trajectory σi is preferable over that

obtained from trajectory σj . Next, the function rϕ has to be trained as a binary classifier using the

loss function:

LReward = −E(σ1,σ2,y)∼D[y(0) logPϕ(σ2 ≻ σ1)

+y(1) logPϕ(σ2 ≻ σ1)]

(3.10)

3.4.4 Environment Modeling and Training

Modeling the environment is a very crucial part of achieving the right results in a Reinforcement

learning (RL) setup. To solve our problem, we defined the environment as a proxy to existing image

stylization software (E.g., Lightroom). It takes the input image at timestep t and performs the image

stylization actions defined on it to get the image corresponding to the next time step.

We now discuss the parameters training procedures. The pseudo code for training is given in

Algorithm 2. We begin by initializing a few parameters, namely the frequency of obtaining feedback

from human users for the image styles created by the model, and also the number of queries that we

will be asking users to evaluate. Before we begin with the iterations, we let the agent interact with

the environment in the Exploration Phase using Algorithm 1 to produce trajectories conditioned

by the static reward function (Eqn.3.8). In each iteration, there are two key events, i.e, (1) human

feedback check (lines 8− 19) and (2) updating network parameters (lines 20− 28). The first event

occurs only after K iterations. For the human feedback event, pairs of trajectories are uniformly

39


sampled and then sent to the human users for feedback. The human preferences are recorded in

a dataset D. Based on the data collected, we train the dynamic reward model (lines 14− 17). In

the parameter update event, the agent performs an action at and observes a reward rt. After this

mini-batch of transitions (st, at, st+1, rt) are sampled and the parameters of Actor and Critic are

updated (lines 26− 27).

Algorithm 1 WARMUP: Unsupervised Exploration
1: Initialize parameters of Qw and πψ and a replay buffer B ← ∅
2: for each iteration do
3: for each timestep t do
4: Select action at by taking at ∼ πψ(at|st)
5: Execute at and observe reward rt0 , new state st+1

6: Store transitions B ← B ∪ {(st, at, st+1, r
t
0)}

7: end for
8: for each gradient step do
9: Sample mini-batch {(sj, aj, sj+1, r

j
0)}Bj=1 ∼ B

10: Optimize LDDPGcritic in (3.1) and LDDPGactor in (3.2)
11: end for
12: end for
13: return B, πψ

3.5 Experiments and Results

The DDPG agent with a reward function capable of capturing human preferences was trained

for around 14500 episodes with at max 10 steps in each episode, totaling 145k steps.

3.5.1 User Studies

We perform a user study to check the validity and performance of the proposed approach.

Particularly, we consider two separate user studies to (1) check the validity of the reward model,

and (2) evaluate the overall performance of the proposed approach.

40


Algorithm 2 Human Feedback induced image enhancement using DDPG
Require: frequency of feedback K
Require: number of queries M per feedback session

1: Initialize actor network µ(s, θ), Critic Network Q(s, a, w) and Dynamic Reward Network rϕ
with random weights

2: Initialize target network µ′ and Q′ with weights θ′ ← θ, w′ ← w
3: Initialize Replay Memory B
4: Initialize a dataset of preferences D ← ∅
5: Initialize random process N for active exploration
6: // EXPLORATION PHASE
7: B, πψ ←WARMUP() in Algorithm 1
8: // REWARD LEARNING
9: for each iteration do

10: if iteration % K == 0 then
11: for m in 1..M do
12: (σ1, σ2) ∼ UNIFORM SAMPLING()
13: Query instructor for y
14: Store preference D ← D ∪ {(σ1, σ2, y)}
15: end for
16: for each gradient step do
17: Sample minibatch {(σ1, σ2, y)j}Dj=1 ∼ D
18: Optimize LReward in (3.10) with respect to ϕ
19: end for
20: Relabel entire replay buffer B using rϕ
21: end if
22: for timestep in 1...t do
23: Select action at by taking at ∼ πψ(at|st)
24: at ← at +Nt
25: Execute at and observe reward rt and new state st
26: Store transitions B ← B ∪ {(st, at, st+1, rt(st))}
27: Sample N transitions (st, at, rt, st+1) ∈ B
28: Optimize LDDPGc (3.1) & LDDPGa (3.2) w.r.t. θ & w
29: Update target networks using τ
30: end for
31: end for

41


3.5.1.1 User Study 1 - Validity of Reward Model

The objective of this study is to evaluate the reward model’s capability to capture human

preferences. As part of this study, we fed pairs of image variations created for a given context and

asked both our dynamic reward model and human users to choose. We observed that the model and

human users agree on 87% of the samples, which confirms the validity of the model.

3.5.1.2 User Study 2 - Overall Performance

In the second study, we aim to understand whether, for a given random image and context,

our RL-based approach is able to produce an image variation that users find appealing. As part

of the study, for a given context, we obtained 3 competing versions of the input image: (1) the

image stylized by our RL model, (2) the image stylized by an image editor (Expert), and (3) the

original unedited image. These 3 versions of the image were used to create 3 versions of the same

website. Each website version had one of the 3 competing image versions. We ask human users to

then evaluate these 3 in pairs to make the comparisons easier and also to avoid noisy data. Users

were asked to evaluate 9 such samples. User preferences were collected using MTurk. The results

obtained are depicted in Figure 3.3. The results obtained from the user study are depicted in Figure

3.3. We can observe that for most of the samples, our model output is preferred by at least 50% of

the users. However, there is still some ambiguity (no obvious winners) for all three comparisons.

Hence, we next perform statistical tests to verify the results further.

We perform statistical hypothesis tests to see if our approach is preferred when compared with

two baselines. As described in Figure 3.3, our study shows a pair of web pages to users, and

then asks which they prefer. For each of the three ways of creating the webpage, RL, Expert, and

42


Pr
op

or
tio

n 
pr

ef
er

rin
g 

A
 o

ve
r B

(i) (ii) (iii)

Image and Context Pair 

Figure 3.3: The graph shows the results obtained from our User Study 2. (i) shows the results
obtained while comparing expert’s stylized images with original images for different contexts (ii)
shows comparisons between our model’s stylized images with the expert’s stylized images and (iii)
compares our model’s stylized images with the original images for a given context.

Original, we perform pairwise comparisons. Given a pair of alternatives i and j, we ask the question,

how likely is it that i will be preferred over j? Let’s say that this probability is denoted by p. Then,

one hypothesis of interest is the following:

H0 : p =
1

2
vs. H1 : p >

1

2
.

Such a hypothesis tests whether humans indeed prefer the first version of the website compared

to the second version of the website. For each comparison, we have ∼ 4, 500 responses, and we

can then use a test of binomial proportions to draw a conclusion. The results of this analysis are

43


presented in Table 3.2, which presents the details of the experiments and the one-sided p-value (and

one-sided 95% Confidence Intervals). We observe a few things. When comparing our RL Model to

the Original versions of the web pages, we see that users prefer the versions generated by the Model

in a strongly statistically significant way. For the comparison of the model with the web pages

created by the expert, we see that there is some evidence that users prefer the model-generated

output, though the evidence is less overwhelming (p−value of 0.046). When comparing the expert-

created web pages with the original web pages, we also see strong evidence that the expert-created

pages are preferred by the study participants.

Table 3.2: Statistical Analysis of User Study

A B n nA Proportion p−value 95% CI

RL Model Original 4, 500 2, 543 0.5651 < 0.0001 (0.5528, 1.0)
RL Model Expert 4, 500 2, 307 0.5127 0.04604 (0.5003, 1.0)
Expert Original 4, 499 2, 626 0.5837 < 0.0001 (0.5714, 1.0)

3.6 Conclusions

We present in this paper an RL-based approach to generate the optimal style of a given image for

a given context. The RL agent takes an image and the set of variables defining the context in which

we wish to present it. The output is the image styled to blend well with the context. Our approach

efficiently handles challenges related to scalability and data by seamlessly incorporating human

feedback into the training process to improve the styles of the image generated. We demonstrate

through user studies that the proposed approach can produce variations close to human preferences

in a time and cost-effective manner. While the results of our approach to developing contextualized

content are encouraging and promising, we identify a few areas for future work. The experiments

44


conducted in this work use feedback collected from the general population and not a specific

individual. It is possible to include more user-specific context variables to understand and explore

user-level personalization of the image for a given context. Additionally, we would also like to

understand the impact of other factors like content genre, website topic, etc. to create an even

better image stylization. Finally, our work could be extended by considering image content-specific

stylization, i.e., by considering the interaction between the context and the content of the image

itself.

45


Chapter 4: TAME-RD: Text Assisted Replication of Image Multi-Adjustments

for Reverse Designing

Given a source and its edited version performed based on human instructions in natural language,

how do we extract the underlying edit operations to automatically replicate similar edits on other

images? This is the problem of reverse designing, and we present TAME-RD, a model to solve this

problem.TAME-RD automatically learns from the complex interplay of image editing operations

and the natural language instructions to learn fully specified edit operations. It predicts both the

underlying image edit operations as discrete categories and their corresponding parameter values in

the continuous space. We accomplish this by mapping together the contextual information from the

natural language text and the structural differences between the corresponding source and edited

images using the concept of pre-post effect. We demonstrate the efficiency of our network through

quantitative evaluations on multiple datasets. We observe improvements of 6–10% on various

accuracy metric