ABSTRACT Title of Dissertation: CONTEXT-AWARE COMPUTATIONAL VIDEO EDITING AND RE-EDITING Pooja Guhan Doctor of Philosophy, 2025 Dissertation Directed by: Professor Dinesh Manocha Department of Computer Science Video has emerged as the dominant medium for communication and creative expression in the digital era, fueled by advances in consumer cameras and the ubiquity of content-sharing platforms. This democratization has empowered creators from diverse backgrounds and made capturing videos now effortless, but editing remains a significant barrier. It demands both technical expertise and nuanced creative judgement in narrative structure, emotional tone, and audience engagement. Current artificial intelligence (AI)-driven tools offer automation for basic editing tasks but fall short in supporting the high-level creative decisions that define compelling video, often neglecting the narrative intent, production context, and viewer perception. We address this gap by introducing adaptive, expressive, and accessible editing techniques that bridge the gap between automation and artistic intent. We present computational models that support decision making across key stages of post-production, structured in three parts. The first part presents two context-aware image editing approaches. The first approach leverages reinforcement learning to automatically analyze images in a way that harmonizes with the broader design or narrative context, rather than applying uniform edits across diverse content. The second approach, TAME- RD, pioneers AI-based reverse designing to provide detailed breakdowns of editing operations and parameter strengths for easy style extraction and transfer. TAME-RD reported improvements of 6-10% on various accuracy metrics and 1.01X - 4X on the RMSE score on the GIER dataset. Additionally, we also introduced a new dataset I-MAD. Together, these methods advance automated color grading, enabling personalized and contextually relevant workflows. The second part tackles the context-based adaptation of visual effects and camera motions to diverse narrative and stylistic goals. Our algorithm, V-Trans4Style, employs a transformer-based encoder-decoder and style conditioning module to generate visually seamless, temporally consistent transitions tailored to targeted production styles, significantly outperforming prior methods. On the AutoTransition dataset, V-Trans4Style achieved improvements of 10%-80% in Recall@K and mean rank values over baselines. We also introduced the AutoTransition++ dataset. Complementing this, another algorithm, CamMimic, introduces a zero-shot algorithm leveraging video diffusion models to transfer camera motion patterns from reference videos to new scenes, allowing creators to emulate complex camera work without additional data or 3D information. Both approaches received strong user preference (at least 70%), underscoring their effectiveness in empowering creative video editing. The third part focuses on the edit refinement process based on audience feedback as new context to guide iterative editing decisions, helping creators identify impactful moments and enhance future content delivery. To solve the challenge of reliably quantifying audience engagement, we present a machine learning–based approach to estimate viewer engagement levels during video playback, drawing on psychological theories of attention and interaction. Our method has been validated through real-world experiments, including an application in telehealth for mental health, where the system automatically assessed patient engagement from video sessions. We obtained a 40% improvement in evaluation metrics over state-of-the-art methods for engagement estimation. By enabling objective, automated measurement of engagement, this approach empowers editors to make data-driven refinements, ultimately improving the effectiveness and resonance of video content. CONTEXT-AWARE COMPUTATIONAL VIDEO EDITING AND RE-EDITING by Pooja Guhan Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy July 2025 Advisory Committee: Dr. Dinesh Manocha (Chair) Dr. Maria Cameron (Dean’s Representative) Dr. Ramani Duraiswami Dr. Ming Lin Dr. Guan-Ming Su © Copyright by Pooja Guhan 2025 Dedication To my family and to the video editors who bring stories to life, frame by frame. ii Acknowledgments This dissertation is not only a reflection of my own efforts but also a testament to the unwavering support of those who have shaped my journey, both academically and personally. While the path has certainly felt like a rollercoaster at times, it has also been one of deep self-discovery, revealing strengths I didn’t know I had, exposing areas for growth, and above all, teaching me the importance of perseverance amidst uncertainty. I am deeply grateful to my advisor, Prof. Dinesh Manocha, whose unwavering confidence in me, especially during my moments of self-doubt, empowered me to embrace the challenges of pursuing a Ph.D. with renewed determination and assurance. My sincere thanks to my dissertation committee members, namely, Prof. Maria Cameron, Prof. Ming Lin, Prof. Ramani Duraiswami, and Dr. Guan-Ming Su, for their invaluable guidance, constructive feedback, and continued support. I am deeply grateful for the time and effort they dedicated to reviewing my dissertation and for their encouragement throughout the process. I extend my heartfelt thanks to my incredible GAMMA lab mates, constant companions on this academic journey. From idea-sparking brainstorming sessions to laughter-filled coffee breaks, your camaraderie made even the toughest days brighter. Thank you for your unwavering support, right from sharing frustrations over stubborn experiments to celebrating every victory, no matter how small. I’m grateful to have shared this experience with such a dedicated and remarkable team. Among these wonderful people (some who have also graduated), in particular, I am especially grateful to Trisha Mittal, for trusting me (then a Masters student) enough as a co-author and iii introducing me to the GAMMA family. I am grateful to Rohan Chandra, for his invaluable support and mentorship throughout my time in the lab. My sincere thanks to Kasun Weerakoon, Adarsh Sathyamoorthy, Mohamed Bashir Elnoor, Gershom Seneviratne, Senthil Hariharan, Vishnu Dorbala, Utsav Patel, and Jim for sharing your expertise and passion for robotics with me. Our discussions, whether about cutting-edge research or troubleshooting the quirks of robots, were always intellectually stimulating and often inspired new directions in my own work. Special shout out to the “foodies gang” (Laura Zheng, Niall Williams, Xijun Wang, Tianrui Guan, Ruiqi Xian, Bhrij Patel, Geonsun Lee, Yonghan) for organizing some truly enjoyable outings, including the unforgettable Totality 2024! I was grateful to join many of these moments, which provided a refreshing break from the routine and added a wonderful flavor of camaraderie to my PhD experience. Finally, to Divya Kothandaraman, for being a wonderful co-author and friend. Also, a big shout-out to the sys-admin team that I had the opportunity to be part of for a good chunk of this academic journey. I learned so much being in your company, trying to resolve the lab system issues as they came. A special mention to Alper Bozkurt for taking the initiative and helping us streamline the system allocation process within the lab to ensure all the members, including me, could continue running their experiments without any disruptions. It still feels surreal that I was fortunate enough to share a home with some of the most inspiring and thoughtful individuals I’ve ever met - Priyal Gala, Nitin Sanket, Anoorag Sunkari, Nakul Garg, Chahat Deep Singh, Sunaina Prabhu, Mrunal Dhaygude, and Aakriti Agarwal. We met as housemates, but now they are my extended family that I didn’t think I would need. Despite their own demanding and often exhausting schedules, they were always there for me, ready to lend an ear, offer support, or just sit with me in shared silence. Our home was a place of vibrant debates, unexpected ideas, and uncontrollable laughter (often leading to me literally falling off my chair!). iv These moments, both profound and silly, grounded me through the highs and lows of this journey, and I’m forever grateful for their presence in my life. If not for them, I would have the misfortune of not meeting and knowing Stella (our house dog). Getting greeted with the most enthusiastic and energetic barks when I got home every day, followed by being forced into a round of tug of war with her, made even the most frustrating research days the best. I feel immensely grateful and lucky to have received the opportunity to collaborate with and receive mentorship from some of the brightest minds in our research community across different projects. These include Aisha Walcott, Celia Cintas, Sekou Lionel Remy, Uttaran Bhattacharya, Saayan Mitra, Somdeb Sarkhel, Stephano Petrangeli, Ritwik Sinha, Vishy Swaminathan, Tsung- Wei Huang, Dae Yeol Lee, Gloria Reeves, Kristin Bussell, and Aniket Bera. Their guidance, encouragement, and thoughtful feedback have not only shaped the direction of my work but have also deeply influenced the way I approach research and collaboration. I will always carry forward the lessons I’ve learned from them. I began my journey at UMD as a Master’s student and later transitioned into the PhD program. I was fortunate to share this path with a wonderful group — Naman Awasthi, Vasu Singla, Vaishnavi Patil, Shishira Maiyya, Noor Pratap Singh, Sai Yerramreddy, and Pulkit Kumar. Their camaraderie, curiosity, and support made the highs brighter and the lows easier. From navigating the COVID-19 lockdown to celebrating festivals and exchanging research ideas, they made this journey truly memorable. Some friendships transcend time, place, and context. I’m extremely thankful for the friends who have stayed by my side since my school and undergrad days, as well as the incredible people I have had the joy of meeting in recent years. These include Ishan Bansal, Shashank Gupta, Sai Karthikey Pentapati, Anand Murali, Srujana Peddinti, Arushi Singhal, Simran Singhal, Thota v Venkata Aishwarya, Anushka Agarwal, Surya Soujanya, Zakir Hussain Shaik, Harikrupa Sridhar, Maithili Kunte, Aishwarya Deshpande, and Nikshita Ranganathan. My thesis would not have been possible without their steady encouragement and belief in me. I couldn’t have asked for better cheerleaders, sounding boards, or safe spaces. Their unwavering support throughout this journey has meant the world to me. I would also like to thank Migo Gui, Tom Hurst, and Jodie Gray for helping me with all administrative concerns, including TAships and RAships. Their support and assistance made my graduate student life significantly smoother and less stressful. Throughout this journey, my family has been my greatest pillar of strength. My sincere thanks to my mom, Priya, and dad, Guhan, for their unwavering patience and belief in me, especially during the times I doubted myself. Knowing that they stood firmly by my side through every high and low gave me the courage to keep moving forward. Their unconditional love, quiet sacrifices, and constant reassurance have been the foundation that held me up through the toughest moments. This achievement is as much theirs as it is mine. My sister, Mahima has been my source of joy, perspective, and resilience throughout this journey. She always knew how to bring a smile to my face when things felt overwhelming. Her steady presence and unshakable faith in me helped me emerge from this journey as a more optimistic, confident, and brave researcher. I am grateful to her for reminding me, time and again, of the light even in the most chaotic moments. I couldn’t have asked for a better cheerleader. I am immensely blessed to have received unwavering encouragement and support from my grandparents, aunts, uncles, and cousins throughout this journey. Their constant check-ins, words of motivation, and quiet pride in my progress have been instrumental in keeping me grounded and motivated. Whether through heartfelt conversations, shared laughter, or simply being there when I needed a break, their presence reminded me that I was never alone in this pursuit. vi I wish to express my heartfelt appreciation to all the funding agencies that have supported my research over the past few years. These include Adobe, Dolby, MPower, and the UMD Graduate School Summer Fellowship. Their generous support has been instrumental in enabling me to pursue ambitious ideas, attend conferences, collaborate with experts, and push the boundaries of my work. This journey would not have been possible without the freedom and flexibility that their funding provided, allowing me to focus deeply on research while growing both intellectually and professionally. Additionally, I am deeply grateful to the open-source platforms and vibrant community forums such as PyTorch, TensorFlow, Linux, and Stack Overflow that played a crucial role in my research journey. Their extensive resources, shared knowledge, and collaborative spirit helped me navigate countless implementation challenges. This thesis would not have been possible without the invaluable contributions of these communities. I am sincerely thankful to everyone who has been a part of this journey in one way or another. Your support, whether big or small, has left a lasting impact. I extend my humble apologies to anyone I may have inadvertently overlooked. Please know that your presence and kindness have not gone unnoticed, and I carry deep gratitude for each of you. As I bring this chapter to a close and look ahead to new beginnings, I do so with immense gratitude for the experiences, lessons, and connections that have defined this journey. This acknowledgment is a heartfelt tribute to everyone who, in their own way, helped shape and support my path. Thank you for being part of this unforgettable ride and for leaving an indelible mark on both my work and my life. vii Table of Contents Dedication ii Acknowledgements iii Table of Contents viii List of Tables xii List of Figures xiii I World of Video Production 1 Chapter 1: Introduction and Overview 2 1.1 Complexity of Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 AI and Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Role of Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Context and AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Background 16 2.0.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.0.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.0.3 Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 II Context-Aware Image Editing 25 Chapter 3: Contextualized Styling of Images for Web Interfaces using Reinforcement Learning 26 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Enhancement of Content . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Image Enhancement for Context . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.3 Incorporating Human Feedback . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Problem Definition and Approach Intuition . . . . . . . . . . . . . . . . . . . . . . 31 viii 3.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 Approach Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Context Definition and Image Corpus . . . . . . . . . . . . . . . . . . . . 33 3.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . . . . . . . 34 3.4.2 State and Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.3 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.4 Environment Modeling and Training . . . . . . . . . . . . . . . . . . . . . 39 3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.1 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 4: TAME-RD: Text Assisted Replication of Image Multi-Adjustments for Reverse Designing 46 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 TAME-RD: Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Stream 1: Pre-Post Effect From Images . . . . . . . . . . . . . . . . . . . 56 4.3.3 Stream 2: Context From Language . . . . . . . . . . . . . . . . . . . . . . 57 4.3.4 Multitask Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 GIER Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 I-MAD: Our Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5 Experiment Results and Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5.1 Training Details and Evaluation Metrics . . . . . . . . . . . . . . . . . . . 69 4.5.2 Quantitative Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.3 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.4 Additional Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 Ethical Considerations and Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.8 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 III Context-Aware Visual Effects 83 Chapter 5: V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation 84 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.1 AutoTransition Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 V-Trans4Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 ix 5.5.1 Pre-trained Transition and Style Embeddings . . . . . . . . . . . . . . . . 96 5.5.2 Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Style Conditioning Module . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.6.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.6.2 Comparisons and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6.3 Ethical Considerations: . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.7 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.7.1 Pre-trained Transition and Style Embeddings . . . . . . . . . . . . . . . . 109 5.7.2 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.7.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.8 Broad Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.9 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 6: CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models 116 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.1 Video Personalization, Controllability, and Camera Motion Transfer . . . . 120 6.2.2 Image Personalization & Novel View Synthesis . . . . . . . . . . . . . . . 121 6.3 CamMimic: Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.3.1 Phase-1: Multi-Concept Finetuning . . . . . . . . . . . . . . . . . . . . . 124 6.3.2 Phase-2: Homography Guided Inference . . . . . . . . . . . . . . . . . . . 127 6.3.3 CameraScore - A New Metric . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.4.3 Comparison with Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.4.4 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.5.1 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.5.2 Additional User Study Details . . . . . . . . . . . . . . . . . . . . . . . . 139 6.6 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.7 Conclusion, Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 151 IV Generating Context for Editing and Re-Editing 153 Chapter 7: Developing an Effective and Automated Patient Engagement Estimator for Telehealth: A Machine Learning Approach 154 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 x 7.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.2.2 Proposed Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.3.1 Study 1: Testing Our Proposed Approach on MEDICA . . . . . . . . . . . 179 7.3.2 Study 2: Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.3.3 Study-3:Analysis on Real-World Data . . . . . . . . . . . . . . . . . . . . 181 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.4.1 Principal Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.4.2 Comparison with Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.4.3 Strengths and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.4.4 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 V Conclusion and Future Directions 189 Chapter 8: Conclusion, Limitations and Future Directions 190 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.2 Applications and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.2.1 Context-Aware Image Editing . . . . . . . . . . . . . . . . . . . . . . . . 196 8.2.2 Context-Aware Visual Effects . . . . . . . . . . . . . . . . . . . . . . . . 197 8.2.3 Generating New Context For Editing . . . . . . . . . . . . . . . . . . . . . 199 8.2.4 Extending Context-Aware Editing to 3D Videos . . . . . . . . . . . . . . . 200 8.2.5 Context-Aware Editing for Immersive and 360◦ Video Formats . . . . . . . 201 Bibliography 203 xi List of Tables 3.1 Automated contextualized styling of images: Context variables considered . . . . . 33 3.2 Automated contextualized styling of images: User study results . . . . . . . . . . . 44 4.1 TAME-RD: Edit operations in I-MAD-Dense . . . . . . . . . . . . . . . . . . . . 65 4.2 TAME-RD: Edit operations in I-MAD-Pro . . . . . . . . . . . . . . . . . . . . . . 68 4.3 TAME-RD: Quantitative results on different datasets . . . . . . . . . . . . . . . . 81 4.4 TAME-RD: Quantitative results with different fusion strategies . . . . . . . . . . . 82 4.5 TAME-RD: Quantitative results with different λ . . . . . . . . . . . . . . . . . . . 82 5.1 V-Trans4Style: Encoder-decoder network quantitative results . . . . . . . . . . . . 108 5.2 V-Trans4Style: Ablation experiment results for encoder-decoder network . . . . . . 108 5.3 V-Trans4Style: Production style based transition recommendation quantitative results 109 6.1 CamMimic: Feature-by-feature comparison . . . . . . . . . . . . . . . . . . . . . 134 7.1 Engagement estimation for Telehealth: MEDICA comparison with related datasets 162 7.2 Engagement estimation for Telehealth: Demographic information for real-world samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.3 Engagement estimation for Telehealth: Quantitative results on MEDICA . . . . . . 180 7.4 Engagement estimation for Telehealth: Ablation experiments on MEDICA . . . . . 181 7.5 Engagement estimation for Telehealth: Real-world experiment results . . . . . . . 183 xii List of Figures 1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Automated contextualized styling of images - Overview . . . . . . . . . . . . . . . 28 3.2 Automated contextualized styling of images - RL based architecture using a combination of dynamic and static reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Automated contextualized styling of images - User study results . . . . . . . . . . 43 4.1 TAME-RD: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 TAME-RD: Multimodal Multitask Learning Network Architecture . . . . . . . . . 51 4.3 TAME-RD: Understanding challenges - Different operations similar effects . . . . 53 4.4 TAME-RD: Understanding challenges - Same operation different effects . . . . . . 54 4.5 TAME-RD: Text provides semantic context . . . . . . . . . . . . . . . . . . . . . 56 4.6 TAME-RD: Experiment results - Per class average precision on GIER . . . . . . . 74 4.7 TAME-RD: Experiment results - Per class average precision on I-MAD-Dense . . . 75 4.8 TAME-RD: Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.9 TAME-RD: Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 V-Trans4Style: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 V-Trans4Style: Visual transition distribution across production styles . . . . . . . . 92 5.3 V-Trans4Style: AutoTransition++ distribution . . . . . . . . . . . . . . . . . . . . 93 5.4 V-Trans4Style: Encoder-Decoder with Style Conditioning Module Architecture . . 95 5.5 V-Trans4Style: Pre-processing stage - Multi-task learning . . . . . . . . . . . . . . 97 5.6 V-Trans4Style: Reconstruction loss decoder architecture . . . . . . . . . . . . . . 102 5.7 V-Trans4Style: Pre-processing stage experiment results . . . . . . . . . . . . . . . 106 5.8 V-Trans4Style: Pre-processing stage class-wise visual transition accuracy . . . . . 110 5.9 V-Trans4Style: Pre-processing stage production style class-wise accuracy . . . . . 110 5.10 V-Trans4Style: User study results . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.11 V-Trans4Style: Qualitative results Set-1 . . . . . . . . . . . . . . . . . . . . . . . 113 5.12 V-Trans4Style: Qualitative results Set-2 . . . . . . . . . . . . . . . . . . . . . . . 114 5.13 V-Trans4Style: Impact of style conditioning module on transition recommendation 115 6.1 CamMimic: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 CamMimic: Camera trajectory challenge . . . . . . . . . . . . . . . . . . . . . . . 122 6.3 CamMimic: Zero-shot video to image camera motion transfer architecture . . . . . 123 xiii 6.4 CamMimic: Need for CameraScore, a homography-based metric . . . . . . . . . . 129 6.5 CamMimic: Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.6 CamMimic: Qualitative results Set-1 . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.7 CamMimic: Qualitative results Set-2 . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.8 CamMimic: Qualitative results Set-3 . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.9 CamMimic: Qualitative results Set-4 . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.10 CamMimic: Qualitative results Set-5 . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.11 CamMimic: Qualitative results Set-6 . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12 CamMimic: Qualitative results Set-7 . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.13 CamMimic: Qualitative results Set-8 . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.14 CamMimic: User study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.15 CamMimic: User studies setup details - Concept introduction page . . . . . . . . . 149 6.16 CamMimic: User study setup details - Sample question . . . . . . . . . . . . . . . 150 7.1 Engagement estimation for Telehealth: Overview . . . . . . . . . . . . . . . . . . 155 7.2 Engagement estimation for Telehealth: Samples from MEDICA dataset . . . . . . 159 7.3 Engagement estimation for Telehealth: Samples from real-world data . . . . . . . . 163 7.4 Engagement estimation for Telehealth: GAN-based semi-supervised multimodal learning architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 xiv Part I World of Video Production 1 Chapter 1: Introduction and Overview Over the past decade, video has emerged as the dominant medium for communication, entertainment, education, marketing, and documentation. With the proliferation of smartphones, affordable high- resolution cameras, and accessible video sharing platforms, the barriers to video creation and consumption have dramatically decreased. Platforms such as YouTube, TikTok, Instagram, and Netflix have not only accelerated content consumption but also have shaped evolving expectations around storytelling formats, pacing, production quality, and personalization [1, 2]. Consequently, video content has seen explosive growth. By some estimates, video comprises more than 80% of global internet traffic. This content explosion has placed tremendous pressure on video production workflows. Video production [3] traditionally encompasses three major phases - 1. Pre-Production involves planning and organizing the video project, including script writing, storyboarding, casting, location scouting, and equipment preparation. This stage sets the foundation for the video project, ensuring that all necessary elements are in place before filming begins. 2. Production is the actual filming phase, where the video footage is captured. This stage requires careful coordination of the cast, crew, and equipment to ensure that the vision outlined in the pre-production stage is effectively utilized. 2 3. Post-Production is where the raw footage is transformed into a polished final video. This is the final stage of the video production process and is typically the longest of all stages. It encompasses a series of intricate and interdependent tasks. Content creators, ranging from professionals and indie filmmakers to educators and marketers, are expected to deliver engaging, high-quality videos quickly and frequently. While capturing raw footage has become relatively frictionless, the post-production process remains labor-intensive, requiring significant expertise and time. To better understand the complexities of post-production, we further divide it into three functional stages [4, 5]: • Editorial stage: This initial phase focuses on organizing and assembling the raw footage into a coherent narrative. Tasks include selecting takes, trimming clips, determining scene order, and creating a rough cut to establish pacing and structure. • Video Editing Stage: Here, the creative and stylistic identity of the video takes shape. This stage includes the application of visual effects (VFX), transitions, color correction and grading, camera motion stylization, and edit refinement. These enhancements are essential for aligning the final content with the creator’s vision and the intended emotional tone. • Finalization Stage: The final step involves technical polishing and delivery preparation. Tasks include audio mixing, sound design, subtitle integration, compression, format conversion, and quality assurance to ensure compatibility with the target distribution platform. Among these, the Video Editing stage often emerges as the primary bottleneck. 3 1.1 Complexity of Video Editing Video editing is a sophisticated and nuanced craft focused on shaping the visual and emotional tone of a video to obtain a compelling visual narrative after the initial editorial work [4, 6]. Far more than simply cutting and splicing clips, it is a multi-layered process that demands both technical mastery and creative vision [7]. The journey typically begins with meticulously logging hours of footage, identifying the most powerful moments, and assembling a rough cut that lays the foundation for the story [5]. From there, editors dive into color grading to set the mood, add visual effects to enhance storytelling, and mix sound to create an immersive audio landscape [8, 9]. Each stage, whether it’s refining transitions, balancing audio levels, or integrating feedback from collaborators, is crucial in shaping the final product into a seamless and emotionally resonant experience for viewers [4, 7]. What makes video editing particularly complex is its power to control not just what the audience sees, but how they feel [10, 11]. Editors sculpt the pacing, build suspense or release tension, and guide the viewer’s eye through carefully timed cuts and visual cues. The process is both an art and a science, requiring a keen sense of timing, narrative structure, and an understanding of human psychology. Also, editing isn’t a one-time process. Often, we return to our videos to repurpose them for different needs, audiences, or emotional tones. This is referred to as re-editing. In today’s fast-paced digital world, the demand for high-quality, engaging video content is at an all-time high, yet the editing process remains notoriously time-consuming and labor-intensive. Artificial intelligence (AI) has long been applied in video-related tasks, particularly within the domains of computer vision and multimedia processing. Early systems employed rule-based methods or traditional statistical models for frame analysis and content organization. However, recent advances 4 in deep learning and large-scale data availability have enabled more powerful and flexible AI-driven solutions in video editing (and re-editing). For the sake of clarity and consistency, the term video editing will be used throughout this dissertation to also encompass re-editing processes. 1.2 AI and Video Editing The integration of AI into video editing has transformed both the capabilities and the accessibility of post-production workflows. AI promises to automate or accelerate several editing steps, reducing human effort while improving consistency. Early applications of AI in video editing focused on low-level operations such as shot boundary detection [12, 13, 14], object tracking [15, 16, 17], face or background segmentation [18, 19] and stabilization. These tasks, once time-consuming, have been made more efficient through computer vision models trained on large-scale annotated datasets. Progress in deep learning has enabled more ambitious applications. AI systems can now generate video highlights [20, 21, 22, 23], perform video summarization [24, 25], detect scenes of emotional relevance, and even recommend edits based on aesthetics or co-occurrence patterns. Commercial platforms like Adobe Premiere Pro [26] have enabled automation of tasks like smart reframing, background score matching, or even stabilization. At the frontier of generative AI, models such as Imagen [27], SORA [28], Make-A-video [29] explore the synthesis of short video segments from textual prompts, offering new directions in content creation. Transformer-based models for video understanding [30, 31] further enhance the capability of AI to reason over multimodal data, combining visual, auditory, and textual cues to infer semantic relevance or contextual importance. These advances suggest a future where AI might not only automate rote editing steps but also contribute to creative ideation and narrative construction. 5 1.3 Role of Context Context refers to the surrounding circumstances, conditions, or intentions that give meaning to a piece of information or an action [32]. In computational systems, context has been shown to enhance decision-making by situating content within a broader semantic or functional framework [33, 34]. Context can broadly be divided into two categories: internal and external [34, 35]. Internal context consists of features inherent to the data itself. In the case of video editing, this includes visual patterns, object semantics, spatial layout, temporal continuity, and audio cues extracted directly from the footage [36]. It enables systems to answer the question: what is happening in a frame or sequence, such as who appears, what actions unfold, how the composition is framed, and when the transitions occur. External context, by contrast, refers to information that lies beyond the immediate content. In the case of video editing, for instance, it provides the rationale for why a particular edit is appropriate in a given situation. It governs the intent behind an editorial decision. This includes, but is not limited to, narrative intent[37], genre conventions, platform constraints, cultural background, and anticipated audience reactions [5, 38, 39]. Both are tightly interwoven and crucial for shaping the audience’s perception of a video. Traditional editing practices implicitly operate on both types of context. They together inform every editorial decision and breathe coherence and emotion into a sequence of visuals [5, 10, 37]. A simple transition, for example, may function differently depending on its context. A fade might evoke nostalgia in one scene but feel disjointed in another [40]. A quick cut might increase the sense of urgency or might disrupt the pacing in another. The nuance required to make these distinctions arises from understanding both what is seen (internal context) and why it matters (external context). The ability to sense and manipulate these nuances is what distinguishes skilled editors from generic automated systems [41]. Editing decisions must 6 be guided by an understanding of who the audience is, what emotion is being conveyed, how the moment connects to the previous or the next one and what visual logic governs the sequence. However, modern AI tools lack access to and understanding of this full spectrum of context. While they are trained to recognize patterns within the content (such as repetitive shots, facial expressions, or motion cues), they often remain unaware of the broader communicative goals or stylistic considerations [42]. For example, a system trained to identify visually similar frames may successfully group scenes with recurring elements, but it may fail to recognize a moment of emotional climax or symbolic importance that warrants emphasis-even if it is visually unremarkable. Similarly, a model trained to generate a video summary may optimize for coverage or diversity but fail to reflect the intended message or tone of the source material. This gap is especially evident in creative or expressive domains where editing decisions are deeply tied to artistic intent, cultural nuance, and audience reception. Consider the difference between editing a travel vlog versus a cinematic short film versus a therapy session recording. Each requires a distinct understanding of what is salient, what pacing is appropriate, and what visual or auditory cues align with communicative goals. An editing system that treats all content uniformly is unlikely to meet the nuanced expectations of different storytelling contexts. Furthermore, many machine learning or vision approaches are often designed as a standalone automatons, focusing on the what (i.e., which frames to cut, which transition to apply) but not why behind those decisions. As a result, edits produced by such systems often feel sterile, emotionally disconnected, or stylistically inconsistent, even when they are technically flawless. This shortcoming arises because editing is not merely a technical process. It is a narrative and affective one. A cut is not just a slice; it is a decision laden with meaning [5]. Depending on its timing, rhythm, and context, a cut can heighten suspense, convey psychological states, elicit laughter, or guide audience attention. Current AI systems cannot 7 reason about these layers of meaning because they lack an understanding of the “why” behind the editorial decisions. They may detect what occurs in a shot and when it occurs, but they remain blind to why that moment matters in the broader arc of the story. As a result, automation, while necessary and often beneficial, is far from being sufficient for achieving meaningful, engaging edits that resonate with the audience. 1.4 Context and AI The importance of context is not unique to video editing. Across several domains of AI, context (both internal and external) has proven to be crucial for building technologies that align more closely with human reasoning, intention, and experience. In natural language processing, contextual embeddings revolutionized how machines understand semantics. Models such as BERT [43], GPT [44], and ELMo [45] significantly outperform earlier methods by incorporating the surrounding linguistic environment (an example of modeling internal context) to disambiguate meaning. These models learn dynamic, context-sensitive representations that significantly improve performance on a range of tasks, from language understanding to dialogue generation. Beyond internal semantics, models that incorporate external context, such as speaker identity, sentiment, or dialogue history, further improve performance in tasks like response generation and personalized conversation [46]. In recommender systems, contextual modeling has enhanced personalization by incorporating signals like time of day, user mood, recent interactions, and device type [47]. These systems adapt to the user’s current situation and historical patterns to generate more relevant suggestions. For example, Netflix and Spotify personalize recommendations not only based on past behavior but also on inferred situational context such as watching late at night, commuting, or preparing for 8 a workout [48, 49, 50]. In human-computer interaction, context-aware systems adjust interfaces and feedback mechanisms based on both user behavior (internal context) and environmental or affective signals (external context). Adaptive interfaces modify layouts based on past interaction patterns [51], while intelligent tutoring systems like AutoTutor [52] respond to learner engagement by integrating facial expressions, eye gaze, and emotional states. These systems demonstrate how external context, when fused with internal cues, leads to more fluid and user-centric interactions. In robotics and embodied AI, contextual reasoning is indispensable for real-world deployment. Internal environmental models (e.g., object recognition, scene geometry) guide basic planning, but effective robotic behavior is highly dependent on the interpretation of external context such as human proximity, social norms, and dynamic constraints [53, 54]. KnowRob [55] exemplifies how robots can integrate semantic and procedural knowledge to understand not just how to perform a task, but when and why it is appropriate. Together, these examples underscore that context, whether derived internally from the data or externally from surrounding conditions, enables more adaptive, coherent, and human-aligned AI systems. 1.5 Thesis Objective Despite the clear benefits of context-aware AI elsewhere, video editing has yet to fully embrace this paradigm. Current approaches to video editing remain heavily reliant on internal content cues, such as visual similarity or shot duration [56, 57]. Although this enables certain degrees of automation, it overlooks the role of external context, such as creative intent, emotional tone, platform-specific constraints, or audience expectations that critically shape editorial decisions [5, 38]. 9 Without access to the complete context spectrum (internal and external), AI remains capable, yet unintuitive. They are productive but not perceptive. We position context with emphasis on the external context as the missing link between automated editing and effective storytelling. We believe editorial intelligence should not only recognize what appears in a video but also why certain editing choices are being made. In pursuit of this vision, our research develops computational methods that operationalize both internal and external context that treat editing not as a sequence of isolated tasks, but as a contextually grounded, narrative-driven process, i.e., one in which editing decisions are informed by context (semantic, stylistic, emotional, and perceptual) and are aligned with the overarching storytelling goals. By incorporating various forms of context like semantic, stylistic, emotional, and perceptual, our work explores how editing systems might begin to exhibit elements of creative reasoning, stylistic adaptability, and audience-aware decision-making, setting the stage for more contextually responsive editing tools. Main Contributions: We introduce computational techniques and frameworks for building editing systems that are sensitive to context across multiple dimensions. Our proposed methods embed the external context signals into the editing pipeline while also leveraging internal content representation. Specifically, we focus on three core challenges: • Context-Aware Image Editing Image editing refers to the manipulation or enhancement of individual frames or still images. In the context of video, frame-level edits often contribute to color correction, lighting adjustments, and compositional improvements that influence the overall aesthetic and narrative tone of a scene. Traditional image editing tools require manual specifications of regions, transformations, or filters. In contrast, AI-based image editing leverages models trained 10 Figure 1.1: Overview of research done: Context is integrated into four key areas of video editing to address distinct challenges. Works under Context-Aware Image Editing focuses on enhancing the image/frame colors by adapting to the context of the scene. Context-Aware Visual Effects applies contextual understanding to visual effects, ensuring seamless transitions and coherence. We explore visual transitions and camera motions under this segment. Finally, Generating Context for Editing and Re-Editing captures and quantifies human engagement, providing valuable feedback for creating context-driven content. Each component contributes to a holistic, context-aware approach to video editing. 11 on large-scale datasets to automate tasks like inpainting [58, 59], style transfer [60, 61], relighting [62, 63], and object manipulation [64]. GAN and diffusion models, DALL- E [65] have dramatically improved realism in synthesized edits, enabling users to edit images using textual prompts or semantic sketches. Despite these advances, most models operate on local image content or global semantic embeddings, lacking awareness of narrative or situational context. This can result in inconsistencies in expected visual perception or result in misalignment with creative intent. We develop models that integrate semantic, perceptual, and spatial context to support editing operations on frames and images. Our methods are designed to preserve narrative relevance and visual coherence, even in the absence of explicit human direction. In Chapter 3, we investigate the role of situational context in image editing. The circumstances in which an image is presented significantly influence how it is perceived by viewers. Just as one size does not fit all, a single edited version of an image may not be appropriate across different contexts, and conversely, the same context may call for varied edits depending on the changing content. Editing decisions can be shaped by factors such as the target audience, the intended emotional tone, and the overarching theme of the content. To address this, we propose a reinforcement learning-based approach that generates contextually appropriate image edits without requiring access to datasets containing multiple versions of an image edited for different contexts. Our method is explicitly optimized to produce edits that not only align with the given context but are also statistically preferred by human evaluators. To the best of our knowledge, this is the first work to systematically examine the interplay between image editing, human preference, and situational context. 12 In Chapter 4, we introduce the concept of AI-driven reverse designing, which involves recovering the full sequence of edit operations and their associated parameter values that transform a given source image into its corresponding edited image, both of which are provided as input. By enabling the easy reconstruction of the edit trajectory, our proposed method TAME-RD can potentially help uncover the underlying decision rationale of a reference edit. The insight allows downstream applications such as edit replication and thereby, supports the transfer of exact editing styles across content to match a desired intent. Our proposed method, TAME-RD, achieved a 6–10% improvement across various accuracy metrics and a 1.01× to 4× reduction in RMSE scores as compared to the state-of-the-art. • Context-Aware Visual Effects Visual effects refer to the addition or manipulation of visual elements in a video to enhance storytelling, create illusions that cannot be captured during live filming, or emphasize specific aspects of a scene to make the viewing experience more immersive and engaging [66, 67]. These effects range from subtle stylistic treatments, such as lens flares, motion blur, and depth-of-field adjustments, to more complex elements, such as simulated camera motion, stylized transitions, or the addition of synthetic environmental components. In cinematic post-production, the choice and timing of such effects are typically informed by the intended mood, genre conventions, and narrative pacing. Recent advances in generative models such as Generative Adversarial Networks, diffusion [68, 69, 70], and neural rendering techniques like Neural Radiance Fields [71, 72] and Gaussian Splatting [73]. Transformer-based multimodal models have also enabled semantic control over editing by linking language prompts with visual outcomes. However, most of these models treat visual effects as isolated transformations 13 and are not sensitive to the narrative role or editorial intention behind their use. As a result, AI-generated effects may look plausible in isolation but can feel tonally jarring, stylistically inconsistent, or narratively unmotivated within the broader video. We, therefore, propose approaches to enable context-aware visual effects. We present algorithms for recommending and adapting visual transitions and camera motions based on production styles, pacing, and creator intent. The models learn from stylistic reference patterns across different genres and apply appropriate transformations to maintain aesthetic fidelity. In Chapter 5, we present V-Trans4Style, our algorithm to recommend a sequence of transitions that can facilitate the adaptation of videos to different production styles. Unlike prior works, in addition to considering the aesthetic appeal, our algorithm keeps track of temporal consistency and proposes a unique inference-based strategy to obtain a video can mimics the desired production style. In Chapter 6, we discuss CamMimic, our zero-shot strategy to enable a single reference video-based camera motion transfer, where we learn to extract and transfer the camera motion observed in a reference video onto a static or minimally animated target scene. Our method enables the synthesis of dynamic camera effects, such as slow tracking shots or rapid camera movements, without requiring any 3D information or manual animation. • Generating Context for Editing and Re-Editing Edit refinement involves evaluating, adjusting, or reordering video segments to improve narrative coherence, emotional impact, or viewer engagement. This is typically the final phase of the editing pipeline, where earlier decisions are re-examined in light of audience response and storytelling clarity. 14 To support a more informed and viewer-aware re-editing process, we propose generating contextual signals derived from audience-centered cues such as affective expressions, visual saliency, and attention patterns (e.g., audio reactions). In Chapter 7, we introduce a semi- supervised, multimodal learning framework designed to estimate viewer engagement. Our approach computationally models cognitive and affective psychological states, and we validate it through a real-world application in telehealth for mental health support. Together, these contributions take initial steps toward the development of AI tools that move beyond surface-level automation and can begin to adapt to human intent, assist in creative decision-making, and collaborate meaningfully with human storytellers. 15 Chapter 2: Background The primary aim of this chapter is to lay a comprehensive groundwork by presenting the essential background knowledge and contextual framework necessary for understanding the central topics addressed in this dissertation. Through a review of fundamental concepts, core methodologies, and recent developments, this chapter seeks to provide readers with the insight and perspective needed to fully grasp the contributions and results discussed in later chapters. Machine learning, and in particular deep learning, forms a central pillar of this thesis. Throughout the various research studies presented here, we use multiple kinds of deep learning concepts to build our algorithms. For clarity and completeness, we briefly discuss each of these approaches below. 2.0.1 Transformers The transformer architecture, introduced by [74], is a deep learning model that relies entirely on attention mechanisms to process sequential data, eliminating the need for recurrence or convolutions. The attention mechanism enables the model to dynamically aggregate information from different parts of the input sequence, assigning varying weights based on learned attention scores. This allows the transformer to process the entire input sequence in parallel, rather than sequentially as in models like LSTMs, which depend on hidden states as memory. Given two input matrices, Xl ∈ Rlxdin and Xs ∈ Rsxdin , where l and s are the sequence lengths and din is the feature dimension, the attention 16 layer computes its output as Attention(Q, V,K) = Softmax( QKT √ din )V ∈ Rlxdout (2.1) where Q = XlWq, K = XlWk and V = XsWv. Here, Q,K and V are the query, key and value matrices, projected from the inputs using learnable weight matrices Wq,Wk ∈ Rdinxdh and Wv ∈ Rdinxdout with dh and dout as the hidden and output dimensions respectively. When both Xl and Xs are the same, the mechanism is known as self-attention, a core component of the transformer. Each transformer layer typically includes a multi-head self-attention (MHSA) module, several linear transformations, normalization, and activation functions. This architecture has demonstrated superior performance and stability across a wide range of applications, largely due to its ability to model long-range dependencies and its efficient, parallelizable design. 2.0.2 Reinforcement Learning Reinforcement learning (RL) is concerned with training agents to make sequences of decisions by interacting with an environment. The primary components of an RL algorithm include: 1. Environment: Represents the external system with which the agent interacts. It defines the context, rules, and dynamics that govern the consequences of the agent’s actions. 2. Agent: It is the learner or decision-maker that observes the environment, takes actions, and aims to maximize the cumulative rewards over time. 3. Actions: Actions are the set of all possible moves or decisions the agent can make at any given state within the environment. The choice of action influences the subsequent state and 17 the reward received. 4. Reward: It is the positive or negative reinforcement that the agent receives from the environment as a result of its actions. It is a way to evaluate the quality of the action done by the agent. 5. State: It is a representation of the current situation or configuration of the environment as perceived by the agent. The agent uses the state to decide which action to take next. 6. Policy: This defines the agent’s strategy for selecting actions based on the current state. It can be stochastic or deterministic and is typically optimized during training. 7. Value Function: It estimates the expected cumulative reward the agent can obtain from a given state (or state-action pair), serving as a measure of the long-term benefit of being in that state. These components operate within the agent-environment loop: at each time step t, the environment is in state St ∈ S, the agent selects an action At ∈ A(S⊔), and the environment responds with a reward Rt+1 ∈ R and transitions to a new state St+1. The agent’s objective is to maximize the sum of rewards over time, learning to take actions that yield the highest cumulative return. RL algorithms are broadly categorized as model-based or model-free, depending on whether the agent has access to or learns a model of the environment’s dynamics. Model-based methods use a model to predict future states and rewards, enabling planning and foresight, and are suitable when the environment is well-understood or static. In contrast, model-free methods learn directly from experience without explicit modeling, making them advantageous for dynamic or complex environments where modeling is infeasible. This dissertation specifically focuses on model-free RL. Within model-free RL, two principal approaches are widely used: policy optimization and 18 Q-learning. Policy Optimization: These methods directly parameterize the agent’s policy and optimize it to maximize the cumulative reward. The policy, typically denoted as πθ(a|s), is updated using gradient-based methods to improve performance. Policy optimization is particularly effective in environments with high-dimensional or continuous action spaces and includes algorithms such as REINFORCE, Proximal Policy Optimization(PPO), and Trust Region Policy Optimization (TRPO). The main advantage of policy optimization is its stability and reliability, as it directly improves agent performance by optimizing the policy itself. Q-Learning This is a value-based approach that learns an action-value function Q(s, a), representing the expected cumulative reward of taking action a in state s and following the optimal policy thereafter. The agent uses the learned Q-function to select actions that maximize the expected value. Deep Q-Networks (DQN) and related algorithms are prominent examples in this category. Q-learning tends to be more sample-efficient, as it can reuse past experience, but may suffer from stability issues due to the indirect nature of optimizing the policy via the value function. For a comprehensive introduction to reinforcement learning algorithms and foundational concepts, we refer the reader to the seminal textbook by Sutton and Barto [75] and the lecture series by David Silver [76], which together offer both theoretical grounding and practical insights into modern RL techniques. In this dissertation, we utilize Deep Deterministic Policy Gradient (DDPG), an algorithm that blends the strengths of both approaches. DDPG concurrently learns a deterministic policy and a Q-function, using each to improve the other. This hybrid method allows for stable learning in 19 continuous action spaces while leveraging the sample efficiency of value-based techniques, making it well-suited for complex real-world tasks. 2.0.3 Generative AI Recently, artificial intelligence has undergone a profound paradigm shift driven by rapid advancements in generative AI technologies. What began as experimental research has evolved into mission-critical solutions that are now transforming industries, business models, and creative practices on a global scale [77, 78, 79]. Generative AI, powered by breakthroughs in foundational models [77, 80, 81] and scalable computing [82], is no longer confined to theoretical or niche applications [78]. By 2025, these technologies will have matured into essential tools that automate complex tasks, streamline workflows, and enable the creation of novel content across text, images, audio, and even 3D modalities [80, 83, 84]. Given their relevance to the methods and applications explored in this research, we provide an overview of Generative Adversarial Networks (GANs) and diffusion models in this section. 2.0.3.1 Generative Adversarial Networks (GANs) Introduced by Ian Goodfellow in 2014, GANs [85] are a class of machine learning algorithms designed to generate new data that closely resembles a given training dataset. The architecture consists of two neural networks - the generator and the discriminator, engaged in a competitive, adversarial process. The generator creates synthetic data (such as images, text, or audio), while the discriminator evaluates whether the data is real (from the training set) or fake (produced by the generator). 20 More formally, given a set of data instances X and a set of labels Y , generative models capture the joint probability P (X, Y ), or just P (X) if there are no labels. Discriminative models capture the conditional probability P (Y |X). One way to think about GANs is that it is a competition between two players. The generator is trying to fool the discriminator, while the discriminator is trying to spot the fakes. GAN is essentially a zero-sum game where the generator is trying to maximize its score and the discriminator is trying to minimize the generator’s score. The goal of training a GAN is to find a Nash equilibrium, where neither player can improve their score by making any changes to their strategy. In the paper that introduced GANs, the training process involves the generator trying to minimize the following function while the discriminator tries to maximize it. Ex[log(D(x))] + Ez[log(1−D(G(z)))] where, D(x) is the discriminator’s estimate of the probability that real data instance x is real, Ex is the expected value over all real data instances, G(z) when given the noise z, D(G(z)) is the discriminator’s estimate of the probability that a fake instance is real, and Ez is the expected value over all random inputs to the generator. Some other loss functions, like modified minmax loss and Wasserstein loss, have also been used in the literature. 2.0.3.2 Diffusion Models Generative models have revolutionized the creation of high-quality synthetic data, with Generative Adversarial Networks (GANs) [85], Variational Autoencoders (VAEs) [86], and Flow-based models [87] each offering unique strengths and facing distinct limitations. GANs, for instance, are celebrated for producing sharp and realistic samples but often encounter unstable training dynamics 21 and mode collapse, which reduces sample diversity [88]. VAEs optimize a surrogate loss that can result in blurry outputs due to the trade-off between reconstruction accuracy and latent space regularization [86]. Flow-based models, while providing exact likelihoods and invertible mappings, require intricate, reversible network architectures that add design complexity [87]. Diffusion models have recently emerged as a powerful alternative, delivering stable training and exceptional sample quality. Rooted in non-equilibrium thermodynamics, these models operate by progressively corrupting data with noise through a forward process and then learning to reverse this process, effectively transforming random noise into coherent data samples. The foundational concept was introduced by Sohl-Dickstein et al. [89], with significant advancements in modeling and efficiency by Ho et al. [90] and Song et al. [91]. Diffusion models comprise of two key processes, namely, forward diffusion process and reverse diffusion process. Forward Diffusion Process: The forward process is a markov chain that gradually adds gaussian noise to the data over T discrete steps. For an initial data point x0, the corrupted version at step t is denoted xt, and the transition is defined as: q(xt|xt−1) = N (xt; √ 1− βtxt−1, βtI) (2.2) where βt is a variance schedule that controls the amount of noise added at each step. The full forward process is q(x1:T |x0) = T∏ t=1 q(xt|xt−1) (2.3) 22 This process ensures that, as t→ T , xT approaches an isotropic Gaussian distribution. Importantly, the marginal distribution at any timestep t can be computed directly: q(xt|x0) = N (xt; √ αtx0, (1− αt)I) (2.4) where αt = ∏t i=1(1− βi) Reverse Diffusion Process: The generative process learns to reverse the forward noising. The reverse transitions are parameterized as: pθ(xt−1|xt) = N (xt−1;µθ(xt, t), ∑ θ (xt, t)) (2.5) where µθ and ∑ θ are neural network outputs trained to predict the mean and variance for denoising. Sampling starts from pure noise xT ∼ N (0, I) and iteratively denoises to produce a sample x0. Training Objective: Diffusion models are trained by maximizing a variational lower bound (ELBO) on the negative likelihood of the data. This objective can be decomposed into a sum of KL divergences between the true and approximate posteriors at each timestep, leveraging the tractability of Gaussian transitions: LELBO = Eq[ T∑ t=1 KL(q(xt−1|xt, x0)||pθ(xt−1|xt))− log pθ(x0|x1)] (2.6) 23 In practice, a simplified objective is often used: the model learns to predict the added noise at each timestep, which yields stable and efficient training. The formulation for this loss looks as follows: L = Ex0,t,ϵ[||ϵ− ϵθ(xt, t)||2] (2.7) where ϵ is sampled from a standard normal distribution and ϵθ is the noise predicted by the model. In the subsequent chapters, we will discuss how these learning paradigms are integrated into our research, highlighting their roles in addressing specific challenges and advancing the state of the art in our target applications. 24 Part II Context-Aware Image Editing 25 Chapter 3: Contextualized Styling of Images for Web Interfaces using Reinforcement Learning Content personalization is one of the foundations of today’s digital marketing. Often, the same image needs to be adapted for different design schemes for content that is created for different occasions, geographic locations, or other aspects of the target population. We present a novel reinforcement learning (RL) based method for automatically stylizing images to complement the design scheme of media, e.g., interactive websites, apps, or posters. Our approach considers attributes related to the design of the media and adapts the style of the input image to match the context. We do so using a preferential reward system in the RL framework that learns a reward function using human feedback. We conducted several user studies to evaluate our approach and demonstrate that we are able to effectively adapt image styles to different design schemes. In user studies, images stylized through our approach were the most preferred variation across a majority of our experiments. Additionally, we also release a dataset consisting of perceptual associations of web context with the associated image style. 3.1 Introduction A professional-looking website with an engaging audience is a great way to promote a brand. With the availability of Stock photography services such as Shutterstock, Adobe Stock, etc., content 26 creators can easily create impressive-looking websites. Although finding image assets is much easier, for the content to be effective for the brand (i.e., better engagement, higher click-through rates, higher conversion), the images need to be stylized or optimized. Approaches to effectively modify images to improve engagement have been extensively studied in the literature [92]. However, all these studies assume the invariability of the webpage where the image is embedded. In reality, different web pages have different design aesthetics as well as different themes. Additionally, brands also often change the aesthetics and styles of their web pages. Therefore, context is key for adopting the right image styling strategies. Context here refers to the circumstances under which the image asset will be consumed by the user. In the case of websites, context could include but is not limited to the website’s design template, the target users and their role, the task at hand or the steps in the process, the user’s location, the time and date, or the device being used. Manually styling the image assets so that they blend well with this context can be difficult and time-consuming. Also, creating and delivering image content at a scale that can resonate with the user is a very challenging task. In this paper, we, therefore, investigate how to efficiently automate this process and optimize the image style characteristics based on the specific context it is associated with. In particular, we use a reinforcement learning (RL)-based algorithm to search across the huge space of image variations and select the best one based on the context. It will also be efficient both from a time and computation perspective, as RL will leverage the knowledge gained from past optimizations to accelerate the search for the best image variation for a new context. Additionally, since assessing the suitability of the image style for a given context would be meaningless without incorporating users’ (content consumers’) feedback on it, we propose using a reward function that mimics the human users and evaluates the image styles generated. Our proposed approach can, therefore, automatically modify and optimize an image based on the context it is associated with, 27 without constraining the image search in terms of pre-defined, static configurations, and at the same time, agreeing with the user’s expectations. Such a data-driven and adaptive framework can help content creators save an incredible amount of time and also make using design tools for beginners easier. Image Style optimal for Website Figure 3.1: We propose a framework that can automate the process of styling an image such that it suits the context defined by the website it’s being embedded in. Main Contributions. The novel components and main contributions of our work include 1. An image color stylization method to automatically adapt the image by taking into consideration its context. 2. A reinforcement learning approach that uses a unique reward function capable of capturing human user preferences. 3. Additionally, we release a dataset consisting of perceptual association labels of web context with the associated image style derived from our user study. 28 3.2 Related Work In this section, we discuss prior works in content enhancement and context-aware image modification. We also survey literature related to incorporating human feedback in the learning framework and existing datasets. 3.2.1 Enhancement of Content Automating the enhancement of content has been an active area of research. Over the past few years, several deep learning-based models have been developed to improve and score the quality of images[93, 94, 95] and videos[96]. Similarly, there has also been some work around evaluating the aesthetics of webpages[97]. However, these do not describe what the optimal layout would be if the score is not good enough. Content optimization has also been approached from the perspective of displaying only important details or information that can better engage a user by making use of some ranking algorithms[98]. In some works, the device on which the content will be presented has been considered as context for understanding the layout of the content. 3.2.2 Image Enhancement for Context Some of the earliest works in the personalization of images propose an enhancement of an image based on the enhancement strategies adopted by the user for a sample set of images [99, 100]. [101] proposes an effective way to rank and select the most optimal design for the given set of context variables. While our work also depends on context variables, unlike [101], we need to optimize jointly on both the context variables as well as image features. [102] proposed an end-to-end reinforcement learning-based framework that formulated various image-retouching operations as a 29 series of differentiable filters. The end goal of the framework was to determine the sequence and parameters of these filters for the given input image. [103] improves the work of [104] to learn the statistical correlation between the keywords associated with an image and color characteristics and modifies the image colors based on the learned correlations. There have been some related works on image enhancement where they either do not consider any contextual information at all or they consider only internal context, i.e, either looking at the objects present within the image [103] or pixel level information [93]. But in a setting like a website, for example, the information and elements surrounding the image (external context) play a role in how the image should appear, which the existing works do not address. Some works based on Bayesian models [105] either fix the context variables or the designs, i.e., image variations are possible. Scaling these approaches to our problem is highly complex and computationally expensive. Therefore, we need an image enhancer for a context model that modifies its behavior based on human feedback received and the context provided. 3.2.3 Incorporating Human Feedback Unlike some other studies [106] that are only interested in how the image appears, we are also interested in how a human user perceives the image in the context it is presented, and understand whether the image variation created is something they would be willing to use for the given context. A lot of work [105] has also been done to incorporate humans into the training loop. Some of them expect humans to show multiple demonstrations [107] for the system to understand the decisions taken by the human users in the process and emulate the same for unseen tasks. There is another set of research that expects humans to provide continuous feedback on the outputs produced by 30 the system [108]. Both these approaches can be tedious for the user, especially in a setting where neither the context nor the image is fixed. 3.2.4 Datasets Datasets like MIT-Adobe-5k dataset [109], SIQAD [110] or Webpage dataset [111] have proven to be useful for tasks related to image enhancement, quality assessment and webpage saliency. But, these datasets either just have data regarding how the image was modified or have user feedback data. We need a dataset that ideally has not only sets of variations available for certain images, but also contains variations created with respect to different context parameters. Additionally, we would also like the dataset to be annotated for human feedback on how good the variation is concerning the context in which it is presented. 3.3 Problem Definition and Approach Intuition 3.3.1 Problem Definition For any multimedia content W (consisting of images and text), our goal is to develop its stylized version, say, W ′. In this work, we specifically focus on optimizing an image I in W using specific image properties that are conditional to certain context parameters (c1, c2, ...cn) (e.g., the website design, text style and font etc.). Our goal is to find the optimal style I ′. Particularly, we consider an image style optimal for a given context if it is a variation that a human user would most likely pick to be used in W . 31 3.3.2 Approach Intuition Learning the best possible image variation for a given context requires a large amount of training data, given the theoretically infinite number of combinations between context and variations. Collecting user feedback on each of these combinations is also not scalable nor time and resource- efficient. Therefore, rather than generating this dataset for training, we instead propose to incorporate the image variation generation and user feedback collection processes in an online fashion using a reinforcement learning approach. Our approach generates image variations based on the context and leverages feedback collected from the users on-the-fly to improve the image variation generation itself. In other words, we allow user feedback to be part of the learning process of our model. Most importantly, we design our system to achieve this efficiently, using a minimal amount of explicit feedback from the user. In RL, an agent learns from its interactions with the environment over a sequence of steps. At each time step, the agent takes an action at depending on its state st and observed reward (rt). The agent’s goal is to maximize the cumulative reward for the task. Defining the reward function is a critical component of any RL-based approach. Hand-crafting a reward function to model the user’s preference is a non-trivial task whose performance is hard to quantify. Therefore, in this work, we use a deep neural network for generating the reward itself, whose goal is to capture the user preference for a given image in a given context. Our network is trained using explicit user feedback: we consider the distance between the “preference” expressed by the network for a given image-context and that of the user providing feedback, which is our ground truth. Our goal is to be able to carry out this learning using the minimum amount of feedback from the users. We discuss details of the reward function in Section 3.4.3. 32 The action to be taken by our agent is to modify an image to better suit a particular context. In this work, we limit the action space to brightness (bt), hue (ht), and contrast (nt) modifications. Since, the range of possible values for nt, ht and bt are continuous, we consider a deterministic policy that learns the best action as a function of the state. Since defining an accurate model of the environment for the agent to understand the consequences of the actions is non-trivial, we opt for the Deep Deterministic Policy Gradient (DDPG) [112] approach, which is model-free. More details on the proposed framework are given in Section 3.4. 3.3.3 Context Definition and Image Corpus Table 3.1: Context variables considered in this work, along with four representative values chosen for each. Context Variable Values Background color blue, orange, green, white Font color black, blue, red, green Font style Times New Roman, Helvetica, Courier New, Brush Script MT While there are large numbers of potential attributes to take into account when defining the context, we consider in this work the online page background color, text font color, and text font styles as context variables. Using all potential variations of background/font colors and font styles would make the problem intractable. Therefore, we choose four possible variations for each of these variables. Table 3.1 presents more details about the exact set of background colors, font color, and font styles that we consider in this work. We use the MIT-Adobe-5k dataset [109] as a corpus for the starting raw images, a popular 33 10 DDPG Agent State (𝐬𝐭) Action (𝐚𝐭) Environment (Photo editor) … Contexts Reward (𝐫𝐭) (𝐫𝛟) Human feedback(labels) Modified Image Acceptable ? Yes No 𝐫𝐨 = + k 𝐫𝐨 = -k Reward Function (𝒔 𝒕 ,𝒂 𝒕 ) (𝒔 𝒕 ,𝒂 𝒕 ) Contrast Hue Brightness Figure 3.2: Architecture Overview: Our RL agent takes as input the state st and reward rt obtained from the environment as a result of an action at. The reward function computes the acceptability (r0) and human preference (rϕ) for the image obtained after applying action at. choice among works on image modification and variation generation. The dataset contains 5000 photographs and their retouching by five different artists. The images cover a wide range of scenes, subjects, and lighting conditions. 3.4 Network Architecture We now look closely at the architecture of the proposed framework, which is depicted in Figure 3.2. 3.4.1 Deep Deterministic Policy Gradient (DDPG) DDPG is based on the concept of DQN but has been developed to handle continuous action spaces. The use of an experience relay buffer helps in addressing issues related to data being dependent and non-identically distributed. The algorithm also introduces the use of a target network 34 to actor-critic policy learning [113] to stabilize the learning process and efficiently deal with the non-stationary target values. For our DDPG framework, let Q(s, a, w), µ(s, θ) represent the critic and actor networks respectively and let Q′(s′, a, w′), µ′(s′, θ′) represent their corresponding target networks. Based on [114], we can define the loss function for the critic network would be LDDPGc = (r + γQ(s′, µ(s′, θ′), w′)−Q(s, a, w))2 (3.1) where r is the reward and γ is the discount factor. The loss function for the actor on the other hand is as follows: LDDPGa = ∇aQ(s, a, w)∇θµ(s, θ) (3.2) The target networks can be updated as follows with τ << 1: w′ = τw + (1− τ)w′ θ′ = τθ + (1− τ)θ′ (3.3) DDPG is relatively better in terms of sample efficiency as compared to on-policy algorithms like the ones discussed in [115]. 3.4.2 State and Actions At each step, the DDPG agent decides which action to execute according to the current state. The state must provide the agent with comprehensive information for better decisions. In our proposed approach, at any time step t, the state vector st can be defined as the combination of the following components: 1. ft (current input image): the action that will be selected will be directly applied to this image 35 to derive a better result, represented as features [116] 2. at−1 (past historical action vector): this informs the agent about the action taken at time step t− 1. The knowledge of the previous decision could help the action selection at the current step 3. c (context variables): represents the vector corresponding to the different context variables considered (namely background colors, font style, and font colors). Therefore, we can define the state vector as st = [ft, at−1, c] (3.4) The action at for our agent is the vector consisting of three values corresponding to contrast, hue, and brightness factors to be applied to the image. For every action that the agent performs, it receives two values, namely, the new state and a reward that signals how good the action taken was. We detail the reward function in the next section. 3.4.3 Reward Function The main objective of the RL-agent is to stylize an input image using a sequence of actions such that, after modification, the image content is still preserved and the stylized image also appeals to the context in which it is presented. The reward function rt has been defined to keep both aspects into consideration, and it is defined as rt = r0 + rϕ (3.5) 36 r0 is the static part of the reward, and it is used to guarantee that the image content after stylization is still representative of the initial image. While performing a sequence of actions, if the agent ends up making the image overexposed or underexposed, then this style of image is not acceptable, and the agent is given a negative reward. It is positive otherwise. We define the following criteria for determining underexposed or overexposed conditions, respectively: |255− Avg(image pixel colors)| > δ1 (3.6) |255− Avg(image pixel colors)| < δ2 (3.7) where δ1 and δ2 are user-defined constant values. At any given point, the image must satisfy both equations 3.6 and 3.7. If it does, we refer to it as an acceptable variation; otherwise unacceptable. We refer to r0 as static because it indicates whether the image obtained after stylization based on the agent’s actions is acceptable or not: r0 =  κ if acceptable −κ otherwise (3.8) The dynamic part of the reward rϕ quantifies how likely the image variation created by the agent would be preferred by a human user for the given context. The dynamic reward is based on a preference-based reward learning framework [117], where it is learned based on the user’s preference between two state-action trajectories, each of which leads to a different image variation. The advantage of this approach lies in the fact that it is more convenient for a human user to choose 37 between two outcomes rather than rank a few variations together. During training, the loss is computed between the preference of the model, depicted by rϕ, and the preference of the users rh. 3.4.3.1 Collecting Human Preferences The user feedback necessary in our approach was obtained by implementing our system in MTurk. Sets of image pairs were created after the state-action trajectory selected by our agent was applied to the raw image. Each image was then laid out on a website template having features described by the context variables (i.e., background color, font color, and font styles). We opted for dummy text on the website in order to avoid potentially affecting the website’s perception of the user. The two possible websites were then shown to Mturk workers, who were asked to choose the website where the image complemented the website features well. The obtained responses are fed back into the network for training the reward network. 3.4.3.2 Reward Learning from Human Preferences The reward function rϕ has to be trained such that the preference of the model for a given pair of images is consistent with the observed human feedback. The procedure is very similar to [118]. Two trajectories σ1, σ2 are given as input, each trajectory being a sequence of observations and actions {sp, ap, ..., sp+k, ap+k}. We obtain preferences y using the pipeline discussed in the previous section for the pair of images corresponding to σ1 and σ2. y indicates which image (as a result of a trajectory) the user preferred, i.e., y ∈ {(1, 0), (0, 1)}. This preference, along with the trajectory pairs (σ1, σ2), is stored in a dataset D as a triplet (σ1, σ2, y). Based on Bradley-Terry model [119], the preference predictor is modeled using the reward function rϕ as follows: 38 Pϕ(σ2 ≻ σ1) = exp{ ∑ t rϕ(st 1, at 1)}∑ i∈{0,1} exp{ ∑ t rϕ(st i, ati)} (3.9) where σi ≻ σj denotes that the image obtained from trajectory σi is preferable over that obtained from trajectory σj . Next, the function rϕ has to be trained as a binary classifier using the loss function: LReward = −E(σ1,σ2,y)∼D[y(0) logPϕ(σ2 ≻ σ1) +y(1) logPϕ(σ2 ≻ σ1)] (3.10) 3.4.4 Environment Modeling and Training Modeling the environment is a very crucial part of achieving the right results in a Reinforcement learning (RL) setup. To solve our problem, we defined the environment as a proxy to existing image stylization software (E.g., Lightroom). It takes the input image at timestep t and performs the image stylization actions defined on it to get the image corresponding to the next time step. We now discuss the parameters training procedures. The pseudo code for training is given in Algorithm 2. We begin by initializing a few parameters, namely the frequency of obtaining feedback from human users for the image styles created by the model, and also the number of queries that we will be asking users to evaluate. Before we begin with the iterations, we let the agent interact with the environment in the Exploration Phase using Algorithm 1 to produce trajectories conditioned by the static reward function (Eqn.3.8). In each iteration, there are two key events, i.e, (1) human feedback check (lines 8− 19) and (2) updating network parameters (lines 20− 28). The first event occurs only after K iterations. For the human feedback event, pairs of trajectories are uniformly 39 sampled and then sent to the human users for feedback. The human preferences are recorded in a dataset D. Based on the data collected, we train the dynamic reward model (lines 14− 17). In the parameter update event, the agent performs an action at and observes a reward rt. After this mini-batch of transitions (st, at, st+1, rt) are sampled and the parameters of Actor and Critic are updated (lines 26− 27). Algorithm 1 WARMUP: Unsupervised Exploration 1: Initialize parameters of Qw and πψ and a replay buffer B ← ∅ 2: for each iteration do 3: for each timestep t do 4: Select action at by taking at ∼ πψ(at|st) 5: Execute at and observe reward rt0 , new state st+1 6: Store transitions B ← B ∪ {(st, at, st+1, r t 0)} 7: end for 8: for each gradient step do 9: Sample mini-batch {(sj, aj, sj+1, r j 0)}Bj=1 ∼ B 10: Optimize LDDPGcritic in (3.1) and LDDPGactor in (3.2) 11: end for 12: end for 13: return B, πψ 3.5 Experiments and Results The DDPG agent with a reward function capable of capturing human preferences was trained for around 14500 episodes with at max 10 steps in each episode, totaling 145k steps. 3.5.1 User Studies We perform a user study to check the validity and performance of the proposed approach. Particularly, we consider two separate user studies to (1) check the validity of the reward model, and (2) evaluate the overall performance of the proposed approach. 40 Algorithm 2 Human Feedback induced image enhancement using DDPG Require: frequency of feedback K Require: number of queries M per feedback session 1: Initialize actor network µ(s, θ), Critic Network Q(s, a, w) and Dynamic Reward Network rϕ with random weights 2: Initialize target network µ′ and Q′ with weights θ′ ← θ, w′ ← w 3: Initialize Replay Memory B 4: Initialize a dataset of preferences D ← ∅ 5: Initialize random process N for active exploration 6: // EXPLORATION PHASE 7: B, πψ ←WARMUP() in Algorithm 1 8: // REWARD LEARNING 9: for each iteration do 10: if iteration % K == 0 then 11: for m in 1..M do 12: (σ1, σ2) ∼ UNIFORM SAMPLING() 13: Query instructor for y 14: Store preference D ← D ∪ {(σ1, σ2, y)} 15: end for 16: for each gradient step do 17: Sample minibatch {(σ1, σ2, y)j}Dj=1 ∼ D 18: Optimize LReward in (3.10) with respect to ϕ 19: end for 20: Relabel entire replay buffer B using rϕ 21: end if 22: for timestep in 1...t do 23: Select action at by taking at ∼ πψ(at|st) 24: at ← at +Nt 25: Execute at and observe reward rt and new state st 26: Store transitions B ← B ∪ {(st, at, st+1, rt(st))} 27: Sample N transitions (st, at, rt, st+1) ∈ B 28: Optimize LDDPGc (3.1) & LDDPGa (3.2) w.r.t. θ & w 29: Update target networks using τ 30: end for 31: end for 41 3.5.1.1 User Study 1 - Validity of Reward Model The objective of this study is to evaluate the reward model’s capability to capture human preferences. As part of this study, we fed pairs of image variations created for a given context and asked both our dynamic reward model and human users to choose. We observed that the model and human users agree on 87% of the samples, which confirms the validity of the model. 3.5.1.2 User Study 2 - Overall Performance In the second study, we aim to understand whether, for a given random image and context, our RL-based approach is able to produce an image variation that users find appealing. As part of the study, for a given context, we obtained 3 competing versions of the input image: (1) the image stylized by our RL model, (2) the image stylized by an image editor (Expert), and (3) the original unedited image. These 3 versions of the image were used to create 3 versions of the same website. Each website version had one of the 3 competing image versions. We ask human users to then evaluate these 3 in pairs to make the comparisons easier and also to avoid noisy data. Users were asked to evaluate 9 such samples. User preferences were collected using MTurk. The results obtained are depicted in Figure 3.3. The results obtained from the user study are depicted in Figure 3.3. We can observe that for most of the samples, our model output is preferred by at least 50% of the users. However, there is still some ambiguity (no obvious winners) for all three comparisons. Hence, we next perform statistical tests to verify the results further. We perform statistical hypothesis tests to see if our approach is preferred when compared with two baselines. As described in Figure 3.3, our study shows a pair of web pages to users, and then asks which they prefer. For each of the three ways of creating the webpage, RL, Expert, and 42 Pr op or tio n pr ef er rin g A o ve r B (i) (ii) (iii) Image and Context Pair Figure 3.3: The graph shows the results obtained from our User Study 2. (i) shows the results obtained while comparing expert’s stylized images with original images for different contexts (ii) shows comparisons between our model’s stylized images with the expert’s stylized images and (iii) compares our model’s stylized images with the original images for a given context. Original, we perform pairwise comparisons. Given a pair of alternatives i and j, we ask the question, how likely is it that i will be preferred over j? Let’s say that this probability is denoted by p. Then, one hypothesis of interest is the following: H0 : p = 1 2 vs. H1 : p > 1 2 . Such a hypothesis tests whether humans indeed prefer the first version of the website compared to the second version of the website. For each comparison, we have ∼ 4, 500 responses, and we can then use a test of binomial proportions to draw a conclusion. The results of this analysis are 43 presented in Table 3.2, which presents the details of the experiments and the one-sided p-value (and one-sided 95% Confidence Intervals). We observe a few things. When comparing our RL Model to the Original versions of the web pages, we see that users prefer the versions generated by the Model in a strongly statistically significant way. For the comparison of the model with the web pages created by the expert, we see that there is some evidence that users prefer the model-generated output, though the evidence is less overwhelming (p−value of 0.046). When comparing the expert- created web pages with the original web pages, we also see strong evidence that the expert-created pages are preferred by the study participants. Table 3.2: Statistical Analysis of User Study A B n nA Proportion p−value 95% CI RL Model Original 4, 500 2, 543 0.5651 < 0.0001 (0.5528, 1.0) RL Model Expert 4, 500 2, 307 0.5127 0.04604 (0.5003, 1.0) Expert Original 4, 499 2, 626 0.5837 < 0.0001 (0.5714, 1.0) 3.6 Conclusions We present in this paper an RL-based approach to generate the optimal style of a given image for a given context. The RL agent takes an image and the set of variables defining the context in which we wish to present it. The output is the image styled to blend well with the context. Our approach efficiently handles challenges related to scalability and data by seamlessly incorporating human feedback into the training process to improve the styles of the image generated. We demonstrate through user studies that the proposed approach can produce variations close to human preferences in a time and cost-effective manner. While the results of our approach to developing contextualized content are encouraging and promising, we identify a few areas for future work. The experiments 44 conducted in this work use feedback collected from the general population and not a specific individual. It is possible to include more user-specific context variables to understand and explore user-level personalization of the image for a given context. Additionally, we would also like to understand the impact of other factors like content genre, website topic, etc. to create an even better image stylization. Finally, our work could be extended by considering image content-specific stylization, i.e., by considering the interaction between the context and the content of the image itself. 45 Chapter 4: TAME-RD: Text Assisted Replication of Image Multi-Adjustments for Reverse Designing Given a source and its edited version performed based on human instructions in natural language, how do we extract the underlying edit operations to automatically replicate similar edits on other images? This is the problem of reverse designing, and we present TAME-RD, a model to solve this problem.TAME-RD automatically learns from the complex interplay of image editing operations and the natural language instructions to learn fully specified edit operations. It predicts both the underlying image edit operations as discrete categories and their corresponding parameter values in the continuous space. We accomplish this by mapping together the contextual information from the natural language text and the structural differences between the corresponding source and edited images using the concept of pre-post effect. We demonstrate the efficiency of our network through quantitative evaluations on multiple datasets. We observe improvements of 6–10% on various accuracy metric