ABSTRACT Title of dissertation: SEQUENTIAL DECISION MAKING WITH LIMITED RESOURCES Karthik Abinav Sankararaman Doctor of Philosophy, 2019 Dissertation directed by: Professor Aravind Srinivasan Department of Computer Science One of the goals of Artificial Intelligence (AI) is to enable multiple agents to interact, co-ordinate and compete with each other to realize various goals. Typically, this is achieved via a system which acts as a mediator to control the agents? behavior via incentives. Such systems are ubiquitous and include online systems for shopping (e.g., Amazon), ride-sharing (e.g., Uber, Lyft) and Internet labor markets (e.g., Mechanical Turk). The main algorithmic challenge in such systems is to ensure that they can operate under a variety of informational constraints such as uncertainty in the input, committing to actions based on partial information or being unaffected by noisy input. The mathematical framework used to study such systems are broadly called sequential decision making problems where the algorithm does not receive the entire input at once; it obtains parts of the input by interacting (also called ?actions?) with the environment. In this thesis, we answer the question, under what informational constraints can we design efficient algorithms for sequential decision making problems. The first part of the thesis deals with the Online Matching problem. Here, the algorithm deals with two prominent constraints: uncertainty in the input and choice of actions being restricted by a combinatorial constraint. We design several new algorithms for many variants of this problem and provide provable guarantees. We also show their efficacy on the ride-share application using a real-world dataset. In the second part of the thesis, we consider the Multi-armed bandit problem with additional informational constraints. In this setting, the algorithm does not receive the entire input and needs to make decisions based on partial observations. Addi- tionally, the set of possible actions is controlled by global resource constraints that bind across time. We design new algorithms for multiple variants of this problem that are worst-case optimal. We provide a general reduction framework to the clas- sic multi-armed bandits problem without any constraints. We complement some of the results with preliminary numerical experiments. SEQUENTIAL DECISION MAKING WITH LIMITED RESOURCES by Karthik Abinav Sankararaman Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Aravind Srinivasan, Chair Professor John P. Dickerson Professor Tom Goldstein Professor Furong Huang Professor Prakash Narayan Dr. Aleksandrs Slivkins ?c Copyright by Karthik Abinav Sankararaman 2019 Dedicated to my parents. ii iii Acknowledgments ?Although an act of help done timely, might be small in nature, it is truly larger than the world itself.? ? Thiruvalluvar, a classic Tamil poet I thank Aravind Srinivasan for the guidance, support and help throughout these years. He is a true genius and it was a privilege to work beside him and absorb some of it. I have learnt many interesting mathematical ideas from him. He is very kind and has the ability to connect his students to new topics (and associated experts) that he thinks are important. One such connection was with John Dickerson who joined UMD and has since been a very close collaborator. His keen insight on applications of the theoretical work is commendable. I am indebted for the opportunity to work on many projects with him and learn a great deal. In Fall 2015, I was fortunate to meet Alex Slivkins through his class on bandits at UMD. Alex is an extremely prolific researcher and I am thankful for the gazillion hours he has spent over these years to mentor me: writing, talks, research, life. I am extremely impressed by the amount of energy he has towards doing research (and life) and hope that some of this has rubbed off on me. He is always available via Skype/email/phone for brainstorming and this has helped me a lot while thinking about problems. I am also thankful to him (and Microsoft) for hosting me at what was a fun and productive summer at MSR NYC along with Rob Schapire and Nicole Immorlica. Chapter 7 was the outcome of this internship; I learnt a great deal about bandits, online learning and game theory in this process from Alex, Rob and Nicole. iv Aravind connected me with Anand Louis at IISc and Navin Goyal at MSR India which was the start of our collaboration on causal inference. I am extremely thankful to them for the numerous weekly meetings (and accommodating their schedule to the 9+ hours of time-difference). I learnt a great deal about many topics including linear algebra, probability and statistics. I am also grateful to Anand for reading through all the proofs carefully and giving critical feedback. I would like to thank IISc for hosting me in the summer, Microsoft for the space and food and Anand and Navin for making this collaboration possible (and having me visit on my trips back to India). Many thanks to my committee members Prakash Narayan, Tom Goldstein and Furong Huang; I am grateful to the many interactions during my time here on various problems in machine learning. Bill Gasarch mentored me on how to be a great instructor and I am thankful for his advice on classroom teaching. And I thank my professors from IITM especially Jayalal Sarma and Narayanaswamy who first introduced me to theoretical computer science and algorithms. I thank Prithvi Sen for hosting me at IBM Almaden and Anil Kamath for hosting me at Adobe during my first two summers. It was a period of lot of learning for me. Recently, Soham De introduced me to the world of neural nets and has since taught me many things about the theory and practice of deep learning. I am extremely thankful to him for those discussions. The initial parts of the online matching research was done with my friends Pan Xu and Brian Brubach. I would always cherish the numerous meetings we had to understand, discuss and eventually solve some of the questions. I continued collaborating with Pan and most of my v later research on this topic has been jointly with him. We have worked on several problems together and it has been extremely fulfilling. Science is a team-sport and the research done during my PhD would not have been possible without my co-authors who are some of the smartest people I have interacted with. I thank all my office mates at AV Williams and Iribe Center for the company, coffee/tea breaks and discussion about various topics over these years. I am also extremely fortunate to have had a huge army of friends right from my middle school days through the undergrad years and grad school; I am thankful to each one of them for long phone calls, planning numerous trips, giving me advice and company (also a place to crash). I am also very grateful for the company of all the friends I made in grad school and at internships; they have kept me happy in the last few years. Home is where the heart is: I thank my parents, grand parents, my brother for the support, affection and happiness in all these years (and more recently my sister-in-law). The ?Indian? way of returning this love is to make them proud and hopefully this is a step towards that direction. I am also thankful to many of my relatives who have frequently kept in touch via calls and meetings in all these years. I gratefully acknowledge NSF Awards CNS 1010789, CCF 1422569, CCF- 1749864, a gift from Google, Inc., and research awards from Adobe, Inc. and Ama- zon, Inc that has supported the research during my PhD. Chapter 3 is based on [42] with Brian Brubach, Aravind Srinivasan and Pan Xu. Chapters 4, 5 and 6 are based on [76, 78] with John Dickerson, Aravind Srinivasan and Pan Xu. Chapter 7 is based on [109] with Nicole Immorlica, Rob Schapire and Alex Slivkins and Chapter 8 is based on [163] with Alex Slivkins. vi Table of Contents Dedication ii Acknowledgements iii Table of Contents vii List of Tables x List of Figures xi 1 Introduction 1 2 Preliminaries 6 2.1 Sequential Decision Making (SDM) protocol . . . . . . . . . . . . . . 6 2.2 Landscape of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 I Online Matching 14 3 Online Matching 15 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Online Weighted Matching . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Overview of our algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Warmup: 0.688 competitive algorithm . . . . . . . . . . . . . . . . . 25 3.7 0.7 competitive algorithm . . . . . . . . . . . . . . . . . . . . . . . . 29 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Extension 1: Online Weighted Matching with Reusable Vertices 34 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Main Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 LP benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 1/2? -competitive algorithm . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Lower-bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vii 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Extension 2: Online Submodular Weighted Matching 47 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Problem Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Background on Submodular Optimization . . . . . . . . . . . . . . . 51 5.5 Related Works in Submodular Optimization . . . . . . . . . . . . . . 53 5.6 Mathematical Program to Upper-bound OPT . . . . . . . . . . . . . 55 5.7 0.125-competitive Algorithm . . . . . . . . . . . . . . . . . . . . . . . 56 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Rideshare: Empirical Evaluation of Online Matching 61 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Dataset and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4 Justifying The Two Important Model Assumptions . . . . . . . . . . 64 6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 II Multi-armed Bandits 73 7 Bandits with Knapsacks 74 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.3 Challenges and Our Contributions . . . . . . . . . . . . . . . . . . . . 84 7.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.5 A new algorithm for Stochastic BwK . . . . . . . . . . . . . . . . . . 91 7.6 A simple algorithm for Adversarial BwK . . . . . . . . . . . . . . . . 100 7.7 High-probability algorithm for Adversarial BwK . . . . . . . . . . . . 108 7.8 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8 Better Bounds for Combinatorial Semi-bandits with Knapsacks 151 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.2 Challenges and Contribution . . . . . . . . . . . . . . . . . . . . . . . 154 8.3 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.5 Main algorithm (SemiBwK-RRS) . . . . . . . . . . . . . . . . . . . . . 164 8.6 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.7 Applications and special cases . . . . . . . . . . . . . . . . . . . . . . 184 8.8 Numerical Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9 Future Directions 197 viii III Appendix 202 10 Standard Tools 203 10.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 203 10.2 Adversarial Online Learning . . . . . . . . . . . . . . . . . . . . . . . 203 10.3 Lagrangians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Bibliography 208 ix List of Tables 2.1 Various problems in the SDM framework . . . . . . . . . . . . . . . . 10 2.2 Summary of benchmark and metrics for various problems . . . . . . . 13 x List of Figures 2.1 SDM protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Online Matching Protocol: A basic version . . . . . . . . . . . . . . . 16 6.1 OTD is normal distribution under KIID . . . . . . . . . . . . . . . . 66 6.2 OTD is normal distribution under KAD . . . . . . . . . . . . . . . . 67 6.3 OTD is power law distribution under KAD . . . . . . . . . . . . . . . 68 6.4 The number of requests of a given type at various time-steps. x-axis: time-step, y-aixs: number of requests . . . . . . . . . . . . . . . . . . 69 6.5 Occupation time distribution of all cars. x-axis: number of time- steps, y-axis: number of requests . . . . . . . . . . . . . . . . . . . . 70 6.6 Occupation time distribution of two different cars. x-axis: number of time-steps, y-axis: number of requests . . . . . . . . . . . . . . . . . . 71 7.1 Bandits with Knapsacks Protocol . . . . . . . . . . . . . . . . . . . . 77 7.2 Adversarial online learning . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3 The lower-bounding construction for the ?(log T ) lower bound. . . . . 130 7.4 Adversarial contextual bandits . . . . . . . . . . . . . . . . . . . . . . 146 8.1 Dynamic Assortment (left) and Dynamic Pricing (right) experiments for n = 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.2 Experimental Results for Uniform matroid (left plots) and Partition matroid (right plots) on independent (upper) and correlated (lower) instances for n = 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.3 Experimental Results for Uniform matroid (left plots) and partition matroid (right plots) on independent (upper) and correlated (lower) instances for n = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.4 Variation of per-step running times as n increases for the various algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 xi Chapter 1: Introduction ?We can only see a short distance ahead, but we can see plenty there that needs to be done.? ? Alan Turing (possibly on sequential decision making) The last decade has seen a rise in large-scale applications of artificial intelli- gence and machine learning techniques in various domains ranging from e-commerce, recommender systems to healthcare, medicine and climate change efforts. The two biggest contributors to this phenomenon are the availability of large amounts of datasets to perform supervised learning and cheap computing resources. This led to efficient application of large scale supervised learning algorithms on massive datasets. From a theoretical standpoint, these applications gave rise to a host of research questions on newer models of computation for algorithm design. One such model is the stochastic input model, where the algorithm receives input as a sample from some distribution. Assumptions about this distribution are motivated from those that are commonly observed across multiple datasets. For a significant por- tion of this thesis, we study algorithm design questions that use the stochastic input model. 1 Although supervised learning on ?offline data? has been very successful, the current goal is to tackle applications from a wider range of domains where the data is inherently ?real-time?. This implies that although one may be able to use the vast historical data to pre-compute predictions about the future the algorithm?s decision has to be done ?online?. Often the time-scales at which these algorithms operate are in the order of milli-seconds; thus it vastly reduces the available compute time to make decisions. Moreover, the algorithm needs to optimize over multiple objectives and operate in domains where the decision affects human-beings. Concretely, con- sider a ride-share platform such as Uber or Lyft. A central problem is to efficiently match a driver to a request. However, the platform does not get all the requests before it makes the matching decisions. Based on historical observational data, the platform can obtain an estimate of the future requests, in order to optimize the allocation policy; however, the actual match has to be done real-time. The goal is to match requests to drivers while learning the best strategy and using them to optimize for various criteria such as revenue, welfare and fairness. In fact, this gen- eral setup is ubiquitous in many other applications such as crowdsourcing, Internet advertising markets and recommender systems. These applications motivate the study of a class of problems called ?sequential decision making? (SDM) where the algorithm does not receive the entire input at once; the algorithm makes a sequence of decisions to obtain the input sequence (often the inputs are stochastic and/or depends on past decisions). In many of these applications, the algorithm is further restricted in the choice of future decisions it can make based on the current actions. This imposes infor- 2 mation bottlenecks for the algorithm to learn about the future based on the past; we abstractly call these the resource constraints. Thus, the central thesis of this work is to understand under what informational constraints can we design efficient algorithms for sequential decision making problems. We make progress towards this by considering two broad classes of SDM prob- lems and design efficient algorithms for variants thereof. The first class of problems is called the online matching problem where the goal of the algorithm is to find an optimal graph matching when the vertices of the graph are provided sequentially. Variants of this problem have many applications such as matching riders to drivers, keywords to advertisements and recommending movies to users. The algorithm is usually faced with one or more objectives it has to optimize for such as revenue, diversity and fairness. The second class we consider is a variant of the classic multi- armed bandit problem called bandits with knapsacks. This is a sequential learning problem where the algorithm wants to learn parameters of the system under various information constraints such as the number of samples it is allowed to collect. Many applications such as ride-share, crowdsourcing and recommender systems have tun- able parameters whose optimal value is not known apriori and the algorithm needs to explore several choices based on real-time inputs. However, it is also usually constrained by external constraints such as limited inventory. Thus variants of this model is useful in such application domains. Our Contributions and Outline: The rest of this thesis is structured as follows. Chapter 2 describes the SDM framework formally and provides the benchmark and 3 performance metric we use for the algorithms. Then the thesis is divided into two parts. Part I deals with problems in Online Matching. In Chapter 3 we consider the weighted version of the online matching problem under stochastic inputs and show algorithms with improved competitive ratio. In Chapter 4 we consider an extension of the problem where the matched vertices may become available after certain number of time-steps from the time it is matched. We design a new algorithm for this problem and show that the competitive ratio obtained is tight among all algorithms that use a class of techniques. In Chapter 5 we consider another extension of the problem where the goal of the algorithm is to maximize multiple objectives, namely total reward and the diversity of edge types. Abstractly, we solve the online matching problem with submodular objective functions. In the last chapter 6 of this part we consider the real-world application of matching drivers to riders on a ride-share platform. We model this problem as an Online Matching problem and use a real-world dataset to empirically evaluate the proposed algorithms. In Part II of the thesis we consider the variants of the bandit with knapsack problem. In Chapter 7 we consider the adversarial version of the basic bandits with knapsacks problem and give an optimal algorithm via a reduction to the classic multi-armed bandit problem. Along the way we also give a new algorithm to the stochastic bandits with knapsacks problem. We show many extensions that can be obtained as a corollary of this reduction. In Chapter 8 we consider an extension of this problem to the combinatorial semi-bandits with knapsacks problem in the stochastic setting and give optimal algorithm for this problem. In this chapter, we also perform experiments on simulated datasets to evaluate the performance of this algorithm in 4 practice and compare it with other algorithms proposed in the literature. Finally in Chapter 9 we conclude the thesis with a few open problems and broader future directions for research. 5 Chapter 2: Preliminaries In this chapter, we give a formal definition of the sequential decision mak- ing protocol. The sequential decision making framework is used across multiple communities (e.g., reinforcement learning, experimental design, online algorithms) with (slightly) different terminologies. Here we describe a general version of the sequential decision making protocol that captures all problems considered in this thesis. 2.1 Sequential Decision Making (SDM) protocol The SDM protocol can be viewed as a repeated game between the algorithm and the nature. The protocol proceeds in rounds with the total number of rounds T known to both the players a-priori. At each time-step t ? [T ] the nature first presents part of the input xt to the algorithm. The algorithm then chooses an action at ? At, where At represents the set of feasible actions available to the algorithm at time t. The choice at yields a reward rt to the algorithm and modifies the set of available actions At+1 to the algorithm in the next round. The goal of the algorithm is to maximize the total reward obtained in the T rounds. Critically, the algorithm?s action at is binding and cannot be revoked at a later time. Moreover, it does not 6 receive the full input beforehand and thus has to optimize based on observed history until time t and any predictions of the future the algorithm may have. The set of actions At+1 could potentially be a strict subset of At; abstractly we call this as resource constraints. Resource constraints make the problem significantly harder and understanding under what constraints can we design efficient algorithms is the central thesis of this work. Figure 2.1 describes the protocol formally. Through- out this thesis, we use the term online algorithm interchangeably to refer to any algorithm for the SDM protocol. Given: number of time-steps T . In each round t ? [T ], 1. Nature reveals part of the input xt. 2. The algorithm chooses action at ? At. 3. Algorithm receives reward rt which is a function of at. 4. The set of available actions At+1 is updated based on at. Figure 2.1: SDM protocol 2.1.1 Models of the Nature The literature makes various assumption on the power of the nature. In the adversarial model, the nature is assumed to be an arbitrary adversary and the goal of the algorithm is to deal with the worst-case input sequence. In the adaptive 7 adversary model, this adversary is assumed to be adaptive to the choices of the algorithm. In other words, the input xt, the reward rt and the new set of actions At+1 can be set by the nature after observing the set of actions a1, a2, . . . , a 1t?2 . In the oblivious adversary model, the adversary has lesser power and the map from action at to input xt+1, reward rt and action set At+1 should be fixed before the start of the game. Thus, the adversary cannot adapt its strategy to the actions taken by the algorithm. We also consider stochastic models of the adversary where the input xt and the reward rt is drawn from a joint distribution D over all possible tuples (xt, rt), at each time-step. Additionally, in all problems considered in this thesis, this draw is independent across time-steps (however, the distribution from which the tuples are sampled may vary over time). In the known distribution model, we assume that the algorithm knows the distribution(s) before the game starts while in the unknown distribution model, the algorithm does not have access to the distribution(s) and has to learn the relevant information over time. 2.2 Landscape of Problems Many problems fall under the category of the SDM protocol described above. In this thesis, we focus on two specific models of computation namely Online Match- ing and Multi-armed Bandits with Knapsacks. We briefly describe the basic version of these problems and show how they fit into the broader SDM framework. Ta- ble 2.1 summarizes how the two problems fit into the SDM protocol described in Figure 2.1. 1In fact, in some problems we allow the adversary to also observe the action at?1. 8 2.2.1 Online Matching In this problem the algorithm makes a sequence of decisions to solve a graph matching problem. We are given a bipartite graph G = (U, V,E) with the vertex set U given before the start of the game. The vertices in the set V are presented sequentially at each time-step. The algorithm either matches the presented vertex vt to any (available) neighbor in U or rejects v irrevocably. Once a vertex u ? U is matched to some v, it becomes unavailable for the rest of the time-steps. The goal is to maximize the total size of the matching. Mapping this to the SDM protocol, the input xt is the vertex vt presented at time-step t. The reward rt is 1 if vt was matched and 0 otherwise. The set of actions At is the available vertices in U at time-step t. This problem was first introduced by Karp et al. [116] and many variants of this basic problem has since been studied. In this thesis, we consider many variants of this basic problem which has applications in many modern large- scale matching markets such as ride-share, recommender systems, job scheduling and online advertising. 2.2.2 Multi-armed Bandits with Knapsacks In this problem the algorithm has K arms to choose from at each time-step. There are d resources each with a global budget B. The algorithm first chooses an arm at ? [K]. The nature then reveals a reward rt(at) for the chosen arm and reveals the consumption of each of the d resources c1,t(at), . . . , cd,t(at). We allow the algorithm to skip any round (denoted by an action a(null) ? [K]) which yields no 9 reward. The moment some resource is exhausted the algorithm is allowed to play only the action a(null) in all future time-steps. The goal is to maximize the total reward collected by the algorithm. In the SDM protocol, the input xt is the outcome vector ot?1 := (rt?1(at?1); c1,t?1(at?1), . . . , cd,t?1(at?1)) for all time-steps t ? 2 and for t = 1 the nature does not provide any input to the algorithm. The reward rt is just the scalar rt(at). The set of actions At remains the same set of K arms until one resource is exhausted after which the set just reduces to the singleton set {a(null)}. This problem was introduced as an extension to the classic multi-armed bandit problem (Thompson [175]) to bandits with resource constraints by Badanidiyuru et al. [33] and has since garnered a lot of attention. Problem/Parameter Input xt Reward rt Action set At 1 if vt matched Online Matching Vertex vt ? V Available vertices in U 0 otherwise Bandits with t = 1: No input [K] if resource not exhausted rt(at) Knapsacks t ? 2: ot?1 a(null) otherwise Table 2.1: Various problems in the SDM framework 2.3 Benchmark We compare the performance of any online algorithm against a benchmark (de- noted by OPT). We primarily deal with two definitions of the benchmark, namely, best fixed distribution and best dynamic policy which we now define. 10 Definition 1 (Best dynamic policy). Given an instance I of the problem, the best dynamic policy is the total reward obtained by the best possible algorithm in hindsight given the realizations of all the randomness associated with the input. Thus, OPT in this case is a random variable and we compare against OPTDP := E[OPT]. Algorithms described in Chapters 3, 4, 5 and 8 use this benchmark. Definition 2 (Best fixed distribution). Given an instance I of the problem, the best fixed distribution is the total expected reward obtained by sampling from the best possible fixed distribution. We use the notation OPTFD := E[OPT] to denote this benchmark. This benchmark is used in the algorithms described in Chapter 7. More dis- cussion of this benchmark is deferred to Chapter 7. 2.4 Performance Metric To measure the performance of online algorithms the following notions are commonly used, namely competitive ratio and regret. We define both these metrics now. Definition 3 (Competitive ratio). Let E[REW] denote the expected reward obtained by any algorithm ALG and let E[OPT] denote the expected reward of the benchmark. Then competitive ratio of the algorithm ALG is the value of the ratio E[REW]E .[OPT] Note that in Definition 3, the ratio is atmost 1 and the goal is to maximize this quantity. To keep it consistent with the related literature, in Chapter 7, we use the 11 reciprocal E[OPT]E as the performance metric. In that case, the goal is to minimize[REW] the quantity. Usually, competitive ratio is used in problems when the best dynamic policy benchmark is used (Buchbinder and Naor [46]). However, for the bandits with knapsacks problem we consider in Chapter 7 we show that even against the best fixed distribution benchmark, the notion of competitive ratio is necessary. We are not aware of other problems that exhibit this behavior. Definition 4 (Regret). Let E[REW] denote the expected reward obtained by any algorithm ALG and let E[OPT] denote the expected reward of the benchmark. Then regret of the algorithm ALG is the value of the difference E[OPT]? E[REW]. Regret of any algorithm is non-negative and the goal is to minimize regret. Regret is usually defined against the weaker best fixed distribution benchmark. However, in Chapter 7 we show that for the stochastic version of the bandits with knapsacks problem we can obtain non-trivial regret bounds even against the best dynamic policy benchmark. Regret is the primary metric in Chapter 8. Remark 1. Competitive ratio is invariant to the scale of the rewards {rt}t?[T ] while regret scales linearly with the scale of the rewards. Thus, for consistency throughout the thesis we assume that rt ? [0, 1] for all t ? [T ]. Table 2.2 summarizes the metric and the benchmark used for the various problems considered in this thesis. Upper-bound on OPT by a mathematical program. A common strategy (e.g., see Buchbinder and Naor [46], Mehta [139]) in algorithm design is to not work 12 Problem/Metric Competitive Ratio Regret Online Matching Dynamic Policy - Stochastic Bandits with Knapsacks - Dynamic Policy Adversarial Bandits with Knapsacks Fixed Distribution - Table 2.2: Summary of benchmark and metrics for various problems directly with OPT but to use a suitable upper-bound on its value. This upper-bound is typically the optimal solution to an appropriate mathematical program. The ad- vantage of this method is that, the optimal solution of such mathematical programs have additional interesting properties which can be exploited in the analysis. More- over, from the definition of the performance metrics it can be seen that algorithms that are ?good? (either low competitive ratio or regret) against the optimal solution of this program are also good against OPT. For all problems considered in this thesis, we always use the optimal solution of an appropriate mathematical program (typically a linear program) as the proxy for OPT. 13 Part I Online Matching 14 Chapter 3: Online Matching 3.1 Introduction In this chapter, we consider the model of Online Algorithms. The central prob- lem of study will be the Online Weighted Matching problem under distributional ar- rival sequences. The model of computation has many motivational examples such as Online advertising, ride-share, crowdsourcing and kidney-exchange. In the simplest setting of this problem, we start with an weighted bipartite graph G = (U, V,E). The protocol proceeds in T rounds with the set U available before the start of the game. In each round t ? [T ], a vertex vt ? V is presented to the algorithm (henceforth we interchangeably use the term arrives) along with all its neighboring edges in U (denoted by ?(vt)) and their corresponding weights. The algorithm has two irrevocable choices: either choose an edge e = (u, vt) ? ?(vt), where u is an unmatched vertex, or not match vt ? V . Thus, a choice to match an edge implies that the algorithm cannot un-match it in a future time-step. Likewise, choosing to drop vt implies that it cannot re-visit this vertex at a later time. The goal of the algorithm is to maximize the total weight of the matching obtained after the T rounds of the protocol. Figure 3.1 formally describes the protocol for the simplest version of this problem. 15 Given: number of time-steps T and offline vertex set U . In each round t ? [T ], 1. A vertex vt ? V and the set of neighbors ?(vt) is presented to the algorithm. 2. The algorithm either chooses an unmatched vertex u ? U to match vt or drops the request vt. Figure 3.1: Online Matching Protocol: A basic version 3.1.1 Arrival sequence. The landscape of problems can be divided based on the assumptions placed on the manner in which vt is chosen to arrive at time t. The earliest version of this problem, studied by Karp et al. [116], assumed that the vertices vt ? V is chosen by an oblivious adversary and that all edge-weights are identical. Following this, more recent works proposed two other models of arrival, namely, the unknown and the known distribution models. In both these cases, we assume that the arrival sequence is stochastic and sampled from a product distribution D over the set of possible arrivals. In other words, at every time-step t, the vertex vt is sampled from a distribution Dt independent of all other time-steps. The focus of this work will be on known product distributions. We show some future directions where we consider models that interpolates between the unknown and known product distributions. 16 3.1.2 Competitive ratio. The performance of any online algorithm for this problem is compared against the best dynamic policy (Definition 1) . The competitive ratio with respect to this benchmark is defined as follows. It is the ratio of the size of the matching obtained by the algorithm to that of the optimal matching if the entire arrival sequence was known (we interchangeably use the phrase the optimal matching of the offline instance). Formally, let ALG(I) (a random variable) denote the size of the matching obtained by any online algorithm ALG on instance I. Let OPT(I) denote the size of the optimal matching (a random variable) in instance I if the arrival sequence of vertices in V was available a-priori. Then the competitive ratio of ALG is defined as maxI E[ALG(I)]/E[OPT(I)]. 3.2 Related work The seminal paper of Karp et al. [116] introduced this problem where they gave the optimal competitive ratio of 1?1/e when the vertex arrival is set by an oblivious adversary. Following this work, this problem has garnered much interest due to its various applications (see Mehta [139] for a comprehensive survey). The vertex- weighted version(of this) problem was introduced by Aggarwal et al. [8], where they give an optimal 1? 1 ratio for the adversarial arrival model. The edge-weighted e setting has been studied in the adversarial model by Feldman et al. [82], where they consider an additional relaxation of ?free-disposal?. In addition to the adversarial and known I.I.D. models, online matching is 17 also studied under several other variants such as random arrival order, unknown distributions, and known adversarial distributions. In the setting of random arrival order, the arrival sequence is assumed to be a random permutation over all online vertices, see e.g., Devanur and Hayes [69], Kesselheim et al. [119], Korula and Pa?l [122], Mahdian and Yan [136]. In the case of unknown distributions, in each round an item is sampled from a fixed but unknown distribution. If the sampling distributions are required to be the same during each round, it is called unknown I.I.D. (Devanur et al. [73, 75]); otherwise, it is called adversarial stochastic input (Devanur et al. [75]). As for known adversarial distributions, in each round an item is sampled from a known distribution, which is allowed to change over time (Alaei et al. [16, 17]). Another variant of this problem is when the edges have stochastic rewards. Models with stochastic rewards have been previously studied by Mehta and Panigrahi [140], Mehta et al. [142] among others, but not in the known I.I.D. model. In the known i.i.d. setting prior work has considered unweighted, vertex- weighted and the edge-weighted versions of the problem. The vertex-weighted and unweighted setting (often studied in tandem) have many results starting with Feld- man et al. [83] who were the first to beat 1 ? 1/e with a competitive ratio of 0.67 for the unweighted problem. This was improved by Manshadi et al. [137] to 0.705 with an adaptive algorithm. In addition, they showed that even in the unweighted variant with integral arrival rates, no algorithm can achieve a ratio better than 1? e?2 ? 0.86. Finally, Jaillet and Lu [111] presented an adaptive algorithm which used a clever LP to achieve 0.725 and 1? 2e?2 ? 0.729 for the vertex-weighted and unweighted problems, respectively. In the edge-weighted setting model Haeupler 18 et al. [102] were the first to beat 1? 1/e by achieving a competitive ratio of 0.667. They use a discounted LP with tighter constraints and employ the power of two choices paradigm to guide their online algorithm. A closely related problem is the Adwords problem where every edge has a bid and every offline vertex has a budget. The goal is to match an online vertex to any offline vertex that has not exhausted its budget. This problem was introduced by Mehta et al. [141] and subsequently studied by Buchbinder et al. [49], Devanur and Jain [71], Devanur et al. [74], Goel and Mehta [95]. A series of works (Aggarwal et al. [8], Chan et al. [56], Devanur et al. [74], Goel and Mehta [95]) has attempted to simplify the RANKING algorithm proposed in Karp et al. [116] which has been useful to develop algorithms for related problems such as vertex-weighted matching and Adwords. Another related model is the two-sided online matching problem where all vertices in the graph come online and the goal is to match the vertices that are currently present. A series of recent works (e.g., Dickerson et al. [77], Huang et al. [107, 108], Truong and Wang [179], Wang and Wong [183]) consider many variants of this model and design improved algorithms. Online matching has also been considered in the edge-arrival model where the edges arrive instead of the vertices. Representative works include Badanidiyuru [29], Buchbinder et al. [51], McGregor [138]. 19 3.3 Online Weighted Matching 3.3.1 Formal Description of the Problem Before the start of the game, we are given a weighted bipartite graph G = (U, V,E) and a corresponding weight function w : E ? R+ that maps each edge to a positive real number. Additionally, we are given a rate function r : V ? Z+ which denotes the number of times a vertex v ? V would arrive in expectation across all the rounds in the game. The game proceeds in T rounds, where in each round a vertex v ? V is sampled i.i.d. from a distribution D such that D(v) = r(v)/T . In particular, the distributions Dt in the product distribution is the same in every time-step. As noted in Haeupler et al. [102], we can assume that r(v) = 1 for every vertex wlog. Indeed, any vertex v such that r(v) > 1 can be replaced by r(v) different vertices each an arrival rate of 1. Thus, this assumption implies that the size of the new set of online vertices is exactly T , since the sum of expectations equals the total number of online rounds. The goal of the algorithm is to match an incoming vertex v ? V to any unmatched neighbor (or drop) irrevocably in each round, such that the sum of the weights of the matched edges after T rounds is maximized. Asymptotic assumption and notation. We will always assume T is large and analyze algorithms as T goes to infinity: e.g., if x ? 1 ? (1 ? 2/T )T , we will just write this as ?x ? 1 ? 1/e2? instead of the more-accurate ?x ? 1 ? 1/e2 + o(1)?. These suppressed o(1) terms will subtract at most o(1) from our competitive ratios. 20 3.3.2 Linear Relaxation to Upper-bound OPT As discussed in Chapter 2, a common strategy in online algorithms is to find a suitable upper-bound on OPT using a mathematical program (typically a LP) and compare the performance of the algorithm to this stronger benchmark. One of the challenges in the design of algorithms is to ensure that the gap between OPT and this upper-bound is as small as possible. In particular, one of the contributions of our work is in constructing a stronger upper-bound compared to prior work. In Lemma 1, we prove that the following LP is an upper-bound on OPT. ? maximize ?e?E w(e) x(e) such that ?e??(u) x(e) ? 1 ?u ? U e??(v) x(e) ? 1 ?v ? U (3.1) 0 ? x(e) ? 1? 1/e ?e ? E x(e) + x(e?) ? 1? 1/e2 ?e, e? ? ?(u),?u ? U ?Lemma 1. Let x ? denote an optimal solution to LP (3.1). Then we have OPT ? e?E w(e)x ?(e). Proof. Let Y (e) denote the indicator random variable for the event that edge e ? E is matched in the optimal solution for a given arrival sequence A. Let y(e) := EA[Y (e)] for every edge e ? E. We will now argue that the vector y := (y(e?))e?E is a feasible solution to the LP (3.1). Consider a vertex u ? U . We have that e??(u) Y (e) ? 1. T?aking expectations on both sides and using the linearity of expectation we have e??(u) y(e) ? 1. This shows that y is feasible to the first constraint. Let R(v) 21 denote the random variable for the number of times a ve?rtex v ? V arrived in a given arrival sequence A. Then we have, for every v ? V , e??(v) Y (e) ? R(v). From the integral arrival rates assumpti?on, EA[R(v)] = 1 for every v ? V . Thus, from linearity of expectation we obtain e??(v) y(e) ? 1. This shows that y is feasible to the second constraint. For any edge e = (u, v), let I[R(v) = 0] be an indicator for the event that a vertex v ? V never arrives in the T rounds. Thus, for any arrival sequence A, we have Y (e) ? I[R(v) 6= 0]. Taking expectations on both sides we get y(e() ? EA) [I[R(v) =6 0]. The probability that a vertex v never arrives in T roundsT is 1? 1 ? 1 . Thus, E [I[R(v) 6= 0] ? 1 ? 1A . This shows that y is feasible toT e e the third constraint. Consider two edges e, e? ? ?(u) for some u ? U . Let e = (u, v) and e? = (u, v?) and as before let I[R(v) =6 0] and I[R(v?) 6= 0] denote the indicator for the events that v, v? arrives at least once in the T rounds, respectively. For any arrival sequence A we have that Y (e) + Y (e?) ? I[R(v) 6= 0] ? I[R(v?) 6= 0]. Taking expectations on both sides we get y(e) + y(e?) ? EA[I[R(v) 6= 0]? I[R((v ?) 6= )0]]. TheT probability that both v and v? never arrive in the T rounds is given by 1? 2 ? 1 T e2 . Thus, we get y(e) + y(e?) ? 1 ? 12 which shows that y is feasible to the fourthe constraint. ? The expected weight of t?he optimal solution is EA[ e?E w(e)Y (e)] which from linearity of expectation gives e?E w(e)y(e). Since y is a feasible solution we have that the optimal value to LP (3.1) is at least as large as the expected optimal solution. 22 3.4 Our Contribution Our main contribution to this problem is to improve the competitive ratio of 0.667 due to Haeupler et al. [102] to 0.7. We do so by designing a new algorithm based on the dependent-rounding schemes of Gandhi et al. [92]. We describe in detail the key observation we make in the algorithm of Haeupler et al. [102] which we exploit in our algorithm design. Haeupler et al. [102] used two matchings, M1 and M2 of the offline graph G = (U, V,E) to guide the online algorithm and leverage the power of two choices. When a vertex v arrives for the first time, an attempt to match to its neighbor in M1 is made and on its second arrival an attempt to match to its neighbor in M2 is made. However, these two matchings may not be edge disjoint, leaving some arriving vertices with only one choice (or possibly none). In fact, choosing two matchings as a guide that maximize both the edge weights and the number of disjoint edges is a major challenge that arises in applying the power of two choices to this setting. When the same edge (u, v) is included in both matchings M1 and M2, the copy of (u, v) in M2 offers no additional benefit and a second arrival of v is wasted (which happens with a substantial probability). Concretely, Haeupler et al. [102] choose the two matchings in the following manner. M1 is obtained by solving an LP similar to (3.1) and rounding this to an integral solution. M2 is constructed by first finding a maximum-weight matching and then removing all the edges which have already been included in M1. A key element of their proof is to show that the probability of an edge being removed from M2 is at most 1? 1/e ? 0.63. 23 We make the observation that generating the two matchings in a slightly corre- lated manner can drastically decrease this probability which translates to improved competitive ratio. Moreover, we show a general technique to construct an ordered set of k matchings where k is a tunable parameter. For k = 2, we show that the probability of an edge appearing in both M1 and M2 is at most 1? 2/e ? 0.26. 3.5 Overview of our algorithm We provide a overview of both the warm-up and the final algorithm. The algorithm has two components, namely the offline phase and the online phase. The offline phase is as follows. The algorithm first solves the LP (3.1) to obtain the fractional solution x?. Then we use a rounding scheme described in subsection 3.5.1 to obtain a integral vector F where each coordinate in this vector is in the set {0, 1, 2}. Then we construct two matchings M1 and M2 of the graph induced by the vector F (repeating an edge whenever any coordinate is 2). The online phases uses the two matchings M1 and M2 to guide its action. When a vertex type v ? V arrives for the first time, we attempt to match it to the neighbor in M1. When it arrives for the second time, we attempt to match it to the neighbor in M2. In its third and succeeding arrivals, we simply drop the vertex. 3.5.1 LP rounding technique We round x? to an integral solution F using a two step process we call DR[x?, k]. The first step is to multiply x? by k. The second step is to apply the 24 dependent rounding techniques of Gandhi et al. [92] to this new vector. For most parts of this section we use k = 2. In the conclusion, we briefly mention an improved algorithm that uses k = 3 to achieve the best known competitive ratio of 0.705. While dependent rounding is typically applied to values between 0 and 1, the useful properties extend naturally to our case in which kx?(e) may be greater than 1 for some edge e ? E. To understand this process, it is easiest to imagine splitting each kx?(e) into two edges with the integer value x?(e) = bkx?(e)c and fractional value x??(e) = kx?(e) ? bkx?(e)c. The former will remain unchanged by the dependent rounding procedure since it is already an integer while the latter will be rounded to 1 with probability x??(e) and 0 otherwise. The final value F (e) would be the sum of the two rounded values. The two properties of dependent rounding we use are: (P1) Marginal distribution: For every edge e ? E, let p(e) = kx?(e)?bkx?(e)c. Then, Pr[F (e) = dkx?(e)e] = p(e) and Pr[F (e) = bkx?(e)c] = 1? p(e). (P2) Degree-pr?eservation: For any vertex w ? U ? V , let its fractional degree kx?? (w) := kx ? e??(w) (e) and integral degree be the random variable F (w) = ? ? e??(w) F (e). Then F (w) ? {bkx (w)c , dkx (w)e}. 3.6 Warmup: 0.688 competitive algorithm As a warm-up, we describe a simple algorithm which achieves a competitive ratio of 0.688 and introduces the key ideas in our approach. We begin by solving the LP (3.1) to get a fractional solution vector x? and applying DR[x?, 2] as described 25 in Subsection 3.5.1 to get an integral vector F. We construct a bipartite graph GF with F (e) copies of each edge e ? E. Note that GF will have max degree 2 since for all w ? U ? V , F (w) ? d2x?(w)e ? 2 and thus we can decompose it into two matchings using Hall?s Theorem. The exact choice of the two matchings is not critical to the algorithm as long as the union contains all edges in GF. Finally, we randomly permute the two matchings into an ordered pair of matchings, [M1,M2]. These matchings serve as a guide for the online phase of the algorithm, similar to Haeupler et al. [102]. The warm-up algorithm, denoted by EW0, is summarized in Algorithm 1. Algorithm 1: EW0 1 Construct and solve the benchmark LP (3.1) for the input instance. 2 Let x? be an optimal fractional solution vector. Invoke DR[x?, 2] to get a random integral vector F. 3 Create the graph GF with F (e) copies of each edge e ? E and decompose it into two matchings as described in text. 4 Randomly permute the matchings to get a random ordered pair of matchings, say [M1,M2]. 5 When a vertex type v arrives for the first time, attempt to match v to u1 if (u1, v) ?M1; when v arrives for the second time, attempt to match v to u2 if (u2, v) ?M2. 6 When a vertex type v arrives for the third time or after, do nothing in that step. 26 3.6.0.1 Analysis of EW0 We will show that EW0 (Algorithm 1) achieves a competitive ratio of 0.688. Let [M1,M2] be the randomly ordered pair of matchings. Note an edge e ? E appears in both matchings if and only if x?(e) > 1/2. We consider three types of edges. We say an edge e ? E is of type ?1, denoted by e ? ?1, if and only if e appears only in M1. Similarly e ? ?2, if and only if e appears only in M2. Finally, e ? ?b, if and only if e appears in both M1 and M2. Let P1, P2, and Pb be the probabilities that an edge e ? ?1, e ? ?2, and e ? ?b is matched, respectively. Haeupler et al. [102] proved the following Lemma 2 that bounds P1, P2, Pb for any given matching M1 and M2. Lemma 2 (Section 3 of Haeupler et al. [102]). For any two matchings M1 and M2 steps (5) and (6) in Algorithm 1 implies that we have (1) P1 > 0.5808; (2) P2 > 0.14849 and (3) Pb > 0.632. We use Lemma 2 to prove that the warm-up algorithm EW0 achieves a ratio of 0.688 by examining the probability that a given edge e is becomes type ?1, ?2, or ?b. Analysis of EW0. Consider the following two cases. ? Case 1: 0 ? x?(e) ? 1/2: By the marginal distribution property of depen- dent rounding (P1), there can be at most one copy of e in GF. Moreover, the probability of including e in GF is 2fe. Since an edge in GF can appear in either M1 or M2 with equal probability 1/2, we have that Pr[e ? ?1] = Pr[e ? 27 ?2] = x ?(e). Thus, the probability that edge e is added to the matching is (x?(e)P1 + x ?(e)P2) = 0.729x ?(e). ? Case 2: 1/2 ? x?(e) ? 1?1/e: Similarly, by the marginal distribution (P1), Pr[e ? ?b] = Pr[F (e) = d2x?(e)e] = 2x?(e)? b2x?(e)c = 2x?(e)? 1. It follows that Pr[e ? ?1] = Pr[e ? ?2] = (1/2)(1? (2x?(e)? 1)) = 1? x?(e). Thus,the probability that edge e is added to the matching is (noting that the first term is from case 1 while the second term is from case 2) ((1 ? x?(e))(P1 + P2) + (2x?(e) ? 1)Pb) ? 0.688x?(e), where the worst-case occurs for an edge e with x?(e) = 1? 1/e. To prove the competitive ratio, we first show the following Lemma. Lemma 3. Let x? denote the optimal solution to LP (3.1). Suppose we have that for every edge e ? E, Pr[e is included in matching] ? ?x?(e) then the competitive ratio is at least ?. Proof. From Li?nearity of Expectation, we ha?ve that the total expected size of the matching is E[ e?E w(e)I[e is matched]] = e?E w(e) Pr[e is matched]. From the premise we have that for every edge e ? E, Pr[e is?included in matching] ? ?x ?(e). Thus, the?total expected size of the matching is ? w(e)x ? e?E (e). From Lemma 1 we have ?e?E w(e)x (e) ? OPT. Thus, the competitive ratio is at least ?. In the two cases above, we proved that for every e ? E we have Pr[e is matched] ? 0.688x?(e). Thus from Lemma 3 we obtain a competitive ratio of 0.688. 28 3.7 0.7 competitive algorithm In this section, we describe an improvement to the warm-up algorithm to get a competitive ratio of 0.7. We start by making an observation about the performance of the warm-up algorithm. After solving LP (3.1), let edges with x?(e) > 1/2 be called large and edges with x?(e) ? 1/2 be called small. Let L and S, be the set of large and small edges, respectively. In the warm-up analysis, small edges contributed a much higher value of 0.729 towards the ratio as opposed to the worst-case 0.688 by the large edges. This is primarily due to the fact that we may potentially get two copies of a large edge in GF. In this case, the copy in M1 has a better chance of being matched, since there is no edge which can ?block? it (i.e., an edge with the same offline neighbor that gets matched first), but the copy in M2 has no chance of being matched. To correct for this imbalance, we make an additional modification to the vector x? before applying DR[x?, k]. The rest of the algorithm is exactly the same. Let ? be a parameter to be optimized in the analysis. For all large edges ` ? L such that x?(`) > 1/2, we set x??(`) = x?(`) + ?. For all small edges s ? S which are adjacent to some large edge, let ` ? L be the largest edge adjacent to s such that x?(`) > 1/2. Note that it is possible for s to have two la(rge nei)ghbors, but we only care about ? the larger of the two. We set x??(s) = x?(s) 1?x? (`) . 1?x?(`) In other words, we increase the values of large edges while ensuring that for all w ? U?V , x?(w) ? 1 by reducing the values of neighboring small edges proportional to their original values. Note that it is not possible for two large edges to be 29 adjacent since they must both have x?(e) > 1/2. For all other small edges which are not adjacent to any large edges, we leave their values unchanged. We then apply DR[x?, 2] to this new vector, multiplying by 2 and applying dependent rounding as before. Algorithm 2 formally describes the algorithm. Algorithm 2: EW(?) 1 Construct and solve the benchmark LP (3.1) for the input instance. Let x?2 be an optimal fractional solution vector. 3 For all edges e ? E with x?(e) > 1/2 let x??(`) = x?(`) + ? 4 For all edges e ? E with x?(e) ? 1/2 do the following. Let ` ? L be the largest edge a(djacent)to e such that x?(`) > 1/2. Let x??(e) = x?(e) 1?x? ?(`) 1? ? .x (`) 5 Invoke DR[x?, 2] to get a random integral vector F. 6 Run steps (2)-(6) of Algorithm 1 3.7.0.1 Analysis Theorem 1. For edge-weighted online stochastic matching with integral arrival rates, EW(0.0142) achieves a competitive ratio of at least 0.7. Proof. As in the warm-up analysis, we?ll consider large and small edges separately ? Scenario 1: 0 ? x?(s) ? 1 : 2 Here we have two cases ? Case 1: s is not adjacent to any large edges. 30 In this case, the analysis is the same as Case 1 in the warm-up analysis. Thus, the probability that edge s is added to the matching is 0.729x?(e). ? Case 2: s is adjacent to some large edge `. For this case, let x?(`) be the value of the largest neighboring edge in the original LP solution. Then the probability that edge s is added to the matching is ( 1? (x? ) ? (`) + ?)x (s) ? (0.1484 + 0.5803).1? x (`) This follows from Lemma 2; in particular, the first two terms are the result of how we set x?(s) in the algorithm, while the two numbers, 0.1484 and 0.5803, are the probabilities that s is matched when it is in M2 and M1, respectively. Note that for x ?(`) ? [0, 1) this is a decreasing function in x?(`). So the worst case is when x?(`) = 1?1/e (due to third constraint in LP (3.1)) Thus, the probability that edge s is added to the matching is ( ) ? 1? (1? 1/e + ?)x (s) (0.1484 + 0.5803). 1? (1? 1/e) Since ? = 0.0142, this evaluates to, 0.701x?(s). (3.2) ? 1 < x?(`) ? 1 ? 1 : Here, the probability that ` is added to the matching is, 2 e [1? (x?(`) + ?)][P ?1 +P2] + [2(x (`) + ?)? 1]Pb. This can re-arranged to obtain (P1 + P2)(1? ?) + (2? ? 1)P + x?b (`)[2Pb ? P1 ? P2]. (3.3) 31 Since ? = 0.0142 using Lemma 2 we have (P1+P2)(1??)+(2??1)Pb = 0.1048. Similarly, using Lemma 2 we have 2Pb ? P1 ? P2 = 0.535. Thus, Eq. (3.3) simplifies to, 0.1048 + x?(`)0.535 (3.4) We can write Eq. (3.4) as x?(`)[0.1048/x?(`) + 0.535]. Note that 1 < x?(`) ? 2 1? 1 . Thus, Eq. (3.4) can be lower-bounded by e 0.701x?(`). (3.5) Thus combining Eq. (3.2) and (3.5) with Lemma 3 we get a competitive ratio of 0.7. We now show that the chosen value of ? = 0.0142 ensures that both x??(`) and x??(s) are less than 1 after modification. Since x?(`) ? 1 ? 1/e we have that x?(`) + ? ? 1 ? 1/e + 0.0142 ? 1. Note that x?(`) ? 1/2. Hence, t(he modifie)d ? 1?(x?value x? (s) is always less than or equal to the original value, since (`)+?) 1?x?(`) is decreasing in the range x?(`) ? [1/2, 1 ? 1/e] and has a value less than 0.98 at x?(`) = 1/2. 3.8 Conclusion In this chapter, we considered the Online (Edge) Weighted Matching problem under the known product distribution arrival assumption. The key contribution is a new algorithm based on randomized (dependent) rounding schemes of Gandhi et al. [92] that improves the competitive ratio to 0.701. For simplicity in exposition, we describe a simpler algorithm in this thesis. In Brubach et al. [42], we give 32 another algorithm which uses similar ideas and DR[x?, 3] to obtain the best-known competitive ratio of 0.705. The analysis is significantly involved with many more cases and we refer the reader to the paper for more details. All algorithms in prior work and the one described in this thesis fall under the class of online algorithms known as non-adaptive algorithms. In other words, after realizing part of the randomness in the input, the algorithm does not change it?s strategy. It is an interesting open problem to consider adaptive algorithms for this problem and understand if significant improvements to competitive ratios can be made. This is called adaptivity gap in the literature Gupta et al. [98, 99]. The best known lower-bound for this problem is 0.823 which holds even when all edge weights are identical. Thus, the immediate open question is to bridge the gap between the upper-bound of 0.705 and the lower-bound of 0.823. Adaptive algorithm is one possible approach towards bridging this gap. 33 Chapter 4: Extension 1: Online Weighted Matching with Reusable Vertices 4.1 Introduction In this chapter, we consider an extension of the Online Matching problem where the offline vertices are reusable. Online Matching problems are motivated by market design problems where agents on one side of a market are paired with agents, contracts, or transactions on the other. In many of these applications such as matching drivers to riders, jobs to servers, organs to patients the process is dynamic where one side of the market arrives in an online fashion and is matched sequentially to the other side. Moreover, in these applications the match has a temporal component; after a certain period of time the offline agent is freed and can be used for other matches. We motivate this problem using the example of ride-share below. Taxi Dispatching Services and Ride-Sharing Systems. Traditional taxi services and rideshare systems like Uber and Didi Chuxing match drivers to would- be riders Lee et al. [128], Lowalekar et al. [132], Tong et al. [176]. Here, the offline Service Providers are different vehicle drivers. Once an online request (potential rider) arrives, the system matches it to a nearby driver instantly such that the 34 rider?s waiting time is minimized. In most cases, the driver will rejoin the system and can be matched again once she finishes the service. Additionally, the arrival rates of requests changes dramatically across the day. Consider the online arrivals during peak hours and off-peak hours for example: the arrival rates in the former case can be much larger than the latter. Motivated by these applications, we consider the Online Matching with Reusable Resources problem. As before we are given a weighted bipartite graph G = (U, V,E). The vertex set U is available offline and at each time-step t ? [T ] a vertex v ? V is sampled from a known product distribution. When a vertex v arrives, we have to match it to one of the available vertices in U or drop it; this action is irrevo- cable. The key difference is that the vertices in U are reusable. Once we assign v to u, the vertex u will rejoin the system after Ce rounds with e = (u, v), where Ce ? {0, 1, . . . , T} is an integral random variable with known distribution. In other words, after Ce time-steps the vertex u becomes available for future matches. The random variable Ce is called the occupation time of u w.r.t. e. The goal is to maxi- mize the total weight of the successful matches at the end of T time-steps. Here we make no assumption on the individual distributions Dt in the product distribution and can vary (adversarially) with time. 4.2 Our Contributions The main contribution in this chapter is to define the Online Weighted Match- ing with Reusable Vertices problem and provide a new algorithm that has tight 35 theoretical guarantees. This model capture a wide range of real-world applications related to online scheduling, organ allocation, rideshare dispatch, among others. We extend the ideas from the previous chapter to design a new algorithm that achieves a competitive ratio of 1 ?  for any given constant  > 0. The key new ingredi- 2 ent in the algorithm is the notion of attenuation ? using Monte-Carlo simulations to approximately estimate the probability of certain safe event and weight actions inversely proportional to this estimate. We also show that our algorithm is nearly optimal among all non-adaptive algorithms. 4.3 Main Model In this section, we present a formal statement of our main model. Consider a weighted bipartite graph G = (U, V,E) where U and V represent the offline and online vertices respectively. Let w : E ? R+ denote the weight function that assigns a weight to every edge in the graph. We have a finite time horizon T (known beforehand) and for each time t ? [T ], a vertex v is sampled (we?use the term v arrives) from a known probability distribution {p (v)} such that p (v) ? 11t v?V t This sample is indepe?ndent for each round t. The expected number of times v arrives across the T rounds, t?[T ] pt(v), is called the arrival rate for vertex v. Once a vertex v arrives, we need to make an irrevocable decision immediately: either reject v or assign v to one of its neighbors in U . For each u ? U , once it is assigned to some v, it becomes unavailable for Ce rounds with e = (u, v), and subsequently rejoins the system. Here Ce is an?integral random variable taking values in {0, 1, . . . , T} 1Thus, with probability 1? v?V pt(v), none of the vertices from V will arrive at t. 36 with the distribution known in advance. The goal is to design an online assignment policy such that the total (expected) weight of the assignments made is maximized. In this work we assume that |V |  |U | and T  1. 4.4 LP benchmark We use the following LP to upper-bound the optimal online algorithm. The linear program has variables xt(e) for every e ? E and t ? [T ]. ? ? maximize ?t?[T ] e?E w(e) xt(e) such that ?e??(u?) xt(e) ? pt(v) ? ?u ? U, t ? [T ] t? t? t?] + e??(u) xt(e) ? 1 ?u ? U, t ? [T ] 0 ? xt(e) ? 1 ?e ? E, t ? [T ] (4.1) As before we have the following Lemma which states that the optimal value of the LP is an upper-bound to the optimal online algorithm. Lemma?4. Le?t {x ? t}t?[T ] denote an optimal solution to LP (4.1). Then we have OPT ? ?t?[T ] e?E w(e)xt (e). Proof. The proof of this is similar to that of Lemma 1. Let Yt(e) denote an indicator random variable to denote if an edge e = (u, v) was matched at time t in the optimal solution for an online sequence A. Let yt(e) := EA[Yt(e)] for every t ? [T ] and e ? E. We will now show that the vector y := (yt(e))e?E,t?[T ] is a feasible solution to the LP (4.3). Consider a vertex u ? U . Let Rt(v) denote the indicator 37 ? random variable to denote if vertex v arrived at time t in A. Thus, e??(u) Yt(e) ? ?Rt(v). Taking expectation on both sides and using linearity of expectation we get e??(u) yt(e) ? pt(v). This shows that y is feasible to the first constraint. Since Yt(e) is an indicator random variable we have 0 ? Yt(e) ? 1. Taking expectation, we obtain 0 ? yt(e) ? 1 and thus y is feasible to the third constraint. We will now show that it is feasible to the second constraint. Consider a u ? U and time-step t ? [T ]. u is either matched at some time-step t? < t and is not yet available at t or that u got matched at time t. Let I[Ce > t?t?] denote the i?ndicat?or for the re-appear time of e?dge e to be larger than t ? t ?. Thus, we have t? t? t?] + e??(u) Yt(e) ??1. Tak?ing expectation on both sides?and using linearity of expectation we get that t? t? t?e ? ]?+ e??(u) yt(e) ? 1. The expected?weight?of the optimal solution is thus EA[ t?[T ] e?E w(e)Yt(e)] which is equal to t?[T ] e?E w(e)yt(e) from linearity of expectation. Since y is feasible to the LP, the optimal value to LP (4.3) is at least as large as the expected optimal solution. In fact, it is suffices to have a finite sample estimate of Pr[C ?e > t? t ] for every edge e ? E. In particular, we have the following Lemma. The proof follows from the fact that {xt}t?[T ] is scaled down by a factor (1 + ?) and hence the objective is scaled down by a factor (1 + ?). Lemma 5. Suppose for some ? ? 0, we have an estimate f(e, y) of Pr[Ce > y] for all edges e and y >= 0 , where f(e, y)/Pr[Ce > y] always lies in [1/(1 + ?), 1 + ?]. Then, by using f(e, t ? t?) in the LP instead of Pr[Ce > t ? t?] and scaling down 38 the resultant vector {xt}t?[T ] by (1 + ?), we only get a further loss of (1 + ?) in the competitive ratio. 4.5 1/2? -competitive algorithm The main idea of the algorithm is as follows. Let {x?t}t?[T ] denote an optimal solution to LP (4.1). Suppose we aim to develop an online algorithm achieving a ratio of ? ? [0, 1]. Consider an assignment e = (u, v) when some v arrived at time t. Let SFt(e) be the event that e is safe at t (i.e., u is available at t). By using Monte-Carlo simulations on the algorithm?s strategy up to t, we can get an estimate of Pr[SFt(e)], denoted by ?t(e), within an arbitrary small error. Therefore when x?(e) SFt(e) holds, we assign v to u with probability t ? when v arrives, which leads pt(v) ?t(e) to the fact that e is assigned with probability exactly equal to ?x?t (e) unconditionally. We call any strategy that satisfies ? ? ?t(e) as valid. At the outset, this looks similar to the Inverse Propensity Scoring (IPS) used in the multi-armed bandit literature Auer et al. [24]. However, there is a key difference between IPS estimates and these Monte-Carlo estimates. In the bandit literature, one usually scales the value by the probability of playing an action, since this is the cost of observing only bandit feedback. However, here we scale by a quantity that depends on the probability of a certain event happening during the run of the algorithm, because of playing other actions. The linear program gives a distribution over the edges assuming that all the neighbors are available. Hence this scaling can be interpreted as the cost the algorithm needs to incur when some neighbors are already matched. 39 The above strategy of using Monte-Carlo simulations to weight the probability, called simulation-based attenuation, has been used previously for other problems, such as stochastic knapsack Ma [133] and stochastic matching Adamczyk et al. [2]. Throughout the analysis, we assume that we know the exact value of ?t(e) := Pr[SFt(e)] for all t and e. This is because, for any giv[en accuracy  > 0 using] ?( 1 2 ? log(1)) samples, we can ensure that ? 1t(e) ? Pr[SFt(e)] ,Pr[SFt(e)]Pr[SFt(e)] ? 1+ with probability at least 1? ?, via a standard Chernoff-bound argument. Thus, this leads to a loss of at most (1 ? ?()) multiplicative factor in the competitive ratio, which is negligible when  = o(1)2. Hence, ignoring it leads to a cleaner presentation. Algorithm 3 gives a formal description of our algorithm. Algorithm 3: Simulation-based algorithm: ALG-REUSE(?) 1 Solve LP (4.1) to obtain optimal solution {x?t}. 2 For each time t, let vt denote the request arriving at time t. x?(e) 3 If ?t(vt) = ?, then reject vt; otherwise choose e ? ?t(vt) with prob. t ? ?pt(vt) ?t(e) where e = (u, v). First, we show that when ? = 1 , Algorithm 3 is a valid strategy (i.e., ? ? ?t(e)2 for every t ? [T ], e ? Et). Lemma 6. For every t ? [T ] and e ? Et, we have ? (e) ? 1t .2 Proof. We prove this by induction on t. The base case is when t = 1. In this case we have ?t(e) = 1 for all e = (u, ?) trivially and thus, we are done. 2o(1) is a vanishing term when both Ce and T/Ce are sufficiently large 40 Consider the inductive case. For all t? < t, assume that ?t?(e) ? 1 . Consider2 time t and a given edge e = (u, vt). Assignment e is unsafe at t iff u is already assigned to some v? at t? < t and that the assignment e? = (u, v?) makes u unavailable at time t. Therefore we have, ? ? x??(e) 1? ?t(e) = 1? Pr[SF tt(e)] = Pr[Ce > t? t?]. (4.2) t? 2 t? ? 1 1 1 t ] ? + x?t (e) ? . ? t? 2 2 2 2 1, let SFt(u) be the event that u is safe at time t and At(u) be the event that u is matched at time t. In each window of K time slots, we have that the events {SFt(u), At?(u), t ?K + 1 ? t? < t} are all mutually exclusive and that together they cover the entire sample space. Therefore, we have, ? 1 = Pr[SFt(u)] + Pr[At?(u)] t??K+1?t? 0. Then we have that E[f(Z)] ? ?E[f(X)]. 1. Marginal Property: Pr[Z(e) = 1|X(e) = 1] ? ?. 2. Monotonicity Property: Pr[Z(e) = 1|X = a] ? Pr[Z(e) = 1|X = b], ?a ? b ? {0, 1}m,?e ? support(a) := {e? : a(e?) = 1}. By definition of the multilinear extension we have that E[f(X)] = F (x). Thus the theorem can be viewed as conditions when ?linearity of expectation? can be applied to F . 5.5 Related Works in Submodular Optimization The offline version of our problem is the well-studied ?maximizing a monotone submodular function subject to a bipartite matching polytope constraint? problem. More generally, the constraint set can be viewed as an intersection of two partition matroids. The general area of submodular maximization is well studied; here, we 53 only survey algorithmic advances related to maximization of a monotone submodular function subject to various constraints. The classical work of Nemhauser et al. [151] showed that the natural greedy algorithm achieves an (1?1/e)-approximation under a cardinality constraint, which is optimal in the value oracle model assuming P 6= NP Nemhauser and Wolsey [150]. Under a general matroid constraint, Calinescu et al. [52] gave an algorithm achieving the optimal ratio of 1?1/e (in the value oracle model defined in Section 5.4) using the pipage rounding technique. Lee et al. [129] considered the constraint case of k matroids with k ? 2 and presented a local-search based algorithm. Sarpatwar et al. [166] studied the case of intersection of k matroids ?(k+1) and a single knapsack constraint and gave a 1?e -approximation algorithm for k+1 any k ? 1. Recently a series of works has considered submodular maximization in the online setting. In particular, Buchbinder et al. [50] and Chan et al. [57] studied online submodular maximization in the adversarial arrival order with preemption: on arrival of an item, we should decide whether to accept it or not and possibly rejecting a previously accepted item. In this work, we do not allow preemption but consider a more flexible arrival assumption (i.e., known i.i.d.). This makes the problem tractable and admits algorithms with non-trivial competitive ratios. Apart from the offline and online models, submodular maximization has received much attention in other models due to its applications in summarization Tschiatschek et al. [180], data subset selection and active learning Wei et al. [186], and diverse summarization Mirzasoleiman et al. [143], to name a few. It has been studied in the streaming Badanidiyuru et al. [32], Mirzasoleiman et al. [146], distributed Mirzasoleiman et al. [144, 145] and stochastic Karimi et al. [115], Stan et al. [171] 54 settings. 5.6 Mathematical Program to Upper-bound OPT We use the following mathematical program to upper-bound OPT. This can- not be solved exactly in polynomial time but can be approximated to a factor of 1? 1/e efficiently (Adamczyk et al. [3], Calinescu et al. [52]). Let F : [0, 1]m ? R+ be the multilinear extension of f . Consider the following mathematical program. maximize ?F (x) such that ?e??(v) x(e) ? r(v) ?v ? V (5.1) e??(u) x(e) ? 1 ?u ? U 0 ? x(e) ? 1 ?e ? E Lemma 9. There is an efficient algorithm (running in polynomial(time))which re- turns a feasible solution x? to the program (5.1) such that F (x?) ? 1? 1 E[OPT], e where E[OPT] is the offline optimal value. Proof. First, it is easy to see that the constraint set (denoted by P) in LP (5.1) is downward closed. In other words, if x is a feasible solution, then (1? )x for every 0 ?  ? 1 is also a feasible solution. Define f+(y) to be the optimal solution of the following program{. ? ? ? } f+(y) := max ?Af(A) : ?A ? 1;?A ? 0;?e ? E ?A ? y(e) A?E A?E A:e?A (5.2) Using Lemma 6 in Adamczyk et al. [3] we have that the continuous greedy algorithm of Calinescu et al. [52] yields a feasible solution x? such that F (x?) ? 55 (1 ? 1/e) max f+y?P (y). As noted in Adamczyk et al. [3], for any given y ? P , f+(y) is the maximum value of E[f(R)] over all possible random sets R such that Pr[e ? R] ? y(e) for each e. Thus, this implies that max +y?P f (y) is an upper bound of the offline optimal. This proves that F (x?) ? (1 ? 1/e) maxy?P f+(y) ? (1? 1/e)E[OPT]. 5.7 0.125-competitive Algorithm We present an algorithm for the Online Submodular Matching problem that achieves a competitive ratio of at least 0.125. The central idea is as follows. We first start with an approximately optimal solution x? to the mathematical program (5.1). We then obtain Rx? , which is obtained by independently sampling each edge e ? E with probability x?(e). Let X ? {0, 1}m be the indicator vector corresponding to Rx? such that X(e) = 1 iff e ? Rx? . Let EX(v) = {e : e ? ?(v), X(e) = 1} and EX(u) = {e : e ? ?(u), X(e) = 1} be the set of sampled edges incident to u and v respectively. Now we obtain another random vector Y ? {0, 1}m from X by uniformly sampling an edge from EX(u) for each u ? U (the sampling process is independent across different u). We then use both X and Y to guide the online phase. Algorithm 4 describes this algorithm formally. We now prove the following theorem about the performance of this algorithm. In particular, we prove the following theorem. Theorem 5. The algorithm CR-ALG achieves a competitive ratio of at least 1(1? 2 e?1/2)(1? 1/e) for Online Submodular Matching problem with integral arrival rates. 56 Algorithm 4: A CR-based algorithm (CR-ALG) Offline Phase: 1 Approximately solve the program (5.1) using the continuous greedy of Calinescu et al. [52]. Let x? = (x(e)?)e?E be the approximate solution with F (x?) ? (1? 1/e)E[OPT]. 2 Independently sample each edge with probability x?(e). Let X = (X(e))e?E ? {0, 1}m be the resultant indicator vector. 3 For each w ? U ? V , let EX(w) := {e : e ? ?(w), X(e) = 1} be the set of sampled edges incident to w. Sample one edge uniformly at random from EX(u) for each u if EX(u) 6= ?. Let Y ? X be the indicator vector of the edges chosen after both the sampling processes. Online Phase: 4 When v ? V arrives at time t ? [T ], sample an edge e uniformly from EX(v). Match v iff Y (e) = 1 and e = (u, v) is safe at t (i.e., u is available); skip it otherwise. Proof. Let Z = (Z(e))e?E be the indicator vector for any e ? E being matched in Algorithm CR-ALG. Let X be as defined in Theorem 4 with x = x? where x? is the approximate optimal solution to the offline program. We show that Z satisfies the two properties stated in Theorem 4 with ? = 1(1 ? e?0.5). Hence, we 2 have that E[f(Z)] ? ?E[f(X)]. From Lemma 9 we have that E[f(X)] = F (x?) ? (1? 1/e)E[OPT]. Combining these two facts proves Theorem 5. We now show that Z satisfies the two properties stated in Theorem 4 with ? = 1(1? e?0.5). To prove this, we first prove the following Lemma. 2 57 Lemma 10. Consider an edge e = (u, v) with X(e) = 1. Let |EX(u)| = ku and |EX(v)| = kv, where ku and kv are the number of edges incident to u and v in the sub-graph induced by X, respectively. Then, 1 ( ( 1 )) Pr[Z(e) = 1 | X(e) = 1, |EX(u)| = ku, |EX(v)| = kv] ? 1? exp ? . (5.3) ku kv Proof. A sufficient condition for Z(e) = 1 to occur given X(e) = 1, |EX(u)| = ku, |EX(v)| = kv is that Y (e) = 1 in this conditional space and that Z(e) = 1 given that Y (e) = 1. Thus, we have Pr[Z(e) = 1 | X(e) = 1, |EX(u)| = ku, |EX(v)| = kv] ? Pr[Y (e) = 1 | X(e) = 1, |EX(u)| = ku] ? Pr[Z(e) = 1 | Y (e) = 1, |EX(v)| = kv, |EX(u)| = ku]. (5.4) Since Y (e) = 1 is set uniformly for every edge e ? EX(u) we have that 1 Pr[Y (e) = 1 | X(e) = 1, |EX(u)| = ku] = . (5.5) ku We now show that the following holds. ? 1 1 ( 1 )t?1 Pr[Z(e) = 1 | Y (e) = 1, |EX(v)| = kv, |EX(u)| = ku] ? 1? . T kv T ? kv t?[T ] (5.6) Consider a given e = (u, v) with Y (e) = 1. We compute the probability that u is safe. At any time t ? [T ], the probability that a vertex u is matched is at most 1 (i.e., the probability that a neighbor(v of u a)rrives in this time-step). Thus,kvT t?1 the probability that u is safe is at least 1? 1 ?. Thus, (the proba)bility thatTkv t?1 e = (u, v) is matched in the T time-steps is at least 1 1t?[T ] 1? 1 T kv T ? .kv 58 Plugging equations (5.5) and (5.6) back into Eq. (5.4) we get, Pr[Z(e) = 1 | X((e) = 1, |EX(u)| = ku, |EX(v)| = k ]? vT ? 1 1 1 1 )t?1 1? ku( T kv ( T)? kt=1 v ? 1 1 ) 1? exp ? . ku kv Given Lemma 10, we are now ready to prove the two properties. First, we show that the Marginal Property holds with ? = (1? e?0.5). Taking expectation over the random sets |EX(u)| and |EX(v)|, Eq. (5.3) in Lemma 10 becomes, [ ( ( ))] Pr[Ze = 1|Xe = 1] ? E 1ku,kv 1? exp ? 1 (5.7)ku kv ( ( )) Since 1 and 1 ? exp ? 1 , are convex functions in ku and kv by Jensen?sku kv inequality, this evaluates to, [ 1 ( (? ? 1 ))] ? 1 ( ( 1 ))Eku,kv 1 exp 1? exp ? . (5.8)ku kv E[|EX(u)|] E(|EX(v)|)? From construction we?have, E[|EX(u)|] = 1+ ? e??E(u),e? 6=e x (e ?) ? 2. Likewise, we have E[|EX(v)|] = 1 + ?e??E(v),e? 6=e x (e?) ? r(v) = 2, since x? = (x?(e))e?E is feasible to the mathematical program (5.1). Thus, the RHS in Eq. (5.8) can be lower-bounded by 1 ( ( 1 )) 1( ) 1? exp ? ? 1? e?1/2 (:= ?). (5.9) E[|EX(u)|] E(|EX(v)|) 2 Combining Eq. (5.9) and (5.7) proves the marginal property. ( Th(e Mo)n)otonicity Property follows from combining the fact that 1 and ku 1?exp ? 1 are decreasing functions in ku and kv respectively and Eq. (5.3).kv 59 5.8 Conclusion In this chapter, we proposed a new model, Online Submodular bipartite match- ing, which effectively captures notions such as relevance and diversity in matching markets. Many applications such as advertising, hiring diverse candidates, recom- mending movies or songs naturally fit within this framework. We propose an algo- rithm based on contention-resolution schemes and prove theoretical guarantees on their performance. In Dickerson et al. [78], we propose another algorithm that has an ? improved competitive ratio in the special case when |U | = o( T ) and T ??. For this special case, this algorithm achieves a competitive ratio of (1?1/e)2. Dickerson et al. [78] also has initial numerical experiments on the MovieLens dataset Harper and Konstan [103] which validates the theoretical results. 60 Chapter 6: Rideshare: Empirical Evaluation of Online Matching 6.1 Introduction In this chapter, we consider a case-study of the Online Matching with Reusable Vertices problem studied in Chapter 4 in the context of ride-share. We cast the prob- lem of matching drivers to riders as an Online Matching problem. Using the publicly available New York City yellow cabs dataset,1 which contains the trip records for trips in Manhattan, Brooklyn, and Queens for the year 2013, we evaluate the perfor- mance of the proposed algorithm and many natural heuristics. Further, we also show that many of the assumptions made to help provable guarantees hold in practice. 6.2 Dataset and Assumptions The dataset is split into 12 months. For each month we have numerous records each corresponding to a single trip. Each record has the following structure. We have an anonymized license number which is the primary key corresponding to a car. For privacy purposes a long string is used as opposed to the actual license number. We then have the time at which the trip was initiated, the time at which the trip ended, and the total time of the trip in seconds. This is followed by the starting 1http://www.andresmh.com/nyctaxitrips/ 61 coordinates (i.e., latitude and longitude) of the trip and the destination coordinates of the trip. Assumptions. We make two assumptions specific to our experimental setup. Firstly, we assume that every car starts and ends at the same location, for all trips that it makes. Initially, we assign every car a location (potentially the same) which corresponds to its docking position. On receiving a request, the car leaves from this docking position to the point of pick-up, executes the trip and returns to this docking position. Secondly, we assume that occupation time distributions (OTD) associated with all matches are identically (and independently) distributed, i.e., {Ce}e?E follow the same distribution. Note that this is a much stronger assump- tion than what we made in the model, and is completely inspired by the dataset (see Section 6.4). We test our model on two specific distributions, namely a normal distribution and the power-law distribution (see Figure 6.5). The docking position of each car and parameters associated with each distribution are all learned from the training dataset (described below in the Training discussion). 6.3 Experimental Setup The cars represent the set of vertices U and the request types represent the set of vertices V . We assign an edge between a car u ? U and request type v ? V iff the request type v can be assigned to the car u. Given this modeling, the experimental setup is as follows. We randomly select 30 cabs (each cab is denoted by u). We discretize the Manhattan map into cells such that each cell is approximately 4 miles 62 (increments of 0.15 degrees in latitude and longitude). For each pair of locations, say (a, b), we create a request type v, which represents all trips with starting and ending locations falling into a and b respectively. In our model, we have |U | = 30 and |V | ? 550 (variations depending on day to day requests with low variance). We focus on the month of January 2013. We split the records into 31 parts, each corresponding to a day of January. We choose a random set of 12 parts for training purposes and use the remaining for testing purposes. The edge weight w(e) on e = (u, v) (i.e., edge from a car u to type v) is set as a function of two distances in our setup. The first is the trip distance (i.e., the distance from the starting location to the ending location of v, denoted L1) while the second is the docking distance (i.e., the distance from the docking position of u to the starting/ending location of v, denoted L2). We set w(e) = max(L1??L2, 0), where ? is a parameter capturing the subtle balance between the positive contribution from the trip distance and negative contribution from the docking distance to the final profit. We set ? = 0.5 for the experiments. We consider each single day as the time horizon and set the total number of rounds T = 24?60 = 288 by discretizing 5 the 24-hour period into a time-step of 5 minutes. Throughout this section, we use time-step and round interchangeably. Training. We use the training dataset of 12 days to learn various parameters. As for the arrival rates {pt(v)}t?[T ],v?V , we first count the total number of appearances of each request type v at time-step t in the 12 parts (denote it by ct(v)). Then, we set pt(v) = ct(v)/12 when we assume known adversarial arrivals (KAD) and 63 p(v) = pt(v) = (c/12)/T (i.e., the arrival distributions are assumed the same across all the time-steps for each v) when we assume i.i.d. arrivals (KIID). The estimation procedure for the parameters of the two distributions for the occupation time is as follows. We first compute the average number of seconds between two requests in the dataset (note this was 5 minutes in the experimental setup). We then assume that each time-step of our online process corresponds to a time-difference of this average in seconds. We then compute the sample mean and sample variance of the trip lengths (as number of seconds taken by the trip divided by five minutes) in the 12 parts. Hence we use the normal distribution obtained by this sample mean and standard deviation as the distribution with which a car is unavailable. We assign the docking position of each car to the location (in the discretized space) in which the majority of the requests were initiated (i.e., starting location of a request) and matched to this car. 6.4 Justifying The Two Important Model Assumptions Known Adversarial Distributions. Figure 6.4 plots the number of arrivals of a particular type at various times during the day. Notice the significant increase in the number of requests in the middle of the day as opposed to the mornings and nights. This justified our arrival assumption of KAD which assumes different arrival distri- butions at different time-steps. Hence the LP (and the corresponding algorithm) can exploit this vast difference in the arrival rates and potentially obtain improved results compared to the assumption of Known Identical Independent Distributions 64 (KIID). This is confirmed by our experimental results shown in Figures 6.1 and 6.2. Identical Occupation Time Distribution. We assume each car will be available again via an independent and identical random process regardless of the matches it received. The validity of our assumptions can be seen in Figures 6.5 and 6.6, where the x-axis represents the different occupation time and the y-axis represents the corresponding number of requests in the dataset responsible for each occupation time. It is clear that for most requests the occupation time is around 2-3 time-steps and dropping drastically beyond that with a long tail. Figure 6.6 displays occupation times for two representative (we chose two out of the many cars we plotted, at random) cars in the dataset; we see that the distributions roughly coincide with each other, suggesting that such distributions can be learned from historical data and used as a guide for future matches. 6.5 Results Inspired by the experimental setup by Tong et al. [176], we run five different algorithms on our dataset. The first algorithm is the ALG-LP. In this algorithm, when a request v arrives, we choose a neighbor u with probability x?t (e)/pt(v) with e = (u, v) if u is available. Here x?t (e) is an optimal solution to the benchmark LP (4.3) and pt(v) is the arrival rate of type v at time-step t. The second algorithm is called ALG-SC-LP. Recall that Et(v) is the set of ?safe? or ava?ilable assignments with respect to v when the type v arrives at t. Let xt(v) = ? e?E x (e). Int(v) t ALG-SC-LP, we sample a safe assignment for v with probability x?t (e)/xt(v). The 65 Figure 6.1: OTD is normal distribution under KIID next two algorithms are heuristics oblivious to the underlying LP. Our third algo- rithm is called GREEDY which is as follows. When a request v comes, match it to the safe neighbor u with the highest edge weight. Our fourth algorithm is called UR-ALG which chooses one of the safe neighbors uniformly at random. Finally, we use a combination of LP-oblivious algorithm and LP-based algorithm called -GREEDY. In this algorithm when a type v comes, with probability  we use the greedy choice and with probability 1?  we use the optimal LP choice. In our algo- rithm, we optimized the value of  and set it to  = 0.1. We summarize our results in the following plots. Figures 6.1, 6.2, and 6.3 show the performance of the five algorithms and OPT (optimal value of the benchmark LP) under the different as- 66 Figure 6.2: OTD is normal distribution under KAD 67 Figure 6.3: OTD is power law distribution under KAD 68 Figure 6.4: The number of requests of a given type at various time-steps. x-axis: time-step, y-aixs: number of requests sumptions of the OTD (normal or power law) and online arrives (KIID or KAD). In all three figures the x-axis represents test data-set number and the y-axis represents average weight of matching. Discussion. From the figures, it is clear that both the LP-based solutions, namely ALG-LP and ALG-SC-LP, do better than choosing a free neighbor uniformly at random. Additionally, with distributional assumptions the LP-based solutions out- perform greedy algorithm as well. We would like to draw attention to a few inter- esting details in these results. Firstly, compared to the LP optimal solution, our LP-based algorithms have a competitive ratio in the range of 0.5 to 0.7. We believe 69 Figure 6.5: Occupation time distribution of all cars. x-axis: number of time-steps, y-axis: number of requests 70 Figure 6.6: Occupation time distribution of two different cars. x-axis: number of time-steps, y-axis: number of requests 71 this is because of our experimental setup. In particular, we have that the rates are high (> 0.1) only in a few time-steps while in all other time-steps the rates are very close to 0. This means that it resembles the structure of the theoretical worst case example we showed in Chapter 4. In future experiments, running our algorithms during peak periods (where the request rates are significantly larger than 0) may show that competitive ratios in those cases approach 1. Secondly, it is surprising that our algorithm is fairly robust to the actual distributional assumption we made. In particular, from Figures 6.2 and 6.3 it is clear that the difference between the assumption of normal distribution versus power-law distribution for the unavailabil- ity of cars is negligible. This is important since it might not be easy to learn the exact distribution in many cases (e.g., cases where the sample complexity is high) and this shows that a close approximation will still be as good. 72 Part II Multi-armed Bandits 73 Chapter 7: Bandits with Knapsacks 7.1 Introduction We now focus on another model of sequential decision making paradigm, called multi-armed bandits. Multi-armed bandits (MAB) is an elegant model for study- ing the tradeoff between acquisition and usage of information, a.k.a. explore-exploit tradeoff (Robbins [159], Thompson [175]). In each round an algorithm sequentially chooses from a fixed set of alternatives (sometimes known as actions or arms), and receives reward for the chosen action. Crucially, the algorithm does not have enough information to answer all ?counterfactual? questions about what would have hap- pened if a different action was chosen in this round. Studied over many decades, multi-armed bandits is a very active research area spanning computer science, op- erations research, and economics (Bergemann and Va?lima?ki [37], Bubeck and Cesa- Bianchi [43], Cesa-Bianchi and Lugosi [54], Gittins et al. [94]). In this work, we focus on bandit problems which feature supply or budget constraints, as is the case in many realistic applications. For example, a seller who experiments with prices may have a limited inventory, and a website optimizing ad placement may be constrained by the advertisers? budgets. This general problem is called Bandits with Knapsacks (BwK) since, in this model, a bandit algorithm 74 needs effectively to solve a knapsack problem (find an optimal packing of items into a limited-size knapsack) or generalization thereof. The BwK model was introduced in Badanidiyuru et al. [33] as a common generalization of numerous motivating examples, ranging from dynamic pricing to ad allocation to repeated auctions to network routing/scheduling. Various special cases with budget/supply constraints were studied previously (e.g., Babaioff et al. [28], Badanidiyuru et al. [30], Besbes and Zeevi [38], Combes et al. [67], Singla and Krause [169]). In BwK, the algorithm is endowed with d ? 1 limited resources that are consumed by the algorithm. In each round, the algorithm chooses an action (arm) from a fixed set of K actions, and the outcome consists of a reward and consumption of each resource; all are assumed to lie in [0, 1]. The algorithm observes bandit feedback, i.e., only the outcome of the chosen arm. The algorithm stops at time horizon T , or when the total consumption of some resource exceeds its budget. The goal is to maximize the total reward, denoted REW. 7.1.1 Formal Model We now describe the formal model. Figure 7.1 describes the protocol. There are T rounds, K possible actions and d resources, indexed as [T ], [K], [d], respec- tively. In each round t ? [T ], the algorithm chooses an action at ? [K] and receives an outcome vector o d+1t = (rt; ct,1 , . . . , ct,d) ? [0, 1] , where rt is a reward and ct,i is consumption of each resource i ? [d]. Each resource i is endowed with bud- get Bi ? T . The game stops early, at some round ?alg < T , when/if the total consumption of any resource exceeds its budget. The algorithm?s objective is to 75 maximize its total reward. Without loss of generality all budgets are the same: B1 = B2 = . . . = B = B. 1 d The outcome vectors are chosen as follows. In each round t, the adversary chooses the outcome matrix M ? [0, 1]K?(d+1)t , where rows correspond to actions. The outcome vector ot is defined as the at-th row of this matrix, denoted Mt(at). Only this row is revealed to the algorithm. The adversary is deterministic and oblivious, meaning that the entire sequence M1 , . . . ,MT is chosen before round 1. A problem instance of BwK consists of (known) parameters (d,K, T,B), and the (unknown) sequence M1 , . . . ,MT . In the stochastic version of BwK, henceforth termed Stochastic BwK, each outcome matrix Mt is chosen from some fixed but unknown distribution DBwK over the outcome matrices. An instance of this problem consists of (known) parameters (d,K, T,B), and the (unknown) distribution DBwK. In the adversarial version of BwK, henceforth termed Adversarial BwK, the outcome matrix Mt is chosen by an adversary. We use two versions of the adversary, namely oblivious adversary who chooses all the outcome matrices before the start of the game and adaptive adversary who chooses the outcome matrix at any time t ? [T ] after observing the actions a1, a2, . . . , at?1 of the algorithm. Following prior work Agrawal and Devanur [11], Badanidiyuru et al. [33], we assume, w.l.o.g., that one of the resources is a dummy resource similar to time; 1To see that this is indeed w.l.o.g., for each resource i, divide all per-round consumptions ct,i by Bi/B, where B := mini?[d]Bi is the smallest budget. In the modified problem instance, all consumptions still lie in [0, 1], and all the budgets are equal to B. 76 formally, each action consumes B/T units of this resource per round (we only need this for Stochastic BwK). Further, we posit that one of the actions is a null action, which lets the algorithm skips a round: it has 0 reward and consumes 0 amount of each resource other than the dummy resource. Given: number of time-steps T , resources d, arms K and budget B. In each round t ? [T ], 1. ALG selects an arm at ? [K]. 2. Nature reveals reward rt(at) and consumptions {ci,t(at)}i?[d]. 3. ALG gets reward rt(at) and consumes ci,t(at) amount of each resource i ? [d] 4. Stop if some resource consumed more than B amount. Figure 7.1: Bandits with Knapsacks Protocol 7.1.2 Benchmarks ? Let REW(ALG) = t?[? ] rt be the total reward of algorithm ALG in thealg BwK problem. Our benchmark is the best fixed distribution (Definition 2), a distri- bution over actions which maximizes E[REW(?)] for a particular problem instance. The expected total reward of this distribution is denoted OPTFD. For Stochastic BwK, one can compete with the best dynamic policy : an algo- rithm that maximizes E[REW(?)] for a particular problem instance. Essentially, this 77 algorithm knows the latent distribution DBwK over outcome matrices. Its expected total reward is denoted OPTDP. We argue that the best fixed distribution over arms is an appropriate bench- mark for Adversarial BwK. First, consider the total expected reward of the best dynamic policy, denote it OPTDP. (The best dynamic policy is the best algorithm, in hindsight, that is allowed to switch arms arbitrarily across time-steps.) This is the strongest possible benchmark, but it is too strong for Adversarial BwK. Indeed, we show a simple example with just one resource (with budget B), where competi- tive ratio against this benchmark is at least T2 for any algorithm. Second, considerB the total expected reward of the best fixed arm, denote it OPTFA. It is a traditional benchmark in multi-armed bandits, but is uninteresting for Adversarial BwK. We show that the competitive ratio is at least ?(K) in the worst case, and this is matched, in expectation, by a trivial algorithm that samples one arm at random and sticks with it forever. For Stochastic BwK, these three benchmarks are related as follows. The best fixed distribution is still the main object of interest, as far as the design and analysis of algorithms is concerned. However, all results ? both ours and prior work ? are almost automatically extended to compete against the best dynamic policy. The best fixed arm is a much weaker benchmark than the best fixed distribution: there are simple examples when their expected reward differs by a factor of two, in multiple special cases of interest (Badanidiyuru et al. [33]). 78 7.2 Related Work The literature on regret-minimizing online learning algorithms is vast; see Bubeck and Cesa-Bianchi [43], Cesa-Bianchi and Lugosi [54], Hazan [104] for back- ground. Most relevant are two algorithms for adversarial rewards/costs: Hedge for full feedback (Freund and Schapire [89]), and EXP3 for bandit feedback (Auer et al. [25]); both are based on the weighted majority algorithm from Littlestone and Warmuth [130]. Stochastic BwK was introduced and optimally solved in Badanidiyuru et al. [33]. Subsequent work extended these results to soft supply/budget constraints (Agrawal and Devanur [11]), a more general notion of rewards2 (Agrawal and Deva- nur [11]), combinatorial semi-bandits (Sankararaman and Slivkins [163]), and con- textual bandits (Agrawal and Devanur [12], Agrawal et al. [14], Badanidiyuru et al. [31]). Several special cases with budget/supply constraints were studied previously: dynamic pricing (Babaioff et al. [28], Besbes and Zeevi [38, 39], Wang et al. [184]), dynamic procurement (Badanidiyuru et al. [30], Singla and Krause [169]) (a version of dynamic pricing where the algorithm is a buyer rather than a seller), dynamic ad allocation (Combes et al. [67], Slivkins [170]), and a version with a single re- source and unlimited time (Ding et al. [79], Gyo?rgy et al. [100], Tran-Thanh et al. [177, 178]). While all this work is on regret minimization, Guha and Munagala [96], Gupta et al. [97] studied closely related Bayesian formulations. 2The total reward is determined by the time-averaged outcome vector, but can be an arbitrary Lischitz-concave function thereof. 79 Stochastic BwK was optimally solved using three different algorithms (Agrawal and Devanur [11], Badanidiyuru et al. [33]), with extremely technical and delicate analyses. All three algorithms involve inherently ?stochastic? techniques such as ?successive elimination? and ?optimism under uncertainty?, and do not appear to extend to the adversarial version. One of them, PrimalDualBwK from Badanidiyuru et al. [33], is a primal-dual algorithm superficially similar to ours. Indeed, it decou- ples into two online learning algorithms: a ?primal? algorithm which chooses among arms, and a ?dual? algorithm similar to ours, which chooses among resources. How- ever, the two algorithms are not playing a repeated game in any meaningful sense, let alone a zero-sum game. The primal algorithm operates under a much richer input: it takes the entire outcome vector for the chosen arm, as well as the ?dual distribu- tion? ? the distribution over resources chosen by the dual algorithm. Further, the primal algorithm is very problem-specific: it interprets the dual distribution as a vector of costs over resources, and chooses arms with largest reward-to-cost ratios, estimated using ?optimism under uncertainty?. Our approach to using regret minimization in games can be traced to Fre- und and Schapire [87, 89] (see Ch. 6 in Schapire and Freund [167]), who showed how a repeated zero-sum game played by two agents yields an approximate Nash equilibrium. This approach has been used as a unifying algorithmic framework for sev- eral problems: boosting (Freund and Schapire [87]), linear programs (Arora et al. [21]), maximum flow (Christiano et al. [66]), and convex optimization (Abernethy ? and Wang [1], Wang and Abernethy [182]). While we use a result with the 1/ t 80 convergence rate for the equilibrium property, recent literature obtains faster con- vergence for cumulative payoffs (but not for the equilibrium property) under various assumptions (e.g., Rakhlin and Sridharan [155], Syrgkanis et al. [172], Wei and Luo [185]). Repeated Lagrangian games, in conjunction with regret minimization in games, have been used in a series of recent papers (Agarwal et al. [7], Hsu et al. [106], Kearns et al. [118], Rogers et al. [160], Roth et al. [161, 162]), as an algorithmic tool to solve convex optimization problems; application domains range from differential privacy to algorithmic fairness to learning from revealed preferences. All these papers deal with deterministic games (i.e., same game matrix in all rounds). Reframing the problem in terms of repeated Lagrangian games is a key technical insight in this work. Most related to our paper are Roth et al. [161, 162], where a repeated La- grangian game is used as a subroutine (the ?inner loop?) in an online algorithm; the other papers solve an offline problem. We depart from this prior work in several re- spects: we use a stochastic game, we deal with some subtleties specific to Stochastic BwK, and we provide a very different analysis for our main results on Adversarial BwK, where we cannot rely on the standard machinery. Online packing problems (e.g., Buchbinder and Naor [47], Devanur et al. [72], see Buchbinder and Naor [48] for a survey) can be seen as a special case of Adversarial BwK with a much more permissive feedback model: the algorithm observes full feedback (the outcomes for all actions) before choosing an action. Online packing subsumes various online matching problems, including the AdWords problem (Mehta 81 et al. [141]) motivated by ad allocation (see Mehta [139] for a survey). While we derive O(log T ) competitive ratio against OPTFD, online packing admits a similar result against OPTDP. Another related line of work concerns online convex optimization with con- straints Chen and Giannakis [63], Chen et al. [64], Mahdavi et al. [134, 135], Neely and Yu [149]. Their setting differs from ours in several important respects. First, the action set is a convex subset of RK (and the algorithms rely on the power to choose arbitrary actions in this set). In particular, there is no immediate way to handle discrete action sets.3 Second, convexity/concavity is assumed on the rewards and resource consumption. Third, in addition to bandit feedback, full feedback is observed for the resource consumption, and (in all papers except Chen and Gian- nakis [63]) one also observes either full feedback on rewards or the rewards gradient around the chosen action. Fourth, their algorithm only needs to satisfy the budget constraints at the time horizon (whereas in BwK the budget constraints hold for all rounds). Fifth, their fixed-distribution benchmark is weaker than ours: essentially, its time-averaged consumption must be small enough at each round t. Due to these differences, their setting admits sublinear regret in the adversarial setting, whereas we have a ?(log T ) lower bound on the competitive ratio. Logarithmic competitive ratios are quite common in prior work on approxima- tion algorithms and online algorithms, e.g., in the context of the set cover problem Johnson [112], Lova?sz [131], buy-at-bulk network design Awerbuch and Azar [26], 3Unless there is full feedback, in which case one can use a standard reduction whereby actions in online convex optimization correspond to distributions over actions in a K-armed bandit problem. 82 sparsest cut Arora et al. [20], and dial-a-ride problem Charikar and Raghavachari [58], the online k-server problem Bansal et al. [34], online packing/covering problems Azar et al. [27], online set cover Alon et al. [18], online network design Umboh [181], and online paging Fiat et al. [85]. Simultaneous work. Two very recent (and yet unpublished) manuscripts have come to our attention after the initial version of this paper has appeared on arxiv.org. Rivera et al. Rivera et al. [158] consider online convex optimization with knapsacks with full feedback. Focusing on the stochastic version, they design an algorithm similar to LagrangeBwK, and derive a regret bound similar to ours, using a simi- lar analysis. They also claim an extension to bandit feedback, without providing any details (such as precise statement of Lemma 11 in terms of the regret property (7.4)). Rangi et al. Rangi et al. [157] consider Adversarial BwK in the special case when there is only one constrained resource, including time. They attain sublinear regret, i.e., a regret bound that is sublinear in T . They also assume a known lower bound cmin > 0 on realized per-round consumption of each resource, and their regret bound scales as 1/cmin. They also achieve polylog(T ) instance-dependent regret for the stochastic version using the same algorithm (matching results from prior work on the stochastic version). BwK with only one constrained resource (including time) is a much easier problem, compared to the general case with multiple resources studied in this paper, in the following sense. First, the single-resource version admits much ? stronger performance guarantees (polylog(T ) vs. T regret bounds for Stochastic 83 BwK, and sublinear regret vs. approximation ratio for Adversarial BwK). Second, the optimal all-knowing time-invariant policy is the best fixed arm, rather than the best fixed distribution over arms. 7.3 Challenges and Our Contributions 7.3.1 Challenges Adversarial BwK is a much harder problem compared to Stochastic BwK. The new challenge is that the algorithm needs to decide how much budget to save for the future, without being able to predict it. (It is also the essential challenge in online packing problems, and it drives our lower bounds.) This challenge compounds the ones already present in Stochastic BwK: that exploitation may be severely limited by the resource consumption during exploration, that optimal per-round reward no longer guarantees optimal total reward, and that the best fixed distribution over arms may perform much better than the best fixed arm. Jointly, these challenges amount to the following. An algorithm for Adversarial BwK must compete, during any given time segment [1, ? ], with a distribution over arms that maximizes the total reward on this time segment. However, this distribution may behave very differently, in terms of expected per-round outcomes, compared to the optimal distribution for some other time segment [1, ? ?]. In more concrete terms, let OPTFD be the total expected reward of the best fixed distribution over arms. In Stochastic BwK (as well as in adversarial bandits) 84 an algorithm can achieve sublinear regret: OPT 4FD?E[REW] = o(T ). In contrast, in Adversarial BwK regret minimization is no longer possible, and we therefore are primarily interested in the competitive ratio OPTFD /E[REW]. It is instructive to consider a simple example in which the competitive ratio is at least 5 ? o(1) for any algorithm. There are two arms and one resource with 4 budget T . Arm 1 has zero rewards and zero consumption. Arm 2 has consumption 2 1 in each round, and offers reward 1 in each round of the first half-time (T rounds). 2 2 In the second half-time, it offers either reward 1 in all rounds, or reward 0 in all rounds. Thus, there are two problem instances that coincide for the first half-time and differ in the second half-time. The algorithm needs to choose how much budget to invest in the first half-time, without knowing what comes in the second. Any choice leads to competitive ratio at least 5 on one of the problem instances. 4 Extending this idea, we prove an even stronger lower bound on the competitive ratio: OPTFD /E[REW] ? ?(log T ). (7.1) Like the simple example above, the lower-bounding construction involves only two arms and only one resource, and forces the algorithm to make a huge commitment without knowing the future. ? 4More specifically, one can achieve regret O?( KT ) for adversarial bandits (Auer et al. [25]), as well as for Stochastic BwK if all budgets are ?(T ) Badanidiyuru et al. [33]. One can achieve sublinear regret for Stochastic BwK if all budgets are ?(T?), ? ? (0, 1) (Badanidiyuru et al. [33]). 85 7.3.2 Our Contributions Our main result is an algorithm which nearly matches (7.1), achieving E[REW] ? 1 (OPT O(log T ) FD ?o(OPTFD)) . (7.2) We put forward a new algorithm for BwK, called LagrangeBwK, that unifies the stochastic and adversarial versions. It has a natural game-theoretic interpretation for Stochastic BwK, and admits a simpler analysis compared to the prior work. For Adversarial BwK, we use LagrangeBwK as a subroutine, though with a different parameter and a different analysis, to derive two algorithms: a simple one that achieves (7.2), and a more involved one that achieves the same competitive ratio with ? high probability. Absent resource consumption, we recover the optimal O?( KT ) regret for adversarial bandits. LagrangeBwK is based on a new perspective on Stochastic BwK. We reframe a standard linear relaxation for Stochastic BwK in a way that gives rise to a repeated zero-sum game, where the two players choose among arms and resources, respec- tively, and the payoffs are given by the Lagrange function of the linear relaxation. Our algorithm consists of two online learning algorithms playing this repeated game. We analyze LagrangeBwK for Stochastic BwK, building on the tools from regret min- imization in stochastic games, and achieve a near-optimal regret bound when the optimal value and the budgets are ?(T ).5 Discussion. LagrangeBwK has numerous favorable properties. As just discussed, it 5This regime is of primary importance in prior work, e.g., Besbes and Zeevi [38], Wang et al. [184]. 86 is simple, unifying, modular, and yields strong performance guarantees in multiple settings. It is the first ?black-box reduction? from bandits to BwK: we take a bandit algorithm and use it as a subroutine for BwK. This is a very natural algorithm for the stochastic version once the single-shot game is set up; indeed, it is immediate from prior work that the repeated game converges to the optimal distribution over arms. Its regret analysis for Stochastic BwK is extremely clean. Compared to prior work (Agrawal and Devanur [11], Badanidiyuru et al. [33]), we side-step the intricate analysis of sensitivity of the linear program to non-uniform stochastic deviations that arise from adaptive exploration. LagrangeBwK has a primal-dual interpretation, as arms and resources corre- spond respectively to primal and dual variables in the linear relaxation. Two players in the repeated game can be seen as the respective primal algorithm and dual algo- rithm. Compared to the rich literature 6 on (e.g., primal-dual algorithms Buchbinder and Naor [48], Mehta [139], Williamson and Shmoys [188]) LagrangeBwK has a very specific and modular structure dictated by the repeated game. 7.4 Preliminaries In this section, we state the required preliminaries from prior work on adver- sarial bandits and zero-sum games. 6This also includes the more recent literature on stochastic online packing problems (e.g., Agrawal et al. [13], Devanur and Hayes [70], Devanur et al. [72], Feldman et al. [84], Molinaro and Ravi [147]). 87 7.4.1 Adversarial Online Learning Given: action set A, payoff range [bmin, bmax]. In each round t ? [T ], 1. the adversary chooses a payoff vector ft ? [bmin, b Kmax] ; 2. the algorithm chooses a distribution pt over A, without observing ft, 3. algorithm?s chosen action at ? A is drawn independently from pt; 4. payoff ft(at) is received by the algorithm. Figure 7.2: Adversarial online learning In this protocol (see Figure 7.2), the adversary can use previously chosen arms to choose the payoff vector ft, but not the algorithm?s random seed. The distribution ft is chosen as a deterministic function of history. (The history at round t consists, for each round s < t, of the chosen action as and the observed feedback in this round.) We focus on two feedback models: bandit feedback (no auxiliary feedback) and full feedback (the entire payoff vector ft). The version for costs can be defined similarly, by setting the payoffs to be the negative of costs. We are interested in adversarial online learning algorithms with known upper bounds on regret, [ ? ] [? ] RAOL(T ) := maxa?A t?[T ] ft(a) ? t?[T ] ft(at) . (7.3) The benchmark here is the total payoff of the best arm, according to the payoff 88 vectors actually chosen by the adversary. More precisely, we assume high-probability regret bounds of the following form: ?? > 0 Pr [ RAOL(T ) ? (bmax ? bmin)R?(T ) ] ? 1? ?, (7.4) for some function R?(?). We will actually use a stronger version implied by (7.4),7 [ ] ?? > 0 Pr ?? ? [T ] RAOL(?) ? (bmax ? bmin)R?/T (T ) ? 1? ?. (7.5) Algorithms EXP3.P (Auer et al. [25]) for bandit feedback, and Hedge (Freund and Schapire [88]) for full feedback, satisfy (7.4) with, resp., (? ) (? ) R?(T ) = O |A|T log(T/?) and R?(T ) = O T log(|A|/?) . (7.6) 7.4.2 Regret minimization in games We build on the framework of regret minimization in games. A zero-sum game (A1, A2,G) is a game between two players i ? {1, 2} with action sets A1 and A2 and payoff matrix G ? RA1?A2 . If each player i chooses an action ai ? Ai, the outcome is a number G(a1, a2). Player 1 receives this number as reward, and player 2 receives it as cost. A repeated zero-sum game G with action sets A1 and A2, time horizon T and game matrices G1 , . . . ,GT ? RA1?A2 is a game between two algorithms, ALG1 and ALG2, which proceeds over T rounds such that each round t is a zero-sum game (A1, A2,Gt). The goal of ALG1 is to maximize the total reward, and the goal of ALG2 is to minimize the total cost. 7Regret bound (7.5) follows from (7.4) using a simple ?zeroing-out? trick: for a given round ? ? [T ], the adversary can set all future payoffs to some fixed value x ? [bmin, bmax], in which case RAOL(?) = RAOL(T ). 89 The game G is called stochastic if the game matrix Gt in each round t is drawn independently from some fixed distribution. For such games, we are interested in the expected game, defined by the expected game matrix G = E[Gt]. We can relate the algorithms? performance to the minimax value of G. Lemma 11. Consider a stochastic repeated zero-sum game between algorithms ALG1 and ALG2, with payoff range [bmin, bmax]. Assume that each ALGj, j ? {1, 2} is an algorithm for adversarial online learning, as per Figure 7.2, which satisfies regret bound (7.4) with R?(T ) = Rj,?(T ). Let ? be some fixed round in the game. For each algorithm ALGj, j ? {1, 2}, let Aj be its ac?tion set, let pt,j ? ?A be the distribution chosen in each round t,j and let p?j = 1 t?[? ] pt,j be the average play distribution at round ? . Let v ? be the ? minimax value for the expected game G = E[Gt]. Then for each ? > 0, with probability at least 1? 2? it holds that ( ? ) ?p2 ? ? p?T G p ? v? 1A2 1 2 ? (bmax ? bmin) R1, ?/T (T ) +R2, ?/T (T ) + 4 2T log(T/?) .? (7.7) Eq. (7.7) states that the average play of player 1 is approximately optimal against any distribution chosen by player 2.8 This lemma is well-known for the de- terministic case (i.e., when Gt = G for each round t), and folklore for the stochastic case. We provide a proof in Appendix 10.3.2 for the sake of completeness. 8If each player j chooses distribution pj ? ?A , and the game matrix is G, then expectedj reward/cost is pT1 Gp2. 90 7.5 A new algorithm for Stochastic BwK We present a new algorithm for Stochastic BwK, based on the framework of regret minimization in games. This is a very natural algorithm once the single-shot game is set up, and it allows for a very clean regret analysis. We will also use this algorithm as a subroutine for the adversarial version. On a high level, we define a stochastic zero-sum game for which a mixed Nash equilibrium corresponds to an optimal solution for a linear relaxation of the original problem. Our algorithm consists of two regret-minimizing algorithms playing this game. The framework of regret minimization in games guarantees that the average primal and dual play distributions (p?1 and p?2 in Lemma 11) approximate the mixed Nash equilibrium in the expected game, which correspondingly approximates the optimal solution. 7.5.1 Linear relaxation and Lagrange functions We start with a linear relaxation of the problem that all prior work relies on. This relaxation is stated in terms of expected rewards/consumptions, i.e., implicitly, in terms of the expected outcome matrix M = E[Mt]. We explicitly formulate the relaxation in terms of M, and this is essential for the subsequent developments. For ease of notation, we write the a-th row of M, for each action a ? [K], as M(a) = (rM(a); cM1 (a) , . . . , c M d (a)), 91 so that rM(a) is the expected reward and cMi (a) is the expected consumption of each resource i. Essentially, the relaxation assumes that each instantaneous outcome matrix Mt is equal to the expected outcome matrix M = E[Mt]. The relaxation seeks the best distribution over actions, focusing on a single round with budgets rescaled as B/T . This leads to the following linear program (LP): ? maximize ?a?[K] X(a) r M(a) such that ?a?[K] X(a) = 1 (7.8) ?i ? [d] X(a) cMa?[K] i (a) ? B/T ?a ? [K] 0 ? X(a) ? 1. We denote this LP by LPM,B,T . The solution X is the best fixed distribution over actions, according to the relaxation. The value of this LP, denoted OPTLP(M, B, T ), is the expected per-round reward of this distribution. It is also the total reward of X in the relaxation, divided by T . We know from Badanidiyuru et al. [33] that T ?OPTLP(M, B, T ) ? OPTDP ? OPTFD, (7.9) where OPTDP and OPTFD are the total expected rewards of, respectively, the best dynamic policy and the best fixed distribution. In words, OPTDP is sandwiched between the total expected reward of the best fixed distribution and that of its linear relaxation. Associated with the linear program LPM,B,T is the Lagrange function L = 92 LM,B,T . It is a function L : ?K ? Rd?0 ? R?defined as? ? ? ? L T(X, ?) := X(a) rM(a) + ? ?i 1? X(a) cMi (a)? . (7.10)B a?[K] i?[d] a?[K] The values ?1 , . . . , ?d in Eq. (7.10) are called the dual variables, as they correspond to the variables in the dual LP. Lagrange functions are meaningful due to their max- min property (e.g., Theorem D.2.2 in Ben-Tal and Nemirovski [36]): min max L(X, ?) = max minL(X, ?) = OPTLP(M, B, T ). (7.11) ??0 X??K X??K ??0 This property holds for our setting because LPM,B,T has at least one feasible solution (namely, one that puts probability one on the null action), and the optimal value of the LP is bounded. Remark 2. We use the linear program LPM,B,T and the associated Lagrange func- tion LM,B,T throughout this chapter. Both are parameterized by an outcome matrix M, budget B and time horizon T . In particular, we can plug in an arbitrary M, and we heavily use this ability throughout. For the adversarial version, it is essential to plug in parameter T0 ? T instead of the time horizon T . For the analysis of the high-probability result in Adversarial BwK, we use a rescaled budget B0 ? B instead of budget B. 7.5.2 Our algorithm: repeated Lagrangian game The Lagrange function L = LM,B,T from (7.10) defines the following zero-sum game: the primal player chooses an arm a, the dual player chooses a resource i, and 93 the payoff is a number L(a, i) = rM(a) + 1? T cMi (a). (7.12)B The primal player receives this number as a reward, and the dual player receives it as cost. This game is termed the Lagrangian game induced by LM,B,T . This game will be crucial throughout this chapter. The Lagrangian game is related to the original linear program as follows: Lemma 12. Assume one of the resources is the dummy resource. Consider the linear program LPM,B,T , for some outcome matrix M. Then the value of this LP equals the minimax value v? of the Lagrangian game induced by LM,B,T . Further, if (X, ?) is a mixed Nash equilibrium in the Lagrangian game, then X is an optimal solution to the LP. The proof can be found in Appendix 10.3. The idea is that because of the special structure of the LP, the second equality in (7.11) also holds when the dual vector ? is restricted to distributions. Consider a repeated version of the Lagrangian game. Formally, the repeated Lagrangian game with parameters B0 ? B and T0 ? T is a repeated zero-sum game between the primal algorithm that chooses among arms and the dual algorithm that chooses among resources. Each round t of this game is the Lagrangian game induced by the Lagrange function Lt := LMt,B0,T0 , where Mt is the round-t outcome matrix. Note that we use parameters B0, T0 instead of budget B and time horizon T . 9 9These parameters are needed only for the adversarial version. For Stochastic BwK we use B0 = B and T0 = T . 94 Remark 3. Consider repeated Lagrangian game for Stochastic BwK (with B0 = B and T0 = T ). The payoffs in the expected game are defined by the expected Lagrange function L := E[Lt]. By linearity, L is the Lagrange function for the expected outcome matrix M = E[Mt]: L := E[Lt] = LM,B,T . (7.13) Our algorithm, called LagrangeBwK, is very simple: it is a repeated Lagrangian game in which the primal algorithm receives bandit feedback, and the dual algorithm receives full feedback. To set up the notation, let at and it be, respectively, the chosen arm and resource in round t. The payoff is therefore Lt(at, it). It can be rewritten in terms of the observed outcome vector ot = (rt; ct,1 , . . . , ct,d) (which corresponds to the at-th row of the instantaneous outcome matrix Mt): Lt(at, it) = r + 1? T0 c T0t B t,i0 t ? [? + 1, 2]. (7.14)B0 Note that the payoff range is [bmin, b T0 max] = [? + 1].B0 With this notation, the pseudocode for LagrangeBwK is summarized in Al- gorithm 5. The pseudocode is simple and self-contained, without referring to the formalism of repeated games and Lagrangian functions. Note that the algorithm is implementable, in the sense that the outcome vector ot revealed in each round t of the BwK problem suffices to generate full feedback for the dual algorithm. 95 Algorithm 5: Algorithm LagrangeBwK for Stochastic BwK. input: parameters B0, T0, primal algorithm ALG1, dual algorithm ALG2. // ALG1, ALG2 are adversarial online learning algorithms // with bandit feedback and full feedback, respectively for round t = 1, 2, 3, . . . do 1. ALG1 returns arm at ? [K], algorithm ALG2 returns resource it ? [d]. 2. arm at is chosen, outcome vector ot = (rt(at); ct,1(at) , . . . , ct,d(at)) ? [0, 1]d+1 is observed. 3. The payoff Lt(at, it) from (7.14) is reported to ALG1 as reward, and to ALG2 as cost. 4. The payoff Lt(at, i) is reported to ALG2 for each resource i ? [d]. 7.5.3 Performance guarantees We consider algorithm LagrangeBwK with parameter T0 = T . We assume the existence of the dummy resource; this is to ensure that the crucial step, Eq. (7.20), works out even if the algorithm stops at time T , without exhausting any actual ? resources. We obtain a regret bound that is non-trivial whenever B > ?( T ), and is optimal, up to log factors, in the regime when min(OPTDP, B) > ?(T ). Theorem 6. Consider Stochastic BwK with K arms, d resources, time horizon T , and budget B. Assume that one resource is the dummy resource (with consumption B for each arm). Fix the failure probability parameter ? ? (0, 1). Consider algorithm T 96 LagrangeBwK with parameters B0 = B, T0 = T . If EXP3.P and Hedge are used as the primal and the dual algorithms, respec- tively, then the algorithm achieves the following regret bound, with probability at least 1? ?: ( T ? ) OPTDP?REW(LagrangeBwK) ? O TK log(dT/?) . (7.15) B In general, suppose each algorithm ALGj satisfies a regret bound (7.4) with R?(T ) = Rj,?(T ) and payoff range [b T min, bmax] = [? + 1, 2]. Then with probabilityB at least 1?O(?T ) it holds that ( ) ( ? ) OPTDP?REW(LagrangeBwK) ? O T RB 1, ?/T (T ) +R2, ?/T (T ) + T log(dT/?) . (7.16) Remark 4. To obtain (7.15) from the ?black-box? result (7.16), we use regret bounds in Eq. (7.6). Remark 5. From Badanidiyuru et al. [33], the optimal regret bound for Stochastic BwK is (? ? ) OPTDP?E[REW] ? O? K OPTDP (1 + OPTDP /B) . Thus, the regret bound (7.15) is near-optimal if min(OPTDP, B) > ?(T ), and non- ? trivial if B > ?( T ). We next prove the ?black-box? regret bound (7.16). For the sake of analysis, consider a version of the repeated Lagrangian game that continues up to the time horizon T . In what follows, we separate the ?easy steps? from what we believe is the crux of the proof. 97 Notation. Let Xt ?be the distribution chosen in round t by the primal algorithm ALG1. Let X := 1 ? t?[? ] Xt be the distribution of average play up to round ? . Let? M = E[Mt] be the expected outcome matrix. Let r = (rM(a) : a ? [K]) be the vector of expected rewards over the actions. Likewise, ci = (c M i (a) : a ? [K]) be the vector of expected consumption of each resource i ? [d]. Using Azuma-Hoeffding inequality. Consider the first ? rounds, for some ? ? [T ]. The average reward and resource-i consumption over these rounds are close to X? ? r and X? ? ci, respectively, with high probability. Specifically, a simple usage of Azuma-Hoeffding inequality (Lemma 28) implies that ? 1 r ? t ? X? ? r?R0(?)/?, (7.17) ?t?[? ] 1 c ? i,t ? X? ? ci +R0(?)/?, ?i ? [d], (7.18) t?[? ] ? hold with probability at least 1? ?, where R0(?) = O( ? log(d/?)). Regret minimization in games. Let us apply the machinery from regret min- imization in games to the repeated Lagrangian game. Consider the game matrix G of the expected game. Using Eq. (7.13) and Lemma 12, we conclude that the minimax value of G is v? = OPTLP(M, B, T ). We apply Lemma 11, with a fixed stopping time ? ? [T ]. Recall that the payoff range is b Tmax ? bmin = + 1. Thus, with probability at least 1? 2? it holdsB that T ? ? ?d : X? G? ? v? ? 1 ( T + 1) ? reg(T ), (7.19)? B ? where the regret term is reg(T ) := R1, ?/T (T ) +R2, ?/T (T ) + 4 2T log(T/?). 98 Crux of the proof. Let us condition on the event that (7.17), (7.18), and (7.19) hold for each ? ? [T ]. By the union bound, this event holds with probability at least 1? 3?T . Let ? denote the stopping time of the algorithm, the first round when the total consumption of some resource exceeds its budget. Let i be the resource for which this happens; hence, ? ci,t > B. (7.20) t?[? ] Let us use Eq. (7.19) with ? = ?(i), the point distribution for this resource. From Eq. (7.13), we have T X G?(i)? = X? ? r + 1? T X? ? ci.B Plugging in (7.17) and (7.18) we get ((? ) ( ? ) ) X? ? r + 1? T X ? c ? 1 r ? T TB ? i ? t?[? ] t B t?[? ] ci,t + ? ? (1 + )R (?) .B 0 Finally, plugging in Eq. (7.20) we get this to be upper-bounded by the following. ((? ) ) ? 1 t?[? ] r T ? t + ? ? T ? (1 + )R0(?) .B Plugging this into Eq. (7.19) and rearranging, we obtain ? rt ? ? v? + T ? ? ? (1 + T ) ? reg(T ).B t?[? ] Since v? ? 1 (because v? = OPTLP, as we?ve proved above), ? REW(LagrangeBwK) = rt ? T v? ? (1 + T ) ? reg(T ).B t?[? ] The claimed regret bound (7.16) follows by Eq. (7.9), completing the proof of The- orem 6. 99 7.6 A simple algorithm for Adversarial BwK We present and analyze an algorithm for Adversarial BwK which achieves O(d2 log T ) competitive ratio, in expectation, up to a low-order additive term. Our algorithm is very simple: we randomly guess the value of OPTFD and run LagrangeBwK with parameter T0 driven by this guess. The analysis is very differ- ent, however, since we cannot rely on the machinery from regret minimization in stochastic games. The crux of the analysis (Lemma 14) is re-used to analyze the high-probability algorithm in the next section. The intuition for our algorithm can be explained as follows. LagrangeBwK builds on adversarial online learning algorithms ALGj, and appears plausibly ap- plicable to Adversarial BwK. We analyze it for Adversarial BwK, with an arbitrary parameter T0 (see Lemma 14, the crux of our analysis), and find that it performs best when T0 is tailored to OPTFD up to a constant multiplicative factor. This is precisely what our algorithm achieves using the random guess. Our algorithm is presented as Algorithm 6. We randomly guess the value of OPTFD from within a specified range [gmin, gmax], up to the specified multiplicative factor of ? > 0. We consider multiplicative scales [?u, ?u+1], u ? N, and we guess uniformly at random among all possible u. Our analysis works as long as OPTFD ? [g , g ] and ? ? d + 1; then we obtain competitive ratio ?2 dlog gmaxmin max e up to agmin low-order additive term. As a corollary, we obtain competitive ratio ?2 ceil log T e with no assumptions. Theorem 7. Consider Adversarial BwK with K arms, d resources, time horizon T , 100 Algorithm 6: A simple algorithm for Adversarial BwK. input: scale parameter ? > 0, guess range [gmin, gmax], primal and dual algorithms ALG1, ALG2 // ALG1, ALG2 are adversarial online learning algorithms // with bandit feedback and full feedback, resp. 1 Choose u?uniforml?y at random from {0, 1 , . . . , umax}, where umax = log gmax? .gmin 2 Guess the value of OPTFD as g? = gmin ? ?u. 3 Run LagrangeBwK with algorithms ALG1, ALG2 and parameters B0 = B and T0 = g?/?. and budget B. Assume that one of the arms is a null arm that has zero reward and zero resource consumption. Consider Algorithm 6 with scale parameter ? ? d + 1. Suppose algorithms ALGj that satisfy the regret bound Eq. (7.4) with ? = T ?2 and regret term R?(T ) = Rj,?(T ), for any known payoff range [bmin, bmax]. If OPTFD ? [gmin, gmax] then the expected reward of Algorithm 6 satisfies( ? ?) E[REW] ? (OPT ?reg)/ ?2 log gmaxFD ? , (7.21) ( ) gmin where reg = (1 + OPTFD ) R1, ?/T (T ) +R2, ?/T (T ) . Taking [g?B min, gmax] = [1, T ], we obtain ( ) E[REW] ? (OPTFD?reg)/ ?2 dlog? T e . (7.22) Remark 6. One can use algorithms EXP3.P for ALG1 and Hedge for ALG2, with regret bounds given by Eq. (7.6), and achieve the regret term ( ) ? reg = O 1 + OPTFD TK log(Td/?). ?B 101 We obtain a meaningful performance guarantee as long as, say, reg < OPTFD /2; ? this requires OPTFD and B to be at least ??( TK). Remark 7. We define the outcome matrices slightly differently compared to Sec- tion 7.5 in that we do not posit a dummy resource. Formally, we assume that the null arm has zero consumption in every resource. This is essential for case 1 ( i.e., when ?alg ? ?) in the analysis of Lemma 14. If a problem instance of Adversarial BwK is actually an instance of adversar- ? ial bandits, then we recover the optimal O?( KT ) regret. (This easily follows by examining the proof of Lemma 14.) Lemma 13. Consider LagrangeBwK, with algorithms EXP3.P for ALG1 and Hedge for ALG2, for an instance of Adversarial BwK with zero resource consumption. This ? algorithm obtains O?( KT ) regret, for any parameters B0, T0 > 0. Accordingly, so does Algorithm 6 with any scale parameter ? > 0. 7.6.1 Analysis: proof of Theorem 7 and Lemma 13 Stopped linear program. Let us set up a linear relaxation that is suitable to the adversarial setting. The expected outcome matrix is no longer available. Instead, we use average outcome matrices: ? M = 1? Mt, (7.23)? t?[? ] the average up to a given intermediate round ? ? [T ]. Similar to the stochastic case, the relaxation assumes that each instantaneous outcome matrix Mt is equal to the 102 average outcome matrix M? . What is different now is that the relaxation depends on ? : using M? is tantamount to stopping precisely at this round. With this intuition in mind, for a particular end-time ? we consider the linear program (7.8), parameterized by the time horizon ? and the average outcome matrix M? . Its value, OPTLP(M? , B, ?), represents the per-round expected reward, so it needs to be scaled by the factor of ? to obtain the total expected reward. Finally, we maximize over ? . Thus, our linear relaxation for Adversarial BwK is defined as follows: [T ] OPTLP := max ? ?OPTLP(M? , B, ?) ? OPTFD . (7.24) ??[T ] The proof of Eq. (7.24) is similar to prior work (Badanidiyuru et al. [33], Devanur et al. [72]). Denote D? to be the set of all distributions over t?he arms such that for every p ? D? we have the following: for every i ? [d] we have t?[? ] p ? ct,i ? B. In other words, D? denotes the set of distributions whose expected?stopping time is at least ? . Thus it immediately follows that OPTLP(?) ? maxp?D? t?[? ] p ? rt since for any given p ? D? it is feasible to LP(?). Thus OPTLP(?) is at least the value of any feasible solution p ? D? . Note that for every fixed distribution p ? ?K , there exists a ? such that either p ? D? and ?p ?6 D?+1 or p ? DT . Moreover the total expected reward we can obtain using p is t?[? ] p ? rt. Thus max1???T OPTLP(?) ? OPTFD. Regret bounds for ALGj. Since each algorithm ALGj, j ? {1, 2} satisfies regret bound Eq. (7.4) with ? = T?2 and R?(T ) = Rj,?(T ), it also satisfies a stronger version (7.5) with the same parameters. Recall from Eq. (7.14) that the payoff range is [bmin, bmax] = [?T0 +1, 2]. For succinctness, let Uj(T |T0) = (1+ T0 )RB B j, ?/T (T ) 103 denote the respective regret term in Eq. (7.5). Let us apply these regret bounds to our setting. Let at ? [K] and it ? [d] be, resp., the chosen arm and resource in round t. We represent the outcomes as vectors over arms: rt, ct,i ? [0, 1]K denote, resp., reward vector and resource-i consumption vector for a given round t. Recall that the round-t payoffs in LagrangeBwK are given by the Lagrange function Lt := LMt,B,T0 such that Lt(a, i) = rt(a) + 1? T0 cB t,i(a) (7.25) for each arm a and resource i. Consider the total Lagrangian payoff at a given round ? ? [T ]: ? Lt(at, it) = REW? +? ?W? , (7.26) ? t?[? ] ? where REW? = t?[? ] rt(at) is the total reward up to round ? , andW = T0 ? B t?[? ] ct,it(at) is the consumption term. The regret bounds sandwich Eq. (7.26) from above and b?elow:? ? ? ? ? ?max Lt(a, it)?? U ? ?1(T |T0) ? REW? +? ?W? ? min Lt(at, i) + U2(T |T0). a?[K] i?[d] t?[? ] t?[? ] (7.27) This holds for all ? ? [T ], with probability at least 1 ? 2?. The first inequality in Eq. (7.27) is due to the primal algorithm, and the second is due to the dual algorithm. Call them primal and dual inequality, respectively. Crux of the proof. We condition on the event that Eq. (7.27) holds for all ? ? [T ], which we call the clean event. The crux of the analysis is encapsulated in 104 the following lemma, which analyzes an execution of LagrangeBwK with an arbitrary parameter T0 under the clean event. Lemma 14. Consider an execution of LagrangeBwK with B0 = B and an arbitrary parameter T0 such that the clean event holds. Fix an arbitrary round ? ? [T ], and consider the LP value relative to this round: f(?) := OPTLP(M?, B, ?). (7.28) The algorithm?s reward up to round ? satisfies REW? ? min(T0, ? ? f(?)? dT0)? ( U1(T |T0) + U2(T |T0) ) . (7.29) Taking ? to be the maximizer in Eq. (7.24), algorithm?s reward satisfies REW ? min(T0,OPTFD?dT0)? ( U1(T |T0) + U2(T |T0) ) . (7.30) Proof. Let ?alg be the stopping time of the algorithm. We consider two cases, de- pending on whether some resource is exhausted at time ?. In both cases, we focus on the round min(?alg, ?). Case 1: ?alg ? ? and some resource i?s exhausted. Let us focus on round ? = ?alg. If i is the exhausted resource, then t?[? ] ct,i(at) > B. Let us apply the dual inequality in Eq. (7.27) for this resource: ? REW? +? ?W? ? U2(T |T0) ? Lt(at, i) t?[? ] ? = REW +? ? T0? ct,i(at)B t?[? ] ? REW? +? ? T0. 105 It follows that W? ? T0 ? U2(T |T0). Now, let us apply the primal inequality in Eq. (7.27) for the null arm. Recall that the reward and consumption for this arm is 0, so Lt(null, it) = 1 for each round t. Therefore, ? REW? +? ?W? + U1(T |T0) ? Lt(null, it) = ?. t?[? ] We conclude that REW? ? W? ? U1(T |T0) ? T0 ? U1(T |T0)? U2(T |T0). Case 2: ?alg ? ?. Let us focus on round ?. Consider the linear program LPM ,B,?, and let X ? ? ?K be an optimal solution to this LP. The primal inequality? in Eq. (7.27) implies that ? REW? +? ?W? + U1(?) ? max Lt(a, it) a?[K] ? ?t?[?] ? X?(a) Lt(a, it) t?[?] a??[K] ? = ? + X? ? r T0 ?t ? X ? cB t,it t?[?] ? t?[?] REW? ? ? ? f(?)? T0 X? ? ct,it ? U1(T |T0). (7.31)B ? t?[?] In the last inequality we used the fact that ?t?[?] X ? rt = ? ? f(?) by optimality of X?.? ? t?[?] X ? ct,i ? B for each resource i, since X? is a feasible solution for OPTLP(M?, B, ?). Then, ? ? ? X? ? c ?t,it ? X ? ct,i ? dB. (7.32) t?[?] i?[d] t?[?] Plugging Eq. (7.32) into Eq. (7.31), we conclude that REW? ? ? ? f(?) ? dT0 ? U1(T |T0). 106 Conclusions from the two cases imply Eq. (7.30), as claimed. Wrapping up. If OPTFD lies in the guess range [gmin, gmax], then some guess g? is approximately correct: OPTFD /? ? g? ? OPTFD . With such a guess g?, and provided that ? ? d+ 1, we have T 20 = g?/? ? OPTFD /? , and OPTFD?dT0 ? OPTFD(1? d ) ? OPT /?.? FD So, by Lemma 14, the algorithm?s execution with this guess, assuming the clean event, satisfies Eq. (7.30) with min(T0,OPTFD?dT0) ? OPT /?2FD and T0 ? OPTFD /?. The regret term for this guess is U1(T |T0) + U (T |T ) ? (1 + OPTFD2 0 ) (R?B 1, ?/T (T ) +R2, ?/T (T )). To?complete?the proof of Theorem 7, we obtain a suitable guess g? with probability 1/ log gmax? .gmin Proof Sketch of Lemma 13. Recall that in the adversarial bandit setting we have ci,t = 0 for every i ? [d] and every t ? [T ]. We re-analyze Lemma 14 with ? = T . Notice t?hat case 1 never occurs. Thus we obtain obtain Eq. (7.31) in case 2. Note that T0 ?t?[?] X ? ct,it = 0 since cB i,t = 0. Therefore, we obtain REWT ? T ? f(T )? U1(T |T0). ? We now argue that T ? f(T )?= maxa?[K] t?[T ] rt(a). Let X ? be the optimal distribution over the arms. Thus X?t?[T ] ? rt = T ? f(T ). Note that since ci,t = 0 107 the only constraint on X??is that it lies in ?K . Therefore the maximizer is a point distribution on maxa?[K] t?[T ] rt(a). This proof does not rely on an(y speci)fic value? for B0, T0. The payoff range is [bmax, bmin] = [1, 2], so U1(T |T0) = O? KT . 7.7 High-probability algorithm for Adversarial BwK In this section we recover the O(log T ) approximation ratio for Adversarial BwK, but with high probability rather than merely in expectation. Our algo- rithm uses LagrangeBwK as a subroutine, and re-uses the adversarial analysis thereof (Lemma 14). We do not attempt to optimize the regret term. The algorithm is considerably more complicated compared to Algorithm 6. [T ] Instead of making one random guess g? for the value of OPTLP , we iteratively refine this guess over time. The algorithm proceeds in phases. In the beginning of each phase, we start a fresh instance of LagrangeBwK with parameter T0 defined by the current value of g?.10 We update the guess g? in each round (in a way specified later), and stop the phase once g? becomes too large compared to its initial value in this phase. We invoke LagrangeBwK with a rescaled budget B0 = B/? (log T ). Within each phase, we simulate the BwK problem with budget B0: we stop LagrangeBwK once the consumption of some resource in this phase exceeds B0. For the remainder of the phase, we play the null arm with probability 1??0 and do uniform exploration with the remaining probability, for some parameter ?0 ? (0, 1) (here and elsewhere, uniform exploration refers to choosing each action with equal probability). The 10The idea of restarting the algorithm in each phase is similar to the standard ?doubling trick? in the online machine learning literature, but much more delicate in our setting. 108 pseudocode is summarized in Algorithm 7. Algorithm 7: High-probability algorithm for Adversarial BwK. input: scale parameter ?, exploration parameter ?0, primal algorithm ALG1, dual algorithm ALG2 // ALG1, ALG2 are adversarial online learning algorithms // with bandit feedback and full feedback, resp. 1 Initialize g? = 1. 2 for each phase do 3 Start a fresh instance ALG of LagrangeBwK 4 with parameters B0 = B/2dlog? T e and T0 = g?/(dlog? T e?2). 5 for each round in this phase do 6 Recompute the global estimate g? 7 if g? > T0/? then start a new phase 8 if consumption of all resources in this phase does not exceed B0 then 9 Play the action chosen by ALG, observe the outcome and report it back to ALG. 10 else 11 Choose the null arm with probability 1? ?0, do uniform exploration otherwise To complete algorithm?s specification, let us define how to update the guess [t] g? in each round t. The guess, denoted g?t, is an estimate for OPTLP, as defined in (7.24). We form this estimate using a standard inverse propensity scoring (IPS ) 109 technique. Let pt and at be, resp., the distribution and the arm chosen by the primal algorithm in round t. The instantaneous outcome matrix Mt is estimated by matrix Mips ? [0,?)K?dt such that each row M ips t (a) is defined as follows: Mipst (a) := 1 1 {at=a} M (a).ft(a tt) For a given end-time ? , the average outcome matrix M ? ? from (7.23) is estimated as ips M? := 1 Mips. ? t t?[? ] Finally, we plug this estimate into (7.23) and define ips g?t := max ? ?OPTLP(M? , B, ?). (7.33) ??[T ] For the analysis, we will assume that the primal algorithm does some uniform ex- ploration: pt(a) ? ? > 0 for each arm a ? [K] and each round t ? [T ]. (7.34) Theorem 8. Consider Adversarial BwK with K arms, d resources, time horizon T , and budget B; assume B > 4T 3/4. Suppose that one of the arms is a null arm that has zero reward and zero resource consumption. Let ? > 0 be the failure probability parameter. Consider Algorithm 7 with parameters ? ? d+1 and ?0 = T?1/4. Assume that each algorithm ALGj, j ? {1, 2}, satisfies the regret bound (7.4) with payoff range [b Tmin, bmax] = [? + 1, 2] and regret term R?(T ) = Rj,?(T ). Assume that the primalB algorithm ALG1 satisfies (7.34) with parameter ? ? T?1/4. Then the total reward REW collected by Algorithm 7 satisfies [ ( ) ] Pr REW ? (OPTFD?reg)/ 2?4 dlog? T e ? 1?O(?T ), (7.35) 110 ( ) where the regret term is reg = T K T 3/4 log1/2(1) +R1, ?/T (T ) +RB ? 2, ?/T (T ) . Remark 8. Using algorithms EXP3.P for ALG1 and Hedge for ALG2, we can achieve (7.35) with ( ) ? reg = O TK T 3/4 log(T/?). B This is because EXP3.P, with appropriately modified unifo?rm exploration term ? = T?1/4, satisfies the regret bound (7.4) with R?(?) = O(T 3/4) K log T , and for Hedge ? we can (still) use 7.6. The theorem is meaningful whenever, say, reg < OPTFD /2. The latter requires OPT ?B > ??(T 7/4FD ).K Remark 9. Like in Theorem 7, we posit that the null arm does not consume any resources. Proof Sketch. The proof consists of several steps. First, we argue that the guess [t] g?t is close to OPTLP with high probability. This argument only relies on the uni- form exploration property (7.33) and the definition of IPS estimators, not on any properties of the algorithm. We immediately obtain concentration for the aver- age outcome matrices; a somewhat subtle point is to derive concentration on the respective LP-values. Next, we focus on a particular phase in the execution of the algorithm. We say that a phase is full if the stopping condition g?t > T0/? has fired. We focus on the last full phase. We prove there is enough reward to be collected in this phase. Essentially, letting ?1, ?2 be, resp., the start and end time of this phase, we consider the BwK problem restricted to time interval [?1, ?2], and lower-bound the LP-value of this problem in terms of the LP-value of the original problem. Finally, we use the 111 adversarial analysis of LagrangeBwK (Lemma 14) to guarantee that our algorithm actually collects that value. Because of the stopping condition g?t > T0/?, there can be at most dlog? T e phases. Therefore, rescaling the budget to B0/2dlog? T e guarantees that the al- gorithm consumes at most B/2 of the budget. We then argue that, with high- probability, the additional uniform exploration in each phase, consumes a budget of at most B/2 with high-probability. Thus, the algorithm never runs out of bud- get. We now describe the full proof of Theorem 8, following the plan outlined in the proof sketch. We decompose the analysis into several distinct pieces, present them one by one, and then show how to put them together. Each piece is presented as a lemma, with appropriate notation and intuition; some of the proofs are deferred to later in this section. Throughout, we assume that ( ) min{OPTFD, B} > ? KT 3/4 log T . (7.36)? This is without loss of generality, because otherwise the guarantee in Theorem 8 is vacuous. Recall that the primal algorithm ALG1 satisfies the uniform exploration property (7.33) with parameter ? ? T?1/4. Extended notation. To argue about a given phase, we extend some of our notation to refer to arbitrary time intervals, not just [1, ? ]. In what follows, fix time interval 112 [?1, ?2], and let ?? = ?2 ? ?1 + 1. Let ? 1 ?2 M[?1,?2] := Mt,?? t?=?1?2 ips 1 M ips[?1,?2] := M?? t t=?1 be, resp., the average outcome matrix and its IPS-estimate on this time interval. Define OBJ([?1, ?2]) := ?? ?OPTLP(M[?1,?2], B,??), (7.37) ips ? ipsOBJ ([?1, ?2]) := ?? OPTLP(M[?1,?2], B,??). (7.38) When ?1 = 1 we use the short-hand OBJ(? ips 2) and OBJ (?2). Recall that [? ] OPTLP := max OBJ(?). (7.39) ??[T ] g?? := max OBJ ips(?). (7.40) ??[T ] Uniform exploration does not exhaust budget. The uniform exploration in Algorithm 7 h?appens for at most ?0 T rounds in expectation, and therefore for at most ?0 T + 3 ?0T ln(1/?) rounds with probability at least 1 ? ?.11 It does not consume more than B/2 units of each resource, since ? = T?1/40 and B > 4T 3/4. IPS estimators are good. We argue that, essentially, the guess g?? is close to [? ] OPTLP with high probability. More precisely, we prove that OBJ(?) is close to its IPS estimator, for any given ? ? [T ]. We will denote the deviation term as ( ) K ? ?OBJ(?) DEV(?) := C0 ? log T , (7.41) ?B ? 11By an easy application of Chernoff-Hoeffding bounds (Lemma 29). 113 ( ? ) 7/4 for a sufficiently large absolute constant C0. Thus, DEV(?) ? O K T log T (:=B ? DEV(T )). In this notation, we characterize the IPS estimators as follows: Lemma 15. With probability at least 1? d?T it holds that ?? ? [T ] OBJips(?)? DEV(?) ? OBJ(?) ? OBJips(?) + DEV(?). (7.42) [? ] If the event (7.42) holds, then g?? and OPTLP are indeed close: ? ? ? ? ?? ? [? ]?? [T ] g?? OPTLP ? ? max DEV(t). (7.43) t?[? ] The proof only relies on the uniform exploration property (7.33) and the defini- tion of IPS estimators, not on anything that the algorithm does. A somewhat subtle point is to derive concentration on the respective LP-values from concentration of the average outcome matrices. IPS estimators do not change too fast. We use the stopping condition in the algorithm to argue that the IPS estimators OBJips(?) do not change too fast from one phase to another, in some specific sense. Namely, we compare OBJips(?) in the first round of any full phase to the guess in the last round of the next phase. Lemma 16. Consider a full phase in the execution of the algorithm. Let ? be the first round in this phase, and let ? ? be any round in the next phase. Then 1 ? OBJ ips(?) ? 1 K+ . (7.44) ?2 g?? ? ? ?g?? ? Proof. There exists a j ? {1, 2 , . . . , dlog? T e} such that OBJips(?) ? ?j. Since ? is the beginning of a phase, this implies that OBJips(? ? 1) < ?j?1. Observe that 114 in a single time-step the value of the estimate can increase by at most K . Thus ? OBJips(?) < ?j?1 + K . Moreover since ? ? belongs to the next phase, we have that ? ?j+1 ? g?? ? ? ?j+2. Putting these together we get that 1 ? OBJ ips(?) ? 1 K ? 1 K+ + . ?2 g?? ? ? ??j+1 ? ?g?? ? This Lemma relies only on the stopping condition in the algorithm, g? > T0/?, and the way the guess g?? is expressed in terms of the IPS estimators OBJ ips(?), as expressed in 7.40. It is irrelevant to this Lemma how the IPS estimators OBJips(?) are actually defined. Last full phase offers sufficient rewards. Recall that a phase in the execution of the algorithm is called full if the stopping condition g?t > T0/? has fired. We focus on the last full phase; let ?start, ?end denote the first and last time-steps of this phase. We prove there is enough reward to be collected in this phase. Let ? ? denote the maximizer in 7.24 which we interpret as the optimal stopping time. Essentially, we compare the LP value for the time interval [?start, ?end] with the LP value for the time interval [1, ? ?]. The former is expressed as OBJ([?start, ?end]) [T ] and the latter as OPTLP . Note that the time horizon T lies in the subsequent phase (so we can apply Lemma 16). Lemma 17. Consider a run of the algorithm such that event (7.42) holds. Then ( ) 1 1 [T ] OBJ([?start, ?end]) ? ? OPTLP ?O(DEV(T )). (7.45)? ?2 Adversarial analysis of LagrangeBwK. Let us plug in the adversarial analysis 115 of LagrangeBwK, as encapsulated in Lemma 14. We focus on the last full phase in the execution. We interpret it as an execution of algorithm LagrangeBwK with parameters B0, T0 on an instance of Adversarial BwK with budget B0 that starts at round ?start of the original problem. Let g? = g??start be the guess at the first round of the phase. Then the parameters are B0 = B/ratio and T = g?/(? 2 0 ? ratio), where ratio = dlog? T e. We apply Lemma 14 for round ? = ?end ? ?start + 1 in the execution of LagrangeBwK. Restated in our notation, f(?) in Lemma 14 becomes f(?) = OPTLP(M[?start,? ], B0, ?).end Thus, we obtain that with probability at least 1? ? we have ??end ( ) ? g? dg?REW rt(at) ? min , ?f(?)? ? reg(T ), (7.46)dlog? T e?2 dlog T e?2t=? ?start ( ) where the regret term is reg(T ) := (1 + T ) R (T ) +R B 1, ?/T 2, ?/T (T ) . Rescaling the budget. Since we use rescaled budget B0, it is important to con- nect the corresponding LP-values to those for the original budget B. We have the following general fact (observed in Agrawal and Devanur [11]): for any outcome matrix M, budget B, time horizon T , and rescaling factor ? ? (0, 1] it holds that OPTLP (M, ?B, T ) ? ? ?OPTLP (M, B, T ) . (7.47) This holds because an optimal solution ? to LPM,B,T , the vector ? ? is feasible to LPM,?B,T . Putting it all together. We defer the proofs of Lemma 15 and Lemma 17 to the 116 following subsections, and show how to complete the proof of Theorem 8 assuming these Lemmas hold. Proof of Theorem 8. Throughout this proof, let us condition on the high-probability events in Lemma 15 and 7.46. Moreover from 7.47 we have that ?f(?) ? 1 OBJ([?start, ?ratio end]) since B = B0 . Thus combining this with 7.46 we get,ratio ??end ( ) ? 1 g? dg?REW = rt(at) min , OBJ([?start, ?end])? ? reg(T ). (7.48) ratio ?2 ?2 t=?start Further from Lemma 17 with ? = d+ 1, we can re-write 7.48 as ( ( ) ) ? 1 g? 1 ? 1 [T ]? g?REW min , OPTLP ? reg(T ). (7.49)ratio ?2 ? ?2 ? Recall that g? for the phase [?start, ?end] is OBJ ips(?start). Moreover we have that T lies in the phase immediately after [?start, ?end]. Thus from 7.44 (first inequality below) and 7.43 (second inequality below) we have, g? ? 1 K ? 1 [T ]g?T + OPTLP +O(DEV(T )). (7.50)? ? ? Likewise we have, ? 1 1 [T ]g? OBJips(?start) ? g?T ? OPTLP ?O(DEV(T )), (7.51)?2 ?2 where the second inequality uses 7.44 and the last inequality uses 7.42. Plugging 7.50 and 7.51 back into 7.49 we get, 117 ( ) [T ] ( ) 1 OPT REW ? min LP 1 ? 2 [T ], OPT ratio ?4 ? ?2 LP ? reg(T )?O(DEV(T )). (7.52) [T ] And from 7.24 we have OPTLP ? OPTFD. Plugging this into 7.52 we get 7.35. 7.7.1 Proof of Lemma 17 (last full phase offers sufficient rewards) We first prove the following property of the optimal solution. Claim 1. Let T1 < T2 be any two time-steps. Then, OBJ([T1, T2]) ? OBJ(T2)? OBJ(T1 ? 1). (7.53) Proof. We prove this claim as follows. Let ?T2 denote the optimal solution to LPM ,B,T . Since ct,i(a) ? 0 for every a ? [K], t ? [T ] and i ? [d] we haveT2 2 ? ? ?T2 ? ct,i ? B =? ?T2 ? ct,i ? B. t?[T2] t?[T1?1] ? Thus ?T2 is feasible to LPM ,B,T ?1. This implies that ? ? r ?T ?1 1 t?[T1?1] T2 t1 OBJ(T1 ? 1). Likewise ?T2 is also feasible to LPM .[T ,T ],B,[T1,T1 2 2] Therefore we have, 118 ?T2 OBJ([T1, T2]) ? ?T2 ? rt t?=T1 ? = ?T2 ? rt ? ?T2 ? rt t?[T2] ?t?[T1?1] = OBJ(T2)? ?T2 ? rt t?[T1?1] ? OBJ(T2)? OBJ(T1 ? 1) Consider OBJ([?start, ?end]). From 7.53 we have OBJ([?start, ?end]) ? OBJ(?end)? OBJ(?start ? 1). (7.54) From Lemma 15 we can re-write 7.54 as OBJ(?end)? OBJ(?start ? 1) ? OBJips(?end)? OBJips(?start ? 1)?O(DEV(T )). (7.55) For some j ? dlog?(T )e, at time-step ?end the value g?t exceeds ?j for the first time. Likewise g?t exceeds ? j?1 for the first time at ?start and is smaller than this value at ? ? 1. This implies that OBJips(? ) ? ?j and OBJips(? ? 1) < ?j?1start end start . Therefore 7.55 simplifies to OBJips(? )? OBJipsend (? j j?1start ? 1)?O(DEV(T )) ? ? ? ? ?O(DEV(T )). (7.56) Combining 7.56, 7.55 and 7.54 we get OBJ([?start, ?end]) ? ?j ? ?j?1 ?O(DEV(T )). 119 Dividing throughout by g?T and noting that g?T ? ?j+1 we have, OBJ([? j?1start, ?end]) ? ? (?? 1) ? O(DEV(T )) g? ( ?j+1T ) g?T 1 = ? 1 ? O(DEV(T )) ? ?2 g?T Finally applying Lemma 15 to g?T we get 7.45. This completes the proof of Lemma 17. 7.7.2 Proof of Lemma 15 (IPS estimators are good) Recall that for every t ? [T ] and a ? [K] we have that pt(a), the probability that arm a is chosen at time t is at least ? . We now prove Lemma 2 which relates K linear sums of rewards and consumptions?computed using the unbiased estimates and the true values. Denote R?,?(?) := K 2? ln(T/?). ? Claim 2. Let ? > 0, ? > 0 used by the EXP3.P(?) be given parameters. Then we have the following sta?tements for a?ny fixed z ? ??K.??? ?? ? Pr??? ? [T ] ? ? ? z ? [r? ? r ]?t t ? > R ??,?(?) ? ? (7.57) ???t??[? ] ? ? ??? ? ?i ? [d] Pr??? ? [T ] ??? z ? [c?t,i ? ct,i(a)]??? > R ??,?(?) ? ? (7.58)t?[? ] Proof. The proof of this follows directly from the invocation of the Azuma-Hoeffding inequality. We will show this for Equation (7.57). Define Yt := z ? [r?t ? rt] (like-wise for the lower-tail use Yt := z ? (rt? r?t)). Note that this forms a martingale difference sequence since E[z ?(r?t?rt) | Ht?1] = z ? [rt ? rt] = 0. Here we used the fact that z is not random and fixed before the start of the algorithm. Also we have that |Yt| ? K .? 120 Using Lemma 28 and taking a union bound over all ? ? [T ] we have the desired equation. We will now prove the two inequalities in 7.42. We will first prove the first inequality in 7.42. ( ( ) ) ? R (?)Let ? denote the optima(l so?lution to)OPT M ?,?LP ? , B 1? , ? . NoteB this is valid whenever B > ? K ? log T . From Equation (7.58) we have that ? ? with probability at least 1? ? for every i ? [d], ? ? ?? ? c?t,i ? ?? ? ct,i +R?,?(?). t?[? ] t?[? ] ? B (7.59) ? 7.59 used the fact that ?t?[? ] ? ? ct,i ? B (1?R?,?(?)). Using Equation (7.57), we have that with probability at least 1? ?, ? ? ?? ? r ? ??t ? r?t +R?,?(?). t?[? ] t?[? ] Using the fact that, ? ( ( ) ) ? ? R?,?(?)? rt = OPTLP M? , B 1? , ? , B t?[? ] we have the following. ( ( ) ) ? R ? ?,?(?) OPTLP M? , B 1 , ? ?R?,?(?) ? ?? ? r?t. (7.60) B t?[? ] From 7.59 we have that ?? is feasible to OPTipsLP (?) and from 7.60 this implies that 121 ( ( ) ) ips ? R?,?(?)OBJ (?) OPTLP M? , B 1? , ? ?R?,?(?). (7.61) B Finally fro(m 7.47 w(e have ) ) ( ) ? R?,?(?) ? ? R?,?(?)OPTLP M? , B 1 , ? 1 OBJ (?) . (7.62) B B From 7.61 and 7.62 we have ( ) OBJ(?) OBJips(?) ? OBJ(?)?R? ?,?(?) 1?+( ? ,? B ) ? KOBJ(?) T =O ? log ?B ? which gives the lower-tail in 7.42. We will now prove the second in(equality(in 7.42 in)a si)milar fashion. Let ??? denote the optimal solution to OBJips ? R?,?(?)M? , B 1 , ? .B From Equation (7.58) we have that with probability at least 1 ? ? for every i ? [d], ? ? ??? ? ct,i ? ??? ? c?t,i +R?,?(?). t?[? ] t?[? ] ? B (7.63) ? 7.63 used the fact that ?t?[? ] ?? ? c?t,i ? B (1?R?,?(?)). Using Equation (7.57), we have that with probability at least 1? ?, ? ? ??? ? r?t ? ??? ? rt +R?,?(?). t?[? ] t?[? ] From the fact that ? ( ( ) ) ??? ? R?,?(?)r?t = OPTipsLP M? , B 1? , ? ,B t?[? ] 122 we get the following. ( ( ) ) ips ? R?,?(?) ? OPTLP M? , B 1 , ? ?R?,?(?) ? ??? ? rt. (7.64)B t?[? ] From 7.63 we have that ???j is feasible to OPTLP (?) and from 7.64 this implies that ( ( ) ) ips R?,?(?)OPTLP M? , B 1? , ? ? OBJ (?) +R?,?(?). (7.65)B Finally from 7.47 we have ( ( ) ) ( ) ips R?,?(?) R?,?(?)OPTLP M? , B 1? , ? ? 1? OBJips (?) . (7.66)B B Combining 7.66 and 7.65 we get, ips ? R?,?(?)OBJ (?) OBJ(?) + (OBJ(?) +R?,?(?)) . (7.67) B ?R?,?(?) Since B0 > 2R?,?(?) we get, OBJips ? 2R?,?(?)(?) OBJ(?) + ? (OBJ?(??) +R?,?(?)?),B ( ? ) KOBJ(?) T =O ? log ?B ? and thus we get the upper-tail in 7.42. We will now prove 7.43. Recall that g? ips? := maxt?[? ] OBJ (t). Moreover [? ] OPTLP = maxt?[? ] OBJ(t). Consider g?? ? [? ]OPTLP . We have, 123 ? [? ] ips ? [? ]g?? OPTLP = max OBJ (t) OPTLP t?[? ] ? [? ]max(OBJ(t) + DEV(t))?OPT t? LP[? ] ? [? ]max OBJ(t) + max DEV(t)?OPT ? LPt [? ] t?[? ] = max DEV(t). t?[? ] [? ] Now consider OPTLP ?g?? . We have, [? ] [? ] OPTLP ?g?? ? OPTLP ?max(OBJ(t)? DEV(t)) t?[? ] ? [? ]OPTLP ?max OBJ(t) + max DEV(t) t?[? ] t?[? ] = max DEV(t). t?[? ] This completes the proof of Lemma 15. 7.8 Lower bounds We prove the lower bounds on the competitive ratio that we have claimed in sub-section 7.3.1: the ?(log T ) lower bound w.r.t. the best fixed distribution benchmark (OPTFD), the ?(T ) lower bound w.r.t. the best dynamic policy bench- mark (OPTDP), and the ?(K) lower bound w.r.t. the best fixed arm benchmark (OPTFA). As a warm-up, we analyze the simple example from sub-section 7.3.1 that leads to the 5 lower bound w.r.t. OPTFD. All lower-bounds are for a randomized4 algorithm against an oblivious adversary. We summarize all these lower bounds in the following theorem: 124 Theorem 9. Consider Adversarial BwK with a single resource (d = 1) and K arms. Consider any randomized algorithm for this problem, and let REW denote its reward. Then: (a) OPTFD /E[REW] ? 5 ? o(1) for some problem instance (from the example in4 the Introduction). (b) OPTFD /E[REW] ? ?(log T ) for some problem instance. (c) OPTDP /E[REW] ? T/B2 for some problem instance, for any given budget B. (d) OPTFA /E[REW] ? ?(K) for some problem instance. Remark 10. The lower bounds for parts (a,b,c) hold (even) for problem instances with K = 2 arms. The lower bounds in parts (a,b) hold even for a much more permissive feedback model from the online packing literature, namely, when the al- gorithm observes the outcome vector for all actions in a given round, and moreover does it before it chooses an arm in this round. We tweak our construction from Theorem 9(c) to obtain a strong lower bound for the contextual version of Adversarial BwK (a.k.a. Adversarial cBwK ), as studied in Section 7.9.3. This lower bound implies that Adversarial cBwK is essentially ? hopeless in the regime B < T , complementing a strong positive result (Corollary 3) ? for the regime B > ??( T ). It is proved in Section 7.8.3, along with Theorem 9(c). Theorem 10. Consider adversarial contextual bandits with knapsacks (Adversarial cBwK), with policy class ?, a single resource (d = 1), K = 2 arms, and any given 125 ? budget B < T . Consider any randomized algorithm for this problem, and let REW denote its reward. Then OPT 2FD(?)/E[REW] ? T/B for some problem instance. Notation. In the proof of lower-bounds below, we use the following notation. Given an instance I, we denote OPTFD(I), OPTFA(I) and OPTDP(I) to denote the optimal value of the best fixed distribution, best fixed arm and best dynamic policy respectively, for instance I [T ]. Likewise let OPTLP (I) denote the optimal LP value for instance I and given an algorithm A and an instance I, let E[REW(A, I)] denote the expected reward obtained by A on instance I, where the expectation is over the internal randomness of the algorithm. 7.8.1 Warm-up: example from the Introduction As a warm-up, let us recap and analyze the example from the Introduction. Construction 2. There are two arms and one resource with budget B = T . Arm 2 1 has zero rewards and zero consumption. Arm 2 has consumption 1 in each round, and offers reward 1 in each round of the first half-time (T rounds). In the second 2 2 half-time, arm 1 offers either reward 1 in all rounds, or reward 0 in all rounds. More formally, there are two problem instances, call them I1 and I2, that coincide for the first half-time and differ in the second half-time. Lemma 18. Any algorithm suffers OPTFD /E[REW] ? 5 ? o(1) on some instances4 in Construction 2. 126 The intuition is that given a random instance as input the algorithm needs to choose how much budget to invest in the first half-time, without knowing what comes in the second, and any choice (in expectation) leads to the claimed competitive ratio. To prove Lemma 18 (as well we the main lower bound in Theorem 9(b)) we [T ] compare algorithm?s performance t(o OPTLP , and inv)oke the following lemma:? [T ] [T ] Lemma 19. OPTFD ? OPTLP ?O OPT ? log dTLP .B [T ] Proof. Let ? ? denote the time-step at which OPTLP is maximiz?ed. Let p denote the optimal solution to ? ? ?OPTLP(M?? , B(1? ), ? ?) where  = log dT . Note thatB OPTFD is at least as large as the expected total reward obtained by the distribution p. From the Chernoff-Hoeffding bounds (Lemma 29), with probability at least 1? dT?2 we have ? ?i ? [d] p ? ct,i ? B. t?[??] Conditioning on this event the expected total reward obtained by p is ? p ? rt = ? ? ?OPTLP(M?? , B(1? ), ? ?). t?[??] Thus the expected total reward obtained by p is at least ? ? ? OPTLP(M?? , B(1 ? ), ? ?). 12 Moreover from Eq. (7.47) we have that OPTFD ? ? ? ?OPTLP(M ??? , B(1? ), ? ) ? (1? )? ? ?O(PTLP(M ???? , B(1?)), ? ) ? [T ]? [T ] log dTOPTLP O OPTLP .B 12With probability T?2 we assume that p has an expected reward of 0. 127 Proof of Lemma 18. Denote the two arms by A1 and A0 where A0 denotes the null arm. The consumption for arm A1 is 1 for all rounds in both I1 and I2. Thus the only difference between the two instances is the rewards obtained for playing arm A1 in each round. The instances have two phases where each phase lasts for T 2 rounds. In phase 1, in both I1 and I2 playing arm A1 fetches a reward 1 . In the2 second phase, in I1, the reward for playing arm A1 is 0 while in I2 the reward for playing arm A1 is 1. Thus the outcome matrix for the first T time-steps is the same 2 in instances I1 and I2. Consider a randomized algorithm A. Let ?1 be the expected number of times arm A1 is played by A in the first T rounds on instances I2 1 and I2. Note since the outcome matrix is same, the expected number of times the arm is played should be same in both the instances. Let ?2,1, ?2,2 denote the expected number of times arm A1 is played in the second phase in instances I1 and I2 respectively. Recall that in this section we are interested in a lower-bound on the com- [T ] petitive ratio OPTFD /E[REW] for every instance. Consider OPTLP (I1), the opti- mal value of the b(est fixed d)istribution on I1. Using Eq. (7.24) with ? = T this2 equals T ? [T ]O( PTLP M T )T , B, which evaluates to T . Likewise OPTLP (I2) equals2 2 4 2 T ? OPTLP MT , B, T , which evaluates to 3T . Consider the performance of A on8 I1. We have, [T ] OPTLP (I ) ( ) ( ) 1 T ?1 E A I ? / . (7.68)[REW( , 1)] 4 2 Likewise on I2 we have, [T ] OPT ( ) ( )LP (I2) ? 3T / ?1E A I + ?[REW( , )] 8 2 2,2 . (7.69)2 128 Thus the competitive ratio of A is at least the maximum of the ratios in Eq. (7.68) and Eq. (7.69). Thus we want to minimize this maximum and is achieved when the two ratios are equal to each other. Notice that the term ?2,1 does not appear in Eq. (7.68) and Eq. (7.69). By setting the term in Eq. (7.68) equal to the term in Eq. (7.69) and re-arranging, ?1 = 4?2,2. (7.70) Moreover we have ?1 + ?2,2 ? B. Combining this with Eq(. ()7.7(0) w)e get ? ? 4B = 2T and the corresponding competitive ratio is at least T / ?11 ? 5 .5 5 4 2 4 By Lemma 19 with d = 1, for every j ? [2], ( ? ) [T ] I A I ? 5 ? OPTOPT ( )/E[REW( , )] O LP log TFD j j . 4 T B 7.8.2 The main lower bound: proof of Theorem 9(b) To obtain the ?(log T ) lower bound in Theorem 9(b), we extend Construction 2 to one with ?(log T ) phases rather than just two. As before, the algorithm needs to decide how much budget to save for the subsequent phases; without knowing whether they would bring higher rewards or nothing. The construction is as follows, see Figure 7.3 for a pictorial representation: Construction 3. There is one resource with budget B, and two arms, denoted A0, A1. Arm A0 is the ?null arm? that has zero reward and zero consumption. The consumption of arm A1 is 1 in all rounds. The rewards of A1 are defined as follows. We partition the time into T phases of duration B each (for simplicity, assume that B 129 [ ] B divides T ). We consider T problem instances; for each instance I , ? ? T? armB B A1 has positive rewards up to and including phase ? ; after that all rewards are 0. In each phase ? ? [? ], arm A1 has reward ?/T in each round. 1 ? ? + 1 Arm 1 ? 2? . . . ?? 0 . . . Consumption=1 Null Arm 0 0 . . . 0 0 . . . Consumption=0 Time, in TB phases of B rounds each Figure 7.3: The lower-bounding construction for the ?(log T ) lower bound. The lower bound holds for a wide range of budgets B, as expressed by the following lemma: Lemma 20. Fix an absolute constant ? ? (0, 1), and budget B in the interval [?(log3 T ), O(T 1??)]. Then any algorithm suffers OPTFD /E[REW] ? ?(log T ) for some problem instance in Construction 3. In the rest of this subsection we prove Lemma 20. Fix any randomized algo- rithmA. As before in this sub-section we are interested in the ratio OPTFD /E[REW(A)]. We argue that it has the claimed competitive ratio on at least one instance I? in the construction 3. The proof proceeds in two parts. We first argue about the solution structure of the optimal distribution for the construction 3 (we prove this formally in Lemma 21). Next we characterize the expected number of times arm A1 is played if A optimal algorithm in each of the phases. Combining the two we get Lemma 20. 130 Structure of the optimal solution. Define OPTLP(M?? , B, ? ?) to be the optimal value of LP 7.8 on the instance I? . Then we have the following Lemma. Lemma 21. For a given instance I? we have the following. 1. The optimal stopping time ? ? = B? . ? 2. OPTLP(M ? 1 ?? , B, ? ) ? ? t?[B? ]P(t). Proof. Note that (2) follows from (1). Indeed, given that the stopping time is B? , the optimal?solution is to set X(1) = 1 and X(0) = 1 ? 1 thus obtaining a total ? ? reward of 1 t?[B? ]P(t). Thus it remains to prove (1).? First it is easy to prove that ? ? ? B? . Since there are no rewards after time-step ? ?, we have ? ? ?t? 1 1> 0 OPTLP(M??+t? , B, ? ? + t?) = ? P(t) < P(t).? + t ? t?[??] t?[B? ] Now we will argue that the optimal stopping time cannot be strictly lesser than ? ?. To do so, first we argue that for two stopping times t1 < t2 within the same phase, the maximum objective is achieved for the stopping time t2. This implies that the optimal stopping time has to be the last time step of some phase. Consider times t1 < t2 such that P(t1) = P(t2) = ? . Then we want to claim that ??? ?B ? ?B ?? ?P(t) ? P(t)? . t1 t t? 2[t1] t?[t2] For contradiction assume the inequality does not hold. Then we have the following. ? ? ? P t ? 1 (t) > ? P(t)? . t2 t?[t1] t?[t2] 131 ? ? Note that t?[B(??1)]P(t) = Bt? = B(??1)? t??[? ] . Thus we have2 ? P B(? ? 1)?(t) = + (t1 ?B(? ? 1))?, 2 t??[t1] P B(? ? 1)?(t) = + (t2 ?B(? ? 1))?. 2 t?[t2] Therefore we have, B(? ? 1)? + (t1 ?B(? ? t1B(? ? 1)? 1))? > + (t2 ?B(? ? 1))?. 2 2t2 Further re-arranging, B(??1) > t2. This is a contradiction since t2 is in phase2 ? , so t2 ? B(? ? 1). Next we argue that the optimal value is achieved when the stopping time is in the last non-zero rewards phase. Consider two phases ?1 < ?2. Thus the ending times are B?1 and B?2. To prove that the optimal value increases by stopping at B?2, as opposed to B?1, we want to show that 1 ? 1 ? Bt ? Bt. ?1 ?? 2t [?1] t?[?2] As before assume for a contradiction that this doesn?t hold. Then re-arranging we get, ?1(?1+1) > ?1(?2+1) , which implies ?1 > ?2, contradiction. We conlude that2 2 the stopping time is ? ? = B? . Expected behavior of the optimal algorithm. Consider any randomized algo- rithm A. The performance of A is then as follows. [T ] OPTLP E ? B? ?OPTLP(MB? , B,B?) max . (7.71) [REW(A)] 1???T/B E[REW(A)] 132 Consider two consecutive instances I? and I?+1. The outcome matrix in the phases 1, 2 , . . . , ? look identical in both these instances. Thus the expected number of times arm A1 is chosen by algorithm A in phases 1, 2 , . . . , ? is identical. Let ?? denote the expected number of times A plays arm A1 in phase ? on instances I1, I2, , . . . , I? . The expected number of times arm A1 is played in phase ? on instances I?+1, I?+2 , . . . , IT/B is set arbitrarily since this term does not appear in the objective function (i.e., E[REW(A)]). Thus the maximum of the ratios in Eq. (7.71) occurs when they are equal for all values ? . Therefore we want for every 1 ? ? ? T/B, ?? ?1 1? j?[? ] Bj ?+?1 j?[?+1] Bj= . (7.72) j?[? ] j?j j?[?+1] j?j Also note that ?1 + ?2 + . . .+ ?T/B ? B, since the total budget is B and the consumption is 1 in every round for arm A1 and 0 for arm A0. We will prove that these lead to the following recurrence for the maximizing values of ?j. ? ? 1j 2 ?j = ?1. (7.73) 2j We will prove the recurrence Eq. (7.73) via induction. The base case is when j = 2. Plugging in ? = 1 in Eq. (7.72) we have 1 3/2 = . ?1 ?1 + 2?2 This implies that ? 12 = ?1. Therefore we are done. Now consider the inductive4 case; let all ? up to ?? satisfy the recurrence Eq. (7.73). Consider the instance I? and I?+1. From Eq.?(7.72) we have, ? 1 1 ? ?j?[? ] jB ??+1 j?[?+1] jB= . ?1 + ? ? j=2 ?1/2 ?1 + j=2 ?1/2 + (? + 1)??+1 133 Rearranging, we get that ??+1 = 1 ?1. This completes the inductive step.2(?+1) Using the fact that ?1(1 + 1/4 + 1/6 + . . . + B/2T ) ? B we have that ?1 ? B where H(n) denotes the nth Harmonic number. The val(ue)of Eq.((7.71))for2H(T/B) ? = 1 is the same at all other ? and has a value of B ? 2H T ? ? log T ? ?1 B B ? (log T ). The second last inequality uses the well-known fact that ? + log n ? H(n) ? ? + lo[g(n + 1) for an abso]lute constant ? and the last inequality uses the fact that B ? ?(log3 T ), O(T 1??) . [ ] Therefore we have that for every instance j ? T , B [T ] OPTLP (Ij) ? ?(log T ). E[REW(A, Ij)] Thus combining this with Lemma 19 we obta(in ) OPT 1.5FD ? log T?(log T )?O ? . ( )E[REW(A)] B Since B ? ? log3 T , we get that OPTFDE[REW(A ? ?(log T ).)] 7.8.3 Best dynamic policy: proof of Theorem 9(c) Consider the following construction of the lower-bound example. Construction 4. There is one resource with budget B, and two arms, denoted A0, A1. Arm A0 is the ?null arm? that has zero reward and zero consumption. The consumption of arm A1 is 1 in all rounds. The rewards of A1 are defined as follows. We partition the time into T phases of duration B each (for simplicity, assume that B B divides T ). We consider T problem instances; for each instance I? , ? ? [T/B]B arm A1 has 0 reward in all phases except phase ? ; in phase ? it has a reward of 1 in each round. 134 Lemma 22. Consider Construction 4 with any given time horizon T ? 2 and budget ? B ? T . Let ALG be an arbitrary randomized algorithm for BwK. Then for one of the problem instances, OPTDP /E[REW] ? T/B2. (7.74) Proof. Let n = T be the number of phases in Construction 4. Let ALG be a B deterministic algorithm. Let REW denote its total reward, and let E? [?] denote the expectation over the uniform-at-random choice of the problem instance I? . We claim that OPTDP /E? [REW] ? T/B2. (7.75) Assume that ALG maximizes E? [REW] (over deterministic algorithms). Then it satisfies the following: ? Within each phase, if ALG ever chooses to play arm A1, it does so in the first round of the phase. If it receives a reward of 1 in this round, it plays A1 for the rest of the phase. Else, it never plays A1 for the rest of this phase. For each ? ? [n], let ?? denote the number of times ALG chooses arm A1 in phase ? in problem instance I? . The expected reward of A?LG over the uniform-at-random choice of the problem instance I? is E[REW] = 1 i?[n] ?i. Letn (??(1), ??(2) , . . . , ??(k)) be the subsequence of (?1 , . . . , ?n) which contains all elements with non-zero values. The key observation is as follows. The problem instances I?(??1) and I?(?) are identical until phase ?(? ?1)?1. Since the feedback received by ALG until the first 135 time it chooses armA1 in phase ?(??1) is identical, it follows that ??(??1)???(?) = 1. Therefore, ? ? ?i = ??(i) = k ? ??(1) ? k(k ? 1) . 2 i?[n] i?[k] Noting that ?1 ? B and k ? min(B, n) = B , we have: ? E[REW] ? 1 ? 2 3 n i < B /n = B /T. i?[n] Since OPTDP = B for every problem instance I? , Eq. (7.75) holds for ALG, and therefore for any other deterministic algorithm. By Yao?s lemma (Motwani and Raghavan [148]), for every randomized algorithm ALG there exists a problem in- stance I? such that (7.74) holds. We now use the same construction to prove Theorem 10. ? Proof sketch of Theorem 10. We prove the Theorem by contradiction. Let B ? T . For contradiction, consider an algorithm ALG for cBwK on a policy set ? such that OPTFD(?)/E[REW(ALG)] < T/B2. We will now use ALG to construct an algorithm A for the Construction 4 such that OPTDP /E[REW(A)] < T/B2 for every instance. This contradicts Lemma 22. Consider a policy set ? with |n| policies. Every policy ? ? ? maps contexts in the range [1, T ] to the action set {A1, A0}. In particular, a policy ?? ? ? maps contexts that lie in the range [B ? (? ?1)+1, B ? ? ] to arm A1 and all other contexts to A0. A invokes ALG as a sub-routine with the policy set ?. At each time-step t, A gives the context xt = t to ALG and plays the arm chosen by ALG. For an instance I? in Construction 4, OPTFD(?) is the total reward obtained by choosing the action given by ?? in all time-steps. The total reward obtained is B, 136 which equals OPTDP(I? ). Therefore, OPTFD(?)/E[REW(ALG)] < T/B2 implies we have OPT 2DP /E[REW(A)] < T/B for every instance I? , which is a contradiction. 7.8.4 Best fixed arm: proof of Theorem 9(d) We use the following construction for the lower-bound. Construction 5. There is one resource with budget B, and K arms denoted by A1, A2 , . . . , AK. Arm AK is the ?null arm? that has zero reward and zero con- sumption. There are K instances in the family. In instance Ij, all arms Aj? where j? > j have 0 reward and 0 consumption in all time-steps. Consider an instance Ij for some j ? [K ? 1] and an arm j? ? j. Arm Aj? has a reward of 1K?j? and con-K sumption of 1 in all time-steps in phase j? and has a reward of 0 and consumption of 0 in every other time-step. Thus the rewards and consumption are bounded in the interval [0, 1] for every arm and every time-step in all instances in this family. Lemma 23. Let T ? 2, 2 ? B ? T , K ? 3 be given parameters of the AdversarialBwK problem. We show that there exists a family of instances with d = 1 shared resource such that for every randomized algorithm A we have OPTFAE A is at least ?(K) on[REW( )] one of these instances. Proof. First note that the best fixed arm in instance Ij is to pick arm Aj which yields a total reward of B KK?j . Consider a randomized algorithm A. Observe that in the first j phases, the instances Ij?1 and Ij have identical outcome matrices. Thus the expected number 137 of times any arm Ak for k ? [K] is chosen in phases {1, 2 , . . . , j} should be the same in both the instances. Let ?k denote the expected number of times arm k is played by A in phase k on instances Ik, Ik+1 , . . . , I 13K?1 . Moreover we have that ?1 + ?2 , . . . , ?K?1 ? B. To show the lower-bound we want to minimize the competitive ratio on every instance for all possible values of ?1, ?2 , . . . , ?K?1. For ease of no?tation denote r := 1j K?j . Let ?B denote the set of values to {?k}k?[K?1] such thatK k?[K?1] ?k ? B. Thus, OPTFA E A ? ? rkBmin . (7.76)[REW( )] ?B j?[k] rj?j The ratio is minimized when all ratios in Eq. (7.76) are equal. We will show via induction that this yields the following(recurrenc)e, ? ? ? rk?1k 2 ?k = 1 ?1. (7.77) ?rk Combining this with the condition that k?[K?1] ?k ? B, this yields the condition ?1 ? B 1 . Moreover the minimizing value in Eq. (7.76) is K ? 1 which K? K K proves Lemma 23. We will now prove the recurrence Eq. (7.77). Consider the base case with k = 2. Equalizing the first two terms in Eq. (7.76) we get r1B r2B = . r1?1 r(1?1 + r2)?2 Re-arranging we obtain that ?2 = 1? r1 ?1. We will now prove the induc-r2 tive case. Let the recurrence be true for all 1 ? k ? k?. Consider the case k = k?+1. 13This has to be the same in all instances since the outcome matrix is identical until phase k in all these instances 138 Setting the k? and k? + 1 ratios in Eq. (7.76) equal, we obtain ? rk?B ? rk?+1B= . (7.78) j?[k?] rj?j j?[k?+1] rj?j ( ) r Moreover from the inductive hypothesis we have ?j = 1? j?1 ?r 1 for everyj j ? k?. Thus we have ? rj?j = rk??1 ?j?[k?] rj?j = rk??1 + rk?+1?k?+1. j?[k?+1] Plugging this back in Eq. (7.78) we get rk?B rk?+1B = . rk??1 ( rk??1 +)rk?+1?k?+1 ? r ? Rearranging we get ? ? = 1 kk +1 ?1. This completes the induction.rk?+1 7.9 Extensions We obtain several extensions which highlight the modularity of LagrangeBwK: we apply Theorem 6 and Theorem 7 with appropriately chosen primal algorithm ALG1, and immediately obtain strong performance guarantees. 14 We tackle four well-known scenarios: ? full feedback (e.g., Arora et al. [21], Freund and Schapire [88], Littlestone and Warmuth [130]): in each round, the algorithm chooses an action and observes 14For these theorems to hold, ALG1 needs to satisfy regret bound (7.4) only against adaptive adversaries that arise in the repeated Lagrange game in the corresponding extension, not against arbitrary adaptive adversaries. 139 the outcomes of all possible actions; this is a classic scenario in online machine learning. ? combinatorial semi-bandits (e.g., Audibert et al. [23], Gyo?rgy et al. [101], Kale et al. [113]): actions are feasible subsets of ?atoms?. The atoms in the chosen action have individual outcomes that are observed and add up to the action?s total outcome. Typical motivating example are subsets/lists of news articles, ads, or web search results. ? contextual bandits with policy sets (e.g., Agarwal et al. [5], Dud??k et al. [80], Langford and Zhang [127]): before each round, a context is observed, and the algorithm competes against the best policy (mapping from context to actions) in a given policy class. In a typical application scenario, the context includes known features of the current user. ? bandit convex optimization (starting from Flaxman et al. [86], Kleinberg [120], with recent advances Bubeck et al. [44, 45], Hazan and Levy [105]). Here the set of actions is a convex set X ? RK . For each round t, the adversary chooses a concave function ft : X ? [0, 1] such that the reward for chosen action x ? X is ft(x). Formalities. To simplify the statements, we make the following assumptions with- out further mention: ? The dual algorithm, ALG2, is always Hedge, with the associated regret bound from 7.6. For high-probability regret bounds, ? = 1 is a fixed and known T 140 failure probability parameter. ? For Stochastic BwK, one resource is the dummy resource (with consumption B for each arm). Algorithm LagrangeBwK is run with parameters B0 = B andT T0 = T . ? For Adversarial BwK, one of the arms is a null arm that has zero reward and zero resource consumption. Algorithm 6 is run with scale parameter ? ? d+1 and range [gmin, gmax] = [1, T ]. A typical corollary. All our corollaries have the following shape, for some regret term reg: (C1) In the stochastic version, algorithm LagrangeBwK achieves, with probability at least 1? 1 , T ( ) OPTDP?REW ? O T ? reg .B (C2) In the adversarial version, Algorithm 6 achi(eves ) 1 1 E[REW] 1?O(reg) +OPTFD ?B? . OPT ?2FD dlog? T e Corollaries similar to (C2) can be achieved for Algorithm 7, too; we omit them for ease of exposition. 7.9.1 BwK with full feedback In the full-feedback version of BwK, the entire outcome matrix Mt is revealed to the algorithm after each round t. Accordingly, we can use Hedge as the primal 141 algorithm ALG1. The effect is, essentially, that the dependence on K, the number ? of arms, in the regret term becomes logarithmic rather than K. Corollary 1. Consider BwK with full feedback. Using Hedge as t?he primal algo- rithm, we obtain corollaries (C1) and (C2) with regret term reg = T ln (dKT ). Adversarial BwK with full feedback have not been studied before. For the stochastic version, the regret bound is unsurprising: one expects to obtain a similar improvement with each of the three other algorithms in the prior work on Stochastic BwK by tracing the ?confidence terms? through the analysis. The significance here is that we obtain this result as an immediate corollary. 7.9.2 Combinatorial Semi-Bandits with Knapsacks Following Sankararaman and Slivkins [163], we consider Combinatorial Semi- BwK, a common generalization of BwK and combinatorial semi-bandits (e.g., Au- dibert et al. [23], Gyo?rgy et al. [101], Kale et al. [113]). In this problem, actions correspond to subsets of some finite ground set ? of size n, whose elements are called atoms. There is a fixed family F ? 2? of feasible actions. For each round t, there is an outcome vector o ? [0, 1 ]d+1t,e for each round atom e ? ?, with then same semantics as the actions? outcome vectors. If an action S ? ? is chosen, the outcome vectors ot,?e are observed for all atoms e ? S, and the action?s outcome is the sum M (S) = dt e?S ot,e ? [0, 1] . In the adversarial case, all outcome vectors ot,e, t ? [T ], e ? ? are chosen by an adversary arbitrarily before round 1. In the stochastic case, the atomic outcome matrix (ot,e : e ? ?) is drawn independently in 142 each round t from some fixed distribution. Combinatorial semi-bandits, as studied previously, is a special case with no resource constraints (d = 0). Typical motivating examples involve ad/content personalization scenarios. Atoms can correspond to items news articles, ads, or web search results, and actions are subsets that satisfy some constraints on quantity, relevance, or diversity of items. One can also model ranked lists of atoms: then atoms are rank-item pairs, and each feasible action S ? ? satisfies a constraint that each rank between 1 and |S| is present in exactly one chosen rank-item pair. A naive?application of our main results suffers from regret terms that are pro- portional to |F|, which may be exponential in the number of atoms n. Instead, the work on combinatorial semi-bandits features regret bounds that scale polynomi- ally in n. This is what we achieve, too. We use an algorithm from Neu and Barto?k [152] which solves combinatorial semi-bandits in the absence of resource constraints. This algorithm satisfies a hi?gh-probability regret bound (7.4) against an adaptive adversary, with R?(T ) = O( nT log(1/?)). 15 Corollary 2. Consider Combinatorial Semi-BwK with n atoms. Using the algo- rithm from Neu and Barto?k [152] as the primal algorithm, we obtain corollaries ? (C1) and (C2) with regret term reg = nT log T . The adversarial version of Combinatorial Semi-BwK has not been studied be- fore. 15Prior work (Neu and Barto?k [152], Sankararaman and Slivkins [163]) posits that atoms? per- round rewards/consumptions lie in the range [0, 1], rather than [0, 1n ], so their stated regret bounds should be recaled accordingly. 143 The stochastic version has been studied in Sankararaman and Slivkins [163] when the action set is a matroid, achieving regret ( ? ? ? ) O? OPTDP n/B + T/n+ OPTDP . ? This regret bound becomes O?( nT ) in the regime when B and OPTDP are ?(T ) (see Footnote 15). We achieve the same regret bound for this regime, without the matroid assumption and without any extra work. However, the regret bound in Sankararaman and Slivkins [163] can be substantially better than ours when OPTDP  T . 7.9.3 Contextual Bandits with Knapsacks Following Agrawal et al. [14], Badanidiyuru et al. [31], we consider Contextual Bandits with Knapsacks (cBwK ), a common generalization of BwK and contextual bandits with policy sets (e.g., Agarwal et al. [5], Dud??k et al. [80], Langford and Zhang [127]). The only change in the protocol, compared to BwK, is that in the beginning of each round t a context xt ? X arrives and is observed by the algorithm before it chooses an action. The context set X is arbitrary and known. In the adversarial version (Adversarial cBwK ) both xt and the outcome matrix Mt is chosen by an adversary. In the stochastic version (Stochastic cBwK ) the pair (xt,Mt) is chosen independently from some fixed but unknown distribution over such pairs. In cBwK one is also given a finite set ? of policies : deterministic mappings from contexts to actions.16 Essentially, the algorithm competes with the best course 16W.l.o.g. assume that ? contains all constant policies, i.e., all policies that always evaluate to 144 of action restricted to these policies. For a formal definition, let us interpret cBwK as a BwK problem with action set ?, denote this problem as BwK(?). In other words, actions in BwK(?) are policies in cBwK. An algorithm for BwK(?) is oblivious to context arrivals. It chooses a policy ?t ? ? in each round t, and receives an outcome for this policy: namely, the outcome for action ?(xt). We are interested in the usual benchmarks for this problem, the best algorithm OPTDP and the best fixed distribution OPTFD (where both benchmarks are constrained to use policies in ?); denote them OPTDP(?) and OPTFD(?), respectively. Without budget constraints (i.e., with B = T ), this is exactly contextual bandits with policy set ?. Both benchmarks then reduce to the standard benchmark of the best fixed policy. Background: algorithm EXP4.P. We use EXP4.P (Beygelzimer et al. [40]), an algorithm for the contextual version of adversarial online learning with bandit feedback. The algorithm operates according to the protocol in Figure 7.4. We are interested in regret bounds for EXP4.P relative to the best fixed policy: ? OPT? = max??? t?[T ] ft(?(xt)). For each round t, the pair (xt,gt) induces a payoff vector ft ? [b ?min, bmax] on policies: ft(?) = gt(?(xt)) ?? ? ?. Theorem 11 (Beygelzimer et al. [40]). Fix failure probability ? > 0, policy set ?, and payoff range [bmin, bmax]. Then algorithm EXP4.P (appropriately tuned) satisfies the same action. 145 Given: action set [K], context set X , policy set ?, payoff range [bmin, bmax]. In each round t ? [T ], 1. the adversary chooses a context xt ? X and a payoff vector gt ? [b Kmin, bmax] ; 2. the algorithm chooses a distribution pt over ? (without seeing xt); 3. a policy ?t ? ? is drawn independently from pt; 4. algorithm?s chosen action is defined as at = ?t(xt) ? [K]; 5. payoff gt(at) is received by the algorithm. Figure 7.4: Adversarial contextual bandits the following regret bound: [ ? ] Pr OPT? ? t?[T ] ft(?t) ? (bmax ? bmin)R?(T ) ? 1? ?, (7.79) (? ) with regret term R?(T ) = O ?K log(KT |?|/?) . Our solution for cBwK. We solve cBwK by reducing it to BwK(?), and treating it as a BwK problem. A naive solution simply posits |?| arm?s and directly applies the machinery developed earlier in this paper. This results in |?| dependence in regret bounds, which is unsatisfactory, as the policy set may be very large. Instead, we use EXP4.P as the primal algorithm (ALG1). We interpret EXP4.P as an algorithm for (non-contextual) adversarial online learning, with action set ?. It is easy to see that Theorem 11 provides regret bound (7.4) under this interpretation. Therefore, 146 we obtain the following: Corollary 3. Consider contextual bandits with knapsacks, with policy set ?. Using EXP4.P as th?e primal algorithm, we obtain corollaries (C1) and (C2) with regret term reg = TK ln (dKT |?|). The benchmarks are OPTDP = OPTDP(?) and OPTFD = OPTFD(?). Adversarial cBwK has not been studied before. The regret bound for the ? adversarial case is meaningful only if B > T . This is essentially inevitable in light of the lower bound in Theorem 10. Stochastic cBwK has been studied in Agrawal et al. [14], Badanidiyuru et al. [31], achieving regret bound O(reg)(1 + OPTDP(?)/B), (7.80) where the reg term is the same as in Corollary 3. Whereas the regret bound from Corollary 3 is O(reg ? T/B). Note that we match (7.80) in the regime OPTDP(?) > ?(T ). Our regret bound is optimal, up to logarithmic factors, in the regime B > ?(T ).17 Discussion. Our algorithms are slow, as the per-round running time of EXP4.P is proportional to |?|. The state-of-art approach to computational efficiency in contex- tual bandits is oracle-efficient algorithms, which make only a small number of calls to an oracle that finds the best policy in ? for a given data set. In particular, prior work for Stochastic cB(wK (A?grawal et al. [14]) )obtains an oracle-efficient algorithm 17This is due to the min T, ?( KT log(|?|)/ log(K) regret bound, which holds for contextual bandits Agarwal et al. [4]. 147 with regret bound as in (7.80). To obtain oracle-efficient algorithms for cBwK in our framework, both for the stochastic and adversarial versions, it suffices to replace EXP4.P with an oracle-efficient algorithm for adversarial contextual bandits that obtains regret bound (7.79), possibly with a larger regret term R?. Such algorithms almost exist: a recent breakthrough (e.g., Rakhlin and Sridharan [156], Syrgkanis et al. [173, 174]) obtains algorithms with similar regret bounds, but only for expected regret. 7.9.4 Bandit Convex Optimization with Knapsacks We consider Bandit Convex Optimization with Knapsacks (BCOwK ), a com- mon generalization of BwK and bandit convex optimization. We define BCOwK as a version of BwK, where the action set X is a convex subset of RK . For each round t, there is a concave function ft : X ? [0, 1] and convex functions gt,i : X ? [0, 1], for each resource i, so that the reward for choosing action x ? X in this round is ft(x) and consumption of each resource i is gt,i(x). In the stochastic version, the tuple of functions (ft; gt,1 , . . . , gt,d) is sampled independently in each round t from some fixed distribution (which is not known to the algorithm). In the adversarial version, all these tuples are chosen by an adversary before the first round. Neither stochastic nor adversarial version of BCOwK have been studied in prior work (but see the discussion of constrained online convex optimization in Section 7.2). Bandit convex optimization, as studied previously, is a special case with no resource constraints (d = 0). The primal algorithm ALG1 in LagrangeBwK faces an instance of BCO (with 148 an adaptive adversary). This is because the Lagrange function (7.10) is a concave function of the action, as sum of concave functions. For our primal algorithm, we use a recent breakthrough on BCO due to Bubeck et al. [45]. This algorithm satisfies the high-probability regret bound (7.4) against an adaptive adversary, with regret term ? R?(T ) = O(K 9.5 log7(T ) T log(1/?)). We assume the existence of a null arm: a point x ? X such that ft(x) = gt,i(x) = 0 for each resource i except the ?dummy resource?. (Recall that we posit the ?dummy resource? ? a resource whose consumption is B/T for each arm ? for the stochastic version.) Unlike elsewhere in this paper, this assumption is not without loss of generality: indeed, the null arm should be ?embedded? into X without breaking the convexity/concavity properties. Moreover, we assume that the null arm lies in the interior of X . Corollary 4. Consider BCOwK for a given convex set X ? RK. Using the algo- rithm from Bubeck et al. [45] as the primal algorithm, we obtain corollaries (C1) ? and (C2) with regret term reg = K9.5 log7.5(T ) T . Remark 11. LagrangeBwK framework extends to infinite action sets: everything carries over, as long as 7.11 holds. (Essentially, we never take union bounds over actions, and we can replace max and sums over actions with sup and integrals.) For BCOwK, 7.11 is a statement about constrained convex optimization programs. According to Slater?s condition (see Eq. (5.27) in Boyd and Vandenberghe [41]), it suffices to have a point x in the interior of X such that gt,i(x) < B/T for each 149 resource i ? [d] other than the dummy resource (or any other resource whose con- sumption is the same in all rounds). One such point is the null arm. 150 Chapter 8: Better Bounds for Combinatorial Semi-bandits with Knap- sacks 8.1 Introduction In this chapter we consider one extension of Stochastic BwK, namely Combi- natorial Semi-Bandits with Knapsacks. We obtain better bounds for this problem than what was achieved in Chapter 71. This extension combines two lines of work related to bandits: on bandits with knapsacks (BwK) (Badanidiyuru et al. [33]) and on combinatorial semi-bandits (Gyo?rgy et al. [101]). In combinatorial semi-bandits, actions correspond to subsets of some ?ground set?, rewards are additive across the elements of this ground set (atoms), and the reward for each chosen atom is revealed (semi-bandit feedback). A paradigmatic example is an online routing problem, where atoms are edges in a graph, and actions are paths. 8.1.1 Formal Model Our model, called Semi-Bandits with Knapsacks (SemiBwK) is a generalization of multi-armed bandits (henceforth, MAB) with i.i.d. rewards. As such, in each round t = 1, 2, . . . , T , an algorithm chooses an action St from a fixed set of actions 1This work was done and published before that of Chapter 7. 151 F , and receives a reward ?t(St) for this action which is drawn independently from a fixed distribution that depends only on the chosen action. The number of rounds T , a.k.a. the time horizon, is know2. There are d resources being consumed by the algorithm. The algorithm starts out with budget Bj ? 0 of each resource j. All budgets are known to the algorithm. If in round t action S ? F is chosen, the outcome of this round is not only the reward ?t(S) but the consumption Ct(S, j) of each resource j ? [d]. We refer to Ct(S) = (Ct(S, j) : j ? [d]) as the consumption vector.3 Following prior work on BwK, we assume that all budgets are the same: Bj = B for all resources j. 4 Algorithm stops as soon as any one of the resources goes strictly below 0. The round in which this happens is called the stopping time and denoted ?stop. The reward collected?in this last round does not count; so the total reward of the algorithm is REW = t ?(T ), the? regret bound becomes O?(T n/B+ nT ), where n is the number of actions, which coincides with the lower bound from Badanidiyuru et al. [33]. Combinatorial semi- bandits is a special case with B = nT . If all feasible subsets contain at most k atoms, ? we have OPTDP ? kT , and the regret bound becomes O?( knT ). This coincides with ? the ?( knT ) lower bound from Kveton et al. [124]. Our main result assumes that the action set, i.e., the family of feasible subsets of atoms, is described by a matroid constraint.6 This is a rather general scenario which includes many paradigmatic special cases of combinatorial semi-bandits such as cardinality constraints, partition matroid constraints, and spanning tree con- ? straints. We also assume that B > ??(n+ nT ). Our model captures several application scenarios, incl. dynamic pricing, dy- namic assortment, repeated auctions, and repeated bidding. We work out these applications, and explain how our regret bounds improve over prior work. 6Matroid is a standard notion in combinatorial optimization which abstracts and generalizes linear independence. 155 8.2.2 Challenges and Techniques Generic challenges in combinatorial semi-bandits concern handling exponen- tially many actions (both in terms of regret and in terms of the running time), and taking advantage of the additional feedback. And in SemiBwK, one needs to deal with distributions over subsets of atoms, rather than ?just? with distributions over actions. Our algorithm connects a technique from BwK and a randomized rounding technique from combinatorial optimization. (With three existing BwK algorithms and a wealth of approaches for combinatorial optimization, choosing the techniques is a part of the challenge.) We build on a BwK algorithm from Agrawal and Devanur [11], which combines linear relaxations and a well-known ?optimism-under-uncertainty? paradigm. A generalization of this algorithm to SemiBwK results in a fractional solution x, a vector over atoms. Randomized rounding converts x into a distribution over feasible subsets of atoms that equals x in expectation. It is crucial (and challenging) to ensure that this distribution contains enough randomness so as to admit concentration bounds not only across rounds, but also across atoms. Our analysis ?opens up? a fairly technical proof from prior work and intertwines it with a new argument based on negative correlation. We present our algorithm and analysis so as to ?plug in? any suitable random- ized rounding technique. This makes our presentation more lucid, and also leads to faster running times for important special cases. 156 8.2.3 Solving SemiBwK using prior work. Solving SemiBwK using an algorithm for BwK would result in a regret bound like (8.1) with n replaced with the number of actions. The latter could be on the order of nk if each action can consist of at most k atoms, or perhaps even exponential in n. SemiBwK can be solved as a special case of a much more general linear-contextual extension of BwK from Agrawal and Devanur [11, 12]. In their model, an algorithm takes advantage of the combinatorial structure of actions, yet it ignores the ad- ditional feedback from the atoms. Their regret bounds have a worse dependence on the parameters, and apply for a much more limited range of parameters. Fur- ther, their per-round running time is linear in the number of actions, which is often prohibitively large. To compare the regret bounds, let us focus on instances of SemiBwK in which at most one unit of each resource is consumed in each round. (This is the case in all our motivating applications, except repeated bidding.) Then Agrawal and Devanur ? ? ? [11, 12] assume B > nT 3/4, and achieve regret O?(n T OPTDP +n2 T ). 7 It is easy B to see that we improve upon the range and upon both summands. In particular, we ? ? 7Agrawal and Devanur [11, 12] state regret bound with term +n T rather than +n2 T , but they assume that per-round rewards lie in [0, 1]. Since per-round rewards can be as large as n in our setting, we need to scale down all rewards by a factor of n, apply their regret bound, and then ? scale back, which results in the regret bound with +n2 T . When per-round consumption can be as ? ? large as n, regret bound from Agrawal and Devanur [11, 12] becomes O?(n2 OPT 2DP T/B+n T ) due to rescaling. 157 ? improve both summands by the factor of n n in a lucid special case when B > ?(T ) and OPTDP < O(T ). 8 8.3 Additional Related Work Combinatorial semi-bandits were studied by Gyo?rgy et al. [101], in the ad- versarial setting. In the i.i.d. setting, in a series of works by Anantharam et al. [19], Chen et al. [65], Combes et al. [68], Gai et al. [90, 91], Kveton et al. [126], an optimal algorithm was achieved. This result was then extended to atoms with linear rewards by Wen et al. [187]. Kveton et al. [124] obtained improved results for the special case when action set is described by a matroid. Some other works studied a closely related ?cascade model?, where the ordering of atoms matters (Katariya et al. [117], Kveton et al. [125], Zong et al. [189]). Contextual semi-bandits have been studied in Krishnamurthy et al. [123], Wen et al. [187]. Randomized rounding schemes (RRS) come from the literature on approxi- mation algorithms in combinatorial optimization (see Papadimitriou and Steiglitz [153], Williamson and Shmoys [188] for background). RRS were introduced in Raghavan and Tompson [154]. Subsequent work Asadpour et al. [22], Chekuri et al. [59, 60], Gandhi et al. [92] developed RRS which correlate the rounded random variables so as to guarantee sharp concentration bounds. 8In prior work on combinatorial bandits (without constraints), semi-bandit feedback improves ? regret bound by a factor of n, see the discussion in Kveton et al. [126]. 158 8.4 Preliminaries 8.4.1 Combinatorial constraints. Action set F is given by a combinatorial constraint, i.e., a family of subsets. Treating subsets of atoms as n-dimensional binary vectors, F corresponds to a finite set of points in Rn. We assume that the convex hull of F forms a polytope in Rn. In other words, there exists a set of linear constraints over Rn whose set of feasible integral solutions is F . We call such F linearizable; the convex hull is called the polytope induced by F . Our main result is for matroid constraints, a family of linearizable combinato- rial constraints which subsumes several important special cases such as cardinality constraints, partition matroid constraints, spanning tree constraints and transversal constraints. Formally, F is a matroid if it contains the empty set, and satisfies two properties: (i) if F contains a subset S, then it also contains every subset of S, and (ii) for any two subsets S, S ? ? F with |S| > |S ?| it holds that S ? ?{a} ? F for each atom a ? S \ S ?. We incorporate prior work on randomized rounding for linear programs. Con- sider a linearizable action set F with induced polytope P ? [0, 1]n. The randomized rounding scheme (henceforth, RRS) for F is an algorithm which inputs a feasible fractional solution x ? P and the linear equations describing P , and produces a random vector Y over F . We consider RRS?s such that E[Y] = x and Y is nega- tively correlated (see below for definition); we call such RRS?s negatively correlated. 159 Several such RRS are known: e.g., for cardinality constraints and bipartite match- ing (Gandhi et al. [92]), for spanning trees (Asadpour et al. [22]), and for matroids (Chekuri et al. [59]). 8.4.2 Negative correlation. Let X = (X1, X2, . . . ,?Xm) denote a family of random variables which take values in [0, 1]. Let X := 1 mi=1Xi be the average, and ? := E[X].m Family X is called negati[vely cor]related if? ? [ E X] i ? E[Xi] ?S ? [m] (8.2)? i?S ?i?S E (1?Xi) ? E[1?Xi] ?S ? [m] (8.3) i?S i?S Independent random variables satisfy both properties with equality. For in- tuition: if X1, X2 are Bernoulli and (8.2) is strict, then X1 is more likely to be 0 if X2 = 1. Negative correlation is a generalization of independence that allows for similar concentration bounds, i.e., high-probability upper bounds on |X ??|. However, our analysis does not invoke them directly. Instead, we use a concentration bound given a closely related property: [? ] E X 1 |S|i ? ( ) ?S ? [m]. (8.4)2 i?S Theorem 12 (Impagliazzo and Kabanets [110]). If (8.4), then for some absolute constant c, Pr[X ? 1 2+ ?] ? c ? e?2m? (?? > 0) (8.5) 2 160 8.4.3 Confidence radius. We bound deviations |X ? ?| in a way that gets sharper when ? is small, without knowing ? in advance. (We use the notation X , X, ? as above.) To this end, we use the notion of confidence radius from Agrawal and Devanur [11], Babaioff et al. [28], Badanidiyuru et al. [33], Kleinberg et al. [121]9: ? Rad?(x,m) = ?x/m+ ?/m. (8.6) If random variables X are independent, then event |X ? ?| < Rad?(X,m) < 3 Rad?(?,m) (8.7) happens with probability at least 1 ? O(e??(?)), for any given ? > 0. We use this notion to define upper/lower confidence bounds on the mean rewards and mean resource consumption. Fix round t, atom a, and resource j. Let ??t(a) and C?t(a, j) denote the empirical average of the rewards and resource-j consumption, resp., between rounds 1 and t ? 1. Let Nt(a) be the number of times atom a has been chosen in these rounds (i.e., included in the chosen actions). The confidence bounds are defined as C?t (a, j) = proj( C?(a, j)? Rad?(C?(a, j), Nt(a)) ) ??t (a) = proj ( ??(a)? Rad?(??(a), Nt(a)) ) (8.8) where proj(x) := arg miny?[0,1] |y ? x| denotes the projection into [0, 1]. We always use the same parameter ? = cconf log(ndT ), for an appropriately chosen absolute 9For instance Theorem 2.1 in Badanidiyuru et al. [33] 161 constant cconf. We suppress ? and cconf from the notation. We use a vector notation ??t and C ? t (j) to denote the corresponding n-dimensional vectors over all atoms a. By (8.7), with probability 1?O(e??(?)) the following hold. ?(a) ? [??t (a), ?+t (a)] C(a, j) ? [C?(a, j), C(a, j)+] 8.4.4 Matroid constraints Recall that in SemiBwK, we have a finite ground set whose elements are called ?atoms?, and a family F of ?feasible subsets? of the ground set which are the actions. To be consistent with the literature on matroids, the ground set will be denoted E. Family F of subsets of E is called a matroid if it satisfies the following properties: ? Empty set: The empty set ? is present in F ? Hereditary property: For two subsets X, Y ? E such that X ? Y , we have that Y ? F =? X ? F ? Exchange property: For X, Y ? F and |X| > |Y |, we have that ?e ? X \ Y : Y ? {e} ? F Matroids are linearizable, i.e., the convex hull of F forms a polytope in RE. (Here subsets of F are interpreted as binary vectors in RE.) In other words, there exists a set of linear constraints whose set of feasible integral solutions is F . In 162 fact, the convex hull of F , a.k.a. the matroid polytope, can be represented via the following linear system: x(S) ? rank(S) ?S ? E (LP-Matroid) ? x(e) ? [0, 1]E ?e ? E. Here x(S) := e?S x(e), and rank(S) = max{|Y | : Y ? S, Y ? F} is the ?rank function? for F . F is indeed the set of all feasible integral solutions of the above system. This is a standard fact in combinatorial optimization, e.g., see Theorem 40.2 and its corollaries in Schrijver [168]. We will now describe some well-studied special cases of matroids. That they indeed are special cases of matroids is well-known, we will not present the corre- sponding proofs here. In all LPs presented?below, we have variables x(e) for each arom e ? E, and we use shorthand x(S) := e?S x(e) for S ? E. Cardinality constraints. Cardinality constraint is defined as follows: a subset S of atoms belongs to F if and only if |S| ? K for some fixed K. This is perhaps the simplest constraint that our results are applicable to. In the context of SemiBwK, each action selects at most K atoms. The corresponding induced polytope is as follows: x(E) ? K (LP-Cardinality) x(e) ? [0, 1] ?e ? E. Partition matroid constraints. A generalization of cardinality constraints, called partition matroid constraints, is defined as follows. Suppose we have a collection 163 B1, B2, . . . , Bk of disjoint subsets of E, and numbers d1, d2, . . . , dk. A subset S of atoms belongs to F if and only if |S ? Bi| ? di for every i. Partition matroid constraints appear in several applications of SemiBwK such as dynamic pricing, ad- justing repeated auctions, and repeated bidding. In these applications, each action selects one price/bid for each offered product. Also, partition matroid constraints can model clusters of mutually exclusive products in dynamic assortment applica- tion. The induced polytope is as follows: x(Bi) ? di ?i ? [k] (LP-PartitionMatroid) x(e) ? [0, 1] ?e ? E. 8.5 Main algorithm (SemiBwK-RRS) Let us define our main algorithm, called SemiBwK-RRS. The algorithm builds on an arbitrary RRS for the action set F . It is parameterized by this RRS, the polytope P induced by F (represented as a collection of linear constraints), and a number  > 0. In each round t, it recomputes the upper/lower confidence bounds, as defined in (8.8), and solves the following linear program: maximize ?+t ? x subject to C?(j) ? x ? B(1?) , j ? [d] (LPt ALG)T x ? P This linear program defines a linear relaxation of the original problem which is ?optimistic? in the sense that it uses upper confidence bounds for rewards and lower 164 confidence bounds for consumption. The linear relaxation is also ?conservative? in the sense that it rescales the budget by 1? . Essentially, this is to ensure that the algorithm does not run out of budget with high probability. Parameter  will be fixed throughout. For ease of notation, we will denote B := (1 ? )B henceforth. The LP solution x can be seen as a probability vector over the atoms. Finally, the algorithm uses the RRS to convert the LP solution into a feasible action. The pseudocode is given as Algorithm 8. Algorithm 8: SemiBwK-RRS input: an RRS for action set F , induced polytope P (as a set of linear constraints),  > 0. 1 for t = 1, 2 , . . . , T do 1. Recompute Confidence Bounds as in (8.8) 2. Obtain fractional solution x ? [0, 1]nt by solving the linear program LPALG. 3. Obtain a feasible action St ? F by invoking the RRS on vector xt. 4. Semi-bandit Feedback: observe the rewards/consumption for all atoms a ? St. If action set F is described by a matroid constraint, we can use the negatively correlated RRS from Chekuri et al. [59]. In particular, we obtain a complete al- gorithm for several combinatorial constraints commonly used in the literature on semi-bandits, such as partition matroid constraints, spanning trees. 165 Theorem 13. Consider the SemiBwK problem with a linearizable action set F that admits a negatively correlated RRS. Then algorithm SemiBwK-RRS with this RRS achieves expected regret bound at most ? ( ? ? ) O(log(ndT )) n OPTDP / B + T + OPTDP . (8.9) Here T is the time horizon, n is the number of atoms, and B is the budget. We ? require B > 3(?n+ ?nT ), where ? = ?(log(n?dT )) is the parameter in confidence? radius. Parameter  in the algorithm is set to ?n + ?n + ?nT . B B B Corollary 5. Consider the setting in Theorem 13 and assume that the action set F is defined by a matroid on the set of atoms. Then, using the negatively correlated RRS from Chekuri et al. [59], we obtain regret bound (8.9). Running time of the algorithm. The algorithm does two computationally in- tensive steps in each round: solves the linear program (LPALG) and runs the RRS. For matroid constraints, the RRS from Chekuri et al. [59] has O(n2) running time. Hence, in the general case the computational bottleneck is solving the LP, which has n variables and O(2n) constraints. Matroids are known to admit a polynomial- time seperation oracle (e.g., see Schrijver [168]). It follows that the entire set of constraints in LPALG admits a polynomial-time separation oracle, and therefore we can use the Ellipsoid algorithm to solve LPALG in polynomial time. For some classes of matroid constraints the LP is much smaller: e.g., for cardinality constraints (just d + 1 constraints) and for traversal matroids on bipartite graphs (just 2n + d con- straints). Then near-linear-time algorithms can be used. 166 Our algorithm works under any negatively correlated RRS. We can use this flexibility to improve the per-round running time for some special cases. (Making decisions extremely fast is often critical in practical applications of bandits (e.g., see Agarwal et al. [6].) We obtain near-linear per-round running times for cardinality constraints and partition matroid constraints. Indeed, LPALG can be solved in near- linear time, as mentioned above, and we can use a negatively correlated RRS from Gandhi et al. [92] which runs in linear time. These classes of matroid constraints are important in our applications (see Section 8.7). 8.6 Proof of Main Theorem 8.6.1 Proof overview. First, we argue that LPALG provides a good benchmark that we can use instead of OPTDP. Specifically, at any given round, the optimal value for LPALG in each round is at least 1 (1?) OPTDP with high probability. We prove this by constructingT a series of LPs, starting with a generic linear relaxation for BwK and ending with LPALG, and showing that the optimal value does not decrease along the series. Next we define an event that occur with high probability, henceforth called clean event. This event concerns total rewards, and compares our algorithm against LPALG: ? ? (? ? ? ) | +t?[T ] rt ? t?[T ] ?t ? xt| ? O ?n t?[T ] rt + ?nT + ?n . (8.10) We prove that it is indeed?a high-probability event in ?three steps. First, we relate the algorithm?s reward t rt to its expected reward t ? ? St, where we 167 interpret the cho?sen action S?t, a subset of atoms, as a binary vector over the atoms. Then we relate t ? ? S to ? + t t t ? S?t, replacing ex?pected rewards with the upper confidence bounds. Finally, we relate ?+t t ? St to t ? + t ? xt, replacing the output of the RRS with the ?corresponding expectations. Putting it together, we relate algorithm?s reward to t ? + t ? xt, as needed. It is essential to bound the deviations ? in the sharpest way possible; in particular, the naive O?( T ) bounds are not good enough. To this end, we use several tools: the confidence radius from (8.6), the negative correlation property of the RRS, and another concentration bound from prior work. A similar ?clean event? (with a similar proof) concerns the total resource consumption of the algorithm. We condition on both clean events, and perform the rest of the analysis via a ?deterministic? argument not involving probabilities. In particular, we use the second ?clean event? to guarantee that the algorithm never runs out of resources. We use negative correlation via a rather delicate argument. We extend the concentration bound in Theorem 12 to a random process that evolves over time, and only assumes that property (8.4) holds within each round conditional on the history. For a given round, we start with a negative correlation property of St and construct another family of random variables that conditionally satisfies (8.4). The extended concentrati?on bound is then applied to this family. The net result is a concentration bound for +t ?t ? St as if we had n? T independent random variables there. The rest of the section contains the full proof. 168 8.6.2 Linear programs We argue that LPALG provides a good benchmark that we can use instead of OPTDP. Fix round t and let OPTALG, t denote the optimal value for LPALG in this round. Then: Lemma 24. OPT 1ALG, t ? (1? ) OPTDP with probability at least 1? ?.T We will prove this by constructing a series of LP?s, starting with a generic linear relaxation for BwK and ending with LPALG. We show that along the series the optimal value does not decrease with high probability. The first LP, adapted from Badanidiyuru et al. [33], has one decision variable for each action, and applies generically to any BwK problem. ? maximize ?S?F ?(S)x(S) subject to S??F C(S, j)x(S) ? B/T j = 1, ..., d (LPBwK) 0 ? S?F x(S) ? 1. Let OPTBwK(B) denote the optimal value of this LP with a given budget B. Then: Claim 3. OPTBwK(B) ? (1? ) OPTBwK(B) ? 1 (1? ) OPTT DP. Proof. The second inequality in Claim 3 follows from [Lemma 3.1 in 33]. We will prove the first inequality as follows. Let x? denote an optimal solution to LPBwK(B). Consider (1? )x?; this is feasible to LPBwK(B), since for every S, ? (1? )x?(S) ? 1 and C(S, j)(1? )x(S) ? B/T. S?A:S?S 169 Hence, this is a feasible solution. Now, consider the objective function. Let y denote an optimal solution to LPBwK(B). We have that ? ? OPTBwK(B) = ?(S)y ?(S) ? ?(S)(1? )x?(S) = (1? ) OPTBwK(B). S?A:S?S S?A:S?S Now consider a simpler LP where the decision variables correspond to atoms. As before, P denotes the polytope induced by action set F . maximize ? ? x (LPATOMS) subject to C? ? x 4 B/T x ? P x ? [0, 1]n. Here C = (C(a, j) : a ? A, j ? d) is the n? d matrix of expected consumption, and C? denotes its transpose. The notation 4 means that the inequality ? holds for for each coordinate. Leting OPTatoms denote the optimal value for LPATOMS, we have: Claim 4. With probability at least 1?? we have, OPTALG, t ? OPTatoms ? OPTBwK(B). Proof. We will first prove the second inequality. Consider the optimal solution vector x to LPBwK(B). Define S ? := {S : x(S) 6= 0}. We will now map this to a feasible solution to LPATOMS and show that the objective value does not decrease. This will then complete the claim. Consider the following solution y defined as follows. ? y(a) = x(S). S?S?:a?S 170 We will now show that y is a feasibl?e solution to the polytope P . From the defi- nition of y, we can write it as y = x(S) ? I[S]. Here, I[S] is a binary vector, S?S? such that it has 1 at position a if and only if atom a is present in set S. Hence, y is a point in the polytope since it can be written as convex combination of its vertices. Now, we will show that, y also satisfies the resource consumption constraint. ? ? ?? ? C(j) ? y = C(a, j) x(S) = C(a, j)x(S) = C(S, j)x(S) ? B/T. a?A S?S?:a?S S?S? a?S S?S? The last inequality is because in the optimal solution, the x value correspond- ing to subset S? is 1 while rest all are 0. We will now show that y produces an objective value at least as large as x. ?n ? OPT ?atoms = ? ? y ? ? ? y = ?(a) x(S) ?? a=1 ? S?S?:a?S = ?(a)x(S) = ?(S)x(S) S?S? a?S S?S? = OPTsubsets(B). Now we will prove the first inequality. We will assume the ?clean event? that ?+t ? ? and C?t ? Ct for all t. Hence, the inequality holds with probability at least 1? ?. Consider a time t. Given an optimal solution x? to LPATOMS we will show that this is feasible to LPALG,t. Note that, x ? satisfies the constraint set x ? P since that is same for both LPALG,t and LPATOMS. Now consider the constraint C ? t (j) ?x ? B .T Note that C?t (a, j) ? C(a, j). Hence, we have that C?t (j) ?x? ? C(j) ?x? ? B . TheT 171 last inequality is because x? is a feasible solution to LPATOMS. Now consider the objective function. Let y? denote the optimal solution to LPALG,t. OPT +ALG, t = ?t ? y? ? ?+ ? x? ? ? ? y?t = OPTatoms. Hence, combining Claim 3 and Claim 4, we obtain Lemma 24. 8.6.3 Negative correlation and concentration bounds Our analysis relies on several facts about negative correlation and concen- tration bounds. First, we argue that property (8.2) in the definition of negative correlation is preserved under a specific linear transformation: Claim 5. Suppose (X1, X2 , . . . , Xm) is a family of negatively correlated random variables with support [0, 1]. Fix numbers ?1, ?2 , . . . , ?m ? [0, 1]. Consider two families of random variables: ( ) ( ) F+ 1 + ?i(Xi ? E[Xi]) ? F? 1? ?i(Xi ? E[Xi])= : i [m] and = : i ? [m] . 2 2 Then both families satisfy property (8.2). Proof. Let us focus on family F+; the proof for family F? is very similar. Denote ?i = E[Xi] and Yi := (1 + ?i(Xi ? ?i))/2 and zi := (1? ?i?i)/2 for all i ? [m]. Note that Yi = ?iXi/2 + zi and zi ? 0, Xi ? 0. Fix a subset S ? [m]. We 172 have, [? ] E Y = E? ??? ? ?? i (?iXi/2) zj (8.11) i?S ?T?[S?i?T ]j?S?\T = E (?iXi/2) zj T??S? i?T ? j?S\T ? (?i?i/2) zj (8.12) ?T?S i?T j?S\T = ((1? ?i?i)/2 + ?i?i/2) (8.13) i?S ? = (1)|S| = E[Y 2 i ] i?S Eq. (8.11) and (8.13) follow from Binomial Theorem. Eq. (8.12) is because Eq. (8.2) invariant under non-negative scaling, Xi neg. correlated Second, we extend Theorem 12 to a random process that evolves over time, and only assumes that property (8.4) holds within each round conditional on the history. Theorem 14. Let ZT = {?t,a : a ? A, t ? [T ]} be a family of random variables taking values in [0, 1]. Assume random variables {?t,a : a ? A} satisfy property (8.2?) given Zt?1 and have expectation 1 given Zt?1, for each round t. Let Z =2 1 a?A,t?[T ] ?t,a be the average. Then for some absolute constant c,nT Pr[Z ? 1 + ?] ? c ? ?2m?2e (?? > 0). (8.14) 2 Proof. We prove that family Zt satisfies property (8.4), and then invoke Theorem 12. Let us restate prope?rty (8.4) fo?r the sake of completeness: E? ? ? ? ? 2?|S|t,a for any subset S ? ZT . (8.15) (t,a)?S 173 Fix subset S ? ZT . Partition S into subsets St = {?t,a ? ZT ? S}, for each round t. Fix round ? and denote ? ? G? = Ht, where Ht = ?t,a. t?[? ] a?St We will now prove the following statement by induction on ? : ? E[G? ] ? 2?k? , where k? = |St|. (8.16) t?[? ] The base case is when ? = 1. Note that G? is just the product of elements in set ?1 and they are negatively correlated from the premise. Therefore we are done. Now for the inductive case of ? ? 2, ? E[H? |Z??1] ? E[??,a|Z??1] From property (8.2) in the conditional space a?S? (8.17) ? 2?|S? | From assumption in Lemma 14 (8.18) Therefore, we have E[G? ] = E[E[G??1H? |Z??1]] Law of iterated expectation = E[G??1E[H? |Z??1]] Since G??1 is a fixed value conditional on Z??1 ? 2?|S? |E[G??1] From Eq. (8.18) ? 2?k? From inductive hypothesis This completes the proof of Eq. (8.16). We obtain Eq. (8.15) for ? = T . Third, we invoke Eq. (8.7) for rewards and resource consumptions: 174 Lemma 25. With probability at least 1? e??(?), we have the following: |??t(a)? ?t(a)| ? 2 Rad(??t(a), Nt(a) + 1) (8.19) ?j ? [d] |C?t(a, j)? Ct(a, j)| ? 2 Rad(C?t(a, j), Nt(a) + 1). Fourth, we use a concentration bound from prior work which gets sharper when the expected sum is very small, and does not rely on independent random variables: Theorem 15 (Babaioff et al. [28]). Let X1, X2, . . . , Xm denote a set of {0, 1} random variables. For each t, let ?t?denote the multiplier determined by random variables X1, X2, . . . , Xt?1. Let M = m t=1Mt where Mt = E[Xt|X1, X2, . . . , Xt?1]. Then for any b ? 1, we have the following with probability at least 1?m??(b): ?m ? | ?t(Xt ?Mt)| ? b( M logm+ logm) t=1 8.6.4 Analysis of the ?clean event? Let us set up several events, henceforth called clean events, and prove that they hold with high probability. Then the remainder of the analysis can proceed conditional on the intersection of these events. The clean events are similar to the ones in Agrawal and Devanur [11], but are somewhat ?stronger?, essentially because our algorithm has access to per-atom feedback and our analysis can use the negative correlation property of the RRS. In what follows, it is convenient to consider a version of SemiBwK in which the algorithm does not stop, so that we can argue about what happens w.h.p. if our 175 algorithm runs for the full T rounds. Then we show that our algorithm does indeed run for the full T rounds w.h.p. Recall that xt be the optimal fractional solution obtained by solving the LP in round t. Let Yt ? {0, 1}n be the random binary vector obtained by invoking the RRS (so that the chosen action St ? F corresponds to a particular realization of Yt, interpreted as a subset). Let Gt := {Y ?t? : ?t ? t} denote the family of RRS realizations up to round t. 8.6.4.1 ?Clean event? for rewards For brevity, for each round t let ?t = (?t(a) : a ? A) be the vector of realized rewards, and let rt := ?t(St) = ?t ?Yt be the algorithm?s reward at this round. Lemma 26. Consider SemiBwK without stopping. Then with probability at least 1? nT e??(?): ? ? ??? ? ??| rt ? ?+t ? x ?t| ? O ?n rt + ?nT + ?n . t?[T ] t?[T ] t?[T ] Proof. We prove the Lemma by proving the following three high-probability inequal- ities. With probability at least 1? nT e???(?): the following hol?ds:? ? | ? ? | ? 1 ? rt ? Yt 3nT Rad? ?+t ? xt , nT? (8.20) t?[T ] t?[T ] ?? ? nT t?[T ]? ? ? ? ? | ? ?Y ? ?+t t ?Yt| ? ? 12??n? ?+ ? x ?t t + 12 ?n+ 12?n (8.21) ?t?[T ] t?[T ]? | ?+t ?Y +t ? ?t ? xt| ? ?nT . (8.22) t?[T ] 176 We will use the properties of RRS to prove Eq. (8.22). Proof of Eq. (8.21) is similar to Agrawal and Devanur [11], while proof of Eq. (8.20) follows immediately from the setup of the model.?U?sing the parts (8.20) and (8.22) we can now find an appropriate upper bound on t?[T ] ? + t ? xt and using this upper bound, we prove Lemma 26. Proof of Eq. (8.20). Recall that rt = ?tYt. Note that, E[?tYt] = ?Yt when the expectation is taken over just the independent samples of ?. By Theorem 15, with probability 1? e??(?) we have: ? ? ( ) | ? ? | ? 1 ? rt ? Yt 3nT Rad( ? ?Yt , nTnTt?T t?T ?t?T ) ? 13nT Rad( ? + t ?Yt , nTnT t?T ) ? 1 ? 3nT Rad ?+t ? xt , nT .nT t?T The last inequality is because Yt is a feasible solution to LPALG. Proof of Eq. (8.21). For this part, the arguments similar to Agrawal and Devanur [11] follow with some minor adaptations. For sake of completeness we describe the full proof. Note that we have, ? ?n ? | ? ?Yt ? ?+t ?Yt| ? | ?(a)Yt(a)? ?+t (a)Yt(a)|. t?T a=1 t?T Now, using Lemma 25 in Appendix, we have that with probability 1?nTe??(?) ? ? | ?(a)Yt(a)? ?+t (a)Yt(a)| ? 12 Rad(?(a), Nt(a) + 1). t?T t?T 177 Hence, we have ?n ? ?NT?(a)+1 | ?(a)Yt(a)? ?+t (a)Yt(a)| = 12 Rad(?(a), r) a=1 t?T a??A r=1 ? 12 (NT (a) + 1) Rad(?(a), NT (a) + 1) ?a?A ? 12 ?n (? ? (NT + 1)) + 12?n. The last inequality is from the defini?tion of Rad function and using the Cauchy- Swartz inequality. Note that ?NT = t?T ? ?Yt. Also, since we have with proba- bility 1? e??(?), ?(a) ? ?+t (a), we have?,? ?? (? ) ? 12 ?n (? ? (NT + 1)) + 12?n ? 12??n ?+t ?Yt + 12 ?n+ 12?n. t?T Finally note that Yt is a feasible solution to the semi-bandit polytope P . Hence, we have that ?+t ?Y +t ? ?t ? xt. He?nce,??? (? ) ?? (? ? ? ) ?12 ?n ?+t ?Yt + 12 ?n+ 12?n ? 12??n ?+t ? xt + 12 ?n+ 12?n. t?T t?T Proof of Eq. (8.22): Recall that for each round t, the UCB vector ?+t is determined by the random variables Gt?1 = {Yt? : ?t? < t}. Further, conditional on a realization of Gt?1, the random variables {Yt(a) : a ? A} are negatively correlated from the property of RRS. Let ?? +t(a) := ?t (a)Yt(a), a ? A. Note that we have E[??t(a)|Gt?1] = ?+t (a)xt(a). Define 1 + ?+t (a)Yt(a)? ?+t (a)xt(a)?t(a) := . 2 178 From Claim 5, we have that {?t(a) : a ? A} conditioned on Gt?1 satisfy (8.2). Further, E[?t(a)|G ] = 1t?1 . Therefore, the family {?t(a) : t ? [T ], a ? A} satisfies2 the assumptions in Theorem 14 and hence satisfies Eq. (8.14) for some absolute constant c. Plug?ging back the ??t(a)?s, we obtain an upp?er-tail concentration bound: 1 ?? Pr? ( ??t(a)? ?+t (a)xt(a)) ? ? ? ? c ? e?2nT?2 .nT t?[T ] a?A To obtain a corresponding concentration bound for the lower tail, we apply a similar argument to ? 1 + ? + t (a)xt(a)? ??t(a)?t(a) = .2 Once again from Claim 5, we have that {? ?t(a) : a ? A} conditioned on Gt?1 satisfy (8.2). The family {? ?t(a) : t ? [T ], a ? A} satisfies the assumptions in Theorem 14 and hence satisfies Eq. (8.14). Plugging back the ??t(a)?s, we obtain a lower-tail concentration bo?und:? ?1 ?? + 2Pr ( ?t (a)xt(a)? ??t(a)) ? ? ? ? c ? e?2nT? .nT t?[T ] a?A Com?bining these two we have, Pr? ?? ? 1 | 2?+t (a)Y +t(a)? ?t (a)xt(a)| ? ? ? ? 2 c ? e?2nT? . (8.23)nT t?[T ] a?A ? Hence setting ? = ? , we obtain Eq. (8.22) with probability at least 1 ? nT e??(?). ?? Combining Eq. (8.20), (8.21) and (8.22) Let us denote H := +t?[T ] ?t ? xt. Adding the three equations we get ? ? ? ? | r ?H2t | ? ?H + ? + ?nH +O(?n) + ?nT (8.24) t?[T ] 179 Rearranging and solving for H, we have ?? ? H ? rt +O( ?n) + (?nT )1/4 t?[T ] Plugging this back into Eq. (8.24), we get Lemma 26. 8.6.4.2 ?Clean event? for resource consumption We define a similar ?clean event? for consumption of each resource j. By a slight abuse of notation, for each round t let Ct(j) = (Ct(a, j) : a ? A) be the vector of realized consumption of resource j. Let ?t(j) denote algorithm?s consumption for resource j at round t (i.e., ?t(j) = Ct(j) ?Yt). Lemma 27. Consider SemiBwK without stopping. Then with probability at least 1? nT e??(?): ? ? ? ? ?j ? [d] | ?t(j)? C?t (j) ? xt| ? ?nB + ?n+ ?nT . t?[T ] t?[T ] Proof. The proof is similar to Lemma 26. We will split the proof into following three equations. Fix an arbitrary resource j ? [d]. With probability at least 1?nTe??(?) the following holds: ? ? ( ? ) | ?t(j)? 1 C(j) ?Yt| ? 3nT Rad C(j) ?Yt , nT . (8.25) nT t?T t?T t?T ? ??? (? ) ? | C(j) ?Yt ?C?t (j) ?Yt| ? 12??n C(j) ?Yt + 12 ?n+ 12?n. (8.26) t?T t?T 180 ? ? | C?t (j) ?Yt ?C?t (j) ? xt| ? ?nT . (8.27) t?T ?? Using the parts (8.25), (8.26) and (8.27) we can find an upper bound on t?T Ct(j) ?Yt. Hence, combining equations (8.25), (8.26) and (8.27) with this bound and taking an Union Bound over all the resources, we get Lemma 27. Proof of Eq. (8.25). We have that {Ct(a, j) : a ? A} is a set of independent random variables over a probability spacee C?. Note that, EC Ct(a, j)Yt(a) = C(a, j)Yt(a).? Hence, we can invoke Theorem 15 on independent random variables to get with probability 1? nTe??(?) ? ? ( 1 ? )| ?t(j)? C(j) ?Yt| ? 3nT Rad C(j) ?Yt , nT . nT t?T t?T t?T Proof of Eq. (8.26). This is very similar to proof of (8.21) and we will skip the repetitive parts. Hence, we have with probability 1? nTe??(?) ? ? | C(j) ?Yt ?C?t (j) ?Yt| ? 12??n(C(j) ? (NT + 1)) + 12?nt?T ?? (? ) ? ? 12??n C(j) ?Yt + 12 ?n+ 12?n. t?T Proof of Eq. (8.27). Recall that for each round t and each resource j, the LCB vector C?t (j) is determined by the random variables Gt?1 = {Yt? : ?t? < t}. Similar to the proof of Eq. (8.22), random variables {Yt(a) : a ? A} obtained from the RRS are negatively correlated given Gt?1. As before define ?? (a) = C?t t (a)Yt(a), a ? A. We have that E[?t(a) | Gt?1] = C?t (a)xt(a). 181 By Claim 5, random variables 1 + ??t(a)? C?t (a)xt(a)?t(a) = , a ? A 2 satisfy (8.2), given Gt?1. We conclude that family {?t(a) : t ? [T ], a ? A} satisfies the assumptions in Theorem 14, and therefore satisfies Eq. (8.14) for some absolute constant c. The?refore, we obtain an upper-tail concentr?ation bound for ??t(a)?s: 1 ?? 2 Pr? ( ??t(a)? C?t (a)xt(a)) ? ? ? ? c ? e?2nT? .nT t?[T ] a?A To obtain a corresponding concentration bound for the lower tail, we apply a similar argument to ? 1 + C ? t (a)xt(a)? ??t(a)?t(a) = .2 Once again, invoking Claim 5 we have that {? ?t(a) : a ? A} conditioned on Gt?1 satisfy (8.2). Thus, family {?t(a) : t ? [T ], a ? A} satisfies the assumptions in Theorem 14, an?d therefore satisfies Eq. (8.14). We obta?in:? 1 ?? 2Pr ( C?t (a)xt(a)? ?? ? ?2nT?t(a)) ? ? ? c ? e .nT t?[T ] a?A Com?bing the two tails we have,? ?1 ??Pr | C? ?t (a)Yt(a)? Ct (a)xt(a)| ? ? ? ? 2 c ? e?2nT?2 . (8.28)nT t?[T ] a?A ? Once again, setting ? = ? , we obtain Eq. (8.27) with probability at least nT 1? e??(?). ?? Proof of Lemma 27. Denote G = t?T C?(j) ?Yt. From Equation (8.25), (8.26)? ? and (8.27), we have that G2? ?2?( ?n)G ? ? t?T Ct (j) ?xt+O(?n)+ ?nT . Note ? ? that C?t?T t (j) ?xt ? B. Hence, G2?2?( ?n)G ? B +O(?n) + ?nT . Hence, 182 ? ? re-arranging this gives us G ? B + O( ?n) + (?nT )1/4. Plugging this back in Equations (8.25), (8.26) and (8.27), we get Lemma 27. 8.6.5 Putting it all together Similar to Agrawal and Devanur [11], we will handle the hard constraint on budget, by choosing an appropriate value of . We then combine the above Lemma on ?rewards? clean event to compare the reward of the algorithm with that of the optimal value of LP to obtain the regret bound in Theorem 13. Additionally, we use the Lemma on ?consumption? clean event to argue that the algorithm doesn?t exhaust the resource budget before round T . Formally, consider the following. Recall that from Lemma 24, we have OP 1?TALG, ? (1?) OPTDP. Let us defineT the performance of the algorithm as ALG = t?T rt. From Lemma 26, that with probability at least 1? ndT e??(?) ? ? ALG ? (1? ) OPTDP?O(??nALG)?O(?n)? ?nT? ? (1? ) OPTDP?O( ?nOPTDP)?O(?n)? ?nT (since ALG ? OPTDP). ? ? Choosing  = ?n + ?n + ?nT and using the assumption that B > 3(?n + B B B ? ?nT ), we derive Eq. (8.9). For any given ?, we set ? = ?(log(ndT )) to obtain a ? success probability of at least 1? ?. Now we will argue that the algorithm does not exhaust the resource budget before round T with probability at least 1?ndT e??(?). Note that for every resource 183 j ? [d], ? C?t (j) ? xt ? (1? )B. t?T ? Hence, combining this with Lemma 27, we have t?T Ct(j) Yt ? (1? )B + B ? B. 8.7 Applications and special cases Let us discuss some notable examples of SemiBwK (which generalize some of the numerous applications listed in Badanidiyuru et al. [33]). Our results for these examples improve exponentially over a naive application of the BwK framework. Compared to what can be derived from Agrawal and Devanur [11, 12], our results feature a substantially better dependence on parameters, a much better per-round running time, and apply to a wider range of parameters. However, we leave open the possibility that the regret bounds can be improved for some special cases. 8.7.1 Dynamic pricing. The dynamic pricing application is as follows. The algorithm has d products on sale with limited supply: for simplicity, B units of each. Following Besbes and Zeevi [39], we allow supply constraints across products, e.g., a ?gadget? that goes into multiple products. In each round t, an agent arrives (who can buy any subset of the products), the algorithm chooses a vector of prices p ? [0, 1]dt to offer the agent, and the agent chooses what to buy at these prices. For simplicity, the agent is interested in buying (or is only allowed to buy) at most one item of each product. The agent has a valuation vector over products, so that the agent buys a given 184 product if and only if her valuation for this product is at least as high as the offered price. The entire valuation vector is drawn as an independent sample from a fixed and unknown distribution (but valuations may be correlated across products). The algorithm maximizes the total revenue from sales. To side-step discretization issues, we assume that prices are restricted to a known finite subset S ? [0, 1]. Achieving general regret bounds without such re- striction appears beyond reach of the current techniques for BwK.10 To model it as a SemiBwK problem, the set of atoms is all price-product pairs. The combinatorial constraint is that at most one price is chosen for each product. (If an action does not specify a price for some product, the default price is used.) This is a ?partition matroid? constraint. Rewards correspond to revenue from sales, and resources correspond to?inventory c?onstraints. We obtain regret O?(d dB|S| + T |S|) using Corollary 5, whenever B > ? ??(n + nT ). This is because OPTDP ? dB, since that is the maximum number of products available, and the number of atoms is n = d|S|. For comparison, results of Agrawal and Devanur [11, 12] apply only when ? ? B > nT 3/4, and yield regret bound of O?(d3|S|2 T ).11 Thus, our regret bounds feature a better dependence on the number of allowed prices |S| (which can be 10Prior work on dynamic pricing with limited supply (e.g., Babaioff et al. [28], Badanidiyuru et al. [33], Besbes and Zeevi [38]) achieves regret bounds without restricting itself to a particular finite set of prices, but only for a simple special case of (essentially) a single product. 11We obtain this by plugging in OPTDP ? dB and n = d|S| into their regret bound. For dynamic pricing the total per-resource consumption is bounded by 1, so we can apply their results without rescaling the consumption. 185 very large) and the number of products d. Further, our regret bounds hold in a meaningful way for the much larger range of values for budget B. For a naive application of the BwK framework, arms correspond to every possible realization of prices for the d products. Thus, we have |S|d arms, with a corresponding exponential blow-up in regret. 8.7.2 Dynamic assortment. The dynamic assortment problem is similar to dynamic pricing in that the algorithm is selling d products to an agent, with a limited inventory B of each prod- uct, and is interested in maximizing the total revenue from sales. As before, agents can have arbitrary valuation vectors, drawn from a fixed but unknown distribution. However, the algorithm chooses which products to offer, whereas all prices are fixed externally. There is a large number of products to choose from, and any subset of k  d of them can be offered in any given round. To model this as SemiBwK, atoms correspond to products, and actions corre- spond to subsets of at most k atoms. The combinatorial constraint forms a partition matroid. Rewards correspond to sales, and resources correspond to products, as in ? dynamic pricing. Since OPTDP ? min(dB, kT ), Corollary 5 yields regret O?(k dT ) ? ? when B > ?(T ), and regret O?(d dB + dT ) in general. In a naive application of BwK, arms are subsets of k products. Hence, we have O(dk) arms. The other parameters of the problem remain the same. This leads to ? regret bound O?(d Bdk), with an exponential dependence on k. 186 8.7.3 Repeated auctions. Consider a repeated auction with adjustable parameters, e.g., repeated second- price auction with reserve price that can be adjusted from one round to another. While prior work (Badanidiyuru et al. [33], Cesa-Bianchi et al. [55]) concerned run- ning one repeated auction, we generalize this scenario to multiple repeated auctions with shared inventory (e.g., the same inventory may be sold via multiple channels to different audiences). More formally, the auctioneer is running r simultaneous repeated auctions to sell a shared inventory of d products, with limited supply B of each product (e.g., different auctions can cater to different audiences). Each auction has a pa- rameter which the algorithm can adjust over time. We assume that this parameter comes from a finite domain S ? [0, 1]. For simplicity, assume the auctions are syn- chronized with one another. As in prior work, we assume that in every round of each auction a fresh set of participants arrives, sampled independently from a fixed joint distribution, and only a minimal feedback is observed: the products sold and the combined revenue. Following prior work (Badanidiyuru et al. [33], Cesa-Bianchi et al. [55]), we only assume minimal feedback: for each auction, what were the products sold and what was the combined revenue from this auction. In particular, we do not as- sume that the algorithm has access to participants? bids. Not using participants? bids is desirable for privacy considerations, and in order to reduce the participants? incentives to game the learning algorithm. 187 To model this problem as SemiBwK, atoms are all auction-parameter pairs. The combinatorial constraint is that an action must specify at most one parameter value for each auction. This corresponds to partition matroid constraint. There is a ?default parameter? for each auction, in case an action does not specify the parameter. We have a resource for each product being auctioned. For simplicity, each product has supply of B. Note that OPTDP?? dB and?number of atoms is n = r|S|. Hence, our main result yields regret O?(d r|S|B + r|S|T ). A naive application of the BwK framework would have arms that correspond to all possible combinations of parameters, for the total of O(|S|r) arms. Again, we have an exponential blow-up in regret. Alternatively, one may try running r separate instances of BwK, one for each auction, but that may result result in budgets being violated since the items are shared across the auctions and it is unclear a priori how much of each item will be sold in each auction. One can also consider a ?flipped? version of the previous example, where the algorithm is a bidder rather than the auction maker. The bidder participates in r repeated auctions, e.g., ad auctions for different keywords. We assume a stationary environment: bidder?s utility from a given bid in a given round of a given auction is an independent sample from a fixed but unknown distribution. The only limited resource here is the bidder?s budget B. Bids are constrained to lie in a finite subset S. To model this as SemiBwK, atoms correspond to the auction-bid pairs. The combinatorial constraint is that each action must specify at most one bid for each auction. (There is a ?default bid? for each auction in case an action does not specify 188 Figure 8.1: Dynamic Assortment (left) and Dynamic Pricing (right) experiments for n = 26. the bid for this auction.) There is exactly one resource, which is money and the total budget is B. Note that?the numbe?r of atoms is n = r|S|. Hence, our main result yields regret O?(OPTDP r|S|/B + r|S|T ). A naive application of BwK would have arms that correspond to all possible combinations of bids, for the total of O(|S|r) arms; so we have an exponential blow- up in regret. 8.8 Numerical Simulations We ran some experiments on simulated datasets in order to compare our al- gorithm, SemiBwK-RRS, with some prior work that can be used to solve SemiBwK: ? the primal-dual algorithm for BwK from Badanidiyuru et al. [33], denoted pdBwK. ? an algorithm for combinatorial semi-bandits with a matroid constraint: ?Op- timistic Matroid Maximization? from Kveton et al. [124], denoted OMM. 189 Figure 8.2: Experimental Results for Uniform matroid (left plots) and Partition matroid (right plots) on independent (upper) and correlated (lower) instances for n = 26. ? the linear-contextual BwK algorithm from Agrawal and Devanur [12], dis- cussed in the Introduction, denoted linCBwK. To speed up the computation in linCBwK, we used a heuristic modification suggested by the authors in a private communication. This modification did not substantially affect average rewards in our preliminary experiments. We also made a heuristic improvement to our algorithm, setting  = 0 and ? = 5. We use the same value of ? for the pdBwK algorithm as well. Problem instances. We did not attempt to comprehensively cover the huge va- riety of problem instances in SemiBwK. Instead, we focus on two representative applications from Section 8.7. 190 Figure 8.3: Experimental Results for Uniform matroid (left plots) and partition matroid (right plots) on independent (upper) and correlated (lower) instances for n = 6. The first experiment is on dynamic assortment. We have n products, and for each product i there is an atom i and a resource i. The (fixed) price for each product is generated as an independent sample from U[0,1], a uniform distribution on [0, 1]. At each round, we sample the buyers?s valuation from U[0,1], independently for each product. If the valuation for a given product is greater than the price, one item of this product is sold (and then the reward for atom i is the price, and consumption of resource i is 1). Else, we set reward for atom i and consumption for resource i to be 0. The second experiment is on dynamic pricing with two products. We have n/2 allowed prices, uniformly spaced in the [0, 1] interval. Recall that atoms correspond 191 to price-product pairs, for the total of n atoms. In each round t, the valuation vt,i for each product i is chosen independently from a normal distribution N (v0i , 1) truncated on [0, 1]. The mean valuation v0i is drawn (once for all rounds) from U[0,1]. If vt,i is greater than the offered price p, one item of this product is sold. Then reward for the corresponding atom (p, i) is the price p, and consumption of product i is 1. If there is no sale for this product, the reward and consumption for each atom (p, i) is set to 0. The third experiment is a modification of the dynamic assortment example, in which we ensure that even non-action (e.g., no sale) exhausts resources other than time. As in dynamic assortment, we have n products, and for each product i there is an atom i and a resource i. The (fixed) price for each product is generated as an independent sample from U[0,1], a uniform distribution on [0, 1]. At each round, we sample the buyers?s valuation from U[0,1], independently for each product. If the valuation for a given product is greater than the price, one item of this product is sold (and then the reward for atom i is the price, and consumption of resource i is 1). Else, we do something different from dynamic assortment: we set reward for atom i and consumption for resource i to be the buyer?s valuation. The fourth experiment is a similar modification of the dynamic pricing exam- ple. We have n/2 allowed prices, uniformly spaced in the [0, 1] interval. Recall that atoms correspond to price-product pairs, for the total of n atoms. In each round t, the valuation vt,i for each product i is chosen independently from a normal distri- bution N (v0, 1) truncated on [0, 1]. The mean valuation v0i i is drawn (once for all rounds) from U[0,1]. If the valuation for a given product i is greater than the offered 192 price p, one item of this product is sold (and then reward for the corresponding atom (p, i) is the price, and consumption of product i is 1). If there is no sale for this product, we do something different from dynamic pricing. For each atom (p, i), if p < vt,i then the reward for atom (p, i) is drawn independently from U[0,1] and resource consumption is 1; else, reward is 0 and consumption is .3. While dynamic assortment is modeled with a uniform matroid, and dynamic pricing is modeled with a partition matroid, we tried both matroids on each family. Experimental setup and results. We choose various values of n, B and T and run our algorithms on the above two datasets assuming both a uniform ma- troid constraint and a partition matroid constraint. We choose n ? {6, 26}, T ? {1000, 2000, 3000, 4000, 5000, 6000} and B = T/2. The maximum number of atoms in any action is set to K = 2. For a given algorithm, dataset and configuration of n and T , we simulate each algorithm for 20 independent runs and take the average. We calculate the total reward obtained by the algorithm at the end of T time-steps. Figure 8.1 shows results for the first two experiments. Figures 8.2 and 8.3 show the results on the third and fourth experiments. Our algorithm achieves the best regret among the competitors. As a benchmark, we included the performance of the fractional allocation in LPOPT , denoted OPTDP DP. Additional experiment. linCBwK and pdBwK have running times proportional to the number of actions. We ran an additional experiment which compared per-step running times. We first calculate the average running time for every 10 steps and take the median of 50 such runs. For both Uniform matroid and Partition matroid, 193 Figure 8.4: Variation of per-step running times as n increases for the various algo- rithms. we run the faster RRS due to Gandhi et al. [92]. See Figure 8.4 for results. Details of heuristic implementation of linCBwK. We now briefly describe the heuristic we use to simulate the linCBwK algorithm. Note that even though the per- time-step running time of linCBwK is reasonable, it takes a significant time when we want to perform simulations for many time-steps. The time-consuming step in the linCBwK algorithm is the solution to a convex program for computing the optimistic estimates (namely ??t and W?t). Hence, the heuristic gives a faster way to obtain this estimate. We sample multiple times from a multi-variate Gaussian with mean ?? and covariance Mt (to obtain estimate ??t) and with mean w?tj and covariance Mt (to obtain estimate w?tj for each resource j). We use these samples to compute the objective to choose the action at time-step t. For each sample, we compute the best action based on the objective in linCBwK. We finally choose the action that occurs majority number of times in these samples. The number of samples we choose is set to 30. Language Details of algorithms. All algorithms except linCBwK were imple- 194 mented in Python. The linCBwK algorithm was implemented in MATLAB. This difference is crucial when we compare running times since language construct can speed-up or slow down algorithms in practice. However, it is known that 12 for matrix operations commonly encountered in engineering and statistics, MATLAB implementations runs several orders of magnitude faster than the corresponding python implementation. Since linCBwK is the slowest of the four algorithms, our comparison of running times across languages is justified. 8.9 Conclusion In this chapter we consider an extension of the BwK problem, namely, stochas- tic combinatorial semi-bandits with knapsacks under matroid constraints and gave a specialized algorithm with optimal bounds. Moreover, we also show applications of SemiBwK and run numerical experiments on simulated data to support our theoreti- cal results. The results in this chapter work for a specialized (yet capture all known applications) action set and obtain stronger bounds (essentially optimal) than those in Section 7.9 in Chapter 7. 12https://www.mathworks.com/products/matlab/matlab-vs-python.html 195 196 Chapter 9: Future Directions In this chapter we conclude with future research directions that stem out of the problems considered in this thesis. We consider both future directions in the models of SDM considered in this thesis as well as other problems beyond SDM that are motivated by problems in this thesis. We also believe that the work in this thesis can be applied to a wider range of applications and exploring them is a future direction. Online Matching The most direct open problem in online matching is to close the gap between the upper and lower-bound for the Online Weighted Matching problem in the known distribution model considered in Chapter 3. We provide an algorithm with a com- petitive ratio of 0.7 while the known lower-bound is 0.823 given by Manshadi et al. [137] which also applies to unweighted graphs. Thus, we want to close this gap by either bringing the upper-bound towards the lower-bound or vice versa (or possibly both). The other direction is to consider stochastic arrival models that relax the no- tion of independence across time-steps which is often too simplifying. In many prac- tical scenarios, this model of stochastic arrival doesn?t exactly capture the reality of 197 the underlying data. The input often resembles a stochastic process with some pre- dictable correlations across time-steps (e.g., rider request pattern depends on local events). In the context of optimization in matching markets and specifically, online matching, results based on correlated data arrival have not been studied. Thus, it is interesting to formulate and study online matching and its variants under no- tions of correlation that are both theoretically challenging and practically relevant. Candidate models include notions of limited dependence such as Markov chains. Such models are especially useful in applications where the input is received as a result of human decisions. The other direction is to consider corrupted inputs, that are frequently encountered in many applications. Abstractly, this can be modeled as input sequences that are close to an i.i.d. sequence with adversarial perturba- tions. In other words, we want algorithms whose performance smoothly interpolate between the stochastic and the adversarial models. Prior work (Esfandiari et al. [81]) has proposed one interesting model along these lines for a particular variant of the unweighted online matching problem. However, the question of interpolation is not yet fully understood and thus, it will be interesting to study this for the vast landscape of online matching problems. From a theoretical stand-point one interesting open question is to obtain a competitive ratio bound that holds with high-probability (i.e., bound the variance of the algorithm?s performance). For the online weighted matching problem the performance guarantees of both the algorithms proposed in Chapter 3 and the one proposed in Haeupler et al. [102] only hold in expectation. Thus, it will be interesting to understand if similar ratios can be proved that hold with high-probability. For 198 the unweighted version of the problem, such high-probability bounds are known (e.g., Feldman et al. [83]). Bandits with Knapsacks The work on adversarial BwK (Chapter 7) directly leads to a few open ques- tions and directions. In particular, it would be illuminating to have a single algo- rithm that has good performance simultaneously in the stochastic and the adversar- ial setting without knowing a-priori, the environment the algorithm is operating in (so called ?best-of-both-worlds algorithm?). Our work almost achieves this, where we develop a common framework to solve both the settings but technical conditions prevent us from using it without knowing the environment. A closely related but distinct question is to obtain instance-specific bounds for the stochastic version of the problem. In particular, the goal is to derive bounds that improve on the easy instances of the problem while still achieving the worst-case bounds in the hard regimes. Another interesting direction is to consider the Stochastic BwK problem with limited dependence, where the environment is stochastic (but not independent) while the rewards/consumptions across times have weak correlation. Limited de- pendence and multi-armed bandits have been previously studied and is supported by numerous motivating examples. However, in many of these applications we also have limited inventory and thus, the ideal approach would be to use variants of BwK in such applications. A broader direction stemming out of our work is to investigate the interplay of learning and game-theory in the context of time-varying zero-sum games. In a 199 time-varying zero-sum game, the players play different zero-sum games repeatedly and the goal for each player is to maximize their total payoff. When both players play optimally, the strategy pair is said to be in equilibrium. Classical results in online learning show that MAB algorithms played against each other, approximately converge to the equilibrium. However, the results are limited to the settings when the game remains the same across time or the payoff functions across time are sampled independently and identically from the same distribution. Thus, it is interesting to understand the behavior of MAB algorithms when no assumptions about the repeated zero-sum games are made. A very recent work by Cardoso et al. [53] make some progress in this direction. Beyond SDM and Related Problems The central thesis in our work was to explore SDM problems with information constraints. However, there are many other problems that are closely related that have similar constraints. Consider the BwK problem. The key challenge for the algorithm was to perform counterfactual inference (i.e., answering what if for arms that were not chosen) in the face of limited samples. In sequential experiment design literature, there are multiple approaches to studying this challenge, with bandit- based approach being one of them. An alternate method to do this is to pose this as a problem in causal inference and study the effects of limited samples on inference algorithms. This question is not well understood; only a handful of prior works (e.g., Ghoshal and Honorio [93], Sankararaman et al. [164, 165]) have considered this question and it would be interesting to understand the limits of finite samples 200 for counterfactual reasoning. More broadly, understanding the trade-off between limited samples and performance of online algorithms with stochastic input is an interesting research direction (i.e., the algorithm only has access to samples from a distribution as opposed to the known distribution assumption). 201 Part III Appendix 202 Appendix: Standard Tools 10.1 Concentration Inequalities Lemma 28 (Azuma-Hoeffding inequality ([148])). Let Y1, Y2, . . . , YT be a martingale difference sequence ( i.e., E[Yt | Y?1, Y2 , . . . , Yt?1] = 0). Suppose |Yt| ? c for all t ? {1, 2, . . . , T}. Let R0,?(T ) := 2Tc2 ln(1/?). Then for every ? > 0, [? ] Pr t?[T ] Yt > R0,?(T ) ? ?. Lemma 29 (Chernoff-Hoeffding bounds ([148])). Let X1, X2, . . . , XT be a sequence of independent random variables such that |Xt| ? c for all t ? {1, 2, . . . , T}. Let Zt := E[Xt]. Then for every ? > 0,[???? ?? ?(? ) ]Pr t?[T ] Xt ? Zt? > 3 t?[T ] Zt c2 ln(1/?) ? ?. 10.2 Adversarial Online Learning Let us revisit adversarial online learning, as per Figure 7.2. Denote the bench- mark in Eq. (7.4) as ? OPTAOL(T ) := maxa?A t?[T ] ft(a). Recall that [bmin, bmax] is the payoff range, and denote ? = bmax ? bmin. 203 Lemma 30. Suppose an algorithm for adversarial online learning satisfies Eq. (7.4) for some ? > 0. Then [ ? ( ? ) ] Pr ?? ? [T ] OPTAOL(?) ? t?[? ] ft ? pt ? ? ? R?/T (T ) + 2T log(T/?) ? 1? 2?. (10.1) Proof. Let us use the stronger regret bound Eq. (7.5) implied by Eq. (7.4). Note that E[ft(at) | a1, a2 , . . . , at?1] = ft ? pt. Applying the Azuma-Hoeffding inequality for each ? ? [T ], and taking a union bound, we have [ ? ? ? ] Pr ?? ? [T ] t?[? ] ft(at)? t?[? ] ft ? pt ? ? ? 2T log(T/?) ? 1? ?. (10.2) Taking a union bound over Eq. (10.2) and Eq. (7.5) and adding the equations we get Eq. (10.1). Remark 12. For Hedge algorithm, regret bound Eq. (10.1) is already proved in [88]. 10.3 Lagrangians 10.3.1 Proof of Lemma 12 Assume one of the resources is the dummy resource, and one of the arms is the null arm. Consider the linear program LPM,B,T , for some outcome matrix M. Let L = LM,B,T denote the Lagrange function. Lemma 31 (Lemma 12, restated). Suppose (X?, ??) is a mixed Nash equilibrium for the Lagrangian game. Then X? is an optimal solution for the linear program 204 (7.8). Moreover, the minimax value of the Lagrangian game equals the LP value: L(X?, ??) = OPTLP. In what follows we prove Lemma 31. Writing out the definition of the mixed Nash equilibrium, L(X?, ?) ? L(X?, ??) ? L(X, ??) ?X ? ?K , ? ? ?d. (10.3) ? ? For brevity, denote r(X?) = a?[K] X ?(a) r(a) and ci(X ?) = ?a?[K] X (a) ci(a). We first state and prove the complementary slackness condition for the Nash equilibrium. Claim 6. For every resource i ? [d] we have, (a) 1? T c ? B i (X ) ? 0, (b) ?? > 0 =? 1? Ti c (X?) = 0.B i Proof. Part (a). For contradiction, consider resource i that minimizes the left-hand side in (a), and assume that the said left-hand side is strictly negative. We have two cases: either ??i < 1 or ? ? i = 1. When ? ? i < 1, consider another distribution ?? ? ?d such that ??i = 1 and ??i? = 0 for every i? 6= i. Note that we have, L(X?, ??) < L(X?, ??). This contradicts the first inequality in Eq. (10.3). Consider the second case, when ??i = 1. Then L(X?, ??) = r(X?)+1? T ci(X?).B Consider any arm a ? [K] such that X?(a) 6= 0. Let X? ? ?K be another distribution such that X?(a) := 0 and X?(null) := X?(null) +X?(a) and X?(a?) = X?(a?) for every a? 6? {a, null}. Note that X?(null) ? 1. Since (X?, ??) is a Nash equilibrium, we 205 have that L(X?, ??) ? L(X?, ??). This implies that ?X?(a)r(a) +X?(a) T ci(a) ? 0.B Re-arranging we obtain, T ci(a) ? r(a) ? 1. Thus, we have 1? T c (a) ? 0.B B i Since this holds for every a ? [K] with X?(a) =6 0, we obtain a contradiction: 1? T c (X? ? ? ( )) = T B i a?[K] X (a) 1? ci(a) ? 0.B Part (b). For contradiction, assume the statement is false for some resource i. Then, by part (a), ?? > 0 and 1? T c ?i i(X ) > 0, and consequently L(X?, ??) > r(X?).B Now, consider distribution ?? which puts probability 1 on the dummy resource. We then have L(X?, ??) = r(X?) < L(X?, ??), contradicting the first inequality in Eq. (10.3). Let X? be some feasible solution for the linear program (7.8). Plugging the feasibility constraints into the definition of the Lagrangian function, L(X?, ??) ? r(X?). Claim 6(a) implies that X? is a feasible solution to the linear program (7.8). Claim 6(b) implies that L(X?, ??) = r(X?). Thus, r(X?) = L(X?, ??) ? L(X?, ??) ? r(X?). So, X? is an optimal solution to the LP. In particular, OPT ? ? ?LP = r(X ) = L(X , ? ). 10.3.2 Regret minimization in games 10.3.3 Proof of Lemma 11 ? Let W = 2T log(T/?) denote the term from Lemma 30 in what follows. We now prove Lemma 11, similar to the proof in [87] for the deterministic game. Recall that we take averages up to some fixed round ? ? [T ]. We prove that 206 the following two inequalities hold, each with probability at least 1? ?. 1 ? T ? R1, ?/T (T ) + 2Wpt,1 Gt pt,2 ? v ? ? ? . (10.4)? ? t 1 ??[? ] R2, ?/T (T ) + 2W pTt,1 G p T t t,2 ? p1 G p2 + ? ? ?p2 ? ?A2 . (10.5)? ? t?[? ] Eq. (7.7) in Lemma 11 follows by adding Eq. (10.4) and Eq. (10.5). First we prove Eq. (10.4). Following the set of inequalities in Section 2.5 of [87] we have, 1 ? ?T ? 1 ?T R1, ?/T (T ) +Wpt,1Gtpt,2 whp p1 Gt pt,2 ? ? ? From Lemma 30? ? ? t?[? ] t??[? ]1 ?T R1, ?/T (T ) + 2W?whp p1 G pt,2 ? ? ? From Lemma 28? ? t?[? ] 1 ? T R1, ?/T (T ) + 2W= max p1 G pt,2 ? ? ? From Definition of p? ? ? ? 1 . p1 ?A1 t?[? ] R1, ?/T (T ) + 2W = max p T1 G p2 ? ? ? From Definition of p2. p1??A1 ? T R1, ?/T (T ) + 2W? min max p1 G p2 ? ? ? p2??A p ??2 1 A1 ? Here ?whp denotes statements that hold with probability at least 1? ?. Now let us prove Eq. (10.5). Fix distribution p2 ? ?A2 . Then: 1 ? T ? 1 ? T R2, ?/T (T ) +Wpt,1 Gt pt,2 whp pt,1 Gt p2 + ? ? From Lemma 30? ? ? t?[? ] t??[? ] R (T ) + 2W ? 1 T 2, ?/Twhp pt,1 G p2 + ? ? From Lemma 28 ? ? t?[? ] = p T R2, ?/T (T ) + 2W 1 G p2 + ? ? From Definition of p1. ? Taking a union bound over all the four high-probability inequalities, we get the lemma. 207 Bibliography [1] Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium computation. In Advances in Neural Information Processing Systems (NIPS), pages 6584?6593, 2017. [2] Marek Adamczyk, Fabrizio Grandoni, and Joydeep Mukherjee. Improved ap- proximation algorithms for stochastic matching. In European Symposium on Algorithms (ESA), pages 1?12. Springer, 2015. [3] Marek Adamczyk, Maxim Sviridenko, and Justin Ward. Submodular stochas- tic probing on matroids. Mathematics of Operations Research, 41(3):1022? 1038, 2016. [4] Alekh Agarwal, Miroslav Dud??k, Satyen Kale, John Langford, and Robert E. Schapire. Contextual bandit learning with predictable rewards. In 15th Intl. Conf. on Artificial Intelligence and Statistics (AISTATS), pages 19?26, 2012. [5] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for con- textual bandits. In 31st Intl. Conf. on Machine Learning (ICML), 2014. 208 [6] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Miro Dudik, Luong Hoang, John Langford, Lihong Li, Dan Melamed, Gal Oshri, Siddhartha Sen, and Aleksandrs Slivkins. Multiworld testing: A system for experimentation, learn- ing, and decision-making, 2016. A white paper, available at https://github. com/Microsoft/mwt-ds/raw/master/images/MWT-WhitePaper.pdf. [7] Alekh Agarwal, Alina Beygelzimer, Miroslav Dud??k, John Langford, and Hanna Wallach. A reductions approach to fair classification. Fairness, Ac- countability, and Transparency in Machine Learning (FATML), 2017. [8] Gagan Aggarwal, Gagan Goel, Chinmay Karande, and Aranyak Mehta. Online vertex-weighted bipartite matching and single-bid budgeted allocations. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 1253?1264. SIAM, 2011. [9] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining, pages 5?14. ACM, 2009. [10] Shipra Agrawal and Nikhil R Devanur. Fast algorithms for online stochastic convex programming. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1405?1424. SIAM, 2014. [11] Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In 15th ACM Conf. on Economics and Computation (ACM EC), 2014. 209 [12] Shipra Agrawal and Nikhil R. Devanur. Linear contextual bandits with knap- sacks. In 29th Advances in Neural Information Processing Systems (NIPS), 2016. [13] Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. A dynamic near-optimal al- gorithm for online linear programming. Operations Research, 62(4):876?890, 2014. [14] Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In 29th Conf. on Learning Theory (COLT), 2016. [15] Faez Ahmed, John P Dickerson, and Mark Fuge. Diverse weighted bipartite b-matching. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 35?41. AAAI Press, 2017. [16] Saeed Alaei, MohammadTaghi Hajiaghayi, and Vahid Liaghat. Online prophet-inequality matching with applications to ad allocation. In Proceed- ings of the 13th ACM Conference on Electronic Commerce, pages 18?35. ACM, 2012. [17] Saeed Alaei, MohammadTaghi Hajiaghayi, and Vahid Liaghat. The online stochastic generalized assignment problem. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 11?25. Springer, 2013. [18] Noga Alon, Baruch Awerbuch, and Yossi Azar. The online set cover prob- 210 lem. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 100?105. ACM, 2003. [19] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptoti- cally efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11): 968?976, 1987. [20] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. Journal of the ACM (JACM), 56(2):5, 2009. [21] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8 (1):121?164, 2012. [22] Arash Asadpour, Michel X Goemans, Aleksander Madry, Shayan Oveis Gha- ran, and Amin Saberi. An o (log n/log log n)-approximation algorithm for the asymmetric traveling salesman problem. In SODA, volume 10, pages 379?389. SIAM, 2010. [23] Jean-Yves Audibert, Se?bastien Bubeck, and Ga?bor Lugosi. Minimax poli- cies for combinatorial prediction games. In 24th Conf. on Learning Theory (COLT), pages 107?132, 2011. [24] Peter Auer, Nicolo? Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235?256, 2002. 211 [25] Peter Auer, Nicolo? Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48?77, 2002. Preliminary version in 36th IEEE FOCS, 1995. [26] Baruch Awerbuch and Yossi Azar. Buy-at-bulk network design. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 542?547. IEEE, 1997. [27] Yossi Azar, Niv Buchbinder, TH Hubert Chan, Shahar Chen, Ilan Reuven Cohen, Anupam Gupta, Zhiyi Huang, Ning Kang, Viswanath Nagarajan, and Joseph Naor. Online algorithms for covering and packing problems with con- vex objectives. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 148?157. IEEE, 2016. [28] Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. Dynamic pricing with limited supply. ACM Trans. on Economics and Computation, 3(1):4, 2015. Special issue for 13th ACM EC, 2012. [29] Ashwinkumar Badanidiyuru. Buyback problem-approximate matroid inter- section with cancellation costs. In International Colloquium on Automata, Languages, and Programming, pages 379?390. Springer, 2011. [30] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. Learning on a budget: posted price mechanisms for online procurement. In 13th ACM Conf. on Electronic Commerce (EC), pages 128?145, 2012. [31] Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Re- 212 sourceful contextual bandits. In 27th Conf. on Learning Theory (COLT), 2014. [32] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and An- dreas Krause. Streaming submodular maximization: Massive data summa- rization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671?680. ACM, 2014. [33] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Ban- dits with knapsacks. J. of the ACM, 65(3), 2018. Preliminary version in FOCS 2013. [34] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph Naor. A polylogarithmic-competitive algorithm for the k-server problem. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 267?276. IEEE, 2011. [35] Nikhil Bansal, Nitish Korula, Viswanath Nagarajan, and Aravind Srinivasan. Solving packing integer programs via randomized rounding with alterations. Theory of Computing, 8(1):533?565, 2012. [36] Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimiza- tion: analysis, algorithms, and engineering applications, volume 2. Siam, 2001. [37] Dirk Bergemann and Juuso Va?lima?ki. Bandit Problems. In Steven Durlauf 213 and Larry Blume, editors, The New Palgrave Dictionary of Economics, 2nd ed. Macmillan Press, 2006. [38] Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57: 1407?1420, 2009. [39] Omar Besbes and Assaf J. Zeevi. Blind network revenue management. Oper- ations Research, 60(6):1537?1550, 2012. [40] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In 14th Intl. Conf. on Artificial Intelligence and Statistics (AISTATS), 2011. [41] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [42] Brian Brubach, Karthik Abinav Sankararaman, Aravind Srinivasan, and Pan Xu. New algorithms, better bounds, and a novel model for online stochastic matching. In 24th Annual European Symposium on Algorithms (ESA 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. [43] Se?bastien Bubeck and Nicolo Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1), 2012. [44] Se?bastien Bubeck, Ofer Dekel, Tomer Koren, and Yuval Peres. Bandit convex 214 optimization: \(\sqrt{T}\) regret in one dimension. In 28th Conf. on Learning Theory (COLT), pages 266?278, 2015. [45] Se?bastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization. In 49th ACM Symp. on Theory of Computing (STOC), pages 72?85. ACM, 2017. [46] Niv Buchbinder and Joseph Naor. The design of competitive online algorithms via a primal-dual approach. Foundations and Trends in Theoretical Computer Science, 3(2-3):93?263, 2009. [47] Niv Buchbinder and Joseph (Seffi) Naor. Online primal-dual algorithms for covering and packing. Math. Oper. Res., 34(2):270?286, May 2009. [48] Niv Buchbinder and Joseph Seffi Naor. The design of competitive online algo- rithms via a primal?dual approach. Foundations and Trends?R in Theoretical Computer Science, 3(2?3):93?263, 2009. [49] Niv Buchbinder, Kamal Jain, and Joseph Seffi Naor. Online primal-dual al- gorithms for maximizing ad-auctions revenue. In European Symposium on Algorithms, pages 253?264. Springer, 2007. [50] Niv Buchbinder, Moran Feldman, and Roy Schwartz. Online submodular maximization with preemption. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1202?1216. Society for Industrial and Applied Mathematics, 2015. 215 [51] Niv Buchbinder, Danny Segev, and Yevgeny Tkach. Online algorithms for maximum cardinality matching with edge arrivals. Algorithmica, pages 1?19, 2017. [52] Gruia Calinescu, Chandra Chekuri, Martin Pa?l, and Jan Vondra?k. Maximiz- ing a monotone submodular function subject to a matroid constraint. SIAM Journal on Computing, 40(6):1740?1766, 2011. [53] Adrian Rivera Cardoso, Jacob Abernethy, He Wang, and Huan Xu. Competing against nash equilibria in adversarially changing zero-sum games. In 36th Intl. Conf. on Machine Learning (ICML), pages 921?930, 2019. [54] Nicolo? Cesa-Bianchi and Ga?bor Lugosi. Prediction, learning, and games. Cam- bridge Univ. Press, 2006. [55] Nicolo? Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret mini- mization for reserve prices in second-price auctions. In ACM-SIAM Symp. on Discrete Algorithms (SODA), 2013. [56] TH Hubert Chan, Fei Chen, Xiaowei Wu, and Zhichao Zhao. Ranking on arbitrary graphs: Rematch via continuous lp with monotone and boundary condition constraints. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1112?1122. SIAM, 2014. [57] TH Hubert Chan, Zhiyi Huang, Shaofeng H-C Jiang, Ning Kang, and Zhi- hao Gavin Tang. Online submodular maximization with free disposal: Ran- domization beats for partition matroids. In Proceedings of the Twenty-Eighth 216 Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1204?1223. SIAM, 2017. [58] Moses Charikar and Balaji Raghavachari. The finite capacity dial-a-ride prob- lem. In Proceedings 39th Annual Symposium on Foundations of Computer Science, pages 458?467. IEEE, 1998. [59] Chandra Chekuri, Jan Vondrak, and Rico Zenklusen. Dependent randomized rounding via exchange properties of combinatorial structures. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 575?584. IEEE, 2010. [60] Chandra Chekuri, Jan Vondra?k, and Rico Zenklusen. Multi-budgeted match- ings and matroid intersection via dependent rounding. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 1080?1097. SIAM, 2011. [61] Chandra Chekuri, Jan Vondra?k, and Rico Zenklusen. Submodular function maximization via the multilinear relaxation and contention resolution schemes. SIAM Journal on Computing, 43(6):1831?1879, 2014. [62] Cheng Chen, Lan Zheng, Venkatesh Srinivasan, Alex Thomo, Kui Wu, and Anthony Sukow. Conflict-aware weighted bipartite b-matching and its applica- tion to e-commerce. IEEE Transactions on Knowledge and Data Engineering, 28(6):1475?1488, 2016. 217 [63] Tianyi Chen and Georgios B Giannakis. Bandit convex optimization for scal- able and dynamic iot management. IEEE Internet of Things Journal, 2018. [64] Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex opti- mization approach to proactive network resource allocation. IEEE Transac- tions on Signal Processing, 65(24):6350?6364, 2017. [65] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed ban- dit: General framework and applications. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Ma- chine Learning (ICML-13), volume 28, pages 151?159. JMLR Workshop and Conference Proceedings, 2013. [66] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, and Shang-Hua Teng. Electrical flows, laplacian systems, and faster approxi- mation of maximum flow in undirected graphs. In 43rd ACM Symp. on Theory of Computing (STOC), pages 273?282. ACM, 2011. [67] Richard Combes, Chong Jiang, and Rayadurgam Srikant. Bandits with bud- gets: Regret lower bounds and optimal algorithms. ACM SIGMETRICS Per- formance Evaluation Review, 43(1):245?257, 2015. [68] Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages 2116?2124, 2015. [69] Nikhil R Devanur and Thomas P Hayes. The adwords problem: online keyword 218 matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM conference on Electronic commerce, pages 71?78. ACM, 2009. [70] Nikhil R. Devanur and Thomas P. Hayes. The AdWords problem: Online keyword matching with budgeted bidders under random permutations. In 10th ACM Conf. on Electronic Commerce (EC), pages 71?78, 2009. [71] Nikhil R Devanur and Kamal Jain. Online matching with concave returns. In Proceedings of the forty-fourth annual ACM symposium on Theory of comput- ing, pages 137?144. ACM, 2012. [72] Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In 12th ACM Conf. on Electronic Commerce (EC), pages 29?38, 2011. [73] Nikhil R Devanur, Balasubramanian Sivan, and Yossi Azar. Asymptotically optimal algorithm for stochastic adwords. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 388?404. ACM, 2012. [74] Nikhil R Devanur, Kamal Jain, and Robert D Kleinberg. Randomized primal- dual analysis of ranking for online bipartite matching. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 101?107. Society for Industrial and Applied Mathematics, 2013. [75] Nikhil R Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A 219 Wilkens. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. Journal of the ACM (JACM), 66(1):7, 2019. [76] John P Dickerson, Karthik A Sankararaman, Aravind Srinivasan, and Pan Xu. Allocation problems in ride-sharing platforms: Online matching with offline reusable resources. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018. [77] John P Dickerson, Karthik Abinav Sankararaman, Aravind Srinivasan, and Pan Xu. Assigning tasks to workers based on historical data: Online task assignment with two-sided arrivals. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 318?326. International Foundation for Autonomous Agents and Multiagent Systems, 2018. [78] John P Dickerson, Karthik A Sankararaman, Aravind Srinivasan, and Pan Xu. Balancing relevance and diversity in online bipartite matching via submodu- larity. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI ?19, 2019. [79] Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed ban- dit with budget constraint and variable costs. In 27th AAAI Conference on Artificial Intelligence (AAAI), 2013. [80] Miroslav Dud??k, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Lang- 220 ford, Lev Reyzin, and Tong Zhang. Efficient optimal leanring for contextual bandits. In 27th Conf. on Uncertainty in Artificial Intelligence (UAI), 2011. [81] Hossein Esfandiari, Nitish Korula, and Vahab Mirrokni. Online allocation with traffic spikes: Mixing adversarial and stochastic models. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 169?186. ACM, 2015. [82] Jon Feldman, Nitish Korula, Vahab Mirrokni, S Muthukrishnan, and Mar- tin Pa?l. Online ad assignment with free disposal. In Internet and network economics, pages 374?385. Springer, 2009. [83] Jon Feldman, Aranyak Mehta, Vahab Mirrokni, and S Muthukrishnan. Online stochastic matching: Beating 1-1/e. In Foundations of Computer Science (FOCS), pages 117?126. IEEE, 2009. [84] Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clif- ford Stein. Online stochastic packing applied to display ad allocation. In 18th Annual European Symp. on Algorithms (ESA), pages 182?194, 2010. [85] Amos Fiat, Richard M Karp, Michael Luby, Lyle A McGeoch, Daniel D Sleator, and Neal E Young. Competitive paging algorithms. Journal of Algo- rithms, 12(4):685?699, 1991. [86] Abraham Flaxman, Adam Kalai, and H. Brendan McMahan. Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient. 221 In 16th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 385?394, 2005. [87] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325?332. ACM, 1996. [88] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119?139, 1997. [89] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplica- tive weights. Games and Economic Behavior, 29(1-2):79?103, 1999. [90] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In New Frontiers in Dynamic Spectrum, 2010 IEEE Symposium on, pages 1?9. IEEE, 2010. [91] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network op- timization with unknown variables: Multi-armed bandits with linear rewards and individual observations, October 2012. [92] Rajiv Gandhi, Samir Khuller, Srinivasan Parthasarathy, and Aravind Srini- vasan. Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324?360, 2006. 222 [93] Asish Ghoshal and Jean Honorio. Learning linear structural equation models in polynomial time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 1466?1475, 2018. [94] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-Armed Bandit Allocation Indices. John Wiley & Sons, 2011. [95] Gagan Goel and Aranyak Mehta. Online budgeted matching in random in- put models with applications to adwords. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 982?991, 2008. [96] Sudipta Guha and Kamesh Munagala. Multi-armed Bandits with Metric Switching Costs. In 36th Intl. Colloquium on Automata, Languages and Pro- gramming (ICALP), pages 496?507, 2007. [97] Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi. Approximation algorithms for correlated knapsacks and non-martingale ban- dits. In 52nd IEEE Symp. on Foundations of Computer Science (FOCS), pages 827?836, 2011. [98] Anupam Gupta, Viswanath Nagarajan, and Sahil Singla. Algorithms and adaptivity gaps for stochastic probing. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 1731?1747. SIAM, 2016. [99] Anupam Gupta, Viswanath Nagarajan, and Sahil Singla. Adaptivity gaps for stochastic probing: Submodular and xos functions. In Proceedings of the 223 Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1688?1702. SIAM, 2017. [100] Andra?s Gyo?rgy, Levente Kocsis, Ivett Szabo?, and Csaba Szepesva?ri. Contin- uous time associative bandit problems. In 20th Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 830?835, 2007. [101] Andra?s Gyo?rgy, Tama?s Linder, Ga?bor Lugosi, and Gyo?rgy Ottucsa?k. The on- line shortest path problem under partial monitoring. J. of Machine Learning Research (JMLR), 8:2369?2403, 2007. [102] Bernhard Haeupler, Vahab S Mirrokni, and Morteza Zadimoghaddam. On- line stochastic weighted matching: Improved approximation algorithms. In International Workshop on Internet and Network Economics, pages 170?181. Springer, 2011. [103] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4): 19, 2016. [104] Elad Hazan. Introduction to Online Convex Optimization. Foundations and Trends?R in Optimization, 2(3-4):157?325, 2015. [105] Elad Hazan and Kfir Y. Levy. Bandit convex optimization: Towards tight bounds. In 27th Advances in Neural Information Processing Systems (NIPS), pages 784?792, 2014. 224 [106] Justin Hsu, Zhiyi Huang, Aaron Roth, and Zhiwei Steven Wu. Jointly pri- vate convex programming. In 27th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 580?599, 2016. [107] Zhiyi Huang, Ning Kang, Zhihao Gavin Tang, Xiaowei Wu, Yuhao Zhang, and Xue Zhu. How to match when all vertices arrive online. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 17?29. ACM, 2018. [108] Zhiyi Huang, Binghui Peng, Zhihao Gavin Tang, Runzhou Tao, Xiaowei Wu, and Yuhao Zhang. Tight competitive ratios of classic matching algorithms in the fully online model. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2875?2886. SIAM, 2019. [109] Nicole Immorlica, Karthik Abinav Sankararaman, Robert Schapire, and Alek- sandrs Slivkins. Adversarial bandits with knapsacks. In Foundations of Com- puter Science (FOCS). IEEE, 2019. [110] Russell Impagliazzo and Valentine Kabanets. Constructive proofs of concen- tration bounds. In Approximation, Randomization, and Combinatorial Opti- mization. Algorithms and Techniques, pages 617?631. Springer, 2010. [111] Patrick Jaillet and Xin Lu. Online stochastic matching: New algorithms with better bounds. Mathematics of Operations Research, 39(3):624?646, 2013. [112] David S Johnson. Approximation algorithms for combinatorial problems. Journal of computer and system sciences, 9(3):256?278, 1974. 225 [113] Satyen Kale, Lev Reyzin, and Robert E. Schapire. Non-stochastic bandit slate problems. In 24th Advances in Neural Information Processing Systems (NIPS), pages 1054?1062, 2010. [114] Michael Kapralov, Ian Post, and Jan Vondra?k. Online submodular welfare maximization: Greedy is optimal. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1216?1225. SIAM, 2013. [115] Mohammad Karimi, Mario Lucic, Hamed Hassani, and Andreas Krause. Stochastic submodular maximization: The case of coverage functions. In Ad- vances in Neural Information Processing Systems, pages 6853?6863, 2017. [116] Richard M Karp, Umesh V Vazirani, and Vijay V Vazirani. An optimal al- gorithm for on-line bipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352?358. ACM, 1990. [117] Sumeet Katariya, Branislav Kveton, Csaba Szepesva?ri, and Zheng Wen. DCM bandits: Learning to rank with multiple clicks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1215?1224, 2016. [118] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In 35th Intl. Conf. on Machine Learning (ICML), pages 2564?2572, 2018. [119] Thomas Kesselheim, Klaus Radke, Andreas To?nnis, and Berthold Vo?cking. An optimal online algorithm for weighted bipartite matching and extensions 226 to combinatorial auctions. In European Symposium on Algorithms (ESA), pages 589?600. Springer, 2013. [120] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit prob- lem. In 18th Advances in Neural Information Processing Systems (NIPS), 2004. [121] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Ban- dits and experts in metric spaces. Working paper, published at http://arxiv.org/abs/1312.1277, 2018. Merged and revised version of conference papers in ACM STOC 2008 and ACM-SIAM SODA 2010. [122] Nitish Korula and Martin Pa?l. Algorithms for secretary problems on graphs and hypergraphs. In Automata, Languages and Programming, pages 508?520. Springer, 2009. [123] Akshay Krishnamurthy, Alekh Agarwal, and Miroslav Dud??k. Contextual semibandits via supervised learning oracles. In 29th Advances in Neural In- formation Processing Systems (NIPS), 2016. [124] Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriks- son. Matroid bandits: Fast combinatorial optimization with learning. In Nevin L. Zhang and Jin Tian, editors, UAI, pages 420?429. AUAI Press, 2014. [125] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascad- ing bandits: Learning to rank in the cascade model. In David Blei and Francis Bach, editors, Proceedings of the 32nd International Conference on Machine 227 Learning (ICML-15), pages 767?776. JMLR Workshop and Conference Pro- ceedings, 2015. [126] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvri. Tight regret bounds for stochastic combinatorial semi-bandits. In Guy Lebanon and S. V. N. Vishwanathan, editors, AISTATS, JMLR Workshop and Conference Proceedings. JMLR.org, 2015. [127] John Langford and Tong Zhang. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In 21st Advances in Neural Information Processing Systems (NIPS), 2007. [128] Der-Horng Lee, Hao Wang, Ruey Long Cheu, and Siew Hoon Teo. Taxi dispatch system based on current demands and real-time traffic conditions. Transportation Research Record, 1882(1):193?200, 2004. [129] Jon Lee, Maxim Sviridenko, and Jan Vondra?k. Submodular maximization over multiple matroids via generalized exchange properties. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 244?257. Springer, 2009. [130] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212?260, 1994. [131] La?szlo? Lova?sz. On the ratio of optimal integral and fractional covers. Discrete mathematics, 13(4):383?390, 1975. 228 [132] Meghna Lowalekar, Pradeep Varakantham, and Patrick Jaillet. Online spatio- temporal matching in stochastic and dynamic domains. Artificial Intelligence, 261:71?112, 2018. [133] Will Ma. Improvements and generalizations of stochastic knapsack and multi- armed bandit approximation algorithms. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1154?1163. So- ciety for Industrial and Applied Mathematics, 2014. [134] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. J. of Machine Learning Research (JMLR), 13(Sep):2503?2528, 2012. [135] Mehrdad Mahdavi, Tianbao Yang, and Rong Jin. Stochastic convex optimiza- tion with multiple objectives. In Advances in Neural Information Processing Systems (NIPS), pages 1115?1123, 2013. [136] Mohammad Mahdian and Qiqi Yan. Online bipartite matching with random arrivals: an approach based on strongly factor-revealing LPs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 597? 606. ACM, 2011. [137] Vahideh H Manshadi, Shayan Oveis Gharan, and Amin Saberi. Online stochas- tic matching: Online actions based on offline statistics. Mathematics of Op- erations Research, 37(4):559?573, 2012. [138] Andrew McGregor. Finding graph matchings in data streams. In Approxima- 229 tion, Randomization and Combinatorial Optimization. Algorithms and Tech- niques, pages 170?181. Springer, 2005. [139] Aranyak Mehta. Online matching and ad allocation. Foundations and Trends?R in Theoretical Computer Science, 8(4):265?368, 2013. [140] Aranyak Mehta and Debmalya Panigrahi. Online matching with stochastic rewards. In Foundations of Computer Science (FOCS), pages 728?737. IEEE, 2012. [141] Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online matching. J. ACM, 54(5):22, 2007. ISSN 0004-5411. [142] Aranyak Mehta, Bo Waggoner, and Morteza Zadimoghaddam. Online stochas- tic matching with unequal probabilities. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2015. [143] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. Fast constrained submodular maximization: Personalized data summariza- tion. In International Conference on Machine Learning, pages 1358?1367, 2016. [144] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Dis- tributed submodular maximization. The Journal of Machine Learning Re- search, 17(1):8330?8373, 2016. [145] Baharan Mirzasoleiman, Morteza Zadimoghaddam, and Amin Karbasi. Fast 230 distributed submodular cover: Public-private data summarization. In Ad- vances in Neural Information Processing Systems, pages 3594?3602, 2016. [146] Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause. Streaming non-monotone submodular maximization: Personalized video summarization on the fly. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [147] Marco Molinaro and R. Ravi. Geometry of online packing linear programs. In 39th Intl. Colloquium on Automata, Languages and Programming (ICALP), pages 701?713, 2012. [148] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cam- bridge University Press, Cambridge, 1995. [149] Michael J Neely and Hao Yu. Online convex optimization with time-varying constraints. arXiv preprint arXiv:1702.04783, 2017. [150] George L Nemhauser and Laurence A Wolsey. Best algorithms for approxi- mating the maximum of a submodular set function. Mathematics of operations research, 3(3):177?188, 1978. [151] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical programming, 14(1):265?294, 1978. [152] Gergely Neu and Ga?bor Barto?k. Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. J. of Machine Learning Research (JMLR), 17(1):5355?5375, 2016. 231 [153] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1982. [154] Prabhakar Raghavan and Clark D Tompson. Randomized rounding: a tech- nique for provably good algorithms and algorithmic proofs. Combinatorica, 7 (4):365?374, 1987. [155] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In 26th Conf. on Learning Theory (COLT), pages 993?1019, 2013. [156] Alexander Rakhlin and Karthik Sridharan. BISTRO: an efficient relaxation- based method for contextual bandits. In 33nd Intl. Conf. on Machine Learning (ICML), 2016. [157] Anshuka Rangi, Massimo Franceschetti, and Long Tran-Thanh. Unifying the stochastic and the adversarial bandits with knapsack. arXiv preprint arXiv:1811.12253, 2018. [158] Adrian Rivera, He Wang, and Huan Xu. Online saddle point problem with applications to constrained online convex optimization. arXiv preprint arXiv:1806.08301, 2018. [159] Herbert Robbins. Some Aspects of the Sequential Design of Experiments. Bull. Amer. Math. Soc., 58:527?535, 1952. [160] Ryan Rogers, Aaron Roth, Jonathan Ullman, and Zhiwei Steven Wu. Inducing approximately optimal flow using truthful mediators. In 16th ACM Conf. on Electronic Commerce (EC), pages 471?488, 2015. 232 [161] Aaron Roth, Jonathan Ullman, and Zhiwei Steven Wu. Watch and learn: Optimizing from revealed preferences feedback. In 48th ACM Symp. on Theory of Computing (STOC), pages 949?962, 2016. [162] Aaron Roth, Aleksandrs Slivkins, Jonathan Ullman, and Zhiwei Steven Wu. Multidimensional dynamic pricing for welfare maximization. In 18th ACM Conf. on Electronic Commerce (EC), pages 519?536, 2017. [163] Karthik Abinav Sankararaman and Aleksandrs Slivkins. Combinatorial semi- bandits with knapsacks. In Intl. Conf. on Artificial Intelligence and Statistics (AISTATS), pages 1760?1770, 2018. [164] Karthik Abinav Sankararaman, Anand Louis, and Navin Goyal. Stability of linear structural equation models of causal inference. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence, UAI ?19, 2019. [165] Karthik Abinav Sankararaman, Anand Louis, and Navin Goyal. Robust iden- tifiability in the linear structural equation model for causal inference. In Work- ing Paper, 2019. [166] Kanthi K Sarpatwar, Baruch Schieber, and Hadas Shachnai. Constrained sub- modular maximization via greedy local search. Operations Research Letters, 47(1):1?6, 2019. [167] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012. 233 [168] Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer Science & Business Media, 2002. [169] Adish Singla and Andreas Krause. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In 22nd Intl. World Wide Web Conf. (WWW), pages 1167?1178, 2013. [170] Aleksandrs Slivkins. Dynamic ad allocation: Bandits with budgets. A techni- cal report on arxiv.org/abs/1306.0155, June 2013. [171] Serban Stan, Morteza Zadimoghaddam, Andreas Krause, and Amin Karbasi. Probabilistic submodular maximization in sub-linear time. In Proceedings of the 34th International Conference on Machine Learning, pages 3241?3250. JMLR. org, 2017. [172] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In 28th Advances in Neural Information Processing Systems (NIPS), pages 2989?2997, 2015. [173] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E. Schapire. Efficient algorithms for adversarial contextual learning. In 33nd Intl. Conf. on Machine Learning (ICML), 2016. [174] Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E. Schapire. Improved regret bounds for oracle-based adversarial contextual ban- dits. In 29th Advances in Neural Information Processing Systems (NIPS), 2016. 234 [175] William R. Thompson. On the likelihood that one unknown probability ex- ceeds another in view of the evidence of two samples. Biometrika, 25(3-4): 285?294, 1933. [176] Yongxin Tong, Jieying She, Bolin Ding, Libin Wang, and Lei Chen. Online mobile micro-task allocation in spatial crowdsourcing. In 32nd international conference on data engineering (ICDE), pages 49?60. IEEE, 2016. [177] Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. -first policies for budget-limited multi-armed ban- dits. In 24th AAAI Conference on Artificial Intelligence (AAAI), pages 1211? 1216, 2010. [178] Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In 26th AAAI Conference on Artificial Intelligence (AAAI), pages 1134?1140, 2012. [179] Van-Anh Truong and Xinshang Wang. Prophet inequality with correlated arrival probabilities, with application to two sided matchings. arXiv preprint arXiv:1901.02552, 2019. [180] Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and Jeff A Bilmes. Learning mixtures of submodular functions for image collection summariza- tion. In Advances in neural information processing systems, pages 1413?1421, 2014. 235 [181] Seeun Umboh. Online network design algorithms via hierarchical decompo- sitions. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1373?1387. Society for Industrial and Applied Mathematics, 2015. [182] Jun-Kun Wang and Jacob Abernethy. Acceleration through optimistic no- regret dynamics. arXiv preprint arXiv:1807.10455, 2018. [183] Yajun Wang and Sam Chiu-wai Wong. Two-sided online bipartite matching and vertex cover: Beating the greedy algorithm. In International Colloquium on Automata, Languages, and Programming, pages 1070?1081. Springer, 2015. [184] Zizhuo Wang, Shiming Deng, and Yinyu Ye. Close the gaps: A learning-while- doing algorithm for single-product revenue management problems. Operations Research, 62(2):318?331, 2014. [185] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In 31st Conf. on Learning Theory (COLT), 2018. [186] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954?1963, 2015. [187] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large- scale combinatorial semi-bandits. In Francis R. Bach and David M. Blei, editors, ICML, JMLR Workshop and Conference Proceedings, pages 1113? 1122. JMLR.org, 2015. 236 [188] David P Williamson and David B Shmoys. The design of approximation al- gorithms. Cambridge university press, 2011. [189] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. 2016. 237