ABSTRACT
Title of dissertation: SEQUENTIAL DECISION MAKING
WITH LIMITED RESOURCES
Karthik Abinav Sankararaman
Doctor of Philosophy, 2019
Dissertation directed by: Professor Aravind Srinivasan
Department of Computer Science
One of the goals of Artificial Intelligence (AI) is to enable multiple agents to
interact, co-ordinate and compete with each other to realize various goals. Typically,
this is achieved via a system which acts as a mediator to control the agents? behavior
via incentives. Such systems are ubiquitous and include online systems for shopping
(e.g., Amazon), ride-sharing (e.g., Uber, Lyft) and Internet labor markets (e.g.,
Mechanical Turk). The main algorithmic challenge in such systems is to ensure that
they can operate under a variety of informational constraints such as uncertainty in
the input, committing to actions based on partial information or being unaffected by
noisy input. The mathematical framework used to study such systems are broadly
called sequential decision making problems where the algorithm does not receive
the entire input at once; it obtains parts of the input by interacting (also called
?actions?) with the environment.
In this thesis, we answer the question, under what informational constraints
can we design efficient algorithms for sequential decision making problems.
The first part of the thesis deals with the Online Matching problem. Here,
the algorithm deals with two prominent constraints: uncertainty in the input and
choice of actions being restricted by a combinatorial constraint. We design several
new algorithms for many variants of this problem and provide provable guarantees.
We also show their efficacy on the ride-share application using a real-world dataset.
In the second part of the thesis, we consider the Multi-armed bandit problem with
additional informational constraints. In this setting, the algorithm does not receive
the entire input and needs to make decisions based on partial observations. Addi-
tionally, the set of possible actions is controlled by global resource constraints that
bind across time. We design new algorithms for multiple variants of this problem
that are worst-case optimal. We provide a general reduction framework to the clas-
sic multi-armed bandits problem without any constraints. We complement some of
the results with preliminary numerical experiments.
SEQUENTIAL DECISION MAKING WITH LIMITED
RESOURCES
by
Karthik Abinav Sankararaman
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2019
Advisory Committee:
Professor Aravind Srinivasan, Chair
Professor John P. Dickerson
Professor Tom Goldstein
Professor Furong Huang
Professor Prakash Narayan
Dr. Aleksandrs Slivkins
?c Copyright by
Karthik Abinav Sankararaman
2019

Dedicated to my parents.
ii
iii
Acknowledgments
?Although an act of help done timely, might be small in nature, it is truly
larger than the world itself.?
? Thiruvalluvar, a classic Tamil poet
I thank Aravind Srinivasan for the guidance, support and help throughout these
years. He is a true genius and it was a privilege to work beside him and absorb
some of it. I have learnt many interesting mathematical ideas from him. He is
very kind and has the ability to connect his students to new topics (and associated
experts) that he thinks are important. One such connection was with John Dickerson
who joined UMD and has since been a very close collaborator. His keen insight
on applications of the theoretical work is commendable. I am indebted for the
opportunity to work on many projects with him and learn a great deal.
In Fall 2015, I was fortunate to meet Alex Slivkins through his class on bandits
at UMD. Alex is an extremely prolific researcher and I am thankful for the gazillion
hours he has spent over these years to mentor me: writing, talks, research, life. I am
extremely impressed by the amount of energy he has towards doing research (and
life) and hope that some of this has rubbed off on me. He is always available via
Skype/email/phone for brainstorming and this has helped me a lot while thinking
about problems. I am also thankful to him (and Microsoft) for hosting me at what
was a fun and productive summer at MSR NYC along with Rob Schapire and Nicole
Immorlica. Chapter 7 was the outcome of this internship; I learnt a great deal about
bandits, online learning and game theory in this process from Alex, Rob and Nicole.
iv
Aravind connected me with Anand Louis at IISc and Navin Goyal at MSR
India which was the start of our collaboration on causal inference. I am extremely
thankful to them for the numerous weekly meetings (and accommodating their
schedule to the 9+ hours of time-difference). I learnt a great deal about many
topics including linear algebra, probability and statistics. I am also grateful to
Anand for reading through all the proofs carefully and giving critical feedback. I
would like to thank IISc for hosting me in the summer, Microsoft for the space and
food and Anand and Navin for making this collaboration possible (and having me
visit on my trips back to India).
Many thanks to my committee members Prakash Narayan, Tom Goldstein
and Furong Huang; I am grateful to the many interactions during my time here
on various problems in machine learning. Bill Gasarch mentored me on how to be
a great instructor and I am thankful for his advice on classroom teaching. And I
thank my professors from IITM especially Jayalal Sarma and Narayanaswamy who
first introduced me to theoretical computer science and algorithms.
I thank Prithvi Sen for hosting me at IBM Almaden and Anil Kamath for
hosting me at Adobe during my first two summers. It was a period of lot of learning
for me. Recently, Soham De introduced me to the world of neural nets and has
since taught me many things about the theory and practice of deep learning. I
am extremely thankful to him for those discussions. The initial parts of the online
matching research was done with my friends Pan Xu and Brian Brubach. I would
always cherish the numerous meetings we had to understand, discuss and eventually
solve some of the questions. I continued collaborating with Pan and most of my
v
later research on this topic has been jointly with him. We have worked on several
problems together and it has been extremely fulfilling. Science is a team-sport
and the research done during my PhD would not have been possible without my
co-authors who are some of the smartest people I have interacted with. I thank
all my office mates at AV Williams and Iribe Center for the company, coffee/tea
breaks and discussion about various topics over these years. I am also extremely
fortunate to have had a huge army of friends right from my middle school days
through the undergrad years and grad school; I am thankful to each one of them
for long phone calls, planning numerous trips, giving me advice and company (also
a place to crash). I am also very grateful for the company of all the friends I made
in grad school and at internships; they have kept me happy in the last few years.
Home is where the heart is: I thank my parents, grand parents, my brother
for the support, affection and happiness in all these years (and more recently my
sister-in-law). The ?Indian? way of returning this love is to make them proud and
hopefully this is a step towards that direction. I am also thankful to many of my
relatives who have frequently kept in touch via calls and meetings in all these years.
I gratefully acknowledge NSF Awards CNS 1010789, CCF 1422569, CCF-
1749864, a gift from Google, Inc., and research awards from Adobe, Inc. and Ama-
zon, Inc that has supported the research during my PhD. Chapter 3 is based on [42]
with Brian Brubach, Aravind Srinivasan and Pan Xu. Chapters 4, 5 and 6 are based
on [76, 78] with John Dickerson, Aravind Srinivasan and Pan Xu. Chapter 7 is based
on [109] with Nicole Immorlica, Rob Schapire and Alex Slivkins and Chapter 8 is
based on [163] with Alex Slivkins.
vi
Table of Contents
Dedication ii
Acknowledgements iii
Table of Contents vii
List of Tables x
List of Figures xi
1 Introduction 1
2 Preliminaries 6
2.1 Sequential Decision Making (SDM) protocol . . . . . . . . . . . . . . 6
2.2 Landscape of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
I Online Matching 14
3 Online Matching 15
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Online Weighted Matching . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Overview of our algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Warmup: 0.688 competitive algorithm . . . . . . . . . . . . . . . . . 25
3.7 0.7 competitive algorithm . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Extension 1: Online Weighted Matching with Reusable Vertices 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Main Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 LP benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 1/2? -competitive algorithm . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Lower-bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Extension 2: Online Submodular Weighted Matching 47
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Problem Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Background on Submodular Optimization . . . . . . . . . . . . . . . 51
5.5 Related Works in Submodular Optimization . . . . . . . . . . . . . . 53
5.6 Mathematical Program to Upper-bound OPT . . . . . . . . . . . . . 55
5.7 0.125-competitive Algorithm . . . . . . . . . . . . . . . . . . . . . . . 56
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Rideshare: Empirical Evaluation of Online Matching 61
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Dataset and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Justifying The Two Important Model Assumptions . . . . . . . . . . 64
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
II Multi-armed Bandits 73
7 Bandits with Knapsacks 74
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3 Challenges and Our Contributions . . . . . . . . . . . . . . . . . . . . 84
7.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.5 A new algorithm for Stochastic BwK . . . . . . . . . . . . . . . . . . 91
7.6 A simple algorithm for Adversarial BwK . . . . . . . . . . . . . . . . 100
7.7 High-probability algorithm for Adversarial BwK . . . . . . . . . . . . 108
7.8 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8 Better Bounds for Combinatorial Semi-bandits with Knapsacks 151
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Challenges and Contribution . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.5 Main algorithm (SemiBwK-RRS) . . . . . . . . . . . . . . . . . . . . . 164
8.6 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.7 Applications and special cases . . . . . . . . . . . . . . . . . . . . . . 184
8.8 Numerical Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9 Future Directions 197
viii
III Appendix 202
10 Standard Tools 203
10.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 203
10.2 Adversarial Online Learning . . . . . . . . . . . . . . . . . . . . . . . 203
10.3 Lagrangians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Bibliography 208
ix
List of Tables
2.1 Various problems in the SDM framework . . . . . . . . . . . . . . . . 10
2.2 Summary of benchmark and metrics for various problems . . . . . . . 13
x
List of Figures
2.1 SDM protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Online Matching Protocol: A basic version . . . . . . . . . . . . . . . 16
6.1 OTD is normal distribution under KIID . . . . . . . . . . . . . . . . 66
6.2 OTD is normal distribution under KAD . . . . . . . . . . . . . . . . 67
6.3 OTD is power law distribution under KAD . . . . . . . . . . . . . . . 68
6.4 The number of requests of a given type at various time-steps. x-axis:
time-step, y-aixs: number of requests . . . . . . . . . . . . . . . . . . 69
6.5 Occupation time distribution of all cars. x-axis: number of time-
steps, y-axis: number of requests . . . . . . . . . . . . . . . . . . . . 70
6.6 Occupation time distribution of two different cars. x-axis: number of
time-steps, y-axis: number of requests . . . . . . . . . . . . . . . . . . 71
7.1 Bandits with Knapsacks Protocol . . . . . . . . . . . . . . . . . . . . 77
7.2 Adversarial online learning . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 The lower-bounding construction for the ?(log T ) lower bound. . . . . 130
7.4 Adversarial contextual bandits . . . . . . . . . . . . . . . . . . . . . . 146
8.1 Dynamic Assortment (left) and Dynamic Pricing (right) experiments
for n = 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.2 Experimental Results for Uniform matroid (left plots) and Partition
matroid (right plots) on independent (upper) and correlated (lower)
instances for n = 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.3 Experimental Results for Uniform matroid (left plots) and partition
matroid (right plots) on independent (upper) and correlated (lower)
instances for n = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.4 Variation of per-step running times as n increases for the various
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
xi
Chapter 1: Introduction
?We can only see a short distance ahead, but we can see plenty there
that needs to be done.?
? Alan Turing (possibly on sequential decision making)
The last decade has seen a rise in large-scale applications of artificial intelli-
gence and machine learning techniques in various domains ranging from e-commerce,
recommender systems to healthcare, medicine and climate change efforts. The
two biggest contributors to this phenomenon are the availability of large amounts
of datasets to perform supervised learning and cheap computing resources. This
led to efficient application of large scale supervised learning algorithms on massive
datasets. From a theoretical standpoint, these applications gave rise to a host of
research questions on newer models of computation for algorithm design. One such
model is the stochastic input model, where the algorithm receives input as a sample
from some distribution. Assumptions about this distribution are motivated from
those that are commonly observed across multiple datasets. For a significant por-
tion of this thesis, we study algorithm design questions that use the stochastic input
model.
1
Although supervised learning on ?offline data? has been very successful, the
current goal is to tackle applications from a wider range of domains where the data
is inherently ?real-time?. This implies that although one may be able to use the vast
historical data to pre-compute predictions about the future the algorithm?s decision
has to be done ?online?. Often the time-scales at which these algorithms operate
are in the order of milli-seconds; thus it vastly reduces the available compute time to
make decisions. Moreover, the algorithm needs to optimize over multiple objectives
and operate in domains where the decision affects human-beings. Concretely, con-
sider a ride-share platform such as Uber or Lyft. A central problem is to efficiently
match a driver to a request. However, the platform does not get all the requests
before it makes the matching decisions. Based on historical observational data, the
platform can obtain an estimate of the future requests, in order to optimize the
allocation policy; however, the actual match has to be done real-time. The goal
is to match requests to drivers while learning the best strategy and using them to
optimize for various criteria such as revenue, welfare and fairness. In fact, this gen-
eral setup is ubiquitous in many other applications such as crowdsourcing, Internet
advertising markets and recommender systems. These applications motivate the
study of a class of problems called ?sequential decision making? (SDM) where the
algorithm does not receive the entire input at once; the algorithm makes a sequence
of decisions to obtain the input sequence (often the inputs are stochastic and/or
depends on past decisions).
In many of these applications, the algorithm is further restricted in the choice
of future decisions it can make based on the current actions. This imposes infor-
2
mation bottlenecks for the algorithm to learn about the future based on the past;
we abstractly call these the resource constraints. Thus, the central thesis of this
work is to understand under what informational constraints can we design efficient
algorithms for sequential decision making problems.
We make progress towards this by considering two broad classes of SDM prob-
lems and design efficient algorithms for variants thereof. The first class of problems
is called the online matching problem where the goal of the algorithm is to find an
optimal graph matching when the vertices of the graph are provided sequentially.
Variants of this problem have many applications such as matching riders to drivers,
keywords to advertisements and recommending movies to users. The algorithm is
usually faced with one or more objectives it has to optimize for such as revenue,
diversity and fairness. The second class we consider is a variant of the classic multi-
armed bandit problem called bandits with knapsacks. This is a sequential learning
problem where the algorithm wants to learn parameters of the system under various
information constraints such as the number of samples it is allowed to collect. Many
applications such as ride-share, crowdsourcing and recommender systems have tun-
able parameters whose optimal value is not known apriori and the algorithm needs
to explore several choices based on real-time inputs. However, it is also usually
constrained by external constraints such as limited inventory. Thus variants of this
model is useful in such application domains.
Our Contributions and Outline: The rest of this thesis is structured as follows.
Chapter 2 describes the SDM framework formally and provides the benchmark and
3
performance metric we use for the algorithms. Then the thesis is divided into two
parts. Part I deals with problems in Online Matching. In Chapter 3 we consider
the weighted version of the online matching problem under stochastic inputs and
show algorithms with improved competitive ratio. In Chapter 4 we consider an
extension of the problem where the matched vertices may become available after
certain number of time-steps from the time it is matched. We design a new algorithm
for this problem and show that the competitive ratio obtained is tight among all
algorithms that use a class of techniques. In Chapter 5 we consider another extension
of the problem where the goal of the algorithm is to maximize multiple objectives,
namely total reward and the diversity of edge types. Abstractly, we solve the online
matching problem with submodular objective functions. In the last chapter 6 of
this part we consider the real-world application of matching drivers to riders on a
ride-share platform. We model this problem as an Online Matching problem and
use a real-world dataset to empirically evaluate the proposed algorithms. In Part
II of the thesis we consider the variants of the bandit with knapsack problem. In
Chapter 7 we consider the adversarial version of the basic bandits with knapsacks
problem and give an optimal algorithm via a reduction to the classic multi-armed
bandit problem. Along the way we also give a new algorithm to the stochastic
bandits with knapsacks problem. We show many extensions that can be obtained
as a corollary of this reduction. In Chapter 8 we consider an extension of this
problem to the combinatorial semi-bandits with knapsacks problem in the stochastic
setting and give optimal algorithm for this problem. In this chapter, we also perform
experiments on simulated datasets to evaluate the performance of this algorithm in
4
practice and compare it with other algorithms proposed in the literature. Finally
in Chapter 9 we conclude the thesis with a few open problems and broader future
directions for research.
5
Chapter 2: Preliminaries
In this chapter, we give a formal definition of the sequential decision mak-
ing protocol. The sequential decision making framework is used across multiple
communities (e.g., reinforcement learning, experimental design, online algorithms)
with (slightly) different terminologies. Here we describe a general version of the
sequential decision making protocol that captures all problems considered in this
thesis.
2.1 Sequential Decision Making (SDM) protocol
The SDM protocol can be viewed as a repeated game between the algorithm
and the nature. The protocol proceeds in rounds with the total number of rounds
T known to both the players a-priori. At each time-step t ? [T ] the nature first
presents part of the input xt to the algorithm. The algorithm then chooses an action
at ? At, where At represents the set of feasible actions available to the algorithm
at time t. The choice at yields a reward rt to the algorithm and modifies the set of
available actions At+1 to the algorithm in the next round. The goal of the algorithm
is to maximize the total reward obtained in the T rounds. Critically, the algorithm?s
action at is binding and cannot be revoked at a later time. Moreover, it does not
6
receive the full input beforehand and thus has to optimize based on observed history
until time t and any predictions of the future the algorithm may have. The set of
actions At+1 could potentially be a strict subset of At; abstractly we call this as
resource constraints. Resource constraints make the problem significantly harder
and understanding under what constraints can we design efficient algorithms is the
central thesis of this work. Figure 2.1 describes the protocol formally. Through-
out this thesis, we use the term online algorithm interchangeably to refer to any
algorithm for the SDM protocol.
Given: number of time-steps T .
In each round t ? [T ],
1. Nature reveals part of the input xt.
2. The algorithm chooses action at ? At.
3. Algorithm receives reward rt which is a function of at.
4. The set of available actions At+1 is updated based on at.
Figure 2.1: SDM protocol
2.1.1 Models of the Nature
The literature makes various assumption on the power of the nature. In the
adversarial model, the nature is assumed to be an arbitrary adversary and the goal
of the algorithm is to deal with the worst-case input sequence. In the adaptive
7
adversary model, this adversary is assumed to be adaptive to the choices of the
algorithm. In other words, the input xt, the reward rt and the new set of actions
At+1 can be set by the nature after observing the set of actions a1, a2, . . . , a 1t?2 .
In the oblivious adversary model, the adversary has lesser power and the map from
action at to input xt+1, reward rt and action set At+1 should be fixed before the start
of the game. Thus, the adversary cannot adapt its strategy to the actions taken by
the algorithm. We also consider stochastic models of the adversary where the input
xt and the reward rt is drawn from a joint distribution D over all possible tuples
(xt, rt), at each time-step. Additionally, in all problems considered in this thesis,
this draw is independent across time-steps (however, the distribution from which
the tuples are sampled may vary over time). In the known distribution model,
we assume that the algorithm knows the distribution(s) before the game starts
while in the unknown distribution model, the algorithm does not have access to the
distribution(s) and has to learn the relevant information over time.
2.2 Landscape of Problems
Many problems fall under the category of the SDM protocol described above.
In this thesis, we focus on two specific models of computation namely Online Match-
ing and Multi-armed Bandits with Knapsacks. We briefly describe the basic version
of these problems and show how they fit into the broader SDM framework. Ta-
ble 2.1 summarizes how the two problems fit into the SDM protocol described in
Figure 2.1.
1In fact, in some problems we allow the adversary to also observe the action at?1.
8
2.2.1 Online Matching
In this problem the algorithm makes a sequence of decisions to solve a graph
matching problem. We are given a bipartite graph G = (U, V,E) with the vertex
set U given before the start of the game. The vertices in the set V are presented
sequentially at each time-step. The algorithm either matches the presented vertex
vt to any (available) neighbor in U or rejects v irrevocably. Once a vertex u ? U is
matched to some v, it becomes unavailable for the rest of the time-steps. The goal
is to maximize the total size of the matching. Mapping this to the SDM protocol,
the input xt is the vertex vt presented at time-step t. The reward rt is 1 if vt
was matched and 0 otherwise. The set of actions At is the available vertices in U
at time-step t. This problem was first introduced by Karp et al. [116] and many
variants of this basic problem has since been studied. In this thesis, we consider
many variants of this basic problem which has applications in many modern large-
scale matching markets such as ride-share, recommender systems, job scheduling
and online advertising.
2.2.2 Multi-armed Bandits with Knapsacks
In this problem the algorithm has K arms to choose from at each time-step.
There are d resources each with a global budget B. The algorithm first chooses
an arm at ? [K]. The nature then reveals a reward rt(at) for the chosen arm and
reveals the consumption of each of the d resources c1,t(at), . . . , cd,t(at). We allow the
algorithm to skip any round (denoted by an action a(null) ? [K]) which yields no
9
reward. The moment some resource is exhausted the algorithm is allowed to play
only the action a(null) in all future time-steps. The goal is to maximize the total
reward collected by the algorithm. In the SDM protocol, the input xt is the outcome
vector ot?1 := (rt?1(at?1); c1,t?1(at?1), . . . , cd,t?1(at?1)) for all time-steps t ? 2 and
for t = 1 the nature does not provide any input to the algorithm. The reward rt is
just the scalar rt(at). The set of actions At remains the same set of K arms until one
resource is exhausted after which the set just reduces to the singleton set {a(null)}.
This problem was introduced as an extension to the classic multi-armed bandit
problem (Thompson [175]) to bandits with resource constraints by Badanidiyuru
et al. [33] and has since garnered a lot of attention.
Problem/Parameter Input xt Reward rt Action set At
1 if vt matched
Online Matching Vertex vt ? V Available vertices in U
0 otherwise
Bandits with t = 1: No input [K] if resource not exhausted
rt(at)
Knapsacks t ? 2: ot?1 a(null) otherwise
Table 2.1: Various problems in the SDM framework
2.3 Benchmark
We compare the performance of any online algorithm against a benchmark (de-
noted by OPT). We primarily deal with two definitions of the benchmark, namely,
best fixed distribution and best dynamic policy which we now define.
10
Definition 1 (Best dynamic policy). Given an instance I of the problem, the best
dynamic policy is the total reward obtained by the best possible algorithm in hindsight
given the realizations of all the randomness associated with the input. Thus, OPT
in this case is a random variable and we compare against OPTDP := E[OPT].
Algorithms described in Chapters 3, 4, 5 and 8 use this benchmark.
Definition 2 (Best fixed distribution). Given an instance I of the problem, the
best fixed distribution is the total expected reward obtained by sampling from the best
possible fixed distribution. We use the notation OPTFD := E[OPT] to denote this
benchmark.
This benchmark is used in the algorithms described in Chapter 7. More dis-
cussion of this benchmark is deferred to Chapter 7.
2.4 Performance Metric
To measure the performance of online algorithms the following notions are
commonly used, namely competitive ratio and regret. We define both these metrics
now.
Definition 3 (Competitive ratio). Let E[REW] denote the expected reward obtained
by any algorithm ALG and let E[OPT] denote the expected reward of the benchmark.
Then competitive ratio of the algorithm ALG is the value of the ratio E[REW]E .[OPT]
Note that in Definition 3, the ratio is atmost 1 and the goal is to maximize this
quantity. To keep it consistent with the related literature, in Chapter 7, we use the
11
reciprocal E[OPT]E as the performance metric. In that case, the goal is to minimize[REW]
the quantity.
Usually, competitive ratio is used in problems when the best dynamic policy
benchmark is used (Buchbinder and Naor [46]). However, for the bandits with
knapsacks problem we consider in Chapter 7 we show that even against the best
fixed distribution benchmark, the notion of competitive ratio is necessary. We are
not aware of other problems that exhibit this behavior.
Definition 4 (Regret). Let E[REW] denote the expected reward obtained by any
algorithm ALG and let E[OPT] denote the expected reward of the benchmark. Then
regret of the algorithm ALG is the value of the difference E[OPT]? E[REW].
Regret of any algorithm is non-negative and the goal is to minimize regret.
Regret is usually defined against the weaker best fixed distribution benchmark.
However, in Chapter 7 we show that for the stochastic version of the bandits with
knapsacks problem we can obtain non-trivial regret bounds even against the best
dynamic policy benchmark. Regret is the primary metric in Chapter 8.
Remark 1. Competitive ratio is invariant to the scale of the rewards {rt}t?[T ] while
regret scales linearly with the scale of the rewards. Thus, for consistency throughout
the thesis we assume that rt ? [0, 1] for all t ? [T ].
Table 2.2 summarizes the metric and the benchmark used for the various
problems considered in this thesis.
Upper-bound on OPT by a mathematical program. A common strategy
(e.g., see Buchbinder and Naor [46], Mehta [139]) in algorithm design is to not work
12
Problem/Metric Competitive Ratio Regret
Online Matching Dynamic Policy -
Stochastic Bandits with Knapsacks - Dynamic Policy
Adversarial Bandits with Knapsacks Fixed Distribution -
Table 2.2: Summary of benchmark and metrics for various problems
directly with OPT but to use a suitable upper-bound on its value. This upper-bound
is typically the optimal solution to an appropriate mathematical program. The ad-
vantage of this method is that, the optimal solution of such mathematical programs
have additional interesting properties which can be exploited in the analysis. More-
over, from the definition of the performance metrics it can be seen that algorithms
that are ?good? (either low competitive ratio or regret) against the optimal solution
of this program are also good against OPT. For all problems considered in this
thesis, we always use the optimal solution of an appropriate mathematical program
(typically a linear program) as the proxy for OPT.
13
Part I
Online Matching
14
Chapter 3: Online Matching
3.1 Introduction
In this chapter, we consider the model of Online Algorithms. The central prob-
lem of study will be the Online Weighted Matching problem under distributional ar-
rival sequences. The model of computation has many motivational examples such as
Online advertising, ride-share, crowdsourcing and kidney-exchange. In the simplest
setting of this problem, we start with an weighted bipartite graph G = (U, V,E).
The protocol proceeds in T rounds with the set U available before the start of
the game. In each round t ? [T ], a vertex vt ? V is presented to the algorithm
(henceforth we interchangeably use the term arrives) along with all its neighboring
edges in U (denoted by ?(vt)) and their corresponding weights. The algorithm has
two irrevocable choices: either choose an edge e = (u, vt) ? ?(vt), where u is an
unmatched vertex, or not match vt ? V . Thus, a choice to match an edge implies
that the algorithm cannot un-match it in a future time-step. Likewise, choosing
to drop vt implies that it cannot re-visit this vertex at a later time. The goal of
the algorithm is to maximize the total weight of the matching obtained after the T
rounds of the protocol. Figure 3.1 formally describes the protocol for the simplest
version of this problem.
15
Given: number of time-steps T and offline vertex set U .
In each round t ? [T ],
1. A vertex vt ? V and the set of neighbors ?(vt) is presented to the
algorithm.
2. The algorithm either chooses an unmatched vertex u ? U to match vt or
drops the request vt.
Figure 3.1: Online Matching Protocol: A basic version
3.1.1 Arrival sequence.
The landscape of problems can be divided based on the assumptions placed
on the manner in which vt is chosen to arrive at time t. The earliest version of this
problem, studied by Karp et al. [116], assumed that the vertices vt ? V is chosen
by an oblivious adversary and that all edge-weights are identical. Following this,
more recent works proposed two other models of arrival, namely, the unknown and
the known distribution models. In both these cases, we assume that the arrival
sequence is stochastic and sampled from a product distribution D over the set of
possible arrivals. In other words, at every time-step t, the vertex vt is sampled from
a distribution Dt independent of all other time-steps.
The focus of this work will be on known product distributions. We show some
future directions where we consider models that interpolates between the unknown
and known product distributions.
16
3.1.2 Competitive ratio.
The performance of any online algorithm for this problem is compared against
the best dynamic policy (Definition 1) . The competitive ratio with respect to this
benchmark is defined as follows. It is the ratio of the size of the matching obtained
by the algorithm to that of the optimal matching if the entire arrival sequence
was known (we interchangeably use the phrase the optimal matching of the offline
instance). Formally, let ALG(I) (a random variable) denote the size of the matching
obtained by any online algorithm ALG on instance I. Let OPT(I) denote the size
of the optimal matching (a random variable) in instance I if the arrival sequence of
vertices in V was available a-priori. Then the competitive ratio of ALG is defined
as maxI E[ALG(I)]/E[OPT(I)].
3.2 Related work
The seminal paper of Karp et al. [116] introduced this problem where they gave
the optimal competitive ratio of 1?1/e when the vertex arrival is set by an oblivious
adversary. Following this work, this problem has garnered much interest due to its
various applications (see Mehta [139] for a comprehensive survey). The vertex-
weighted version(of this) problem was introduced by Aggarwal et al. [8], where they
give an optimal 1? 1 ratio for the adversarial arrival model. The edge-weighted
e
setting has been studied in the adversarial model by Feldman et al. [82], where they
consider an additional relaxation of ?free-disposal?.
In addition to the adversarial and known I.I.D. models, online matching is
17
also studied under several other variants such as random arrival order, unknown
distributions, and known adversarial distributions. In the setting of random arrival
order, the arrival sequence is assumed to be a random permutation over all online
vertices, see e.g., Devanur and Hayes [69], Kesselheim et al. [119], Korula and Pa?l
[122], Mahdian and Yan [136]. In the case of unknown distributions, in each round an
item is sampled from a fixed but unknown distribution. If the sampling distributions
are required to be the same during each round, it is called unknown I.I.D. (Devanur
et al. [73, 75]); otherwise, it is called adversarial stochastic input (Devanur et al.
[75]). As for known adversarial distributions, in each round an item is sampled from
a known distribution, which is allowed to change over time (Alaei et al. [16, 17]).
Another variant of this problem is when the edges have stochastic rewards. Models
with stochastic rewards have been previously studied by Mehta and Panigrahi [140],
Mehta et al. [142] among others, but not in the known I.I.D. model.
In the known i.i.d. setting prior work has considered unweighted, vertex-
weighted and the edge-weighted versions of the problem. The vertex-weighted and
unweighted setting (often studied in tandem) have many results starting with Feld-
man et al. [83] who were the first to beat 1 ? 1/e with a competitive ratio of 0.67
for the unweighted problem. This was improved by Manshadi et al. [137] to 0.705
with an adaptive algorithm. In addition, they showed that even in the unweighted
variant with integral arrival rates, no algorithm can achieve a ratio better than
1? e?2 ? 0.86. Finally, Jaillet and Lu [111] presented an adaptive algorithm which
used a clever LP to achieve 0.725 and 1? 2e?2 ? 0.729 for the vertex-weighted and
unweighted problems, respectively. In the edge-weighted setting model Haeupler
18
et al. [102] were the first to beat 1? 1/e by achieving a competitive ratio of 0.667.
They use a discounted LP with tighter constraints and employ the power of two
choices paradigm to guide their online algorithm.
A closely related problem is the Adwords problem where every edge has a bid
and every offline vertex has a budget. The goal is to match an online vertex to any
offline vertex that has not exhausted its budget. This problem was introduced by
Mehta et al. [141] and subsequently studied by Buchbinder et al. [49], Devanur and
Jain [71], Devanur et al. [74], Goel and Mehta [95]. A series of works (Aggarwal
et al. [8], Chan et al. [56], Devanur et al. [74], Goel and Mehta [95]) has attempted
to simplify the RANKING algorithm proposed in Karp et al. [116] which has been
useful to develop algorithms for related problems such as vertex-weighted matching
and Adwords.
Another related model is the two-sided online matching problem where all
vertices in the graph come online and the goal is to match the vertices that are
currently present. A series of recent works (e.g., Dickerson et al. [77], Huang et al.
[107, 108], Truong and Wang [179], Wang and Wong [183]) consider many variants
of this model and design improved algorithms. Online matching has also been
considered in the edge-arrival model where the edges arrive instead of the vertices.
Representative works include Badanidiyuru [29], Buchbinder et al. [51], McGregor
[138].
19
3.3 Online Weighted Matching
3.3.1 Formal Description of the Problem
Before the start of the game, we are given a weighted bipartite graph G =
(U, V,E) and a corresponding weight function w : E ? R+ that maps each edge to
a positive real number. Additionally, we are given a rate function r : V ? Z+ which
denotes the number of times a vertex v ? V would arrive in expectation across all
the rounds in the game. The game proceeds in T rounds, where in each round a
vertex v ? V is sampled i.i.d. from a distribution D such that D(v) = r(v)/T .
In particular, the distributions Dt in the product distribution is the same in every
time-step. As noted in Haeupler et al. [102], we can assume that r(v) = 1 for
every vertex wlog. Indeed, any vertex v such that r(v) > 1 can be replaced by r(v)
different vertices each an arrival rate of 1. Thus, this assumption implies that the
size of the new set of online vertices is exactly T , since the sum of expectations
equals the total number of online rounds. The goal of the algorithm is to match an
incoming vertex v ? V to any unmatched neighbor (or drop) irrevocably in each
round, such that the sum of the weights of the matched edges after T rounds is
maximized.
Asymptotic assumption and notation. We will always assume T is large and
analyze algorithms as T goes to infinity: e.g., if x ? 1 ? (1 ? 2/T )T , we will just
write this as ?x ? 1 ? 1/e2? instead of the more-accurate ?x ? 1 ? 1/e2 + o(1)?.
These suppressed o(1) terms will subtract at most o(1) from our competitive ratios.
20
3.3.2 Linear Relaxation to Upper-bound OPT
As discussed in Chapter 2, a common strategy in online algorithms is to find
a suitable upper-bound on OPT using a mathematical program (typically a LP)
and compare the performance of the algorithm to this stronger benchmark. One of
the challenges in the design of algorithms is to ensure that the gap between OPT
and this upper-bound is as small as possible. In particular, one of the contributions
of our work is in constructing a stronger upper-bound compared to prior work. In
Lemma 1, we prove that the following LP is an upper-bound on OPT.
?
maximize ?e?E w(e) x(e) such that
?e??(u) x(e) ? 1 ?u ? U
e??(v) x(e) ? 1 ?v ? U (3.1)
0 ? x(e) ? 1? 1/e ?e ? E
x(e) + x(e?) ? 1? 1/e2 ?e, e? ? ?(u),?u ? U
?Lemma 1. Let x
? denote an optimal solution to LP (3.1). Then we have OPT ?
e?E w(e)x
?(e).
Proof. Let Y (e) denote the indicator random variable for the event that edge e ? E is
matched in the optimal solution for a given arrival sequence A. Let y(e) := EA[Y (e)]
for every edge e ? E. We will now argue that the vector y := (y(e?))e?E is a feasible
solution to the LP (3.1). Consider a vertex u ? U . We have that e??(u) Y (e) ? 1.
T?aking expectations on both sides and using the linearity of expectation we have
e??(u) y(e) ? 1. This shows that y is feasible to the first constraint. Let R(v)
21
denote the random variable for the number of times a ve?rtex v ? V arrived in a given
arrival sequence A. Then we have, for every v ? V , e??(v) Y (e) ? R(v). From
the integral arrival rates assumpti?on, EA[R(v)] = 1 for every v ? V . Thus, from
linearity of expectation we obtain e??(v) y(e) ? 1. This shows that y is feasible to
the second constraint. For any edge e = (u, v), let I[R(v) = 0] be an indicator for
the event that a vertex v ? V never arrives in the T rounds. Thus, for any arrival
sequence A, we have Y (e) ? I[R(v) 6= 0]. Taking expectations on both sides we get
y(e() ? EA) [I[R(v) =6 0]. The probability that a vertex v never arrives in T roundsT
is 1? 1 ? 1 . Thus, E [I[R(v) 6= 0] ? 1 ? 1A . This shows that y is feasible toT e e
the third constraint. Consider two edges e, e? ? ?(u) for some u ? U . Let e = (u, v)
and e? = (u, v?) and as before let I[R(v) =6 0] and I[R(v?) 6= 0] denote the indicator
for the events that v, v? arrives at least once in the T rounds, respectively. For any
arrival sequence A we have that Y (e) + Y (e?) ? I[R(v) 6= 0] ? I[R(v?) 6= 0]. Taking
expectations on both sides we get y(e) + y(e?) ? EA[I[R(v) 6= 0]? I[R((v
?) 6= )0]]. TheT
probability that both v and v? never arrive in the T rounds is given by 1? 2 ? 1
T e2
.
Thus, we get y(e) + y(e?) ? 1 ? 12 which shows that y is feasible to the fourthe
constraint. ?
The expected weight of t?he optimal solution is EA[ e?E w(e)Y (e)] which from
linearity of expectation gives e?E w(e)y(e). Since y is a feasible solution we have
that the optimal value to LP (3.1) is at least as large as the expected optimal
solution.
22
3.4 Our Contribution
Our main contribution to this problem is to improve the competitive ratio of
0.667 due to Haeupler et al. [102] to 0.7. We do so by designing a new algorithm
based on the dependent-rounding schemes of Gandhi et al. [92]. We describe in
detail the key observation we make in the algorithm of Haeupler et al. [102] which
we exploit in our algorithm design.
Haeupler et al. [102] used two matchings, M1 and M2 of the offline graph
G = (U, V,E) to guide the online algorithm and leverage the power of two choices.
When a vertex v arrives for the first time, an attempt to match to its neighbor in
M1 is made and on its second arrival an attempt to match to its neighbor in M2 is
made. However, these two matchings may not be edge disjoint, leaving some arriving
vertices with only one choice (or possibly none). In fact, choosing two matchings as
a guide that maximize both the edge weights and the number of disjoint edges is a
major challenge that arises in applying the power of two choices to this setting.
When the same edge (u, v) is included in both matchings M1 and M2, the copy
of (u, v) in M2 offers no additional benefit and a second arrival of v is wasted (which
happens with a substantial probability). Concretely, Haeupler et al. [102] choose
the two matchings in the following manner. M1 is obtained by solving an LP similar
to (3.1) and rounding this to an integral solution. M2 is constructed by first finding
a maximum-weight matching and then removing all the edges which have already
been included in M1. A key element of their proof is to show that the probability
of an edge being removed from M2 is at most 1? 1/e ? 0.63.
23
We make the observation that generating the two matchings in a slightly corre-
lated manner can drastically decrease this probability which translates to improved
competitive ratio. Moreover, we show a general technique to construct an ordered
set of k matchings where k is a tunable parameter. For k = 2, we show that the
probability of an edge appearing in both M1 and M2 is at most 1? 2/e ? 0.26.
3.5 Overview of our algorithm
We provide a overview of both the warm-up and the final algorithm. The
algorithm has two components, namely the offline phase and the online phase. The
offline phase is as follows. The algorithm first solves the LP (3.1) to obtain the
fractional solution x?. Then we use a rounding scheme described in subsection 3.5.1
to obtain a integral vector F where each coordinate in this vector is in the set
{0, 1, 2}. Then we construct two matchings M1 and M2 of the graph induced by
the vector F (repeating an edge whenever any coordinate is 2). The online phases
uses the two matchings M1 and M2 to guide its action. When a vertex type v ? V
arrives for the first time, we attempt to match it to the neighbor in M1. When it
arrives for the second time, we attempt to match it to the neighbor in M2. In its
third and succeeding arrivals, we simply drop the vertex.
3.5.1 LP rounding technique
We round x? to an integral solution F using a two step process we call
DR[x?, k]. The first step is to multiply x? by k. The second step is to apply the
24
dependent rounding techniques of Gandhi et al. [92] to this new vector. For most
parts of this section we use k = 2. In the conclusion, we briefly mention an improved
algorithm that uses k = 3 to achieve the best known competitive ratio of 0.705.
While dependent rounding is typically applied to values between 0 and 1,
the useful properties extend naturally to our case in which kx?(e) may be greater
than 1 for some edge e ? E. To understand this process, it is easiest to imagine
splitting each kx?(e) into two edges with the integer value x?(e) = bkx?(e)c and
fractional value x??(e) = kx?(e) ? bkx?(e)c. The former will remain unchanged by
the dependent rounding procedure since it is already an integer while the latter will
be rounded to 1 with probability x??(e) and 0 otherwise. The final value F (e) would
be the sum of the two rounded values. The two properties of dependent rounding
we use are:
(P1) Marginal distribution: For every edge e ? E, let p(e) = kx?(e)?bkx?(e)c.
Then, Pr[F (e) = dkx?(e)e] = p(e) and Pr[F (e) = bkx?(e)c] = 1? p(e).
(P2) Degree-pr?eservation: For any vertex w ? U ? V , let its fractional degree
kx?? (w) := kx
?
e??(w) (e) and integral degree be the random variable F (w) =
? ?
e??(w) F (e). Then F (w) ? {bkx (w)c , dkx (w)e}.
3.6 Warmup: 0.688 competitive algorithm
As a warm-up, we describe a simple algorithm which achieves a competitive
ratio of 0.688 and introduces the key ideas in our approach. We begin by solving the
LP (3.1) to get a fractional solution vector x? and applying DR[x?, 2] as described
25
in Subsection 3.5.1 to get an integral vector F. We construct a bipartite graph
GF with F (e) copies of each edge e ? E. Note that GF will have max degree 2
since for all w ? U ? V , F (w) ? d2x?(w)e ? 2 and thus we can decompose it into
two matchings using Hall?s Theorem. The exact choice of the two matchings is not
critical to the algorithm as long as the union contains all edges in GF. Finally, we
randomly permute the two matchings into an ordered pair of matchings, [M1,M2].
These matchings serve as a guide for the online phase of the algorithm, similar to
Haeupler et al. [102]. The warm-up algorithm, denoted by EW0, is summarized in
Algorithm 1.
Algorithm 1: EW0
1 Construct and solve the benchmark LP (3.1) for the input instance.
2 Let x? be an optimal fractional solution vector. Invoke DR[x?, 2] to get a
random integral vector F.
3 Create the graph GF with F (e) copies of each edge e ? E and decompose
it into two matchings as described in text.
4 Randomly permute the matchings to get a random ordered pair of
matchings, say [M1,M2].
5 When a vertex type v arrives for the first time, attempt to match v to u1
if (u1, v) ?M1; when v arrives for the second time, attempt to match v
to u2 if (u2, v) ?M2.
6 When a vertex type v arrives for the third time or after, do nothing in
that step.
26
3.6.0.1 Analysis of EW0
We will show that EW0 (Algorithm 1) achieves a competitive ratio of 0.688.
Let [M1,M2] be the randomly ordered pair of matchings. Note an edge e ? E
appears in both matchings if and only if x?(e) > 1/2. We consider three types of
edges. We say an edge e ? E is of type ?1, denoted by e ? ?1, if and only if e
appears only in M1. Similarly e ? ?2, if and only if e appears only in M2. Finally,
e ? ?b, if and only if e appears in both M1 and M2. Let P1, P2, and Pb be the
probabilities that an edge e ? ?1, e ? ?2, and e ? ?b is matched, respectively.
Haeupler et al. [102] proved the following Lemma 2 that bounds P1, P2, Pb for any
given matching M1 and M2.
Lemma 2 (Section 3 of Haeupler et al. [102]). For any two matchings M1 and
M2 steps (5) and (6) in Algorithm 1 implies that we have (1) P1 > 0.5808; (2)
P2 > 0.14849 and (3) Pb > 0.632.
We use Lemma 2 to prove that the warm-up algorithm EW0 achieves a ratio
of 0.688 by examining the probability that a given edge e is becomes type ?1, ?2,
or ?b.
Analysis of EW0. Consider the following two cases.
? Case 1: 0 ? x?(e) ? 1/2: By the marginal distribution property of depen-
dent rounding (P1), there can be at most one copy of e in GF. Moreover, the
probability of including e in GF is 2fe. Since an edge in GF can appear in
either M1 or M2 with equal probability 1/2, we have that Pr[e ? ?1] = Pr[e ?
27
?2] = x
?(e). Thus, the probability that edge e is added to the matching is
(x?(e)P1 + x
?(e)P2) = 0.729x
?(e).
? Case 2: 1/2 ? x?(e) ? 1?1/e: Similarly, by the marginal distribution (P1),
Pr[e ? ?b] = Pr[F (e) = d2x?(e)e] = 2x?(e)? b2x?(e)c = 2x?(e)? 1. It follows
that Pr[e ? ?1] = Pr[e ? ?2] = (1/2)(1? (2x?(e)? 1)) = 1? x?(e). Thus,the
probability that edge e is added to the matching is (noting that the first term
is from case 1 while the second term is from case 2) ((1 ? x?(e))(P1 + P2) +
(2x?(e) ? 1)Pb) ? 0.688x?(e), where the worst-case occurs for an edge e with
x?(e) = 1? 1/e.
To prove the competitive ratio, we first show the following Lemma.
Lemma 3. Let x? denote the optimal solution to LP (3.1). Suppose we have that
for every edge e ? E, Pr[e is included in matching] ? ?x?(e) then the competitive
ratio is at least ?.
Proof. From Li?nearity of Expectation, we ha?ve that the total expected size of the
matching is E[ e?E w(e)I[e is matched]] = e?E w(e) Pr[e is matched]. From the
premise we have that for every edge e ? E, Pr[e is?included in matching] ? ?x
?(e).
Thus, the?total expected size of the matching is ? w(e)x
?
e?E (e). From Lemma 1
we have ?e?E w(e)x (e) ? OPT. Thus, the competitive ratio is at least ?.
In the two cases above, we proved that for every e ? E we have Pr[e is matched] ?
0.688x?(e). Thus from Lemma 3 we obtain a competitive ratio of 0.688.
28
3.7 0.7 competitive algorithm
In this section, we describe an improvement to the warm-up algorithm to get a
competitive ratio of 0.7. We start by making an observation about the performance
of the warm-up algorithm. After solving LP (3.1), let edges with x?(e) > 1/2 be
called large and edges with x?(e) ? 1/2 be called small. Let L and S, be the set of
large and small edges, respectively. In the warm-up analysis, small edges contributed
a much higher value of 0.729 towards the ratio as opposed to the worst-case 0.688
by the large edges. This is primarily due to the fact that we may potentially get
two copies of a large edge in GF. In this case, the copy in M1 has a better chance of
being matched, since there is no edge which can ?block? it (i.e., an edge with the
same offline neighbor that gets matched first), but the copy in M2 has no chance of
being matched.
To correct for this imbalance, we make an additional modification to the vector
x? before applying DR[x?, k]. The rest of the algorithm is exactly the same. Let ?
be a parameter to be optimized in the analysis. For all large edges ` ? L such that
x?(`) > 1/2, we set x??(`) = x?(`) + ?. For all small edges s ? S which are adjacent
to some large edge, let ` ? L be the largest edge adjacent to s such that x?(`) > 1/2.
Note that it is possible for s to have two la(rge nei)ghbors, but we only care about
?
the larger of the two. We set x??(s) = x?(s) 1?x? (`) .
1?x?(`)
In other words, we increase the values of large edges while ensuring that for all
w ? U?V , x?(w) ? 1 by reducing the values of neighboring small edges proportional
to their original values. Note that it is not possible for two large edges to be
29
adjacent since they must both have x?(e) > 1/2. For all other small edges which are
not adjacent to any large edges, we leave their values unchanged. We then apply
DR[x?, 2] to this new vector, multiplying by 2 and applying dependent rounding as
before. Algorithm 2 formally describes the algorithm.
Algorithm 2: EW(?)
1 Construct and solve the benchmark LP (3.1) for the input instance.
Let x?2 be an optimal fractional solution vector.
3 For all edges e ? E with x?(e) > 1/2 let x??(`) = x?(`) + ?
4 For all edges e ? E with x?(e) ? 1/2 do the following. Let ` ? L be the
largest edge a(djacent)to e such that x?(`) > 1/2. Let
x??(e) = x?(e) 1?x?
?(`)
1? ? .x (`)
5 Invoke DR[x?, 2] to get a random integral vector F.
6 Run steps (2)-(6) of Algorithm 1
3.7.0.1 Analysis
Theorem 1. For edge-weighted online stochastic matching with integral arrival
rates, EW(0.0142) achieves a competitive ratio of at least 0.7.
Proof. As in the warm-up analysis, we?ll consider large and small edges separately
? Scenario 1: 0 ? x?(s) ? 1 :
2
Here we have two cases
? Case 1: s is not adjacent to any large edges.
30
In this case, the analysis is the same as Case 1 in the warm-up analysis.
Thus, the probability that edge s is added to the matching is 0.729x?(e).
? Case 2: s is adjacent to some large edge `.
For this case, let x?(`) be the value of the largest neighboring edge in the
original LP solution. Then the probability that edge s is added to the
matching is
(
1? (x?
)
? (`) + ?)x (s) ? (0.1484 + 0.5803).1? x (`)
This follows from Lemma 2; in particular, the first two terms are the
result of how we set x?(s) in the algorithm, while the two numbers, 0.1484
and 0.5803, are the probabilities that s is matched when it is in M2 and
M1, respectively. Note that for x
?(`) ? [0, 1) this is a decreasing function
in x?(`). So the worst case is when x?(`) = 1?1/e (due to third constraint
in LP (3.1)) Thus, the probability that edge s is added to the matching
is ( )
? 1? (1? 1/e + ?)x (s) (0.1484 + 0.5803).
1? (1? 1/e)
Since ? = 0.0142, this evaluates to,
0.701x?(s). (3.2)
? 1 < x?(`) ? 1 ? 1 : Here, the probability that ` is added to the matching is,
2 e
[1? (x?(`) + ?)][P ?1 +P2] + [2(x (`) + ?)? 1]Pb. This can re-arranged to obtain
(P1 + P2)(1? ?) + (2? ? 1)P + x?b (`)[2Pb ? P1 ? P2]. (3.3)
31
Since ? = 0.0142 using Lemma 2 we have (P1+P2)(1??)+(2??1)Pb = 0.1048.
Similarly, using Lemma 2 we have 2Pb ? P1 ? P2 = 0.535. Thus, Eq. (3.3)
simplifies to,
0.1048 + x?(`)0.535 (3.4)
We can write Eq. (3.4) as x?(`)[0.1048/x?(`) + 0.535]. Note that 1 < x?(`) ?
2
1? 1 . Thus, Eq. (3.4) can be lower-bounded by
e
0.701x?(`). (3.5)
Thus combining Eq. (3.2) and (3.5) with Lemma 3 we get a competitive ratio
of 0.7.
We now show that the chosen value of ? = 0.0142 ensures that both x??(`)
and x??(s) are less than 1 after modification. Since x?(`) ? 1 ? 1/e we have that
x?(`) + ? ? 1 ? 1/e + 0.0142 ? 1. Note that x?(`) ? 1/2. Hence, t(he modifie)d
? 1?(x?value x? (s) is always less than or equal to the original value, since (`)+?)
1?x?(`)
is decreasing in the range x?(`) ? [1/2, 1 ? 1/e] and has a value less than 0.98 at
x?(`) = 1/2.
3.8 Conclusion
In this chapter, we considered the Online (Edge) Weighted Matching problem
under the known product distribution arrival assumption. The key contribution
is a new algorithm based on randomized (dependent) rounding schemes of Gandhi
et al. [92] that improves the competitive ratio to 0.701. For simplicity in exposition,
we describe a simpler algorithm in this thesis. In Brubach et al. [42], we give
32
another algorithm which uses similar ideas and DR[x?, 3] to obtain the best-known
competitive ratio of 0.705. The analysis is significantly involved with many more
cases and we refer the reader to the paper for more details.
All algorithms in prior work and the one described in this thesis fall under the
class of online algorithms known as non-adaptive algorithms. In other words, after
realizing part of the randomness in the input, the algorithm does not change it?s
strategy. It is an interesting open problem to consider adaptive algorithms for this
problem and understand if significant improvements to competitive ratios can be
made. This is called adaptivity gap in the literature Gupta et al. [98, 99]. The best
known lower-bound for this problem is 0.823 which holds even when all edge weights
are identical. Thus, the immediate open question is to bridge the gap between the
upper-bound of 0.705 and the lower-bound of 0.823. Adaptive algorithm is one
possible approach towards bridging this gap.
33
Chapter 4: Extension 1: Online Weighted Matching with Reusable
Vertices
4.1 Introduction
In this chapter, we consider an extension of the Online Matching problem
where the offline vertices are reusable. Online Matching problems are motivated
by market design problems where agents on one side of a market are paired with
agents, contracts, or transactions on the other. In many of these applications such
as matching drivers to riders, jobs to servers, organs to patients the process is
dynamic where one side of the market arrives in an online fashion and is matched
sequentially to the other side. Moreover, in these applications the match has a
temporal component; after a certain period of time the offline agent is freed and
can be used for other matches. We motivate this problem using the example of
ride-share below.
Taxi Dispatching Services and Ride-Sharing Systems. Traditional taxi
services and rideshare systems like Uber and Didi Chuxing match drivers to would-
be riders Lee et al. [128], Lowalekar et al. [132], Tong et al. [176]. Here, the offline
Service Providers are different vehicle drivers. Once an online request (potential
rider) arrives, the system matches it to a nearby driver instantly such that the
34
rider?s waiting time is minimized. In most cases, the driver will rejoin the system
and can be matched again once she finishes the service. Additionally, the arrival
rates of requests changes dramatically across the day. Consider the online arrivals
during peak hours and off-peak hours for example: the arrival rates in the former
case can be much larger than the latter.
Motivated by these applications, we consider the Online Matching with Reusable
Resources problem. As before we are given a weighted bipartite graph G = (U, V,E).
The vertex set U is available offline and at each time-step t ? [T ] a vertex v ? V
is sampled from a known product distribution. When a vertex v arrives, we have
to match it to one of the available vertices in U or drop it; this action is irrevo-
cable. The key difference is that the vertices in U are reusable. Once we assign
v to u, the vertex u will rejoin the system after Ce rounds with e = (u, v), where
Ce ? {0, 1, . . . , T} is an integral random variable with known distribution. In other
words, after Ce time-steps the vertex u becomes available for future matches. The
random variable Ce is called the occupation time of u w.r.t. e. The goal is to maxi-
mize the total weight of the successful matches at the end of T time-steps. Here we
make no assumption on the individual distributions Dt in the product distribution
and can vary (adversarially) with time.
4.2 Our Contributions
The main contribution in this chapter is to define the Online Weighted Match-
ing with Reusable Vertices problem and provide a new algorithm that has tight
35
theoretical guarantees. This model capture a wide range of real-world applications
related to online scheduling, organ allocation, rideshare dispatch, among others. We
extend the ideas from the previous chapter to design a new algorithm that achieves
a competitive ratio of 1 ?  for any given constant  > 0. The key new ingredi-
2
ent in the algorithm is the notion of attenuation ? using Monte-Carlo simulations
to approximately estimate the probability of certain safe event and weight actions
inversely proportional to this estimate. We also show that our algorithm is nearly
optimal among all non-adaptive algorithms.
4.3 Main Model
In this section, we present a formal statement of our main model. Consider
a weighted bipartite graph G = (U, V,E) where U and V represent the offline and
online vertices respectively. Let w : E ? R+ denote the weight function that
assigns a weight to every edge in the graph. We have a finite time horizon T (known
beforehand) and for each time t ? [T ], a vertex v is sampled (we?use the term v
arrives) from a known probability distribution {p (v)} such that p (v) ? 11t v?V t
This sample is indepe?ndent for each round t. The expected number of times v arrives
across the T rounds, t?[T ] pt(v), is called the arrival rate for vertex v. Once a vertex
v arrives, we need to make an irrevocable decision immediately: either reject v or
assign v to one of its neighbors in U . For each u ? U , once it is assigned to some
v, it becomes unavailable for Ce rounds with e = (u, v), and subsequently rejoins
the system. Here Ce is an?integral random variable taking values in {0, 1, . . . , T}
1Thus, with probability 1? v?V pt(v), none of the vertices from V will arrive at t.
36
with the distribution known in advance. The goal is to design an online assignment
policy such that the total (expected) weight of the assignments made is maximized.
In this work we assume that |V |  |U | and T  1.
4.4 LP benchmark
We use the following LP to upper-bound the optimal online algorithm. The
linear program has variables xt(e) for every e ? E and t ? [T ].
? ?
maximize ?t?[T ] e?E w(e) xt(e) such that
?e??(u?) xt(e) ? pt(v) ? ?u ? U, t ? [T ]
t?<t e??(u) xt?(e) Pr[Ce > t? t?] + e??(u) xt(e) ? 1 ?u ? U, t ? [T ]
0 ? xt(e) ? 1 ?e ? E, t ? [T ]
(4.1)
As before we have the following Lemma which states that the optimal value
of the LP is an upper-bound to the optimal online algorithm.
Lemma?4. Le?t {x
?
t}t?[T ] denote an optimal solution to LP (4.1). Then we have
OPT ? ?t?[T ] e?E w(e)xt (e).
Proof. The proof of this is similar to that of Lemma 1. Let Yt(e) denote an indicator
random variable to denote if an edge e = (u, v) was matched at time t in the
optimal solution for an online sequence A. Let yt(e) := EA[Yt(e)] for every t ? [T ]
and e ? E. We will now show that the vector y := (yt(e))e?E,t?[T ] is a feasible
solution to the LP (4.3). Consider a vertex u ? U . Let Rt(v) denote the indicator
37
?
random variable to denote if vertex v arrived at time t in A. Thus, e??(u) Yt(e) ?
?Rt(v). Taking expectation on both sides and using linearity of expectation we get
e??(u) yt(e) ? pt(v). This shows that y is feasible to the first constraint. Since
Yt(e) is an indicator random variable we have 0 ? Yt(e) ? 1. Taking expectation,
we obtain 0 ? yt(e) ? 1 and thus y is feasible to the third constraint. We will now
show that it is feasible to the second constraint. Consider a u ? U and time-step
t ? [T ]. u is either matched at some time-step t? < t and is not yet available at t or
that u got matched at time t. Let I[Ce > t?t?] denote the i?ndicat?or for the re-appear
time of e?dge e to be larger than t ? t
?. Thus, we have t?<t e??(u) Yt?(e)I[Ce >
t? t?] + e??(u) Yt(e) ??1. Tak?ing expectation on both sides?and using linearity of
expectation we get that t?<t e??(u) yt?(e) Pr[C > t? t?e ? ]?+ e??(u) yt(e) ? 1. The
expected?weight?of the optimal solution is thus EA[ t?[T ] e?E w(e)Yt(e)] which is
equal to t?[T ] e?E w(e)yt(e) from linearity of expectation. Since y is feasible to
the LP, the optimal value to LP (4.3) is at least as large as the expected optimal
solution.
In fact, it is suffices to have a finite sample estimate of Pr[C ?e > t? t ] for every
edge e ? E. In particular, we have the following Lemma. The proof follows from
the fact that {xt}t?[T ] is scaled down by a factor (1 + ?) and hence the objective is
scaled down by a factor (1 + ?).
Lemma 5. Suppose for some ? ? 0, we have an estimate f(e, y) of Pr[Ce > y] for
all edges e and y >= 0 , where f(e, y)/Pr[Ce > y] always lies in [1/(1 + ?), 1 + ?].
Then, by using f(e, t ? t?) in the LP instead of Pr[Ce > t ? t?] and scaling down
38
the resultant vector {xt}t?[T ] by (1 + ?), we only get a further loss of (1 + ?) in the
competitive ratio.
4.5 1/2? -competitive algorithm
The main idea of the algorithm is as follows. Let {x?t}t?[T ] denote an optimal
solution to LP (4.1). Suppose we aim to develop an online algorithm achieving a
ratio of ? ? [0, 1]. Consider an assignment e = (u, v) when some v arrived at time
t. Let SFt(e) be the event that e is safe at t (i.e., u is available at t). By using
Monte-Carlo simulations on the algorithm?s strategy up to t, we can get an estimate
of Pr[SFt(e)], denoted by ?t(e), within an arbitrary small error. Therefore when
x?(e)
SFt(e) holds, we assign v to u with probability t
? when v arrives, which leads
pt(v) ?t(e)
to the fact that e is assigned with probability exactly equal to ?x?t (e) unconditionally.
We call any strategy that satisfies ? ? ?t(e) as valid. At the outset, this looks similar
to the Inverse Propensity Scoring (IPS) used in the multi-armed bandit literature
Auer et al. [24]. However, there is a key difference between IPS estimates and these
Monte-Carlo estimates. In the bandit literature, one usually scales the value by
the probability of playing an action, since this is the cost of observing only bandit
feedback. However, here we scale by a quantity that depends on the probability of
a certain event happening during the run of the algorithm, because of playing other
actions. The linear program gives a distribution over the edges assuming that all
the neighbors are available. Hence this scaling can be interpreted as the cost the
algorithm needs to incur when some neighbors are already matched.
39
The above strategy of using Monte-Carlo simulations to weight the probability,
called simulation-based attenuation, has been used previously for other problems,
such as stochastic knapsack Ma [133] and stochastic matching Adamczyk et al. [2].
Throughout the analysis, we assume that we know the exact value of ?t(e) :=
Pr[SFt(e)] for all t and e. This is because, for any giv[en accuracy  > 0 using]
?( 1 2 ? log(1)) samples, we can ensure that ? 1t(e) ? Pr[SFt(e)] ,Pr[SFt(e)]Pr[SFt(e)] ? 1+
with probability at least 1? ?, via a standard Chernoff-bound argument. Thus, this
leads to a loss of at most (1 ? ?()) multiplicative factor in the competitive ratio,
which is negligible when  = o(1)2. Hence, ignoring it leads to a cleaner presentation.
Algorithm 3 gives a formal description of our algorithm.
Algorithm 3: Simulation-based algorithm: ALG-REUSE(?)
1 Solve LP (4.1) to obtain optimal solution {x?t}.
2 For each time t, let vt denote the request arriving at time t.
x?(e)
3 If ?t(vt) = ?, then reject vt; otherwise choose e ? ?t(vt) with prob. t ? ?pt(vt) ?t(e)
where e = (u, v).
First, we show that when ? = 1 , Algorithm 3 is a valid strategy (i.e., ? ? ?t(e)2
for every t ? [T ], e ? Et).
Lemma 6. For every t ? [T ] and e ? Et, we have ? (e) ? 1t .2
Proof. We prove this by induction on t. The base case is when t = 1. In this case
we have ?t(e) = 1 for all e = (u, ?) trivially and thus, we are done.
2o(1) is a vanishing term when both Ce and T/Ce are sufficiently large
40
Consider the inductive case. For all t? < t, assume that ?t?(e) ? 1 . Consider2
time t and a given edge e = (u, vt). Assignment e is unsafe at t iff u is already
assigned to some v? at t? < t and that the assignment e? = (u, v?) makes u unavailable
at time t. Therefore we have,
? ? x??(e)
1? ?t(e) = 1? Pr[SF tt(e)] = Pr[Ce > t? t?]. (4.2)
t?
2
<t e??t(u)
Using the constraints of the LP (4.1) with Eq. (4.2) we have,
? ?
? x
? (e) ??
?t(e) = 1
t Pr[Ce > t? ?
1 1 1
t ] ? + x?t (e) ? .
? t? 2 2 2 2<t e??t(u) ? e??t(u)x?t (e) ? ? ? x?(e)Hence we have te??t(vt) pt(vt) ?t(e) e?? ? 1 and thus we are done.t(vt) pt(vt)
We will now prove that the competitive ratio of Algorithm 3 is at least 1?o(1).
2
In particular, we prove the following theorem.
Theorem 2. Algorithm ALG-REUSE(1) achieves a competitive ratio of 1 ? o(1).
2 2
Proof. Consider any edge e in the offline-graph. Using Lemma 6 and the definition
of ?t(e) we?have that the probability that this assignment is chosen in the T rounds
is at least ?t?[T ] xt (e)?(1?o(1)). Thus, combining this with Lemma 3 we have that
the competitive ratio is at least ?(1? o(1)) = 1(1? o(1)).
2
4.6 Lower-bounds
In this section we show that the analysis in the previous section is tight. In
particular, any algorithm that uses the optimal solution of LP (4.1) as a guide for
the online actions cannot get a competitive ratio better than 1 +o(1). In particular,
2
we show that the worst-case occurs for the following instance.
41
Construction 1. Consider a complete bipartite graph G = (U, V,E) where |U | = K,
|V | = T 2. Let pt(v) = 12 for each v ? V and t ? [T ]. In other words, in each roundT
t ? [T ], each v ? V is sampled uniformly. For each e, let Ce be the same value K,
which implies that any u ? U will be unavailable for K rounds after an assignment.
All assignments have the same weight ( i.e., w(e) = 1 for all e ? E).
Theorem 3. No non-adaptive algorithm that uses the optimal solution of LP (4.1)
as a guide for online actions, can achieve a competitive ratio better than 1 + o(1)
2
on the instance in Construction 1.
Proof. First, split the T rounds in the online phase into T ? K + 1 consecutive
windows W = {W`}`?[T?K+1] such that W` = {`, `+ 1, . . . , `+K ? 1} for each 1 ?
` ? T ?K+ 1. Consider the benchmark LP (4.1) for the instance in construction 1.
It can be written as follows.
? ?
maximize ?t?[T ] e?E xt(e) such that
?e??(u)?xt(e) ?
1
2 ?u ? U, t ? [T ]T
t?W e??(u) xt(e) ? 1 ?u ? U, 1 ? ` ? T ?K + 1`
0 ? xt(e) ? 1 ?e ? E, t ? [T ]
(4.3)
We can verify that an optimal solution to LP (4.3) is as follows: x?t (e) =
1/(T 2K) for all e ? E and t ? [T ] with the optimal objective value of T . Consider
any optimal non-adaptive algorithm. The expected number of arrivals of any v ? V
after T steps is 1/T . Thus, for any non-adaptive algorithm ALG, it needs to specify
the allocation distribution Dv for each v during the first arrival. Consider a given
42
non-adaptive algorithm; it can be?described by a set of parameters {?u,v ? [0, 1]}
for each pair v ? V, u ? ?(v) with u??(v) ?u,v ? 1, for each v ? V . In other words,
the algorithm can be described by the probability that a v ? V will be assigned to
u ? U in the run?of the algorithm.
Let ? 1u := v??(u) ?u,v ? 2 , which is the probability that u is matched in eachT
round if it is safe at the beginning of that round, when running ALG. Hence,
? ? ? 1 ? ?? 1?u = ?u,v = ?2 u,v ? ? 1.T T 2
u?U u?U v??(u) v?V u??(v)
Consider a given u and the corresponding ?u. Let ?t(u) be the probability
that u is a?vailable at time t. Then the expected number of matches of u after the T
rounds is t?[T ] ?u?t(u). Then the variables ?t(u) and ?u satisfy a recurrence, given
by Lemma 7, at each time-step t ? [T ]
Lemma 7. ?1 < t ? T , we have
?
?t(u) + ?u ?t?(u) = 1
t?K+1?t?<t
Proof. When t = 1, this follows from the fact that u is safe at time t = 1. For
each time t > 1, let SFt(u) be the event that u is safe at time t and At(u) be the
event that u is matched at time t. In each window of K time slots, we have that
the events {SFt(u), At?(u), t ?K + 1 ? t? < t} are all mutually exclusive and that
together they cover the entire sample space. Therefore, we have,
?
1 = Pr[SFt(u)] + Pr[At?(u)]
t??K+1?t?<t
= ?t(u) + ?u ?t(u)
t?K+1?t?<t
43
Recall that the op?timal v?alue of the benchmark LP (4.3) is T . However, the
performance of ALG is u?U t?[T ] ?u?t(u). Thus, the competitive ratio achieved
by ALG can be captured by the following maximization program.
? ?
maximize u?U t?[T ]
?u?t(u)
? such thatT
u?U ?u ??1 ?u ? U, t ? [T ] (4.4)
?t(u) + ?u t?K+1?t?<t ?t?(u) = 1?1 < t ? T, u ? U
?u ? 0, ?1(u) = 1 ?u ? U
We now prove the following Lemma about the optimal value of the pro-
gram (4.4). Theorem 3 follows directly from this lemma.
Lemma 8. The optimal value of the maximization program (4.4) is at most 1
2? +1/K
o(1) when K = o(T ).
Pro?of. Consider any given vertex u ? U . From Lemma 7 we have that ?t(u) +
?u t?K+1?t?<t ?t?(u) = 1 for all 1 ? t ? T . Summing both sides over t ? [T ], we
have the following.
( )?
1 + ?u(K ? 1) t?[T ] ?t(u) =
T + ?u(K ? 1)?T (u) + ?u(K ? 2)?T?1(u) + ? ? ?+ ?u?T?K+2(u)
? T +K ? 1
Therefore we have,
? T K ? 1 T 1
?t(u) ? + ? + (4.5)
1 + ?u(K ? 1) 1 + ?u(K ? 1) 1 + ?u(K ? 1) ?
t? u[T ]
44
?
Define Hu := t?[T ] ?u?t(u). From Eq. (4.5), we have that H ?
T?u
u + 1.1+?u(K?1)
Thus the objective value in the maximization program (4.4) can be upper-bounded
as follows. ? ?
u?U t?[T ] ?u?t(u) ? H ?u ? ?u K= + .
T T 1 + ?u(K ? 1) T
u?U u?U
Thus, the optimal value to the maximization program (4.4) can be upper
bounded by{the following maximization program.? ? ? }u K
max + : ?u = 1, ?u ? 0,?u ? U
1 + ?u(K ? 1) T
u?U u?U
According to our assumption we have K = o(T ); thus, the second term eval-
uates to o(1). Let g(x) := x/(1 + x(K ? 1)). Fo?r any K ? 2, g(x) is a concave
function. Thus, when maximizing g(x) subject to u?U ?u = 1, the maximizer will
be achieved when all ?u = 1/K. The resultant value is
1
? + o(1).2 1/K
Unconditional Hardness. Manshadi et al. [137] prove that for the online match-
ing problem under known distribution (but disposable offline vertices), no algorithm
can achieve a ratio better than 0.823. Since our setting generalizes this, their hard-
ness results directly apply to our problem as well.
4.7 Conclusion
In this work, we provide a model that captures the application of assignment
in ride-sharing platforms. One key aspect in our model is to consider the reusable
aspect of the offline resources. This helps in modeling many other important ap-
plications where agents enter and leave the system multiple times (e.g., ride-share,
45
organ allocation and crowdsourcing markets). We provide an LP based algorithm
that achieves a tight competitive ratio of 1?o(1). We supplement this with a lower-
2
bound for LP-based algorithms. In Chapter 6 we use this model to run experiments
on the New York yellow cabs taxi dataset to show the application of this model for
a real-world problem. We show that the proposed algorithms sometimes does better
than simple heuristics that are commonly employed.
46
Chapter 5: Extension 2: Online Submodular Weighted Matching
5.1 Introduction
In this chapter, we consider another extension of the Online Weighted Match-
ing problem called the Online Submodular Matching Problem. In the online weighted
matching problem, each edge is associated with a weight and the goal of maximizing
the total weight captures the notion of maximizing the total relevance of the match.
In many applications, we also care about the diversity of the final matching along-
side the relevance. Ahmed et al. [15] considered a motivating example of matching
academic papers to potential reviewers: simply maximizing the relevance (the qual-
ity of each match) could potentially assign a paper to multiple scholars in a single
lab due to shared expertise, which is undesirable. Instead, we want to assign each
paper to relevant experts with diverse backgrounds to obtain comprehensive feed-
back. Maximizing diversity is of particular importance in various recommendation
systems, ranging from recommendations of new books and movies on eBay Chen
et al. [62] to returning search-engine queries Agrawal et al. [9]. A common strategy
to address diversity is to first formulate a specific objective (typically maximization
over a submodular function capturing the balance of diversity and relevance and
then design an efficient algorithm?typically a greedy one?to solve it (e.g., Ahmed
47
et al. [15] and references within). Inspired by the wide range of applications we
propose the Online Submodular Matching problem and design algorithms that give
provable guarantees. We depart from the greedy paradigm of algorithms and use
tools from submodular optimization.
5.2 Our Contributions
Our contributions can be summarized as follows.We propose the Online Sub-
modular Bipartite Matching problem, which captures the balance between relevance
and diversity in the context of matching markets. We then provide a provably good
algorithm for this model. The algorithm is based on Contention Resolution schemes
(CRS, Chekuri et al. [61]) used in the offline submodular maximization literature.
The key algorithmic idea is to construct an offline guide based on using CRS on
the expected offline problem and use this guide in the online phase. Our algorithm
leverages useful properties of CRS from prior work to argue about the total expected
objective of the online algorithm.
5.3 Problem Model
Consider a given bipartite graph G = (U, V,E) where U and V represent the
offline and online vertices respectively. Additionally,. We have a finite time horizon
T (known beforehand) and for each time (or round) t ? [T ] := {1, 2, . . . , T}, at
most one vertex v is sampled?in which case we say v arrives?from a given known
probability distribution {p(v)}v?V . The sampling process is independent across
48
different times. As in chapter 3 we consider the case of integral arrival rates ? thus
wlog the arrival rate of each vertex v ? V is 1. This implies that p(v) = 1 for
T
every vertex v ? V . When a vertex v arrives, we need to make an immediate and
irrevocable decision: either to reject v or assign v to one of its neighbors in U . Each
u has a unit capacity: it will be unavailable in the future upon being matched.1 We
are given a non-negative monotone submodular function f : 2E ? R+ as input. Our
goal is to design an online matching algorithm such that E[f(M)] is maximized,
where M is the final (random) matching obtained.
Related model. One important direction in addressing diversity in online algo-
rithms has been via online convex programming. In particular, Agrawal and De-
vanur [10] considered the model of maximizing a concave function under convex
constraints. At each time-step a random vector is drawn from an unknown distribu-
tion and the goal is to satisfy a convex constraint in expectation. Our work differs
from theirs in multiple aspects. First, the offline problem of Agrawal and Devanur
[10] is poly-time solvable, while our problem even in the offline version has unknown
hardness (status unknown for both NP- and APX-hardness). Equivalence between
discrete and continuous functions exists for submodular minimization via the Lova?sz
extension. However, a similar continuous relaxation for submodular maximization is
NP-hard to evaluate2. Hence, it is unclear how one would use their model to address
our problem. Secondly, they consider the large budget regime while all matching
1The general case where each u has a given capacity B(u) can be reduced to this by creating
B(u) copies of u.
2e.g., Slide 26 in https://goo.gl/HAhqaZ
49
type problems differ from allocation problems in that this assumption is not true (in
fact, the main challenge is small budgets). The other difference is that our ?known
i.i.d.? gives algorithm design more power as compared to unknown distributions and
therefore helps obtain improved ratios rigorously. For example, the online matching
problem with linear objectives has been studied both in unknown distribution and
known i.i.d. models separately since it presents a natural trade-off?knowing more
information about the distribution and the competitive ratio. Based on applications,
one would make assumption one-way or the other.
Special cases. Our model generalizes some well-known problems in this literature.
Note that if the submodular function is just a linear function of the weights this
reduces to online weighted matching in Chapter 3. Our model can also capture the
Submodular Welfare Maximization (SWM) problem Kapralov et al. [114]. Given
an instance of SWM we can add polynomially many extra vertices and reduce it to
an instance of our problem. The SWM problem is defined as follows. We have n
vertices in U which are available offline. With each vertex u ? U we have a monotone
submodular function gu : 2
V ? R associated with it. When a vertex v arrives we
need to match v to one of its neighbors in U . At the end of the online phase, let
S1, S2, . . . , Sn be the set of ve?rtices assigned to vertices 1, 2, . . . , n respectively. The
goal is to maximize the sum i?[n]wi(Si).
Note that sum of submodular functions is a submodular function. Hence to
make the objective function of SWM fit our framework, we do the following. Let
E(u) denote the set of edges incident to u. Define a function g?u : E(u) ? R. For
50
any subset S ? E(u), we let g?(S) := g(S(v)) where S(v) is the set of endpoints
of S in V . The non-trivial part to handle is that in SWM any vertex in U can
be matched multiple times, while in our setting we allow any U to be matched
exactly once. We can overcome this by creating additional vertices as follows. For
any u ? U , create |?(u)| ? T copies. A copy, indexed by v ? ?(u) and t ? [T ] has
an edge only to v. Additionally, wlog we can enforce that at time t, an algorithm
can be matched to vertices whose copy is indexed by t (this doesn?t change the
optimal value). Therefore, we now have an instance with a submodular objective
and matching constraints.
Consider an arrival sequence ? . Let the optimal allocation in SWM under this
arrival be S?,1, S?,2, . . . , S?,n at times T1, T2, . . . , Tn. Then note that by considering
the edges (uS ,T , S?,i) yields the same value to the objective in the Online Sub-?,i ?,i
modular Matching model. Additionally, the above argument also holds in the other
direction since there is a one-to-one mapping between the two optimal solutions.
5.4 Background on Submodular Optimization
Definition 5 (Submodular function). A function f : 2[n] ? R+ on a ground-set of
elements [n] := {1, 2, . . . , n} is called submodular if for every A,B ? [n], we have
that f(A ? B) + f(A ? B) ? f(A) + f(B) and f(?) = 0. Additionally, f is said to
be monotone if for every A ? B ? [n], we have that f(A) ? f(B).
For our algorithms, we assume a value-oracle access to a submodular function.
This means that, there is an oracle which on querying a subset T ? [n], returns the
51
value f(T ). The algorithm does not have access to f explicitly.
Examples. Some common examples of submodular functions include the coverage
function, piece-wise linear functions, budget-additive functions among others. In
our experiments section, we use the following two examples.
1. Coverage function. Given a universe U and g subsets A1, A2, . . . , Ag ? U , the
function f(S) = | ?i?S Ai| is called the coverage function for any S ? [g].
This can naturally be extended to the weighted case. Given a non-negative
weight function w : U ? R+, then the weighted coverage function is defined
as f(S) = w(?i?SAi).
2. Budget-additive function. For a given total budget B and a set of weights
wi ? 0 on the elements [g] of universe U , f?or any subset S ? U the budget-
additive function is defined as f(S) = min{ i?S wi, B}.
Definition 6 (Multilinear extension). The multilinear extension of a submodu-
l?ar func?tion f is t?he continuous function F : [0, 1]
n ? R+ defined as F (x) :=
T?[n]( k?T x(k) k 6?T (1? x(k)))f(T ).
Note that F (x) = f(x) for every x ? {0, 1}n. The multilinear extension is a
useful tool in maximization of submodular objectives. In particular, the above has
the following probabilistic interpretation. Let Rx ? [n] be a random subset of items
where each item i ? [n] is added into Rx independently with probability xi. We
then have F (x) = E[f(Rx)].
We now describe the key theorem relating sub-modular functions to its multi-
linear extension. This theorem was proved in Bansal et al. [35] and was generalized
52
to a procedure called contention resolution schemes (CRS) by Chekuri et al. [61].
In this work, we only require the following theorem, but use the term CRS to refer
to the following theorem.
Theorem 4 (Bansal et al. [35]). Let f be a non-negative monotone submodular
function over E with |E| = m. Given a fractional vector x ? [0, 1]m, let X =
(X(e))e?E be a random binary vector such that each X(e) is a Bernoulli random
variable with mean x(e). Suppose Z ? X is another random binary vector satisfying
the following properties (1) & (2) for some ? > 0. Then we have that E[f(Z)] ?
?E[f(X)].
1. Marginal Property: Pr[Z(e) = 1|X(e) = 1] ? ?.
2. Monotonicity Property: Pr[Z(e) = 1|X = a] ? Pr[Z(e) = 1|X = b],
?a ? b ? {0, 1}m,?e ? support(a) := {e? : a(e?) = 1}.
By definition of the multilinear extension we have that E[f(X)] = F (x). Thus
the theorem can be viewed as conditions when ?linearity of expectation? can be
applied to F .
5.5 Related Works in Submodular Optimization
The offline version of our problem is the well-studied ?maximizing a monotone
submodular function subject to a bipartite matching polytope constraint? problem.
More generally, the constraint set can be viewed as an intersection of two partition
matroids. The general area of submodular maximization is well studied; here, we
53
only survey algorithmic advances related to maximization of a monotone submodular
function subject to various constraints. The classical work of Nemhauser et al. [151]
showed that the natural greedy algorithm achieves an (1?1/e)-approximation under
a cardinality constraint, which is optimal in the value oracle model assuming P 6=
NP Nemhauser and Wolsey [150]. Under a general matroid constraint, Calinescu
et al. [52] gave an algorithm achieving the optimal ratio of 1?1/e (in the value oracle
model defined in Section 5.4) using the pipage rounding technique. Lee et al. [129]
considered the constraint case of k matroids with k ? 2 and presented a local-search
based algorithm. Sarpatwar et al. [166] studied the case of intersection of k matroids
?(k+1)
and a single knapsack constraint and gave a 1?e -approximation algorithm for
k+1
any k ? 1. Recently a series of works has considered submodular maximization in
the online setting. In particular, Buchbinder et al. [50] and Chan et al. [57] studied
online submodular maximization in the adversarial arrival order with preemption:
on arrival of an item, we should decide whether to accept it or not and possibly
rejecting a previously accepted item. In this work, we do not allow preemption but
consider a more flexible arrival assumption (i.e., known i.i.d.). This makes the
problem tractable and admits algorithms with non-trivial competitive ratios. Apart
from the offline and online models, submodular maximization has received much
attention in other models due to its applications in summarization Tschiatschek
et al. [180], data subset selection and active learning Wei et al. [186], and diverse
summarization Mirzasoleiman et al. [143], to name a few. It has been studied
in the streaming Badanidiyuru et al. [32], Mirzasoleiman et al. [146], distributed
Mirzasoleiman et al. [144, 145] and stochastic Karimi et al. [115], Stan et al. [171]
54
settings.
5.6 Mathematical Program to Upper-bound OPT
We use the following mathematical program to upper-bound OPT. This can-
not be solved exactly in polynomial time but can be approximated to a factor of
1? 1/e efficiently (Adamczyk et al. [3], Calinescu et al. [52]). Let F : [0, 1]m ? R+
be the multilinear extension of f . Consider the following mathematical program.
maximize ?F (x) such that
?e??(v) x(e) ? r(v) ?v ? V (5.1)
e??(u) x(e) ? 1 ?u ? U
0 ? x(e) ? 1 ?e ? E
Lemma 9. There is an efficient algorithm (running in polynomial(time))which re-
turns a feasible solution x? to the program (5.1) such that F (x?) ? 1? 1 E[OPT],
e
where E[OPT] is the offline optimal value.
Proof. First, it is easy to see that the constraint set (denoted by P) in LP (5.1) is
downward closed. In other words, if x is a feasible solution, then (1? )x for every
0 ?  ? 1 is also a feasible solution. Define f+(y) to be the optimal solution of the
following program{. ? ? ? }
f+(y) := max ?Af(A) : ?A ? 1;?A ? 0;?e ? E ?A ? y(e)
A?E A?E A:e?A
(5.2)
Using Lemma 6 in Adamczyk et al. [3] we have that the continuous greedy
algorithm of Calinescu et al. [52] yields a feasible solution x? such that F (x?) ?
55
(1 ? 1/e) max f+y?P (y). As noted in Adamczyk et al. [3], for any given y ? P ,
f+(y) is the maximum value of E[f(R)] over all possible random sets R such that
Pr[e ? R] ? y(e) for each e. Thus, this implies that max +y?P f (y) is an upper
bound of the offline optimal. This proves that F (x?) ? (1 ? 1/e) maxy?P f+(y) ?
(1? 1/e)E[OPT].
5.7 0.125-competitive Algorithm
We present an algorithm for the Online Submodular Matching problem that
achieves a competitive ratio of at least 0.125. The central idea is as follows. We first
start with an approximately optimal solution x? to the mathematical program (5.1).
We then obtain Rx? , which is obtained by independently sampling each edge e ? E
with probability x?(e). Let X ? {0, 1}m be the indicator vector corresponding to
Rx? such that X(e) = 1 iff e ? Rx? . Let EX(v) = {e : e ? ?(v), X(e) = 1}
and EX(u) = {e : e ? ?(u), X(e) = 1} be the set of sampled edges incident to u
and v respectively. Now we obtain another random vector Y ? {0, 1}m from X by
uniformly sampling an edge from EX(u) for each u ? U (the sampling process is
independent across different u). We then use both X and Y to guide the online
phase. Algorithm 4 describes this algorithm formally.
We now prove the following theorem about the performance of this algorithm.
In particular, we prove the following theorem.
Theorem 5. The algorithm CR-ALG achieves a competitive ratio of at least 1(1?
2
e?1/2)(1? 1/e) for Online Submodular Matching problem with integral arrival rates.
56
Algorithm 4: A CR-based algorithm (CR-ALG)
Offline Phase:
1 Approximately solve the program (5.1) using the continuous greedy of
Calinescu et al. [52]. Let x? = (x(e)?)e?E be the approximate solution with
F (x?) ? (1? 1/e)E[OPT].
2 Independently sample each edge with probability x?(e). Let
X = (X(e))e?E ? {0, 1}m be the resultant indicator vector.
3 For each w ? U ? V , let EX(w) := {e : e ? ?(w), X(e) = 1} be the set of
sampled edges incident to w. Sample one edge uniformly at random from
EX(u) for each u if EX(u) 6= ?. Let Y ? X be the indicator vector of the
edges chosen after both the sampling processes.
Online Phase:
4 When v ? V arrives at time t ? [T ], sample an edge e uniformly from EX(v).
Match v iff Y (e) = 1 and e = (u, v) is safe at t (i.e., u is available); skip it
otherwise.
Proof. Let Z = (Z(e))e?E be the indicator vector for any e ? E being matched
in Algorithm CR-ALG. Let X be as defined in Theorem 4 with x = x? where
x? is the approximate optimal solution to the offline program. We show that Z
satisfies the two properties stated in Theorem 4 with ? = 1(1 ? e?0.5). Hence, we
2
have that E[f(Z)] ? ?E[f(X)]. From Lemma 9 we have that E[f(X)] = F (x?) ?
(1? 1/e)E[OPT]. Combining these two facts proves Theorem 5.
We now show that Z satisfies the two properties stated in Theorem 4 with
? = 1(1? e?0.5). To prove this, we first prove the following Lemma.
2
57
Lemma 10. Consider an edge e = (u, v) with X(e) = 1. Let |EX(u)| = ku and
|EX(v)| = kv, where ku and kv are the number of edges incident to u and v in the
sub-graph induced by X, respectively. Then,
1 ( ( 1 ))
Pr[Z(e) = 1 | X(e) = 1, |EX(u)| = ku, |EX(v)| = kv] ? 1? exp ? . (5.3)
ku kv
Proof. A sufficient condition for Z(e) = 1 to occur given X(e) = 1, |EX(u)| =
ku, |EX(v)| = kv is that Y (e) = 1 in this conditional space and that Z(e) = 1 given
that Y (e) = 1. Thus, we have
Pr[Z(e) = 1 | X(e) = 1, |EX(u)| = ku, |EX(v)| = kv]
? Pr[Y (e) = 1 | X(e) = 1, |EX(u)| = ku]
? Pr[Z(e) = 1 | Y (e) = 1, |EX(v)| = kv, |EX(u)| = ku]. (5.4)
Since Y (e) = 1 is set uniformly for every edge e ? EX(u) we have that
1
Pr[Y (e) = 1 | X(e) = 1, |EX(u)| = ku] = . (5.5)
ku
We now show that the following holds.
? 1 1 ( 1 )t?1
Pr[Z(e) = 1 | Y (e) = 1, |EX(v)| = kv, |EX(u)| = ku] ? 1? .
T kv T ? kv
t?[T ]
(5.6)
Consider a given e = (u, v) with Y (e) = 1. We compute the probability that u is
safe. At any time t ? [T ], the probability that a vertex u is matched is at most
1 (i.e., the probability that a neighbor(v of u a)rrives in this time-step). Thus,kvT t?1
the probability that u is safe is at least 1? 1 ?. Thus, (the proba)bility thatTkv t?1
e = (u, v) is matched in the T time-steps is at least 1 1t?[T ] 1?
1
T kv T ? .kv
58
Plugging equations (5.5) and (5.6) back into Eq. (5.4) we get,
Pr[Z(e) = 1 | X((e) = 1, |EX(u)| = ku, |EX(v)| = k ]? vT
? 1 1 1 1
)t?1
1?
ku( T kv ( T)? kt=1 v
? 1 1
)
1? exp ? .
ku kv
Given Lemma 10, we are now ready to prove the two properties. First, we
show that the Marginal Property holds with ? = (1? e?0.5). Taking expectation
over the random sets |EX(u)| and |EX(v)|, Eq. (5.3) in Lemma 10 becomes,
[ ( ( ))]
Pr[Ze = 1|Xe = 1] ? E 1ku,kv 1? exp ? 1 (5.7)ku kv
( ( ))
Since 1 and 1 ? exp ? 1 , are convex functions in ku and kv by Jensen?sku kv
inequality, this evaluates to,
[ 1 ( (? ? 1 ))] ? 1 ( ( 1 ))Eku,kv 1 exp 1? exp ? . (5.8)ku kv E[|EX(u)|] E(|EX(v)|)?
From construction we?have, E[|EX(u)|] = 1+
?
e??E(u),e? 6=e x (e
?) ? 2. Likewise,
we have E[|EX(v)|] = 1 + ?e??E(v),e? 6=e x (e?) ? r(v) = 2, since x? = (x?(e))e?E is
feasible to the mathematical program (5.1). Thus, the RHS in Eq. (5.8) can be
lower-bounded by
1 ( ( 1 )) 1( )
1? exp ? ? 1? e?1/2 (:= ?). (5.9)
E[|EX(u)|] E(|EX(v)|) 2
Combining Eq. (5.9) and (5.7) proves the marginal property.
( Th(e Mo)n)otonicity Property follows from combining the fact that
1 and
ku
1?exp ? 1 are decreasing functions in ku and kv respectively and Eq. (5.3).kv
59
5.8 Conclusion
In this chapter, we proposed a new model, Online Submodular bipartite match-
ing, which effectively captures notions such as relevance and diversity in matching
markets. Many applications such as advertising, hiring diverse candidates, recom-
mending movies or songs naturally fit within this framework. We propose an algo-
rithm based on contention-resolution schemes and prove theoretical guarantees on
their performance. In Dickerson et al. [78], we propose another algorithm that has an
?
improved competitive ratio in the special case when |U | = o( T ) and T ??. For
this special case, this algorithm achieves a competitive ratio of (1?1/e)2. Dickerson
et al. [78] also has initial numerical experiments on the MovieLens dataset Harper
and Konstan [103] which validates the theoretical results.
60
Chapter 6: Rideshare: Empirical Evaluation of Online Matching
6.1 Introduction
In this chapter, we consider a case-study of the Online Matching with Reusable
Vertices problem studied in Chapter 4 in the context of ride-share. We cast the prob-
lem of matching drivers to riders as an Online Matching problem. Using the publicly
available New York City yellow cabs dataset,1 which contains the trip records for
trips in Manhattan, Brooklyn, and Queens for the year 2013, we evaluate the perfor-
mance of the proposed algorithm and many natural heuristics. Further, we also show
that many of the assumptions made to help provable guarantees hold in practice.
6.2 Dataset and Assumptions
The dataset is split into 12 months. For each month we have numerous records
each corresponding to a single trip. Each record has the following structure. We
have an anonymized license number which is the primary key corresponding to a car.
For privacy purposes a long string is used as opposed to the actual license number.
We then have the time at which the trip was initiated, the time at which the trip
ended, and the total time of the trip in seconds. This is followed by the starting
1http://www.andresmh.com/nyctaxitrips/
61
coordinates (i.e., latitude and longitude) of the trip and the destination coordinates
of the trip.
Assumptions. We make two assumptions specific to our experimental setup.
Firstly, we assume that every car starts and ends at the same location, for all
trips that it makes. Initially, we assign every car a location (potentially the same)
which corresponds to its docking position. On receiving a request, the car leaves
from this docking position to the point of pick-up, executes the trip and returns
to this docking position. Secondly, we assume that occupation time distributions
(OTD) associated with all matches are identically (and independently) distributed,
i.e., {Ce}e?E follow the same distribution. Note that this is a much stronger assump-
tion than what we made in the model, and is completely inspired by the dataset
(see Section 6.4). We test our model on two specific distributions, namely a normal
distribution and the power-law distribution (see Figure 6.5). The docking position
of each car and parameters associated with each distribution are all learned from
the training dataset (described below in the Training discussion).
6.3 Experimental Setup
The cars represent the set of vertices U and the request types represent the set
of vertices V . We assign an edge between a car u ? U and request type v ? V iff the
request type v can be assigned to the car u. Given this modeling, the experimental
setup is as follows. We randomly select 30 cabs (each cab is denoted by u). We
discretize the Manhattan map into cells such that each cell is approximately 4 miles
62
(increments of 0.15 degrees in latitude and longitude). For each pair of locations,
say (a, b), we create a request type v, which represents all trips with starting and
ending locations falling into a and b respectively. In our model, we have |U | = 30
and |V | ? 550 (variations depending on day to day requests with low variance).
We focus on the month of January 2013. We split the records into 31 parts, each
corresponding to a day of January. We choose a random set of 12 parts for training
purposes and use the remaining for testing purposes.
The edge weight w(e) on e = (u, v) (i.e., edge from a car u to type v) is set as a
function of two distances in our setup. The first is the trip distance (i.e., the distance
from the starting location to the ending location of v, denoted L1) while the second
is the docking distance (i.e., the distance from the docking position of u to the
starting/ending location of v, denoted L2). We set w(e) = max(L1??L2, 0), where
? is a parameter capturing the subtle balance between the positive contribution
from the trip distance and negative contribution from the docking distance to the
final profit. We set ? = 0.5 for the experiments. We consider each single day as the
time horizon and set the total number of rounds T = 24?60 = 288 by discretizing
5
the 24-hour period into a time-step of 5 minutes. Throughout this section, we use
time-step and round interchangeably.
Training. We use the training dataset of 12 days to learn various parameters. As
for the arrival rates {pt(v)}t?[T ],v?V , we first count the total number of appearances
of each request type v at time-step t in the 12 parts (denote it by ct(v)). Then,
we set pt(v) = ct(v)/12 when we assume known adversarial arrivals (KAD) and
63
p(v) = pt(v) = (c/12)/T (i.e., the arrival distributions are assumed the same across
all the time-steps for each v) when we assume i.i.d. arrivals (KIID). The estimation
procedure for the parameters of the two distributions for the occupation time is as
follows. We first compute the average number of seconds between two requests in
the dataset (note this was 5 minutes in the experimental setup). We then assume
that each time-step of our online process corresponds to a time-difference of this
average in seconds. We then compute the sample mean and sample variance of the
trip lengths (as number of seconds taken by the trip divided by five minutes) in the
12 parts. Hence we use the normal distribution obtained by this sample mean and
standard deviation as the distribution with which a car is unavailable. We assign
the docking position of each car to the location (in the discretized space) in which
the majority of the requests were initiated (i.e., starting location of a request) and
matched to this car.
6.4 Justifying The Two Important Model Assumptions
Known Adversarial Distributions. Figure 6.4 plots the number of arrivals of a
particular type at various times during the day. Notice the significant increase in the
number of requests in the middle of the day as opposed to the mornings and nights.
This justified our arrival assumption of KAD which assumes different arrival distri-
butions at different time-steps. Hence the LP (and the corresponding algorithm)
can exploit this vast difference in the arrival rates and potentially obtain improved
results compared to the assumption of Known Identical Independent Distributions
64
(KIID). This is confirmed by our experimental results shown in Figures 6.1 and 6.2.
Identical Occupation Time Distribution. We assume each car will be available
again via an independent and identical random process regardless of the matches
it received. The validity of our assumptions can be seen in Figures 6.5 and 6.6,
where the x-axis represents the different occupation time and the y-axis represents
the corresponding number of requests in the dataset responsible for each occupation
time. It is clear that for most requests the occupation time is around 2-3 time-steps
and dropping drastically beyond that with a long tail. Figure 6.6 displays occupation
times for two representative (we chose two out of the many cars we plotted, at
random) cars in the dataset; we see that the distributions roughly coincide with
each other, suggesting that such distributions can be learned from historical data
and used as a guide for future matches.
6.5 Results
Inspired by the experimental setup by Tong et al. [176], we run five different
algorithms on our dataset. The first algorithm is the ALG-LP. In this algorithm,
when a request v arrives, we choose a neighbor u with probability x?t (e)/pt(v) with
e = (u, v) if u is available. Here x?t (e) is an optimal solution to the benchmark
LP (4.3) and pt(v) is the arrival rate of type v at time-step t. The second algorithm
is called ALG-SC-LP. Recall that Et(v) is the set of ?safe? or ava?ilable assignments
with respect to v when the type v arrives at t. Let xt(v) =
?
e?E x (e). Int(v) t
ALG-SC-LP, we sample a safe assignment for v with probability x?t (e)/xt(v). The
65
Figure 6.1: OTD is normal distribution under KIID
next two algorithms are heuristics oblivious to the underlying LP. Our third algo-
rithm is called GREEDY which is as follows. When a request v comes, match it
to the safe neighbor u with the highest edge weight. Our fourth algorithm is called
UR-ALG which chooses one of the safe neighbors uniformly at random. Finally,
we use a combination of LP-oblivious algorithm and LP-based algorithm called
-GREEDY. In this algorithm when a type v comes, with probability  we use the
greedy choice and with probability 1?  we use the optimal LP choice. In our algo-
rithm, we optimized the value of  and set it to  = 0.1. We summarize our results
in the following plots. Figures 6.1, 6.2, and 6.3 show the performance of the five
algorithms and OPT (optimal value of the benchmark LP) under the different as-
66
Figure 6.2: OTD is normal distribution under KAD
67
Figure 6.3: OTD is power law distribution under KAD
68
Figure 6.4: The number of requests of a given type at various time-steps. x-axis:
time-step, y-aixs: number of requests
sumptions of the OTD (normal or power law) and online arrives (KIID or KAD). In
all three figures the x-axis represents test data-set number and the y-axis represents
average weight of matching.
Discussion. From the figures, it is clear that both the LP-based solutions, namely
ALG-LP and ALG-SC-LP, do better than choosing a free neighbor uniformly at
random. Additionally, with distributional assumptions the LP-based solutions out-
perform greedy algorithm as well. We would like to draw attention to a few inter-
esting details in these results. Firstly, compared to the LP optimal solution, our
LP-based algorithms have a competitive ratio in the range of 0.5 to 0.7. We believe
69
Figure 6.5: Occupation time distribution of all cars. x-axis: number of time-steps,
y-axis: number of requests
70
Figure 6.6: Occupation time distribution of two different cars. x-axis: number of
time-steps, y-axis: number of requests
71
this is because of our experimental setup. In particular, we have that the rates are
high (> 0.1) only in a few time-steps while in all other time-steps the rates are very
close to 0. This means that it resembles the structure of the theoretical worst case
example we showed in Chapter 4. In future experiments, running our algorithms
during peak periods (where the request rates are significantly larger than 0) may
show that competitive ratios in those cases approach 1. Secondly, it is surprising
that our algorithm is fairly robust to the actual distributional assumption we made.
In particular, from Figures 6.2 and 6.3 it is clear that the difference between the
assumption of normal distribution versus power-law distribution for the unavailabil-
ity of cars is negligible. This is important since it might not be easy to learn the
exact distribution in many cases (e.g., cases where the sample complexity is high)
and this shows that a close approximation will still be as good.
72
Part II
Multi-armed Bandits
73
Chapter 7: Bandits with Knapsacks
7.1 Introduction
We now focus on another model of sequential decision making paradigm, called
multi-armed bandits. Multi-armed bandits (MAB) is an elegant model for study-
ing the tradeoff between acquisition and usage of information, a.k.a. explore-exploit
tradeoff (Robbins [159], Thompson [175]). In each round an algorithm sequentially
chooses from a fixed set of alternatives (sometimes known as actions or arms), and
receives reward for the chosen action. Crucially, the algorithm does not have enough
information to answer all ?counterfactual? questions about what would have hap-
pened if a different action was chosen in this round. Studied over many decades,
multi-armed bandits is a very active research area spanning computer science, op-
erations research, and economics (Bergemann and Va?lima?ki [37], Bubeck and Cesa-
Bianchi [43], Cesa-Bianchi and Lugosi [54], Gittins et al. [94]).
In this work, we focus on bandit problems which feature supply or budget
constraints, as is the case in many realistic applications. For example, a seller who
experiments with prices may have a limited inventory, and a website optimizing ad
placement may be constrained by the advertisers? budgets. This general problem
is called Bandits with Knapsacks (BwK) since, in this model, a bandit algorithm
74
needs effectively to solve a knapsack problem (find an optimal packing of items into
a limited-size knapsack) or generalization thereof. The BwK model was introduced
in Badanidiyuru et al. [33] as a common generalization of numerous motivating
examples, ranging from dynamic pricing to ad allocation to repeated auctions to
network routing/scheduling. Various special cases with budget/supply constraints
were studied previously (e.g., Babaioff et al. [28], Badanidiyuru et al. [30], Besbes
and Zeevi [38], Combes et al. [67], Singla and Krause [169]).
In BwK, the algorithm is endowed with d ? 1 limited resources that are
consumed by the algorithm. In each round, the algorithm chooses an action (arm)
from a fixed set of K actions, and the outcome consists of a reward and consumption
of each resource; all are assumed to lie in [0, 1]. The algorithm observes bandit
feedback, i.e., only the outcome of the chosen arm. The algorithm stops at time
horizon T , or when the total consumption of some resource exceeds its budget. The
goal is to maximize the total reward, denoted REW.
7.1.1 Formal Model
We now describe the formal model. Figure 7.1 describes the protocol. There
are T rounds, K possible actions and d resources, indexed as [T ], [K], [d], respec-
tively. In each round t ? [T ], the algorithm chooses an action at ? [K] and receives
an outcome vector o d+1t = (rt; ct,1 , . . . , ct,d) ? [0, 1] , where rt is a reward and
ct,i is consumption of each resource i ? [d]. Each resource i is endowed with bud-
get Bi ? T . The game stops early, at some round ?alg < T , when/if the total
consumption of any resource exceeds its budget. The algorithm?s objective is to
75
maximize its total reward. Without loss of generality all budgets are the same:
B1 = B2 = . . . = B = B.
1
d
The outcome vectors are chosen as follows. In each round t, the adversary
chooses the outcome matrix M ? [0, 1]K?(d+1)t , where rows correspond to actions.
The outcome vector ot is defined as the at-th row of this matrix, denoted Mt(at).
Only this row is revealed to the algorithm. The adversary is deterministic and
oblivious, meaning that the entire sequence M1 , . . . ,MT is chosen before round
1. A problem instance of BwK consists of (known) parameters (d,K, T,B), and the
(unknown) sequence M1 , . . . ,MT .
In the stochastic version of BwK, henceforth termed Stochastic BwK, each
outcome matrix Mt is chosen from some fixed but unknown distribution DBwK over
the outcome matrices. An instance of this problem consists of (known) parameters
(d,K, T,B), and the (unknown) distribution DBwK.
In the adversarial version of BwK, henceforth termed Adversarial BwK, the
outcome matrix Mt is chosen by an adversary. We use two versions of the adversary,
namely oblivious adversary who chooses all the outcome matrices before the start
of the game and adaptive adversary who chooses the outcome matrix at any time
t ? [T ] after observing the actions a1, a2, . . . , at?1 of the algorithm.
Following prior work Agrawal and Devanur [11], Badanidiyuru et al. [33], we
assume, w.l.o.g., that one of the resources is a dummy resource similar to time;
1To see that this is indeed w.l.o.g., for each resource i, divide all per-round consumptions ct,i
by Bi/B, where B := mini?[d]Bi is the smallest budget. In the modified problem instance, all
consumptions still lie in [0, 1], and all the budgets are equal to B.
76
formally, each action consumes B/T units of this resource per round (we only need
this for Stochastic BwK). Further, we posit that one of the actions is a null action,
which lets the algorithm skips a round: it has 0 reward and consumes 0 amount of
each resource other than the dummy resource.
Given: number of time-steps T , resources d, arms K and budget B.
In each round t ? [T ],
1. ALG selects an arm at ? [K].
2. Nature reveals reward rt(at) and consumptions {ci,t(at)}i?[d].
3. ALG gets reward rt(at) and consumes ci,t(at) amount of each resource
i ? [d]
4. Stop if some resource consumed more than B amount.
Figure 7.1: Bandits with Knapsacks Protocol
7.1.2 Benchmarks
?
Let REW(ALG) = t?[? ] rt be the total reward of algorithm ALG in thealg
BwK problem. Our benchmark is the best fixed distribution (Definition 2), a distri-
bution over actions which maximizes E[REW(?)] for a particular problem instance.
The expected total reward of this distribution is denoted OPTFD.
For Stochastic BwK, one can compete with the best dynamic policy : an algo-
rithm that maximizes E[REW(?)] for a particular problem instance. Essentially, this
77
algorithm knows the latent distribution DBwK over outcome matrices. Its expected
total reward is denoted OPTDP.
We argue that the best fixed distribution over arms is an appropriate bench-
mark for Adversarial BwK. First, consider the total expected reward of the best
dynamic policy, denote it OPTDP. (The best dynamic policy is the best algorithm,
in hindsight, that is allowed to switch arms arbitrarily across time-steps.) This is
the strongest possible benchmark, but it is too strong for Adversarial BwK. Indeed,
we show a simple example with just one resource (with budget B), where competi-
tive ratio against this benchmark is at least T2 for any algorithm. Second, considerB
the total expected reward of the best fixed arm, denote it OPTFA. It is a traditional
benchmark in multi-armed bandits, but is uninteresting for Adversarial BwK. We
show that the competitive ratio is at least ?(K) in the worst case, and this is
matched, in expectation, by a trivial algorithm that samples one arm at random
and sticks with it forever.
For Stochastic BwK, these three benchmarks are related as follows. The best
fixed distribution is still the main object of interest, as far as the design and analysis
of algorithms is concerned. However, all results ? both ours and prior work ? are
almost automatically extended to compete against the best dynamic policy. The
best fixed arm is a much weaker benchmark than the best fixed distribution: there
are simple examples when their expected reward differs by a factor of two, in multiple
special cases of interest (Badanidiyuru et al. [33]).
78
7.2 Related Work
The literature on regret-minimizing online learning algorithms is vast; see
Bubeck and Cesa-Bianchi [43], Cesa-Bianchi and Lugosi [54], Hazan [104] for back-
ground. Most relevant are two algorithms for adversarial rewards/costs: Hedge
for full feedback (Freund and Schapire [89]), and EXP3 for bandit feedback (Auer
et al. [25]); both are based on the weighted majority algorithm from Littlestone and
Warmuth [130].
Stochastic BwK was introduced and optimally solved in Badanidiyuru et al.
[33]. Subsequent work extended these results to soft supply/budget constraints
(Agrawal and Devanur [11]), a more general notion of rewards2 (Agrawal and Deva-
nur [11]), combinatorial semi-bandits (Sankararaman and Slivkins [163]), and con-
textual bandits (Agrawal and Devanur [12], Agrawal et al. [14], Badanidiyuru et al.
[31]). Several special cases with budget/supply constraints were studied previously:
dynamic pricing (Babaioff et al. [28], Besbes and Zeevi [38, 39], Wang et al. [184]),
dynamic procurement (Badanidiyuru et al. [30], Singla and Krause [169]) (a version
of dynamic pricing where the algorithm is a buyer rather than a seller), dynamic
ad allocation (Combes et al. [67], Slivkins [170]), and a version with a single re-
source and unlimited time (Ding et al. [79], Gyo?rgy et al. [100], Tran-Thanh et al.
[177, 178]). While all this work is on regret minimization, Guha and Munagala
[96], Gupta et al. [97] studied closely related Bayesian formulations.
2The total reward is determined by the time-averaged outcome vector, but can be an arbitrary
Lischitz-concave function thereof.
79
Stochastic BwK was optimally solved using three different algorithms (Agrawal
and Devanur [11], Badanidiyuru et al. [33]), with extremely technical and delicate
analyses. All three algorithms involve inherently ?stochastic? techniques such as
?successive elimination? and ?optimism under uncertainty?, and do not appear to
extend to the adversarial version. One of them, PrimalDualBwK from Badanidiyuru
et al. [33], is a primal-dual algorithm superficially similar to ours. Indeed, it decou-
ples into two online learning algorithms: a ?primal? algorithm which chooses among
arms, and a ?dual? algorithm similar to ours, which chooses among resources. How-
ever, the two algorithms are not playing a repeated game in any meaningful sense, let
alone a zero-sum game. The primal algorithm operates under a much richer input:
it takes the entire outcome vector for the chosen arm, as well as the ?dual distribu-
tion? ? the distribution over resources chosen by the dual algorithm. Further, the
primal algorithm is very problem-specific: it interprets the dual distribution as a
vector of costs over resources, and chooses arms with largest reward-to-cost ratios,
estimated using ?optimism under uncertainty?.
Our approach to using regret minimization in games can be traced to Fre-
und and Schapire [87, 89] (see Ch. 6 in Schapire and Freund [167]), who showed
how a repeated zero-sum game played by two agents yields an approximate Nash
equilibrium.
This approach has been used as a unifying algorithmic framework for sev-
eral problems: boosting (Freund and Schapire [87]), linear programs (Arora et al.
[21]), maximum flow (Christiano et al. [66]), and convex optimization (Abernethy
?
and Wang [1], Wang and Abernethy [182]). While we use a result with the 1/ t
80
convergence rate for the equilibrium property, recent literature obtains faster con-
vergence for cumulative payoffs (but not for the equilibrium property) under various
assumptions (e.g., Rakhlin and Sridharan [155], Syrgkanis et al. [172], Wei and Luo
[185]).
Repeated Lagrangian games, in conjunction with regret minimization in games,
have been used in a series of recent papers (Agarwal et al. [7], Hsu et al. [106], Kearns
et al. [118], Rogers et al. [160], Roth et al. [161, 162]), as an algorithmic tool to solve
convex optimization problems; application domains range from differential privacy
to algorithmic fairness to learning from revealed preferences. All these papers deal
with deterministic games (i.e., same game matrix in all rounds). Reframing the
problem in terms of repeated Lagrangian games is a key technical insight in this
work.
Most related to our paper are Roth et al. [161, 162], where a repeated La-
grangian game is used as a subroutine (the ?inner loop?) in an online algorithm; the
other papers solve an offline problem. We depart from this prior work in several re-
spects: we use a stochastic game, we deal with some subtleties specific to Stochastic
BwK, and we provide a very different analysis for our main results on Adversarial
BwK, where we cannot rely on the standard machinery.
Online packing problems (e.g., Buchbinder and Naor [47], Devanur et al. [72],
see Buchbinder and Naor [48] for a survey) can be seen as a special case of Adversarial
BwK with a much more permissive feedback model: the algorithm observes full
feedback (the outcomes for all actions) before choosing an action. Online packing
subsumes various online matching problems, including the AdWords problem (Mehta
81
et al. [141]) motivated by ad allocation (see Mehta [139] for a survey). While we
derive O(log T ) competitive ratio against OPTFD, online packing admits a similar
result against OPTDP.
Another related line of work concerns online convex optimization with con-
straints Chen and Giannakis [63], Chen et al. [64], Mahdavi et al. [134, 135], Neely
and Yu [149]. Their setting differs from ours in several important respects. First,
the action set is a convex subset of RK (and the algorithms rely on the power to
choose arbitrary actions in this set). In particular, there is no immediate way to
handle discrete action sets.3 Second, convexity/concavity is assumed on the rewards
and resource consumption. Third, in addition to bandit feedback, full feedback is
observed for the resource consumption, and (in all papers except Chen and Gian-
nakis [63]) one also observes either full feedback on rewards or the rewards gradient
around the chosen action. Fourth, their algorithm only needs to satisfy the budget
constraints at the time horizon (whereas in BwK the budget constraints hold for all
rounds). Fifth, their fixed-distribution benchmark is weaker than ours: essentially,
its time-averaged consumption must be small enough at each round t. Due to these
differences, their setting admits sublinear regret in the adversarial setting, whereas
we have a ?(log T ) lower bound on the competitive ratio.
Logarithmic competitive ratios are quite common in prior work on approxima-
tion algorithms and online algorithms, e.g., in the context of the set cover problem
Johnson [112], Lova?sz [131], buy-at-bulk network design Awerbuch and Azar [26],
3Unless there is full feedback, in which case one can use a standard reduction whereby actions in
online convex optimization correspond to distributions over actions in a K-armed bandit problem.
82
sparsest cut Arora et al. [20], and dial-a-ride problem Charikar and Raghavachari
[58], the online k-server problem Bansal et al. [34], online packing/covering problems
Azar et al. [27], online set cover Alon et al. [18], online network design Umboh [181],
and online paging Fiat et al. [85].
Simultaneous work. Two very recent (and yet unpublished) manuscripts have
come to our attention after the initial version of this paper has appeared on arxiv.org.
Rivera et al. Rivera et al. [158] consider online convex optimization with knapsacks
with full feedback. Focusing on the stochastic version, they design an algorithm
similar to LagrangeBwK, and derive a regret bound similar to ours, using a simi-
lar analysis. They also claim an extension to bandit feedback, without providing
any details (such as precise statement of Lemma 11 in terms of the regret property
(7.4)).
Rangi et al. Rangi et al. [157] consider Adversarial BwK in the special case
when there is only one constrained resource, including time. They attain sublinear
regret, i.e., a regret bound that is sublinear in T . They also assume a known lower
bound cmin > 0 on realized per-round consumption of each resource, and their regret
bound scales as 1/cmin. They also achieve polylog(T ) instance-dependent regret for
the stochastic version using the same algorithm (matching results from prior work on
the stochastic version). BwK with only one constrained resource (including time) is
a much easier problem, compared to the general case with multiple resources studied
in this paper, in the following sense. First, the single-resource version admits much
?
stronger performance guarantees (polylog(T ) vs. T regret bounds for Stochastic
83
BwK, and sublinear regret vs. approximation ratio for Adversarial BwK). Second,
the optimal all-knowing time-invariant policy is the best fixed arm, rather than the
best fixed distribution over arms.
7.3 Challenges and Our Contributions
7.3.1 Challenges
Adversarial BwK is a much harder problem compared to Stochastic BwK. The
new challenge is that the algorithm needs to decide how much budget to save for the
future, without being able to predict it. (It is also the essential challenge in online
packing problems, and it drives our lower bounds.) This challenge compounds the
ones already present in Stochastic BwK: that exploitation may be severely limited
by the resource consumption during exploration, that optimal per-round reward no
longer guarantees optimal total reward, and that the best fixed distribution over
arms may perform much better than the best fixed arm. Jointly, these challenges
amount to the following. An algorithm for Adversarial BwK must compete, during
any given time segment [1, ? ], with a distribution over arms that maximizes the total
reward on this time segment. However, this distribution may behave very differently,
in terms of expected per-round outcomes, compared to the optimal distribution for
some other time segment [1, ? ?].
In more concrete terms, let OPTFD be the total expected reward of the best
fixed distribution over arms. In Stochastic BwK (as well as in adversarial bandits)
84
an algorithm can achieve sublinear regret: OPT 4FD?E[REW] = o(T ). In contrast,
in Adversarial BwK regret minimization is no longer possible, and we therefore are
primarily interested in the competitive ratio OPTFD /E[REW].
It is instructive to consider a simple example in which the competitive ratio
is at least 5 ? o(1) for any algorithm. There are two arms and one resource with
4
budget T . Arm 1 has zero rewards and zero consumption. Arm 2 has consumption
2
1 in each round, and offers reward 1 in each round of the first half-time (T rounds).
2 2
In the second half-time, it offers either reward 1 in all rounds, or reward 0 in all
rounds. Thus, there are two problem instances that coincide for the first half-time
and differ in the second half-time. The algorithm needs to choose how much budget
to invest in the first half-time, without knowing what comes in the second. Any
choice leads to competitive ratio at least 5 on one of the problem instances.
4
Extending this idea, we prove an even stronger lower bound on the competitive
ratio:
OPTFD /E[REW] ? ?(log T ). (7.1)
Like the simple example above, the lower-bounding construction involves only two
arms and only one resource, and forces the algorithm to make a huge commitment
without knowing the future.
?
4More specifically, one can achieve regret O?( KT ) for adversarial bandits (Auer et al. [25]),
as well as for Stochastic BwK if all budgets are ?(T ) Badanidiyuru et al. [33]. One can achieve
sublinear regret for Stochastic BwK if all budgets are ?(T?), ? ? (0, 1) (Badanidiyuru et al. [33]).
85
7.3.2 Our Contributions
Our main result is an algorithm which nearly matches (7.1), achieving
E[REW] ? 1 (OPT
O(log T ) FD
?o(OPTFD)) . (7.2)
We put forward a new algorithm for BwK, called LagrangeBwK, that unifies the
stochastic and adversarial versions. It has a natural game-theoretic interpretation
for Stochastic BwK, and admits a simpler analysis compared to the prior work.
For Adversarial BwK, we use LagrangeBwK as a subroutine, though with a different
parameter and a different analysis, to derive two algorithms: a simple one that
achieves (7.2), and a more involved one that achieves the same competitive ratio with
?
high probability. Absent resource consumption, we recover the optimal O?( KT )
regret for adversarial bandits.
LagrangeBwK is based on a new perspective on Stochastic BwK. We reframe a
standard linear relaxation for Stochastic BwK in a way that gives rise to a repeated
zero-sum game, where the two players choose among arms and resources, respec-
tively, and the payoffs are given by the Lagrange function of the linear relaxation.
Our algorithm consists of two online learning algorithms playing this repeated game.
We analyze LagrangeBwK for Stochastic BwK, building on the tools from regret min-
imization in stochastic games, and achieve a near-optimal regret bound when the
optimal value and the budgets are ?(T ).5
Discussion. LagrangeBwK has numerous favorable properties. As just discussed, it
5This regime is of primary importance in prior work, e.g., Besbes and Zeevi [38], Wang et al.
[184].
86
is simple, unifying, modular, and yields strong performance guarantees in multiple
settings. It is the first ?black-box reduction? from bandits to BwK: we take a bandit
algorithm and use it as a subroutine for BwK. This is a very natural algorithm for
the stochastic version once the single-shot game is set up; indeed, it is immediate
from prior work that the repeated game converges to the optimal distribution over
arms. Its regret analysis for Stochastic BwK is extremely clean. Compared to prior
work (Agrawal and Devanur [11], Badanidiyuru et al. [33]), we side-step the intricate
analysis of sensitivity of the linear program to non-uniform stochastic deviations that
arise from adaptive exploration.
LagrangeBwK has a primal-dual interpretation, as arms and resources corre-
spond respectively to primal and dual variables in the linear relaxation. Two players
in the repeated game can be seen as the respective primal algorithm and dual algo-
rithm. Compared to the rich literature 6 on (e.g., primal-dual algorithms Buchbinder
and Naor [48], Mehta [139], Williamson and Shmoys [188]) LagrangeBwK has a very
specific and modular structure dictated by the repeated game.
7.4 Preliminaries
In this section, we state the required preliminaries from prior work on adver-
sarial bandits and zero-sum games.
6This also includes the more recent literature on stochastic online packing problems
(e.g., Agrawal et al. [13], Devanur and Hayes [70], Devanur et al. [72], Feldman et al. [84], Molinaro
and Ravi [147]).
87
7.4.1 Adversarial Online Learning
Given: action set A, payoff range [bmin, bmax].
In each round t ? [T ],
1. the adversary chooses a payoff vector ft ? [bmin, b Kmax] ;
2. the algorithm chooses a distribution pt over A, without observing ft,
3. algorithm?s chosen action at ? A is drawn independently from pt;
4. payoff ft(at) is received by the algorithm.
Figure 7.2: Adversarial online learning
In this protocol (see Figure 7.2), the adversary can use previously chosen arms
to choose the payoff vector ft, but not the algorithm?s random seed. The distribution
ft is chosen as a deterministic function of history. (The history at round t consists,
for each round s < t, of the chosen action as and the observed feedback in this
round.) We focus on two feedback models: bandit feedback (no auxiliary feedback)
and full feedback (the entire payoff vector ft). The version for costs can be defined
similarly, by setting the payoffs to be the negative of costs.
We are interested in adversarial online learning algorithms with known upper
bounds on regret,
[ ? ] [? ]
RAOL(T ) := maxa?A t?[T ] ft(a) ? t?[T ] ft(at) . (7.3)
The benchmark here is the total payoff of the best arm, according to the payoff
88
vectors actually chosen by the adversary. More precisely, we assume high-probability
regret bounds of the following form:
?? > 0 Pr [ RAOL(T ) ? (bmax ? bmin)R?(T ) ] ? 1? ?, (7.4)
for some function R?(?). We will actually use a stronger version implied by (7.4),7
[ ]
?? > 0 Pr ?? ? [T ] RAOL(?) ? (bmax ? bmin)R?/T (T ) ? 1? ?. (7.5)
Algorithms EXP3.P (Auer et al. [25]) for bandit feedback, and Hedge (Freund and
Schapire [88]) for full feedback, satisfy (7.4) with, resp.,
(? ) (? )
R?(T ) = O |A|T log(T/?) and R?(T ) = O T log(|A|/?) . (7.6)
7.4.2 Regret minimization in games
We build on the framework of regret minimization in games. A zero-sum game
(A1, A2,G) is a game between two players i ? {1, 2} with action sets A1 and A2 and
payoff matrix G ? RA1?A2 . If each player i chooses an action ai ? Ai, the outcome
is a number G(a1, a2). Player 1 receives this number as reward, and player 2 receives
it as cost. A repeated zero-sum game G with action sets A1 and A2, time horizon T
and game matrices G1 , . . . ,GT ? RA1?A2 is a game between two algorithms, ALG1
and ALG2, which proceeds over T rounds such that each round t is a zero-sum game
(A1, A2,Gt). The goal of ALG1 is to maximize the total reward, and the goal of
ALG2 is to minimize the total cost.
7Regret bound (7.5) follows from (7.4) using a simple ?zeroing-out? trick: for a given round
? ? [T ], the adversary can set all future payoffs to some fixed value x ? [bmin, bmax], in which case
RAOL(?) = RAOL(T ).
89
The game G is called stochastic if the game matrix Gt in each round t is drawn
independently from some fixed distribution. For such games, we are interested in
the expected game, defined by the expected game matrix G = E[Gt]. We can relate
the algorithms? performance to the minimax value of G.
Lemma 11. Consider a stochastic repeated zero-sum game between algorithms ALG1
and ALG2, with payoff range [bmin, bmax]. Assume that each ALGj, j ? {1, 2} is an
algorithm for adversarial online learning, as per Figure 7.2, which satisfies regret
bound (7.4) with R?(T ) = Rj,?(T ).
Let ? be some fixed round in the game. For each algorithm ALGj, j ? {1, 2},
let Aj be its ac?tion set, let pt,j ? ?A be the distribution chosen in each round t,j
and let p?j =
1
t?[? ] pt,j be the average play distribution at round ? . Let v
? be the
?
minimax value for the expected game G = E[Gt].
Then for each ? > 0, with probability at least 1? 2? it holds that
( ? )
?p2 ? ? p?T G p ? v? 1A2 1 2 ? (bmax ? bmin) R1, ?/T (T ) +R2, ?/T (T ) + 4 2T log(T/?) .?
(7.7)
Eq. (7.7) states that the average play of player 1 is approximately optimal
against any distribution chosen by player 2.8 This lemma is well-known for the de-
terministic case (i.e., when Gt = G for each round t), and folklore for the stochastic
case. We provide a proof in Appendix 10.3.2 for the sake of completeness.
8If each player j chooses distribution pj ? ?A , and the game matrix is G, then expectedj
reward/cost is pT1 Gp2.
90
7.5 A new algorithm for Stochastic BwK
We present a new algorithm for Stochastic BwK, based on the framework of
regret minimization in games. This is a very natural algorithm once the single-shot
game is set up, and it allows for a very clean regret analysis. We will also use this
algorithm as a subroutine for the adversarial version.
On a high level, we define a stochastic zero-sum game for which a mixed Nash
equilibrium corresponds to an optimal solution for a linear relaxation of the original
problem. Our algorithm consists of two regret-minimizing algorithms playing this
game. The framework of regret minimization in games guarantees that the average
primal and dual play distributions (p?1 and p?2 in Lemma 11) approximate the mixed
Nash equilibrium in the expected game, which correspondingly approximates the
optimal solution.
7.5.1 Linear relaxation and Lagrange functions
We start with a linear relaxation of the problem that all prior work relies on.
This relaxation is stated in terms of expected rewards/consumptions, i.e., implicitly,
in terms of the expected outcome matrix M = E[Mt]. We explicitly formulate the
relaxation in terms of M, and this is essential for the subsequent developments. For
ease of notation, we write the a-th row of M, for each action a ? [K], as
M(a) = (rM(a); cM1 (a) , . . . , c
M
d (a)),
91
so that rM(a) is the expected reward and cMi (a) is the expected consumption of each
resource i.
Essentially, the relaxation assumes that each instantaneous outcome matrix
Mt is equal to the expected outcome matrix M = E[Mt]. The relaxation seeks the
best distribution over actions, focusing on a single round with budgets rescaled as
B/T . This leads to the following linear program (LP):
?
maximize ?a?[K] X(a) r
M(a) such that
?a?[K] X(a) = 1 (7.8)
?i ? [d] X(a) cMa?[K] i (a) ? B/T
?a ? [K] 0 ? X(a) ? 1.
We denote this LP by LPM,B,T . The solution X is the best fixed distribution over
actions, according to the relaxation. The value of this LP, denoted OPTLP(M, B, T ),
is the expected per-round reward of this distribution. It is also the total reward of
X in the relaxation, divided by T . We know from Badanidiyuru et al. [33] that
T ?OPTLP(M, B, T ) ? OPTDP ? OPTFD, (7.9)
where OPTDP and OPTFD are the total expected rewards of, respectively, the best
dynamic policy and the best fixed distribution. In words, OPTDP is sandwiched
between the total expected reward of the best fixed distribution and that of its
linear relaxation.
Associated with the linear program LPM,B,T is the Lagrange function L =
92
LM,B,T . It is a function L : ?K ? Rd?0 ? R?defined as? ? ? ?
L T(X, ?) := X(a) rM(a) + ? ?i 1? X(a) cMi (a)? . (7.10)B
a?[K] i?[d] a?[K]
The values ?1 , . . . , ?d in Eq. (7.10) are called the dual variables, as they correspond
to the variables in the dual LP. Lagrange functions are meaningful due to their max-
min property (e.g., Theorem D.2.2 in Ben-Tal and Nemirovski [36]):
min max L(X, ?) = max minL(X, ?) = OPTLP(M, B, T ). (7.11)
??0 X??K X??K ??0
This property holds for our setting because LPM,B,T has at least one feasible solution
(namely, one that puts probability one on the null action), and the optimal value of
the LP is bounded.
Remark 2. We use the linear program LPM,B,T and the associated Lagrange func-
tion LM,B,T throughout this chapter. Both are parameterized by an outcome matrix
M, budget B and time horizon T . In particular, we can plug in an arbitrary M,
and we heavily use this ability throughout. For the adversarial version, it is essential
to plug in parameter T0 ? T instead of the time horizon T . For the analysis of the
high-probability result in Adversarial BwK, we use a rescaled budget B0 ? B instead
of budget B.
7.5.2 Our algorithm: repeated Lagrangian game
The Lagrange function L = LM,B,T from (7.10) defines the following zero-sum
game: the primal player chooses an arm a, the dual player chooses a resource i, and
93
the payoff is a number
L(a, i) = rM(a) + 1? T cMi (a). (7.12)B
The primal player receives this number as a reward, and the dual player receives it
as cost. This game is termed the Lagrangian game induced by LM,B,T . This game
will be crucial throughout this chapter.
The Lagrangian game is related to the original linear program as follows:
Lemma 12. Assume one of the resources is the dummy resource. Consider the
linear program LPM,B,T , for some outcome matrix M. Then the value of this LP
equals the minimax value v? of the Lagrangian game induced by LM,B,T . Further, if
(X, ?) is a mixed Nash equilibrium in the Lagrangian game, then X is an optimal
solution to the LP.
The proof can be found in Appendix 10.3. The idea is that because of the
special structure of the LP, the second equality in (7.11) also holds when the dual
vector ? is restricted to distributions.
Consider a repeated version of the Lagrangian game. Formally, the repeated
Lagrangian game with parameters B0 ? B and T0 ? T is a repeated zero-sum game
between the primal algorithm that chooses among arms and the dual algorithm that
chooses among resources. Each round t of this game is the Lagrangian game induced
by the Lagrange function Lt := LMt,B0,T0 , where Mt is the round-t outcome matrix.
Note that we use parameters B0, T0 instead of budget B and time horizon T .
9
9These parameters are needed only for the adversarial version. For Stochastic BwK we use
B0 = B and T0 = T .
94
Remark 3. Consider repeated Lagrangian game for Stochastic BwK (with B0 = B
and T0 = T ). The payoffs in the expected game are defined by the expected Lagrange
function L := E[Lt]. By linearity, L is the Lagrange function for the expected
outcome matrix M = E[Mt]:
L := E[Lt] = LM,B,T . (7.13)
Our algorithm, called LagrangeBwK, is very simple: it is a repeated Lagrangian
game in which the primal algorithm receives bandit feedback, and the dual algorithm
receives full feedback.
To set up the notation, let at and it be, respectively, the chosen arm and
resource in round t. The payoff is therefore Lt(at, it). It can be rewritten in terms
of the observed outcome vector ot = (rt; ct,1 , . . . , ct,d) (which corresponds to the
at-th row of the instantaneous outcome matrix Mt):
Lt(at, it) = r + 1? T0 c T0t B t,i0 t ? [? + 1, 2]. (7.14)B0
Note that the payoff range is [bmin, b
T0
max] = [? + 1].B0
With this notation, the pseudocode for LagrangeBwK is summarized in Al-
gorithm 5. The pseudocode is simple and self-contained, without referring to the
formalism of repeated games and Lagrangian functions. Note that the algorithm is
implementable, in the sense that the outcome vector ot revealed in each round t of
the BwK problem suffices to generate full feedback for the dual algorithm.
95
Algorithm 5: Algorithm LagrangeBwK for Stochastic BwK.
input: parameters B0, T0, primal algorithm ALG1, dual algorithm ALG2.
// ALG1, ALG2 are adversarial online learning algorithms
// with bandit feedback and full feedback, respectively
for round t = 1, 2, 3, . . . do
1. ALG1 returns arm at ? [K], algorithm ALG2 returns resource it ? [d].
2. arm at is chosen, outcome vector ot = (rt(at); ct,1(at) , . . . , ct,d(at)) ?
[0, 1]d+1 is observed.
3. The payoff Lt(at, it) from (7.14) is reported to ALG1 as reward, and
to ALG2 as cost.
4. The payoff Lt(at, i) is reported to ALG2 for each resource i ? [d].
7.5.3 Performance guarantees
We consider algorithm LagrangeBwK with parameter T0 = T . We assume the
existence of the dummy resource; this is to ensure that the crucial step, Eq. (7.20),
works out even if the algorithm stops at time T , without exhausting any actual
?
resources. We obtain a regret bound that is non-trivial whenever B > ?( T ), and
is optimal, up to log factors, in the regime when min(OPTDP, B) > ?(T ).
Theorem 6. Consider Stochastic BwK with K arms, d resources, time horizon T ,
and budget B. Assume that one resource is the dummy resource (with consumption
B for each arm). Fix the failure probability parameter ? ? (0, 1). Consider algorithm
T
96
LagrangeBwK with parameters B0 = B, T0 = T .
If EXP3.P and Hedge are used as the primal and the dual algorithms, respec-
tively, then the algorithm achieves the following regret bound, with probability at least
1? ?:
(
T ? )
OPTDP?REW(LagrangeBwK) ? O TK log(dT/?) . (7.15)
B
In general, suppose each algorithm ALGj satisfies a regret bound (7.4) with
R?(T ) = Rj,?(T ) and payoff range [b
T
min, bmax] = [? + 1, 2]. Then with probabilityB
at least 1?O(?T ) it holds that
( ) ( ? )
OPTDP?REW(LagrangeBwK) ? O T RB 1, ?/T (T ) +R2, ?/T (T ) + T log(dT/?) .
(7.16)
Remark 4. To obtain (7.15) from the ?black-box? result (7.16), we use regret bounds
in Eq. (7.6).
Remark 5. From Badanidiyuru et al. [33], the optimal regret bound for Stochastic
BwK is (? ? )
OPTDP?E[REW] ? O? K OPTDP (1 + OPTDP /B) .
Thus, the regret bound (7.15) is near-optimal if min(OPTDP, B) > ?(T ), and non-
?
trivial if B > ?( T ).
We next prove the ?black-box? regret bound (7.16). For the sake of analysis,
consider a version of the repeated Lagrangian game that continues up to the time
horizon T . In what follows, we separate the ?easy steps? from what we believe is
the crux of the proof.
97
Notation. Let Xt ?be the distribution chosen in round t by the primal algorithm
ALG1. Let X :=
1
? t?[? ] Xt be the distribution of average play up to round ? . Let?
M = E[Mt] be the expected outcome matrix. Let r = (rM(a) : a ? [K]) be the
vector of expected rewards over the actions. Likewise, ci = (c
M
i (a) : a ? [K]) be
the vector of expected consumption of each resource i ? [d].
Using Azuma-Hoeffding inequality. Consider the first ? rounds, for some ? ?
[T ]. The average reward and resource-i consumption over these rounds are close to
X? ? r and X? ? ci, respectively, with high probability. Specifically, a simple usage of
Azuma-Hoeffding inequality (Lemma 28) implies that
?
1 r
? t
? X? ? r?R0(?)/?, (7.17)
?t?[? ]
1 c
? i,t
? X? ? ci +R0(?)/?, ?i ? [d], (7.18)
t?[? ] ?
hold with probability at least 1? ?, where R0(?) = O( ? log(d/?)).
Regret minimization in games. Let us apply the machinery from regret min-
imization in games to the repeated Lagrangian game. Consider the game matrix
G of the expected game. Using Eq. (7.13) and Lemma 12, we conclude that the
minimax value of G is v? = OPTLP(M, B, T ).
We apply Lemma 11, with a fixed stopping time ? ? [T ]. Recall that the
payoff range is b Tmax ? bmin = + 1. Thus, with probability at least 1? 2? it holdsB
that
T
? ? ?d : X? G? ? v? ? 1 ( T + 1) ? reg(T ), (7.19)? B
?
where the regret term is reg(T ) := R1, ?/T (T ) +R2, ?/T (T ) + 4 2T log(T/?).
98
Crux of the proof. Let us condition on the event that (7.17), (7.18), and (7.19)
hold for each ? ? [T ]. By the union bound, this event holds with probability at
least 1? 3?T .
Let ? denote the stopping time of the algorithm, the first round when the total
consumption of some resource exceeds its budget. Let i be the resource for which
this happens; hence,
?
ci,t > B. (7.20)
t?[? ]
Let us use Eq. (7.19) with ? = ?(i), the point distribution for this resource.
From Eq. (7.13), we have
T
X G?(i)? = X? ? r + 1? T X? ? ci.B
Plugging in (7.17) and (7.18) we get
((? ) ( ? ) )
X? ? r + 1? T X ? c ? 1 r ? T TB ? i ? t?[? ] t B t?[? ] ci,t + ? ? (1 + )R (?) .B 0
Finally, plugging in Eq. (7.20) we get this to be upper-bounded by the following.
((? ) )
? 1 t?[? ] r
T
? t
+ ? ? T ? (1 + )R0(?) .B
Plugging this into Eq. (7.19) and rearranging, we obtain
?
rt ? ? v? + T ? ? ? (1 + T ) ? reg(T ).B
t?[? ]
Since v? ? 1 (because v? = OPTLP, as we?ve proved above),
?
REW(LagrangeBwK) = rt ? T v? ? (1 + T ) ? reg(T ).B
t?[? ]
The claimed regret bound (7.16) follows by Eq. (7.9), completing the proof of The-
orem 6.
99
7.6 A simple algorithm for Adversarial BwK
We present and analyze an algorithm for Adversarial BwK which achieves
O(d2 log T ) competitive ratio, in expectation, up to a low-order additive term.
Our algorithm is very simple: we randomly guess the value of OPTFD and run
LagrangeBwK with parameter T0 driven by this guess. The analysis is very differ-
ent, however, since we cannot rely on the machinery from regret minimization in
stochastic games. The crux of the analysis (Lemma 14) is re-used to analyze the
high-probability algorithm in the next section.
The intuition for our algorithm can be explained as follows. LagrangeBwK
builds on adversarial online learning algorithms ALGj, and appears plausibly ap-
plicable to Adversarial BwK. We analyze it for Adversarial BwK, with an arbitrary
parameter T0 (see Lemma 14, the crux of our analysis), and find that it performs
best when T0 is tailored to OPTFD up to a constant multiplicative factor. This is
precisely what our algorithm achieves using the random guess.
Our algorithm is presented as Algorithm 6. We randomly guess the value of
OPTFD from within a specified range [gmin, gmax], up to the specified multiplicative
factor of ? > 0. We consider multiplicative scales [?u, ?u+1], u ? N, and we guess
uniformly at random among all possible u. Our analysis works as long as OPTFD ?
[g , g ] and ? ? d + 1; then we obtain competitive ratio ?2 dlog gmaxmin max e up to agmin
low-order additive term. As a corollary, we obtain competitive ratio ?2
ceil log T e with no assumptions.
Theorem 7. Consider Adversarial BwK with K arms, d resources, time horizon T ,
100
Algorithm 6: A simple algorithm for Adversarial BwK.
input: scale parameter ? > 0, guess range [gmin, gmax], primal and dual
algorithms ALG1, ALG2
// ALG1, ALG2 are adversarial online learning algorithms
// with bandit feedback and full feedback, resp.
1 Choose u?uniforml?y at random from {0, 1 , . . . , umax}, where
umax = log gmax? .gmin
2 Guess the value of OPTFD as g? = gmin ? ?u.
3 Run LagrangeBwK with algorithms ALG1, ALG2 and parameters B0 = B and
T0 = g?/?.
and budget B. Assume that one of the arms is a null arm that has zero reward and
zero resource consumption. Consider Algorithm 6 with scale parameter ? ? d + 1.
Suppose algorithms ALGj that satisfy the regret bound Eq. (7.4) with ? = T
?2 and
regret term R?(T ) = Rj,?(T ), for any known payoff range [bmin, bmax].
If OPTFD ? [gmin, gmax] then the expected reward of Algorithm 6 satisfies( ? ?)
E[REW] ? (OPT ?reg)/ ?2 log gmaxFD ? , (7.21)
( ) gmin
where reg = (1 + OPTFD ) R1, ?/T (T ) +R2, ?/T (T ) . Taking [g?B min, gmax] = [1, T ], we
obtain
( )
E[REW] ? (OPTFD?reg)/ ?2 dlog? T e . (7.22)
Remark 6. One can use algorithms EXP3.P for ALG1 and Hedge for ALG2, with
regret bounds given by Eq. (7.6), and achieve the regret term
( ) ?
reg = O 1 + OPTFD TK log(Td/?).
?B
101
We obtain a meaningful performance guarantee as long as, say, reg < OPTFD /2;
?
this requires OPTFD and B to be at least ??( TK).
Remark 7. We define the outcome matrices slightly differently compared to Sec-
tion 7.5 in that we do not posit a dummy resource. Formally, we assume that the
null arm has zero consumption in every resource. This is essential for case 1 ( i.e.,
when ?alg ? ?) in the analysis of Lemma 14.
If a problem instance of Adversarial BwK is actually an instance of adversar-
?
ial bandits, then we recover the optimal O?( KT ) regret. (This easily follows by
examining the proof of Lemma 14.)
Lemma 13. Consider LagrangeBwK, with algorithms EXP3.P for ALG1 and Hedge
for ALG2, for an instance of Adversarial BwK with zero resource consumption. This
?
algorithm obtains O?( KT ) regret, for any parameters B0, T0 > 0. Accordingly, so
does Algorithm 6 with any scale parameter ? > 0.
7.6.1 Analysis: proof of Theorem 7 and Lemma 13
Stopped linear program. Let us set up a linear relaxation that is suitable to the
adversarial setting. The expected outcome matrix is no longer available. Instead,
we use average outcome matrices:
?
M = 1? Mt, (7.23)?
t?[? ]
the average up to a given intermediate round ? ? [T ]. Similar to the stochastic case,
the relaxation assumes that each instantaneous outcome matrix Mt is equal to the
102
average outcome matrix M? . What is different now is that the relaxation depends
on ? : using M? is tantamount to stopping precisely at this round.
With this intuition in mind, for a particular end-time ? we consider the linear
program (7.8), parameterized by the time horizon ? and the average outcome matrix
M? . Its value, OPTLP(M? , B, ?), represents the per-round expected reward, so it
needs to be scaled by the factor of ? to obtain the total expected reward. Finally,
we maximize over ? . Thus, our linear relaxation for Adversarial BwK is defined as
follows:
[T ]
OPTLP := max ? ?OPTLP(M? , B, ?) ? OPTFD . (7.24)
??[T ]
The proof of Eq. (7.24) is similar to prior work (Badanidiyuru et al. [33], Devanur
et al. [72]). Denote D? to be the set of all distributions over t?he arms such that for
every p ? D? we have the following: for every i ? [d] we have t?[? ] p ? ct,i ? B. In
other words, D? denotes the set of distributions whose expected?stopping time is at
least ? . Thus it immediately follows that OPTLP(?) ? maxp?D? t?[? ] p ? rt since for
any given p ? D? it is feasible to LP(?). Thus OPTLP(?) is at least the value of any
feasible solution p ? D? . Note that for every fixed distribution p ? ?K , there exists
a ? such that either p ? D? and ?p ?6 D?+1 or p ? DT . Moreover the total expected
reward we can obtain using p is t?[? ] p ? rt. Thus max1???T OPTLP(?) ? OPTFD.
Regret bounds for ALGj. Since each algorithm ALGj, j ? {1, 2} satisfies regret
bound Eq. (7.4) with ? = T?2 and R?(T ) = Rj,?(T ), it also satisfies a stronger
version (7.5) with the same parameters. Recall from Eq. (7.14) that the payoff
range is [bmin, bmax] = [?T0 +1, 2]. For succinctness, let Uj(T |T0) = (1+ T0 )RB B j, ?/T (T )
103
denote the respective regret term in Eq. (7.5).
Let us apply these regret bounds to our setting. Let at ? [K] and it ? [d] be,
resp., the chosen arm and resource in round t. We represent the outcomes as vectors
over arms: rt, ct,i ? [0, 1]K denote, resp., reward vector and resource-i consumption
vector for a given round t. Recall that the round-t payoffs in LagrangeBwK are given
by the Lagrange function Lt := LMt,B,T0 such that
Lt(a, i) = rt(a) + 1? T0 cB t,i(a) (7.25)
for each arm a and resource i. Consider the total Lagrangian payoff at a given round
? ? [T ]:
?
Lt(at, it) = REW? +? ?W? , (7.26)
? t?[? ] ?
where REW? = t?[? ] rt(at) is the total reward up to round ? , andW =
T0
? B t?[? ] ct,it(at)
is the consumption term. The regret bounds sandwich Eq. (7.26) from above and
b?elow:? ? ? ? ? ?max Lt(a, it)?? U ? ?1(T |T0) ? REW? +? ?W? ? min Lt(at, i) + U2(T |T0).
a?[K] i?[d]
t?[? ] t?[? ]
(7.27)
This holds for all ? ? [T ], with probability at least 1 ? 2?. The first inequality
in Eq. (7.27) is due to the primal algorithm, and the second is due to the dual
algorithm. Call them primal and dual inequality, respectively.
Crux of the proof. We condition on the event that Eq. (7.27) holds for all
? ? [T ], which we call the clean event. The crux of the analysis is encapsulated in
104
the following lemma, which analyzes an execution of LagrangeBwK with an arbitrary
parameter T0 under the clean event.
Lemma 14. Consider an execution of LagrangeBwK with B0 = B and an arbitrary
parameter T0 such that the clean event holds. Fix an arbitrary round ? ? [T ], and
consider the LP value relative to this round:
f(?) := OPTLP(M?, B, ?). (7.28)
The algorithm?s reward up to round ? satisfies
REW? ? min(T0, ? ? f(?)? dT0)? ( U1(T |T0) + U2(T |T0) ) . (7.29)
Taking ? to be the maximizer in Eq. (7.24), algorithm?s reward satisfies
REW ? min(T0,OPTFD?dT0)? ( U1(T |T0) + U2(T |T0) ) . (7.30)
Proof. Let ?alg be the stopping time of the algorithm. We consider two cases, de-
pending on whether some resource is exhausted at time ?. In both cases, we focus
on the round min(?alg, ?).
Case 1: ?alg ? ? and some resource i?s exhausted. Let us focus on round
? = ?alg. If i is the exhausted resource, then t?[? ] ct,i(at) > B. Let us apply the
dual inequality in Eq. (7.27) for this resource:
?
REW? +? ?W? ? U2(T |T0) ? Lt(at, i)
t?[? ] ?
= REW +? ? T0? ct,i(at)B
t?[? ]
? REW? +? ? T0.
105
It follows that W? ? T0 ? U2(T |T0).
Now, let us apply the primal inequality in Eq. (7.27) for the null arm. Recall
that the reward and consumption for this arm is 0, so Lt(null, it) = 1 for each
round t. Therefore,
?
REW? +? ?W? + U1(T |T0) ? Lt(null, it) = ?.
t?[? ]
We conclude that REW? ? W? ? U1(T |T0) ? T0 ? U1(T |T0)? U2(T |T0).
Case 2: ?alg ? ?. Let us focus on round ?. Consider the linear program
LPM ,B,?, and let X
? ? ?K be an optimal solution to this LP. The primal inequality?
in Eq. (7.27) implies that
?
REW? +? ?W? + U1(?) ? max Lt(a, it)
a?[K]
? ?t?[?]
? X?(a) Lt(a, it)
t?[?] a??[K] ?
= ? + X? ? r T0 ?t ? X ? cB t,it
t?[?] ? t?[?]
REW? ? ? ? f(?)? T0 X? ? ct,it ? U1(T |T0). (7.31)B
? t?[?]
In the last inequality we used the fact that ?t?[?] X ? rt = ? ? f(?) by optimality
of X?.? ?
t?[?] X ? ct,i ? B for each resource i, since X? is a feasible solution for
OPTLP(M?, B, ?). Then,
? ? ?
X? ? c ?t,it ? X ? ct,i ? dB. (7.32)
t?[?] i?[d] t?[?]
Plugging Eq. (7.32) into Eq. (7.31), we conclude that REW? ? ? ? f(?) ? dT0 ?
U1(T |T0).
106
Conclusions from the two cases imply Eq. (7.30), as claimed.
Wrapping up. If OPTFD lies in the guess range [gmin, gmax], then some guess g? is
approximately correct:
OPTFD /? ? g? ? OPTFD .
With such a guess g?, and provided that ? ? d+ 1, we have T 20 = g?/? ? OPTFD /? ,
and
OPTFD?dT0 ? OPTFD(1? d ) ? OPT /?.? FD
So, by Lemma 14, the algorithm?s execution with this guess, assuming the clean
event, satisfies Eq. (7.30) with min(T0,OPTFD?dT0) ? OPT /?2FD and T0 ? OPTFD /?.
The regret term for this guess is
U1(T |T0) + U (T |T ) ? (1 + OPTFD2 0 ) (R?B 1, ?/T (T ) +R2, ?/T (T )).
To?complete?the proof of Theorem 7, we obtain a suitable guess g? with probability
1/ log gmax? .gmin
Proof Sketch of Lemma 13. Recall that in the adversarial bandit setting we have
ci,t = 0 for every i ? [d] and every t ? [T ]. We re-analyze Lemma 14 with ? = T .
Notice t?hat case 1 never occurs. Thus we obtain obtain Eq. (7.31) in case 2. Note
that T0 ?t?[?] X ? ct,it = 0 since cB i,t = 0. Therefore, we obtain
REWT ? T ? f(T )? U1(T |T0).
?
We now argue that T ? f(T )?= maxa?[K] t?[T ] rt(a). Let X
? be the optimal
distribution over the arms. Thus X?t?[T ] ? rt = T ? f(T ). Note that since ci,t = 0
107
the only constraint on X??is that it lies in ?K . Therefore the maximizer is a point
distribution on maxa?[K] t?[T ] rt(a). This proof does not rely on an(y speci)fic value?
for B0, T0. The payoff range is [bmax, bmin] = [1, 2], so U1(T |T0) = O? KT .
7.7 High-probability algorithm for Adversarial BwK
In this section we recover the O(log T ) approximation ratio for Adversarial
BwK, but with high probability rather than merely in expectation. Our algo-
rithm uses LagrangeBwK as a subroutine, and re-uses the adversarial analysis thereof
(Lemma 14). We do not attempt to optimize the regret term.
The algorithm is considerably more complicated compared to Algorithm 6.
[T ]
Instead of making one random guess g? for the value of OPTLP , we iteratively refine
this guess over time. The algorithm proceeds in phases. In the beginning of each
phase, we start a fresh instance of LagrangeBwK with parameter T0 defined by the
current value of g?.10 We update the guess g? in each round (in a way specified later),
and stop the phase once g? becomes too large compared to its initial value in this
phase. We invoke LagrangeBwK with a rescaled budget B0 = B/? (log T ). Within
each phase, we simulate the BwK problem with budget B0: we stop LagrangeBwK
once the consumption of some resource in this phase exceeds B0. For the remainder
of the phase, we play the null arm with probability 1??0 and do uniform exploration
with the remaining probability, for some parameter ?0 ? (0, 1) (here and elsewhere,
uniform exploration refers to choosing each action with equal probability). The
10The idea of restarting the algorithm in each phase is similar to the standard ?doubling trick?
in the online machine learning literature, but much more delicate in our setting.
108
pseudocode is summarized in Algorithm 7.
Algorithm 7: High-probability algorithm for Adversarial BwK.
input: scale parameter ?, exploration parameter ?0, primal algorithm ALG1,
dual algorithm ALG2
// ALG1, ALG2 are adversarial online learning algorithms
// with bandit feedback and full feedback, resp.
1 Initialize g? = 1.
2 for each phase do
3 Start a fresh instance ALG of LagrangeBwK
4 with parameters B0 = B/2dlog? T e and T0 = g?/(dlog? T e?2).
5 for each round in this phase do
6 Recompute the global estimate g?
7 if g? > T0/? then start a new phase
8 if consumption of all resources in this phase does not exceed B0 then
9 Play the action chosen by ALG, observe the outcome and report it
back to ALG.
10 else
11 Choose the null arm with probability 1? ?0, do uniform
exploration otherwise
To complete algorithm?s specification, let us define how to update the guess
[t]
g? in each round t. The guess, denoted g?t, is an estimate for OPTLP, as defined in
(7.24). We form this estimate using a standard inverse propensity scoring (IPS )
109
technique. Let pt and at be, resp., the distribution and the arm chosen by the
primal algorithm in round t. The instantaneous outcome matrix Mt is estimated
by matrix Mips ? [0,?)K?dt such that each row M
ips
t (a) is defined as follows:
Mipst (a) := 1
1
{at=a} M (a).ft(a tt)
For a given end-time ? , the average outcome matrix M
? ?
from (7.23) is estimated as
ips
M? :=
1 Mips.
? t
t?[? ]
Finally, we plug this estimate into (7.23) and define
ips
g?t := max ? ?OPTLP(M? , B, ?). (7.33)
??[T ]
For the analysis, we will assume that the primal algorithm does some uniform ex-
ploration:
pt(a) ? ? > 0 for each arm a ? [K] and each round t ? [T ]. (7.34)
Theorem 8. Consider Adversarial BwK with K arms, d resources, time horizon T ,
and budget B; assume B > 4T 3/4. Suppose that one of the arms is a null arm that
has zero reward and zero resource consumption. Let ? > 0 be the failure probability
parameter.
Consider Algorithm 7 with parameters ? ? d+1 and ?0 = T?1/4. Assume that
each algorithm ALGj, j ? {1, 2}, satisfies the regret bound (7.4) with payoff range
[b Tmin, bmax] = [? + 1, 2] and regret term R?(T ) = Rj,?(T ). Assume that the primalB
algorithm ALG1 satisfies (7.34) with parameter ? ? T?1/4.
Then the total reward REW collected by Algorithm 7 satisfies
[ ( ) ]
Pr REW ? (OPTFD?reg)/ 2?4 dlog? T e ? 1?O(?T ), (7.35)
110
( )
where the regret term is reg = T K T 3/4 log1/2(1) +R1, ?/T (T ) +RB ? 2, ?/T (T ) .
Remark 8. Using algorithms EXP3.P for ALG1 and Hedge for ALG2, we can
achieve (7.35) with ( ) ?
reg = O TK T 3/4 log(T/?).
B
This is because EXP3.P, with appropriately modified unifo?rm exploration term ? =
T?1/4, satisfies the regret bound (7.4) with R?(?) = O(T
3/4) K log T , and for Hedge
?
we can (still) use 7.6. The theorem is meaningful whenever, say, reg < OPTFD /2.
The latter requires OPT ?B > ??(T 7/4FD ).K
Remark 9. Like in Theorem 7, we posit that the null arm does not consume any
resources.
Proof Sketch. The proof consists of several steps. First, we argue that the guess
[t]
g?t is close to OPTLP with high probability. This argument only relies on the uni-
form exploration property (7.33) and the definition of IPS estimators, not on any
properties of the algorithm. We immediately obtain concentration for the aver-
age outcome matrices; a somewhat subtle point is to derive concentration on the
respective LP-values.
Next, we focus on a particular phase in the execution of the algorithm. We
say that a phase is full if the stopping condition g?t > T0/? has fired. We focus on
the last full phase. We prove there is enough reward to be collected in this phase.
Essentially, letting ?1, ?2 be, resp., the start and end time of this phase, we consider
the BwK problem restricted to time interval [?1, ?2], and lower-bound the LP-value
of this problem in terms of the LP-value of the original problem. Finally, we use the
111
adversarial analysis of LagrangeBwK (Lemma 14) to guarantee that our algorithm
actually collects that value.
Because of the stopping condition g?t > T0/?, there can be at most dlog? T e
phases. Therefore, rescaling the budget to B0/2dlog? T e guarantees that the al-
gorithm consumes at most B/2 of the budget. We then argue that, with high-
probability, the additional uniform exploration in each phase, consumes a budget
of at most B/2 with high-probability. Thus, the algorithm never runs out of bud-
get.
We now describe the full proof of Theorem 8, following the plan outlined in
the proof sketch. We decompose the analysis into several distinct pieces, present
them one by one, and then show how to put them together. Each piece is presented
as a lemma, with appropriate notation and intuition; some of the proofs are deferred
to later in this section. Throughout, we assume that
( )
min{OPTFD, B} > ? KT 3/4 log T . (7.36)?
This is without loss of generality, because otherwise the guarantee in Theorem 8 is
vacuous. Recall that the primal algorithm ALG1 satisfies the uniform exploration
property (7.33) with parameter ? ? T?1/4.
Extended notation. To argue about a given phase, we extend some of our notation
to refer to arbitrary time intervals, not just [1, ? ]. In what follows, fix time interval
112
[?1, ?2], and let ?? = ?2 ? ?1 + 1. Let
?
1 ?2
M[?1,?2] := Mt,??
t?=?1?2
ips 1
M ips[?1,?2] := M?? t
t=?1
be, resp., the average outcome matrix and its IPS-estimate on this time interval.
Define
OBJ([?1, ?2]) := ?? ?OPTLP(M[?1,?2], B,??), (7.37)
ips ? ipsOBJ ([?1, ?2]) := ?? OPTLP(M[?1,?2], B,??). (7.38)
When ?1 = 1 we use the short-hand OBJ(?
ips
2) and OBJ (?2). Recall that
[? ]
OPTLP := max OBJ(?). (7.39)
??[T ]
g?? := max OBJ
ips(?). (7.40)
??[T ]
Uniform exploration does not exhaust budget. The uniform exploration in
Algorithm 7 h?appens for at most ?0 T rounds in expectation, and therefore for at
most ?0 T + 3 ?0T ln(1/?) rounds with probability at least 1 ? ?.11 It does not
consume more than B/2 units of each resource, since ? = T?1/40 and B > 4T
3/4.
IPS estimators are good. We argue that, essentially, the guess g?? is close to
[? ]
OPTLP with high probability. More precisely, we prove that OBJ(?) is close to its
IPS estimator, for any given ? ? [T ]. We will denote the deviation term as
( )
K ? ?OBJ(?)
DEV(?) := C0 ? log
T , (7.41)
?B ?
11By an easy application of Chernoff-Hoeffding bounds (Lemma 29).
113
( ? )
7/4
for a sufficiently large absolute constant C0. Thus, DEV(?) ? O K T log T (:=B ?
DEV(T )).
In this notation, we characterize the IPS estimators as follows:
Lemma 15. With probability at least 1? d?T it holds that
?? ? [T ] OBJips(?)? DEV(?) ? OBJ(?) ? OBJips(?) + DEV(?). (7.42)
[? ]
If the event (7.42) holds, then g?? and OPTLP are indeed close:
? ?
? ? ?? ? [? ]?? [T ] g?? OPTLP ? ? max DEV(t). (7.43)
t?[? ]
The proof only relies on the uniform exploration property (7.33) and the defini-
tion of IPS estimators, not on anything that the algorithm does. A somewhat subtle
point is to derive concentration on the respective LP-values from concentration of
the average outcome matrices.
IPS estimators do not change too fast. We use the stopping condition in the
algorithm to argue that the IPS estimators OBJips(?) do not change too fast from
one phase to another, in some specific sense. Namely, we compare OBJips(?) in the
first round of any full phase to the guess in the last round of the next phase.
Lemma 16. Consider a full phase in the execution of the algorithm. Let ? be the
first round in this phase, and let ? ? be any round in the next phase. Then
1 ? OBJ
ips(?) ? 1 K+ . (7.44)
?2 g?? ? ? ?g?? ?
Proof. There exists a j ? {1, 2 , . . . , dlog? T e} such that OBJips(?) ? ?j. Since ?
is the beginning of a phase, this implies that OBJips(? ? 1) < ?j?1. Observe that
114
in a single time-step the value of the estimate can increase by at most K . Thus
?
OBJips(?) < ?j?1 + K . Moreover since ? ? belongs to the next phase, we have that
?
?j+1 ? g?? ? ? ?j+2. Putting these together we get that
1 ? OBJ
ips(?) ? 1 K ? 1 K+ + .
?2 g?? ? ? ??j+1 ? ?g?? ?
This Lemma relies only on the stopping condition in the algorithm, g? > T0/?,
and the way the guess g?? is expressed in terms of the IPS estimators OBJ
ips(?), as
expressed in 7.40. It is irrelevant to this Lemma how the IPS estimators OBJips(?)
are actually defined.
Last full phase offers sufficient rewards. Recall that a phase in the execution
of the algorithm is called full if the stopping condition g?t > T0/? has fired. We
focus on the last full phase; let ?start, ?end denote the first and last time-steps of this
phase. We prove there is enough reward to be collected in this phase.
Let ? ? denote the maximizer in 7.24 which we interpret as the optimal stopping
time. Essentially, we compare the LP value for the time interval [?start, ?end] with the
LP value for the time interval [1, ? ?]. The former is expressed as OBJ([?start, ?end])
[T ]
and the latter as OPTLP . Note that the time horizon T lies in the subsequent phase
(so we can apply Lemma 16).
Lemma 17. Consider a run of the algorithm such that event (7.42) holds. Then
( )
1 1 [T ]
OBJ([?start, ?end]) ? ? OPTLP ?O(DEV(T )). (7.45)? ?2
Adversarial analysis of LagrangeBwK. Let us plug in the adversarial analysis
115
of LagrangeBwK, as encapsulated in Lemma 14. We focus on the last full phase
in the execution. We interpret it as an execution of algorithm LagrangeBwK with
parameters B0, T0 on an instance of Adversarial BwK with budget B0 that starts at
round ?start of the original problem. Let g? = g??start be the guess at the first round of
the phase. Then the parameters are B0 = B/ratio and T = g?/(?
2
0 ? ratio), where
ratio = dlog? T e.
We apply Lemma 14 for round ? = ?end ? ?start + 1 in the execution of
LagrangeBwK. Restated in our notation, f(?) in Lemma 14 becomes
f(?) = OPTLP(M[?start,? ], B0, ?).end
Thus, we obtain that with probability at least 1? ? we have
??end ( )
? g? dg?REW rt(at) ? min , ?f(?)? ? reg(T ), (7.46)dlog? T e?2 dlog T e?2t=? ?start ( )
where the regret term is reg(T ) := (1 + T ) R (T ) +R
B 1, ?/T 2, ?/T
(T ) .
Rescaling the budget. Since we use rescaled budget B0, it is important to con-
nect the corresponding LP-values to those for the original budget B. We have the
following general fact (observed in Agrawal and Devanur [11]): for any outcome
matrix M, budget B, time horizon T , and rescaling factor ? ? (0, 1] it holds that
OPTLP (M, ?B, T ) ? ? ?OPTLP (M, B, T ) . (7.47)
This holds because an optimal solution ? to LPM,B,T , the vector ? ? is feasible to
LPM,?B,T .
Putting it all together. We defer the proofs of Lemma 15 and Lemma 17 to the
116
following subsections, and show how to complete the proof of Theorem 8 assuming
these Lemmas hold.
Proof of Theorem 8. Throughout this proof, let us condition on the high-probability
events in Lemma 15 and 7.46. Moreover from 7.47 we have that ?f(?) ? 1 OBJ([?start, ?ratio end])
since B = B0 . Thus combining this with 7.46 we get,ratio
??end ( )
? 1 g? dg?REW = rt(at) min , OBJ([?start, ?end])? ? reg(T ). (7.48)
ratio ?2 ?2
t=?start
Further from Lemma 17 with ? = d+ 1, we can re-write 7.48 as
( ( ) )
? 1 g? 1 ? 1 [T ]? g?REW min , OPTLP ? reg(T ). (7.49)ratio ?2 ? ?2 ?
Recall that g? for the phase [?start, ?end] is OBJ
ips(?start). Moreover we have that
T lies in the phase immediately after [?start, ?end].
Thus from 7.44 (first inequality below) and 7.43 (second inequality below) we
have,
g? ? 1 K ? 1 [T ]g?T + OPTLP +O(DEV(T )). (7.50)? ? ?
Likewise we have,
? 1 1 [T ]g? OBJips(?start) ? g?T ? OPTLP ?O(DEV(T )), (7.51)?2 ?2
where the second inequality uses 7.44 and the last inequality uses 7.42. Plugging
7.50 and 7.51 back into 7.49 we get,
117
( )
[T ] ( )
1 OPT
REW ? min LP 1 ? 2 [T ], OPT
ratio ?4 ? ?2 LP
? reg(T )?O(DEV(T )). (7.52)
[T ]
And from 7.24 we have OPTLP ? OPTFD. Plugging this into 7.52 we get
7.35.
7.7.1 Proof of Lemma 17 (last full phase offers sufficient rewards)
We first prove the following property of the optimal solution.
Claim 1. Let T1 < T2 be any two time-steps. Then,
OBJ([T1, T2]) ? OBJ(T2)? OBJ(T1 ? 1). (7.53)
Proof. We prove this claim as follows. Let ?T2 denote the optimal solution to
LPM ,B,T . Since ct,i(a) ? 0 for every a ? [K], t ? [T ] and i ? [d] we haveT2 2
? ?
?T2 ? ct,i ? B =? ?T2 ? ct,i ? B.
t?[T2] t?[T1?1] ?
Thus ?T2 is feasible to LPM ,B,T ?1. This implies that ? ? r ?T ?1 1 t?[T1?1] T2 t1
OBJ(T1 ? 1). Likewise ?T2 is also feasible to LPM .[T ,T ],B,[T1,T1 2 2]
Therefore we have,
118
?T2
OBJ([T1, T2]) ? ?T2 ? rt
t?=T1 ?
= ?T2 ? rt ? ?T2 ? rt
t?[T2] ?t?[T1?1]
= OBJ(T2)? ?T2 ? rt
t?[T1?1]
? OBJ(T2)? OBJ(T1 ? 1)
Consider OBJ([?start, ?end]). From 7.53 we have
OBJ([?start, ?end]) ? OBJ(?end)? OBJ(?start ? 1). (7.54)
From Lemma 15 we can re-write 7.54 as
OBJ(?end)? OBJ(?start ? 1) ? OBJips(?end)? OBJips(?start ? 1)?O(DEV(T )). (7.55)
For some j ? dlog?(T )e, at time-step ?end the value g?t exceeds ?j for the first
time. Likewise g?t exceeds ?
j?1 for the first time at ?start and is smaller than this
value at ? ? 1. This implies that OBJips(? ) ? ?j and OBJips(? ? 1) < ?j?1start end start .
Therefore 7.55 simplifies to
OBJips(? )? OBJipsend (? j j?1start ? 1)?O(DEV(T )) ? ? ? ? ?O(DEV(T )). (7.56)
Combining 7.56, 7.55 and 7.54 we get
OBJ([?start, ?end]) ? ?j ? ?j?1 ?O(DEV(T )).
119
Dividing throughout by g?T and noting that g?T ? ?j+1 we have,
OBJ([? j?1start, ?end]) ? ? (?? 1) ? O(DEV(T ))
g? ( ?j+1T ) g?T
1
= ? 1 ? O(DEV(T ))
? ?2 g?T
Finally applying Lemma 15 to g?T we get 7.45. This completes the proof of
Lemma 17.
7.7.2 Proof of Lemma 15 (IPS estimators are good)
Recall that for every t ? [T ] and a ? [K] we have that pt(a), the probability
that arm a is chosen at time t is at least ? . We now prove Lemma 2 which relates
K
linear sums of rewards and consumptions?computed using the unbiased estimates
and the true values. Denote R?,?(?) :=
K 2? ln(T/?).
?
Claim 2. Let ? > 0, ? > 0 used by the EXP3.P(?) be given parameters. Then we
have the following sta?tements for a?ny fixed z ? ??K.??? ?? ?
Pr??? ? [T ]
? ?
? z ? [r? ? r ]?t t ? > R ??,?(?) ? ? (7.57)
???t??[? ]
?
? ??? ?
?i ? [d] Pr??? ? [T ] ??? z ? [c?t,i ? ct,i(a)]??? > R ??,?(?) ? ? (7.58)t?[? ]
Proof. The proof of this follows directly from the invocation of the Azuma-Hoeffding
inequality. We will show this for Equation (7.57). Define Yt := z ? [r?t ? rt] (like-wise
for the lower-tail use Yt := z ? (rt? r?t)). Note that this forms a martingale difference
sequence since E[z ?(r?t?rt) | Ht?1] = z ? [rt ? rt] = 0. Here we used the fact that z is
not random and fixed before the start of the algorithm. Also we have that |Yt| ? K .?
120
Using Lemma 28 and taking a union bound over all ? ? [T ] we have the desired
equation.
We will now prove the two inequalities in 7.42. We will first prove the first
inequality in 7.42. ( ( ) )
? R (?)Let ? denote the optima(l so?lution to)OPT M ?,?LP ? , B 1? , ? . NoteB
this is valid whenever B > ? K ? log T . From Equation (7.58) we have that
? ?
with probability at least 1? ? for every i ? [d],
? ?
?? ? c?t,i ? ?? ? ct,i +R?,?(?).
t?[? ] t?[? ]
? B (7.59)
?
7.59 used the fact that ?t?[? ] ? ? ct,i ? B (1?R?,?(?)).
Using Equation (7.57), we have that with probability at least 1? ?,
? ?
?? ? r ? ??t ? r?t +R?,?(?).
t?[? ] t?[? ]
Using the fact that,
? ( ( ) )
? ? R?,?(?)? rt = OPTLP M? , B 1? , ? ,
B
t?[? ]
we have the following.
( ( ) )
? R
?
?,?(?)
OPTLP M? , B 1 , ? ?R?,?(?) ? ?? ? r?t. (7.60)
B
t?[? ]
From 7.59 we have that ?? is feasible to OPTipsLP (?) and from 7.60 this implies
that
121
( ( ) )
ips ? R?,?(?)OBJ (?) OPTLP M? , B 1? , ? ?R?,?(?). (7.61)
B
Finally fro(m 7.47 w(e have ) ) ( )
? R?,?(?) ? ? R?,?(?)OPTLP M? , B 1 , ? 1 OBJ (?) . (7.62)
B B
From 7.61 and 7.62 we have ( )
OBJ(?)
OBJips(?) ? OBJ(?)?R? ?,?(?) 1?+( ? ,? B ) ?
KOBJ(?) T
=O ? log
?B ?
which gives the lower-tail in 7.42.
We will now prove the second in(equality(in 7.42 in)a si)milar fashion. Let ???
denote the optimal solution to OBJips ? R?,?(?)M? , B 1 , ? .B
From Equation (7.58) we have that with probability at least 1 ? ? for every
i ? [d],
? ?
??? ? ct,i ? ??? ? c?t,i +R?,?(?).
t?[? ] t?[? ]
? B (7.63)
?
7.63 used the fact that ?t?[? ] ?? ? c?t,i ? B (1?R?,?(?)).
Using Equation (7.57), we have that with probability at least 1? ?,
? ?
??? ? r?t ? ??? ? rt +R?,?(?).
t?[? ] t?[? ]
From the fact that
? ( ( ) )
??? ? R?,?(?)r?t = OPTipsLP M? , B 1? , ? ,B
t?[? ]
122
we get the following.
( ( ) )
ips ? R?,?(?)
?
OPTLP M? , B 1 , ? ?R?,?(?) ? ??? ? rt. (7.64)B
t?[? ]
From 7.63 we have that ???j is feasible to OPTLP (?) and from 7.64 this implies
that
( ( ) )
ips R?,?(?)OPTLP M? , B 1? , ? ? OBJ (?) +R?,?(?). (7.65)B
Finally from 7.47 we have
( ( ) ) ( )
ips R?,?(?) R?,?(?)OPTLP M? , B 1? , ? ? 1? OBJips (?) . (7.66)B B
Combining 7.66 and 7.65 we get,
ips ? R?,?(?)OBJ (?) OBJ(?) + (OBJ(?) +R?,?(?)) . (7.67)
B ?R?,?(?)
Since B0 > 2R?,?(?) we get,
OBJips ? 2R?,?(?)(?) OBJ(?) + ? (OBJ?(??) +R?,?(?)?),B ( ? )
KOBJ(?) T
=O ? log
?B ?
and thus we get the upper-tail in 7.42.
We will now prove 7.43. Recall that g? ips? := maxt?[? ] OBJ (t). Moreover
[? ]
OPTLP = maxt?[? ] OBJ(t).
Consider g?? ? [? ]OPTLP . We have,
123
? [? ] ips ? [? ]g?? OPTLP = max OBJ (t) OPTLP
t?[? ]
? [? ]max(OBJ(t) + DEV(t))?OPT
t? LP[? ]
? [? ]max OBJ(t) + max DEV(t)?OPT
? LPt [? ] t?[? ]
= max DEV(t).
t?[? ]
[? ]
Now consider OPTLP ?g?? . We have,
[? ] [? ]
OPTLP ?g?? ? OPTLP ?max(OBJ(t)? DEV(t))
t?[? ]
? [? ]OPTLP ?max OBJ(t) + max DEV(t)
t?[? ] t?[? ]
= max DEV(t).
t?[? ]
This completes the proof of Lemma 15.
7.8 Lower bounds
We prove the lower bounds on the competitive ratio that we have claimed
in sub-section 7.3.1: the ?(log T ) lower bound w.r.t. the best fixed distribution
benchmark (OPTFD), the ?(T ) lower bound w.r.t. the best dynamic policy bench-
mark (OPTDP), and the ?(K) lower bound w.r.t. the best fixed arm benchmark
(OPTFA). As a warm-up, we analyze the simple example from sub-section 7.3.1 that
leads to the 5 lower bound w.r.t. OPTFD. All lower-bounds are for a randomized4
algorithm against an oblivious adversary. We summarize all these lower bounds in
the following theorem:
124
Theorem 9. Consider Adversarial BwK with a single resource (d = 1) and K
arms. Consider any randomized algorithm for this problem, and let REW denote
its reward. Then:
(a) OPTFD /E[REW] ? 5 ? o(1) for some problem instance (from the example in4
the Introduction).
(b) OPTFD /E[REW] ? ?(log T ) for some problem instance.
(c) OPTDP /E[REW] ? T/B2 for some problem instance, for any given budget B.
(d) OPTFA /E[REW] ? ?(K) for some problem instance.
Remark 10. The lower bounds for parts (a,b,c) hold (even) for problem instances
with K = 2 arms. The lower bounds in parts (a,b) hold even for a much more
permissive feedback model from the online packing literature, namely, when the al-
gorithm observes the outcome vector for all actions in a given round, and moreover
does it before it chooses an arm in this round.
We tweak our construction from Theorem 9(c) to obtain a strong lower bound
for the contextual version of Adversarial BwK (a.k.a. Adversarial cBwK ), as studied
in Section 7.9.3. This lower bound implies that Adversarial cBwK is essentially
?
hopeless in the regime B < T , complementing a strong positive result (Corollary 3)
?
for the regime B > ??( T ). It is proved in Section 7.8.3, along with Theorem 9(c).
Theorem 10. Consider adversarial contextual bandits with knapsacks (Adversarial
cBwK), with policy class ?, a single resource (d = 1), K = 2 arms, and any given
125
?
budget B < T . Consider any randomized algorithm for this problem, and let REW
denote its reward. Then
OPT 2FD(?)/E[REW] ? T/B for some problem instance.
Notation. In the proof of lower-bounds below, we use the following notation.
Given an instance I, we denote OPTFD(I), OPTFA(I) and OPTDP(I) to denote the
optimal value of the best fixed distribution, best fixed arm and best dynamic policy
respectively, for instance I [T ]. Likewise let OPTLP (I) denote the optimal LP value for
instance I and given an algorithm A and an instance I, let E[REW(A, I)] denote
the expected reward obtained by A on instance I, where the expectation is over the
internal randomness of the algorithm.
7.8.1 Warm-up: example from the Introduction
As a warm-up, let us recap and analyze the example from the Introduction.
Construction 2. There are two arms and one resource with budget B = T . Arm
2
1 has zero rewards and zero consumption. Arm 2 has consumption 1 in each round,
and offers reward 1 in each round of the first half-time (T rounds). In the second
2 2
half-time, arm 1 offers either reward 1 in all rounds, or reward 0 in all rounds.
More formally, there are two problem instances, call them I1 and I2, that coincide
for the first half-time and differ in the second half-time.
Lemma 18. Any algorithm suffers OPTFD /E[REW] ? 5 ? o(1) on some instances4
in Construction 2.
126
The intuition is that given a random instance as input the algorithm needs to
choose how much budget to invest in the first half-time, without knowing what comes
in the second, and any choice (in expectation) leads to the claimed competitive ratio.
To prove Lemma 18 (as well we the main lower bound in Theorem 9(b)) we
[T ]
compare algorithm?s performance t(o OPTLP , and inv)oke the following lemma:?
[T ] [T ]
Lemma 19. OPTFD ? OPTLP ?O OPT ? log dTLP .B
[T ]
Proof. Let ? ? denote the time-step at which OPTLP is maximiz?ed. Let p denote
the optimal solution to ? ? ?OPTLP(M?? , B(1? ), ? ?) where  = log dT . Note thatB
OPTFD is at least as large as the expected total reward obtained by the distribution
p. From the Chernoff-Hoeffding bounds (Lemma 29), with probability at least
1? dT?2 we have ?
?i ? [d] p ? ct,i ? B.
t?[??]
Conditioning on this event the expected total reward obtained by p is
?
p ? rt = ? ? ?OPTLP(M?? , B(1? ), ? ?).
t?[??]
Thus the expected total reward obtained by p is at least ? ? ? OPTLP(M?? , B(1 ?
), ? ?). 12 Moreover from Eq. (7.47) we have that
OPTFD ? ? ? ?OPTLP(M ??? , B(1? ), ? )
? (1? )? ? ?O(PTLP(M ???? , B(1?)), ? )
? [T ]? [T ] log dTOPTLP O OPTLP .B
12With probability T?2 we assume that p has an expected reward of 0.
127
Proof of Lemma 18. Denote the two arms by A1 and A0 where A0 denotes the null
arm. The consumption for arm A1 is 1 for all rounds in both I1 and I2. Thus
the only difference between the two instances is the rewards obtained for playing
arm A1 in each round. The instances have two phases where each phase lasts for
T
2
rounds. In phase 1, in both I1 and I2 playing arm A1 fetches a reward 1 . In the2
second phase, in I1, the reward for playing arm A1 is 0 while in I2 the reward for
playing arm A1 is 1. Thus the outcome matrix for the first
T time-steps is the same
2
in instances I1 and I2.
Consider a randomized algorithm A. Let ?1 be the expected number of times
arm A1 is played by A in the first T rounds on instances I2 1 and I2. Note since the
outcome matrix is same, the expected number of times the arm is played should be
same in both the instances. Let ?2,1, ?2,2 denote the expected number of times arm
A1 is played in the second phase in instances I1 and I2 respectively.
Recall that in this section we are interested in a lower-bound on the com-
[T ]
petitive ratio OPTFD /E[REW] for every instance. Consider OPTLP (I1), the opti-
mal value of the b(est fixed d)istribution on I1. Using Eq. (7.24) with ? = T this2
equals T ? [T ]O( PTLP M
T
)T , B, which evaluates to
T . Likewise OPTLP (I2) equals2 2 4
2
T ? OPTLP MT , B, T , which evaluates to 3T . Consider the performance of A on8
I1. We have,
[T ]
OPTLP (I )
( ) ( )
1 T ?1
E A I ? / . (7.68)[REW( , 1)] 4 2
Likewise on I2 we have,
[T ]
OPT ( ) ( )LP (I2) ? 3T / ?1E A I + ?[REW( , )] 8 2 2,2 . (7.69)2
128
Thus the competitive ratio of A is at least the maximum of the ratios in
Eq. (7.68) and Eq. (7.69). Thus we want to minimize this maximum and is achieved
when the two ratios are equal to each other.
Notice that the term ?2,1 does not appear in Eq. (7.68) and Eq. (7.69). By
setting the term in Eq. (7.68) equal to the term in Eq. (7.69) and re-arranging,
?1 = 4?2,2. (7.70)
Moreover we have ?1 + ?2,2 ? B. Combining this with Eq(. ()7.7(0) w)e get
? ? 4B = 2T and the corresponding competitive ratio is at least T / ?11 ? 5 .5 5 4 2 4
By Lemma 19 with d = 1, for every j ? [2], ( ? )
[T ]
I A I ? 5 ? OPTOPT ( )/E[REW( , )] O LP log TFD j j .
4 T B
7.8.2 The main lower bound: proof of Theorem 9(b)
To obtain the ?(log T ) lower bound in Theorem 9(b), we extend Construction 2
to one with ?(log T ) phases rather than just two. As before, the algorithm needs
to decide how much budget to save for the subsequent phases; without knowing
whether they would bring higher rewards or nothing. The construction is as follows,
see Figure 7.3 for a pictorial representation:
Construction 3. There is one resource with budget B, and two arms, denoted
A0, A1. Arm A0 is the ?null arm? that has zero reward and zero consumption. The
consumption of arm A1 is 1 in all rounds. The rewards of A1 are defined as follows.
We partition the time into T phases of duration B each (for simplicity, assume that
B
129
[ ]
B divides T ). We consider T problem instances; for each instance I , ? ? T? armB B
A1 has positive rewards up to and including phase ? ; after that all rewards are 0. In
each phase ? ? [? ], arm A1 has reward ?/T in each round.
1 ? ? + 1
Arm 1 ? 2? . . . ?? 0 . . . Consumption=1
Null Arm 0 0 . . . 0 0 . . . Consumption=0
Time, in TB phases of B rounds each
Figure 7.3: The lower-bounding construction for the ?(log T ) lower bound.
The lower bound holds for a wide range of budgets B, as expressed by the
following lemma:
Lemma 20. Fix an absolute constant ? ? (0, 1), and budget B in the interval
[?(log3 T ), O(T 1??)]. Then any algorithm suffers OPTFD /E[REW] ? ?(log T ) for
some problem instance in Construction 3.
In the rest of this subsection we prove Lemma 20. Fix any randomized algo-
rithmA. As before in this sub-section we are interested in the ratio OPTFD /E[REW(A)].
We argue that it has the claimed competitive ratio on at least one instance I? in the
construction 3. The proof proceeds in two parts. We first argue about the solution
structure of the optimal distribution for the construction 3 (we prove this formally
in Lemma 21). Next we characterize the expected number of times arm A1 is played
if A optimal algorithm in each of the phases. Combining the two we get Lemma 20.
130
Structure of the optimal solution. Define OPTLP(M?? , B, ?
?) to be the optimal
value of LP 7.8 on the instance I? . Then we have the following Lemma.
Lemma 21. For a given instance I? we have the following.
1. The optimal stopping time ? ? = B? .
?
2. OPTLP(M
? 1
?? , B, ? ) ? ? t?[B? ]P(t).
Proof. Note that (2) follows from (1). Indeed, given that the stopping time is B? ,
the optimal?solution is to set X(1) =
1 and X(0) = 1 ? 1 thus obtaining a total
? ?
reward of 1 t?[B? ]P(t). Thus it remains to prove (1).?
First it is easy to prove that ? ? ? B? . Since there are no rewards after
time-step ? ?, we have
? ?
?t? 1 1> 0 OPTLP(M??+t? , B, ? ? + t?) = ? P(t) < P(t).? + t ?
t?[??] t?[B? ]
Now we will argue that the optimal stopping time cannot be strictly lesser than
? ?. To do so, first we argue that for two stopping times t1 < t2 within the same
phase, the maximum objective is achieved for the stopping time t2. This implies
that the optimal stopping time has to be the last time step of some phase.
Consider times t1 < t2 such that P(t1) = P(t2) = ? . Then we want to claim
that
??? ?B ? ?B ?? ?P(t) ? P(t)? .
t1 t
t? 2[t1] t?[t2]
For contradiction assume the inequality does not hold. Then we have the following.
? ? ?
P t
?
1
(t) > ? P(t)? .
t2
t?[t1] t?[t2]
131
? ?
Note that t?[B(??1)]P(t) = Bt? =
B(??1)?
t??[? ] . Thus we have2
?
P B(? ? 1)?(t) = + (t1 ?B(? ? 1))?,
2
t??[t1]
P B(? ? 1)?(t) = + (t2 ?B(? ? 1))?.
2
t?[t2]
Therefore we have,
B(? ? 1)?
+ (t1 ?B(? ?
t1B(? ? 1)?
1))? > + (t2 ?B(? ? 1))?.
2 2t2
Further re-arranging, B(??1) > t2. This is a contradiction since t2 is in phase2
? , so t2 ? B(? ? 1).
Next we argue that the optimal value is achieved when the stopping time is
in the last non-zero rewards phase. Consider two phases ?1 < ?2. Thus the ending
times are B?1 and B?2. To prove that the optimal value increases by stopping at
B?2, as opposed to B?1, we want to show that
1 ? 1 ?
Bt ? Bt.
?1 ?? 2t [?1] t?[?2]
As before assume for a contradiction that this doesn?t hold. Then re-arranging
we get, ?1(?1+1) > ?1(?2+1) , which implies ?1 > ?2, contradiction. We conlude that2 2
the stopping time is ? ? = B? .
Expected behavior of the optimal algorithm. Consider any randomized algo-
rithm A. The performance of A is then as follows.
[T ]
OPTLP
E ?
B? ?OPTLP(MB? , B,B?)
max . (7.71)
[REW(A)] 1???T/B E[REW(A)]
132
Consider two consecutive instances I? and I?+1. The outcome matrix in the
phases 1, 2 , . . . , ? look identical in both these instances. Thus the expected number
of times arm A1 is chosen by algorithm A in phases 1, 2 , . . . , ? is identical. Let
?? denote the expected number of times A plays arm A1 in phase ? on instances
I1, I2, , . . . , I? . The expected number of times arm A1 is played in phase ? on
instances I?+1, I?+2 , . . . , IT/B is set arbitrarily since this term does not appear
in the objective function (i.e., E[REW(A)]). Thus the maximum of the ratios in
Eq. (7.71) occurs when they are equal for all values ? . Therefore we want for every
1 ? ? ? T/B,
?? ?1 1? j?[? ] Bj ?+?1 j?[?+1] Bj= . (7.72)
j?[? ] j?j j?[?+1] j?j
Also note that ?1 + ?2 + . . .+ ?T/B ? B, since the total budget is B and the
consumption is 1 in every round for arm A1 and 0 for arm A0. We will prove that
these lead to the following recurrence for the maximizing values of ?j.
? ? 1j 2 ?j = ?1. (7.73)
2j
We will prove the recurrence Eq. (7.73) via induction. The base case is when
j = 2. Plugging in ? = 1 in Eq. (7.72) we have
1 3/2
= .
?1 ?1 + 2?2
This implies that ? 12 = ?1. Therefore we are done. Now consider the inductive4
case; let all ? up to ?? satisfy the recurrence Eq. (7.73). Consider the instance I?
and I?+1. From Eq.?(7.72) we have, ?
1 1
? ?j?[? ] jB ??+1 j?[?+1] jB= .
?1 +
? ?
j=2 ?1/2 ?1 + j=2 ?1/2 + (? + 1)??+1
133
Rearranging, we get that ??+1 =
1 ?1. This completes the inductive step.2(?+1)
Using the fact that ?1(1 + 1/4 + 1/6 + . . . + B/2T ) ? B we have that ?1 ?
B where H(n) denotes the nth Harmonic number. The val(ue)of Eq.((7.71))for2H(T/B)
? = 1 is the same at all other ? and has a value of B ? 2H T ? ? log T ?
?1 B B
? (log T ). The second last inequality uses the well-known fact that ? + log n ?
H(n) ? ? + lo[g(n + 1) for an abso]lute constant ? and the last inequality uses the
fact that B ? ?(log3 T ), O(T 1??) . [ ]
Therefore we have that for every instance j ? T ,
B
[T ]
OPTLP (Ij) ? ?(log T ).
E[REW(A, Ij)]
Thus combining this with Lemma 19 we obta(in )
OPT 1.5FD ? log T?(log T )?O ? .
( )E[REW(A)] B
Since B ? ? log3 T , we get that OPTFDE[REW(A ? ?(log T ).)]
7.8.3 Best dynamic policy: proof of Theorem 9(c)
Consider the following construction of the lower-bound example.
Construction 4. There is one resource with budget B, and two arms, denoted
A0, A1. Arm A0 is the ?null arm? that has zero reward and zero consumption. The
consumption of arm A1 is 1 in all rounds. The rewards of A1 are defined as follows.
We partition the time into T phases of duration B each (for simplicity, assume that
B
B divides T ). We consider T problem instances; for each instance I? , ? ? [T/B]B
arm A1 has 0 reward in all phases except phase ? ; in phase ? it has a reward of 1
in each round.
134
Lemma 22. Consider Construction 4 with any given time horizon T ? 2 and budget
?
B ? T . Let ALG be an arbitrary randomized algorithm for BwK. Then for one of
the problem instances,
OPTDP /E[REW] ? T/B2. (7.74)
Proof. Let n = T be the number of phases in Construction 4. Let ALG be a
B
deterministic algorithm. Let REW denote its total reward, and let E? [?] denote
the expectation over the uniform-at-random choice of the problem instance I? . We
claim that
OPTDP /E? [REW] ? T/B2. (7.75)
Assume that ALG maximizes E? [REW] (over deterministic algorithms). Then it
satisfies the following:
? Within each phase, if ALG ever chooses to play arm A1, it does so in the first
round of the phase. If it receives a reward of 1 in this round, it plays A1 for
the rest of the phase. Else, it never plays A1 for the rest of this phase.
For each ? ? [n], let ?? denote the number of times ALG chooses arm
A1 in phase ? in problem instance I? . The expected reward of A?LG over the
uniform-at-random choice of the problem instance I? is E[REW] = 1 i?[n] ?i. Letn
(??(1), ??(2) , . . . , ??(k)) be the subsequence of (?1 , . . . , ?n) which contains all
elements with non-zero values.
The key observation is as follows. The problem instances I?(??1) and I?(?) are
identical until phase ?(? ?1)?1. Since the feedback received by ALG until the first
135
time it chooses armA1 in phase ?(??1) is identical, it follows that ??(??1)???(?) = 1.
Therefore, ? ?
?i = ??(i) = k ? ??(1) ?
k(k ? 1)
.
2
i?[n] i?[k]
Noting that ?1 ? B and k ? min(B, n) = B , we have:
?
E[REW] ? 1 ? 2 3
n i
< B /n = B /T.
i?[n]
Since OPTDP = B for every problem instance I? , Eq. (7.75) holds for ALG, and
therefore for any other deterministic algorithm. By Yao?s lemma (Motwani and
Raghavan [148]), for every randomized algorithm ALG there exists a problem in-
stance I? such that (7.74) holds.
We now use the same construction to prove Theorem 10.
?
Proof sketch of Theorem 10. We prove the Theorem by contradiction. Let B ? T .
For contradiction, consider an algorithm ALG for cBwK on a policy set ? such
that OPTFD(?)/E[REW(ALG)] < T/B2. We will now use ALG to construct an
algorithm A for the Construction 4 such that OPTDP /E[REW(A)] < T/B2 for
every instance. This contradicts Lemma 22.
Consider a policy set ? with |n| policies. Every policy ? ? ? maps contexts
in the range [1, T ] to the action set {A1, A0}. In particular, a policy ?? ? ? maps
contexts that lie in the range [B ? (? ?1)+1, B ? ? ] to arm A1 and all other contexts
to A0. A invokes ALG as a sub-routine with the policy set ?. At each time-step t,
A gives the context xt = t to ALG and plays the arm chosen by ALG.
For an instance I? in Construction 4, OPTFD(?) is the total reward obtained
by choosing the action given by ?? in all time-steps. The total reward obtained is B,
136
which equals OPTDP(I? ). Therefore, OPTFD(?)/E[REW(ALG)] < T/B2 implies we
have OPT 2DP /E[REW(A)] < T/B for every instance I? , which is a contradiction.
7.8.4 Best fixed arm: proof of Theorem 9(d)
We use the following construction for the lower-bound.
Construction 5. There is one resource with budget B, and K arms denoted by
A1, A2 , . . . , AK. Arm AK is the ?null arm? that has zero reward and zero con-
sumption. There are K instances in the family. In instance Ij, all arms Aj? where
j? > j have 0 reward and 0 consumption in all time-steps. Consider an instance Ij
for some j ? [K ? 1] and an arm j? ? j. Arm Aj? has a reward of 1K?j? and con-K
sumption of 1 in all time-steps in phase j? and has a reward of 0 and consumption
of 0 in every other time-step. Thus the rewards and consumption are bounded in the
interval [0, 1] for every arm and every time-step in all instances in this family.
Lemma 23. Let T ? 2, 2 ? B ? T , K ? 3 be given parameters of the AdversarialBwK
problem. We show that there exists a family of instances with d = 1 shared resource
such that for every randomized algorithm A we have OPTFAE A is at least ?(K) on[REW( )]
one of these instances.
Proof. First note that the best fixed arm in instance Ij is to pick arm Aj which
yields a total reward of B
KK?j
.
Consider a randomized algorithm A. Observe that in the first j phases, the
instances Ij?1 and Ij have identical outcome matrices. Thus the expected number
137
of times any arm Ak for k ? [K] is chosen in phases {1, 2 , . . . , j} should be the
same in both the instances. Let ?k denote the expected number of times arm k is
played by A in phase k on instances Ik, Ik+1 , . . . , I 13K?1 . Moreover we have that
?1 + ?2 , . . . , ?K?1 ? B.
To show the lower-bound we want to minimize the competitive ratio on every
instance for all possible values of ?1, ?2 , . . . , ?K?1. For ease of no?tation denote
r := 1j K?j . Let ?B denote the set of values to {?k}k?[K?1] such thatK k?[K?1] ?k ?
B. Thus,
OPTFA
E A ? ? rkBmin . (7.76)[REW( )] ?B j?[k] rj?j
The ratio is minimized when all ratios in Eq. (7.76) are equal. We will show
via induction that this yields the following(recurrenc)e,
? ? ? rk?1k 2 ?k = 1 ?1. (7.77)
?rk
Combining this with the condition that k?[K?1] ?k ? B, this yields the
condition ?1 ? B 1 . Moreover the minimizing value in Eq. (7.76) is K ? 1 which
K? K
K
proves Lemma 23.
We will now prove the recurrence Eq. (7.77). Consider the base case with
k = 2. Equalizing the first two terms in Eq. (7.76) we get
r1B r2B
= .
r1?1 r(1?1 + r2)?2
Re-arranging we obtain that ?2 = 1? r1 ?1. We will now prove the induc-r2
tive case. Let the recurrence be true for all 1 ? k ? k?. Consider the case k = k?+1.
13This has to be the same in all instances since the outcome matrix is identical until phase k in
all these instances
138
Setting the k? and k? + 1 ratios in Eq. (7.76) equal, we obtain
? rk?B ? rk?+1B= . (7.78)
j?[k?] rj?j j?[k?+1] rj?j ( )
r
Moreover from the inductive hypothesis we have ?j = 1? j?1 ?r 1 for everyj
j ? k?. Thus we have
?
rj?j = rk??1
?j?[k?]
rj?j = rk??1 + rk?+1?k?+1.
j?[k?+1]
Plugging this back in Eq. (7.78) we get
rk?B rk?+1B
= .
rk??1 ( rk??1 +)rk?+1?k?+1
? r
?
Rearranging we get ? ? = 1 kk +1 ?1. This completes the induction.rk?+1
7.9 Extensions
We obtain several extensions which highlight the modularity of LagrangeBwK:
we apply Theorem 6 and Theorem 7 with appropriately chosen primal algorithm
ALG1, and immediately obtain strong performance guarantees.
14 We tackle four
well-known scenarios:
? full feedback (e.g., Arora et al. [21], Freund and Schapire [88], Littlestone and
Warmuth [130]): in each round, the algorithm chooses an action and observes
14For these theorems to hold, ALG1 needs to satisfy regret bound (7.4) only against adaptive
adversaries that arise in the repeated Lagrange game in the corresponding extension, not against
arbitrary adaptive adversaries.
139
the outcomes of all possible actions; this is a classic scenario in online machine
learning.
? combinatorial semi-bandits (e.g., Audibert et al. [23], Gyo?rgy et al. [101], Kale
et al. [113]): actions are feasible subsets of ?atoms?. The atoms in the chosen
action have individual outcomes that are observed and add up to the action?s
total outcome. Typical motivating example are subsets/lists of news articles,
ads, or web search results.
? contextual bandits with policy sets (e.g., Agarwal et al. [5], Dud??k et al. [80],
Langford and Zhang [127]): before each round, a context is observed, and the
algorithm competes against the best policy (mapping from context to actions)
in a given policy class. In a typical application scenario, the context includes
known features of the current user.
? bandit convex optimization (starting from Flaxman et al. [86], Kleinberg [120],
with recent advances Bubeck et al. [44, 45], Hazan and Levy [105]). Here
the set of actions is a convex set X ? RK . For each round t, the adversary
chooses a concave function ft : X ? [0, 1] such that the reward for chosen
action x ? X is ft(x).
Formalities. To simplify the statements, we make the following assumptions with-
out further mention:
? The dual algorithm, ALG2, is always Hedge, with the associated regret bound
from 7.6. For high-probability regret bounds, ? = 1 is a fixed and known
T
140
failure probability parameter.
? For Stochastic BwK, one resource is the dummy resource (with consumption
B for each arm). Algorithm LagrangeBwK is run with parameters B0 = B andT
T0 = T .
? For Adversarial BwK, one of the arms is a null arm that has zero reward and
zero resource consumption. Algorithm 6 is run with scale parameter ? ? d+1
and range [gmin, gmax] = [1, T ].
A typical corollary. All our corollaries have the following shape, for some regret
term reg:
(C1) In the stochastic version, algorithm LagrangeBwK achieves, with probability
at least 1? 1 ,
T
( )
OPTDP?REW ? O T ? reg .B
(C2) In the adversarial version, Algorithm 6 achi(eves )
1 1
E[REW] 1?O(reg) +OPTFD ?B? .
OPT ?2FD dlog? T e
Corollaries similar to (C2) can be achieved for Algorithm 7, too; we omit them for
ease of exposition.
7.9.1 BwK with full feedback
In the full-feedback version of BwK, the entire outcome matrix Mt is revealed
to the algorithm after each round t. Accordingly, we can use Hedge as the primal
141
algorithm ALG1. The effect is, essentially, that the dependence on K, the number
?
of arms, in the regret term becomes logarithmic rather than K.
Corollary 1. Consider BwK with full feedback. Using Hedge as t?he primal algo-
rithm, we obtain corollaries (C1) and (C2) with regret term reg = T ln (dKT ).
Adversarial BwK with full feedback have not been studied before. For the
stochastic version, the regret bound is unsurprising: one expects to obtain a similar
improvement with each of the three other algorithms in the prior work on Stochastic
BwK by tracing the ?confidence terms? through the analysis. The significance here
is that we obtain this result as an immediate corollary.
7.9.2 Combinatorial Semi-Bandits with Knapsacks
Following Sankararaman and Slivkins [163], we consider Combinatorial Semi-
BwK, a common generalization of BwK and combinatorial semi-bandits (e.g., Au-
dibert et al. [23], Gyo?rgy et al. [101], Kale et al. [113]). In this problem, actions
correspond to subsets of some finite ground set ? of size n, whose elements are
called atoms. There is a fixed family F ? 2? of feasible actions. For each round
t, there is an outcome vector o ? [0, 1 ]d+1t,e for each round atom e ? ?, with then
same semantics as the actions? outcome vectors. If an action S ? ? is chosen, the
outcome vectors ot,?e are observed for all atoms e ? S, and the action?s outcome is
the sum M (S) = dt e?S ot,e ? [0, 1] . In the adversarial case, all outcome vectors
ot,e, t ? [T ], e ? ? are chosen by an adversary arbitrarily before round 1. In the
stochastic case, the atomic outcome matrix (ot,e : e ? ?) is drawn independently in
142
each round t from some fixed distribution. Combinatorial semi-bandits, as studied
previously, is a special case with no resource constraints (d = 0).
Typical motivating examples involve ad/content personalization scenarios. Atoms
can correspond to items news articles, ads, or web search results, and actions are
subsets that satisfy some constraints on quantity, relevance, or diversity of items.
One can also model ranked lists of atoms: then atoms are rank-item pairs, and each
feasible action S ? ? satisfies a constraint that each rank between 1 and |S| is
present in exactly one chosen rank-item pair.
A naive?application of our main results suffers from regret terms that are pro-
portional to |F|, which may be exponential in the number of atoms n. Instead,
the work on combinatorial semi-bandits features regret bounds that scale polynomi-
ally in n. This is what we achieve, too. We use an algorithm from Neu and Barto?k
[152] which solves combinatorial semi-bandits in the absence of resource constraints.
This algorithm satisfies a hi?gh-probability regret bound (7.4) against an adaptive
adversary, with R?(T ) = O( nT log(1/?)).
15
Corollary 2. Consider Combinatorial Semi-BwK with n atoms. Using the algo-
rithm from Neu and Barto?k [152] as the primal algorithm, we obtain corollaries
?
(C1) and (C2) with regret term reg = nT log T .
The adversarial version of Combinatorial Semi-BwK has not been studied be-
fore.
15Prior work (Neu and Barto?k [152], Sankararaman and Slivkins [163]) posits that atoms? per-
round rewards/consumptions lie in the range [0, 1], rather than [0, 1n ], so their stated regret bounds
should be recaled accordingly.
143
The stochastic version has been studied in Sankararaman and Slivkins [163]
when the action set is a matroid, achieving regret
( ? ? ? )
O? OPTDP n/B + T/n+ OPTDP .
?
This regret bound becomes O?( nT ) in the regime when B and OPTDP are ?(T )
(see Footnote 15). We achieve the same regret bound for this regime, without
the matroid assumption and without any extra work. However, the regret bound
in Sankararaman and Slivkins [163] can be substantially better than ours when
OPTDP  T .
7.9.3 Contextual Bandits with Knapsacks
Following Agrawal et al. [14], Badanidiyuru et al. [31], we consider Contextual
Bandits with Knapsacks (cBwK ), a common generalization of BwK and contextual
bandits with policy sets (e.g., Agarwal et al. [5], Dud??k et al. [80], Langford and Zhang
[127]). The only change in the protocol, compared to BwK, is that in the beginning
of each round t a context xt ? X arrives and is observed by the algorithm before
it chooses an action. The context set X is arbitrary and known. In the adversarial
version (Adversarial cBwK ) both xt and the outcome matrix Mt is chosen by an
adversary. In the stochastic version (Stochastic cBwK ) the pair (xt,Mt) is chosen
independently from some fixed but unknown distribution over such pairs.
In cBwK one is also given a finite set ? of policies : deterministic mappings
from contexts to actions.16 Essentially, the algorithm competes with the best course
16W.l.o.g. assume that ? contains all constant policies, i.e., all policies that always evaluate to
144
of action restricted to these policies. For a formal definition, let us interpret cBwK
as a BwK problem with action set ?, denote this problem as BwK(?). In other
words, actions in BwK(?) are policies in cBwK. An algorithm for BwK(?) is oblivious
to context arrivals. It chooses a policy ?t ? ? in each round t, and receives an
outcome for this policy: namely, the outcome for action ?(xt). We are interested
in the usual benchmarks for this problem, the best algorithm OPTDP and the best
fixed distribution OPTFD (where both benchmarks are constrained to use policies in
?); denote them OPTDP(?) and OPTFD(?), respectively.
Without budget constraints (i.e., with B = T ), this is exactly contextual
bandits with policy set ?. Both benchmarks then reduce to the standard benchmark
of the best fixed policy.
Background: algorithm EXP4.P. We use EXP4.P (Beygelzimer et al. [40]),
an algorithm for the contextual version of adversarial online learning with bandit
feedback. The algorithm operates according to the protocol in Figure 7.4.
We are interested in regret bounds for EXP4.P relative to the best fixed policy:
?
OPT? = max??? t?[T ] ft(?(xt)).
For each round t, the pair (xt,gt) induces a payoff vector ft ? [b ?min, bmax] on
policies:
ft(?) = gt(?(xt)) ?? ? ?.
Theorem 11 (Beygelzimer et al. [40]). Fix failure probability ? > 0, policy set ?,
and payoff range [bmin, bmax]. Then algorithm EXP4.P (appropriately tuned) satisfies
the same action.
145
Given: action set [K], context set X , policy set ?, payoff range [bmin, bmax].
In each round t ? [T ],
1. the adversary chooses a context xt ? X and a payoff vector gt ?
[b Kmin, bmax] ;
2. the algorithm chooses a distribution pt over ? (without seeing xt);
3. a policy ?t ? ? is drawn independently from pt;
4. algorithm?s chosen action is defined as at = ?t(xt) ? [K];
5. payoff gt(at) is received by the algorithm.
Figure 7.4: Adversarial contextual bandits
the following regret bound:
[ ? ]
Pr OPT? ? t?[T ] ft(?t) ? (bmax ? bmin)R?(T ) ? 1? ?, (7.79)
(? )
with regret term R?(T ) = O ?K log(KT |?|/?) .
Our solution for cBwK. We solve cBwK by reducing it to BwK(?), and treating it
as a BwK problem. A naive solution simply posits |?| arm?s and directly applies the
machinery developed earlier in this paper. This results in |?| dependence in regret
bounds, which is unsatisfactory, as the policy set may be very large. Instead, we use
EXP4.P as the primal algorithm (ALG1). We interpret EXP4.P as an algorithm
for (non-contextual) adversarial online learning, with action set ?. It is easy to see
that Theorem 11 provides regret bound (7.4) under this interpretation. Therefore,
146
we obtain the following:
Corollary 3. Consider contextual bandits with knapsacks, with policy set ?. Using
EXP4.P as th?e primal algorithm, we obtain corollaries (C1) and (C2) with regret
term reg = TK ln (dKT |?|). The benchmarks are OPTDP = OPTDP(?) and
OPTFD = OPTFD(?).
Adversarial cBwK has not been studied before. The regret bound for the
?
adversarial case is meaningful only if B > T . This is essentially inevitable in light
of the lower bound in Theorem 10.
Stochastic cBwK has been studied in Agrawal et al. [14], Badanidiyuru et al.
[31], achieving regret bound
O(reg)(1 + OPTDP(?)/B), (7.80)
where the reg term is the same as in Corollary 3. Whereas the regret bound from
Corollary 3 is O(reg ? T/B). Note that we match (7.80) in the regime OPTDP(?) >
?(T ). Our regret bound is optimal, up to logarithmic factors, in the regime B >
?(T ).17
Discussion. Our algorithms are slow, as the per-round running time of EXP4.P is
proportional to |?|. The state-of-art approach to computational efficiency in contex-
tual bandits is oracle-efficient algorithms, which make only a small number of calls
to an oracle that finds the best policy in ? for a given data set. In particular, prior
work for Stochastic cB(wK (A?grawal et al. [14]) )obtains an oracle-efficient algorithm
17This is due to the min T, ?( KT log(|?|)/ log(K) regret bound, which holds for contextual
bandits Agarwal et al. [4].
147
with regret bound as in (7.80). To obtain oracle-efficient algorithms for cBwK in
our framework, both for the stochastic and adversarial versions, it suffices to replace
EXP4.P with an oracle-efficient algorithm for adversarial contextual bandits that
obtains regret bound (7.79), possibly with a larger regret term R?. Such algorithms
almost exist: a recent breakthrough (e.g., Rakhlin and Sridharan [156], Syrgkanis
et al. [173, 174]) obtains algorithms with similar regret bounds, but only for expected
regret.
7.9.4 Bandit Convex Optimization with Knapsacks
We consider Bandit Convex Optimization with Knapsacks (BCOwK ), a com-
mon generalization of BwK and bandit convex optimization. We define BCOwK as
a version of BwK, where the action set X is a convex subset of RK . For each round
t, there is a concave function ft : X ? [0, 1] and convex functions gt,i : X ? [0, 1],
for each resource i, so that the reward for choosing action x ? X in this round is
ft(x) and consumption of each resource i is gt,i(x). In the stochastic version, the
tuple of functions (ft; gt,1 , . . . , gt,d) is sampled independently in each round t from
some fixed distribution (which is not known to the algorithm). In the adversarial
version, all these tuples are chosen by an adversary before the first round.
Neither stochastic nor adversarial version of BCOwK have been studied in
prior work (but see the discussion of constrained online convex optimization in
Section 7.2). Bandit convex optimization, as studied previously, is a special case
with no resource constraints (d = 0).
The primal algorithm ALG1 in LagrangeBwK faces an instance of BCO (with
148
an adaptive adversary). This is because the Lagrange function (7.10) is a concave
function of the action, as sum of concave functions. For our primal algorithm, we
use a recent breakthrough on BCO due to Bubeck et al. [45]. This algorithm satisfies
the high-probability regret bound (7.4) against an adaptive adversary, with regret
term ?
R?(T ) = O(K
9.5 log7(T ) T log(1/?)).
We assume the existence of a null arm: a point x ? X such that ft(x) =
gt,i(x) = 0 for each resource i except the ?dummy resource?. (Recall that we
posit the ?dummy resource? ? a resource whose consumption is B/T for each arm
? for the stochastic version.) Unlike elsewhere in this paper, this assumption is
not without loss of generality: indeed, the null arm should be ?embedded? into X
without breaking the convexity/concavity properties. Moreover, we assume that the
null arm lies in the interior of X .
Corollary 4. Consider BCOwK for a given convex set X ? RK. Using the algo-
rithm from Bubeck et al. [45] as the primal algorithm, we obtain corollaries (C1)
?
and (C2) with regret term reg = K9.5 log7.5(T ) T .
Remark 11. LagrangeBwK framework extends to infinite action sets: everything
carries over, as long as 7.11 holds. (Essentially, we never take union bounds over
actions, and we can replace max and sums over actions with sup and integrals.)
For BCOwK, 7.11 is a statement about constrained convex optimization programs.
According to Slater?s condition (see Eq. (5.27) in Boyd and Vandenberghe [41]),
it suffices to have a point x in the interior of X such that gt,i(x) < B/T for each
149
resource i ? [d] other than the dummy resource (or any other resource whose con-
sumption is the same in all rounds). One such point is the null arm.
150
Chapter 8: Better Bounds for Combinatorial Semi-bandits with Knap-
sacks
8.1 Introduction
In this chapter we consider one extension of Stochastic BwK, namely Combi-
natorial Semi-Bandits with Knapsacks. We obtain better bounds for this problem
than what was achieved in Chapter 71. This extension combines two lines of work
related to bandits: on bandits with knapsacks (BwK) (Badanidiyuru et al. [33]) and
on combinatorial semi-bandits (Gyo?rgy et al. [101]). In combinatorial semi-bandits,
actions correspond to subsets of some ?ground set?, rewards are additive across the
elements of this ground set (atoms), and the reward for each chosen atom is revealed
(semi-bandit feedback). A paradigmatic example is an online routing problem, where
atoms are edges in a graph, and actions are paths.
8.1.1 Formal Model
Our model, called Semi-Bandits with Knapsacks (SemiBwK) is a generalization
of multi-armed bandits (henceforth, MAB) with i.i.d. rewards. As such, in each
round t = 1, 2, . . . , T , an algorithm chooses an action St from a fixed set of actions
1This work was done and published before that of Chapter 7.
151
F , and receives a reward ?t(St) for this action which is drawn independently from
a fixed distribution that depends only on the chosen action. The number of rounds
T , a.k.a. the time horizon, is know2.
There are d resources being consumed by the algorithm. The algorithm starts
out with budget Bj ? 0 of each resource j. All budgets are known to the algorithm.
If in round t action S ? F is chosen, the outcome of this round is not only the
reward ?t(S) but the consumption Ct(S, j) of each resource j ? [d]. We refer
to Ct(S) = (Ct(S, j) : j ? [d]) as the consumption vector.3 Following prior work
on BwK, we assume that all budgets are the same: Bj = B for all resources j.
4
Algorithm stops as soon as any one of the resources goes strictly below 0. The round
in which this happens is called the stopping time and denoted ?stop. The reward
collected?in this last round does not count; so the total reward of the algorithm is
REW = t<? ?t(S ).stop t
Actions correspond to subsets of a finite ground set A, with n = |A|; we refer
to elements of A as atoms. Thus, the set F of actions corresponds to the family
of ?feasible subsets? of A. The rewards and resource consumption is additive over
the atoms: for each round t and each atom a there is a reward ?t(a) ? [0, 1] and
consump?tion vector Ct(a) ? [0, 1]
d
?such that for each action S ? F it holds that
?t(S) = a?S ?t(a) and Ct(S) = a?S Ct(a).
2As opposed to Chapter 7, throughout this chapter we use n to denote the number of arms, j
to denote the index of a resource and k to denote the maximum number of atoms in any action.
3We use bold font to indicate vectors and matrices.
4This is w.l.o.g. because we can divide all consumption of each resource j by Bj/minj??[d]Bj? .
Effectively, B is the smallest budget in the original problem instance.
152
We assume the i.i.d. property across rounds, but allow arbitrary correlations
within the same round. Formally, for a given round t we consider the n ? (d + 1)
?outcome matrix? (?t(a),Ct(a) : a ? A), which specifies rewards and resource con-
sumption for all resources and all atoms. We assume that the outcome matrix is
chosen independently from a fixed distribution DM over such matrices. The distribu-
tion DM is not revealed to the algorithm. The mean rewards and mean consumption
is denoted ?(a) := E[?t(a)] and C(a) :=?E[Ct(a)]. We extend ?the notation to
actions, i.e., to subsets of atoms: ?(S) := a?S ?(a) and C(S) := a?S C(a).
An instance of SemiBwK consists of the action set F ? 2[n], the budgets B =
(Bj : j ? [d]), and the distribution DM. The F and B are known to the algorithm,
and DM is not. As explained in the introduction, SemiBwK subsumes Bandits with
Knapsacks (BwK) and semi-bandits. BwK is the special case when F consists
of singletons, and semi-bandits is the special case when all budgets are equal to
Bj = nT (so that the resource consumption is irrelevant).
Following the prior work on BwK, we compete against the ?optimal all-knowing
algorithm? (Definition 1): an algorithm that optimizes the expected total reward
for a given problem instance; its expected total reward is denoted by OPTDP. As
observed in Badanidiyuru et al. [33], OPTDP can be much larger (e.g., factor of 2
larger) than the expected cumulative reward of the best action, for a variety of
important special cases of BwK. Our goal is to minimize regret, defined as OPTDP
minus algorithm?s total reward.
153
8.2 Challenges and Contribution
8.2.1 Our Contribution
We define a common generalization of combinatorial semi-bandits and BwK,
termed Combinatorial Semi-Bandits with Knapsacks (SemiBwK). We focus on the
i.i.d. environment: in each round, the ?outcome? is drawn independently from a
fixed distribution over the possible outcomes. Here the ?outcome? of a round is the
matrix of reward and resource consumption for all atoms.5 We design an algorithm
for SemiBwK, achieving regret rates that are comparable with those for BwK and
combinatorial semi-bandits.
Specifics are as follows. As usual, we assume ?bounded outcomes?: for each
atom and each round, rewards and consumption of each resource is non-negative
and at most 1. Regret is relative to the expected total reward of the best all-
knowing policy, denoted OPTDP. We upper-bound the regret in terms of the relevant
parameters: time horizon T , (smallest) budget B, number of atoms n, and OPTDP
itself (which may be as large as nT ). We obtain
? ? ?
reg ? O?( n)(OPTDP / B + T + OPTDP). (8.1)
?
The ?shape? of the regret bound is consistent with prior work: the OPTDP / B ad-
5Our model allows arbitrary correlations within a given round, both across rewards and con-
sumption for the same atom and across multiple atoms. Such correlations are essential in ap-
plications such as dynamic pricing and dynamic assortment. E.g., customers? valuations can be
correlated across products, and algorithm earns only if it sells; see Section 8.7 for details.
154
? ?
ditive term appears in the optimal regret bound for BwK, and the T and OPTDP
additive terms are very common in regret bounds for MAB. The per-round running
time is polynomial in n, and near-linear in n for some important special cases.
Our regret bound is optimal up to polylog factors for paradigmatic special
cases. BwK is a special c?ase when actions are atoms. For OPTDP > ?(T ), the?
regret bound becomes O?(T n/B+ nT ), where n is the number of actions, which
coincides with the lower bound from Badanidiyuru et al. [33]. Combinatorial semi-
bandits is a special case with B = nT . If all feasible subsets contain at most k atoms,
?
we have OPTDP ? kT , and the regret bound becomes O?( knT ). This coincides with
?
the ?( knT ) lower bound from Kveton et al. [124].
Our main result assumes that the action set, i.e., the family of feasible subsets
of atoms, is described by a matroid constraint.6 This is a rather general scenario
which includes many paradigmatic special cases of combinatorial semi-bandits such
as cardinality constraints, partition matroid constraints, and spanning tree con-
?
straints. We also assume that B > ??(n+ nT ).
Our model captures several application scenarios, incl. dynamic pricing, dy-
namic assortment, repeated auctions, and repeated bidding. We work out these
applications, and explain how our regret bounds improve over prior work.
6Matroid is a standard notion in combinatorial optimization which abstracts and generalizes
linear independence.
155
8.2.2 Challenges and Techniques
Generic challenges in combinatorial semi-bandits concern handling exponen-
tially many actions (both in terms of regret and in terms of the running time), and
taking advantage of the additional feedback. And in SemiBwK, one needs to deal
with distributions over subsets of atoms, rather than ?just? with distributions over
actions.
Our algorithm connects a technique from BwK and a randomized rounding
technique from combinatorial optimization. (With three existing BwK algorithms
and a wealth of approaches for combinatorial optimization, choosing the techniques
is a part of the challenge.)
We build on a BwK algorithm from Agrawal and Devanur [11], which combines
linear relaxations and a well-known ?optimism-under-uncertainty? paradigm. A
generalization of this algorithm to SemiBwK results in a fractional solution x, a vector
over atoms. Randomized rounding converts x into a distribution over feasible subsets
of atoms that equals x in expectation. It is crucial (and challenging) to ensure that
this distribution contains enough randomness so as to admit concentration bounds
not only across rounds, but also across atoms. Our analysis ?opens up? a fairly
technical proof from prior work and intertwines it with a new argument based on
negative correlation.
We present our algorithm and analysis so as to ?plug in? any suitable random-
ized rounding technique. This makes our presentation more lucid, and also leads to
faster running times for important special cases.
156
8.2.3 Solving SemiBwK using prior work.
Solving SemiBwK using an algorithm for BwK would result in a regret bound
like (8.1) with n replaced with the number of actions. The latter could be on the
order of nk if each action can consist of at most k atoms, or perhaps even exponential
in n.
SemiBwK can be solved as a special case of a much more general linear-contextual
extension of BwK from Agrawal and Devanur [11, 12]. In their model, an algorithm
takes advantage of the combinatorial structure of actions, yet it ignores the ad-
ditional feedback from the atoms. Their regret bounds have a worse dependence
on the parameters, and apply for a much more limited range of parameters. Fur-
ther, their per-round running time is linear in the number of actions, which is often
prohibitively large.
To compare the regret bounds, let us focus on instances of SemiBwK in which
at most one unit of each resource is consumed in each round. (This is the case in all
our motivating applications, except repeated bidding.) Then Agrawal and Devanur
? ? ?
[11, 12] assume B > nT 3/4, and achieve regret O?(n T OPTDP +n2 T ). 7 It is easy
B
to see that we improve upon the range and upon both summands. In particular, we
? ?
7Agrawal and Devanur [11, 12] state regret bound with term +n T rather than +n2 T , but
they assume that per-round rewards lie in [0, 1]. Since per-round rewards can be as large as n in
our setting, we need to scale down all rewards by a factor of n, apply their regret bound, and then
?
scale back, which results in the regret bound with +n2 T . When per-round consumption can be as
? ?
large as n, regret bound from Agrawal and Devanur [11, 12] becomes O?(n2 OPT 2DP T/B+n T )
due to rescaling.
157
?
improve both summands by the factor of n n in a lucid special case when B > ?(T )
and OPTDP < O(T ).
8
8.3 Additional Related Work
Combinatorial semi-bandits were studied by Gyo?rgy et al. [101], in the ad-
versarial setting. In the i.i.d. setting, in a series of works by Anantharam et al.
[19], Chen et al. [65], Combes et al. [68], Gai et al. [90, 91], Kveton et al. [126], an
optimal algorithm was achieved. This result was then extended to atoms with linear
rewards by Wen et al. [187]. Kveton et al. [124] obtained improved results for the
special case when action set is described by a matroid. Some other works studied
a closely related ?cascade model?, where the ordering of atoms matters (Katariya
et al. [117], Kveton et al. [125], Zong et al. [189]). Contextual semi-bandits have
been studied in Krishnamurthy et al. [123], Wen et al. [187].
Randomized rounding schemes (RRS) come from the literature on approxi-
mation algorithms in combinatorial optimization (see Papadimitriou and Steiglitz
[153], Williamson and Shmoys [188] for background). RRS were introduced in
Raghavan and Tompson [154]. Subsequent work Asadpour et al. [22], Chekuri et al.
[59, 60], Gandhi et al. [92] developed RRS which correlate the rounded random
variables so as to guarantee sharp concentration bounds.
8In prior work on combinatorial bandits (without constraints), semi-bandit feedback improves
?
regret bound by a factor of n, see the discussion in Kveton et al. [126].
158
8.4 Preliminaries
8.4.1 Combinatorial constraints.
Action set F is given by a combinatorial constraint, i.e., a family of subsets.
Treating subsets of atoms as n-dimensional binary vectors, F corresponds to a finite
set of points in Rn. We assume that the convex hull of F forms a polytope in Rn.
In other words, there exists a set of linear constraints over Rn whose set of feasible
integral solutions is F . We call such F linearizable; the convex hull is called the
polytope induced by F .
Our main result is for matroid constraints, a family of linearizable combinato-
rial constraints which subsumes several important special cases such as cardinality
constraints, partition matroid constraints, spanning tree constraints and transversal
constraints. Formally, F is a matroid if it contains the empty set, and satisfies two
properties: (i) if F contains a subset S, then it also contains every subset of S, and
(ii) for any two subsets S, S ? ? F with |S| > |S ?| it holds that S ? ?{a} ? F for each
atom a ? S \ S ?.
We incorporate prior work on randomized rounding for linear programs. Con-
sider a linearizable action set F with induced polytope P ? [0, 1]n. The randomized
rounding scheme (henceforth, RRS) for F is an algorithm which inputs a feasible
fractional solution x ? P and the linear equations describing P , and produces a
random vector Y over F . We consider RRS?s such that E[Y] = x and Y is nega-
tively correlated (see below for definition); we call such RRS?s negatively correlated.
159
Several such RRS are known: e.g., for cardinality constraints and bipartite match-
ing (Gandhi et al. [92]), for spanning trees (Asadpour et al. [22]), and for matroids
(Chekuri et al. [59]).
8.4.2 Negative correlation.
Let X = (X1, X2, . . . ,?Xm) denote a family of random variables which take
values in [0, 1]. Let X := 1 mi=1Xi be the average, and ? := E[X].m
Family X is called negati[vely cor]related if? ?
[ E X] i ? E[Xi] ?S ? [m] (8.2)? i?S ?i?S
E (1?Xi) ? E[1?Xi] ?S ? [m] (8.3)
i?S i?S
Independent random variables satisfy both properties with equality. For in-
tuition: if X1, X2 are Bernoulli and (8.2) is strict, then X1 is more likely to be 0 if
X2 = 1.
Negative correlation is a generalization of independence that allows for similar
concentration bounds, i.e., high-probability upper bounds on |X ??|. However, our
analysis does not invoke them directly. Instead, we use a concentration bound given
a closely related property: [? ]
E X 1 |S|i ? ( ) ?S ? [m]. (8.4)2
i?S
Theorem 12 (Impagliazzo and Kabanets [110]). If (8.4), then for some absolute
constant c,
Pr[X ? 1 2+ ?] ? c ? e?2m? (?? > 0) (8.5)
2
160
8.4.3 Confidence radius.
We bound deviations |X ? ?| in a way that gets sharper when ? is small,
without knowing ? in advance. (We use the notation X , X, ? as above.) To this
end, we use the notion of confidence radius from Agrawal and Devanur [11], Babaioff
et al. [28], Badanidiyuru et al. [33], Kleinberg et al. [121]9:
?
Rad?(x,m) = ?x/m+ ?/m. (8.6)
If random variables X are independent, then event
|X ? ?| < Rad?(X,m) < 3 Rad?(?,m) (8.7)
happens with probability at least 1 ? O(e??(?)), for any given ? > 0. We use this
notion to define upper/lower confidence bounds on the mean rewards and mean
resource consumption. Fix round t, atom a, and resource j. Let ??t(a) and C?t(a, j)
denote the empirical average of the rewards and resource-j consumption, resp.,
between rounds 1 and t ? 1. Let Nt(a) be the number of times atom a has been
chosen in these rounds (i.e., included in the chosen actions). The confidence bounds
are defined as
C?t (a, j) = proj( C?(a, j)? Rad?(C?(a, j), Nt(a)) )
??t (a) = proj ( ??(a)? Rad?(??(a), Nt(a)) ) (8.8)
where proj(x) := arg miny?[0,1] |y ? x| denotes the projection into [0, 1]. We always
use the same parameter ? = cconf log(ndT ), for an appropriately chosen absolute
9For instance Theorem 2.1 in Badanidiyuru et al. [33]
161
constant cconf. We suppress ? and cconf from the notation. We use a vector notation
??t and C
?
t (j) to denote the corresponding n-dimensional vectors over all atoms a.
By (8.7), with probability 1?O(e??(?)) the following hold.
?(a) ? [??t (a), ?+t (a)]
C(a, j) ? [C?(a, j), C(a, j)+]
8.4.4 Matroid constraints
Recall that in SemiBwK, we have a finite ground set whose elements are called
?atoms?, and a family F of ?feasible subsets? of the ground set which are the actions.
To be consistent with the literature on matroids, the ground set will be denoted E.
Family F of subsets of E is called a matroid if it satisfies the following properties:
? Empty set: The empty set ? is present in F
? Hereditary property: For two subsets X, Y ? E such that X ? Y , we have
that
Y ? F =? X ? F
? Exchange property: For X, Y ? F and |X| > |Y |, we have that
?e ? X \ Y : Y ? {e} ? F
Matroids are linearizable, i.e., the convex hull of F forms a polytope in RE.
(Here subsets of F are interpreted as binary vectors in RE.) In other words, there
exists a set of linear constraints whose set of feasible integral solutions is F . In
162
fact, the convex hull of F , a.k.a. the matroid polytope, can be represented via the
following linear system:
x(S) ? rank(S) ?S ? E
(LP-Matroid)
? x(e) ? [0, 1]E ?e ? E.
Here x(S) := e?S x(e), and rank(S) = max{|Y | : Y ? S, Y ? F} is the ?rank
function? for F .
F is indeed the set of all feasible integral solutions of the above system. This
is a standard fact in combinatorial optimization, e.g., see Theorem 40.2 and its
corollaries in Schrijver [168].
We will now describe some well-studied special cases of matroids. That they
indeed are special cases of matroids is well-known, we will not present the corre-
sponding proofs here.
In all LPs presented?below, we have variables x(e) for each arom e ? E, and
we use shorthand x(S) := e?S x(e) for S ? E.
Cardinality constraints. Cardinality constraint is defined as follows: a subset S
of atoms belongs to F if and only if |S| ? K for some fixed K. This is perhaps the
simplest constraint that our results are applicable to. In the context of SemiBwK,
each action selects at most K atoms.
The corresponding induced polytope is as follows:
x(E) ? K
(LP-Cardinality)
x(e) ? [0, 1] ?e ? E.
Partition matroid constraints. A generalization of cardinality constraints, called
partition matroid constraints, is defined as follows. Suppose we have a collection
163
B1, B2, . . . , Bk of disjoint subsets of E, and numbers d1, d2, . . . , dk. A subset S of
atoms belongs to F if and only if |S ? Bi| ? di for every i. Partition matroid
constraints appear in several applications of SemiBwK such as dynamic pricing, ad-
justing repeated auctions, and repeated bidding. In these applications, each action
selects one price/bid for each offered product. Also, partition matroid constraints
can model clusters of mutually exclusive products in dynamic assortment applica-
tion.
The induced polytope is as follows:
x(Bi) ? di ?i ? [k]
(LP-PartitionMatroid)
x(e) ? [0, 1] ?e ? E.
8.5 Main algorithm (SemiBwK-RRS)
Let us define our main algorithm, called SemiBwK-RRS. The algorithm builds
on an arbitrary RRS for the action set F . It is parameterized by this RRS, the
polytope P induced by F (represented as a collection of linear constraints), and a
number  > 0. In each round t, it recomputes the upper/lower confidence bounds,
as defined in (8.8), and solves the following linear program:
maximize ?+t ? x
subject to C?(j) ? x ? B(1?) , j ? [d] (LPt ALG)T
x ? P
This linear program defines a linear relaxation of the original problem which is
?optimistic? in the sense that it uses upper confidence bounds for rewards and lower
164
confidence bounds for consumption. The linear relaxation is also ?conservative? in
the sense that it rescales the budget by 1? . Essentially, this is to ensure that the
algorithm does not run out of budget with high probability. Parameter  will be
fixed throughout. For ease of notation, we will denote B := (1 ? )B henceforth.
The LP solution x can be seen as a probability vector over the atoms. Finally,
the algorithm uses the RRS to convert the LP solution into a feasible action. The
pseudocode is given as Algorithm 8.
Algorithm 8: SemiBwK-RRS
input: an RRS for action set F , induced polytope P (as a set of linear
constraints),  > 0.
1 for t = 1, 2 , . . . , T do
1. Recompute Confidence Bounds as in (8.8)
2. Obtain fractional solution x ? [0, 1]nt by solving the linear program
LPALG.
3. Obtain a feasible action St ? F by invoking the RRS on vector xt.
4. Semi-bandit Feedback: observe the rewards/consumption for all atoms
a ? St.
If action set F is described by a matroid constraint, we can use the negatively
correlated RRS from Chekuri et al. [59]. In particular, we obtain a complete al-
gorithm for several combinatorial constraints commonly used in the literature on
semi-bandits, such as partition matroid constraints, spanning trees.
165
Theorem 13. Consider the SemiBwK problem with a linearizable action set F that
admits a negatively correlated RRS. Then algorithm SemiBwK-RRS with this RRS
achieves expected regret bound at most
? ( ? ? )
O(log(ndT )) n OPTDP / B + T + OPTDP . (8.9)
Here T is the time horizon, n is the number of atoms, and B is the budget. We
?
require B > 3(?n+ ?nT ), where ? = ?(log(n?dT )) is the parameter in confidence?
radius. Parameter  in the algorithm is set to ?n + ?n + ?nT .
B B B
Corollary 5. Consider the setting in Theorem 13 and assume that the action set
F is defined by a matroid on the set of atoms. Then, using the negatively correlated
RRS from Chekuri et al. [59], we obtain regret bound (8.9).
Running time of the algorithm. The algorithm does two computationally in-
tensive steps in each round: solves the linear program (LPALG) and runs the RRS.
For matroid constraints, the RRS from Chekuri et al. [59] has O(n2) running time.
Hence, in the general case the computational bottleneck is solving the LP, which
has n variables and O(2n) constraints. Matroids are known to admit a polynomial-
time seperation oracle (e.g., see Schrijver [168]). It follows that the entire set of
constraints in LPALG admits a polynomial-time separation oracle, and therefore we
can use the Ellipsoid algorithm to solve LPALG in polynomial time. For some classes
of matroid constraints the LP is much smaller: e.g., for cardinality constraints (just
d + 1 constraints) and for traversal matroids on bipartite graphs (just 2n + d con-
straints). Then near-linear-time algorithms can be used.
166
Our algorithm works under any negatively correlated RRS. We can use this
flexibility to improve the per-round running time for some special cases. (Making
decisions extremely fast is often critical in practical applications of bandits (e.g., see
Agarwal et al. [6].) We obtain near-linear per-round running times for cardinality
constraints and partition matroid constraints. Indeed, LPALG can be solved in near-
linear time, as mentioned above, and we can use a negatively correlated RRS from
Gandhi et al. [92] which runs in linear time. These classes of matroid constraints
are important in our applications (see Section 8.7).
8.6 Proof of Main Theorem
8.6.1 Proof overview.
First, we argue that LPALG provides a good benchmark that we can use instead
of OPTDP. Specifically, at any given round, the optimal value for LPALG in each
round is at least 1 (1?) OPTDP with high probability. We prove this by constructingT
a series of LPs, starting with a generic linear relaxation for BwK and ending with
LPALG, and showing that the optimal value does not decrease along the series.
Next we define an event that occur with high probability, henceforth called
clean event. This event concerns total rewards, and compares our algorithm against
LPALG:
? ? (? ? ? )
| +t?[T ] rt ? t?[T ] ?t ? xt| ? O ?n t?[T ] rt + ?nT + ?n . (8.10)
We prove that it is indeed?a high-probability event in ?three steps. First,
we relate the algorithm?s reward t rt to its expected reward t ? ? St, where we
167
interpret the cho?sen action S?t, a subset of atoms, as a binary vector over the atoms.
Then we relate t ? ? S to ?
+
t t t ? S?t, replacing ex?pected rewards with the upper
confidence bounds. Finally, we relate ?+t t ? St to t ?
+
t ? xt, replacing the output
of the RRS with the ?corresponding expectations. Putting it together, we relate
algorithm?s reward to t ?
+
t ? xt, as needed. It is essential to bound the deviations
?
in the sharpest way possible; in particular, the naive O?( T ) bounds are not good
enough. To this end, we use several tools: the confidence radius from (8.6), the
negative correlation property of the RRS, and another concentration bound from
prior work.
A similar ?clean event? (with a similar proof) concerns the total resource
consumption of the algorithm. We condition on both clean events, and perform the
rest of the analysis via a ?deterministic? argument not involving probabilities. In
particular, we use the second ?clean event? to guarantee that the algorithm never
runs out of resources.
We use negative correlation via a rather delicate argument. We extend the
concentration bound in Theorem 12 to a random process that evolves over time, and
only assumes that property (8.4) holds within each round conditional on the history.
For a given round, we start with a negative correlation property of St and construct
another family of random variables that conditionally satisfies (8.4). The extended
concentrati?on bound is then applied to this family. The net result is a concentration
bound for +t ?t ? St as if we had n? T independent random variables there.
The rest of the section contains the full proof.
168
8.6.2 Linear programs
We argue that LPALG provides a good benchmark that we can use instead of
OPTDP. Fix round t and let OPTALG, t denote the optimal value for LPALG in this
round. Then:
Lemma 24. OPT 1ALG, t ? (1? ) OPTDP with probability at least 1? ?.T
We will prove this by constructing a series of LP?s, starting with a generic
linear relaxation for BwK and ending with LPALG. We show that along the series
the optimal value does not decrease with high probability.
The first LP, adapted from Badanidiyuru et al. [33], has one decision variable
for each action, and applies generically to any BwK problem.
?
maximize ?S?F ?(S)x(S)
subject to S??F C(S, j)x(S) ? B/T j = 1, ..., d
(LPBwK)
0 ? S?F x(S) ? 1.
Let OPTBwK(B) denote the optimal value of this LP with a given budget B.
Then:
Claim 3. OPTBwK(B) ? (1? ) OPTBwK(B) ? 1 (1? ) OPTT DP.
Proof. The second inequality in Claim 3 follows from [Lemma 3.1 in 33]. We will
prove the first inequality as follows. Let x? denote an optimal solution to LPBwK(B).
Consider (1? )x?; this is feasible to LPBwK(B), since for every S,
?
(1? )x?(S) ? 1 and C(S, j)(1? )x(S) ? B/T.
S?A:S?S
169
Hence, this is a feasible solution. Now, consider the objective function. Let y denote
an optimal solution to LPBwK(B). We have that
? ?
OPTBwK(B) = ?(S)y
?(S) ? ?(S)(1? )x?(S) = (1? ) OPTBwK(B).
S?A:S?S S?A:S?S
Now consider a simpler LP where the decision variables correspond to atoms.
As before, P denotes the polytope induced by action set F .
maximize ? ? x
(LPATOMS)
subject to C? ? x 4 B/T x ? P x ? [0, 1]n.
Here C = (C(a, j) : a ? A, j ? d) is the n? d matrix of expected consumption, and
C? denotes its transpose. The notation 4 means that the inequality ? holds for for
each coordinate.
Leting OPTatoms denote the optimal value for LPATOMS, we have:
Claim 4. With probability at least 1?? we have, OPTALG, t ? OPTatoms ? OPTBwK(B).
Proof. We will first prove the second inequality.
Consider the optimal solution vector x to LPBwK(B). Define S
? := {S :
x(S) 6= 0}.
We will now map this to a feasible solution to LPATOMS and show that the
objective value does not decrease. This will then complete the claim. Consider the
following solution y defined as follows.
?
y(a) = x(S).
S?S?:a?S
170
We will now show that y is a feasibl?e solution to the polytope P . From the defi-
nition of y, we can write it as y = x(S) ? I[S]. Here, I[S] is a binary vector,
S?S?
such that it has 1 at position a if and only if atom a is present in set S. Hence, y
is a point in the polytope since it can be written as convex combination of its vertices.
Now, we will show that, y also satisfies the resource consumption constraint.
? ? ?? ?
C(j) ? y = C(a, j) x(S) = C(a, j)x(S) = C(S, j)x(S) ? B/T.
a?A S?S?:a?S S?S? a?S S?S?
The last inequality is because in the optimal solution, the x value correspond-
ing to subset S? is 1 while rest all are 0. We will now show that y produces an
objective value at least as large as x.
?n ?
OPT ?atoms = ? ? y ? ? ? y = ?(a) x(S)
?? a=1 ? S?S?:a?S
= ?(a)x(S) = ?(S)x(S)
S?S? a?S S?S?
= OPTsubsets(B).
Now we will prove the first inequality. We will assume the ?clean event? that
?+t ? ? and C?t ? Ct for all t. Hence, the inequality holds with probability at least
1? ?.
Consider a time t. Given an optimal solution x? to LPATOMS we will show that
this is feasible to LPALG,t. Note that, x
? satisfies the constraint set x ? P since that
is same for both LPALG,t and LPATOMS. Now consider the constraint C
?
t (j) ?x ? B .T
Note that C?t (a, j) ? C(a, j). Hence, we have that C?t (j) ?x? ? C(j) ?x? ? B . TheT
171
last inequality is because x? is a feasible solution to LPATOMS.
Now consider the objective function. Let y? denote the optimal solution to
LPALG,t.
OPT +ALG, t = ?t ? y? ? ?+ ? x? ? ? ? y?t = OPTatoms.
Hence, combining Claim 3 and Claim 4, we obtain Lemma 24.
8.6.3 Negative correlation and concentration bounds
Our analysis relies on several facts about negative correlation and concen-
tration bounds. First, we argue that property (8.2) in the definition of negative
correlation is preserved under a specific linear transformation:
Claim 5. Suppose (X1, X2 , . . . , Xm) is a family of negatively correlated random
variables with support [0, 1]. Fix numbers ?1, ?2 , . . . , ?m ? [0, 1]. Consider two
families of random variables:
( ) ( )
F+ 1 + ?i(Xi ? E[Xi]) ? F? 1? ?i(Xi ? E[Xi])= : i [m] and = : i ? [m] .
2 2
Then both families satisfy property (8.2).
Proof. Let us focus on family F+; the proof for family F? is very similar.
Denote ?i = E[Xi] and Yi := (1 + ?i(Xi ? ?i))/2 and zi := (1? ?i?i)/2 for all
i ? [m]. Note that Yi = ?iXi/2 + zi and zi ? 0, Xi ? 0. Fix a subset S ? [m]. We
172
have, [? ]
E Y = E?
??? ? ??
i (?iXi/2) zj (8.11)
i?S ?T?[S?i?T ]j?S?\T
= E (?iXi/2) zj
T??S? i?T ? j?S\T
? (?i?i/2) zj (8.12)
?T?S i?T j?S\T
= ((1? ?i?i)/2 + ?i?i/2) (8.13)
i?S ?
= (1)|S| = E[Y
2 i
]
i?S
Eq. (8.11) and (8.13) follow from Binomial Theorem. Eq. (8.12) is because
Eq. (8.2) invariant under non-negative scaling, Xi neg. correlated
Second, we extend Theorem 12 to a random process that evolves over time,
and only assumes that property (8.4) holds within each round conditional on the
history.
Theorem 14. Let ZT = {?t,a : a ? A, t ? [T ]} be a family of random variables
taking values in [0, 1]. Assume random variables {?t,a : a ? A} satisfy property
(8.2?) given Zt?1 and have expectation
1 given Zt?1, for each round t. Let Z =2
1
a?A,t?[T ] ?t,a be the average. Then for some absolute constant c,nT
Pr[Z ? 1 + ?] ? c ? ?2m?2e (?? > 0). (8.14)
2
Proof. We prove that family Zt satisfies property (8.4), and then invoke Theorem 12.
Let us restate prope?rty (8.4) fo?r the sake of completeness:
E? ? ? ? ? 2?|S|t,a for any subset S ? ZT . (8.15)
(t,a)?S
173
Fix subset S ? ZT . Partition S into subsets St = {?t,a ? ZT ? S}, for each round t.
Fix round ? and denote
? ?
G? = Ht, where Ht = ?t,a.
t?[? ] a?St
We will now prove the following statement by induction on ? :
?
E[G? ] ? 2?k? , where k? = |St|. (8.16)
t?[? ]
The base case is when ? = 1. Note that G? is just the product of elements in set ?1
and they are negatively correlated from the premise. Therefore we are done. Now
for the inductive case of ? ? 2,
?
E[H? |Z??1] ? E[??,a|Z??1] From property (8.2) in the conditional space
a?S?
(8.17)
? 2?|S? | From assumption in Lemma 14
(8.18)
Therefore, we have
E[G? ] = E[E[G??1H? |Z??1]] Law of iterated expectation
= E[G??1E[H? |Z??1]] Since G??1 is a fixed value conditional on Z??1
? 2?|S? |E[G??1] From Eq. (8.18)
? 2?k? From inductive hypothesis
This completes the proof of Eq. (8.16). We obtain Eq. (8.15) for ? = T .
Third, we invoke Eq. (8.7) for rewards and resource consumptions:
174
Lemma 25. With probability at least 1? e??(?), we have the following:
|??t(a)? ?t(a)| ? 2 Rad(??t(a), Nt(a) + 1)
(8.19)
?j ? [d] |C?t(a, j)? Ct(a, j)| ? 2 Rad(C?t(a, j), Nt(a) + 1).
Fourth, we use a concentration bound from prior work which gets sharper
when the expected sum is very small, and does not rely on independent random
variables:
Theorem 15 (Babaioff et al. [28]). Let X1, X2, . . . , Xm denote a set of {0, 1} random
variables. For each t, let ?t?denote the multiplier determined by random variables
X1, X2, . . . , Xt?1. Let M =
m
t=1Mt where Mt = E[Xt|X1, X2, . . . , Xt?1]. Then for
any b ? 1, we have the following with probability at least 1?m??(b):
?m ?
| ?t(Xt ?Mt)| ? b( M logm+ logm)
t=1
8.6.4 Analysis of the ?clean event?
Let us set up several events, henceforth called clean events, and prove that
they hold with high probability. Then the remainder of the analysis can proceed
conditional on the intersection of these events. The clean events are similar to the
ones in Agrawal and Devanur [11], but are somewhat ?stronger?, essentially because
our algorithm has access to per-atom feedback and our analysis can use the negative
correlation property of the RRS.
In what follows, it is convenient to consider a version of SemiBwK in which the
algorithm does not stop, so that we can argue about what happens w.h.p. if our
175
algorithm runs for the full T rounds. Then we show that our algorithm does indeed
run for the full T rounds w.h.p.
Recall that xt be the optimal fractional solution obtained by solving the LP
in round t. Let Yt ? {0, 1}n be the random binary vector obtained by invoking the
RRS (so that the chosen action St ? F corresponds to a particular realization of
Yt, interpreted as a subset). Let Gt := {Y ?t? : ?t ? t} denote the family of RRS
realizations up to round t.
8.6.4.1 ?Clean event? for rewards
For brevity, for each round t let ?t = (?t(a) : a ? A) be the vector of realized
rewards, and let rt := ?t(St) = ?t ?Yt be the algorithm?s reward at this round.
Lemma 26. Consider SemiBwK without stopping. Then with probability at least
1? nT e??(?):
? ? ??? ? ??| rt ? ?+t ? x ?t| ? O ?n rt + ?nT + ?n .
t?[T ] t?[T ] t?[T ]
Proof. We prove the Lemma by proving the following three high-probability inequal-
ities.
With probability at least 1? nT e???(?): the following hol?ds:? ?
| ? ? | ? 1
?
rt ? Yt 3nT Rad? ?+t ? xt , nT? (8.20)
t?[T ] t?[T ] ?? ? nT t?[T ]? ? ? ? ?
| ? ?Y ? ?+t t ?Yt| ?
?
12??n? ?+ ? x ?t t + 12 ?n+ 12?n (8.21)
?t?[T ] t?[T ]?
| ?+t ?Y +t ? ?t ? xt| ? ?nT . (8.22)
t?[T ]
176
We will use the properties of RRS to prove Eq. (8.22). Proof of Eq. (8.21) is
similar to Agrawal and Devanur [11], while proof of Eq. (8.20) follows immediately
from the setup of the model.?U?sing the parts (8.20) and (8.22) we can now find an
appropriate upper bound on t?[T ] ?
+
t ? xt and using this upper bound, we prove
Lemma 26.
Proof of Eq. (8.20). Recall that rt = ?tYt. Note that, E[?tYt] = ?Yt when the
expectation is taken over just the independent samples of ?. By Theorem 15, with
probability 1? e??(?) we have:
? ? ( )
| ? ? | ? 1
?
rt ? Yt 3nT Rad( ? ?Yt , nTnTt?T t?T ?t?T )
? 13nT Rad( ?
+
t ?Yt , nTnT
t?T )
? 1
?
3nT Rad ?+t ? xt , nT .nT
t?T
The last inequality is because Yt is a feasible solution to LPALG.
Proof of Eq. (8.21). For this part, the arguments similar to Agrawal and Devanur
[11] follow with some minor adaptations. For sake of completeness we describe the
full proof. Note that we have,
? ?n ?
| ? ?Yt ? ?+t ?Yt| ? | ?(a)Yt(a)? ?+t (a)Yt(a)|.
t?T a=1 t?T
Now, using Lemma 25 in Appendix, we have that with probability 1?nTe??(?)
? ?
| ?(a)Yt(a)? ?+t (a)Yt(a)| ? 12 Rad(?(a), Nt(a) + 1).
t?T t?T
177
Hence, we have
?n ? ?NT?(a)+1
| ?(a)Yt(a)? ?+t (a)Yt(a)| = 12 Rad(?(a), r)
a=1 t?T a??A r=1
? 12 (NT (a) + 1) Rad(?(a), NT (a) + 1)
?a?A
? 12 ?n (? ? (NT + 1)) + 12?n.
The last inequality is from the defini?tion of Rad function and using the Cauchy-
Swartz inequality. Note that ?NT = t?T ? ?Yt. Also, since we have with proba-
bility 1? e??(?), ?(a) ? ?+t (a), we have?,? ?? (? ) ?
12 ?n (? ? (NT + 1)) + 12?n ? 12??n ?+t ?Yt + 12 ?n+ 12?n.
t?T
Finally note that Yt is a feasible solution to the semi-bandit polytope P .
Hence, we have that
?+t ?Y +t ? ?t ? xt.
He?nce,??? (? ) ?? (? ? ? ) ?12 ?n ?+t ?Yt + 12 ?n+ 12?n ? 12??n ?+t ? xt + 12 ?n+ 12?n.
t?T t?T
Proof of Eq. (8.22): Recall that for each round t, the UCB vector ?+t is determined
by the random variables Gt?1 = {Yt? : ?t? < t}. Further, conditional on a realization
of Gt?1, the random variables {Yt(a) : a ? A} are negatively correlated from the
property of RRS. Let ?? +t(a) := ?t (a)Yt(a), a ? A. Note that we have E[??t(a)|Gt?1] =
?+t (a)xt(a). Define
1 + ?+t (a)Yt(a)? ?+t (a)xt(a)?t(a) := .
2
178
From Claim 5, we have that {?t(a) : a ? A} conditioned on Gt?1 satisfy (8.2).
Further, E[?t(a)|G ] = 1t?1 . Therefore, the family {?t(a) : t ? [T ], a ? A} satisfies2
the assumptions in Theorem 14 and hence satisfies Eq. (8.14) for some absolute
constant c. Plug?ging back the ??t(a)?s, we obtain an upp?er-tail concentration bound:
1 ??
Pr? ( ??t(a)? ?+t (a)xt(a)) ? ? ? ? c ? e?2nT?2 .nT
t?[T ] a?A
To obtain a corresponding concentration bound for the lower tail, we apply a
similar argument to
? 1 + ?
+
t (a)xt(a)? ??t(a)?t(a) = .2
Once again from Claim 5, we have that {? ?t(a) : a ? A} conditioned on Gt?1 satisfy
(8.2). The family {? ?t(a) : t ? [T ], a ? A} satisfies the assumptions in Theorem 14
and hence satisfies Eq. (8.14). Plugging back the ??t(a)?s, we obtain a lower-tail
concentration bo?und:? ?1 ?? + 2Pr ( ?t (a)xt(a)? ??t(a)) ? ? ? ? c ? e?2nT? .nT
t?[T ] a?A
Com?bining these two we have,
Pr? ??
?
1 | 2?+t (a)Y +t(a)? ?t (a)xt(a)| ? ? ? ? 2 c ? e?2nT? . (8.23)nT
t?[T ] a?A ?
Hence setting ? = ? , we obtain Eq. (8.22) with probability at least 1 ?
nT
e??(?). ??
Combining Eq. (8.20), (8.21) and (8.22) Let us denote H := +t?[T ] ?t ? xt.
Adding the three equations we get
? ? ? ?
| r ?H2t | ? ?H + ? + ?nH +O(?n) + ?nT (8.24)
t?[T ]
179
Rearranging and solving for H, we have
?? ?
H ? rt +O( ?n) + (?nT )1/4
t?[T ]
Plugging this back into Eq. (8.24), we get Lemma 26.
8.6.4.2 ?Clean event? for resource consumption
We define a similar ?clean event? for consumption of each resource j. By a
slight abuse of notation, for each round t let Ct(j) = (Ct(a, j) : a ? A) be the vector
of realized consumption of resource j. Let ?t(j) denote algorithm?s consumption for
resource j at round t (i.e., ?t(j) = Ct(j) ?Yt).
Lemma 27. Consider SemiBwK without stopping. Then with probability at least
1? nT e??(?):
? ? ? ?
?j ? [d] | ?t(j)? C?t (j) ? xt| ? ?nB + ?n+ ?nT .
t?[T ] t?[T ]
Proof. The proof is similar to Lemma 26. We will split the proof into following three
equations. Fix an arbitrary resource j ? [d]. With probability at least 1?nTe??(?)
the following holds:
? ? ( ? )
| ?t(j)?
1
C(j) ?Yt| ? 3nT Rad C(j) ?Yt , nT . (8.25)
nT
t?T t?T t?T
? ??? (? ) ?
| C(j) ?Yt ?C?t (j) ?Yt| ? 12??n C(j) ?Yt + 12 ?n+ 12?n. (8.26)
t?T t?T
180
? ?
| C?t (j) ?Yt ?C?t (j) ? xt| ? ?nT . (8.27)
t?T
?? Using the parts (8.25), (8.26) and (8.27) we can find an upper bound on
t?T Ct(j) ?Yt. Hence, combining equations (8.25), (8.26) and (8.27) with this
bound and taking an Union Bound over all the resources, we get Lemma 27.
Proof of Eq. (8.25). We have that {Ct(a, j) : a ? A} is a set of independent random
variables over a probability spacee C?. Note that, EC Ct(a, j)Yt(a) = C(a, j)Yt(a).?
Hence, we can invoke Theorem 15 on independent random variables to get with
probability 1? nTe??(?)
? ? ( 1 ? )| ?t(j)? C(j) ?Yt| ? 3nT Rad C(j) ?Yt , nT .
nT
t?T t?T t?T
Proof of Eq. (8.26). This is very similar to proof of (8.21) and we will skip the
repetitive parts. Hence, we have with probability 1? nTe??(?)
? ?
| C(j) ?Yt ?C?t (j) ?Yt| ? 12??n(C(j) ? (NT + 1)) + 12?nt?T ?? (? ) ?
? 12??n C(j) ?Yt + 12 ?n+ 12?n.
t?T
Proof of Eq. (8.27). Recall that for each round t and each resource j, the LCB
vector C?t (j) is determined by the random variables Gt?1 = {Yt? : ?t? < t}. Similar
to the proof of Eq. (8.22), random variables {Yt(a) : a ? A} obtained from the RRS
are negatively correlated given Gt?1. As before define ?? (a) = C?t t (a)Yt(a), a ? A.
We have that E[?t(a) | Gt?1] = C?t (a)xt(a).
181
By Claim 5, random variables
1 + ??t(a)? C?t (a)xt(a)?t(a) = , a ? A
2
satisfy (8.2), given Gt?1. We conclude that family {?t(a) : t ? [T ], a ? A} satisfies
the assumptions in Theorem 14, and therefore satisfies Eq. (8.14) for some absolute
constant c. The?refore, we obtain an upper-tail concentr?ation bound for ??t(a)?s:
1 ?? 2
Pr? ( ??t(a)? C?t (a)xt(a)) ? ? ? ? c ? e?2nT? .nT
t?[T ] a?A
To obtain a corresponding concentration bound for the lower tail, we apply a
similar argument to
? 1 + C
?
t (a)xt(a)? ??t(a)?t(a) = .2
Once again, invoking Claim 5 we have that {? ?t(a) : a ? A} conditioned on Gt?1
satisfy (8.2). Thus, family {?t(a) : t ? [T ], a ? A} satisfies the assumptions in
Theorem 14, an?d therefore satisfies Eq. (8.14). We obta?in:? 1 ?? 2Pr ( C?t (a)xt(a)? ?? ? ?2nT?t(a)) ? ? ? c ? e .nT
t?[T ] a?A
Com?bing the two tails we have,? ?1 ??Pr | C? ?t (a)Yt(a)? Ct (a)xt(a)| ? ? ? ? 2 c ? e?2nT?2 . (8.28)nT
t?[T ] a?A ?
Once again, setting ? = ? , we obtain Eq. (8.27) with probability at least
nT
1? e??(?). ??
Proof of Lemma 27. Denote G = t?T C?(j) ?Yt. From Equation (8.25), (8.26)? ?
and (8.27), we have that G2? ?2?( ?n)G ?
?
t?T Ct (j) ?xt+O(?n)+ ?nT . Note
? ?
that C?t?T t (j) ?xt ? B. Hence, G2?2?( ?n)G ? B +O(?n) + ?nT . Hence,
182
? ?
re-arranging this gives us G ? B + O( ?n) + (?nT )1/4. Plugging this back in
Equations (8.25), (8.26) and (8.27), we get Lemma 27.
8.6.5 Putting it all together
Similar to Agrawal and Devanur [11], we will handle the hard constraint on
budget, by choosing an appropriate value of . We then combine the above Lemma
on ?rewards? clean event to compare the reward of the algorithm with that of the
optimal value of LP to obtain the regret bound in Theorem 13. Additionally, we
use the Lemma on ?consumption? clean event to argue that the algorithm doesn?t
exhaust the resource budget before round T . Formally, consider the following.
Recall that from Lemma 24, we have OP 1?TALG, ? (1?) OPTDP. Let us defineT
the performance of the algorithm as ALG = t?T rt. From Lemma 26, that with
probability at least 1? ndT e??(?)
? ?
ALG ? (1? ) OPTDP?O(??nALG)?O(?n)? ?nT?
? (1? ) OPTDP?O( ?nOPTDP)?O(?n)? ?nT (since ALG ? OPTDP).
? ?
Choosing  = ?n + ?n + ?nT and using the assumption that B > 3(?n +
B B B
?
?nT ), we derive Eq. (8.9). For any given ?, we set ? = ?(log(ndT )) to obtain a
?
success probability of at least 1? ?.
Now we will argue that the algorithm does not exhaust the resource budget
before round T with probability at least 1?ndT e??(?). Note that for every resource
183
j ? [d], ?
C?t (j) ? xt ? (1? )B.
t?T ?
Hence, combining this with Lemma 27, we have t?T Ct(j) Yt ? (1? )B + B ? B.
8.7 Applications and special cases
Let us discuss some notable examples of SemiBwK (which generalize some of
the numerous applications listed in Badanidiyuru et al. [33]). Our results for these
examples improve exponentially over a naive application of the BwK framework.
Compared to what can be derived from Agrawal and Devanur [11, 12], our results
feature a substantially better dependence on parameters, a much better per-round
running time, and apply to a wider range of parameters. However, we leave open
the possibility that the regret bounds can be improved for some special cases.
8.7.1 Dynamic pricing.
The dynamic pricing application is as follows. The algorithm has d products
on sale with limited supply: for simplicity, B units of each. Following Besbes and
Zeevi [39], we allow supply constraints across products, e.g., a ?gadget? that goes
into multiple products. In each round t, an agent arrives (who can buy any subset
of the products), the algorithm chooses a vector of prices p ? [0, 1]dt to offer the
agent, and the agent chooses what to buy at these prices. For simplicity, the agent
is interested in buying (or is only allowed to buy) at most one item of each product.
The agent has a valuation vector over products, so that the agent buys a given
184
product if and only if her valuation for this product is at least as high as the offered
price. The entire valuation vector is drawn as an independent sample from a fixed
and unknown distribution (but valuations may be correlated across products). The
algorithm maximizes the total revenue from sales.
To side-step discretization issues, we assume that prices are restricted to a
known finite subset S ? [0, 1]. Achieving general regret bounds without such re-
striction appears beyond reach of the current techniques for BwK.10
To model it as a SemiBwK problem, the set of atoms is all price-product pairs.
The combinatorial constraint is that at most one price is chosen for each product.
(If an action does not specify a price for some product, the default price is used.)
This is a ?partition matroid? constraint. Rewards correspond to revenue from sales,
and resources correspond to?inventory c?onstraints.
We obtain regret O?(d dB|S| + T |S|) using Corollary 5, whenever B >
?
??(n + nT ). This is because OPTDP ? dB, since that is the maximum number of
products available, and the number of atoms is n = d|S|.
For comparison, results of Agrawal and Devanur [11, 12] apply only when
? ?
B > nT 3/4, and yield regret bound of O?(d3|S|2 T ).11 Thus, our regret bounds
feature a better dependence on the number of allowed prices |S| (which can be
10Prior work on dynamic pricing with limited supply (e.g., Babaioff et al. [28], Badanidiyuru
et al. [33], Besbes and Zeevi [38]) achieves regret bounds without restricting itself to a particular
finite set of prices, but only for a simple special case of (essentially) a single product.
11We obtain this by plugging in OPTDP ? dB and n = d|S| into their regret bound. For dynamic
pricing the total per-resource consumption is bounded by 1, so we can apply their results without
rescaling the consumption.
185
very large) and the number of products d. Further, our regret bounds hold in a
meaningful way for the much larger range of values for budget B.
For a naive application of the BwK framework, arms correspond to every
possible realization of prices for the d products. Thus, we have |S|d arms, with a
corresponding exponential blow-up in regret.
8.7.2 Dynamic assortment.
The dynamic assortment problem is similar to dynamic pricing in that the
algorithm is selling d products to an agent, with a limited inventory B of each prod-
uct, and is interested in maximizing the total revenue from sales. As before, agents
can have arbitrary valuation vectors, drawn from a fixed but unknown distribution.
However, the algorithm chooses which products to offer, whereas all prices are fixed
externally. There is a large number of products to choose from, and any subset of
k  d of them can be offered in any given round.
To model this as SemiBwK, atoms correspond to products, and actions corre-
spond to subsets of at most k atoms. The combinatorial constraint forms a partition
matroid. Rewards correspond to sales, and resources correspond to products, as in
?
dynamic pricing. Since OPTDP ? min(dB, kT ), Corollary 5 yields regret O?(k dT )
? ?
when B > ?(T ), and regret O?(d dB + dT ) in general.
In a naive application of BwK, arms are subsets of k products. Hence, we have
O(dk) arms. The other parameters of the problem remain the same. This leads to
?
regret bound O?(d Bdk), with an exponential dependence on k.
186
8.7.3 Repeated auctions.
Consider a repeated auction with adjustable parameters, e.g., repeated second-
price auction with reserve price that can be adjusted from one round to another.
While prior work (Badanidiyuru et al. [33], Cesa-Bianchi et al. [55]) concerned run-
ning one repeated auction, we generalize this scenario to multiple repeated auctions
with shared inventory (e.g., the same inventory may be sold via multiple channels
to different audiences).
More formally, the auctioneer is running r simultaneous repeated auctions
to sell a shared inventory of d products, with limited supply B of each product
(e.g., different auctions can cater to different audiences). Each auction has a pa-
rameter which the algorithm can adjust over time. We assume that this parameter
comes from a finite domain S ? [0, 1]. For simplicity, assume the auctions are syn-
chronized with one another. As in prior work, we assume that in every round of
each auction a fresh set of participants arrives, sampled independently from a fixed
joint distribution, and only a minimal feedback is observed: the products sold and
the combined revenue.
Following prior work (Badanidiyuru et al. [33], Cesa-Bianchi et al. [55]), we
only assume minimal feedback: for each auction, what were the products sold and
what was the combined revenue from this auction. In particular, we do not as-
sume that the algorithm has access to participants? bids. Not using participants?
bids is desirable for privacy considerations, and in order to reduce the participants?
incentives to game the learning algorithm.
187
To model this problem as SemiBwK, atoms are all auction-parameter pairs.
The combinatorial constraint is that an action must specify at most one parameter
value for each auction. This corresponds to partition matroid constraint. There
is a ?default parameter? for each auction, in case an action does not specify the
parameter. We have a resource for each product being auctioned. For simplicity,
each product has supply of B. Note that OPTDP?? dB and?number of atoms is
n = r|S|. Hence, our main result yields regret O?(d r|S|B + r|S|T ).
A naive application of the BwK framework would have arms that correspond
to all possible combinations of parameters, for the total of O(|S|r) arms. Again, we
have an exponential blow-up in regret. Alternatively, one may try running r separate
instances of BwK, one for each auction, but that may result result in budgets being
violated since the items are shared across the auctions and it is unclear a priori how
much of each item will be sold in each auction.
One can also consider a ?flipped? version of the previous example, where the
algorithm is a bidder rather than the auction maker. The bidder participates in r
repeated auctions, e.g., ad auctions for different keywords. We assume a stationary
environment: bidder?s utility from a given bid in a given round of a given auction
is an independent sample from a fixed but unknown distribution. The only limited
resource here is the bidder?s budget B. Bids are constrained to lie in a finite subset
S.
To model this as SemiBwK, atoms correspond to the auction-bid pairs. The
combinatorial constraint is that each action must specify at most one bid for each
auction. (There is a ?default bid? for each auction in case an action does not specify
188
Figure 8.1: Dynamic Assortment (left) and Dynamic Pricing (right) experiments for
n = 26.
the bid for this auction.) There is exactly one resource, which is money and the
total budget is B. Note that?the numbe?r of atoms is n = r|S|. Hence, our main
result yields regret O?(OPTDP r|S|/B + r|S|T ).
A naive application of BwK would have arms that correspond to all possible
combinations of bids, for the total of O(|S|r) arms; so we have an exponential blow-
up in regret.
8.8 Numerical Simulations
We ran some experiments on simulated datasets in order to compare our al-
gorithm, SemiBwK-RRS, with some prior work that can be used to solve SemiBwK:
? the primal-dual algorithm for BwK from Badanidiyuru et al. [33], denoted
pdBwK.
? an algorithm for combinatorial semi-bandits with a matroid constraint: ?Op-
timistic Matroid Maximization? from Kveton et al. [124], denoted OMM.
189
Figure 8.2: Experimental Results for Uniform matroid (left plots) and Partition
matroid (right plots) on independent (upper) and correlated (lower) instances for
n = 26.
? the linear-contextual BwK algorithm from Agrawal and Devanur [12], dis-
cussed in the Introduction, denoted linCBwK.
To speed up the computation in linCBwK, we used a heuristic modification
suggested by the authors in a private communication. This modification did not
substantially affect average rewards in our preliminary experiments. We also made
a heuristic improvement to our algorithm, setting  = 0 and ? = 5. We use the
same value of ? for the pdBwK algorithm as well.
Problem instances. We did not attempt to comprehensively cover the huge va-
riety of problem instances in SemiBwK. Instead, we focus on two representative
applications from Section 8.7.
190
Figure 8.3: Experimental Results for Uniform matroid (left plots) and partition
matroid (right plots) on independent (upper) and correlated (lower) instances for
n = 6.
The first experiment is on dynamic assortment. We have n products, and for
each product i there is an atom i and a resource i. The (fixed) price for each product
is generated as an independent sample from U[0,1], a uniform distribution on [0, 1].
At each round, we sample the buyers?s valuation from U[0,1], independently for each
product. If the valuation for a given product is greater than the price, one item of
this product is sold (and then the reward for atom i is the price, and consumption
of resource i is 1). Else, we set reward for atom i and consumption for resource i to
be 0.
The second experiment is on dynamic pricing with two products. We have n/2
allowed prices, uniformly spaced in the [0, 1] interval. Recall that atoms correspond
191
to price-product pairs, for the total of n atoms. In each round t, the valuation
vt,i for each product i is chosen independently from a normal distribution N (v0i , 1)
truncated on [0, 1]. The mean valuation v0i is drawn (once for all rounds) from U[0,1].
If vt,i is greater than the offered price p, one item of this product is sold. Then
reward for the corresponding atom (p, i) is the price p, and consumption of product
i is 1. If there is no sale for this product, the reward and consumption for each atom
(p, i) is set to 0.
The third experiment is a modification of the dynamic assortment example, in
which we ensure that even non-action (e.g., no sale) exhausts resources other than
time. As in dynamic assortment, we have n products, and for each product i there
is an atom i and a resource i. The (fixed) price for each product is generated as
an independent sample from U[0,1], a uniform distribution on [0, 1]. At each round,
we sample the buyers?s valuation from U[0,1], independently for each product. If the
valuation for a given product is greater than the price, one item of this product is
sold (and then the reward for atom i is the price, and consumption of resource i
is 1). Else, we do something different from dynamic assortment: we set reward for
atom i and consumption for resource i to be the buyer?s valuation.
The fourth experiment is a similar modification of the dynamic pricing exam-
ple. We have n/2 allowed prices, uniformly spaced in the [0, 1] interval. Recall that
atoms correspond to price-product pairs, for the total of n atoms. In each round t,
the valuation vt,i for each product i is chosen independently from a normal distri-
bution N (v0, 1) truncated on [0, 1]. The mean valuation v0i i is drawn (once for all
rounds) from U[0,1]. If the valuation for a given product i is greater than the offered
192
price p, one item of this product is sold (and then reward for the corresponding
atom (p, i) is the price, and consumption of product i is 1). If there is no sale for
this product, we do something different from dynamic pricing. For each atom (p, i),
if p < vt,i then the reward for atom (p, i) is drawn independently from U[0,1] and
resource consumption is 1; else, reward is 0 and consumption is .3. While dynamic
assortment is modeled with a uniform matroid, and dynamic pricing is modeled with
a partition matroid, we tried both matroids on each family.
Experimental setup and results. We choose various values of n, B and T
and run our algorithms on the above two datasets assuming both a uniform ma-
troid constraint and a partition matroid constraint. We choose n ? {6, 26}, T ?
{1000, 2000, 3000, 4000, 5000, 6000} and B = T/2. The maximum number of atoms
in any action is set to K = 2. For a given algorithm, dataset and configuration of
n and T , we simulate each algorithm for 20 independent runs and take the average.
We calculate the total reward obtained by the algorithm at the end of T time-steps.
Figure 8.1 shows results for the first two experiments. Figures 8.2 and 8.3
show the results on the third and fourth experiments. Our algorithm achieves the
best regret among the competitors. As a benchmark, we included the performance
of the fractional allocation in LPOPT , denoted OPTDP DP.
Additional experiment. linCBwK and pdBwK have running times proportional to
the number of actions. We ran an additional experiment which compared per-step
running times. We first calculate the average running time for every 10 steps and
take the median of 50 such runs. For both Uniform matroid and Partition matroid,
193
Figure 8.4: Variation of per-step running times as n increases for the various algo-
rithms.
we run the faster RRS due to Gandhi et al. [92]. See Figure 8.4 for results.
Details of heuristic implementation of linCBwK. We now briefly describe the
heuristic we use to simulate the linCBwK algorithm. Note that even though the per-
time-step running time of linCBwK is reasonable, it takes a significant time when we
want to perform simulations for many time-steps. The time-consuming step in the
linCBwK algorithm is the solution to a convex program for computing the optimistic
estimates (namely ??t and W?t). Hence, the heuristic gives a faster way to obtain
this estimate. We sample multiple times from a multi-variate Gaussian with mean
?? and covariance Mt (to obtain estimate ??t) and with mean w?tj and covariance Mt
(to obtain estimate w?tj for each resource j). We use these samples to compute the
objective to choose the action at time-step t. For each sample, we compute the best
action based on the objective in linCBwK. We finally choose the action that occurs
majority number of times in these samples. The number of samples we choose is set
to 30.
Language Details of algorithms. All algorithms except linCBwK were imple-
194
mented in Python. The linCBwK algorithm was implemented in MATLAB. This
difference is crucial when we compare running times since language construct can
speed-up or slow down algorithms in practice. However, it is known that 12 for
matrix operations commonly encountered in engineering and statistics, MATLAB
implementations runs several orders of magnitude faster than the corresponding
python implementation. Since linCBwK is the slowest of the four algorithms, our
comparison of running times across languages is justified.
8.9 Conclusion
In this chapter we consider an extension of the BwK problem, namely, stochas-
tic combinatorial semi-bandits with knapsacks under matroid constraints and gave a
specialized algorithm with optimal bounds. Moreover, we also show applications of
SemiBwK and run numerical experiments on simulated data to support our theoreti-
cal results. The results in this chapter work for a specialized (yet capture all known
applications) action set and obtain stronger bounds (essentially optimal) than those
in Section 7.9 in Chapter 7.
12https://www.mathworks.com/products/matlab/matlab-vs-python.html
195
196
Chapter 9: Future Directions
In this chapter we conclude with future research directions that stem out of
the problems considered in this thesis. We consider both future directions in the
models of SDM considered in this thesis as well as other problems beyond SDM
that are motivated by problems in this thesis. We also believe that the work in this
thesis can be applied to a wider range of applications and exploring them is a future
direction.
Online Matching
The most direct open problem in online matching is to close the gap between
the upper and lower-bound for the Online Weighted Matching problem in the known
distribution model considered in Chapter 3. We provide an algorithm with a com-
petitive ratio of 0.7 while the known lower-bound is 0.823 given by Manshadi et al.
[137] which also applies to unweighted graphs. Thus, we want to close this gap by
either bringing the upper-bound towards the lower-bound or vice versa (or possibly
both).
The other direction is to consider stochastic arrival models that relax the no-
tion of independence across time-steps which is often too simplifying. In many prac-
tical scenarios, this model of stochastic arrival doesn?t exactly capture the reality of
197
the underlying data. The input often resembles a stochastic process with some pre-
dictable correlations across time-steps (e.g., rider request pattern depends on local
events). In the context of optimization in matching markets and specifically, online
matching, results based on correlated data arrival have not been studied. Thus,
it is interesting to formulate and study online matching and its variants under no-
tions of correlation that are both theoretically challenging and practically relevant.
Candidate models include notions of limited dependence such as Markov chains.
Such models are especially useful in applications where the input is received as a
result of human decisions. The other direction is to consider corrupted inputs, that
are frequently encountered in many applications. Abstractly, this can be modeled
as input sequences that are close to an i.i.d. sequence with adversarial perturba-
tions. In other words, we want algorithms whose performance smoothly interpolate
between the stochastic and the adversarial models. Prior work (Esfandiari et al.
[81]) has proposed one interesting model along these lines for a particular variant
of the unweighted online matching problem. However, the question of interpolation
is not yet fully understood and thus, it will be interesting to study this for the vast
landscape of online matching problems.
From a theoretical stand-point one interesting open question is to obtain a
competitive ratio bound that holds with high-probability (i.e., bound the variance
of the algorithm?s performance). For the online weighted matching problem the
performance guarantees of both the algorithms proposed in Chapter 3 and the one
proposed in Haeupler et al. [102] only hold in expectation. Thus, it will be interesting
to understand if similar ratios can be proved that hold with high-probability. For
198
the unweighted version of the problem, such high-probability bounds are known
(e.g., Feldman et al. [83]).
Bandits with Knapsacks
The work on adversarial BwK (Chapter 7) directly leads to a few open ques-
tions and directions. In particular, it would be illuminating to have a single algo-
rithm that has good performance simultaneously in the stochastic and the adversar-
ial setting without knowing a-priori, the environment the algorithm is operating in
(so called ?best-of-both-worlds algorithm?). Our work almost achieves this, where
we develop a common framework to solve both the settings but technical conditions
prevent us from using it without knowing the environment. A closely related but
distinct question is to obtain instance-specific bounds for the stochastic version of
the problem. In particular, the goal is to derive bounds that improve on the easy
instances of the problem while still achieving the worst-case bounds in the hard
regimes. Another interesting direction is to consider the Stochastic BwK problem
with limited dependence, where the environment is stochastic (but not independent)
while the rewards/consumptions across times have weak correlation. Limited de-
pendence and multi-armed bandits have been previously studied and is supported
by numerous motivating examples. However, in many of these applications we also
have limited inventory and thus, the ideal approach would be to use variants of BwK
in such applications.
A broader direction stemming out of our work is to investigate the interplay
of learning and game-theory in the context of time-varying zero-sum games. In a
199
time-varying zero-sum game, the players play different zero-sum games repeatedly
and the goal for each player is to maximize their total payoff. When both players
play optimally, the strategy pair is said to be in equilibrium. Classical results in
online learning show that MAB algorithms played against each other, approximately
converge to the equilibrium. However, the results are limited to the settings when the
game remains the same across time or the payoff functions across time are sampled
independently and identically from the same distribution. Thus, it is interesting
to understand the behavior of MAB algorithms when no assumptions about the
repeated zero-sum games are made. A very recent work by Cardoso et al. [53] make
some progress in this direction.
Beyond SDM and Related Problems
The central thesis in our work was to explore SDM problems with information
constraints. However, there are many other problems that are closely related that
have similar constraints. Consider the BwK problem. The key challenge for the
algorithm was to perform counterfactual inference (i.e., answering what if for arms
that were not chosen) in the face of limited samples. In sequential experiment design
literature, there are multiple approaches to studying this challenge, with bandit-
based approach being one of them. An alternate method to do this is to pose this as
a problem in causal inference and study the effects of limited samples on inference
algorithms. This question is not well understood; only a handful of prior works
(e.g., Ghoshal and Honorio [93], Sankararaman et al. [164, 165]) have considered
this question and it would be interesting to understand the limits of finite samples
200
for counterfactual reasoning. More broadly, understanding the trade-off between
limited samples and performance of online algorithms with stochastic input is an
interesting research direction (i.e., the algorithm only has access to samples from a
distribution as opposed to the known distribution assumption).
201
Part III
Appendix
202
Appendix: Standard Tools
10.1 Concentration Inequalities
Lemma 28 (Azuma-Hoeffding inequality ([148])). Let Y1, Y2, . . . , YT be a martingale
difference sequence ( i.e., E[Yt | Y?1, Y2 , . . . , Yt?1] = 0). Suppose |Yt| ? c for all
t ? {1, 2, . . . , T}. Let R0,?(T ) := 2Tc2 ln(1/?). Then for every ? > 0,
[? ]
Pr t?[T ] Yt > R0,?(T ) ? ?.
Lemma 29 (Chernoff-Hoeffding bounds ([148])). Let X1, X2, . . . , XT be a sequence
of independent random variables such that |Xt| ? c for all t ? {1, 2, . . . , T}. Let
Zt := E[Xt]. Then for every ? > 0,[???? ?? ?(? ) ]Pr t?[T ] Xt ? Zt? > 3 t?[T ] Zt c2 ln(1/?) ? ?.
10.2 Adversarial Online Learning
Let us revisit adversarial online learning, as per Figure 7.2. Denote the bench-
mark in Eq. (7.4) as
?
OPTAOL(T ) := maxa?A t?[T ] ft(a).
Recall that [bmin, bmax] is the payoff range, and denote ? = bmax ? bmin.
203
Lemma 30. Suppose an algorithm for adversarial online learning satisfies Eq. (7.4)
for some ? > 0. Then
[ ? ( ? ) ]
Pr ?? ? [T ] OPTAOL(?) ? t?[? ] ft ? pt ? ? ? R?/T (T ) + 2T log(T/?) ? 1? 2?.
(10.1)
Proof. Let us use the stronger regret bound Eq. (7.5) implied by Eq. (7.4). Note
that
E[ft(at) | a1, a2 , . . . , at?1] = ft ? pt.
Applying the Azuma-Hoeffding inequality for each ? ? [T ], and taking a union
bound, we have
[ ? ? ? ]
Pr ?? ? [T ] t?[? ] ft(at)? t?[? ] ft ? pt ? ? ? 2T log(T/?) ? 1? ?. (10.2)
Taking a union bound over Eq. (10.2) and Eq. (7.5) and adding the equations we
get Eq. (10.1).
Remark 12. For Hedge algorithm, regret bound Eq. (10.1) is already proved in [88].
10.3 Lagrangians
10.3.1 Proof of Lemma 12
Assume one of the resources is the dummy resource, and one of the arms is
the null arm. Consider the linear program LPM,B,T , for some outcome matrix M.
Let L = LM,B,T denote the Lagrange function.
Lemma 31 (Lemma 12, restated). Suppose (X?, ??) is a mixed Nash equilibrium
for the Lagrangian game. Then X? is an optimal solution for the linear program
204
(7.8). Moreover, the minimax value of the Lagrangian game equals the LP value:
L(X?, ??) = OPTLP.
In what follows we prove Lemma 31. Writing out the definition of the mixed
Nash equilibrium,
L(X?, ?) ? L(X?, ??) ? L(X, ??) ?X ? ?K , ? ? ?d. (10.3)
? ?
For brevity, denote r(X?) = a?[K] X
?(a) r(a) and ci(X
?) = ?a?[K] X (a) ci(a).
We first state and prove the complementary slackness condition for the Nash
equilibrium.
Claim 6. For every resource i ? [d] we have,
(a) 1? T c ?
B i
(X ) ? 0,
(b) ?? > 0 =? 1? Ti c (X?) = 0.B i
Proof. Part (a). For contradiction, consider resource i that minimizes the left-hand
side in (a), and assume that the said left-hand side is strictly negative. We have
two cases: either ??i < 1 or ?
?
i = 1. When ?
?
i < 1, consider another distribution
?? ? ?d such that ??i = 1 and ??i? = 0 for every i? 6= i. Note that we have, L(X?, ??) <
L(X?, ??). This contradicts the first inequality in Eq. (10.3).
Consider the second case, when ??i = 1. Then L(X?, ??) = r(X?)+1? T ci(X?).B
Consider any arm a ? [K] such that X?(a) 6= 0. Let X? ? ?K be another distribution
such that X?(a) := 0 and X?(null) := X?(null) +X?(a) and X?(a?) = X?(a?) for every
a? 6? {a, null}. Note that X?(null) ? 1. Since (X?, ??) is a Nash equilibrium, we
205
have that L(X?, ??) ? L(X?, ??). This implies that ?X?(a)r(a) +X?(a) T ci(a) ? 0.B
Re-arranging we obtain, T ci(a) ? r(a) ? 1. Thus, we have 1? T c (a) ? 0.B B i
Since this holds for every a ? [K] with X?(a) =6 0, we obtain a contradiction:
1? T c (X?
? ? ( )) = T
B i a?[K] X (a) 1? ci(a) ? 0.B
Part (b). For contradiction, assume the statement is false for some resource i.
Then, by part (a), ?? > 0 and 1? T c ?i i(X ) > 0, and consequently L(X?, ??) > r(X?).B
Now, consider distribution ?? which puts probability 1 on the dummy resource.
We then have L(X?, ??) = r(X?) < L(X?, ??), contradicting the first inequality
in Eq. (10.3).
Let X? be some feasible solution for the linear program (7.8). Plugging the
feasibility constraints into the definition of the Lagrangian function, L(X?, ??) ?
r(X?). Claim 6(a) implies that X? is a feasible solution to the linear program (7.8).
Claim 6(b) implies that L(X?, ??) = r(X?). Thus,
r(X?) = L(X?, ??) ? L(X?, ??) ? r(X?).
So, X? is an optimal solution to the LP. In particular, OPT ? ? ?LP = r(X ) = L(X , ? ).
10.3.2 Regret minimization in games
10.3.3 Proof of Lemma 11
?
Let W = 2T log(T/?) denote the term from Lemma 30 in what follows.
We now prove Lemma 11, similar to the proof in [87] for the deterministic
game. Recall that we take averages up to some fixed round ? ? [T ]. We prove that
206
the following two inequalities hold, each with probability at least 1? ?.
1 ? T ? R1, ?/T (T ) + 2Wpt,1 Gt pt,2 ? v ? ? ? . (10.4)? ?
t
1 ??[? ] R2, ?/T (T ) + 2W
pTt,1 G p
T
t t,2 ? p1 G p2 + ? ? ?p2 ? ?A2 . (10.5)? ?
t?[? ]
Eq. (7.7) in Lemma 11 follows by adding Eq. (10.4) and Eq. (10.5).
First we prove Eq. (10.4). Following the set of inequalities in Section 2.5 of
[87] we have,
1 ? ?T ? 1 ?T R1, ?/T (T ) +Wpt,1Gtpt,2 whp p1 Gt pt,2 ? ? ? From Lemma 30? ? ?
t?[? ] t??[? ]1 ?T R1, ?/T (T ) + 2W?whp p1 G pt,2 ? ? ? From Lemma 28? ?
t?[? ]
1 ? T R1, ?/T (T ) + 2W= max p1 G pt,2 ? ? ? From Definition of p?
? ? ? 1
.
p1 ?A1 t?[? ]
R1, ?/T (T ) + 2W
= max p T1 G p2 ? ? ? From Definition of p2.
p1??A1 ?
T R1, ?/T (T ) + 2W? min max p1 G p2 ? ? ?
p2??A p ??2 1 A1 ?
Here ?whp denotes statements that hold with probability at least 1? ?.
Now let us prove Eq. (10.5). Fix distribution p2 ? ?A2 . Then:
1 ? T ? 1 ? T R2, ?/T (T ) +Wpt,1 Gt pt,2 whp pt,1 Gt p2 + ? ? From Lemma 30? ? ?
t?[? ] t??[? ] R (T ) + 2W
? 1 T 2, ?/Twhp pt,1 G p2 + ? ? From Lemma 28
? ?
t?[? ]
= p T
R2, ?/T (T ) + 2W
1 G p2 + ? ? From Definition of p1.
?
Taking a union bound over all the four high-probability inequalities, we get the
lemma.
207
Bibliography
[1] Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium
computation. In Advances in Neural Information Processing Systems (NIPS),
pages 6584?6593, 2017.
[2] Marek Adamczyk, Fabrizio Grandoni, and Joydeep Mukherjee. Improved ap-
proximation algorithms for stochastic matching. In European Symposium on
Algorithms (ESA), pages 1?12. Springer, 2015.
[3] Marek Adamczyk, Maxim Sviridenko, and Justin Ward. Submodular stochas-
tic probing on matroids. Mathematics of Operations Research, 41(3):1022?
1038, 2016.
[4] Alekh Agarwal, Miroslav Dud??k, Satyen Kale, John Langford, and Robert E.
Schapire. Contextual bandit learning with predictable rewards. In 15th Intl.
Conf. on Artificial Intelligence and Statistics (AISTATS), pages 19?26, 2012.
[5] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and
Robert Schapire. Taming the monster: A fast and simple algorithm for con-
textual bandits. In 31st Intl. Conf. on Machine Learning (ICML), 2014.
208
[6] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Miro Dudik, Luong Hoang,
John Langford, Lihong Li, Dan Melamed, Gal Oshri, Siddhartha Sen, and
Aleksandrs Slivkins. Multiworld testing: A system for experimentation, learn-
ing, and decision-making, 2016. A white paper, available at https://github.
com/Microsoft/mwt-ds/raw/master/images/MWT-WhitePaper.pdf.
[7] Alekh Agarwal, Alina Beygelzimer, Miroslav Dud??k, John Langford, and
Hanna Wallach. A reductions approach to fair classification. Fairness, Ac-
countability, and Transparency in Machine Learning (FATML), 2017.
[8] Gagan Aggarwal, Gagan Goel, Chinmay Karande, and Aranyak Mehta. Online
vertex-weighted bipartite matching and single-bid budgeted allocations. In
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete
Algorithms, pages 1253?1264. SIAM, 2011.
[9] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong.
Diversifying search results. In Proceedings of the second ACM international
conference on web search and data mining, pages 5?14. ACM, 2009.
[10] Shipra Agrawal and Nikhil R Devanur. Fast algorithms for online stochastic
convex programming. In Proceedings of the twenty-sixth annual ACM-SIAM
symposium on Discrete algorithms, pages 1405?1424. SIAM, 2014.
[11] Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and
convex knapsacks. In 15th ACM Conf. on Economics and Computation (ACM
EC), 2014.
209
[12] Shipra Agrawal and Nikhil R. Devanur. Linear contextual bandits with knap-
sacks. In 29th Advances in Neural Information Processing Systems (NIPS),
2016.
[13] Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. A dynamic near-optimal al-
gorithm for online linear programming. Operations Research, 62(4):876?890,
2014.
[14] Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. An efficient algorithm for
contextual bandits with knapsacks, and an extension to concave objectives.
In 29th Conf. on Learning Theory (COLT), 2016.
[15] Faez Ahmed, John P Dickerson, and Mark Fuge. Diverse weighted bipartite
b-matching. In Proceedings of the 26th International Joint Conference on
Artificial Intelligence, pages 35?41. AAAI Press, 2017.
[16] Saeed Alaei, MohammadTaghi Hajiaghayi, and Vahid Liaghat. Online
prophet-inequality matching with applications to ad allocation. In Proceed-
ings of the 13th ACM Conference on Electronic Commerce, pages 18?35. ACM,
2012.
[17] Saeed Alaei, MohammadTaghi Hajiaghayi, and Vahid Liaghat. The online
stochastic generalized assignment problem. In Approximation, Randomization,
and Combinatorial Optimization. Algorithms and Techniques, pages 11?25.
Springer, 2013.
[18] Noga Alon, Baruch Awerbuch, and Yossi Azar. The online set cover prob-
210
lem. In Proceedings of the thirty-fifth annual ACM symposium on Theory of
computing, pages 100?105. ACM, 2003.
[19] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptoti-
cally efficient allocation rules for the multiarmed bandit problem with multiple
plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):
968?976, 1987.
[20] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric
embeddings and graph partitioning. Journal of the ACM (JACM), 56(2):5,
2009.
[21] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights
update method: a meta-algorithm and applications. Theory of Computing, 8
(1):121?164, 2012.
[22] Arash Asadpour, Michel X Goemans, Aleksander Madry, Shayan Oveis Gha-
ran, and Amin Saberi. An o (log n/log log n)-approximation algorithm for the
asymmetric traveling salesman problem. In SODA, volume 10, pages 379?389.
SIAM, 2010.
[23] Jean-Yves Audibert, Se?bastien Bubeck, and Ga?bor Lugosi. Minimax poli-
cies for combinatorial prediction games. In 24th Conf. on Learning Theory
(COLT), pages 107?132, 2011.
[24] Peter Auer, Nicolo? Cesa-Bianchi, and Paul Fischer. Finite-time analysis of
the multiarmed bandit problem. Machine Learning, 47(2-3):235?256, 2002.
211
[25] Peter Auer, Nicolo? Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The
nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48?77,
2002. Preliminary version in 36th IEEE FOCS, 1995.
[26] Baruch Awerbuch and Yossi Azar. Buy-at-bulk network design. In Proceedings
38th Annual Symposium on Foundations of Computer Science, pages 542?547.
IEEE, 1997.
[27] Yossi Azar, Niv Buchbinder, TH Hubert Chan, Shahar Chen, Ilan Reuven
Cohen, Anupam Gupta, Zhiyi Huang, Ning Kang, Viswanath Nagarajan, and
Joseph Naor. Online algorithms for covering and packing problems with con-
vex objectives. In 2016 IEEE 57th Annual Symposium on Foundations of
Computer Science (FOCS), pages 148?157. IEEE, 2016.
[28] Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs
Slivkins. Dynamic pricing with limited supply. ACM Trans. on Economics
and Computation, 3(1):4, 2015. Special issue for 13th ACM EC, 2012.
[29] Ashwinkumar Badanidiyuru. Buyback problem-approximate matroid inter-
section with cancellation costs. In International Colloquium on Automata,
Languages, and Programming, pages 379?390. Springer, 2011.
[30] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. Learning
on a budget: posted price mechanisms for online procurement. In 13th ACM
Conf. on Electronic Commerce (EC), pages 128?145, 2012.
[31] Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Re-
212
sourceful contextual bandits. In 27th Conf. on Learning Theory (COLT),
2014.
[32] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and An-
dreas Krause. Streaming submodular maximization: Massive data summa-
rization on the fly. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 671?680. ACM,
2014.
[33] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Ban-
dits with knapsacks. J. of the ACM, 65(3), 2018. Preliminary version in FOCS
2013.
[34] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph Naor. A
polylogarithmic-competitive algorithm for the k-server problem. In 2011 IEEE
52nd Annual Symposium on Foundations of Computer Science, pages 267?276.
IEEE, 2011.
[35] Nikhil Bansal, Nitish Korula, Viswanath Nagarajan, and Aravind Srinivasan.
Solving packing integer programs via randomized rounding with alterations.
Theory of Computing, 8(1):533?565, 2012.
[36] Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimiza-
tion: analysis, algorithms, and engineering applications, volume 2. Siam, 2001.
[37] Dirk Bergemann and Juuso Va?lima?ki. Bandit Problems. In Steven Durlauf
213
and Larry Blume, editors, The New Palgrave Dictionary of Economics, 2nd
ed. Macmillan Press, 2006.
[38] Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand
function: Risk bounds and near-optimal algorithms. Operations Research, 57:
1407?1420, 2009.
[39] Omar Besbes and Assaf J. Zeevi. Blind network revenue management. Oper-
ations Research, 60(6):1537?1550, 2012.
[40] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E.
Schapire. Contextual bandit algorithms with supervised learning guarantees.
In 14th Intl. Conf. on Artificial Intelligence and Statistics (AISTATS), 2011.
[41] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge
university press, 2004.
[42] Brian Brubach, Karthik Abinav Sankararaman, Aravind Srinivasan, and Pan
Xu. New algorithms, better bounds, and a novel model for online stochastic
matching. In 24th Annual European Symposium on Algorithms (ESA 2016).
Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
[43] Se?bastien Bubeck and Nicolo Cesa-Bianchi. Regret Analysis of Stochastic
and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in
Machine Learning, 5(1), 2012.
[44] Se?bastien Bubeck, Ofer Dekel, Tomer Koren, and Yuval Peres. Bandit convex
214
optimization: \(\sqrt{T}\) regret in one dimension. In 28th Conf. on Learning
Theory (COLT), pages 266?278, 2015.
[45] Se?bastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for
bandit convex optimization. In 49th ACM Symp. on Theory of Computing
(STOC), pages 72?85. ACM, 2017.
[46] Niv Buchbinder and Joseph Naor. The design of competitive online algorithms
via a primal-dual approach. Foundations and Trends in Theoretical Computer
Science, 3(2-3):93?263, 2009.
[47] Niv Buchbinder and Joseph (Seffi) Naor. Online primal-dual algorithms for
covering and packing. Math. Oper. Res., 34(2):270?286, May 2009.
[48] Niv Buchbinder and Joseph Seffi Naor. The design of competitive online algo-
rithms via a primal?dual approach. Foundations and Trends?R in Theoretical
Computer Science, 3(2?3):93?263, 2009.
[49] Niv Buchbinder, Kamal Jain, and Joseph Seffi Naor. Online primal-dual al-
gorithms for maximizing ad-auctions revenue. In European Symposium on
Algorithms, pages 253?264. Springer, 2007.
[50] Niv Buchbinder, Moran Feldman, and Roy Schwartz. Online submodular
maximization with preemption. In Proceedings of the twenty-sixth annual
ACM-SIAM symposium on Discrete algorithms, pages 1202?1216. Society for
Industrial and Applied Mathematics, 2015.
215
[51] Niv Buchbinder, Danny Segev, and Yevgeny Tkach. Online algorithms for
maximum cardinality matching with edge arrivals. Algorithmica, pages 1?19,
2017.
[52] Gruia Calinescu, Chandra Chekuri, Martin Pa?l, and Jan Vondra?k. Maximiz-
ing a monotone submodular function subject to a matroid constraint. SIAM
Journal on Computing, 40(6):1740?1766, 2011.
[53] Adrian Rivera Cardoso, Jacob Abernethy, He Wang, and Huan Xu. Competing
against nash equilibria in adversarially changing zero-sum games. In 36th Intl.
Conf. on Machine Learning (ICML), pages 921?930, 2019.
[54] Nicolo? Cesa-Bianchi and Ga?bor Lugosi. Prediction, learning, and games. Cam-
bridge Univ. Press, 2006.
[55] Nicolo? Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret mini-
mization for reserve prices in second-price auctions. In ACM-SIAM Symp. on
Discrete Algorithms (SODA), 2013.
[56] TH Hubert Chan, Fei Chen, Xiaowei Wu, and Zhichao Zhao. Ranking on
arbitrary graphs: Rematch via continuous lp with monotone and boundary
condition constraints. In Proceedings of the twenty-fifth annual ACM-SIAM
symposium on Discrete algorithms, pages 1112?1122. SIAM, 2014.
[57] TH Hubert Chan, Zhiyi Huang, Shaofeng H-C Jiang, Ning Kang, and Zhi-
hao Gavin Tang. Online submodular maximization with free disposal: Ran-
domization beats for partition matroids. In Proceedings of the Twenty-Eighth
216
Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1204?1223.
SIAM, 2017.
[58] Moses Charikar and Balaji Raghavachari. The finite capacity dial-a-ride prob-
lem. In Proceedings 39th Annual Symposium on Foundations of Computer
Science, pages 458?467. IEEE, 1998.
[59] Chandra Chekuri, Jan Vondrak, and Rico Zenklusen. Dependent randomized
rounding via exchange properties of combinatorial structures. In Foundations
of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages
575?584. IEEE, 2010.
[60] Chandra Chekuri, Jan Vondra?k, and Rico Zenklusen. Multi-budgeted match-
ings and matroid intersection via dependent rounding. In Proceedings of the
twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages
1080?1097. SIAM, 2011.
[61] Chandra Chekuri, Jan Vondra?k, and Rico Zenklusen. Submodular function
maximization via the multilinear relaxation and contention resolution schemes.
SIAM Journal on Computing, 43(6):1831?1879, 2014.
[62] Cheng Chen, Lan Zheng, Venkatesh Srinivasan, Alex Thomo, Kui Wu, and
Anthony Sukow. Conflict-aware weighted bipartite b-matching and its applica-
tion to e-commerce. IEEE Transactions on Knowledge and Data Engineering,
28(6):1475?1488, 2016.
217
[63] Tianyi Chen and Georgios B Giannakis. Bandit convex optimization for scal-
able and dynamic iot management. IEEE Internet of Things Journal, 2018.
[64] Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex opti-
mization approach to proactive network resource allocation. IEEE Transac-
tions on Signal Processing, 65(24):6350?6364, 2017.
[65] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed ban-
dit: General framework and applications. In Sanjoy Dasgupta and David
Mcallester, editors, Proceedings of the 30th International Conference on Ma-
chine Learning (ICML-13), volume 28, pages 151?159. JMLR Workshop and
Conference Proceedings, 2013.
[66] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman,
and Shang-Hua Teng. Electrical flows, laplacian systems, and faster approxi-
mation of maximum flow in undirected graphs. In 43rd ACM Symp. on Theory
of Computing (STOC), pages 273?282. ACM, 2011.
[67] Richard Combes, Chong Jiang, and Rayadurgam Srikant. Bandits with bud-
gets: Regret lower bounds and optimal algorithms. ACM SIGMETRICS Per-
formance Evaluation Review, 43(1):245?257, 2015.
[68] Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre
Proutiere, et al. Combinatorial bandits revisited. In Advances in Neural
Information Processing Systems, pages 2116?2124, 2015.
[69] Nikhil R Devanur and Thomas P Hayes. The adwords problem: online keyword
218
matching with budgeted bidders under random permutations. In Proceedings
of the 10th ACM conference on Electronic commerce, pages 71?78. ACM, 2009.
[70] Nikhil R. Devanur and Thomas P. Hayes. The AdWords problem: Online
keyword matching with budgeted bidders under random permutations. In
10th ACM Conf. on Electronic Commerce (EC), pages 71?78, 2009.
[71] Nikhil R Devanur and Kamal Jain. Online matching with concave returns. In
Proceedings of the forty-fourth annual ACM symposium on Theory of comput-
ing, pages 137?144. ACM, 2012.
[72] Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A.
Wilkens. Near optimal online algorithms and fast approximation algorithms
for resource allocation problems. In 12th ACM Conf. on Electronic Commerce
(EC), pages 29?38, 2011.
[73] Nikhil R Devanur, Balasubramanian Sivan, and Yossi Azar. Asymptotically
optimal algorithm for stochastic adwords. In Proceedings of the 13th ACM
Conference on Electronic Commerce, pages 388?404. ACM, 2012.
[74] Nikhil R Devanur, Kamal Jain, and Robert D Kleinberg. Randomized primal-
dual analysis of ranking for online bipartite matching. In ACM-SIAM Symp.
on Discrete Algorithms (SODA), pages 101?107. Society for Industrial and
Applied Mathematics, 2013.
[75] Nikhil R Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A
219
Wilkens. Near optimal online algorithms and fast approximation algorithms
for resource allocation problems. Journal of the ACM (JACM), 66(1):7, 2019.
[76] John P Dickerson, Karthik A Sankararaman, Aravind Srinivasan, and Pan Xu.
Allocation problems in ride-sharing platforms: Online matching with offline
reusable resources. In Proceedings of the 32nd AAAI Conference on Artificial
Intelligence, 2018.
[77] John P Dickerson, Karthik Abinav Sankararaman, Aravind Srinivasan, and
Pan Xu. Assigning tasks to workers based on historical data: Online task
assignment with two-sided arrivals. In Proceedings of the 17th International
Conference on Autonomous Agents and MultiAgent Systems, pages 318?326.
International Foundation for Autonomous Agents and Multiagent Systems,
2018.
[78] John P Dickerson, Karthik A Sankararaman, Aravind Srinivasan, and Pan Xu.
Balancing relevance and diversity in online bipartite matching via submodu-
larity. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence,
AAAI ?19, 2019.
[79] Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed ban-
dit with budget constraint and variable costs. In 27th AAAI Conference on
Artificial Intelligence (AAAI), 2013.
[80] Miroslav Dud??k, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Lang-
220
ford, Lev Reyzin, and Tong Zhang. Efficient optimal leanring for contextual
bandits. In 27th Conf. on Uncertainty in Artificial Intelligence (UAI), 2011.
[81] Hossein Esfandiari, Nitish Korula, and Vahab Mirrokni. Online allocation with
traffic spikes: Mixing adversarial and stochastic models. In Proceedings of the
Sixteenth ACM Conference on Economics and Computation, pages 169?186.
ACM, 2015.
[82] Jon Feldman, Nitish Korula, Vahab Mirrokni, S Muthukrishnan, and Mar-
tin Pa?l. Online ad assignment with free disposal. In Internet and network
economics, pages 374?385. Springer, 2009.
[83] Jon Feldman, Aranyak Mehta, Vahab Mirrokni, and S Muthukrishnan. Online
stochastic matching: Beating 1-1/e. In Foundations of Computer Science
(FOCS), pages 117?126. IEEE, 2009.
[84] Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clif-
ford Stein. Online stochastic packing applied to display ad allocation. In 18th
Annual European Symp. on Algorithms (ESA), pages 182?194, 2010.
[85] Amos Fiat, Richard M Karp, Michael Luby, Lyle A McGeoch, Daniel D
Sleator, and Neal E Young. Competitive paging algorithms. Journal of Algo-
rithms, 12(4):685?699, 1991.
[86] Abraham Flaxman, Adam Kalai, and H. Brendan McMahan. Online Convex
Optimization in the Bandit Setting: Gradient Descent without a Gradient.
221
In 16th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 385?394,
2005.
[87] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and
boosting. In Proceedings of the ninth annual conference on Computational
learning theory, pages 325?332. ACM, 1996.
[88] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of
on-line learning and an application to boosting. Journal of Computer and
System Sciences, 55(1):119?139, 1997.
[89] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplica-
tive weights. Games and Economic Behavior, 29(1-2):79?103, 1999.
[90] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Learning multiuser channel
allocations in cognitive radio networks: A combinatorial multi-armed bandit
formulation. In New Frontiers in Dynamic Spectrum, 2010 IEEE Symposium
on, pages 1?9. IEEE, 2010.
[91] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network op-
timization with unknown variables: Multi-armed bandits with linear rewards
and individual observations, October 2012.
[92] Rajiv Gandhi, Samir Khuller, Srinivasan Parthasarathy, and Aravind Srini-
vasan. Dependent rounding and its applications to approximation algorithms.
Journal of the ACM (JACM), 53(3):324?360, 2006.
222
[93] Asish Ghoshal and Jean Honorio. Learning linear structural equation models
in polynomial time and sample complexity. In International Conference on
Artificial Intelligence and Statistics, pages 1466?1475, 2018.
[94] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-Armed Bandit
Allocation Indices. John Wiley & Sons, 2011.
[95] Gagan Goel and Aranyak Mehta. Online budgeted matching in random in-
put models with applications to adwords. In ACM-SIAM Symp. on Discrete
Algorithms (SODA), pages 982?991, 2008.
[96] Sudipta Guha and Kamesh Munagala. Multi-armed Bandits with Metric
Switching Costs. In 36th Intl. Colloquium on Automata, Languages and Pro-
gramming (ICALP), pages 496?507, 2007.
[97] Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi.
Approximation algorithms for correlated knapsacks and non-martingale ban-
dits. In 52nd IEEE Symp. on Foundations of Computer Science (FOCS),
pages 827?836, 2011.
[98] Anupam Gupta, Viswanath Nagarajan, and Sahil Singla. Algorithms and
adaptivity gaps for stochastic probing. In Proceedings of the twenty-seventh
annual ACM-SIAM symposium on Discrete algorithms, pages 1731?1747.
SIAM, 2016.
[99] Anupam Gupta, Viswanath Nagarajan, and Sahil Singla. Adaptivity gaps
for stochastic probing: Submodular and xos functions. In Proceedings of the
223
Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages
1688?1702. SIAM, 2017.
[100] Andra?s Gyo?rgy, Levente Kocsis, Ivett Szabo?, and Csaba Szepesva?ri. Contin-
uous time associative bandit problems. In 20th Intl. Joint Conf. on Artificial
Intelligence (IJCAI), pages 830?835, 2007.
[101] Andra?s Gyo?rgy, Tama?s Linder, Ga?bor Lugosi, and Gyo?rgy Ottucsa?k. The on-
line shortest path problem under partial monitoring. J. of Machine Learning
Research (JMLR), 8:2369?2403, 2007.
[102] Bernhard Haeupler, Vahab S Mirrokni, and Morteza Zadimoghaddam. On-
line stochastic weighted matching: Improved approximation algorithms. In
International Workshop on Internet and Network Economics, pages 170?181.
Springer, 2011.
[103] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History
and context. Acm transactions on interactive intelligent systems (tiis), 5(4):
19, 2016.
[104] Elad Hazan. Introduction to Online Convex Optimization. Foundations and
Trends?R in Optimization, 2(3-4):157?325, 2015.
[105] Elad Hazan and Kfir Y. Levy. Bandit convex optimization: Towards tight
bounds. In 27th Advances in Neural Information Processing Systems (NIPS),
pages 784?792, 2014.
224
[106] Justin Hsu, Zhiyi Huang, Aaron Roth, and Zhiwei Steven Wu. Jointly pri-
vate convex programming. In 27th ACM-SIAM Symp. on Discrete Algorithms
(SODA), pages 580?599, 2016.
[107] Zhiyi Huang, Ning Kang, Zhihao Gavin Tang, Xiaowei Wu, Yuhao Zhang,
and Xue Zhu. How to match when all vertices arrive online. In Proceedings of
the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages
17?29. ACM, 2018.
[108] Zhiyi Huang, Binghui Peng, Zhihao Gavin Tang, Runzhou Tao, Xiaowei Wu,
and Yuhao Zhang. Tight competitive ratios of classic matching algorithms in
the fully online model. In Proceedings of the Thirtieth Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 2875?2886. SIAM, 2019.
[109] Nicole Immorlica, Karthik Abinav Sankararaman, Robert Schapire, and Alek-
sandrs Slivkins. Adversarial bandits with knapsacks. In Foundations of Com-
puter Science (FOCS). IEEE, 2019.
[110] Russell Impagliazzo and Valentine Kabanets. Constructive proofs of concen-
tration bounds. In Approximation, Randomization, and Combinatorial Opti-
mization. Algorithms and Techniques, pages 617?631. Springer, 2010.
[111] Patrick Jaillet and Xin Lu. Online stochastic matching: New algorithms with
better bounds. Mathematics of Operations Research, 39(3):624?646, 2013.
[112] David S Johnson. Approximation algorithms for combinatorial problems.
Journal of computer and system sciences, 9(3):256?278, 1974.
225
[113] Satyen Kale, Lev Reyzin, and Robert E. Schapire. Non-stochastic bandit
slate problems. In 24th Advances in Neural Information Processing Systems
(NIPS), pages 1054?1062, 2010.
[114] Michael Kapralov, Ian Post, and Jan Vondra?k. Online submodular welfare
maximization: Greedy is optimal. In Proceedings of the twenty-fourth annual
ACM-SIAM symposium on Discrete algorithms, pages 1216?1225. SIAM, 2013.
[115] Mohammad Karimi, Mario Lucic, Hamed Hassani, and Andreas Krause.
Stochastic submodular maximization: The case of coverage functions. In Ad-
vances in Neural Information Processing Systems, pages 6853?6863, 2017.
[116] Richard M Karp, Umesh V Vazirani, and Vijay V Vazirani. An optimal al-
gorithm for on-line bipartite matching. In Proceedings of the twenty-second
annual ACM symposium on Theory of computing, pages 352?358. ACM, 1990.
[117] Sumeet Katariya, Branislav Kveton, Csaba Szepesva?ri, and Zheng Wen. DCM
bandits: Learning to rank with multiple clicks. In Proceedings of the 33nd
International Conference on Machine Learning, ICML 2016, New York City,
NY, USA, June 19-24, 2016, pages 1215?1224, 2016.
[118] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing
fairness gerrymandering: Auditing and learning for subgroup fairness. In 35th
Intl. Conf. on Machine Learning (ICML), pages 2564?2572, 2018.
[119] Thomas Kesselheim, Klaus Radke, Andreas To?nnis, and Berthold Vo?cking.
An optimal online algorithm for weighted bipartite matching and extensions
226
to combinatorial auctions. In European Symposium on Algorithms (ESA),
pages 589?600. Springer, 2013.
[120] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit prob-
lem. In 18th Advances in Neural Information Processing Systems (NIPS),
2004.
[121] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Ban-
dits and experts in metric spaces. Working paper, published at
http://arxiv.org/abs/1312.1277, 2018. Merged and revised version
of conference papers in ACM STOC 2008 and ACM-SIAM SODA 2010.
[122] Nitish Korula and Martin Pa?l. Algorithms for secretary problems on graphs
and hypergraphs. In Automata, Languages and Programming, pages 508?520.
Springer, 2009.
[123] Akshay Krishnamurthy, Alekh Agarwal, and Miroslav Dud??k. Contextual
semibandits via supervised learning oracles. In 29th Advances in Neural In-
formation Processing Systems (NIPS), 2016.
[124] Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriks-
son. Matroid bandits: Fast combinatorial optimization with learning. In
Nevin L. Zhang and Jin Tian, editors, UAI, pages 420?429. AUAI Press, 2014.
[125] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascad-
ing bandits: Learning to rank in the cascade model. In David Blei and Francis
Bach, editors, Proceedings of the 32nd International Conference on Machine
227
Learning (ICML-15), pages 767?776. JMLR Workshop and Conference Pro-
ceedings, 2015.
[126] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvri. Tight
regret bounds for stochastic combinatorial semi-bandits. In Guy Lebanon and
S. V. N. Vishwanathan, editors, AISTATS, JMLR Workshop and Conference
Proceedings. JMLR.org, 2015.
[127] John Langford and Tong Zhang. The Epoch-Greedy Algorithm for Contextual
Multi-armed Bandits. In 21st Advances in Neural Information Processing
Systems (NIPS), 2007.
[128] Der-Horng Lee, Hao Wang, Ruey Long Cheu, and Siew Hoon Teo. Taxi
dispatch system based on current demands and real-time traffic conditions.
Transportation Research Record, 1882(1):193?200, 2004.
[129] Jon Lee, Maxim Sviridenko, and Jan Vondra?k. Submodular maximization
over multiple matroids via generalized exchange properties. In Approximation,
Randomization, and Combinatorial Optimization. Algorithms and Techniques,
pages 244?257. Springer, 2009.
[130] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm.
Information and Computation, 108(2):212?260, 1994.
[131] La?szlo? Lova?sz. On the ratio of optimal integral and fractional covers. Discrete
mathematics, 13(4):383?390, 1975.
228
[132] Meghna Lowalekar, Pradeep Varakantham, and Patrick Jaillet. Online spatio-
temporal matching in stochastic and dynamic domains. Artificial Intelligence,
261:71?112, 2018.
[133] Will Ma. Improvements and generalizations of stochastic knapsack and multi-
armed bandit approximation algorithms. In Proceedings of the twenty-fifth
annual ACM-SIAM symposium on Discrete algorithms, pages 1154?1163. So-
ciety for Industrial and Applied Mathematics, 2014.
[134] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency:
online convex optimization with long term constraints. J. of Machine Learning
Research (JMLR), 13(Sep):2503?2528, 2012.
[135] Mehrdad Mahdavi, Tianbao Yang, and Rong Jin. Stochastic convex optimiza-
tion with multiple objectives. In Advances in Neural Information Processing
Systems (NIPS), pages 1115?1123, 2013.
[136] Mohammad Mahdian and Qiqi Yan. Online bipartite matching with random
arrivals: an approach based on strongly factor-revealing LPs. In Proceedings
of the forty-third annual ACM symposium on Theory of computing, pages 597?
606. ACM, 2011.
[137] Vahideh H Manshadi, Shayan Oveis Gharan, and Amin Saberi. Online stochas-
tic matching: Online actions based on offline statistics. Mathematics of Op-
erations Research, 37(4):559?573, 2012.
[138] Andrew McGregor. Finding graph matchings in data streams. In Approxima-
229
tion, Randomization and Combinatorial Optimization. Algorithms and Tech-
niques, pages 170?181. Springer, 2005.
[139] Aranyak Mehta. Online matching and ad allocation. Foundations and
Trends?R in Theoretical Computer Science, 8(4):265?368, 2013.
[140] Aranyak Mehta and Debmalya Panigrahi. Online matching with stochastic
rewards. In Foundations of Computer Science (FOCS), pages 728?737. IEEE,
2012.
[141] Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords
and generalized online matching. J. ACM, 54(5):22, 2007. ISSN 0004-5411.
[142] Aranyak Mehta, Bo Waggoner, and Morteza Zadimoghaddam. Online stochas-
tic matching with unequal probabilities. In Proceedings of the Twenty-Sixth
Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2015.
[143] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi.
Fast constrained submodular maximization: Personalized data summariza-
tion. In International Conference on Machine Learning, pages 1358?1367,
2016.
[144] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Dis-
tributed submodular maximization. The Journal of Machine Learning Re-
search, 17(1):8330?8373, 2016.
[145] Baharan Mirzasoleiman, Morteza Zadimoghaddam, and Amin Karbasi. Fast
230
distributed submodular cover: Public-private data summarization. In Ad-
vances in Neural Information Processing Systems, pages 3594?3602, 2016.
[146] Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause. Streaming
non-monotone submodular maximization: Personalized video summarization
on the fly. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[147] Marco Molinaro and R. Ravi. Geometry of online packing linear programs. In
39th Intl. Colloquium on Automata, Languages and Programming (ICALP),
pages 701?713, 2012.
[148] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cam-
bridge University Press, Cambridge, 1995.
[149] Michael J Neely and Hao Yu. Online convex optimization with time-varying
constraints. arXiv preprint arXiv:1702.04783, 2017.
[150] George L Nemhauser and Laurence A Wolsey. Best algorithms for approxi-
mating the maximum of a submodular set function. Mathematics of operations
research, 3(3):177?188, 1978.
[151] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis
of approximations for maximizing submodular set functions. Mathematical
programming, 14(1):265?294, 1978.
[152] Gergely Neu and Ga?bor Barto?k. Importance weighting without importance
weights: An efficient algorithm for combinatorial semi-bandits. J. of Machine
Learning Research (JMLR), 17(1):5355?5375, 2016.
231
[153] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization:
algorithms and complexity. Courier Corporation, 1982.
[154] Prabhakar Raghavan and Clark D Tompson. Randomized rounding: a tech-
nique for provably good algorithms and algorithmic proofs. Combinatorica, 7
(4):365?374, 1987.
[155] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable
sequences. In 26th Conf. on Learning Theory (COLT), pages 993?1019, 2013.
[156] Alexander Rakhlin and Karthik Sridharan. BISTRO: an efficient relaxation-
based method for contextual bandits. In 33nd Intl. Conf. on Machine Learning
(ICML), 2016.
[157] Anshuka Rangi, Massimo Franceschetti, and Long Tran-Thanh. Unifying
the stochastic and the adversarial bandits with knapsack. arXiv preprint
arXiv:1811.12253, 2018.
[158] Adrian Rivera, He Wang, and Huan Xu. Online saddle point problem
with applications to constrained online convex optimization. arXiv preprint
arXiv:1806.08301, 2018.
[159] Herbert Robbins. Some Aspects of the Sequential Design of Experiments.
Bull. Amer. Math. Soc., 58:527?535, 1952.
[160] Ryan Rogers, Aaron Roth, Jonathan Ullman, and Zhiwei Steven Wu. Inducing
approximately optimal flow using truthful mediators. In 16th ACM Conf. on
Electronic Commerce (EC), pages 471?488, 2015.
232
[161] Aaron Roth, Jonathan Ullman, and Zhiwei Steven Wu. Watch and learn:
Optimizing from revealed preferences feedback. In 48th ACM Symp. on Theory
of Computing (STOC), pages 949?962, 2016.
[162] Aaron Roth, Aleksandrs Slivkins, Jonathan Ullman, and Zhiwei Steven Wu.
Multidimensional dynamic pricing for welfare maximization. In 18th ACM
Conf. on Electronic Commerce (EC), pages 519?536, 2017.
[163] Karthik Abinav Sankararaman and Aleksandrs Slivkins. Combinatorial semi-
bandits with knapsacks. In Intl. Conf. on Artificial Intelligence and Statistics
(AISTATS), pages 1760?1770, 2018.
[164] Karthik Abinav Sankararaman, Anand Louis, and Navin Goyal. Stability of
linear structural equation models of causal inference. In Proceedings of the
35th Conference on Uncertainty in Artificial Intelligence, UAI ?19, 2019.
[165] Karthik Abinav Sankararaman, Anand Louis, and Navin Goyal. Robust iden-
tifiability in the linear structural equation model for causal inference. In Work-
ing Paper, 2019.
[166] Kanthi K Sarpatwar, Baruch Schieber, and Hadas Shachnai. Constrained sub-
modular maximization via greedy local search. Operations Research Letters,
47(1):1?6, 2019.
[167] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms.
The MIT Press, 2012.
233
[168] Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency,
volume 24. Springer Science & Business Media, 2002.
[169] Adish Singla and Andreas Krause. Truthful incentives in crowdsourcing tasks
using regret minimization mechanisms. In 22nd Intl. World Wide Web Conf.
(WWW), pages 1167?1178, 2013.
[170] Aleksandrs Slivkins. Dynamic ad allocation: Bandits with budgets. A techni-
cal report on arxiv.org/abs/1306.0155, June 2013.
[171] Serban Stan, Morteza Zadimoghaddam, Andreas Krause, and Amin Karbasi.
Probabilistic submodular maximization in sub-linear time. In Proceedings of
the 34th International Conference on Machine Learning, pages 3241?3250.
JMLR. org, 2017.
[172] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast
convergence of regularized learning in games. In 28th Advances in Neural
Information Processing Systems (NIPS), pages 2989?2997, 2015.
[173] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E. Schapire. Efficient
algorithms for adversarial contextual learning. In 33nd Intl. Conf. on Machine
Learning (ICML), 2016.
[174] Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E.
Schapire. Improved regret bounds for oracle-based adversarial contextual ban-
dits. In 29th Advances in Neural Information Processing Systems (NIPS),
2016.
234
[175] William R. Thompson. On the likelihood that one unknown probability ex-
ceeds another in view of the evidence of two samples. Biometrika, 25(3-4):
285?294, 1933.
[176] Yongxin Tong, Jieying She, Bolin Ding, Libin Wang, and Lei Chen. Online
mobile micro-task allocation in spatial crowdsourcing. In 32nd international
conference on data engineering (ICDE), pages 49?60. IEEE, 2016.
[177] Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers,
and Nicholas R. Jennings. -first policies for budget-limited multi-armed ban-
dits. In 24th AAAI Conference on Artificial Intelligence (AAAI), pages 1211?
1216, 2010.
[178] Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings.
Knapsack based optimal policies for budget-limited multi-armed bandits. In
26th AAAI Conference on Artificial Intelligence (AAAI), pages 1134?1140,
2012.
[179] Van-Anh Truong and Xinshang Wang. Prophet inequality with correlated
arrival probabilities, with application to two sided matchings. arXiv preprint
arXiv:1901.02552, 2019.
[180] Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and Jeff A Bilmes.
Learning mixtures of submodular functions for image collection summariza-
tion. In Advances in neural information processing systems, pages 1413?1421,
2014.
235
[181] Seeun Umboh. Online network design algorithms via hierarchical decompo-
sitions. In Proceedings of the twenty-sixth annual ACM-SIAM symposium
on Discrete algorithms, pages 1373?1387. Society for Industrial and Applied
Mathematics, 2015.
[182] Jun-Kun Wang and Jacob Abernethy. Acceleration through optimistic no-
regret dynamics. arXiv preprint arXiv:1807.10455, 2018.
[183] Yajun Wang and Sam Chiu-wai Wong. Two-sided online bipartite matching
and vertex cover: Beating the greedy algorithm. In International Colloquium
on Automata, Languages, and Programming, pages 1070?1081. Springer, 2015.
[184] Zizhuo Wang, Shiming Deng, and Yinyu Ye. Close the gaps: A learning-while-
doing algorithm for single-product revenue management problems. Operations
Research, 62(2):318?331, 2014.
[185] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial
bandits. In 31st Conf. on Learning Theory (COLT), 2018.
[186] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection
and active learning. In International Conference on Machine Learning, pages
1954?1963, 2015.
[187] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-
scale combinatorial semi-bandits. In Francis R. Bach and David M. Blei,
editors, ICML, JMLR Workshop and Conference Proceedings, pages 1113?
1122. JMLR.org, 2015.
236
[188] David P Williamson and David B Shmoys. The design of approximation al-
gorithms. Cambridge university press, 2011.
[189] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav
Kveton. Cascading bandits for large-scale recommendation problems. 2016.
237