entropy

Article

Scheduling to Minimize Age of Incorrect Information with
Imperfect Channel State Information

Yutao Chen and Anthony Ephremides *

����������
�������

Citation: Chen, Y.; Ephremides, A.

Scheduling to Minimize Age of

Incorrect Information with Imperfect

Channel State Information. Entropy

2021, 23, 1572. https://doi.org/

10.3390/e23121572

Academic Editor: Mario Martinelli

Received: 3 November 2021

Accepted: 23 November 2021

Published: 25 November 2021

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Copyright: © 2021 by the authors.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA;
cheny@umd.edu
* Correspondence: etony@umd.edu

Abstract: In this paper, we study a slotted-time system where a base station needs to update multiple
users at the same time. Due to the limited resources, only part of the users can be updated in each
time slot. We consider the problem of minimizing the Age of Incorrect Information (AoII) when
imperfect Channel State Information (CSI) is available. Leveraging the notion of the Markov Decision
Process (MDP), we obtain the structural properties of the optimal policy. By introducing a relaxed
version of the original problem, we develop the Whittle’s index policy under a simple condition.
However, indexability is required to ensure the existence of Whittle’s index. To avoid indexability,
we develop Indexed priority policy based on the optimal policy for the relaxed problem. Finally,
numerical results are laid out to showcase the application of the derived structural properties and
highlight the performance of the developed scheduling policies.

Keywords: age of incorrect information; multi-user system; scheduling policy

1. Introduction

The Age of Incorrect Information (AoII) is introduced in [1] as a combination of age-
based metrics (e.g., Age of Information (AoI)) and error-based metrics (e.g., Minimum
Mean Square Error). In communication systems, AoII captures not only the information
mismatch between the source and the destination but also the aging process of inconsistent
information. Hence, two functions dominate AoII. The first is the time penalty function,
which reflects how the inconsistency of information affects the system over time. In real-
life applications, inconsistent information will affect different communication systems in
different ways. For example, machine temperature monitoring is time-sensitive because
the damage caused by overheating will accumulate quickly. However, reservoir water
level monitoring is less sensitive to time. Therefore, by adopting different time penalty
functions, AoII can capture different aging processes of the mismatch in different systems.
The second is the information penalty function, which captures the information mismatch
between the source and the destination. It allows us to measure mismatches in different
ways, depending on how sensitive different systems are to information inconsistencies.
For example, the navigation system requires precise information to give correct instructions,
but the real-time delivery tracking system does not need very accurate location information.
Since we can choose different penalty functions for different systems, AoII is adaptable to
various communication goals, which is why it is regarded as a semantic metric [2].

Since the introduction of AoII, several studies have been performed to reveal its funda-
mental nature. The authors of [3] consider a system with random packet delivery times and
compare AoII with AoI and real-time error via extensive numerical results. The authors
of [4] study the problem of minimizing the AoII that takes the general time penalty function.
Three real-life applications are considered to showcase the performance advantages of AoII
over AoI and real-time error. In [5], the authors investigate the AoII that considers the
quantified mismatch between the source and the destination. The optimization problem
is studied when the system is resource-constrained. The authors of [6] studied the AoII

Entropy 2021, 23, 1572. https://doi.org/10.3390/e23121572 https://www.mdpi.com/journal/entropy

https://www.mdpi.com/journal/entropy
https://www.mdpi.com
https://orcid.org/0000-0003-3992-5468
https://doi.org/10.3390/e23121572
https://doi.org/10.3390/e23121572
https://creativecommons.org/
https://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/
https://doi.org/10.3390/e23121572
https://www.mdpi.com/journal/entropy
https://www.mdpi.com/article/10.3390/e23121572?type=check_update&version=2


Entropy 2021, 23, 1572 2 of 39

minimization problem in the context of scheduling. It considers a system where the central
scheduler needs to update multiple users at the same time. However, the central scheduler
cannot know the states of the sources before receiving the updates. By introducing the
belief value, Whittle’s index policy is developed and evaluated. In this paper, we also
consider the problem of minimizing AoII in scheduling. Different from [6], we consider
the generic time penalty function and study the minimization problem in the presence of
imperfect Channel State Information (CSI). Due to the existence of CSI, Whittle’s index
policy becomes infeasible in general. Hence, we introduce another scheduling policy that
is more versatile and has comparable performance to Whittle’s index policy.

The problem of scheduling to minimize AoI is studied under various system settings
in [7–11]. The problem studied in this paper is different and more complicated because
AoII considers the aging process of inconsistent information rather than the aging process
of updates. Meanwhile, none of them consider the case where CSI is available. The problem
of optimizing information freshness in the presence of CSI is studied in [12,13]. However,
they focus on the system with a single user and mainly discuss the case where CSI is perfect.
The scheduling problems with the goal of minimizing an error-based performance measure
are considered in [14–16]. Our problem is fundamentally different because AoII also
considers the time effect. Moreover, we consider the system where a base station observes
multiple sources simultaneously and needs to send updates to multiple destinations.

The main contributions of this work can be summarized as follows. (1) We study
the problem of minimizing AoII in a multi-user system where imperfect CSI is available.
Meanwhile, the time penalty function is generic. (2) We derive the structural properties
of the optimal policy for the considered problem. (3) We establish the indexability of the
considered problem under a simple condition and develop Whittle’s index policy. (4) We
obtain the optimal policy for a relaxed version of the original problem. By exploring the
characteristics of the relaxed problem, we provide an efficient algorithm to obtain the
optimal policy. (5) Based on the optimal policy for the relaxed problem, we develop the
Indexed priority policy that is free from indexability and has comparable performance to
Whittle’s index policy.

The remainder of this paper is organized in the following way. In Section 2, we introduce
the system model and formulate the primal problem. Section 3 explores the structural
properties of the optimal policy for the primal problem. Under a simple condition, we
develop Whittle’s index policy in Section 4. Section 5 presents the optimal policy for a
relaxed version of the primal problem. On this basis, we develop the Indexed priority
policy in Section 6. Finally, in Section 7, the numerical results are laid out.

2. System Overview
2.1. Communication Model

We consider a slotted-time system with N users and one base station. Each user is
composed of a source process, a channel, and a receiver. We assume all the users share
the same structure, but the parameters are different. The structure of the communication
model is provided in Figure 1.

Base
Station

Receiver1

Receiver2

ReceiverN

U
S
E
R
1

U
S
E
R
2

U
S
E
R
N

Source1

Source2

SourceN

Channel1

Channel2

ChannelN

Figure 1. The structure of the communication model.


Entropy 2021, 23, 1572 3 of 39

For user i, the source process is modeled by a two-state Markov chain where transitions
happen between the two states with probability pi > 0 and self-transitions happen with
probability 1− pi. At any time slot t, the state of the source process Xi,t ∈ {0, 1} will
be reported to the base station as an update, and the base station will decide whether
to transmit this update through the corresponding channel. The channel is unreliable,
but the estimate of the Channel State Information (CSI) is available at the beginning of
each time slot. Let ri,t ∈ {0, 1} be the CSI at time t. We assume that ri,t is independent
across time and user indices. ri,t = 1 if and only if the transmission attempt at time t will
succeed and ri,t = 0 otherwise. Then, we denote by r̂i,t ∈ {0, 1} the estimate of ri,t. We
assume that r̂i,t is an independent Bernoulli random variable with parameter γi, i.e., r̂i,t = 1
with probability γi ∈ [0, 1] and r̂i,t = 0 with probability 1− γi. However, the estimate
is imperfect. We assume that the error depends only on the user and its estimate. More
precisely, we define the probability of error as pr̂i

e,i , Pr[ri 6= r̂i | r̂i]. We assume pr̂i
e,i < 0.5

because we can flip the estimate if pr̂i
e,i > 0.5. We are not interested in the case of pr̂i

e,i = 0.5
since r̂i,t is useless in this case. Although the channel is unreliable, each transmission
attempt takes exactly one time slot regardless of the result, and the successfully transmitted
update will not be corrupted. Every time an update is received, the receiver will use it as
the new estimate X̂i,t. The receiver will send an ACK/NACK packet to inform the base
station of its reception of the new update. Since an ACK/NACK packet is generally very
small and simple, we assume that it is transmitted reliably and received instantaneously.
Then, if ACK is received, the base station knows that the receiver’s estimate changed to
the transmitted update. If NACK is received, the base station knows that the receiver’s
estimate did not change. Therefore, the base station always knows the estimate at the
receiver side.

At the beginning of each time slot, the base station receives updates from each source
and the estimates of CSI from each channel. The old updates and estimates are discarded
upon the arrival of new ones. Then, the base station decides which updates to transmit,
and the decision is independent of the transmission history. Due to the limited resources,
at most M < N updates are allowed per transmission attempt. We consider a base station
that always transmits M updates.

2.2. Age of Incorrect Information

All the users adopt AoII as a performance metric, but the choices of penalty functions
vary. Let Xt and X̂t be the true state and the estimate of the source process, respectively.
Then, in a slotted-time system, AoII can be expressed as follows

∆AoII(Xt, X̂t, t) =
t

∑
k=Ut+1

(
g(Xk, X̂k)× F(k−Ut)

)
, (1)

where Ut is the last time instance before time t (including t) that the receiver’s estimate
is correct. g(Xt, X̂t) can be any information penalty function that captures the difference
between Xt and X̂t. F(t) , f (t)− f (t− 1) where f (t) can be any time penalty function that
is non-decreasing in t. We consider the case where the users adopt the same information
penalty function g(Xt, X̂t) = |Xt− X̂t| but possibly different time penalty functions. To ease
the analysis, we require f (t) to be unbounded. Combined together, we require f (t1) ≤ f (t2)
if t1 < t2 and limt→+∞ f (t) = +∞. Without a loss of generality, we assume f (0) = 0, as the
source is modeled by a two-state Markov chain, g(Xt, X̂t) ∈ {0, 1}. Hence, Equation (1)
can be simplified to

∆AoII(Xt, X̂t, t) =
t

∑
k=Ut+1

F(k−Ut) = f (st),

where st , t−Ut. Therefore, the evolution of st is sufficient to characterize the evolution
of AoII. To this end, we distinguish between the following cases.


Entropy 2021, 23, 1572 4 of 39

• When the receiver’s estimate is correct at time t + 1, we have Ut+1 = t + 1. Then,
by definition, st+1 = 0.

• When the receiver’s estimate is incorrect at time t + 1, we have Ut+1 = Ut. Then,
by definition, st+1 = t + 1−Ut = st + 1.

To sum up, we get
st+1 = 1{Ut+1 6=t+1} × (st + 1). (2)

A sample path of st is shown in Figure 2. In the remainder of this paper, we use fi(·) to
denote the time penalty function user i adopts.

0 1 2 3 4 5 6 7

1

2

3

X1 = 1

X̂1 = 1

X2 = 0

X̂2 = 1

X3 = 0

X̂3 = 0

X4 = 1

X̂4 = 0

X5 = 1

X̂5 = 0

X6 = 0

X̂6 = 1

X7 = 1

X̂7 = 1

t

st

Figure 2. A sample path of st.

Remark 1. Under this particular choice of the penalty function, st can be interpreted as the time
elapsed since the last time the receiver’s estimate is correct. Please note that st is different from
the Age of Information (AoI) [17], which is defined as the time elapsed since the generation time
of the last received update. We can see that AoI considers the aging process of the update, while
AoII considers the aging process of the estimation error. At the same time, st is also fundamentally
different from the holding time, which, according to [18,19], is defined as the time elapsed since the
last successful transmission. We notice that the receiver’s estimate can become correct even when
no new update is successfully transmitted. Moreover, the information carried by the update may
have become incorrect by the time it is received. We also notice that [18,19] consider the problem of
minimizing the estimation error. However, by adopting AoII as the performance metric, we study
the impact of estimation error on the system.

2.3. System Dynamic

In this section, we tackle the system dynamic. We notice that the status of user i can
be captured by the pair xi,t , (si,t, r̂i,t). In the following, we will use xi,t and (si,t, r̂i,t)
interchangeably. Then, the system dynamic can be fully characterized by the dynamic of
xt , (x1,t, . . . , xN,t). Hence, it suffices to characterize the value of xt+1 given xt and the
base station’s action. To this end, we denote, by at = (a1,t, . . . , aN,t), the base station’s
action at time t. ai,t = 1 if the base station transmits the update from user i at time t and
ai,t = 0 otherwise. We notice that given action at, users are independent and the action
taken on user i will only affect itself. Consequently

Pr(xt+1 | xt, at) =
N

∏
i=1

Pr(xi,t+1 | xi,t, at) =
N

∏
i=1

Pr(xi,t+1 | xi,t, ai,t).

Combined with the fact that all the users share the same structure, it is sufficient to study
the dynamic of a single user. In the following discussions, we drop the user-dependent
subscript i. We recall that r̂t+1 is an independent Bernoulli random variable. Then, we have

Pr(xt+1 | xt, at) = P(r̂t+1)× Pr(st+1 | xt, at). (3)


Entropy 2021, 23, 1572 5 of 39

By definition, P(r̂t+1 = 1) = γ and P(r̂t+1 = 0) = 1− γ. Then, we only need to tackle the
value of Pr(st+1 | xt, at). To this end, we distinguish between the following cases

• When xt = (0, r̂t), the estimate at time t is correct (i.e., X̂t = Xt). Hence, for the
receiver, Xt carries no new information about the source process. In other words,
X̂t+1 = X̂t regardless of whether an update is transmitted at time t. We recall that
Ut+1 = Ut if X̂t+1 6= Xt+1 and Ut+1 = t + 1 otherwise. Since the source is binary, we
obtain Ut+1 = Ut if Xt+1 6= Xt, which happens with probability p and Ut+1 = t + 1
otherwise. According to (2), we obtain

Pr(1 | (0, r̂t), at) = p,

Pr(0 | (0, r̂t), at) = 1− p.

• When at = 0 and xt = (st, r̂t), where st > 0, the channel will not be used and no new
update will be received by the receiver,and so, X̂t+1 = X̂t. We recall that Ut+1 = Ut if
X̂t+1 6= Xt+1 and Ut+1 = t + 1 otherwise. Since Xt 6= X̂t and the source is binary, we
have Ut+1 = Ut if Xt+1 = Xt, which happens with probability 1− p and Ut+1 = t + 1
otherwise. According to (2), we obtain

Pr(st + 1 | (st, r̂t), at = 0) = 1− p,

Pr(0 | (st, r̂t), at = 0) = p.

• When at = 1 and xt = (st, 1) where st > 0, the transmission attempt will succeed
with probability 1 − p1

e and fail with probability p1
e . We recall that Ut+1 = Ut if

X̂t+1 6= Xt+1 and Ut+1 = t + 1 otherwise. Then, when the transmission attempt
succeeds (i.e., X̂t+1 = Xt), Ut+1 = Ut if Xt+1 6= Xt and Ut+1 = t + 1 otherwise. When
the transmission attempt fails (i.e., X̂t+1 = X̂t 6= Xt), we have Ut+1 = Ut if Xt+1 = Xt
and Ut+1 = t + 1 otherwise. Combining (2) with the dynamic of the source process
we obtain

Pr(st + 1 | (st, 1), at = 1) = p1
e (1− p) + (1− p1

e )p , α,

Pr(0 | (st, 1), at = 1) = p1
e p + (1− p1

e )(1− p) = 1− α.

• When at = 1 and xt = (st, 0), where st > 0, following the same line, we obtain

Pr(st + 1 | (st, 0), at = 1) = p0
e p + (1− p0

e )(1− p) , β,

Pr(0 | (st, 0), at = 1) = p0
e (1− p) + (1− p0

e )p = 1− β.

Combines together, we obtain the value of Pr(st+1 | xt, at) in all cases. As only M out
of N updates are allowed per transmission attempt, we realize a necessity to require
transmission attempts always help minimize AoII. It is equivalent to impose Pr(st+1 >
st | (st, r̂t), at = 0) > Pr(st+1 > st | (st, r̂t), at = 1) for any (st, r̂t). Leveraging the results
above, it is sufficient to require p < 0.5. As all the users share the same structure, we
assume, for the rest of this paper, that 0 < pi < 0.5 for 1 ≤ i ≤ N.

2.4. Problem Formulation

The communication goal is to minimize the expected AoII. Therefore, the problem can
be formulated as the following

arg min
φ ∈ Φ

lim
T→∞

1
T
Eφ

(
T−1

∑
t=0

N

∑
i=1

fi(si,t)

)
(4a)

subject to
N

∑
i=1

ai,t = M ∀t, (4b)


Entropy 2021, 23, 1572 6 of 39

where Φ is the set of all causal policies. We refer to the constrained minimization problem
reported in problem (4) as the Primal Problem (PP). We notice that the PP is a Restless
Multi-Armed Bandit (RMAB) Problem. The optimal policy for this type of problem is far
from reachable since it is PSPACE-hard in general [20]. However, we can still derive the
structural properties of the optimal policy. These structural properties can be used as a
guide for the development of scheduling policies and can indicate the good performance
of the developed scheduling policies.

3. Structural Properties of the Optimal Policy

In this section, we investigate the structural properties of the optimal policy for PP.
We first define an infinite horizon with an average cost Markov Decision Process (MDP)
MN(w, M) = (XN ,AN(M),PN , CN(w)), where

• XN denotes the state space. The state is x = (x1, . . . , xN) where xi = (si, r̂i).
• AN(M) denotes the action space. The feasible action is a = (a1, . . . , aN) where

ai ∈ {0, 1} and ∑N
i=1 ai = M. Note that the feasible actions are independent of the

state and the time.
• PN denotes the state transition probabilities. We define Px,x′(a) as the probability that

action a at state x will lead to state x′. It is calculated by

Px,x′(a) =
N

∏
i=1

P(r̂′i)Psi ,s′i
(ai, r̂i),

where Psi ,s′i
(ai, r̂i) is the transition probability from si to s′i when the estimate of CSI

is r̂i and action ai is taken. The values of Psi ,s′i
(ai, r̂i) can be obtained easily from the

results in Section 2.3.
• CN(w) denotes the instant cost. When the system is at state x and action a is taken,

the instant cost is C(x, a) , ∑N
i=1 C(xi, ai) , ∑N

i=1
(

fi(si) + wai
)
.

We notice that PP can be cast intoMN(0, M). Since w = 0, the instant cost is indepen-
dent of action a. Therefore, we abbreviate C(x, a) as C(x). To simplify the analysis, we
consider the case of M = 1. Equivalently, we investigate the structural properties of the
optimal policy forMN(0, 1).

Remark 2. For the case of M > 1, we can apply the same methodology. However, as M increases,
the action space will grow quickly, resulting in the need to consider more feasible actions in each
step of the proof. Hence, to better demonstrate the methodology, we only consider the case of M = 1
in this paper.

It is well known that the optimal policy for MN(0, 1) can be characterized by the
value function. We denote the value function of state x as V(x). A canonical procedure to
calculate V(x) is applying the Value Iteration Algorithm (VIA). To this end, we define Vν(·)
as the estimated value function at iteration ν of VIA and initialize V0(·) = 0. Then, VIA
updates the estimated value functions in the following way

Vν+1(x) = C(x)− θ + min
a∈AN(1)

{
∑

x′∈XN

Px,x′(a)Vν(x′)

}
, (5)

where θ is the optimal value of MN(0, 1). VIA is guaranteed to converge to the value
function [21]. More precisely, Vν(·) = V(·) when ν → +∞. However, the exact value
function is impossible to get since we need infinite iterations and the state space is infinite.
Instead, we provide two structural properties of the value function.

Lemma 1 (Monotonicity). ForMN(0, 1), V(x) is non-decreasing in si for 1 ≤ i ≤ N.


Entropy 2021, 23, 1572 7 of 39

Proof. Leveraging the iterative nature of VIA, we use mathematical induction to prove the
desired results. The complete proof can be found in Appendix A.

Before introducing the next structural property, we make the following definition.

Definition 1 (Statistically identical). Two users are said to be statistically identical if the user-
dependent parameters and the adopted time penalty functions are the same.

For the users that are statistically identical, we can prove the following

Lemma 2 (Equivalence). ForMN(0, 1), if users j and k are statistically identical, V(x) = V
(P(x)) where P(x) is state x with xj and xk exchanged.

Proof. Leveraging the iterative nature of VIA, we use mathematical induction to prove the
desired results. At each iteration, we show that for each feasible action at state x, we can
find an equivalent action at state P(x). Two actions are equivalent if they lead to the same
value function. The complete proof can be found in Appendix B.

Equipped with the above lemmas, we proceed with characterizing the structural
properties of the optimal policy. We recall that the optimal action at each state can be
characterized by the value function. Hence, we denote, by V j(x), the value function
resulting from choosing user j to update at state x. Then, V j(x) can be calculated by

V j(x) = C(x)− θ + ∑
x′−x′j


(

∏
i 6=j

Pxi ,x′i
(0)

)
∑
r̂′j

P(r̂′j)

∑
s′j

Psj ,s′j
(1, r̂j)V(x′)

.

If V j(x) < Vk(x) for all k 6= j, it is optimal to transmit the update from user j. When
V j(x) = Vk(x), the two choices are equally desirable. In the following, we will characterize
the properties of δj,k(x) , V j(x)−Vk(x) for any j and k.

Theorem 1 (Structural properties). ForMN(0, 1), δj,k(x) has the following properties

1. δj,k(x) ≤ 0 if r̂k = p0
e,k = 0. The equality holds when sj = 0 or r̂j = p0

e,j = 0.

2. δj,k(x) is non-increasing in r̂j and is non-decreasing in r̂k when sj, sk > 0. At the same time,
δj,k(x) is independent of r̂i for any i 6= j, k.

3. δj,k(x) ≤ 0 if sk = 0. The equality holds when sj = 0 or r̂j = p0
e,j = 0.

4. δj,k(x) is non-increasing in sj if Γ
r̂j
j ≤ Γr̂k

k and is non-decreasing in sk if Γ
r̂j
j ≥ Γr̂k

k when

sj, sk > 0. We define Γ1
i , αi

1−pi
and Γ0

i , βi
1−pi

for 1 ≤ i ≤ N.

5. δj,k(x) ≤ 0 if sj ≥ sk, r̂j ≥ r̂k, and users j and k are statistically identical.

Proof. The proof can be found in Appendix C.

We notice that Γr̂i
i can be written as

Γr̂i
i =

Pr(si + 1 | (si, r̂i), ai = 1)
Pr(si + 1 | (si, r̂i), ai = 0)

< 1,

where si can be any positive integer. Consequently, Γr̂i
i is independent of any si > 0 and

indicates the decrease in the probability of increasing si caused by action ai = 1. When Γr̂i
i

is large, action ai = 1 will achieve a small decrease in the probability of increasing si. In the
following, we provide an intuitive interpretation of why the monotonicity in Property 4 of

Theorem 1 depends on Γr̂i
i . We take the case of Γ

r̂j
j ≤ Γr̂k

k as an example and assume that
there are only users j and k in the system. Then, according to Section 2.3, the dynamic of sj
and sk can be divided into the following three cases


Entropy 2021, 23, 1572 8 of 39

• Neither sj nor sk increases. In this case, both sj and sk become zero.
• Either sj or sk increases and the other becomes zero. We denote by Pk

j the probability
that only sk increases when aj = 1. The notation for other cases is defined analogously.
The probabilities can be obtained easily using the results in Section 2.3.

• Both sj and sk increase. We denote by Pj the probability that both sj and sk increase
when aj = 1. Pk is defined analogously. The probabilities can be obtained easily using
the results in Section 2.3.

We notice that δj,k(x) implies the tendency of the base station to choose between the
two users. The larger δj,k(x) is, the more the base station tends to choose user k. Thus, we
investigate the base station’s propensity to choose user k when sk increases but sj stays
the same. We ignore the case where the resulting sk is zero since it is independent of the
increase in sk. With this in mind, we first notice that Pk

k ≤ Pk
j . Meanwhile, we can easily

verify that
Pj
Pk

=
Γ

r̂j
j

Γ
r̂k
k

. When Γ
r̂j
j ≤ Γr̂k

k , we have Pj ≤ Pk. Then, there exists a subtle trade-off.

More precisely, choosing user k will result in Pk
k ≤ Pk

j , but at the cost of Pk ≥ Pj. Hence,
in this case, the propensity of the base station is hard to determine. Following the same
line, we can show that choosing user j will lead to Pj

j ≤ Pj
k and Pj ≤ Pk. Thus, there exists

no such trade-off when we investigate the base station’s propensity to choose user j as sj
increases but sk stays the same.

Leveraging Theorem 1, we can provide some specific structural properties of the
optimal policy.

Corollary 1 (Application of Theorem 1). When M = 1, the optimal policy for PP must satisfy
the following

1. The user i with r̂i = p0
e,i = 0 or si = 0 will not be chosen unless it is to break the tie.

2. When user j is chosen at state x1, then for state x2, such that r̂1,j ≤ r̂2,j and s1,i = s2,i for
1 ≤ i ≤ N, the optimal choice must be in the set G = {j} ∪ {k : r̂1,k < r̂2,k}.

3. When N = 2, we consider two states, x1 and x2, which differ only in the value of sj.

Specifically, s1,j ≤ s2,j. If user j is chosen at state x1 and Γ
r̂1,j
j ≤ Γr̂1,k

k , the optimal choice at
state x2 will also be user j.

4. When N = 2, we consider two states, x1 and x2, which differ only in the value of sk.

Specifically, s1,k ≥ s2,k. If user j is chosen at state x1 and Γ
r̂1,j
j ≥ Γr̂1,k

k , the optimal choice at
state x2 will also be user j.

5. When all users are statistically identical, the optimal choice at any time slot must be either
the user with x = (smax,1, 1) where smax,1 , maxsi{(si, 1)} or the user with x = (smax,0, 0)
where smax,0 , maxsi{(si, 0)}. Moreover,

• If smax,1 ≥ smax,0, it is optimal to choose the user with x = (smax,1, 1).
• If smax,1 < smax,0, the optimal choice will switch from the user with x = (smax,0, 0) to

the user with x = (smax,1, 1) when smax,1 increases from 0 to smax,0 solely.

Proof. The first property follows directly from Property 1 and Property 3 of Theorem 1.
For the second property, leveraging Property 2 of Theorem 1, we have δj,k(x2) ≤ δj,k(x1) ≤
0 if r̂1,j ≤ r̂2,j, r̂1,k ≥ r̂2,k, and s1,i = s2,i for 1 ≤ i ≤ N. Thus, the optimal choice will not
be user k in this case. Then, we can conclude that the optimal choice must be in the set
G = {j} ∪ {k : r̂1,k < r̂2,k}.

For the third property, we have proved in Property 4 of Theorem 1 that δj,k(x) is

non-increasing in sj if Γ
r̂j
j ≤ Γr̂k

k . Hence, δj,k(x2) ≤ δj,k(x1) ≤ 0. As we consider the case of
N = 2, the optimal choice at state x2 will also be user j. The fourth property can be shown

in a similar way by noticing that δj,k(x) is non-decreasing in sk when Γ
r̂j
j ≥ Γr̂k

k .
For the last property, we recall from Property 5 of Theorem 1 that it is always better to

choose the user with a larger s if they are statistically identical and have the same r̂. Thus,


Entropy 2021, 23, 1572 9 of 39

we can conclude that the optimal choice must be either the user with x = (smax,1, 1) or
the user with x = (smax,0, 0). Without a loss of generality, we assume xj = (smax,1, 1) and
xk = (smax,0, 0). Now, we distinguish between the following cases

• According to Property 5 of Theorem 1, we can conclude that it is optimal to choose
user j when smax,1 ≥ smax,0.

• To determine the optimal choice in the case of smax,1 < smax,0, we recall that the optimal
choice will be user k (i.e., δj,k(x) ≥ 0) if sj = 0 and will be user j (i.e., δj,k(x) ≤ 0) if
sj = sk. At the same time, Property 4 of Theorem 1 tells us that δj,k(x) is non-increasing
in sj when users j and k are statistically identical. Therefore, we can conclude that the
optimal choice will switch from user k to user j when sj increases from 0 to sk solely.

4. Whittle’s Index Policy

Whittle’s index policy is a well-known low-complexity heuristic that shows a strong
performance in many problems that belong to RMAB [22–24]. In this section, we develop
Whittle’s index policy for PP. We first present the general procedures we adopt to obtain
Whittle’s index.

• We first formulate a relaxed version of PP and apply the Lagrangian approach.
• Then, we decouple the problem of minimizing the Lagrangian function into N decou-

pled problems, each of which only considers a single user. By casting the decoupled
problem into an MDP, we investigate the structural properties and performance of the
optimal policy.

• Leveraging the results above and under a simple condition, we establish the indexa-
bility of the decoupled problem.

• Finally, we obtain the expression of Whittle’s index by solving the Bellman equation.

4.1. Relaxed Problem

The first step in obtaining Whittle’s index is to formulate the Relaxed Problem (RP).
More precisely, instead of requiring the limit on the number of updates allowed per
transmission attempt to be met in each time slot, we relax the constraint such that the limit
is not violated in an average sense. Then, RP can be formulated as

arg min
φ ∈ Φ

∆̄φ , lim
T→∞

1
T
Eφ

(
T−1

∑
t=0

N

∑
i=1

fi(si,t)

)
(6a)

subject to ρ̄φ , lim
T→∞

1
T
Eφ

(
T−1

∑
t=0

N

∑
i=1

ai,t

)
≤ M. (6b)

As RP is specified, we apply the Lagrangian approach. First of all, we write RP into its
Lagrangian form.

L(λ, φ) = lim
T→∞

1
T
Eφ

(
T−1

∑
t=0

N

∑
i=1

( fi(si,t) + λai,t)

)
− λM,

where λ ≥ 0 is the Lagrange multiplier. Then, we investigate the problem of minimizing
the Lagrangian function. Since λM is independent of policies, we can ignore it. More
precisely, we consider the following minimization problem

minimize
φ ∈ Φ

lim
T→∞

1
T
Eφ

(
T−1

∑
t=0

N

∑
i=1

( fi(si,t) + λai,t)

)
. (7)

4.2. Decoupled Model

In this section, we formulate the decoupled problem and investigate its optimal policy.
The decoupled model associated with each user follows the system model with N = 1.


Entropy 2021, 23, 1572 10 of 39

Since all the users share the same structure, we drop the user-dependent subscript i for
simplicity. Then, the decoupled problem can be formulated as

minimize
φ ∈ Φ′

lim
T→∞

1
T
Eφ

(
T−1

∑
t=0

( f (st) + λat)

)
, (8)

where Φ′ is the set of all causal policies when N = 1. We notice that problem (8) can be cast
into the MDPM1(λ,−1). We define M = −1 when there is no restriction on the number
of updates allowed per transmission attempt.

We first investigate the structural properties of the optimal policy forM1(λ,−1) when
λ is a given non-negative constant. We start with characterizing the corresponding value
function V(x).

Corollary 2 (Extension of Lemma 1). ForM1(λ,−1), V(x) is non-decreasing in s.

Proof. The proof follows the same steps as in the proof of Lemma 1. The complete proof
can be found in Appendix D.

Equipped with the above corollary, we can characterize the structural properties of
the optimal policy for (8).

Proposition 1 (Optimal policy for decoupled problem). The optimal policy for the decoupled
problem is a threshold policy with the following properties.

• The optimal policy can be fully captured by n = (n0, n1). More precisely, when the system is
at state (s, r̂), it is optimal to make a transmission attempt only when s ≥ nr̂.

• n0 ≥ n1 > 0.

Proof. We define ∆V(x) , V1(x)− V0(x), where Va(x) is the value function resulting
from taking action a at state x. Then, the optimal action at state x is a = 1 if ∆V(x) < 0,
and a = 0 is optimal otherwise. We use Corollary 2 to characterize the sign of ∆V(x).
The complete proof can be found in Appendix E.

In the following, we evaluate the performance of the threshold policy detailed in
Proposition 1. More precisely, we calculate the expected AoII ∆̄n and the expected transmis-
sion rate ρ̄n resulting from the adoption of threshold policy n. We will see in the following
that ∆̄n and ρ̄n are essential for establishing the indexability and obtaining the expression
of Whittle’s index.

Proposition 2 (Performance). Under threshold policy n = (n0, n1),

∆̄n = π0 p

[
n1−1

∑
k=1

f (k)(1− p)k−1 + (1− p)n1−1

(
n0−1

∑
k=n1

f (k)ck−n1
1 + cn0−n1

1

+∞

∑
k=n0

f (k)ck−n0
2

)]
,

ρ̄n = π0 p(1− p)n1−1
[

γ

1− c1
+ cn0−n1

1

(
1

1− c2
− γ

1− c1

)]
,

where
π0 =

1

2 + p(1− p)n1−1
[

1
1− c1

− 1
p
+ cn0−n1

1

(
1

1− c2
− 1

1− c1

)] ,

c1 = (1− γ)(1− p) + γα, and c2 = (1− γ)β + γα.

Proof. We notice that the dynamic of AoII under the threshold policy can be fully captured
by a Discrete-Time Markov Chain (DTMC). Then, combined with the fact that r̂ is an inde-
pendent Bernoulli random variable, we can obtain the desired results from the stationary
distribution of the induced DTMC. The complete proof can be found in Appendix F.


Entropy 2021, 23, 1572 11 of 39

As f (·) can be any non-decreasing function, ∆̄ can grow indefinitely. Thus, it is
necessary to require that there exists at least one threshold policy that causes a finite ∆̄.
By noting that 1− p ≥ c1 ≥ c2, we have

∆̄ ≥ π0 p

[
n1−1

∑
k=1

f (k)ck−1
2 + cn1−1

2

(
n0−1

∑
k=n1

f (k)ck−n1
2 + cn0−n1

2

+∞

∑
k=n0

f (k)ck−n0
2

)]

= π0 p

(
+∞

∑
k=1

f (k)ck−1
2

)
.

The equality is achieved when n0 = n1 = 1. Then, we can conclude that it is sufficient to
require ∑+∞

k=1 f (k)ck−1
2 < +∞. This will be the underlying assumption throughout the rest

of this paper.

4.3. Indexability

In this section, we establish the indexability of the decoupled problem, which ensures
the existence of Whittle’s index. We start with the definition of indexability.

Definition 2 (Indexability). The decoupled problem is indexable if the set of states in which a = 0
is the optimal action increases with λ, that is,

λ′ < λ =⇒ D(λ′) ⊆ D(λ),

where D(λ) is the set of states in which a = 0 is optimal when Lagrange multiplier λ is adopted.

The Lagrange multiplier λ can be viewed as a cost associated with each transmission
attempt. Intuitively, as λ increases, the base station should stay idle (i.e., a = 0) for a longer
time until s becomes large enough to offset the cost. Although it is intuitively correct that
the decoupled problem is indexable, the indexability is hard to establish as the optimal
policy is characterized by two thresholds. Thus, Whittle’s index does not necessarily exist.
However, the indexability can be established when the following condition is satisfied

p0
e,i = 0 f or 1 ≤ i ≤ N. (9)

Remark 3. Problem (9) only requires the estimate r̂i to be perfect when r̂i = 0. In the case of
r̂i = 1, we still allow the estimate to be inaccurate.

When (9) is satisfied, Propositions 1 and 2 reduce to the following

Corollary 3 (Consequences of (9)). When (9) is satisfied, the optimal policy for the decoupled
problem (8) is the threshold policy n = (+∞, n). The corresponding ∆̄n and ρ̄n are

∆̄n = π0 p

(
n−1

∑
k=1

f (k)(1− p)k−1 + (1− p)n−1
+∞

∑
k=n

f (k)ck−n
1

)
,

ρ̄n = π0 p(1− p)n−1
(

γ

1− c1

)
,

where
π0 =

1

2 + p(1− p)n−1
(

1
1− c1

− 1
p

) .

Proof. We continue with the same notations as in the proof of Propositions 1 and 2. It is
sufficient to show that n0 = +∞. To this end, we consider the state x = (s, 0). By following
the same steps as in the proof of Proposition 1, we have


Entropy 2021, 23, 1572 12 of 39

∆V(s, 0) = λ ≥ 0.

Therefore, it is optimal to stay idle (i.e., a = 0) at state x = (s, 0) for any s ≥ 0. Equivalently,
n0 = +∞. Then, the corresponding ∆̄n and ρ̄n can be calculated as a special case of
Proposition 2 where n0 = +∞, n1 = n, and p0

e = 0.

Leveraging Corollary 3, we can establish the indexability of the decoupled problem.

Proposition 3 (Indexability of decoupled problem). The decoupled problem is indexable when
(9) is satisfied.

Proof. According to Proposition 2.2 of [25], we only need to verify that the expected
transmission rate ρ̄n is strictly decreasing in n. From Corollary 3, we have

ρ̄n =

γ

(
p

1− c1

)
2

(1− p)n−1 +

(
p

1− c1
− 1
) .

As 1
2 < 1− p < 1, we can easily verify that ρ̄n is strictly decreasing in n. Thus, the decou-

pled problem is indexable when (9) is satisfied.

4.4. Whittle’s Index Policy

In this section, we proceed with finding the expression of Whittle’s index and defining
Whittle’s index policy. First of all, we give the definition of Whittle’s index.

Definition 3 (Whittle’s index). When the decoupled problem is indexable, Whittle’s index at state
x is defined as the infimum λ, such that both actions are equally desirable. Equivalently, Whittle’s
index at state x is defined as the infimum λ such that V0(x) = V1(x).

Let us denote by Wx the Whittle’s index at state x. Then, the expression of Whittle’s
index is given by the following Proposition.

Proposition 4 (Whittle’s index). When (9) is satisfied, Whittle’s index is

Wx =



0 when x = (0, r̂) or x = (s, 0),

(1− c1)
+∞

∑
k=s+1

f (k)ck−s−1
1 − ∆̄s

(1− c1)(1− p)− γ(1− p− α)

c1(1− p− α)
+ ρ̄s

when x = (s, 1),

where s > 0 and c1 = (1− γ)(1− p) + γα. ∆̄s and ρ̄s are the expected AoII and the expected
transmission rate when threshold policy n = (+∞, s) is adopted, respectively. At the same time,
Wx is non-negative and is non-decreasing in s.

Proof. Whittle’s indexes at state x = (0, r̂) and x = (s, 0) are obtained easily from the
proof of Proposition 1. For state x = (s, 1), we first use backward induction to calculate
the expressions of some value functions. Then, the expression of Whittle’s index can be
obtained from its definition. The complete proof can be found in Appendix G.

Definition 4 (Whittle’s index policy). At any state x = (x1, x2, . . . , xN), the base station will
transmit the updates from M users with the largest Wxi . The ties are broken arbitrarily. Wxi is
calculated using Proposition 4 with the parameters of user i.

Remark 4. Whittle’s index policy possesses the structural properties detailed in Corollary 1.


Entropy 2021, 23, 1572 13 of 39

• The first two properties can be verified by noting that Wxi ≥ 0 and the equality holds when
r̂i = 0 or si = 0. At the same time, Wxi is non-decreasing in r̂i.

• The third and fourth properties can be verified by noting that Wxi is non-decreasing in si.
• For the last property, we first notice that Wxj = Wxk when users j and k are statistically

identical and xj = xk. Then, the property can be verified by noting that Wxi is non-decreasing
in both si and r̂i.

5. Optimal Policy for Relaxed Problem

In this section, we provide an efficient algorithm to obtain the optimal policy for RP,
based on which we will develop another scheduling policy for PP in the next section that
is free from indexability. At the same time, the performance of the optimal policy for RP
forms a universal lower bound because the following ordering holds

∆̄RP
AoII ≤ ∆̄PP

AoII ,

where ∆̄RP
AoII and ∆̄PP

AoII are the minimal expected AoII of RP and PP, respectively.

Remark 5. Note that the optimal policy for RP may not necessarily be a valid policy for PP,
as the transmitter may transmit more than M updates in one transmission attempt under RP-
optimal policy.

To solve RP, we follow the discussion in Section 4.1. More precisely, we take the
Lagrangian approach and consider the problem reported in (7). We will see in the following
discussion that the optimal policy for RP can be characterized by the optimal policies for
problem (7). Therefore, we first cast problem (7) into the MDP MN(λ,−1). However,
the optimal policy forMN(λ,−1) is difficult to obtain because the state space is infinite.
Even though we can make the state space finite by imposing an upper limit on the value
of s, the state space and the action space grow exponentially with the number of users in
the system. To overcome the difficulty, we investigate the optimal policy forMi

1(λ,−1)
where 1 ≤ i ≤ N. The superscript i means that the only user in the system is user i. We
will show later that the optimal policy forMN(λ,−1) can be fully characterized by the
optimal policies forMi

1(λ,−1) where 1 ≤ i ≤ N.

5.1. Optimal Policy for Single User

In this section, we tackle the problem of finding the optimal policy forMi
1(λ,−1).

Since the users share the same structure, we ignore the superscript i for simplicity. To find
the optimal policy, we first use the Approximating Sequence Method (ASM) introduced
in [26] to make the state space finite. More precisely, we impose s ≤ m where m is a
predetermined upper limit. The state transition probabilities P′s,s′(a, r̂) are modified in the
following way

P′s,s′(a, r̂) =

{
Ps,s′(a, r̂) i f s′ < m,
Ps,s′(a, r̂) + ∑z>m Ps,z(a, r̂) i f s′ = m.

(10)

The action space and the instant cost remain unchanged. Then, we can apply Relative
Value Iteration (RVI) with convergence criteria ε to obtain the optimal policy. We notice
thatM1(λ,−1) coincides with the decoupled model studied in Section 4.2. Hence, we can
utilize the threshold structure of the optimal policy to improve RVI. To this end, we class
a state as active if the optimal action at this state is a = 1. Then, the threshold structure
detailed in Proposition 1 tells us the following. For any state x, if there exists an active state
x1 with s1 ≤ s and r̂1 ≤ r̂, then x must also be active. Hence, we can determine the optimal
action at state x immediately instead of comparing all feasible actions. In this way, we can
reduce the running time of RVI. The pseudocode for the improved RVI can be found in
Algorithm A1 of Appendix M. A similar technique is also presented in [5].


Entropy 2021, 23, 1572 14 of 39

For M1(λ,−1), when problem (9) is satisfied, Whittle’s index exists and can be
calculated efficiently using Proposition 4. Therefore, we can obtain the optimal policy using
Whittle’s index and further reduce the computational complexity. To this end, we denote
by nλ the optimal policy forM1(λ,−1) and present the following proposition

Proposition 5 (Optimal deterministic policy). When (9) is satisfied, the optimal policy for
M1(λ,−1) is nλ = (+∞, n) where n is given by

n =

{
1 i f λ = 0,
max{s ∈ N0 : Ws ≤ λ}+ 1 i f λ > 0.

Ws is the Whittle’s index at state (s, 1).

Proof. We first notice that M1(λ,−1) coincides with the decoupled model studied in
Section 4.2. Then, we show the optimal action for each state with r̂ = 1 using the definition
of Whittle’s index and the fact that the decoupled problem is indexable when (9) is satisfied.
The complete proof can be found in Appendix H.

In the following, we provide a randomized policy that is also optimal forM1(λ,−1).
We will see later that the randomized policy is the key to obtaining the optimal policy
for RP.

Theorem 2 (Optimal randomized policy). There exist two deterministic policies nλ+ and nλ− ,
which are both optimal forM1(λ,−1). We consider the following randomized policy nλ: every time
the system reaches state (0, 0), the base station will make the choice between nλ− with probability
µ and nλ+ with probability 1− µ. The chosen policy will be followed until the next choice. Then,
the randomized policy nλ is optimal forM1(λ,−1) under any µ ∈ [0, 1].

Proof. We show that our system verifies the assumptions given in [27]. Then, leveraging
the characteristics of our system, we can obtain the optimal randomized policy. The com-
plete proof can be found in Appendix I.

In practice, we approximate λ+ ≈ λ + ξ and λ− ≈ λ− ξ where ξ is a small pertur-
bation. Then, the deterministic policies nλ+ and nλ− can be obtained by following the
discussion at the beginning of this subsection. Note that, in most cases, nλ+ and nλ− are
the same.

5.2. Optimal Policy for RP

In this section, we characterize the optimal policy for RP. Let us denote by V(x) and
Vi(xi) the value functions ofMN(λ,−1) andMi

1(λ,−1), respectively. Then, we can prove
the following

Proposition 6 (Separability). V(x) = ∑N
i=1 Vi(xi) where x = (x1, . . . , xN). In other words,

the policy, under which each user adopts its own optimal policy, is optimal forMN(λ,−1).

Proof. We show V(x) = ∑N
i=1 Vi(xi) by comparing the Bellman equations they must satisfy.

The complete proof can be found in Appendix J.

We denote the optimal policy forMN(λ,−1) as φλ = [nλ,1, . . . , nλ,N ] where nλ,i is the
optimal policy forMi

1(λ,−1). For simplicity, we define ∆̄(λ) and ρ̄(λ) as the expected
AoII and the expected transmission rate associated with φλ, respectively. ∆̄i(λ) and ρ̄i(λ)
are defined analogously for user i under policy nλ,i. We also define λ∗ , inf{λ > 0 :
ρ̄(λ) ≤ M}. With Proposition 6 and the above definitions in mind, we proceed with
constructing the optimal policy for RP.


Entropy 2021, 23, 1572 15 of 39

Theorem 3 (Optimal policy for RP). The optimal policy for RP can be characterized by two
deterministic policies φλ∗+ = [nλ∗+ ,1, . . . , nλ∗+ ,N ] and φλ∗− = [nλ∗− ,1, . . . , nλ∗− ,N ] where nλ∗+ ,i and
nλ∗− ,i are both the optimal deterministic policies forMi

1(λ
∗,−1). Then, we mix φλ∗+ and φλ∗− in

the following way: for each user i, every time the user reaches state (0, 0), the base station will make
the choice between nλ∗− ,i with probability µi and nλ∗+ ,i with probability 1− µi. The chosen policy
will be followed by user i until the next choice. Where 1 ≤ i ≤ N, the µi is chosen in such a way as
to satisfy

N

∑
i=1

ρ̄i(λ∗) =
N

∑
i=1

(
µi ρ̄

i(λ∗−) + (1− µi)ρ̄
i(λ∗+)

)
= M. (11)

Then, the mixed policy, denoted by φλ∗ , is optimal for RP.

Proof. According to Lemma 3.10 of [27], a policy is optimal for RP if

1. It is optimal forMN(λ
∗,−1);

2. The resulting expected transmission rate is equal to M.

Then, we construct such a policy using Theorem 2 and Proposition 6. The complete proof
can be found in Appendix K.

Since we approximate λ∗+ ≈ λ∗ + ξ and λ∗− ≈ λ∗ − ξ in practice, ρ̄i(λ∗+) ≤ ρ̄i(λ∗−)
for all i according to the monotonicity given by Lemma 3.4 of [27]. Combining with the
definition of λ∗, we must have ρ̄(λ∗+) ≤ M < ρ̄(λ∗−). Therefore, we can always find µi’s
that realize (11). In this paper, we choose

µi = µ =
M− ρ̄(λ∗+)

ρ̄(λ∗−)− ρ̄(λ∗+)
, f or 1 ≤ i ≤ N. (12)

Then, we describe the algorithm used to obtain the optimal policy for RP. As detailed
in Theorem 3, it is essential to find λ∗. To this end, we recall that, for any user i under given
λ, the optimal deterministic policy nλ,i can be obtained using the results in Section 5.1
and the resulting expected transmission rate ρ̄i(λ) is given by Proposition 2. Since ρ̄i(λ)
is non-increasing in λ for all i according to Lemma 3.4 of [27], ρ̄(λ) = ∑N

i=1 ρ̄i(λ) is also
non-increasing in λ. Hence, we can regard ρ̄(λ) as a non-increasing function of λ. Then,
according to the definition of λ∗, we can use the Bisection search to obtain λ∗ efficiently.
The main steps can be summarized as follows.

1. Initialize λ− = 0 and λ+ = 1.
2. Do λ− = λ+ and λ+ = 2λ+ until ρ̄(λ+) < M.
3. Run Bisection search on the interval [λ−, λ+] until the tolerance 2ξ is met.

Then, λ∗− and λ∗+ can simply be the boundaries of the final interval. The pseudocode for
the Bisection search can be found in Algorithm A2 of Appendix M. After obtaining λ∗− and
λ∗+, the optimal policy φλ∗ is detailed in Theorem 3 and the mixing probabilities µi’s are
given by (12).

Remark 6. We recall that the optimal deterministic policy for each user can be characterized by
two positive thresholds (i.e., n0, n1 > 0). Consequently, under RP-optimal policy, the base station
will never choose the user at state (0, r̂). Then, when M increases, the expected transmission rate
achieved by RP-optimal policy will saturate before M reaches N. When the expected transmission
rate saturates, the RP-optimal policy is φ∗ = [n1, . . . , nN ] where ni = (1, 1) for 1 ≤ i ≤ N.
The saturation happens when M is larger than or equal to the expected transmission rate achieved
by φ∗.

6. Indexed Priority Policy

Although the performance of Whittle’s index policy is known to be good, it requires
indexability, which is usually difficult to establish. In this section, based on the primal-
dual heuristic introduced in [28], we develop a policy that does not require indexability


Entropy 2021, 23, 1572 16 of 39

and has comparable performance to Whittle’s index policy. We start with presenting the
primal-dual heuristic.

6.1. Primal-Dual Heuristic

The heuristic is based on the optimal primal and dual solution pair to the linear
program associated with RP. To introduce the linear program, we define π

ai
xi (φ) ≥ 0 as the

expected time that user i is at state xi and action ai is taken according to policy φ. Then,
for any φ, π

ai
xi (φ) must satisfy the following problems

π0
xi
(φ) + π1

xi
(φ) = ∑

x′i

∑
a′i

Px′i ,xi
(a′i)π

a′i
x′i
(φ), ∀xi, i.

∑
xi

∑
ai

π
ai
xi (φ) = 1, ∀i.

The objective function of RP can be rewritten as

minimize
φ ∈ Φ

N

∑
i=1

∑
xi ,ai

C(xi)π
ai
xi (φ),

where C(xi) = fi(si) is the instant cost at state xi. The constraint on the expected transmis-
sion rate can be rewritten as

N

∑
i=1

∑
xi

π1
xi
(φ) ≤ M.

Thus, the linear program associated with RP can be formulated as the following

minimize
π

ai
xi

N

∑
i=1

∑
xi ,ai

C(xi)π
ai
xi (13a)

subject to π0
xi
+ π1

xi
−∑

x′i

∑
a′i

Px′i ,xi
(a′i)π

a′i
x′i
= 0, ∀xi, i, (13b)

∑
xi

∑
ai

π
ai
xi = 1, ∀i, (13c)

N

∑
i=1

∑
xi

π1
xi
≤ M, (13d)

π
ai
xi ≥ 0, ∀xi, ai, i. (13e)

The corresponding dual problem is

maximize
σ, σi, σxi

N

∑
i=1

σi −Mσ (14a)

subject to σxi + σi −∑
x′i

Pxi ,x′i
(0)σx′i

≤ C(xi), ∀xi, i, (14b)

σxi + σi −∑
x′i

Pxi ,x′i
(1)σx′i

− σ ≤ C(xi), ∀xi, i, (14c)

σ ≥ 0. (14d)

Let {π̄ai
xi} and {σ̄, σ̄i, σ̄xi} be the optimal primal and dual solution pair to the problems

reported in (13) and (14). We define

ψ̄0
xi
= ∑

x′i

Pxi ,x′i
(0)σ̄x′i

+ C(xi)− σ̄i − σ̄xi ≥ 0,

ψ̄1
xi
= ∑

x′i

Pxi ,x′i
(1)σ̄x′i

+ σ̄ + C(xi)− σ̄i − σ̄xi ≥ 0.


Entropy 2021, 23, 1572 17 of 39

For any state x = (x1, . . . , xN), let h(x) = ∑N
i=1 1{π̄1

xi>0}. Then, the heuristic operates in the

following way

• If h(x) ≥ M, the base station will choose the M users with the largest ψ̄0
xi

among the
h(x) users.

• If h(x) < M, these h(x) users are chosen by the base station. The base station will
choose M− h(x) additional users with the smallest ψ̄1

xi
.

However, Linear Programming (LP) is a very general technique and does not appear
to take advantage of the special structure of the problem. Although there are algorithms for
solving rational LP that take time polynomial in the number of variables and constraints,
they run extremely slowly in practice [29]. For our problem, we notice that the users have
separate activity areas that are linked through a common resource constraint. Therefore,
the primal problem can be solved using Dantzig-Wolfe decomposition. Even so, the prob-
lem is still computationally demanding when the system scales up. We recall that we
solved the exact problem efficiently using MDP-specific algorithms in Section 5. It is more
efficient because of the following reasons

• According to Proposition 6, we can decompose the problem into N subproblems.
• For each subproblem, the threshold structure of the optimal policy is utilized to reduce

the running time of RVI.
• As we will see later, the developed policy can be obtained directly from the result of

RVI in practice.

In the following, we will translate the results in Section 5 into the optimal primal and dual
solution pair and propose Indexed priority policy.

6.2. Indexed Priority Policy

We first define the Lagrangian function associated with (13).

L(πai
xi , σ, σi, σxi , ψ

ai
xi ) =

( N

∑
i=1

∑
xi ,ai

C(xi)π
ai
xi

)
+ ∑

i,xi

σxi

(
∑
x′i

∑
a′i

Px′i ,xi
(a′i)π

a′i
x′i
− π0

xi
− π1

xi

)
+

N

∑
i=1

σi

(
1−∑

xi

∑
ai

π
ai
xi

)
+ σ

( N

∑
i=1

∑
xi

π1
xi
−M

)
− ∑

i,xi ,ai

ψ
ai
xi π

ai
xi .

Then, the corresponding Lagrangian dual function is

g(σ, σi, σxi , ψ
ai
xi ) = inf

π
ai
xi

L(πai
xi , σ, σi, σxi , ψ

ai
xi ).

Let πxi be the expected time that user i is at state xi caused by the adoption of φλ∗ , where
φλ∗ is the optimal policy detailed in Theorem 3. Then, we define {πai

xi} as follows

• State xi is where randomization happens (randomization happens when the actions
suggested by the two optimal deterministic policies are different), and it has a value
of π0

xi
= anλ∗− ,i

(xi)(1− µi)πxi + anλ∗+ ,i
(xi)µiπxi and π1

xi
= πxi − π0

xi
where µi is given

by (12) and anλ,i (xi) is the action suggested by nλ,i at state xi.
• For other values of xi, we have π0

xi
= (1− anλ∗ ,i (xi))πxi and π1

xi
= πxi − π0

xi
.

We also define σ = λ∗, σi = θi, and σxi = Vi(xi) where λ∗ is specified in Section 5.2, θi is the
optimal value ofMi

1(λ
∗,−1), and Vi(xi) is the value function associated withMi

1(λ
∗,−1).

Lastly, we define {ψai
xi} as follows

ψ0
xi
= ∑

x′i

Pxi ,x′i
(0)σx′i

+ C(xi)− σi − σxi ,

ψ1
xi
= ∑

x′i

Pxi ,x′i
(1)σx′i

+ σ + C(xi)− σi − σxi .


Entropy 2021, 23, 1572 18 of 39

Then, we can prove the following proposition.

Proposition 7 (Optimal solution pair). {πai
xi} and {σ, σi, σxi , ψ

ai
xi} are primal and dual solutions

to (13), respectively.

Proof. Since (13) is linear and strictly feasible, it is sufficient to show that {πai
xi} and

{σ, σi, σxi , ψ
ai
xi} verify the KKT conditions, which can be expressed as the following four con-

ditions.

1. Primal feasibility: the constraints in (13) are satisfied.
2. Dual feasibility: σ ≥ 0 and ψ

ai
xi ≥ 0 for all xi, ai, and i.

3. Complementary slackness: σ
(

∑N
i=1 ∑xi

π1
xi
− M

)
= 0 and ψ

ai
xi π

ai
xi = 0 for all xi, ai,

and i.
4. Stationarity: the gradient of L(πai

xi , σ, σi, σxi , ψ
ai
xi ) with respect to {πai

xi} vanishes.

Apparently, the first condition is satisfied by {πai
xi}. For the second condition, σ ≥ 0 since

σ = λ∗ ≥ 0 by definition. For ψ
ai
xi , we can verify that ψ

ai
xi = Vi,ai (xi) − Vi(xi) where

Vi,ai (xi) is the value function resulting from taking action ai at state xi. Then, the non-
negativity is guaranteed by the Bellman equation. For the third condition, the first term
is zero because we choose the µi’s given by (12). For the second term, we recall that
ψ

ai
xi = Vi,ai (xi)−Vi(xi). According to the definition of π

ai
xi , we know Vi(xi) = Vi,ai (xi) if

π
ai
xi > 0. Combined together, we can conclude that ψ

ai
xi = 0 when π

ai
xi > 0. Thus, the third

condition is satisfied. For the last condition, setting the gradient equal to zero yields a
system of linear equations. More precisely, for each xi and 1 ≤ i ≤ N

∑
x′i

Pxi ,x′i
(0)σx′i

+ C(xi) = σxi + σi + ψ0
xi

.

∑
x′i

Pxi ,x′i
(1)σx′i

+ σ + C(xi) = σxi + σi + ψ1
xi

.

Then, {σ, σi, σxi , ψ
ai
xi} verifies the system of linear equations by definition. Since all four

conditions are satisfied, we can conclude our proof.

According to Proposition 7, we know that {πai
xi} and {σ, σi, σxi} defined above are

the optimal solutions to problems (13) and (14), respectively. As the optimal solutions are
obtained, we can adopt the heuristic detailed in Section 6.1.

The heuristic can be expressed equivalently as an index policy. To this end, we define
the index Ixi for state xi as

Ixi , ψ̄0
xi
− ψ̄1

xi
.

According to the complementary slackness, Ixi can be reduced to the following.

• For state xi such that π̄1
xi
> 0 and π̄0

xi
= 0, we have ψ̄1

xi
= 0. Therefore, Ixi = ψ̄0

xi
≥ 0.

• For state xi such that π̄1
xi
> 0 and π̄0

xi
> 0, we have ψ̄1

xi
= ψ̄0

xi
= 0. Therefore, Ixi = 0.

• For state xi such that π̄1
xi
= 0 and π̄0

xi
> 0, we have ψ̄0

xi
= 0. Therefore, Ixi = −ψ̄1

xi
≤ 0.

We can show that Ixi possesses the following properties.

Proposition 8 (Properties of Ixi ). For 1 ≤ i ≤ N, Ixi ≥ −λ∗ for any xi. The equality holds
when r̂i = p0

e,i = 0 or si = 0. At the same time, Ixi is non-decreasing in both si and r̂i.

Proof. We notice that Ixi can be expressed as a function of Vi(xi) and λ∗. Meanwhile,
Mi

1(λ
∗,−1) coincides with the decoupled model studied in Section 4.2. Then, we can

verify the properties of Ixi using the results in Section 4.2. The complete proof can be found
in Appendix L.

Comparing with the heuristic detailed in Section 6.1, we can define the Indexed
priority policy.


Entropy 2021, 23, 1572 19 of 39

Definition 5 (Indexed priority policy). At any state x = (x1, x2, . . . , xN), the base station will
transmit the updates from M users with the largest Ixi . The ties are broken arbitrarily.

Remark 7. Indexed priority policy belongs to the class of priority policies introduced in [30]. These
priority policies are asymptotically optimal when certain conditions are satisfied.

Remark 8. Indexed priority policy possesses the structural properties detailed in Corollary 1.

• The first two properties can be verified by noting that Ixi ≥ −λ∗ and the equality holds when
r̂i = p0

e,i = 0 or si = 0. At the same time, Ixi is non-decreasing in r̂i.
• The third and fourth properties can be verified by noting that Ixi is non-decreasing in si.
• For the last property, we first notice that Ixj = Ixk when users j and k are statistically identical

and xj = xk. Then, the property can be verified by noting that Ixi is non-decreasing in both si
and r̂i.

We notice that θi’s and C(xi)’s are canceled out by the definition of Ixi . Therefore, Ixi

can be calculated using λ∗ and the value function ofMi
1(λ
∗,−1). In practice, we can use

either λ∗− or λ∗+ to approximate λ∗, and the value function can be approximated by the
result of the RVI detailed in Section 5.1. Since the state space is infinite, we only calculate a
finite number of Vi(xi), the number of which depends on the truncation parameter m of
ASM. Meanwhile, the probabilities Pxi ,x′i

(ai) in Ixi are modified according to (10).

7. Numerical Results

In this section, we provide numerical results to showcase the performance of the
developed scheduling policies. To eliminate the effect of N, we plot the expected average
AoII. In particular, we provide the expected average AoII achieved by the Indexed priority
policy and Whittle’s index policy when M = 1. The policies are calculated using the results
detailed in Sections 4–6. When obtaining the Indexed priority policy, we set the tolerance
in the Bisection search to ξ = 0.005. Meanwhile, we choose the truncation parameter in
ASM m = 800 and the convergence criteria in RVI ε = 0.01. We notice that the calculation
of Whittle’s index involves an infinite sum. In practice, we approximate the result by
replacing +∞ with a large enough number kmax. Here, we choose kmax = 800. For both
scheduling policies, the resulting expected average AoII is obtained via simulations. Each
data point is the average of 15 runs with 15,000 time slots considered in each run.

We also compare the developed policies with the optimal policy for RP, which can
be calculated by following the discussion in Section 5.2. We adopt the same choices of
parameters as we used to obtain the developed policies. The corresponding performance
is calculated using Proposition 2. Like before, the infinite sum is approximated by replac-
ing +∞ with kmax = 800. We also provide the expected average AoII achieved by the
Greedy policy to show the performance advantages of the developed policies. When the
Greedy policy is adopted, the base station always chooses the user with the largest AoII.
The resulting expected average AoII is obtained via the same simulations as applied to the
developed policies.

Figures 3 and 4 illustrate the performance when the source processes have different
dynamics and when each user’s communication goal is different, respectively. Figure 3a
provides the performance when pi = 0.05 + 0.4(i−1)

N−1 for 1 ≤ i ≤ N. For other parameters,
the users make the same choices. More precisely, fi(s) = s, γi = 0.6, and p0

e,i = p1
e,i = 0.1

for 1 ≤ i ≤ N. Figure 4a provides the performance when fi(s) = s0.5+ i−1
N−1 for 1 ≤ i ≤ N.

Same as before, the users make the same choices for other parameters. More precisely,
pi = 0.3, γi = 0.6, and p0

e,i = p1
e,i = 0.1 for 1 ≤ i ≤ N. In Figures 3b and 4b, we force

p0
e,i = 0 for all users to ensure the existence of Whittle’s index. Other choices remain the

same as in Figures 3a and 4a. According to Corollary 1, the optimal policy will never
choose the user with r̂ = p0

e = 0 unless it is to break the tie. Therefore, in Figures 3b and
4b, we also consider the Greedy+ policy where the base station always chooses the user


Entropy 2021, 23, 1572 20 of 39

with the largest AoII among the users with r̂ = 1. The resulting expected average AoII is
obtained via the same simulations as applied to the Greedy policy.

Figure 5 shows the performance in systems where the parameters for each user are
generated uniformly and randomly within their ranges. In Figure 5a, we consider N = 5,
γ ∈ [0, 1], p ∈ [0.05, 0.45], pr̂

e ∈ [0, 0.45], and f (s) = sτ , where τ ∈ [0.5, 1.5]. There are a
total of 300 different choices and the results are sorted by the performance of RP-optimal
policy in ascending order. Figure 5b adopts the same system settings except that we impose
p0

e,i = 0 for 1 ≤ i ≤ N to ensure the feasibility of Whittle’s index policy. Meanwhile,
we ignore the Greedy policy since the Greedy+ policy achieves a better performance, as
indicated by Figures 3b and 4b.

5 10 15 20 25 30 35 40 45 50
Number of users in the system, N

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

E
xp

ec
te

d
 A

ve
ra

g
e 

A
o

II

Greedy policy
Indexed priority policy
Optimal policy for RP

(a) When p0
e = 0.1.

5 10 15 20 25 30 35 40 45 50
Number of users in the system, N

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

E
xp

ec
te

d
 A

ve
ra

g
e 

A
o

II

Greedy policy
Greedy+ policy
Whittle's index policy
Indexed priority policy
Optimal policy for RP

(b) When p0
e = 0.

Figure 3. Performance when the source processes vary. We choose pi = 0.05 + 0.4(i−1)
N−1 , fi(s) = s, γi = 0.6, p0

e,i = p0
e ,

and p1
e,i = 0.1 for 1 ≤ i ≤ N.

5 10 15 20 25 30 35 40 45 50
Number of users in the system, N

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

E
xp

ec
te

d
 A

ve
ra

g
e 

A
o

II

Greedy policy
Indexed priority policy
Optimal policy for RP

(a) When p0
e = 0.1.

5 10 15 20 25 30 35 40 45 50
Number of users in the system, N

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

E
xp

ec
te

d
 A

ve
ra

g
e 

A
o

II

Greedy policy
Greedy+ policy
Whittle's index policy
Indexed priority policy
Optimal policy for RP

(b) When p0
e = 0.

Figure 4. Performance when the communication goals vary. We choose fi(s) = s0.5+ i−1
N−1 , pi = 0.3, γi = 0.6, p0

e,i = p0
e ,

and p1
e,i = 0.1 for 1 ≤ i ≤ N.


Entropy 2021, 23, 1572 21 of 39

0 50 100 150 200 250 300
System Index

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

E
xp

ec
te

d
 A

ve
ra

g
e 

A
o

II

Greedy policy
Indexed priority policy
Optimal policy for RP

(a) When I = [0, 0.45].

0 50 100 150 200 250 300
System Index

0

0.5

1

1.5

2

2.5

3

3.5

4

E
xp

ec
te

d
 A

ve
ra

g
e 

A
o

II

Greedy+ policy
Whittle's index policy
Indexed priority policy
Optimal policy for RP

(b) When I = {0}.
Figure 5. Performance in systems with random parameters when N = 5. The parameters for each user are chosen randomly
within the following intervals: γ ∈ [0, 1], p ∈ [0.05, 0.45], p0

e ∈ I, p1
e ∈ [0, 0.45], and f (s) = sτ where τ ∈ [0.5, 1.5].

We can make the following observations from the figures.

• The Greedy+ policy yields a smaller expected average AoII than that achieved by the
Greedy policy. Recall that we obtained the Greedy+ policy by applying the structural
properties detailed in Corollary 1. Therefore, simple applications of the structural
properties of the optimal policy can improve the performance of scheduling policies.

• The Indexed priority policy has comparable performance to Whittle’s index policy
in all the system settings considered. The two policies have their own advantages.
The Indexed priority policy has a broader scope of application, while Whittle’s index
policy has a lower computational complexity.

• The performance of the Indexed priority policy and Whittle’s index policy is better
than that of the Greedy/Greedy+ policies and is not far from the performance of
the RP-optimal policy. Recall that the performance of the RP-optimal policy forms a
universal lower bound on the performance of all admissible policies for PP. Hence, we
can conclude that both the Indexed priority policy and Whittle’s index policy achieve
good performances.

8. Conclusions

In this paper, we studied the problem of minimizing the Age of Incorrect Information
in a slotted-time system where a base station needs to schedule M users among N available
users. Meanwhile, the base station has access to imperfect channel state information in
each time slot. The problem is a restless multi-armed bandit problem which is SPACE-
hard. However, by casting the problem into a Markov decision process, we obtain the
structural properties of the optimal policy. Then, we introduce a relaxed version of the
original problem and investigate the decoupled model. Under a simple condition, we
establish the indexability of the decoupled problem and obtain the expression of Whittle’s
index. On this basis, we developed Whittle’s index policy. To get rid of the requirement
for indexability, we developed the Indexed priority policy based on the optimal policy
for the relaxed problem. The characteristics of the relaxed problem are explored to make
the calculation of its optimal policy more efficient. Finally, through numerical results, we
show that simple applications of the structural properties can improve the performance
of scheduling policies. Moreover, Whittle’s index policy and the Indexed priority policy
achieve good and comparable performances.

Author Contributions: Formal analysis, Y.C.; Investigation, Y.C.; Methodology, Y.C.; Supervision,
A.E.; Validation, Y.C.; Writing—original draft, Y.C.; Writing—review & editing, Y.C. and A.E. All
authors have read and agreed to the published version of the manuscript.


Entropy 2021, 23, 1572 22 of 39

Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement: Not applicable.

Conflicts of Interest: The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1

We consider two states, x1 and x2, that differ only in the value of sj. Without the loss
of generality, we assume s1,j < s2,j. Then, it is sufficient to show that, for any 1 ≤ j ≤ N,
V(x1) ≤ V(x2). Leveraging the iterative nature of VIA, we use mathematical induction
to prove the monotonicity. First of all, the base case (i.e., ν = 0) is true by initialization.
We assume the lemma holds at iteration ν. Then, we want to examine whether it holds at
iteration ν + 1. The update step reported in problem (5) can be rewritten as follows.

Vν+1(x) = min
a∈AN(1)

Va
ν+1(x), (A1)

where

Va
ν+1(x) = C(x)− θ + ∑

x′−{x′j}


(

∏
i 6=j

Pxi ,x′i
(ai)

)
∑
r̂′j

P(r̂′j)U
j
ν(x, x′)

,

U j
ν(x, x′) = ∑

s′j

Psj ,s′j
(aj, r̂j)Vν(x′).

To prove the desired results, we distinguish between the following cases.

• We first consider the case of s1,j = 0 < s2,j and r̂1,j = r̂2,j = 0. When aj = 1 and for
any x′ − {s′j}, we have

U j
ν(x1, x′) = pjVν(x′; s′j = 1) + (1− pj)Vν(x′; s′j = 0),

U j
ν(x2, x′) = β jVν(x′; s′j = s2,j + 1) + (1− β j)Vν(x′; s′j = 0),

where Vν(x′; s′j = 0) is the estimated value function of the state x′ with s′j = 0 at
iteration ν (at the risk of abusing the notation, we use V(x; sj = s1) and V(x; sj = s2)
to represent the value functions of two states that differ only in the value of sj). Then,
we get

U j
ν(x1, x′)−U j

ν(x2, x′) ≤ (pj − β j)
(
Vν(x′; s′j = 1)−Vν(x′; s′j = 0)

)
≤ 0.

The inequalities hold since β j > pj and Lemma 1 are true at iteration ν by assumption.

Therefore, we have U j
ν(x1, x′) ≤ U j

ν(x2, x′) when aj = 1 for any x′ − {s′j}.
For the case of ai = 1 where i 6= j, we notice that aj = 0. Then, for any x′ − {s′j},
we obtain

U j
ν(x1, x′) = pjVν(x′; s′j = 1) + (1− pj)Vν(x′; s′j = 0),

U j
ν(x2, x′) = (1− pj)Vν(x′; s′j = s2,j + 1) + pjVν(x′; s′j = 0).

Therefore, when ai = 1, we have

U j
ν(x1, x′)−U j

ν(x2, x′) ≤ (2pj − 1)
(
Vν(x′; s′j = 1)−Vν(x′; s′j = 0)

)
≤ 0.

The inequalities hold since 2pj − 1 < 0 and Lemma 1 is true at iteration ν by as-

sumption. Combining with the case of aj = 1, U j
ν(x1, x′) ≤ U j

ν(x2, x′) holds for any


Entropy 2021, 23, 1572 23 of 39

x′ − {s′j} under any feasible action. Since x1 and x2 differ only in the value of sj and
C(x) is non-decreasing in si for 1 ≤ i ≤ N, we can see that Va

ν+1(x1) ≤ Va
ν+1(x2) for

any feasible a. Then, by (A1), we can conclude that the lemma holds at iteration ν + 1
when s1,j = 0 < s2,j and r̂1,j = r̂2,j = 0.

• When s1,j = 0 < s2,j and r̂1,j = r̂2,j = 1, by replacing the β j’s in the above case with
αj’s, we can achieve the same result.

• When 0 < s1,j < s2,j and r̂1,j = r̂2,j, we notice that

Ps1,j ,s1,j+1(aj, r̂1,j) = Ps2,j ,s2,j+1(aj, r̂2,j),

Ps1,j ,0(aj, r̂1,j) = Ps2,j ,0(aj, r̂2,j).

Then, leveraging the monotonicity of Vν(x) and C(x), we can conclude with the
same result.

Combining the three cases, we prove that the lemma also holds at iteration ν + 1 of VIA.
Therefore, the lemma holds at any iteration ν by mathematical induction. Since the results
hold for any 1 ≤ j ≤ N and VIA is guaranteed to converge to the value function when
ν→ +∞, we can conclude our proof.

Appendix B. Proof of Lemma 2

We inherit the notations in the proof of Lemma 1. We still use mathematical induction
to obtain the desired results. The base case ν = 0 is true by initialization. We assume the
lemma holds at iterative ν and examine whether it still holds at iteration ν + 1. In the case
of M = 1, we rewrite (5) as

Vν+1(x) = min
1≤j≤N

V j
ν+1(x), (A2)

where

V j
ν+1(x) = C(x)− θ + ∑

x′

{(
∏
i 6=j

Pi
xi ,x′i

(0)

)
Pj

xj ,x′j
(1)Vν(x′)

}
, (A3)

and Pi
x,x′(ai) is the probability that action ai will lead to state x′ when user i is at state x.

To get the desired results, we distinguish between the following cases

• We first show that V j
ν+1(x) = Vk

ν+1(P(x)). According to (A3), we have

V j
ν+1(x) = C(x)− θ + ∑

x′

{(
∏

i 6=j,k
Pi

xi ,x′i
(0)

)
Pk

xk ,x′k
(0)Pj

xj ,x′j
(1)Vν(x′)

}
.

Vk
ν+1(P(x)) = C(P(x))− θ+

∑
P(x)′

(
∏

i 6=j,k
Pi
P(x)i ,P(x)′i

(0)

)
Pk
P(x)k ,P(x)′k

(1)Pj
P(x)j ,P(x)′j

(0)Vν(P(x)′).

It is obvious that for any P(x)′, there always exists P(x′′) = P(x)′. Then, we obtain

Vk
ν+1(P(x)) = C(P(x))− θ+

∑
P(x′′)

(
∏

i 6=j,k
Pi

xi ,x′′i
(0)

)
Pk

xj ,P(x′′)k
(1)Pj

xk ,P(x′′)j
(0)Vν(P(x′′))

= C(P(x))− θ + ∑
x′′

(
∏

i 6=j,k
Pi

xi ,x′′i
(0)

)
Pk

xj ,x′′j
(1)Pj

xk ,x′′k
(0)Vν(x′′)

= C(P(x))− θ + ∑
x′

(
∏

i 6=j,k
Pi

xi ,x′i
(0)

)
Pk

xj ,x′j
(1)Pj

xk ,x′k
(0)Vν(x′).


Entropy 2021, 23, 1572 24 of 39

The second equality follows from the definition of P(·), the property of sum-
mation, and the assumption at iteration ν. The last equality follows from the
variable renaming. Then, by the definition of statistically identical, we have
Pk

xj ,x′j
(1) = Pj

xj ,x′j
(1), Pj

xk ,x′k
(0) = Pk

xk ,x′k
(0), and C(x) = C(P(x)). Therefore, we can

conclude that V j
ν+1(x) = Vk

ν+1(P(x)).

• Along the same lines, we can easily show that Vk
ν+1(x) = V j

ν+1(P(x)) and Vi
ν+1(x) =

Vi
ν+1(P(x)) for i 6= j, k.

Combining the above cases with (A2), we prove that Vν+1(x) = Vν+1(P(x)). Then,
by induction, we have Vν(x) = Vν(P(x)) at any iteration ν. Since VIA is guaranteed to
converge to the value function when ν→ +∞, we can conclude our proof.

Appendix C. Proof of Theorem 1

For arbitrary j and k

δj,k(x) = ∑
x′−{x′j ,x′k}


(

∏
i 6=j,k

Pxi ,x′i
(0)

)
∑
r̂′j ,r̂
′
k

P(r̂′j)P(r̂′k)Rj,k(x, x′)

, (A4)

where

Rj,k(x, x′) = ∑
s′j ,s
′
k

[(
Psk ,s′k

(0, r̂k)Psj ,s′j
(1, r̂j)− Psk ,s′k

(1, r̂k)Psj ,s′j
(0, r̂j)

)
V(x′)

]
. (A5)

With this in mind, we will prove the properties one by one.

Property 1—δj,k(x) ≤ 0 if r̂k = p0
e,k = 0. The equality holds when sj = 0 or r̂j = p0

e,j = 0.

When r̂k = p0
e,k = 0, transmitting the update from user k will necessarily fail. Therefore,

Psk ,s′k
(0, 0) = Psk ,s′k

(1, 0) for any sk and s′k. Then, we have

Rj,k(x, x′) = ∑
s′k

Psk ,s′k
(0, 0)∑

s′j

[(
Psj ,s′j

(1, r̂j)− Psj ,s′j
(0, r̂j)

)
V(x′)

]
.

To identify the sign of Rj,k(x, x′), we distinguish between the following cases

• When sj = 0, we can easily show that Rj,k(x, x′) = 0 for any x′ − {s′j, s′k} by noticing
that the two possible actions with respect to user j (i.e., aj = 1 and aj = 0) are
equivalent when sj = 0. Since δj,k(x) is a linear combination of Rj,k(x, x′)’s with
non-negative coefficients, we can conclude that δj,k(x) = 0 in this case.

• When sj > 0 and r̂j = 1, for any x′ − {s′j, s′k}, we have

Rj,k(x, x′) = ∑
s′k

Psk ,s′k
(0, 0)(αj + pj − 1)

(
V(x′; s′j = sj + 1)−V(x′; s′j = 0)

)
≤ 0.

(A6)

The inequality holds because of Lemma 1 and the fact that αj + pj < 1. We recall that
δj,k(x) is a linear combination of Rj,k(x, x′)’s with non-negative coefficients. Then, we
can conclude that δj,k(x) ≤ 0 in this case.

• When sj > 0 and r̂j = 0, by replacing the αj in (A6) with β j, we can get the same result.
In this case, the equality holds when β j + pj = 1, or, equivalently, p0

e,j = 0.

Combining the cases, we prove the first property.


Entropy 2021, 23, 1572 25 of 39

Property 2—δj,k(x) is non-increasing in r̂j and is non-decreasing in r̂k when sj, sk > 0.
At the same time, δj,k(x) is independent of r̂i for any i 6= j, k.

We first prove the monotonicity of δj,k(x) with respect to r̂j. To this end, we define
x1 and x2 as two states that differ only in the value of r̂j. Without a loss of generality, we
assume r̂1,j = 1 and r̂2,j = 0. Then, we investigate the sign of δj,k(x1)− δj,k(x2). We define
xi , x1,i = x2,i for i 6= j. Then, according to (A4), δj,k(x1)− δj,k(x2) can be written as

δj,k(x1)−δj,k(x2) =

∑
x′−{x′j ,x′k}


(

∏
i 6=j,k

Pxi ,x′i
(0)

)
∑
r̂′j ,r̂
′
k

P(r̂′j)P(r̂′k)
(

Rj,k(x1, x′)− Rj,k(x2, x′)
).

Since x1,k = x2,k, we have Ps1,k ,s′k
(a, r̂1,k) = Ps2,k ,s′k

(a, r̂2,k) for any s′k. We recall that the
transition probability is independent of r̂ when a = 0. Combining with the fact that
s1,j = s2,j, we also have Ps1,j ,s′j

(0, r̂1,j) = Ps2,j ,s′j
(0, r̂2,j) for any s′j. Combining together,

we obtain
Ps1,k ,s′k

(1, r̂1,k)Ps1,j ,s′j
(0, r̂1,j) = Ps2,k ,s′k

(1, r̂2,k)Ps2,j ,s′j
(0, r̂2,j),

Ps1,k ,s′k
(0, r̂1,k) = Ps2,k ,s′k

(0, r̂2,k).

Leveraging the above two problems, we have

Rj,k(x1, x′)− Rj,k(x2, x′) =

∑
s′j ,s
′
k

[
Psk ,s′k

(0, r̂k)

(
Ps1,j ,s′j

(1, r̂1,j)− Ps2,j ,s′j
(1, r̂2,j)

)
V(x′)

]
.

Consequently, we obtain

δj,k(x1)−δj,k(x2) =

∑
x′−{x′j}

∏
i 6=j

Pxi ,x′i
(0)

∑
r̂′j

P(r̂′j)∑
s′j

(
Ps1,j ,s′j

(1, 1)− Ps2,j ,s′j
(1, 0)

)
V(x′)

.

In the following, we characterize the sign of

R1 , ∑
s′j

(
Ps1,j ,s′j

(1, 1)− Ps2,j ,s′j
(1, 0)

)
V(x′).

As s1,j = s2,j > 0, for any x′ − {s′j}, we have

R1 =
(
(1− αj)− (1− β j)

)
V(x′; s′j = 0) + (αj − β j)V(x′; s′j = s1,j + 1) ≤ 0.

The inequality follows from Lemma 1 and the fact that β j > αj. Since δj,k(x1)− δj,k(x2)
is a linear combination of R1’s with non-negative coefficients, we can conclude that
δj,k(x1) ≤ δj,k(x2). Since r̂1,j > r̂2,j, we can see that δj,k(x) is non-increasing in r̂j.

In a very similar way, we can show that δj,k(x) is non-decreasing in r̂k. We recall that
r̂i will not affect the system dynamic if ai = 0. Consequently, we can conclude that δj,k(x)
is independent of r̂i for any i 6= j, k.

Combining together, we prove the second property.

Property 3—δj,k(x) ≤ 0 if sk = 0. The equality holds when sj = 0 or r̂j = p0
e,j = 0.

Since the probabilities are non-negative, it is sufficient to show that Rj,k(x, x′) satisfies
Property 3 for any x′ − {s′j, s′k}. More precisely, it is sufficient to show that Rj,k(x, x′) ≤ 0


Entropy 2021, 23, 1572 26 of 39

for any x′ − {s′j, s′k} when sk = 0 and the equality holds when sj = 0 or r̂j = p0
e,j = 0. We

recall that Psk ,s′k
(1, r̂k) = Psk ,s′k

(0, r̂k) for any s′k when sk = 0. Hence, for any x′ − {s′j, s′k},
we have

Rj,k(x, x′) = ∑
s′k

[
Psk ,s′k

(0, r̂k)∑
s′j

(
Psj ,s′j

(1, r̂j)− Psj ,s′j
(0, r̂j)

)
V(x′)

]
.

Then, we investigate the following quantity for any x′ − {s′j}

R2 , ∑
s′j

(
Psj ,s′j

(1, r̂j)− Pxj ,x′j
(0, r̂j)

)
V(x′).

To this end, we distinguish between the following cases

• When sj = 0, we have Psj ,s′j
(1, r̂j) = Psj ,s′j

(0, r̂j) for any s′j. Thus, we conclude that

R2 = 0 for any x′ − {s′j}. Consequently, Rj,k(x, x′) = 0 for any x′ − {s′j, s′k}.
• When sj > 0 and r̂j = 1, for any x′ − {s′j}, we have

R2 = (αj − 1 + pj)V(x′; s′j = sj + 1) + (1− αj − pj)V(x′; s′j = 0) ≤ 0 (A7)

The inequality follows from Lemma 1 and the fact that αj + pj < 1. Thus, Rj,k(x, x′) ≤ 0
for any x′ − {s′j, s′k}.

• When sj > 0 and r̂j = 0, by replacing the αj in (A7) with β j, we can get the same result.
In this case, the equality holds when β j + pj = 1, or, equivalently, p0

e,j = 0.

Combined together, we can conclude that Property 3 is true.

Property 4—δj,k(x) is non-increasing in sj if Γ
r̂j
j ≤ Γr̂k

k and is non-decreasing in sk if Γ
r̂j
j ≥ Γr̂k

k

when sj, sk > 0. We define Γ1
i , αi

1−pi
and Γ0

i , βi
1−pi

for 1 ≤ i ≤ N.

Such as we did in the proof of Property 3, it is sufficient to show that Rj,k(x, x′) satisfies
Property 4 for any x′ − {s′j, s′k}. We recall that Rj,k(x, x′) depends on the values of r̂j and r̂k.
Therefore, we distinguish between the following cases

• In the case of r̂j = r̂k = 1 and sj, sk > 0, for any x′ − {s′j, s′k}, (A5) can be written as

Rj,k(x, x′) = ∑
s′j ,s
′
k

[(
Psk ,s′k

(0, 1)Psj ,s′j
(1, 1)− Psk ,s′k

(1, 1)Psj ,s′j
(0, 1)

)
V(x′)

]
=
(

pkαj − (1− pj)(1− αk)
)
V(x′; s′j = sj + 1; s′k = 0)

+
(
(1− pk)(1− αj)− pjαk

)
V(x′; s′j = 0; s′k = sk + 1)

+
(
(1− pk)αj − (1− pj)αk

)
V(x′; s′j = sj + 1; s′k = sk + 1)

+
(

pk(1− αj)− pj(1− αk)
)
V(x′; s′j = 0; s′k = 0).

As we can verify

pkαj − (1− pj)(1− αk) <
1
2
(pk + pj − 1) < 0,

(1− pk)(1− αj)− pjαk >
1
2
(1− pk − pj) > 0.

We define Γ1
i , αi

1−pi
and Γ0

i , βi
1−pi

for 1 ≤ i ≤ N. Then, we have

Γ1
j Q Γ1

k =⇒ (1− pk)αj − (1− pj)αk Q 0.


Entropy 2021, 23, 1572 27 of 39

Combining with Lemma 1, we can conclude that, for any x′ − {s′j, s′k}, Rj,k(x, x′) is

non-increasing in sj if Γ1
j ≤ Γ1

k and is non-decreasing in sk if Γ1
j ≥ Γ1

k .

• In the case of r̂j = r̂k = 0 and sj, sk > 0, by replacing the α’s in the above case with β’s,
we can conclude with the same result.

• In the case of r̂j = 1, r̂k = 0, and sj, sk > 0, for any x′ − {s′j, s′k}, (A5) can be written as

Rj,k(x, x′) = ∑
s′j ,s
′
k

[(
Psk ,s′k

(0, 0)Psj ,s′j
(1, 1)− Psk ,s′k

(1, 0)Psj ,s′j
(0, 1)

)
V(x′)

]
=
(

pkαj − (1− pj)(1− βk)
)
V(x′; s′j = sj + 1; s′k = 0)

+
(
(1− pk)(1− αj)− pjβk

)
V(x′; s′j = 0; s′k = sk + 1)

+
(
(1− pk)αj − (1− pj)βk

)
V(x′; s′j = sj + 1; s′k = sk + 1)

+
(

pk(1− αj)− pj(1− βk)
)
V(x′; s′j = 0; s′k = 0).

As we can verify

pkαj − (1− pj)(1− βk) < pk

(
pj −

1
2

)
< 0,

(1− pk)(1− αj)− pjβk > (1− pk)

(
1
2
− pj

)
> 0.

At the same time
Γ1

j Q Γ0
k =⇒ (1− pk)αj − (1− pj)βk Q 0.

Combined with Lemma 1, we can conclude that, for any x′ − {s′j, s′k}, Rj,k(x, x′) is

non-increasing in sj if Γ1
j ≤ Γ0

k and is non-decreasing in sk if Γ1
j ≥ Γ0

k .

• In the case of r̂j = 0, r̂k = 1, and sj, sk > 0, by swapping the α’s and β’s in the above
case, we can conclude with the same result.

Combined together, we conclude that Rj,k(x, x′) satisfies Property 3 for any x′ − {s′j, s′k}.
Consequently, δj,k(x) is non-increasing in sj if Γ

r̂j
j ≤ Γr̂k

k and is non-decreasing in sk if

Γ
r̂j
j ≥ Γr̂k

k when sj, sk > 0.

Property 5—δj,k(x) ≤ 0 if sj ≥ sk, r̂j ≥ r̂k, and users j and k are statistically identical.

According to Property 3, it is sufficient to consider the case where sj, sk > 0. We
notice that the sign of δj,k(x) can be captured by the sign of the quantity Qj,k(x, x′) , ∑r̂′j ,r̂

′
k

P(r̂′j)P(r̂′k)Rj,k(x, x′). Thus, we divide our discussion into the following cases.

• We first consider the case of sj ≥ sk > 0 and r̂j = r̂k = 0. Leveraging the definition of
statistically identical, for any x′ − {x′j, x′k}, we have

Qj,k(x, x′) = ∑
r̂′j ,r̂
′
k

P(r̂′j)P(r̂′k)κ1

(
V(x′; x′j = (0, r̂′j); x′k = (sk + 1, r̂′k))−

V(x′; x′j = (sj + 1, r̂′j); x′k = (0, r̂′k))
)

,

where κ1 = 1 − pj − β j ≥ 0. Then, by substituting the values of P(r̂) and using
Lemma 2, we obtain


Entropy 2021, 23, 1572 28 of 39

Qj,k(x, x′) =γjγkκ1V(x′; x′j = (sk + 1, 1); x′k = (0, 1))−
γjγkκ1V(x′; x′j = (sj + 1, 1); x′k = (0, 1))+

(1− γj)(1− γk)κ1V(x′; x′j = (sk + 1, 0); x′k = (0, 0))−
(1− γj)(1− γk)κ1V(x′; x′j = (sj + 1, 0); x′k = (0, 0))+

γk(1− γj)κ1V(x′; x′j = (sk + 1, 1); x′k = (0, 0))−
γk(1− γj)κ1V(x′; x′j = (sj + 1, 0); x′k = (0, 1))+

γj(1− γk)κ1V(x′; x′j = (sk + 1, 0); x′k = (0, 1))−
γj(1− γk)κ1V(x′; x′j = (sj + 1, 1); x′k = (0, 0)).

Since users j and k are statistically identical, we have γj = γk. Then, by Lemma 1,
we have Qj,k(x, x′) ≤ 0 for any x′ − {x′j, x′k}. Since δj,k(x) is a linear combination of

Qj,k(x, x′)’s with non-negative coefficients, we can conclude that δj,k(x) ≤ 0.
• For the case of sj ≥ sk > 0 and r̂j = r̂k = 1, by replacing β j in κ1 with αj, we can

conclude with the same result.
• Then, we consider the case of sj ≥ sk > 0, r̂j = 1, and r̂k = 0. We first notice that,

for any x′ − {s′j, s′k}

Rj,k(x, x′) =
(

pkαj − (1− pj)(1− βk)
)
V(x′; s′j = sj + 1; s′k = 0)+(

(1− pk)(1− αj)− pjβk
)
V(x′; s′j = 0; s′k = sk + 1)+(

(1− pk)αj − (1− pj)βk
)
V(x′; s′j = sj + 1; s′k = sk + 1)+(

pk(1− αj)− pj(1− βk)
)
V(x′; s′j = 0; s′k = 0).

As users j and k are statistically identical, we have pj = pk and αj < βk. Leveraging
Lemma 1, we have

Rj,k(x, x′) ≤ (αj + pj − 1)
(

V(x′; s′j = sj + 1; s′k = 0)−

V(x′; s′j = 0; s′k = sk + 1)
)

.

Then, for any x′ − {x′j, x′k}

Qj,k(x, x′) ≤ ∑
r̂′j ,r̂
′
k

P(r̂′j)P(r̂′k)κ2

(
V(x′; x′j = (0, r̂′j); x′k = (sk + 1, r̂′k))−

V(x′; x′j = (sj + 1, r̂′j); x′k = (0, r̂′k))
)

,

where κ2 = 1− pj − αj > 0. Such as we did in the previous cases, we can leverage
Lemmas 1 and 2 to conclude that Qj,k(x, x′) ≤ 0 for any x′ − {x′j, x′k}. Consequently,

δj,k(x) ≤ 0 in this case. The details are omitted for the sake of space.

Combined together, we conclude the proof of Property 5.

Appendix D. Proof of Corollary 2

We follow the same steps as in the proof of Lemma 1. To prove the corollary, it is
sufficient to show that V(x1) ≤ V(x2) when s1 < s2 and r̂1 = r̂2. We use mathematical
induction to prove the monotonicity. First of all, the base case (i.e., ν = 0) is true by
initialization. We assume the lemma holds at iteration ν. Then, we want to examine


Entropy 2021, 23, 1572 29 of 39

whether it holds at iteration ν + 1. For the system with a single user, the update step
reported in problem (5) can be simplified and rewritten as follows

Vν+1(x) = min
a∈{0,1}

Va
ν+1(x), (A8)

where
Va

ν+1(x) = C(x, a)− θ + ∑
r̂′

P(r̂′)∑
s′

Ps,s′(a, r̂)Vν(x′),

and θ is the optimal value forM1(λ,−1). To prove the desired results, we distinguish
between the following cases

• We first consider the case of s1 = 0 < s2 and r̂1 = r̂2 = 0. When a = 1, we have

V1
ν+1(x1) = C(x1, 1)− θ + ∑

r̂′
P(r̂′)

(
pVν(1, r̂′) + (1− p)Vν(0, r̂′)

)
,

V1
ν+1(x2) = C(x2, 1)− θ + ∑

r̂′
P(r̂′)

(
βVν(s2 + 1, r̂′) + (1− β)Vν(0, r̂′)

)
.

Subtracting the two expressions yields

V1
ν+1(x1)−V1

ν+1(x2)

≤ C(x1, 1)− C(x2, 1) + ∑
r̂′

P(r̂′)
[
(p− β)

(
Vν(1, r̂′)−Vν(0, r̂′)

)]
≤ 0.

The inequalities hold since β > p, C(x, a) is non-decreasing in s, and Corollary 2 is
true at iteration ν by assumption.
For the case of a = 0, we obtain

V0
ν+1(x1) = C(x1, 0)− θ + ∑

r̂′
P(r̂′)

(
pVν(1, r̂′) + (1− p)Vν(0, r̂′)

)
,

V0
ν+1(x2) = C(x2, 0)− θ + ∑

r̂′
P(r̂′)

(
(1− p)Vν(s2 + 1, r̂′) + pVν(0, r̂′)

)
.

Therefore, when a = 0, we have

V0
ν+1(x1)−V0

ν+1(x2)

≤ C(x1, 0)− C(x2, 0) + ∑
r̂′

P(r̂′)
[
(2p− 1)

(
Vν(1, r̂′)−Vν(0, r̂′)

)]
≤ 0.

The inequalities hold since 2p − 1 < 0, C(x, a) is non-decreasing in s, and Corol-
lary 2 is true at iteration ν by assumption. Combined together, we can see that
Va

ν+1(x1) ≤ Va
ν+1(x2) for any feasible a. Then, by problem (A8), we can conclude that

the lemma holds at iteration ν + 1 when s1 = 0 < s2 and r̂1 = r̂2 = 0.
• When s1 = 0 < s2 and r̂1 = r̂2 = 1, by replacing the β’s in the above case with α’s, we

can achieve the same result.
• When 0 < s1 < s2 and r̂1 = r̂2, we notice that Ps1,s1+1(a, r̂1) = Ps2,s2+1(a, r̂2) and

Ps1,0(a, r̂1) = Ps2,0(a, r̂2). Then, leveraging the monotonicity of Vν(x) and C(x, a), we
can conclude with the same result.

Combining the three cases, we prove that the lemma holds at iteration ν + 1 of VIA.
Therefore, the lemma holds at any iteration ν by mathematical induction. Since VIA is
guaranteed to converge to the value function when ν→ +∞, we can conclude our proof.


Entropy 2021, 23, 1572 30 of 39

Appendix E. Proof of Proposition 1

We define ∆V(x) , V1(x)−V0(x) where Va(x) is the value function resulting from
taking action a at state x. Then, Va(x) can be calculated as follows

Va(x) = C(x, a)− θ + ∑
x′∈X

Px,x′(a)V(x′), (A9)

where θ is the optimal value forM1(λ,−1). Hence, the optimal action at state x can be
fully characterized by the sign of ∆V(x). More precisely, the optimal action at state x is
a = 1 if ∆V(x) < 0, and a = 0 is optimal otherwise. To determine the sign of ∆V(x) for
each state, we distinguish between the following cases

• We first consider the state x = (0, r̂). Applying the results in Section 2.3 to prob-
lem (A9), we obtain

V0(0, r̂) =− θ + (1− γ)(1− p)V(0, 0) + (1− γ)pV(1, 0)+

γ(1− p)V(0, 1) + γpV(1, 1),

V1(0, r̂) = λ + V0(0, r̂). (A10)

Therefore, ∆V(0, r̂) = λ ≥ 0. Thus, the optimal action at state (0, r̂) is a = 0.
• Then, we consider the state x = (s, 0) where s > 0. Applying the results in Section 2.3

to Equation (A9), we obtain

V0(s, 0) = f (s)− θ + (1− γ)pV(0, 0) + (1− γ)(1− p)V(s + 1, 0)+

γpV(0, 1) + γ(1− p)V(s + 1, 1),

V1(s, 0) = f (s) + λ− θ + (1− γ)(1− β)V(0, 0) + (1− γ)βV(s + 1, 0)+

γ(1− β)V(0, 1) + γβV(s + 1, 1).

Then,
∆V(s, 0) = λ + p0

e (1− 2p)ω, (A11)

where ω = (1− γ)[V(0, 0)−V(s + 1, 0)] + γ[V(0, 1)−V(s + 1, 1)] ≤ 0.
• Finally, we consider the state x = (s, 1) where s > 0. Following the same trajectory,

we have

∆V(s, 1) = λ + (1− p1
e )(1− 2p)ω.

According to Corollary 2 and the fact that p < 0.5, we can see that ∆V(s, 0) and ∆V(s, 1)
are both a constant λ plus a term that is non-increasing in s. As the time penalty function is
unbounded, the value function must also be unbounded. Then, combining the three cases,
we can conclude the following. For fixed r̂, there always exists a threshold nr̂ > 0 such that
the optimal action at state (s, r̂) where s ≥ nr̂ is a = 1, otherwise a = 0 is optimal. Since
r̂ ∈ {0, 1}, the optimal policy can be fully captured by the pair (n0, n1).

In the following, we determine the relationship between n0 and n1. We have

∆V(s, 1)− ∆V(s, 0) = (1− p1
e − p0

e )(1− 2p)ω ≤ 0.

At the same time, for the threshold n0, we know ∆V(n0, 0) < 0. Then, we have ∆V(n0, 1) ≤
∆V(n0, 0) < 0. Combined with the fact that ∆V(s, r̂) is non-increasing in s, we can conclude
that the ordering n0 ≥ n1 is true.

Appendix F. Proof of Proposition 2

We notice that the dynamic of AoII under threshold policy can be fully captured by
a Discrete-Time Markov Chain (DTMC). Then, the expected AoII ∆̄n and the expected
transmission rate ρ̄n under threshold policy n = (n0, n1) can be obtained from the stationary


Entropy 2021, 23, 1572 31 of 39

distribution of the induced DTMC. Let the states of the induced DTMC be the values of s.
We recall that r̂ is an independent Bernoulli random variable with parameter γ. Combined
with the results in Section 2.3, we can easily obtain the state transition probabilities of the
induced DTMC, which are shown in Figure A1.

0 1 2 n1 n1 + 1 n0 n0 + 11− p

p 1− p 1− p c1c1 c2

p

p

1− c1
1− c1

1− c2
1− c2

Figure A1. DTMC induced by the threshold policy n = (n0, n1). In the figure, c1 = (1− γ)(1− p) + γα and c2 = (1− γ)

β + γα.

The balance equations of the induced DTMC are the following

(1− p)π0 + p
n1−1

∑
k=1

πk + (1− c1)
n0−1

∑
k=n1

πk + (1− c2)
+∞

∑
k=n0

πk = π0.

pπ0 = π1.

(1− p)πk−1 = πk f or 2 ≤ k ≤ n1.

c1πk−1 = πk f or n1 + 1 ≤ k ≤ n0.

c2πk−1 = πk f or n0 + 1 ≤ k.

+∞

∑
k=0

πk = 1.

Then, we can easily solve the above system of linear equations. After some algebraic
manipulation, we obtain the following

π0 =
1

2 + p(1− p)n1−1
[

1
1− c1

− 1
p
+ cn0−n1

1

(
1

1− c2
− 1

1− c1

)] .

πk = p(1− p)k−1π0 f or 1 ≤ k ≤ n1.

πk = p(1− p)n1−1ck−n1
1 π0 f or n1 + 1 ≤ k ≤ n0.

πk = p(1− p)n1−1cn0−n1
1 ck−n0

2 π0 f or n0 + 1 ≤ k.

Equipped with the above results, we proceed with calculating ∆̄n and ρ̄n. According to
problem (6a), the expected AoII is:

∆̄n =
+∞

∑
k=0

f (k)πk.

Substituting the expressions of πk’s, we can get the expression of ∆̄n. Proposition 1 tells us
the following.

• For state (s, r̂) where s < n1, it is optimal to stay idle (i.e., a = 0).
• For state (s, r̂) where n1 ≤ s < n0, it is optimal to make a transmission attempt

only when r̂ = 1. We recall that r̂ is an independent Bernoulli random variable with
parameter γ. Therefore, the expected proportion of time that the system is at state
(s, 1) is γπs.


Entropy 2021, 23, 1572 32 of 39

• For state (s, r̂) where s ≥ n0, it is optimal to make transmission attempt regardless
of r̂.

Combined with problem (6b), we have

ρ̄n = γ
n0−1

∑
k=n1

πk +
+∞

∑
k=n0

πk.

Substituting the expressions of πk’s, we can obtain the closed-form expression of ρ̄n.

Appendix G. Proof of Proposition 4

We first tackle the Whittle’s indexes at state (0, r̂) and (s, 0) where s > 0. To this end,
we distinguish between the following cases

• We first consider the state x = (0, r̂). By definition, Whittle’s index is the infimum λ

such that V0(x) = V1(x). According to (A10), we can conclude that Wx = 0 when
x = (0, r̂).

• Then, we consider the state x = (s, 0) where s > 0. We recall that p0
e = 0. Then, we

can conclude, from (A11), that Wx = 0 for all x = (s, 0) where s > 0.

Now, we tackle the Whittle’s index at state x = (s, 1) where s > 0. For convenience,
we denote by Wn the Whittle’s index at state x = (n, 1). According to the monotonicity
of ∆V(x) shown in the proof of Proposition 1, we can conclude that threshold policy
n = (+∞, n + 1) is optimal when V0(n, 1) = V1(n, 1). Then, we can prove the following

Lemma A1. When (9) is satisfied and V0(n, 1) = V1(n, 1), V(s, 1) = V(s, 0) , V(s) for
0 ≤ s ≤ n.

Proof. Since the value function satisfies the Bellman equation, it is sufficient to show that
V(s, 1) and V(s, 0) satisfy the same Bellman equation. We recall that the Bellman equation
for V(x) is given by

V(x) = min
a∈{0,1}

Va(x),

where
Va(x) = C(x, a)− θ + ∑

x′
Px,x′(a)V(x′), (A12)

and θ is the optimal value of the decoupled problem. We recall, from Corollary 3, that the
optimal action at state (s, 0) is staying idle (i.e., a = 0) for any s. We also know that threshold
policy n = (+∞, n+ 1) is optimal when V0(n, 1) = V1(n, 1). Therefore, the optimal actions
at states (s, 0) and (s, 1) where s ≤ n are the same (i.e., a = 0). Equivalently, we have

V(s, r̂) = V0(s, r̂), f or s ≤ n. (A13)

According to the system dynamic reported in Section 2.3, we know that the state transition
probabilities are independent of r̂ when a = 0. Meanwhile, r̂ does not affect the instant cost.
Let x1 = (s, 1) and x2 = (s, 0). Then, for any x′, we have

Px1,x′(0) = Px2,x′(0).

C(x1, 0) = C(x2, 0).

Hence, according to (A12), we can see that V0(s, 0) = V0(s, 1) for any s ≤ n. Combined
with problem (A13), we can conclude that V(s, 0) = V(s, 1) for any 0 ≤ s ≤ n.

By definition, Whittle’s index Wn is the infimum λ such that V0(n, 1) = V1(n, 1).
In this case, according to Lemma A1, V(0, 1) = V(0, 0) = V(0). Then, V0(n, 1) and V1(n, 1)
can be written as

V0(n, 1) = f (n)− θ + pV(0) + (1− p)[(1− γ)V(n + 1, 0) + γV(n + 1, 1)]. (A14)


Entropy 2021, 23, 1572 33 of 39

V1(n, 1) = f (n) + Wn − θ + (1− α)V(0) + α[(1− γ)V(n + 1, 0) + γV(n + 1, 1)].

Without a loss of generality, we assume V(0) = 0. Then, equating the two expressions
yields

Wn = (1− p− α)(γV(n + 1, 1) + (1− γ)V(n + 1, 0)). (A15)

Combining problems (A14) and (A15), we conclude that Wn is

Wn =
(1− p− α)(V0(n, 1) + θ − f (n))

1− p
.

Since the optimal action at state (n, 1) is a = 0, we have V0(n, 1) = V(n, 1) = V(n). Finally,
we obtain

Wn =
(1− p− α)(V(n) + θ − f (n))

1− p
. (A16)

Now, we tackle the expression of V(n). When V0(n, 1) = V1(n, 1), the optimal action at
state (s, r̂) where 0 ≤ s < n is staying idle. Then, leveraging Lemma A1, value function
V(s) where 0 ≤ s < n satisfies the following

V(s) =

{
−θ + f (0) + pV(1) when s = 0,
−θ + f (s) + (1− p)V(s + 1) when 0 < s < n.

(A17)

By backward induction, we end up with the following equation for 0 < s < n.

V(s) =
−θ(1− (1− p)n−s)

p
+

n−s

∑
k=1

f (n− k)(1− p)n−s−k + (1− p)n−sV(n).

Letting s = 1 yields

V(1) =
−θ(1− (1− p)n−1)

p
+

n−1

∑
k=1

f (n− k)(1− p)n−1−k + (1− p)n−1V(n).

From problem (A17), V(1) also satisfies the following

V(1) =
θ − f (0)

p
.

Equating the two expressions of V(1), we obtain

V(n) =
− f (0)

p(1− p)n−1 + θ

(
2

p(1− p)n−1 −
1
p

)
−

n−1

∑
k=1

f (n− k)(1− p)−k. (A18)

We recall that, when V0(n, 1) = V1(n, 1), threshold policy n = (+∞, n + 1) is optimal and
both actions at state x = (n, 1) are equally desirable. Thus, threshold policy n = (+∞, n) is
also optimal. Then, we know

θ = ∆̄n + Wnρ̄n, (A19)

where ∆̄n and ρ̄n are the expected AoII and the expected transmission rate under threshold
policy n = (+∞, n), respectively. Finally, combining problems (A16), (A18) and (A19),
we obtain

Wn =

− f (0)
p(1− p)n + ∆̄n

2− (1− p)n

p(1− p)n − (1− p)−n

(
n

∑
k=1

f (k)(1− p)k−1

)
1

1− p− α
− ρ̄n

2− (1− p)n

p(1− p)n

.

After some algebraic manipulation, we have


Entropy 2021, 23, 1572 34 of 39

Wn =

(1− c1)
+∞

∑
k=n+1

f (k)ck−n−1
1 − ∆̄n

(1− c1)(1− p)− γ(1− p− α)

c1(1− p− α)
+ ρ̄n

,

where c1 = (1− γ)(1− p) + γα.
In the following, we investigate some properties of Whittle’s index. First of all, Wn

is non-negative since 1− p− α and V(n + 1, r̂) in (A15) are all non-negative. Meanwhile,
combining (A15) with the fact that V(n, r̂) is non-decreasing in n, we can verify that Wn is
non-decreasing in n. Combined with the Whittle’s indexes in two other cases (i.e., x = (0, r̂)
and x = (s, 0) where s > 0), we can easily obtain the properties of Wx as detailed in
Proposition 4.

Appendix H. Proof of Proposition 5

We notice thatM1(λ,−1) coincides with the decoupled model studied in Section 4.2.
When problem (9) is satisfied, the decoupled problem is indexable, and, according to
Corollary 3, we only need to show that n is the optimal threshold for the states with
r̂ = 1. We first tackle the case of λ > 0. To this end, we divide our discussion into the
following cases

• For state (s, 1) where s < n, Ws ≤ λ by definition. As the problem is indexable,
we have D(Ws) ⊆ D(λ). We recall that Ws , min{λ′ ≥ 0 : V0(s, 1) = V1(s, 1)}.
Equivalently, Ws , min{λ′ ≥ 0 : (s, 1) ∈ D(λ′)}. Then, we know (s, 1) ∈ D(Ws).
Combined together, we conclude that (s, 1) ∈ D(λ). In other words, the optimal
action at state (s, 1) where s < n is to stay idle (i.e., a = 0).

• For state (s, 1) where s ≥ n, we first recall that Ws = min{λ′ ≥ 0 : (s, 1) ∈ D(λ′)}.
Consequently, for any λ′ < Ws, we know (s, 1) /∈ D(λ′). Meanwhile, we have
Ws ≥Wn > λ by the monotonicity of Whittle’s index and the definition of n. Hence,
we can conclude that (s, 1) /∈ D(λ). In other words, the optimal action at state (s, 1)
where s ≥ n is to make the transmission attempt.

Then, we conclude that n is the optimal threshold for the states with r̂ = 1 when λ > 0.
In the case of λ = 0, according to the proof of Proposition 1, we can easily verify that the
optimal threshold is 1.

Appendix I. Proof of Theorem 2

We first make the following definitions. When M1(λ,−1) is at state x and action
a is taken, cost C1(x, a) , f (s) and C2(x, a) , λa are incurred. We denote the expected
C1-cost and the expected C2-cost under policy φ as C̄1(φ) and C̄2(φ), respectively. Let G be
a non-empty set of states. For the given state i, we defineR∗(i, G) as the class of policies φ,
for which the following hold

• The probability Pφ(xn ∈ G f or some n ≥ 1 | x0 = i) = 1 where xn is the state of
M1(λ,−1) at time n.

• The expected time miG(φ) of a first passage from i to G under φ is finite.
• The expected C1-cost C̄i,G

1 (φ) and the expected C2-cost C̄i,G
2 (φ) of a first passage form

i to G under φ are finite.

With the definitions in mind, we proceed with verifying the assumptions given in [27].

1. For all d > 0, the set A(d) = {x | there exists an action a such that C1(x, a)+C2(x, a) ≤ d}
is finite: For any state x, the cost satisfies C1(x, a) + C2(x, a) = f (s) + λa ≥ f (s).
The equality holds when a = 0. Then, the states in A(d) must satisfy f (s) ≤ d.
Combined with the fact that f (s) is a non-decreasing and unbounded function when
s ∈ N0, we can conclude that A(d) is finite.

2. There exists a stationary policy e such that the induced Markov chain has the following
properties: the state space S consists of a single (non-empty) positive recurrent class R and a


Entropy 2021, 23, 1572 35 of 39

set U of transient states such that e ∈ R∗(i, R) for i ∈ U. Moreover, both C̄1(e) and C̄2(e) on
R are finite: We consider the policy under which the base station makes a transmission
attempt at every time slot. According to the system dynamic detailed in Section 2.3,
we can see that all the states communicate with state (0, 0) and (0, 0) communicates
with all other states. Thus, the state space S consists of a single (non-empty) positive
recurrent class and the set of transient states can simply be an empty set. C̄1(e) and
C̄2(e) are trivially finite as we can verify using Proposition 2.

3. Given any two state x 6= y, there exists a policy φ such that φ ∈ R∗(x, y): We notice that,
under any policy, the maximum increase of s between two consecutive time slots is 1.
Meanwhile, when s decreases, it decreases to zero. Combined with the fact that r̂ is
an independent Bernoulli random variable, we can conclude that there always exists
a path between any x and y with positive probability. mxy(φ), C̄x,y

1 (φ), and C̄x,y
2 (φ)

are trivially finite.
4. If a stationary policy φ has at least one positive recurrent state, then it has a single positive

recurrent class R. Moreover, if x = (0, 0) /∈ R, then φ ∈ R∗(x, R): Given that r̂ is an
independent Bernoulli random variable, we can easily conclude from the system
dynamic that all the states