entropy Article Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information Yutao Chen and Anthony Ephremides * ���������� ������� Citation: Chen, Y.; Ephremides, A. Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information. Entropy 2021, 23, 1572. https://doi.org/ 10.3390/e23121572 Academic Editor: Mario Martinelli Received: 3 November 2021 Accepted: 23 November 2021 Published: 25 November 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- iations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA; cheny@umd.edu * Correspondence: etony@umd.edu Abstract: In this paper, we study a slotted-time system where a base station needs to update multiple users at the same time. Due to the limited resources, only part of the users can be updated in each time slot. We consider the problem of minimizing the Age of Incorrect Information (AoII) when imperfect Channel State Information (CSI) is available. Leveraging the notion of the Markov Decision Process (MDP), we obtain the structural properties of the optimal policy. By introducing a relaxed version of the original problem, we develop the Whittle’s index policy under a simple condition. However, indexability is required to ensure the existence of Whittle’s index. To avoid indexability, we develop Indexed priority policy based on the optimal policy for the relaxed problem. Finally, numerical results are laid out to showcase the application of the derived structural properties and highlight the performance of the developed scheduling policies. Keywords: age of incorrect information; multi-user system; scheduling policy 1. Introduction The Age of Incorrect Information (AoII) is introduced in [1] as a combination of age- based metrics (e.g., Age of Information (AoI)) and error-based metrics (e.g., Minimum Mean Square Error). In communication systems, AoII captures not only the information mismatch between the source and the destination but also the aging process of inconsistent information. Hence, two functions dominate AoII. The first is the time penalty function, which reflects how the inconsistency of information affects the system over time. In real- life applications, inconsistent information will affect different communication systems in different ways. For example, machine temperature monitoring is time-sensitive because the damage caused by overheating will accumulate quickly. However, reservoir water level monitoring is less sensitive to time. Therefore, by adopting different time penalty functions, AoII can capture different aging processes of the mismatch in different systems. The second is the information penalty function, which captures the information mismatch between the source and the destination. It allows us to measure mismatches in different ways, depending on how sensitive different systems are to information inconsistencies. For example, the navigation system requires precise information to give correct instructions, but the real-time delivery tracking system does not need very accurate location information. Since we can choose different penalty functions for different systems, AoII is adaptable to various communication goals, which is why it is regarded as a semantic metric [2]. Since the introduction of AoII, several studies have been performed to reveal its funda- mental nature. The authors of [3] consider a system with random packet delivery times and compare AoII with AoI and real-time error via extensive numerical results. The authors of [4] study the problem of minimizing the AoII that takes the general time penalty function. Three real-life applications are considered to showcase the performance advantages of AoII over AoI and real-time error. In [5], the authors investigate the AoII that considers the quantified mismatch between the source and the destination. The optimization problem is studied when the system is resource-constrained. The authors of [6] studied the AoII Entropy 2021, 23, 1572. https://doi.org/10.3390/e23121572 https://www.mdpi.com/journal/entropy https://www.mdpi.com/journal/entropy https://www.mdpi.com https://orcid.org/0000-0003-3992-5468 https://doi.org/10.3390/e23121572 https://doi.org/10.3390/e23121572 https://creativecommons.org/ https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/ https://doi.org/10.3390/e23121572 https://www.mdpi.com/journal/entropy https://www.mdpi.com/article/10.3390/e23121572?type=check_update&version=2 Entropy 2021, 23, 1572 2 of 39 minimization problem in the context of scheduling. It considers a system where the central scheduler needs to update multiple users at the same time. However, the central scheduler cannot know the states of the sources before receiving the updates. By introducing the belief value, Whittle’s index policy is developed and evaluated. In this paper, we also consider the problem of minimizing AoII in scheduling. Different from [6], we consider the generic time penalty function and study the minimization problem in the presence of imperfect Channel State Information (CSI). Due to the existence of CSI, Whittle’s index policy becomes infeasible in general. Hence, we introduce another scheduling policy that is more versatile and has comparable performance to Whittle’s index policy. The problem of scheduling to minimize AoI is studied under various system settings in [7–11]. The problem studied in this paper is different and more complicated because AoII considers the aging process of inconsistent information rather than the aging process of updates. Meanwhile, none of them consider the case where CSI is available. The problem of optimizing information freshness in the presence of CSI is studied in [12,13]. However, they focus on the system with a single user and mainly discuss the case where CSI is perfect. The scheduling problems with the goal of minimizing an error-based performance measure are considered in [14–16]. Our problem is fundamentally different because AoII also considers the time effect. Moreover, we consider the system where a base station observes multiple sources simultaneously and needs to send updates to multiple destinations. The main contributions of this work can be summarized as follows. (1) We study the problem of minimizing AoII in a multi-user system where imperfect CSI is available. Meanwhile, the time penalty function is generic. (2) We derive the structural properties of the optimal policy for the considered problem. (3) We establish the indexability of the considered problem under a simple condition and develop Whittle’s index policy. (4) We obtain the optimal policy for a relaxed version of the original problem. By exploring the characteristics of the relaxed problem, we provide an efficient algorithm to obtain the optimal policy. (5) Based on the optimal policy for the relaxed problem, we develop the Indexed priority policy that is free from indexability and has comparable performance to Whittle’s index policy. The remainder of this paper is organized in the following way. In Section 2, we introduce the system model and formulate the primal problem. Section 3 explores the structural properties of the optimal policy for the primal problem. Under a simple condition, we develop Whittle’s index policy in Section 4. Section 5 presents the optimal policy for a relaxed version of the primal problem. On this basis, we develop the Indexed priority policy in Section 6. Finally, in Section 7, the numerical results are laid out. 2. System Overview 2.1. Communication Model We consider a slotted-time system with N users and one base station. Each user is composed of a source process, a channel, and a receiver. We assume all the users share the same structure, but the parameters are different. The structure of the communication model is provided in Figure 1. Base Station Receiver1 Receiver2 ReceiverN U S E R 1 U S E R 2 U S E R N Source1 Source2 SourceN Channel1 Channel2 ChannelN Figure 1. The structure of the communication model. Entropy 2021, 23, 1572 3 of 39 For user i, the source process is modeled by a two-state Markov chain where transitions happen between the two states with probability pi > 0 and self-transitions happen with probability 1− pi. At any time slot t, the state of the source process Xi,t ∈ {0, 1} will be reported to the base station as an update, and the base station will decide whether to transmit this update through the corresponding channel. The channel is unreliable, but the estimate of the Channel State Information (CSI) is available at the beginning of each time slot. Let ri,t ∈ {0, 1} be the CSI at time t. We assume that ri,t is independent across time and user indices. ri,t = 1 if and only if the transmission attempt at time t will succeed and ri,t = 0 otherwise. Then, we denote by r̂i,t ∈ {0, 1} the estimate of ri,t. We assume that r̂i,t is an independent Bernoulli random variable with parameter γi, i.e., r̂i,t = 1 with probability γi ∈ [0, 1] and r̂i,t = 0 with probability 1− γi. However, the estimate is imperfect. We assume that the error depends only on the user and its estimate. More precisely, we define the probability of error as pr̂i e,i , Pr[ri 6= r̂i | r̂i]. We assume pr̂i e,i < 0.5 because we can flip the estimate if pr̂i e,i > 0.5. We are not interested in the case of pr̂i e,i = 0.5 since r̂i,t is useless in this case. Although the channel is unreliable, each transmission attempt takes exactly one time slot regardless of the result, and the successfully transmitted update will not be corrupted. Every time an update is received, the receiver will use it as the new estimate X̂i,t. The receiver will send an ACK/NACK packet to inform the base station of its reception of the new update. Since an ACK/NACK packet is generally very small and simple, we assume that it is transmitted reliably and received instantaneously. Then, if ACK is received, the base station knows that the receiver’s estimate changed to the transmitted update. If NACK is received, the base station knows that the receiver’s estimate did not change. Therefore, the base station always knows the estimate at the receiver side. At the beginning of each time slot, the base station receives updates from each source and the estimates of CSI from each channel. The old updates and estimates are discarded upon the arrival of new ones. Then, the base station decides which updates to transmit, and the decision is independent of the transmission history. Due to the limited resources, at most M < N updates are allowed per transmission attempt. We consider a base station that always transmits M updates. 2.2. Age of Incorrect Information All the users adopt AoII as a performance metric, but the choices of penalty functions vary. Let Xt and X̂t be the true state and the estimate of the source process, respectively. Then, in a slotted-time system, AoII can be expressed as follows ∆AoII(Xt, X̂t, t) = t ∑ k=Ut+1 ( g(Xk, X̂k)× F(k−Ut) ) , (1) where Ut is the last time instance before time t (including t) that the receiver’s estimate is correct. g(Xt, X̂t) can be any information penalty function that captures the difference between Xt and X̂t. F(t) , f (t)− f (t− 1) where f (t) can be any time penalty function that is non-decreasing in t. We consider the case where the users adopt the same information penalty function g(Xt, X̂t) = |Xt− X̂t| but possibly different time penalty functions. To ease the analysis, we require f (t) to be unbounded. Combined together, we require f (t1) ≤ f (t2) if t1 < t2 and limt→+∞ f (t) = +∞. Without a loss of generality, we assume f (0) = 0, as the source is modeled by a two-state Markov chain, g(Xt, X̂t) ∈ {0, 1}. Hence, Equation (1) can be simplified to ∆AoII(Xt, X̂t, t) = t ∑ k=Ut+1 F(k−Ut) = f (st), where st , t−Ut. Therefore, the evolution of st is sufficient to characterize the evolution of AoII. To this end, we distinguish between the following cases. Entropy 2021, 23, 1572 4 of 39 • When the receiver’s estimate is correct at time t + 1, we have Ut+1 = t + 1. Then, by definition, st+1 = 0. • When the receiver’s estimate is incorrect at time t + 1, we have Ut+1 = Ut. Then, by definition, st+1 = t + 1−Ut = st + 1. To sum up, we get st+1 = 1{Ut+1 6=t+1} × (st + 1). (2) A sample path of st is shown in Figure 2. In the remainder of this paper, we use fi(·) to denote the time penalty function user i adopts. 0 1 2 3 4 5 6 7 1 2 3 X1 = 1 X̂1 = 1 X2 = 0 X̂2 = 1 X3 = 0 X̂3 = 0 X4 = 1 X̂4 = 0 X5 = 1 X̂5 = 0 X6 = 0 X̂6 = 1 X7 = 1 X̂7 = 1 t st Figure 2. A sample path of st. Remark 1. Under this particular choice of the penalty function, st can be interpreted as the time elapsed since the last time the receiver’s estimate is correct. Please note that st is different from the Age of Information (AoI) [17], which is defined as the time elapsed since the generation time of the last received update. We can see that AoI considers the aging process of the update, while AoII considers the aging process of the estimation error. At the same time, st is also fundamentally different from the holding time, which, according to [18,19], is defined as the time elapsed since the last successful transmission. We notice that the receiver’s estimate can become correct even when no new update is successfully transmitted. Moreover, the information carried by the update may have become incorrect by the time it is received. We also notice that [18,19] consider the problem of minimizing the estimation error. However, by adopting AoII as the performance metric, we study the impact of estimation error on the system. 2.3. System Dynamic In this section, we tackle the system dynamic. We notice that the status of user i can be captured by the pair xi,t , (si,t, r̂i,t). In the following, we will use xi,t and (si,t, r̂i,t) interchangeably. Then, the system dynamic can be fully characterized by the dynamic of xt , (x1,t, . . . , xN,t). Hence, it suffices to characterize the value of xt+1 given xt and the base station’s action. To this end, we denote, by at = (a1,t, . . . , aN,t), the base station’s action at time t. ai,t = 1 if the base station transmits the update from user i at time t and ai,t = 0 otherwise. We notice that given action at, users are independent and the action taken on user i will only affect itself. Consequently Pr(xt+1 | xt, at) = N ∏ i=1 Pr(xi,t+1 | xi,t, at) = N ∏ i=1 Pr(xi,t+1 | xi,t, ai,t). Combined with the fact that all the users share the same structure, it is sufficient to study the dynamic of a single user. In the following discussions, we drop the user-dependent subscript i. We recall that r̂t+1 is an independent Bernoulli random variable. Then, we have Pr(xt+1 | xt, at) = P(r̂t+1)× Pr(st+1 | xt, at). (3) Entropy 2021, 23, 1572 5 of 39 By definition, P(r̂t+1 = 1) = γ and P(r̂t+1 = 0) = 1− γ. Then, we only need to tackle the value of Pr(st+1 | xt, at). To this end, we distinguish between the following cases • When xt = (0, r̂t), the estimate at time t is correct (i.e., X̂t = Xt). Hence, for the receiver, Xt carries no new information about the source process. In other words, X̂t+1 = X̂t regardless of whether an update is transmitted at time t. We recall that Ut+1 = Ut if X̂t+1 6= Xt+1 and Ut+1 = t + 1 otherwise. Since the source is binary, we obtain Ut+1 = Ut if Xt+1 6= Xt, which happens with probability p and Ut+1 = t + 1 otherwise. According to (2), we obtain Pr(1 | (0, r̂t), at) = p, Pr(0 | (0, r̂t), at) = 1− p. • When at = 0 and xt = (st, r̂t), where st > 0, the channel will not be used and no new update will be received by the receiver,and so, X̂t+1 = X̂t. We recall that Ut+1 = Ut if X̂t+1 6= Xt+1 and Ut+1 = t + 1 otherwise. Since Xt 6= X̂t and the source is binary, we have Ut+1 = Ut if Xt+1 = Xt, which happens with probability 1− p and Ut+1 = t + 1 otherwise. According to (2), we obtain Pr(st + 1 | (st, r̂t), at = 0) = 1− p, Pr(0 | (st, r̂t), at = 0) = p. • When at = 1 and xt = (st, 1) where st > 0, the transmission attempt will succeed with probability 1 − p1 e and fail with probability p1 e . We recall that Ut+1 = Ut if X̂t+1 6= Xt+1 and Ut+1 = t + 1 otherwise. Then, when the transmission attempt succeeds (i.e., X̂t+1 = Xt), Ut+1 = Ut if Xt+1 6= Xt and Ut+1 = t + 1 otherwise. When the transmission attempt fails (i.e., X̂t+1 = X̂t 6= Xt), we have Ut+1 = Ut if Xt+1 = Xt and Ut+1 = t + 1 otherwise. Combining (2) with the dynamic of the source process we obtain Pr(st + 1 | (st, 1), at = 1) = p1 e (1− p) + (1− p1 e )p , α, Pr(0 | (st, 1), at = 1) = p1 e p + (1− p1 e )(1− p) = 1− α. • When at = 1 and xt = (st, 0), where st > 0, following the same line, we obtain Pr(st + 1 | (st, 0), at = 1) = p0 e p + (1− p0 e )(1− p) , β, Pr(0 | (st, 0), at = 1) = p0 e (1− p) + (1− p0 e )p = 1− β. Combines together, we obtain the value of Pr(st+1 | xt, at) in all cases. As only M out of N updates are allowed per transmission attempt, we realize a necessity to require transmission attempts always help minimize AoII. It is equivalent to impose Pr(st+1 > st | (st, r̂t), at = 0) > Pr(st+1 > st | (st, r̂t), at = 1) for any (st, r̂t). Leveraging the results above, it is sufficient to require p < 0.5. As all the users share the same structure, we assume, for the rest of this paper, that 0 < pi < 0.5 for 1 ≤ i ≤ N. 2.4. Problem Formulation The communication goal is to minimize the expected AoII. Therefore, the problem can be formulated as the following arg min φ ∈ Φ lim T→∞ 1 T Eφ ( T−1 ∑ t=0 N ∑ i=1 fi(si,t) ) (4a) subject to N ∑ i=1 ai,t = M ∀t, (4b) Entropy 2021, 23, 1572 6 of 39 where Φ is the set of all causal policies. We refer to the constrained minimization problem reported in problem (4) as the Primal Problem (PP). We notice that the PP is a Restless Multi-Armed Bandit (RMAB) Problem. The optimal policy for this type of problem is far from reachable since it is PSPACE-hard in general [20]. However, we can still derive the structural properties of the optimal policy. These structural properties can be used as a guide for the development of scheduling policies and can indicate the good performance of the developed scheduling policies. 3. Structural Properties of the Optimal Policy In this section, we investigate the structural properties of the optimal policy for PP. We first define an infinite horizon with an average cost Markov Decision Process (MDP) MN(w, M) = (XN ,AN(M),PN , CN(w)), where • XN denotes the state space. The state is x = (x1, . . . , xN) where xi = (si, r̂i). • AN(M) denotes the action space. The feasible action is a = (a1, . . . , aN) where ai ∈ {0, 1} and ∑N i=1 ai = M. Note that the feasible actions are independent of the state and the time. • PN denotes the state transition probabilities. We define Px,x′(a) as the probability that action a at state x will lead to state x′. It is calculated by Px,x′(a) = N ∏ i=1 P(r̂′i)Psi ,s′i (ai, r̂i), where Psi ,s′i (ai, r̂i) is the transition probability from si to s′i when the estimate of CSI is r̂i and action ai is taken. The values of Psi ,s′i (ai, r̂i) can be obtained easily from the results in Section 2.3. • CN(w) denotes the instant cost. When the system is at state x and action a is taken, the instant cost is C(x, a) , ∑N i=1 C(xi, ai) , ∑N i=1 ( fi(si) + wai ) . We notice that PP can be cast intoMN(0, M). Since w = 0, the instant cost is indepen- dent of action a. Therefore, we abbreviate C(x, a) as C(x). To simplify the analysis, we consider the case of M = 1. Equivalently, we investigate the structural properties of the optimal policy forMN(0, 1). Remark 2. For the case of M > 1, we can apply the same methodology. However, as M increases, the action space will grow quickly, resulting in the need to consider more feasible actions in each step of the proof. Hence, to better demonstrate the methodology, we only consider the case of M = 1 in this paper. It is well known that the optimal policy for MN(0, 1) can be characterized by the value function. We denote the value function of state x as V(x). A canonical procedure to calculate V(x) is applying the Value Iteration Algorithm (VIA). To this end, we define Vν(·) as the estimated value function at iteration ν of VIA and initialize V0(·) = 0. Then, VIA updates the estimated value functions in the following way Vν+1(x) = C(x)− θ + min a∈AN(1) { ∑ x′∈XN Px,x′(a)Vν(x′) } , (5) where θ is the optimal value of MN(0, 1). VIA is guaranteed to converge to the value function [21]. More precisely, Vν(·) = V(·) when ν → +∞. However, the exact value function is impossible to get since we need infinite iterations and the state space is infinite. Instead, we provide two structural properties of the value function. Lemma 1 (Monotonicity). ForMN(0, 1), V(x) is non-decreasing in si for 1 ≤ i ≤ N. Entropy 2021, 23, 1572 7 of 39 Proof. Leveraging the iterative nature of VIA, we use mathematical induction to prove the desired results. The complete proof can be found in Appendix A. Before introducing the next structural property, we make the following definition. Definition 1 (Statistically identical). Two users are said to be statistically identical if the user- dependent parameters and the adopted time penalty functions are the same. For the users that are statistically identical, we can prove the following Lemma 2 (Equivalence). ForMN(0, 1), if users j and k are statistically identical, V(x) = V (P(x)) where P(x) is state x with xj and xk exchanged. Proof. Leveraging the iterative nature of VIA, we use mathematical induction to prove the desired results. At each iteration, we show that for each feasible action at state x, we can find an equivalent action at state P(x). Two actions are equivalent if they lead to the same value function. The complete proof can be found in Appendix B. Equipped with the above lemmas, we proceed with characterizing the structural properties of the optimal policy. We recall that the optimal action at each state can be characterized by the value function. Hence, we denote, by V j(x), the value function resulting from choosing user j to update at state x. Then, V j(x) can be calculated by V j(x) = C(x)− θ + ∑ x′−x′j  ( ∏ i 6=j Pxi ,x′i (0) ) ∑ r̂′j P(r̂′j) ∑ s′j Psj ,s′j (1, r̂j)V(x′) . If V j(x) < Vk(x) for all k 6= j, it is optimal to transmit the update from user j. When V j(x) = Vk(x), the two choices are equally desirable. In the following, we will characterize the properties of δj,k(x) , V j(x)−Vk(x) for any j and k. Theorem 1 (Structural properties). ForMN(0, 1), δj,k(x) has the following properties 1. δj,k(x) ≤ 0 if r̂k = p0 e,k = 0. The equality holds when sj = 0 or r̂j = p0 e,j = 0. 2. δj,k(x) is non-increasing in r̂j and is non-decreasing in r̂k when sj, sk > 0. At the same time, δj,k(x) is independent of r̂i for any i 6= j, k. 3. δj,k(x) ≤ 0 if sk = 0. The equality holds when sj = 0 or r̂j = p0 e,j = 0. 4. δj,k(x) is non-increasing in sj if Γ r̂j j ≤ Γr̂k k and is non-decreasing in sk if Γ r̂j j ≥ Γr̂k k when sj, sk > 0. We define Γ1 i , αi 1−pi and Γ0 i , βi 1−pi for 1 ≤ i ≤ N. 5. δj,k(x) ≤ 0 if sj ≥ sk, r̂j ≥ r̂k, and users j and k are statistically identical. Proof. The proof can be found in Appendix C. We notice that Γr̂i i can be written as Γr̂i i = Pr(si + 1 | (si, r̂i), ai = 1) Pr(si + 1 | (si, r̂i), ai = 0) < 1, where si can be any positive integer. Consequently, Γr̂i i is independent of any si > 0 and indicates the decrease in the probability of increasing si caused by action ai = 1. When Γr̂i i is large, action ai = 1 will achieve a small decrease in the probability of increasing si. In the following, we provide an intuitive interpretation of why the monotonicity in Property 4 of Theorem 1 depends on Γr̂i i . We take the case of Γ r̂j j ≤ Γr̂k k as an example and assume that there are only users j and k in the system. Then, according to Section 2.3, the dynamic of sj and sk can be divided into the following three cases Entropy 2021, 23, 1572 8 of 39 • Neither sj nor sk increases. In this case, both sj and sk become zero. • Either sj or sk increases and the other becomes zero. We denote by Pk j the probability that only sk increases when aj = 1. The notation for other cases is defined analogously. The probabilities can be obtained easily using the results in Section 2.3. • Both sj and sk increase. We denote by Pj the probability that both sj and sk increase when aj = 1. Pk is defined analogously. The probabilities can be obtained easily using the results in Section 2.3. We notice that δj,k(x) implies the tendency of the base station to choose between the two users. The larger δj,k(x) is, the more the base station tends to choose user k. Thus, we investigate the base station’s propensity to choose user k when sk increases but sj stays the same. We ignore the case where the resulting sk is zero since it is independent of the increase in sk. With this in mind, we first notice that Pk k ≤ Pk j . Meanwhile, we can easily verify that Pj Pk = Γ r̂j j Γ r̂k k . When Γ r̂j j ≤ Γr̂k k , we have Pj ≤ Pk. Then, there exists a subtle trade-off. More precisely, choosing user k will result in Pk k ≤ Pk j , but at the cost of Pk ≥ Pj. Hence, in this case, the propensity of the base station is hard to determine. Following the same line, we can show that choosing user j will lead to Pj j ≤ Pj k and Pj ≤ Pk. Thus, there exists no such trade-off when we investigate the base station’s propensity to choose user j as sj increases but sk stays the same. Leveraging Theorem 1, we can provide some specific structural properties of the optimal policy. Corollary 1 (Application of Theorem 1). When M = 1, the optimal policy for PP must satisfy the following 1. The user i with r̂i = p0 e,i = 0 or si = 0 will not be chosen unless it is to break the tie. 2. When user j is chosen at state x1, then for state x2, such that r̂1,j ≤ r̂2,j and s1,i = s2,i for 1 ≤ i ≤ N, the optimal choice must be in the set G = {j} ∪ {k : r̂1,k < r̂2,k}. 3. When N = 2, we consider two states, x1 and x2, which differ only in the value of sj. Specifically, s1,j ≤ s2,j. If user j is chosen at state x1 and Γ r̂1,j j ≤ Γr̂1,k k , the optimal choice at state x2 will also be user j. 4. When N = 2, we consider two states, x1 and x2, which differ only in the value of sk. Specifically, s1,k ≥ s2,k. If user j is chosen at state x1 and Γ r̂1,j j ≥ Γr̂1,k k , the optimal choice at state x2 will also be user j. 5. When all users are statistically identical, the optimal choice at any time slot must be either the user with x = (smax,1, 1) where smax,1 , maxsi{(si, 1)} or the user with x = (smax,0, 0) where smax,0 , maxsi{(si, 0)}. Moreover, • If smax,1 ≥ smax,0, it is optimal to choose the user with x = (smax,1, 1). • If smax,1 < smax,0, the optimal choice will switch from the user with x = (smax,0, 0) to the user with x = (smax,1, 1) when smax,1 increases from 0 to smax,0 solely. Proof. The first property follows directly from Property 1 and Property 3 of Theorem 1. For the second property, leveraging Property 2 of Theorem 1, we have δj,k(x2) ≤ δj,k(x1) ≤ 0 if r̂1,j ≤ r̂2,j, r̂1,k ≥ r̂2,k, and s1,i = s2,i for 1 ≤ i ≤ N. Thus, the optimal choice will not be user k in this case. Then, we can conclude that the optimal choice must be in the set G = {j} ∪ {k : r̂1,k < r̂2,k}. For the third property, we have proved in Property 4 of Theorem 1 that δj,k(x) is non-increasing in sj if Γ r̂j j ≤ Γr̂k k . Hence, δj,k(x2) ≤ δj,k(x1) ≤ 0. As we consider the case of N = 2, the optimal choice at state x2 will also be user j. The fourth property can be shown in a similar way by noticing that δj,k(x) is non-decreasing in sk when Γ r̂j j ≥ Γr̂k k . For the last property, we recall from Property 5 of Theorem 1 that it is always better to choose the user with a larger s if they are statistically identical and have the same r̂. Thus, Entropy 2021, 23, 1572 9 of 39 we can conclude that the optimal choice must be either the user with x = (smax,1, 1) or the user with x = (smax,0, 0). Without a loss of generality, we assume xj = (smax,1, 1) and xk = (smax,0, 0). Now, we distinguish between the following cases • According to Property 5 of Theorem 1, we can conclude that it is optimal to choose user j when smax,1 ≥ smax,0. • To determine the optimal choice in the case of smax,1 < smax,0, we recall that the optimal choice will be user k (i.e., δj,k(x) ≥ 0) if sj = 0 and will be user j (i.e., δj,k(x) ≤ 0) if sj = sk. At the same time, Property 4 of Theorem 1 tells us that δj,k(x) is non-increasing in sj when users j and k are statistically identical. Therefore, we can conclude that the optimal choice will switch from user k to user j when sj increases from 0 to sk solely. 4. Whittle’s Index Policy Whittle’s index policy is a well-known low-complexity heuristic that shows a strong performance in many problems that belong to RMAB [22–24]. In this section, we develop Whittle’s index policy for PP. We first present the general procedures we adopt to obtain Whittle’s index. • We first formulate a relaxed version of PP and apply the Lagrangian approach. • Then, we decouple the problem of minimizing the Lagrangian function into N decou- pled problems, each of which only considers a single user. By casting the decoupled problem into an MDP, we investigate the structural properties and performance of the optimal policy. • Leveraging the results above and under a simple condition, we establish the indexa- bility of the decoupled problem. • Finally, we obtain the expression of Whittle’s index by solving the Bellman equation. 4.1. Relaxed Problem The first step in obtaining Whittle’s index is to formulate the Relaxed Problem (RP). More precisely, instead of requiring the limit on the number of updates allowed per transmission attempt to be met in each time slot, we relax the constraint such that the limit is not violated in an average sense. Then, RP can be formulated as arg min φ ∈ Φ ∆̄φ , lim T→∞ 1 T Eφ ( T−1 ∑ t=0 N ∑ i=1 fi(si,t) ) (6a) subject to ρ̄φ , lim T→∞ 1 T Eφ ( T−1 ∑ t=0 N ∑ i=1 ai,t ) ≤ M. (6b) As RP is specified, we apply the Lagrangian approach. First of all, we write RP into its Lagrangian form. L(λ, φ) = lim T→∞ 1 T Eφ ( T−1 ∑ t=0 N ∑ i=1 ( fi(si,t) + λai,t) ) − λM, where λ ≥ 0 is the Lagrange multiplier. Then, we investigate the problem of minimizing the Lagrangian function. Since λM is independent of policies, we can ignore it. More precisely, we consider the following minimization problem minimize φ ∈ Φ lim T→∞ 1 T Eφ ( T−1 ∑ t=0 N ∑ i=1 ( fi(si,t) + λai,t) ) . (7) 4.2. Decoupled Model In this section, we formulate the decoupled problem and investigate its optimal policy. The decoupled model associated with each user follows the system model with N = 1. Entropy 2021, 23, 1572 10 of 39 Since all the users share the same structure, we drop the user-dependent subscript i for simplicity. Then, the decoupled problem can be formulated as minimize φ ∈ Φ′ lim T→∞ 1 T Eφ ( T−1 ∑ t=0 ( f (st) + λat) ) , (8) where Φ′ is the set of all causal policies when N = 1. We notice that problem (8) can be cast into the MDPM1(λ,−1). We define M = −1 when there is no restriction on the number of updates allowed per transmission attempt. We first investigate the structural properties of the optimal policy forM1(λ,−1) when λ is a given non-negative constant. We start with characterizing the corresponding value function V(x). Corollary 2 (Extension of Lemma 1). ForM1(λ,−1), V(x) is non-decreasing in s. Proof. The proof follows the same steps as in the proof of Lemma 1. The complete proof can be found in Appendix D. Equipped with the above corollary, we can characterize the structural properties of the optimal policy for (8). Proposition 1 (Optimal policy for decoupled problem). The optimal policy for the decoupled problem is a threshold policy with the following properties. • The optimal policy can be fully captured by n = (n0, n1). More precisely, when the system is at state (s, r̂), it is optimal to make a transmission attempt only when s ≥ nr̂. • n0 ≥ n1 > 0. Proof. We define ∆V(x) , V1(x)− V0(x), where Va(x) is the value function resulting from taking action a at state x. Then, the optimal action at state x is a = 1 if ∆V(x) < 0, and a = 0 is optimal otherwise. We use Corollary 2 to characterize the sign of ∆V(x). The complete proof can be found in Appendix E. In the following, we evaluate the performance of the threshold policy detailed in Proposition 1. More precisely, we calculate the expected AoII ∆̄n and the expected transmis- sion rate ρ̄n resulting from the adoption of threshold policy n. We will see in the following that ∆̄n and ρ̄n are essential for establishing the indexability and obtaining the expression of Whittle’s index. Proposition 2 (Performance). Under threshold policy n = (n0, n1), ∆̄n = π0 p [ n1−1 ∑ k=1 f (k)(1− p)k−1 + (1− p)n1−1 ( n0−1 ∑ k=n1 f (k)ck−n1 1 + cn0−n1 1 +∞ ∑ k=n0 f (k)ck−n0 2 )] , ρ̄n = π0 p(1− p)n1−1 [ γ 1− c1 + cn0−n1 1 ( 1 1− c2 − γ 1− c1 )] , where π0 = 1 2 + p(1− p)n1−1 [ 1 1− c1 − 1 p + cn0−n1 1 ( 1 1− c2 − 1 1− c1 )] , c1 = (1− γ)(1− p) + γα, and c2 = (1− γ)β + γα. Proof. We notice that the dynamic of AoII under the threshold policy can be fully captured by a Discrete-Time Markov Chain (DTMC). Then, combined with the fact that r̂ is an inde- pendent Bernoulli random variable, we can obtain the desired results from the stationary distribution of the induced DTMC. The complete proof can be found in Appendix F. Entropy 2021, 23, 1572 11 of 39 As f (·) can be any non-decreasing function, ∆̄ can grow indefinitely. Thus, it is necessary to require that there exists at least one threshold policy that causes a finite ∆̄. By noting that 1− p ≥ c1 ≥ c2, we have ∆̄ ≥ π0 p [ n1−1 ∑ k=1 f (k)ck−1 2 + cn1−1 2 ( n0−1 ∑ k=n1 f (k)ck−n1 2 + cn0−n1 2 +∞ ∑ k=n0 f (k)ck−n0 2 )] = π0 p ( +∞ ∑ k=1 f (k)ck−1 2 ) . The equality is achieved when n0 = n1 = 1. Then, we can conclude that it is sufficient to require ∑+∞ k=1 f (k)ck−1 2 < +∞. This will be the underlying assumption throughout the rest of this paper. 4.3. Indexability In this section, we establish the indexability of the decoupled problem, which ensures the existence of Whittle’s index. We start with the definition of indexability. Definition 2 (Indexability). The decoupled problem is indexable if the set of states in which a = 0 is the optimal action increases with λ, that is, λ′ < λ =⇒ D(λ′) ⊆ D(λ), where D(λ) is the set of states in which a = 0 is optimal when Lagrange multiplier λ is adopted. The Lagrange multiplier λ can be viewed as a cost associated with each transmission attempt. Intuitively, as λ increases, the base station should stay idle (i.e., a = 0) for a longer time until s becomes large enough to offset the cost. Although it is intuitively correct that the decoupled problem is indexable, the indexability is hard to establish as the optimal policy is characterized by two thresholds. Thus, Whittle’s index does not necessarily exist. However, the indexability can be established when the following condition is satisfied p0 e,i = 0 f or 1 ≤ i ≤ N. (9) Remark 3. Problem (9) only requires the estimate r̂i to be perfect when r̂i = 0. In the case of r̂i = 1, we still allow the estimate to be inaccurate. When (9) is satisfied, Propositions 1 and 2 reduce to the following Corollary 3 (Consequences of (9)). When (9) is satisfied, the optimal policy for the decoupled problem (8) is the threshold policy n = (+∞, n). The corresponding ∆̄n and ρ̄n are ∆̄n = π0 p ( n−1 ∑ k=1 f (k)(1− p)k−1 + (1− p)n−1 +∞ ∑ k=n f (k)ck−n 1 ) , ρ̄n = π0 p(1− p)n−1 ( γ 1− c1 ) , where π0 = 1 2 + p(1− p)n−1 ( 1 1− c1 − 1 p ) . Proof. We continue with the same notations as in the proof of Propositions 1 and 2. It is sufficient to show that n0 = +∞. To this end, we consider the state x = (s, 0). By following the same steps as in the proof of Proposition 1, we have Entropy 2021, 23, 1572 12 of 39 ∆V(s, 0) = λ ≥ 0. Therefore, it is optimal to stay idle (i.e., a = 0) at state x = (s, 0) for any s ≥ 0. Equivalently, n0 = +∞. Then, the corresponding ∆̄n and ρ̄n can be calculated as a special case of Proposition 2 where n0 = +∞, n1 = n, and p0 e = 0. Leveraging Corollary 3, we can establish the indexability of the decoupled problem. Proposition 3 (Indexability of decoupled problem). The decoupled problem is indexable when (9) is satisfied. Proof. According to Proposition 2.2 of [25], we only need to verify that the expected transmission rate ρ̄n is strictly decreasing in n. From Corollary 3, we have ρ̄n = γ ( p 1− c1 ) 2 (1− p)n−1 + ( p 1− c1 − 1 ) . As 1 2 < 1− p < 1, we can easily verify that ρ̄n is strictly decreasing in n. Thus, the decou- pled problem is indexable when (9) is satisfied. 4.4. Whittle’s Index Policy In this section, we proceed with finding the expression of Whittle’s index and defining Whittle’s index policy. First of all, we give the definition of Whittle’s index. Definition 3 (Whittle’s index). When the decoupled problem is indexable, Whittle’s index at state x is defined as the infimum λ, such that both actions are equally desirable. Equivalently, Whittle’s index at state x is defined as the infimum λ such that V0(x) = V1(x). Let us denote by Wx the Whittle’s index at state x. Then, the expression of Whittle’s index is given by the following Proposition. Proposition 4 (Whittle’s index). When (9) is satisfied, Whittle’s index is Wx =  0 when x = (0, r̂) or x = (s, 0), (1− c1) +∞ ∑ k=s+1 f (k)ck−s−1 1 − ∆̄s (1− c1)(1− p)− γ(1− p− α) c1(1− p− α) + ρ̄s when x = (s, 1), where s > 0 and c1 = (1− γ)(1− p) + γα. ∆̄s and ρ̄s are the expected AoII and the expected transmission rate when threshold policy n = (+∞, s) is adopted, respectively. At the same time, Wx is non-negative and is non-decreasing in s. Proof. Whittle’s indexes at state x = (0, r̂) and x = (s, 0) are obtained easily from the proof of Proposition 1. For state x = (s, 1), we first use backward induction to calculate the expressions of some value functions. Then, the expression of Whittle’s index can be obtained from its definition. The complete proof can be found in Appendix G. Definition 4 (Whittle’s index policy). At any state x = (x1, x2, . . . , xN), the base station will transmit the updates from M users with the largest Wxi . The ties are broken arbitrarily. Wxi is calculated using Proposition 4 with the parameters of user i. Remark 4. Whittle’s index policy possesses the structural properties detailed in Corollary 1. Entropy 2021, 23, 1572 13 of 39 • The first two properties can be verified by noting that Wxi ≥ 0 and the equality holds when r̂i = 0 or si = 0. At the same time, Wxi is non-decreasing in r̂i. • The third and fourth properties can be verified by noting that Wxi is non-decreasing in si. • For the last property, we first notice that Wxj = Wxk when users j and k are statistically identical and xj = xk. Then, the property can be verified by noting that Wxi is non-decreasing in both si and r̂i. 5. Optimal Policy for Relaxed Problem In this section, we provide an efficient algorithm to obtain the optimal policy for RP, based on which we will develop another scheduling policy for PP in the next section that is free from indexability. At the same time, the performance of the optimal policy for RP forms a universal lower bound because the following ordering holds ∆̄RP AoII ≤ ∆̄PP AoII , where ∆̄RP AoII and ∆̄PP AoII are the minimal expected AoII of RP and PP, respectively. Remark 5. Note that the optimal policy for RP may not necessarily be a valid policy for PP, as the transmitter may transmit more than M updates in one transmission attempt under RP- optimal policy. To solve RP, we follow the discussion in Section 4.1. More precisely, we take the Lagrangian approach and consider the problem reported in (7). We will see in the following discussion that the optimal policy for RP can be characterized by the optimal policies for problem (7). Therefore, we first cast problem (7) into the MDP MN(λ,−1). However, the optimal policy forMN(λ,−1) is difficult to obtain because the state space is infinite. Even though we can make the state space finite by imposing an upper limit on the value of s, the state space and the action space grow exponentially with the number of users in the system. To overcome the difficulty, we investigate the optimal policy forMi 1(λ,−1) where 1 ≤ i ≤ N. The superscript i means that the only user in the system is user i. We will show later that the optimal policy forMN(λ,−1) can be fully characterized by the optimal policies forMi 1(λ,−1) where 1 ≤ i ≤ N. 5.1. Optimal Policy for Single User In this section, we tackle the problem of finding the optimal policy forMi 1(λ,−1). Since the users share the same structure, we ignore the superscript i for simplicity. To find the optimal policy, we first use the Approximating Sequence Method (ASM) introduced in [26] to make the state space finite. More precisely, we impose s ≤ m where m is a predetermined upper limit. The state transition probabilities P′s,s′(a, r̂) are modified in the following way P′s,s′(a, r̂) = { Ps,s′(a, r̂) i f s′ < m, Ps,s′(a, r̂) + ∑z>m Ps,z(a, r̂) i f s′ = m. (10) The action space and the instant cost remain unchanged. Then, we can apply Relative Value Iteration (RVI) with convergence criteria ε to obtain the optimal policy. We notice thatM1(λ,−1) coincides with the decoupled model studied in Section 4.2. Hence, we can utilize the threshold structure of the optimal policy to improve RVI. To this end, we class a state as active if the optimal action at this state is a = 1. Then, the threshold structure detailed in Proposition 1 tells us the following. For any state x, if there exists an active state x1 with s1 ≤ s and r̂1 ≤ r̂, then x must also be active. Hence, we can determine the optimal action at state x immediately instead of comparing all feasible actions. In this way, we can reduce the running time of RVI. The pseudocode for the improved RVI can be found in Algorithm A1 of Appendix M. A similar technique is also presented in [5]. Entropy 2021, 23, 1572 14 of 39 For M1(λ,−1), when problem (9) is satisfied, Whittle’s index exists and can be calculated efficiently using Proposition 4. Therefore, we can obtain the optimal policy using Whittle’s index and further reduce the computational complexity. To this end, we denote by nλ the optimal policy forM1(λ,−1) and present the following proposition Proposition 5 (Optimal deterministic policy). When (9) is satisfied, the optimal policy for M1(λ,−1) is nλ = (+∞, n) where n is given by n = { 1 i f λ = 0, max{s ∈ N0 : Ws ≤ λ}+ 1 i f λ > 0. Ws is the Whittle’s index at state (s, 1). Proof. We first notice that M1(λ,−1) coincides with the decoupled model studied in Section 4.2. Then, we show the optimal action for each state with r̂ = 1 using the definition of Whittle’s index and the fact that the decoupled problem is indexable when (9) is satisfied. The complete proof can be found in Appendix H. In the following, we provide a randomized policy that is also optimal forM1(λ,−1). We will see later that the randomized policy is the key to obtaining the optimal policy for RP. Theorem 2 (Optimal randomized policy). There exist two deterministic policies nλ+ and nλ− , which are both optimal forM1(λ,−1). We consider the following randomized policy nλ: every time the system reaches state (0, 0), the base station will make the choice between nλ− with probability µ and nλ+ with probability 1− µ. The chosen policy will be followed until the next choice. Then, the randomized policy nλ is optimal forM1(λ,−1) under any µ ∈ [0, 1]. Proof. We show that our system verifies the assumptions given in [27]. Then, leveraging the characteristics of our system, we can obtain the optimal randomized policy. The com- plete proof can be found in Appendix I. In practice, we approximate λ+ ≈ λ + ξ and λ− ≈ λ− ξ where ξ is a small pertur- bation. Then, the deterministic policies nλ+ and nλ− can be obtained by following the discussion at the beginning of this subsection. Note that, in most cases, nλ+ and nλ− are the same. 5.2. Optimal Policy for RP In this section, we characterize the optimal policy for RP. Let us denote by V(x) and Vi(xi) the value functions ofMN(λ,−1) andMi 1(λ,−1), respectively. Then, we can prove the following Proposition 6 (Separability). V(x) = ∑N i=1 Vi(xi) where x = (x1, . . . , xN). In other words, the policy, under which each user adopts its own optimal policy, is optimal forMN(λ,−1). Proof. We show V(x) = ∑N i=1 Vi(xi) by comparing the Bellman equations they must satisfy. The complete proof can be found in Appendix J. We denote the optimal policy forMN(λ,−1) as φλ = [nλ,1, . . . , nλ,N ] where nλ,i is the optimal policy forMi 1(λ,−1). For simplicity, we define ∆̄(λ) and ρ̄(λ) as the expected AoII and the expected transmission rate associated with φλ, respectively. ∆̄i(λ) and ρ̄i(λ) are defined analogously for user i under policy nλ,i. We also define λ∗ , inf{λ > 0 : ρ̄(λ) ≤ M}. With Proposition 6 and the above definitions in mind, we proceed with constructing the optimal policy for RP. Entropy 2021, 23, 1572 15 of 39 Theorem 3 (Optimal policy for RP). The optimal policy for RP can be characterized by two deterministic policies φλ∗+ = [nλ∗+ ,1, . . . , nλ∗+ ,N ] and φλ∗− = [nλ∗− ,1, . . . , nλ∗− ,N ] where nλ∗+ ,i and nλ∗− ,i are both the optimal deterministic policies forMi 1(λ ∗,−1). Then, we mix φλ∗+ and φλ∗− in the following way: for each user i, every time the user reaches state (0, 0), the base station will make the choice between nλ∗− ,i with probability µi and nλ∗+ ,i with probability 1− µi. The chosen policy will be followed by user i until the next choice. Where 1 ≤ i ≤ N, the µi is chosen in such a way as to satisfy N ∑ i=1 ρ̄i(λ∗) = N ∑ i=1 ( µi ρ̄ i(λ∗−) + (1− µi)ρ̄ i(λ∗+) ) = M. (11) Then, the mixed policy, denoted by φλ∗ , is optimal for RP. Proof. According to Lemma 3.10 of [27], a policy is optimal for RP if 1. It is optimal forMN(λ ∗,−1); 2. The resulting expected transmission rate is equal to M. Then, we construct such a policy using Theorem 2 and Proposition 6. The complete proof can be found in Appendix K. Since we approximate λ∗+ ≈ λ∗ + ξ and λ∗− ≈ λ∗ − ξ in practice, ρ̄i(λ∗+) ≤ ρ̄i(λ∗−) for all i according to the monotonicity given by Lemma 3.4 of [27]. Combining with the definition of λ∗, we must have ρ̄(λ∗+) ≤ M < ρ̄(λ∗−). Therefore, we can always find µi’s that realize (11). In this paper, we choose µi = µ = M− ρ̄(λ∗+) ρ̄(λ∗−)− ρ̄(λ∗+) , f or 1 ≤ i ≤ N. (12) Then, we describe the algorithm used to obtain the optimal policy for RP. As detailed in Theorem 3, it is essential to find λ∗. To this end, we recall that, for any user i under given λ, the optimal deterministic policy nλ,i can be obtained using the results in Section 5.1 and the resulting expected transmission rate ρ̄i(λ) is given by Proposition 2. Since ρ̄i(λ) is non-increasing in λ for all i according to Lemma 3.4 of [27], ρ̄(λ) = ∑N i=1 ρ̄i(λ) is also non-increasing in λ. Hence, we can regard ρ̄(λ) as a non-increasing function of λ. Then, according to the definition of λ∗, we can use the Bisection search to obtain λ∗ efficiently. The main steps can be summarized as follows. 1. Initialize λ− = 0 and λ+ = 1. 2. Do λ− = λ+ and λ+ = 2λ+ until ρ̄(λ+) < M. 3. Run Bisection search on the interval [λ−, λ+] until the tolerance 2ξ is met. Then, λ∗− and λ∗+ can simply be the boundaries of the final interval. The pseudocode for the Bisection search can be found in Algorithm A2 of Appendix M. After obtaining λ∗− and λ∗+, the optimal policy φλ∗ is detailed in Theorem 3 and the mixing probabilities µi’s are given by (12). Remark 6. We recall that the optimal deterministic policy for each user can be characterized by two positive thresholds (i.e., n0, n1 > 0). Consequently, under RP-optimal policy, the base station will never choose the user at state (0, r̂). Then, when M increases, the expected transmission rate achieved by RP-optimal policy will saturate before M reaches N. When the expected transmission rate saturates, the RP-optimal policy is φ∗ = [n1, . . . , nN ] where ni = (1, 1) for 1 ≤ i ≤ N. The saturation happens when M is larger than or equal to the expected transmission rate achieved by φ∗. 6. Indexed Priority Policy Although the performance of Whittle’s index policy is known to be good, it requires indexability, which is usually difficult to establish. In this section, based on the primal- dual heuristic introduced in [28], we develop a policy that does not require indexability Entropy 2021, 23, 1572 16 of 39 and has comparable performance to Whittle’s index policy. We start with presenting the primal-dual heuristic. 6.1. Primal-Dual Heuristic The heuristic is based on the optimal primal and dual solution pair to the linear program associated with RP. To introduce the linear program, we define π ai xi (φ) ≥ 0 as the expected time that user i is at state xi and action ai is taken according to policy φ. Then, for any φ, π ai xi (φ) must satisfy the following problems π0 xi (φ) + π1 xi (φ) = ∑ x′i ∑ a′i Px′i ,xi (a′i)π a′i x′i (φ), ∀xi, i. ∑ xi ∑ ai π ai xi (φ) = 1, ∀i. The objective function of RP can be rewritten as minimize φ ∈ Φ N ∑ i=1 ∑ xi ,ai C(xi)π ai xi (φ), where C(xi) = fi(si) is the instant cost at state xi. The constraint on the expected transmis- sion rate can be rewritten as N ∑ i=1 ∑ xi π1 xi (φ) ≤ M. Thus, the linear program associated with RP can be formulated as the following minimize π ai xi N ∑ i=1 ∑ xi ,ai C(xi)π ai xi (13a) subject to π0 xi + π1 xi −∑ x′i ∑ a′i Px′i ,xi (a′i)π a′i x′i = 0, ∀xi, i, (13b) ∑ xi ∑ ai π ai xi = 1, ∀i, (13c) N ∑ i=1 ∑ xi π1 xi ≤ M, (13d) π ai xi ≥ 0, ∀xi, ai, i. (13e) The corresponding dual problem is maximize σ, σi, σxi N ∑ i=1 σi −Mσ (14a) subject to σxi + σi −∑ x′i Pxi ,x′i (0)σx′i ≤ C(xi), ∀xi, i, (14b) σxi + σi −∑ x′i Pxi ,x′i (1)σx′i − σ ≤ C(xi), ∀xi, i, (14c) σ ≥ 0. (14d) Let {π̄ai xi} and {σ̄, σ̄i, σ̄xi} be the optimal primal and dual solution pair to the problems reported in (13) and (14). We define ψ̄0 xi = ∑ x′i Pxi ,x′i (0)σ̄x′i + C(xi)− σ̄i − σ̄xi ≥ 0, ψ̄1 xi = ∑ x′i Pxi ,x′i (1)σ̄x′i + σ̄ + C(xi)− σ̄i − σ̄xi ≥ 0. Entropy 2021, 23, 1572 17 of 39 For any state x = (x1, . . . , xN), let h(x) = ∑N i=1 1{π̄1 xi>0}. Then, the heuristic operates in the following way • If h(x) ≥ M, the base station will choose the M users with the largest ψ̄0 xi among the h(x) users. • If h(x) < M, these h(x) users are chosen by the base station. The base station will choose M− h(x) additional users with the smallest ψ̄1 xi . However, Linear Programming (LP) is a very general technique and does not appear to take advantage of the special structure of the problem. Although there are algorithms for solving rational LP that take time polynomial in the number of variables and constraints, they run extremely slowly in practice [29]. For our problem, we notice that the users have separate activity areas that are linked through a common resource constraint. Therefore, the primal problem can be solved using Dantzig-Wolfe decomposition. Even so, the prob- lem is still computationally demanding when the system scales up. We recall that we solved the exact problem efficiently using MDP-specific algorithms in Section 5. It is more efficient because of the following reasons • According to Proposition 6, we can decompose the problem into N subproblems. • For each subproblem, the threshold structure of the optimal policy is utilized to reduce the running time of RVI. • As we will see later, the developed policy can be obtained directly from the result of RVI in practice. In the following, we will translate the results in Section 5 into the optimal primal and dual solution pair and propose Indexed priority policy. 6.2. Indexed Priority Policy We first define the Lagrangian function associated with (13). L(πai xi , σ, σi, σxi , ψ ai xi ) = ( N ∑ i=1 ∑ xi ,ai C(xi)π ai xi ) + ∑ i,xi σxi ( ∑ x′i ∑ a′i Px′i ,xi (a′i)π a′i x′i − π0 xi − π1 xi ) + N ∑ i=1 σi ( 1−∑ xi ∑ ai π ai xi ) + σ ( N ∑ i=1 ∑ xi π1 xi −M ) − ∑ i,xi ,ai ψ ai xi π ai xi . Then, the corresponding Lagrangian dual function is g(σ, σi, σxi , ψ ai xi ) = inf π ai xi L(πai xi , σ, σi, σxi , ψ ai xi ). Let πxi be the expected time that user i is at state xi caused by the adoption of φλ∗ , where φλ∗ is the optimal policy detailed in Theorem 3. Then, we define {πai xi} as follows • State xi is where randomization happens (randomization happens when the actions suggested by the two optimal deterministic policies are different), and it has a value of π0 xi = anλ∗− ,i (xi)(1− µi)πxi + anλ∗+ ,i (xi)µiπxi and π1 xi = πxi − π0 xi where µi is given by (12) and anλ,i (xi) is the action suggested by nλ,i at state xi. • For other values of xi, we have π0 xi = (1− anλ∗ ,i (xi))πxi and π1 xi = πxi − π0 xi . We also define σ = λ∗, σi = θi, and σxi = Vi(xi) where λ∗ is specified in Section 5.2, θi is the optimal value ofMi 1(λ ∗,−1), and Vi(xi) is the value function associated withMi 1(λ ∗,−1). Lastly, we define {ψai xi} as follows ψ0 xi = ∑ x′i Pxi ,x′i (0)σx′i + C(xi)− σi − σxi , ψ1 xi = ∑ x′i Pxi ,x′i (1)σx′i + σ + C(xi)− σi − σxi . Entropy 2021, 23, 1572 18 of 39 Then, we can prove the following proposition. Proposition 7 (Optimal solution pair). {πai xi} and {σ, σi, σxi , ψ ai xi} are primal and dual solutions to (13), respectively. Proof. Since (13) is linear and strictly feasible, it is sufficient to show that {πai xi} and {σ, σi, σxi , ψ ai xi} verify the KKT conditions, which can be expressed as the following four con- ditions. 1. Primal feasibility: the constraints in (13) are satisfied. 2. Dual feasibility: σ ≥ 0 and ψ ai xi ≥ 0 for all xi, ai, and i. 3. Complementary slackness: σ ( ∑N i=1 ∑xi π1 xi − M ) = 0 and ψ ai xi π ai xi = 0 for all xi, ai, and i. 4. Stationarity: the gradient of L(πai xi , σ, σi, σxi , ψ ai xi ) with respect to {πai xi} vanishes. Apparently, the first condition is satisfied by {πai xi}. For the second condition, σ ≥ 0 since σ = λ∗ ≥ 0 by definition. For ψ ai xi , we can verify that ψ ai xi = Vi,ai (xi) − Vi(xi) where Vi,ai (xi) is the value function resulting from taking action ai at state xi. Then, the non- negativity is guaranteed by the Bellman equation. For the third condition, the first term is zero because we choose the µi’s given by (12). For the second term, we recall that ψ ai xi = Vi,ai (xi)−Vi(xi). According to the definition of π ai xi , we know Vi(xi) = Vi,ai (xi) if π ai xi > 0. Combined together, we can conclude that ψ ai xi = 0 when π ai xi > 0. Thus, the third condition is satisfied. For the last condition, setting the gradient equal to zero yields a system of linear equations. More precisely, for each xi and 1 ≤ i ≤ N ∑ x′i Pxi ,x′i (0)σx′i + C(xi) = σxi + σi + ψ0 xi . ∑ x′i Pxi ,x′i (1)σx′i + σ + C(xi) = σxi + σi + ψ1 xi . Then, {σ, σi, σxi , ψ ai xi} verifies the system of linear equations by definition. Since all four conditions are satisfied, we can conclude our proof. According to Proposition 7, we know that {πai xi} and {σ, σi, σxi} defined above are the optimal solutions to problems (13) and (14), respectively. As the optimal solutions are obtained, we can adopt the heuristic detailed in Section 6.1. The heuristic can be expressed equivalently as an index policy. To this end, we define the index Ixi for state xi as Ixi , ψ̄0 xi − ψ̄1 xi . According to the complementary slackness, Ixi can be reduced to the following. • For state xi such that π̄1 xi > 0 and π̄0 xi = 0, we have ψ̄1 xi = 0. Therefore, Ixi = ψ̄0 xi ≥ 0. • For state xi such that π̄1 xi > 0 and π̄0 xi > 0, we have ψ̄1 xi = ψ̄0 xi = 0. Therefore, Ixi = 0. • For state xi such that π̄1 xi = 0 and π̄0 xi > 0, we have ψ̄0 xi = 0. Therefore, Ixi = −ψ̄1 xi ≤ 0. We can show that Ixi possesses the following properties. Proposition 8 (Properties of Ixi ). For 1 ≤ i ≤ N, Ixi ≥ −λ∗ for any xi. The equality holds when r̂i = p0 e,i = 0 or si = 0. At the same time, Ixi is non-decreasing in both si and r̂i. Proof. We notice that Ixi can be expressed as a function of Vi(xi) and λ∗. Meanwhile, Mi 1(λ ∗,−1) coincides with the decoupled model studied in Section 4.2. Then, we can verify the properties of Ixi using the results in Section 4.2. The complete proof can be found in Appendix L. Comparing with the heuristic detailed in Section 6.1, we can define the Indexed priority policy. Entropy 2021, 23, 1572 19 of 39 Definition 5 (Indexed priority policy). At any state x = (x1, x2, . . . , xN), the base station will transmit the updates from M users with the largest Ixi . The ties are broken arbitrarily. Remark 7. Indexed priority policy belongs to the class of priority policies introduced in [30]. These priority policies are asymptotically optimal when certain conditions are satisfied. Remark 8. Indexed priority policy possesses the structural properties detailed in Corollary 1. • The first two properties can be verified by noting that Ixi ≥ −λ∗ and the equality holds when r̂i = p0 e,i = 0 or si = 0. At the same time, Ixi is non-decreasing in r̂i. • The third and fourth properties can be verified by noting that Ixi is non-decreasing in si. • For the last property, we first notice that Ixj = Ixk when users j and k are statistically identical and xj = xk. Then, the property can be verified by noting that Ixi is non-decreasing in both si and r̂i. We notice that θi’s and C(xi)’s are canceled out by the definition of Ixi . Therefore, Ixi can be calculated using λ∗ and the value function ofMi 1(λ ∗,−1). In practice, we can use either λ∗− or λ∗+ to approximate λ∗, and the value function can be approximated by the result of the RVI detailed in Section 5.1. Since the state space is infinite, we only calculate a finite number of Vi(xi), the number of which depends on the truncation parameter m of ASM. Meanwhile, the probabilities Pxi ,x′i (ai) in Ixi are modified according to (10). 7. Numerical Results In this section, we provide numerical results to showcase the performance of the developed scheduling policies. To eliminate the effect of N, we plot the expected average AoII. In particular, we provide the expected average AoII achieved by the Indexed priority policy and Whittle’s index policy when M = 1. The policies are calculated using the results detailed in Sections 4–6. When obtaining the Indexed priority policy, we set the tolerance in the Bisection search to ξ = 0.005. Meanwhile, we choose the truncation parameter in ASM m = 800 and the convergence criteria in RVI ε = 0.01. We notice that the calculation of Whittle’s index involves an infinite sum. In practice, we approximate the result by replacing +∞ with a large enough number kmax. Here, we choose kmax = 800. For both scheduling policies, the resulting expected average AoII is obtained via simulations. Each data point is the average of 15 runs with 15,000 time slots considered in each run. We also compare the developed policies with the optimal policy for RP, which can be calculated by following the discussion in Section 5.2. We adopt the same choices of parameters as we used to obtain the developed policies. The corresponding performance is calculated using Proposition 2. Like before, the infinite sum is approximated by replac- ing +∞ with kmax = 800. We also provide the expected average AoII achieved by the Greedy policy to show the performance advantages of the developed policies. When the Greedy policy is adopted, the base station always chooses the user with the largest AoII. The resulting expected average AoII is obtained via the same simulations as applied to the developed policies. Figures 3 and 4 illustrate the performance when the source processes have different dynamics and when each user’s communication goal is different, respectively. Figure 3a provides the performance when pi = 0.05 + 0.4(i−1) N−1 for 1 ≤ i ≤ N. For other parameters, the users make the same choices. More precisely, fi(s) = s, γi = 0.6, and p0 e,i = p1 e,i = 0.1 for 1 ≤ i ≤ N. Figure 4a provides the performance when fi(s) = s0.5+ i−1 N−1 for 1 ≤ i ≤ N. Same as before, the users make the same choices for other parameters. More precisely, pi = 0.3, γi = 0.6, and p0 e,i = p1 e,i = 0.1 for 1 ≤ i ≤ N. In Figures 3b and 4b, we force p0 e,i = 0 for all users to ensure the existence of Whittle’s index. Other choices remain the same as in Figures 3a and 4a. According to Corollary 1, the optimal policy will never choose the user with r̂ = p0 e = 0 unless it is to break the tie. Therefore, in Figures 3b and 4b, we also consider the Greedy+ policy where the base station always chooses the user Entropy 2021, 23, 1572 20 of 39 with the largest AoII among the users with r̂ = 1. The resulting expected average AoII is obtained via the same simulations as applied to the Greedy policy. Figure 5 shows the performance in systems where the parameters for each user are generated uniformly and randomly within their ranges. In Figure 5a, we consider N = 5, γ ∈ [0, 1], p ∈ [0.05, 0.45], pr̂ e ∈ [0, 0.45], and f (s) = sτ , where τ ∈ [0.5, 1.5]. There are a total of 300 different choices and the results are sorted by the performance of RP-optimal policy in ascending order. Figure 5b adopts the same system settings except that we impose p0 e,i = 0 for 1 ≤ i ≤ N to ensure the feasibility of Whittle’s index policy. Meanwhile, we ignore the Greedy policy since the Greedy+ policy achieves a better performance, as indicated by Figures 3b and 4b. 5 10 15 20 25 30 35 40 45 50 Number of users in the system, N 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 E xp ec te d A ve ra g e A o II Greedy policy Indexed priority policy Optimal policy for RP (a) When p0 e = 0.1. 5 10 15 20 25 30 35 40 45 50 Number of users in the system, N 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 E xp ec te d A ve ra g e A o II Greedy policy Greedy+ policy Whittle's index policy Indexed priority policy Optimal policy for RP (b) When p0 e = 0. Figure 3. Performance when the source processes vary. We choose pi = 0.05 + 0.4(i−1) N−1 , fi(s) = s, γi = 0.6, p0 e,i = p0 e , and p1 e,i = 0.1 for 1 ≤ i ≤ N. 5 10 15 20 25 30 35 40 45 50 Number of users in the system, N 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 E xp ec te d A ve ra g e A o II Greedy policy Indexed priority policy Optimal policy for RP (a) When p0 e = 0.1. 5 10 15 20 25 30 35 40 45 50 Number of users in the system, N 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 E xp ec te d A ve ra g e A o II Greedy policy Greedy+ policy Whittle's index policy Indexed priority policy Optimal policy for RP (b) When p0 e = 0. Figure 4. Performance when the communication goals vary. We choose fi(s) = s0.5+ i−1 N−1 , pi = 0.3, γi = 0.6, p0 e,i = p0 e , and p1 e,i = 0.1 for 1 ≤ i ≤ N. Entropy 2021, 23, 1572 21 of 39 0 50 100 150 200 250 300 System Index 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 E xp ec te d A ve ra g e A o II Greedy policy Indexed priority policy Optimal policy for RP (a) When I = [0, 0.45]. 0 50 100 150 200 250 300 System Index 0 0.5 1 1.5 2 2.5 3 3.5 4 E xp ec te d A ve ra g e A o II Greedy+ policy Whittle's index policy Indexed priority policy Optimal policy for RP (b) When I = {0}. Figure 5. Performance in systems with random parameters when N = 5. The parameters for each user are chosen randomly within the following intervals: γ ∈ [0, 1], p ∈ [0.05, 0.45], p0 e ∈ I, p1 e ∈ [0, 0.45], and f (s) = sτ where τ ∈ [0.5, 1.5]. We can make the following observations from the figures. • The Greedy+ policy yields a smaller expected average AoII than that achieved by the Greedy policy. Recall that we obtained the Greedy+ policy by applying the structural properties detailed in Corollary 1. Therefore, simple applications of the structural properties of the optimal policy can improve the performance of scheduling policies. • The Indexed priority policy has comparable performance to Whittle’s index policy in all the system settings considered. The two policies have their own advantages. The Indexed priority policy has a broader scope of application, while Whittle’s index policy has a lower computational complexity. • The performance of the Indexed priority policy and Whittle’s index policy is better than that of the Greedy/Greedy+ policies and is not far from the performance of the RP-optimal policy. Recall that the performance of the RP-optimal policy forms a universal lower bound on the performance of all admissible policies for PP. Hence, we can conclude that both the Indexed priority policy and Whittle’s index policy achieve good performances. 8. Conclusions In this paper, we studied the problem of minimizing the Age of Incorrect Information in a slotted-time system where a base station needs to schedule M users among N available users. Meanwhile, the base station has access to imperfect channel state information in each time slot. The problem is a restless multi-armed bandit problem which is SPACE- hard. However, by casting the problem into a Markov decision process, we obtain the structural properties of the optimal policy. Then, we introduce a relaxed version of the original problem and investigate the decoupled model. Under a simple condition, we establish the indexability of the decoupled problem and obtain the expression of Whittle’s index. On this basis, we developed Whittle’s index policy. To get rid of the requirement for indexability, we developed the Indexed priority policy based on the optimal policy for the relaxed problem. The characteristics of the relaxed problem are explored to make the calculation of its optimal policy more efficient. Finally, through numerical results, we show that simple applications of the structural properties can improve the performance of scheduling policies. Moreover, Whittle’s index policy and the Indexed priority policy achieve good and comparable performances. Author Contributions: Formal analysis, Y.C.; Investigation, Y.C.; Methodology, Y.C.; Supervision, A.E.; Validation, Y.C.; Writing—original draft, Y.C.; Writing—review & editing, Y.C. and A.E. All authors have read and agreed to the published version of the manuscript. Entropy 2021, 23, 1572 22 of 39 Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Not applicable. Conflicts of Interest: The authors declare no conflict of interest. Appendix A. Proof of Lemma 1 We consider two states, x1 and x2, that differ only in the value of sj. Without the loss of generality, we assume s1,j < s2,j. Then, it is sufficient to show that, for any 1 ≤ j ≤ N, V(x1) ≤ V(x2). Leveraging the iterative nature of VIA, we use mathematical induction to prove the monotonicity. First of all, the base case (i.e., ν = 0) is true by initialization. We assume the lemma holds at iteration ν. Then, we want to examine whether it holds at iteration ν + 1. The update step reported in problem (5) can be rewritten as follows. Vν+1(x) = min a∈AN(1) Va ν+1(x), (A1) where Va ν+1(x) = C(x)− θ + ∑ x′−{x′j}  ( ∏ i 6=j Pxi ,x′i (ai) ) ∑ r̂′j P(r̂′j)U j ν(x, x′) , U j ν(x, x′) = ∑ s′j Psj ,s′j (aj, r̂j)Vν(x′). To prove the desired results, we distinguish between the following cases. • We first consider the case of s1,j = 0 < s2,j and r̂1,j = r̂2,j = 0. When aj = 1 and for any x′ − {s′j}, we have U j ν(x1, x′) = pjVν(x′; s′j = 1) + (1− pj)Vν(x′; s′j = 0), U j ν(x2, x′) = β jVν(x′; s′j = s2,j + 1) + (1− β j)Vν(x′; s′j = 0), where Vν(x′; s′j = 0) is the estimated value function of the state x′ with s′j = 0 at iteration ν (at the risk of abusing the notation, we use V(x; sj = s1) and V(x; sj = s2) to represent the value functions of two states that differ only in the value of sj). Then, we get U j ν(x1, x′)−U j ν(x2, x′) ≤ (pj − β j) ( Vν(x′; s′j = 1)−Vν(x′; s′j = 0) ) ≤ 0. The inequalities hold since β j > pj and Lemma 1 are true at iteration ν by assumption. Therefore, we have U j ν(x1, x′) ≤ U j ν(x2, x′) when aj = 1 for any x′ − {s′j}. For the case of ai = 1 where i 6= j, we notice that aj = 0. Then, for any x′ − {s′j}, we obtain U j ν(x1, x′) = pjVν(x′; s′j = 1) + (1− pj)Vν(x′; s′j = 0), U j ν(x2, x′) = (1− pj)Vν(x′; s′j = s2,j + 1) + pjVν(x′; s′j = 0). Therefore, when ai = 1, we have U j ν(x1, x′)−U j ν(x2, x′) ≤ (2pj − 1) ( Vν(x′; s′j = 1)−Vν(x′; s′j = 0) ) ≤ 0. The inequalities hold since 2pj − 1 < 0 and Lemma 1 is true at iteration ν by as- sumption. Combining with the case of aj = 1, U j ν(x1, x′) ≤ U j ν(x2, x′) holds for any Entropy 2021, 23, 1572 23 of 39 x′ − {s′j} under any feasible action. Since x1 and x2 differ only in the value of sj and C(x) is non-decreasing in si for 1 ≤ i ≤ N, we can see that Va ν+1(x1) ≤ Va ν+1(x2) for any feasible a. Then, by (A1), we can conclude that the lemma holds at iteration ν + 1 when s1,j = 0 < s2,j and r̂1,j = r̂2,j = 0. • When s1,j = 0 < s2,j and r̂1,j = r̂2,j = 1, by replacing the β j’s in the above case with αj’s, we can achieve the same result. • When 0 < s1,j < s2,j and r̂1,j = r̂2,j, we notice that Ps1,j ,s1,j+1(aj, r̂1,j) = Ps2,j ,s2,j+1(aj, r̂2,j), Ps1,j ,0(aj, r̂1,j) = Ps2,j ,0(aj, r̂2,j). Then, leveraging the monotonicity of Vν(x) and C(x), we can conclude with the same result. Combining the three cases, we prove that the lemma also holds at iteration ν + 1 of VIA. Therefore, the lemma holds at any iteration ν by mathematical induction. Since the results hold for any 1 ≤ j ≤ N and VIA is guaranteed to converge to the value function when ν→ +∞, we can conclude our proof. Appendix B. Proof of Lemma 2 We inherit the notations in the proof of Lemma 1. We still use mathematical induction to obtain the desired results. The base case ν = 0 is true by initialization. We assume the lemma holds at iterative ν and examine whether it still holds at iteration ν + 1. In the case of M = 1, we rewrite (5) as Vν+1(x) = min 1≤j≤N V j ν+1(x), (A2) where V j ν+1(x) = C(x)− θ + ∑ x′ {( ∏ i 6=j Pi xi ,x′i (0) ) Pj xj ,x′j (1)Vν(x′) } , (A3) and Pi x,x′(ai) is the probability that action ai will lead to state x′ when user i is at state x. To get the desired results, we distinguish between the following cases • We first show that V j ν+1(x) = Vk ν+1(P(x)). According to (A3), we have V j ν+1(x) = C(x)− θ + ∑ x′ {( ∏ i 6=j,k Pi xi ,x′i (0) ) Pk xk ,x′k (0)Pj xj ,x′j (1)Vν(x′) } . Vk ν+1(P(x)) = C(P(x))− θ+ ∑ P(x)′ ( ∏ i 6=j,k Pi P(x)i ,P(x)′i (0) ) Pk P(x)k ,P(x)′k (1)Pj P(x)j ,P(x)′j (0)Vν(P(x)′). It is obvious that for any P(x)′, there always exists P(x′′) = P(x)′. Then, we obtain Vk ν+1(P(x)) = C(P(x))− θ+ ∑ P(x′′) ( ∏ i 6=j,k Pi xi ,x′′i (0) ) Pk xj ,P(x′′)k (1)Pj xk ,P(x′′)j (0)Vν(P(x′′)) = C(P(x))− θ + ∑ x′′ ( ∏ i 6=j,k Pi xi ,x′′i (0) ) Pk xj ,x′′j (1)Pj xk ,x′′k (0)Vν(x′′) = C(P(x))− θ + ∑ x′ ( ∏ i 6=j,k Pi xi ,x′i (0) ) Pk xj ,x′j (1)Pj xk ,x′k (0)Vν(x′). Entropy 2021, 23, 1572 24 of 39 The second equality follows from the definition of P(·), the property of sum- mation, and the assumption at iteration ν. The last equality follows from the variable renaming. Then, by the definition of statistically identical, we have Pk xj ,x′j (1) = Pj xj ,x′j (1), Pj xk ,x′k (0) = Pk xk ,x′k (0), and C(x) = C(P(x)). Therefore, we can conclude that V j ν+1(x) = Vk ν+1(P(x)). • Along the same lines, we can easily show that Vk ν+1(x) = V j ν+1(P(x)) and Vi ν+1(x) = Vi ν+1(P(x)) for i 6= j, k. Combining the above cases with (A2), we prove that Vν+1(x) = Vν+1(P(x)). Then, by induction, we have Vν(x) = Vν(P(x)) at any iteration ν. Since VIA is guaranteed to converge to the value function when ν→ +∞, we can conclude our proof. Appendix C. Proof of Theorem 1 For arbitrary j and k δj,k(x) = ∑ x′−{x′j ,x′k}  ( ∏ i 6=j,k Pxi ,x′i (0) ) ∑ r̂′j ,r̂ ′ k P(r̂′j)P(r̂′k)Rj,k(x, x′) , (A4) where Rj,k(x, x′) = ∑ s′j ,s ′ k [( Psk ,s′k (0, r̂k)Psj ,s′j (1, r̂j)− Psk ,s′k (1, r̂k)Psj ,s′j (0, r̂j) ) V(x′) ] . (A5) With this in mind, we will prove the properties one by one. Property 1—δj,k(x) ≤ 0 if r̂k = p0 e,k = 0. The equality holds when sj = 0 or r̂j = p0 e,j = 0. When r̂k = p0 e,k = 0, transmitting the update from user k will necessarily fail. Therefore, Psk ,s′k (0, 0) = Psk ,s′k (1, 0) for any sk and s′k. Then, we have Rj,k(x, x′) = ∑ s′k Psk ,s′k (0, 0)∑ s′j [( Psj ,s′j (1, r̂j)− Psj ,s′j (0, r̂j) ) V(x′) ] . To identify the sign of Rj,k(x, x′), we distinguish between the following cases • When sj = 0, we can easily show that Rj,k(x, x′) = 0 for any x′ − {s′j, s′k} by noticing that the two possible actions with respect to user j (i.e., aj = 1 and aj = 0) are equivalent when sj = 0. Since δj,k(x) is a linear combination of Rj,k(x, x′)’s with non-negative coefficients, we can conclude that δj,k(x) = 0 in this case. • When sj > 0 and r̂j = 1, for any x′ − {s′j, s′k}, we have Rj,k(x, x′) = ∑ s′k Psk ,s′k (0, 0)(αj + pj − 1) ( V(x′; s′j = sj + 1)−V(x′; s′j = 0) ) ≤ 0. (A6) The inequality holds because of Lemma 1 and the fact that αj + pj < 1. We recall that δj,k(x) is a linear combination of Rj,k(x, x′)’s with non-negative coefficients. Then, we can conclude that δj,k(x) ≤ 0 in this case. • When sj > 0 and r̂j = 0, by replacing the αj in (A6) with β j, we can get the same result. In this case, the equality holds when β j + pj = 1, or, equivalently, p0 e,j = 0. Combining the cases, we prove the first property. Entropy 2021, 23, 1572 25 of 39 Property 2—δj,k(x) is non-increasing in r̂j and is non-decreasing in r̂k when sj, sk > 0. At the same time, δj,k(x) is independent of r̂i for any i 6= j, k. We first prove the monotonicity of δj,k(x) with respect to r̂j. To this end, we define x1 and x2 as two states that differ only in the value of r̂j. Without a loss of generality, we assume r̂1,j = 1 and r̂2,j = 0. Then, we investigate the sign of δj,k(x1)− δj,k(x2). We define xi , x1,i = x2,i for i 6= j. Then, according to (A4), δj,k(x1)− δj,k(x2) can be written as δj,k(x1)−δj,k(x2) = ∑ x′−{x′j ,x′k}  ( ∏ i 6=j,k Pxi ,x′i (0) ) ∑ r̂′j ,r̂ ′ k P(r̂′j)P(r̂′k) ( Rj,k(x1, x′)− Rj,k(x2, x′) ). Since x1,k = x2,k, we have Ps1,k ,s′k (a, r̂1,k) = Ps2,k ,s′k (a, r̂2,k) for any s′k. We recall that the transition probability is independent of r̂ when a = 0. Combining with the fact that s1,j = s2,j, we also have Ps1,j ,s′j (0, r̂1,j) = Ps2,j ,s′j (0, r̂2,j) for any s′j. Combining together, we obtain Ps1,k ,s′k (1, r̂1,k)Ps1,j ,s′j (0, r̂1,j) = Ps2,k ,s′k (1, r̂2,k)Ps2,j ,s′j (0, r̂2,j), Ps1,k ,s′k (0, r̂1,k) = Ps2,k ,s′k (0, r̂2,k). Leveraging the above two problems, we have Rj,k(x1, x′)− Rj,k(x2, x′) = ∑ s′j ,s ′ k [ Psk ,s′k (0, r̂k) ( Ps1,j ,s′j (1, r̂1,j)− Ps2,j ,s′j (1, r̂2,j) ) V(x′) ] . Consequently, we obtain δj,k(x1)−δj,k(x2) = ∑ x′−{x′j} ∏ i 6=j Pxi ,x′i (0) ∑ r̂′j P(r̂′j)∑ s′j ( Ps1,j ,s′j (1, 1)− Ps2,j ,s′j (1, 0) ) V(x′) . In the following, we characterize the sign of R1 , ∑ s′j ( Ps1,j ,s′j (1, 1)− Ps2,j ,s′j (1, 0) ) V(x′). As s1,j = s2,j > 0, for any x′ − {s′j}, we have R1 = ( (1− αj)− (1− β j) ) V(x′; s′j = 0) + (αj − β j)V(x′; s′j = s1,j + 1) ≤ 0. The inequality follows from Lemma 1 and the fact that β j > αj. Since δj,k(x1)− δj,k(x2) is a linear combination of R1’s with non-negative coefficients, we can conclude that δj,k(x1) ≤ δj,k(x2). Since r̂1,j > r̂2,j, we can see that δj,k(x) is non-increasing in r̂j. In a very similar way, we can show that δj,k(x) is non-decreasing in r̂k. We recall that r̂i will not affect the system dynamic if ai = 0. Consequently, we can conclude that δj,k(x) is independent of r̂i for any i 6= j, k. Combining together, we prove the second property. Property 3—δj,k(x) ≤ 0 if sk = 0. The equality holds when sj = 0 or r̂j = p0 e,j = 0. Since the probabilities are non-negative, it is sufficient to show that Rj,k(x, x′) satisfies Property 3 for any x′ − {s′j, s′k}. More precisely, it is sufficient to show that Rj,k(x, x′) ≤ 0 Entropy 2021, 23, 1572 26 of 39 for any x′ − {s′j, s′k} when sk = 0 and the equality holds when sj = 0 or r̂j = p0 e,j = 0. We recall that Psk ,s′k (1, r̂k) = Psk ,s′k (0, r̂k) for any s′k when sk = 0. Hence, for any x′ − {s′j, s′k}, we have Rj,k(x, x′) = ∑ s′k [ Psk ,s′k (0, r̂k)∑ s′j ( Psj ,s′j (1, r̂j)− Psj ,s′j (0, r̂j) ) V(x′) ] . Then, we investigate the following quantity for any x′ − {s′j} R2 , ∑ s′j ( Psj ,s′j (1, r̂j)− Pxj ,x′j (0, r̂j) ) V(x′). To this end, we distinguish between the following cases • When sj = 0, we have Psj ,s′j (1, r̂j) = Psj ,s′j (0, r̂j) for any s′j. Thus, we conclude that R2 = 0 for any x′ − {s′j}. Consequently, Rj,k(x, x′) = 0 for any x′ − {s′j, s′k}. • When sj > 0 and r̂j = 1, for any x′ − {s′j}, we have R2 = (αj − 1 + pj)V(x′; s′j = sj + 1) + (1− αj − pj)V(x′; s′j = 0) ≤ 0 (A7) The inequality follows from Lemma 1 and the fact that αj + pj < 1. Thus, Rj,k(x, x′) ≤ 0 for any x′ − {s′j, s′k}. • When sj > 0 and r̂j = 0, by replacing the αj in (A7) with β j, we can get the same result. In this case, the equality holds when β j + pj = 1, or, equivalently, p0 e,j = 0. Combined together, we can conclude that Property 3 is true. Property 4—δj,k(x) is non-increasing in sj if Γ r̂j j ≤ Γr̂k k and is non-decreasing in sk if Γ r̂j j ≥ Γr̂k k when sj, sk > 0. We define Γ1 i , αi 1−pi and Γ0 i , βi 1−pi for 1 ≤ i ≤ N. Such as we did in the proof of Property 3, it is sufficient to show that Rj,k(x, x′) satisfies Property 4 for any x′ − {s′j, s′k}. We recall that Rj,k(x, x′) depends on the values of r̂j and r̂k. Therefore, we distinguish between the following cases • In the case of r̂j = r̂k = 1 and sj, sk > 0, for any x′ − {s′j, s′k}, (A5) can be written as Rj,k(x, x′) = ∑ s′j ,s ′ k [( Psk ,s′k (0, 1)Psj ,s′j (1, 1)− Psk ,s′k (1, 1)Psj ,s′j (0, 1) ) V(x′) ] = ( pkαj − (1− pj)(1− αk) ) V(x′; s′j = sj + 1; s′k = 0) + ( (1− pk)(1− αj)− pjαk ) V(x′; s′j = 0; s′k = sk + 1) + ( (1− pk)αj − (1− pj)αk ) V(x′; s′j = sj + 1; s′k = sk + 1) + ( pk(1− αj)− pj(1− αk) ) V(x′; s′j = 0; s′k = 0). As we can verify pkαj − (1− pj)(1− αk) < 1 2 (pk + pj − 1) < 0, (1− pk)(1− αj)− pjαk > 1 2 (1− pk − pj) > 0. We define Γ1 i , αi 1−pi and Γ0 i , βi 1−pi for 1 ≤ i ≤ N. Then, we have Γ1 j Q Γ1 k =⇒ (1− pk)αj − (1− pj)αk Q 0. Entropy 2021, 23, 1572 27 of 39 Combining with Lemma 1, we can conclude that, for any x′ − {s′j, s′k}, Rj,k(x, x′) is non-increasing in sj if Γ1 j ≤ Γ1 k and is non-decreasing in sk if Γ1 j ≥ Γ1 k . • In the case of r̂j = r̂k = 0 and sj, sk > 0, by replacing the α’s in the above case with β’s, we can conclude with the same result. • In the case of r̂j = 1, r̂k = 0, and sj, sk > 0, for any x′ − {s′j, s′k}, (A5) can be written as Rj,k(x, x′) = ∑ s′j ,s ′ k [( Psk ,s′k (0, 0)Psj ,s′j (1, 1)− Psk ,s′k (1, 0)Psj ,s′j (0, 1) ) V(x′) ] = ( pkαj − (1− pj)(1− βk) ) V(x′; s′j = sj + 1; s′k = 0) + ( (1− pk)(1− αj)− pjβk ) V(x′; s′j = 0; s′k = sk + 1) + ( (1− pk)αj − (1− pj)βk ) V(x′; s′j = sj + 1; s′k = sk + 1) + ( pk(1− αj)− pj(1− βk) ) V(x′; s′j = 0; s′k = 0). As we can verify pkαj − (1− pj)(1− βk) < pk ( pj − 1 2 ) < 0, (1− pk)(1− αj)− pjβk > (1− pk) ( 1 2 − pj ) > 0. At the same time Γ1 j Q Γ0 k =⇒ (1− pk)αj − (1− pj)βk Q 0. Combined with Lemma 1, we can conclude that, for any x′ − {s′j, s′k}, Rj,k(x, x′) is non-increasing in sj if Γ1 j ≤ Γ0 k and is non-decreasing in sk if Γ1 j ≥ Γ0 k . • In the case of r̂j = 0, r̂k = 1, and sj, sk > 0, by swapping the α’s and β’s in the above case, we can conclude with the same result. Combined together, we conclude that Rj,k(x, x′) satisfies Property 3 for any x′ − {s′j, s′k}. Consequently, δj,k(x) is non-increasing in sj if Γ r̂j j ≤ Γr̂k k and is non-decreasing in sk if Γ r̂j j ≥ Γr̂k k when sj, sk > 0. Property 5—δj,k(x) ≤ 0 if sj ≥ sk, r̂j ≥ r̂k, and users j and k are statistically identical. According to Property 3, it is sufficient to consider the case where sj, sk > 0. We notice that the sign of δj,k(x) can be captured by the sign of the quantity Qj,k(x, x′) , ∑r̂′j ,r̂ ′ k P(r̂′j)P(r̂′k)Rj,k(x, x′). Thus, we divide our discussion into the following cases. • We first consider the case of sj ≥ sk > 0 and r̂j = r̂k = 0. Leveraging the definition of statistically identical, for any x′ − {x′j, x′k}, we have Qj,k(x, x′) = ∑ r̂′j ,r̂ ′ k P(r̂′j)P(r̂′k)κ1 ( V(x′; x′j = (0, r̂′j); x′k = (sk + 1, r̂′k))− V(x′; x′j = (sj + 1, r̂′j); x′k = (0, r̂′k)) ) , where κ1 = 1 − pj − β j ≥ 0. Then, by substituting the values of P(r̂) and using Lemma 2, we obtain Entropy 2021, 23, 1572 28 of 39 Qj,k(x, x′) =γjγkκ1V(x′; x′j = (sk + 1, 1); x′k = (0, 1))− γjγkκ1V(x′; x′j = (sj + 1, 1); x′k = (0, 1))+ (1− γj)(1− γk)κ1V(x′; x′j = (sk + 1, 0); x′k = (0, 0))− (1− γj)(1− γk)κ1V(x′; x′j = (sj + 1, 0); x′k = (0, 0))+ γk(1− γj)κ1V(x′; x′j = (sk + 1, 1); x′k = (0, 0))− γk(1− γj)κ1V(x′; x′j = (sj + 1, 0); x′k = (0, 1))+ γj(1− γk)κ1V(x′; x′j = (sk + 1, 0); x′k = (0, 1))− γj(1− γk)κ1V(x′; x′j = (sj + 1, 1); x′k = (0, 0)). Since users j and k are statistically identical, we have γj = γk. Then, by Lemma 1, we have Qj,k(x, x′) ≤ 0 for any x′ − {x′j, x′k}. Since δj,k(x) is a linear combination of Qj,k(x, x′)’s with non-negative coefficients, we can conclude that δj,k(x) ≤ 0. • For the case of sj ≥ sk > 0 and r̂j = r̂k = 1, by replacing β j in κ1 with αj, we can conclude with the same result. • Then, we consider the case of sj ≥ sk > 0, r̂j = 1, and r̂k = 0. We first notice that, for any x′ − {s′j, s′k} Rj,k(x, x′) = ( pkαj − (1− pj)(1− βk) ) V(x′; s′j = sj + 1; s′k = 0)+( (1− pk)(1− αj)− pjβk ) V(x′; s′j = 0; s′k = sk + 1)+( (1− pk)αj − (1− pj)βk ) V(x′; s′j = sj + 1; s′k = sk + 1)+( pk(1− αj)− pj(1− βk) ) V(x′; s′j = 0; s′k = 0). As users j and k are statistically identical, we have pj = pk and αj < βk. Leveraging Lemma 1, we have Rj,k(x, x′) ≤ (αj + pj − 1) ( V(x′; s′j = sj + 1; s′k = 0)− V(x′; s′j = 0; s′k = sk + 1) ) . Then, for any x′ − {x′j, x′k} Qj,k(x, x′) ≤ ∑ r̂′j ,r̂ ′ k P(r̂′j)P(r̂′k)κ2 ( V(x′; x′j = (0, r̂′j); x′k = (sk + 1, r̂′k))− V(x′; x′j = (sj + 1, r̂′j); x′k = (0, r̂′k)) ) , where κ2 = 1− pj − αj > 0. Such as we did in the previous cases, we can leverage Lemmas 1 and 2 to conclude that Qj,k(x, x′) ≤ 0 for any x′ − {x′j, x′k}. Consequently, δj,k(x) ≤ 0 in this case. The details are omitted for the sake of space. Combined together, we conclude the proof of Property 5. Appendix D. Proof of Corollary 2 We follow the same steps as in the proof of Lemma 1. To prove the corollary, it is sufficient to show that V(x1) ≤ V(x2) when s1 < s2 and r̂1 = r̂2. We use mathematical induction to prove the monotonicity. First of all, the base case (i.e., ν = 0) is true by initialization. We assume the lemma holds at iteration ν. Then, we want to examine Entropy 2021, 23, 1572 29 of 39 whether it holds at iteration ν + 1. For the system with a single user, the update step reported in problem (5) can be simplified and rewritten as follows Vν+1(x) = min a∈{0,1} Va ν+1(x), (A8) where Va ν+1(x) = C(x, a)− θ + ∑ r̂′ P(r̂′)∑ s′ Ps,s′(a, r̂)Vν(x′), and θ is the optimal value forM1(λ,−1). To prove the desired results, we distinguish between the following cases • We first consider the case of s1 = 0 < s2 and r̂1 = r̂2 = 0. When a = 1, we have V1 ν+1(x1) = C(x1, 1)− θ + ∑ r̂′ P(r̂′) ( pVν(1, r̂′) + (1− p)Vν(0, r̂′) ) , V1 ν+1(x2) = C(x2, 1)− θ + ∑ r̂′ P(r̂′) ( βVν(s2 + 1, r̂′) + (1− β)Vν(0, r̂′) ) . Subtracting the two expressions yields V1 ν+1(x1)−V1 ν+1(x2) ≤ C(x1, 1)− C(x2, 1) + ∑ r̂′ P(r̂′) [ (p− β) ( Vν(1, r̂′)−Vν(0, r̂′) )] ≤ 0. The inequalities hold since β > p, C(x, a) is non-decreasing in s, and Corollary 2 is true at iteration ν by assumption. For the case of a = 0, we obtain V0 ν+1(x1) = C(x1, 0)− θ + ∑ r̂′ P(r̂′) ( pVν(1, r̂′) + (1− p)Vν(0, r̂′) ) , V0 ν+1(x2) = C(x2, 0)− θ + ∑ r̂′ P(r̂′) ( (1− p)Vν(s2 + 1, r̂′) + pVν(0, r̂′) ) . Therefore, when a = 0, we have V0 ν+1(x1)−V0 ν+1(x2) ≤ C(x1, 0)− C(x2, 0) + ∑ r̂′ P(r̂′) [ (2p− 1) ( Vν(1, r̂′)−Vν(0, r̂′) )] ≤ 0. The inequalities hold since 2p − 1 < 0, C(x, a) is non-decreasing in s, and Corol- lary 2 is true at iteration ν by assumption. Combined together, we can see that Va ν+1(x1) ≤ Va ν+1(x2) for any feasible a. Then, by problem (A8), we can conclude that the lemma holds at iteration ν + 1 when s1 = 0 < s2 and r̂1 = r̂2 = 0. • When s1 = 0 < s2 and r̂1 = r̂2 = 1, by replacing the β’s in the above case with α’s, we can achieve the same result. • When 0 < s1 < s2 and r̂1 = r̂2, we notice that Ps1,s1+1(a, r̂1) = Ps2,s2+1(a, r̂2) and Ps1,0(a, r̂1) = Ps2,0(a, r̂2). Then, leveraging the monotonicity of Vν(x) and C(x, a), we can conclude with the same result. Combining the three cases, we prove that the lemma holds at iteration ν + 1 of VIA. Therefore, the lemma holds at any iteration ν by mathematical induction. Since VIA is guaranteed to converge to the value function when ν→ +∞, we can conclude our proof. Entropy 2021, 23, 1572 30 of 39 Appendix E. Proof of Proposition 1 We define ∆V(x) , V1(x)−V0(x) where Va(x) is the value function resulting from taking action a at state x. Then, Va(x) can be calculated as follows Va(x) = C(x, a)− θ + ∑ x′∈X Px,x′(a)V(x′), (A9) where θ is the optimal value forM1(λ,−1). Hence, the optimal action at state x can be fully characterized by the sign of ∆V(x). More precisely, the optimal action at state x is a = 1 if ∆V(x) < 0, and a = 0 is optimal otherwise. To determine the sign of ∆V(x) for each state, we distinguish between the following cases • We first consider the state x = (0, r̂). Applying the results in Section 2.3 to prob- lem (A9), we obtain V0(0, r̂) =− θ + (1− γ)(1− p)V(0, 0) + (1− γ)pV(1, 0)+ γ(1− p)V(0, 1) + γpV(1, 1), V1(0, r̂) = λ + V0(0, r̂). (A10) Therefore, ∆V(0, r̂) = λ ≥ 0. Thus, the optimal action at state (0, r̂) is a = 0. • Then, we consider the state x = (s, 0) where s > 0. Applying the results in Section 2.3 to Equation (A9), we obtain V0(s, 0) = f (s)− θ + (1− γ)pV(0, 0) + (1− γ)(1− p)V(s + 1, 0)+ γpV(0, 1) + γ(1− p)V(s + 1, 1), V1(s, 0) = f (s) + λ− θ + (1− γ)(1− β)V(0, 0) + (1− γ)βV(s + 1, 0)+ γ(1− β)V(0, 1) + γβV(s + 1, 1). Then, ∆V(s, 0) = λ + p0 e (1− 2p)ω, (A11) where ω = (1− γ)[V(0, 0)−V(s + 1, 0)] + γ[V(0, 1)−V(s + 1, 1)] ≤ 0. • Finally, we consider the state x = (s, 1) where s > 0. Following the same trajectory, we have ∆V(s, 1) = λ + (1− p1 e )(1− 2p)ω. According to Corollary 2 and the fact that p < 0.5, we can see that ∆V(s, 0) and ∆V(s, 1) are both a constant λ plus a term that is non-increasing in s. As the time penalty function is unbounded, the value function must also be unbounded. Then, combining the three cases, we can conclude the following. For fixed r̂, there always exists a threshold nr̂ > 0 such that the optimal action at state (s, r̂) where s ≥ nr̂ is a = 1, otherwise a = 0 is optimal. Since r̂ ∈ {0, 1}, the optimal policy can be fully captured by the pair (n0, n1). In the following, we determine the relationship between n0 and n1. We have ∆V(s, 1)− ∆V(s, 0) = (1− p1 e − p0 e )(1− 2p)ω ≤ 0. At the same time, for the threshold n0, we know ∆V(n0, 0) < 0. Then, we have ∆V(n0, 1) ≤ ∆V(n0, 0) < 0. Combined with the fact that ∆V(s, r̂) is non-increasing in s, we can conclude that the ordering n0 ≥ n1 is true. Appendix F. Proof of Proposition 2 We notice that the dynamic of AoII under threshold policy can be fully captured by a Discrete-Time Markov Chain (DTMC). Then, the expected AoII ∆̄n and the expected transmission rate ρ̄n under threshold policy n = (n0, n1) can be obtained from the stationary Entropy 2021, 23, 1572 31 of 39 distribution of the induced DTMC. Let the states of the induced DTMC be the values of s. We recall that r̂ is an independent Bernoulli random variable with parameter γ. Combined with the results in Section 2.3, we can easily obtain the state transition probabilities of the induced DTMC, which are shown in Figure A1. 0 1 2 n1 n1 + 1 n0 n0 + 11− p p 1− p 1− p c1c1 c2 p p 1− c1 1− c1 1− c2 1− c2 Figure A1. DTMC induced by the threshold policy n = (n0, n1). In the figure, c1 = (1− γ)(1− p) + γα and c2 = (1− γ) β + γα. The balance equations of the induced DTMC are the following (1− p)π0 + p n1−1 ∑ k=1 πk + (1− c1) n0−1 ∑ k=n1 πk + (1− c2) +∞ ∑ k=n0 πk = π0. pπ0 = π1. (1− p)πk−1 = πk f or 2 ≤ k ≤ n1. c1πk−1 = πk f or n1 + 1 ≤ k ≤ n0. c2πk−1 = πk f or n0 + 1 ≤ k. +∞ ∑ k=0 πk = 1. Then, we can easily solve the above system of linear equations. After some algebraic manipulation, we obtain the following π0 = 1 2 + p(1− p)n1−1 [ 1 1− c1 − 1 p + cn0−n1 1 ( 1 1− c2 − 1 1− c1 )] . πk = p(1− p)k−1π0 f or 1 ≤ k ≤ n1. πk = p(1− p)n1−1ck−n1 1 π0 f or n1 + 1 ≤ k ≤ n0. πk = p(1− p)n1−1cn0−n1 1 ck−n0 2 π0 f or n0 + 1 ≤ k. Equipped with the above results, we proceed with calculating ∆̄n and ρ̄n. According to problem (6a), the expected AoII is: ∆̄n = +∞ ∑ k=0 f (k)πk. Substituting the expressions of πk’s, we can get the expression of ∆̄n. Proposition 1 tells us the following. • For state (s, r̂) where s < n1, it is optimal to stay idle (i.e., a = 0). • For state (s, r̂) where n1 ≤ s < n0, it is optimal to make a transmission attempt only when r̂ = 1. We recall that r̂ is an independent Bernoulli random variable with parameter γ. Therefore, the expected proportion of time that the system is at state (s, 1) is γπs. Entropy 2021, 23, 1572 32 of 39 • For state (s, r̂) where s ≥ n0, it is optimal to make transmission attempt regardless of r̂. Combined with problem (6b), we have ρ̄n = γ n0−1 ∑ k=n1 πk + +∞ ∑ k=n0 πk. Substituting the expressions of πk’s, we can obtain the closed-form expression of ρ̄n. Appendix G. Proof of Proposition 4 We first tackle the Whittle’s indexes at state (0, r̂) and (s, 0) where s > 0. To this end, we distinguish between the following cases • We first consider the state x = (0, r̂). By definition, Whittle’s index is the infimum λ such that V0(x) = V1(x). According to (A10), we can conclude that Wx = 0 when x = (0, r̂). • Then, we consider the state x = (s, 0) where s > 0. We recall that p0 e = 0. Then, we can conclude, from (A11), that Wx = 0 for all x = (s, 0) where s > 0. Now, we tackle the Whittle’s index at state x = (s, 1) where s > 0. For convenience, we denote by Wn the Whittle’s index at state x = (n, 1). According to the monotonicity of ∆V(x) shown in the proof of Proposition 1, we can conclude that threshold policy n = (+∞, n + 1) is optimal when V0(n, 1) = V1(n, 1). Then, we can prove the following Lemma A1. When (9) is satisfied and V0(n, 1) = V1(n, 1), V(s, 1) = V(s, 0) , V(s) for 0 ≤ s ≤ n. Proof. Since the value function satisfies the Bellman equation, it is sufficient to show that V(s, 1) and V(s, 0) satisfy the same Bellman equation. We recall that the Bellman equation for V(x) is given by V(x) = min a∈{0,1} Va(x), where Va(x) = C(x, a)− θ + ∑ x′ Px,x′(a)V(x′), (A12) and θ is the optimal value of the decoupled problem. We recall, from Corollary 3, that the optimal action at state (s, 0) is staying idle (i.e., a = 0) for any s. We also know that threshold policy n = (+∞, n+ 1) is optimal when V0(n, 1) = V1(n, 1). Therefore, the optimal actions at states (s, 0) and (s, 1) where s ≤ n are the same (i.e., a = 0). Equivalently, we have V(s, r̂) = V0(s, r̂), f or s ≤ n. (A13) According to the system dynamic reported in Section 2.3, we know that the state transition probabilities are independent of r̂ when a = 0. Meanwhile, r̂ does not affect the instant cost. Let x1 = (s, 1) and x2 = (s, 0). Then, for any x′, we have Px1,x′(0) = Px2,x′(0). C(x1, 0) = C(x2, 0). Hence, according to (A12), we can see that V0(s, 0) = V0(s, 1) for any s ≤ n. Combined with problem (A13), we can conclude that V(s, 0) = V(s, 1) for any 0 ≤ s ≤ n. By definition, Whittle’s index Wn is the infimum λ such that V0(n, 1) = V1(n, 1). In this case, according to Lemma A1, V(0, 1) = V(0, 0) = V(0). Then, V0(n, 1) and V1(n, 1) can be written as V0(n, 1) = f (n)− θ + pV(0) + (1− p)[(1− γ)V(n + 1, 0) + γV(n + 1, 1)]. (A14) Entropy 2021, 23, 1572 33 of 39 V1(n, 1) = f (n) + Wn − θ + (1− α)V(0) + α[(1− γ)V(n + 1, 0) + γV(n + 1, 1)]. Without a loss of generality, we assume V(0) = 0. Then, equating the two expressions yields Wn = (1− p− α)(γV(n + 1, 1) + (1− γ)V(n + 1, 0)). (A15) Combining problems (A14) and (A15), we conclude that Wn is Wn = (1− p− α)(V0(n, 1) + θ − f (n)) 1− p . Since the optimal action at state (n, 1) is a = 0, we have V0(n, 1) = V(n, 1) = V(n). Finally, we obtain Wn = (1− p− α)(V(n) + θ − f (n)) 1− p . (A16) Now, we tackle the expression of V(n). When V0(n, 1) = V1(n, 1), the optimal action at state (s, r̂) where 0 ≤ s < n is staying idle. Then, leveraging Lemma A1, value function V(s) where 0 ≤ s < n satisfies the following V(s) = { −θ + f (0) + pV(1) when s = 0, −θ + f (s) + (1− p)V(s + 1) when 0 < s < n. (A17) By backward induction, we end up with the following equation for 0 < s < n. V(s) = −θ(1− (1− p)n−s) p + n−s ∑ k=1 f (n− k)(1− p)n−s−k + (1− p)n−sV(n). Letting s = 1 yields V(1) = −θ(1− (1− p)n−1) p + n−1 ∑ k=1 f (n− k)(1− p)n−1−k + (1− p)n−1V(n). From problem (A17), V(1) also satisfies the following V(1) = θ − f (0) p . Equating the two expressions of V(1), we obtain V(n) = − f (0) p(1− p)n−1 + θ ( 2 p(1− p)n−1 − 1 p ) − n−1 ∑ k=1 f (n− k)(1− p)−k. (A18) We recall that, when V0(n, 1) = V1(n, 1), threshold policy n = (+∞, n + 1) is optimal and both actions at state x = (n, 1) are equally desirable. Thus, threshold policy n = (+∞, n) is also optimal. Then, we know θ = ∆̄n + Wnρ̄n, (A19) where ∆̄n and ρ̄n are the expected AoII and the expected transmission rate under threshold policy n = (+∞, n), respectively. Finally, combining problems (A16), (A18) and (A19), we obtain Wn = − f (0) p(1− p)n + ∆̄n 2− (1− p)n p(1− p)n − (1− p)−n ( n ∑ k=1 f (k)(1− p)k−1 ) 1 1− p− α − ρ̄n 2− (1− p)n p(1− p)n . After some algebraic manipulation, we have Entropy 2021, 23, 1572 34 of 39 Wn = (1− c1) +∞ ∑ k=n+1 f (k)ck−n−1 1 − ∆̄n (1− c1)(1− p)− γ(1− p− α) c1(1− p− α) + ρ̄n , where c1 = (1− γ)(1− p) + γα. In the following, we investigate some properties of Whittle’s index. First of all, Wn is non-negative since 1− p− α and V(n + 1, r̂) in (A15) are all non-negative. Meanwhile, combining (A15) with the fact that V(n, r̂) is non-decreasing in n, we can verify that Wn is non-decreasing in n. Combined with the Whittle’s indexes in two other cases (i.e., x = (0, r̂) and x = (s, 0) where s > 0), we can easily obtain the properties of Wx as detailed in Proposition 4. Appendix H. Proof of Proposition 5 We notice thatM1(λ,−1) coincides with the decoupled model studied in Section 4.2. When problem (9) is satisfied, the decoupled problem is indexable, and, according to Corollary 3, we only need to show that n is the optimal threshold for the states with r̂ = 1. We first tackle the case of λ > 0. To this end, we divide our discussion into the following cases • For state (s, 1) where s < n, Ws ≤ λ by definition. As the problem is indexable, we have D(Ws) ⊆ D(λ). We recall that Ws , min{λ′ ≥ 0 : V0(s, 1) = V1(s, 1)}. Equivalently, Ws , min{λ′ ≥ 0 : (s, 1) ∈ D(λ′)}. Then, we know (s, 1) ∈ D(Ws). Combined together, we conclude that (s, 1) ∈ D(λ). In other words, the optimal action at state (s, 1) where s < n is to stay idle (i.e., a = 0). • For state (s, 1) where s ≥ n, we first recall that Ws = min{λ′ ≥ 0 : (s, 1) ∈ D(λ′)}. Consequently, for any λ′ < Ws, we know (s, 1) /∈ D(λ′). Meanwhile, we have Ws ≥Wn > λ by the monotonicity of Whittle’s index and the definition of n. Hence, we can conclude that (s, 1) /∈ D(λ). In other words, the optimal action at state (s, 1) where s ≥ n is to make the transmission attempt. Then, we conclude that n is the optimal threshold for the states with r̂ = 1 when λ > 0. In the case of λ = 0, according to the proof of Proposition 1, we can easily verify that the optimal threshold is 1. Appendix I. Proof of Theorem 2 We first make the following definitions. When M1(λ,−1) is at state x and action a is taken, cost C1(x, a) , f (s) and C2(x, a) , λa are incurred. We denote the expected C1-cost and the expected C2-cost under policy φ as C̄1(φ) and C̄2(φ), respectively. Let G be a non-empty set of states. For the given state i, we defineR∗(i, G) as the class of policies φ, for which the following hold • The probability Pφ(xn ∈ G f or some n ≥ 1 | x0 = i) = 1 where xn is the state of M1(λ,−1) at time n. • The expected time miG(φ) of a first passage from i to G under φ is finite. • The expected C1-cost C̄i,G 1 (φ) and the expected C2-cost C̄i,G 2 (φ) of a first passage form i to G under φ are finite. With the definitions in mind, we proceed with verifying the assumptions given in [27]. 1. For all d > 0, the set A(d) = {x | there exists an action a such that C1(x, a)+C2(x, a) ≤ d} is finite: For any state x, the cost satisfies C1(x, a) + C2(x, a) = f (s) + λa ≥ f (s). The equality holds when a = 0. Then, the states in A(d) must satisfy f (s) ≤ d. Combined with the fact that f (s) is a non-decreasing and unbounded function when s ∈ N0, we can conclude that A(d) is finite. 2. There exists a stationary policy e such that the induced Markov chain has the following properties: the state space S consists of a single (non-empty) positive recurrent class R and a Entropy 2021, 23, 1572 35 of 39 set U of transient states such that e ∈ R∗(i, R) for i ∈ U. Moreover, both C̄1(e) and C̄2(e) on R are finite: We consider the policy under which the base station makes a transmission attempt at every time slot. According to the system dynamic detailed in Section 2.3, we can see that all the states communicate with state (0, 0) and (0, 0) communicates with all other states. Thus, the state space S consists of a single (non-empty) positive recurrent class and the set of transient states can simply be an empty set. C̄1(e) and C̄2(e) are trivially finite as we can verify using Proposition 2. 3. Given any two state x 6= y, there exists a policy φ such that φ ∈ R∗(x, y): We notice that, under any policy, the maximum increase of s between two consecutive time slots is 1. Meanwhile, when s decreases, it decreases to zero. Combined with the fact that r̂ is an independent Bernoulli random variable, we can conclude that there always exists a path between any x and y with positive probability. mxy(φ), C̄x,y 1 (φ), and C̄x,y 2 (φ) are trivially finite. 4. If a stationary policy φ has at least one positive recurrent state, then it has a single positive recurrent class R. Moreover, if x = (0, 0) /∈ R, then φ ∈ R∗(x, R): Given that r̂ is an independent Bernoulli random variable, we can easily conclude from the system dynamic that all the states