Академический Документы
Профессиональный Документы
Культура Документы
gets reward
Y
i,j
(n) = S
i,j
(n), while the other users on the channel j =
j
n
() = n
[
n
t=1
Y
(t)
(t)], (1)
where
= max
k
(i,j)A
k
i,j
, the expected reward of the
optimal arm, is the expected sum-weight of the maximum
weight matching of users to channels with
i,j
as the weight.
Intuitively, we would like the regret R
n
() to be as small
as possible. If it is sub-linear with respect to time n, the time-
averaged regret will tend to zero.
IV. A NAIVE APPROACH
To begin with, we show a straightforward, relatively naive
approach to solving the combinatorial multi-armed bandit
problem that we have dened. This approach essentially
ignores the dependencies across the different arms, storing ob-
served information about each arm independently, and making
decisions based on this information alone.
In particular, we use the UCB1 policy given by Auer et
al. [5]. In this policy, shown in Algorithm 1, two variables
are stored and updated each time an arm is played:
Y
k
is the
average of all the observation values of arm k up to the current
time slot (sample mean); n
k
is the number of times that arm
k has been played up to the current time slot.
Y
k
and n
k
are
both initialized to 0 and updated as follows:
Y
k
(n) =
_
_
_
Y
k
(n1)n
k
+
(i,j)A
k
Sij(n)
n
k
(n1)+1
, if arm k is played
Y
k
(n 1) , else
(2)
n
k
(n) =
_
n
k
(n 1) + 1 , if arm k is played
n
k
(n 1) , else
(3)
Algorithm 1 Policy UCB1 from Auer et al. [5]
1: // INITIALIZATION
2: Play each arm once. Update
Y
k
, n
k
accordingly;
3: // MAIN LOOP
4: while 1 do
5: Play arm k that maximizes
Y
k
+
_
2 ln n
n
k
;
6: Update
Y
k
, n
k
accordingly;
7: end while
Theorem 1: The expected regret under UCB1 policy, spec-
ied in Algorithm 1, is at most
_
8
k:
k
<
(
lnn
k
)
_
+ (1 +
2
3
)(
k:
k
<
k
). (4)
where
k
=
k
,
k
=
(i,j)A
k
i,j
(n).
Proof: See [5, Theorem 1].
i,j
: average (sample mean) of all the observed values
of channel j by user i up to the current time slot.
Note that E[
i,j
(n)] =
i,j
.
k
:
(i,j)A
k
i,j
(n)
k
:
k
.
min
: min
k
k
.
max
: max
k
k
.
n
k
i
: n
i,j
such that (i, j) A
k
at current time slot.
T
k
(n): number of times arm k has been played by
MLPS in the rst n time slots.
k
(n):
(i,j)A
k
i,j
(n). It is the summation of all the
average observation values in arm k at time n.
Note that E[
k
(n)] =
k
.
T
k
(n): min
(i,j)A
k
n
i,j
(n).
k
i,n
k
i
:
i,j
(n) such that (i, j) A
k
and n
i,j
(n) = n
k
i
.
k,n
k
1
,...,n
k
M
:
M
i=1
k,n
k
1
.
T
k
(n
k
1
, . . . , n
k
M
) : min
i
n
k
i
.
TABLE I
NOTATION
Table I summarizes some notation we use in the description
and analysis of our algorithm.
The key idea behind this algorithm is to store and use
observations for each user-channel pair, rather than for each
arm as a whole. Since the same user-channel combination
can occur in different matchings, this allows exploitation of
information gained from the operation of one arm to make
decisions about a correlated arm.
We use two M by N matrices to store the information after
we play an arm at each time slot. One is (
i,j
)
MN
in which
i,j
is the average (sample mean) of all the observed values
of channel j by user i up to the current time slot (obtained
through potentially different sets of arms over time). The other
one is (n
i,j
)
MN
in which n
i,j
is the number of times that
channel j has been observed by user i up to the current time
slot.
At each time slot n, after an arm k is played, we get the
observation of S
i,j
(n) for all (i, j) A
k
. Then (
i,j
)
MN
and (n
i,j
)
MN
(both initialized to 0 at time 0) are updated
as follows:
i,j
(n) =
_
i,j(n1)ni,j+Si,j(n)
ni,j(n1)+1
, if (i, j) A
k
i,j
(n 1) , else
(5)
n
i,j
(n) =
_
n
i,j
(n 1) + 1 , if (i, j) A
k
n
i,j
(n 1) , else
(6)
Note that while we indicate the time index in the above
updates for notational clarity, it is not necessary to store the
matrices from previous time steps while running the algorithm.
Our proposed policy, which we refer to as matching learning
with polynomial storage (MLPS), is shown in Algorithm 2.
Algorithm 2 Matching Learning with Polynomial Storage
(MLPS)
1: // INITIALIZATION
2: for p = 1 to M do
3: for q = 1 to N do
4: n = (M 1)p + q;
5: Play any permutation k such that (p, q) A
k
;
6: Update (
i,j
)
MN
, (n
i,j
)
MN
accordingly.
7: end for
8: end for
9: // MAIN LOOP
10: while 1 do
11: n = n + 1;
12: Run algorithm 3 to play arm k that maximizes
(i,j)A
k
i,j
+ M
_
(M + 1) lnn
min
(i,j)A
k
n
i,j
(7)
13: Update (
i,j
)
MN
, (n
i,j
)
MN
accordingly.
14: end while
Denote
W
k
(n) =
(i,j)A
k
i,j
+ M
_
(M + 1) lnn
min
(i,j)A
k
n
i,j
. (8)
The MLPS policy chooses to play an arm with the maximum
value W
k
(n) at each time slot to play after the initialization
period when each arm is chosen once. Note that there are
P(N, M) arms, so using exhaustive search to solve this
maximization is prohibitively expensive. We therefore propose
the subroutine presented in Algorithm 3 to solve the rele-
vant combinatorial optimization over matchings in polynomial
time.
If we only focus on the rst part, i.e., max
k
(i,j)A
k
i,j
, it
would be the problem of nding a maximum weight matching
on a labeled bipartite graph between users and channels
with weights
i,j
. However, note that the second term in
the optimization required by the MLPS policy, min
(i,j)A
k
n
i,j
,
depends on which permutation is picked, and cannot be at-
tached to any particular edge in the maximum weight matching
problem. Therefore, something additional is called for. The
subroutine presented in Algorithm 3 solves the problem by
conditioning on each edge one at a time to see which one ts
the second term and solving for the max-weight matching over
the remaining users and channels.
Algorithm 3 Matching Optimization Subroutine
Input: M, N, (
i,j
)
MN
, (n
i,j
)
MN
Output: k, an arm that maximizes W
k
1:
i,j
=
i,j
, i, j.
2: for i = 1 to M do
3: for j = 1 to N do
4: // ASSUME THAT n
i,j
WILL GIVE THE MAXIMUM
VALUE IN (7)
5: 1 p M, set
p,j
= 0;
6: 1 q N, set
i,q
= 0;
7: for i
= 1 to M do
8: for j
= 1 to N do
9: // DELETE AN EDGE i
IF n
i
j
< n
i,j
10: if n
i
j
< n
i,j
then
11: Set
,j
= 0;
12: end if
13: end for
14: end for
15: Solve the Maximum Weight Matching problem
(e.g., using the Hungarian algorithm [21]) on the bipartite
graph of users and channels with edge weights (
i,j
)
MN
.
Call the permutation corresponding to this maximum
weight matching k
i,j
. Compute W
ki,j
according to (8);
16: end for
17: end for
18: Let k = arg max
ki,j
W
ki,j
Theorem 2: Algorithm 3 computes a solution to the opti-
mization problem:
max
k
(i,j)A
k
i,j
+ M
_
(M+1) ln n
min
(i,j)A
k
ni,j
(9)
in polynomial time.
Proof of Theorem 2: Denote O
= {k : k =
arg max
k
(i,j)A
k
i,j
+ M
_
(M+1) ln n
min
(i,j)A
k
ni,j
}. Denote n
k
min
=
min
(i,j)A
k
n
k
i,j
. We prove Theorem 2 by showing that the one of
the arms in O
, when n
k
min
is assumed to be the one
given the maximum value in (7) by the algorithm, k
will
stay as a permutation in (
i,j
)
MN
for running the maximum
weight matching.
Proof: When n
k
min
is assumed, note that (i, j)
A
k
, n
k
i,j
n
k
min
, so none of
i,j:(i,j)A
k
implies j = j
, j =
j
also implies i = i
. So none of
i,j:(i,j)A
k
are deleted at
line 5 or 6. Therefore, Lemma 1 holds.
Lemma 2: k
such that k
, when n
k
min
is assumed, there exists another arm
k
/ O
such that
(i,j)A
k
i,j
(i,j)A
k
i,j
.
Also note that
min
(i,j)A
k
n
i,j
= min
(i,j)A
k
n
i,j
= n
k
min
,
we have
(i,j)A
k
i,j
+ M
_
(M+1) ln n
min
(i,j)A
k
ni,j
(i,j)A
k
i,j
+M
_
(M+1) ln n
min
(i,j)A
k
ni,j
.
(10)
(10) implies that k
/ O
,
therefore Lemma 2 holds.
Lemma 2 implies that k
is one of the k
i,j
at line 18.
Further, computation time of maximum weight matching on
a bipartite graph is polynomial (e.g., the Hungarian method
requires O((M + N)
3
) computations), and there are exactly
M N such computations in Algorithm 3. Hence Theorem 2
holds.
max
. (11)
Proof of Theorem 3: Denote C
t,nj
as M
_
(M+1) ln t
nj
. We
introduce
T
ij
(n) as a counter after the initialization period. It
is updated in the following way:
At each time slot after the initialization period, one of the
two cases must happen: (1) an optimal arm is played; (2) a
non-optimal arm is played. In the rst case, (
T
ij
(n))
MN
wont be updated. When an non-optimal arm k(n) is picked
at time n, there must be at least one (i, j) A
k
such that
n
i,j
(n) = min
(i,j)A
k
n
i,j
. If there is only one such arm,
T
ij
(n)
is increased by 1. If there are multiple such arms, we arbitrarily
pick one, say (i
, j
), and increment
T
i
j
by 1.
Each time when a non-optimal arm is picked, exactly one
element in (
T
ij
(n))
MN
is incremented by 1. This implies
that the total number that we have played the non-optimal arms
is equal to the summation of all counters in (
T
ij
(n))
MN
.
Therefore, we have:
k:
k
<
E[T
k
(n)] =
M
i=1
N
j=1
E[
T
ij
(n)] (12)
Also note for
T
ij
(n), the following inequality holds:
T
ij
(n) n
i,j
(n), 1 i M, 1 j N. (13)
Denote by
I
ij
(n) the indicator function which is equal to
1 if
T
ij
(n) is added by one at time n. Let l be an arbitrary
positive integer. Then:
T
ij
(n)=
n
t=MN+1
{
I
ij
(t)}
l +
n
t=MN+1
{
I
ij
(t),
T
ij
(t 1) l}
When
I
ij
(t) = 1, there exists some arm such that a non-
optimal arm is picked for which n
i,j
is the minimum in this
arm. We denote this arm as k(t) since at each time that
I
ij
(t) = 1, we could get different arms. Then,
T
ij
(n) l +
n
t=MN+1
{
(t 1) + C
t1,
(t1)
k(t1)
(t 1) + C
t1,
T
k(t1)
(t1)
,
T
ij
(t 1) l}
= l +
n
t=MN
{
(t) + C
t,
(t)
k(t)
(t) + C
t,
T
k(t)
(t)
,
T
ij
(t) l}
Note that l
T
ij
(t) implies,
l
T
ij
(t) n
i,j
(t) = n
k(t)
i
= min
j
n
k(t)
j
. (14)
This means:
1 i M, n
k(t)
i
l. (15)
Then we have,
T
ij
(n) l +
n
t=MN
{ min
0<n
1
,...,n
M
<t
1
,...,n
M
+C
t1,
(n
1
,...,n
M
)
max
ln
k(t)
1
,...,n
k(t)
M
<t
k(t),n
k(t)
1
,...,n
k(t)
M
+C
t1,
T
k
(n
k(t)
1
,...,n
k(t)
M
)
}
l +
n
t=1
[
t1
1
=1
t1
M
=1
t1
n
k(t)
1
=l
t1
n
k(t)
M
=l
(
1
,...,n
M
+C
t1,
(n
1
,...,n
M
)
k(t),n
k(t)
1
,...,n
k(t)
M
+C
t1,
T
k(t)
(n
k(t)
1
,...,n
k(t)
M
)
)]
1
,...,n
M
+ C
t1,
(n
1
,...,n
M
)
k(t),n
k(t)
1
,...,n
k(t)
M
+
C
t1,
T
k(t)
(n
k(t)
1
,...,n
k(t)
M
)
means that at least one of the following
must be true:
1
,...,n
C
t1,
(n
1
,...,n
M
)
(16)
k(t),n
k(t)
1
,...,n
k(t)
M
k
+ C
t1,
T
k(t)
(n
k(t)
1
,...,n
k(t)
M
)
(17)
<
k
+ 2C
t1,
T
k(t)
(n
k(t)
1
,...,n
k(t)
M
)
(18)
Now we nd the upper bound for Pr{
1
,...,n
C
t1,
(n
1
,...,n
M
)
}.
Note that
(n
1
, . . . , n
M
) = min
i
n
i
n
min
,
We also dene n
k
min
= min
i
n
k
i
, so
T
k
(n
k
1
, . . . , n
k
M
) = n
k
min
.
We have:
Pr{
1
,...,n
C
t1,
(n
1
,...,n
M
)
}
= Pr{
1,n
1
+
2,n
2
+ . . . +
M,n
1
+
2
+ . . .
M
C
t1,n
min
}
Pr{At least one of the following must hold:
1,n
1
1
M
C
t1,n
min
,
2,n
2
1
M
C
t1,n
min
,
.
.
.
1,n
M
1
M
C
t1,n
min
}
Pr{
1,n
1
1
M
C
t1,n
min
}
+Pr{
2,n
2
1
M
C
t1,n
min
} + . . .
+Pr{
M,n
M
1
M
C
t1,n
min
}.
1 i M, applying the Chernoff-Hoeffding bound [24], we
could nd the upper bound of each item in the above equation
as,
Pr{
i,n
i
1
M
C
t1,n
min
}
= Pr{n
i,n
i
n
i
n
i
M
C
t1,n
min
}
e
2
1
n
i
(n
i
)
2
(M+1) ln t
n
min
e
2
1
n
i
(n
i
)
2
(M+1) ln t
n
i
= e
2(M+1) ln t
= t
2(M+1)
.
Thus,
Pr{
1
,...,n
C
t1,
(n
1
,...,n
M
)
} Mt
2(M+1)
.
(19)
Similarly, we can get the upper bound of the probability for
inequality (17):
Pr{
k(t),n
k(t)
1
,...,n
k(t)
M
k
+ C
t1,
T
k(t)
(n
k(t)
1
,...,n
k(t)
M
)
}
Mt
2(M+1)
.
(20)
Note that for l
_
4(M+1) ln n
k
(t)
M
2
_
,
k(t)
2C
t1,
T
k(t)
(n
k(t)
1
,...,n
k(t)
M
)
=
k(t)
2M
_
(M+1) ln t
n
k(t)
min
k(t)
M
_
4(M+1) ln n
n
k(t)
min
k(t)
M
_
4(M+1) ln n
4(M+1) ln n
_
k(t)
M
_
2
=
k(t)
k(t)
= 0.
(21)
(21) implies that condition (16) is false when l =
_
4(M+1) ln n
k(t)
M
2
_
. If we let l =
_
4(M+1) ln n
i,j
min
M
2
_
i,j
min
= min
k
{
k
: (i, j) A
k
}. (22)
Therefore,
E[
T
ij
(n)]
4(M+1) ln n
i,j
min
M
2
+
t=1
_
t1
1
=1
t1
1
=M
t1
n
k
1
=1
t1
n
k
1
=M
2Mt
2(M+1)
_
4M
2
(M+1) ln n
(
i,j
min
)
2
+ M
t=1
2t
2
4M
2
(M+1) ln n
(
i,j
min
)
2
+ M(1 +
2
3
)
(23)
So under our MLPS policy,
R
n
()=
n E
[
n
t=1
Y
(t)
(t)]
=
k:
k
<
k
E[T
k
(n)]
max
k:
k
<
E[T
k
(n)]
=
max
M
i=1
N
j=1
E[
T
ij
(n)]
_
M
i=1
N
j=1
4M
2
(M+1) ln n
(
i,j
min
)
2
+ M
2
N(1 +
2
3
)
_
max
_
4M
2
(M+1)(MN) ln n
(min)
2
+ M
2
N(1 +
2
3
)
_
max
(24)