Вы находитесь на странице: 1из 35

1

Cognitive Radio Engine Training


Haris I. Volos, Member, IEEE, and R. Michael Buehrer, Senior Member, IEEE
Abstract
Training is the task of guiding a cognitive radio engine through the process of learning a desired
systems behavior and capabilities. The training speed and expected performance during this task are of
paramount importance to the systems operation, especially when the system is facing new conditions.
In this paper, we provide a thorough examination of cognitive engine training, and we analytically
estimate the number of trials needed to conclusively nd the best-performing communication method
in a list of methods sorted by their possible throughput. We show that, even if only a fraction of
the methods meet the minimum packet success rate requirement, near maximal performance can be
reached quickly. Furthermore, we propose the Robust Training Algorithm (RoTA) for applications in
which stable performance during training is of utmost importance. We show that the RoTA can facilitate
training while maintaining a minimum performance level, albeit at the expense of training speed. Finally,
we test four key training techniques (-greedy; Boltzmann exploration; the Gittins index strategy; and
the RoTA) and we identify and explain the three main factors that affect performance during training:
the domain knowledge of the problem, the number of methods with acceptable performance, and the
exploration rate.
I. INTRODUCTION
This work contributes to the study of cognitive engine (CE) design within the eld of cognitive
radio (CR) specialty. The scope of CR research is radios with advanced adaptation capabilities
that are facilitated using articial intelligence (AI) [1], [2] and other computer science concepts.
According to the CR pioneer Joe Mitola, an ideal CR is capable of not only optimally utilizing its
own wireless capabilities, but also self-determining its goals by observing its human operators
behavior [3]. Current CR devices are capable of efcient spectrum utilization and optimized
This paper was presented in part at MILCOM 2010, San Jose, CA, Oct. 31Nov. 3, 2010, and it was awarded the Fred W.
Ellersick Award for Best Unclassied Paper.
H. I. Volos and R. M. Buehrer are associated with Wireless@Virginia Tech, Blacksburg, VA 24061, USA.
The National Science Foundation supported this work under Grant No. 0520418.
2
performance in challenging conditions. The scope of our work [4][6] is to propose methods
that allow a radio communication device to observe the environment and its own performance
and thus to determine on its own the best communication method to use to meet its objectives
in a given environment.
To make CR possible, communication engineers have, in the last few years, borrowed ideas
from machine learning and AI [1]. A CE is an intelligent agent that enables the radio to have
the desired learning and adaptation abilities. This intelligent agent [7] senses its environment
(the wireless channel), acts by using a communication method based on its past experience,
and observes its own performance to learn its capabilities, adding to its experience. The work
of Rieser, Rondeau, and Le [2], [8], [9] propose a CE that deals with the user, policy, and
radio domains. Their designs are similar and based on the Genetic Algorithm (GA), cased-
based reasoning (CBR), and multi-objective optimization principles. He et al. [10] designed a
CBR-based CE for IEEE 802.22 wireless regional area network (WRAN) applications and also
investigated both the radio and policy domains. Other works have focused only on the radio
domain. For example, Newman et al. [11] and Z. Zhao, et al. [12] applied a GA and particle
swarm optimization, respectively, to multi-channel radio links. On the other hand, N. Zhao et
al. [13] proposed a CE design based on ant colony optimization. Zhang and Weng designed a
CE with dynamic resource allocation [14], and Y. Zhao et al. [15] looked into utility function
selection for streaming video with a CE testbed. Finally, for learning and optimization of a
wireless link, Baldo and Zorzi [16] applied an articial neural network (ANN) and Clancy et al.
[17] used predicate logic.
CE design is only one aspect of CR; signicant work has been done in other CR areas
[18], mostly related to Dynamic Spectrum Access (DSA). For example, fundamental CR studies
have been done on achievable rates [19], communication limits [20], fundamental issues [21],
and design tradeoffs [22]. Sridharan and Vishwanath [23] and Jafar and Shamai [24] derive
theoretical capacities for multiple-input and multiple-output (MIMO) CR systems, and Scutari
et al. [25] propose some techniques for operating in a MIMO spectrum-sharing setting. Work in
other CR aspects includes spectrum sensing [26], [27], cognitive networks [28], [29], security
[30], and minimization of system power consumption [31].
3
A. Research Challenges
A primary function of a CE is to learn and adapt to its environment. We dene training as the
task of guiding a CE through the process of learning a radios behavior in response to specic
environments and inputs. This work and others [2], [11] have demonstrated that a CE can be
successfully trained.
Although research has demonstrated [2], [5], [11] that a CE can be trained with a reasonable
amount of time and effort, no study has focused on the performance during training and the
factors that affect the aforementioned performance. Estimating the training speed and the systems
expected performance during and after training is of paramount importance to the CE designer.
Quantitatively performing this task is difcult because of the diverse techniques used and the
numerous operating scenarios. Nevertheless, in this paper we attempt to provide quantitative
and qualitative insights into CE training by examining four specic training techniques. We also
explain the ndings and analysis from a general point of view, so that the lessons learned can
be applied to other systems.
This study makes three key contributions. First, we analytically estimate the number of trials
needed to conclusively nd from a sorted list the method that best meets the operation objective.
The operation objective is typically dened as a combination of packet success rate (PSR) and
spectral efciency. The methods with acceptable performance are assumed to be toward the
bottom of the list, and those with unacceptable performance (i.e., do not meet the PSR required)
are assumed to be toward the top of the list. The number of trials required to nd the maximal-
performing method is found to increase linearly with the number of methods having unacceptable
performance. As a specic data point, we show via simulation that when at least 300 of 1000
methods have acceptable performance, 70% of the maximal performance can be reached in about
2000 trials.
Second, we propose the Robust Training Algorithm (RoTA) for applications in which stable
performance during training is of utmost importance. Previously proposed training techniques
[2], [5], [11] make no attempt to provide stable performance during training, which can be exten-
sive. This aspect may lead to instantaneous suboptimal performance and more importantly, can
cause outages, which are unacceptable for sensitive applications in which bounded performance
is paramount. We show that the RoTA can facilitate training while maintaining a minimum
4
performance level, albeit at the expense of the training speed. The RoTAs key difference with
other methods is that its primary focus on short-term performance (i.e., PSR) even though while
it tries to improve the long-term performance of the system.
Third, we test four key training techniques -greedy exploration, Boltzmann exploration,
the Gittins index strategy, and the proposed RoTA in ve different scenarios, assuming 1000
available methods. The purpose of these tests is to evaluate the performance of each method
and, more importantly, to get insights on the factors that affect training performance. From
these tests, we nd that the algorithm knowing the methods potential performance causes
the training techniques to focus more on the most promising methods, causing lower initial
performance.When the methods have equal potential performance, maximal performance is
reached faster. We also nd that condence intervals can be used when a minimum performance
target is set. The condence intervals allow the removal of under-performing methods from the
list, speeding up the search process. Finally, we nd that the number of acceptably-performing
methods has a direct effect on performance.
B. Paper Outline
Section II provides a discussion and an overview of training. Section III provides an analytical
estimation of the number of trials needed to conclusively nd the best method. It also provides
simulation results on the association between the expected performance and the number of trials.
Section IV provides an overview of three key training techniques and the proposed RoTA. Section
V presents results for all four key training techniques along with a discussion of the factors that
affect training. Finally, Section VI provides some concluding remarks.
II. TRAINING OVERVIEW
Training is used in many AI [7] based systems. For example, a learning-treebased classier
is typically trained using a specic training set, with a goal of minimizing the classication
error. In a back-propagation ANN, the least mean squares (LMS) algorithm is widely used to
minimize the training error [32]. In k-means, clustering is used to identify k groups in a set of
data that minimize the sum of the squares of the distances of each data point to its assigned
group. [33]. The examples cited are ofine techniques, i.e., the whole dataset is available before
5
training commences. On the other hand, online versions [34], [35] are available that can process
data as the data arrive.
Three primary types of learning, reinforcement, supervised, and unsupervised learning, are
used in the context of a CE. Because different types of learning can be applied in the context of
a CE, we refer to the different learning types more generically as training to keep the discussion
general. First, in a CE, the training task is assumed to be online and a joint learning and
optimization process takes place. This operation can be cast as a reinforcement learning [36]
task, which attempts to learn so that a reward is maximized. In reinforcement learning, the
behavior is adjusted as rewards are received. Second, supervised learning is based on examples
of the desired behavior or attribute being learned. Supervised learning exists in the context of
a CE when the capabilities of the system are learned by observing actionoutcome pairs. Both
reinforcement and supervised learning exist in the context of a CE: reinforcement learning is
used to decide, based on previous experience, upon the next communications method to be used;
supervised learning is used when the actionoutcome pairs are used to estimate system abilities.
One example is the training of a Bayesian classier, in which action is the communication
method that was used to establish the communications link and the result is the number of
successful and failed packets using this method. Finally, in unsupervised learning, no explicit
groupings are specied in the collected data. The unsupervised learner extracts features from the
data, such as clusters of similar items. Unsupervised learning can be used for data organization
and memory compression; this type of learning is omitted from our study
III. EXPECTED LENGTH OF TRAINING
A. Problem Description
The general problem in a link adaptation CE is that there is a list with a large number of
potential communication methods that can be used. Each potential method is a discrete com-
bination of modulation, coding, antenna techniques, and other possible parameters dening the
communication method to be used. In this list, some of the methods are admissible and the rest are
inadmissible. An admissible method is a method that meets minimum performance requirements
per the operation objective, and an inadmissible method fails to meet those requirements given
the current environment. The minimum performance requirements are typically a given PSR
or bit rate (bits/s/Hz). For example, if a method has a 90% PSR in the current environment
6
but the minimum required PSR is 95%, then this method will be inadmissible. The goal in a
link adaptation CE is to nd the admissible method with the highest performance metric. In
most cases, such as when the goal is to maximize bandwidth efciency, the maximum potential
performance of each conguration is already known. Therefore, the list of congurations can
be potentially sorted by how well each item serves the current goal. In this case, the problem
becomes a search through a sorted list.
Rather than nding the perfect technique for minimizing the learning cost (which we have done
in previous publications [4], [5] by adopting an optimal exploration vs. exploitation balancing
strategy), the objective of this section is to estimate the number of trials needed to nd the optimal
method and to determine the expected performance after a number of trials. In our derivation,
the estimation of the number of trials is only indirectly dependent on the performance metric.
Though developing a new technique was not a goal of this section, the analysis and assumptions
made in this section have inuenced the search methodology employed by the proposed RoTA
(Section IV-D).
B. Assumptions
For this task to be analytically tractable, some assumptions must be made. First, we assume
that the radio has K methods in which each method k has a potential reward R
k
if the method is
admissible. Each method is assumed to be evaluated until its admissibility or inadmissibility is
veried. Finally, we assume that an admissible method requires T
a
trials to be veried; likewise,
an inadmissible method requires T
i
trials to be veried.
The sampling space is a sorted list (by ascending potential reward) of all the methods. The
admissible methods can be anywhere in the sampling space. However, the probability tree for
any arbitrary distribution of admissible methods cannot be simplied analytically. To make this
problem tractable, it is assumed that if N
A
out of N
K
methods are admissible then methods 1
to N
A
are admissible, and methods N
A
+1 to N
K
are inadmissible. In fact, this tends to be true
when the return is spectral efciency (capacity). However, since this is just an assumption for
analytical purposes, techniques like a binary search (which is very efcient searching through
a sorted list) cannot be used for the analysis or for the actual search technique. Because the
inadmissible methods must all be evaluated to conclusively know that the best-performing method
was found, our assumption makes the results obtained a worst case compared to the case when
7
the admissible methods are randomly distributed within the search space. By denition, all the
inadmissible methods need to be evaluated because they have the highest potential performance.
On the other hand, when the admissible methods are randomly distributed (within the search
space), that an admissible method likely exists near the top of the list. That is, if an admissible
method is found near the top, only a few methods will remain to be evaluated. In the analysis
to follow, if it is assumed that T
a
= T
i
= 1, then the results will present a lower bound on the
expected trials needed in the worst case.
C. The Expected Number of Trials Needed to Find the Maximal-Performing Method
We want to estimate the expected number of trials needed to nd the maximal-performing
method X. Let K = A I be the set of all N
K
methods where A and I are the sets of the
admissible and inadmissible methods respectively: A = {1, 2, 3, . . . , X} and I = {X + 1, X +
2, X + 3, . . . , K}. Thus N
A
= |A|, N
I
= |I|, and N
K
= |K| where |S| is the cardinality of set
S. The problem can be broken into two subproblems, one for each subset of K. The number of
trials needed to search through set I can be found to be N
I
T
i
; all the methods in I need to be
evaluated since they have a higher potential performance metric than the admissible methods.
The expected number of trials T
K
needed to search through K is given by
T
K
= T
A
+ T
I
(1)
where T
I
= N
I
T
i
and T
A
is the number of trials required to search through A until method X is
reached. The process of going though A
1
can be modeled as a Markov chain with an absorbing
state [37]. When the chain reaches the absorbing state, it remains in that state. There are N
A
states starting from 1 to X, e.g., state X 1 means that we evaluating method X 1. In
our case, all the states but X are non-absorbing states. A state is dened to be an admissibility
evaluation session that takes on average T
a
trials to be concluded, i.e., each state represents T
a
trials that are required to verify that the method (represented by the state) is admissible. There
are two basic sampling cases: uniform sampling, in which each method has an equal chance
of being sampled, p
x
= 1/N
A
, and weighted sampling where p
x
= R
x
/

N
A
i=1
R
i
. The proposed
RoTA uses a similar search pattern (Section IV-D).
1
Since we have no knowledge of which methods are in each set (A and I) as we search, we randomly switch between the
two sets until I is empty.
8
The sampling probabilities can be used to populate the transition matrix of the Markov chain.
The Markov chain has N
A
states that transition only to the higher-order states. Each states
probability of transition is represented in the transition matrix.
This transition matrix is a size N
A
square matrix, where the rows are the current states and
the columns are the next states. For example, if the process is at state X 2, the probability
that state X will be next is given by the entry in the (N
A
2)th row and (N
A
)th column. The
Markov model for uniform sampling is depicted in Figure 1; from state X 2, only the nal
two states can be selected with equal probability. The process may start from any state with a
probability p
x
. Also, from the rst state, all remaining states may be selected with probability
1/(N
A
1). The transition matrix P, for the weighted sampling search of the Markov chain is
P =

0
R
X3

N
A
i=2
R
i
R
X2

N
A
i=2
R
i
R
X1

N
A
i=2
R
i
R
X

N
A
i=2
R
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0
R
X2

N
A
i=N
A
2
R
i
R
X1

N
A
i=N
A
2
R
i
R
X

N
A
i=N
A
2
R
i
0 0 0
R
X1

N
A
i=N
A
1
R
i
R
X

N
A
i=N
A
1
R
i
0 0 0 0 1
0 0 0 0 1

(2)
From the transition matrix, state X is observed to be reached with probability of 1 from
itself (state X, the absorbing state) and state X 1. The transition matrix for uniform
sampling can be trivially found by setting R
i
= 1, i. The transition probabilities between the
non-absorbing states are given by the submatrix Q
Q = P(i, j)
i=1,2, ,N
A
1;j=1,2, ,N
A
1
(3)
The fundamental matrix N of the Markov chain is needed to calculate T
A
and is dened to
be
N = (I
N
A
1
Q)
1
(4)
where I
N
A
1
is an identity matrix of size N
A
1. N(i, j) is the expected number of periods
spent in the non-absorbing state j if the process started from the non-absorbing state i. Finally,
T
A
is given by
T
A
=

1 +
N
A
1

x=1

p
x
N
A
1

i=1
N(x, i)

T
a
(5)
where p
x
is the probability that method x is selected rst in the search process.
9
D. Estimating Admissibility
The main criterion for determining the admissibility of a method is its estimated PSR

.
However, condently estimating requires several trials. The number of trials required for a
condent estimate of can be bounded by using the Chernoff bound [38]
T
ln(/2)
2
2
(6)
where T is the number of trials required for the estimate

to be within of the true probability
with condence 1 . Equation (6) can be used to estimate T
i
and T
a
. The actual condence
intervals can be estimated using our previously published methods [5].
For analysis purposes, let us assume that is dened as being equal to |
min
| where

min
is the minimum PSR required for a method to be considered admissible. Evaluating Eq.(6)
as 0, T . From this observation, we can conclude that signicantly more trials are
needed to verify a method when it has a close to
min
. Since
min
is typically close to 1,
an admissible method is more likely to have a close to
min
than an inadmissible method.
Because an admissible method has a [
min
, 1] and an inadmissible method has [0,
min
)
and typically
min
>> 1
min
(i.e.,
min
is more likely to be a higher value, such as 0.8 or 0.9,
than 0.3). Of course, in a practical system, the will have a set minimum value to be able to
verify that
min
in a nite number of trials. Each individual methods T
i
or T
a
will depend
on the operation parameters, but, without a loss of generality, we assume two cases in this work
(T
i
= T
a
and T
a
T
i
) to demonstrate their effect on the results.
The expected number of trials is only inexplicitly dependent on the performance metric. The
performance metric to be maximized will determine which methods are admissible and which are
inadmissible, and this determination directly impacts the expected number of trials. As already
explained in this section, the performance metric will also impact the number of trials needed
to determine the admissibility or inadmissibility of a specic method (T
i
and T
a
), and this
determination also impacts the expected number of total trials.
E. Results
Figure 2 evaluates Eq. (1) for the special case N
A
= N
K
, i.e., I = , when uniform and
weighted sampling is used. The simulation agrees with the analytical results. In this best case
scenario (N
A
= N
K
), a relatively small number of trials is required to nd the best-performing
10
method. However, if Eq. (1) is evaluated for N
A
< N
K
, this number will increase linearly
(with N
I
) and soon dominate the number of trials required, assuming T
a
T
i
, because all the
N
I
inadmissible methods must be evaluated. On the other hand, Figure 3 depicts the simulated
performance for N
A
ranging from 1 to 1000, K = 1000, and T
a
= T
i
= 100. The performance at
each step was normalized, i.e., it was divided by the reward of the maximal-performing method.
When just 30% of the methods are admissible (N
A
= 300), 70% of the maximal performance
can be reached in about 2000 trials.
Figure 4 plots the number of trials needed to reach 95% of the maximal performance for
the case shown in Figure 3 and other cases, and 10
5
trials are needed when N
A
= 1. A large
number of trials is needed because the assumed distribution of the admissible and inadmissible
methods represent a worst-case scenario (assuming T
a
= T
i
) in terms of searching. In the assumed
distribution, all the N
A
admissible methods are at the bottom of the list below the N
I
inadmissible
methods, and, as a result, all the inadmissible methods must be evaluated to determine the best-
performing method. On the other hand, if the admissible methods are randomly distributed, this
number of trials required drops quickly to 400 as N
A
approaches 1000 (the total number of
methods). Nevertheless, even when the admissible methods are not randomly distributed, the
number of trials drops to 400 but not as fast. Conversely, when T
a
= 1000, 3 10
3
trials are
required when N
A
= N
K
= 1000. Moreover, for (T
a
= T
i
= 100) and (T
a
= 1000, T
i
= 100),
approximately the same number of trials is needed for N
A
< 300. Because T
i
is the same for
both cases. Finally, by using the CE we described in an earlier publication [5] and for the case of

min
= 0.9, we found T
a
to be 366 and T
i
= 99 trials. When
min
= 0.99, T
a
was further skewed
to 1050 and T
i
= 139, and these ndings are consistent with our analysis that T
a
increases with

min
(i.e., from Eq. (6) has a lower value). A performance analysis similar to that shown in
Figure 4 was performed in Section V for the key training techniques that are described in Section
IV. The results of the analysis of Section V are shown in Figure 6, which shows that the RoTA
curve decays similarly to the T
a
= T
i
= 100, sorted admissibility order curve in Figure 4. This
similarity is not surprising because that the RoTA uses a similar search pattern to that assumed
in this section.
11
IV. KEY TRAINING TECHNIQUES
In this section, we present three existing training techniques and a newly proposed technique.
The techniques are the -greedy strategy [36], Boltzmann exploration [36], the Gittins index
strategy [39], and the RoTA [6]. All the techniques have two things in common. First, all of them
are based on stochastic principles. Second, they all have a factor that affects the exploration rate.
In the -greedy strategy, the exploration rate is managed by the factor; in Boltzmann exploration
by the temperature T; in the Gittins index strategy, by the discount factor; and in the RoTA, by
the training window length. A higher causes exploration to be performed more often; a higher T
allows more methods to be selected for operation; a higher discount factor in the Gittins index
strategy makes methods more attractive for exploration; and a wider training window length
will allow the RoTA to absorb more failed packets over the windows duration, thus allowing
more exploration packets to be sent. The techniques also have their differences and unique
characteristics: rst the classic version of -greedy never stops exploration. Second, Boltzmann
exploration becomes greedier (more likely to select the top-performing method) as its temperature
factor decreases in value. Third, the Gittins index strategy needs to update and compare only
one methods index per step. Finally, the RoTA throttles exploration more aggressively so that
a minimum performance level is maintained in the short term.
A. The -Greedy Strategy
The -greedy strategy [36] is a simple strategy that uses (i.e., exploits) the best method (k
greedy
= arg max
k

k
(n)) with probability 1 ( [0, 1]), where
k
(n) is the average reward of
method k at trial n. However, with probability , the -greedy strategy explores by using a
random method, k, uniformly selected. As n , by the law of large numbers
k
(n) will
converge to the true mean. The -greedy techniques guarantee that all methods are explored
as the horizon tends to innity. The parameter controls how fast exploration is performed. A
higher will cause a faster exploration rate and the -greedy strategy will arrive more quickly
at an optimal or near-optimal method. However, the high exploration rate may reduce overall
returns because of the higher exploration cost.
The -greedy strategy described here is the classic version. With this version, exploration never
stops. For this reason, there is a variation that slows down the exploration rate as the number
of trials increases. This decrease in the exploration rate is especially important when the search
12
space has many signicantly under-performing methods. The exploration parameter is updated
at every trial n by
=

0
1 + nd

(7)
where
0
is the initial value of and d

is the decrease rate. Both described -greedy versions


are used in this work, including a variation of the classic version that limits exploration to only
potentially better-performing methods (see Section V).
B. Boltzmann Exploration
Boltzmann exploration [36] weighs its actions based on their estimated value (i.e., methods
with a higher value are more likely to be selected). Each method is selected with probability p
k
:
p
k
=
e

k
(n)/T

i
e

i
(n)/T
(8)
where T is a positive parameter called the temperature. When the temperature is high (T > 1000),
the methods are selected probabilistically based on their relative return (p
k

k
(n)/

N
K
i=1

i
(n)).
That is, methods with a higher estimated value
k
(n) are more likely to be selected. However,
as T 0, p
kmax
1 where k
max
= arg max
k

k
(n). Therefore, as the temperature decreases,
Boltzmann exploration disproportionately selects methods with higher reward (i.e., it becomes
greedier), and, when T 0, it selects only the method with the highest estimated reward.
The temperature T is updated at each trial n using
T =
T
0
1 + nd
T
(9)
where T
0
is the initial value of T and d
T
is the decrease rate.
C. The Gittins Index Strategy
Gittins proved that exploration vs. exploitation can be optimally balanced using a dynamic
allocation index-based strategy [39]. This strategy maximizes the total sum of rewards collected
over a long-term horizon.
2
The strategy is simply to use the method with the highest Gittins
index, which is based on the reward statistics of each method and must be estimated only when
those statistics change (i.e., only when each method is used). We discuss the use of the Gittins
indices in more detail in two of our prior publications [4], [5].
2
An important differentiating aspect to the RoTA, is its short-term focus for the PSR performance metric.
13
The Gittins index is dependent upon the underlying distribution of R
k
. In this work, we
consider the Gittins index for the normal reward process (NRP) and the Bernoulli reward process
(BRP). In the application examined in this work, the underlying process is Bernoulli a packet
is either successful or unsuccessful. For a NRP, the Gittins index is equal to
( ,
2
, n

, ) + (0, 1, n

, ) (10)
where and
2
are the estimates of the mean and the variance of return, respectively, using n

trials; (0, 1) is a discount factor; and (0, 1, n

, ) is the Gittins index for a zero mean, unit


variance distributed process (tabulated in Gittins book [39]). For a BRP, the Gittins index is
equal to [40]
(, , , R
k
) R
k
(, , , 1) (11)
where (, , , 1) is the Gittins index for a Bernoulli process (again tabulated in Gittins book
[39]), with successes and failures, and a reward of 1, if successful. R
k
is the reward received
when method k is successful. In the BRP case, the belief state is represented by {, }. Other
works offer more information on the Gittins index [4], [5], [39], [40].
D. The Robust Training Algorithm
In the RoTA, learning is accomplished by sending training packets. A training has an uncertain
PSR . The training packets are assumed to carry payload data. Therefore, a successfully
delivered packet it is going to contribute to the communication links throughput. During training,
packets with known performance that exceeds
min
are also sent and make it possible to maintain
a close to
min
. These packets shall be subsequently referred to as offsetting packets. Therefore,
in this case, the training packets potential negative effect on is counterbalanced by the offsetting
packets known performance. The RoTAs goal is to provide stable PSR performance in a short-
term horizon (i.e., during training).
Let
l
and
u
be the lower and upper bound [5], respectively, of the estimated for a given
method using condence intervals.
min
is the minimum required PSR. C
min
is the minimum
required rate (bits/s/Hz).
O
(
O
l
,
Ou
) is the PSR of the offsetting packets.
T
(
T
l
,
Tu
)
is the training packets PSR. N
w
is the training window length in packets. N
T
is the expected
number of training packets within the training window. is a vector that indicates which of the
last N
w
packets were successful: (j) = 1 if the jth packet was successful, (j) = 0 otherwise.
14
Finally,
w
=

Nw
j=2
(j) is the number of the successful packets in the last N
w
1 steps.
The offsetting methods have a
O
l
>
min
, and the training methods satisfy
Tu

min
>
T
l
.
Let T and O be the sets containing the training and offsetting methods, respectively, which are
populated by going through all the N
K
methods in the K set and selecting the ones that meet the
respective criteria for each set. Similar to the assumed search pattern of Section III, the RoTA
maintains a sorted list that is pruned when a better-performing method is found, but the RoTA
slows down experimentation to ensure a minimum short-term PSR performance.
The RoTA is very straightforward: based on the last N
w
successes and failures, training is
performed when


min
is expected over the training window. If no offsetting methods are
available, then the only option is to enter a training loop until an offsetting method is found. The
RoTA nishes when no more training methods are available. In that situation, all N
K
methods
are found to perform worse (in of terms and/or spectral efciency) than the best-performing
method currently known, which meets the minimum performance requirements (
min
, C
min
).
The pseudo code of the RoTA is provided here.
Input: Set of methods K,
min
, C
min
, and
N
w
Result: Runs until T =
1: Populate T and O (see their denitions)
2:
w
0
3: {}
4: while T = do
5: while O = do
6: nd the training method with
max{
T
l
}
7: for i = 1 to N
w
do
8: use the training method
9: if
Tu
<
min
or
T
l

min
then
10: break for
11: end if
12: end for
13: update T and O
14: end while
15: if T = then
16: break while
17: else
18: pick a high
O
l
offsetting method
19: randomly pick a training method
20: end if
21: for i = 1 to N
w
do
22: if
aw
Nw

min
then
23: use the training method
24: (1 : N
w
1) (2 : Nw)
25: if successful packet then
26: (N
w
) = 1
27: else
28: (N
w
) = 0
15
29: end if
30:
w
=

Nw
j=2
(j)
31: if
Tu
<
min
or
T
l

min
then
32: break for
33: end if
34: else
35: use the offsetting method
36: (1 : N
w
1) (2 : Nw)
37: if successful packet then
38: (N
w
) = 1
39: else
40: (N
w
) = 0
41: end if
42:
w
=

Nw
j=2
(j)
43: if
O
l
<
min
then
44: break for
45: end if
46: end if
47: end for
48: update T and O
49: end while
The algorithm deserves a few comments. First, operation in lines 7 and 21 is limited to
N
w
to offer an opportunity to avoid (if possible) methods with an arbitrarily high T
a
or T
i
.
Second, the loop in lines 10 and 32 breaks if the training method either satises or fails
min
.
On the other hand, the loop breaks in line 44 if an offsetting method no longer satises
min
.
Finally, in line 22, the condition a
w
/N
w

min
looks ahead to the next step and estimates
assuming that the next packet is dropped. If
min
is expected to be maintained, training is
allowed because the potentially negative effects of training are expected to be mitigated. Finally,
we assume that the channel is constant over the training window. The RoTA can operate in
varying channel conditions by repopulating or remembering T and O for previously experienced
channel conditions.
To enable training, an offsetting method with
O
l
>
min
is required. The greater the difference,
the faster the training rate because more failing packets can be tolerated. The minimum training
window that is expected to accommodate at least N
T
training packets is given by
N
w
=

N
T

O
l

T
l

O
l

min

(12)
Table I evaluates Eq. (12) for N
T
= 1,
T
l
= 0, and fty-ve (
min
,
O
l
) pairs. A wider window
is required when
O
l
is close to
min
, i.e., if
min

O
l
, a more narrow training window is
needed than when
min

O
l
. For example, when
O
l
= 0.999 and
min
= 0.90, the training
window must be at least seven trials to accommodate one training packet, but if
min
= 0.95,
N
w
must be three times longer.
16
V. RESULTS
We tested seven techniques (including variations) for evaluating the effect of distinct factors
on the trials needed to reach a certain level of performance. The factors considered were the
respective training parameter value (TPV), i.e., the training parameter of each method (e.g.,
for -greedy, for the Gittins index strategy, and so on); the number of admissible methods; the
methods rewards structure, and the use of condence intervals. The following seven techniques
and variations were tested: -greedy I (classic); -greedy II with decreasing ; -greedy III with
limited exploration; Boltzmann exploration; the Gittins index strategy with NRP; the Gittins
index strategy with BRP; and the RoTA.
Each of the techniques were tested varying the respective TPV. Techniques 1 to 3 were tested
{0.001, 0.01, 0.1, 0.5}. Technique 4 was tested T {10, 250, 500, 1000, 2000, 4000}.
Techniques 5 and 6 were tested {0.5, 0.6, 0.7, 0.8, 0.9, 0.99}, and technique 7 was tested
N
w
{25, 50, 100, 250, 500}. The parameters are listed in order of ascending effective explo-
ration.
In addition, ve test scenarios were used. The rst test scenario assumed
min
= 0.9 and
R
k
= 1k. The rst scenario tested the techniques performance when all the methods have the
same potential reward but different actual rewards due to different values of . The second test
scenario was the same as the rst one, but, condence intervals were not used to help reduce
the list of the potential methods during training, i.e., there was no target PSR, and, as a result,
the RoTA was not applicable in these scenarios. The third and forth scenarios were similar
to the rst and second, respectively, but the methods had different potential rewards ranging
from 1 to 500.5 (R
k
= 1 to 500.5). Finally, the fth scenario initialized the mean estimates
to zero (
min
= 0.9, R
k
= 1 to 500.5, and
k
(1) = 0k). In this scenario, only the rst four
techniques were tested because the Gittins index strategy and the condence intervals cannot be
employed when starting with zero observations. This scenario tested the effect of initialization
(prior knowledge) on the training techniques performance.
The two sets of results are given in Tables II and III. First, Table II presents the total normalized
reward and PSR at 1000 and 15000 trials. The results are averaged over all values of N
A
and
include all the techniques and parameters tested. Second, Table III tabulates the number of trials
required to reach a certain performance level, averaged over all parameters, for each of the
17
techniques for different values of N
A
. In addition to the tables, Figure 6, like Figure 4, shows
the number of trials needed to reach 95% of the maximal performance compared to the number
of admissible methods for the third scenario. Figure 7 shows the average normalized performance
compared to the respective training parameter for each training method in the order tabulated
above. Finally, Figure 8 depicts the average normalized performance versus the test scenario. The
PSR is equivalent to the average total normalized reward for the rst two scenarios, which saves
space in the plot. The results in the gures are averaged over all the parameters not considered
in the gure, e.g., Figure 7 is averaged over all N
A
and scenarios one and three. We will discuss
these results in detail in the next subsections.
A. Robust Training Algorithm Discussion
We rst discuss the results for the RoTA. Table III illustrates that, in the rst test scenario,
near maximal performance is achieved very fast, e.g., only 382 and 697 trials are required
to reach 80% and 90% of the performance, respectively, because all the methods have equal
rewards. Therefore, a relatively high-performing method is easily found. On the other hand, in
the third test scenario, about 60% of the performance is reached quickly (139 trials). However,
signicantly more trials are required (as many as 14,000 trials when N
A
=100) to reach near
maximal performance, especially. when N
A
is below 600, because most of the methods are
inadmissible with very low (near zero) performance and thus do not contribute signicantly to
the links operation. Table II, shows that the RoTA outperforms the other methods in the rst
1000 trials of the third scenario evaluation. Because the RoTA by design focuses on providing
a guaranteed PSR in the short term.
Figure 5 depicts a sample run of the RoTA in a simulated multi-antenna system [5] for a signal-
to-noise ratio (SNR) of 15 dB,
min
= 0.9, C
min
= 0.1875 bits/s/Hz, and N
w
= 50. Figure 5
illustrates that after 78 trials (vertical dividing line) using the RoTA, an offsetting method is found
and the PSR stays near the target
min
. At the same point, the number of methods remaining
to be explored drops signicantly. At about 2250 trials, near maximal performance is reached.
Since nearly 2300 trials are required to learn the optimal communications method, the guaranteed
PSR the RoTA provides is extremely valuable. As shown in Figure 6, the RoTA requires more
trials than the other methods to reach the desired performance level. However, Figures 7 and 8
demonstrate that the RoTA has either the highest or one of the highest PSR levels. Thus, the
18
RoTA trades training time for guaranteed performance during training.
The speed of the RoTA can be signicantly enhanced by using knowledge propagation tech-
niques as we proposed in an earlier publication [5]. Specically, when a method is evaluated, the
result can be used to enhance our knowledge about other methods. For example, when a method
with a high combined modulation/coding distance metric is found to be inadmissible, all else
being equal, a method with a lower distance metric is likely to be inadmissible and, therefore,
should not be considered. Likewise, SNR and other channel metrics that have a monotonic effect
on performance can be used to inform our understanding of performance at other values. As
a comparison and using these techniques, Figure 5s scenario was simulated using knowledge
propagation, and an offsetting method was found in only 59 trials (compared to 78) and reached
near maximal performance in just 800 trials (compared to 2250).
B. Factors That Affect Training
A analysis of the results identies three main factors that affect training performance: the
problem domain knowledge, the number of admissible methods, and the exploration rate.
1) Problem Domain Knowledge: The problem domain knowledge can be divided into three
subcategories: knowledge of each methods potential reward, statistical knowledge (condence
intervals), and other domain knowledge, such as performance trends from communications theory.
The rst two were extensively tested in this study, but the latter received minimal treatment with
only one test.
First, we consider knowledge of each methods potential reward.
3
We observe that performance
is higher in the rst scenario than the third scenario because of the knowledge of each methods
potential reward (Figure 8). For example, in Table II, -greedy I, = 0.1, in the rst scenario has
a normalized performance (reward) of 90% compared to 40% in the third scenario. Similarly
from Table III, the Gittins index strategy requires 1963 trials to reach 80% of the maximal
performance in the third scenario compared to 196 trials in the rst scenario. In the rst scenario,
all methods have the same potential reward and the training techniques equally consider all the
methods. On the other hand, in the third scenario the methods have different potential rewards,
3
In communications systems, we generally know the maximum potential of each method, e.g., 4 bits/s/Hz.
19
and this difference causes the training techniques to focus more on the most promising methods,
improving PSR performance in all training techniques except the Gittins index strategy.
Second, statistical knowledge allows us to calculate condence intervals of each methods po-
tential based on the collected observations. This knowledge allows methods with a low projected
performance to be removed from the training list, thus improving performance. Test scenarios
two and four do not use the condence intervals during training; therefore Boltzmann exploration,
performs better in scenarios one and three than in scenarios two and four, respectively, as depicted
by Figure 8. The same trend is evident in Table III. Boltzmann exploration is affected most
because of the probabilistic nature (see Eq. 8) of its operation in which all the methods are
likely to be selected when the parameter T is sufciently high. The other training techniques
also enjoy improved performance by using condence intervals, but Boltzmann exploration gains
the most. The RoTA is not tested in scenarios two and four because they offer no minimum
PSR requirement, and thus the RoTA is inapplicable.
Third, other domain knowledge, such as communication theory fundamentals, bit error rate
(BER), and capacity curves, although not fully applied (i.e., used in only a single test with the
RoTA) in this study can be used to limit the search space. (Other domain knowledge, is discussed
in more detail in Section V-A and in our previous publication [5].)
2) The Number of Admissible Methods: An admissible method (i.e., a method that meets
a minimum performance requirement) can provide some service, albeit potentially suboptimal,
compared to an inadmissible method). A case in which the one admissible method must be found
out of many inadmissible ones would intuitively be expected to require more resources than the
case in which all methods are admissible.
The ndings of Section III support this expectation, and Figure 6 (which is like Figure 4 of
Section III) shows the number of trials needed to nd the best-performing method. The number
of trials needed is found to decease as the number of admissible methods increases. Also the
RoTA curve in Figure 6 is approximately a case between the two rst curves in Figure 4, although
the RoTA curve lacks the points for N
A
< 150. The RoTA has the most similar curve to Figure
4 because its search pattern is similar to the search pattern assumed in Section III. Also from
Figure 6, the Gittins index strategy is seen to be optimal, and the proposed RoTA approach pays
for performance guarantees with additional training time.
From the results of Table III, the number of admissible methods, N
A
, is a strong indicator
20
of the performance during learning and the learning rate. In summary, the results demonstrate
that as long as most of the methods are admissible, near maximal performance is expected to
be reached in a relatively small number of trials. On the other hand, a very low N
A
indicates
that signicant effort may be required to reach near maximal performance.
3) The Exploration Rate: Table II indicates that an increasing exploration rate (see Section
IV), in many cases has a negative effect on the total reward due to higher exploration costs, e.g.,
-greedy II: Test Scenario 1. On the other hand, there are cases (-greedy II: Test Scenario 2, 4,
and 5; RoTA: Test Scenario 3; Gittins Index BRP: Test Scenario 2), where a higher exploration
rate yields higher performance. This effect is generally observed in cases with a more challenging
environment, like test scenarios two and four, which lack a minimum performance requirement
and in which condence intervals cannot be used to eliminate methods with low performance. In
addition, in some cases, a rising exploration rate gives higher returns up to a certain level, and
then performance starts to suffer (Boltzmann Exploration: Test Scenario 4). In brief, in terms
of aggregate throughput, a higher exploration rate generally hurts performance because more
time is spent on methods with uncertain performance. However, in some instances, a higher
exploration rate may lead to better performance.
Similar trends can also be observed from Figure 7, which shows a downward trend for -
greedy and Boltzmann exploration. On the other hand, the Gittins index strategy appears to be
practically at and the RoTA enjoys increased performance with increasing TPV. The RoTAs
performance improves because its TPV is the training window size (N
w
), which allows for more
experimentation when N
w
is larger, therefore allowing the RoTA to more quickly nd better-
performing methods. However, this speed comes to the expense of the PSR, which appears to
be decreasing but is still above the target of 0.9.
C. Performance Summary of the Key Techniques
This section summarizes the ndings from the techniques perspective. First, the -greedy
techniques are very simple to implement and reach near-maximal performance relatively fast
in most but not all cases (i.e., it is not robust to the scenario). For this reason, they can be a
good alternative to more sophisticated techniques, such as the Gittins index strategy. Second,
Boltzmann exploration generally yields very good results, but it may encounter a case like
test scenario two (Figure 8) in which the performance is signicantly lower than in the other
21
cases. Because Boltzmann exploration fails to nd a near-maximal method before exploration
stops (T 0), there is almost always a probability that any method may be selected during
exploration vs. the other training techniques which tend to focus more on the best-performing
techniques. The Gittins index strategy yields better results when all the methods have equal
potential performance (test scenarios one and two; Figure 8). In these scenarios, all that is
known is that the methods have different rewards.
4
The Gittins index strategy performs well
because it is designed to balance exploration and exploitation in a setting in which the methods
have the same potential performance but unknown actual performance. Finally, the RoTA is the
only technique that consistently maintains a minimum short-term PSR, albeit at the expense of
the average total return and the convergence speed. The appropriate selection will depend on the
operations objective and the prior knowledge about the methods potential reward distribution.
In summary, when the methods have equal rewards and a long horizon, the Gittins index strategy
should be used because it is designed to handle such situations. When the PSR during training
is a concern, the RoTA should be used. The RoTA will allow exploration only when it expects
that the PSR can be maintained at the target level. When discrete rewards exist (i.e., not all the
methods have equal rewards), we may use any of the Gittins index strategy, -greedy techniques,
or Boltzmann exploration. However, the latter may be more computationally intensive to estimate
(Eq. (8)) in a real system.
VI. SUMMARY AND CONCLUSIONS
In this study, we focused solely on training (i.e., learning) in the context of a cognitive engine,
which is the intelligent agent in a cognitive radio. First, we provided an overview of training.
Second, we analytically estimated the number of trials needed to conclusively nd the best-
performing communication method in a list of methods sorted by their potential. The number
of trials required to conclusively nd the best-performing method was found to increase linearly
with the number of methods with unacceptable performance (inadmissible methods). Third, we
proposed the Robust Training Algorithm (RoTA), which can facilitate training while maintaining
a minimum performance level in the short term, albeit at the expense of training speed.
4
The benet of knowing each methods potential performance is absent because all methods have the same potential
performance.
22
Fourth, we tested four key training techniques in ve different scenarios. Knowledge of the
methods maximum potential performance was found to cause the CE to focus on the most
promising (and thus more risky) communication methods during training, causing an initial
period of low performance. We also found that condence intervals can be used when a minimum
performance target is set for removing under-performing methods from the list, speeding up the
search process. Moreover, the number of acceptably-performing (admissible) methods had a
direct effect on performance.
Fifth, we veried our previously published conclusion [5] that the Gittins index strategy is
consistently one of the top-performing techniques for a long-term operating horizon. In fact,
this strategy performs best when the methods have equal potential performance. Nevertheless,
-greedy exploration was found to be a good alternative, and Boltzmann exploration is almost
as good as -greedy. Finally, the RoTA should be used when the PSR during training must be
maintained near a target level.
Finally, the effect of factors that affect training (the problem domain knowledge, the number
of communication methods, and the exploration rate) was investigated.
REFERENCES
[1] A. He et al., A Survey of Articial Intelligence for Cognitive Radios, IEEE Transactions on Vehicular Technology,
vol. 59, no. 4, pp. 15781592, May 2010.
[2] T. W. Rondeau, Application of Articial Intelligence to Wireless Communications, Ph.D. dissertation, Virginia Tech,
2007.
[3] J. Mitola, III, Cognitive Radio: An Integrated Agent Architecture for Software Dened Radio, Ph.D. dissertation, The
Royal Institute of Technology (KTH), Stockholm , Sweden, May 2000.
[4] H. I. Volos and R. M. Buehrer, On Balancing Exploration vs. Exploitation in a Cognitive Engine for Multi-Antenna
Systems, in Proceedings of the IEEE Global Telecommunications Conference, Nov. 2009, pp. 16.
[5] H. I. Volos and R. M. Buehrer, Cognitive Engine Design for Link Adaptation: An Application to Multi-Antenna Systems,
IEEE Transactions on Wireless Communications, vol. 9, no. 9, pp. 29022913, Sept. 2010.
[6] H. I. Volos and R. M. Buehrer, Robust Training of a Link Adaptation Cognitive Engine, in IEEE Military Communications
Conference, Oct. 31Nov. 3. 2010, pp. 13181323.
[7] S. J. Russell and P. Norvig, Articial Intelligence: A Modern Approach. Upper Saddle River, NJ: Pearson Education,
2003.
[8] C. J. Rieser, Biologically Inspired Cognitive Radio Engine Model Utilizing Distributed Genetic Algorithms for Secure
and Robust Wireless Communications and Networking, Ph.D. dissertation, Virginia Tech, 2004.
[9] B. Le, T. W. Rondeau, and C. W. Bostian, Cognitive Radio Realities, Wiley Journal on Wireless Communications and
Mobile Computing, vol. 7, no. 9, pp. 10371048, 2007.
23
[10] A. He et al., Development of a Case-Based Reasoning Cognitive Engine for IEEE 802.22 WRAN Applications, ACM
Mobile Computing and Communications Review, vol. 13, no. 2, pp. 3748, 2009.
[11] T. R. Newman et al., Cognitive Engine Implementation for Wireless Multicarrier Transceivers, Wiley Journal on Wireless
Communications and Mobile Computing, vol. 7, no. 9, pp. 11291142, 2007.
[12] Z. Zhao, S. Xu, S. Zheng, and J. Shang, Cognitive Radio Adaptation Using Particle Swarm Optimization, Wireless
Communications and Mobile Computing, vol. 9, no. 7, pp. 875881, 2009.
[13] N. Zhao, S. Li, and Z. Wu, Cognitive Radio Engine Design Based on Ant Colony Optimization, Wireless Personal
Communications, pp. 110, 2011.
[14] C.-H. Jiang and R.-M. Weng, Cognitive Engine with Dynamic Priority Resource Allocation for Wireless Networks,
Wireless Personal Communications, vol. 63, no. 1, pp. 3143, Mar. 2012.
[15] Y. Zhao, S. Mao, J. Reed, and Y. Huang, Utility Function Selection for Streaming Videos with a Cognitive Engine
Testbed, Mobile Networks and Applications, vol. 15, pp. 446460, 2010.
[16] N. Baldo and M. Zorzi, Learning and Adaptation in Cognitive Radios Using Neural Networks, in 5th IEEE Consumer
Communications and Networking Conference, Jan. 2008, pp. 9981003.
[17] C. Clancy, J. Hecker, E. Stuntebeck, and T. OShea, Applications of Machine Learning to Cognitive Radio Networks,
IEEE Wireless Communications, vol. 14, no. 4, pp. 4752, August 2007.
[18] A. MacKenzie et al., Cognitive Radio and Networking Research at Virginia Tech, Proceedings of the IEEE, vol. 97,
no. 4, pp. 660688, 2009.
[19] N. Devroye, P. Mitran, and V. Tarokh, Achievable Rates in Cognitive Radio Channels, IEEE Transactions on Information
Theory, vol. 52, no. 5, pp. 18131827, May 2006.
[20] N. Devroye, P. Mitran, and V. Tarokh, Limits on Communications in a Cognitive Radio Channel, Communications
Magazine, IEEE, vol. 44, no. 6, pp. 44 49, june 2006.
[21] S. Haykin, Fundamental issues in cognitive radio, in Cognitive Wireless Communication Networks, E. Hossain and V. K.
Bhargava, Eds. New York: Springer, 2007, pp. 143.
[22] A. Sahai, R. Tandra, S. M. Mishra, and N. Hoven, Fundamental Design Tradeoffs in Cognitive Radio Systems, in ACM
First International Workshop on Technology and Policy for Accessing Spectrum, August 2006.
[23] S. Sridharan and S. Vishwanath, On the Capacity of a Class of MIMO Cognitive Radios, IEEE Journal of Selected
Topics in Signal Processing, vol. 2, no. 1, pp. 103117, Feb. 2008.
[24] S. Jafar and S. Shamai, Degrees of Freedom of the MIMO X Channel, in IEEE Global Telecommunications Conference,
Nov. 2007, pp. 16321636.
[25] G. Scutari, D. P. Palomar, and S. Barbarossa, Cognitive MIMO Radio: A Competitive Optimality Design Based on
Subspace Projections, IEEE Signal Processing Magazine, vol. 25, no. 6, pp. 4959, 2008.
[26] S. Haykin, D. Thomson, and J. Reed, Spectrum Sensing for Cognitive Radio, Proceedings of the IEEE, vol. 97, no. 5,
pp. 849877, May 2009.
[27] D. Datla, R. Rajbanshi, A. Wyglinski, and G. Minden, An Adaptive Spectrum Sensing Architecture for Dynamic Spectrum
Access Networks, IEEE Transactions on Wireless Communications, vol. 8, no. 8, pp. 42114219, August 2009.
[28] R. W. Thomas, D. H. Friend, L. A. Dasilva, and A. B. MacKenzie, Cognitive Networks: Adaptation and Learning to
Achieve End-to-End Performance Objectives, IEEE Communications Magazine, vol. 44, no. 12, pp. 5157, Dec. 2006.
[29] I. F. Akyildiz, W.-Y. Lee, M. C. Vuran, and S. Mohanty, NeXt Generation/Dynamic Spectrum Access/Cognitive Radio
Wireless Networks: A Survey, Computer Networks, vol. 50, no. 13, pp. 21272159, 2006.
24
[30] T. Clancy and N. Goergen, Security in Cognitive Radio Networks: Threats and Mitigation, in 3rd International Conference
on Cognitive Radio Oriented Wireless Networks and Communications, May 2008, pp. 18.
[31] A. He et al., System Power Consumption Minimization for Multichannel Communications Using Cognitive Radio, in
IEEE International Conference on Microwaves, Communications, Antennas and Electronics Systems, Nov. 2009, pp. 15.
[32] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classication, 2nd ed. New York: Wiley-Interscience, November 2000.
[33] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, August 2006.
[34] D. Saad, On-line Learning in Neural Networks. Cambridge, UK: Cambridge University Press, 1998.
[35] L. OCallaghan et al., Streaming-data algorithms for high-quality clustering, in Proceedings of the 18th International
Conference on Data Engineering, 2002, pp. 685694.
[36] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, March 1998.
[37] J. Kemeny and J. Snell, Finite Markov chains. New York: Springer-Verlag, 1976.
[38] M. J. Kearns and U. V. Vazirani, An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press, 1994.
[39] J. C. Gittins, Multi-Armed Bandit Allocation Indices. Wiley, Chichester, NY, 1989.
[40] D. Acu na and P. Schrater, Bayesian Modeling of Human Sequential Decision-Making on the Multi-Armed Bandit
Problem, in Proceedings of the 30th Annual Conference of the Cognitive Science Society, 2008.
25
1 X 2 X 1

X
1/2
1/2
1
1/(N
A
1)
1/(N
A
1)
1/(N
A
1)
1
Fig. 1: Markov model with uniform sampling probabilities
26
100 200 300 400 500 600 700 800 900 1000
0
500
1000
1500
2000
2500
3000
Number of total methods (N
K
)
E
x
p
e
c
t
e
d

t
r
i
a
l
s

n
e
e
d
e
d


Uniform sampling, analysis
Weighted sampling, analysis
Uniform sampling, simulation
Weighted sampling, simulation
Fig. 2: Expected trials needed to nd the top-performing method vs. the number of total
methods (N
K
). All methods are admissible (N
A
= N
K
, T
a
= 400)
27
N
u
m
b
e
r

o
f

a
d
m
i
s
s
i
b
l
e

m
e
t
h
o
d
s

(
N
A
)
Number of trials


0 1 2 3 4 5 6 7 8 9 10
x 10
4
100
200
300
400
500
600
700
800
900
1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 3: Expected normalized performance vs. trials (weighted sampling, N
K
= 1000,
T
i
= T
a
= 100)
28
100 200 300 400 500 600 700 800 900 1000
10
2
10
3
10
4
10
5
10
6
Number of admissible methods (N
A
)
N
u
m
b
e
r

o
f

t
r
i
a
l
s


T
a
= T
i
= 100, sorted admissibility order
T
a
= 1000, T
i
= 100, sorted admissibility order
T
a
= T
i
= 100, random admissibility order
Fig. 4: Trials needed to reach 95% of maximal performance vs. number of admissible
methods, N
K
= 1000
29
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
5
10
15
20
Packet number (trials)
b
i
t
s
/
s
/
H
z


0
0.2
0.4
0.6
0.8
1
P
S
R


> Nonempty offsetting list
Attempted
Achieved avg.
Maximal avg.
PSR
PSR training packet

min
% remaining methods
0
20
40
60
80
100
%

R
e
m
a
i
n
i
n
g

m
e
t
h
o
d
s
Fig. 5: Example training session with a simulated communication system (SNR = 15 dB,

min
= 0.9, C
min
= 0.1875 bit/s/Hz, N
K
= 110)
30
100 200 300 400 500 600 700 800 900 1000
10
2
10
3
10
4
Number of admissible methods (N
A
)
N
u
m
b
e
r

o
f

t
r
i
a
l
s


Greedy
Boltzmann exp.
Gittins index
RoTA
Fig. 6: Trials needed to reach 95% of maximal performance (reward) vs. number of admissible
methods, K = 1000, test scenario 3
31
TPV I TPV II TPV III TPV IV TPV V TPV VI
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
Respective technique TPV
A
v
e
r
a
g
e

n
o
r
m
a
l
i
z
e
d

p
e
r
f
o
r
m
a
n
c
e


Greedy reward
Boltzmann exp. reward
Gittins index reward
RoTA reward
Greedy PSR
Boltzmann exp. PSR
Gittins index PSR
RoTA PSR
Fig. 7: Average normalized performance vs. respective training parameter. (Note: higher
exploration is used as the graph moves left to right.) The respective training parameters are
for -greedy, T for Boltzmann exploration, for Gittins index, and N
w
for RoTA. The gure
is averaged over scenarios one and three for a total of 15,000 trials
32
Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 3 Scenario 4 Scenario 5
0
0.2
0.4
0.6
0.8
1
A
v
e
r
a
g
e

n
o
r
m
a
l
i
z
e
d

p
e
r
f
o
r
m
a
n
c
e


Average total normalized reward Average total PSR
Greedy
Boltzmann exp.
Gittins index
RoTA
Fig. 8: Average normalized performance vs. scenario (the PSR is equal to the reward for
scenarios 1 and 2)
33
TABLE I: Minimum training window length (N
w
), N
T
= 1
min Offsetting Method Lower Bound PSR (O
l
)
0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.9 0.95 0.999
0.50 11 6 5 4 3 3 3 3 3 3
0.55 12 7 5 4 4 3 3 3 3
0.60 13 7 5 4 4 3 3 3
0.65 14 8 6 5 4 4 3
0.70 15 8 6 5 4 4
0.75 16 9 6 5 5
0.80 17 9 7 6
0.85 18 10 7
0.90 19 11
0.95 21
34
T
A
B
L
E
I
I
:
A
v
e
r
a
g
e
n
o
r
m
a
l
i
z
e
d
p
e
r
f
o
r
m
a
n
c
e
(
%
)
o
v
e
r
a
l
l
N
A
S c e n a r i o
T
r
a
i
n
i
n
g
T
e
c
h
n
i
q
u
e
s
a
n
d
R
e
s
p
e
c
t
i
v
e
T
r
a
i
n
i
n
g
P
a
r
a
m
e
t
e
r

-
G
r
e
e
d
y
I

-
G
r
e
e
d
y
I
I

-
G
r
e
e
d
y
I
I
I
B
o
l
t
z
m
a
n
n
E
x
p
.
G
i
t
t
i
n
s
I
n
d
e
x
N
R
P
G
i
t
t
i
n
s
I
n
d
e
x
B
R
P
R
o
T
A

N
w
T e s t
0 . 0 0 1
0 . 0 1
0 . 1
0 . 5
0 . 0 0 1
0 . 0 1
0 . 1
0 . 5
0 . 0 0 1
0 . 0 1
0 . 1
0 . 5
1 0
2 5 0
5 0 0
1 0 0 0
2 0 0 0
4 0 0 0
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
0 . 9 9
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
0 . 9 9
2 5
5 0
1 0 0
2 5 0
5 0 0
1

9
5
9
4
9
0
7
2
9
5
9
5
9
3
8
7
9
5
9
4
9
0
7
2
5
6
5
1
5
0
5
0
5
0
5
0
9
5
9
5
9
5
9
5
9
5
9
4
9
5
9
5
9
5
9
5
9
5
9
5
9
2
9
1
9
1
9
0
9
0
1

9
5
9
5
9
1
9
0
9
5
9
5
9
5
9
4
9
5
9
5
9
1
9
0
9
4
8
9
8
9
8
9
8
9
8
9
9
5
9
5
9
5
9
5
9
5
9
5
9
5
9
5
9
5
9
5
9
5
9
6
9
2
9
2
9
2
9
1
9
1
2

8
2
8
3
8
0
5
7
8
2
8
3
8
3
7
7
8
2
8
3
8
0
5
7
3
0
2
5
2
5
2
5
2
5
2
5
8
0
8
0
8
0
8
0
8
0
7
8
8
3
8
4
8
4
8
5
8
7
9
0
-
-
-
-
-
2

8
5
8
6
8
5
6
0
8
4
8
5
8
7
8
9
8
5
8
6
8
5
6
0
9
0
3
7
3
1
2
8
2
7
2
6
8
2
8
2
8
2
8
2
8
2
8
0
8
5
8
6
8
6
8
8
9
0
9
5
-
-
-
-
-
3

3
5
3
6
4
1
3
4
3
5
3
6
4
1
4
3
3
5
3
4
3
1
2
4
3
6
4
5
4
0
3
4
2
9
2
7
3
3
3
2
3
1
3
0
2
7
2
4
3
5
3
5
3
5
3
4
3
4
3
1
5
2
5
9
6
1
6
3
6
3
3

9
0
8
6
8
4
6
8
9
1
8
9
8
5
8
8
9
1
9
1
9
0
9
0
9
1
9
1
9
2
9
1
8
8
8
4
9
1
9
1
9
1
9
1
9
1
9
0
9
1
9
1
9
1
9
1
9
1
9
1
6
9
8
1
8
2
8
3
8
3
3

3
5
3
8
4
6
4
7
3
4
3
6
4
5
4
8
3
4
3
4
3
0
2
4
3
6
4
6
4
3
3
9
4
0
4
3
3
2
3
2
3
1
2
9
2
7
2
3
3
4
3
4
3
4
3
4
3
3
3
1
9
1
9
1
9
1
9
0
9
0
3

9
0
9
0
9
0
8
9
9
0
9
0
9
1
9
1
9
0
8
9
8
9
8
9
9
0
9
2
9
3
9
2
9
2
9
0
9
0
8
9
8
9
8
9
8
9
8
9
9
0
9
0
9
0
9
0
9
0
9
0
9
2
9
2
9
2
9
1
9
1
4

3
1
3
1
3
3
2
5
3
1
3
1
3
3
3
5
3
1
3
0
2
7
1
1
3
1
3
8
3
3
2
4
1
7
1
4
2
9
2
8
2
7
2
6
2
3
1
5
3
2
3
2
3
2
3
2
3
2
2
9
-
-
-
-
-
4

7
1
7
2
7
0
4
7
7
1
7
1
7
1
7
4
7
1
7
1
6
4
3
6
7
2
7
7
8
0
8
1
7
6
6
8
7
0
7
0
6
9
6
9
6
7
6
0
7
4
7
5
7
5
7
6
7
7
7
7
-
-
-
-
-
4

3
0
3
0
3
5
3
1
3
0
3
0
3
3
3
7
2
9
2
9
2
6
1
1
3
0
3
9
3
4
2
7
2
2
2
2
2
8
2
7
2
6
2
5
2
2
1
4
3
0
3
1
3
1
3
1
3
1
2
8
-
-
-
-
-
4

6
8
7
0
7
2
5
3
6
8
6
8
7
1
7
5
6
8
6
7
6
1
3
4
6
8
7
7
7
9
7
9
7
6
6
9
6
7
6
6
6
6
6
5
6
4
5
7
7
0
7
1
7
2
7
2
7
4
7
4
-
-
-
-
-
5

5
2
6
4
7
3
4
9
5
0
5
9
7
5
7
2
5
2
6
6
7
3
4
3
6
4
7
4
6
3
5
0
3
8
3
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
5

6
8
8
1
8
3
6
8
5
3
6
9
8
2
8
6
7
0
8
2
8
7
9
1
6
5
9
4
9
4
9
2
8
9
8
4
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
5

9
4
9
4
8
8
6
4
9
4
9
4
9
2
8
3
9
4
9
3
8
4
4
5
9
3
7
8
7
0
6
1
5
4
5
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
5

9
5
9
5
9
0
8
9
9
5
9
5
9
5
9
4
9
5
9
4
9
0
9
0
9
5
9
6
9
5
9
4
9
3
9
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

A
v
e
r
a
g
e
t
o
t
a
l
n
o
r
m
a
l
i
z
e
d
r
e
w
a
r
d
.

A
v
e
r
a
g
e
t
o
t
a
l
P
S
R
.

N
u
m
b
e
r
r
e
p
r
e
s
e
n
t
s
b
o
t
h

a
n
d

A
f
t
e
r
1
0
0
0
t
r
i
a
l
s
.

A
f
t
e
r
1
5
0
0
0
t
r
i
a
l
s
.
35
T
A
B
L
E
I
I
I
:
N
u
m
b
e
r
o
f
t
r
i
a
l
s
r
e
q
u
i
r
e
d
t
o
r
e
a
c
h
p
e
r
f
o
r
m
a
n
c
e
(
r
e
w
a
r
d
)
(
%
)
l
e
v
e
l
T e s t S c e n a r i o
T
r
a
i
n
i
n
g
T
e
c
h
n
i
q
u
e
s
a
n
d
P
e
r
f
o
r
m
a
n
c
e
(
r
e
w
a
r
d
)
(
%
)
L
e
v
e
l

-
G
r
e
e
d
y
B
o
l
t
z
m
a
n
n
E
x
p
.
G
i
t
t
i
n
s
I
n
d
e
x
R
o
T
A
N
A
3
0
6
0
8
0
9
0
9
5
3
0
6
0
8
0
9
0
9
5
3
0
6
0
8
0
9
0
9
5
3
0
6
0
8
0
9
0
9
5
1
1
0
4
5
1
3
0
2
9
4
3
9
8
3

1
9
8
1
2
0
6
2
2
0
9
4
2
1
3
5
2
5
5
7
3
9
1
0
7
1
9
6
3
0
8
5
2
7
6
9
1
9
7
3
8
2
6
9
7

1
1
0
0
5
1
4
3
6
4
1
9
7

1
4
8
4
2
1
1
1
2
3
5
5
2
5
4
3
3
1
0
4
4
1
1
2
1
3
4
8
5
7
1
9
3
5
6
2

1
2
0
0
2
6
1
6
3
8
1
7

9
7
0
1
9
8
0
2
4
7
8
2
8
4
3
3
7
6
5
2
5
1
0
1
7
4
4
3
9
1
7
2
5

1
6
0
0
1
2
3
6
1
1
3
4
6
1
1
3
2
9
1
9
0
8
3
0
3
1
5
2
9
3
1
2
3
4
1
4
1
3
5
7
3
3
1
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
0
1
4
6
4
0
7
7
9
6
5
4
7
5

2
6
7
5

1
2
7
3
5
9
6
6
3
9
9
1
1
0
3
2
-
-
-
-
-
2
1
0
0
1
4
5
1
1
6
9

2
2
3
3

1
1
3
9
8
9
1
9
5

-
-
-
-
-
2
2
0
0
6
2
5
9
7

1
8
4
6

5
1
9
4
6
1
1
5

-
-
-
-
-
2
6
0
0
1
6
2
7
4
9
4
0

1
5
2
2
4

1
5
1
5
5
6

-
-
-
-
-
2
1
0
0
0
1
2
1
1
3
8
1

1
1
1
2
7
5
1

1
2
9
4
3

-
-
-
-
-
3
1
0
1
9
5
1
1
9
5
8
1
9
6
7
1
9
7
5
1
9
8
1
1
9
5
7
1
9
9
5
2
0
3
9
1
3
6
9
2

1
9
5
2
1
9
5
9
1
9
6
3
1
9
6
6
1
9
6
8
8
1
2
7
2

3
1
0
0
1
7
2
1
1
7
4
3
1
7
9
1
1
8
5
5
1
9
1
8
1
7
3
0
1
8
2
2
2
0
2
7
2
5
4
1
4
3
5
9
1
7
4
2
1
7
6
7
1
7
8
7
1
8
0
8
1
8
2
7
1
1
3
9
6
7
1
5
1
4
7
4
4

3
2
0
0
1
4
2
7
1
4
9
3
1
5
5
3
1
6
2
9
1
6
9
5
1
4
4
9
1
5
5
3
1
8
1
5
2
1
2
6
2
6
0
1
1
5
0
1
1
5
3
7
1
5
7
0
1
5
8
9
1
6
1
1
5
1
9
2
1
4
1
5
8
2
1
9
4
5
5
3
6
0
0
3
1
5
4
9
2
5
5
1
7
6
8
8
8
6
1
5
1
4
8
0
9
1
1
1
7
1
3
5
7
5
2
9
6
1
0
6
7
7
7
4
3
7
8
2
1
5
3
1
3
1
0
6
9
2
0
7
2
3
1
0
0
0
1
1
1
1
1
1
1
9
7
2
5
2
3
9
9
1
1
1
1
1
1
1
1
5
2
4
0
1
6
5
6
4
1
0

-
-
-
-
-
4
1
0
0
8
1
3
1
8
6
8
1
9
0
2
7

7
8
7
1
8
7
1
6
9
5
7
2
1
1
3
3
0
1
2
8
1
7
8
2
5
7
9
8
3
3
1
1
0
6
4
1
2
2
4
8

-
-
-
-
-
4
2
0
0
3
3
6
1
3
6
4
3
3
8
0
6

3
1
6
5
3
6
4
9
4
2
1
5
5
1
9
8
5
9
3
8
3
6
4
1
4
2
6
9
5
2
0
9
5
6
8
4
7
9
6
3
-
-
-
-
-
4
6
0
0
5
0
1
5
1
5
7
2
3

4
5
1
5
8
7
9
6
6
1
2
9
5
1
5
2
3
5
4
9
6
2
6
7
8
6
1
4
6
6
1
4
7
7
-
-
-
-
-
4
1
0
0
0
1
2
1
4
2
7
1
1

1
1
1
1
3
1
5
5
1
6
6
6
4
1
1
6
1
9
7
8
-
-
-
-
-
5
1
0
1
8
1
2
1
1
2

1
9
6
2
2
0
7
1
7
8
9
5

-
-
-
-
-
-
-
-
-
-
5
1
0
0
2
9
1
8
6
6
8
7
9
8

7
7
7
1
7
4
9
2
6
1
2
6
2
0
7
8
2
3
9
-
-
-
-
-
-
-
-
-
-
5
2
0
0
1
3
1
1
8
6
4
0
3
4

3
8
9
1
1
5
9
2
0
4
3
3
1
4
3
3
5
9
8
-
-
-
-
-
-
-
-
-
-
5
6
0
0
3
1
2
7
1
8
5
1
1
4
1
4
4

1
2
9
9
5
8
0
8
7
0
1
0
9
2
-
-
-
-
-
-
-
-
-
-
5
1
0
0
0
1
6
3
8
7
5
8
9
2

1
3
8
1
9
8
3
7
6
5
3
0
-
-
-
-
-
-
-
-
-
-

P
e
r
f
o
r
m
a
n
c
e
(
r
e
w
a
r
d
)
l
e
v
e
l
n
o
t
r
e
a
c
h
e
d
a
f
t
e
r
1
5
0
0
0
t
r
i
a
l
s
.

Вам также может понравиться