5 F2019 PDF

Accepted Manuscript
Multi-criteria active deep learning for image classification
Jin Yuan, Xingxing Hou, Yaoqiang Xiao, Da Cao, Weili Guan,

Liqiang Nie
PII: S0950-7051(19)30074-7
DOI: https://doi.org/10.1016/j.knosys.2019.02.013
Reference: KNOSYS 4673
To appear in: Knowledge-Based Systems
Received date : 9 June 2018

Revised date : 11 February 2019
Accepted date : 12 February 2019
Please cite this article as: J. Yuan, X. Hou, Y. Xiao et al., Multi-criteria active deep learning for
image classification, Knowledge-Based Systems (2019),
https://doi.org/10.1016/j.knosys.2019.02.013
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form.
Please note that during the production process errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
Multi-Criteria Active Deep Learning for Image Classification
Jin Yuana,∗, Xingxing Houa , Yaoqiang Xiaoa , Da Caoa , Weili Guanc , Liqiang Nieb
a College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
b School of Computer Science and Technology, Shandong University, Shandong, China
c Hewlett Packard Enterprise, Singapore
Abstract
As a robust and heuristic technique in machine learning, active learning has been established as
an effective method for addressing large volumes of unlabeled data; it interactively queries users
(or certain information sources) to obtain desired outputs at new data points. With regard to
deep learning techniques (e.g., CNN) and their applications (e.g., image classification), labeling
work is of great significance as training processes for obtaining parameters in neural networks
which requires abundant labeled samples. Although a few active learning algorithms have been
proposed for devising certain straightforward sampling strategies (e.g., density, similarity, uncer-
tainty, and label-based measure) for deep learning algorithms, these employ onefold sampling
strategies and do not consider the relationship among multiple sampling strategies.
To this end, we devised a novel solution “multi-criteria active leep learning”(MCADL) to
learn an active learning strategy for deep neural networks in image classification. Our sample se-
lection strategy selects informative samples by considering multiple criteria simultaneously (i.e.,
density, similarity, uncertainty, and label-based measure). Moreover, our approach is capable of
adjusting weights adaptively to fuse the results from multiple criteria effectively by exploring the
utilities of the criteria at different training stages. Through extensive experiments on two popular
image datasets (i.e., MNIST and CIFAR-10), we demonstrate that our proposed method consis-
tently outperforms highly competitive active learning approaches; thereby, it can be verified that
our multi-criteria active learning proposal is rational and our solution is effective.
Keywords: Deep Neural Network; Active Learning; Multi-Criteria; Image Classification
1. Introduction
Over the last decade, deep neural networks have brought revolutionary progress in computer
vision [23, 29, 39], natural language processing [49, 25], and information retrieval application-
s [1, 2]. As a consequence, a variety of popular convolutional neural networks (CNNs) (e.g.,
AlexNet [18], vggNet [43], and ResNet [12]) have been proposed by utilizing large annotated
datasets [17, 34]. However, with the increase in the number of layers in CNNs, more labeled
training samples are required owing to the growth of training parameters in neural networks.
Label acquisition for abundant unlabeled samples is a long, laborious and expensive process.
∗ Correspondingauthor.
Email address: yuanjin@hnu.edu.cn (Jin Yuan)
Preprint submitted to Knowledge-Based Systems February 11, 2019
Therefore, training a high-quality CNN model with minimum labeled data has attracted the at-
tention of the research community [8, 15, 31]. Meanwhile, active learning [4] offers an effective
approach to achieve this goal.
In active learning, the model is initially trained on a small labeled training set; then, a sam-
pling strategy or decision function is employed to determine which sample in the unlabeled pool
would be provided to users for labeling. Thereafter, the model is updated based on the new
training set. This process repeats until the unlabeled pool becomes empty or the performance
attains a certain threshold. The widely used sampling strategies in the active learning include un-
certainty [20], diversity [22], error reduction [32], and variance reduction [37]. Although these
strategies are established to be effective in accelerating performance enhancement, the improve-
ment achieved using a single sampling strategy is limited. On the one hand, the algorithm learns
multiple class labels in a structure, and the sampling strategies omit the problem of performance
imbalance among classes. On the other hand, a unique sampling strategy considers only one cri-
terion, and its utility decreases as training continues. Therefore, it is more reasonable to integrate
multiple criteria into a sample selection strategy.
To address these problems, this paper proposes a novel solution, “multi-criteria active deep
learning”(MCADL) to effectively accelerate the performance improvement in image classifica-
tion applications. MCADL introduces an AL algorithm and applies it on CNN to enhance the
performance during image classification tasks. In our framework, the initial CNN model is con-
structed based on a few randomly selected samples. Thereafter, additional informative samples
are recommended for users’ labeling and then added to the training set to update the CNN model.
In each iteration, our approach measures the informativeness of the unlabeled samples under two
conditions — the labeled samples and the current model. For the first condition, we use den-
sity and similarity to measure the informativeness of unlabeled samples to reduce information
redundancy. For the second condition, the informativeness of unlabeled samples is calculated
according to the uncertainty and label-based measure. Uncertainty is utilized to accelerate the
convergence of the model. The label-based measure is employed to accelerate the performance
improvement as well as reduce the performance difference among classes, wherein it tends to
select two types of samples. The first type is the samples from the classes that has exhibited
rapid performance improvement recently. These samples exhibit the potential to accelerate the
performance improvement of the CNN model. The second type is the samples from the classes
that exhibit low performance. These samples are used to achieve a performance balance among
classes. The final informativeness scores are calculated by fusing results from these two part-
s, where the fusion weights are automatically adjusted with respect to the performance change
during the training process. By conducting experiments on two popular image datasets (i.e.,
MINIST and CIFAR-10), it can be demonstrated that our proposed approaches yield significant
gains compared to other state-of-the-art methods.
The main contributions of this work are summarized as follows:
• We explore the potential yet challenging problem of applying active learning to deep neural
networks in order to address the problem of performance imbalance among classes. To our
knowledge, this is the first work attempting to integrate multiple criteria to accelerate the
performance improvement as well as make a performance balance among classes.
• To further explore the effectiveness of different criteria, we propose a dynamic fusion strat-
egy, where the fusion weights are automatically adjusted with respect to the performance
change among classes. Our proposed adaptive learning framework is evaluated by apply-
ing it to image classification.
2
• Extensive experiments performed on two image datasets demonstrate the effectiveness of
our proposed solution. Meanwhile, we have released datasets and our implementation to
facilitate the research community in further exploration1 .
The paper is organized as follows: In Section 2, we review related work. We then provide
an overview of our approach in Section 3. Next, in Section 4 we report our experimental re-
sults. Following this, the limitations and conclusions are delineated in Section 5 and Section 6,
respectively.
2. Related Work
In this section, we review related approaches of sampling strategies, which are categorized
into two: the sampling strategies for the traditional machine learning algorithms in Section 2.1
and the sampling strategies for deep learning in Section 2.2.
2.1. Sampling Strategy for Traditional Machine Learning Algorithms

Active learning is aimed at selecting informative samples to be labeled in order to save
the labeling cost of high performance [11, 22, 54, 30, 53, 55]. Most researchers focus on
the sampling strategy in active learning. The commonly used sampling strategy, “uncertainty
sampling”, is established to be effective for classification [42, 52, 46], retrieval [5, 47], image
segmentation[50, 40, 38], and recommended system [33, 27, 26]. To measure uncertainty, the
most popular strategy, “entropy-based uncertainty sampling,” attempts to identify the samples
with the largest entropy values on the conditional probability distribution over all the labels [41].
However, this approach is straightforwardly affected by the trivial labels. Considering this, David
proposed the “least confidence-based uncertainty sampling” [21] to measure uncertainty by con-
sidering only the label with the highest posterior probability. Although its calculation is simple,
this approach considers only the most probable label and omits the confidence distribution of
other labels. As a compromise, Ran proposed “margin-based sampling” [9] to simultaneous-
ly consider the first and second most probable class labels. Although the uncertainty sampling
achieves potential performance, it does not consider the information redundancy among samples.
As a result, researchers are inclined toward incorporating diversity into uncertainty to select in-
formative samples. The diversity ensures that the selected samples are dissimilar; thus, it can
reduce the informative redundancy among the samples. For example, Li et al. [22] used mutual
information to measure the mutual dependence between a candidate and each of the other se-
lected samples so that the candidate is different from the other selected samples; Tiago et al. [3]
measured the sample diversity by using four similarity measures: Chebyshev distance, Cosine
distance, Euclidean distance, and Manhattan distance. In [51], Yang et al. explored the multi-
class active learning problem and imposed the diversity constraint on the objective function to
select the most informative samples. Furthermore, a few researchers have proposed to exploit
unlabeled samples to minimize the generalization errors of classifiers. For example, Christoph
et al. [15] presented the EMOC principle for active learning, which can actively select relevant
batches of unlabeled samples from a large number of random sets of fixed size that can minimize
the loss of the model. Ji et al. [14] proposed a variance minimization criterion in active learning
on graphs, which selects the nodes to be labeled such that the total variance of the distribution
1 https://github.com/houxingxing/Multi-Criteria-Active-Deep-Learning-for-Image-Classification
3
on the unlabeled data and the expected prediction error are minimized. A few others focus on
reducing the generalization error indirectly by minimizing the output variance [37, 8]. However,
these generalization error minimization approaches are ordinarily computationally expensive.
2.2. Sampling Strategy for Deep Learning

Although the available AL approaches [42, 52, 46] have demonstrated remarkable results for
image classification, their classifiers/models are trained with hand-craft features (e.g., HoG and
SIFT) on small-scale visual datasets. Recently, the sampling selection strategy for deep learning
has attracted the attention of researchers [7, 44, 48, 56]. Compared to the traditional machine
learning approaches, a deep learning algorithm integrates feature extraction and classification;
moreover, it requires significantly more time to train as it includes a large number of parameters.
Furthermore, deep learning algorithm simultaneously learns multiple class labels in a structure
based on a large number of labeled samples. Because of this complex process, AL algorithms
need to consider multiple criteria to enhance the performance for image classification tasks. For
example, Stark et al.[44] applied uncertainty sampling to improve the performance of CNNs for
CAPTCHA recognition; here, the uncertainty value is equal to the ratio between the top two
probabilities. Gal et al. [7] proposed a new theoretical framework that casts the dropout layer in
deep neural networks as an approximate Bayesian inference to represent the model uncertainty
in deep learning. Apart from uncertainty, Zhou et al. [56] introduced diversity in the sampling
strategy and selected the most informative samples to accelerate the performance improvement
of CNNs during transfer learning. Sener et al. [36] defined active learning as a core-set selection
problem and attempted to select a subset from a completely labeled dataset such that the model
trained on the selected subset performs as closely as possible to the model trained on the entire
dataset. All the works above use only labeled samples to train CNN models; occasionally, la-
beled samples provided by the available AL approaches are insufficient for CNNs. In [48], Keze
at al. proposed the cost-effective active learning algorithm, where a set of pseudo-label samples
with high confidence scores are offered to train the CNN model without additional labor cost.
Although the methods above achieve certain success, they omit the problem of performance im-
balance, wherein the classification accuracies for different classes by a well-trained CNN vary
substantially. Moreover, the utility of a unique strategy will be exhausted as the training con-
tinues; therefore, incorporating multiple criteria and exploring their utilities during the training
stage are worthy of study.
3. Proposed Multi-Criteria Active Deep Learning
In this section, the details of our approach are presented. We introduce the framework of our
approach in Section 3.1 and then elaborate the details of the proposed sample selection strategy
in Section 3.2.
3.1. Framework
The performance of image classification can be significantly improved by increasing the
number of layers in a neural network. However, training a high performance deep neural network
model requires large quantities of labeled samples, and the process of obtaining labeled samples
is time consuming. Therefore, we take the initiative to apply the active learning algorithm in deep
learning, which can reduce the cost of manual labeling to a certain extent. Figure 1 shows the
framework of our proposed active learning algorithm. In the first round, the MCADL randomly
4
Figure 1: Multi-criteria active deep learning framework.
selects a set of samples from the unlabeled set for user’ labeling, and the initial CNN model is
learned based on the labeled samples. Here, the size of the initial sample set is manually set
in the experiments. In the rest rounds, the CNN model is applied to evaluate the samples in
the unlabeled pool; then, our sampling strategy selects the most informative samples for users’
labeling from the unlabeled set by adaptively integrating the information according to the density,
similarity, uncertainty, and label-based measure. Subsequently, the CNN model is updated. This
process repeats until the unlabeled set is empty or the performance is satisfied.
The fundamental content underlying MCADL is the multi-criteria sampling strategy. The
above four sampling criteria are divided into two groups to measure the informativeness of unla-
beled samples; this classification basis corresponds to two conditions: “informativeness measure
under the labeled samples” and “informativeness measure under the existing model”. For the first
group, MCADL uses density and similarity measures to reduce information redundancy. For the
second group, MCADL obtains the information of unlabeled samples based on uncertainty and
the label-based measure. Uncertainty can speed up the convergence of the model, whereas the
label-based measure can speed up the performance improvement of the CNN model and reduce
the performance differences between classes. Finally, MCADL calculates the final informative-
ness of each unlabeled sample by fusing the results from the two groups; here, the fusion weights
are automatically adjusted according to the performance changes of the CNN model during the
training process. The selected informative samples are fed into the CNN model to update the
parameters for higher performance.
3.2. Informative Sample Selection

Given a CNN model M as well as the labeled samples set DL , the informative sample selec-
tion is aimed at offering an informative sample set DI from the unlabeled sample set DU . We
define In f (DI |DL , M) as the informativeness of the sample set DI under the conditions DL and
5
M. The informative sample selection strategy is aimed at identifying the sample set DI from DU
to maximize the value of In f (DI |DL , M) as
arg max In f (DI |DL , M) (1)

DI
Here, we assume that DL and M independently affect the informativeness of DI . And then Eq. 1
is transformed as
arg max αIn f (DI |DL ) + (1 − α)In f (DI |M) (2)
DI
where α is a weight to balance the utilities of In f (DI |DL ) and In f (DI |M). The first part In f (DI |DL )
defines the informativeness of DI under the labeled samples set DL ; the second part In f (DI |M)
represents the informativeness of DI under the current model M.
3.2.1. Informativeness Measure Under Labeled Samples

Given a labeled sample set DL , the informativeness of each sample xi in DI is calculated
based on two factors: density and similarity. Density represents the closeness of samples. The
high density indicates the redundant information of samples and the marginal informativeness,
and vice versa. We calculate the informativeness based on density as follows:
1 ∑
In f Den (xi |DL ) = 1 − Cosdis(xi , x j ) (3)
|DLs | x ∈Ds
j L
where s is the pseudo-label of xi by the CNN model, DLs is the sample set from the class s in DL ,
and Cosdis(., .) is the cosine distance metric [45]. Here, we use only the labeled samples with
an identical class as xi to calculate the density informativeness. Similarity is used to ensure the
diversity of samples. This implies that xi should be different from the other samples in DL . Thus,
the informativeness can be calculated based on similarity as follows:
In f S imi (xi |DL ) = 1 − maxs Cosdis(xi , x j ) (4)

x j ∈DL
Finally, we combine these two parts with equal weights to calculate In f (xi |DL ):
In f (xi |DL ) = 0.5In f Den (xi |DL ) + 0.5In f S imi (xi |DL ) (5)
We sum the informativeness of all the samples in DI as In f (DI |DL ). The high value of In f (DI |DL )
ensures the samples in DI contain rich and non-repetitive information.
3.2.2. Informative Measure Under Existing Model

Given a CNN model M, the informative measure of DI is related to two factors. The first
factor is the uncertainty measure. The uncertain samples provide ambiguous information to
accelerate the convergence of the CNN model [48]. The second part is the label-based measure,
which is used to select the most effective labels and to prevent performance imbalance among
classes[13]. We combine these two parts as:
In f (xi |M) = βIn f Unc (xi |M) + (1 − β)In f Lab (xi |M) (6)
6
where β is a weight to balance the two parts. We sum the informativeness of all the samples in
DI as
N
∑
In f (DI |M) = In f (xi |M) (7)
i=1
The common criteria to measure uncertainty are entropy [41], margin sampling [35] and
least confidence [37]. Generally, the least confident strategy considers only the most probable
label; it omits the confidence distribution of other labels. Therefore, it is not suitable for the
multiclassification problem. In contrast, the entropy-based active learning strategy considers all
the labels; however, it is straightforwardly affected by the trivial labels. As a compromise, the
margin sampling method overcomes the above problems by simultaneously considering the top
two probable labels. However, the consideration of only the top two probable labels is marginally
inadequate, particularly in the case of a large number of classes. Therefore, we modify on the
margin sampling strategy. Let P(Ck |xi , M) be the posterior probability of the unlabeled sample
xi belonging to the class Ck under the model M. In f Unc (xi |M) can be calculated by the modified
margin sampling strategy, which can be expressed as
K
1 ∑
AvgK (xi |M) = P(Ck |xi , M) (8)
K k=1
K
1 ∑
In f Unc (xi |M) = 1 − |P(Ck |xi ; M) − AvgK (xi |M)| (9)
K k=1
where Ck is the label of the k-th probable class, and K is the top K probable classes. AvgK (xi |M)
is the mean value of the top K probabilities. We calculate the value of K when the sum of the top
K probabilities is marginally over 0.5 in the experiment.
The second part is called “label-based measure.” It tends to select two types of samples. The
first type is the samples from the classes that exhibit rapid performance improvement recently.
These samples exhibit the potential to accelerate the performance improvement of the CNN
model. The second type is the samples from the classes that exhibit low performance. These
samples are used to achieve a performance balance among classes. Let ARm t be the classification
accuracy of the m-th class by the CNN model M in round t on the validation set. Then, we assign
a weight to the m-th class as:
 (ARm −ARm )
 max(0, t Z1 t−1 ) min ARm t <b



m


Wt =  (10)


 m
 1/ARt m
min ARt ≥ b
Z2
where Z1 and Z2 are normalization factors, and b is a threshold value. We set b = 0.5 in the
experiments. In the beginning, our approach pays more attention to the samples from the classes
that have the fastest performance enhancement. We evaluate the performance enhancement of
each class by considering the performance difference between the t-th and (t −1)-th rounds on the
validation set. As the performance continues to improve, our approach tends to select samples
from the classes with low performance to balance the performance among classes.
Given an unlabeled sample xi in round t, we first determine its most similar sample x s on
visual space in DL and then assign the label s of x s to xi as its pseudo-label. Finally, the value of
In f Lab (xi |M) is equal to Wts :
In f Lab (xi |M) = Wts (11)
7
3.2.3. Weight Adjustment
In our approach, we dynamically adjust the value of α, β in Eq. 2 and Eq. 6. At the beginning
of training, the CNN model exhibits a low performance, and its predicted probability is not
trustworthy. Therefore, we consider metrics based on labeled samples, namely similarity and
density. Thus, we set a very large value for the parameter α in the initial stage. As the training
proceeds, the performance of the CNN model continues to increase, and the probability predicted
by the CNN model also becomes credible. Meanwhile, the effect of similarity and density begins
to diminish. As a result, we gradually reduce the value of α. This process can be expressed as
α = αini e−ARt (12)
where αini is the initial value and ARt is the average classification accuracy with the validation
dataset in the t-th round.
The parameter β is used to regulate the degree of emphasis between the uncertainty and the
label-based measure. In the early stages of training, we consider that the CNN model M cannot
accurately predict labels; therefore, the label-based measure is considered untrustworthy. As a
result, we focus more on the uncertainty measure by assigning a large value to β in order to accel-
erate the convergence of the model. As the accuracy of the model increases and the convergence
rate of the model decreases, we focus on the validity of the label. Therefore, subsequently, we
gradually reduce the value of β. We express the process as Eq. 13, where βini is the initial value:
β = βini e−ARt (13)
3.2.4. Algorithm Implementation

Algorithm 1 summarizes the process of our algorithm. Our approach is aimed at selecting
an informative subset DI from the unlabeled set DU . In each round r, our approach first calcu-
lates the informativeness In f (xi |DL , M) for each unlabeled sample xi . We then select the top N
unlabeled samples with the largest informative values to construct DI . Next, we add DI to DL
to construct the model. Finally, we evaluate the performance of the new model to update the
parameters α, β. This process repeats for R loops. In each round, our sample selection strategy
in algorithm 1 first calculates an informativeness value for each unlabeled sample (time com-
plexity O(|DU |)); then, it selects the top N unlabeled samples with the largest informative values
(time complexity O(N|DU |)). Therefore, the time complexity of our sample selection strategy in
algorithm 1 is O(N|DU |).
4. Experiments
In this section, we conduct extensive experiments on two popular datasets to answer the
following four research questions:
RQ1. How does MCADL perform as compared with the other state-of-the-art competitors?
RQ2. Does MCADL using two parts in Eq. 2 outperform that using one part?
RQ3. Does our adaptive fusion outperform the fixed fusion method, where we fix the
values of α, β?
RQ4. How does the label-based measure affect the performance? Are the two parts in
Eq. 10 effective for achieving performance enhancement and imbalance?
8
Algorithm 1: Multi-criteria active deep learning algorithm
Input:
The labeled set DrL in the round r
The unlabeled set DrU in the round r
The whole set D = DrL ∪ DrU
The initial set size Nini
The maximum value of round R
The number of selected samples in each round N
The initial value of parameters αini ,βini
Output:
the CNN model MR .
1 D0U = D, D0L = {}
2 Randomly select Nini samples from D0U and add them to DI ; then,
D1U = D0U − DI , D1L = D0L + DI , DI = {}.
3 Train M1 by using D1L
4 Calculate α and β according to Eq. 12 and 13.
5 for r= 1 to R do
6 for each sample xi in DrU do
7 Calculate In f (xi |DrL ) according to Eq. 5.
8 Calculate In f Unc (xi |Mr ) according to Eq. 9.
9 Calculate In f Lab (xi |Mr ) according to Eq. 11.
10 Calculate In f (xi |Mr ) according to Eq. 6.
11 Calculate the informativeness value for xi as:
In f (xi |Mr , DrL ) = αIn f (xi |DL ) + (1 − α)In f (xi |M)
12 end
N
13 Add the top N unlabeled samples {xi }i=1 with the largest informativeness values to DI
r+1 r r+1 r
14 DU = DU , DL = DL + DI , DI = {}.
15 Training the model Mr+1 based on Dr+1 L .
16 Update the parameters α,β according to Eq. 12 and 13.
17 end
4.1. Experimental Settings

4.1.1. Experimental Setup and Dataset
We conducted experiments on two image datasets. The first dataset is MNIST [19], which
contains 10 classes with 6,000 training samples and 1,000 testing samples for each class. There
are totally 60,000 training samples and 10,000 testing samples, where each sample is a 28 × 28
gray image. We randomly divided the training samples into two parts: the first part contains
50,000 training samples as the pooling set for user’ labeling to train the CNN model; the second
part contains 10,000 samples as the validation set used to calculate the parameters for our sample
selection algorithm. The second dataset is called CIFAR-10 [17], which contains 10 classes with
5,000 training samples and 1,000 testing samples for each class. There are in total 50,000 training
samples and 10,000 testing samples, where each sample is a 32 × 32 color image. Similarly, we
randomly divided the training set into two parts: the pooling set with 40,000 samples and the
validation set with 10,000 samples. Detailed information is presented in Table1.
9
Table 1: Data division on MNIST and CIFAR-10.
Origin Experiment
Dataset Image size
Number of training set Number of testing set Number of training set Number of validation set Number of testing set
MNIST 28 × 28 60,000 10,000 50,000 10,000 10,000
CIFAR-10 32 × 32 50,000 10,000 40,000 10,000 10,000
Table 2: Configuration of CNN architecture on MNIST.
Layer Type Input Kernel Stride Output
data input 1 × 28 × 28 N/A N/A 1 × 28 × 28
conv1 convolution 1 × 28 × 28 3×3 1 32 × 26 × 26
conv2 convolution 32 × 26 × 26 3×3 1 64 × 24 × 24
pool3 max pooling 64 × 24 × 24 2×2 2 64 × 12 × 12
fc4 fully connected 64 × 12 × 12 1×1 1 128 × 1 × 1
fc5 fully connected 128 × 1 × 1 1×1 1 10 × 1 × 1
4.1.2. Network Architecture and Selection of Parameters

We designed two types of CNN architectures for each of MNIST and CIFAR-10 by consid-
ering the differences between handwritten digits and objects. The MNIST dataset contains only
10 different digits, and we construct a simple network structure, as presented in Table 2. The
CNN architecture contains only two convolution layers, a pooling layer, and two full-connected
layers. The CIFAR-10 dataset contains 10 semantic objects (e.g., “cat,” “airplane,” and “deer”).
We constructed a more complex CNN model by adding an additional convolution layer and an
additional pooling layer as compared to that on the MNIST dataset. The details of the CNN
architecture are presented in Table 3.
We employed Keras to implement the CNN model, where the mini-batch gradient descent
method [24] was used to learn the parameters and Adam [16] was used to accelerate the network
convergence. All the experiments were conducted on a common desktop PC with an intel i7
4.0GHz CPU and a GTX1080 GPU.
As Algorithm 1 illustrates, we should assign initial values to the parameters. According
to our description in Section 3.2.3, large values were set to the initial αini , βini ; herein, we set
αini = βini = 0.9. The values of α, β would be adaptively adjusted according to Eq. 12 and
Eq. 13. For the experiment on MNIST dataset, we set Nini = 100, and R = 7. On CIFAR-10
dataset, we set Nini = 2000, and R = 30. In each round, N = 128 samples were provided to users
for labeling to update the model.
4.1.3. Evaluation Metric

We use accuracy [28] to measure the performance. Accuracy is a commonly used metric for
evaluating the quality of an algorithm in image classification; it is a measure of the correctness
of the classifier as a whole. As Eq. 14 illustrates, the accuracy refers to the ratio of the number
10
Table 3: Configuration of CNN architecture on CIFAR-10.
Layer Type Input Kernel Stride/Pad Output
data input 3 × 32 × 32 N/A N/A 3 × 32 × 32
conv1 convolution 3 × 32 × 32 3×3 1/0 32 × 30 × 30
pool2 max pooling 3 × 30 × 30 2×2 2/0 32 × 15 × 15
pool5 max pooling 64 × 13 × 23 2×2 2/0 64 × 6 × 6
fc6 fully connected 64 × 6 × 6 1×1 1/0 512 × 1 × 1
fc7 fully connected 512 × 1 × 1 1×1 1/0 10 × 1 × 1
of correctly classified samples Nc to the total number of samples Nt for a specified test dataset.
Nc
Accuracy = (14)
Nt
Meanwhile, in order to measure the performance imbalance among classes by the CNN model,
we adopt the standard deviation[10]; it is a measure of the dispersion of data distribution. A
smaller standard deviation implies lesser deviation of the values from the average, and vice versa.
We express it as v
u
t M
1 ∑
S tandardDeviation = (ARm − ARavg )2 (15)
N m=1
where ARavg is the average accuracy of all the categories and ARm is the classification accuracy
of the m-th class by the CNN model M on the test set.
4.1.4. Baselines
To demonstrate the effectiveness of our method, we compared it with the following methods.
• Random (RD): This approach randomly selects 128 samples to be manually labeled in
each round.
• Entropy-based strategy (EP) [41]: The EP selects the 128 most uncertain samples mea-
sured by entropy to be labeled in each round.
• Margin sampling (MS) [35]: MS selects the 128 most uncertain samples measured by the
probability difference between the top two probable classes in each round.
• Least confidence method (LC) [37]: The LC selects the 128 most uncertain samples mea-
sured by the most probable class in each round.
• Dropout Bayesian active learning by disagreement (Dropout BALD) [6, 7]: Dropout BALD
selects the 128 most uncertain samples measured by using dropout as a Bayesian approxi-
mation in each round.
11
0.60
0.95
0.55
Classification Accuracy
0.90
0.85 0.50
RD RD
0.80 MS MS
EP 0.45 EP
LC LC
0.75 Dropout BALD Dropout BALD
MCADL 0.40 MCADL
1 2 3 4 5 6 7 8 0 5 10 15 20 25 30
Number of Rounds Number of Rounds
(a) MNIST (b) CIFAR-10
Figure 2: Performance comparison between MCADL and state-of-the-art approaches.
4.2. Performance Evaluation

4.2.1. Overall Performance Comparisons (RQ1)
To demonstrate the effectiveness of our proposed approach, we compare it with several state-
of-the-art approaches: 1) RD, 2) EP, 3) MS, 4) LC, and 5) Dropout BALD.
Figure 2 illustrates the performance comparison between our approach and the above five
methods on two datasets. From the experiment, we can draw the following conclusions:
• All the active learning algorithms listed above outperform the RD approach on both the
datasets, particularly in the later rounds. This establishes the effectiveness of the applica-
tion of active sample selection strategies to deep learning algorithm.
• The MCADL achieves the highest performance among all the active learning approaches
listed above on both the datasets. This is because the general active sample selection
strategy considers only one criterion, and MCADL integrates multiple criteria to select
informative samples. Moreover, MCADL adaptively adjusts weights to explore the utilities
of different criteria during the training stage; thus, it can enhance the performance faster.
• The performance by the MS approach is marginally better than those by the LC and EP ap-
proaches. This demonstrates that the MS approach measures uncertainty more accurately.
As previously mentioned, the LC approach considers only one label; thus, it is unsuitable
for addressing the multi-classification problem. With regard to the EP approach, although
it considers all the labels, it is straightforwardly affected by the noise from trivial labels.
Table 4: Time complexity comparison between MCADL and state-of-the-art approaches.
Algorithm RD EP MS LC Drop BALD MCADL
Cost O(|DU |) O(N|DU |) O(N|DU |) O(N|DU |) O(|M|) O(N|DU |)
Table 4 presents the time complexity comparison between MCADL and the state-of-the-art ap-
proaches; here, N is the number of labeled samples in each round, |DU | is the number of samples
in the unlabeled set, and |M| is the number of the model repeatedly trained in each round. RD is
12
0.60
0.95
0.55
0.90
0.50
0.85
0.45
0.80 MCADL−α = 1 MCADL−α = 1
MCADL−α = 0 MCADL−α = 0
MCADL 0.40 MCADL
1 2 3 4 5 6 7 8 0 5 10 15 20 25 30
Figure 3: Performance comparison between MCADL, “MCADL-α = 1”, and “MCADL-α = 0”

approaches on MNIST and CIFAR-10 datasets.
the fastest among all the approaches, whereas our approach exhibits speed comparable to that of
EP, MS, and LC approaches.
4.2.2. Analysis of Combination Problem (RQ2)

In this experiment, we evaluate the effectiveness of the two parts in Eq. 2. We set two extreme
cases: α = 1 and α = 0. The first case represents that our approach calculates informativeness
only under labeled samples (“MCADL-α = 1”); the second case represents that our approach
calculates informativeness only under the existing model (“MCADL-α = 0”).
Figure 3 shows the results on both the datasets. Apparently, the MCADL integrating both
parts outperforms that using only one part. This indicates that the two factors in Eq. 2 are both
effective for selecting informative samples. Moreover, the performance by “MCADL-α = 1”
is higher than that by “MCADL- α = 0” on CIFAR-10 dataset; the contrary is true on M-
NIST dataset. This is because the samples on the MNIST dataset are handwritten numbers with
marginal visual difference; thus, the effectiveness of the criteria “density” and “similarity” is
limited.
4.2.3. Analysis of Weight Problem (RQ3)

In this experiment, we evaluate the effectiveness of our adaptive weight adjustment in E-
q. 12 and Eq. 13. We compare our adaptive weight strategy with the fixed weight strategy.
The fixed weight strategy adopts our approach albeit with fixed weights. We compare three
types of settings: MCADL-Fix1(α = β = 0.1), MCADL-Fix2(α = β = 0.5), and MCADL-
Fix3(α = β = 0.9).
Figure 4 illustrates the performance comparison results. MCADL outperforms the MCADL-
Fix1, MCADL-Fix2, and MCADL-Fix3 approaches on both MNIST and CIFAR-10 datasets.
This demonstrates that the adaptive weight adjustment is more effective for selecting informa-
tive samples as compared to the fix weight settings. Further exploration of the influence of α, β
reveals that the performance of MCADL-Fix3 on CIFAR-10 is marginally higher than those of
MCADL-Fix1 and MCADL-Fix2, particularlly in the early rounds. This validates our assump-
tion in Section 3.2.3. In the early rounds, the trained CNN model is not trustworthy; thus, the
setting of large α, β in the initial rounds is more effective for selecting informative samples. On
13
0.60
0.95
0.55
0.90
0.50
0.85
MCADL-Fix1 0.45 MCADL_Fix1

0.80 MCADL-Fix2 MCADL_Fix2
MCADL-Fix3 MCADL_Fix3
MCADL 0.40 MCADL
0.75
1 2 3 4 5 6 7 8 0 5 10 15 20 25 30
Figure 4: Performance comparison between MCADL and the MCADL-Fix1, MCADL-Fix2, and
MCADL-Fix3 approaches on MNIST and CIFAR-10 datasets.
Table 5: Performance comparison between MCADL, MCADL-Eq(10)Upper and MCADL-

Eq(10)Lower approaches on CIFAR-10 dataset.
Round 5 10 15 20 25 30
MCADL-Eq(10)Upper 0.525 0.548 0.556 0.572 0.585 0.589
Accuracy MCADL-Eq(10)Lower 0.521 0.533 0.544 0.566 0.579 0.581
MCADL 0.525 0.548 0.556 0.579 0.592 0.595
MCADL-Eq(10)Upper 0.137 0.128 0.114 0.148 0.133 0.156

Standard
MCADL-Eq(10)Lower 0.134 0.136 0.146 0.110 0.110 0.123
Deviation
MCADL 0.137 0.128 0.114 0.134 0.130 0.130
MNIST dataset, digital recognition is a relatively convenient task, and the trained CNN models
exhibit high performance in the early rounds. Therefore, the impact of setting different α, β is
highly marginal.
4.2.4. Analysis of Class Imbalance Problem (RQ4)

In this experiment, we evaluate the effectiveness of the label-based measure. The label-based
measure selects samples from two types of classes: the classes exhibiting rapid performance
improvement (the upper part in Eq.10) and those exhibiting low performance (the lower part in
Eq.10). To validate the utilities of these two parts, we revise our approach as follows:
• MCADL-Eq(10)Upper: MCADL uses only the upper part in Eq. 10.

• MCADL-Eq(10)Lower: MCADL uses only the lower part in Eq. 10.
Table 5 illustrates the performance comparison results on CIFAR-10 dataset. Here, we do
not conduct experiments on MNIST dataset, because the accuracies on all the classes are sig-
nificantly high and balanced; thus, the effectiveness of the label-based measure is concealed. In
14
the early rounds, MCADL and MCADL-Eq(10)Upper exhibit higher accuracy than MCADL-
Eq(10)Lower. This demonstrates that selecting samples from classes with rapid performance
improvement is effective in the early stage. In the subsequent rounds, MCADL changes the strat-
egy, and achieves the highest accuracy. This indicates that it is crucial to solve the performance
imbalance problem when the overall performance is increased to a certain stage. Moreover, we
use the standard deviation to measure the performance imbalance among classes. We observe
that the standard deviation values of MCADL-Eq(10)Lower are smaller than those of MCADL
and MCADL-Eq(10)Upper. However, this balanced performance is accompanied by a reduction
in overall performance. Compared to MCADL-Eq(10)Lower, MCADL achieves higher accuracy
with a marginal increase in the standard deviation values.
5. Limitations
This paper has two limitations. Firstly, although our approach has a time cost of O(N|DU |)
in theory, which is identical to that of the entropy-based strategy, least confidence method, and
margin sampling method, in practice, our approach selects informative samples more slowly than
those approaches. This is because it needs to calculate the informativeness values from multiple
criteria, which increases the computation time. Secondly, our approach exhibits adequate per-
formance when the problem of performance imbalance occurs among classes during the training
process. When the performance among classes is relatively balanced, our approach has limited
effect.
6. Conclusions and Future Work
In this paper, we presented a novel active learning algorithm for deep learning to reduce the
cost of manual labeling; to our knowledge, this is the first work that attempts to consider multiple
criteria including density, similarity, uncertainty, and the label-based measure simultaneously. It
is capable of enhancing the prediction accuracy by adaptively adjusting weights among multi-
ple criteria. The problem of class performance imbalance among classes can be alleviated; this
is particularly favorable to the overall performance. To validate the effectiveness and rational-
ity of our proposed approach, extensive experiments have been conducted on the MNIST and
CIFAR-10 datasets. The experimental results demonstrate that our proposed approach consis-
tently outperforms the-state-of-the-art active learning approaches.
Notwithstanding its effectiveness on imbalance data, our proposed approach exerts limited
effect on balance data. Moreover, we need to manually set a few hyperparameters to achieve
adequate performance. In the future, we plan to expand our work in the following two aspects:
1) Apply our proposed approach to larger and more unbalanced datasets. And 2) Attempt to
conduct a decision function that can combine uncertainty sampling with an optimizing model
method, which is likely to be a highly potential topic.
Acknowledgement
The authors are highly grateful to the anonymous referees for their careful reading and in-
sightful comments. The work is supported by the NSFC Grant 61502157.
15
References
[1] Da Cao, Xiangnan He, Liqiang Nie, Xiaochi Wei, Xia Hu, Shunxiang Wu, and Tat-Seng Chua. 2017. Cross-
Platform App Recommendation by Jointly Modeling Ratings and Texts. ACM Transactions on Information Systems
35, 4 (2017), 37.
[2] Da Cao, Liqiang Nie, Xiangnan He, Xiaochi Wei, Shunzhi Zhu, and Tat-Seng Chua. 2017. Embedding Factoriza-
tion Models for Jointly Recommending Items and User Generated Lists. In International Conference on Research
and Development in Information Retrieval. 585–594.
[3] Thiago NC Cardoso, Rodrigo M Silva, Sérgio Canuto, Mirella M Moro, and Marcos A Gonçalves. 2017. Ranked
Batch-mode Active Learning. Information Sciences 379 (2017), 313–337.
[4] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. 1996. Active Learning with Statistical Models. Journal
of Artificial Intelligence Research 4, 1 (1996), 129–145.
[5] Begüm Demir and Lorenzo Bruzzone. 2015. A Novel Active Learning Method in Relevance Feedback for Content-
based Remote Sensing Image Retrieval. IEEE Transactions on Geoscience and Remote Sensing 53, 5 (2015),
2323–2334.
[6] Yarin Gal. 2016. Uncertainty in Deep Learning. Ph.D. Dissertation. University of Cambridge.
[7] Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian approximation: Representing model uncertainty
in deep learning. In International Conference on Machine Learning. 1050–1059.
[8] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep Bayesian Active Learning with Image Data. arXiv
preprint arXiv:1703.02910 (2017).
[9] Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. 2004. Margin based Feature Selection-theory and Algo-
rithms. In International Conference on Machine Learning. ACM, 43.
[10] Michiel Hazewinkel. 2013. Encyclopaedia of Mathematics: Volume 6: Subject IndexAuthor Index. Springer
Science & Business Media.
[11] Guoliang He, Yifei Li, and Wen Zhao. 2017. An Uncertainty and Density based Active Semi-supervised Learning
Scheme for Positive Unlabeled Multivariate Time Series Classification. Knowledge-Based Systems 124 (2017),
80–92.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition.
In International Conference on Computer Vision and Pattern Recognition. 770–778.
[13] Paulina Hensman and David Masko. 2015. The Impact of Imbalanced Training Data for Convolutional Neural
Networks. Degree Project in Computer Science, KTH Royal Institute of Technology (2015).
[14] Ming Ji and Jiawei Han. 2012. A Variance Minimization Criterion to Active Learning on Graphs. In Artificial
Intelligence and Statistics. 556–564.
[15] Christoph Käding, Erik Rodner, Alexander Freytag, and Joachim Denzler. 2016. Active and Continuous Explo-
ration with Deep Neural Networks and Expected Model Output Changes. arXiv preprint arXiv:1612.06129 (2016).
[16] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arX-
iv:1412.6980 (2014).
[17] Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical
Report (2009).
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet Classification with Deep Convolutional
Neural Networks. In Advances in Neural Information Processing Systems. 1097–1105.
[19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document
recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
[20] Lewis and W. Gale. 1994. A Sequential Algorithm for Training Text Classifiers. In International Conference on
Research and Development in Information Retrieval. 3–12.
[21] David D. Lewis and Jason Catlett. 1994. Heterogeneous Uncertainty Sampling for Supervised Learning. In Inter-
national Conference on Machine Learning. Morgan Kaufmann, 148–156.
[22] Xin Li and Yuhong Guo. 2013. Adaptive Active Learning for Image Classification. In International Conference on
Computer Vision and Pattern Recognition. 859–866.
[23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully Convolutional Networks for Semantic Segmen-
tation. In International Conference on Computer Vision and Pattern Recognition. 3431–3440.
[24] Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and Andrew Y Ng. 2011. On Optimiza-
tion Methods for Deep Learning. In International Conference on Machine Learning. 265–272.
[25] Liqiang Nie, Meng Wang, Luming Zhang, Shuicheng Yan, Bo Zhang, and Tat-Seng Chua. 2015. Disease inference
from health-related questions via sparse deep learning. IEEE Transactions on Knowledge and Data Engineering
27, 8 (2015), 2107–2119.
[26] Liqiang Nie, Xiaochi Wei, Dongxiang Zhang, Xiang Wang, Zhipeng Gao, and Yi Yang. 2017. Data-Driven Answer
Selection in Community QA Systems. IEEE Transactions on Knowledge and Data Engineering 29, 6 (2017),
1186–1198.
16
[27] Liqiang Nie, Luming Zhang, Yan Yan, Xiaojun Chang, Maofu Liu, and Ling Shaoling. 2017. Multiview physician-
specific attributes fusion for health seeking. IEEE Transactions on Cybernetics 47, 11 (2017), 3680–3691.
[28] David L Olson and Dursun Delen. 2008. Advanced Data Mining Techniques. Springer Science & Business Media.
[29] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time
object detection. In International Conference on Computer Vision and Pattern Recognition. 779–788.
[30] Oscar Reyes, Abdulrahman H Altalhi, and Sebastián Ventura. 2018. Statistical comparisons of active learning
strategies over multiple datasets. Knowledge-Based Systems 145 (2018), 274–288.
[31] Phill Kyu Rhee, Enkhbayar Erdenee, Shin Dong Kyun, Minhaz Uddin Ahmed, and Songguo Jin. 2017. Active and
Semi-Supervised Learning for Object Detection with Imperfect Data. Cognitive Systems Research (2017).
[32] N. Roy and A. McCallum. 2001. Toward Optimal Active Learning through Sampling Estimation of Error Reduc-
tion. In International Conference on Machine Learning. 441–448.
[33] Neil Rubens, Dain Kaplan, and Masashi Sugiyama. 2011. Active Learning in Recommender Systems. (2011),
735–767 pages.
[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision 115, 3 (2015), 211–252.
[35] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active Hidden Markov Models for Information
Extraction. In International Symposium on Intelligent Data Analysis. Springer, 309–318.
[36] Ozan Sener and Silvio Savarese. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach.
In International Conference on Learning Representations.
[37] Burr Settles. 2010. Active learning Literature Survey. University of Wisconsin, Madison 52, 55-66 (2010), 11.
[38] Ronghua Shang, Pingping Tian, Licheng Jiao, Rustam Stolkin, Jie Feng, Biao Hou, and Xiangrong Zhang. 2016. A
spatial fuzzy clustering algorithm with kernel metric based on immune clone for SAR image segmentation. IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9, 4 (2016), 1640–1652.
[39] Ronghua Shang, Jiaming Wang, Licheng Jiao, Rustam Stolkin, Biao Hou, and Yangyang Li. 2018. SAR Targets
Classification Based on Deep Memory Convolution Neural Networks and Transfer Parameters. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 11, 8 (2018), 2834–2846.
[40] Ronghua Shang, Yijing Yuan, Licheng Jiao, Biao Hou, Amir Masoud Ghalamzan Esfahani, and Rustam Stolkin.
2017. A Fast Algorithm for SAR Image Segmentation Based on Key Pixels. IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing 10, 12 (2017), 5657–5673.
[41] Claude E Shannon. 2001. A Mathematical Theory of Communication. ACM SIGMOBILE Mobile Computing and
Communications Review 5, 1 (2001), 3–55.
[42] Xin Mu Zhi-Hua Zhou Sheng-Jun Huang, Jia-Lve Chen. 2017. Cost-Effective Active Learning from Diverse
Labelers. In International Conference on Artificial Intelligence. 1879–1885.
[43] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-scale Image Recog-
nition. arXiv preprint arXiv:1409.1556 (2014).
[44] Fabian Stark, Caner Hazırbas, Rudolph Triebel, and Daniel Cremers. 2015. Captcha Recognition with Active Deep
Learning. In Workshop New Challenges in Neural Computation. Citeseer, 94.
[45] Pang-Ning Tan et al. 2006. Introduction to data mining. Pearson Education India.
[46] Van Cuong Tran, Ngoc Thanh Nguyen, Hamido Fujita, Dinh Tuyen Hoang, and Dosam Hwang. 2017. A combi-
nation of active learning and self-learning for named entity recognition on Twitter using conditional random fields.
Knowledge-Based Systems 132 (2017), 179–187.
[47] Sudheendra Vijayanarasimhan and Kristen Grauman. 2014. Large-scale Live Active Learning: Training Object
Detectors with Crawled Data and Crowds. International Journal of Computer Vision 108, 1-2 (2014), 97–114.
[48] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. 2016. Cost-effective Active Learning for Deep
Image Classification. IEEE Transactions on Circuits and Systems for Video Technology 27 (2016), 2591–2600.
[49] Xiaochi Wei, Heyan Huang, Liqiang Nie, Hanwang Zhang, Xian-Ling Mao, and Tat-Seng Chua. 2017. I Know
What You Want to Express: Sentence Element Inference by Incorporating External Knowledge Base. IEEE Trans-
actions on Knowledge and Data Engineering 29, 2 (2017), 344–358.
[50] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z Chen. 2017. Suggestive annotation: A deep
active learning framework for biomedical image segmentation. In International Conference on Medical Image
Computing and Computer-Assisted Intervention. Springer, 399–407.
[51] Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. 2015. Multi-class Active
Learning by Uncertainty Sampling with Diversity Maximization. International Journal of Computer Vision 113, 2
(2015), 113–127.
[52] Zhipeng Ye, Peng Liu, Jiafeng Liu, Xianglong Tang, and Wei Zhao. 2016. Practice Makes Perfect: An Adaptive
Active Learning Framework for Image Classification. Neurocomputing 196 (2016), 95–106.
[53] Lili Yin, Huangang Wang, and Wenhui Fan. 2018. Active learning based support vector data description method
for robust novelty detection. Knowledge-Based Systems 153 (2018), 40–52.
17
[54] Jin Yuan, Xiangdong Zhou, Junqi Zhang, Mei Wang, Qi Zhang, Wei Wang, and Baile Shi. 2007. Positive sample
enhanced angle-diversity active learning for SVM based image retrieval. In International Conference on Multime-
dia and Expo. 2202–2205.
[55] Qi Zhou, Yan Wang, Ping Jiang, Xinyu Shao, Seung-Kyum Choi, Jiexiang Hu, Longchao Cao, and Xiangzheng
Meng. 2017. An active learning radial basis function modeling method based on self-organization maps for
simulation-based design problems. Knowledge-Based Systems 131 (2017), 10–27.
[56] Zongwei Zhou, Jae Shin, Lei Zhang, Suryakanth Gurudu, Michael Gotway, and Jianming Liang. 2017. Fine-Tuning
Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally. In International Con-
ference on Computer Vision and Pattern Recognition.
18

5 F2019 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

5 F2019 PDF

Загружено:

Авторское право:

Доступные форматы

Accepted Manuscript

Multi-criteria active deep learning for image classification

Jin Yuan, Xingxing Hou, Yaoqiang Xiao, Da Cao, Weili Guan,

To appear in: Knowledge-Based Systems

Received date : 9 June 2018

2.1. Sampling Strategy for Traditional Machine Learning Algorithms

2.2. Sampling Strategy for Deep Learning

3. Proposed Multi-Criteria Active Deep Learning

3.2. Informative Sample Selection

arg max In f (DI |DL , M) (1)

3.2.1. Informativeness Measure Under Labeled Samples

In f S imi (xi |DL ) = 1 − maxs Cosdis(xi , x j ) (4)

3.2.2. Informative Measure Under Existing Model

α = αini e−ARt (12)

β = βini e−ARt (13)

3.2.4. Algorithm Implementation

4.1. Experimental Settings

MNIST 28 × 28 60,000 10,000 50,000 10,000 10,000

CIFAR-10 32 × 32 50,000 10,000 40,000 10,000 10,000

Table 2: Configuration of CNN architecture on MNIST.

Layer Type Input Kernel Stride Output

data input 1 × 28 × 28 N/A N/A 1 × 28 × 28

conv1 convolution 1 × 28 × 28 3×3 1 32 × 26 × 26

conv2 convolution 32 × 26 × 26 3×3 1 64 × 24 × 24

pool3 max pooling 64 × 24 × 24 2×2 2 64 × 12 × 12

fc4 fully connected 64 × 12 × 12 1×1 1 128 × 1 × 1

fc5 fully connected 128 × 1 × 1 1×1 1 10 × 1 × 1

4.1.2. Network Architecture and Selection of Parameters

4.1.3. Evaluation Metric

Layer Type Input Kernel Stride/Pad Output

data input 3 × 32 × 32 N/A N/A 3 × 32 × 32

conv1 convolution 3 × 32 × 32 3×3 1/0 32 × 30 × 30

pool2 max pooling 3 × 30 × 30 2×2 2/0 32 × 15 × 15

conv3 convolution 32 × 15 × 15 3×3 1/2 64 × 15 × 15

conv4 convolution 32 × 15 × 15 3×3 1/0 64 × 13 × 13

pool5 max pooling 64 × 13 × 23 2×2 2/0 64 × 6 × 6

fc6 fully connected 64 × 6 × 6 1×1 1/0 512 × 1 × 1

fc7 fully connected 512 × 1 × 1 1×1 1/0 10 × 1 × 1

Figure 2: Performance comparison between MCADL and state-of-the-art approaches.

4.2. Performance Evaluation

Table 4: Time complexity comparison between MCADL and state-of-the-art approaches.

Algorithm RD EP MS LC Drop BALD MCADL

Cost O(|DU |) O(N|DU |) O(N|DU |) O(N|DU |) O(|M|) O(N|DU |)

Figure 3: Performance comparison between MCADL, “MCADL-α = 1”, and “MCADL-α = 0”

4.2.2. Analysis of Combination Problem (RQ2)

4.2.3. Analysis of Weight Problem (RQ3)

MCADL-Fix1 0.45 MCADL_Fix1

Table 5: Performance comparison between MCADL, MCADL-Eq(10)Upper and MCADL-

MCADL-Eq(10)Upper 0.525 0.548 0.556 0.572 0.585 0.589

Accuracy MCADL-Eq(10)Lower 0.521 0.533 0.544 0.566 0.579 0.581

MCADL 0.525 0.548 0.556 0.579 0.592 0.595

MCADL-Eq(10)Upper 0.137 0.128 0.114 0.148 0.133 0.156

4.2.4. Analysis of Class Imbalance Problem (RQ4)

• MCADL-Eq(10)Upper: MCADL uses only the upper part in Eq. 10.

6. Conclusions and Future Work

Вам также может понравиться