1 s2.0 S0031320317300377 Main PDF

Pattern Recognition 67 (2017) 97–109
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Hierarchical learning of multi-task sparse metrics for large-scale

image classification
Yu Zheng a, Jianping Fan b, Ji Zhang c, Xinbo Gao a,∗
a
Department of Electronic Engineering, Xidian University, Xi’an 710071, PR China
b
Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA
c
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, PR China
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, a novel approach is developed to learn a tree of multi-task sparse metrics hierarchically
Received 20 July 2016 over a visual tree to achieve a fast solution to large-scale image classification, where an enhanced visual
Revised 18 December 2016
tree is first learned to organize large numbers of image categories hierarchically in a coarse-to-fine fash-
Accepted 24 January 2017
ion. Over the visual tree, a tree of multi-task sparse metrics is learned hierarchically by: (a) performing
Available online 3 February 2017
multi-task sparse metric learning over the sibling child nodes under the same parent node to explicitly
Keywords: separate their commonly-shared metric from their node-specific metrics; and (b) propagating the node-
Hierarchical multi-task sparse metric specific metric for the parent node to its sibling child nodes (at the next level of the visual tree), so that
learning more discriminative metrics can be learned for controlling inter-level error propagation effectively. We
Visual tree have evaluated our hierarchical multi-task sparse metric learning algorithm over three different image
Large-scale image classification sets and the experimental results demonstrated that our hierarchical multi-task sparse metric learning
algorithm can obtain better performance than the state-of-the-art algorithms on large-scale image classi-
fication.
© 2017 Elsevier Ltd. All rights reserved.
1. Introduction any sense to ignore their strong inter-category visual correlations

completely and learn their inter-related classifiers independently.
Large-scale image classification (i.e., classifying millions of im- One promising way to address these issues is to organize large
ages into thousands or even tens of thousands of categories) is a numbers of image categories hierarchically through a tree struc-
fundamental research issue for the communities of computer vi- ture by exploiting their inter-category correlations. Some previous
sion, machine learning and multimedia computing, and it has re- works [2,3,5] have leveraged the semantic ontologies (taxonomies)
ceived extensive attentions recently [1–4]. In spite of recent sig- to organize large number of image categories hierarchically ac-
nificant progresses on recognizing hundreds of image categories, cording to their inter-category semantic relationships. On the
large-scale image classification is still a challenging task because: other hand, it is very attractive to learn a visual hierarchy directly
(1) distinguishing large numbers of image categories (i.e., thou- from large amounts of training images, and some researches
sands or even tens of thousands of image categories) is inher- [6–8] have recently been done on learning label trees to organize
ently more complex than distinguishing among just a few, thus the large numbers of image categories hierarchically according to
accuracy rates for the state-of-the-arts methods are usually very their inter-category visual correlations. To learn a label tree, a
low; (2) the computational cost at test time is a critical issue (i.e., confusion matrix for N image categories is first obtained from
the computational cost for a flat approach grows linearly with the N one versus rest (OVR) binary classifiers. Such an approach to
number of image categories, and it is unacceptable for large-scale label tree learning may seriously suffer from two problems: (a)
image classification application); (3) some image categories may huge computational cost: it could be very expensive to learn N
have strong inter-category visual correlations, and it does not make OVR binary classifiers independently, especially when N is very
large (N is typically very large for large-scale image classification);
(b) huge sample imbalance: for a given image category (positive
∗
Corresponding author. class), the negative samples from other N − 1 image categories
E-mail addresses: yuzheng.xidian@gmail.com (Y. Zheng), jfan@uncc.edu (negative classes) heavily outnumber its positive samples, thus
(J. Fan), zhang_ji@stu.xjtu.edu.cn (J. Zhang), xbgao@mail.xidian.edu.cn,
huge numbers of negative samples may easily mislead the process
xbgao.xidian@gmail.com (X. Gao).
http://dx.doi.org/10.1016/j.patcog.2017.01.029
0031-3203/© 2017 Elsevier Ltd. All rights reserved.
98 Y. Zheng et al. / Pattern Recognition 67 (2017) 97–109
Fig. 1. The flowchart of our hierarchical multi-task sparse metric learning algorithm.
for training the OVR binary classifiers and result in incorrect enhanced visual tree can provide a good environment to automat-
OVR binary classifiers (with very low accuracy rates), which ically identify the inter-related tasks for multi-task sparse metric
may further produce an incorrect confusion matrix for label tree learning, e.g., the tasks for learning the large-margin metrics for
construction. Therefore, other researchers have proposed visual the sibling child nodes under the same parent node are strongly
trees learning [9,10] to organize large number of image categories inter-related; (b) For the root node at the first level of the visual
hierarchically. Unlike label tree learning requires a pre-calculated tree, a multi-task sparse metric learning algorithm is developed
confusion matrix, visual tree learning can be achieved by using to learn a multi-task sparse metric for separating its sibling child
the inter-category visual similarities directly, which may be able to nodes more accurately; (c) For the non-root node of the visual tree,
deal with the issues of huge computational cost and huge sample both the inter-node visual correlations (among the sibling non-root
imbalance effectively. However, one open problem of visual tree nodes under the same parent node) and the inter-level visual cor-
learning is how to provide more effective representations of image relations (between the parent node and its sibling child nodes at
categories (which should be able to cover huge intra-category the next level of the visual tree) are leveraged to learn a tree of
visual diversity sufficiently) and characterize the inter-category vi- multi-task sparse metrics, so that more discriminative metrics can
sual similarities accurately. Based on these observations, it is very be learned for controlling inter-level error propagation effectively.
attractive to develop new algorithms that are able to characterize The rest of the paper is organized as follows. A brief review of
the inter-category visual similarities accurately while reducing the related work is presented in Section 2. In Section 3, we introduce
computational cost dramatically. our work on constructing the enhanced visual tree. In Section 4,
It is also worth noting that the performance of all the algo- we present our work on hierarchical learning of a tree of multi-
rithms for image classification crucially depends on the discrim- task sparse metrics over the visual tree. Section 5 demonstrates
ination ability of the underlying metrics for similarity characteri- our experimental results, followed by conclusions.
zation. For examples, the distance metric plays an important role
in the performance of kNN (k-nearest neighbor) classifiers, and
the performance of the kernel-based algorithms (such as SVMs) 2. Related work
can be improved significantly by adopting a proper metric to
achieve more accurate similarity characterization [11–16]. Recently, Recently, metric learning [15–17,19,21–23] has received exten-
many researches focus on learning large-margin metrics to achieve sive attentions in communities of computer vision, machine learn-
more accurate similarity characterization [17–20]. Even such large- ing and multimedia computing, especially Mahalanobis metric
margin metric learning algorithms have provided a good solution learning. Weinberger et al. have developed a metric learning ap-
for enhancing the discrimination power of the metrics, they are proach to support nearest neighbor classification (LMNN) [19],
not scalable for large-scale image classification application due to which aims to learn a Mahanalobis distance by forcing the samples
their huge computational complexity. from the same class to be closer while the samples from differ-
Based on these observations, a hierarchical multi-task sparse ent classes are far away. Parameswaran et al. have extended such
metric learning algorithm is developed to learn a tree of multi-task LMNN approach to a multi-task setting and a multi-task metric
sparse metrics over an enhanced visual tree to achieve more accu- learning approach [24] is proposed. The multi-task metric learning
rate similarity characterization and support more effective solution approach aims to learn one common metric shared among multi-
to large-scale image classification. As shown in Fig. 1, our hierar- ple inter-related tasks and multiple task-specific metrics simulta-
chical multi-task sparse metric learning algorithm contains three neously. By separating the commonly-shared metric from the task-
key components: (a) An enhanced visual tree is first learned to or- specific metrics explicitly, the multi-task metric learning approach
ganize large numbers of image categories hierarchically, and such can obtain more discriminative metrics and achieve higher classi-
fication accuracy rates.
Y. Zheng et al. / Pattern Recognition 67 (2017) 97–109 99
The problem of hierarchical classification [2,8,9,25,26] has been sual tree learning is how to extract an effective representation for
investigated extensively in the past decades. The hierarchical clas- each image category and how to characterize the inter-category vi-
sification approaches can significantly reduce the computational sual similarities more accurately. For two given image categories,
cost at test time by organizing large numbers of image categories one straightforward way for characterizing their inter-category vi-
hierarchically in a coarse-to-fine fashion[10]. Some researchers sual similarity is to compute the pairwise distances among all their
have integrated the semantic ontology to support hierarchical clas- relevant training samples. But the computational cost for such pair-
sifier training [2,27,28]. Because the feature space is the common wise sample-oriented method is very heavy because the numbers
space for classifier training and image classification, it is more of training samples could be very large. Some researchers employ
attractive to learn a visual hierarchy to organize large numbers mean features for category representations and the inter-category
of image categories directly in the feature space. The visual tree visual similarities are simply calculated by using the distances be-
[9,10] is developed to organize large number of image categories by tween their mean features. Because of huge intra-category visual
characterizing the inter-category visual similarities directly, where diversity, such mean features may not be able to cover the intra-
the mean features are employed for category representations and category visual diversity sufficiently and thus they are ineffective
the inter-category visual similarities are simply characterized by for category representation. Based on this understanding, we uti-
the distances between their mean features. Because of huge intra- lize active sampling to find multiple representative samples for
category visual diversity, such mean features may not have suf- each image category, and using multiple representative samples for
ficient capacity for category representations (i.e., such mean fea- category representation can allow us to cover the intra-category
tures may not be able to cover huge intra-category visual diversity visual diversity at a certain sufficiency level while significantly re-
sufficiently). In addition, the hierarchical approaches may seriously ducing the computational cost for calculating the inter-category
suffer from the problem of inter-level error propagation (i.e., mis- visual similarities. It’s worth noting that yen et al. proposed an
classification at a high-level node will pass to its child nodes and advanced scalable exemplar clustering [33] method can also se-
propagate along the tree structure until the leaf nodes). lect representative subset for each image category. (b) Hierarchical
To control the inter-level error propagation, some researchers Affinity Propagation Clustering for Node Partitioning: A top-down ap-
have used a hierarchy to regularize the child nodes towards their proach to hierarchical affinity propagation (AP) clustering is devel-
parent node [10,29,30]. Xiao et al. have developed an interesting oped for visual tree learning. Such hierarchical AP clustering pro-
approach by adding an orthogonal regularization term into the ob- cess for visual tree construction starts from the root node (that
jective function for hierarchical classifier training [30], so that the contains all the image categories) and ends at the leaf nodes (each
classifier learned for the current node could be orthogonal with leaf node contains only one single image category).
the classifier for its parent node. It is worth noting that orthog-
onality cannot guarantee to prevent inter-level error propagation 3.1. Active sampling for category representation
effectively because the orthogonal classifiers cannot ensure that:
(a) the classifier for the parent node may not make mistakes; and Inspired by a recent work [34], we utilize active sampling to
(b) the mistakes for the parent node may not be propagated to its select multiple informative samples for category representation.
child nodes. In order to select the most informative samples for covering the
Rather than enforcing orthogonality between the classifiers intra-category visual diversity at certain sufficiency level, active
for the parent node and its child nodes, some interesting ap- sampling considers two criteria: (a) representativeness: the samples
proaches have been developed to learn a tree of disjoint metrics at high density regions should be selected, (b) diversity: the sam-
by adding a trace-norm regularization term into the objective ples with high diversity should be selected. For a sample xi ∈ X, its
function [21,31,32], so that the classifiers at different levels can representativeness is defined as:
select disjoint metrics or feature subsets. By focusing on the
1 2
inter-level (parent-child) correlations, such algorithms can learn R ( xi ) = exp(−xi − x j /2σR2 ) (1)
disjoint and sparse metrics for different nodes at different levels |Ni | j∈Ni
2
of a pre-defined tree structure, but the inter-node correlations

among the sibling child nodes under the same parent node are where R(xi ) is the representativeness of xi , Ni is the set of neigh-
completely ignored. It is worth noting that the discrimination bors of xi , Gaussian distance is employed to measure the distance
power of the metrics may largely depend on their ability on dis- between xi and rest samples, then select the 2 ∗ n_s/n − 1 sam-
criminating the sibling child nodes under the same parent node. A ples with the smallest distance as neighbor set, ns is number of
good solution to hierarchical classifier training has to accomplish training samples, and n is number of active samples [34]. σ R is the
two inter-related requirements simultaneously: (a) making better bandwidth of Gaussian kernel.
inter-level (parent-child) decision along the tree structure, so that The diversity is defined as:
the inter-level error propagation can be controlled effectively; and 2
(b) achieving more accurate separation of the sibling child nodes D(xi ) = min[− exp(−xi − x j /2σR2 )] (2)
x j ∈S 2
under the same parent node, so that the corresponding node
classifiers (especially the classifiers for high-level nodes) can make where D(xi ) is the diversity of xi and S is the set of selected infor-
less misclassification decisions. mative samples.
Active sampling considers both representativeness and diversity,
3. Visual tree learning thus the combined objective function define as:
arg max(λR(xi ) + (1 − λ )D(xi )) (3)

Because the feature space is the common space for classifier
training and image classification, it is more attractive for us to whereλ ∈ (0, 1) is the tradeoff between representativeness and di-
learn a hierarchical tree structure directly in the feature space. versity. Considering the balance, we choice λ = 0.5 in this paper.
In this section, we are interested in learning an enhanced visual Active sampling allows us to select an informative subset from
tree. Given M image categories, we learn a visual tree VT = (V, E ), the sample set, so that the computational cost for calculating the
which comprises a set of tree nodes V and a set of edges E. Our inter-category visual similarities can be reduced dramatically while
algorithm for visual tree learning contains two key steps: (a) Ac- covering the intra-category visual diversity at a certain sufficiency
tive Sampling for Category Representation: One open problem of vi- level.
Fig. 2. Results of two different hierarchical AP clustering approaches on synthetic data. Left subfigures show the clustering results of high-level nodes; right subfigures show
the clustering results of low-level nodes.
3.2. Hierarchical affinity propagation clustering for node partitioning In our top-down hierarchical AP algorithm, the Hausdorff dis-
tance [37] is used to characterize the inter-category visual simi-
After the most representative samples are selected for each im- larities. Let A = {a1 , . . . , am } and B = {b1 , . . . , bn } denote two point
age category, hierarchical clustering is employed to build the en- sets. Inter-category visual similarities should be characterized
hanced visual tree. Due to the number of cluster centers for affinity through measures how far two point sets are from each other. We
propagation (AP) clustering [35] does not require to be pre-defined, employ Hausdorff distance to characterize the inter-category visual
the AP clustering allows us to obtain a better structure of visual similarities, since the Hausdorff distance can utilize the max-min
tree, and a hierarchical AP clustering algorithm is used to construct characteristic of the set, calculates the maximum value in the near-
the enhanced visual tree. Givoni et al. have proposed a bottom-up est neighbor set to measure the maximum mismatch between two
approach to support hierarchical AP clustering [36], but it may lead point sets. The Hausdorff distance of point sets A and B is defined
to higher accuracy rate for the bottom level but lower accuracy as:
rate for the top level. To control the inter-level error propagation,
the metrics for high-level nodes are required to be more discrimi- dH (A, B ) = max(dh (A, B ), dh (B, A )) (4)
native than the metrics for low-level nodes, so that the inter-level
where dh (A, B) donates the one-sided Hausdorff distance, defined as:
error propagation can be controlled effectively. Based on this un-
derstanding, a top-down approach to hierarchical AP clustering is
developed to learn an enhanced visual tree. Fig. 2 shows a clus- dh (A, B ) = sup inf x − y (5)
x∈A y∈B
tering result on a group of synthetic data by using two different
hierarchical AP clustering methods. One can observe that our top- Hereby · is Euclidean norm. By definition, the Hausdorff dis-
down hierarchical AP algorithm achieves better clustering accuracy tance between two point sets is the greatest of all the distances of
rates for the high-level nodes. one set to the closest point from the other set.
Based on this understanding, we use the Hausdorff distance to One open issue for multi-task learning [38,39] is how to automat-
calculate the distance between the most representative samples for ically identify the inter-related tasks. It is worth noting that our
two image categories, so that we can obtain more accurate simi- visual tree has provided a good environment to automatically iden-
larity matrix for visual tree construction (via hierarchical AP clus- tify the inter-related tasks for multi-task metric learning, e.g., the
tering over the similarity matrix). Such hierarchical AP clustering tasks for learning the metrics for the sibling child nodes under the
process for visual tree construction starts from the root node (that same parent node are strongly inter-related and such inter-related
contains all the image categories) and ends at the leaf nodes (each metrics should be learned jointly to enhance their discrimination
leaf node contains only one single image category). power.
For the root node R, all the image categories L(R ) ⊆ {1, . . . , L} For the root node at the first level of the visual tree, its Maha-
are partitioned into B subsets by using AP clustering. Then, AP lanobis distance is defined as:
clustering are employed to partition these B subsets simultane-
ously. Note that the number of cluster center B is learned from dW (xi , x j ) = (xi − x j )T W (xi − x j ) (8)
training samples automatically. During the learning process, we
where xi , xj are two data points, W is the Mahalanobis metric ma-
calculate within-class distance dw and between-class distance db
trix. If the weight matrix W reduces to the identity matrix I, the
under the same parent node. It is assumed that one super-category
distance dW (., .) is identical to the Euclidean distance.
C has n samples and has m sibling super-categories which share a
Our goal for learning a multi-task sparse metric for the root
same parent node, the within-class distance dw and the between-
node is to enhance the separability of its sibling child nodes ef-
class distance db are defined as:
fectively [24]. Thus the metric for each sibling child node under

n
the same root node (each task) consists of two parts: the common
dw = tr ( (xi − μyi )(xi − μyi )T ) (6) metric shared among multiple sibling child nodes under the root
i=1
node and the node-specific individual metric. By using such multi-

m task sparse metric, the distance function for the tth sibling child
db = tr ( n k ( μk − μ ) ( μk − μ ) )
T
(7) node is defined as:
k=1
where xi indicates one sample in one super-category C and μyi in- dt (xi , x j ) = (xi − x j )T (W0 + Wt )(xi − x j ) (9)
dicates the mean sample of this super-category. μk indicates one where W0 is the common metric W0 0, which is shared among
mean sample of one super-category and μ indicates mean of all all the sibling child nodes under the root node and Wt is the node-
mean samples of sibling super-categories under the same parent specific individual metric Wt 0 for the tth child node. Both W0 and
node. nk is the number of samples in the kth super-categories. dw Wt are symmetric positive semidefinite matrices. The joint objec-
indicates the degree of aggregation within the most representative function for multi-task sparse metric learning is then defined
tive samples for the super-category C, db indicates the degree of as:
separation between the most representative samples for the super-

B
category C and its sibling super-categories under the same parent min γoW0 − I2F + αt tr[W0 + Wt ]
node. After the dw and db are obtained, we calculate their ratio W0 ,...,WB
t=1 (10)
rd = ddw . Note that if rd is greater than a presetting threshold θ , B 2
b + γt Wt F + dt (xi , x j ) + ξi, j,k
2
then only |L(C)| leaf nodes are generated directly. This hierarchical t=1 i, j i, j,k
AP clustering process for node partition is applied recursively until
subject to:
a complete tree is created, each leaf node contains one single im-
age category. For a given leaf node, its depth is defined as the path dt2 (xi , xk ) − dt2 (xi , x j ) ≥ 1 − ξi jk
length from itself to the root node, e.g., Fig. 3 shows the enhanced ξi jk ≥ 0
visual tree for CIFAR-100 image set, we can observe that the depth
W0 , W1 , . . . , WB 0.
of the enhanced visual tree is 4 (from the root node to the leaf
nodes). Such visual hierarchy is an imbalanced tree, and different where the trade-off parameters γ 0 and γ t are used to adjust the
high-level nodes are related with different numbers of fine-grained relative importance between W0 and Wt , t = 1, . . . , B, xi is an input
image categories. exemplar, xj is the target neighbor, xk is the impostor, and ξ ijk is
the slack variable. In the extreme cases, if the γ 0 → ∞, one single
4. Hierarchical multi-task sparse metric learning metric W0 is learned for all the sibling child nodes under the root
node, on the other hand, if γ t → ∞, B independent metrics are
In this section, we introduce our hierarchical multi-task sparse learned. By adding the trace-norm into multi-task sparse metric
metric learning algorithm that can learn a tree of multi-task sparse learning, the multi-task sparse metric (that is learned for the root
metrics hierarchically over the visual tree, where both the inter- node) can have higher discrimination power on distinguishing its
node visual correlations (among the sibling child nodes under the sibling child nodes accurately.
same parent node) and the inter-level visual correlations (between By embedding the inter-node visual similarities (i.e., inter-task
the parent node and its sibling child nodes at the next level of the relationships) into a structure regularization term, our multi-task
visual tree) are leveraged to learn more discriminative metrics to sparse metric learning algorithm can learn more discriminative
enhance the separability. metrics for separating the sibling child nodes under the same par-
ent node. By focusing on the sibling child nodes under the same
4.1. Learning multi-task sparse metric for root node parent node, our algorithm can effectively control the complexity
for multi-task sparse metric learning. Our multi-task sparse metric
After the visual tree is available, a top-down approach is used learning algorithm is able to establish two separable metrics for
to learn a tree of multi-task sparse metrics hierarchically over the characterizing: (1) the common visual similarity shared among the
visual tree. For the root node at the first level of the visual tree, a sibling child nodes under the same parent node; and (2) the node-
multi-task sparse metric learning algorithm is developed to learn specific visual difference for each sibling child node. By separat-
a multi-task sparse metric to achieve more accurate separation ing the common metric from the node-specific metrics explicitly,
of its sibling child nodes at the second level of the visual tree. our multi-task sparse metric learning algorithm can have higher
Fig. 3. The enhanced visual tree for CIFAR-100 image set with 100 categories. The depth of the enhanced visual tree is 4 (from the root node to the leaf nodes). The visual
hierarchy is an imbalanced tree, and different high-level nodes are related with different numbers of fine-grained image categories.
discrimination power on distinguishing the sibling child nodes un- child nodes at the next level of the visual tree, our hierarchical
der the same parent node, which may share some common visual multi-task sparse metric learning algorithm is able to leverage both
properties significantly and are usually hard to be separated from the inter-level visual correlations and the inter-node visual corre-
each other. lations to train more discriminative metrics for the sibling child
nodes under the same parent node. We define the inter-level reg-
4.2. Hierarchical learning of multi-task sparse metrics for non-root ularization term as:
nodes
W0 − Wp 2 (11)
For a non-root node at the mid-level of the visual tree, a hierar-
chical approach is developed to leverage both the inter-node visual where W0 is the common metric shared among the sibling child
correlations (among the sibling child nodes under the same parent nodes under the same parent node p and the Wp is the node-
node) and the inter-level visual correlations (between the parent specific metric for the parent node p at the upper level of the
node and its sibling child nodes at the next level of the visual tree) visual tree. For the current non-root node, we make the node-
for learning its multi-task sparse metric hierarchically. For the sib- specific individual metric for its parent node at the upper level
ling non-root nodes, their parent node is used to characterize their of the visual tree and its commonly-shared metric to be similar
common visual properties, this is the reason why such sibling non- or closer. Thus the node-specific individual metric learned for the
root nodes and their associated image categories are assigned into parent node at the upper level of the visual tree is propagated to
the same parent node. By borrowing the node-specific metric from its sibling child nodes. For the sibling child nodes under the same
the parent node to set a prior regularization term for its sibling parent node, the joint objective function for hierarchical learning
of their multi-task sparse metrics is defined as: images, 50 thousands validation images, and 150 thousands testing
images. We utilize all training images to train our classifier and use

B
min γoW0 − I2F + αt tr[W0 + Wt ] all validation images as our testing data.
W0 ,...,WB
t=1
B 2 (12)
+ γt Wt F + dt (xi , x j ) + ξi, j,k
2 5.2. Experimental settings
t=1 i, j i, j,k
+βW0 − Wp 2 The DeCAF features [42] are extracted for image representation
in our experiments. The DeCAF features are extracted by using a
subject to:
deep convolutional network [43], where the 7th layer (fully con-
dt2 (xi , xk ) − dt2 (xi , x j ) ≥ 1 − ξi jk nected layer) is treated as the discriminative features for image
ξi jk ≥ 0 representation. The dimensionality of the original DeCAF features
is 4096, PCA is used to reduce the dimensionality of the DeCAF
W0 , W1 , . . . , WB 0.
features to 128 for speeding up the process for image classifica-
where α t is used to control the sparse regularization of W0 + Wt tion. The Mean Accuracy (%) is used as the criterion to evaluate
for t = 1, . . . , B and β is used to adjust the inter-level regulariza- the performance of all these approaches. A single machine with
tion. When the multi-task sparse metrics are available for the sib- 4 3.20 GHz cores and 32GB memory is utilized to run all experi-
ling non-root nodes at the current level, a level-by-level process is ments. There are many parameter settings in this paper. We inves-
adopted to learn the multi-task sparse metrics at the next level of tigate the effect of different parameters on the classification per-
the visual tree recursively until the leaf nodes are reached. formance at different stages.
By leveraging the visual tree to generate subtrees (each subtree
contains one parent node and its sibling child nodes) and identify 5.2.1. Visual tree learning
the inter-related tasks for multi-task sparse metric learning, our hi- In Section 3, for the balance of representativeness and diversity
erarchical multi-task sparse metric learning algorithm can provide of active samples, we set the tradeoff parameter λ = 0.5. In active
an iterative solution for large-scale metric learning, so that learning sampling, the number of active samples will affect the structure
a tree of multi-task sparse metrics over the enhanced visual tree of enhanced visual tree, and different tree structures will result
becomes computationally tractable. By leveraging more discrimi- in different classification accuracy, in addition, more samples will
native metrics for classifier training, the corresponding tree clas- lead to more computational cost for constructing the enhanced vi-
sifiers become more discriminative, which can accurately rule out sual tree. Therefore, we determine the number of active samples
unlikely coarse-grained groups of image categories (i.e., irrelevant per category by experiment. The effect of active samples per cat-
high-level nodes on the visual tree) at an early stage and signifi- egory is shown in Fig. 4. The x-axis indicates the number of ac-
cantly reduce the computational cost for large-scale image classifi- tive samples we select from each category. The left y-axis indi-
cation. By propagating the node-specific metric for the parent node cates the classification accuracy rates (%). The right y-axis indicates
to its sibling child nodes at the next level of the visual tree, our hi- the computational cost (s) for constructing the enhanced visual
erarchical multi-task sparse metric learning algorithm can control tree. The red curve shows the classification accuracy with changes
the inter-level error propagation effectively and achieve higher ac- in the number of active samples while the blue curve shows the
curacy rates for large-scale image classification. computational cost for constructing the enhanced visual tree with
changes in the number of active samples. One can observe that
5. Experimental results and analysis with the growth in the number of active samples, the classifica-
tion accuracy is not necessarily increased. The reason is only the
In this section, we evaluate the performance of the proposed active samples reflect the distribution of the original data, a rea-
approach on three public image sets. We first investigate the ef- sonable tree structure can be achieved to obtain better classifica-
fect of different parameters on the classification performance. Then tion accuracy, however, more active samples do not guarantee that
we validate the effectiveness of the proposed enhanced visual tree we can approximate the real distribution of the original data accu-
through compared with other hierarchical structure. We finally rately. On the other hand, one can observe that the computational
evaluate our hierarchical multi-task sparse metrics classifier by cost increase linearly with the number of active samples. Accord-
comparing with flat classifiers and other state-of-the-art hierarchi- ing to the results in Fig. 4, we set the number of active samples
cal approaches. per category Nas = 30 for CIFAR-100 image set, Nas = 5 for Caltech-
256 image set and Nas = 15 for ILSVRC-2012 image set. Moreover,
5.1. Image sets the threshold θ affect the structure of enhanced visual tree. Ac-
cording to our experience, we set θ = 4.0, 5.0, 1.5 for CIFAR-100,
In this paper, three well-known image sets are used to evaluate Caltech-256 and ILSVRC-2012 image set respectively.
the performance of our method, including the CIFAR-100 [40], the
Caltech-256 [41], and the ILSVRC-2012 [5]. Each of them is briefly 5.2.2. Hierarchical learning
introduced as follows. In Section 4, there are many parameters affect our hierarchi-
CIFAR-100 has 100 image categories and each category contains cal classifier: trade-off parameters γ 0 and γ t , sparse regularization
600 images, where 500 are the training images and 100 are the parameter α t and inter-level regularization parameter β . Note that
testing images. The size of each image is 32 × 32 pixels. we consider the importance of all the tasks are the same, so we
Caltech-256 contains 256 image categories, and each contains set all the γ t t = 1, . . . , B are equal, so does the α t . We evaluate
more than 80 images. The size of each image is about 300 × 200 the effect of these parameters as shown in Fig. 5. Fig. 5(a), (b) and
pixels. The object’s category label as well as its bounding box are (c) correspond CIFAR-100, Caltech-256 and ILSVRC-2012 image set
manually annotated. In our experiment, we set 50 images per class respectively. Since all γt , t = 1 . . . T are equal, we set a parameter
for training and the rest are for testing. ψ = γ0 /γt to verify effect of trade-off parameters. The first col-
ILSVRC-2012 contains 10 0 0 image categories and it is a subset umn shows the evaluation for the effect of trade-off parameters for
of ImageNet, where each category has over 1,0 0 0 images. Our algo- three image sets, the second column shows the evaluation for the
rithm is trained based on ILSVRC2014 (the latest databases remain sparse regularization parameter and the third column shows the
unchanged from ILSVRC2012), which has over 1.2 million training evaluation for the inter-level regularization parameter. The x-axis
Fig. 4. The effect of active samples to enhanced visual tree. The red curve shows the classification accuracy with changes in the number of active samples while the blue
curve shows the computational cost for building the enhanced visual tree with changes in the number of active samples. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
Table 1 performance in all three image sets. For label tree and visual tree,
Comparison of classification accuracy (%) of the different hierarchical
these two approaches learn hierarchical structure in the feature
structure and proposed enhanced visual tree on three image sets.
space and achieve good performance. However, for the label tree,
Approaches CIFAR-100 Caltech-256 ILSVRC-2012 a confusion matrix is first obtained from one-versus-rest (OVR) bi-
Semantic ontology [45] 48.95 57.78 44.47 nary classifiers. Such approach for label tree learning may seriously
Label tree [6] 52.04 61.62 49.64 suffer from huge computational cost. Visual tree approach employs
Visual tree [10] 51.30 62.13 51.03 mean features for category representations, however, such mean
Enhanced visual tree 54.33 64.06 58.76
features may not have sufficient capacity for category representa-
tions. Our proposed enhanced visual tree employs representative
active samples for category representations, so it can obtain a bet-
indicates the value of the different parameters. The y-axis indicates ter hierarchical structure and achieve best performance on all three
the classification accuracy (%). We choose the trade-off parameters image sets.
ψ from [0.005 0.01 0.05 0.1 0.5 1 5 10 50 100 500] and choose
the sparse regularization parameter α t and the inter-level regular- 5.4. Experiments on different method
ization parameter β from [0.01 0.05 0.1 0.5 1 5 10 50 100]. One
can observe that these parameters do not affect the classification In this experiments, we compare the proposed method with
accuracy seriously. It shows that the proposed hierarchical learning state-of-the-art approaches on three image sets respectively. To il-
approach is robust to parameter setting. According to the results lustrate our experimental results, our hierarchical multi-task met-
in Fig. 5, we set γ0 = 1, γt = 100, αt = 0.5 and β = 0.1 for CIFAR- ric learning algorithm is referred as HMML, and it has been com-
100 image set, γ0 = 1, γt = 50, αt = 1 and β = 0.5 for Caltech-256 pared with other two types of classifiers: flat classifiers and hi-
image set and γ0 = 1, γt = 10, αt = 1 and β = 10 for ILSVRC-2012 erarchical classifiers. The baseline kNN classifier with Euclid dis-
image set. tance is referred as Euclid and the state-of-the-art metric learn-
ing algorithm is referred as LMNN [19]. Lei et al. have developed
5.3. Experiments on different hierarchical structures an hierarchical large-margin metrics learning algorithm referred
as HLMM [46], which learns the metric automatically from the
For the semantic ontology (taxonomies), some databases pro- training samples. The hierarchical multi-task learning algorithm is
vide semantic ontology, such as: CIFAR-100 image set. The reason referred as HMTL [24]which performs multi-task metric learning
that we construct enhanced visual tree rather than available se- (without inter-level metric propagation). Grauman et al. have de-
mantic ontology is shown in Fig. 6. We employ the t-SNE toolkit veloped an interesting approach to learn a tree of disjoint met-
[44] to visualize the partition of root node of semantic ontol- rics by adding a trace-norm regularization term into the objective
ogy and enhanced visual tree respectively in CIFAR-100 image set. function, so that the classifiers at different levels can select disjoint
Fig. 6(a) illustrates the partition of root node of semantic ontol- metrics or feature subsets. The idea of this approach has some sim-
ogy. One can observe that the data points are confusing in the fea- ilarities with proposed method, so we should compare with it and
ture space, because semantic similar categories does not necessar- it is referred as ToM [31]. Another type of inter-level constraint for
ily close in the feature space. Because the feature space is the com- SVM classifiers is also included for comparison and it is referred as
mon space for classifier training and image classification, it is more HTC [10]. Deep learning has demonstrated its outstanding perfor-
attractive to learn a better visual hierarchy to organize large num- mance on extracting discriminative features to significantly boost
bers of image categories directly in the feature space. As shown in the accuracy rates for large scale image classification. Therefore,
Fig. 6(b), the distinction degree of enhanced visual tree is clearly on large-scale image dataset ILSVRC-2012, we compare our algo-
higher than semantic ontology. rithm with convolutional neural networks (referred as CNN) with
In this experiment, we compare the proposed enhanced visual Alex model [43] which is the same model with our feature extrac-
tree (EVT) structure with other hierarchical structures on three im- tion. In this paper, we employ the Matconvnet [47] deep learning
age sets as shown in Table 1. For those image sets with no seman- toolbox1 to complete the experiment about CNN.
tic ontology, such as Caltech-256 image set, a semantic hierarchy is
5.4.1. CIFAR-100
constructed manually by employing WordNet [45]. We employ the
We first investigate the performance of the proposed HMML
approach [6] to construct label tree and the approach [10] to con-
method in the CIFAR-100 image set. The experimental results are
struct visual tree. For semantic ontology, its hierarchical structure
is constructed by semantic meaning and we perform classification
in the feature space rather than semantic space, so it achieves bad 1
http://www.vlfeat.org/matconvnet/
Fig. 5. The effect of parameters to classification accuracy. The first column shows the effect of trade-off parameters, the second column shows the effect of sparse regular-
ization parameter and the third column shows the effect of inter-level regularization parameter.
shown in Table 2. From these results, one can observe that: (a) All traditional nearest neighbor approach by using Euclidean metric
the hierarchical approaches can achieve higher accuracy rates than directly; (c) Our hierarchical multi-task metric learning algorithm
the flat approaches because the flat approaches completely ignore (HMML) can have better performance than the traditional HLMM
the inter-category visual correlations and learn the metrics for all approaches, because our algorithm can leverage multi-task learn-
the categories independently, on the other hand, the hierarchical ing and inter-level metric propagation to learn more discriminative
approaches can effectively leverage the inter-category visual simi- metrics. In addition, our visual hierarchy has provided a good envi-
larities for enhancing the separability of the sibling child nodes un- ronment to automatically identify the inter-related tasks for multi-
der the same parent node; (b) By learning large-margin metrics to task metric learning. By supporting inter-level metric propagation,
make the exemplars in the same category to be closer and the ex- our hierarchical multi-task metric learning algorithm (HMML) can
emplars from different categories to be far away, both the flat and learn more discriminative metrics and control the inter-level er-
hierarchical approaches can achieve higher performances than the ror propagation effectively. The experimental results show that our
Fig. 6. Visualization of semantic ontology and enhanced visual tree in CIFAR-100 image set.
Table 2 sual hierarchy plays an essential role in hierarchical learning of a

Comparison on CIFAR-100 image set.
tree of multi-task sparse metrics.
Structure Approaches Mean accuracy (%)
Flat Euclid 46.03

LMNN [19] 48.84 5.4.3. ILSVRC-2012
Hierarchical HLMM [46] 50.16 We investigate the performance of the proposed HMML method
HMTL [24] 52.78 in the ILSVRC-2012 image set. An enhanced visual tree is learned
ToM [31] 53.02 automatically, and Fig. 7 illustrates the enhanced visual tree with
HTC [10] 53.53
5 levels, where the icon images are used to illustrate the image
HMML 54.33
categories and one icon image is used to represent one particular
image category. The icon images for the visually-similar image cat-
Table 3 egories, which are assigned into the same non-leaf node, are in-
Comparison on Caltech-256 image set.
tegrated to illustrate that non-leaf node jointly. Due to the space
Structure Approaches Mean accuracy (%) limitation, only three subtrees are illustrated.
Flat Euclid 50.36 Table 4 shows the experimental results. One can observe that
LMNN [19] 54.05 the improvement on the accuracy rates of LMNN is not obvious as
Hierarchical HLMM [46] 58.17 compared with the traditional kNN classifiers by using Euclidean
HMTL [24] 60.22 distance directly. Our hierarchical multi-task metric learning algo-
ToM [31] 61.52 rithm can achieve higher accuracy rates than the ToM and HTC
HTC [10] 60.78
approaches. Deep learning method have achieved reasonable suc-
HMML 64.06
cesses on large-scale image classification because of its outstand-
ing performance on extracting discriminative features. However,it
utilizes softmax regression as last layer of network and it will
proposed method achieves better performance than ToM because lead to ignore the inter-category correlations. Thus, one can ob-
the ToM approach concentrate on utilizing trace norm to control serve that our hierarchical multi-task metric learning algorithm
the inter-level error propagation and ignore the inter-category vi- can achieve better performance than the CNN methods. This evi-
sual correlations. Moreover, the proposed method also achieves dence has shown that our hierarchical multi-task metric learning
better performance than HTC because our enhanced visual tree can algorithm can accurately learn a tree of multi-task sparse met-
characterize the inter-category visual correlations accurately in the rics over the enhanced visual tree and control the inter-level error
feature space. propagation effectively.
The detailed comparison on all 10 0 0 image categories in the
5.4.2. Caltech-256 ILSVRC-2012 image set is illustrated in Fig. 8. Each method is ar-
We investigate the performance of the proposed HMML method ranged in ascending order of the category accuracy. Obviously, our
in the Caltech-256 image set. The experimental results are shown hierarchical multi-task metric learning algorithm can achieve bet-
in Table 3. One can observe that similar conclusions are reached ter accuracy rates for most of 10 0 0 image categories. Fig. 9 il-
as we have obtained from the CIFAR-100 image set. In addition, lustrates the comparison of all approaches. The x-axis indicates
the hierarchical approaches with inter-level regularization (inter- the classification accuracy rates (%), and the y-axis indicates the
level metric propagation) can learn more discriminative metrics percentage of the image categories whose accuracy rates exceeds
and control the inter-level error propagation effectively, and thus the corresponding threshold. One can observe that our hierarchi-
it can achieve better performance on large-scale image classifica- cal multi-task metric learning algorithm can achieve better perfor-
tion. Our experimental results have also demonstrated that the vi- mance on large-scale image classification.
Fig. 7. Visualization of three different paths of our enhanced visual tree for ILSVRC-2012 image set with 10 0 0 categories. The icon images are used to illustrate the image
categories and one icon image is used to represent one particular image category.
Fig. 8. Accuracy comparison on the ILSVRC-2012 image set with 10 0 0 image categories.
Table 4
Comparison on ILSVRC-2012 image set.
Structure Approaches Mean accuracy (%)
Flat Euclid 44.09

LMNN [19] 45.72
Hierarchical HLMM [46] 51.67
HMTL [24] 52.94
ToM [31] 56.62
HTC [10] 53.27
CNN [47] 56.69
HMML 58.76
5.5. Experiments on computational efficiency
The computational cost at test time is evaluated for all the Fig. 9. Comparison on accuracy percentage. The x-axis indicates the classification
approaches on 3 image sets. Table 5 illustrates the experimental accuracy rates (%), and the y-axis indicates the percentage of the image categories
whose accuracy rates exceeds the corresponding threshold.
results. One can observe that all the hierarchical approaches are
much faster than the flat approaches which computational cost in-
creased linearly with the increasing of the numbers of image cate- cal approaches have little difference on testing efficiency because
gories. It is because the hierarchical approaches allows us to search their efficiency depends mainly on the depth of the enhanced vi-
partial categories at testing phase (i.e., each test sample is matched sual tree, and their tree structure is uniform in this experiment.
with partial nodes over our enhanced visual tree and node clas- One can also note that the testing efficiency of deep learning is
sifiers), thus they can achieve high efficiency. Different hierarchi- not high.
Table 5 [8] B. Liu, F. Sadeghi, M. Tappen, O. Shamir, C. Liu, Probabilistic label trees for
Comparison of testing time (Milliseconds/image). efficient large scale image classification, in: CVPR, 2013, pp. 843–850.
[9] J. Fan, X. He, N. Zhou, J. Peng, R. Jain, Quantitative characterization of semantic
Approaches CIFAR-100 Caltech-256 ILSVRC-2012 gaps for learning complexity estimation and inference model selection, IEEE
Euclid 25.86 115.59 1309.15 Trans. Multimedia 14 (5) (2012) 1414–1428.
LMNN 23.07 127.08 1233.94 [10] J. Fan, N. Zhou, J. Peng, L. Gao, Hierarchical learning of tree classifiers for
large-scale plant species identification, IEEE Trans. Image Process. 24 (11)
HLMM 9.27 6.65 30.1
(2015) 4172–4184.
HMTL 9.89 7.01 33.1
[11] E.P. Xing, M.I. Jordan, S. Russell, A.Y. Ng, Distance metric learning with appli-
ToM 9.18 5.77 28.49 cation to clustering with side-information, in: NIPS, 2003, pp. 505–512.
HTC 21.46 18.25 55.95 [12] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning a mahalanobis metric
CNN – – 65.80 from equivalence constraints, J. Mach. Learn. Res. 6 (6) (2005) 937–965.
HMML 9.93 6.98 33.4 [13] A. Globerson, S.T. Roweis, Metric learning by collapsing classes, in: NIPS, 2005,
pp. 451–458.
[14] C.J. Veenman, Statistical disk cluster classification for file carving, in: IAS, 2007,
6. Conclusion pp. 393–398.
[15] B. Kulis, Metric learning: a survey, Found. Trends Mach. Learn. 5 (4) (2012)
287–364.
In this paper, a hierarchical multi-task sparse metric learning [16] Q. Qian, R. Jin, S. Zhu, Y. Lin, Fine-grained visual categorization via multi-stage
algorithm is developed to learn a tree of multi-task sparse met- metric learning, in: CVPR, 2015, pp. 3716–3724.
rics hierarchically over an enhanced visual tree to achieve a fast [17] M.P. Kumar, P.H. Torr, A. Zisserman, An invariant large margin nearest neigh-
bour classifier, in: ICCV, 2007, pp. 1–8.
solution for large-scale image classification, where both the inter- [18] L. Torresani, K.-c. Lee, Large margin component analysis, in: NIPS, 2006,
level visual correlations (between the parent node and its sibling pp. 1385–1392.
child nodes at the next level of the enhanced visual tree) and the [19] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest
neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244.
inter-node visual correlations (among the sibling child nodes under [20] J. Goldberger, G.E. Hinton, S.T. Roweis, R. Salakhutdinov, Neighbourhood com-
the same parent node) are exploited to learn a tree of multi-task ponents analysis, in: NIPS, 2004, pp. 513–520.
sparse metrics over the enhanced visual tree. By explicitly sepa- [21] N. Verma, D. Mahajan, S. Sellamanickam, V. Nair, Learning hierarchical similar-
ity metrics, in: CVPR, 2012, pp. 2280–2287.
rating the node-specific metrics from the commonly-shared met- [22] Y. Ying, K. Huang, C. Campbell, Sparse metric learning via smooth optimization,
ric, our hierarchical multi-task metric learning algorithm can learn in: NIPS, 2009, pp. 2214–2222.
more discriminative metrics for large-scale image classification ap- [23] Y. Zhang, D.-Y. Yeung, Transfer metric learning by learning task relationships,
in: ACM SIGKDD, 2010, pp. 1199–1208.
plication. The experimental results have demonstrated that our al-
[24] S. Parameswaran, K.Q. Weinberger, Large margin multi-task metric learning,
gorithm has superior performance as compared with other state- in: NIPS, 2010, pp. 1867–1875.
of-the-art techniques on both the classification accuracy and the [25] N. Zhou, J. Fan, Automatic image–text alignment for large-scale web image in-
computational efficiency. dexing and retrieval, Pattern Recognit. 48 (1) (2015) 205–219.
[26] L. Cai, T. Hofmann, Hierarchical document categorization with support vector
The contributions of this paper can be summarized as: (a) An machines, in: ICIKM, 2004, pp. 78–87.
enhanced visual tree is constructed to organize large numbers [27] M. Marszalek, C. Schmid, Constructing category hierarchies for visual recogni-
of image categories in a coarse-to-fine fashion and automatically tion, in: ECCV, 2008, pp. 479–491.
[28] E. Bart, M. Welling, P. Perona, Unsupervised organization of image collections:
identify the inter-related tasks for multi-task sparse metric learn- taxonomies and beyond, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011)
ing; (b) A new objective function is developed for multi-task sparse 2302–2315.
metric learning; (c) A top-down approach is developed for sup- [29] S. Gopal, Y. Yang, Recursive regularization for large-scale classification with
hierarchical and graphical dependencies, in: ACM SIGKDD, 2013, pp.
porting hierarchical learning of a tree of multi-task sparse metrics 257–265.
over the enhanced visual tree. [30] L. Xiao, D. Zhou, M. Wu, Hierarchical classification via orthogonal transfer, in:
ICML, 2011, pp. 801–808.
Acknowledgments [31] K. Grauman, F. Sha, S.J. Hwang, Learning a tree of metrics with disjoint visual
features, in: NIPS, 2011, pp. 621–629.
[32] B. Babenko, S. Branson, S. Belongie, Similarity metrics for categorization: from
The authors would like to thank the editor and the anonymous monolithic to category specific, in: ICCV, 2009, pp. 293–300.
reviewers for their critical and constructive comments and sug- [33] I.E. Yen, D. Malioutov, A. Kumar, Scalable exemplar clustering and facility lo-
gestions. This work was supported in part by the National Natu- cation via augmented block coordinate descent with column generation, in:
AISTATS, 2016, pp. 1260–1269.
ral Science Foundation of China under Grant 61432014, U1605252, [34] H. Wang, X. Gao, K. Zhang, J. Li, Single image super-resolution using ac-
61571347, 61672402 and 61603233, in part by Key Industrial In- tive-sampling gaussian process regression, IEEE Trans. Image Process. 25 (2)
novation Chain in Industrial Domain under Grant 2016KTZDGY-02, (2016) 935–948.
[35] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Sci-
in part by the Fundamental Research Funds for the Central Uni-
ence 315 (5814) (2007) 972–976.
versities under Grant BDZ021403 and Grant JB149901, in part by [36] I. Givoni, C. Chung, B.J. Frey, Hierarchical affinity propagation, UAI (2011)
National High-Level Talents Special Support Program of China un- 238–246.
[37] R.T. Rockafellar, R.J.-B. Wets, Variational Analysis, 317, Springer Science & Busi-
der Grant CS3111720 0 0 01, in part by the Program for Changjiang
ness Media, 2009.
Scholars and Innovative Research Team in University of China un- [38] F. Cai, V. Cherkassky, Generalized smo algorithm for svm-based multitask
der Grant IRT13088. learning, IEEE Trans. Neural Netw. Learn. Syst. 23 (6) (2012) 997–1003.
[39] T. Evgeniou, C.A. Micchelli, M. Pontil, Learning multiple tasks with kernel
References methods, J. Mach. Learn. Res. (2005) 615–637.
[40] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images,
[1] Y. Li, Q. Huang, W. Xie, X. Li, A novel visual codebook model based on fuzzy Technical report, U. Toronto (2009).
geometry for large-scale image classification, Pattern Recognit. 48 (10) (2015) [41] G. Griffin, A. Holub, P. Perona, Caltech-256 object category dataset(2007).
3125–3134. [42] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
[2] L.-J. Li, C. Wang, Y. Lim, D.M. Blei, L. Fei-Fei, Building and using a semantivisual T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv
image hierarchy, in: CVPR, 2010, pp. 3336–3343. preprint arXiv:1408.5093(2014).
[3] B. Zhao, F. Li, E.P. Xing, Large-scale category structure aware image categoriza- [43] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
tion, in: NIPS, 2011, pp. 1251–1259. volutional neural networks, in: NIPS, 2012, pp. 1097–1105.
[4] S. Gopal, Large-scale Structured Learning, Carnegie Mellon University, 2014 [44] L.v.d. Maaten, G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res. 9
Ph.D. thesis. (11) (2008) 2579–2605.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale [45] G.A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (11)
hierarchical image database, in: CVPR, 2009, pp. 248–255. (1995) 39–41.
[6] S. Bengio, J. Weston, D. Grangier, Label embedding trees for large multi-class [46] H. Lei, K. Mei, J. Xin, P. Dong, J. Fan, Hierarchical learning of large-margin met-
tasks, in: NIPS, 2010, pp. 163–171. rics for large-scale image classification, Neurocomputing 208 (2016) 46–58.
[7] J. Deng, S. Satheesh, A.C. Berg, F. Li, Fast and balanced: efficient label tree [47] A. Vedaldi, K. Lenc, Matconvnet: convolutional neural networks for matlab, in:
learning for large scale object recognition, in: NIPS, 2011, pp. 567–575. ACM MM, 2015, pp. 689–692.
Yu Zheng received the B.Eng. degree in electronic information engineering from Xidian University, Xi’an, China, in 2012. He is currently pursuing the Ph.D. degree in
intelligent information processing with the VIPS Laboratory, School of Electronic Engineering, Xidian University. His current research interests include machine learning and
computer vision.
Jianping Fan received the M.S. degree in theory physics from Northwestern University, Xi’an, China, in 1994, and the Ph.D. degree in optical storage and computer science
from the Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Shanghai, China, in 1997. He was a Post-Doctoral Researcher with Fudan University,
Shanghai, in 1998. From 1998 to 1999, he was a Researcher with the Japan Society of Promotion of Science, and the Department of Information System Engineering, Osaka
University, Osaka, Japan. From 1999 to 2001, he was a Post-Doctoral Researcher with the Department of Computer Science, Purdue University, West Lafayette, IN. In 2001, he
joined the Department of Computer Science, University of North Carolina at Charlotte. His research interests include automatic image/video analysis, semantic image/video
classification, personalized image/video recommendation, surveillance videos, and statistical machine learning.
Ji Zhang received the B.E. and M.S. degrees in Electronics Engineering and in 2011, and 2014, respectively, Xi’an Jiaotong University, Xi’an, China. He is now pursuing his
Ph.D. degree on Pattern Recognition and Intelligence Systems at the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University. His research interests include
image analysis and embedded vision computing. He is now a visiting student at UNC-Charlotte.
Xinbo Gao received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 1994, 1997, and 1999, respectively. From
1997 to 1998, he was a Research Fellow at the Department of Computer Science, Shizuoka University, Shizuoka, Japan. From 20 0 0 to 2001, he was a Post-Doctoral Research
Fellow at the Department of Information Engineering, the Chinese University of Hong Kong, Hong Kong. Since 2001, he has been at the School of Electronic Engineering,
Xidian University. He is currently a Cheung Kong Professor of Ministry of Education, a Professor of Pattern Recognition and Intelligent System, and the Director of the
State Key Laboratory of Integrated Services Networks, Xi’an, China. His current research interests include multimedia analysis, computer vision, pattern recognition, machine
learning, and wireless communications. He has published five books and around 200 technical articles in refereed journals and proceedings. Prof. Gao is on the Editorial
Boards of several journals, including Signal Processing (Elsevier), and Neurocomputing (Elsevier). He served as the General Chair/Co-Chair, Program Committee Chair/Co-Chair,
or PC Member for around 30 major international conferences. He is currently a fellow of the Institution of Engineering and Technology.

1 s2.0 S0031320317300377 Main PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

1 s2.0 S0031320317300377 Main PDF

Загружено:

Авторское право:

Доступные форматы

Pattern Recognition 67 (2017) 97–109

Contents lists available at ScienceDirect

Hierarchical learning of multi-task sparse metrics for large-scale

1. Introduction any sense to ignore their strong inter-category visual correlations

of a pre-deﬁned tree structure, but the inter-node correlations

arg max(λR(xi ) + (1 − λ )D(xi )) (3)

Table 2 sual hierarchy plays an essential role in hierarchical learning of a

Flat Euclid 46.03

Structure Approaches Mean accuracy (%)

Flat Euclid 44.09

5.5. Experiments on computational eﬃciency

Вам также может понравиться