Вы находитесь на странице: 1из 6

ICVGIP 2016 Submission #324. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

000 056
001 057
002 Image2Words: A neural network architecture for image tag 058
003 059
004 assignment 060
005 061
006 062
007 063
Anonymous ICVGIP submission
008 Paper ID 324 064
009 065
010 066
011 067
012 068
013 ABSTRACT 0.9 boy 069
0. car
014 070
Social networking and media sharing creates huge volumes . .
015 . . 071
of image data and are tagged by the users. However, the 0.3 leap
Visual Word
016 tags provided by the users are noisy, bare weak association features Image2words confidences
. . 072
0.6 baseball
017 with the contents of the image. Automatic image tagging . . 073
0. dog
018 and refining the existing tags is one of the important prob- 074
019 lems in multimedia and computer vision. In this paper, we Image Dictionary 075
020
propose a neural network based learning procedure to auto- 076
matically generate suitable tags to describe the given image.
021 077
Our system image2words is an MLP based cross media map-
022 ping learned from images to tags. We train our model on 078
Figure 1: The overview of the proposed approach.
023 a set of images and associated human given tags. Given an We learn an efficient mapping from image space to 079
024 image, the proposed system predicts a set of suitable tags a dictionary. Our system is a multi-layer perceptron 080
025 and associated confidences. We evaluated our approach on (MLP) which takes visual features of an image as in- 081
026 benchmark datasets for image tagging and report very com- put and predicts the confidences towards the words 082
027
petitive performance, despite being a simple model. present in a pre-compiled dictionary. Note that our 083
028
system predicts multiple words with associated con- 084
029
CCS Concepts fidences which need not sum up to 1.
085
030 Computer systems organization Embedded sys- 086
031 tems; Redundancy; Robotics; Networks Network reli- 087
ability; users. More often, these tags are observed to be substan-
032 088
dard when it comes to content association. Moreover, they
033 contain noise in the form of personalization, multiple (re- 089
034
Keywords gional) languages, cultural impact etc. For instance, Figure 090
035 image tagging, multi-label classification, tag assignment, cross 2 shows a couple of randomly picked images and associated 091
036 modal mapping, social media, social tagging user given tags from MIRFLICKR[10] dataset. In the tags 092
037 given for these images, most of the words have no direct as- 093
038 1. INTRODUCTION sociation with the contents of the image. More often the tags
094
The popular usage of social websites and networking plat- are inaccurate, ambiguous, casual and do not infer directly
039 095
forms causes a massive outflow of visual content everyday. the contents of the image. The tags need to be improved
040 by a great deal before they can be used for a set of related 096
This also enables tagging, commenting and sharing the gen-
041 tasks. 097
erated images and videos worldwide over the internet. As a
042 result, huge volumes of personally tagged images are avail- In this paper, we attempt the problem of tag assignment 098
043 able. The problem of image retrieval has been shadowed and refinement. Tag assignment is the task of assigning a 099
044 by the problem of improving the tags given to the images set of suitable tags that match and explain the contents of 100
and using them to search. When users upload their images a given unlabeled image. Tag refinement refers to the task
045 101
and browse the social profiles of others, they tag the images of refining the existing set of initial tags for a given image:
046 102
without constraint. These tags are provided by the common removing the less relevant tags and adding more appropriate
047 new tags. We recognize this as a problem of learning a map- 103
048 ping between an image and a set of possible words that can 104
Permission to make digital or hard copies of all or part of this work for personal or
049 classroom use is granted without fee provided that copies are not made or distributed be used as tags. We learn a system which accepts image fea- 105
050 for profit or commercial advantage and that copies bear this notice and the full cita- tures as input and predicts confidence for each word present 106
tion on the first page. Copyrights for components of this work owned by others than in a pre-compiled dictionary, to be a tag for the given im-
051 ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- 107
publish, to post on servers or to redistribute to lists, requires prior specific permission age. Our system predicts multiple tags and ranks them with
052 108
and/or a fee. Request permissions from permissions@acm.org. a confidence measure. In other words, we perform a multi-
053 109
ICVGIP 16 Guwahati, Assam India label classification on the input image features in order to
054 generate the tags. 110

c 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00
055 DOI: 10.475/123 4 The proposed system is learned from the carefully anno- 111
ICVGIP 2016 Submission #324. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

112 168
113 f(I) h1 hn {p(wi/I)} 169
114 170
115 171
116 172
117 play 173
118 174
leap
119 175
120 Figure 2: Sample images from MIRFLICKR 176
boy
121 dataset. The set of user given tags for these images baseball 177
122 respectively are {365days, flickr group roulette, fairy 178
123 tale, Story book Themes, tortoise and the hare, slow 179
124 and steady wins the race, my neighbors think Im in- 180
sane, OaM is a Dork} and {boston, zoo, franklin park Figure 3: The proposed image2words architecture.
125 Our system is an MLP. It learns the mapping from 181
zoo, lion, animal, wildlife, k10d, outdoors, boston zoo
126 franklin, simba, aslan, mufasa, Corey Leopold, Chal- image features extracted by a convnet to a set of 182
127 lengeYouWinner}. The evident noise present in the relevant tags. The output layer predicts the con- 183
128 user given tags such as these, poses challenges while fidences {P (wi /I)} towards the tags present in the 184
processing them. Dictionary.
129 185
130 186
131 tated, noise free (image, caption) pairs. Unlike the user 187
2.1 Visual features
132 provided tags to the uploaded images on social media, these 188
Convolutional neural networks (Convnets) have been ob-
133 captions are collected to have best association towards the 189
served to outperform the traditional gradient based features[23,
134 image content. The motivation for this intuition comes from 190
26, 2, 6] in almost all the vision tasks. The Convnets such
135
the need for predicting more accurate and general tags. This 191
as Alexnet[15], VGG [30] and googlenet[31] etc. that are
enables us to avoid predicting personalized and uncommon
136 trained on large datasets such as IMAGENET [29], learn 192
tags while having variety of relevant tags. The major con-
137 features that can be easily fine-tuned towards other related 193
tributions of the paper can be listed as follows:
138 tasks. In this work, as we are learning a mapping between 194
We learn an efficient mapping in the form of a sim- the images and tags, we represent the contents of the im-
139 195
ple MLP, that assigns a set of tags to images ranked age with the semantic features learned by the convnets. For
140 196
according to a confidence this purpose, we choose to work with features learned by the
141 deepest fully connected layers. This we believe is a reason- 197
142 We utilize the existing image caption data to train our 198
able choice given the performance of this feature in many
system by posing it as a multi-label classification prob-
143 recognition related tasks as reported in [8, 7]. 199
lem
144 200
145 The contents of the paper are organised as follows, Section 2.2 The tags 201
146 ?? explains about the proposed approach, Sectoin 3 presents 202
The proposed system is a mapping learned between the
the benchmark datasets and experimental setup followed for
147 two modalities, viz. image and text. We require sufficient 203
evaluating the proposed approach. Finally we relate our
148 labeled data for the mapping to be reliable. There are image 204
proposed method to the existing literature in Section ??
149 datasets that are available with associated tag data such as 205
and conclude in Section 5.
MIRFLCIKR [10], NUS-WIDE [5] etc. One can use these
150 206
datasets for learning a system to predict the tags. However,
151 2. PROPOSED APPROACH if the aim of the work is to assign suitable tags and to remove 207
152 In this section we detail the components of the proposed irrelevant tags, we believe one should use more cleaner as- 208
153 Image2Words system. We begin with the notation followed sociated tags. As explained in section 1 and shown in figure 209
154 throughout the paper. (I, C) denotes an (image, caption) 2 these tags are noisy and weakly associated with the con- 210
155 pair. Note that, throughout this paper, caption refers to a tents. This motivates us to use more cleaner and strongly 211
156
complete sentence describing an iamge and tag refers to a associated tags. 212
word or phrase aassociated to an image. Therefore, a dataset Automatic caption generation is one of the recent vision
157 213
D of N annotated images is denoted as problems that has drawn attention of many researchers. The
158 D = {(I1 , C1 ), (I2 , C2 ) . . . (IN , CN )}, where the subscript 214
caption generation systems such as [13, 12, 35] are trained
159 denotes the image index. We represent the set of words that on datasets that provide image and associated human given 215
160 constitute the caption Ci as Ci = {wi1 , wi2 , . . .}. Note that, descriptions in the form of complete sentences (called cap- 216
161 different captions may consist of different number of words. tions). Examples for such dataset are Flickr8K [9], Flcikr30K 217
162 The primary objective of the our work is to predict a set of [37] and MSCOCO [4]. 218
163
tags {t1 , t2 , . . .} for a given image I as showed in Equation We utilize these datasets to train our image2words sys- 219
(1). f (I) denotes the visual features extracted from the tem to predict relevant tags for images. Note that, though
164 220
image. these datasets provide full sentences, our system predicts
165 221
only tags. We extract tags from the captions which we refer
166 D={(Ii ,Ci )}N to as Dictionary to train our system. For extracting the tags 222
167 f (I) =========i=1
= {t1 , t2 , . . .} (1) from a given caption, we first remove the stop words from 223
Image2words
ICVGIP 2016 Submission #324. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

224 280
the caption and then lemmatize the remaining words. This We repeat the evaluating protocol followed in [22]. We
225 set of lemmatized words (called tags) acts as set of labels in have trained our model on Flickr8K[9] dataset. It has 8091 281
226 the tag space for the given image. images and each of the image has 5 captions provided by 282
227 different humans. We have used the standard 8000 im- 283
228 2.3 The network and the objective ages for our experiments. In total, the dataset has 30000 284
229 The proposed image2words system is essentially a Multi (image, caption) pairs. 285
230 Layer Perceptron (MLP) learned on the iamge features f (I). 286
231 We interpret the problem of tag assignment as a multi-label 3.2 Evaluation metrics 287
classification problem. Given an image, our objective is to An efficient approach shall rank the relevant tags ahead of
232 288
classify it into multiple tag classes which the image can be the irrelevant ones. Similarly, for a given test tag, relevant
233 fit in. For instance, in Figure 1, the given image belongs to 289
images shall be ranked before the irrelevant images. Similar
234 boy, baseball, leap, play classes. In other words, these are to many works as reported in the survey by Xi et al. [22], 290
235 the tags predicted for the given image. we also use the image centric Mean image Average Preci- 291
236 Figure 3 shows the basic structure of the network. It con- sion (MiAP) to evaluate the tag ranking. Similarly, we use 292
237 tains a series of hidden layers h1 , h2 . . . hn that transform tag centric Mean Average Precision (MAP) to measure the 293
238 the input visual features f (I) into a set of tags. The visual quality for ranking the images. While assigning tags to an 294
layer contains as many neurons as the dimension of the im- image I, the image centric iAP measure is computed as
239 295
age feature f (I). The number of the units in the output
240 layer is equal to the number of possible tags an image can 296
K
241 possibly be assigned, which is equal to the size of the Dic- 1 X rel(k) 297
iAP (I) = (I, tk ) (3)
242 tionary. This number generally is equal to the number of T k 298
k=1
243 tags extracted from the training images. 299
Unlike the single label classification, we perform a multi- where, T is the total number of relevant tags for image I,
244 300
label classification, which allows us to assign multiple labels K is the total number of ground truth test tags, rel(k) is
245 the number of relevant tags in the top k predicted tags and 301
to the given image. The units in the output layer have a sig-
246 (I, tk ) = 1 if the tag tk is relevant and 0 otherwise. Finally, 302
moid non-linearity, whose output lies in [0 1]. This acts as
247 predicted confidence towards the corresponding label(tag). the MiAP is computed by averaging the iAP measure across 303
248 The set of target labels is encoded as a single one hot vector all the images. 304
249 with 1s in their respective positions and zeros at everywhere Similarly, the tag centric ranking measure of a given test 305
else. This is same as adding the one hot vectors correspond- tag is computed as
250 306
251 ing to the tags. We train the network by minimizing the 307
n
252 binary cross entropy loss (shown in equation 2) at each of 1 X rel(i) 308
the output nodes. AP (t) = (Ii , t) (4)
253 R i=1 i 309
254 310
where, R is the total number of relevant images for the
255 L(t, p) = (t log(p) + (1 t) log(1 p)) (2) 311
given tag t, ri is the number of relevant images in the top
256 where t is target and p is predicted. i ranked list and MAP is computed by taking the mean of 312
257 The network is thus trained over a set of images and their AP (t) across all the tags. 313
258 set of tags. Note that the output of the network is not 314
259 a probability distribution over the tags, instead it is the 3.3 Visual Features 315
260 confidence (in [0 1]) towards each tag. There has been an increasing evidence proving the effec- 316
261
tiveness of the features learned by convnets in vision. The 317
deeper layers in convnets learn semantic features and exhibit
262 3. EXPERIMENTS a degree of robustness to general transformations. Therefore
318
263 In this section we present the datasets used for training we describe the image contents by fully connected layers. We 319
264 our models and the benchmark datasets for evaluating the use caffe[11] implementation of the googelnet[31] trained on 320
265 proposed method. IMAGENET[29] as our feature descriptor. In order to ex- 321
266 tract wholesome information we follow the standard crops 322
267
3.1 Datasets procedure and take four corners and center along with their 323
The following datasets are considered for evaluation flips for extracting convnet features. However, instead of
268 324
taking their mean, we take max of each activation to obtain
269 325
MIRFlickr [10]: This is a set of images and associated a single representation for the image. Googlenet has 1024
270 tags collected from the photo sharing website Flickr. units in the final fully connected layer resulting in a 1024D 326
271 It contains 25, 000 images and a total of 67, 389 user feature description (after the max pooling of the crops) per 327
272 provided tags. This has a wide variety of images and image. Note that, we can use any other convnet architec- 328
273 tags, as it is collected from 9862 user profiles in Flickr. ture for describing the image contents. Moreover, we do not 329
274 The dataset is provided with a set of 14 test tags, from perform any sort of fine-tuning and use the features learned 330
275 which the relevant ones shall be assigned to the images. from IMAGENET. 331
276 3.4 Training 332
NUS-WIDE[5]: The dataset contains 259, 233 images
277 333
with a total number of , 55, 913 associated tags. The We use Flickr8K dataset to train our image2words archi-
278 number of users contributed to this dataset is 51645. tecture. We pre-process (as discussed in 2.2) the captions 334
279 It provides 81 ground truth test tags. associated with all the 8000 images of the dataset. We form 335
ICVGIP 2016 Submission #324. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

336 392
337 Table 1: MiAP measure resulted by architectures of Table 2: The metric values computed for Tag 393
338 different depths on MIRFLickr dataset. We use this Assignment experiment on the two benchmark 394
339 observation to fix the depth of our network. datasets. 395
Depth 1024-128-14 1024-128-64-14 1024-512-128-64-14 Method MIRFlickr NUS-WIDE
340 396
MiAP 0.4801 0.4597 0.4652 MiAP:
341 CNN+KNN [25] 0.379 0.376 397
342 CNN+TagVote [18] 0.389 0.396 398
343 the Dictionary from the tags extracted from the captions. CNN+TagProp [34] 0.392 0.380 399
344 At this point, we can prepare customized dictionaries for (i) CNN+TagFeature [3] 0.383 0.373 400
345 the task at hand and (ii) dataset under consideration. CNN+RelExample [17] 0.373 0.388 401
For instance, if the task is tag assignment, the objective image2words 0.48
346 402
is to assign each image one of the ground truth test tags. MAP:
347 403
Therefore, we can remove all the irrelevant tags from the CNN+KNN [25] 0.639 0.4
348 CNN+TagVote [18] 0.638 0.402 404
Dictionary and the images that have no association with any
349 of the ground truth test tags. Different datasets may have CNN+TagProp [34] 0.641 0.397 405
350 different ground truth test tags, which require different Dic- CNN+TagFeature [3] 0.563 0.326 406
351 tionaries. On the other hand, if the task is tag refinement, CNN+RelExample [17] 0.584 0.373 407
352 we need to improve the tags of a given image by adding more image2words 0.525 408
353 relevant tags. In this case, we can have semantically related 409
354 words to the ground truth tags in the Dictionary to be able 410
to augment the existing tags of an image. eral interesting methods [3, 18, 36, 34, 17, 38, 32, 24, 33, 16,
355 19, 39] have been presented. Techniques are build that work 411
356 3.5 MIRFlickr with parametric and non-parametric models. An excellent 412
357 After composing the Dictionary with the 14 ground truth review on image and tag related articles is presented by Xi 413
358 test tags of MIRFlickr dataset, and removing the images et al.[22]. We resort to this comprehensive review article for 414
359 that have no association with any of those tags, we are left understanding the existing works in the field. Though there 415
with 21334 (image, caption) pairs for training. Note that, is rich literature available for the problem of image tag as-
360 416
training happens only on these images, our system is not signment, As we propose a learning technique, we present a
361 417
trained on the benchmark datasets. The only information brief account of model based works for tag assignment. Tang
362 et al.[32] proposed a graph-based semi-supervised approach 418
passed from the benchmark datasets is set of ground truth
363 to propagate tags over images with noisy tags. Similarly, 419
test tags.
364 The network we trained for evaluating on MIRFlickr dataset Zhou et al.[38] proposed an approach to learn number of vi- 420
365 contains a single hidden layers (1024 128 14). The sual concepts at image and region level from weakly labeled 421
366 input image features f (I) are the f c7 features learned by data. Many approaches have been proposed to assign tags to 422
Googlenet: 1024D units in the visible layer. The MIRFlickr new images by comparing to a training set. Xi et al. refer
367 423
dataset has 14 ground test tags, so that many units in the these methods as instance based approaches. Approaches
368 424
output layer with sigmoid non-linearity. We have employed such as Li et al. [19], Kennedy et al. [14], Li et al. [20], Lee
369 et al. [16] and Zhu et al. [39] assign a tag based on its fre- 425
Batch Normalization (BN) and Dropout layers after hidden
370 quency of occurrence in its visual neighbors tags. Another 426
layer. The Dropout is increased progressively with depth
371 from 0.25 to 0.5, while training the network. The results set of works Ballan et al.[1], Qi et al. [28], Pereira et al. and 427
372 of the Tag assignment are shown in Table 2. Note that all [27] try to approach the problem via cross-modal learning 428
373 other approaches are trained on Train1M[21] dataset, where of images and tags. The proposed approach in this paper 429
as we have trained our system on the Flickr8K dataset. We also attempts to learn a cross modal mapping from images
374 430
have obtained these numbers from the very recent survey to tags. However, we posed this as a multi-label classifi-
375 431
paper by Xi et al.[22]. In spite of the very less training data cation problem, by treating the set of associated labels as
376 multiple labels. It also ranks the set of tags by predicting a 432
and simple transformation learning, the results obtained by
377 confidence value for each of them. 433
the proposed image2words are competitive.
378 434
379 3.6 NUS-WIDE 5. CONCLUSIONS 435
380 to be filled In this paper, we have presented an approach to assign 436
381 suitable tags to images. Unlike many of the existing ap- 437
382
3.7 Ablation on the depth of the network proaches, we posed it as a multi-label classification problem. 438
383 We have performed ablation experiments to choose the We look at the tags as the labels to be assigned to tags and 439
depth of our network. We measured the MiAP metric for classify images to have multiple tags. We utilize the human
384 440
networks of different depths. We have performed these ex- given annotations to images in order to train our system,
385 periments with all the trainng samples from MIRFlickr, which 441
which is essentially Multi Layer Perceptron (MLP). We ob-
386 is 21334 (image, caption) pairs. The results are shown in serve that models can be trained efficiently over images and 442
387 Table 1. tags with strong association to the image contents to predict 443
388 suitable tags on new set of images. 444
389 4. RELATION TO THE EXISTING WORKS 445
390 Assigning tags to images has been a problem of investiga- 6. REFERENCES 446
391 tion in the vision and multimedia research community. Sev- [1] L. Ballan, T. Uricchio, L. Seidenari, and 447
ICVGIP 2016 Submission #324. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

448 504
A. Del Bimbo. A cross-media model for automatic 2014.
449 image annotation. In Proceedings of International 505
[17] X. Li and C. G. M. Snoek. Classifying tag relevance
450 Conference on Multimedia Retrieval, ICMR 14, pages with relevant positive and negative examples. In ACM 506
451 73:7373:80, New York, NY, USA, 2014. ACM. International Conference on Multimedia, 2013. 507
452 [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. [18] X. Li, C. G. M. Snoek, and M. Worring. Learning 508
453 Speeded-up robust features (SURF). Computer Vision social tag relevance by neighbor voting. IEEE 509
454 and Image Understanding, 110(3):346359, June 2008. Transactions on Multimedia, 11(7):13101322, 2009. 510
455 [3] L. Chen, D. Xu, I. W. Tsang, and J. Luo. Tag-based [19] X. Li, C. G. M. Snoek, and M. Worring. Learning 511
image retrieval improved by augmented features and social tag relevance by neighbor voting. IEEE
456 512
group-based refinement. IEEE Transactions on Transactions on Multimedia, 11(7):13101322, Nov
457 513
Multimedia (T-MM), 14(4):10571067, Auguest 2012. 2009.
458 [4] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, 514
[20] X. Li, C. G. M. Snoek, and M. Worring. Unsupervised
459 P. Dollar, and C. L. Zitnick. Microsoft COCO multi-feature tag relevance learning for social image 515
460 captions: Data collection and evaluation server. retrieval. In Proceedings of the ACM International 516
461 CoRR, abs/1504.00325, 2015. Conference on Image and Video Retrieval, CIVR 10, 517
462 [5] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and pages 1017, New York, NY, USA, 2010. ACM. 518
463 Y.-T. Zheng. Nus-wide: A real-world web image [21] X. Li, C. G. M. Snoek, M. Worring, and A. W. M. 519
464
database from national university of singapore. In Smeulders. Harvesting social images for bi-concept 520
Proc. of ACM Conf. on Image and Video Retrieval search. IEEE Transactions on Multimedia,
465 521
(CIVR09), Santorini, Greece., July 8-10, 2009. 14(4):10911104, 2012.
466 522
[6] N. Dalal and B. Triggs. Histograms of oriented [22] X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M.
467 gradients for human detection. In In CVPR, pages Snoek, and A. D. Bimbo. Socializing the semantic gap: 523
468 886893, 2005. A comparative survey on image tag assignment, 524
469 [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. refinement, and retrieval. ACM Comput. Surv., 525
470 Rich feature hierarchies for accurate object detection 49(1):14:114:39, June 2016. 526
471 and semantic segmentation. In Proceedings of the [23] D. G. Lowe. Distinctive image features from 527
472 IEEE Conference on (CVPR), 2014. scale-invariant keypoints. Int. J. Comput. Vision, 528
473
[8] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. 60(2), 2004. 529
Multi-scale orderless pooling of deep convolutional [24] H. Ma, J. Zhu, M. R. T. Lyu, and I. King. Bridging
474 530
activation features. In ECCV- European Conference the semantic gap between image contents and tags.
475 Computer Vision, 2014. 531
IEEE Transactions on Multimedia, 12(5):462473,
476 [9] M. Hodosh, P. Young, and J. Hockenmaier. Framing Aug 2010. 532
477 image description as a ranking task: Data, models and [25] A. Makadia, V. Pavlovic, and S. Kumar. Baselines for 533
478 evaluation metrics. J. Artif. Int. Res., 47(1):853899, image annotation. International Journal of Computer 534
479 May 2013. Vision, 90(1):88105, 2010. 535
480 [10] M. J. Huiskes, B. Thomee, and M. S. Lew. New trends [26] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust 536
481 and ideas in visual concept detection. In in [ACM wide baseline stereo from maximally stable extremal 537
International Conference on Multimedia Information regions. In Proceedings of the BMVC, 2002.
482 538
Retrieval, 2010. [27] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia,
483 539
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. On
484 J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 540
the role of correlation and abstraction in cross-modal
485 Caffe: Convolutional architecture for fast feature multimedia retrieval. IEEE Transactions on Pattern 541
486 embedding. arXiv preprint arXiv:1408.5093, 2014. Analysis and Machine Intelligence, 36(3):521535, 542
487 [12] A. Karpathy, A. Joulin, and F. Li. Deep fragment March 2014. 543
488 embeddings for bidirectional image sentence mapping. [28] G.-J. Qi, C. C. Aggarwal, Q. Tian, H. Ji, and T. S. 544
489 In Advances in Neural Information Processing Huang. Exploring context and content links in social 545
Systems, pages 18891897. 2014. media: A latent space method. IEEE Trans. Pattern
490 546
[13] A. Karpathy and F.-F. Li. Deep visual-semantic Anal. Mach. Intell., 34(5):850862, 2012.
491 547
alignments for generating image descriptions. In [29] O. Russakovsky, J. Deng, H. Su, J. Krause,
492 CVPR, pages 31283137. IEEE, 2015. 548
S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
493 [14] L. Kennedy, M. Slaney, and K. Weinberger. Reliable A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 549
494 tags using image similarity: Mining specificity and ImageNet Large Scale Visual Recognition Challenge. 550
495 expertise from large-scale multimedia databases. In International Journal of Computer Vision (IJCV), 551
496 Proceedings of the 1st Workshop on Web-scale 115(3):211252, 2015. 552
497 Multimedia Corpus, WSMC 09, pages 1724, New [30] K. Simonyan and A. Zisserman. Very deep 553
498
York, NY, USA, 2009. ACM. convolutional networks for large-scale image 554
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. recognition. CoRR, abs/1409.1556, 2014.
499 555
Imagenet classification with deep convolutional neural [31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed,
500 556
networks. In NIPS. 2012. D. Anguelov, D. Erhan, V. Vanhoucke, and
501 [16] S. Lee, W. De Neve, and Y. M. Ro. Visually weighted A. Rabinovich. Going deeper with convolutions. 557
502 neighbor voting for image tag relevance learning. CoRR, abs/1409.4842, 2014. 558
503 Multimedia Tools and Applications, 72(2):13631386, 559
ICVGIP 2016 Submission #324. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

560 616
[32] J. Tang, R. Hong, S. Yan, T.-S. Chua, G.-J. Qi, and
561 R. Jain. Image annotation by knn-sparse graph-based 617
562 label propagation over noisily tagged web images. 618
563 ACM Trans. Intell. Syst. Technol., 2(2):14:114:15, 619
564 Feb. 2011. 620
565 [33] B. Q. Truong, A. Sun, and S. S. Bhowmick. Content is 621
566 still king: The effect of neighbor voting schemes on 622
567
tag relevance for social image retrieval. In Proceedings 623
of the 2Nd ACM International Conference on
568 624
Multimedia Retrieval, ICMR 12, pages 9:19:8, New
569 York, NY, USA, 2012. ACM. 625
570 [34] J. Verbeek, M. Guillaumin, T. Mensink, and 626
571 C. Schmid. Image Annotation with TagProp on the 627
572 MIRFLICKR set. In MIR 2010 - 11th ACM 628
573 International Conference on Multimedia Information 629
574 Retrieval, pages 537546, Philadephia, United States, 630
Mar. 2010. ACM Press.
575 631
[35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
576 632
and tell: A neural image caption generator. In IEEE
577 Conference on Computer Vision and Pattern 633
578 Recognition, CVPR 2015, pages 31563164, 2015. 634
579 [36] H. Xu, J. Wang, X.-S. Hua, and S. Li. Tag refinement 635
580 by regularized lda. In Proceedings of the 17th ACM 636
581 International Conference on Multimedia, MM 09, 637
582 pages 573576, New York, NY, USA, 2009. ACM. 638
583 [37] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 639
From image descriptions to visual denotations: New
584 640
similarity metrics for semantic inference over event
585 641
descriptions. TACL, 2:6778, 2014.
586 [38] B. Zhou, V. Jagadeesh, and R. Piramuthu. 642
587 ConceptLearner: Discovering Visual Concepts from 643
588 Weakly Labeled Image Collections. Computer Vision 644
589 and Pattern Recognition (CVPR), 2015. 645
590 [39] X. Zhu, W. Nejdl, and M. Georgescu. An adaptive 646
591 teleportation random walk model for learning social 647
592
tag relevance. In Proceedings of the 37th International 648
ACM SIGIR Conference on Research &
593 649
Development in Information Retrieval, SIGIR 14,
594 pages 223232, New York, NY, USA, 2014. ACM. 650
595 651
596 652
597 653
598 654
599 655
600 656
601 657
602 658
603 659
604 660
605 661
606 662
607 663
608 664
609 665
610 666
611 667
612 668
613 669
614 670
615 671