Академический Документы
Профессиональный Документы
Культура Документы
Gong Cheng, Chengcheng Ma, Peicheng Zhou, Xiwen Yao, Junwei Han
the semantic gap between low-level feature represent- sensing image dataset by replacing the CNN’s 1000-
ation and high-level visual recognition tasks. way softmax classification layer with a C-way softmax
With more than tens of millions parameters classification layer (for our C scene classes) with a
involved for learning CNN models, directly training a smaller learning rate, which allows fine-tuning to
deep CNN model on our scene classification dataset is make progress while not clobbering the initialization.
problematic, since the dataset only contains thousands Then, we train a linear one-vs-all SVM classifier for
of images, which is scarce to train so many parameters. each image class by treating the images of the chosen
Fortunately, thanks to the powerful generalization class as positive instances and the rest images as
abilities of pre-trained deep CNN models, we could negative instances. Finally, an unlabeled test image is
transfer the parameters of those pre-trained models to assigned to the label of the classifier with the highest
our own applications directly. In this work, we build response.
on three successful CNN models, including AlexNet
[1], VGGNet [2] (16 layers), and GoogleNet [3], to 3. EXPERIMENTAL RESULTS
achieve our scene classification task.
Fig. 1 gives an overview of the introduced CNN In the experiments, we comprehensively evaluate the
feature-based scene classification method. performance of the introduced deep CNN-based scene
Specifically, given the training dataset, we first extract classification method on a publicly available UC
the deep CNN features of each image by using two Merced land-use dataset 1 [20, 21]. The dataset is
simple and effective strategies. In the first strategy, the composed of the following 21 land-use classes:
CNN model parameters are pre-trained on ImageNet agricultural, airplane, baseball diamond, beach,
dataset [24] and then directly transferred to our scene buildings, chaparral, dense residential, forest, freeway,
classification task. That is, we simply use the pre- golf course, harbor, intersection, medium density
trained CNN models as feature extractors by removing residential, mobile home park, overpass, parking lot,
the last classification layer and considering the output river, runway, sparse residential, storage tanks, and
of its previous layer as feature vector of the input data. tennis courts. Each class consists of 100 images
In the second strategy, the CNN model parameters are
also pre-trained on ImageNet dataset [24] but domain
specifically fine-tuned to adapt to our VHR remote
1
http://vision.ucmerced.edu/datasets
768
measuring 256×256 pixels, with a pixel resolution of Table 1 Average classification accuracies (%) of 15
30 cm in the red-green-blue (RGB) color space. different methods on UC Merced land-use dataset
Implementation details. Due to the limited size of
Method Accuracy (%)
training set, performing data augmentation to increase
the number of training examples is necessary to avoid BOVW [21] 76.81
over-fitting. In our implementation, for each training BOVW+SCK [21] 77.71
image, we first simply rotate it in the step of 90° from
Color-HLS [21] 81.19
0° to 360°, and then extract five n×n patches (the four
corner patches and the center patch) as well as their Texture [21] 76.91
horizontal reflections from each rotated image. This SPCK++ [20] 77.38
increases the size of our training set by a factor of 40.
The method of [19] 81.67
For AlexNet [1] n=227 and for VGGNet [2] and
GoogleNet [3] n=224. For the fine-tuning of all three Partlets [11] 91.33
CNN models, (1) the batch size is set to 50; (2) the UFL-SC [22] 90.26
learning rate is set to 0.001 and decreases by 0.1 every
The method of [18] 86.67
3000 iterations; and (3) the momentum is set to 0.9
and the weight decay is set to 0.0005 for all the layers. AlexNet [1] 95.00
CNN-based methods
To make a comprehensive comparison with some (strategy 1: without VGGNet [2] 93.57
state-of-the-art methods [11, 18-22] that have been fine-tuning)
GoogleNet [3] 94.29
evaluated on the same land-use data set, we use the
same five-fold cross-validation methodology as in [11, AlexNet [1] 97.14
CNN-based methods
18-22]. To be specific, the images of each class are (strategy 2: with VGGNet [2] 98.10
randomly divided into five equal non-overlapping sets. fine-tuning)
GoogleNet [3] 98.33
For each land-use class, we select four sets for training
and evaluated on the held-out set. The classification
accuracy is the fraction of the held-out images of 21 4. CONCLUSIONS
classes that are correctly labeled. For fair comparison,
we adopted the same training dataset and test dataset In this paper, we introduced a deep CNN-based scene
for all three CNN models and the results of [11, 18-22] classification method, which adopted the existing
are all the best results from literatures of [11, 18-22]. high-capacity CNN models to extract more richer
Table 1 gives the average classification accuracies high-level image representation features. Comparisons
over all 21 land-use classes of our method with three with state-of-the-arts on a publicly available land-use
different deep CNN models (AlexNet [1], 16-layer data set demonstrated the effectiveness of our method.
VGGNet [2], and GoogleNet [3]) and 9 stat-of-the-art In the future work, we will (1) test the method on more
approaches as detailed in [11, 18-22]. As can be seen publicly available remote sensing image datasets; and
from Table 1: (1) the introduced CNN-based scene (2) apply dimension reduction methods such as [25] to
classification methods including without fine-tuning generate compact low-dimensional features in order to
and with fine-tuning (strategies 1 & 2) outperform all reduce computational complexity.
other comparison approaches [11, 18-22]. This shows
that CNN-based methods can significantly improve the 5. ACKNOWLEDGMENT
traditional handcrafted feature-based methods and
demonstrates the effectiveness and superiority of the This work was supported in part by the National
introduced deep CNN-based methods. (2) By fine- Science Foundation of China under Grant 61401357,
tuning the CNN networks, the performance measured the Fundamental Research Funds for the Central
in average classification accuracy was further boosted Universities under Grants 3102016ZY023, and the
by 2.14%, 4.53%, and 4.04% for AlexNet [1], 16-layer Innovation Foundation for Doctor Dissertation of NPU
VGGNet [2], and GoogleNet [3], respectively. (3) The under Grant CX201622.
deeper CNN models have better performance (98.10%
and 98.33 % vs 97.14%), but the superiority is 6. REFERENCES
relatively small (~1%).
769
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, discriminatively trained mixture model," ISPRS J.
"Imagenet classification with deep convolutional neural Photogramm. Remote Sens., vol. 85, pp. 32-43, 2013.
networks," in Proc. Conf. Adv. Neural Inform. Process. [14]X. Yao, J. Han, L. Guo, S. Bu, and Z. Liu, "A coarse-
Syst., 2012, pp. 1097-1105. to-fine model for airport detection from remote sensing
[2] K. Simonyan and A. Zisserman, "Very deep images using target-oriented visual saliency and CRF,"
convolutional networks for large-scale image Neurocomputing, vol. 164, pp. 162-172, 2015.
recognition," in Proc. Int. Conf. Learning [15]G. Cheng, J. Han, P. Zhou, and L. Guo, "Multi-class
Representations, 2015, pp. 1-14. geospatial object detection and geographic image
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. classification based on collection of part detectors,"
Anguelov, et al., "Going deeper with convolutions," in ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119-
Proc. IEEE Int. Conf. Comput. Vision Pattern 132, 2014.
Recognit., 2015, pp. 1-9. [16]G. Cheng, J. Han, P. Zhou, and L. Guo, "Scalable
[4] G. Cheng, L. Guo, T. Zhao, J. Han, H. Li, and J. Fang, multi-class geospatial object detection in high-spatial-
"Automatic landslide detection from remote-sensing resolution remote sensing images," in Proc. IEEE Int.
imagery using a scene classification method based on Geosci. Remote Sens. Symp., 2014, pp. 2479-2482.
BoVW and pLSA," Int. J. Remote Sens., vol. 34, no. 1, [17]L. Yang, G. Bi, M.-d. Xing, and L. Zhang, "Airborne
pp. 45-59, 2013. SAR moving target signatures and imagery based on
[5] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, LVD," IEEE Trans. Geosci. Remote Sens., vol. 53, no.
"Object detection in optical remote sensing images 11, pp. 5958-5971, 2015.
based on weakly supervised learning and high-level [18]G. Cheng, P. Zhou, J. Han, L. Guo, and J. Han, "Auto-
feature learning," IEEE Trans. Geosci. Remote Sens., encoder-based shared mid-level visual dictionary
vol. 53, no. 6, pp. 3325-3337, 2015. learning for scene classification using very high
[6] G. Cheng and J. Han, "A survey on object detection in resolution remote sensing images," IET Comput. Vis.,
optical remote sensing images," ISPRS J. Photogramm. vol. 9, no. 5, pp. 639-647, 2015.
Remote Sens., vol. 117, pp. 11-28, 2016. [19]A. M. Cheriyadat, "Unsupervised feature learning for
[7] X. Yao, J. Han, G. Cheng, and X. Qian, "Semantic aerial scene classification," IEEE Trans. Geosci.
annotation of high-resolution satellite images via Remote Sens., vol. 52, no. 1, pp. 439-451, 2014.
weakly supervised learning," IEEE Trans. Geosci. [20]Y. Yang and S. Newsam, "Spatial pyramid co-
Remote Sens., vol. 54, no. 6, pp. 3660-3671, 2016. occurrence for image classification," in Proc. IEEE Int.
[8] D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and L. Guo, Conf. Comput. Vis., 2011, pp. 1465-1472.
"Weakly supervised learning for target detection in [21]Y. Yang and S. Newsam, "Bag-of-visual-words and
remote sensing images," IEEE Geosci. Remote Sens. spatial extensions for land-use classification," in Proc.
Lett., vol. 12, no. 4, pp. 701-705, 2015. ACM SIGSPATIAL Int. Conf. Adv. Geogr. Inform. Syst.,
[9] G. Cheng, J. Han, L. Guo, and T. Liu, "Learning 2010, pp. 270-279.
coarse-to-fine sparselets for efficient object detection [22]F. Hu, G.-S. Xia, Z. Wang, X. Huang, L. Zhang, and H.
and scene classification," in Proc. IEEE Int. Conf. Sun, "Unsupervised feature learning via spectral
Comput. Vis. Pattern Recog., 2015, pp. 1173-1181. clustering of multidimensional patches for remotely
[10]P. Zhou, G. Cheng, Z. Liu, S. Bu, and X. Hu, "Weakly sensed scene classification," IEEE J. Sel. Topics Appl.
supervised target detection in remote sensing images Earth Observ. Remote Sens., vol. 8, no. 5, pp. 2015-
based on transferred deep features and negative 2030, 2015.
bootstrapping," Multidimension. Syst. Signal Process., [23]F. F. Li and P. Perona, "A bayesian hierarchical model
DOI: 10.1007/s11045-015-0370-3, 2015. for learning natural scene categories," in Proc. IEEE
[11]G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, Int. Conf. Comput. Vis. Pattern Recog., 2005, pp. 524-
"Effective and efficient midlevel visual elements- 531.
oriented land-use classification using VHR remote [24]J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li,
sensing images," IEEE Trans. Geosci. Remote Sens., "Imagenet: A large-scale hierarchical image database,"
vol. 53, no. 8, pp. 4238-4249, 2015. in Proc. IEEE Int. Conf. Comput. Vision Pattern
[12]J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, et Recognit., 2009, pp. 248-255.
al., "Efficient, simultaneous detection of multi-class [25]C. Yao and G. Cheng, "Approximative Bayes
geospatial targets based on visual saliency modeling optimality linear discriminant analysis for Chinese
and discriminative learning of sparse coding," ISPRS J. handwriting character recognition," Neurocomputing,
Photogramm. Remote Sens., vol. 89, pp. 37-48, 2014. DOI: 10.1016/j.neucom.2016.05.017, 2016.
[13]G. Cheng, J. Han, L. Guo, X. Qian, P. Zhou, X. Yao, et
al., "Object detection in remote sensing imagery using a
770