Вы находитесь на странице: 1из 4

SCENE CLASSIFICATION OF HIGH RESOLUTION REMOTE SENSING

IMAGES USING CONVOLUTIONAL NEURAL NETWORKS

Gong Cheng, Chengcheng Ma, Peicheng Zhou, Xiwen Yao, Junwei Han

School of Automation, Northwestern Polytechnical University, Xi’an, 710072, China

ABSTRACT most of them are based on handcrafted or shallow


learning-based features, such as bag-of-visual-words
Scene classification of high resolution remote sensing (BOVW) model [23] and its variations [4, 20, 21],
images plays an important role for a wide range of sparse coding-based features [12, 19], midlevel visual
applications. While significant efforts have been made elements-oriented features [9, 11, 15, 18] and so on.
in developing various methods for scene classification, Although the above-mentioned methods have achieved
most of them are based on handcrafted or shallow good performance for scene classification of very high
learning-based features. In this paper, we investigate resolution (VHR) remote sensing images, as the scene
the use of deep convolutional neural network (CNN) classification task becomes more challenging, the
for scene classification. To this end, we first adopt two description capability of these low-/middle-level image
simple and effective strategies to extract CNN features: features is very limited or even impoverished.
(1) using pre-trained CNN models as universal feature More recently, deep learning algorithms especially
extractors, and (2) domain-specifically fine-tuning pre- convolutional neural networks (CNN) have shown
trained CNN models on our scene classification dataset. their much stronger feature representation power in
Then, scene classification is carried out by using computer vision. In this paper, we investigate the use
simple classifiers such as linear support vector of deep CNN features for land-use scene classification.
machine (SVM). In our work, three off-the-shelf CNN To this end, we first adopt two simple and effective
models including AlexNet [1], VGGNet [2], and strategies to extract CNN features: (1) using pre-
GoogleNet [3] are investigated. Comprehensive trained CNN models as universal feature extractors,
evaluations on a publicly available 21 classes land use and (2) domain-specifically fine-tuning pre-trained
dataset and comparisons with several state-of-the-art CNN models on our scene classification dataset. Then,
approaches demonstrate that deep CNN features are scene classification is carried out by using simple
effective for scene classification of high resolution classifiers such as linear support vector machine
remote sensing images. (SVM). In our work, three off-the-shelf CNN models
including AlexNet [1], VGGNet [2], and GoogleNet [3]
Index Terms—Scene classification, convolutional are investigated. Comprehensive evaluations on a
neural network (CNN), deep learning, feature publicly available 21 classes land use dataset and
extraction, remote sensing images comparisons with several state-of-the-art approaches
demonstrate that deep CNN features are effective for
1. INTRODUCTION scene classification of VHR remote sensing images.

Scene classification of high resolution remote sensing 2. METHOD


images plays an important role for a wide range of
applications, such as land-use determination, natural Since scene classification is usually performed in
hazards detection, urban planning, geospatial object feature space, how to effectively extract image feature
detection and so on [4-17]. During the last decades, is the core of scene classification. In this paper, we
significant efforts have been made in developing introduce the use of deep CNN model to extract more
various approaches for scene classification from richer high-level meaning and contents of the images,
satellite and aerial images [11, 15, 18-22]. However, which can provide a possible solution for narrowing

978-1-5090-3332-4/16/$31.00 ©2016 IEEE 767 IGARSS 2016


Fig. 1. Overview of the introduced deep CNN-based scene classification method.

the semantic gap between low-level feature represent- sensing image dataset by replacing the CNN’s 1000-
ation and high-level visual recognition tasks. way softmax classification layer with a C-way softmax
With more than tens of millions parameters classification layer (for our C scene classes) with a
involved for learning CNN models, directly training a smaller learning rate, which allows fine-tuning to
deep CNN model on our scene classification dataset is make progress while not clobbering the initialization.
problematic, since the dataset only contains thousands Then, we train a linear one-vs-all SVM classifier for
of images, which is scarce to train so many parameters. each image class by treating the images of the chosen
Fortunately, thanks to the powerful generalization class as positive instances and the rest images as
abilities of pre-trained deep CNN models, we could negative instances. Finally, an unlabeled test image is
transfer the parameters of those pre-trained models to assigned to the label of the classifier with the highest
our own applications directly. In this work, we build response.
on three successful CNN models, including AlexNet
[1], VGGNet [2] (16 layers), and GoogleNet [3], to 3. EXPERIMENTAL RESULTS
achieve our scene classification task.
Fig. 1 gives an overview of the introduced CNN In the experiments, we comprehensively evaluate the
feature-based scene classification method. performance of the introduced deep CNN-based scene
Specifically, given the training dataset, we first extract classification method on a publicly available UC
the deep CNN features of each image by using two Merced land-use dataset 1 [20, 21]. The dataset is
simple and effective strategies. In the first strategy, the composed of the following 21 land-use classes:
CNN model parameters are pre-trained on ImageNet agricultural, airplane, baseball diamond, beach,
dataset [24] and then directly transferred to our scene buildings, chaparral, dense residential, forest, freeway,
classification task. That is, we simply use the pre- golf course, harbor, intersection, medium density
trained CNN models as feature extractors by removing residential, mobile home park, overpass, parking lot,
the last classification layer and considering the output river, runway, sparse residential, storage tanks, and
of its previous layer as feature vector of the input data. tennis courts. Each class consists of 100 images
In the second strategy, the CNN model parameters are
also pre-trained on ImageNet dataset [24] but domain
specifically fine-tuned to adapt to our VHR remote
1
http://vision.ucmerced.edu/datasets

768
measuring 256×256 pixels, with a pixel resolution of Table 1 Average classification accuracies (%) of 15
30 cm in the red-green-blue (RGB) color space. different methods on UC Merced land-use dataset
Implementation details. Due to the limited size of
Method Accuracy (%)
training set, performing data augmentation to increase
the number of training examples is necessary to avoid BOVW [21] 76.81
over-fitting. In our implementation, for each training BOVW+SCK [21] 77.71
image, we first simply rotate it in the step of 90° from
Color-HLS [21] 81.19
0° to 360°, and then extract five n×n patches (the four
corner patches and the center patch) as well as their Texture [21] 76.91
horizontal reflections from each rotated image. This SPCK++ [20] 77.38
increases the size of our training set by a factor of 40.
The method of [19] 81.67
For AlexNet [1] n=227 and for VGGNet [2] and
GoogleNet [3] n=224. For the fine-tuning of all three Partlets [11] 91.33
CNN models, (1) the batch size is set to 50; (2) the UFL-SC [22] 90.26
learning rate is set to 0.001 and decreases by 0.1 every
The method of [18] 86.67
3000 iterations; and (3) the momentum is set to 0.9
and the weight decay is set to 0.0005 for all the layers. AlexNet [1] 95.00
CNN-based methods
To make a comprehensive comparison with some (strategy 1: without VGGNet [2] 93.57
state-of-the-art methods [11, 18-22] that have been fine-tuning)
GoogleNet [3] 94.29
evaluated on the same land-use data set, we use the
same five-fold cross-validation methodology as in [11, AlexNet [1] 97.14
CNN-based methods
18-22]. To be specific, the images of each class are (strategy 2: with VGGNet [2] 98.10
randomly divided into five equal non-overlapping sets. fine-tuning)
GoogleNet [3] 98.33
For each land-use class, we select four sets for training
and evaluated on the held-out set. The classification
accuracy is the fraction of the held-out images of 21 4. CONCLUSIONS
classes that are correctly labeled. For fair comparison,
we adopted the same training dataset and test dataset In this paper, we introduced a deep CNN-based scene
for all three CNN models and the results of [11, 18-22] classification method, which adopted the existing
are all the best results from literatures of [11, 18-22]. high-capacity CNN models to extract more richer
Table 1 gives the average classification accuracies high-level image representation features. Comparisons
over all 21 land-use classes of our method with three with state-of-the-arts on a publicly available land-use
different deep CNN models (AlexNet [1], 16-layer data set demonstrated the effectiveness of our method.
VGGNet [2], and GoogleNet [3]) and 9 stat-of-the-art In the future work, we will (1) test the method on more
approaches as detailed in [11, 18-22]. As can be seen publicly available remote sensing image datasets; and
from Table 1: (1) the introduced CNN-based scene (2) apply dimension reduction methods such as [25] to
classification methods including without fine-tuning generate compact low-dimensional features in order to
and with fine-tuning (strategies 1 & 2) outperform all reduce computational complexity.
other comparison approaches [11, 18-22]. This shows
that CNN-based methods can significantly improve the 5. ACKNOWLEDGMENT
traditional handcrafted feature-based methods and
demonstrates the effectiveness and superiority of the This work was supported in part by the National
introduced deep CNN-based methods. (2) By fine- Science Foundation of China under Grant 61401357,
tuning the CNN networks, the performance measured the Fundamental Research Funds for the Central
in average classification accuracy was further boosted Universities under Grants 3102016ZY023, and the
by 2.14%, 4.53%, and 4.04% for AlexNet [1], 16-layer Innovation Foundation for Doctor Dissertation of NPU
VGGNet [2], and GoogleNet [3], respectively. (3) The under Grant CX201622.
deeper CNN models have better performance (98.10%
and 98.33 % vs 97.14%), but the superiority is 6. REFERENCES
relatively small (~1%).

769
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, discriminatively trained mixture model," ISPRS J.
"Imagenet classification with deep convolutional neural Photogramm. Remote Sens., vol. 85, pp. 32-43, 2013.
networks," in Proc. Conf. Adv. Neural Inform. Process. [14]X. Yao, J. Han, L. Guo, S. Bu, and Z. Liu, "A coarse-
Syst., 2012, pp. 1097-1105. to-fine model for airport detection from remote sensing
[2] K. Simonyan and A. Zisserman, "Very deep images using target-oriented visual saliency and CRF,"
convolutional networks for large-scale image Neurocomputing, vol. 164, pp. 162-172, 2015.
recognition," in Proc. Int. Conf. Learning [15]G. Cheng, J. Han, P. Zhou, and L. Guo, "Multi-class
Representations, 2015, pp. 1-14. geospatial object detection and geographic image
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. classification based on collection of part detectors,"
Anguelov, et al., "Going deeper with convolutions," in ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119-
Proc. IEEE Int. Conf. Comput. Vision Pattern 132, 2014.
Recognit., 2015, pp. 1-9. [16]G. Cheng, J. Han, P. Zhou, and L. Guo, "Scalable
[4] G. Cheng, L. Guo, T. Zhao, J. Han, H. Li, and J. Fang, multi-class geospatial object detection in high-spatial-
"Automatic landslide detection from remote-sensing resolution remote sensing images," in Proc. IEEE Int.
imagery using a scene classification method based on Geosci. Remote Sens. Symp., 2014, pp. 2479-2482.
BoVW and pLSA," Int. J. Remote Sens., vol. 34, no. 1, [17]L. Yang, G. Bi, M.-d. Xing, and L. Zhang, "Airborne
pp. 45-59, 2013. SAR moving target signatures and imagery based on
[5] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, LVD," IEEE Trans. Geosci. Remote Sens., vol. 53, no.
"Object detection in optical remote sensing images 11, pp. 5958-5971, 2015.
based on weakly supervised learning and high-level [18]G. Cheng, P. Zhou, J. Han, L. Guo, and J. Han, "Auto-
feature learning," IEEE Trans. Geosci. Remote Sens., encoder-based shared mid-level visual dictionary
vol. 53, no. 6, pp. 3325-3337, 2015. learning for scene classification using very high
[6] G. Cheng and J. Han, "A survey on object detection in resolution remote sensing images," IET Comput. Vis.,
optical remote sensing images," ISPRS J. Photogramm. vol. 9, no. 5, pp. 639-647, 2015.
Remote Sens., vol. 117, pp. 11-28, 2016. [19]A. M. Cheriyadat, "Unsupervised feature learning for
[7] X. Yao, J. Han, G. Cheng, and X. Qian, "Semantic aerial scene classification," IEEE Trans. Geosci.
annotation of high-resolution satellite images via Remote Sens., vol. 52, no. 1, pp. 439-451, 2014.
weakly supervised learning," IEEE Trans. Geosci. [20]Y. Yang and S. Newsam, "Spatial pyramid co-
Remote Sens., vol. 54, no. 6, pp. 3660-3671, 2016. occurrence for image classification," in Proc. IEEE Int.
[8] D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and L. Guo, Conf. Comput. Vis., 2011, pp. 1465-1472.
"Weakly supervised learning for target detection in [21]Y. Yang and S. Newsam, "Bag-of-visual-words and
remote sensing images," IEEE Geosci. Remote Sens. spatial extensions for land-use classification," in Proc.
Lett., vol. 12, no. 4, pp. 701-705, 2015. ACM SIGSPATIAL Int. Conf. Adv. Geogr. Inform. Syst.,
[9] G. Cheng, J. Han, L. Guo, and T. Liu, "Learning 2010, pp. 270-279.
coarse-to-fine sparselets for efficient object detection [22]F. Hu, G.-S. Xia, Z. Wang, X. Huang, L. Zhang, and H.
and scene classification," in Proc. IEEE Int. Conf. Sun, "Unsupervised feature learning via spectral
Comput. Vis. Pattern Recog., 2015, pp. 1173-1181. clustering of multidimensional patches for remotely
[10]P. Zhou, G. Cheng, Z. Liu, S. Bu, and X. Hu, "Weakly sensed scene classification," IEEE J. Sel. Topics Appl.
supervised target detection in remote sensing images Earth Observ. Remote Sens., vol. 8, no. 5, pp. 2015-
based on transferred deep features and negative 2030, 2015.
bootstrapping," Multidimension. Syst. Signal Process., [23]F. F. Li and P. Perona, "A bayesian hierarchical model
DOI: 10.1007/s11045-015-0370-3, 2015. for learning natural scene categories," in Proc. IEEE
[11]G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, Int. Conf. Comput. Vis. Pattern Recog., 2005, pp. 524-
"Effective and efficient midlevel visual elements- 531.
oriented land-use classification using VHR remote [24]J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li,
sensing images," IEEE Trans. Geosci. Remote Sens., "Imagenet: A large-scale hierarchical image database,"
vol. 53, no. 8, pp. 4238-4249, 2015. in Proc. IEEE Int. Conf. Comput. Vision Pattern
[12]J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, et Recognit., 2009, pp. 248-255.
al., "Efficient, simultaneous detection of multi-class [25]C. Yao and G. Cheng, "Approximative Bayes
geospatial targets based on visual saliency modeling optimality linear discriminant analysis for Chinese
and discriminative learning of sparse coding," ISPRS J. handwriting character recognition," Neurocomputing,
Photogramm. Remote Sens., vol. 89, pp. 37-48, 2014. DOI: 10.1016/j.neucom.2016.05.017, 2016.
[13]G. Cheng, J. Han, L. Guo, X. Qian, P. Zhou, X. Yao, et
al., "Object detection in remote sensing imagery using a

770

Вам также может понравиться