Академический Документы
Профессиональный Документы
Культура Документы
Engineering Geology
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / e n g g e o
a r t i c l e i n f o a b s t r a c t
Article history: A data mining classification technique can be applied to landslide susceptibility mapping. Because of its
Received 23 February 2010 advantages, a decision tree is one popular classification algorithm, although hardly used previously to analyze
Received in revised form 11 August 2010 landslide susceptibility because the obtained data assume a uniform class distribution whereas landslide
Accepted 10 September 2010
spatial event data when represented on a grid raster layer are highly class imbalanced. For this study of South
Available online 19 September 2010
Korean landslides, a decision tree was constructed using Quinlan's algorithm C4.5. The susceptibility of
Keywords:
landslide occurrence was then deduced using leaf-node ranking or m-branch smoothing. The area studied at
Landslide predictability Injae suffered substantial landslide damage after heavy rains in 2006. Landslide-related factors for nearly 600
Decision tree landslides were extracted from local maps: topographic, including curvature, slope, distance to ridge, and
Spatial events aspect; forest, providing age, type, density, and diameter; and soil texture, drainage, effective thickness, and
C4.5 algorithm material. For the quantitative assessment of landslide susceptibility, the accuracy of the twofold cross-
Korea validation was 86.08%; accuracy using all known data was 89.26% based on a cumulative lift chart. A decision
tree can therefore be used efficiently for landslide susceptibility analysis and might be widely used for
prediction of various spatial events.
Crown Copyright © 2010 Published by Elsevier B.V. All rights reserved.
1. Introduction and Massari, 1998; Dai et al., 2001; Dai and Lee, 2001; Nefeslioglu
et al., 2008), and artificial neural network methods (Ermini, 2004; Lee
Landslides occur mainly because of heavy rain, and their et al., 2004; Gómez, 2005; Melchiorre et al., 2008). Most of these
reoccurrence year after year has led to heavy damage to property studies were aimed at increasing the accuracy of landslide prediction
and lives not only in Korea but also throughout the world. To mitigate by finding suitable techniques for the respective study area.
landslide damage, it is necessary to assess and manage areas that The objective of this study was to suggest a method to carry out
are susceptible to them. Hence, in recent years, the assessment of landslide susceptibility analysis using a decision tree, a popular clas-
landslide hazard and risk has become a topic of major interest (Aleotti sification technique. Unlike other statistical methods, a decision tree
and Chowdhury, 1999). Landslide susceptibility is defined as the makes no statistical assumptions, can handle data that are repre-
propensity of an area to generate landslides (Guzzetti et al., 2006) sented on different measurement scales, and is computationally fast
with susceptibility represented by relative value in a given area. (Pal and Mather, 2003). Also, such a tree represents a good
Recently, with the development of GIS data-processing techniques, compromise between comprehensibility, accuracy, and efficiency
quantitative studies have been applied to landslide susceptibility (Ferri et al., 2003). However, the decision tree algorithm was
analysis using various techniques. Such studies can be identified on considered to be an unsuitable method to apply in spatial event
the basis of the techniques used, such as probabilistic methods (Luzi prediction such as landslide susceptibility analysis because in the case
et al., 2000; Lee and Min, 2001; Donati and Turrini, 2002; Lee and Chol, of most decision tree algorithms, including C4.5(Quinlan, 1993), they
2003; Neuhäuser and Terhorst, 2007), logistic regression (Atkinson normally require a discrete type of output class whereas susceptibility
needs to be represented as a continuous value. The Classification and
Regression Tree algorithm (CART) (Breiman et al., 1984), which can
estimate probability, assumes a uniform distribution of training data
⁎ Corresponding authors: J.-G. Han is to be contacted at Geoscience Information set. Thus, previous studies (Saito et al., 2009; Nefeslioglu et al., 2010)
department, Korea Institute of Geoscience and Mineral Resources(KIGAM) Gwahang- only carried out limited applications without overcoming these
no 92, Yusung-gu, Daejeon, 305-350, South Korea. Tel.: +82 42 868 3297; fax: +82 42 problems.
868 3413. Ryu, College of Electrical & Computer Engineering, Chungbuk National To estimate probability from a decision tree in a class imbalanced
University, 410 Seongbong-ro, Heungdeok-gu, Cheongju, Chungbuk, South Korea. Tel.:
+82 43 267 2254; fax: +82 44 275 2254.
data set, Provost and Domingos (2003), Zadrozny and Elkan (2001), and
E-mail addresses: jghan@kigam.re.kr (J.-G. Han), khryu@dblab.chungbuk.ac.kr Ferri et al. (2003) used leaf node ranking methods, which are achieved
(K.H. Ryu). by smoothing the class frequencies. They used C4.5 and demonstrated
0013-7952/$ – see front matter. Crown Copyright © 2010 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.enggeo.2010.09.009
Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 275
a better accuracy by making a big tree. In this paper, we apply such Table 1
previous methods to the spatial event, and suggest a method to carry out Data set used for landslide susceptibility mapping.
landslide susceptibility mapping using a decision tree. Map source Thematic Type Scale Conversion
layer (resolution)
2. Study area and data Airborne Image Landslide Class 0.4 m
point
The selected study area, shown in Fig. 1, covers 34,696,750 m2 and is DEM (from Slope Continuous 5×5m
topographic map) Aspect Discrete (1:5,000)
located between Inje-eup and Buk-myeon in the middle of Kwangwon,
Curvature Continuous
Korea. This site lies between latitude 38° 5' 52.19''N and 38° 3' 42.43''N, Distance Continuous
and longitude 128° 12' 4.92''E and 128° 17' 56.08''E. The site is mainly a from ridge
granite-based rocky mountainous area. Landslides in the study area Soil map Texture Discrete 1:25,000 5m×5m
were caused by heavy rainfall during the period July 11–18, 2006. The Drainage Discrete Float type image
Material Discrete
average annual rainfall of this area was about 1400 mm from 1995 to
Effective Discrete
2005, with increases to 1740 mm in 2006. During the 8-day period of thickness
landslide occurrence, it rained about 559 mm. Forest map Forest Discrete 1:25,000
To extract the landslide casual factors for the area, we used 1:5000- type
Diameter Discrete
scale topographic map, 1:25,000-scale soil map, and 1:25,000-scale
class
forest map. A 5 × 5 m Digital Elevation Model (DEM) extracted using Density Discrete
the topographic map was used for generating slope, aspect, curvature, Age Discrete
and distance from the mountain ridge. From the soil map, data on
texture, drainage, and effective thickness were extracted. From the
forest map, forest type, diameter class, density, and age data were
extracted. All landslide factors were converted into 5 × 5 m float-type a strong relationship between the attribute layer and the landslide.
raster images with 1,387,870 pixels. Among the slopes of the layers, The interval from the slope angles between 20° to 39° has a stronger
the curvature and distance ridge had a continuous value, whereas relationship than other intervals. With respect to aspect, landslides
others had a discrete value, as shown in Table 1. As for the event are concentrated on the East, Southeast, and South-facing areas. The
data set, a total of 590 landslides were identified within the study area “curvature” of the topography refers to the degree of the convex or
by analyzing a 0.4 m resolution airborne image and a Triangulated concave nature of the geomorphology. In the interval, −17 to 2 is
Irregular Network (TIN), as shown in Fig. 2. In this paper, we used higher than 1 in terms of frequency ratio. Hence, the interval is highly
ArcGIS 9.2 software for preparing the image data set. susceptible to landslides. The buffered ridge means the distance from
The co-relationship between landslide occurrence and the classes of the ridge. In the relationship between the buffered ridge and
each extracted attribute layer can be derived by calculating the landslides, the closer to the ridge, the higher susceptibility they
frequency ratio (Bonham-Carter, 1994), i.e., the ratio of the probability show. However, the interval 26 to 125 m shows a strong relationship
of an event occurrence to the probability of a whole concurrency for the between the attribute layer and the landslide.
given attributes. If the ratio is greater than 1, the relationship between a Certain relationships have been discerned between landslides
landslide event and the factor's range or type is strong. If the ratio is less and forest factors. In the case of timber diameter, the frequency
than 1, then the relationship is weak (Lee and Sambath, 2006). ratio of landslide occurrence is high when the timber is thin, with
As for the topographic map, the relationships between each an especially strong relationship observed in the case of young trees.
attribute layer extracted from the map and the landslides were As for forest type, among the 11 types of trees considered, pine,
analyzed. The relationship between the slope and the landslides is planted pine, Korean pine, larch, and poplar are highly susceptible to
explained to determine whether or not a particular slope interval has landslides.
The relationships between landslides and soil factors are as
follows: in the case of texture, “Coarse loamy” and “Loamy skeletal”
soils showed the susceptibility to landslides. Soil material refers to
the origin of the soil and several, such as “Colluvium from granite,”
“Colluvium from porphyry,” and “Residuum on granite,” showed the
susceptibility to landslides. The effective thickness of soil is related to
the environment of plant growth, as if deep then plants will grow
well; if shallow then they will not. Hence, in the study area, the
shallow soil area exhibits the susceptibility to landslides. However,
conversely, no landslide was found in an area of very shallow soil
depth.
3. Methods
selecting an attribute with the smallest Entropy. At a node N, Entropy Thus, C4.5 selects an attribute with the smallest Entropy or biggest
is calculated by InfoGain. InfoGain has a tendency to select an attribute with many
split points. This feature makes the tree grow toward continuous
EntropyðnÞ = −∑j p Cj jN log2 p Cj jN ð1Þ attributes. To solve this problem, InfoGain is normalized by SplitInfo, a
kind of Entropy on the split point of an attribute. Thus, it has a high
value for an attribute with a number of splits. When node N is divided
where p(Cj|N)is the relative frequency of N. Of the k attributes of N, into n subsets, the equation for SplitInfo is:
the Entropy for selecting attribute A is given by
v jNj j jNj j
SplitInfo = − ∑ × log2 ð4Þ
k jNj j i=1 jN j jN j
EntropyA ðN Þ = ∑ × Entropy Nj ð2Þ
j=1 jN j
Thus, InfoGain compensated by SplitInfo is GainRatio, which is
defined as follows:
InfoGain is a gain from differences between the Entropy of the
original node and the Entropy of the newly split nodes. The equation is InfoGainð AÞ
GainRatioð AÞ = ð5Þ
as follows: SplitInfoð AÞ
of nonevent classes, and nevent is the number of event classes, the Laplace smoothing (Provost and Domingos, 2003) uses Laplace
probability of the event class can be estimated as follows: correction for avoiding a probability value of 1 or 0 from leaf nodes.
Another method, M-estimate smoothing (Cussents, 1993; Zadrozny and
P ðnodeÞ = nevent = ðnnonevent + nevent Þ: ð6Þ Elkan, 2001), uses the prior probability of events to smooth the
probabilities so that estimates are toward the minority class base rate.
However, the probability of an event cannot be used as the estimated Both of the above methods consider a uniform class distribution of the
probability of the event because tree nodes are split by a purity sample (Ferri et al., 2003). To obtain predictive accuracy in the class-
measure, and the estimated probability from the frequencies of a leaf imbalanced data set, Ferri et al. (2003) introduced m-branch smoothing,
node may be an extreme value: 0 or 1. Thus, instead of estimating the a recursive root-to-leaf extension of m probability estimation. On each
probability directly from the frequencies of leaf nodes, it is more path, the probability estimates at a parent node are propagated
desirable to estimate relative probability by ranking leaf nodes, which downward to all of its children. The rank of the child node can be
can be achieved by smoothing frequencies. expressed by m-branch as follows when the target class is an event class:
Fig. 5. AUC values according to the parameter M of m-branch smoothing in the goodness of fit and twofold validation.
Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 279
where M is a constant, N is the global cardinality of the data set, and d using a pruning step was used. We programmed the tree algorithm
is the depth of the node. using the Java programming language.
As for the result assessment, the minority event data were
regarded as confirmative, and the majority of nonevent data were
4. Mapping and validation of landslide susceptibility not, because events might occur in nonevent areas in the future. One
of the widely used assessment techniques, the Receiver Operating
We followed the process of landslide susceptibility mapping as seen Characteristic (ROC) (Swets. 1988) can be considered for the model
in Fig. 3. The C4.5 algorithm was used for constructing the decision tree, evaluation, but it does not consider such an aspect because it
as in previous studies (Provost and Domingos, 2003; Zadrozny and evaluates the results included in the nonevent data, which is
Elkan, 2001; Ferri et al., 2003). After the tree construction process, leaf nonconfirmative. As an alternative method, a Lift chart can be used,
nodes were relatively evaluated by the m-branch smoothing method. which evaluates the degree of the classification on the target class. Lift
For searching best accuracy of the tree model, we tested the accuracy charts were introduced in the business data mining area by Berry and
according to the parameter M of the m-branch smoothing. For the Linoff (1997). Then, Chung and Fabbri (1999) used one for estimating
assessment of accuracy performance, we carried out the goodness of fit a landslide prediction model. Generally, a lift chart is used for
using an all-known landslide set and the twofold cross-validation for accumulating the lift value. What lift actually measures is the change
testing predictive aspects of the decision tree. At the twofold cross- in concentration of a particular class when the model is used to select
validation, two independent subsets were used to construct and to a group from the general population(Berry and Linoff, 1997).
evaluate the model. The full-grown decision tree based on C4.5 without Therefore, if the subsequent curve is biased on the left side, the
Fig. 6. Twofold cross-validation results; (a) the result of first fold, and (b) the result of second fold.
280 Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283
Fig. 7. Landslide susceptibility map using the all-known data set. Rectangular area is selected to represent the rules.
accuracy of the prediction result may be higher and the performance process are shown in Fig. 6. The susceptibility map trained using the
is quantified by calculating using the area under the curve (AUC). all-landslides set is shown in Fig. 7. The cumulative lift charts for each
To test the goodness of fit of the model, we used a 590-landslide result are shown in Fig. 8.
set for constructing and evaluating the prediction model. From the The landslide susceptibility results can also be assessed by the
constructed tree, 828 leaf nodes were generated. A series of nodes distribution of the percentile value of susceptibility. Fig. 9 represents
from a single leaf to the root from the tree can be converted into a rule. the distribution of the percentile value of susceptibility gained from
For twofold cross-validation, two groups of 295 landslides were both the twofold cross-validation and the goodness of fit in the 95%
selected from the 590 landslides; the distribution of both the groups, confidence interval. In the twofold cross-validation result, the mean
Landslide SetA and Landslide SetB, is given in Table 2. In the first was 15.01% (Std. Dev. = 15.94) and the median was 12.58%. In the
fold process, Landslide SetA was used to build a decision tree, and result from goodness of fit, the mean was 11.07% (Std. Dev. = 11.75)
Landslide SetB was used as the validation data set. In the second fold and the median was 7.63%. Thus, the result of goodness of fit was
process, the role of the two data sets was changed. From the con- better than the result of the twofold cross-validation.
structed trees, 393 leaves in the first fold and 486 leaves in the second
fold were generated. This procedure is described in Fig. 4. 5. Discussion
In the twofold cross-validation, the best accuracy covering 89.26%
of the AUC was shown when M was 2500. In the goodness-of-fit test, A decision tree is built by selecting attributes; thus, prior
the AUC was assessed to be 86.08% when M was 8000, as shown in knowledge of these is not needed. This feature is helped by gaining
Fig. 5. The susceptibility maps trained from the twofold validation knowledge from a real-world phenomenon because many factors are
Fig. 10. The same rectangular area as in Fig. 7 with marked locations (1) and (2) of relatively low susceptibility, and (3) and (4) where landslides occurred.
282 Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283
Fig. 11. Configuration of nodes at locations (1), (2), (3), and (4) seen in Fig. 10.
Acknowledgments Gómez, H., Kavzoglu, T., 2005. Assessment of shallow landslide susceptibility using
artificial neural networks in Jabonosa River. Basin, Venezuela. Eng. Geol. 78, 11–27.
Guzzetti, F., Reichenbach, P., Ardizzone, F., Cardinali, M., Galli, M., 2006. Estimating the
This work was supported in part by the cooperative research quality of landslide susceptibility models. Geomorphology 81, 166–184.
program of the Korea Institute of Geoscience and Mineral Resources Lee, S., Chol, U.C., 2003. Development of GIS-based geological hazard information
system and its application for landslide analysis in Korea. Geosci. J. 7, 243–252.
(KIGAM) and the Korea Aerospace Research Institute (KARI), and in Lee, S., Min, K., 2001. Statistical analysis of landslide susceptibility at Yongin. Korea,
part by a grant (#07-KLSG-C05) from Cutting-edge Urban Develop- Environmental Geology 40, 1095–1113.
ment - Korean Land Spatialization Research Project funded by Lee, S., Sambath, T., 2006. Landslide susceptibility mapping in the Damrei Romel area,
Cambodia using frequency ratio and logistic regression models. Environ. Geol. 50,
Ministry of Land, Transport and Maritime Affairs (MLTM) of Korean 847–855.
government and by Basic Science Research Program through the Lee, S., Ryu, J.H., Won, J.S., Park, H.J., 2004. Determination and application of the weights
National Research Foundation of Korea (NRF) funded by the Ministry for landslide susceptibility mapping using an artificial neural network. Eng. Geol.
71 (3–4), 289–302.
of Education, Science and Technology (NRF No. 2010-0001732).
Luzi, L., Pergalani, F., Terlien, M.T.J., 2000. Slope vulnerability to earthquakes at
Constructive comments and suggestions by anonymous reviewers subregional scale, using probabilistic techniques and geographic information
also helped us improve the presentation of this paper. systems. Eng. Geol. 58, 313–336.
Melchiorre, C., Matteucci, M., Azzoni, A., Zanchi, A., 2008. Artificial neural networks and
cluster analysis in landslide susceptibility zonation. Geomorphology 94, 379–400.
References Nefeslioglu, H., Duman, T., Durmaz, S., 2008. Landslide susceptibility mapping for a part
of tectonic Kelkit Valley., Eastern Black Sea region of Turkey). Geomorphology 94,
Aleotti, P., Chowdhury, R., 1999. Landslide hazard assessment: summary review and 401–418.
new perspectives. Bull Eng Geo Environ 58, 21–44. Nefeslioglu, H., Sezer, E., Gokceoglu, C., Bozkir, A., Duman, T., 2010. Assessment of
Atkinson, P.M., Massari, R., 1998. Generalized linear modeling of susceptibility to landslide susceptibility by decision trees in the metropolitan area of Istanbul,
landsliding in the central Apennines, Italy. Computer & Geosciences 24, 373–385. Turkey. Mathematical Problems in Engineering 2010, Article ID 901095.
Berry, M.J.A., Linoff, G., 1997. Data Mining Techniques: For Marketing, Sales, and Customer Neuhäuser, B., Terhorst, B., 2007. Landslide susceptibility assessment using “weights-
Support. John Wiley & Sons. of-evidence” applied to a study area at the Jurassic escarpment (SW-Germany).
Bonham-Carter, G.F., 1994. Geographic information system for geoscientist, modeling Geomorphology 86, 12–24.
with GIS. Pergamon Press, Oxford. 398. Pal, M., Mather, P.M., 2003. An assessment of the effectiveness of decision tree methods
Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees, for land cover classification. Remote Sens. Environ. 86, 554–556.
Chapman & Hal. Wadsworth, Inc, New York. Provost, F.J., Domingos, P., 2003. Tree Induction for Probability-based Ranking. Machine
Chung, C.F., Fabbri, A.G., 1999. Probabilistic prediction models for landslide hazard mapping. Learning Kluwer Academic Publisher 52 (3), 199–215.
Photogrammetric Engineering & Remote Sensing (PE&RS) 65 (12), 1388–1399. Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106.
Cussents, J., 1993. Bayes and psudo-bayes estimates of conditional probabilities and Quinlan, J.R., 1993. C4.5 : Programs for Machine Learning, Morgan Kaufmann.
their reliabilities. Proceedings of European Conference on Machine Learning. Saito, H., Nakayama, D., Matsuyama, H., 2009. Comparison of landslide susceptibility based
Dai, F.C., Lee, C.F.J., Li, J., Xu, Z.W., 2001. Assessment of landslide susceptibility on the on a decision-tree model and actual landslide occurrence: the Akaishi Mountains,
natural terrain of Lantau Island. Hong Kong, Environmental Geology 40, 381–391. Japan. Geomorphology 109 (3–4), 108–121.
Donati, L., Turrini, M.C., 2002. An objective method to rank the importance of the factors Swets, J.A., 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293.
predisposing to landslides with the GIS methodology: application to an area of the Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., MacLachlan, G.J., Ng, A.,
Apennines (Valnerina; Perugia, Italy). Eng. Geol. 63, 277–289. Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D., 2008. Top 10
Ermini, L., Catani, L., Casagli, N., 2004. Artificial Neural Networks applied to landslide algorithms in data mining. Knowl. Inf. Syst. 14 (1), 1–37.
susceptibility assessment. Geomorphology 66 (1–4), 327–343. Zadrozny, B., Elkan, C., 2001. Learning and making decisions when costs and probabilities
Ferri, C., Flach, P.A., Hernndez-Orallo, J., 2003. Improving the AUC of probabilistic estimation are both unknown. Proceedings of the 7th ACM SIGKDD International Conference
trees. Proc. of the 14th European Conf. on Machine Learning, pp. 121–132. on Knowledge Discovery and Data Mining, pp. 204–213.