Вы находитесь на странице: 1из 10

Engineering Geology 116 (2010) 274–283

Contents lists available at ScienceDirect

Engineering Geology
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / e n g g e o

Landslide susceptibility mapping in Injae, Korea, using a decision tree


Young-Kwang Yeon a, Jong-Gyu Han a,⁎, Keun Ho Ryu b,⁎
a
Geoscience Information department, Korea Institute of Geoscience and Mineral Resources(KIGAM) Gwahang-no 92, Yusung-gu, Daejeon, 305-350, South Korea
b
College of Electrical & Computer Engineering, Chungbuk National University, 410 Seongbong-ro, Heungdeok-gu, Cheongju, Chungbuk, South Korea

a r t i c l e i n f o a b s t r a c t

Article history: A data mining classification technique can be applied to landslide susceptibility mapping. Because of its
Received 23 February 2010 advantages, a decision tree is one popular classification algorithm, although hardly used previously to analyze
Received in revised form 11 August 2010 landslide susceptibility because the obtained data assume a uniform class distribution whereas landslide
Accepted 10 September 2010
spatial event data when represented on a grid raster layer are highly class imbalanced. For this study of South
Available online 19 September 2010
Korean landslides, a decision tree was constructed using Quinlan's algorithm C4.5. The susceptibility of
Keywords:
landslide occurrence was then deduced using leaf-node ranking or m-branch smoothing. The area studied at
Landslide predictability Injae suffered substantial landslide damage after heavy rains in 2006. Landslide-related factors for nearly 600
Decision tree landslides were extracted from local maps: topographic, including curvature, slope, distance to ridge, and
Spatial events aspect; forest, providing age, type, density, and diameter; and soil texture, drainage, effective thickness, and
C4.5 algorithm material. For the quantitative assessment of landslide susceptibility, the accuracy of the twofold cross-
Korea validation was 86.08%; accuracy using all known data was 89.26% based on a cumulative lift chart. A decision
tree can therefore be used efficiently for landslide susceptibility analysis and might be widely used for
prediction of various spatial events.
Crown Copyright © 2010 Published by Elsevier B.V. All rights reserved.

1. Introduction and Massari, 1998; Dai et al., 2001; Dai and Lee, 2001; Nefeslioglu
et al., 2008), and artificial neural network methods (Ermini, 2004; Lee
Landslides occur mainly because of heavy rain, and their et al., 2004; Gómez, 2005; Melchiorre et al., 2008). Most of these
reoccurrence year after year has led to heavy damage to property studies were aimed at increasing the accuracy of landslide prediction
and lives not only in Korea but also throughout the world. To mitigate by finding suitable techniques for the respective study area.
landslide damage, it is necessary to assess and manage areas that The objective of this study was to suggest a method to carry out
are susceptible to them. Hence, in recent years, the assessment of landslide susceptibility analysis using a decision tree, a popular clas-
landslide hazard and risk has become a topic of major interest (Aleotti sification technique. Unlike other statistical methods, a decision tree
and Chowdhury, 1999). Landslide susceptibility is defined as the makes no statistical assumptions, can handle data that are repre-
propensity of an area to generate landslides (Guzzetti et al., 2006) sented on different measurement scales, and is computationally fast
with susceptibility represented by relative value in a given area. (Pal and Mather, 2003). Also, such a tree represents a good
Recently, with the development of GIS data-processing techniques, compromise between comprehensibility, accuracy, and efficiency
quantitative studies have been applied to landslide susceptibility (Ferri et al., 2003). However, the decision tree algorithm was
analysis using various techniques. Such studies can be identified on considered to be an unsuitable method to apply in spatial event
the basis of the techniques used, such as probabilistic methods (Luzi prediction such as landslide susceptibility analysis because in the case
et al., 2000; Lee and Min, 2001; Donati and Turrini, 2002; Lee and Chol, of most decision tree algorithms, including C4.5(Quinlan, 1993), they
2003; Neuhäuser and Terhorst, 2007), logistic regression (Atkinson normally require a discrete type of output class whereas susceptibility
needs to be represented as a continuous value. The Classification and
Regression Tree algorithm (CART) (Breiman et al., 1984), which can
estimate probability, assumes a uniform distribution of training data
⁎ Corresponding authors: J.-G. Han is to be contacted at Geoscience Information set. Thus, previous studies (Saito et al., 2009; Nefeslioglu et al., 2010)
department, Korea Institute of Geoscience and Mineral Resources(KIGAM) Gwahang- only carried out limited applications without overcoming these
no 92, Yusung-gu, Daejeon, 305-350, South Korea. Tel.: +82 42 868 3297; fax: +82 42 problems.
868 3413. Ryu, College of Electrical & Computer Engineering, Chungbuk National To estimate probability from a decision tree in a class imbalanced
University, 410 Seongbong-ro, Heungdeok-gu, Cheongju, Chungbuk, South Korea. Tel.:
+82 43 267 2254; fax: +82 44 275 2254.
data set, Provost and Domingos (2003), Zadrozny and Elkan (2001), and
E-mail addresses: jghan@kigam.re.kr (J.-G. Han), khryu@dblab.chungbuk.ac.kr Ferri et al. (2003) used leaf node ranking methods, which are achieved
(K.H. Ryu). by smoothing the class frequencies. They used C4.5 and demonstrated

0013-7952/$ – see front matter. Crown Copyright © 2010 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.enggeo.2010.09.009
Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 275

a better accuracy by making a big tree. In this paper, we apply such Table 1
previous methods to the spatial event, and suggest a method to carry out Data set used for landslide susceptibility mapping.

landslide susceptibility mapping using a decision tree. Map source Thematic Type Scale Conversion
layer (resolution)
2. Study area and data Airborne Image Landslide Class 0.4 m
point
The selected study area, shown in Fig. 1, covers 34,696,750 m2 and is DEM (from Slope Continuous 5×5m
topographic map) Aspect Discrete (1:5,000)
located between Inje-eup and Buk-myeon in the middle of Kwangwon,
Curvature Continuous
Korea. This site lies between latitude 38° 5' 52.19''N and 38° 3' 42.43''N, Distance Continuous
and longitude 128° 12' 4.92''E and 128° 17' 56.08''E. The site is mainly a from ridge
granite-based rocky mountainous area. Landslides in the study area Soil map Texture Discrete 1:25,000 5m×5m
were caused by heavy rainfall during the period July 11–18, 2006. The Drainage Discrete Float type image
Material Discrete
average annual rainfall of this area was about 1400 mm from 1995 to
Effective Discrete
2005, with increases to 1740 mm in 2006. During the 8-day period of thickness
landslide occurrence, it rained about 559 mm. Forest map Forest Discrete 1:25,000
To extract the landslide casual factors for the area, we used 1:5000- type
Diameter Discrete
scale topographic map, 1:25,000-scale soil map, and 1:25,000-scale
class
forest map. A 5 × 5 m Digital Elevation Model (DEM) extracted using Density Discrete
the topographic map was used for generating slope, aspect, curvature, Age Discrete
and distance from the mountain ridge. From the soil map, data on
texture, drainage, and effective thickness were extracted. From the
forest map, forest type, diameter class, density, and age data were
extracted. All landslide factors were converted into 5 × 5 m float-type a strong relationship between the attribute layer and the landslide.
raster images with 1,387,870 pixels. Among the slopes of the layers, The interval from the slope angles between 20° to 39° has a stronger
the curvature and distance ridge had a continuous value, whereas relationship than other intervals. With respect to aspect, landslides
others had a discrete value, as shown in Table 1. As for the event are concentrated on the East, Southeast, and South-facing areas. The
data set, a total of 590 landslides were identified within the study area “curvature” of the topography refers to the degree of the convex or
by analyzing a 0.4 m resolution airborne image and a Triangulated concave nature of the geomorphology. In the interval, −17 to 2 is
Irregular Network (TIN), as shown in Fig. 2. In this paper, we used higher than 1 in terms of frequency ratio. Hence, the interval is highly
ArcGIS 9.2 software for preparing the image data set. susceptible to landslides. The buffered ridge means the distance from
The co-relationship between landslide occurrence and the classes of the ridge. In the relationship between the buffered ridge and
each extracted attribute layer can be derived by calculating the landslides, the closer to the ridge, the higher susceptibility they
frequency ratio (Bonham-Carter, 1994), i.e., the ratio of the probability show. However, the interval 26 to 125 m shows a strong relationship
of an event occurrence to the probability of a whole concurrency for the between the attribute layer and the landslide.
given attributes. If the ratio is greater than 1, the relationship between a Certain relationships have been discerned between landslides
landslide event and the factor's range or type is strong. If the ratio is less and forest factors. In the case of timber diameter, the frequency
than 1, then the relationship is weak (Lee and Sambath, 2006). ratio of landslide occurrence is high when the timber is thin, with
As for the topographic map, the relationships between each an especially strong relationship observed in the case of young trees.
attribute layer extracted from the map and the landslides were As for forest type, among the 11 types of trees considered, pine,
analyzed. The relationship between the slope and the landslides is planted pine, Korean pine, larch, and poplar are highly susceptible to
explained to determine whether or not a particular slope interval has landslides.
The relationships between landslides and soil factors are as
follows: in the case of texture, “Coarse loamy” and “Loamy skeletal”
soils showed the susceptibility to landslides. Soil material refers to
the origin of the soil and several, such as “Colluvium from granite,”
“Colluvium from porphyry,” and “Residuum on granite,” showed the
susceptibility to landslides. The effective thickness of soil is related to
the environment of plant growth, as if deep then plants will grow
well; if shallow then they will not. Hence, in the study area, the
shallow soil area exhibits the susceptibility to landslides. However,
conversely, no landslide was found in an area of very shallow soil
depth.

3. Methods

3.1. Decision tree

The decision tree algorithm, C4.5, is widely used for classification


tasks (Wu et al., 2008) and is designed to carry out additional
functions including the use of continuous attribute based on ID3
(Quinlan, 1986). C4.5 consists of tree growth and tree pruning steps.
In the former, tree growth begins from a node, which is then split by
selecting the attribute that best classifies a set of examples on the
basis of an attribute selection measure.
The attribute selection measure uses the concept of entropy,
Fig. 1. Study area (Inje, Korea). which is defined as the degree of disorder. Thus, a tree grows by
276 Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 2. Landslide location with TIN (Triangulated Irregular Network) image.

selecting an attribute with the smallest Entropy. At a node N, Entropy Thus, C4.5 selects an attribute with the smallest Entropy or biggest
is calculated by InfoGain. InfoGain has a tendency to select an attribute with many
split points. This feature makes the tree grow toward continuous
   
EntropyðnÞ = −∑j p Cj jN log2 p Cj jN ð1Þ attributes. To solve this problem, InfoGain is normalized by SplitInfo, a
kind of Entropy on the split point of an attribute. Thus, it has a high
value for an attribute with a number of splits. When node N is divided
where p(Cj|N)is the relative frequency of N. Of the k attributes of N, into n subsets, the equation for SplitInfo is:
the Entropy for selecting attribute A is given by
v jNj j jNj j
SplitInfo = − ∑ × log2 ð4Þ
k jNj j   i=1 jN j jN j
EntropyA ðN Þ = ∑ × Entropy Nj ð2Þ
j=1 jN j
Thus, InfoGain compensated by SplitInfo is GainRatio, which is
defined as follows:
InfoGain is a gain from differences between the Entropy of the
original node and the Entropy of the newly split nodes. The equation is InfoGainð AÞ
GainRatioð AÞ = ð5Þ
as follows: SplitInfoð AÞ

For landslide susceptibility mapping, we can consider the probability


Infogainð AÞ = EntropyðNÞ−EntropyA ðNÞ ð3Þ of an event class in the leaf node. When nnonevent is the number

Fig. 3. Procedure of landslide susceptibility mapping using decision tree.


Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 277

Table 2 Table 2 (continued)


Data distribution according to the layers. Layers Value No. of pixels No. of Frequency Landslide Landslide
(variables) domain in domain landslides ratio SetA SetB
Layers Value No. of pixels No. of Frequency Landslide Landslide
(variables) domain in domain landslides ratio SetA SetB land
t_aspect FLAT 1177 0 0.00 0 0 Mixed 206404 23 0.26 12 11
NORTH 169295 46 0.64 25 21 (nonconifer,
NORTHEAST 150783 24 0.37 9 15 conifer)
EAST 115706 84 1.71 35 49 Planted pine 17897 27 3.55 12 15
SOUTHEAST 151134 155 2.41 95 60 Planted 1261 0 0.00 0 0
nonconifer
SOUTH 165153 104 1.48 54 50
Korean pine 141118 123 2.05 60 63
SOUTHWEST 193688 74 0.90 28 46
WEST 222224 61 0.65 31 30 Larch 122973 55 1.05 33 22
NORTHWEST 218710 42 0.45 18 24 Poplar 3886 4 2.42 3 1
Sum 1,387,870 590 1.00 295 295 Field 2638 4 3.57 1 3
t_slope 0~4 16376 0 0.00 0 0 Sum 1,387,870 590 1.00 295 295
5~9 48348 0 0.00 0 0 s_texture Coarse 1231824 536 1.02 259 277
10 ~ 14 86785 4 0.11 1 3 loamy
Fine loamy 54928 6 0.26 5 1
15 ~ 19 132934 36 0.64 18 18
Fine loamy 3512 0 0.00 0 0
20 ~ 24 194440 95 1.15 58 37
25 ~ 29 251315 132 1.24 67 65 or coarse
30 ~ 34 268259 144 1.26 63 81 loamy
35 ~ 39 219885 112 1.20 47 65 Loamy 94508 47 1.17 30 17
40 ~ 44 117105 48 0.96 29 19 skeletal
45 ~ 49 40957 17 0.98 10 7 River 33 0 0.00 0 0
50 ~ 54 9928 1 0.24 1 0 overflow
area
55 ~ 59 1449 1 1.62 1 0
60 ~ 64 84 0 0.00 0 0 Sandy 3065 1 0.77 1
Over 65 5 0 0.00 0 0 skeletal
Sum 1,387,870 590 1.00 295 295 Sum 1,387,870 590 1.00 295 295
t_curvature below - 28 6 0 0.00 0 0 s_material River 33 0 0.00 0 0
-27 ~ -23 17 0 0.00 0 0 overflow
-22 ~ -18 80 0 0.00 0 0 area
Alluvium 6498 3 1.09 0 3
-17 ~ -13 789 1 2.98 0 1
Alluvium- 3077 0 0.00 0 0
-12 ~ -8 9052 32 8.32 6 26
-7 ~ -3 149623 292 4.59 99 193 colluvium
-2 ~ 2 953923 260 0.64 185 75 from acid
3~7 255806 5 0.05 5 0 rock
8 ~ 12 17256 0 0.00 0 0 Alluvium- 8980 3 0.79 2 1
13 ~ 17 1158 0 0.00 0 0 colluvium
18 ~ 22 141 0 0.00 0 0 from granite
Colluvium 39992 5 0.29 4 1
23 ~ 27 16 0 0.00 0 0
Colluvium 74310 34 1.08 25 9
Over 28 3 0 0.00 0 0
Sum 1,387,870 590 1.00 295 295 from granite
t_ridgebuffer 1 ~ 25 314988 75 0.56 34 41 Colluvium 11693 7 1.41 3 4
(distance 26 ~ 40 292307 192 1.55 98 94 from
from ridge: 51 ~ 75 250570 154 1.45 90 64 porphyry
meter) 76 ~ 100 204480 104 1.20 48 56 Local 6212 1 0.38 1 0
101 ~ 125 150759 48 0.75 18 30 alluvium
Local 21278 4 0.44 3 1
126 ~ 150 99397 13 0.31 3 10
151 ~ 175 51535 4 0.18 4 0 alluvium-
176 ~ 200 17706 0 0.00 0 0 colluvium
201 ~ 225 4863 0 0.00 0 0 Residuum 1210155 532 1.03 257 275
225 ~ 250 1145 0 0.00 0 0 on granite
251 ~ 275 120 0 0.00 0 0 Residuum 5642 1 0.42 0 1
Sum 1,387,870 590 1.00 295 295 on granite
gneiss
f_diameter Non-forest 106105 12 0.27 4 8
Sum 1,387,870 590 1.00 295 295
(cm) 6 ~ 16 365945 255 1.64 128 127
18 ~ 28 735844 304 0.97 153 151 s_drainage River 33 0 0.00 0 0
Over 30 179976 19 0.25 10 9 overflow
Sum 1,387,870 590 1.00 295 295 area
f_age Non-forest 106105 12 0.27 4 8 Imperfectly 4200 0 0.00 0 0
11–20 year 214605 184 2.02 93 91 Moderately 15341 5 0.77 2 3
21–30 year 151340 71 1.10 35 36 well
Somewhat 1210330 532 1.03 257 275
31–40 year 203433 86 0.99 41 45
excessively
41–50 year 570157 225 0.93 115 110
Over 51 year 142230 12 0.20 7 5 Well 157966 53 0.79 36 17
Sum 1,387,870 590 1.00 295 295 Sum 1,387,870 590 1.00 295 295
f_density Non-forest 106105 12 0.27 4 8 s_thickness Deep 32206 9 0.66 7 2
Less than 19390 6 0.73 2 4 Moderately 1108115 455 0.97 232 223
50% deep
51–70% 682056 323 1.11 159 164 River 33 0 0.00 0 0
overflow
Over 71% 580319 249 1.01 130 119
Sum 1,387,870 590 1.00 295 295 area
f_type Non-forest 101418 8 0.19 3 5 Shallow 241114 126 1.23 56 70
Pine 465626 237 1.20 117 120 Very 6402 0` 0.00 0 0
Non-conifer 322600 109 0.79 54 55 shallow
Agricultural 2049 0 0.00 0 0 Sum 1,387,870 590 1.00 295 295

(continued on next page)


278 Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 4. Twofold cross-validation step.

of nonevent classes, and nevent is the number of event classes, the Laplace smoothing (Provost and Domingos, 2003) uses Laplace
probability of the event class can be estimated as follows: correction for avoiding a probability value of 1 or 0 from leaf nodes.
Another method, M-estimate smoothing (Cussents, 1993; Zadrozny and
P ðnodeÞ = nevent = ðnnonevent + nevent Þ: ð6Þ Elkan, 2001), uses the prior probability of events to smooth the
probabilities so that estimates are toward the minority class base rate.
However, the probability of an event cannot be used as the estimated Both of the above methods consider a uniform class distribution of the
probability of the event because tree nodes are split by a purity sample (Ferri et al., 2003). To obtain predictive accuracy in the class-
measure, and the estimated probability from the frequencies of a leaf imbalanced data set, Ferri et al. (2003) introduced m-branch smoothing,
node may be an extreme value: 0 or 1. Thus, instead of estimating the a recursive root-to-leaf extension of m probability estimation. On each
probability directly from the frequencies of leaf nodes, it is more path, the probability estimates at a parent node are propagated
desirable to estimate relative probability by ranking leaf nodes, which downward to all of its children. The rank of the child node can be
can be achieved by smoothing frequencies. expressed by m-branch as follows when the target class is an event class:

3.2. Leaf node ranking methods nevent m × Rankðnode:parent Þ


Rankðnode:childÞ = ð7Þ
nevent + nnonevent + m
The methods outlined were developed to use the applications in a
class-imbalanced data set, and they can be applied in the evaluation of
where parameter m is calculated by:
reliability and cost-sensitive learning. Leaf node ranking methods
commonly use the ratio of target class in the leaf node, but the way of pffiffiffiffi
smoothening is different. M + ðd−1Þ = d × M × N ð8Þ

Fig. 5. AUC values according to the parameter M of m-branch smoothing in the goodness of fit and twofold validation.
Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 279

where M is a constant, N is the global cardinality of the data set, and d using a pruning step was used. We programmed the tree algorithm
is the depth of the node. using the Java programming language.
As for the result assessment, the minority event data were
regarded as confirmative, and the majority of nonevent data were
4. Mapping and validation of landslide susceptibility not, because events might occur in nonevent areas in the future. One
of the widely used assessment techniques, the Receiver Operating
We followed the process of landslide susceptibility mapping as seen Characteristic (ROC) (Swets. 1988) can be considered for the model
in Fig. 3. The C4.5 algorithm was used for constructing the decision tree, evaluation, but it does not consider such an aspect because it
as in previous studies (Provost and Domingos, 2003; Zadrozny and evaluates the results included in the nonevent data, which is
Elkan, 2001; Ferri et al., 2003). After the tree construction process, leaf nonconfirmative. As an alternative method, a Lift chart can be used,
nodes were relatively evaluated by the m-branch smoothing method. which evaluates the degree of the classification on the target class. Lift
For searching best accuracy of the tree model, we tested the accuracy charts were introduced in the business data mining area by Berry and
according to the parameter M of the m-branch smoothing. For the Linoff (1997). Then, Chung and Fabbri (1999) used one for estimating
assessment of accuracy performance, we carried out the goodness of fit a landslide prediction model. Generally, a lift chart is used for
using an all-known landslide set and the twofold cross-validation for accumulating the lift value. What lift actually measures is the change
testing predictive aspects of the decision tree. At the twofold cross- in concentration of a particular class when the model is used to select
validation, two independent subsets were used to construct and to a group from the general population(Berry and Linoff, 1997).
evaluate the model. The full-grown decision tree based on C4.5 without Therefore, if the subsequent curve is biased on the left side, the

Fig. 6. Twofold cross-validation results; (a) the result of first fold, and (b) the result of second fold.
280 Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 7. Landslide susceptibility map using the all-known data set. Rectangular area is selected to represent the rules.

accuracy of the prediction result may be higher and the performance process are shown in Fig. 6. The susceptibility map trained using the
is quantified by calculating using the area under the curve (AUC). all-landslides set is shown in Fig. 7. The cumulative lift charts for each
To test the goodness of fit of the model, we used a 590-landslide result are shown in Fig. 8.
set for constructing and evaluating the prediction model. From the The landslide susceptibility results can also be assessed by the
constructed tree, 828 leaf nodes were generated. A series of nodes distribution of the percentile value of susceptibility. Fig. 9 represents
from a single leaf to the root from the tree can be converted into a rule. the distribution of the percentile value of susceptibility gained from
For twofold cross-validation, two groups of 295 landslides were both the twofold cross-validation and the goodness of fit in the 95%
selected from the 590 landslides; the distribution of both the groups, confidence interval. In the twofold cross-validation result, the mean
Landslide SetA and Landslide SetB, is given in Table 2. In the first was 15.01% (Std. Dev. = 15.94) and the median was 12.58%. In the
fold process, Landslide SetA was used to build a decision tree, and result from goodness of fit, the mean was 11.07% (Std. Dev. = 11.75)
Landslide SetB was used as the validation data set. In the second fold and the median was 7.63%. Thus, the result of goodness of fit was
process, the role of the two data sets was changed. From the con- better than the result of the twofold cross-validation.
structed trees, 393 leaves in the first fold and 486 leaves in the second
fold were generated. This procedure is described in Fig. 4. 5. Discussion
In the twofold cross-validation, the best accuracy covering 89.26%
of the AUC was shown when M was 2500. In the goodness-of-fit test, A decision tree is built by selecting attributes; thus, prior
the AUC was assessed to be 86.08% when M was 8000, as shown in knowledge of these is not needed. This feature is helped by gaining
Fig. 5. The susceptibility maps trained from the twofold validation knowledge from a real-world phenomenon because many factors are

Fig. 8. Cumulative lift charts of goodness of fit and twofold cross-validation.


Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 281

for better accuracy, the parameter value of the m-branch should be


searched experimentally.
Because landslides occur by means of the interaction among causal
factors, to analyze factors of the event, it is necessary to explain
the relationship among the factors. A rule consists of an “AND”
combination of nodes from the root to the leaf. When a rule is
interpreted, the use of all combinations of the node is needed. Thus,
the relationship among causal factors is implicitly included in the rule.
We selected four places to represent the configuration of nodes
in the result, which was trained by using the all-landslide data set,
shown within the rectangular area of Fig. 7. The places marked (1)
and (2) are of relatively low susceptibility, as shown in Fig. 10.
Sites (3) and (4) are the location where landslides occurred. The node
information of each location is described in Fig. 11. As a rule, a series of
nodes can be represented, for example, at location (1), the m-branch
and percentile values are 0.58092 and 43.83%, respectively. The rule
is represented as t_curvature N –1 & s_texture = “Leamy skeletal” &
t_slope b = 33.0.
As for the event occurrence locations (3) and (4), m-branch values
are 0.89634 (percentile = 1.63) and 0.89556 (percentile = 1.80), re-
spectively, and are higher than those of (1) and (2). Event locations
share from the 1st to 8th nodes. Among the nodes, “t_curvature” and
“t_slope” appeared several times because continuous attributes can be
repeatedly selected. When we interpret the series of nodes according
to the rule, the low level of the node can be ignored when the same
attribute is found at a deeper level. For example, location (3) can
be represented as t_ridgebuffer b = 27.0 & s_material = “Residem on
granite genisis” & f_diameter = “18 ~ 28” & t_slope N 25 & t_curvature
b = –6.0 & t_aspect = “South” & f_density = “51 ~ 70%” & s_thickness =
Fig. 9. Box and whisker plots representing percentiles of landslide susceptibility. The “Moderately deep” & f_age = “30 ~ 40 year”.
box represents second and third quartiles. Whiskers represent first and fourth quartiles. For a predictive point of view, we carried out a twofold cross-
The thicker line in the box represents the median. An open circle represents extreme
values. Star points represent outliers.
validation. When the class-imbalanced landslide data set was
considered, the prediction model may have been underestimated.
The twofold cross-validation does not consider two aspects. First,
generally many landslides may occur at a place where they previously
co-related in the real world. Thus, we collected and could use all 12 occurred, whereas the cross-validation method performs a one-leave-
landslide factors for this study. The use of continuous attributes is also out process or tests without replacements. Second, if the number of
one of the advantages of using a decision tree to improve the folds is small, then the predictive result will be pessimistic because
prediction ability. We used m-branch smoothing for inducing relative the amount of training data used for the construction of a prediction
probability from the tree. This method can estimate the relative value model is small. Thus, the difference in results between goodness of fit
of event occurrence for landslide susceptibility mapping. However, and twofold cross-validation can be shown.

Fig. 10. The same rectangular area as in Fig. 7 with marked locations (1) and (2) of relatively low susceptibility, and (3) and (4) where landslides occurred.
282 Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 11. Configuration of nodes at locations (1), (2), (3), and (4) seen in Fig. 10.

6. Conclusions composed of a small numbers of pixels. Thus, a minority event class is


treated as noise. Moreover, it is not desirable to estimate probability
Landslides are caused mainly by heavy rains or earthquakes, but from a decision tree in the class imbalanced data set.
the landslide occurrences and the scale are different depending In this paper, we used a full-grown decision tree because the
on geo-environmental conditions. A landslide is explained by the minority event class can be ignored in the tree-building process. The
environmental conditions at the event-occurred location, and thus a minority event class, however, has more meaning than the majority
landslide event can be predicted at a location when specific conditions nonevent class in the spatial data. The leaf node ranking method for
are satisfied. representing susceptibility is achieved by smoothening frequencies.
A decision tree was not previously considered to be a suitable The smoothening technique played an important role in estimating
method to analyze landslide susceptibility because data used in such relative rank in the imbalanced data set. This study showed that a
trees assume a uniform class distribution. However, the ratio between decision tree can be used efficiently for spatial prediction problems.
event and nonevent classes of spatial event data sets is highly im- Furthermore, it is expected that a decision tree will be widely used for
balanced because landslides represented in grid raster spatial data are various other spatial prediction problems.
Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 283

Acknowledgments Gómez, H., Kavzoglu, T., 2005. Assessment of shallow landslide susceptibility using
artificial neural networks in Jabonosa River. Basin, Venezuela. Eng. Geol. 78, 11–27.
Guzzetti, F., Reichenbach, P., Ardizzone, F., Cardinali, M., Galli, M., 2006. Estimating the
This work was supported in part by the cooperative research quality of landslide susceptibility models. Geomorphology 81, 166–184.
program of the Korea Institute of Geoscience and Mineral Resources Lee, S., Chol, U.C., 2003. Development of GIS-based geological hazard information
system and its application for landslide analysis in Korea. Geosci. J. 7, 243–252.
(KIGAM) and the Korea Aerospace Research Institute (KARI), and in Lee, S., Min, K., 2001. Statistical analysis of landslide susceptibility at Yongin. Korea,
part by a grant (#07-KLSG-C05) from Cutting-edge Urban Develop- Environmental Geology 40, 1095–1113.
ment - Korean Land Spatialization Research Project funded by Lee, S., Sambath, T., 2006. Landslide susceptibility mapping in the Damrei Romel area,
Cambodia using frequency ratio and logistic regression models. Environ. Geol. 50,
Ministry of Land, Transport and Maritime Affairs (MLTM) of Korean 847–855.
government and by Basic Science Research Program through the Lee, S., Ryu, J.H., Won, J.S., Park, H.J., 2004. Determination and application of the weights
National Research Foundation of Korea (NRF) funded by the Ministry for landslide susceptibility mapping using an artificial neural network. Eng. Geol.
71 (3–4), 289–302.
of Education, Science and Technology (NRF No. 2010-0001732).
Luzi, L., Pergalani, F., Terlien, M.T.J., 2000. Slope vulnerability to earthquakes at
Constructive comments and suggestions by anonymous reviewers subregional scale, using probabilistic techniques and geographic information
also helped us improve the presentation of this paper. systems. Eng. Geol. 58, 313–336.
Melchiorre, C., Matteucci, M., Azzoni, A., Zanchi, A., 2008. Artificial neural networks and
cluster analysis in landslide susceptibility zonation. Geomorphology 94, 379–400.
References Nefeslioglu, H., Duman, T., Durmaz, S., 2008. Landslide susceptibility mapping for a part
of tectonic Kelkit Valley., Eastern Black Sea region of Turkey). Geomorphology 94,
Aleotti, P., Chowdhury, R., 1999. Landslide hazard assessment: summary review and 401–418.
new perspectives. Bull Eng Geo Environ 58, 21–44. Nefeslioglu, H., Sezer, E., Gokceoglu, C., Bozkir, A., Duman, T., 2010. Assessment of
Atkinson, P.M., Massari, R., 1998. Generalized linear modeling of susceptibility to landslide susceptibility by decision trees in the metropolitan area of Istanbul,
landsliding in the central Apennines, Italy. Computer & Geosciences 24, 373–385. Turkey. Mathematical Problems in Engineering 2010, Article ID 901095.
Berry, M.J.A., Linoff, G., 1997. Data Mining Techniques: For Marketing, Sales, and Customer Neuhäuser, B., Terhorst, B., 2007. Landslide susceptibility assessment using “weights-
Support. John Wiley & Sons. of-evidence” applied to a study area at the Jurassic escarpment (SW-Germany).
Bonham-Carter, G.F., 1994. Geographic information system for geoscientist, modeling Geomorphology 86, 12–24.
with GIS. Pergamon Press, Oxford. 398. Pal, M., Mather, P.M., 2003. An assessment of the effectiveness of decision tree methods
Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees, for land cover classification. Remote Sens. Environ. 86, 554–556.
Chapman & Hal. Wadsworth, Inc, New York. Provost, F.J., Domingos, P., 2003. Tree Induction for Probability-based Ranking. Machine
Chung, C.F., Fabbri, A.G., 1999. Probabilistic prediction models for landslide hazard mapping. Learning Kluwer Academic Publisher 52 (3), 199–215.
Photogrammetric Engineering & Remote Sensing (PE&RS) 65 (12), 1388–1399. Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106.
Cussents, J., 1993. Bayes and psudo-bayes estimates of conditional probabilities and Quinlan, J.R., 1993. C4.5 : Programs for Machine Learning, Morgan Kaufmann.
their reliabilities. Proceedings of European Conference on Machine Learning. Saito, H., Nakayama, D., Matsuyama, H., 2009. Comparison of landslide susceptibility based
Dai, F.C., Lee, C.F.J., Li, J., Xu, Z.W., 2001. Assessment of landslide susceptibility on the on a decision-tree model and actual landslide occurrence: the Akaishi Mountains,
natural terrain of Lantau Island. Hong Kong, Environmental Geology 40, 381–391. Japan. Geomorphology 109 (3–4), 108–121.
Donati, L., Turrini, M.C., 2002. An objective method to rank the importance of the factors Swets, J.A., 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293.
predisposing to landslides with the GIS methodology: application to an area of the Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., MacLachlan, G.J., Ng, A.,
Apennines (Valnerina; Perugia, Italy). Eng. Geol. 63, 277–289. Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D., 2008. Top 10
Ermini, L., Catani, L., Casagli, N., 2004. Artificial Neural Networks applied to landslide algorithms in data mining. Knowl. Inf. Syst. 14 (1), 1–37.
susceptibility assessment. Geomorphology 66 (1–4), 327–343. Zadrozny, B., Elkan, C., 2001. Learning and making decisions when costs and probabilities
Ferri, C., Flach, P.A., Hernndez-Orallo, J., 2003. Improving the AUC of probabilistic estimation are both unknown. Proceedings of the 7th ACM SIGKDD International Conference
trees. Proc. of the 14th European Conf. on Machine Learning, pp. 121–132. on Knowledge Discovery and Data Mining, pp. 204–213.

Вам также может понравиться