Вы находитесь на странице: 1из 11

ORIGINAL PAPER

Comparison of classification and clustering methods


in spatial rainfall pattern recognition at Northern Iran
Saeed Golian & Bahram Saghafian &
Sara Sheshangosht & Hossein Ghalkhani
Received: 23 April 2009 / Accepted: 2 February 2010
# Springer-Verlag 2010
Abstract Pattern recognition is the science of data structure
and its classification. There are many classification and
clustering methods prevalent in pattern recognition area.
In this research, rainfall data in a region in Northern Iran
are classified with natural breaks classification method and
with a revised fuzzy c-means (FCM) algorithm as a
clustering approach. To compare these two methods, the
results of the FCM method are hardened. Comparison
proved overall coincidence of natural breaks classification
and FCM clustering methods. The differences arise from
nature of these two methods. In the FCM, the boundaries
between adjacent clusters are not sharp while they are
abrupt in natural breaks method. The sensitivity of both
methods with respect to rain gauge density was also
analyzed. For each rain gauge density, percentage of
boundary region and hardening error are at a minimum
in the first cluster while the second cluster has the
maximum error. Moreover, the number of clusters was
sensitive to the number of stations. Since the optimum
number of classes is not apparent in the classification
methods and the boundary between adjacent classes is
abrupt, use of clustering methods such as the FCM
method, overcome such deficiencies. The methods were
also applied for mapping an aridity index in the study
region where the results revealed good coincidence
between the FCM clustering and natural breaks classification
methods.
1 Introduction
Pattern recognition by classification or clustering methods
has many applications on meteorological and hydrological
studies. For example, it has applied to investigate
temperature trend in the US (Lawson et al. 1981) or to
identify areas of central North America with similar cloud
frequency behavior (Schulz and Samson 1988). In
decision-making for water resource management and
planning, clustering of annual and monthly rainfall data
and extracting regions with different rainfall patterns can
be a useful tool for managers and stakeholders.
There are several definitions provided for pattern
recognition. For example, Bezdek (1981) stated that
pattern recognition is a search for structure in data.
Schalkoff (1992) also defined pattern recognition as the
science that concerns the description or classification
(recognition) of measurements. Approaches to pattern
recognition include neural networks, classification methods,
and clustering algorithms. Classification belongs to
supervised pattern recognition category where as clustering
refers to unsupervised approaches.
Lauzon et al. (2006) used a Kohonen neural network for
clustering of precipitation fields. They also employed mean
rainfall in each cluster and the upstream flows as inputs of a
lumped rainfall-runoff model and simulated the flow at
downstream. The results demonstrated the relevance of
the proposed clustering method, which produces groups
of precipitation fields that are in agreement with the
global climatological features affecting the study region.
S. Golian (*)
Shahrood University of Technology,
Shahrood, Iran
e-mail: s.golian@aut.ac.ir
B. Saghafian
Science and Research Branch, Islamic Azad University,
Tehran, Iran
S. Sheshangosht
:
H. Ghalkhani
Water Research Institute,
Tehran, Iran
Theor Appl Climatol
DOI 10.1007/s00704-010-0267-x
Ramachandra Rao and Srinivas (2006; 2008) used hybrid
clustering algorithms for regionalization analysis. The
watershed was initially clustered by means of agglomerative
hierarchical clustering algorithms such as single linkage,
complete linkage, and Ward's algorithm. Each derived
cluster was refined with partitional clustering procedures
such as K-means algorithm. The regions given by the
clustering algorithms were, in general, not statistically
homogeneous in terms of runoff generation mechanism.
Hoffman and Hargrove (2005) discussed clustering
methods and their ability in eco-regionalization (region-
alization of eco-regions). They used a geographic
multivariate clustering method with K-means algorithm
as a type of quantitative regionalization method. Maps of
nine characteristics including elevation, plant-available
water capacity, soil organic matter, total soil nitrogen,
depth to a seasonally high water table, mean precipitation
during the growing season, mean solar insolation during
the growing season, degree-day heat sum during the growing
season, and degree-day cold sum during the non-growing
season were generated with 1 km resolution. Using these
maps and K-means clustering method, they divided the
United States into as many as 3,000 eco-regions.
Kulkarni and Kripalani (1998) used FCM method to
classify seasonal (June through September) percentage
departure from normal rainfall patterns over India for the
period 18711994. The dominant modes of spatio-temporal
variability in the Indian monsoon rainfall were identified.
Monthly rainfall data for 306 stations spread for June
through September months over 124 years were obtained
and spatial averages for 51 uniform blocks of 2.5 latitude
by 2.5 longitude were prepared. Using the FCM method,
most dominant rainfall patterns were classified into four
clusters and the spatio-temporal characteristics for each
cluster were analyzed.
Osaragi (2002) proposed a spatial data classification
method based on the minimization of information loss and
compared the results with five other classification methods.
He applied each method to seven different sets of data from
Digital Mesh Statistics compiled by Statistics Bureau and
Statistics Center of Japan. Each data was classified into
nine classes. Then, the ratio L of information loss by each
method was compared with other methods. The results of
numerical analysis showed that the Natural Break's method
was the most effective classification method.
Claggett et al. (2004) assessed development pressure and
land-use changes in the BaltimoreWashington, DC, region
exploring the utility of two modeling approaches for
forecasting future development trends and patterns. The
study area was divided into five classes representing
percent area of urban land by using the Jenks' optimization
algorithm to identify breakpoints between classes that
minimize the sum of the variance for each class. The
output data from two modeling approaches were divided
into five classes of development pressure ranging from
very low to very high.
Comparison between classification and clustering methods
were not reported in previous studies. In this paper, we use a
fuzzy clustering algorithm to classify annual rainfall data over
the period 1975 to 2008 in northern Iran, and the results are
compared with a hard classification method. Optimum
number of clusters is derived through the fuzzy clustering
method. Unlike the hard classification method, the boundary
between adjacent clusters is not sharp and the boundary
region is introduced in the
1
c
;
c1
c
_
interval where c is the
number of clusters.
2 Methodology
Suitable rainfall data were available from 1975 to 2008 for
some 25 rain gauge stations in the study region. Data
filling, where required, was performed using a multivariate
regression method between adjacent stations. Next, statistical
tests were conducted for all stations in the study area. In case
of hydrological and water resources time series at common
time scales (e.g., monthly or annual), most statistical analyses
are based on a set of fundamental assumptions, i.e., the series
is consistent, is trend-free and constitutes a stochastic process
whose random component follows the appropriate probability
distribution function (Eischeid et al. 1995). Consistency
implies that all the collected data belong to the same
statistical population. Trend exists in a data set if there is a
significant correlation between the observations and time.
Trend or nonstationarity is normally introduced through
human activities such as land-use changes or human-induced
climate change. Double mass curve is the most widely
technique for consistency test .This test revealed that the data
were consistent for all rain gauges.
In general, randomness in a hydrological time series
means that the data arise from natural causes. If there is no
randomness, then the series is persistent; this persistence is
normally quantified in terms of the serial correlation
coefficient (McMahon and Mein 1986)
The Spearman rank order correlation nonparametric test
was used to investigate the existence of long-term trends in
the data sets. Also, outlier test was carried out. Outliers are
data points which depart significantly from the trend of the
remaining data. The retention, modification, and deletion of
these outliers can significantly affect the statistical parameters
computed from the data. Results showed that three stations
had trend and outlier values with 5% significance level. These
stations were removed in future analysis.
In the third step, the nonparametric run test was applied
for randomness. This test is described by McGhee (1985)
among others. With application of the run test, it was
S. Golian et al.
deduced data of two stations were not random at 5%
two-tailed significance level.
Mean annual precipitation was retained for the remaining
stations in the study area. The spatial distribution of rainfall
fields were determined by the inverse distance weighted
(IDW) in which the value of a variable in an unsampled point
is obtained from values of adjacent points by the following
relationship:
Z

n
i1
Z
i
d
a
i

n
i1
1
d
a
i
1
where:
Z* is the estimated quantity, Z
i
is the observed quantity
at i-th station, d
i
is the distance between i-th station and the
unsampled point, a is the power usually between 1 and 3
and n is number of sampled points involved in the
interpolation. The power a influences the accuracy of
estimations so that adjacent points are given greater weights
when a is increased. The interpolation was carried out on
a 500-m pixel size.
In IDW method, the weights of sampled points are
determined according to their distance to the unsampled
points while the position and distribution have no effects on
the estimation. However, to study the dependence of the
results on the station density, four station density scenarios
with different distances from the region boundary were
considered. Thus, 45, 33, 22, and 15 stations were involved
in scenario 1 to 4, respectively.
To investigate the annual rainfall patterns, a clustering
and a classification method were applied. Clustering
algorithms can be divided into hard clustering and fuzzy
clustering. In hard clustering, each feature vector is
assigned to one of the clusters with a degree of membership
equal to one. This is based on the assumption that feature
vectors can be divided into non-overlapping clusters with
well-defined boundaries between them. Fuzzy clustering
allows a feature vector to belong to all the clusters
simultaneously with a certain degree of membership in the
[0, 1] interval which means that the cluster boundaries
overlay each other.
Generally, all clustering methods are designed to maximize
within-group similarity and to minimize between-group
similarity. To achieve this purpose, some measures of
similarity or distance between pairs of observations/objects
must be established. The most commonly used distance
measure is the Euclidean distance (Bunkers et al. 1996).
In this study, the Fuzzy c-means method described by
Bezdek (1981) is used as patterns clustering method on the
basis of Euclidean distance as a measure of similarity. Also,
the natural breaks will be used for classification. The
Jenks' optimization method is employed and realized
(Jenks 1967) so that the boundary values are determined in
such a way that the average of a squared deviation in each
class is minimized.
2.1 Clustering with FCM algorithm
The determination of the number of clusters is the most
important issue in clustering algorithms. Here, we use cluster
validity index (CVI) criterion proposed by Fukuyama and
Sugeno (1989) as follows:
Sc

N
k1

c
i1
m
ik

m
x
k
v
i
k k
2
v
i
x k k
2
_ _
2
Where:
N is the number of data to be clustered, c is number of
clusters, c2, x
k
is k-th data, usually a vector, x is average
of x
1
, x
2
,...,x
n
data, v
i
is vector expressing the center of the
i-th cluster, k kis the norm,
ik
is grade of membership of
k-th data to the i-th cluster, and m is adjustable weight
(usually m=1.53).
The number of clusters, c, is determined so that S(c)
reaches a minimum as c increases. It is also imposed that:

c
i1
m
ik
1 3
which means that the memberships of a chosen input feature
vector over all the c fuzzy clusters should sum up to 1.0.
The procedure for determining cluster centers and grade
of membership of k-th data belonging to the i-th cluster is
as follows (Sugeno and Yasukawa 1993):
1. Set t (iteration index) to unity.
2. Set an initial vector for cluster centers: V
0
=(v
1
, v
2
,...,v
c
).
3. Calculate the membership matrix U
t
cN
from vector of
cluster centers determined in the previous step (c is
the number of clusters and N is the number of data):
m
ik

1

c
j1
x
k
v
i
k k
2
x
k
v
j k k
2
_ _ 2
m1
4
4. Calculate new vector of cluster centers from matrix U
t
c

N
;
V
t

N
k1
m
ik

m
x
k

N
k1
m
ik

m
5
5. If V
t
V
t1

", then stop, else t=t+1 and go to step 3.


Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran
Fig. 1 Digital elevation model (DEM) and the location of rain gauges network in the study area
Fig. 2 Four rain gauge scenario networks
S. Golian et al.
Fig. 3 Map of mean annual rainfall
Fig. 4 Map of mean annual temperature
Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran
Once the procedure stops, the cluster validity index, S(c),
will be calculated for a given c. This procedure is repeated
for different number of clusters. If S(c) increases with c,
then the optimum number of cluster (equal to c1) has been
obtained.
2.2 Classification with natural breaks algorithm
Natural breaks is a common method for spatial data classifi-
cation and is based on Jenks optimization algorithmintroduced
in 1967. In general, the optimization minimizes within-class
variance and maximizes between-class variance in an iterative
series of calculations. Optimization is achieved when the
goodness of variance fit (GVF) quantity is maximized. The
GVF value is calculated as follows (Dent 1996):
1. Calculate the arithmetic mean (X) for the variable, and
calculate the sum of the squared deviations between
observation values (x
i
) and the mean; i.e.,

x
i
X
_ _
.
This value is called squared deviations, array mean
(SDAM).
2. Calculate the arithmetic mean within each class (Z
c
).
For each class, calculate the sum of the squared
deviations between observation values (x
i
) and the
class' arithmetic mean (x
i
Z
c
). Finally, the sum of all
classes is determined by

x
i
Z
c
_ _
2
. This value is
called squared deviation, class means (SDCM).
3. Calculate the GVF:
GVF
SDAM SDCM
SDAM
6
The method first specifies an arbitrary grouping of the
numeric data. SDAM is a constant and does not change unless
the data changes. The mean of each class is computed, and the
SDCM is calculated. Observations are then moved from one
class to another in an effort to reduce the sum of SDCM and
therefore the GVF statistics increases. This process continues
until the GVF value no longer increases.
For each cluster in each scenario, sum of grades of
membership of all cells in the study region is calculated,
and the result is divided by the total number of cells in the
area. Thus, the mean grade of membership to a specific
cluster in each scenario is determined with FCM clustering
method. Also the mean grade of membership of cells
located in boundary regions is obtained. For this purpose,
all cells with grade of membership greater than
1
c
are assumed
to belong to that cluster where cells with grade of
membership between
1
c
and
c1
c
belong to the boundary
Table 1 Number of clusters and annual precipitation depth (in
millimeters) for cluster centers for different rain gauge network
density
45 stations 33 stations 22 stations 15 stations
Cluster 1 237.0 238.1 228.9 225.9
Cluster 2 439.5 451.2 392.6 361.8
Cluster 3 562.5 561.2 501.39 484.0
Cluster 4 748.2 745.2 595.4 592.4
Cluster 5 704.2 713.7
Cluster 6 819.5 821.5
Fig. 5 Comparison of Fuzzy FCM and Natural Breaks classification methods for different densities of rain gauges network
S. Golian et al.
region, and cells with grade of membership of less than
1
c
do
not belong to that cluster. c is the number of clusters for each
scenario. The following abbreviations are used hereafter:
MMFFuzzyC
i
Mean grade of membership to the i-th
cluster for all cells in the study area
in fuzzy clustering method.
MMFFuzzyC_B
i
Mean grade of membership of
boundary cells to the i-th cluster in
fuzzy clustering method.
MMFFuzzyC_WB
i
Mean grade of membership of none-
boundary cells to the i-th cluster in
fuzzy clustering method.
Error
i
hardening error in fuzzy clustering
method which is in fact the mean
grade of membership of cells with
grade of membership less than
1
c
.
MMFClassi Mean grade of membership to the i-th
class in the classification method. For
each month, the study area is
partitioned into classes that are
equivalent in number, with clustering
method using natural breaks
classification method. The mean grade
of membership to the i-th class (which
takes 0 or 1 value) is calculated.
To compare the classification and clustering methods in
each month, the mean grade of memberships to the i-th
class/cluster are illustrated simultaneously.
Fig. 6 Comparison of FCM and
natural breaks methods
with 45 rain gauges
Fig. 7 Comparison of FCM and
natural breaks methods
with 33 rain gauges
Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran
Fig. 8 Comparison of FCM and
Natural Breaks methods
with 22 rain gauges
Fig. 9 Comparison of FCM and
Natural Breaks methods
with 15 rain gauges
S. Golian et al.
The methodology was further applied for evaluation of
aridity condition in the study region. The De Martonne
aridity index is expressed by:
I
P
T 10
7
where I is the De Martonne aridity index, P is the annual
precipitation in millimeters and T is the annual temperature in
Centigrade degree. The De Martonne=climate classification
is as follows:
0<I<10, arid climate; 10<I<20, semi-arid climate; 20<
I<24, Mediterranean climate; 24<I<30, semi-wet climate;
and I>30, wet climate.
2.3 Case study region
This study focuses on Golestan Dam basin with an area of
5,171 km2 located in north-east of Iran. Figure 1 provides a
schematic view of the basin boundary. The elevation in this
region varies from 53 to 2,544 m with an average of 935 m.
The region's climate is influenced by Alborz Mountain,
Torkmanestan desert, Caspian sea's moisture, and Cyberian
circuits. The average annual precipitation across the
watershed is 530 mm, varying from 200 mm to 800 mm
at different parts of the region. Figure 1 shows the rain
gauges network with 45 stations in or around the basin.
Different rain gauge network densities (scenarios) are also
depicted in Fig. 2. In Figs. 3 and 4, maps of mean annual
rainfall and temperature are shown.
Fig. 10 Comparison of FCM
and Natural Breaks methods
for De Martonne
Table 2 Characteristics of FCM cluster centers
Mean annual
precipitation
(mm)
Mean
temperature
(C
o
)
De
Martonne
Index
Climate
Cluster
1
255.26 9.60 13.02 Semi-arid
Cluster
2
627.92 11.55 29.13 Semi-wet
Cluster
3
773.94 14.10 32.11 Wet
Cluster
4
700.70 12.06 31.76 Wet
Cluster
5
517.73 13.41 22.11 Mediterranean
Cluster
6
437.73 12.32 19.61 semi-arid
Cluster
7
350.55 11.09 16.62 Semi-arid
Cluster
8
565.23 12.15 25.52 Semi-wet
Fig. 11 Comparison of proportion of boundary region for the first cluster
(blue), last cluster (green), and mean of other clusters (dashed line)
Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran
3 Results
Table 1 presents the number of clusters and mean annual
precipitation of cluster centers obtained by FCM algorithm.
The results of FCM fuzzy clustering algorithm and natural
breaks classification methods for each rain gauges network
density are presented in Table 1 and illustrated in Fig. 5.
The bar diagrams for each rain gauge density show cluster
share from the total area in that particular case. For
instance, for the case with 15 stations, the greatest share
belongs to the 5th cluster (class). The resemblance of the
two methods is also obvious.
It can also be seen that, for each month, percent of
boundary regions of the first cluster is less compared with
that of other clusters.
Results of clustering and classification for different rain
gauge densities are illustrated in Figs. 6, 7, 8, and 9. As one
may note, heavy rainfall exists in central areas extended
from northeastern to southwestern areas while low
rainfall (cluster 1 in all scenarios) occurs in eastern areas
of the region. This is compatible with the synoptic
evidences in the region. Boundary regions are formed
from cells with degree of membership between
1
c
and
c1
c
.
Thus, it may be deduced that, in seasonal pattern
recognition, the results of classification and clustering
methods are quite similar.
Climate classes derived by the FCM clustering and
natural breaks classification methods are illustrated in
Fig. 10. In the FCM algorithm, some eight classes were
derived on the basis of Fukuyama and Sugeno (1989) CVI.
Table 2 contains characteristics of derived climate classes.
4 Conclusions
Two different methods were compared for pattern recognition
of rainfall and climate index in northeastern region, Iran. The
first method is the unsupervised fuzzy clustering method
(FCM) while the second method is a supervised hard
classification type based on Jenk's optimization method. For
each rain gauge density, the number of clusters derived by
the FCM method using Sugeno CVI criterion were input to
the natural breaks classification method (supervised). Com-
parison of the results showed coincidence of the natural
breaks classification and the FCM clustering methods. The
differences arise from the nature of these two methods. In the
FCM, the boundaries between adjacent clusters are not sharp
while the boundary alteration between adjacent classes is
abrupt in natural breaks method (Figs. 6, 7, 8, and 9). The
sensitivity of the methods with respect to rain gauges density
was also evaluated. Results demonstrated that the number of
derived clusters (classes) was sensitive to the number of
stations. By decreasing the number of rain gauges from 45 to
15, the number of clusters increased from four to six
(Table 1).
According to the results of this study, it appears that the
value of
1
c
representing a lower limit for the degree of
membership data to a cluster could be a proper criterion.
Moreover, the interval
1
c
;
c1
c
_
represents the boundary
interval between adjacent clusters.
As it has been demonstrated in Fig. 11, for different rain
gauge density, the share of boundary cells for the first and
the last clusters with minimum and maximum annual
precipitations, respectively, are less than those of other
clusters with average amounts of rainfall depth. For instance,
in the case of 22 rain gauges, mean proportion of boundary
regions for the first, last, and mean of pre-determined
clusters are 8.3%, 26.6%, and 42.4%, respectively.
As the number of optimum classes is not pre-determined
in unsupervised classification methods, its value may be
determined based on the CVI criteria of clustering methods.
It was further deduced that patterns attained are comparable
with those of the FCM clustering method.
References
Bezdek JC (1981) Pattern recognition with fuzzy objective function
algorithms. Plenum Press, New York, USA, p 256
Bunkers MJ, Miller JR, DeGaetano AT (1996) Definition of climate
regions in the Northern plains using an objective cluster
modification technique. J Climate 9:130146
Claggett PR, Jantz CA, Goetz SJ, Bisland C (2004) Assessing
development pressure in the Chesapeake Bay watershedan
evaluation of two landuse change models. Environ Monit and
Assess 94:129146
Dent BD (1996) Cartography and thematic map design. Wm. C. Brown
Publishing, Dubuque, IA
Eischeid JK, Baker CB, Karl TR, Diaz HF (1995) The quality control
of long-term climatological data using objective data analysis. J
Appl Meteor 34:27872795
Fukuyama Y, Sugeno M (1989) A new method of choosing the
number of clusters for the fuzzy c-means method. Proceedings of
Fifth Fuzzy Systems Symposium, Kobe, Japan, pp 247250
Hargrove WW, Hoffman FM(2005) Potential of multivariate quantitative
methods for delineation and visualization of ecoregions. Environ
Manage 34(1):S39S60
Jenks GF (1967) The data model concept in statistical mapping. Int
Yearb Cartogr 7:186190
Kulkarni A, Kripalani RH (1998) Rainfall patterns over India:
classification with fuzzy c-means method. Theor Appl Climatol
59:137146
Lauzon N, Anctil F, Baxter CW (2006) Clustering of heterogeneous
precipitation fields for assessment and possible improvement of
lumped neural network models for streamflow forecasts. Hydrol
Earth Syst Sci 10:485494
Lawson MP, Balling RC Jr, Peter AJ, Rundquist DC (1981) Spatial
analysis of secular temperature fluctuations. J Climatol 1:325332
McGhee JW (1985) Introductory statistic. West Publishing Co., New
York, USA
McMahon TA, Mein RG (1986) River and reservoir yield. Water
Resources Publication, Littleton, Colorado, USA
S. Golian et al.
Osaragi T (2002) Classification methods for spatial data representation.
CASA (Center of Advanced Spatial Analysis University College
London)
Ramachandra Rao A, Srinivas VV (2006) Regionalization of water-
sheds by hybrid-cluster analysis. J Hydrol 318:3756
Ramachandra Rao A, Srinivas VV (2008) Regionalization of water-
sheds: an approach based on cluster analysis. Water Sci Technol
Libr 58, Springer, Dordrecht, Netherlands
Schalkoff R (1992) Pattern recognition: statistical, structural and
neural approaches. Wiley, NY
Schulz TM, Samson PJ (1988) Nonprecipitating low cloud
frequencies for central North America: 1982. J Appl Metro
27:427440
Sugeno M, Yasukawa T (1993) A fuzzy-logic-based approach to
qualitative modeling. IEEE Trans Fuzzy Syst 1(1). February
1993
Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran

Вам также может понравиться