Вы находитесь на странице: 1из 11

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008), pp.

235245
2008 International Association of Hydraulic Engineering and Research

Extreme water levels of the Vistula River and Gdansk Harbour


Les nivaux deau extrmes de la rivire Vistule et du port de Gdask
DOMINIC E. REEVE, Centre for Coastal Dynamics and Engineering, School of Engineering, University of Plymouth,
Drake Circus, Plymouth, PL4 8AA, UK. Tel.: +44 (0)1752 233 672; fax: +44 (0)1752 232 638;
e-mail: dominic.reeve@plymouth.ac.uk (author for correspondence)
GRZEGORZ ROZYNSKI, Institute of Hydroengineering, Polish Academy of Science, IBW PAN, 7 Koscierska,
80-328 Gdansk, Poland. Tel.: +48 58 5222907; fax: +48 58 5524211;
e-mail: grzegorz@ibwpan.gda.pl
YING LI, Centre for Coastal Dynamics and Engineering, School of Engineering, University of Plymouth,
Drake Circus, Plymouth, PL4 8AA, UK. E-mail: ying.li@plymouth.ac.uk
ABSTRACT
Using Canonical Correlation Analysis (CCA), Rozynski et al. (2006) have demonstrated that there is a weak correlation between the water levels of
Vistula River and Gdansk Harbour. Herein the CCA analysis serves as a first step for a univariate bootstrap resampling technique, applied to investigate
coincident extreme water levels of Vistula River and Gdansk Harbour. The CCA-derived assumption of statistical independence is argued as being
a suitable working approximation for (outline) engineering design. This allows the goodness-of-fit of different statistical models to be assessed in a
quantitative manner with a bootstrap method. This also provides a convenient means of defining extreme levels together with their confidence intervals.
The analysis with two statistical methods provides insight into the character of joint coastal extremes in an estuary of a large north European river.
The rationale and methodology are at least partly applicable to similar estuaries of northern Europe.
RSUM
Utilisant lAnalyse de Corrlation Canonique, (CCA), Rozynski et al. (2006) ont dmontr quil y a une corrlation faible entre les nivaux deau de
la rivire Vistule et du port de Gdask. Sur ce point lanalyse du CCA sert de base une technique de re-chantillonnage univarie (bootstrap), qui
a t applique ltude sur les nivaux deau extrmes dans le port de Gdask et la rivire Vistule. La supposition drive du CCA dindpendance
statistique est soutenue comme tant une approximation approprie pour les plans conceptuels. Ceci permet dvaluer de manire quantitative la qualit
dajustement de diffrents modles statistiques avec une mthode bootstrap. Ce processus fournit galement des moyens pratiques de dfinir les
nivaux deau extrmes ainsi que les intervalles de confiance. Lapplication des deux mthodes donne une ide de la nature des extrmes ctiers, dans
lestuaire dune rivire importante du nord de lEurope. Les raisons et la mthodologie sont applicables au moins en partie, des estuaires similaires
du nord de lEurope.

Keywords: Bootstrap resampling, canonical correlation analysis, extremes, flooding, risk, water level
1 Introduction

Coles and Tawn, 1994). This provides a statistical model of the


joint distribution of extreme values of waves and water levels
at a site, and hence estimates of the joint extreme conditions.
The present concern, however, is not the joint dependence of
two variables at a single site but the joint dependence of water
levels at two separate sites where, for example, water levels on
the coast and upstream of a river may be correlated through a
common physical process such as a surge. In what follows, the
extreme behaviour of water levels at two locations in Poland is
investigated. The first is an open coastal site located in Gdansk
Harbour, located within the Gulf of Gdansk (Fig. 1).
The second is the benchmark station Tczew, on the banks
of the River Vistula. Tczew is about 35 km upstream from the
mouth of the river and water level variations there are strongly

In most coastal and estuarine engineering works it is valuable


to consider the joint occurrence of combined conditions, such
as high water levels and large waves. At its simplest level, an
engineer might decide to test the design wave conditions at a
range of water levels. Using the assumptions of pure dependence
and independence of waves and water levels, one could derive
the maximum range of uncertainty in water level for specified
wave conditions. To improve on such methods semi-empirical
techniques were proposed (e.g., Hawkes et al., 2002). These
determine the degree of dependence between two variables and
then apply an intuitive method to estimate joint extremes. This
type of approach has been put onto a rigorous footing (see e.g.,

Revision received September 24, 2007/ Open for discussion until December 31, 2008.

235

paper-05 2008/6/29 12:06 page 235 #1

236

D.E. Reeve et al.

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

BALTIC SEA
G u l f of
Gdansk
Mareograph
gauge

new outlet
since 1895

la
stu
i
V

Vistu

la

GDANSK

on
go
a
L

Benchmark
Station at
Tczew

POLAND

10 km

Figure 1 Geographic location of mareograph gauge and benchmark


station at Tczew

influenced by sea surges propagating up the river. The River Vistula meets the Gulf of Gdansk approximately 15 km southeast
of the first site. The regions surrounding the mouth of the river
were prone to flooding, caused by ice jams in the winter months
that constricted the flow of river water into the Gulf. In 1895 the
mouth of the Vistula was engineered to create a short, straight
route for the river to discharge into the Gulf as part of a flood
alleviation scheme. Prior to the scheme, the impact of coastal
surge events on upstream locations such as Tczew was effectively damped due to the much longer distance of the original
river course. With the scheme in place, coastal surges propagated much further upstream. The question therefore arises as to
what degree water levels on the coast and upstream are correlated through the common phenomenon of surge, and whether
the coastal monitoring station could be used to provide an early
warning of potential flood conditions upstream. From physical
considerations it might be considered that river discharges and
surges could be associated with a similar annual cycle of storms.
There is a significant amount of previous work devoted to the
problem of the joint probability of high tides combining with
large surge. Tides are, by and large, well understood and predictable, being essentially deterministic. Surges, in contrast, are
not so easy to predict and are of a more stochastic nature. Thus,
much of the research in this area was aimed at deconvolving the
deterministic and stochastic components of the total water level,
(e.g., Pugh and Vassie, 1980; Tawn and Vassie, 1989).
In the Baltic Sea, the tidal component of the water level variation is negligible and the water levels have a purely stochastic
nature. On the open coast, water level fluctuations are due to
surge. Surge is a term used to describe the combined effects of

static surge or the inverse barometer effect, propagating coastal


Kelvin waves or surge waves, wind set up due to wind stress,
and wave set-up arising from the radiation stress associated with
breaking waves. Under certain conditions, the combined surge
can enter the river mouth and propagate upstream as a wave,
changing its form in response to the river geometry. The water
level variations up-river will be due to a combination of upstream
propagating coastal surge, rainfall and run-off flow, and downstream propagating surge waves caused by the operation of flood
protection infrastructure such as sluices and barriers. Thus, it
might be anticipated that water level records from Gdansk and
Tczew would show a level of correlation that could be useful for
flood warning.
An analysis of water level recordings at Gdansk and Tczew
was presented by Rozynski et al. (2006). Basic statistical measures including the mean values and the standard deviations were
calculated to describe the behaviour of water levels in the Vistula
estuary. Further, they used Singular Spectrum Analysis (SSA)
to identify key phenomenological patterns of behaviour at each
site and Canonical Correlation Analysis (CCA) to perform an
analysis of the joint behaviour between the two sites. Here, the
results of a CCA analysis are combined with a bootstrap resampling technique, whose application allows for an assessment of
the extreme water levels at both sites, together with confidence
intervals. The flow chart is shown in Fig. 2.
The key work is entangled with a dotted line. The integration
of both techniques provides a more comprehensive understanding of the character of joint extremes in an estuary of a large river
in north Europe. This brings additional value, for the results are
at least partly representative for other large estuaries in northern Europe, providing clues for improved flood management
practices.

Water level analysis at two distinct sites

raw data at Vistula

raw data at Gdnask

statistical unknown

behaviour of water levels

basic statistics

SSA

key patterns

key patterns

CCA

joint behaviour

bootstrap resampling

Figure 2 Flow chart for this work

paper-05 2008/6/29 12:06 page 236 #2

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

Traditionally, a probability distribution function is used to


model the distribution of annual maxima, or the peaks over a
specified threshold level. A method such as maximum likelihood
estimation is used to fit the function to the data. Subsequently, the
best fit-probability distribution function is used to determine the
extreme values corresponding to selected return periods. Uncertainties in the extreme values arise not only from sampling errors
associated with a finite set of data, but also from the choice of
statistical model, i.e., the distribution function.
To estimate the uncertainty, the bootstrap resampling technique was employed, as introduced by Efron (1979). Uncertainties arising from the finite sample size are usually expressed in
terms of confidence intervals. The size of the intervals is influenced by the number of data (more data will generally lead to
smaller confidence intervals), and the variation in the sample.
Uncertainties arising from model selection are more difficult to
quantify. Here, the maximum likelihood method is used to estimate the parameters of several candidate distribution functions.
The goodness-of-fit is determined with an error norm that allows
the fit to the data in different ranges to be investigated (e.g., Reeve,
1996). Once the estimated parameters are established for each
distribution function the extreme values corresponding to specific
quantiles can be calculated. In general, there are no closed form
expressions for the confidence limits on the estimates of extreme
values. The bootstrap method provides a means of estimating
these confidence limits.
In Sec. 2 the methodology used for the CCA analysis is
introduced. Section 3 describes the bootstrapping method for
statistical model selection. Section 4 provides a description of
the water level datasets. In Sec. 5 the results of the CCA analysis are presented, together with the rationale for the assumption
of independence. Section 6 includes the results of the bootstrap
resampling calculations, including a comparison of the performance of different distributions in describing the extreme water
levels, estimates of extreme water levels and confidence limits.
The paper concludes with a discussion of the advantages and
limitations of the statistical techniques used in the study and a
summary of the key conclusions.

2 Methodology of the Canonical Correlation Analysis


Canonical correlation analysis is a multivariate statistical model
that facilitates the study of interrelationships among sets of multiple dependent variables and multiple independent variables
(Green, 1978; Green and Douglas Caroll, 1978). In contrast
to multiple regression analysis, where values of one dependent
variable with a linear function of a set of independent variables
are predicted, canonical correlation analysis is capable of predicting multiple dependent variables from multiple independent
variables. Thus, it can also be used to predict the values of a
predictand random field with predictor random field. Canonical correlation analysis places the fewest restrictions on the
types of data on which it operates, basically linearity is the only
underlying assumption. This assumption implies that the correlation coefficient between any two variables is based on a linear

Extreme water levels of the Vistula River and Gdansk Harbour

237

relationship. Moreover, the canonical correlation is the linear


relationship between the linear combination of variables. If the
combinations are related in a nonlinear manner, the relationship
will not be captured by canonical correlation. Thus, while canonical correlation analysis is the most generalized multivariate
method, it is still constrained to identifying linear relationships.
Let the predictor random field be Yt,y with t = 1, 2, . . . , nt
observations of y = 1, 2, . . . , ny elements. The corresponding
predictand random field Zt,z must have the same number of observations nt, although the number of elements z = 1, 2, . . . , nz
needs not be the same. After removal of the mean values of all
elements in both fields maximally correlated linear combinations
of vector observations of Yt , and Zt are constructed, that is new
variables Ut , and Vt are obtained such that U1 is related to V1 ,
U2 to V2 , and so forth. In addition they all have unit variances
and are orthogonal (Graham, 1990). Orthogonality means that
zero-mean variables Ut=m and Vt=n are uncorrelated and given
their normality independent of each other. With angle brackets
  denoting the expected values of the vector products through
time the above description is mathematically equivalent to

max m = n
Um Vn  =
,
(1)
0
m = n

0 m = n
Um Un  =
,
(2)
1 m=n

0 m = n
Vm Vn  =
.
(3)
1 m=n
Given Eqs (1), (2), and (3) the desired weights for transforming Y into U and Z into V can be found by solution of
the following eigenvalue problem
(Y T Y)1 (Y T Z)(ZT Z)1 (ZT Y ) I = 0,

(4)

where the letter T indicates the matrix transpose. The eigenvalues


m (often denoted 2m ) are the squared canonical correlations
sought in Eq. (1). The associated eigenvectors Ry,m provide the
required weights for transforming Y into U. The canonical mode
m consists of the canonical correlation m and the associated
eigenvector Ry,m . The maximum number of canonical modes nm
is determined by the rank of the quadruple product in Eq. (4),
for large natural systems it will almost certainly be equal to the
number of observations nt. That is why indexation of elements
of vectors and matrices with m is retained. In matrix notation
U = YR.

(5)

When Y and Z are swapped in Eq. (4) the same canonical


correlations are obtained together with the eigenvectors Q for
transforming Z into V , i.e., V = ZQ.
The predictor is Y linked to the predictand Z, with the matrix
of regression coefficients S relating the values of predictor canonical mode temporal amplitudes U to the individual points in the
predictand field Z. Due to orthogonality and unit variance of U,
this matrix can be evaluated from Sm,z = Um Zz , where m is
the canonical mode index and z the spatial index of elements
of Z. In matrix notation the regression equation takes the form

paper-05 2008/6/29 12:06 page 237 #3

238

D.E. Reeve et al.

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

= US, where Z
constitutes predictions of Z with the Y field.
Z
Using Eq. (5), this can be written in a real space as
= YRS.
Z

(6)

Now the CCA regression skill is computed as


= 1 (w/)2 ,

(7)

where is the explained variance, w is the standard deviation of


prediction error and stands for the standard deviation of measurements in the predictand. This expression can also be used to
assess the impact of individual canonical modes or their specific
subsets, when the remaining canonical modes in matrices on the
right hand side of the Eq. (6) are skipped.

3 Methodology for the statistical model selection


For this investigation the sequence of the annual maxima water
levels are used to determine the estimated parameters with maximum likelihood method of a selection of commonly-used families
of distributions. The candidate distributions are the normal, lognormal, gamma, exponential, Weibull, Gumbel (also called
Extreme type (I)), and the general extreme value (GEV). The
last three are extreme distributions whilst the others are traditional distributions for comparison. The asymptotic behaviour of
the distribution of maximum values was investigated by Fisher
and Tippett (1928) who found three types of limiting distribution,
which can all be described by the Gev distribution. If X obeys
the Gev (, , ) distribution it has the distribution function
(x )
>0

The three parameters of this distribution are


Pr(X x) = e{(1(x)/)

1/ }

(8)

(i) a location parameter


(ii) a scale parameter, with > 0
(iii) a shape parameter
where the condition reflects the support of the distribution. For
> 0 this distribution corresponds to the reversed Weibull distribution; as 0 the distribution function tends to the Gumbel
distribution and for < 0 it corresponds to the Frchet distribution. The Weibull and Gumbel distributions are widely used,
and are special cases of the Gev distribution, corresponding to
> 0 and = 0, respectively. Details of them and on other
distributions used herein are found in the Appendix.
The level xP exceeded with the probability 1 P, i.e.,
Pr(X > xP ) = 1 P, is given by

xp = + [1 (ln(1 P)) ],
(9)

so xP is the return level for return period 1/P units of time.


The theoretical justification for the Gev provides a basis for
extrapolation beyond the data to long return period events. A
crucial assumption in fitting distribution functions to data is that
the data are independent and identically distributed (iid). With
hydraulic, hydrodynamic and hydrological data, this requirement
can be difficult to achieve even for observations separated by a
month, because many datasets feature strong annual periodicity.
Hence, it imposes certain determinism, which remains imprinted

by significant correlations at long lags. This problem can only


be (partly) remedied by resampling annual maxima, which is
simultaneously one of the primary drawbacks, since it is usually
wasteful of data. Nevertheless, such an assumption is justified
by the CCA output and supported by the fact that correlations
among annual maxima are barely significant for a population of
only 29 years of observation.
From a purists view the point of examining specific distributions, as well as more generalised distributions that include
specific distributions as a special case, may well be questioned.
However, this is included here from the practicing engineers
perspective because, if the simpler analysis afforded by a two
parameter distribution is sufficient then this is an important
practical consideration. The bootstrap resampling technique was
introduced by Efron (1979), but has only recently become accessible with the rapid increases in computing capability. The
bootstrap technique involves the following steps:
(1) First, let the number of elements in the original data set be n
(2) Generate another sample of n elements by randomly sampling the original data set with replacement, i.e., when an
element is sampled it is replaced immediately
(3) Repeat this say, B times, to create multiple (or bootstrap)
replications, each with n elements
(4) Perform fitting to each replication (here maximum likelihood
is used)
(5) Compute the estimated parameters, error norm and confidence limits from the sample of B replications. The confidence limits can be calculated directly from the bootstrap
sample by choosing the values corresponding to particular
quantiles from the ordered set of bootstrapped values.
Let the original sample of n elements be denoted by X. Then
the B bootstrap replications are denoted by X1 , X2 , XB ,
with the asterisk denoting a bootstrap replication and the number the numbered sequence of replications. The corresponding
best fit distribution functions are F (i) (x) with parameter sets
1 , 2 , . . . , B , respectively. The process of calculating the

average best fit parameters for a distribution F(x) using the


bootstrap replications is shown in Fig. 3.
Steps 1 to 3 are relatively straightforward. For Steps 4 and 5,
some extra care is required. There is often no a priori reason that
the distribution of the original data, G(x), should belong to any
of the families of distributions specified. In the following it is
however assumed that in each family of distributions there is a
set of parameter values that is closest to G(x) in a quantifiable
sense. The closeness of the model to the raw data is measured

Raw data

Bootstrap resample
Size n

Bootstrap replications
the best fit

Bootstrap replications
error norm

X*1

F*1(x)

*1

X*2

F*2(x)

*2

X*3

F*3(x)

*3

X*B

F*B(x)

*B

Figure 3 Bootstrapping procedure

paper-05 2008/6/29 12:06 page 238 #4

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

Extreme water levels of the Vistula River and Gdansk Harbour

with an error norm proposed by Linhart and Zucchini (1986).


The error norm, , is defined by
() = max |G(x)h F (x)h |,
x

(10)

where F (x) is the distribution function of the approximating


model with parameter set . The parameter h > 0 and the emphasis on the fit of the model at various portions of the distribution
may be changed by altering the value of h. A value of h = 1.0
corresponds to the well known Kolmogorov-Smirnoff statistic.
To make an objective choice between different families of
distributions an additional statistic is used as a selection criterion.
The statistic is the expected value of the maximum difference
between the empirical distribution of the original data and the best
fit distribution for any particular family of distributions, namely


h


i



h
E( ) = max 
(11)
F (xi )  ,
1in  (n + 1)

where xi (i = 1, 2, . . . , n), are the original set of water levels in increasing order of magnitude. is the estimator of the
parameters in the approximating family for a particular bootstrap replication, and E is the expectation operator. In general,
analytical expressions for the expectation in Eq. (11) are not available, but a direct estimate may be obtained by using bootstrap
methods. Thus the expectation is estimated by computing the
average error norm over the B replications of the original data.
Finally, the parameters estimated from the maximum likelihood method for each family of distributions can be used to
estimate extreme return values corresponding to particular return
periods. By ordering these return values, the confidence intervals of them can be estimated in a straightforward manner. For
example, if 100 replications are performed the 90% confidence
interval is the interval between the 5th and 95th largest values.
The possibility of calculating directly the confidence intervals of
extreme values is a major advantage of the bootstrap technique
(Efron and Tibshirani, 1993). Here, the results are computed
from 500 replications. To test the sensitivity of the results to
the number of replications, the bootstrapping calculations for the
results shown in Tables 3 and 4 were repeated for 1000 and 2000
replications. Small changes in the numerical values of the error
norm were found but the relative ordering of the magnitudes was
unchanged, indicating that 500 replications provide stable and
robust estimates for this dataset.

4 Water level measurements


The data consist of simultaneous daily records of water level in
the Vistula Estuary at the Tczew benchmark station and daily
records of a mareograph gauge, situated in the Gdansk Harbour
and deemed representative for the whole Gulf of Gdansk (Fig. 2).
These data sets originate from Polands Institute of Meteorology and Water Management (IMGW), being responsible for the
acquisition of meteorological data in Poland, including water levels in harbours and rivers. For the analysis of the extremes, annual
maxima were extracted from each series. Both time series span
29 years and cover the period between 1961 and 1989. They were

239

compiled by a library survey within IBW PAN. The records after


1989 were not available to the team.
The Vistula river has a catchment area of 194,376 km2
(Cyberski and Wrblewski, 2000), which is the largest hydraulic
system on the non-tidal Baltic Sea. The river discharges water
directly into the Gulf of Gdansk and its mouth is an artificial
cross-cut through the coastal dune strip (Fig. 2). This became
operational in 1895 and was constructed to prevent ice jam floods
in the delta due to the complicated configuration of the previously
active branches. These branches were cut off with sluices constructed at the same time as the cross-cut was made. It has been
noted from available studies (Ostrowski et al., 2005), that the
influence of extreme storms can reach 50 km upstream from the
mouth of the Vistula.
For the benchmark station at Tczew the average water level for
the 19611989 period was 3.97 m with a maximum of 10.17 m
on 13th June 1962 and minimum equal to 1.84 m on 8th August
1964, the alert level of 9.20 m was exceeded 18 times in that
period. The standard deviation of this series over the entire 29
years was equal to 1.34 m. This series features developments in
the whole catchment; its large area tends to average local effects,
hence this series is highly regular. Note that this station has its own
reference datum, whose exact elevation over the mean sea level
is confidential. Thus, records of the water level in Tczew should
not be confused with the standard Amsterdam 5.00 m level.
At the Gdansk mareograph station all values of seawater level
are referred to the Amsterdam zero level of 5.00 m. The mean
value within the period studied between 1961 and 1989 was equal
to 5.065 m, which is more than the standard 5.00 m level. Most
likely this demonstrates sea level rise, visible in ordinary linear
regression of this series, producing a slope of 0.0013 with an
intercept of 4.995 m. The standard deviation of the entire series
over 29 years was 0.21 m. The maximum level, recorded on 20th
January 1983, reached 6.27 m, whereas the minimum level from
13th March 1972 was only 4.38 m. This series incorporates the
effects of storms such as the surge effect (rise of seawater level
due to low air pressure during nearly all storm events), the wind
set up (fetch induced rise of seawater level) and wave set up (due
to radiation stress associated with breaking waves).
The lag-1 autocorrelation for the Vistula is 0.578 and 0.107
for the monthly and annual maxima, respectively. The results for
the Gulf of Gdansk show a similar pattern with autocorrelation of
0.468 and 0.110 for the monthly and annual maxima. The zerolag cross-correlation coefficient of the monthly maxima between
the two sets is 0.129, and 0.155 for the annual maxima. The
results demonstrate a noticeable reduction in the autocorrelation
level when progressing from a monthly to annual sampling rate.
They also indicate only a low level of linear correlation between
the two datasets, whether sampled at monthly or annual intervals.

5 Results: Canonical correlation analysis


Figure 4 presents the basic feature embedded in both series, i.e.,
the seasonality, obtained as empirical mean values for each day
of the year and as a key feature identified by Singular Spectrum
Analysis.

paper-05 2008/6/29 12:06 page 239 #5

240

D.E. Reeve et al.

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

Vistula - Tezew mean empirical statistical year of water level upon 1961-189 records

700

640
Empirical mean
580
SSA

Water level (cm)

520

460

400

340
Jan. Feb Mar. Apr. May Jun.
280
30

30

60

90

120

150

Jul. Aug. Sep. Oct. Nov. Dec.

180

210

240

270

300

330

360

390

Vistula: random deviations vs centered raw series (cm)

Raw series
600
Random deviations
500
400
300
200
100
0
100
200

Day

300

Gdansk harbor mean empirical statistical year of seawater level upon 1961-1989 records

501

530

1001
1501
2001
2501
Days: 1st Jan. 1961 - 31st Dec. 1970

3001

3501

Empirical mean
525

Seawater level (cm)

515
510
505
500
495
490
Jan.
485
30

30

Feb

Mar.
60

90

Apr.

May

120

Jun.

150

Jul.

180

Aug. Sep.

210

240

Oct.

270

Nov. Dec.

300

330

360

390

Day

Figure 4 Seasonality in data: Vistula (top), Gulf of Gdansk (bottom)

Seawater - random deviations vs centered raw series (cm)

120

SSA
520

Raw series

100

Random deviations

80
60
40
20
0
20
40
60
80
1

The immediate impression is that they are close to anti-phase,


as the maximum level in the River Vistula occurs in MarchApril
and coincides with yearly seawater minima. Simultaneously, its
lowest levels can be expected in SeptemberOctober coinciding
with seawater levels well above the average, yet not at their allyear high. Since the contribution of seasonality is overwhelming
for the variability of the data and equals 79% for the River Vistula
and nearly 61% for the seawater, Fig. 4 demonstrates that CCA
analysis makes little sense for the raw data; strong annual seasonality, featuring anti-phase behaviour, could possibly overshadow
more delicate relations. On the other hand, Fig. 4 assists in the
selection of months with the least likelihood of joint extremes,
i.e., the months in which average water levels are far too low to
allow a joint extreme to occur. Upon visual inspection of Fig. 4
these months span from May until October, they were excluded
from further analysis.
Figure 5 in turn shows the behaviour of the 2nd important
feature extracted with the SSA method, namely exemplary 10
years of random deviations from annual variations of the River
Vistula and 4.2% of the seawater, in connection with seasonal
variations these quantities grow up to 91.5% for the River Vistula
and 75% for the seawater. It is clear that the greater the seasonal
component, the greater are random deviations from seasonality.
This is especially true for the River Vistula, for the correlation
coefficient between random deviations and the raw series is equal
to 0.5, whereas this coefficient computed for the seasonal component and the deviations is only 0.27. For the seawater these

501

1001
1501
2001
2501
Days: 1st Jan. 1961 - 31st Dec. 1970

3001

3501

Figure 5 Fragment of CCA input versus centred raw data: Vistula (top),
Gulf of Gdansk (bottom)

quantities are equal to 0.25, and 0.31, respectively, again indicating the greater irregularity of this series. All in all, the random
deviations appear to be the only possible driver for joint extremes,
despite apparently insignificant individual contributions to water
level variability.
Bearing in mind the two last paragraphs it was assumed that
joint extremes can only occur between November and April, so
separate CCA runs were performed for the random deviations
from seasonality corresponding to these months with seawater as the predictors (Y -matrix) and Vistula as the predictands
(Z-matrix). The rows in these matrices contained the deviations for a given month studied (November, December, January,
February, March, April), starting from 1961 in the 1st row up to
1989 in the 29th row. Therefore, spatial locations defined by
column numbers referred to a day in a month, e.g., for November
the term Y (6, 25) contained the predictor seawater component
from 25th November 1966. The prediction skills for CCA runs,
i.e., percentages of variability of deviations from seasonality of
sea water level, explained by the variability of deviations from
seasonality of water levels in the River Vistula in consecutive
months, were low (Table 1).

paper-05 2008/6/29 12:06 page 240 #6

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

Extreme water levels of the Vistula River and Gdansk Harbour

Table 1 CCA prediction skills


Month

CCA prediction skillvariance of


predictand explained by predictor

November
December
January
February
March
April

0.186
0.100
0.134
0.169
0.170
0.127

Thus, figures in Table 1 are not correlations but show (low)


ability of predictors to explain the behaviour of the predictands.
The best skill was achieved for November, February and March,
which is partly explained by Fig. 6, where average values and
standard deviations (SD) of the CCA input are plotted.
In November predictor SD (seawater random deviations from
seasonality) reach a maximum and coincide with the growing
predictand SD (Vistula random deviations from seasonality). In
February and March predictor SD are still high and encounter
high predictand SD. The lowest score for December becomes
less surprising when it is noted that the predictor SD goes down
slightly from their November peak to meet rapidly growing predictand SD. In this way the overall SSA and CCA results provide
substantial evidence that joint extremes in the Vistula estuary
are very rare. The major reason for this is that annual maxima
55

Vistula: mean values and SD of random deviations


from seasonality for November-April (cm)

SD
45

35

25

Mean

15

5
Nov

Dec

Jan

Apr

Mar

Feb

-5
1st Nov
-15

30th Apr
30

60

90
Day

120

150

180

Seawater: mean values and SD for random deviations


from seasonality for November-April (cm)

SD

241

are phase-shifted and deviations from this basic pattern, which


could trigger a joint extreme, are to a great extent independent.
Hence, independent bootstrap analyses of both series could be
safely undertaken in order to assess joint probabilities of extreme
events in the Vistula estuary. Additional practical implications of
the SSA and CCA study include: (1) Elaboration of an approach
in which the interdependence of two random signals can be evaluated by extracting their key patterns featuring phenomena and
processes (SSA method) and then checking their interrelations
(CCA method), and (2) More general character of the results,
which might at least partly be applicable to other large estuaries in northern Europe (e.g., The rivers Elbe, Oder, Rhine and
Scheldt).
6 Results: Statistical model selection
6.1 Vistula water levels
The annual maxima from the water level recordings at Tczew
were used to estimate parameters with the maximum likelihood
method for a range of distributions, including some conventional distributions as well as the extreme distributions. All the
estimated parameters for the conventional distributions and the
extreme distributions are displayed in Table 2.
The increasingly ordered series of extreme values is plotted against the reduced variate xN,j = ln( ln(pN,j )), where
Hazens formula (Chambers et al., 1983) was used with pN,i =
(i 0.5)/N to present the probability plot (also often called the
QQ plot). The Gumbel QQ plots showing the best-fit for the
Gumbel, the Weibull and the Gev are shown in Fig. 7(a), together
with the annual maxima. If the data obey the Gumbel distribution,
all should fall along a straight line. From observation alone, it is
hard to define which distribution function fits best to the data. The
error norm provides a rational means of discriminating between
the performances of the different distributions in describing the
data. To this end, the error norms are summarised in Table 3.
The different values of h correspond to the emphasis of the
norm being on different parts of the distribution (Eq. (11)). Low
values of h weight the norm towards the fit at the lower tail while
values of h greater than 1 weight the norm towards the fit at the
upper tail.
As shown in Table 3, the error norms of the Gev distribution function are almost consistently the smallest among these

7
Mean

Table 2 Best-fit distribution parameters for annual maxima data at


Vistula

3
Feb

Jan

Dec

Nov
1

-1
1st Nov
-3

30th Apr
30

60

90
DAY

120

Model

Parameters

Normal
Log-normal
Gamma
Exponential
Weibull (II)
Weibull (III)
Gumbel
Gev

= 7.8203; = 1.2551
= 2.0433; = 0.1657
= 37.4946; = 0.2086
= 7.8203
= 7.0069; = 8.3548
= 3.6195; = 4.4886; = 3.7789
= 7.1909; = 1.2170
= 0.3606; = 7.4252; = 1.2869

Apr

Mar

150

180

Figure 6 Mean values and standard deviations of CCA input: Vistula


(top), Gulf of Gdansk (bottom)

paper-05 2008/6/29 12:06 page 241 #7

242

D.E. Reeve et al.

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

12

12

(a)

(b)

10

Water level (m)

Water level (m)

10

data
Weibull(II)
Weibull(III)
Gumbel
Gev

-1

Gev
5% quantile of Gev
95% quantile of Gev
data

4
-2

-2

-1

Reduced variate

6.6

6.6
6.4

6.4

(c)

(d)

6.2

Water level (m)

6.2

Water level (m)

Reduced variate

6.0

data
Weibull(II)
Weibull(III)
Gumbel
Gev

5.8
5.6
5.4

6.0
5.8
5.6

Gev
5% quantile of Gev
95% quantile of Gev
data

5.4
5.2

5.2
-2

-1

-2

-1

Reduced variate

Reduced variate

Figure 7 (a) Gumbel QQ plot showing annual maxima water levels in Vistula and best fit distributions, (b) Best Gev fit with 95% confidence limits
together with annual maximum water levels in Vistula, (c) Gumbel QQ plot showing annual maximum water levels in gulf of Gdansk and best fit
distribution, and (d) Best Gev fit with 95% confidence limits together with annual maximum water levels in Gulf of Gdansk

Table 3 Computed expectation error norm (Eq. (11)) for Vistula data with 500 bootstrap replications
Model

h = 1.5

h = 1.25

h = 1.00

h = 0.75

h = 0.50

h = 0.25

Normal
Log-normal
Gamma
Exponential
Weibull (II)
Weibull (III)
Gumbel
Gev

0.1360
0.1390
0.1381
0.3877
0.1336
0.1311
0.1364
0.1286

0.1292
0.1316
0.1305
0.4311
0.1291
0.1237
0.1300
0.1208

0.1232
0.1233
0.1225
0.4763
0.1251
0.1173
0.1242
0.1161

0.1179
0.1181
0.1169
0.5172
0.1205
0.1151
0.1207
0.1135

0.1182
0.1337
0.1256
0.5227
0.1156
0.1251
0.1412
0.1109

0.1118
0.1506
0.1335
0.4124
0.0972
0.1367
0.1737
0.0993

distributions at different parts of the distribution curve, indicating the Gev distribution is the best-fit for the annual maximum
water levels in the Vistula, based on the data over the interval
from 1961 to 1989. Apart from the Gev distribution, the error
norms of the Weibull (III) are also small (Table 3). The fact
that the best fit to the data is achieved by the three-parameter
distributions is perhaps not so surprising as these distributions

have more degrees of freedom to match the data. This is most


evident with the Exponential distribution which has only one
parameter and has consistently the largest error norm. Of the twoparameter distributions, it is less clear which provides the best
fit, although the extreme value distributions perform adequately.
From the analysis above, it was concluded that the Gev is the best
fit function for the annual maximum water level. Corresponding

paper-05 2008/6/29 12:06 page 242 #8

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

Extreme water levels of the Vistula River and Gdansk Harbour

to this, the QQ plot in Fig. 7(a) is presented again in Fig. 7(b),


which shows the Gev distribution and the 5%, and 95% quantiles
together with the annual maxima. As may be seen clearly, the
annual maxima fall within these quantile limits.
Table 4 summaries the return values at a selection of return
periods with 95% confidence limits based on 500 bootstrap
re-samplings. The extreme values are determined from the
Weibull (II), the Weibull (III), the Gumbel and the Gev distributions for the return period from 2 years to 100 years. It should
be noted that with 30 years of data, return periods larger than
30 years correspond to events that are unlikely to have occurred
during the period of the measurements.
The first columns of the values are the return values from the
best fit of the raw annual maxima water level data in the Vistula.
From the lower and upper limits of the return values in this table,
one can identify which distribution function gives the best-fit for
the annual maxima.
For engineering design, it is more helpful to have smaller
confidence intervals, which is another reason to favour the Gev
distribution in this case. As one example, the return value determined from the Gev distribution function for the return period
of 100 years is 10.31 m with the 5% quantile value of 9.32 m
and the 95% quantile of 12.21 m. In passing it should be noted
that the upper limit of 12.21 m is larger than the upper limit
of the best fitting Gev distribution which has parameters such
that + / = 10.99 m. However, the bootstrapping process
means that some of the sample has parameters that allow larger
limits, and this is reflected in the quantile values. When using
the best fit parameters given in Table 2 the physical limit on
the maximum values inherent in the form of distribution should
clearly be remembered. The fact that the bootstrapping method
employed here indicates that larger values may occur reflects
the uncertainty in the process of estimating the parameter values

243

of the distribution. In comparison, the Weibull (III) provides an


unreasonably larger upper limit as shown in Table 4. The reason
is that the Weibull (III) distribution is quite sensitive to the minima of the sample. This sensitivity is reflected in the scatter of
the results within the bootstrap sampling, and hence the limits of
the confidence interval.
6.2 Seawater levels at Gdansk
An analogous bootstrapping analysis was also performed on the
annual maxima seawater levels in the Gulf of Gdansk. For conciseness of this work, the estimated parameters from maximum
likelihood method are not presented. A significant difference
from the River Vistula data is that the estimated parameter < 0
for the Gev distribution. The Gumbel QQ plots with the best fit
for the Gumbel, the Weibull and the Gev are shown in Fig. 7(c)
together with the annual maxima water level data in the Gulf of
Gdansk. Figure 7(c) shows that the Weibull (II) does not fit the
raw data well. There is not much difference for the goodness-offit among the Weibull (III), the Gumbel and the Gev distribution
by observation. To discriminate between the performances of
these three distribution functions it is necessary to consider the
confidence intervals for each.
The values of the error norm from the 500 bootstrap replications are shown in Table 5. These suggest that the Gev provides
the best fit for the annual maximum seawater level data in the
Gulf of Gdansk.
Figure 7(d) shows the Gev distribution and the 5%, and 95%
quantiles together with the annual maxima. As may be seen
clearly, the annual maxima fall almost on the Gev curve and
are well within the quantile limits. The extreme values for the
different return periods determined from the Weibull, Gumbel
and Gev distribution functions are shown in Table 6. The limits

Table 4 Return values of maximum annual extremes in Vistula with 95% confidence limits in bracket
Return period (years)

Weibull (II) (m)

Weibull (III) (m)

Gumbel (m)

Gev (m)

2
5
10
20
50
100
200

7.93 (7.45, 8.39)


8.94 (8.40, 9.36)
9.41 (8.79, 9.82)
9.77 (9.12, 10.20)
10.15 (9.38, 10.64)
10.39 (9.57, 10.92)
10.60 (9.74, 11.16)

7.84 (6.51, 8.22)


8.90 (7.59, 9.23)
9.43 (8.06, 10.19)
9.86 (8.44, 12.02)
10.32 (8.86, 14.47)
10.62 (9.05, 16.74)
10.89 (9.26, 19.17)

7.64 (7.15, 8.09)


9.02 (8.37, 9.51)
9.93 (9.13, 10.52)
10.81 (9.78, 11.49)
11.94 (10.59, 12.80)
12.79 (11.16, 13.78)
13.64 (11.73, 14.76)

7.87 (7.30, 8.47)


8.92 (8.31, 9.43)
9.41 (8.81, 9.79)
9.77 (9.07, 10.22)
10.12 (9.24, 11.25)
10.31 (9.32, 12.21)
10.46 (9.39, 12.93)

Note: Return period is the expected (mean) time (usually in years) between the exceedence of a particular extreme threshold

Table 5 Computed expectation error norm for data in Gdansk with 500 bootstrap replications
Model

h = 1.5

h = 1.25

h = 1.00

h = 0.75

h = 0.50

h = 0.25

Normal
Log-normal
Gamma
Exponential
Weibull (II)
Weibull (III)
Gumbel
Gev

0.1736
0.1686
0.1699
0.4769
0.2146
0.1348
0.1321
0.1258

0.1652
0.1604
0.1615
0.5310
0.2018
0.12907
0.1258
0.1188

0.1530
0.1487
0.1497
0.5822
0.1845
0.1223
0.1192
0.1119

0.1358
0.1323
0.1331
0.6170
0.1790
0.1182
0.1130
0.1068

0.1193
0.1159
0.1169
0.6020
0.2025
0.1406
0.1139
0.1091

0.1002
0.0962
0.0976
0.4585
0.1894
0.1741
0.1145
0.1107

paper-05 2008/6/29 12:06 page 243 #9

244

D.E. Reeve et al.

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)


Table 6 Return values of maximum annual extremes in Gdansk with 95% confidence limits in brackets
Return period (years)

Weibull (II) (m)

Weibull (III) (m)

Gumbel (m)

Gev (m)

2
5
10
20
50
100
200

5.76 (5.70, 5.84)


5.94 (5.81, 6.04)
6.01 (5.85, 6.13)
6.07 (5.88, 6.20)
6.13 (5.91, 6.27)
6.16 (5.93, 6.31)
6.19 (5.95, 6.35)

5.72 (5.61, 5.78)


5.89 (5.79, 5.98)
5.99 (5.85, 6.27)
6.07 (5.89, 6.69)
6.17 (5.95, 7.41)
6.24 (5.98, 7.99)
6.31 (6.01, 8.68)

5.72 (5.67, 5.79)


5.87 (5.79, 5.97)
5.97 (5.87, 6.10)
6.07 (5.94, 6.22)
6.19 (6.03, 6.37)
6.29 (6.10, 6.49)
6.38 (6.17, 6.60)

5.72 (5.67, 5.79)


5.87 (5.79, 5.97)
5.97 (5.85, 6.10)
6.07 (5.89, 6.27)
6.20 (5.93, 6.61)
6.30 (5.96, 6.90)
6.40 (5.97, 7.40)

for the extreme values are based on the 95% confidence interval
from the 500 bootstrap replications. The results determined from
the Gev are highlighted in Table 6, which performed best.
It may be noted that the confidence intervals for the water level
in the Gulf of Gdansk are much smaller than those for the Vistula.
This is not surprising when considering the respective ranges and
standard deviations of the series at Gdansk and Tczew, see Sec. 4.

7 Conclusions

(6)

(7)

In this paper the results of an investigation of the extreme distributions of the annual maximum water levels were presented
both at Tczew on the River Vistula and in the Gulf of Gdansk
with 29 years of simultaneous readings covering the period from
1961 to 1989. The extreme probability distributions were validated with the univariate bootstrap resampling technique on both
of the datasets. Some key conclusions are:
(1) The SSA and CCA approach highlights the synergistic
effect of combining two statistical methods; key behavioural
patterns in the data can be identified (SSA) and their
interdependence scrutinised (CCA).
(2) The SSA and CCA results signify practical independence of
both signals, which allows their separate bootstrap analyses.
If they are highly correlated, a joint probability approach,
such as proposed as Hawkes et al. (2002), should be
considered.
(3) The GEV distribution provides the best model for the distribution of extreme annual maximum water levels at both
locations. It has best goodness-of-fit, it is an asymptotic
model for extremes and it detects the finite character of the
phenomenon. It is also fairly robust in that the confidence
limits are such that extreme value estimates are useful for
practical purposes when bootstrap resampling is applied.
(4) The three-parameter Weibull distribution also performed
well, but exhibited greater sensitivity to the replications than
the GEV distribution.
(5) The two datasets exhibit very different behaviour. The
Gdansk water level gauge data were moderately well
described by a Gumbel distribution. In contrast the Tczew
extreme water levels exhibit a distinct curve on the QQ
plot which is not well-captured by a Gumbel distribution. This suggests that there is a limiting process that
restricts the extreme values of the water levels at Tczew.

(8)

(9)

One possible explanation is that the water levels at Tczew


are representative of a whole catchment and are therefore
smoother as they are the result of a sum of processes over the
whole catchment. Localised flooding in part of the catchment
could also act to limit water levels at Tczew.
The choice of an error norm allows the fit to the data to be
weighted towards particular sections of the data. It allows
a quantitative assessment of the goodness-of-fit of different
statistical models throughout the range of the data.
The bootstrap resampling method provides estimates of confidence intervals for different return values to be estimated
directly.
This investigation shows how the bootstrap resampling
method can assist in selecting a distribution that provides
a good fit to the data and is also robust. In practical application, the choice of the distribution function may be guided
not just by the goodness-of-fit but also the robustness of the
fit (as measured by the confidence interval of estimates of
extreme values).
The overall results have a more general applicability as they
should be applicable to other large estuaries.

Acknowledgements
The work described in this publication was supported by the European Communitys Sixth Framework Programme through the
grant to the budget of the Integrated Project FLOODsite, Contract
GOCE-CT-2004-505420. The paper reflects the authors views
and not those of the European Community. Neither the European
Community nor any member of the FLOODsite Consortium is
liable for any use of the information in this paper.
Appendix
Definitions of distributions and their parameters used in this
paper.
Normal distribution with location parameter and scale parameter

2
1
21 x

f(x) =
(A1)
e
2
Log-normal distribution with location parameter and scale
parameter

2
1
1 ln(x)
f(x) =
(A2)
e 2
2x

paper-05 2008/6/29 12:06 page 244 #10

Journal of Hydraulic Research Vol. 46, Extra Issue 2 (2008)

Extreme water levels of the Vistula River and Gdansk Harbour

Gamma distribution with parameter and


x
1

x e
()
Exponential distribution with scale parameter
f(x) =

F(x) = 1 e

(A3)

(A4)

General extreme value (Gev) with location parameter , scale


parameter and shape parameter
1/


1 x

(x )
>0
(A5)

The special cases of Gev used in this work, Weibull and Gumbel
distribution, are as follows:
Weibull (III) distribution with location parameter , scale
parameter and shape parameter
F(x) = e

F(x) = 1 e



x

(A6)

When = 0, (A5) will reduce to the Weibull (II).


Gumbel distribution with location parameter ,
scale parameter

(A7)

Notation
. = Expected value
H = Weighting parameter
nm = Number of canonical modes
nt = Number of observations (realizations) of Y and
Z fields
ny = Number of spatial points in Y predictor field
nz = Number of spatial points in Z predictand field
t = Time index
w = Average prediction error; a standard deviation of
discrepancies (remainders) between predictand

Z and predictions Z
xP = Return level for return period 1/P units of time
y = Location index in Y
z = Location index in Z
I = Identity matrix
F (i) (x) = Best fit distribution function for ith bootstrap
replication
Q (nz nm) = Normalized eigenvectors of CCA system matrix
with Z swapped for Y
R (ny nm) = Eigenvectors of CCA system matrix, each scaled
to unit length
S(nm nz) = Matrix of regression coefficients relating canonical predictor amplitudes U to points in predictand Z, cf. CCA description
U (nt nm) = Canonical predictor field, each row has a unit
variance
V (nt nm) = Canonical predictand field
Xi ith = Bootstrap replications of original data sample X
Y (nt ny) = Predictor field
Z (nt nz) = Predictand field
(nt nz) = Predictions of Z with Y
Z
(nm) = Canonical correlations in CCA analysis

245

= Standard deviation of measurements in CCA


analysis
, , , , , = Statistical model parameters (see Table 2
, , , , for further details)
i = Best fit distribution parameter set for the ith
bootstrap sample
= CCA prediction skill
 = Error norm

References
Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, P.A.
(1983). Graphical Methods for Data Analysis, Duxbury,
Boston MA.
Coles, S.G., Tawn, J.A. (1994). Statistical Methods for Multivariate Extremes: An Application to Structural Design (with
discussion). Appl. Statistics 43, 148.
Cyberski, J., Wrblewski, A. (2000). Riverine Water Inflows and
the Baltic Water Volume 19011990. Hydrology and Earth
System Sciences 4(I), 111.
Efron, B. (1979). Bootstrap Methods: Another Look at the
Jackknife. Ann. Statist. 7, 126.
Efron, B., Tibshirani, R.J. (1993). An Introduction to the
Bootstrap, Chapman and Hall, New York.
Fisher, R.A., Tippett, L.H.C. (1928). Limiting Forms of the
Frequency Distributions of the Largest or Smallest Member of
a Sample. Proc. Camb. Phil. Soc. 24, 180190.
Graham, N.E. (1990). Canonical Correlation Analysis. World
Meteorological Organization report. WMO review of climate
diagnostic models.
Green, P.E. (1978). Analyzing Multivariate Data, Holt, Rinehart
& Winston, Hinsdale IL.
Green, P.E., Douglas Caroll, J. (1978). Mathematical Tools for
Applied Multivariate Analysis, Academic Press, New York.
Hawkes, P.J., Gouldby, B.P., Tawn, J.A., Owen, M.W. (2002).
The Joint Probability of Waves and Water Levels in Coastal
Defence Design. J. Hydraul. Res. 40, 241251.
Linhart, H., Zucchini, W. (1986). Model Selection, Wiley,
New York.
Ostrowski, R., Pruszak Z., Szmytkiewicz, M. (2005). Red River
Delta (Vietnam) and Vistula Delta (Poland)Similarities and
Differences. Proc. Seminar Sediment Transport in Rivers and
Transitional Waters, IBW PAN, Gdansk, pp. 6872.
Pugh, D.T., Vassie, J.M. (1980). Applications of the Joint Probability Method for Extreme Sea Level Computations. Proc
ICE, Part 2, 69, 959975.
Reeve, D.E. (1996). Estimation of Extreme Indian Monsoon
Rainfall. Int. J. Climatology 16, 105112.
Rozynski, G., Ostrowski, R., Pruszak, Z., Szmytkiewicz, M.,
Skaja, M. (2006). Data-Driven Analysis of Joint Coastal
Extremes Near a Large Non-Tidal Estuary in North Europe.
Estuarine, Coastal and Shelf Science 68(1/2), 317327.
Tawn, J.A., Vassie, J.M. (1989). Extreme Sea Levels: The Joint
Probabilities Method Revisited and Revised. Proc ICE, Part
2, 87, 429442.

paper-05 2008/6/29 12:06 page 245 #11

Вам также может понравиться